Symbiosis of Human and Artifact Y. Anzai, K. Ogawa and H. Mori (Editors) 1995 Elsevier Science B.V.
393
W h y choose? A process a p p r o a c h to u s a b i l i t y t e s t i n g T. Kelley a n d L. A l l e n d e r Army Research Laboratory, AMSRL-HR-MB, Aberdeen Proving Ground, Aberdeen, Maryland, 21005-5425 1. ABSTRACT Two sets of prototype screens for a complex, computerized analysis tool were evaluated using a series of usability analysis techniques. The empirical, or experimental usability method identified more interface design problems of a severe nature than the other methods and gave a clear indication of which prototype design to choose for the final development process. While the individual walkthrough evaluation identified the most design problems overall, many of the problems tended to be of a less severe nature than were identified by the experimental method. The implications for selecting appropriate usability techniques and using them collectively, as a process, are discussed. 2. INTRODUCTION Several comparisons of usability methodologies have recently appeared in the literature (e.g., Karat et al., 1992; Virzi et al., 1993; Jeffries et al., 1991). Questions addressed by this research include: How effective is one particular usability technique over another? How many and what type of problems are uncovered with each technique? How much does one technique cost in comparison to other usability analysis techniques? Are the benefits of cost savings of a given method reduced by lack of problem identification? Comparisons of usability techniques are, however, sometimes difficult to interpret. Each method is still only loosely defined and overlap across techniques is common. In general, though, it can be agreed that the techniques vary according to certain factors: whether the technique is empirical and task oriented or a subjective walkthrough; and for the walkthrough techniques, whether they are individual or group, what the subjects' level of user interface (UI) design expertise is, whether the group is all end users or pluralistic (a mix of users, UI experts, programmers, etc.), whether the walkthrough is "self-guided" or guided by a set of guidelines, and the method of data collection (written, "think-aloud"; time periods ranging from hours to weeks). The desire to use one technique over the other is driven by cost and effectiveness concerns. But, the literature is unclear which variant of which technique: empirical, individual or group walkthroughs is the "best." Karat et al. (1992), compared three techniques, empirical, individual walkthrough, and group walkthrough and found that the empirical method identified the largest number of problems and identified problems missed by the
394 individual and group walkthroughs. Cost analysis also showed that the empirical usability technique used the same or less time to identify each problem. Contrary to the findings of Karat et al. (1992), Jeffries et al. (1991) reported heuristic evaluations found the most problems with the lowest cost. However, Jeffries used usability specialists, whereas, Karat used a mix of mostly end users and developers of graphic user interface systems, along with a few usability specialists and software support staff. Also the subjects in the Jeffries et al. study reported problems found over a two-week period; those in the Karat et al. study, a three-hour period. One common element across the reviewed research is that the software being evaluated using the various techniques was all relatively simple. When developing analytical tools for the scientific community, the software can be quite complex, both in terms of overall conceptual organization and in terms of individual screen designs. The work reported here is an evaluation of a complex task analysis modeling tool being developed for the U.S. Army and helps to answer the questions about the effectiveness of the various usability techniques for larger scale software products. To reiterate an earlier point, there is no clear answer on which usability technique is the best. Karat et al. (1992) point out that each of the techniques serves to uncover different types of problems. In the work reported here, the natural developmental cycle of the software to be evaluated was used as the deciding factor: Different techniques were used at different times. 2. O B J E C T I V E S The usability evaluations reported here were conducted with two goals. The first goal was to user the output of the usability evaluations to select one of two different interface design prototypes for a complex task analysis tool for the scientific and analysis communities and to subsequently refine the selected design. The two prototypes differed in their conceptual and organizational structure, and, therefore, in most of the screen designs. However, a number of the screen designs, particularly "lower level" data input screens were identical. The second goal was to use and compare a variety of usability analysis techniques as a part of the developmental process in order to gain insight into the strengths and weaknesses of each and to guide future technique selection. Five techniques were used: The first and fifth techniques were employed without strict experimental controls; the second, third, and fourth techniques were employed and evaluated in an experimental setting. The five techniques were: (1) individual walkthrough evaluation; (2)empirical evaluation (experimental); (3)individual heuristic walkthrough (experimental); (4) group walkthrough (experimental); and (5) group pluralistic evaluation. 3. PARTIC IPANTS Three different groups of participants used the five different techniques: One group used the individual walkthrough; one group used the three techniques that were evaluated experimentally and one group used the pluralistic evaluation technique. The group who participated in the individual walkthrough evaluation consisted of six expected end-users of the software under development who were currently active users of the predecessor software tool.
395 The second group participated in the three experimental evaluations and comprised 20 subjects. All 20 subjects were employees of the Army Research Laboratory and had various educational and professional backgrounds. All of these subjects were equal in experience in that they had all received a three day training course on the predecessor software, but had not used the software since the course. All 20 participated in the empirical evaluation. Half of the subjects then participated in the individual heuristic evaluation and the other half participated in the group walkthrough evaluation. Finally, the third group included 18 participants who employed the pluralistic evaluation technique. By definition, this group included a mix of end-users, designers, programmers, and human factors practitioners. 4. PROCEDURE 4.1. I n d i v i d u a l W a l k t h r o u g h Evaluation For the individual walkthrough evaluation, a hard copy packet was mailed to the participants after initial prototyping of the two prototypes had been completed. This was the first evaluation point in the developmental process. The packet included printouts of every screen in each of the two prototypes and a set of instructions. The instructions provided information about how to interpret the printouts and how to map the prototypes onto the functionality of the predecessor software. Rather than the human interface guidelines which might be given to subjects in a standard heuristic evaluation, the evaluators were given a set of questions to guide and prompt their responses. For example, "What do you like or dislike about the prototypes? .... What changes should be made to the layout and organization of the prototype?" They were also encouraged to mark directly on the printouts and to make any other comments they felt appropriate. They had three weeks to evaluate both prototypes. 4.2. Empirical Evaluation The second evaluation technique was the empirical method. It was used at the same time as the other two experimental techniques which was during the development of software specifications but before the coding had begun. Twenty subjects were each tested individually using the same Gateway 2000 33 megahertz computer with a color VGA monitor. The interactive screen prototypes, which were created using the ToolBook T M development environment, were presented in a counterbalanced scheme so the time and errors for each could be compared. All subjects received refresher training session on the predecessor software immediately prior to the experiment. Subjects then had to successfully complete five training tasks before proceeding with the experiment. The experiment consisted of carrying out 10 goal-oriented tasks that would actually be performed using the software. Although prototypes were used in the experiment, there was sufficient functionality to permit performance of all ten tasks. The same set of ten tasks was presented in different random orders for each of the two prototypes. (Of note, 10 subjects received one set of 10 tasks; 10 subjects received a different set of ten tasks. This change was to maximize the amount of information available to guide actual interface revisions.) Data were collected by use of a video camera and also directly by the computer the subjects were using during the experiment. Subjects were not given any special instructions about how fast or accurately to work.
396
4.3 Heuristic Evaluation The third technique, the heuristic evaluation, was conducted immediately after the empirical evaluation session. Ten of the twenty subjects who participated in the empirical evaluation were randomly selected. They were given the set of usability guidelines (Nielsen and Molich, 1990) which included: simple and natural dialog, speak the user's language, minimize user memory load, be consistent, provide feedback, provide clearly marked exits, provide short cuts, good error messages, prevent errors. The subjects were then instructed to use the guidelines to identify usability problems with each interface. They could choose to use the computer on-line versions of the prototypes or be given a printout of each screen to work from.
4.4. Group Walkthrough For the fourth technique, the group walkthrough, the subjects were the remaining ten from the empirical evaluation. Subjects met in one room facing a large screen monitor displaying the prototype. One experimenter served as the moderator for the session. Task lists which were used for the empirical evaluations were given to each of the subjects and then each task was presented for evaluation with the interface. Subjects vocalized any concerns they had with the interface while each task was being walked through. Data were collected by using a video camera and by a second experimenter taking notes.
4.5. Group Pluralistic W a l k t h r o u g h The final usability technique was the group pluralistic evaluation. It served as the final review before actual software coding began on the task analysis tool. Eighteen people participated. The prototype was displayed onto an overhead projector and one moderator, the program developer, took the group through as much of the interface as possible in the time that was allotted, which was approximately eight hours. The pluralistic walkthrough is distinguished by the wide range of experience from its participants (Bias, 1991). In this case, the pluralistic walkthrough included end users, human factors experts, developers, and programmers. 5. R E S U L T S As Figure I illustrates, the individual walkthrough evaluation identified more problems t h a n any of the other techniques. The individual walkthrough evaluation technique identified a total 39 unique problems compared to 21 for the pluralistic, 15 for the empirical, 12 for the heuristic, and 9 for the group walkthrough. Severity ratings of each problem identified were calculated using a three-point scale. Two human factors experts conducted the severity ratings. Each human factors expert did his own rating independently, then the ratings were compared for differences. If there were any disagreements, discussion ensued until a consensus was reached. Figure 2 shows the severity rating scores for the problems found with each usability technique. As Figure 2 indicates the empirical method identified the highest number of high severity problems, a total of six. The individual walkthrough identified the highest number of low severity problems, a total of 29.
397
40 35 30 25 20 15 10 5 0
Figure 1. Number of problems identified.
30 25
2o
I I I" °w !
15
10 5 0
to
m
._~_~
Ca
._
"~
o
i..
Q.c-4.,
Figure 2. Error severity identification.
The individual walkthrough and heuristic evaluations were further compared in order to address issues of low priority problem identification. The heuristic evaluation yielded a total of 84 individual comments. (Note that many individual comments all reported the same problem so there are more comments t han problems identified). All of the 84 comments were classified in order to give an indication of the type of comments received. The individual walkthrough evaluation yielded a total of 356 individual comments. Of the 356 individual comments, 84 comments were randomly selected and classified in order to give some indication of the type of comments received and compare to the heuristic evaluation. The categories used for the classification of user comments were as follows: fidelity problem with the prototype, question about the prototype, a compliment of the prototype, a suggestion to change prototype, a problem with the interface identified, a syntax or wording problem, and a meaningless comment that could not be interpreted. Results indicated, the individual walkthrough evaluation, which was given without the "standard" usability guidelines, yielded a smaller percentage of problem identification (16%) than did the heuristic evaluation, which was given with guidelines (33%). Results also indicated that evaluators had more questions during the individual walkthroughs (22%) than during the heuristic evaluation (2%). This is not surprising given that those in the individual walkthroughs were seeing the prototypes for the first time, whereas, those in the heuristic evaluation had previously participated in the empirical evaluation. During the experimental evaluation, the two prototypes were evaluated on the time and errors obtained on one set of ten tasks. A 2 (prototypes) X 10 (tasks) repeated measures ANOVA was conducted on the data for 10 subjects. Results indicated a significant main effect of prototype, F(1,9) = 14.39, p< .01, as well as task, F(9,81) = 14.85, p < .01. The effect of prototype by task interaction was also significant F(9,81) = 5.15 p < .05. 6. CONCLUSIONS The usability analysis process should be a combination of usability analysis techniques. Each usability analysis technique has its own advantages and
398 disadvantages, but together, each technique can compliment the other methods and can collectively be more powerful than if used separately, in other words, a
Gestalt analysis. First, for this evaluation process, we chose to use the individual walkthrough which was a paper and pencil exercise without usability guidelines characteristic of an heuristic evaluation. These two factors most likely led to the feedback including many comments or questions and not problem identification per se, and of the problems that were identified, many were of low severity. Next we used the empirical method which was task-oriented and used the interactive prototypes. With this method, the largest number of severe problems were identified by noting where subjects made navigation and menu selection errors and which steps took the most time. The heuristic and the group walkthrough evaluations came next with the expectation that subjects would draw from their intensive experience in the empirical evaluation and be more likely to identify additional, severe usability problems. Both techniques did identify additional problems: the heuristic evaluation identified slightly more problems; this was a tradeoff consideration if selecting one method over the other. The last evaluation technique was the pluralistic evaluation since at this point in the development cycle, the most severe problems should have been identified already, allowing the discussion of detailed design changes from the user and programmer points of view. Additional investigation needs to be done to help clarify the process, as well as to identify the best order in which to use each methodology in an overall usability process. 7. ACKNOWLEDGMENTS The authors would like to thank CPT Jim Nagel, Diane Mitchell, Linda Fatkin, Jock Grynovicki and Debbie Rice for their assistance with the statistical analysis. 8. R E F E R E N C E S
1. Bias, R., (1991) Walkthroughs: Efficient Collaborative Testing IEEE Software, 8, 5, 1991, pp. 94-95. 2. Jeffries, R., Miller, J.R.,Wharton, C., and Uyeda, K.M., (1991). User interface evaluation in the real world: a comparison of four techniques. In Proc. of ACM CHI'91 Conference on H u m a n Factors in Computing Systems, pp. 119-124, ACM, New York. 3. Karat, C., Campbell, R., and Fiegel, T. (1992) Comparison of empirical testing and walkthrough methods in user interface evaluation. In Proc of ACM CHI'92 Conference on H u m a n Factors in Computing Systems pp. 397-404, ACM, New York. 4. Nielsen, J. and Molich, R. (1990) Heuristic evaluation of user interfaces. In Proc. of ACM CHI'90 Conference on Human Factors in Computing Systems, pp. 249-256, ACM, New York. 5. Virzi, R.A., Sorce, J.F., and Herbert, L.B. (1993) A Comparison of Three Usability Evaluation Methods: Heuristic, Think-Aloud, and Performance Testing. In Proc. of Human Factors and Ergonomics Society 37th Annual Meeting - 1993, pp. 309- 313.