USER
DOCUMENTATION LANGUAGE
EVALUATION STATISTICAL
Department
of Statistics,
AND CONTROL I
AND COMPARISON OF COMPUTER PACKAGES* RONALD
A. THISTED
University
of Chicago.
(Received 28 Noremher
Chicago,
IL 60637. U.S.A.
1977; in revised form I6 ,Wurch 1978)
Abstract-Procedures for comparing and evaluating aspects of the user interface of statistical computer packages are described. These procedures are implemented in a study of three packages, SPSS. BMDP and Minitab, by a class of 21 students with some statistical background. It was found that most participants exhibited consistent personal preferences among the packages. In selecting packages to solve specific problems, however. their choice was determined more by issues of good statistical practice than by personal preference for overall package features. INTRODUCTION Experience
to date
in the
area
of statistical
evaluation has shown that it is difficult to language in any objective fashion using universally pointed out that
software
evaluate user documentation and package control acceptable criteria. In 1975, Richard Heiberger[I]
“very little has appeared in print on the tiser interface. by which we mean the packayc command language, the display of the printed output, and the user documentation. even though there are strong and contradictory opinions on its proper design.”
His remark spoke
appear
is equally
appropriate
to
major
be
the
today.
obstacles
to
The strong formulating
and contradictory objective
criteria
opinions of which hc assessing the ubcr
for
interfaces which statistical packages present. There are several reasons why it is difficult to evaluate the user interface objectively, and mozt of them have to do with the fact that packages differ in the audiences for Hhich they arc intended. The documentation and the control language often are designed to make smooth interaction possible between the package and a particular group of users. Consequently. the degree of statistical sophistication required of the user, the syntax of the control language, and the kinds of examples given in the user’s manual differ a great deal. Tastes vary, and so do the experience and backgrounds of package users. Any assessment of the user interface of a package must. then, take into account the audience with respect to which the evaluation is made. In this study, the terms user documentation and control language will be used in a broad sense. Users include all those who install, maintain, and consult on package usage. Documentation includes all those means by which information about an analysis is provided to the program. In particular. most explanatory or descriptive output is considered to be documentation. Documentation and control language are not easily separated. Working together they make a package a dream to use, or they make it a nightmare. Documentation and control language complement one another: through documentation the package communicates with the user, and by means of the control language the user instructs the program. An elegant language ill-described is hardly better than an awkward language explained clearly. The documentation and command structure considered l Support for this research was provided in part by National Science Foundation Grant No. SOC72-05228 A04 and Grant No. MCS72-04364 AO4. and by U.S. Energy Research and Development Contract No. E(I l-1)-2751. By acceptance of this article the publisher and/or recipient acknowledges the United States Government’s right to retain a non-exclusive, royalty-free license in and to any copyright covering this paper. Notice: This report was prepared as an account of work sponsored by the United States Government. Neither the United States nor the U.S. Energy Research and Development Administration. nor any of their employees. nor any of their contractors. subcontractors. or their employees. makes any warranty. express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product or process disclosed. or represents that its use would not infringe privately owned rights. Technical report No. 35. revised.
C.&E. 3,2-P
135
136
RONALD A. THISTED
together distinguish a package from a set of statistical routines. The language structure, user’s manual. and sometimes the form of the printed output are often the only features of the package shared by all of the routines. They constitute the brown paper and string that makes a collection of statistical procedures a ‘package’. This paper is a report on an investigation conducted at Stanford University in 1976. The purpose of our study was to replicate and to extend the work by Francis and Valliant[Z]. We focused on the three components of the user interface: documentation, control language, and output. In this respect our study differs from theirs, which sought to determine whether one package was more suitable for novices than another. Participants in our study examined three packages on six statistical problems during a one quarter course on statistical computer packages. THE
PACKAGES
The statistical packages selected for study were the 1975 version of the BMDP programs from UCLAD]. Minitab II from Pennsylvania State University[4] and SPSS (Version 6) from SPSS. Inc.[5]. These three were selected because all are widely distributed, all are easily accessed at Stanford, all are general-purpose statistical packages. all have documentation available at a reasonable price. and all have a ‘control language’ as such. In addition, the three packages appeared to me to approach statistical analysis and package design in quite different ways, yet for many problems all had similar capabilities.
AUDIENCE The participants in our investigation were twenty-one students in a class on statistical computer packages. Billed as a second course in statistics for practitioners. the prerequisite for the course was simply one (unspecified) course in statistics. Most of these students had taken only one or two quarters of statistics at an introductory level, although a few were pursuing graduate work in statistics. Computing experience was uniformly distributed through the group. No more than two students had previous working experience with any one of the packages. Francis and Heiberger[6] developed a classification of users by statistical training. computing experience and package usage. This ‘user audience’ falls into category B-2-8 in their classification, that is, users with some computer experience and statistical training who are using packages for the analysis of data (as opposed to learning to use packages or learning statistics with the aid of computer examples). The students’ academic backgrounds were diverse. Those in the course majored in industrial engineering, chemistry, psychology, economics, history, education. sociology, geology, operations research, mathematical sciences and statistics. DESIGN
OF
THE
STUDY
This work focuses on the user interface. We have argued that the components of this interfacedocumentation, control language, and output-are ‘package-wide’ features, not specific to individual statistical procedures. If this assertion is correct, then an objective evaluation of a user interface based upon one statistical problem should agree roughly with that based upon another statistical problem. By using the packages to perform several unrelated statistical analyses, we have examined this conjecture. Forsythe and HiB[7] suggest that software evaluation experiments ought not to employ textbook problems, but rather should use problems similar to those that an investigator might encounter in practice. Because of the prior statistical experience of the students in our course, we were able to use such problems, using real data sets of moderate size to answer interesting questions. Francis and Valliant[2] found that, among novices, work stopped when no error messages appeared on the output, and that the resulting ‘analyses’ were unacceptable. In this experiment emphasis was placed on obtaining a complete analysis appropriate to the problem using statistical packages as a tool. Students were graded on the quality of their analyses. The study was based upon six problems assigned and due weekly. For each problem and for each package used the students recorded their reactions to and opinions on the documentation. output, and language. They were asked to record what they felt to be the strengths and weaknesses of each package in each of the three areas. In addition, they rated each package according to several criteria, and ranked the three packages according to their preferences. The criteria were similar to those used by Francis and Valliant, although our emphasis was on the user interface. Participants were asked to rate each package on a scale of one to four. The user’s manual was
User documentation
and control language I
137
rated on its clarity, completeness, examples. organization, debugging aids and table of contents. The control language was rated on the ease with which it is used, the ease with which it is Icarned. the understandability of commands and procedure names, and the ease with which data are entered. The output was rated for clarity, diagnostic messages, usefulness and completeness. As in Francis and Valliant’s study, the students were given a general orientation to the computer system, but no instruction on the use of the packages. The students had to use the user’s manuals to learn how to use the three packages. Forsythe and Hill point out that in such a study there are myriad sources of variation, and they mention a crossover design as one way to control for some of this variation. In fact. the first half of our experiment uses exactly that. On each of the first three problems, each student was to perform the same analysis using WC/I package in a specified order. Students were randomly assigned to six groups. each of which used the packages in a different order. The permutation of the packages assigned to each group was changed from problem to problem as well. On these three problems. the participants evaluated all three packages ‘side by side’. The problems used for this portion were: I. computation of descriptive statistics and visual displays such as histograms for the detection of outliers and blunders. and for the comparison of distributions; 2. analysis requiring several one- and two-sample r-tests: 3. construction and analysis of a 6 x 6 correlation matrix. For the last three problems, students were still required to indicate what they felt to be the strengths and weaknesses of each package. but were to use whichever one they chose. Grading of these exercises was based only upon the statistical analysis, and nor on the choice of package for the problem. The problems used in the second half were: 4. model building using regression programs, including outlier detection, residual analysis, and comparison of models; 5. one-way analysis of variance with groups of unequal size; 6. tests for independence in contingency tables extracted. in tabular form. from the published literature. RESULTS Before discussing the results of the experiment in detail, we should sound a note of warning. Now that we have high-speed digital computers and easy to use software. it is more important than it ever has been to recognize the problems of simultaneous statistical inference. It is a simple matter to compute a 90 x 90 correlation matrix based upon, say, only 21 observations. And in practically any 90 x 90 correlation matrix we are bound to find one or more coefficients ‘significant at the 0.05 level’. My students are always warned to use caution when many questions are asked of the same observations, and that warning applies to the results of this study as well. One of the shortcomings of an experiment such as this is that it is far easier to come up with questions Table 1. User interface evaluation, problem 1 BMDP I2 User’s manual Clarity Completeness Examples Organization Debugging aids Table of contents Control language Ease of use Learning ease Understandability Data entry output Clarity Diagnostic messages Usefulness Completeness 1= 2= 3 = 4 =
3
4
12
I I
1
6
0
5 6 1 II 5
9 13 9 12 5 7
4 2 4 6 0 8
2 2 4 1
10 10 9 I2
7 7 7 7
2 7 2
12 5 10
5
I
II
1 1 2 0
I I 0 0
I 3 0 0
very poor, misleading, confusing poor or inadequate. adequate or good. excellent.
:, 7
0 2
Minitab 3
4
1: 9 13
:,
I
2
:I
5 7 5 4 0 5
8 12 12 If
I
I
9 7 8 8
I I I
3 6 2 2
I3 IO IO I2
3 3 7 5
2 5 2 3
8 12 13 II
7
3 I 0 0
: 2 0
I: 10 IO
II 4 8 9
1
2
:,
3
5 3 6 3 I3
:
3 2
I2
; 0 0 2
I’: 12 6 I2 13
0 0 0
SPSS 4
:,
9 6 7 IO 5 3
6 2 I 2
1
4 4
138
RONALD A. THETED
to ask than it is to come up with participants to answer them. For these reasons to emphasize the major findings and have tried nor to milk the data for everything there are simply too few degrees of freedom.
we have chosen they are worth;
(a) The ratings of minute aspects of the user interface were remarkably consistent from problem to problem. This finding indicates that these aspects are indeed features of the package rather than a function of the statistical procedures. Table 1 exhibits the results from the first problem. which are representative. (b) The ratings were used to rank the packages with respect to the user’s manual. the control language. -and the output. Table 2 contains these relative rankings. The numbers in Table 2 are the number of participants rating that package first. Fractions result from distributing ties. Table 2 also contains the students’ package preferences for each problem. Although the rankings of the user interface remain roughly constant throughout the first three problems, the overall package preferences depend heavily on the problem being solved. 04 There appeared to be no relationship between package preference and statistical experience. computing background, or previous work with statistical packages. although the last conclusion is based on very few observations. This finding contradicts the common folklore which holds that people always prefer the first package they they learn to use. On the first problem. for instance. of the two students who knew BMD programs, both preferred SPSS. Of the two who knew SPSS. both chose SPSS. The two with Minitab experience split evenly: one for SPSS and one for BMDP. (d) The crossover design enabled us to determine whether the order in which the packages were used on a problem had any effect on the package preferences. We found no evidence of such an effect, in any direction. (e) From comments received with the problems that were turned in, the overall preferences appear to be determined by the students’ assessment of the suitability of the package as a tool for solving the problem at hand. Features such as well-written manuals. simple control structures. and readable output were appreciated by the students. but those features did not determine their choices. (f) From the point of view of good statistical practice. nearly all of the submissions were adequate. Some, in fact. were outstanding. In sharp distinction to Francis and Valliant’s findings among novices. the participants in this study frequently made several error free runs in order to improve their analysis of the problem and to arrive at more informative or useful results. (g) Table 2 apparently reveals a paradox: when students must use all three packages they will prefer SPSS to BMDP to Minitab, but when allowed to use any one package will prefer Minitab and BMDP to SPSS. Even though the students could use any of the packages. they still had to write short paragraphs evaluating the other packages. These records indicate why the pattern reverses. For problem four, the regression problem. most students found the hlinitab manual to be markedly clearer and more detailed in its treatment of regression than any other statistical method. Most of the students also remarked that the output from Minitab was more complete. Table 2. Package preferences Problem I
2
3
4 5 6
Documentation Language output Overall preference Documentation Language output Overall preference Documentation Language output Overall preference Overall preference Overal! preference Overall preference
BMDP
Minitab
SPSS
8.3 5.3 5.3 6
1.8 12.8 3.8 6
9.8
4.5 4.1 3.3
3.0 9.2 4.3 3 0.5 10.0 0.8 0 14 8 7
3.0 4.5 3.3 3 2 4 10
Entries represent the number of participants Fractions result from distributions of ties.
1.8 10.8 8 13.5 7.2
13.3 17 17.5 6.5 15.8 I8 4
preferring each package.
User documentation
and control
language
I
139
more useful, and more clearly described than that from SPSS or BMDP. Most found the control language more compact for Minitab than for the others. In the analysis of variance problem. the consensus was that Minitab’s documentation was clear on how the data had to be supplied to the program and that the other two packages were not. The final problem involved re-analyzing a published data set consisting of contingency tables in tabular form. Most of the students were unable to discover how to enter data which are already tabulated into SPSS. The examples in the BMDP manual, however, made it clear what had to be done. Incidentally, it is quite easy to enter such a table in SPSS (using the WEIGHT feature), although it is difficult to locate the appropriate documentation in the manual. The consensus was that Minitab’s control language was easier to use, but that BMDP’s output was more complete. (h) From studying the written comments, it appears that a package which supplied all of the information to complete an analysis. and which required the least effort to obtain that information. was preferred to the others. (i) Francis and Valliant addressed the question. “which (of two) packages is better for the novice?” The program preferences displayed here. which obviously depend upon the exact problem being solved. indicate that there is no ‘best’ package for persons of intermediate statistical and computing experience in an applied problem-solving environment. This result is hardly earth-shaking. but it is somewhat comforting to know that not only social scientists prefer SPSS, that not only medical researchers like the BMDPs, and that not only elementary students can appreciate Minitab. LIMITATIONS
ON
RESULTS
This is a good point to consider just what this study has, and what it has not, accomplished. Students of intermediate experience were studied as they worked on six problems using three packages. Their preferences were determined more by the needs of the problem than by the quality of the user interface. No package was universally endorsed, and none was unanimously shunned. These findings should be of interest to anyone who teaches a course in statistical methods which uses the computer as a tool for lengthy or complicated analyses. In some respects these results can be misleading, and we point them out lest anyone be fooled. The evaluations of documentation, control language, and output alike are functions of the versions of the software and user’s manuals available in early 1976. All have changed since then, some dramatically. In particular, the current edition of the Minitab manual contains an index and has been expanded, and a Student Handbook is available[8]. One of the major difficulties that our students experienced in using the Minitab manual was the lack of an index. It is likely that Minitab’s current manual and Srudenr Handbook would compare more favorably to the others than did the version we used. In addition, both SPSS and BMDP have sent corrections and improvements to their installations, as well as providing long lists of known errors and undocumented limitations. In short, documentation is evolving, and to the extent that the manuals and programs we studied have been improved. this study is already out of date. Of course, this limitation is inherent in any experiment of this sort, but it makes life difficult for those who would draw absolute conclusions about the relative merits of these packages. Another important point to remember in using this report is that no attempt was made to find problems that were particularly hard or particularly easy for individual programs. Nobody preferred Minitab to the other packages for the correlation problem, which required that a 6 x 6 correlation matrix be constructed for the analysis. Minitab had no instruction for computing a matrix of correlation coeflicients. (The current version does.) Consequently, fifteen commands were required to compute the numbers needed. For use. say. in a beginning course in statistics, it might be easier to use Minitab to compute one or two correlations than it would be to use any other package-even using the older version of Minitab. The results here, however, would erroneously lead to the opposite conclusion: that Minitab is no good for computing correlation coefficients. In short. the fact that suitability of a package for a problem is often of greater importance than suitability of a package for an audience (with some sophistication). and the fact that packages and their user interfaces evolve continuously, limit the uses to which the results of any evaluation can fairly be put. There is one more limitation inherent in investigations of this sort. They really have more to say about the participants in the study than about the packages themselves. Both in this study and in the Francis and Valliant study, the packages were not judged to be greatly different from
I40
RONALO A. THISTED
one another: even in instances where one package was preferred by a large number of participants. the preference was ‘not by much’. Novices appear to be concerned with eliminating error messages. while more advanced students seek quality analyses. In either case. the students were asked to supply subjective judgments to questions about aspects of the user interface. The very fact that these questions must be addressed to a yroup of people is an indication that no one person can answer them for another. Since tastes differ, we have obtained, in effect, a public opinion poll of users’ tastes. Just as we ought not to decide how to vote by studying the Gallup Poll, we also ought not to select a package for use in a problem on the basis of such an experiment as this. If. however. a package must be selected for use by some group of people, then it is helpful to know something about that group’s reactions and opinions about several packages, and studies such as this one can be useful in this context.
NEXT
STEPS
In light of these limitations it may be valuable to discuss the direction that further study of the user interface might take. Heiberger[l] has noted that “package users do not have the information they need to make an intelligent choice of the package most appropriate for their needs.” For many users, experiments such as this one do not provide all of the information that is required. What more information could be provided? Once again, Heiberger has hinted at a useful approach. With respect to the contents of output he said. “The evaluation of content is much easier since it can be done completely objectively by noting whether the feature exists as a default. an option. or not at all. A user knowing his output requirements can just look through a chart to decide if the package can fulfill his requirements.” Precisely this approach should be taken with respect to user documentation and control language. It cannot answer all questions. either. But there are identifiable features of documents. for instance, which can be listed and are of relevance to some people. It is easy to check to see whether certain features exist or not. (For example, it is usually easy to find out whether or not a manual has an index.) After listing the documentation characteristics of a package, we can give the list to the user. and as before, a user who knows what suits his taste can look through such a chart to decide whether the package is likely to be suitable for him. Under this scheme it will be necessary to enumerate relevant features of documentation and control languages. There are many levels of documentation, and all should be included. Among them are documentation of the system design, the interaction of control language and documentation. documentation of the program itself; the source code, machine dependencies, and installation notes. Statistical documentation is important. including references to the literature. suggestions for use. descriptions of the items in the output, definitions for statistical terms and alternative approaches to the techniques incorporated in the package. Numerical methods and computational algorithms used in a package are features of documentation relevant to most users, as is information about the costs of obtaining. maintaining and using the package. Of course documents often describe how to use a package and its control language. and such aspects of the usage documentation can be enumerated and charted for various packages. The features of the control language and its syntax. their hierarchies and parsing algorithms all may or may not be documented. and such an evaluation as I envision would note which of them, if any, were included in the document. Similarly. the way in which the output is documented would be noted. Large user’s manuals may have aids to help the user employ the manual itself effectively (document documentation). These aids would be listed. together with descriptions of the typeface and binding of the manual and descriptions of the document maintenance that is available through newsletters. update services. or regular revisions of a basic manual. Such a list of ‘possible features’ that statistical package documentation could conceivably include. together with a similar list for control language features, is presented by ThistedC91. In compiling such a list. several difficulties in providing good documentation become apparent. First, many among the possible features cannot be realized simultaneously. Especially when budgets are limited. it is unlikely that package designers will spend 80”/. of their effort on complete documentation of all aspects of a package-not enough software is produced that way. Second, programs are never static: they change continuously. Consequently many documents-especially those that attempt to be complete-are out of date as soon as they are published. Finally, fully documenting a sophisticated statistical software system is a task so enormous that it overshadows the task of providing good. flexible and useful statistical routines. Similar remarks hold for package control languages.
User
documentation
and control
language
I
141
CONCLUSIONS The purpose of the study reported here was to assess the adequacy of user documentation and control language of three packages for an audience defined by the statistical experience of its members. All of the packages were adequate. in the sense that most participants could use each of the packages to solve any of the problems (with one exception). But there was no consensus that one package’s documentation was better than another’s, or that one’s control language clearly dominated those of the others. Instead. individuals exhibited fairly consistent personal preferences. These preferences, however, did not interfere with. nor did they dominate. their selection of packages to solve specific problems. Choosing a package for personal use on a particular problem involves many subjective decisions which vary from user to user, and even from problem to problem for a single user. Apparently. when the emphasis is on good statistical practice, the primary decision is whether a package provides what is needed for a complete analysis. Only after this question is answered affirmatively do considerations of taste concerning the user’s manual and the control language come into play. This suggests that future evaluations of the user interface intended to assist users in selecting packages for their personal use ought to identify, isolate, and examine major features of the documentafion and the control structures. The features of which I speak here are those that can be identified by no,ting whether they exist as defaults, options, or not at all. The user could then weigh his statistical needs and personal tastes against the capabilities and features of individual packages. Evaluations intended to provide information for selecting packages for general use by more or less homogeneous groups, however, can profitably study the reactions and preferences of a sample of group members in situations typical of intended use. Group experiments provide a different sort of information from that found in a catalog of features. Both kinds of information are valuable and both types of package assessment should be pursued.
A~~,lo~ledgerllenf~-Ronald A. Thisted is Assistant Professor, Department of Statistics. University of Chicago. 5724 University Avenue, Chicago. IL 60637. This article is a revision of an invited paper presented to the Statistical Computing Section at the 136th Annual Meeting of the American Statistical Association. Boston. MA. August 1976. Discussion with and comments from Ivor Francis, David Wallace, Richard Heiberger. David Pasta and Richard Poppen are gratefully acknowledged. The cooperation and dedication of my class in statistical computer packages are also acknowledged. Permission of the American Statistical Association to reproduce portions of an earlier version of this article appearing in the Proceedings of the Statistical Computing Section is gratefully acknowledged.
REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9.
R. M. Heiberger. Proc. Comp. Sci. Srat.: Eighrh Annual Spmp. Interface. 115-121 (1975). I. Francis and R. Valliant. Proc. Comp. Sci. Star.: Eighrh Annual Symp. Interface, IlO-1 14 (1975). W. J. Dixon. (ed.) BMDP Biomedical Compurer Programs. University of California. Berkeley (1975). T. A. Ryan Jr, B. Joiner and B. F. Ryan, Minirub II Refirence Manual, Preliminary Edirion. Department of Statistics. The Pennsylvania State University, University Park (1976). N. H. Nie. C. H. Hull. J. G. Jenkins. K. Steinbrenner and D; H. Bent SPSS: Srorisrica/ Package jir the Social Sciences. McGraw-Hill. New York (1975). 1. Francis and R. M. Heiberger. Proc. Comp. Sci. Slur.: Eighrh Annual Symp. Interface. 106-109 (1975). A. Forsythe and M. Hill. Proc. Comp. Sci. Sror.: Eighrh Annuul Symp. Inrerface, 17-20 (1975). T. A. Ryan. Jr. B. Joiner and B. F. Ryan. MINITAB Student Handbook. Duxbury. North Scituate (1976). R. A. Thisted. In preparation (1977).