MSUSTAT—An interactive statistical analysis package

MSUSTAT—An interactive statistical analysis package

Computational North-Holland Statistics & Data Analysis 2 (1985) 331-336 331 Reviewed by John ISI’FT Estelle Doheny Eye Foundation and the Dqwrtment...

640KB Sizes 7 Downloads 155 Views

Computational North-Holland

Statistics & Data Analysis 2 (1985) 331-336

331

Reviewed by John ISI’FT Estelle Doheny Eye Foundation and the Dqwrtment of Ophthalmology, California School of Medicine, Los Angeles, CA 90033, USA Received November

University of Southern

1984

A great deal of interest has recently centered on development of statistical packages for microcomputers. Generally, users of statistical packages for microcomputers have had to settle for fewer tests, slower processing rates, and cumbersome methods for entering data. “MSUSTAT - An Interactive Statistical Analysis Package,” by Richard E. Lund (Montana State University, Bozeman, MT 59717-0002) is reviewed here for numerical accuracy, handling of data, and program documentation. The version reviewed here is designed for the Apple IIE and II + computers with CP/M operating systems, using the Z-80A microprocessor at 4 MHz. At present, installation of MSUSTAT is also available for the IBM-PC and XT computers and presumably those computers which are directly compatible to IBM-PC’s using MS-DOS operating system. Numerical accuracy is compared between program output of the Apple and IBM package versions. Program compilation was developed by Super Soft Associates (Champaign, IL) using SSS FORTRAN IV. The author claims MSUSTAT provides results accurate to four decimal places for the MS-DOS version and three decimal places for CP/M on Apple computers. These results were obtained by multiple regression on the highly correlated Longley Data (see Chambers, J., Computational Methods for Data Analysis, New York; John Wiley and Sons, ‘1977). MSUSTAT was evaluated thoroughly, covering all program capabilities with on? exception. T’.e version for Apple computers will not execute data transformatigsns for cae-k2WT~A program and all multivariate analysis programs (multiple and stepwise regression). Thus, program execution and accuracy could not be fully evaluated using transformed data prior to program operation. Tne total packags provides seven ANOVA programs (including multi-factor, three stage 61 fi 7-3W73/85/$3.38

0 1985, Elsevier Scknce Publishers B.V. (North-Holland)

332

Package Review

analysis of covariance and multivariate tests), Chi-square, (simple, multiple, and stepwise), summary statistics, non-parametric (including Wilcoxon, Mann-Whitney, Kruskal-Wallis, and confidence intervals), student t-test, and Z-probabilities - among others. The CP/M version operates in an overlay mode, requiring the use of at least four diskettes for the complete package. This requires the user to switch disks during program operation for most analyses in MSUSTAT. Complete instructions are provided with the package for installation with proper stepwise procedures for either floppy disk drives or hard disk applications. The program diskettes are not copy protected allowing the user to make backup copies and to access a loadcl file for data input and output to and from data disks or application software. Several popular editing software products produce acceptable data files for MSUSTAT. Further, if an editing program produces standard ASCII files (with a carriage-return-line-feed separater), the file will likely be acceptable to MSUSTAT, DOCUmentation is well written with few errors and appears to have been printed using a microcomputer. For locating specific programs and text, the manuai has a table of ::ontents, but lacks an index. The rest of the documentation is divided into three parts. Part one is thirteen pages of general features with instructions for prompt commands and control language. Part two is twenty-two pages of text describing the specific mathematical algorithm used in computation of most statistical tests offered in the package. Part three consists of seventy-seven pages of examples of program operation with dat a, with a list of references and addenda that follow. MSUSTAT responds to user’s by line prompts, such as “program desired”. While a menu is not used, program choice is a matter. of typing the first five letters of the program selected. A complete listing of programs can be displayed by typing “programs” when asked for “ program desired”. As programs begin operation, additional information is usually provided with each prompt throughout program execution. If at any time the user needs additional information about a prompt, he czll type “help” at that point and MSUSTAT will provide text information, specifying the proper format or options regarding that individual prompt. In rlddition to help messages, MSUSTAT also provides the option of rerunning most programs with the same data and new control options, or different data with the same control options, or restarting the program from beginning with no assumptions. Data entry is handled by one of three schemes. Scheme one accepts data entered from the keyboard one row at a time, thus each row is typed in response to the prompt for the next consecutive row number (‘i--N). This all ows data to be interpreted in row/column form. Scheme two aifows rows of data or samples to be entered by keyboard and have no row/column association. Scheme three provides data entry and analysis for the ‘case’, which consists of measurements or other quantitative data that serve as identifiers or variables. The assumption is made that all cases entered have the same set of identifiers. At the start of a program, the prompt for input source (data) will appear after program choice has been made. File name, type, and disk drive designation must be made if the file resides on a separate diskette. If data is entered by keyboard, the default “me” is used by pressing return. MSUSTAT nested GT&rarcl&al,

regression

Package Review

333

allows construction of data files with FORTRAN statements or by comma or space delimiters. If data files are formated with FORTRAN statements, the ii+dt source prompt “prior” is used. Default configuration for data file construction is with comma or space delimiters. Accuracy was evaluated by comparing multiple regression outputs of the Longley Data from the Apple II and IBM-PC microcomputers. A data file was created for sixteen cases each with seven variables (see Table 1). For labeling variables, MSUSTAT allows descriptive headings up to eight characters in length and may be entered directly from the keyboard or from a disk file. A separate disk file for both variable headings and data was used with the regression analysis and no problems arose during program operation or printer output. Once the program started, MSUSTAT was instructed to produce a correlation matrix and then do multiple regression using the first six variables to predict the seventh (employment). The first output is descriptive statistics including skewness, kurtosis, and maximum/minimum values for all analyzed variables. Next a Pearson’s correlation matrix was produced (Table 2). It took approximately twenty five seconds processing time for both the descriptive statistics and correlation matrix to be produced. The correlation matrix produced was accurate to four decimal places without any errors. An additional twenty seconds was required to complete the multiple regression analysis. Regression coefficients are compared to Longley’s desk calculator results in Table 3. For the Apple’s results, three of the coefficients agree with Longley’s results to the second significant digit. The other four (GPN DEF, POP, and YEAR) were not in error by more than 0.9%. The IBM-PC maintained better accuracy with MSUSTAT, calculating regression coefficients that yielded no more than 0.20% error in any variable when compared to Longley’s twelve place calculations. MSUSTAT produced essentially a-digit error free results on the IBM-PC. Table 1 Longley data 83.0 88.5 88.2 89.5 96.2 98.1 99.0 100.0 101.2 104.6 108.4 110.8 112.6 114.2 115.7 116.9

234289 259426 258054 284599 328975 346999 365385 363112 397469 419180 442769 444546 482704 502601 518173 554894

2356 2325 3682 3351 2099 1932 1870 3578 2904 2822 2936 4681 3813 3931 4806 4007

.

1590 1456 1616 1650 3099 3594 3547 3350 3348 2857 2798 2637 2552 2514 2572 2827

107608 108632 109773 110929 112075 113270 115094 116219 117388 118734 i20445 121950 123366 125368 127852 130081

1947 194G 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962

60323 61122 60171 61187 63221 63639 64989 63761 66019 67857 68169 66513 68655 69564 69331 70551

Package Review

334

Table 2 Regression analysis with Z80-MP (Longley data) Pearson correlations 1 2 3 4 5 6 7

1 .ooo .9916 .6206 .4647 .9792 .9911 .9709 1.ooo

1.ooo .6043 4464 .9911 .9953 .9836

1.000 .3644 .4172 .4573

1.OO@ .9940 -9604

1 .ooo .9713

--_

DEPENDENT

=

FIT: GNP DEF GNP UNEMPL MILITARY POP YEAR

VAR 1 2 3 4 5 6

INTERCEPT R-SQUARED

1.ooo - .1774 .6866 ,6683 so25

7 EMPLOYED R-PART S639E -1 -.3208 -.7941 -.8359 - .7085E -1 .7853

B 15.20 -.3594E -1 - 2.022 - 1.034 - .5088E - 1 1831

SEW

89.69 .3537E -1 .5158 .2263 .2388 481.1

-

T .1694 1.016 3.920 4.568 .2131 3.806

P-VALUE .8632 .3378 .3777E -2 .1667E -2 .8298 .4411E -2

= - .3485E 7 = .9950

ANALYSIS OF VARIANCE: SOURCE DF SS. REGRESS 6 .1840E 9 RESIDUAL 9 .9326E 6 TOTAL 15 .1850E 9

MS. .3M7E

I!

F-VALUE mf; rm i- _._I

P-VALUE .OOOO

.~036E 6

To test MSUSTAT’s

ability to handle large whole numbers, a data file comprised of five variables was used to calculate means and standard deviations. Variable number one were the values one through nine. The next four variables were generated by adding one thousand, ten thousand, one hundred thousand, and one mihion respectively to the first variable one through nine (see Table 4). Table 3 Regression coefficients _._,_L.-_ -__- . v arizbk LongIey p-v 1 lS.@f;18722 11573 2 - 0.035819179292 3 - 2.020229803816 4 - 1.033226867173 5 - 0.051104105653 6 1.829.151464613551

Apple

IBM

15.20 - 0.03594 - 2.022 - 1.034 0.05088 1831

15.03 - 0.03581 - 2.020 - 1.033 0.05118 1829

Package Review

335

Table 4 Integer data -l.l_

1.00

2.00 3.00 4.00 5.60 ml 7.Q!l 8.00 9.00

1001 .oo 1002.00

1003.00 1004.00 1005.00 1006.@0 1007.00 1008.00 1009.00

10001.oo 10002.00 10U03.00 10004.00 10005.00 10006.00 10007.00 10008.00 10009.00

_

_

100:1001.@O ~f-m0002.30 loc0O~.?.OO 3OO(‘OO~ .oo

100001.Go 100002.00 100003.00 100004.00 100005.00 100006.00 mO007.00 100008.00 100009.00

la’i)3005 .OO lOOGO’~6.~“

1oooOO7.c)o 1000008.00 1000009.00

The program for mmmary statistics (SUMSTATS) allows data to be entered by separate disk file and provides accurate results (Table 5). But when using the MEAN STD program (keyboard entry only), large integer data had to be entered in exponential form to maintain accurate results. The user’s inanual recommends that SUMSTATS be used in preference to MEAN STD if additional options are required. To evaluate processing rates, multiple regression was tested with both the IBM-PC and Apple microcomputers, running formatted and unformatted dsdta files. The files were generated with a variance of 100 and zero correlation using n = 40 cases for each run. A single space delimiter was used between each variable when the data files were constructed. Results are compared to MSUSTAT’s performance when installed on a Honeywell level 66 (C)P-6)computer running at 2400 baud (see Table 6). Formatted data files ran about 30% faster on the average than unformatted files. The IBM-PC with 8087 support seemed to produce the quickest overall results. Processing times for runs with formatted and unformatted data was averaged and indexed for each computer, with Honeywell-CP6 results generating the base (1.00) for comparison. The Z-80’s processing time is approximately twice that of the IBM-PC with 8087 support. These results would be expected since the comparison here is between an 8 bit environment and a 16 bit environment (with copiocessing support). When comparing processing times of the IBM-PC with and without 8087 support, the differences ari; n&igible. Processing rates are faster when reading data into and out of main memory, but Table 5 Summary statistics for integer data FIRST CASE CASES READ VARIABLE MEAN( 9) STD DEV SKEWNESS KURTOSIS MAXIMUM MINIMUM

= 1.OJO

1001.

.lOOOE 5

.iOOOE6

.lOOOE7

9 1 = 5.000 = 2.739

2 1005. 2.739

3 .lOOOE 5 2.739

4 .lOOOE6 2,739

= .OOOO “- 1.770 ---.Gw?e = 1.OOo

.oooo 1.770 1009. ikal.

.oooo 1.770 >lOOOE5 .1006E 5

.oooo 1.770 ,300OF.S .lOOOE6

5 .1000E ‘7 2.739 .oooo 1.770 .lOOOE7 .iOOOE7

=

--

Package Review

336 Table 6 Processing time comparison

IBM-PC/8087 r

IBM-PC/8088

Apple/Z80

Honeywell CP-6

read data regression residuals

read data regression residuals

read data regression residuals

read kita regression residuals

K=10

K=lO

no format

format lOF5.0

Index

31 s 38 77

16 8 27 51

37 8 42

23 8 30

Xv

Si

1.51

65 20 67 FY2

53 19 41 l-KJ

2.69

12 8 2ii 4%

14 8 28 SD

1.00

1.31

MSUSTAT’s processing rates (programming) for regression calculations are the same for both the 8088 alone and with 8087 support. Further comparison of MSUSTAT with other statistical packages for microcomputers can be made with refexnce to the Wavember 1983 issue of Byte magazine (pp. 560470). This article by Dr. Lstchenbuch reviews statistical packages designed for 8 bit computers and may be of value in comparing program output of MSUSTAT’s 280 version in this review. overall, MSUSTAT provides a powerful and compiete package for numerical analysis in microcomputers. Program diversity and brevity off/x an attractive application for small systems. The non-menu driven system with help commands available at every prompt make the package functional, yet easy to use. Program overlays work we!1 with dual drive systems and can easily be transferred to 2 hard or RAM disk system for continuous, less delayed operation. From a mathematical point of view, MSUSTAT offers one of the most diverse and accurate packages for numerical analysis in small system computing.