Neurotoxicologyand Teratology, Vol. 12, pp. 449--454. ©Pergamon Press plc, 1990. Printed in the U.S.A.
0892-0362/90 $3.00 + .00
Dealing With Large Data S e t s I J O H N F. G R A E F E A N D R O N A L D W. W O O D 2
Department of Environmental Medicine, New York University Medical Center 550 First Avenue, New York, N Y 10016
GRAEFE, J. F. AND R. W. WOOD. Dealing with large data sets. NEUROTOXICOL TERATOL 12(5) 449--454, 1990.Collection, storage and retrieval of large amounts of data from multiple experiments for subsequent reduction, graphing and statistical analysis need not be a burdensome task. Although turnkey systems may offer significant economies for single well-defined and repetitive tasks, they may not permit sufficient flexibility to achieve the diverse alms required by many research programs. Using popular microcomputers to run one or a few experimental subjects may confront the investigator not only with significant bookkeeping problems, but also with an allocation of labor resources to computer maintenance and support that might be better invested in research effort. By using networked minicomputers, economies of scale emerge both in data collection, transfer, reduction, and analysis, as well as in maintenance, support, and scientific effort. Data collection
Data management
Data analysis
Network
IN laboratories focused on behavioral pharmacology/toxicology and the experimental analysis of behavior, the use of computers to control experimental processes and to collect and store data has become as commonplace as the use of electromechanicai relays and counters was two decades ago. Although computers have been used for this purpose for quite some time (2), the choice among various system designs can be perplexing. One strategy is to use personal computers, each dedicated to run one or a few experimental subjects, but unless appropriately networked, this microcomputer approach confronts the investigator not only with the headache of tracking and assembling data from disks on each PC before performing overall analyses, but also with the maintenance of hardware and software on multiple instruments. Collection, storage and retrieval of large amounts of data from multiple experiments running simultaneously for subsequent reduction, graphing and statistical analysis need not be burdensome in a minicomputer environment. The use of a minicomputer dedicated to process control and data collection and that is networked to a separate minicomputer for data reduction relieves the investigator from having to deal repetitively with individual data files and disks. In our laboratory, data are collected and stored on a single disk on a PDP11/73 and are transferred electronically to a MicroVAX II for subsequent examination and statistical analysis using simple command files. Only one local backup operation must be performed to archive data versus separate backups on each of several microcomputers. Computing needs in our laboratory are dictated by two projects, one focused on the behavioral effects of solvents subject to abuse by inhalation, and the second focused on the development of an animal model of cocaine smoking. The solvent project is currently examining the structure-activity relationships among the single
SKED
RS/1
and double constituent C7, C s and C9 alkyl benzenes across several different preparations using mouse, rat, and monkey. Behavioral effects are characterized in individual animals using enough replications to allow for regression analyses and simultaneous comparisons within animals; the studies employ enough animals to permit quantitative estimates of the variation in sensitivity between animals. For example, spontaneous and foodreinforced activity are simultaneously studied in four 1.3-m 3 exposure chambers. One chamber holds 24 rats in suspended wire cages, one holds 24 mice in running wheels, and the two remaining chambers each hold eight rats in running wheels (four of which have feeders) for a total of 64 animals. Animals are typically exposed on Tuesday and Friday to one member of a series of six replications at each of six solvent concentrations, a typical design used in behavioral pharmacology. Activity counts from individual animals are recorded every 10 min, six hours per day, five days per week. A total of 11,520 numbers are recorded per week, 207,360 numbers per compound studied in 64 animals, or a total of 622,080 numbers per year for three compounds. A second focus is on schedule-controlled operant behavior in both rat and monkey. In the rat preparations, 18 animals per day can be studied in individual five-liter acrylic inhalation boxes; six animals per day are studied in the monkey preparation. Animals are run five days per week for 72 min each day, responses are recorded every two min, and six replications of six concentrations of a solvent are typically examined. Thus, the rat and monkey operant preparations can generate 174,960 and 58,320 numbers, respectively, for a total of 855,360 numbers annually. The initial phase of the cocaine project has been the development of a cocaine aerosol ( " s m o k e " ) generator for delivery of measured atmospheric concentrations with a known particle size distribution appropriate for nonhuman primates. The primary
tSupported by grants K02-DA00117, R01-DA05080, and R01-DA04438 from the National Institute on Drug Abuse. 2Requests for reprints should be addressed to Ronald W. Wood, A.J. Lanza Labs, Long Meadow Road, Tuxedo, NY 10987.
449
450
GRAEFE AND WOOD
, Ch.mber,,| JChamber41 IChamber21 I a=~ats-°ll 8Ra~- I 6 Inhalation i J 24 Mice- | J Wl-lees J I wnepw I Chambers I I Wheels II II =-eeaers I 18 Rats - SCOB I I I i X ! d 1 Mouse Activity Inhalation I Chamber | 18 Mice
Mainframe/ I Network E-Mail
\
I
i
I Br!ge I
, --r"
f=
JMicrocomputer 80286 I NODE: PC1
FORTRAN [ Chromat~ograph / Integrator I FIG. 1. Flow diagram depictingthe hardware and software utilized in experimentcontrol and data collection. instruments for this task are a gas chromatograph and integrator (GC) which is connected by a serial line to the MicroVAX and a particle sizing optical spectrometer (OPS) which is operated by software on a PC. The PC is attached to etheruet, thus allowing data files collected from the OPS to be written directly onto a "virtual" disk on the MicroVAX using the Personal Computer Systems Architecture (PCSA), which relies on the DECnet protocols. The GC generates 50 numbers and the OPS generates about 400 numbers per typical sample run. If 10 samples per day are taken, a total of about 4,500 numbers per day are recorded. Heart rate and operant performance are recorded during experimental sessions; heartrate is obtained from recording the interbeat interval with 10-msec resolution, while responses are recorded in two-min epochs. In a 60-min session approximately 36,000 numbers are recorded. For both the analytical and function measurements, a total of approximately 40,000 numbers per day or upwards of 4 million numbers per year could be generated if warranted. Thus, in a year we anticipate collecting about 5 million numbers (not bytes). The management of a data base of this size is small in comparison to that of an airline reservation or banking system, and the computational effort involved in number-crunching and sorting data into daily tables is not nearly as intensive as in fluid dynamics, weather prediction, or computer-aided design. Basically, the task is the same one that behavioral pharmacologists have always done, except that there are five times the number of animals that Peter Dews (1982) said was possible to track before experimental error begins to creep in and obscure effects. In order to conduct reliable scientific research and stay within budgets without a significant actuarial burden, our general strategy has been to purchase proprietary hardware and software products
(thus profiting from the extensive engineering invested in them and not having to design and test interfaces and operating system software) and to let service contracts for computer hardware and software maintenance. This leaves the investigators responsible only for data collection and analysis, and monitoring the continuing function of the instruments. Our systems have proven to be cost effective and powerful (Figs. 1 and 2). The remainder of this paper examines: 1) the desirable attributes of a data collection system; 2) the requirements of data transfer, storage, and retrieval; and 3) the desirable attributes of a data reduction, analysis and graphics system. DESIRABLE ATTRIBUTES OF A DATA COLLECTION SYSTEM
1) Reliability of hardware. This factor is critical to continuous data collection. The hobbyist "do-it-yourself" machine is "selfinsured" and has hidden personnel costs when the equipment breaks and requires repairs, or requires finding someone else to perform either remote or onsite repairs. A proprietary instrument with a service contract for 24-hour repair from trained service company personnel assures the minimal interruption of data collection and analysis and the continued productivity of laboratory staff. 2) Economy of scale. Duplicating microprocessor systems in the $2000-$5000 range, each with its own peripherals, may ultimately be costlier than purchasing one $8000-$10,000 machine with a larger disk drive, fewer keyboards, monitors, peripherals, power supplies, and tape units. Furthermore, address space of a controlling microcomputer may limit data manipulation: thus, if local data reduction is required in a microcomputer
D E A L I N G W I T H L A R G E D A T A SETS
Winchester Disk Drive
451
Casette Tape Unit
\
70 MB Winchester I Disk Drive
Library
/
PDP 11/73 NODE:RWW (End Node)
L\
I
I
"i I
I I) Tape Library
/ I
MICROVAX II NODE:MVAX (Routing Node)
I
I
PDP 11/73 NODE:RWW2 (End Node)
I 95 MB Cassette Tape Unit
1.2 MB | Floppy Disk Drive
Disk Ddves (684 MB Total)
I
/
Terminal Server - I 16 Ports
\
I P,,n,°rs I I ,o,m'n~,s I
I
8o2e6
I
MicrocomputerI NOOE : PC 1 |
/ 20 MB I Winchester Disk Drive
FIG. 2. Diagram of the hardware configuration utilized for data collection, storage and retrieval.
Do j = l t o 2 4 ; /* make individual rat tables from a mastertable containing all rat data */ Make Table ('RAT'.j) from table ('Mastertable') where column RAT = j; /* make a series of graphs of total responses across daily sessions */ Make graph ('RAT'.j/Calendar') from column DATE of table ('RAT'.j) vs column TOTAL RESPONSES; Display graph ('RAT'.j.'Calendar'); /* make a table of summary statistics where columns are different doses */ Allocate table 'Summary' 20 rows by 8 columns; Do k = 1 to 8; /* 6 concentrations + air & control */ MEASURE Column TOTAL RESPONSES of Table('RAT'.j) where column DOSE = k; if table exists('Summary') then {ADD Column to Table 'Summary' from Table 'Measures'} else {make Table 'Summary' from Table 'Measures'} SET ROW DOSE COL K to K / * Enter Dose for Column */ End; /* make a series of dose-effect graphs with SEMs for each rat */ Make graph ("DoseEffect".j) from row DOSE of table 'summary' vs row MEAN; ADD ERROR BARS to CURVE 1 of GRAPH ("DoseEffect".j) from row STDERROR of Table 'summary'; Plot graph ("DoseEffect".j) End; FIG. 3. Example of RS/1 Research Programming Language used to make individual animal data files, daily session data plotted over actual calendar days, summary statistics tables, and concentration-effect graphs with S.E.M.s for individual animals.
\ 1.2 MB I Floppy DiskDr,ve
452
GRAEFE AND WOOD 1 INTERVAL 1 1 2 3 4 5
INTERVAL 2
220 213 195 220 153
9 TOTAL RESP 1 2 3 4 5
2
4867 3499 4167 4057 3592
3
INTERVAL 3
222 203 175 223 ii0
i0
LOG10
4
224 173 235 227 134
ii
3.687261 3.543944 3.619824 3.608205 3.555336
LOG LOG
...
5
INTERVAL 36
... ... ... ... ...
SUM THURS
0.143317 0.000000 0.170808 0.159190 0.106321
6 DATE
53 124 36 27 199
12
LAST30
280 420 36 52 233
7
II-MAR-87 12-MAR-87 13-MAR-87 17-MAR-87 18-MAR-87
13
LOG LAST30
2.447158 2.623249 1.556303 1.716003 2.367356
SOLV
0 0 0 0 0
14
8 CONC
0 0 AIR AIR 0
LOG SUMLAST30 -LOG THURS -0.176091 0.000000 -0.663806 -0.504105 0.147248
FIG. 4. Example of a portion of an individual animal daily data file.
environment, data sorting or description may be awkward and/or time consuming as intermediate files or disk resident arrays are brought in and out of memory. 3) Independent starting and stopping of experiments. Single program path environments, such as those using BASIC or RT11, can require the investigator to anticipate and load every parameter at once at the start of the session. The investigator must anticipate required service for any condition of every experiment on each path through the program's cycle. This can cause severe problems in debugging as well as limit the experimenter's ability to modify parameters while an experiment is in progress. 4) Multiple user access. Several experimenters must be able to access the machine simultaneously for program preparation, experiment control, and data handling. 5) Easy exportation of data to other computing environments. 6) Local backup capabilities. How clumsy will it be a year from now to find your data spread across 400 diskettes? How important is loss of data to you? Backing up data regularly on removable media of some form is essential. To attain these attributes we chose a Digital Equipment Corporation (DEC) Micro PDPll/73 system with 4 megabytes (mb) of memory (2 mb will do), a 70-mb nonremovable disk, and
0
6000
0
400C
a 95-mb tape unit for software distribution and backup. We use Coulbourn interface modules (Coulbourn Instruments, Inc.) and SKED-11 (State Systems, Inc.) software for experiment control and online monitoring of experiment status. Raw binary data files are formatted and converted to machine-independent ASCII data files using Fortran 77. This machine can also be used for further data reduction, although we no longer do so. THE REQUIREMENTSOF DATATRANSFER,STORAGE, AND RETRIEVAL
Data Transfer To move data between machines requires either: a) compatible media between machines so that disks or tapes can be carried between machines ("sneaker net"); b) the use of asynchronous communications programs like Kermit, xmodem or others to move files between machines; these are usually multistep procedures that require operator execution, and are limited to conventional baud rates (e.g., 9600 bps); or, c) the use of networks. Networks entail additional expense, but have major advantages: 1) speed--10 megabits/sec with ethernet; 2) direct compatibility with institutionwide computing systems; 3) given privilege, access to all machines on the network from one seat at one terminal; 4) multiple concurrent sessions, e.g., examine experiment-control on one or several machines, do data reduction and analysis or work on manuscripts or correspondence on another machine; 5) easy syntax for moving files (e.g., copy a to b); 6) capability for complete automation, e.g., copy on time of day. For data transfer DECnet has been adopted; this is a proprietary protocol which is in wide use and has a very high degree of reliability and security.
Data Storage
aooo
0
• O 0
Control 3 0 0 0 ppm 4 0 0 0 ppm
0
27-0CT-67
06-NOV-B7
i6-NOV-67
26-NOV-B7
FIG. 5. Example of a portion of an individual animal's performance across calendar days.
For storage purposes, there is a large disk capacity on a central MicroVAX; 684 mb divided across three disk drives: one system disk and two data disks, one of which can stand in for one of the other two should it fail; estimated use is about 15 mb per month in our environment. A 95 mb-cartridge tape drive is used for weekly backups (slow, but 35 cents/mb), and four sets of tapes are rotated to minimize potential data loss. When an experiment is completed, the raw binary data is archived on tape and then deleted from the disks. Backup tapes of the operating system and archived data are stored in a vault for physical security.
DEALING WITH LARGE DATA SETS
453
0 Statistic
1
1 2 3 4 5 6 7 8 9 i0 ii 12 13 14
26.000000 88102.000000 3388.538462 229.874067 3392.500000 1373894.258462 1172.132355 5703.000000 760.000000 4943.000000 0.009963 0.175232 3848.286596 2928.790327
C o u n t (N) Sum Mean S E M (s.e. of m e a n ) Median Variance S t D e v (sd) Maximum Minimum Range Skewness Kurtosis Mean+2SEM Mean-2SEM
0.000000
16.000000 62813.000000 3925.812500 233.032584 4027.000000 868866.962500 932.130336 5503.000000 1986.000000 3517.000000 -0.365674 0.068746 4391.877668 3459.747332
6.000000 24816.000000 4136.000000 666.927282 4564.000000 2668752.000000 1633.631537 5690.000000 2176.000000 3514.000000 -0.375960 -2.462357 5469.854565 2802.145435
4
1 2 3 4 5 6 7 8 9 i0 ii 12 13 14
6.000000 33661.000000 5610.166667 320.840918 5680.000000 617633.366667 785.896537 6637.000000 4666.000000 1971.000000 -0.069509 -1.542608 6251.848502 4968.484831
6.000000 25374.000000 4229.000000 750.410110 4397.500000 3378692.000000 1838.121868 6327.000000 2088.000000 4239.000000 -0.172171 -2.057119 5729.820220 2728.179780
0 Statistic
7 1780.000000
8 3000.000000
1 2 3 4 5 6 7 8 9 i0 ii 12 13 14
6.000000 7027.000000 1171.166667 168.172217 1129.000000 169691.366667 411.936120 1664.000000 557.000000 1107.000000 -0.218929 -0.587222 1507.511100 834.822233
2.000000 1229.000000 614.500000 300.500000 614.500000 180600.500000 424.971175 915.000000 314.000000 601.000000
C o u n t (N) Sum Mean S E M (s.e. of m e a n ) Median Variance S t D e v (sd) Maximum Minimum Range Skewness Kurtosis Mean+2SEM Mean-2SEM
5
3
0 Statistic
C o u n t (N) Sum Mean S E M (s.e. of m e a n ) Median Variance S t D e v (sd) Maximum Minimum Range Skewness Kurtosis Mean+2SEM Mean-2SEM
300.000000
2 air
560.000000
6
178.000000
I000.000000
6.000000 18678.000000 3113.000000 420.425182 2932.000000 1060544.000000 1029.827170 4327.000000 1645.000000 2682.000000 -0.056658 -1.000018 3953.850363 2272.149637
1215.500000 13.500000
FIG. 6. Example of a summary statistics table showing control (0 ppm), air and six concentrations of a test atmosphere.
Retrieval By keeping large amounts of data on line, the necessity of maintaining extensive hardcopy catalogues and tape or disk libraries from which data must be physically retrieved is obviated. Because the storage capacity of 5.25-inch Winchester disk drives is becoming enormous, the economy of scale is dramatic; by purchasing large capacity drives for a MicroVAX, one can achieve almost an order of magnitude savings in cost per unit of storage.
As in most laboratories, we have experienced disk failures, and the existence of our backup tapes has permitted us to come back up painlessly after hardware replacement. As disks get larger, backup becomes more burdensome, unless an incremental backup strategy is adopted, or a larger capacity tape technology is adopted. DESIRABLEATTRIBLrFESOF A DATAREDUCTION,ANALYSISAND GRAPHICSSYSTEM 1) Ease of use of any software system is paramount, so that the
454
GRAEFE AND WOOD
naive user is encouraged to jump right in; extensive defaulting and native language syntax are very helpful to this end. The investigator should be able to ask a naive technician to make a graph and that person should have success on the first try. 2) The system should either be menu driven or should have a graphical user interface for the naive user. 3) The user should be able to address and retrieve data sets quickly and automatically using a syntax for generating data set names, e.g., by concatenating strings of letters and numbers under program control, so that data structures are addressable by animals, treatments, sex, or other attribute. 4) There should be macro or programming language capability for the automation of repeated sequences of commands. 5) The user should be able to edit all objects, i.e., data tables, graphs, barcharts, macros, and programs. 6) There should be device independence, i.e., any graphics terminal, laser printer, plotter, or dumb device should be supported. 7) The system should allow the user to work at a fast enough pace so that he/she doesn't feel put upon. Speed reflects not just central processing unit (CPU) speed but data throughput (hardware and software). 8) Color displays are very helpful for rapid examination of complex data sets, otherwise unique line types and symbols must be used to extract the finding of interest. 9) Independence of data format restrictions is quite useful; for example, being able to plot data using the date or the time of day as the abscissa can be advantageous. 10) Efficient use of storage media is important. Space should not be allocated to hold nothing; garbage should be collected and data compressed. 11) The system and its software should allow access by multiple users to shared procedures and data thus reducing duplication of effort. 12) Security of data sets from inappropriate mixing, manipulation, or vandalism is imperative. 13) There should be extensive documentation, easy management and installation, online help, and a phone support hotline. 14) There should be enough statistical capability to perform routine tasks and some not so routine tasks, but not so much as to duplicate other major packages (e.g., SAS, BMDP). A product developed and marketed by Bolt, Beranek and Newman (BBN Software Products Cow), RS/1 and a companion statistical advisor, RS/Explore, meets or exceeds all of the above requirements. This product was originally bought for the PDP11/ 73. It was a major advance for this laboratory, since it is much faster and easier to use than other programming and data reduction environments with which we have had experience. Figure 3 is an example of the programming language and some of the descriptive statistics capabilities of RS/1. Figures 4, 5, 6 and 7 show the tables and graphs produced by the program shown in Fig. 3. When this software environment was no longer supported on this machine, we bought a MicroVAX to implement it on a stable computing platform. At about the same time the MicroVAX was purchased, DECnet became available. Given the reduction in hardware prices which have occurred in the past few years, the whole environment consisting of a MicroVAX, the Micro PDP11/ 73, interface hardware, and software can be purchased for about $40,000. It would not be possible with our limited manpower to
400¢
> ~ 200C
a~r
t78
300
560
t000
1780
3000
CONCENTRATION
FIG. 7. Exampleof an individualanimalconcentration-effect graph (mean --+2s.e.m.) made from a table similar to that shown in Fig. 6.
undertake the effort to handle such large data sets without this environment. We have a final word about flexibility and obsolescence. One of the great advantages of the early equipment used in behavioral pharmacology and toxicology was the flexibility and rapidity with which procedures could be changed; this ability to ask scientific questions quickly facilitated innovation and the fleshing out of the literature. Even then the hardware imposed limits on the questions asked; the limits now are frequently the effort and sophistication required to implement the experiment. This ability to do things quickly and simply is a major consideration today when we ask students to become simultaneously not only psychologists, pharmacologists, and toxicologists, but also computer programmers. The computing environment chosen directly influences the actual behavior of the scientist at the bench and the amount of effort necessary to ask interesting scientific questions. There is an old saying that if you can buy a piece of equipment it is already obsolete. There is a lot of wisdom in this observation which reflects the product development cycle of the computer industry, a cycle largely driven by the joint actions of engineering advances and marketing strategies. The experimenter considering a system purchase must realize the possibility that the competencies developed in a given environment may eventually be useless if the cost of maintaining the system becomes too great or parts cease to be available. An examination of the history of computing systems across the last twenty years shows us a littered landscape of once useful systems that have fallen by the wayside as either too clumsy, too weak, or too expensive to maintain. The secret of success in this area is choosing systems that can be rapidly configured to ask the scientific questions at hand at a reasonable cost in both manpower and equipment. This choice is heavily influenced by the state of the computer industry as a whole. The researcher should be satisfied that the chosen computer "hotbox," today's dream machine, will have a product lifecycle long enough to comprise a suitably large proportion of his or her research career.
REFERENCES 1. Dews, P. Epistemology of screening for behavioral toxicity. In: Mitchell, C. L., ed. Nervous system toxicology. New York: Raven Press; 1982:229-236.
2. Wood, 17,.W.; Sette, W. F.; Weiss, B. Languagesfor psychologists. Am. Psychol. 30:230-238; 1975,