LIGA members

LIGA members

Quaternary International, Vol. l&12, pp. 243-248, Printed in Great Britain. All rights reserved. THOUGHTS 1991. ON A PRACTICAL 0 104&6182/91$0.00...

678KB Sizes 4 Downloads 23 Views

Quaternary International, Vol. l&12, pp. 243-248, Printed in Great Britain. All rights reserved.

THOUGHTS

1991.

ON A PRACTICAL

0

104&6182/91$0.00 + SO 1992 INQUAIPergamon Press Ltd

DATABASE AND DATA EXCHANGE FOR CELWLIGA MEMBERS

Louis Department of Geology and Geophysics,

J. Maher,

Jr

University of Wisconsin, Madison, WI 53706, U.S.A.

A practical database should function to record information about a topic, a site, or sites, so that the information is secure, but at the same time readily available to authorized users. The information should be in the form of a digital file using standard and well-proven formats and media. Copies should be made available easily to authorized users, but the original database file should be ‘read only’. As scientists will be using microcomputers both for providing and receiving data, the files should have a format that can be distributed to the widest number of potential users. There is no reason to demand conformity among end-users in how the data are handled in their own facilities; however, the mannner in which data are transferred between the users is important. Practical considerations suggest that data in standard ASCII text format will be the easiest to handle. For the immediate future, data files can be exchanged in the mail or in person using standard ‘floppy’ magnetic disks. It would be a marked advantage to use the various governmental e-mail networks for both the international exchange of data and ideas. Concomitantly, agreements of a more delicate nature need to be worked out, to insure that the rightful ‘ownership’ of the original data is observed and credit given when it is used by others.

INTRODUCTION

A DATABASE

The individuals and teams studying the Last Interglacial in the Arctic and Sub-Arctic are going to acquire diverse kinds of data at many sites. This will range from botanic and zoologic material through stratigraphies of soils, magnetism and isotopes, and it will come from both terrestrial and marine sources (see introductory chapter of this volume). To be useful, this information must be recorded and maintained in some form of database. The purpose of this paper is to discuss some of the options currently available for storing and sharing these data. It is possible to specify equipment and techniques for handling data that are very expensive. However, funds required for data handling cannot be used for research personnel, field work, and laboratory study. It is for that reason that it is essential to utilize techniques and equipment that are readily available and low in cost; indeed, some very useful avenues of data exchange are essentially free. There are many ways of exchanging data. Tables of numbers on paper are easily handed or mailed to a colleague and kept for a lifetime in a file folder. For many purposes paper records are still the easiest and most permanent means of storing and sharing data. FAX has recently become popular by moving information at near-light speed and letting the recipient supply the paper. But there are not many ‘degrees of freedom’ in a table of numbers on paper. Today, microcomputers and floppy magnetic disks give Quaternary scientists incredibly effective ways of sharing information. The developing European and North American pollen databases (Grimm, 1990a) come to mind as examples involving the handling and storage of large masses of data.

CENTER

A practical database should function to record information about a topic, a site, or sites, so that the information is secure, but at the same time readily available to authorized users. This implies that there be a Center (or Centers) where the body of data is stored or archived. The Center then serves as a clearing-house enabling colleagues to share their information. It is suggested that two Centers for the CELIA/LIGA database be established, one in Edmonton, Alberta, and the other at Madison, Wisconsin. The body of information about a particular topic or site should be in the form of a digital file using standard and well-proven formats and media. At present the files would normally be stored on magnetic disks and archived on either magnetic or optical disks, or tape. Copies should be made available easily to authorized users, but the original database file should be ‘read only’. That is to say, changes to the archive file should be made only on the authority of the data’s originator (the owner), and the date and extent of the changes should be clearly indicated at the end of the archive file. The archive file should also include the names of users, and the date they were sent copies; in this way colleagues can be advised when major changes made to the master file make their copies obsolete. A STANDARD

MICROCOMPUTER

As scientists will be using microcomputers both for providing data to-and receiving data from-the Center, the files should have a format that is accessible internationally and sharable to the widest number of potential users. On the basis of the number of installed

243

244

L.J.

units and reasonable cost, the ‘industry standard’ (the IBM PC and its many clones) would seem the standard to adopt for data transferred by magnetic disk. Loyalists of other types of computer (Apple, VAX, etc.) are able to translate between the IBM protocol and their own. DATA-HANDLING

PROGRAMS

There are many word-processor-, spread-sheet-, and database-programs to choose between, and we often develop strong feelings about which is best. There is no reason to demand conformity among end-users in how the data are handled in their own facilities. However, common standards must be used when data are to be shared. Separate regional pollen and spore databases have recently been set up in Europe, Canada, and the United States. Fortunately, there has been sufficient liaison between the participants that certain common standards have been agreed on, and the CELIA/LIGA database can benefit by adhering to them. The European and United States Pollen Databases are using ‘industry standard’ computers and the Borland PARADOX Database program (Grimm, 1990a; Anonymous, 1991). Borland’s QUATTRO PRO is a top-rated spreadsheet program (Anonymous, 1990b; Reinhardt, 1990) which is compatible with PARADOX. Further, QUATTRO PRO reads Lotus’ l-2-3 spreadsheets, and all can import suitably structured tables of ASCII text. These spreadsheet and database programs are powerful, and - because numerous copies have been sold worldwide - they are relatively inexpensive and wellsupported. Sources for these and other programs are listed in the Appendix. Most of the data to be gathered by the CELIA/LIGA members can be compiled, handled, and maintained in QUATTRO PRO and PARADOX files. Although the programs mentioned next were designed with palynology in mind, they also can display other kinds of data. TILIA and TILIA-GRAPH (Grimm, 1990b) uses its own spreadsheet to handle pollen data, and can produce a standard pollen diagram suitable for publishing. It can also produce labeled columns of other time-series data. TILIA imports and exports several different data formats as well as its own. PALYPLOT (Chumbley, 1991) constructs publication-quality pollen and plant macrofossil diagrams by producing a file that is loaded into GENERIC CADD for editing or printing. These programs have provision for outputting digital files which can be shared with other users. I have produced several utility programs that can be useful for handling and displaying pollen data (Maher, 1990). In particular, my POLFILE version 1.15 also translates spreadsheet and other formats between, for example, TILIA and PALYPLOT. DATA STORAGE

Maher. Jr considerations suggest that data in standard ASCII text format will be the easiest to handle and exchange on an international basis. All major data-manipulating programs have an option for importing or exporting ASCII text. Data in ASCII text format also can be viewed easily with a text editor, and the receiver gets immediate assurance that the file survived the trip to (or from) the Data Center. ASCII text generally can be converted to another individually-preferred style with a simple conversion program written with the BASIC interpreter supplied with most PCs. One may hear that ASCII text is ‘old technology’, and that there are better techniques for efficient storage - like the waste of using zeros for unrecorded taxa, or wasting a whole byte to indicate a decimal ‘l’, say, when the same byte could record 256 separate integers (Walanus, 1989). Efficient utilization of space is undoubtedly important when the volume of data is truly huge and when storage media are being newly developed and especially expensive. But it becomes trivial when one considers that all the pollen and spore data for a good-sized country can be kept on a few floppy disks. I once did a study on how the size of a file depends on the storage format; I provide a summary here because it is relevant. I used an array of pollen data from my Devils Lake Site in Wisconsin. It consists of 80 taxon categories over 134 stratigraphic samples; that makes up 10,720 items of data though about a third were zeros. To that value we must add the two array dimensions, an alpha-string title, and 80 alpha-strings for the taxon names. I stored these data in seven different formats to determine the file size required by each format; the results are shown in Fig. 1. The shortest ASCII file was one in which the number quantity was converted to a string value with the leading blank (needed for the sign in negative numbers) removed and terminated with hexadecimal OD (carriage return). See Table 1 for examples of the various formats. When the 28,600-byte file is examined with an editor, it is one thin line of numbers along the left

8 ;; 8 k! s

programs for convertto another, practical

ONE/LINE+.ZIP ARRAY(lSP).ZIP WISC-CONDZIP TILIA-TIL.ZIP WISC-REG.ZIP PAFtADOX(3).ZIP QUATTROPRO.ZIP L 0

20 FILE SIZE IN BYTES (Thousands)

AND TRANSMISSION

Although there are specialized ing one propriety file format

ONE/LINE+(CR) ARRAY(lSP).PRN WISC-CONDRAW WISC-REG.RAW TILIA.TIL PARADOX(3).DB QUATTROPRO.WQl

FIG.

1. An array of data requires different amounts of file space according to its format. See text for discussion.

Databases

TABLE

1. An illustration

how different

file formats

handle

for CELIA/LIGA

an 80 x 134 data shortened

245

Members

array

for Devils

Lake,

Wisconsin.

Standard Table of Data ---Devils Lake, Sauk Co., Wisconsin 80

(Title) (Number of taxon categories) (Number of depth levels)

134 Depth

Spike

Spike

(cm) 0.5

Mean

Std.Dev.

124890

1553

0.665

0.008

2

124890

1553

0.614

0.007

518

9

124890

1553

0.598

0.007

305

16

124890

1553

0.676

0.008

366

19.5

1553 0.603 124890 [remainder deleted]

0.007

531

ASCII File: ----

One item E

Volume cm3

(taxa)

134

(depth levels)

Spike

Etc

Std.Dev.

Counted

___

302

--line followed by a carriage return

Devils Lake, Sauk Co., Wisconsin .80

Volume

(Title)

.5 124890 1553 .665 ,008 302 [remainder deleted] ASCII File: ---

Standard Array -with 1 space --between each item

“Devils Lake, Sauk Co., Wisconsin” 80 134 .5 124890 1553 ,665 ,008 302 ,846 ,264 .0514 -10 1 0 7 12 1

1 16 5 0 31...

2 124890 1553 ,614 ,007 518 ,782 ,228 .0514 30 2 0 19 12 1 4 22 1 0 39... 9 124890 1553 ,598 ,007 305 .764 ,242 .0514 110 1 0 6 12

1 2 25 8 0 46...

16 124890 1553 .676 ,008 366 .79 ,175 .0514 190 1 0 8 6 1 0 16 6 0 47 lg... 19.5 124890 1553 ,603 ,007 531 ,692 ,131 .0514 230 0 0 7 2 7 8 33 4 0 63... 22.5 124890 1553 ,633 ,008 433 ,717 .124 .0514 260 1 2 17 17 2 6 19 11 O... [remainder

deleted]

ASCII Wisconsin Condensed -----

RAW file (always 10 or fewer items per line)

Devils Lake, Sauk Co., Wisconsin 80 134 .5 124890 1553 ,665 ,008 302 ,846 ,264 .0514 -10 107121

1165031

131725351051

11

0120020000 00002411

515304

The ASCII Wisconsin Regular -RAW file --~ (always 10 or fewer items per line, but with an extra space between numbers)

13011420120 30000075221 [remainder deleted] The binary file formats of TILIA hexadecimal

, PARADOX, and OUATTROPRO

are not shown,

notation can be viewed with the DOS file DEBUG.COM

Their structure m

or other utilmes.

All examples

drastically

246

L.J.

Maher.

margin of the screen - very easy for a machine to read, but impossible for a human to comprehend. The next format was an ASCII array in which each of the 134 rows consisted of 80 numbers separated by a single blank space. There is little improvement in the ease of reading this file, nor in space reduction, because most of the carriage returns were replaced by the spaces, and the long rows extend off the screen. The third file structure is what I call a ‘Concentrated Wisconsin’ format. Here the array rows are broken with a carriage return after every ten numbers, and the numbers are each separated by a single space; this file format is easier to interpret when viewed on the screen - perhaps because of our ten fingers. The ‘Regular Wisconsin’ format has two spaces between numbers; it is easier to read, but a bit more bulky. When the data are stored in Grimm’s TILIA file structure, it uses 48,100 bytes, partly because Grimm builds in space for additional information about the pollen taxa. Stored in Borland’s PARADOX Database results in three files which total over 100,000 bytes. And if the data are saved as a Borland QUATTRO PRO spreadsheet, it is 126,500 bytes long. The various files can differ markedly in size; powerful programs pay a certain size penalty in the overhead it takes to provide that power. File size is probably less important than it was only a few years ago; disk storage is relatively cheap. In addition, there are some very excellent, inexpensive compression utilities (i.e. PKZIP and PKUNZIP) that can drastically shrink the size of stored files, as is shown in the lower part of Fig. 1. For the immediate future, data files can be exchanged in the mail or in person using the standard 5% and 3%inch floppy magnetic disks. For international distribution the low-density (360 Kb 5?&inch and 720 Kb 3%-inch) disks are compatible with the widest range of equipment. The higher-density (1.2 Mb %-inch and 1.4 Mb 3%inch) disks transport data more efficiently and can be used when both sender and receiver agree to their use. Computers equipped with inexpensive modems can exchange limited data sets by the commercial telephone network. When the files are large, disks sent by mail are much more cost effective. CELIA/LIGA members would be well advised to make a concerted effort to use the various governmental e-mail networks (Macdonald, 1987; Myers, 1990) for both the international exchange of data and ideas. Academic users are probably most familiar with BITNET. This system is connected with EARN (European Research Network), BITNET North (Canada), and many others. BITNET files are routed serially among nodes set up at various sites around the world. INTERNET serves as a local area network between research and academic institutions in the United States; participants with the correct identification can actually examine files and run programs from different cities. These various networks are in the process of changing the scientific community in ways not fully comprehended. The system is essentially free for many academic and governmental users. It is incredibly fast, and

Jr

the file comes in digital form that can be recorded on disk when the e-mail is read. Thus the receiver can store the data directly on his/her own equipment, and it is in a form that can be manipulated at will. Contrast this with a page of figures transmitted by FAX; the data are very difficult to process further. The ‘folded’ structure (ten items per row followed by a carriage return) of my ‘Wisconsin’ .RAW and .DAT files was in part devised for use with e-mail. They always fit into a normal-width screen, and a standard text editor can be used to recover a usable data file. While ASCII files wider than 80 characters will easily travel by e-mail, there is always the possibility that extra-long lines may become fragmented during the process, and the recipient of a ‘trashed’ file may spend hours trying to find why it does not work. Much of the exchange of data among the CELIA/ LIGA members and the Central Database sites can be done by simply e-mailing the material. The digital file can be read as ASCII text using a word processor, and when its routing address and comments are deleted, the data file can be saved as ASCII text, translated as necessary, and read into a suitable program for further study. It is also possible to send compiled programs or binary data files over networks other than INTERNET by using e-mail. This requires a bit more experience on the part of both parties to the transmission, and I do not recommend it to be used for most data exchange. The problem with trying to send binary files by e-mail is that such files contain numbers that e-mail removes and uses for control purposes; aside from the wild effects this has on the system, the files that reach the other party will be missing information or will not work, or both. One can get around this problem when using a unix system for e-mail - and most do - but it requires more effort at each end. Unix has ‘uuencode’, a program that can code a binary file into a form of ASCII which will pass through the e-mail system. When it is delivered, the encoded version is processed with ‘uudecode’, and it is converted back to its original binary form. The size of the encoded form is considerably larger than the original binary file, and this can result in a file that exceeds the permissible length of some mail systems. If the mail is refused, one can clip the encoded ASCII form into two or more pieces and send them separately as is suggested in Table 2. The receiver then must reassemble the pieces, and most scientists will need some help from those who maintain their unix systems. An individual can enjoy many of the computational advantages of a large research center by utilizing readily available commercial programs (word processors and spreadsheets) and tieing them into specialized programs through simple translation utilities. Version 1.15 of my POLFILE program can read and write my .RAW file format which Grimm’s (1990b) TILIA program recognizes as a ‘Wisconsin’ File. As noted above, TILIA’s binary files would require special handling on e-mail; the .RAW file simply travels as

Databases for CELIA/LIGA

247

Members

TABLE 2. An illustration how a binary data file or an executable program file can be coded as an ASCII file with Unix’s ‘uuencode’, edited into smaller pieces, and mailed individually. The received pieces are edited with ‘vi’ omitting everything outside the cut-lines. The saved files are concatenated together with ‘cat’ and then converted back to the original binary file with ‘uudecode’. The example file is drastically shortened

(A file received by email would contain several lines of address and routing information, and it might contain a normal message with detailed instructions for reassembling the binary file.) Polfile.uue Part 1 of 3 Parts This is an IBM PC executable uuencoded. To reassemble cat then run The resultant ,,L-,,,-,,-,-,------

this file, uudecode. file

is

cut

file

that

the

pieces

has

and

been

assemble

them

using

an IBM PC executable file. CUT BBBB ---_---_--_____------------_-_

begin 644 polfile.uue M33,,B\?(\\\\@\.X’ \lIlX\\~\\o\(,
PS’!E(!

“#‘!EO!<$&X”lON%8!4)K&”

\,

MQP9: ‘O”QP9<‘O”QP9>‘O”~P9@‘O”QP9B’O’~QP9D’O”QP9F’O”Q~9H M’O”QP9J’O”QP9L’O”QP9N’O”QP9P’O”QP9R’O”QP9T’>! ‘QP9V’O” MQP9X’
[major portion of file deleted] M’-@2”xA$@“ZA(“/,2”#\$@“!1,

M=Q ,

“!P3”‘E$P”+A,

““X3”!“$P”

“),~“#A$p”“!~“!p4”~~~“‘>~~‘~((~“~~~~”’~Q~”~~~”~~~

M%“‘s!o”-H4”#^%“‘~~A4”‘T5”‘>%O”+Q4”~~\5”!%%0”3A4”%<5

M”!@%0”<14”( (5”“;%O”NA4”, \5”#H%O”]A4”“O6”‘I%@“,A8’ !8”‘$6”!~%@“=!8”~06”1l-%@“Cl8”)X6”11~%@” M’$D6”!?%@“: ML!E”+D6”#/%@“Y!E”/T6”‘+%P”‘l<““(7”’K%P”OA<”$47”!% M%P”ll<“$@7’‘!-%P”3l<“%D7”! :%P”9A<“&87”“*%P”F!<“)L7 M”“>%P”KQ<“, (7”#.%P”Tl<“-$7”#4%P”ZA<“.\7\\XX%P”_l<’ -------------------CUT HEHE --------------------------____

end

of

Part

1 of

3 Parts

ASCII text. Chumbley’s (1991) PALYPLOT uses POLFILE’s data file format (.DAT), and this provides a route between TILIA and PALYPLOT. POLFILE can convert either file structure so that they can be directly imported as ASCII text into Lotus l-2-3 or QUAITRO PRO. Some of the proprietary statistics packages should accept output from these ubiquitous spreadsheets. POLFILE is a useful bridge between

some very powerful programs, and its ASCII format allows one to exchange data by disk, or by e-mail. THE CELIA/LIGA

DATABASE AND THE FUTURE

The first task in studying the Last Interglacial in the Arctic and Sub-Arctic is getting and storing useful information. But concomitantly, the CELIA/LIGA

248

L.J.

Maher.

members need to work out agreements of a more delicate nature, to insure that the rightful ‘ownership’ of the original data is observed and credit is given when it is used by others. There will have to be procedures for handling disputes if (when) they arrive. Some among us will demand that - to be truly useful - the material must be compiled into a large ‘relational’ database, whereby a single query will couple the mollusk diversity index in Denmark with, say, the inclination of the earth’s axis to produce the trajectory of the jetstream in western Canada during the Last Interglacial. Perhaps that will come some day, but the files from our anticipated field parties will be raw numbers. Later these may be associated with agreed ages and put together in percentages of arboreal taxa based on an agreed pollen sum, and the like. Until that is done probably by a committee of experts - the relational aspect of the database is moot. At the beginning of the enterprise we must find ways for sharing our problems and solutions. Timely meetings are one way to do this. Increased use of e-mail between colleagues is another. In addition we all should try to maintain a friendly liaison with those of our colleagues who are setting up regional databases dealing with modern and Holocene materials; their successes and failures are likely to be instructive. REFERENCES Anonymous (1990a). Paradox: The SQL with many improvements, short takes. Byte, lS/lO, 132. Anonymous (1990b). Quattro Pro 2.0, The 1990 Byte Awards. Byte, 16/l, 150. Anonymous (1991). American pollen database. The paleoclimate data record, newsletter for NOAA’s program in paleoclimatology, national geophysical data center, NOAA, 325 Broadway, E/GC. Boulder, CO 80303 U.S.A, 2, no. 1, 4. Chumbley, C.A. (1991). Palyplot: A PC-based program for plotting pollen and plant macrofossil stratigraphic data. INQUA Commission for the Study of the Holocene, Working Group on Data-Handling Methods, Newsletter, 5, 2-4. Grimm, E.C. (1990a). The North American pollen database. INQCJA-Commission for the Study of the Holocene, Working Group on Data-Handling Methods, Newsletter, 4, 4-5. Grimm, E.C. (1990b) TILIA and TILIA-GRAPH: PC Spreadsheets and graphics software for pollen data. INQUA - Commission for the Study of the Holocene, Working Group on Data-Handling Methods, Newsletter, 4, 5-7. Maher, L.J., Jr. (1990). Programs useful in the pollen lab. ZNQCJA - Commission for the Study of the Holocene, Working Group on Data-Handling Methods, Newsletter, 4, 7-10. Macdonald, P. (1987). An introduction to e-mail. Bulletin of the Institute of Mathematical Statistics, 1615, 258259. Myers, D.E. (1990). Electronic mail and networks. Geotimes, 35/l, 17-18. Reinhardt, A. (1990). A new status quo for Quattro. Byte, 15/12, 1.56157. Walanus, A. (1989). Saving computer memory in storing tables of pollen counts. Pollen et Spores, XxX1/1-2, 161-164.

SOURCE AND COMMENTS ON RECOMMENDED PROGRAMS Generic CADD Level3 or CADD 5.0, Available from Generic Software, Inc., 11911 North Creek

JI

Parkway South, Bothell, WA 98011 U.S.A. The program may be obtained at reduced prices of from $225 to $250 from mail-order sources. Although PALYPLOT uses this program for making pollen- or macrofossil diagrams, Generic CADD would be useful whenever stratigraphic data needs to be created, stored, and plotted. PALYPLOT. Available from Craig A. Chumbley, New York State Museum, Biological Survey, Albany, NY 12230 U.S.A. for $5 to cover cost of disk and mailing. From a raw data file, PALYPLOT produces a file which can be imported into Generic CADD for the production of a publication-quality pollen diagram. PARADOX. Available from Borland International, 1800 Green Hills Road, Scotts Valley, CA 95067 U.S.A. This is a top-ranked database program. Borland has been offering it to users of their other products for about $150. PKZIP and PKUNZIP. Available from PKWARE, Inc., 9025 N Deerwood Drive, Brown Deer, WI 53223 U.S.A. Version 1.1 is a ‘shareware’ program available for the downloading from many electronic bulletin boards. The package comes in a ‘self-unzipping’ form called PKZ lOO.EXE. A user is asked to send PKWare, Inc. $25; $47 gets you the next edition as well. File compression and expansion is fast and efficient; see Fig. 1 of this paper. POLFILE. Available from Louis J. Maher, Jr. Department of Geology and Geophysics, University of Wisconsin, Madison, WI 53706 U.S.A. POLFILE is a utility for making and editing Wisconsin RAW files of pollen data arrays. The program also has provision for producing DATA files for plotting reconnaissance style pollen diagrams on the screen. Grimm’s TILIA imports and exports the Wisconsin RAW file format. The program is free (Maher, 1990). QUATTRO PRO. Available from Borland International, 1800 Green Hills Road, Scotts Valley, CA 95067 U.S.A. This program has been available recently for $99.95 to owners of rival spreadsheets. QUATTRO PRO is a top-ranked spreadsheet. It is being sold as having more ‘3D’ graphics, but that is of doubtful value in science . . or even business. Versions 2 and 3 read the new releases of LOTUS l-2-3. TILIA and TILIA-GRAPH. Available from Eric C. Grimm, Illinois State Museum, Research and Collections Center, 1920 South 10% Street, Springfield, IL 62703, U.S.A. Grimm has written a very useful utility program that has its own spreadsheet, allows statistical operations on arrays of data, produces dendrograms, and imports and exports files from a variety of pollen laboratories. TILIA is available from Grimm for $5 to cover mailing of the disk. TILIA-GRAPH, which creates a publication-quality pollen diagram, costs $200, and this includes two proprietary GSS device drivers (for your screen and your printer); additional drivers are $25 each.