Computers & Geosciences Vol. 12, No. 4B, pp. 619-620, 1986 Printed in Great Britain. All rights reserved
0098-3004/86 $3.00 + 0.00 Copyright © 1986 Pergamon Journals Ltd
COMPUTERIZATION OF BIOSTRATIGRAPHIC DATA COLLECTION AND ANALYSIS B.T. WELLS Robertson Research International Lid, "TYN-Y-COED", Lanrhos Llandudno, Gwynned LL30 ISA, Wales (Received I November 1985)
The volume of data involved suggests biostratigraphy as an obvious selection for investigation by d.p. departments. In fact it has lagged behind, for example, core analysis and geochemistry; for these there even exist pressure groups for standardization, a sure sign that progress has gone beyond the stage where standards should, or could, have been introduced. A field of 150 wells may give rise to 3 x 106 observations on taxa, where an observation is a (perhaps codified) numeric count of a taxon at a given depth. If nonoccurrences (which are significant) are included, the amount of data is large. It seems certain that not all significant relationships will be noticed. Microcomputer-based systems exist (e.g. CHECKLIST) but larger processors are required if true multiwell work is to be undertaken. Single well information is not to be dismissed lightly. The types of closure diagrams which computers have been producing are expensive if drawn by hand. They are, however, at least possible without a computer. A principal component analysis, for instance, across many wells, is out of the question. And yet it is only recently that, in Europe at least, computerization has been implemented. AGIP and BRITOIL certainly have systems in operation. Only one company,, however, has expressed an interest in our open offer to supply data on magnetic tape, in addition to our usual services. The reasons for this backwardness must lie mainly in taxonomic classification. Any viable system must be based on a codification of taxon names. Within any one group of biostratigraphers this could lead to a database of the order of 10,000 names. Between groups differences could be as high as 10%, accounted for by different names given to the same taxon and differing lists of taxa. The group, of course, would wish to add taxa to their database continuously. Are these really good reasons for the apparent reluctance to computerize? The storage requirements indicated so far would only be 2 Mb for the database and less than 10Mb for that 150 well field; actually well into microterritory since the introduction of Winchester discs, and no problem for minis. The fluid database is not beyond the capabilities of a good ISAM system, let alone a relational database; some programs utilize binary tree structures. The whole
problem therefore is solved. Why, then, is it within the bounds of possibility to suggest an implementable standard for transfer of computerized biostratigraphic data? Competing requirements for a computerized routine biostratigraphic data analysis system include: fast data entry by depth key; editing of entitles by depth or taxon key; retrieval of all records relating to a well, sorted either by depth or by taxa; creation of views spanning wells, usually for access by background (plotting) programs. Internal storage of taxa must be by code; an analysis option is the extent to which these codes are available to the user. Ideally they should be transparent fully but in practice users will prefer the use of a code to typing a full name, or to waiting for a search on an incomplete maks. Codes, therefore, should be meaningful if possible. An obvious possibility is for low numbers to be the most used taxa. Alphabetic order is not possible because this precludes list expansion. The list cannot be regarded as static although it will have low volatility; on-line updating is not necessary. In any event, at least two access methods are essential; directly, by code, and by search. The simplest realization of the first goal is for the code to be the retrieval key in the preferred storage system. For the second, an index giving an alphabetically sorted view minimizes storage whilst a copy of the file with pointer to the main list optimizes buffered disc access. Given the size of files involved (typically 0.2 Mb) the latter option is preferred. Additions are to the main list, on line, with the sorted copy being updated by a process scheduled without wait. Data stored in this way is transportable only in conjunction with the list, and is not useful in this form. Internal representation is implementationdependent, precluding comparison across sites' data sets. If the requirements are analyzed from the standpoint of a researcher, instead of a routine analyst, the specification becomes more daunting. Because the presence/absence of a taxon is influenced by chronostratigraphy, environment of deposition, and facies, these all may be regarded as keys; for example, if a taxon is absent, both its degree of facies independence
619
620
B. T. W~LLS
and that of present species must be considered, as well as the absence of species which are markers for the current age and environment. These competing requirements cannot be regarded as too daunting because the potential gain to both routine biostratigrapher and researcher are so great that degradation of performance on a given function may be overlooked. Our solution has been based on dual-key ISAMs. Data entry by coded name or number has been by-passed by using Hewlett-Packard touch-sensitive microcomputers (HPl50s) as dataentry terminals to the central processor, an HP 1000series A900. Digitizers and character recognition tablets were evaluated initially, but the ability to build dynamically a sublist, extracted from the master list of taxa, was considered essential. Default lists for Micropalaentology, Palynology, and Nannofossil work are decided upon for geographies and ages. Up to 250 names can be accommodated comfortably, all accessible instantly by pointing a finger at a screen. Programmable keys are used for abundant input. Keystrokes thus are reduced to a minimum for the majority of data input (typically 95 % + of logging). For the exceptional situations, typically caved or reworked specimens, data input is no more time-consuming than other computer-aided options and less so than manual methods. Footprint also is a minimum, with the input medium being an interactive input and display device; microscope + VDU + digitizer was determined to be unacceptable clutter for the geologist. For maximum data entry speed buffering is by whole samples; indexing functions can take place whilst the biostratigrapher changes slides. Locking is only by depth, which causes some problems with editing by taxa although none to retrieval. Thus only correction of a misclassification throughout a well is likely to result in degradation of response. Production of routine biostratigraphic summary charts ("checklists") gives this format the most problems. Range diagrams retrieve on a key of taxon, summary/composite logs on a key of depth; checklists require depth retrieval, but sorted by first or last occurrence, or by almost any taxon attribute. Going back to our 150 well example, an output file of incom-
ing and outgoing attributes, for contouring, is computed in minutes. The problem of data transfer is solved incidentally. As with any database application, where the internal format is transparent, the external appearance of the data is at the user's discretion; this can apply to magtape input and output. If the taxon names list is fluid, new names from a different site may be added in the same way as newly classified taxa. The number of types of analysis available to determine biostratigraphic attributes and value is large. Whilst the time series analyses of the sedimentologist are not of use, the whole of the burgeoning discriminatory/clustering fields of multivariate statistics are at the disposal of the researcher. The problem is in deciding their relative merits and processing the answers. Whilst the amount of data which goes in is limited by the biostratigrapher's logging speed, the amount which comes out is limited only by his imagination. If the actual count at a sample depth is considered as a single realization of a known distribution (Jasko, 1984; Blank and Ellis, 1982) then the ability of a human mind to process the information (presence/absence data of marker taxa and the relative ranges of species with respect to chronostratigraphy, environment, facies, etc.) must be questioned. Once an initial zonation is established, and environmental facies dependence of the markers are known, the determiantion of chronostratigraphic boundaries is routine. Inference is required with respect to caved and reworked specimens, and in determining the relative merits of contradictory marker indications. This is just the sort of problem at which an inference engine excels. Do we see a limit to the life of the routine biostratigrapher? Will his range be an important marker for the comings of the petroleum age (incoming) and the 5th generation of computers (outgoing)? REFERENCES Blank, R. G., and Ellis, C. H., 1982, The probable range concept applied to the biostratigraphy of marine microfossils: Jour. Geology, v. 90, no. 4, p. 415-433. Jasko, T., 1984, The first find: estimation of the precision of range zone boundaries: Computers & Geosciences, v. I0, no. 1, p. 133-136.