Earth-ScienceReviews, 9 (1973) 159-196 © Elsevier Scientific Publishing Company, Amsterdam - Printed in The Netherlands
A New Geological Tool-The Data A. Hubaux ~
ABSTRACT
Hubaux, A., 1973. A new geological tool - the data. Earth-Sci. Rev., 9: 159-196. Today data processing technology offers new and attractive possibilities for creating and exchanging geological data. This technique will be of profit to policy makers and economists establishing national and worldwide reserves; to developing countries comparing their resources potentials with well-studied areas of like geological setting; to geological surveys, computerizing their archives; to explorationists making interregional and intercontinental comparisons; to conservationists keeping track of ephemeral phenomena; to environmentalists making interdisciplinary studies, and to geomathematicians. Much remains to be done, however, in this area where many enthusiastic starts have turned short and have made apparent the necessity of more coordinated efforts. From the information he could gather as Cogeodata research secretary, the author reviews the present state of the art in the subdisciplines of geology, and attempts to mention all the main aspects of geological data file building and interchange. INTRODUCTION " A s I see it, at least part of the conflict amounts to a philosophic judgement whether Science is the search for new knowledge or the organizer of existing knowledge." (A.M. Weinberg, Reflections on Big Science; Pergamon Press, 1967).
The forgotten data Without observations and measurements, without data, there would be no geology. And yet these humble servants of the science have received remarkably little attention. Only a small part of these data are being published. The bulk remain in notebooks and are, for all practical purposes, lost to the geological community. In geology, it is easier to find a publisher for the most irrelevant ideas than it is to publish a table of good data. The two main systems of communication, papers and maps, emphasize syntheses rather than details. Hence, geological communication is loaded with personal bias and genetic hypotheses. Too often geological archives are still regarded as dust-accumulating piles of paper and, in most instances, but a small fraction of the potential information they contain is 1 CETIS, European Joint Research Center, Ispra (Va.), Italy. Former Cogeodata Research Secretary.
160
A. HUBAUX
retrieved and processed. The technological tools to make proper use of this reservoir of data have existed for some years, but the correct methodology to tap it is emerging only now. Amid the abundant geological literature, there is not a single journal which devotes more than an occasional article to the problems of geological data storage and retrieval. Yet important problems exist and should require more attention, as this paper will attempt to show. The benefits which would accrue from this improvement in data communication would be numerous and of far-reaching consequences. One of the points, regularly brought forward in the studies of human environment, which is attracting much attention today, is that reserves of most commodities are finite. To establish worldwide inventories of these reserves is an urgent task, if mankind is to avoid self-destruction. There is an increasing pressure on geologists by governments and international organizations to apply their science to improve our knowledge of natural resources potential. The usefulness of such inventories is directly related to the degree of compatibility between the data on which these inventories are based. Yet conventional methodology is completely inadequate to meet this challenge. The very future of the geological profession may well depend on its capability to set a new policy for the presentation to proper authorities of the basic geological data to estimate the reserves of commodities. By proper standardization, intercontinental comparisons made on a factual basis could be achieved. Not only would this have important scientific impacts, it would as well have economic significance. For instance, comparisons of newly explored regions with wellknown mineralized regions in other parts of the world will materially facilitate the discovery of new mineral deposits, an approach which may be of great help to developing countries. New international research programs such as the International Geological Correlation Program and the Geodynamics Project are being launched under the impetus of the International Union of Geological Sciences (IUGS - see section on "The International Scene", p. 186). Such interregional and intercontinental studies imply the exchange, not only of syntheses, but also of the underlying basic observations, on a scale which will eventually be several orders of magnitude greater than it is today. More consistent data would also tend to be less subjective and would enhance the image of geology as a science, by facilitating communication with scientists and technicians who occasionally have to use these data, such as policy makers, mining engineers, geophysicists, geographers, urbanists, etc. Industry is spending increasing sums on the creation and updating of archives on exploration data by methods which have been largely empirical. The benefits of a more methodical approach would yield important consequences which are but dimly seen at present. Indeed, the technique to properly use the geological data accumulated within one organization, important as it is for the organization itself, has aspects which are of interest to the geological community. A workable and consistent methodology would be of profit to all those geologists who are now attempting to create useful data files, especially computer-based data files of a more or less permanent nature.
A NEW GEOLOGICALTOOL - THE DATA
161
Of the next ten mineral deposits of economic significance which will be developed within the coming years, at least six or seven are already mentioned in the files of some mining company. The geologists of the mineral companies to whom the author was able to pay a visit agree on this prediction. Also, recording data about transient phenomena could be of incalculable value for future generations. Volcanic eruptions, water-filled old mining deposits, stratigraphic sections in underground works and numerous hydrogeological data would be in that category. Finally, the more reproducible the data, the more amenable they will be to mathematical treatment. Any mathematically derived results are only as good as the input data. Quantification and exchangeability of data present essentially the same problems, and yet this important aspect of geomathematics, the improvement of geological data quality, has been almost neglected, for reasons the author will try to outline in this paper (see "Needed studies", p. 188). Hence, the development of better tools for description would be of profit to economists establishing national and worldwide reserves; to developing countries; to geologists creating computer-based data files; to explorationists and searchers making interregional and intercontinental comparisons; to conservationists archiving ephemeral phenomena; to environmentalists making interdisciplinary studies, and to geomathematicians. What is needed, however, is not simply a new technique, but a new and more rigorous descriptor language. The present survey will serve its purpose if it can help to outline the usefulness of the endeavour and the magnitude of the problems to be solved. After explaining briefly in the next chapter what geological data are and outlining the background information on which this paper is based, examples are given of current activity in data storage and retrieval on the following topics: oil wells; mineral deposits; field observations; environmental data; hydrogeology; museum data; geochemistry; paleontology; orogenic studies; and indexes on documents. A few general aspects, including software problems, geological standards, the availability of data and international activities are then presented. Finally, a tentative list of problems to be solved and practical guidelines for the building of a geological data file conclude this essay.
Whataregeologicaldatafiles? Before reviewing the state of the art in the subdisciplines of geology, some precision ought to be given about what is intended by geological data and by data files. Data are defined in "A national system for storage and retrieval of geological data in Canada" (Brisbin and Ediger, 1967) as "observations or measurements reproducible within the limits set by the observer." As the present paper is addressed to geologists, the geological concepts which ought to be the basis for the storage of more consistent data will be the focus of attention. A data file may be conceived as a collection of measurements and/or observations pertaining to a given set of properties or characters about a series of geological objects of the same kind. These objects, which serve as basic units to the file, may be, e.g., oil wells,
162
A. HUBAUX
mineral deposits, fossils, outcrops, etc. This notion of the basic unit will be clearer when the types of data files currently being used are reviewed. Let us only note here the close relationship between these basic units and the resulting data file. For the description of a given region, a file on outcrops will not be the same as a file on rock samples. The proper choice of the basic unit is just one of the difficulties of geological data file building. The problem of maintaining a geological data file of lasting usefulness is closely linked to the exchange of geological data, and it must be pointed out that there is still some skepticism as to the value of using other people's geological data. One extreme view was expressed to me by a geologist who is also a geomathematican: "I have no need for geological data collected by other geologists. It would be nice if these data could be used, but they are so heavily loaded with personal interpretations and subjective hypotheses that they are practically useless. All I need are theories. While data are accumulating at staggering amounts, theories may be expressed in a few simple and synthetic sentences and what I need is only to test whether a theory applies to my field of study or not." While this extreme opinion is probably not shared by the majority of the geological community, it must be reckoned that it contains an element of truth. Geological data are certainly not easy to communicate. Very few of these data may be considered as completely objective; nearly all of them contain some degree of genetic hypothesis. Nevertheless, better means of communication are an essential part of any scientific progress. Indeed, the sciences which have progressed more rapidly are the so-called "exact sciences", where many concepts have been quantified and may therefore not only be precisely measured, but also transmitted easily. The difficulty inherent in the proper building of a geological data file has often been underestimated. There was, 10 or 15 years ago, a great deal of enthusiasm for creating data files in a variety of fields. The computer in those days was a relatively new tool and, as it could most impressively solve in the nick of time some kinds of very difficult problems, there was the feeling that it could eventually solve all kinds of scientific problems. But disenchantment after some negative experiences led some geologists to hastily conclude that computers and geology had better be kept separated. Geological data, for some mysterious reasons, would somehow not fit within the strictly ruled realm of the computer. Also, geologists supposedly were of such individualistic nature that they would never comply to normalization or standardization. So the very nature of the science and the psychology of the scientists have both been evoked to testify to the failure of most early attempts to build computer-based data files. Continuous use of the computer has thereafter been restricted principally to workers within the oil industry, to commercial companies selling information to that industry and to some faithful but isolated enthusiasts. Many oil companies also realized that great benefits could be expected from an exchange of data between companies and that complete confidentiality was not beneficial in the long run. This of course had already been realized before the advent of computers, but undoubtedly, with the computer, the exchange of data has been further enhanced. Another factor playing a role here is the legal obligation to deposit the data in some kind of State repository. The legal situation in this
A NEW GEOLOGICALTOOL - THE DATA
163
respect varies very much from country to country and even somewhat from province to province or state to state in the same country. Clearly however, the legal dispositions in Canada have had conspicuous beneficial effects. In Alberta, for instance, certain engineering and geological data must be given to the Energy Resources Conservation Board wittfin a month. These data are released within one year at maximum. It is partly due to this law that Canadian authorities have been able to take a series of actions to facilitate the exchange of geological data, at the request of the oil and the mineral industries. As a result of these actions, among others, the Canadian Centre for Geosciences Data was created in 1.970. Also, in view of the results obtained on a national scale, one of the initiators of the Canadian actions approached the International Union of Geological Sciences (lUGS) in 1967. Dr. S.C. Robinson of the Geological Survey of Canada (Ottawa) was appointed by that Union to form the "Committee on Storage, Automatic Processing and Retrieval of Geological Data", which he has chaired since then until the 24th International Geological Congress at Montreal. The aims of the committee (later called COGEODATA for short) are to facilitate access to available geological data and to promote compatibility between these data on an international scale. During a meeting of this committee in 1970, it was realized that the best move was to send an official lUGS and Cogeodata representative to oil and mining companies and to geological surveys in order to: (a) get first-hand information on the expertise gained on geological data storage and retrieval problems; and (b) to inform these organizations of the existence of Cogeodata and study the services which this committee could give to the geological community (International Union of Geological Sciences, 1970). (The list of these organizations is given in the a.ppendix). The present article relates the main results not only of this inquiry, but also of the information stemming from the activities and contacts which the author has had occasion to develop through his activities as Cogeodata's Research Secretary. These activities include visits to geological organizations in various countries of Western Europe and some of Eastern Europe. Also, the completion of an inquiry on existing geological data files, through a questionnaire followed by an exchange of correspondence, has brought valuable information, the essence of which is included here. (The list resulting from this inquiry will be published in Codata Bulletin No. 8.) Also, conversations on various occasions with data specialists within the Geological Survey of Canada (G.S.C.), especially with Dr. S.C. Robinson and Dr. C.F. Burk Jr. and discussions with Professor P. Sutterlin at the University of Western Ontario have made possible the use of at least a part of the experience acquired in that country. Finally, the author has had the opportunity to hold several discussions with searchers at the Ecole Sup~rieure des Mines (Paris) and the C.R.P.G. (Nancy) which are among the few institutes where theoretical research is done on the subject. The contribution of all these specialists is gratefully acknowledged. The manuscript has also been reviewed by many of the geologists who contributed to the input of information. The careful correcting and editing work of Drs. C.F. Burk Jr., E.K.L. Gunn and G.D. William (G.S.C.) and of Professor P. Sutterlin (Univeristy of Western Ontario)
16 4
A. HUBAUX
must be especially mentioned. The author alone, however, is responsible for the bias and inadequacies which such a broad encompassing survey unavoidably contains. The paper is an attempt at presenting an overall view in a rapidly developing field of geological activity. Although certainly not complete, it hopefully gives a general picture at a time when this is still possible. Data files will soon be mushrooming at such a rate that a survey of this kind will be much more difficult to make. SURVEY OF CURRENT DATA FILES Oil and gas wells
The conditions in the oil industry with respect to the use of computers for geological problems have been different from any other geological environment. The oil companies were, right at the beginning of the computer era, large enough and sufficiently centralized to possess in-house computer installations, which they needed anyway to solve problems outside the scope of geology. The computer proved to be a handy tool to process geophysical data, so that more and more logs were digitized. These companies were accumulating data of various kinds at an ever-increasing rate and realized more and more clearly the usefulness of these data. The number of geologists using these data was growing, and some kind of standardization was necessary anyway. Furthermore, the individual oil or gas well formed a logical unit of description which conspicuously helped to rationalize the ordering of the geological descriptions. Finally, oil companies had reached the critical dimensions where a fully-fledged staff of computer specialists may be maintained in house. Their geologists have had, then, the help of programmers and systems analysts who could be trained to use the geological language. Of all the abovementioned factors, this latter situation, almost unique in geology, is certainly the most important. Most, if not all, oil companies have stored a significant amount of their well data on computer processable mediums. The wells thus described amount to several tens of thousands for a big company. The most common kind of data usually stored is the engineering data about these wells. The geological data usually include the digitized logs, the formation tops and sometimes the lithology and the paleontology. The variety of data which are contained in a well data file is exemplified by the following list established for the file of the Standard Oil Company of California. The heading zone contains general data about the wells and indication as to the presence or absence and availability of various kinds of data. It appears once for each well. Data and information of the interval zones are repeated for each interval within the well. Of course this set of data is not complete for every interval.
A NEW GEOLOGICALTOOL - THE DATA
ii65
heading zone
company name and former names lease name and well number location (administrative unit, Lambert coordinates, narrative) elevation and water depth total depth initial and final classification platform or drilling vessel name spud, abandon, completion dates current well status completion method oil field name logs run and available presence and availability of: paleontological, palynological, nanopaleontological data, heavy mineral analyses, thin sections, water, oil, gas analyses, permeability and porosity, density, magnetic analyses bottom hole temperature formation name and age at total depth formation name and age of producing zone initial production rate oil shows (presence) interval zone
drillstem test results including pressures and water salinity production tests other than producing interval core and side wall samples depths core dips and dipmeter results directional survey results permeability and porosity, density and magnetic data sample availability and preparation paleo-age determinations geologic picks (i.e., markers, formation tops) mechanical summary well log system sand count sonic and check shot velocity surveys The absence of lithological information is to be noted. For this file, the lithology is considered as not sufficiently reliable and of limited use only, and has therefore been excluded. Also worth noting is the mixing of purely physical measurements with interpretations such as paleontological age determinations and geological "picks" (i.e., marker
166
A. HUBAUX
beds, formation tops) for which geological judgment is needed. Keeping together these two kinds of data has not given rise to particular problems. These big data files, stored on several dozens of magnetic tapes, require sophisticated management systems. Often, the management system is entirely written in house, but some companies have made use of commercially available systems such as MARK IV. Conspicuous efforts are devoted at developing sophisticated and flexible retrieval techniques, so as to answer always more complex questions in an increasingly efficient way. Retrieval may for instance be made by: isopach values; number of beds within a given interval; average bed thickness; ratio of some lithological characteristics within given intervals; standard deviation; kurtosis, etc. The well data kept by oil companies come from several sources. In North America they include data generated by the company itself, data from State agencies, data traded with other companies and data purchased from a variety of commercial companies. There is constant and apparently satisfactory dialogue among the three kinds of organizations. The governmental or provincial agencies are not looked at as an impediment but as integral and necessary parts of the system. At Calgary in particular, everyone in the oil industry with whom I had the opportunity to speak is happy with the Energy Resources Conservation Board, with which there exists a close collaboration. Data gathered by the governmental agencies often serve to produce statistics and maps. Thus the U.S. Geological Survey is preparing an oil and gas map of the country, where the units will not be the well, but the oil or gas pools. All the basic data for the map are being put on magnetic tapes. There exist quite a variety of commercial companies which collect and organize information from various sources and which edit these data and sell them to the oil industry. Some of these companies specialize in well data and are also developing computer programs for file management and data processing, especially graphical output. One example is Petroleum Information Corporation (Denver) which maintains data files on every well drilled in the U.S.A., including Alaska. Others, like CANSTRAT (Calgary) or AMSTRAT (Denver) make careful and complete descriptions of selected wells, including paleontological and lithological descriptions. These data are most often sold both on magnetic tapes and as hard copies. Some of these companies also prepare a variety of maps from these data for regions which look promising and they sell these maps to the oil industry. Although most companies use the computer to store their data, the degree of computerization varies very much from one company to the other. Some companies have taken as their motto that data computerization must remain an essentially economic process at every stage. This policy has led to a rather careful approach where only sections of critical importance have been digitized. Others have taken a bold approach at an early date. A good example is Imperial Oil Limited of Canada. In an enlightening paper, Stauft (1968) explains the whole history and development of the company's Well Data System which contains the description of over 60,000 wells in Western Canada. After an initial period of several years of trial and error, the system has worked satisfactorily and proved to be useful in a number of ways. More particularly, in two kinds of circumstances the services given by the system have been of critical value. When the
A NEW GEOLOGICALTOOL - THE DATA
16 7
management needs to take important decisions on bids quickly, the computer is used to arrive at the best presentation of data. Here the speed of the tool, which permits several versions and presentations to be obtained almost simultaneously, is a great asset. Also, when geologists discuss conflicting hypotheses, the computer has enabled rapid visualization of both hypotheses, often yielding the result that the differences between two interpretations were not as significant as thought beforehand. A change in company policy, however, has led to the selling of the Well Data System to a commercial company. New interest in the potential of frontier areas such as the Canadian Arctic has somewhat lessened the value of the Well Data System for the company system. But the expertise thus acquired in building a complex data management system has proved to be of great value for the new interests of the company. Within most companies, contacts between the computer geologists on the one hand, who are storing and processing large volumes of geological data, and the management in exploration branches on the other hand, are not always as close as they should be. This is due partly to the psychological fears that the computer infringes on human decisions and also to the effort which is still required to decipher computer output. Enormous progress has been made in these last years, however, on this last topic through the possibility of having completely uncoded output and through a striking development of graphical output which will be briefly reviewed in the section on "Processing programs", p. 181. To conclude this brief review of a sector of geological activity where data have :received more attention than in any other, it may be said that the petroleum industry is comparatively well developed. The efforts of the oil companies, however, have been more directed to the automatic management of complex systems than to the solving of purely geological problems of data storage and retrieval. If we exclude applied geophysical data, purely geological data are in most cases limited to the identification and elevation of formation tops and marker beds. This information implies some judgment, and a sophisticated system to record the degree of subjectivity of this judgment has not come to the attention of the author. Nevertheless, oil companies' experience will be useful for other geological disciplines. Indeed, with the progress of computer technology, it may be predicted that the use of the computer will be at least as widespread by all kinds of geologists as it is now within the oil industry. The lead of the petroleum companies, however, will persist at least within the foreseeable future. The present trends to create more integrated systems; to enhance the ease of file access and processing; to obtain more consistent and more complete data; and the facility to exchange data will be parallel developments.
Mineral deposits The storage of geological data is in a less advanced stage in the mineral industry than in the oil industry. A mining geologist has said to me: "We are using less than 5% of tile potential information contained in our archives and yet completely new ore deposists are increasingly harder to find; the majority of ore deposits which will be developed and exploited within the next years are contained somewhere in these archives." There is,
168
A. HUBAUX
however, a growing recognition of the usefulness of these archives, and some efforts are being made by several companies to develop computer-based data files. Several "basic units", i.e., described objects, are used to compile information on mineral deposits. The three units in most common use are: (1) documents; (2) field observations; and (3) ore deposits, where each deposit is considered as an entity. The first category includes reports, property records, maps and publications; the relative files are not basically different from indexes for documents which will be reviewed in the section on "Indexes", p.177. Observations made in the field or in underground works will be examined in the section on "Field observations", p. 170. Files on "ore deposits" may relate to a single mineral commodity, several commodities, a mine, a district, an area summary, etc. Emphasis may be on economic exploration or metallogenic research. All these files, whatever their purposes, are among the most difficult to establish. The principal difficulty is to choose a consistent definition for the basic unit of description. What is an ore deposit? Are two adjoining mines to be considered as separate units or as one unit on the presumption that they are exploiting the same geological unit? The answer will vary from file to file but the conventions should remain constant within one file. There is a real danger in relating the size of the unit of description to the amount of available information. Where information is scarce, several deposits will be lumped together and, vice versa, where many details are available, the units will be split. As a result, comparisons will be made on objects which are not really comparable. Research and definition along the same lines as that presently done in stratigraphy on an international scale, where stratotypes, lithostratigraphic and biostratigraphic units are being distinguished and defined, would be most useful (Van Eysinga, 1970). Efforts such as those of Routhier (1969) to bring some order in the nomenclature should be pursued and developed. Yet the files on ore deposits are likely to be among the most useful of all geological files. They will serve to establish, on a factual basis, better estimates of reserves of unrenewable commodities. Information on abandoned and inaccessible mines will be preserved. Also, and most importantly, these files will be the basis for comparison between well-studied mineralized areas and newly explored areas having a comparable geological setting. This comparison will be of considerable help in assessing the potential economic value of the region and in finding better research strategies. This should be of particular usefulness for developing countries. Special efforts of Cogeodata have therefore been devoted to this problem. While oil companies devote some 15-20% of their net income for exploration, only 5% is spent in most mining companies. Also, many of the large mining companies are decentralized. In many instances, adverse economic situations have prompted some companies to curtail experiments done in this field. Some early experiments done in the late 1950's or early 1960's with rather disappointing results have provided arguments against computerizing geological data on a large scale. Also, unlike the oil industry, exploration is done on a discontinuous basis which discourages systematic approaches to data recording. Another reason, which is not often mentioned but which I believe is important, is that
A NEW GEOLOGICAL TOOL - THE DATA
t 69
there is not the same logical basic unit of description as in the oil industry where, as we have mentioned, the well efficiently fulfills that purpose. For these reasons there exists no equivalent to the network of commercial companies as in the oil industry. Some companies are marketing machine processible files on ore deposits, but these files, as far as I know, contain no geological data but only proprietory and economic data. For instance McDonald Consultants Ltd. has an index to documents on 6,000 mineral showings in British Columbia, but geological information is not included. The U.S. Bureau of Mines has started an interesting file on economically significant ore deposits within the United States and within all countries where the U.S.A. might have an interest. Here the emphasis is on the evaluation of the reserves, not on geology. The file will be organized by commodities and those which are most critical will be treated first. The main characteristic of this file is that the system to record economic data is based on a probabilistic approach. Instead of encoding one figure for the most probable value of the reserves of the commodity within a deposit, the geologist, after studying all available documents, has to file a whole probability matrix. One line of this matrix could be read as "I feel 95% confident that the deposit contains at least x tons ore at 1.5% or more of metal content, and at least y tons at 1.0% or more, etc . . . . ". Each line in the matrix corresponds to a confidence level, each column to a minimum content. Such an approach, where the geologist's judgment is quantified in a rational way, so that judgments and estimates by various geologists may be effectively compared, could be applied to a number of situations in geology. But this probabilistic approach, unfortunately, is not yet part of normal geological training, and it has required time for the concept to be properly understood and used in a consistent manner by members of the team. Geological surveys and departments of mines of many countries are devoting efforts to establish inventories of ore deposits. The Geological Survey of Canada and the U.S.G.S. are jointly working on a metallogenic map of North America, as part of the project of the metallogenic map of the world. Although the map is considered here as the final project, and not the file, a great amount of data has been accumulated, partly on manual and partly on computer media. The same effort is done for Europe under the sponsorship of Professor P. Laffitte of the Ecole des Mines de Paris. One of the problems of such maps, which is not inherent to files, is the fact that while some regions are practically devoid of any ore showings, others are so heavily crowded that it is impossible to represent all ore deposits. The idea, however, that the final product could be, not the map, but a magnetic tape from which an infinite variety of specialized maps could be easily derived, is gradually taking ground. Some academic organizations have undertaken the time-consuming and patient task of gathering data about the geology of ore deposits. For instance, at the Ecole des Mines de Paris, several researchers are gathering data on the geology of deposits of various commodities (Mo, Be, Cu, Mn, Fe, etc.) and others are studying different theoretical problems related to the compiling, processing and synthetizing of these data. At the University of Manitoba (Winnipeg), Laznicka (1972) has begun the gathering of geological and metallogenic data on base, rare and precious metals deposits of the world.
170
A. HUBAUX
This file is, to the knowledge of the author, unique because it has a truly worldwide coverage and comprises all the important non-ferrous metals. The some 4,000 deposits now described, although they represent less than 1% of existing deposits, are estimated to include more than 85% of known reserves. In conclusion, for the various aspects under which ore deposits may be studied (genesis, economic statistics, exploration...) data files with corresponding emphasis are being created. Unfortunately, the collecting of these data is done largely without liaison. To my knowledge, nowhere except in Canada (Burk, 1972b) is there any systematic attempt at rationalizing the efforts, or at least at communicating systematically the results. The work of Cogeodata on this question is reviewed in the section on "The International Scene", p.186. Many efforts are still needed before ore deposits can be described with a satisfactory degree of consistency. Some examples of research which should enhance this compatibility are mentioned in the section on "Needed studies", p. 188. Field observations Observations and measurements made on outcrops during field work are among the examples which are often cited to show the difficulty to automatize geological data; and it must be admitted that the obstacles are real indeed. For example, one mineral exploration company has discovered with dismay that a file on outcrops was not giving the expected results, as many of the students doing the field work and filling the input sheets gave more attention to the sheets than to observing the outcrops; anything which was not already included in the input form was disregarded. The file has therefore been abandoned. However, teams in Canada have now acquired several years of positive experience in recording field observations on input sheets for the computer. In Canada only two or three months are available for mapping. To arrive therefore at mapping the country at a reasonable speed, numerous groups have to be sent in the field during the summer. For the Manitoba Department of Natural Resources for example, some 10 to 20 parties, each composed of 1 geologist and 1 assistant, are sent each summer into a target region. Thus a total of 20,000-30,000 stations are observed each year. They were thus able to map within two years some 7,000 sq. miles of the Canadian Shield on a scale of 1:50,000 with a mean distance of half a mile between stations. The computer has proved to be a handy tool for the processing of this great amount of data. How can the data gathered by so many geologists be integrated into a coherent whole? This problem of consistency is precisely the most important for all kinds of geological data files. But let us note that it does not stem here from the use of the computer; it is a direct consequence of the number of geologists involved in the project. However, by fulfilling this condition of consistency, the use of the computer becomes possible and indeed useful. Consistency, i.e., recording in the same manner what is actually the same, has been obtained through the years by close liaison between the teams and
A NEW GEOLOGICALTOOL - THE DATA
171
constant supervision of the chief geologist. But it has inherent limits. At the above-cited Manitoba Department, experience has shown for instance that the human recording of rock colours is not reliable, even when using a colour chart. Also, to cut down the nomenclature problems, the names of the rocks are not used but instead one sample is collected at each station and extensive use is made of thin sections. However, the space required for the repository of all these samples is becoming somewhat of a problem. The system is still evolving from year to year. The data file is not a separate entity, but forms an integral part of the whole chain of procedures, which consists in bringing back to the headquarters the input sheets and the rock samples, key-punching, processing and obtaining various listings, preparing a draft map directly from the computer, and, during the winter, petrographic studies of the thin sections and preparation of the final edition of the map and of the report. The drafi map, automatically drawn by the computer, where every observation is reported in ils place (but the geological boundaries are still largely drawn by hand) has proved to be a useful document, which is sold to mining companies. Another example worth mentioning is an advanced project of computerized field mapping devised by the Quebec Department of Natural Resources in highly metamorphosed terrain on the southeastern margin of the Canadian Shield (Wynne-Edwards et al., 1970). Extensive use is made of outcrop input documents which are largely selfexplanatory and function as a check-list on which answers to specific questions can be given in key-punchable form, by choosing among a specific range of answers according to agreed standards. The data bank now contains descriptions of about 5,000 outcrops, covering some 30,000 sq. miles described to uniform standards, capable of being processed in many different ways. Special care has been taken to achieve consistency in field rock descriptions. Beside the name, the observing geologist is required to give a carefill description of the physical, mineralogical and structural characteristics of each rock by using a check-list, so that the primary data by which it may be named are permanently recorded. With practice, the outcrop input document can be completed in less than two minutes. An appropriate computer program derives a second name from the observed characteristics. Confidence levels on the name are assigned by the computer on the basis of the amount of supporting information. The lowest confidence level is for a rock name recorded by a geologist with no supporting data. Computer-generated maps are widely used as primary documents for interpretation. The high degree of consistency makes the data readily usable; indeed, the upgrading of the data collection is so important that the use of the system could be justified even without computer processing. Experience is also being acquired in Scandinavia where the GEOMAP system has now been used for three field seasons by about 60 geologists (Berner et al., 1972). Detailed codes are used, with structures and textures illustrated on charts. Information on the system may be obtained by writing to Dr. O. Stephansson, Institute of Geology, Box 555, S-75122 Uppsala, Sweden. Another interesting experiment is being made by Professor Pouba and Dr. Pluskal of Prague for recording the geological observations in gallery fronts of metallic mines. Here
172
A. HUBAUX
also, the main data will be put on computer. The system of description and the input sheets are at the test stage.
Environmental data The National Aero-Space Agency (NASA) possesses a series of Earth photographs taken during the Apollo and Gemini expeditions, which are at the disposal of the entire scientific community. NASA has also launched in 1972 a first Earth Resources Technology Satelfite. PLaced on polar orbits, it could provide complete coverage of the Earth and transmit imageries in the scale of 1 : 1,000,000 taken at four different wavelengths: blue, green, red and infrared (0.8 micron). The resolution of the pictures should be around 100 m. The whole imagery will be at the disposal of the scientific community, free of charge, on the condition that its use will be to the benefit of mankind. This venture is interesting in the present context, because it illustrates a probable trend. Data will be collected in increasing amounts and will be more and more often of an interdisciplinary nature. The storage and retrieval of this enormous quantity of data will give rise to problems which are seen but dimly at the present time, much less solved. Geofond, a part of the Geological Survey of Czechoslovakia, has undertaken the creation of a data base where data from various branches of geology are stored in a consistent way. One of its main aims will be to serve as a basis for defining the best policy concerning the country's ground and underground. Part of the data is already on computer. They concern engineering geology (including hazards such as landslides), hydrogeology, location and setting of actual and potential mineral deposits, etc. The point of importance here is that these various aspects of geology are being integrated into a coherent whole.
Hydrogeology In many countries data on surface- and groundwater are held within the same organization, while in others they are held and processed independently. One example of each will be briefly examined here. In the U.S.A., the Water Resources Division of the U.S. Geological Survey has a staff of 2,900 employees. Data on inland waters are computerized and centralized at the computer centre in Washington, D.C. Data on surface waters comprise general and analytical data (location, daily discharges, etc.) for all stations. The resulting file contains about 300,000 station-years of data, each station-year composed of some 2,000 characters. This big file (by geological standards) is held on magnetic tapes with the current and immediately preceding years on disk pack. It is mostly a numerical data file, kept on fixed format and processed by specially written programs. For groundwaters, geologists were at first reluctant to use standard input sheets, but the procedure is by now largely accepted. The format now used is fixed, even though the amount of information varies within considerable limits. Some 400,000 water wells are
A NEW GFOLOGICALTOOL - THE DATA
173
drilled each year in the United States with an average depth of 3 0 - 4 0 m, but 10-15% only are described, with the amount of detail available varying widely from well to well. Most data on surface- and groundwaters are accessible but the retrieval procedure may be costly. A file of water-quality data is integrated to the system; it contains information for approximately 5,500 stations throughout the country. The growing interest in water quality has resulted in an increasing demand for data which has impacted upon the size of the file to the degree that it is growing much more rapidly than either the surface-water or ground-water files. Plans for the flow and banking of these data are now being nationally organized. The guidelines for the implementation of this organization are given in "Design Characteristics for a National System to Store, Retrieve and Disseminate Water Data" (Federal Advisory Committee on Water Data, 1971). The basic recommendations for the System referred to as the National Water Data Exchange (NAWDEX), could serve as a useful model for the exchange of other kinds of geological data, both nationally and internationally. NAWDEX will link all organizations collecting and using water data in the United States. The objective of NAWDEX is % fully coordinated handling system in which member units (1) make the data in files available to all users, (2) follow prescribed standards for stating precision and quality of data, and (3) identify the same types of data in the same manner." Its main feature will be a Systems Central which will: (I)transmit requests for data; 12) establish standards and formats for storing and disseminating water data, both for manual and for computerized files; (3) develop adequate computer programs for internal manipulation, exchange and dissemination of these data; (4) maintain an index to) water data activities. It should be remarked that the data will not be physically centralized in a unique data bank. Furthermore, only that part of the manual files for which there is sufficient demand will eventually be converted into machine-readable form. There are close similarities between these recommendations, the principles which have served to create the Canadian National System for geological data storage and retrieval, and the conclusions arrived at by Cogeodata for international exchange. In France, hydrogeological data are held separately from surface-waters data. The water wells are part of the so-called Code Minier. By law, any boring deeper than 10 m must be geologically described and this description must be submitted to the Bureau des Recherches Gdologiques et Mini~res (the French Geological Survey). Where hydrological data have been captured, they are added to the file. The coverage of the described wells is fairly complete. Until now, the file has been manual but its computerization is now actively pursued. The computerization of such a voluminous and complex file poses real problems which are solved in part by the use of the Semantic Coding System (Laffitte, 1969; Dixon, 1970). Museum files An endeavour worth mentioning here is presently being undertaken in the United
174
A. HUBAUX
States to organize data in museum collections. The program, which has now a budget of U.S. $ 250,000, began in January 1970, using the experience acquired by another program started in 1967 and which proved to be somewhat too ambitious. The project is conceived for all departments of the Museum of Natural History in Washington, in close collaboration with other natural history museums within the United States. The basic idea is to improve the value of the collections by facilitating retrieval. Not only with this system will each museum know what the specimens held by other museums are, and this with a very short delay, but it will also be possible to easily draw maps of various kinds, showing regional distributions. One of the important problems requiring trial and error experiments, is to find the best way to improve data capture and to minimize the amount of clerical work. Indeed, the conventional practice of manual duplication of the same information from the register to the label and to the various card indexes represents a conspicuous investment in clerical time and a source of errors which plagues every museum. The information automatically stored by the new system will comprise all label information (collector's name, date etc.) and all measurements that have been made on the specimen, such as geometrical measurements, chemical analysis etc. Latitude and longitude will always be included. The set of computer programs which has been developed especially for this purpose and which is known under the acronym SELGEM, will be described in the section on "Data-management systems", p. 178. Efforts in the same direction, including the development of original, computer-based registration systems, have been pursued for several years at the Rijksmuseum voor Geologie en Mineralogie, Leiden, Netherlands, and independently at the Sedgwick Museum, Cambridge, United Kingdom. These systems and the experience acquired in this field should be of interest to all museums of the world.
Geochemistry Many exploration geochemical data files are computerized, but they are held by private companies and generally concern limited areas. Three examples of geochemical data files of a more general interest will be considered here. The first, called the Rock Analysis Storage System (RASS), was developed and is maintained by the U.S. Geological Survey at Denver. It consists of two basic computer programs, one for adding data to a file and another for selectively retrieving from it. The system is used to maintain several different geochemical data files. The principal file has been in operation for 5 years and contains data on about 50,000 samples of not only rocks, but of soils and vegetation as well. Each record in the file pertains to a sample, and contains a brief description of the sample, location information, and any general comment about the sample that the geologist cared to submit. Data on as many as 430 kinds of analytical determinations are allowed for any one sample. Any of this information, including the analytical data, may be queried on retrieval. The files are maintained on
A NEW GEOLOGICALTOOL - THE DATA
175
magnetic disks. Input card formats are fixed field, but the number of cards required and the record length vary with the amount of data available. The system contains completely general provisions for updating, correcting, and deleting information. The maintenance cost for the system, including all costs of input and retrieval, is about 1 dollar per sample. Emphasis has been put on arriving at a workable, practical system. This necessitated a restriction of: (1) the method used to describe the analyzed samples; and (2) the specification of the analytical methods employed. Samples are described by a series of 30 single character codes. Other information includes laboratory and field sample numbers, location data (latitude and longitude, state and county), stratigraphic name, name of collector, and some administrative data. Analytical methods are specified as either emission spectrographic or other. Earlier attempts to record the analytical methods more completely were discouraging because of the many types and variations of methods used. The principal benefit derived frthe use of the RASS system is the ability to selectively retrieve data in a form that is directly acceptable to a large system of computer programs for data reduction and graphical summary - the U.S.G.S. STATPAC. Data are entered into the RASS sytem directly from the research laboratories. When the geologist receives the laboratory reports, he may have the data retrieved from RASS and proceed with use of the STATPAC system without manual data preparation. The RASS files of the U.S.G.S. are not available. The data are released by conventional methods of publication, or, in some instances, on magnetic tapes through the National Technical Information Service of the U.S. Department of Commerce. The RASS and STATPAC computer programs, however, have been made available to many organizations throughout the world. The system developed at the Centre de Recherches P6trographiques et G6ochimiques at Nancy, France, is based on a somewhat different approach. ~As the Centre is an academic organization, more time has been devoted to developing a sophisticated system. Although already implemented, the storage and retrieval programs are the object of further studies. The file is made up of various homogeneous series of rock analyses, including traces, from the Vosges, the Ivory Coast, Norway etc. Great care is taken to relate the samples to their geological setting. In the new version of the system, comments and qualifiers may be included with each datum. At Mobil Oil Development and Research Corp. (Dallas), R. Pauken and T. Dennison have developed a geochronological file from selected areas of the world. These data come from the literature and from determinations made within their laboratory. This rather unusual file in an oil company bears mostly on eruptive rocks and minerals; it has proven its usefulness in a number of cases for the exploration department of the company. Part of these data may be made available on special request.
Paleontology Three files will illustrate the type of current activity in the storage and retrieval of paleontological data.
176
A. HUBAUX
The efforts devoted to paleontology vary within large limits from one oil company to the other, depending on the usefulness of fossils for picking stratigraphic markers in the field studied by the company. At Imperial Oil in Canada (Calgary) a staff of 6 academicians and 6 technicians work in palynology, micropaleontology (notably Coccoliths) and megapaleontology. The efficiency of the staff is considerably enhanced by a standardized and automatized storage system. For each of the disciplines considered, two types of cards have been devised, sample record cards and fossil record cards. These 5-inch x 8-inch cards can be employed in visual files and are also coded, so that data may be key-punched directly from them. The sample records are common to all paleontological disciplines. In order to reduce the amount of subjectivity in identifying, naming and recording the names of taxa, a system based on standard descriptions has been devised. A photo-card is constructed for each species, carrying one or several photographs of a type specimen together with a description. This specimen is labelled and preserved. The specimen and cards then form a standard to which fossils can be compared. Each species is given a code number and identified as accurately as possibly by name. In the data processing system the fossil is recorded by a number only, and the name is never keypunched. Dictionaries of numbers and names are maintained. In parallel a series of computer programs have been devised to answer all pertinent questions of retrieval and for drawing automatically a large variety of maps and charts. The system will enable the paleontological staff to examine samples from the wildcat wells which will be drilled by the company within the next year at an average of 24 wells per year. Use is presently being made of the facility offered by this system to free the professionals from routine work as much as possible, identification being made more and more reliably by the technical staff. In a later stage it is planned to try to make at least part of the identification automatically by computer analysis of the image obtained by a scanning electron microscope. The efficiency of the system was made possible through direct contact and cooperation, constantly maintained between all interested parties, a major condition for success. One of the anticipated consequences of the wide use of the system is an extension in the exchange of mutually useful data between affiliated companies. At Amoco (Houston) a "Paleowell Data File" on 20,000 company wells has been maintained for a number of years. It is a computer-based file on fossil assemblages which receive the name of the leading species, accompanied by sedimentological data. With this data base, paleo-environmental maps and sections of various kinds are automatically created. In particular, wide use is made of contouring techniques. Another kind of file is being developed by Dr. Germeraad at the Museum of Natural History of Leiden (Netherlands). For all palynological taxa, a morphological description is introduced in the computer file. In order to do this, it has been necessary to standardize the basic morphological forms which together make the morphological description possible. When the system becomes operational, within the next few years, it will be possible to introduce in a query the morphological description of any described palynological species and to retrieve the name of the species, its stratigraphic ranges, bibliographical references etc. The system will be available at a nominal cost to any
A NEW GI~2OLOGICALTOOL - THE DATA
17 7
palynologist. This morphological file is related to the endeavours of a committee chaired by Kremp (1972; University of Arizona) "to bring to reality plans concerning a world consortium of palynological data banks". Dr. Kremp concludes that "if these efforts are successful, palynology might advance into a usefulness that we now can see only dimly".
Orogenic studies A collection of data for orogenic studies has been collated and edited by A.M. Spencer, under the sponsorship of a committee of the Geological Society of London, England. It will be published as a Special Publication of this society before the end of 1972. "The data concern samples of Mesozoic and Tertiary orogenic structures selected on a worldwide basis. The information on these structures was obtained by a questionnaire sent to experts in many countries and designed so as to obtain data in the most objective and quantitative manner possible within a smalll editorial budget." (W.B. Harland, personal communication, 1972). It should be remarked that the basic units (orogenic structures) are by far the largest geological objects of all the data files on which 1 am informed.
hTdexes Some of the earliest attempts to rationalize the organization of data have been, within most if not all of the large industrial companies, governmental agencies, etc., to create files on documents. Here the basic unit of the file is a report, a publication, a map, etc. The technique for maintaining such a type of file is in a more advanced stage than tile techniques on data files. Use is generally made of a series of keywords which together form what is called a thesaurus. The contents of the document are recorded by a series of these keywords. There exists some divergence of opinion as to the optimum number of keywords for each case. The problem, however, falls outside the scope of the present survey. One point worth mentioning is that there have been various attempts at storing data together with bibliographical information. To my knowledge, most of these attempts have been unsuccessful, probably because in many cases reports or publications do not give all the necessary data on one and only one geological object such as a fossil, a mineral deposit, a mineral species etc. Some confusion then is introduced as to the basic object of reference. Public agencies have developed various systems of bibliographic references, including for example the American Geological Institute (U.S.A.) and the Bureau des Recherches G~ologiques et Mini~res (France), which both are now computer-based. The BRGM system is based on a limited number of keywords displayed graphically, while with the AGI system some 40,000 keywords are utilized, not counting the geographic terms. They are organized in a three-level hierarchy, the first order numbering 200 keywords. Both systems include bibliographic lists, indexes to these bibliographies and various selective retrieval capabilities.
178
A. HUBAUX
Of special significance here is the Canadian Index to Geoscience Data which contains references to publications, reports etc., containing geological data on Canada. No data are stored within the index itself. The index is maintained by the Canadian Centre for Geoscience Data. Contrary to the two systems previously mentioned, the indexing is done at a variety of places, generally close to the sources of documents. Another difference is that the file relates only to documents containing data proper. GENERAL ASPECTS
Data management systems Frequently, a system of computer programs to create, update, edit, retrieve and print the file is prepared at the time when the data are beginning to accumulate. At that time, the organization of the file and its basic conventions are supposed to remain valid for a number of years. Care is then taken to optimize the program in order to reduce computer time. The procedure has the advantage that the intricacies of the system are known by the people who use them, because they have written it. The system is also well suited for its needs. The price to be paid, however, for the use of custom-designed programs is very often underestimated at the beginning and may well be prohibitive. First, writing and implementing all the subroutines very often turns out to be more time-consuming than expected. All the programs are conceived for a certain type of data organization; after some time and after some experience has been acquired with the file, the case often arises that some change is necessary. For instance, the input system may be too cumbersome or too detailed; or on the contrary, it may be of advantage to add new characteristics. Reprogramming of the subroutines then becomes necessary. In some instances, after the system has worked for some time, there is pressure on the programmer to arrive quickly at a set of running modified programs. He will then be inclined to use techniques which work well, but which nobody can master once he has left because he was unwilling, unable or not provided with the time to document the program's procedures. The tendency to obtain more and more programs of which nobody really possesses a complete written description and an updated how-to-use is widespread indeed. Another major inconvenience of disparate systems is the incompatibility of files created by different systems. The merging of two data files may then become a heavy burden. Yet the ability to merge data files so as to easily compare data from different sources should be a major concern. Outside the oil industry, most geologists live in an environment where a programmer's time for scientific purposes is limited indeed. The geologist then has the choice either of writing his own program or of trying to find some obliging programmer who will give him some of his time. Neither solution is optimal. Very few geologists have the opportunity to learn Cobol and file-management systems. They are interested in obtaining results as
A NEW GEOLOGICALTOOL - THE DATA
179
quickly as possible, and not in developing sophisticated program packages. On the other hand, discussions with a programmer who has had no geological training at all, proves to be a hard school of patience. The time necessary for this programmer to really grasp what the geologist wants is to be counted in weeks or more probably in months. These ideas have been at the root of efforts to develop generalized systems for geological data storage and retrieval, which presently are being undertaken, in Canada and in the United States. Two of them will be briefly described here. The first is the "SelfAdaptative Flexible Retrieval and Storage" System (SAFRAS), developed by Sutterlin et al. (1969, 1971, 1972) at the University of Western Ontario (London, Ont., Canada), under sponsorship of the National Advisory Committee on Research in the Geological Sciences. The other is a generalized system for information, management and retrieval which has been in operation at the Smithsonian Institution, Washington, D.C., U.S.A., since spring 1970; called SELGEM (Self-Generating Master), it has been written by Creighton et al. (1972). A major asset of these two systems is that they are written in Cobol (for SELGEM) and Cobol and Fortran (SAFRAS). They may then be implemented on any computer of medium size which has a Cobol compiler or Cobol and Fortran compilers. Data from various institutions, using computers of various makes, may then easily be exchanged or merged. Furthermore, it is possible for an institute to change completely its computer installation without having to re-write the whole system. Before briefly describing these two systems, some words have to be specified. It is deplorable but undeniable fact that the discrepancies between the meanings of such words as data, information, item, record, hierarchic organization, are as serious as the differences in uses of the geological vocabulary, without even the excuse here that natural and complex phenomena are involved! We have seen that a geological data file may be envisaged as a collection of measurements, observations, information of various kinds on one kind of object. For each object, a series of characters is given, such as its longitude, latitude, density, color etc. We will call a data element the most elementary item of data which may be given, for instance, longitude, SiO2 contents (the results of one analysis); the data element may be either a single number, a word or a sentence. Data elements may be grouped, e.g., the geographic location may be made of three data elements, longitude, latitude and topographic name. There is a wide variety of ways in which data elements may be grouped because groups may be arranged into subgroups, sub-subgroups, etc. This hierarchical organization is a major characteristic of the file and its processing system. A somewhat unrealistic example of a hierarchical organization with many levels would be a file on formations, with the following sublevels: (1) first order sublevel - outcrop with position; (2) second order sublevel - point of observation of that outcrop with position relative to the other points; (3) third order sublevel - description of selected fossil and/or description and analysis of major minerals.
180
A. HUBAUX
The number of admissible levels within one system determines, to some extent, the complexity of a system. SELGEM allows a practically infinite number of levels. SAFRAS allows three levels only. The first system indeed has been conceived for museum data files, not only for geological or mineralogical, but also zoological, etc., purposes. Fortunately, many geological data files which exist today have a rather simple hierarchy. Both SAFRAS and SELGEM have been conceived for use by scientists who have pratically no training in programming. They have a retrieval language which is based on Boolean logic and which is somewhat similar to the "logical IF" -statement of Fortran IV with " greater than", "less than", etc. linked by the logical operators "or", "and", etc. The retrieval strategy built in the systems, however, is different. Both have the capability of analyzing sentences, so that all the words within a sentence may be compared separately to a given word. For instance, one data element could be C o m m e n t s on age. Supposing that some of these comments would be: Sample number 325: probably Silurian or Cambrian, Sample number 422: Cambrian (doubtful), Sample number 478: certainly not Cambrian. All three samples would be retrieved by a query asking to list all comments on age which contain the word Cambrian. If one wants to avoid the last sample, he must ask that "not Cambrian" be excluded, a thing which is possible in a single pass. For both systems, the input procedure is quite flexible and easy to learn. It allows data of variable length, without fixed format, but data prepared for other systems (for instance for Fortran input with fixed format) may readily be accepted also. Retrieval requests may be operated through a remote terminal. Both have the facility to accept variable length logical records, with repetition of fields permitted within a logical record. SAFRAS is well suited for universities and in general for places where geologists have access to computers, but little help from programmers. It is relatively cheap ($ 1,500 for licences) and easy to implement. It takes ] 10 K character locations of cores. It has been implemented for 8 organizations on computers type IBM 360/50, 65, 67, 85, UNIVAC 1106, PDP 10 and CDC 6400. The implementation takes 20 man/days or less. Emphasis has been put on the simplicity of use, and that aim has indeed been achieved. In particular it should be of great help to test various possible designs of a file and the feasibility of creating a new file. For these reasons it should be useful to students. It has therefore been implemented, for instance, at the Colorado School of Mines. SELGEM runs on the Honeywell 200 series, IBM 360 series and UNIVAC 1106. It takes 90 K character locations of core. It is used by half a dozen universities and is provided for free to non-commercial scientific activities. Learning how to use it is easy, even for scientists without training in programming. Beside such systems which have been developed especially for geological or Natural History data, there exist over 100 generalized data systems, which are commercially available: report generators, file management systems, but only a few are used by geologists as far as I know. Probably the most widely used is MARK IV which is being used by a number of oil companies for their well data and other files. It is a general purpose
A NEW GEOLOGICALTOOL -- THE DATA
18 1
file management system for file creation, maintenance and report generation. It handles multiple input files and generally multiple output reports or files. It is adapted to voluminous data files which, within the geological community, exist mostly at oil companies. It is written in IBM 360 Assembler language and its use is therefore limited to this type of machine. The GIPSY system (General Information Processing System, Addison et al., 1969) which has been adopted by the U.S. Geological Survey for several files, notably its Oil and Gas file and its Computerized Resource Information Bank file, should also be mentioned. As MARK IV, it is written in IBM 360 Assembler language and is therefore linked to that type of computer. It is a generalized information storage and retrieval system, developed at the University of Oklahoma computing centre. The system is suited for mixed numerical and alphanumerical data. As for SAFRAS and SELGEM, data may be referred to by the data element name. Some commercial companies, selling geological data to the oil industry, also provide specialized programs to handle these data. For instance Petroleum Information (Denver, Col.) sells a "Petroleum Information Retrieval System", specially designed to operate on the Well History Control System. It also has report generator and full geological mapping capability. Computer Data Processors Limited (Calgary) has designed a full set of computer programs, the CDP Geological System, which is specially suitable for handling oil well geological data. The system has the capability to build a file, retrieve, edit, and update it and also has quite a number of graphical output capabilities, including trend surface analysis, grid perspective programs, contouring etc. Processing program s
The present state of the art of the mathematical treatment of geological data has been recently reviewed in several papers, notably in Griffiths (1970) and in Agterberg and Robinson (1971). Griffiths points out the feedback effect by which the advance in methodology interacts with the advance of science, leading to the mutual progress of both science and methods. He predicts that "adoption of modern mathematics and modern aspects of the scientific method will lead to revolution in the paradigmatic foundations of geoscience and this aspect, rather than the adoption of mathematics as an auxiliary tool to conventional practice, will ultimately be the most rewarding contribution of geomathematics." The point of interest for our subject is that improvement in geomathematical methods and in the quality and objectivity of geological data are in fact two facets of the same problem. The results obtained by mathematical treatment are oniy as good as the treated data. Hence, any improvement in the consistency of the data will bring about an improvement in the applicability of mathematical solutions, together with a better exchangeability of the data - a case where you can both have the cake and eat it. One of the most difficult points of geomathematics is to bring back the result in a language by which these results may be compared to the "ground truth" as described qualitatively by geologists (Agterberg and Robinson, 1971). The most promising field of
182
A. HUBAUX
application is undoubtedly, in this respect, all types of graphical output, Graphics are easy to grasp at a glance, or in other words, as said by a U.S.G.S. geologist, they are "sexy". The system developed by J.M. Botbol and G.B. Gott at the U.S.G.S. in Denver seems to be most promising. They obtain stereoscopic pairs of figures which may be observed as pairs of air photographs. Leymarie in Nancy and Bickmore in London have tested the same system but using red and blue colors, to be looked at with red and blue colored glasses. Interesting research is also being done at Atlantic Richfield on interactive graphic systems with video and light pen equipment. It may be predicted that with the advent of to-day's technology to handle graphics, massive use of graphical output will be developed and accepted by the geological community within the next decade. The realization that, to use the data properly and to obtain more meaningful graphics, consistency should be improved by several orders of magnitude, will then become commonplace. Not all geologists, however, and even not all geomathematicians recognize the close connection between data and mathematical methods. They do not yet see the need to change the present state of affairs where, as pointed out by S.C. Robinson "the product available to the mathematician conveys the geologist's interpretations of field and laboratory observations and measurements but not the observations and measurements themselves." Standards, pros and cons
Experience acquired over the years in building a variety of geological data files has shown that the dominating problem is consistency. Consistency may be said to be achieved when two geologists encoding independently the same observations on the same object arrive at the same results. The main reason for failure of geological data files (and failures have happened a number of times) has been a lack of rigor in the use of descriptive terms. It is a well-known fact indeed that not only may two geologists encode the same thing in different ways, but that the same geologist will probably change his system somewhat, even between the beginning and the end of a field season. Admittedly, there are other problems, such as the compression and selection of data, using adequate software, etc., but arriving at consistency within a single file is an absolute necessity, if the file is to be of any use at all. A useful distinction may be made here according to the number of users. (By users we mean both people who will process the file in any way and people who encode the data; in many cases indeed the people preparing the input are also using the output of the file). If the number of users is limited, consistency is generally achieved through permanent contacts between users. That this close collaboration is a necessity has been realized by successful commercial companies, selling geological information to the oil industry. If the number of users is large, and this includes the case where the data are taken from various sources in the literature, achieving some degree at least of consistency becomes a major problem. It is already important within big companies which employ a number of geologists, and becomes crucial on an international level. Here, close and permanent contacts
A NEW GEOLOGICALTOOL THE DATA
l 83
are not possible. The only alternative then is to arrive at better standards. The problems of standards in geology are of major importance, but first it must be noted that the word "standards" provokes negative reactions by many geologists. The importance of this feeling should not be underestimated. Dangers of standards which have been pointed out to me, are: (1) Standards may enforce measuring systems which may be a barrier to progress; for instance a standard would impose that the length of some geological characteristic should be reported in millimeters, whereas in reality the phenomenon would be much better described by a logarithmic scale, not a linear scale. It will then require some time and efforts to overcome the barrier. (Let us note, however, that this objection is even more true for reporting theories than it is for reporting facts.) (2) Standards are sometimes felt as encroaching on the liberty of presentation of the data. "Presentation of the data", said the leader of a commercial company, "is an important part of our business; rules on presentation of these data may infringe on our competitivity". Standards are also sometimes felt as some kind of governmental intervention in the geologists' freedom of thought. (3) There is also a not unfounded fear that the filling of normalized input sheets would transform the geologist into some kind of servant to a blind and all-powerful machine. Young geologists would then be tempted to pay more attention to filling in the sheet properly than to making the observations. Any case which does not readily fit into the "small boxes" system would be disregarded. Proof that this is a real danger is the fact that it has indeed happened within one mineral company at least. (4) Finally, although very rarely expressed, there is some obscure fear about mathematics creeping into one of the few scientific fields which is comparatively free of this drastically rigorous method. The main problem is that while an improvement in the communication of geological facts and theories is generally recognized as a factor in progress, some geologists think that the present situation cannot really be improved. No effort therefore should be devoted to improving geological describing tools because such efforts are hopeless; it is in the nature of geology itself. A great part of the problem, I feel, is due to a misunderstanding of what standards are really about. A complete definition of what geological standards could be and an exhaustive study of their usefulness and limits, has still to be done. It would be, I think, a most useful contribution to geological progress. But, already now, the following ideas seem to emerge, thanks for a large part to research being done in Canada (see Robinson, 1972). One of the outcomes of this endeavor has been the realization of the fact that the amount of work required to obtain useful standards or norms is often hard to foresee. Thus, geographical location recording of mineral deposits proved to be much more tricky than expected. This has led Kelly (1972) to make a comprehensive study of the problem and to prepare a series of standards which could be used, at least in part, in other countries. Three levels of standardization may be distinguished: concepts, codes and structures.
184
A. HUBAUX
Standardization at the conceptual level is by far the most important, and will be discussed in the next paragraph. Too often, in creating data files, this aspect is neglected and attention is concentrated on codes and overall organization of the data file. Problems dealt with are, e.g., "shall we record limestone with a mnemonic code or with a number, and shall we put it in columns 4 0 - 4 5 ? ". Whereas the true problem is: "Should we record limestone at all and what will we call limestone? ". Fortunately, systems now available such as SAFRAS will free geologists from bothering too much about format and structure and let them concentrate on the real geological problems. Conceptual standards may concern: (1) objects or (2) characteristics. The best-known example of geological objects which have been made into standards are the famous G I and W1, two geochemical standard rocks which have served to compare analytical methods. Other geochemical standards (rocks and common minerals) are being prepared now, among others, by the CRPG of Nancy. Stratotypes which are being proposed now under the aegis of the lUGS Commission on Stratigraphy are another good example of standard geological objects. 1 have been told that a standard series of rock chips is being given to prospectors in Canada; also, Cogeodata has suggested the establishment of a standard petrographic series internationally. S.C. Robinson has also proposed the selection of a number of mineral deposits as standards to which all other deposits could be compared. Certainly other objects could also serve as standard and would give much service. Although various kinds of standard objects will be of great help in geological communication, still more important are efforts at unification of concepts. Stratigraphy is probably the branch of geology where most efforts are done in that direction. The clarification of the concepts of formation and of various lithostratigraphic, biostratigraphic and chronostratigraphic units, certainly will bring order in that important branch of geology (Van Eysinga, 1970); and more order means progress. It could be applied to other fields of geology as well, but much remains to be done. The important work of the subcommission on Nomenclature of Igneous Rocks, chaired by Professor Streckeisen is worth mentioning here. The American Petroleum Institute has been involved for several years in standardizing well data terms. Admittedly, only a fraction of these terms concern geology, and emphasis has been put more on encoding than on meaning; nevertheless this effort has saved enormous expenses when converting well data from one system to another. To my knowledge the only major effort of a national scale, however, has been made in Canada where a series of geological standards of various kinds is now being prepared, under the coordination of the Canadian Centre for Geoscience Data. An experiment worth mentioning here is the pilot "Multilingual Thesaurus" on Structural Geology prepared by the IUGS Committee on Geological Documentation. The first draft has been presented at the 24th International Geological Congress. Experience acquired up to now by the working group shows that specialists from different countries, sitting together and comparing the meanings of keywords, can arrive at a good degree of agreement within a reasonable time. The results obtained so far are encouraging and could be extended to other kinds of concepts and standards. It may already be foreseen that notwithstanding the reluctance of the geological
A NEW GEOLOGICALTOOL TIlE DATA
18 5
community, efforts to standardize geological concepts will be increased conspicuously. Beside providing better means for exchanging data, it will also improve the exchange of ideas and theories and finally and most importantly, it will yield a better quality of the data themselves. This research on better standards nmst take into account the recognized fact that, for most geological concepts, there are no pigeon holes, but that the majority of concepts are continuous. While the fact is recognized, however, all its implications and ways to overcome these difficulties are still to be studied.
Data availability The degrees of availability of geological data files are more numerous than is generally believed. The complexity of the situation is, in a great part, due to the fact that the interchange of data files has not yet come of age. If we except commercial companies, very few computer processable data files are prepared as end-products. The idea that magnetic tapes could be a useful form of data transmission has not yet taken ground in geology. There are some indications, however, that things are changing. First, commercial companies have been selling tapes for several years to the oil industry. Second, the U.S. Geological Survey has recently defined a policy to release magnetic tapes in the same way as it is releasing publications. It must be reckoned, however, that demands for such tapes are not yet numerous. There is also within the USGS, the Geological Survey of Canada and the Dutch Geological Survey a growing recognition that we are nearing the time when a part of their "geological maps" will be released as data tapes. Maps will then be produced either by the Survey or by some editing company, and tailored to every request. They will cover the geographic region, be at the requested scale and contain only the specified data directly useful for the customer. Admittedly, this new possibility is not for tomorrow. But the trend toward that new system has already begun. Finally, the number of geologists working on an interregional or intercontinental scale is growing and these specialists will require access to increasing amounts of data from increasingly diversified sources. One of the most widespread beliefs among geologists from academic circles is that atl data held by industry are confidentail. The situation, however, is not that simple. In many countries, there is a legal obligation to release some or all of the geological data acquired to some national or provincial authority. In North America where there is sufficient demand for it, several companies have made a business of exploiting this source of information, editing it and, above all, homogenizing it and selling it to the oil industry. A great part of the data produced by the oil industry is available, but is mixed with data of critical importance for the company. Hence, to satisfy an external demand, confidential and non-confidential data should be separated. This separation would require considerable time for some official within the company and, quite understandably, he will be reluctant to do this task unless he can justify that the work to be done on these data would in some way be beneficial to the company. A somewhat similar situation arises
186
A. HUBAUX
when a company holds for some reason, data from another company. It also happens that a state survey will be reluctant to furnish data which might infringe on the rights of a private company. Therefore, any request for data must be clearly justified and the use and dissemination of the data must be specified. Restrictions to data availability are not by any means confined to industry. In public organizations (surveys, universities, etc.) data will not be disseminated before publication of the results obtained from processing these data by the authors of the data file. This protection of the scientific rights is certainly justified. On the other hand, a fact which too often happens and which is less justified is that, once the research is done and the publication achieved, nobody cares about these data any more. Many valuable data have thus been lost. When the data come from the literature, some copyright problems may arise. How far can magnetic tapes be disseminated when they contain data extracted from scientific journals? The problem is similar to abusive dissemination of scientific papers by photocopies. The situation does not seem to be so different in Eastern Europe; for instance in Czechoslovakia several geological data files are available but without containing the exact geographic location of the stations described. A final remark. It seems to be a trend that the degree of availability is reduced by the owners as the file grows, at least in some instances. This is because the economic usefulness of the information potentially contained in the file grows enormously as the volume of data grows, after a certain threshold has been reached.
The international scene On an international level, many bodies are dealing with some aspects of geological data. Indeed, it may be said that most if not all commissions and committees of the International Union of Geological Sciences (IUGS) are bringing order into their respective categories of data. However, it is the task of Cogeodata to deal with these questions for geological data in general. This committee, created in 1968, has sponsored various activities which are reviewed in several issues of Geological Newsletter. Notably, it has proposed a series of guidelines for the recording of: reference numbering; geographic location; data on paleontology, stratigraphy, geochemistry, mineral deposits and petrography (see Geological Newsletter, vol. 1971, no. 3, pp. 175-194, and Cogeodata, 1972). Also, for Cogeodata a bibliography on geological data storage and retrieval has been prepared by Hruska and Burk (1971), supplemented by Burk (1972a); a list of existing geological files will be published in Codata Bulletin, no. 8. (Hubaux, 1972b). Finally, an inquiry on present usage in rock names and description has been organized. The first results thereof were presented at the 24th International Geological Congress in 1972. International bodies interested in geological data are not all within lUGS, however. We will attempt to review briefly the most important of these bodies, although today's
A NEW GEOLOGICALTOOL - THE DATA
187
situation is rapidly evolving. First, some words ought to be said about the two main international organizations responsible for scientific activities in general and which are of prime importance here: UNESCO and ICSU. While the first is well known, many scientists are still unaware of the existence of the second. Yet the International Council of Scientific Unions was created after World War I and is thus notably older than UNESCO, which is a member of the family of organizations created together with the UNO. ICSU is non-governmental, which means that it is subsidized by scientific bodies of the member countries such as the academies, and not directly by governments. ICSU is made of 16 scientific unions among which we may mention, beside the IUGS, the International Geographical Union and the International Union of Crystallography. Beside these unions and linked to them, are a series of associations such as the International Association on Paleontology, the International Association for Mathematical Geology, etc. Several long-range research programs are sponsored by these unions and associations, of which two ought to be mentioned here. The International Geological Correlation Program is a joint venture of IUGS and UNESCO (1971a). Its aim is to encourage interregional and intercontinental correlations. One of the four sections of the program concerns data storage, retrieval and processing. The exchange of data will constitute one of the major aspects of this program. The Geodynamics Project is jointly sponsored by IUGS and the International Union of Geodesy and Geophysics. It is, in a sense, a successor to the Upper Mantle Project, which ended in 1970 and which was highly successful. The focus of the project relates to the dynamic processes, present and past, which have shaped and are shaping the Earth's surface. There exists also a series of interdisciplinary services within the ICSU family; among others, the World Data Centres which were created during the International Geophysical Year and are still active. Although they have dealt up till now with geophysical data and very marginally with geological data, their potential usefulness for geology is now being reassessed. The ICSU Committee on Data for Science and Technology (Codata)which originally dealt mainly with critically evaluated data on simple substances, has recently broadened its scope so as to include Earth sciences and Biosciences data. Codata will more particularly study interdisciplinary aspects of scientific data. Finally, there exists a joint project of UNESCO and ICSU, called Unisist, aiming at fostering the unimpeded exchange of published and publishable scientific information and data among scientists of all countries (UNESCO, 197 lb). It will thus cover the two aspects: documentation and data. In this sense it is unique, because generally these two aspects are kept separate. It is my feeling however, that emphasis is put more on documentation (i.e., bibliographical information) than on data. The above enumeration should not give the impression that scientific data, and geological data in particular, are well organized and cared for internationally. Actually, the total sum of money and effort which is devoted to improving the present situation is small indeed. There could seem to be a danger of duplication of efforts. The danger is not
188
A. HUBAUX
so much of duplication, however, than of working on a scale which will be one or two orders of magnitude below what is really needed. One of the basic problems, the importance of which had been, until recently, underestimated, is to ascertain the real needs of the geological community. This problem is not particular to geology. Some 10 or 15 years ago, the creation of big scientific data banks, which would contain all scientific knowledge internationally available was seriously envisaged. The Word Data Centres associated with the International Geophysical Year were created in that period. The idea, however, proved to be impractical. Political problems would have been difficult to solve, and also the cost of establishing and maintaining such huge data banks would have been staggering indeed and their usefulness doubtful. Although there are some instances of data centralized and stored on an international scale in some specific scientific fields, it is generally admitted that the creation of a network of data centres is more advisable for the time being for most environmental data, and specifically for geological data. On a national scale, the geological surveys should, as one of their tasks, maintain repositories of publicly available data within the country. The coverage is incomplete in most countries but it must be emphasized that several surveys have in these last years devoted much more effort than before at gathering data which are dispersed in universities, governmental agencies, museums etc. and even in industry. The idea of a network of geological data centres is thus taking ground nationally in several countries and is also beginning to take form internationally. The problem here is to guess which action will be the less objectionable by the next generation of geologists. CONCLUSIONS "When you can measure what you are speaking about and express it in numbers, you know something about it, and when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind. It may be the beginning of knowledge, but you have scarcely in your thought advanced to the stage of a science." (Lord Kelvin). Needed studies Among the pioneering research of geologists to improve the usefulness of data, one voice is conspicuously silent. While industry, especially the oil industry, has paved the way for a number of years by using a trial and error approach which has brought a useful harvest of experience; while geological surveys are now following suit in several countries, the academic world is hardly beginning to move. With few exceptions, as the CRPG in Nancy, tile Ecole des Mines in Paris and a few others, hardly any researchers have given thought to the theoretical aspects of the question. The Geological Survey of Canada, aware of the fact that the attention of universities should be called to the subject, has been devoting during these last few years one-fourth of its grants in aid to universities to sponsor research on various aspects of data storage and retrieval. The return, the author
A NEW GEOLOGICALTOOL - THE DATA
189
has been told, has been particularly successful. But the recognition of existence of interesting problems in this field is certainly not widespread. Among geomathematicians, the prevailing tendency is to refine tools to process that part of the geological data which is already quantified (where, admittedly, considerable progress has been achieved during this last decade), but not to attempt a better quantification of other concepts. However, truly quantified concepts represent today but a small part of all geological concepts, and the usefulness of geomathematics will remain marginal if the list of objects amenable to mathematical treatment is not extended. The subject is fraught with problems of theoretical interest and of practical value. We have met with several of them in this review. I will attempt to list them here, not as an exhaustive list but with the hope that they will serve as hints to imaginative researchers. None of these questions, I believe, is really fashionable yet, but all of themshould yield results of direct usefulness. (1) Comparative studies of current practice in such domains as the storage and processing of laboratory results, file building and management programs, programs for graphical output (contouring techniques, etc.), would be valuable to a number of people presently engaged in these activities. (2) Bringing some order into the maze of technical words used in connection with data files would be a most useful task. Words such as attribute, character, data, information, records, bundle of data, data element, data element set, etc. are used with widely different meanings. (3) How could we achieve better consistency for the description of: rocks; structures; textures; tectonic phenomena; metamorphic phenomena; etc.? (4) How could we m e a s u r e the degree of consistency achieved in the above list? Even a crude measure would be useful. (5) For a petrographer, gneiss is nearer to granite than to chalk. But how much nearer? Is there a way by which we could measure the "distances" between rock types? Admittedly, these distances will depend on the point of view: an engineering geologist, a hydrogeologist will see these "distances" differently from a petrographer. But will two petrographers (or two hydrogeologists) give widely divergent appreciations of these distances, or will the discrepancies remain within tolerable limits? (6) More generally, what are the most adequate systems of description when there are no pigeon-holes, i.e., when the objects to be described belong to a continuum? (see Hubaux, 1970, 1972a). (7) A systematic study of geological standards presently in use would help in: (1) coming nearer to a clear definition of what geological standards are, on their usefulness and their limits; and (2) consequently, how and to what extent should new standards be developed. How far are, for instance, color charts presently in use? There have been, it seems, several unsuccessful attempts at using them for rocks. What are the causes of failure? What kind of geological objects are presently being used as standards? (for instance, stratotypes, geochemical standards...). What kind of geological standard
190
A. HUBAUX
should be first developed? What standards would have the greatest chances of success? (8) A significant part of geological data, even quantitative data, will always be derived in part by the geologists' best judgment. Examples are the so-called absolute age, the volume represented by a given sample (for geochemical studies for instance) and especially the reserves of a mineral deposit. This last instance has been solved as we have seen in the section on "Mineral deposits", p. 167, by using a probability matrix. This probabilistic approach could be extended to a number of geological characteristics if the reproducibility of this method could be ascertained; a series of experiments could be devised for that purpose. Several geologists would be asked to encode their estimates, with confidence levels, of the same quantity. Such experiments would be easy to organize and most probably prove to be enlightening. (9) A much more difficult problem would be a system to evaluate the potential information contained in a data file. Indeed, the harvest of results whichmay be obtained by processing a file grows as the volume of the file grows, but the relationship is probably not linear. In many cases, the potential information grows rather slowly at the beginning and increases at a much steeper rate once a threshold is surpassed. Where is that threshold? The question may probably not be answered at this level of generality, but to have some rough guidelines as to the potential usefulness of a file would be a really valuable tool when the decision is taken to build or to interrupt the updating of a file. (10) Any file is a description of geological objects of the same kind. As we have seen in the section on "Data-management systems", all geological objects to a given level within a hierarchy. Clearly also, when we build a file we are not interested only in that hierarchical level from which the basic units of descriptions have been chosen, but in higher and perhaps lower levels in the hierarchy as well. How far up or down the hierarchy is it possible to go usefully with one file? For instance, can we compare metallogenic provinces, using characteristics of minerals from ore shows? Or, rather, what are the conditions to make such comparisons feasible? (11) Any storage system provides for some cancellation of information, either by simply leaving out unwanted details or by some kind of generalization, i.e., by recording at a higher level of the hierarchy; for instance, recording Paleozoic instead of Devonian. In the RASS file, as we have seen in the "Geochemistry" section, p.174, the analytical method is simply recorded as belonging to one of two categories. (Of course the contrary operation, i.e., encoding with more precision than is actually known, is forbidden for obvious reasons; I cannot record Spirifer if I only know that the fossil is a Brachiopod.) A certain amount of information cancellation is undoubtedly unavoidable, but how much? This amount is built into the system at the start and any change in this convention will be costly. The designer of the file necessarily makes use of his best knowledge and his guesses about the possible uses of the future file, but criteria to define the degree of cancellation in a more systematic way would be useful. As an illustration of this, the Manitoba Department of Natural Resources keeps a rock sample from every field mapping station. Clearly, the sample is kept for the potential information it contains, but the physical storage of this information is a costly operation and its justification must be
A NEW GEOLOGICAL TOOL
TttE DATA
191
analyzed. Generally speaking, unwanted information is a special kind of pollution and its destruction has not attracted the attention it really deserves. These are a few examples of the kinds of problems, the solution of which would bring about conspicuous progress in data exchange and data processing. They are given here to stimulate reflections and comments and with the hope that more research will be devoted to this subject.
The art o f data file building A happy few among geologists live in an environment where they have easy access to a big computer and, most important, may find the ready help of programmers and information specialists in their own organization, without having to explain at length the intricacies of the geological problems they have to solve. This is the case in big oil companies where picks and tops and casings and logs are the daily bread. Most geologists, however, are not so lucky. They must do a great deal of work by themselves when they want to create a data file. As we have seen in the preceding chapters, obtaining a file which will meet its goals is still more an art then a science. There is no sure recipe for success. Some hints, based on the experience of several specialists, may nevertheless be of help to isolated geologists wanting to create a geological data file. Useful guidelines will also be found in Brisbin and Ediger (1967) and in Cogeodata Recommendations (1972), although these documents should not be considered as the Gospel. Generally, the main theme of study and the field of application (a region, an eruptive complex) is imposed. But the first choice to be made and one to which not enough thought is given, is the choice of the basic unit of description. A data file indeed is a collection of observations on one kind of geological object. It is not advisable to mix objects of various kinds within the same file. For instance water analyses are better kept separate from rock analyses. Any geological object may be considered as belonging to a hierarchy (see section on "Data management systems"). As we have seen, we are always interested in more than one level of the hierarchy. So the first problem is to decide at which level of the hierarchy the basic unit must be chosen. When the basic unit is a complex object, such as a volcano, it is generally advisable to keep separate files on series of characters. For instance the volcano reference data for petrology of the Natural History Museum in Washington are kept in four separate files: (1) volcano anatomy and description; (2) main eruptions of that volcano; (3)events that occurred within a particular eruption; (4) rock names and analyses. It is useful to choose objects which are as distinct as possible, and a clear definition of these objects is imperative. This is not as obvious as it sounds. For instance, what is an ore deposit? If clear conventions are not taken at the start, one runs into the danger of storing information on a series of closely connected deposits, and then again on one particular deposit within that series. Generally, therefore, man-made objects such as samples or oil wells are preferred to natural geological objects which are not so easy to define. Even man-made objects however, are not exempt from pitfalls. Take the case of
192
A. HUBAUX
oil wells. Apparently, nothing is simpler, but what if there is a directional redrill (i.e., a new hole drilled at an angle from a point in a pre-existing hole); what about a well which is drilled deeper? The American Petroleum Institute Sub-Committee on Well-Data Retrieval Systems has devoted enormous efforts at arriving on widely accepted conventions on problems like this. Once the basic unit is chosen, the next step consists in establishing the list of characters which will be observed or measured. It will be noted that some of these characters are merely objects at a lower level of the hierarchy. As a general rule, it is advisable to split rather than to lump these characters. For example, many characters may be accompanied by a qualifier; e.g., the color of the rock, if recorded by a name, could be made more specific by having the possibility to qualify it: darkgreen. It is generally better to record such qualifiers separately. Often, to keep track of the precision of each measurement, four subjective categories are distinguished, ranging from excellent to poor precision. Although somewhat crude, this system seems to be satisfactory. If the numerical data are partly based on subjective judgment, such as the estimate of reserves in ore deposits, there is a tendency to omit it from the data file on the basis that data files should contain only objective data and not so-called information, i.e., subjective deductions. The author believes, however, that such data are important and so numerous in geology that some effort should be made to record them as consistently as possible. Much wider use could be made in that respect of the probability matrix which is used by the U.S. Bureau of Mines (see section on mineral deposits). If the data must be expressed not by numbers but by words (mineral names, colors, etc.) use may be made of mnemonic codes, although there is no obligation for any kind o f coding from the computer point of view. The time is past when it was necessary to compress information in highly cryptic symbols. It proves, however, often convenient to use mnemonics as a kind of shorthand to facilitate filling the input sheets. Four-letter mnemonics seem to be widely used and adequate for most purposes. Useful rules to create these mnemonics from the original word will be found in Brisbin and Ediger (1967). An alternative which is used with success for a file on Foraminifera is to use the first three letters of the generic name plus the first three letters of the species name, followed by the total number of letters in the word. Experience has shown that duplicate meanings of symbols occur very rarely. These cases then are easily handled. Whatever the system of encoding, recording of the data is done by hand for most geological files. This represents a heavy and costly part of the work. The system should be devised so as to make that recording as easy as possible. Also, extra care must be taken to facilitate debugging errors, which have an incredible tendency to creep in at every step of the procedure. Therefore, some redundancy must be kept in the data, so that at least a part of the errors may be automatically checked. Whatever the system of editing, however, the perpetual struggle against errors is the plague of almost all data file building. Before attempting to write specially tailored programs, the possibility of using generalized data management systems such as SAFRAS should be thoroughly investigated. The
A NEW GEOLOGICALTOOL - THE DATA
l 93
advantages of such systems have been reviewed in the section on "Data management systems". The greater flexibility of input, the time saved in implementing the system, the ease of changing organization, the compatibility with other files, are important assets. At all times, obtaining and keeping consistency of the recorded data should be a primary preoccupation. It is important therefore to test this consistency by comparing input sheets obtained independently on the same object and recorded by different geologists. All discrepancies should be scrutinized. Provision must be taken for incomplete information, e.g., if the recording of the stratigraphic age is by eras and periods, how should the indication "Lower Paleozoic" be recorded? If we record it as Cambrian, we are introducing more information than we actually know, a thing which is clearly not permitted; on the other hand, by recording Paleozoic, we are losing the information "Lower". One way out is to provide for recording of an "Upper" and "Lower" limit in all cases. If the age is exactly known, both limits will be the same and we will thus have an indication of precision. It is important to minimize the initial time necessary to implement the system. Not too much time should be devoted at the start in trying to conceive the recording of all possible cases. Instead, places for comments written in free form should be provided, for recording exceptional cases. If these exceptions become too numerous or if these comments prove useful for retrieval or other purposes, it will be time to introduce them as data at a later stage. Today, technology allows this change with minimal revision of the system, and the technique of providing space for comments in free format is to be highly recommended. Procedures to present the data should be developed at an early stage. Efforts to obtain clear, uncryptic presentation of these data, are as important as good input systems. Other geologists will not use the system if it is too intricate. Users should have to do as little decoding as possible. Graphical presentation in maps, cross sections, graphs of various kinds, are specially attractive and should be developed as early as possible. Not only will these graphs make the system more acceptable by co-workers, but they also will be helpful in detecting errors and in improving the usefulness of the file. A golden rule is: never accumulate data for a long period of time without using them. All aspects of a data system should be developed more or less simultaneously: data recording, file building, editing, retrieval and use, since all functions are interdependent. Many failures can be accounted for by the fact that each phase was treated in isolation. From protozoan to vertebrate files
The first automobiles were designed to look as much as possible like horse-driven carriages. Most of today's computer-based geological data files are still conceived to mimic manual files. Too often, the conception of the file, its organization, is thought as a one-step operation, as if the conventions could be established once and for all and the data accrued thereafter in an essentially static manner. In reality, in almost every case where sufficient experience has been acquired, the
194
A. HUBAUX
process has proved to be dynamic. The organization of the file has grown in parallel with the file itself. Not only have new possibilities been discovered "downstream", but also, by a feedback process, more efficient or more complete input systems have gradually been introduced. The data file, its updating and processing programs, are part of a chain of processes, where a change in any ring of the chain progressively creates a change in all other rings. There is an interesting parallel here with organic evolution, and the lesson is to accept at the start that the system will indeed evolve. It is advisable therefore to devise, at the beginning, a management system as unspecialized as possible. User-oriented systems such as SAFRAS are particularly useful for this purpose. Somewhat like organic evolution, geological data handling will go through several stages of increasing complexity: data kept in a note book; manual file of temporary use; data kept in a note book; manual file of temporary use; manual file of permanent use (e.g., plain cards); peek-a-boo, edge-punched cards; punched cards (normally with fixed format); magnetic tape file, fixed or variable format; data base of interrelated and cross-referenced computer-based files; idem, with the use of a generalized data management system; network of such data bases, with development of common standards; idem, important volumes of data being exchanged through tele-communication channels. A data file is just a tool, like a microscope or a hammer. A hammer, by itself, has never broken a stone; but it prolongs the arm and its presence influences the manner of working in the field. Likewise, the data file will eventually renew the approach to geological problems. There is an important difference with the hammer or the microscope, however. These tools are suited for the individual, whereas data files are essentially the result of team work. They are the first team tool of geology. Success or failure will for a large part depend on the unity spirit of the team. APPENDIX List o f visited organizations
During travels in Europe and a six weeks' journey in North America, the author had the occasion to meet with numerous specialists and to visit a series of organizations, viz: Canada
Calgary: Mobil Oil; Gulf Oil; Imperial Oil; Chevron Standard; Shell Canada; Computer Data Processing; Canadian Stratigraphic Service; Geophysical Service Inc. [,ondon: University of Western Ontario Ottawa: Geological Survey of Canada
A NEW GEOLOGICAL TOOL - THE DATA
1 95
Toronto: Department of Mines and Northern Affairs Vancouver: Rio Tinto Canadian Exploration Ltd.; Amax Exploration Inc.; Cominco Ltd. Winnipeg: University of Manitoba; Department of Natural Resources Czechoslovakia Prague: Geofond France Nancy: Centre de Recherches P~trographiques et G~ochimiques (C.R.P.G.) Orleans: Bureau de Recherches G~ologiques et Mini~res (B.R.G.M.) Paris: Ecole Nationale Sup6rieure des Mines; Institut Franqais du P~trole Italy Milan: Universit',) degli Studi; AGIP Nucleare The Netherlands Haarlem: Rijksgeologische Dienst Leiden: Rijksmuseum voor Geologie en Mineralogie The Hague: Royal Dutch Shell U.S.A. Dallas: Atlantic Richfield Co.; Mobil Research and Development Corp. Denver: U.S. Geological Survey; Colorado School of Mines; Petroleum Information Corp. Houston: Rice University; Amoco Production Company; Geological Consulting Services; Information Processing Corp.; Bonner and Moore Computing Company; NASA Manned Spacecraft Center Menlo Park: U.S. Geological Survey San Francisco: Standard Oil Company of California Syracuse (N.Y.): Syracuse University Washington: U.S. Geological Survey; American Geological Institute; Smithsonian Institution; Bureau of Mines REFERENCES (Only the papers cited in the text are referenced here. A thorough bibliography on geological data storage and retrieval will be found in Hruska and Burk (1971), supplemented by Burk (1972a). Addison, Ch.H., Coney, M.D., Jones, M.A., Shields, R.W. and Sweeney, J.W., 1969. GIPSY, General Information Processing System: Application description. Univ. Okla. Inf. Sci. Ser., Monogr., IV: 127 pp. Agterberg, F.P. and Robinson, S.C., 1971. Mathematical problems in geology. Preprint o f paper presented at I A M G - I S I Joint Session, Washington, D.C., August 1 7th, 1971. 28 pp. Berner, H., Ekstrom, T., Lilljequist, R., Stephansson, O and Wikstrom, A., 1972. Geomap -- A data system for geological mapping. Proc. lnt. Geol. Congr., 2 4th, Montreal, 1972, 16:3 11. Brisbin, W.C. and Ediger, N.M., 1967. A national system for storage and retrieval of geological data in Canada. Nat. Adv. Comm. Res. Geol. Sci. (available from Geol. Surv. Canada), 175 pp. Burk Jr., C.F., 1971. Computer-based geological data systems -- An emerging basis for international communication. Proc. Worm Pet. Congr., 8th, 2:327-335. Burk Jr., C.F., 1972a. Computer-based storage and retrieval of geoscience information Bibliography 1970-72. Can. Centre GeoscL Data (in press). Burk Jr., C.F., 1972b. Development of a national computer-based network of basic information on Canadian mineral deposits. Can. Mining J., 93(4): 3 4 - 3 8 . Cogeodata, 1972. Recommendations, Doc. No. 33 (available soon through Dr. C.F. Burk Jr., Cogeodata Secretary, Canadian Centre for Geoscience Data, 601 Booth Street, Ottawa KIA OE8, Ontario, Canada).
196
A. HUBAUX
Creighton, R. and Crockett, J.J., 1971. SELGEM: A system for collection management. Smithsonian Inst., Inf. Systems Div., 2(3): 1-24. Dixon, C.J., 1970. Semantic symbols. Math. Geol., 2(1):81-88. Federal Advisory Committee on Water Data, 1971. Design Characteristics for a National System to Store, Retrieve and Disseminate Water Data. U.S. Dept. of the Interior, Geol. Survey, Office of Water Data Coordination, Washington, D.C., 31 pp. Germeraad, J.H. et al., 1972. A computer-based registration system for geological collections. Scripta Geologiea, Leiden, 9. Griffiths, J.C., 1970. Current trends in geomathematics. Earth-Sci. Rev., 6:121-140. Hruska, J. and Burk Jr., C.F., 1971. Computer-based storage and retrieval ofgeoscience information: Bibliography 1946 69. Geol. Surv. Canada, Pap., 7 1 - 4 0 : 5 2 pp. Hubaux, A., 1970. Description of geological objects. Math. Geol., 2(1): 89--95. Hubaux, A., 1972a. Dissecting geological concepts. Math. Geol., 4 (1): 77 80. Hubaux, A., 1972b (Editor). Geological data files. Survey of international activity. Codata BulL, 8 : 3 0 Pp. International Union of Geological Sciences, 1970. Committee on Storage, Automatic Processing and Retrieval of Geological Data, Report for 1970. Geol. Newsletter, 1970(4): 386-390. Kelly, A.M., 1972. Recommended standards for recording the location of mineral deposits. Can. Centre Geosci. Data, Geol. Surv, Canada. Pap., 72 9 : 8 pp. Kremp, G.O.W., 1972. Palynologic data bank. Geotimes, 17(4): 30. Laffitte, P., 1969. La codification sdmantique en informatique g6ologique (Semantic coding in geological information processing). Ann. Mines, 12:75 -83. Laznicka, P., 1972. The University of Manitoba file of world's nonferrous metal deposits (Manifile), its content and use. Dept. Earth Sei., Univ. Manitoba, 62 pp. Robinson, S.C., 1972. Data standardization in geology. To be published in: Geoscience Infi)rmation Society Symposium, Washington, D. C. Routhier, P., 1969. Essai Critique sur les MOthodes de la Gdologie {de l'Objet h la Genkse). Masson, Paris, 202 pp. Stauft, D.L., 1968. Computer applications in an oil exploration company. Can. Pet. Geol, Bull., 16(1):64 86. Sutterlin, P.G. and De Plancke, J., 1969. Development of a flexible computer-processable file for storage and retrieval of mineral deposits data. Proc. Syrup. Decision-making in Mineral Explor. H, Univ. British Columbia, Feb. 1969, pp. 11 42. Sutterlin, P.G., 1971. The design of computer-processable information, decision-making in the mineral industry. Can. Inst. Min. Metallurgy, Spec. Vol., 12: 399-403. Sutterlin, P.G. and Cooper, M.A., 1972. SAFRAS - A generalized data base management system (in preparation). UNESCO, 1971 a. hztergovernmental Conference o f Experts .for Preparing an International Geological Correlation Programme, Final Rep. UNESCO, Paris, SC/MD/28, 52 p p . . UNESCO, 1971b. UNISIST, Synopsis o f the Feasibility Study on a Worm Science Information System. UNESCO, Paris, SC,70/D.74/A, 92 pp. Van Eysinga, F.W.B., 1970. Stratigraphic terminology and nomenclature; a guide for editors and authors. Earth.Sci. Rev., 6(4): 267-288. Wynne-Edwards, H.R., Laurin, A.F., Sharma, K.N.M., Nandi, A., Kehlenbeck, M.M. and Franconi, A., 1970. Computerized geological mapping in the Grenville Province, Quebec, Canada. Can. J. Earth Sci., 7(6):1357-1373. (Accepted for publication January 18, 1973)