A new method for correlation of multiple stratigraphic sequences

A new method for correlation of multiple stratigraphic sequences

PII: Compurers & GeoscierlcesVol. 23, No. 6, pp. 697-700. 1997 0 1997 Elsevier Science Ltd. All rights reserved Printed in Great Britain 0098-3004/97...

364KB Sizes 3 Downloads 61 Views

PII:

Compurers & GeoscierlcesVol. 23, No. 6, pp. 697-700. 1997 0 1997 Elsevier Science Ltd. All rights reserved Printed in Great Britain 0098-3004/97 $17.00 + 0.00 SOO98-3004(97)00046-O

SHORT NOTE A NEW METHOD FOR CORRELATION OF MULTIPLE STRATIGRAPHIC SEQUENCES IAIN M. BROWN Department of Computer Studies, University of Glamorgan, Pontypridd CF37 IDL, Wales, U.K.

(e-mai[: IBrown@,glam.ac.uk) I Received I7 November 1996; revised 17 February 1997)

INTRODUCTION

comparison, because the elevated cost increases the likelihood of a unique solution. Unfortunately, for multiple correlation, computing time increases exponentially with the number of sequences compared, and the amount of available memory rapidly diminishes too. This is because, whereas pairwise correlation is based upon a 2-D matrix, multiple sequence alignment involves a multi-dimensional “hyperlattice” (Fig. l(B)) and the recursive procedure used by dynamic programming to determine the minimum cost path for each node now has to visit (2” - 1) previous nodes for an n-dimensional alignment, rather than just three previous nodes for pairwise correlation. Recent research in biocomputing has concentrated on reducing the search space within the hyperlattice (cf. Fuellen, 1996, for a good review), principally by assuming that the optimal alignment path will be contained in a “polyhedron” around the main diagonal. A practical solution to creating this polyhedron is to develop a heuristic alignment which provides limits (the so-called Carillo-Lipman bounds) on each face of the hyperlattice (i.e. each pairwise alignment). and then to project these limits within the polyhedron.

Correlation of stratigraphic sequences between boreholes remains one of the more fundamental problems facing the geologist, especially in locations where similar lithologies are cyclically repeated in vertical sequence (cyclothems) as these cycles are usually incomplete and laterally variable. Manual correlation is subjective and time-consuming, making a more objective automated approach an attractive proposition. Techniques have been developed to match the traces from wireline logs (e.g. Olea, 1994), but these have only been applied to pairs of logs. This note demonstrates how advances in dynamic programming can now allow several lithological logs to be compared at a speed which makes the computation procedure practical. The dynamic programming algorithm was originally adapted from linguistics analysis and biocomputing to geological sequences by Smith and Waterman (1980) with subsequent modifications by Waterman and Raymond (1987). Its main advantage is that it can readily accommodate gaps in the stratigraphy as a result of erosion (unconformities) or non-deposition. The basic technique involves pattern-matching between strings (e.g. ACBDECAE; CBABDEAE) with one sequence merged into the other via three basic operations: match, insertion, and deletion. By associating costs with each of these operations, and then both minimising the operations required and the resulting cost, an optimal correlation can be determined which, by considering all of the options, represents a global solution. The process can be constructed graphically, by a traceback through a matrix of cumulative cost to give the minimal cost pathway (Fig. l(A)). However, in pairwise correlation the optimal path is frequently not unique, with bifurcations occurring at certain points (Fig. l(A)). The potential for these degeneracies could be reduced or even eliminated by a multiple correlation of several sequences, as a geologist would do with a manual

THE MSA PROGRAM We have adapted for geological purposes the MSA (Multiple Sequence Alignment) program for developed at the US National Centre Biotechnology Information. Unlike nearly all of the other genetics programs created for this purpose. this program still attempts to determine an optimal rather than a heuristic solution, although this cannot always be guaranteed, and an optimal solution may take some time longer to determine. In general terms, MSA first estimates a heuristic correlation of sequences and then by comparing this with each series of optimal pairwise alignments derives the bounds (epsilon values) for the search space 691

Short Note

698

trace-back J

A

B D

DC

DAE

B

C

BA

log 2 2-D matrix for pairwise correlation showing trace-back of optimal minimal path through transference net of local minimum costs. (8) 3-D “hyperlattice” with optimal path.

Figure 1. (A)

(Lipman, Altschul, and Kececioglu, 1989; Gupta, Kececioglu, and Schaffer, 1995). The heuristic employed by MSA is to determine positions where all of the sequences agree and force these into alignment, then progressively fill in the intervening segments by maximising the similarity between as many sequences as possible. However, other heuristics may be used, including techniques such as genetic algorithms, or the epsilon values may be set explicitly. Another feature of MSA is that, by default, it builds up the series of pairwise alignments into a tree by aligning the most similar pair first and then proceeding upwards to the least similar. This allows each pairwise alignment to be weighted when projected for optimal multiple alignment, hence reducing the bias caused when two or more similar sequences are included in the series. OBSERVATfONS

As the original MSA program was designed for genetics, some modifications are required and a variety of other options improve performance. Firstly, the costs table must be re-designed for the lithological units present in the boreholes to be correlated; the standard format for this is explained in the program documentation. As with the original pairwise dynamic programming method, unless a similar area with known correlations is available to finetune the values, estimating the costs of matching individual units tends to rely on existing knowledge of the relative likelihood of such a match occurring. Gap costs in MSA are based on an atline model with an initial gap cost plus additional costs per character; other methods could be devised but this seems to work well for our purposes. An option

flag (-c) in the MSA syntax allows a user-supplied cost file to be used, but if the costs are hard-coded within the program it runs faster. An example output from MSA is given in Appendix A for five borehole segments from the Northumberland coalfield (U.K.), an area where correlation of coal seams between logs proves problematical. The -g flag in the command-line syntax is an option to cost terminal gaps the same as internal gaps; this is necessary for geological sequences, otherwise all the gaps tend to be forced to either end. The final epsilon values should not exceed the maximum epsilon values assigned, although this occurs frequently on an initial run, and it is advisable that the program is then re-run with the new maximum values. The MSA program works best with sequences of approximately equal length; as yet, incorporating fragmentary sequences with more complete sequences proves impractical in time and memory usage. The program is particularly effective for stratigraphic sequences demarcated at the top and bottom by marker beds (i.e. units having a distinct identity throughout the area). In such instances, because the heuristic alignment automatically aligns the first and last element in each sequence, it tends to narrow the search space to a more manageable extent. If known marker beds also exist within each of the sequences to be compared, another MSA option allows these to be fixed during the correlation which helps to increase accuracy. Unlike the pairwise dynamic programming algorithm, our adaptation of MSA can be criticised because it currendy does not include unit thickness in the calculation, as this would significantly increase the calculation time. However, the discrimi-

699

Short Note

natory power of multiple correlation is such that this variable tends to be less important with several sequences, except in those situations where there are great variations in unit thickness. Our results show that MSA performs well with up to six sequences, but above that the heuristic often does not constrain the search space enough to allow a solution in reasonable time, depending on how diverse the sequences are. A compromise must therefore be made in which one can either correlate a few long sequences ( 100 units or more) or several short sequences. Nevertheless, by comparison with the established stratigraphy, the general accuracy of the correlations suggest that programs such as MSA couldprovide a useful quick analysis tool, providing a first-pass estimate of stratigraphic variations for further investigation, and they deserve more development in the future. Furthermore, if the current trend in increased computing power continues as predicted (e.g. with parallel processors), it may be expected that this methodology will soon become more viable for larger databases and longer sequences.

Acknowledgments-The

borehole data from the Northumberland coalfield was kindly made available by the British Geological Survey. REFERENCES

Fuellen, G. (1996) A gentle guide to multiple alignment, version 2.02: http://www.techfak.uni-bielefeld.de/bcd/ Curric/MulAli/mulali.html. Gupta, S. K., Kececioglu, J. D. and Schaffer, A. A. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. Journal of Computational Biology 2(3), 459-472.

Lipman, D. J., Altschul, S. F. and Kececioglu, J. D. (1989) A tool for multiple sequence alignment. Proceedings of the National Academy of Science 86, 4412-4415.

Olea, R. A. (1994) Expert systems for automated correlation and interpretation of wireline logs. Mathemutical Geology 26(8), 879-897.

Smith. T. F. and Waterman, M. S. (1980) graphic correlation techniques. Journal 88(4), 451-457. Waterman, M. S. and Raymond, R. (1987) game: new stratigraphic correlation

New stratiof Geology

The match algorithms,

Mathematical Geolog,v 19(2), 109-127.

APPENDIX A MSA Program

The MSA program can be obtained through the World Wide Web at the URL: http://ibc.wustl.edu/msa.html. It is designed to run under UNIX and requires several MB of memory. A man page is available for extra assistance and a sample genetics file to test the program. After that all you need for an initial correlation is to supply your own cost and sequence files. Several web-based guides to sequence alignment for genetics now exist: in addition to Fuellen (1996) a good interactive example of optimal path alignment in a 3-D lattice using a Java-based tool is available at the URL: http://oleander.techfak.uni-bielefeld.de:808O/java~biomoo/visu.html. Sample Output

In this example, five sequences are correlated with the -g option to penahse should be run again without the heuristic alignment and using a file containing some of them are higher than the maximum values, but in this instance the same Key: C = coal, E = seaearth, S = sandstone, L = siltstone, M = mudstone, I

terminal gaps. In practice, the program the new epsilon values (-e option), as optimal correlation results. = ironstone.

ibrown:> msa -g -c costfile.txt pegswood.txt ________________

On On On On On On On On

the the the the the the the the ***

left: left: left: right: right: left: right: right:

Tree

given

Internal Node Internal Node Leaf #l Leaf #4 Internal Node Leaf #2 Leaf #3 Leaf #5

from

ancestor

Distance Distance Distance Distance Distance Distance Distance Distance

Heuristic Multiple Alignment

*23514 C-S--CE--S-MMI-SCE-SM-MCEL-MELM-SCEM-SCEMSC CEMEMCESMS-MSECMC---M-SCESCM-SMCS--M-SC--MC CESCSME-MSCMSE-MCE--M-SCES-M-SMC---M-SCM C-SMSCM-MS-MMS-SCSCSMSMC-S-MESM-SC-SESC--SC CES---E--SCSLE-SCE----SCEL-SELSES--L-MC-LSC

-_______________

to to to to to to to to ***

parent parent parent parent parent parent parent parent

= = = = = = = =

262.75 46.75 130.75 182,25 77.75 147.17 71.83 0.00

Short Note

700

t**

Optimal

Multiple

Alignment

l

**

C---SCE-SMMISCE-SM-MCELM-EL-M-SCEMSCEMSC CEMEMCE-SM--S-M-SECMC-MSCESCMSMCSMSC--MC C--ESC--SMEMSCM-SE-MCEMSCES-MSMC-MSCMMMC C-SMSCMMSMMSSCSCSMSMC-SM-ES-M-SCSESC--SC CESESC--SLE-SCE-S---CELS-EL-SES--LMC-LSC

Costfile: Alignment Delta:

coal-matrix cost: 7324 800

Lower bound: Max. Delta:

Pair. Cost Proj. Cost Sequences 352 1 402 2 312 13 352 342 14 364 344 298 15 286 2 3 296 420 350 2 4 338 2 5 404 370 3 4 422 360 310 3 5 368 4 5 436 Elapsed time = 180.136

6524 820

Epsilon Max.Epsi. 50 50 40 50 22 30 44 46 10 28 70 50 66 50 52 50 50 50 50 68

Weight 1 1 4 2 4 1 2 1 2 2

Weight*Cost 402 352 1456 688 1184 420 808 422 720 872