CLONEPLACER: a software tool for simulating contig formation for ordered shotgun sequencing

CLONEPLACER: a software tool for simulating contig formation for ordered shotgun sequencing

SHORT COMMUNICATION CLONEPLACER: A Software Tool for Simulating Contig Formation for Ordered Shotgun Sequencing GAUTAM B. SINGH* AND STEPHEN A. KRAWET...

356KB Sizes 0 Downloads 46 Views

SHORT COMMUNICATION CLONEPLACER: A Software Tool for Simulating Contig Formation for Ordered Shotgun Sequencing GAUTAM B. SINGH* AND STEPHEN A. KRAWETZ-I":I:'§'¶ '1 *Electrical and Computer Engineering, College of Engineering, Wayne State University, Detroit, Michigan 48202, and Departments of tMolecular Biology & Genetics and tObstetrics and Gynecology, §C S. Mott Center for Human Growth and Development, and the ¶Center for Molecular Medicine and Genetics, Wayne State University School of Medicine, 253 CSMC, 275 East Hancock, Detroit, Michigan 48201 Received May 16, 1994; revised October 26, 1994

This c o m m u n i c a t i o n d e s c r i b e s a s o f t w a r e t o o l t h a t e n a b l e s o n e to s i m u l a t e l a r g e - s c a l e r e g i o n a l m a p p i n g u s i n g an o r d e r e d s h o t g u n s e q u e n c i n g a p p r o a c h . T h e analysis routines that are provided yield an estimate of t h e d e p t h of c o v e r a g e o f t h e p h y s i c a l map, t h e l a r g e s t c o n t i g f o r m e d , a n d t h e n u m b e r o f g a p s rem a i n i n g at a n y g i v e n j u n c t u r e in t h e project. A det a i l e d l i s t i n g d e s c r i b i n g t h e s p a n of e a c h c o n t i g w i t h i n t h e p h y s i c a l m a p is a l s o p r e s e n t e d . This p r o v i d e s an a priori m e a n s o f e s t i m a t i n g t h e r e s o u r c e s t h a t will be r e q u i r e d to u n d e r t a k e a n y m e g a b a s e m a p p i n g o r seq u e n c i n g project. C L O N E P L A C E R p r o v i d e s t h e m u c h n e e d e d g u i d e to d e r i v i n g t h e o p t i m a l s t r a t e g y . © 1995 Academic

Press, Inc.

Ordered shotgun sequencing (OSS) (3) has recently been proposed as an alternative to the random shotgun approach (2) to increasing sequencing throughout. OSS initially utilizes short sequence information typically ranging from 500 to 1000 bp from the ends of a series of large subclones contained within the project. This enables one to create a physical map based on the overlaps that are detected. Sequencing of the region of interest now relies only on sequencing of a series of defined overlapping fragments, achieving maximal nonredundant sequence. The rate of project completion using OSS can be defined by asking the following question: what is the expected frequency at which two random clones with known end sequences will overlap? The corresponding probability that two random subclones will overlap and form a contig is small, given that only the end sequence from each subclone is determined. This contrasts with the higher likelihood for overlap using the traditional random shotgun strategy (2, 6, 7), in which contig formation is based on the DNA sequence of the entire subclone. Mathematical modeling of two random subclones 1 To w h o m c o r r e s p o n d e n c e s h o u l d be a d d r e s s e d . Telephone: (313) 577-6770. I n t e r n e t : s t e v e @ c o m p b i o . m e d . w a y n e . e d u .

possessing an average Poisson-distributed length of shown in Fig. la, obtained from the larger clone of interest of size #, predicts a strategy-dependent probability of overlap among the subclones. For example, the probability of contig formation in the absence of repetitive DNA using a random shotgun sequencing strategy is simply equal to the overlap probability of 2h/# since the entire sequence of both subclones is known. In contrast, when the OSS strategy is employed, the simple overlap between two clones, A and B, shown in Fig. lb is not sufficient to result in the formation of a contig. One of the four configurations of the two clones A and B shown in Fig. lc must be encountered in order for two clones to form a contig. Therefore, the probability that any two clones will form a contig during OSS will be equal to the probability of randomly selecting two clones that assume one of the four configurations shown in Fig. lc. If we assume that both clones A and B have an average length of k, then the probability of observing both Type I and Type II overlaps shown in Fig. lc is given by Pob. . . . . overlap :

(ll + r l +12 +r2) X

[1]

It is reasonable to expect that Type I contigs will form twice as frequently as Type II contigs since the probability of generating multiple random clones with the same average length that span the same approximate map area renders the two Type II configurations indistinguishable. Accordingly, the probability that two end-sequenced clones will overlap and that this overlap will result in a contig shown in Fig. lc is Pcontig-formation = Poverlap X Pobserve-overlap

2k x (ll + ri -~- 12 -~ r2) p k

2(ll + rl + 12 + r2) #

[2]

This equation provides an overestimate of the probability of forming a contig. However, it is unlikely that two clones will overlap at both ends simultaneously since 555

GENOMICS 25, 555--558 (1995) 0888-7543/95 $6.00 Copyright © 1995 by Academic Press, Inc. All rights of reproduction in any form reserved.

556

SHORT COMMUNICATION

m

m

Average S u b - c l o n e L e n g t h ( P I : 100kb, Cosmid: 40kb, Rasmid: 5 k b etc.)

:

A

:



Sequenced Ends (500 bp)

i

m

!

m

B

ll

i

ll

!

i

~'pe I Cont~.g

i

Type II Coneig

A"

NB



K

I*ii





K

J'!

..,



A

iI

El i•

~mm

i

Needed Sequence Overlaps = 25bp



B

~( Resulting Contlg Formed

F I G . 1. C o n t i g f o r m a t i o n b a s e d on e n d s e q u e n c e o v e r l a p i n ord e r e d s h o t g u n s e q u e n c i n g (OSS). (a) A clone w i t h only t h e 500-bp e n d s e q u e n c e s a v a i l a b l e . (b) Two clones, A a n d B, w i t h no o v e r l a p i n t h e end s e q u e n c e s do n o t l e a d to f o r m a t i o n of a contig. T h e s i g n i f i c a n t o v e r l a p of a t l e a s t 25 bp m u s t occur b e t w e e n two clone e n d s to r e s u l t in contig f o r m a t i o n . (c) One of t h e four p o s s i b l e c o n f i g u r a t i o n s s h o w n c a n r e s u l t i n contig f o r m a t i o n .

the fragment size is Poisson distributed. If we assume that both clones have the same length of sequence ~1 and c2 generated at the 5' and 3' ends, respectively, then upon substituting ll = le = cl and rl = re = ee in Eq. [2] above, the probability of contig formation is 4 ( ( 1 + C2)

#

[3]

The probability of forming a contig is nearly zero (0.002-0.004), given that the size of s e q u e n c e d ends ~1 and c2 is approximately 5 0 0 - 1 0 0 0 bp and the comparatively large size of # of - 1 Mb. However, it is of note that the probability of overlap is independent of the length of the clones that constitute the sequenced group. A theoretical analysis of stochastic processes needed to model the OSS strategy analytically is mathematically intractable. Accordingly, computer simulations can provide far better insight into the process of contig formation, yielding the expected distribution of the contig lengths and singleton clones over the entire map.

The CLONEPLACER tool described in this communication is built upon this model and assists one in defining the project resource needs by simulating OSS prior to project implementation. Simulation parameters can be varied in accord with the individual laboratory's expertise and their effect on the predicted success of map completion. This assists one in devising the optimal project strategy for megabase mapping and sequencing efforts. C L O N E P L A C E R was developed to simulate the viability of the OSS approach to the sequencing of YAC clones, utilizing the plasmid subclone library approach (4, 5). This strategy can be also used for subclones constructed in P1, phage, cosmid, or plasmid vectors. The simulation program input consists of the expected size of the physical map, the sizes of the inserts of the subclones, and the lengths of the sequenced ends. We have a s s u m e d that sequenced ends range in size from 0.5 to 1.0 kb (5). Typical results obtained using this analysis are p r e s e n t e d in Figs. 2 and 3, which show the coverage of the physical map by contigs containing two or more clones, the expected Poisson distribution of the various sized contigs (1), and their relative locations. C L O N E P L A C E R performs the simulation of OSS by allowing the user to specify the size of the region being mapped. It then prompts the user for insert size. The lengths of inserts accommodated within the commonly used P1, cosmid, phage, and plasmid subclones can be selected from the menu. In addition, custom insert sizes can be specified. The size of the sequenced ends defaults to a Poisson distribution t h a t deviates with a mean of 500 bp. This can be adjusted to a different value to best reflect the individual data set. The m i n i m u m overlap required to establish the placement of two clones as neighbors SelectCloningVectorType(Default= [P]) : [P] PI Vector(Average Size= I00 KB) [C] CosmidVector(Average Size= 40 KB) [G] PhageVector(Average Size = 21 KB) [L] PlasmidVector (Average Size= 5KB) [X'] Specifya differentsizedclone Cloning Vector :[P* C G L X] L EnterPercentage of Map Coverage Needed(e.g. 95) TerminateSimulationif Map Coyer'age= [default=95,0%] Using Poisson Deviate for Randomly generatingclone size Map Size = 200000 bp Average Clone Size = 5000 bp

# of clones LargestContig 50

3 Clones

Coverage

Gap Count

50.832

9

100 14Clones 83.758 8 ........................................................................... 150 38 Clones 94.406 3 ........................................................................ 200 67 Clones 96.108 3 ....................................................................... Print IndividualContig Data ? (default = y) n F I G . 2. C o v e r a g e of a 200-kb YAC clone by a l i b r a r y of p l a s m i d subclones.

SHORT COMMUNICATION

557

Specify Map Size in kb 760 Select Cloning Vector Type (Default = [P]) [P] Pl Vector (Average Size = 100 kb) [C] Cosmid Vector (Average Size = 40 kb) [G] Phage Vector (Average Size = 21 kb) [L] Plesmid Vector (Average Size = 5 kb) [X] Specify a different sized clone Cloning Vector : [P* C G L X] C Enter Percentage Map Coverage Needed (e.g. 95) Terminate Simulation if Map Coverage = [default---95.0%] 60 Using Poisson Deviate for randomly generating clone sizes Map Size = 760000 bp Average Clone Size = 40000 bp # of clones

LargestContig

Coverage

Gap Count

20 0 Clones .......................................................................... 40 0 Clones ................................................................................... 60 2 Clones 27.431 2 ............................................................................ 80 2Clones 58.643 6 .......................................................................... 100 2 Clones 63.857 5 ................................................................. Print Individual Contig Data ? (default = y) y Individual Contig Profde D a t a 199631,279299 = 7966gbp Coverage By Contig # 1 = 0.105 (2) 288928,368453 : 79525 bp Coverage By Conti8 # 2 = 0.105 (2) Coverage By Contig # 3 = 0.104 (2) 202855,282198 = 79343 bp 272182,312396 Coverage By Conti8 # 4 = 0.0529 (2) = 40214 bp 726275,759999 Coverage By Conti8 # 5 = 0.0444 (4) = 33724 bp Coverage By Contig # 6 = O.106 (2) 579364,659658 = 80294 bp Coverage By Contig # 7 = 0.0537 (3) 491412,532203 = 40791 bp 235396,314847 Coverage By Cootig # 8 = 0.105 (2) = 79451 bp 125503,165719 Coverage By Contig # 9 = 0.0529 (2) = 40216 bp 400909,480020 Coverage By Contig # 10 = 0.104 (3) = 79111 bp 158067,237941 Coverage By Contig # 11 = 0.105 (2) = 79874 bp Number of Unpaired Clones = 74

#8(2)

#i0(3)

#3(2)

#7(3)

#5(4)

#i(2) 0

152

304

456

608

760 kb

I

I

I

I

I

1

#2(2)

#6(2)

#4(2) #11(2)

FIG. 3. Coverage of 760-kb YAC clone by a library of cosinid subclones. The largest contig (No. 6) spans 80.294 kb of the YAC map and contains two subclones. Contigs 4, 5, 7, 9, and 10 contain at least one Type II overlap. After 100 subclone ends are sequenced, 63% of the map is covered.

has a default value of 25 overlapped bases with 80% i d e n t i t y which can also be varied. Figure 2 shows the results of the OSS simulation for a 200-kb region t h a t is represented as a set of 5-kb plasmid subclones. The simulation demonstrates t h a t from 200 sequenced ends, the largest contig formed contains 67 clones, covering 97% of the map. The gap count shows t h a t 2 regions were not covered by any of the contigs. At this juncture, the project would consist of 67 clones in 17 contigs with 12 unpaired clones and 3 gaps. Other methods of closure would be advised. The application of CLONEPLACER to a 760-kb YAC is shown in Fig. 3. The parameters were adjusted to achieve 60% coverage of a 760-kb YAC using a library of cosmid subclones. A 25-bp overlap from clones with 500 bp of end sequence was assumed. In this case, 100

randomly selected cosmid clones should cover more than 60% of the YAC. Approximately 11 contigs would be formed, and 74 clones would not be paired with any other clone. A gap count of 5 was estimated, representing five regions that would not be covered by any clones. It is interesting to note that as the map coverage increases, so does the gap count. This results from the placing of a new contig within the map. This bisects the original gap between two previously existing singly gapped contigs. Thus, with respect to the original gap, the number of gaps will increase while the total gap length, i.e., the sum of two smaller gaps, will decrease. Intuitively, this analysis suggests that it may be desirable to apply a hierarchical approach to sequencing large YAC clones. One could first use the OSS strategy to generate the physical map with cosmid or P1 sub-

558

SHORT COMMUNICATION

clones. Further application of the OSS strategy could then be applied, such that minimally overlapped cosmids are subcloned into plasmids mapped by OSS in preparation for sequencing. This hierarchical strategy for sequencing large regions should markedly improve the efficiency and rate of completion of any sequencing project. ACKNOWLEDGMENTS The authors acknowledge the support of the Office of Research and Sponsored Programs (ORSP), Wayne State University, for providing the funding to initiate research in Computational Genomics through a Research Stimulation Grant. The authors gratefully acknowledge Research Equipment Grant EDUC-US93015 from Sun Microsysterns to S.A.K. and FMRE 1-064 research support to S.A.K. CLONEPLACER was implemented in C++ and can be obtained by sending an electronic mail message to [email protected]. The simulation run time on a Sparc-l+ systems ranged from 1 to 10 min based on the size of the map.

REFERENCES

1. Balding, D., and Torney, D. (1991). Statistical analysis of DNA fingerprint data for ordered clone physical mapping of human chromosomes. Bull. Math. Biol. 53: 853-879. 2. Bankier, A. T., and Barrell, B. G. (1983). Shotgun DNA sequencing. In "Techniques in the Life Sciences," Vol. 35, pp. 1-34, Elsevier, Ireland. 3. Chen, E., Schlessinger, D., and Juha, K. (1993). Ordered shotgun sequencing: A strategy for integrated mapping and sequencing of YAC clones. Genomics 17: 651-656. 4. Hoheisel, J., and Lehrach, H. (1993). Use of reference libraries and hybridization fingerprinting for relational genome analysis. F E B S Lett. 325: 118-122. 5. Hunkapiller, T., Kaiser, R., Koep, B., and Hood, L. (1991). Large scale and automated DNA sequence determination. Science 25: 59-67. 6. Lander, E. S., and Waterman, S. (1988). Genomic mapping by fingerprinting random clone: A mathematical analysis. Genomics 2: 231-239. 7. Szybalski, W. (1993). From double helix to novel approaches to the sequencing of large genomes. Gene 135: 279-290.