GENOMICS
4, 114-128
(1989)
Sequencing
of Megabase Plus DNA by Hybridization: Theory of the Method
RADOJE DRMANAC, Genetic
IVAN LABAT, IVAN BRUKNER, AND RADOMIR CRKVENJAKOV
Engineering Received
Center,
P. 0. Box
April 8. 1988;
revised
INTRODUCTION
Recently, there has been a surge of interest in mapping and sequencing the entire human genome (Lewin, 114 Inc. reserved.
1 7 000 September
Belgrade,
Yugoslavia
2, 1988
1986; Wada, 1987; Smith and Hood, 1987). This stems from the fact that only 1 in about 75 human genes is either cloned or mapped (Human Gene Mapping 9, 1987). Unknown genes will have much to tell us about human biology. In the future, the progress of studies on molecular evolution may depend on the sequencing of genomes of species besides Homo. Rapid progress has been achieved in several areas. A linkage map of the human genome based on cloned DNA probes detecting RFLPs has been obtained (Donis-Keller et al., 1987). Once mapped a gene can be approached from a neighboring DNA marker not only by walking (Cross and Little, 1986) but also by the use of jumping (Collins and Weissman, 1984; Poustka et al., 1987) and linking (Poustka and Lehrach, 1986) libraries. However, the task of going from a marker to a mapped gene could be facilitated immensely if an ordered collection of overlapping cosmid or phage clones representing individual chromosomes were available. Two approaches have been tried. The first is based on overlapping clones by similarities in their patterns of restriction digests (Coulson et al., 1986; Olson et al., 1986; Kohara et al., 1987). In the second approach, Lehrach has proposed the hybridization of a collection of 100 specific oligonucleotides to an array of 3-10 X 10” cosmid-containing colonies on filters. The resulting patterns of hybridization identify specific regions along the genome to which a smaller collection of cosmids from chromosome libraries can be fitted in the second step (Poustka et al., 1986; Craig et al., 1986; Michiels et al., 1987). In the area of human genetics, the emphasis is on individual DNA and the methods to detect patterns of its variation and inheritance determining patient’s chances for health and disease. The number of genetic regions to be scored in the DNA of an individual requires a large number of polymorphic probes and makes use of traditional Southern blotting unpractical. However, a method that is capable of amplifying 1000-bp stretches of DNA starting from two flanking small oligonucleotide primers and that requires DNA from only 150 cells of an individual has been described recently
A mismatch-free hybridization of oligonucleotides containing from 11 to 20 monomers to unknown DNA represents, in essence, a sequencing of a complementary target. Realizing this, we have used probability calculations and, in part, computer simulations to estimate the types and numbers of oligonucleotides that would have to be synthesized in order to sequence a megabase plus segment of DNA. We estimate that 95,000 specific mixes of 11-mers, mainly of the 5’ (A,T,C,G)(A,T,C,G)NE!(A,T,C,G)3’ type, hybridized consecutively to dot blots of cloned genomic DNA fragments would provide primary data for the sequence assembly. An optimal mixture of representative libraries in Ml3 vector, having inserts of (i) 7 kb, (ii) 0.6 kb genomic fragments randomly ligated in up to lo-kb inserts, and (iii) tandem “jumping” fragments 100 kb apart in the genome, will be needed. To sequence each million base pairs of DNA, one would need hybridization data from about 2100 separate hybridization sample dots. Inevitable gaps and uncertainties in alignment of sequenced fragments arising from nonrandom or repetitive sequence organization of complex genomes and difficulties in cloning “poisonous” sequences in Escherichia coli, inherent to large sequencing by any method, have been considered and minimized by choice of libraries and number of subclones used for hybridization. Because it is based on simpler biochemical procedures, our method is inherently easier to automate than existing sequencing methods. The sequence can be derived from simple primary data only by extensive computing. Phased experimental tests and computer simulations increasing in complexity are needed before accurate estimates can be made in terms of cost and speed of sequencing by the new approach. Nevertheless, sequencing by hybridization should show advantages over existing methods because of the inherent redundancy and pare’ is80 Academic PWS, IX allelism in its data gathering.
0888.7543/89 $3.00 Copyright 0 1989 by Academic Press, All rights of reproduction in any form
794,
SEQUENCING
BY
(Saiki et al., 1986, 1988). Oligonucleotide probes can detect sequence changes in amplified DNA in dot blot hybridization. The procedure is easily automated to test hundreds of patients for many genetic regions simultaneously (Saiki et al., 1988). Both Lehrach’s method of ordering cosmid libraries and the method for amplifying DNA are based on the work of Wallace showing that small oligonucleotides require perfect homology in their target DNA to hybridize at all (Wallace et al., 1979). Once obtained, the genome sequence removes the need for ordered overlapping clone libraries and gives numerous polymorphic probes. Since these subsidiary tasks can be solved by oligonucleotide hybridization, it was natural to consider this approach as a way to accomplish sequencing. (H. Lehrach (personal communication) has independently considered the idea of sequencing by hybridization.) We have explored in detail the concept of a procedure for sequencing complex DNA based on mismatch-free hybridizations of oligonucleotide probes (ONPs) to target DNA (Wallace et al., 1979). The numbers of ONPs, lengths and types of cloned fragments, numbers of clones, and numbers of hybridizations necessary for sequencing were obtained using probability calculations, computer analysis, and some known facts about mammalian genomic DNA organization. The results indicate that sequencing by hybridization may be more efficient for DNA approaching genome size than computer-robotized Maxam-Gilbert or Sanger sequencing methods (Wada 1987; Smith and Hood, 1987; Maxam and Gilbert, 1977; Sanger et al., 1977).
to visualize that hybridization of all possible N-mer ONPs to a clone determines the N-mer oligonucleotide subset contained in the sequence of the clone and is therefore analogous to the first step. The second step cannot be accomplished in the general case. This is because some information is lost in the process of dissolving the sequence. The quantity of information lost is proportional to the length of a clone being sequenced. However, if sufficiently short clones are used, their sequence can be unambiguously determined. In general, positively hybridizing N-mer ONPs are ordered and the sequence of the target DNA is determined using (N-1)mer overlapping frames between the ONPs. This process of sequence assembly will be interrupted wherever a given overlapping (N-1)mer is encountered two or more times (Figs. 1A and 1B). Then either of the two N-mers differing in the last nucleotide can be used in extending the sequence. This branching
A 5 NNNNNNNctgg~cgqTTGGCTG I3 5 NIJNNI~NNI NNNNNNct NNNNNc tg IvNNNctgq NNNctqgA r(rNctggBG Nctgg.42 ctggAGK wJ= gg‘=KLT 9m
The ONP hybridization under certain conditions unequivocally determines the sequence. The mismatchfree hybridization of ONPs containing from 11 to 20 monomers allows this (Wallace et al., 1979; Thein and Wallace, 1986). A single mismatch lowers the melting point by 5-lO”C, which is sufficient to eliminate less than perfect hybrids. For hybridizations in the presence of 3 M tetramethylammonium chloride, the melting point of the hybrid is independent of the GC content (Wood et al., 1985). The determination of sequence by hybridization requires that all necessary probes be made and all hybridizations performed, even for sequencing of a single clone. This comes from the fact that the subset of ONPs showing hybridization to the DNA fragment being sequenced is not known. Sequencing by hybridization (SBH) of a clone can be considered as consisting of two steps: a process of dissolving the clone’s sequence into all its constituent oligonucleotide N-mers, and the back assembly of Nmers by overlap into an extended sequence. It is easy
tagcgALKEIg3tTTGGCTG
actNNNNNNN3’
C irGGCTG AWXIcggTTGGCTG
actNNNNNNN3
5 NNNNNNNctggm AGIKCIgatTTGGCTG TrGGCTG tagcgAGICCC:
AG?ULIC hGli-LLlg
THEORY
Basic Idea
115
HYBRIDIZATION
D 1) 5 NNNNNNNctggAGIKLIcqgTTGGCTG 2) 5’YNNNNNNctggAfX!XtgatTTGGCTG
tagcgk&ULIgrtTTGGCTG
actNNNNNNN3
tagcgAGKLCIcggTTGGCTG
actNNNNNNN3’
FIG. 1. Generation of SFs in SBH and problem of their ordering. (A) The sequence of a hypothetical clone. NNNNNNN-ends of vector sequence. AGTCCCT and TTGGCTG are the only oligonucleotides 7 bp or longer repeated within the depicted sequence. (B) Formation of SFs. Assuming that the content of 8-mers for the depicted sequence is known, these El-mers are ordered by maximal overlap, in this case 7 bp. Starting from the 5’8.mer (NNNNNNNc), ordering is unambiguous up to gAGTCCCT, which on its 3’ end contains a repeated 7-mer. Both AGZ’CCCTc and ACTCCCTg can be overlapped with gAGTCCCT, preventing further ordering. Each of the two sequences serves as a starting point for new ordering (not shown). Therefore, each repeated sequence 7 bp or longer represents a branching point. Unambiguous sequences are obtained between two consecutive branching points only. (Cl Listing of SFs formed from 8-mers of depicted sequence. SFs are horizontally displaced to indicate overlap; the orientation is 5’ to 3’ and end SFs are identifiable. (D) The SFs cannot be unambiguously ordered into a starting sequence without additional information. Both arrangements shown are possible since SFs AGTCCCTcggTTGGCTG and AGTCCCTgatTTGGCTG have the same 7-mers at their 5’ and 3’ ends, respectively.
116
DRMANAC
ET
theory and the computer simulation for a random DNA sequence. For instance, for N-l = 10, a clone of 1500 bp will have an average of three SFs. However, because of the dispersion around the mean, a library should have inserts of 500 bp so that less than 1 in 2000 inserts have more than three SFs. Thus in an ideal case of SBH determination on a long DNA of random sequence, it is only necessary to use a representative library with sufficiently short inserts. For such inserts it is possible to reconstruct their individual sequences by SBH. The entire sequence is obtained by overlapping of individual insert sequences. In reality, this simple scheme would not work for several reasons. For instance, mammalian DNA is far from randomly organized. It contains repetitive and structurally and functionally selected sequences. The libraries are to varying degrees unstatistical. Also there are practical limits to the numbers of individual clones and ONPs that can be used. The ONP hybridization will show some degree of error. However, by judicious choice of probes and libraries the SBH can be used to determine the sequence of real DNAs up to mammalian genome size as described below.
point does not allow further unambiguous assembly of sequence. To derive the theoretical explanation of SBH it is necessary to introduce the definition of a parameter having to do with sequence organization: the sequence subfragment (SF). SF is any part of the sequence of a clone that starts and ends with the (N-1)mer repeated two or more times within the clone sequence. SFs are sequences generated between two points of branching in the process of assembly of the sequences in SBH (Fig. 1C). The sum of SFs is longer than the sequence because of overlapping short ends. Generally, SFs cannot be assembled in a linear order without additional information (Fig. lD), since they have shared (N-1)mers at their ends and starts. However, when there are only three SFs they can be ordered unambiguously. The three SFs are a single internal SF and two end SFs that can be recognized by their content of vector sequences. Different numbers of SFs are obtained for each DNA fragment depending on the number of its repeated (N-1)mers. The number depends on the value of N-l and the length of the clones. We have used probability calculations to estimate the interrelationship of the two factors (Fig. 2; Appendixes Al and A3). For reasons of space we do not give the mathematical theory and the resulting equations but restrict ourselves to a graphic presentation of the results (Fig. 2). Also, a computer program that is able to form SFs from the content of N-mers for any given sequence was developed. There is a relatively good agreement between the
1
2
AL.
The Number
of
Probes
for
SBH
The number of probes is rigidly set by the requirement that N-mers having all possible sequences must be used in SBH. For the shortest N-mer known to be able to hybridize an 11-mer accurately (Wallace et al.,
3
4
5
Lf (kb)
FIG. 2. Average number of SFs (N8,) as a function of the length of DNA fragment (L,) for various values of the length sequence, (N - 1, in bp), or average distance of two consecutive identical N - 1 sequences in DNA subjected to SBH (A,), are obtained using equation [2] (Appendix Al).
of the overlapping in kb). The curves
SEQUENCING
1979; Wood et al., 1985), this means 4 million (4’l)probes. If not reduced, this number of probes makes SBH impossible to perform at the outset. In principle, the reduction of this number can be achieved in two ways: by decreasing the length of the hybridizing sequence N and by omitting a certain fraction of probes. The reduction in the length of the probes increases the number of SFs in a clone, i.e., reduces the length of the insert whose SFs can be unambiguously ordered (Fig. 2). For the reasons explained in Appendix A2, we think that 8-mer probes are the most practical choice for SBH. Since 11-mers are the shortest ONPs that can form stable hybrids, one can use as probes 64 member groups of ONPs of the type
(A,T,C,G)(A,T,C,G)Ng(A,T,C,G),
[(WWWI,
where only one ONP in the group will be able to form a hybrid with an individual ll-mer target sequence. There are 65,536 such groups. Since all members of a group share the internal 8-mer sequence, the information about its presence or absence in target DNA can be obtained. In real DNA of known GC content, the number of SFs depends not only on the length of the ONP but also on its base composition. Taking this into account, one can devise a collection of 50% more 11-mer ONP groups than the original 65,536, which allows the use of significantly longer inserts (Appendix A2). This translates into a significant reduction in necessary clones. Our list of 86,000 lacks 7000 probe groups which cannot be used if Ml3 phage is employed as a vector for libraries. Further problems are monotonous sequences defined as (Nl)n, (NlN2)n, (NlN2N3)n, . . . , n > 8 bp; 7432 additional probes will be needed to measure their lengths up to 18 bp. The Choice of Libraries for SBH
and the Number
of Clones
The basic problem that SBH must address is the occurrence of repeated stretches within longer sequences. It arises from the principle of SBH that sequence is generated by overlap. For N-mer ONP the overlap sequence is N-l bases long. The repeats of the overlap lead to formation of SFs, as already mentioned. For longer repeated sequences the overlap becomes impossible when unit repeat is longer than cloned insert, and unconnected stretches of sequence called contigs (Staden, 1980) are generated. However, contigs can also arise due to the absence of certain sequences in libraries, mainly since they appear to be “poisonous” to the Escherichia coEi host (Cross and Little, 1986; Coulson et al., 1986). To obtain maximally extended sequences containing repeats, SBH requires three levels of fragmentation of genomic DNA in the preparation of libraries to deal with each of these problems. The first level supplies the dense overlap of relatively short inserts. This library is necessary for the ordering of SFs. The use of 8-mer rather than 11-mer probes
BY
HYBRIDIZATION
117
allows a 40-fold reduction in the number of probes. However, 8-mer probes impose the requirement that the length of clones must be less than 200 bp for an unambiguous ordering of SFs in the vast majority of clones (Appendix A2). This is clearly impractical in terms of the numbers of clones such a library would have to contain and the subsequent use of such clones. Thus, the major problem that SBH must resolve is the ordering of SFs in libraries that can realistically be made and managed. We describe the solution to this problem in the following two sections, and here only summarize the requirements of the ordering library. It should have inserts 500 bp long and an average displacement of 40 bp. This amounts to 25,000 clones per million base pairs of genomic DNA being sequenced. Pooling (described below) allows a further 20-fold reduction in the number of samples from 25,000 to 1250 clones per million base pairs of DNA. It seems that the best way to form pools is random ligation of 500-bp fragments via adapters into “trains,” size-selected to 10 kb before cloning into Ml3 vector. This does not compromise the information potential of individual fragments. Such an ordering library of sufficient complexity covering haploid mammalian genome would contain 3.75 million clones and 12.5 genome complements. The second level of fragmentation, producing DNA pieces 7-10 kb long, in effect solves the problem of unit repeat being longer than overlap. Since the longest highly repetitive sequences in mammalian DNAs, socalled LINE 1 sequences, are 6 kb (Singer, 1982), the maximally extended contigs would be obtained with a library with insert length of 7 kb or more. We call the library fulfilling these criteria the basic library. It would consist of clones having inserts of 7-10 kb (7-kb clones) with an average displacement of 1.4 kb. More than 99% of the DNA would be cloned with 700 random 7-kb clones per million base pairs or 2.1 million clones per mammalian genome (Maniatis et al., 1982). The basic library answers the real need for sequenced DNA in the minimal number of clones for permanent storage and retrieval. Employed together, the ordering and basic libraries solve two additional problems for SBH-the more efficient ordering of SFs and the further reduction in the overall number of clones representing DNA to be sequenced. As described below the ordering of SFs and generation of sequence are performed by an algorithm comparing the contents of +ONP in the individual clones from the two libraries. Also, to calculate the numbers of clones in the basic library, we used a number of clones sufficient to cover 1 million bp or genome 4.7 times. This gives the probability that more than 99% of the sequences will be cloned (Maniatis et al., 1982). There is a greater than 99.99999% chance that the clones of the ordering library contain the DNA
118
DRMANAC
missing in the basic library (Appendix A3). In this way, the number of clones in the basic library can be kept reasonably low. The third level of fragmentation is needed for bridging exceptionally long repetitive sequences, blocks of nearly identical, tandemly organized repeats and gaps of DNA not cloned in the libraries because of “poisonous” sequences (Cross and Little, 1986; Coulson et al., 1986) and other reasons. This missing information will create sequence contigs only when data from ordering and basic libraries are used. The length of usable fragments should be at least an order of magnitude longer than that for the basic library since some satellite or minisatellite DNA is known to be very long (Jeffreys et al., 1985). However, a library with inserts of 100 kb and longer is useless for SBH, since our calculations show that all inserts of this size will have approximately the same content of ONPs. Thus, although cloning is possible in yeast artificial chromosome vectors (Burke et al., 1987), and the only way to obtain sequences in gaps between contigs is with large inserts, they cannot be used in SBH. However, SBH can provide the solution for the correct ordering of the contigs by the use of a “jumping” library (Collins and Weissman, 1984; Poustka et al., 1987) covering intervals of 100 kb. The representative library would need to have a complexity of 170 clones per million base pairs of DNA. This statistical representation ensures a very low chance that the pair of contigs will remain unlinked due to the absence of the particular jumping clone (Appendix A3). The sum of samples from the three libraries that would have to be hybridized is 2120 per million base pairs of DNA or about 6 million per mammalian genome. The need to use only 95,000 probes and hybridizations in parallel makes the described method a realistic proposition. The Practical
Solution for the Ordering of SFs
The problem of ordering of SFs can be solved by (i) the choice of libraries, (ii) the use of libraries containing densely overlapping clones, and (iii) a specific algorithm for assembling sequence of contigs by comparing positively hybridizing ONPs (+ONPs) contained in clones. In the section on Theory it was shown that the ordering of SFs can be accomplished by reducing the length of inserts in a library. An additional way of ordering is based on the idea that the sequence can be regenerated if one uses the information contained in the order of overlapped clones to replace the information lost in resolving the sequence into SFs. For instance, if two clones are known to overlap, the comparison of their +ONPs allows partitioning of +ONPs of an individual clone into two groups: one contained in the neighboring clone and coming from one part of the sequence, and the other not contained in the
--I31
A7
AL.
neighboring clone and coming from the other part of the sequence.Now, the formation of SFs can take place in each of the two parts independently and not in the insert as a whole. By extending this principle to densely overlapped clones, one can seethat a substantial subdivision of each insert is possible in an informational senseand that the problem of ordering SFs now applies to each of the sections separately. It helps to define a new sequence parameter representing units of these subdivisions: an information fragment of sequence (IF). IF is a part of a genomic DNA sequence between the two nearest ends, either 5’ or 3’, of the overlapped clones in simple or mixed libraries (Fig. 3). The numbers and sizes of IFS are not intrinsic properties of the sequence but are dependent on the mode of generation of genomic libraries and their complexity. Their number is twofold larger than the number of cloned genomic fragments, and the average size of IFS is half of the average displacement of cloned genomic fragments in the library. The mathematical treatment of the number of SFs per clone is easily extended to IFS (Appendix A3). The criterion for an IF suitably small for purposes of SBH is that it does not contain more than three SFs including the end ones. A library forming IFS 20 bp long on average rarely contains an IF longer than 300 bp because of the dispersion. Since IFS 300 bp long will contain an average of three SFs with 95,000 ONPs, the number of IFS with unordered SFs in such a library is negligible (Appendix A3). The ordering library fitting the description should have inserts 500 bp long and an average displacement of 40 bp. As mentioned above, this amounts to 25,000 clones per million base pairs of genomic DNA being sequenced. However, the final, more efficient scheme for ordering uses the information obtained from both the ordering and the basic libraries. The clones of the basic library define IFS as well. The average length of the IFS of the basic library is 700 bp, which is too long for the purpose of ordering of SFs. Nevertheless, the SFs will be generated in IFS of the basic library. The 700-bp IFS are also used to obtain IFS of the ordering library (section on algorithm
5 -Fj+Flj++_
f~, 7
8
9
1011
FIG. 3. Formation of IFS. Overlapping clones OF cloned genomic fragments are represented as horizontal bars (a-f). The vertical lines represent borders of IFS. IFS are labeled with numerals (l-1 1). Note that all combinations of fragment ends participate in IF formation: IF4 is formed by 5’ ends of clones d and e; IF5 with 5’ end of clone e and 3’ end of clone a; IF6 with 3’ end of clone a and 5’ end of clcne f; and IF8 with 3’ ends of clones b and c.
SEQUENCING
and Appendix A5). The basic library IFS will serve as a matrix for finding clones of the ordering library which overlap with them. In fact, the clones and IFS from the ordering library could be used for subdivision of IFS from basic library to 20-bp IFS. This could be done by the principle outlined above, except that the two libraries are used to give a clone each for pairwise comparisons. A further refinement, which consists of random pooling of clones from the ordering library, is possible. As mentioned, the subdivision of 700-bp IFS is accomplished by finding their overlapping 500-bp clones. Correctly overlapped clones are found by comparisons of contents of +ONPs from a given 700-bp IF from basic library with 500-bp clones. The chances of false pairing are vanishingly small when the overlap is longer than 30 bp. A probability calculation indicates that the chance of a spurious match with a 700-bp IF is negligible not only for a single ordering clone but also for their random pool of at least 20 members (Appendix A3). In an informational sense, the presence of 19 unspecific clones in such a pool does not prevent the pairing of a specific ordering clone with its IF from the basic library. Thus, random pooling allows a 20-fold reduction in the number of individual hybridization samples from the ordering library. The hybridization data with the described probes using the basic and ordering libraries are sufficient for obtaining sequences of extended contigs. The sequences are formed from input data after extensive computation. Algorithm for Obtaining the Sequence of Contigs from the Content of +ONPs of Clones from Basic and Ordering Libraries The software consists of the two parts. With the first, the SFs are formed in the 7-kb clones of the basic library. With the second, the numerous cross comparisons of the contents of +ONPs are made between each of the pairs of clones from the two libraries. The frame on which these comparisons are organized is the order of overlapping 7-kb clones in the basic library obtained using the principles of the cosmid link-up method (Poustka et al., 1986; Craig et al., 1986; Michiels et al., 1987). It should be realized that this is a computing operation using data already obtained for sequencing. In the next stage, false associations from the cross comparisons are eliminated by numerous sifting and resifting loops, resulting in an unambiguous order of SFs in IFS of the basic library. This information easily leads to sequences of IFS, which are merged into sequences of extended contigs using information about the order of 7-kb clones obtained previously. The most efficient computing would use data from hybridizations of 1000 probes to the basic library to generate the order of 7-kb clones, from all 95,000 probes to generate SFs
BY
119
HYBRIDIZATION
in IFS of the basic library, and from 10,000 probes from the ordering library to merge SFs of the basic library into sequence contigs. Further details of the algorithm and the principle for the reduction in number of probes necessary for hybridization to the ordering library are given in Appendixes A3 and A5. Algorithm
for Linking
the Contigs
The basic and ordering libraries do not allow unambiguous determination of the entire sequence. The contigs will end on all repeated sequences or tandemly repeated units (minisatellites) of composite length greater than 7 kb. This also includes gaps of DNA not cloned in the libraries because of poisonous sequences and other reasons. We think the solution for the correct alignment of contigs in SBH is best provided by jumping libraries (Collins and Weissman, 1984; Poustka et al., 1987) covering intervals of 100 kb. As mentioned, the representative library would need to have a complexity of 170 clones per million base pairs of DNA. Jumping subclones would be hybridized with all probes. Their SFs would be generated by overlapping +ONPs. The linear order of contigs would be determined by linking pairs of contigs which contain a statistically significant number of SFs from single jumping clones. Although jumping clones would provide ordering of the great majority of the remaining contigs, definitive sequences between them would remain undetermined. In principle, to complete the sequence, long-range restriction maps (Smith et al., 1987), ordered phage, or cosmid or artificial yeast chromosome libraries of large DNA being sequenced (Coulson et al., 1986; Olson et al., 1986; Kohara et al., 1987; Michiels et al., 1987; Burke et al., 1987) could be used to pinpoint DNA for sizing and filling in the remaining gaps by standard sequencing. Error
Tolerance and Limitations
of SBH
The error rate of the procedure includes false hybridization data, both positive and negative, regardless of the cause. Theoretically, a hybridization procedure in which 6% of the probes give totally false results with all clones and remaining probes yielding up to 50% error per clone, when false positives and negatives are randomly spread with respect to probe sequence and clones, would be sufficiently accurate to provide information to assemble almost the complete sequence. Tolerance of such a large error is possible because every nucleotide is read more than 40 times in the basic library only. The large majority of the information missing for each clone because of hybridization error is recovered from other clones densely overlapped with it (Appendix A4). The information that is missing for the complete sequence of contigs is the order of SFs within some
120
DRMANAC
IFS, the placement of small differences in duplicated units within a IF to the correct unit, the sequences across the eventual gaps within IFS (mostly due to snap-back structures), and the lengths of monotonous stretches longer than 18 bp (Appendix A2) or the number of units in tandem repeats. In principle, all of these problems can be eliminated by decreasing the lengths of IFS using a larger number of train fragments (i.e., larger pools) in the ordering library, decreasing the number of SFs by using a larger number of longer probes, or by secondary subcloning from the pools of affected. 7-kb clones followed by SBH, except for the problem of lengths of monotonous sequences. The measurement of lengths of long monotonous sequences requires the standard methods for restriction fragment length measurement and sequencing. It may be more appropriate to use those to solve some of the other problems mentioned rather than to increase the complexity of working libraries. On average, one can expect I2 IFS with unordered SFs and 0.3 monotonous sequence of undetermined length per million base pairs of DNA for the suggested combination of probes and libraries, increasing the result lo-fold because of the inexactness of the calculation (Appendixes A2 and A3). This order of magnitude factor is included to account for untreated parameters, nonrandomness of real DNA sequences, and unstatistical representation of libraries. Biochemical
Procedures
The obvious choice among several hybridization methods is dot blot filter hybridization (Kafatos et al., 1979; Wallace et al., 1979; Saiki et al., 1986). It offers an easy repetition in applying cloned DNA and scoring positive hybridization signals. It eliminates a need for replica plating as in plaque and colony hybridizations and allows easy removal of the difference in signal intensity due to the growth rate of recombinants. The family of single-stranded phage vectors exemplified by phage Ml3 (Messing and Vieira, 1982) seems the most suitable for preparing recombinant DNA. It can accommodate inserts of up to 14 kb in a stable manner (Barnes and Bevan, 1983), and it forms high titers of recombinant phage without lysing the infected cells. Therefore, it offers the possibility of eliminating contaminating bacterial DNA. Individual clone plaques can be propagated in small vessels, cells centrifuged, and supernatants containing sufficient numbers of recombinant DNA molecules dot blotted. The protein coat is easily removed by the addition of alkali in the applying solution. However, Ml3 is still untested as a vector for cloning of large random libraries. We propose the simultaneous hybridization of ONP mixes, most containing 64 different oligonucleotides. This is possible using tetramethylammonium chloride (Wood et al., 1985). Under these conditions the hybrid
ET
AL.
melting point does not depend on the GC content of the oligonucleotide probe. However, the sharp dependence of melting point on the oligonucleotide length may be blurred by the “end fraying” effect (Wood et al., 1985). SBH imposes different and much stricter requirements for dot blot hybridization than its previous uses, especially because 11-mer probes form hybrids that are at the very limit of stability. It is possible that some IO-mers hybridize and some II-mers do not hybridize under conditions optimal for 11-mer hybridization, although limited data do not support such a notion (Wallace et al., 1979; Wood et al., 1985). Since we expect that a mismatch would be tolerated only at the end positions of a hybrid (Wood et al., 1985; Thein and Wallace, 1986), the problem can be circumvented by choosing a criterion of a perfect match of the internal 8 bp. Hence, the choice is for the (N2)N8(Nl) type of probe instead of (N3)N8. If some of the llmers are unable to hybridize, the corresponding information is lost. This can be avoided, if necessary, using 12-mers of the type (N2)N8(N2) and hybridization under conditions optimal for 11-mers. It would raise the number of individual probes within groups to 256, leaving all other parameters unchanged. Conversely, if ONPs shorter than 11 nucleotides can hybridize accurately and reproducibly (Estivill and Williamson, 1987), one would use (Nl)N8(Nl) type probes. It has been suggested to us that there is a considerable problem in the lack of hybridization due to the presence of short snap-backs in immobilized target DNA. However, because of a small displacement between neighboring clones in our ordering library, we expect sequences of the majority of snap-back structures to be represented in insert ends unable to form duplexes. This would probably leave only a small number of clones that would require an additional treatment for this reason. Such an absence of hybridization due to snap-back structures, if unavoidable by optimizing binding of DNA to filter, will be recognizable as a gap in the sequence in the assembly of contigs. Those 7kb clones that show gaps can be reapplied to a filter, but after the DNA has been sheared to an average size of 50 nucleotides and denatured. This would ensure that at least a fraction of DNA (340% for snap-backs of 220 bp) formerly within internal duplexes is now open and available for hybridization. The problem of nonstatistical representation of libraries requires a comment. The estimate of the number of poisonous sequences is not available for inclusion in the calclation of the required number of clones. However, one can expect that, with the exception of a small number of completely unclonable sequences, for poorly clonable sequences to be represented at all would require libraries several times more complex than those indicated by the theory. This factor depends on the choice of vector and host as well as on the way the
SEQUENCING
BY
library is propagated (Coulson et al., 1986). An extrapolation of published data seems to indicate that poorly clonable sequences would not require that the libraries have a complexity more than three times higher than that of an ideal case (Fig. 4 of Coulson et al., 1986). Since overgrowth by fast replicating clones which leads to loss of slow growers represents the main problem, it can be solved by separate growth of individual clones immediately after transformation. Aliquots of diluted transformation mixes can be inoculated directly in individual microtiter wells. Construction of libraries from parts of genomic DNA (chromosomes, chromosome arms, etc.) can be used also for overcoming the problem of nonrandomness of libraries with the least number of clones possible. The degree of randomness of libraries for SBH can be tested by hybridization with 1000-2000 ONPs. The results would permit the determination of the number of clones that must be used from each good library in SBH and allow the elimination of those lacking a sufficient representation. The procedures outlined above can be optimized using simple known probes and target DNAs. We are performing experiments along these lines. DISCUSSION
When the method presented here is compared with the existing sequencing methods (Maxam and Gilbert, 1977; Sanger et al., 1977), its advantages and disadvantages can be seen. Because SBH differs in principle, there is an obvious need for such a comparison. Biochemically, the existing methods are based on partial cleavage of DNA molecules at every nucleotide, with a subsequent visualization of nested fragments having the same end after separation by gel electrophoresis allowing l-bp resolution. SBH doesnot require enzymatic or chemical pretreatment of DNA or gel electrophoresis. Its basic biochemical step is hybridization with a subsequent visualization of hybrids. It requires chemical synthesis of a number of labeled ONPs. The subsequent steps of signal development and reading, in principle, could be the same for both approaches. This also holds for cloning procedures preceding the actual sequencing. Informationally, the two are radically different. The standard methods are linear, sequential, determinate, and nonredundant, whereas SBH is random, parallel, indeterminate and redundant. By linearity, we mean that sequencing information is gathered by reading base by base, processing from one end of the fragment being sequenced. Only one clone is sequenced at a time or up to 40 in a recent proposal (Church and KiefferHiggins, 1988). By determination, we mean that information about a nucleotide is determined directly in a sensethat no coding-decoding process is interpolated. For a given sequencing run, every base is read only once, i.e., nonredundantly.
HYBRIDIZATION
121
SBH reads random bits of linear sequence as determined by the order of specific probes. Random clones or their pools with sufficient overlap to represent a genomic DNA being sequenced are sequenced together in parallel. The method is indeterminate, since the reading does not specify the position and kind of nucleotide at the position in the sequence. The reading is in a binary code, + or --, which in itself has no meaning before decoding, which requires extensive computing. The primary information is redundant, since every base is read eight times when (N2)N8(Nl) probes are used. SBH, besides requiring rather a low accuracy of hybridization, is highly resistant to sampling error. Since every nucleotide is read at least 40 times, the statistical elimination of random errors is assured. In the standard methods this is the largest source of error, imposing the requirement of obtaining the sequence of a given stretch several times. The errors in sequence obtained by SBH are identifiable as gaps, or ambiguities, at defined positions. Between two such positions, the sequence should be highly accurate, much more so than in standard methods after the sequence has been read only two or three times. There are three ways to determine the sequence of complex DNA by sequencing clones from a random library without further subcloning: SBH, standard sequencing from one end of an insert, and standard sequencing from both ends without leaving a gap in the middle. All three require sequencing of similar number of clones within a factor of 2. Furthermore, for our method, the introduction of a pooling procedure decreases the number of individual samples to be hybridized by 20-fold. Thus, the number of samples to be sequenced required in SBH is of the same order of magnitude as that of the standard methods even after the recent improvement by Church and Kieffer-Higgins (1988). The argument based on the foregoing is that SBH of larger DNA fragments, chromosomal molecules, or genomes is far simpler than standard methods at the level of biochemical procedures and primary data gathering. However, SBH is more complex at the level of sequenceassembly by computing from primary data. Qualitatively, the longer the sequence, the more important automation with exclusion of expert manual labor becomes in primary data gathering. Therefore, SBH may become more favorable than standard methods at some point. There is a mutual dependence of the numbers of necessary probes as a function of the length of informative sequenceof a group of ONPs and the percentage of detected sequences,on the one hand, and the number of hybridization dots (number of clones and pool size) required, on the other hand, for obtaining a DNA sequence by SBH. The probes and libraries recommended
122
DRMANAC
here represent only one of the possible choices. We think our choice is near the optimum. For further optimization, computer simulations which take into account specific properties of a given DNA sequence and various technological requirements are needed. Only the most general and practical considerations can be outlined here. One should consider cost, speed of sequencing at the level of accuracy, and completeness. SBH requires experimental data about the oligonucleotide content of clones, in contrast to standard methods which require experimental determination of the position of each nucleotide. It is intuitively obvious which is experimentally easier, and hence cheaper and less labor intensive. The use of less labor is possible because extensive computing determines the position of each nucleotide from the experimentally obtained contents of oligonucleotides. SBH described here requires 100,000 probes and as many hybridizations, with dots representing individual clones on filters. For a mammalian genome this is about 6 million clones from the three libraries that must be prepared by scientists beforehand. The low-volume, commercial cost of synthesis of all probes in quantities sufficient for many genomes is 2-7 million dollars. We cannot see any reason why the automated, computer-driven equipment for probe labeling, dot forming, hybridization, and optical reading of hybridization signals in SBH would cost any more than similar equipment for standard sequencing of DNA of comparable complexity in the megabase range. The quantity of filters required for SBH is similar to that for multiplex sequencing (Church and Kieffer-Higgins, 1988). Because unspecified nucleotides in probes must be used, SBH requires about 50 times more radioactivity than that of the standard methods of sequencing. This would be obviated if conditions for reliable hybridization of 8-mers could be found (Estivill et al., 1987). SBH does not require gel electrophoresis, which represents a significant part of the cost of the standard methods. Presumably, the computing of sequence from the data would not require any further investment in equipment, only time on existing computers. When all technological requirements are met, the speed with which a sequence can be obtained in SBH is on the order of 1 year; i.e., 1000 hybridization vessels operating simultaneously would require only 100 days to complete, time limiting the process of SBH, if one assumes that the hybridization takes 24 h. Allowing similar 3-month periods for probe synthesis and preparing phage supernatants, for making dots on filters and probe labeling, and finally for optical reading of results and computing, one arrives at an estimate of 1 year. Of course, this is the current minimal estimate. However, we think that much longer times will be impractical as well.
ET
AL.
All of the technological parameters mentioned are open to further improvement that might lead to very significant decreases in cost and time for SBH. For instance, simply by increasing the volume of hybridization vessels to accept more filters, the number of genomes simultaneously sequences can be increased from one to several. Also, various schemes are being proposed for simultaneous detection of several labels in nucleic acids (Smith et al., 1986; Prober et al., 1987). This can lead to an order of magnitude reduction in the number of required hybridizations. A drastic cut in the cost of materials for SBH could be achieved by replacement of radioactivity with biotin (Agrawal et al., 1986), enzymatic, or fluorescence labeling (Urdea et al., 1988). We are working on these and other ideas that might put SBH within the range of large specialized laboratories. It is not the purpose of this paper to predict explicitly the values of experimental, technological, and computing parameters that would be necessary for a reasonable estimate of the cost and speed of obtaining the sequence of a given length of DNA by the new method, but rather to provide the theoretical base for further work in this direction. However, the theoretical arguments brought forward already indicate that our method has advantages in sequencing human and other complex genomes. APPENDIX
Calculation
Al
of the Number of
SFs
It is important to determine the expected number of subfragments (N,J. N,f is estimated using probability calculations assuming randomness of genomic DNA sequences. Oligonucleotide sequences (ONSs) follow a binomial distribution within a randomly formed long sequence of DNA. The average distance between two identical ONSs (A) is influenced only by the length of the ONS (N) and is given by A = 4N. The probability of ONS occurring K times on a fragment Lf bp long is given by P(K, L,) = C(K, Lf) X (l/A)K
X
(1 -
l/A)L’,
[l]
where C(K, Lf) is the number of combinations of Lf elements taken K at a time. It takes care of the fact that an ONS repeated K times can be found anywhere along Lf and not at only one location. (l/A)K reflects that the chance of K occurrences decreases with increasing A within a sequence of defined length and (1 - l/A)Lf signifies that for a given A, the chance of being repeated increases with the length of sequenced fragment Lt. This is a simplified version of the complete equation and holds only when Lf is much larger than K and N.
SEQUENCING
BY
The average number of different sequences of length N bp or at average distance A bp, which are repeated K times on a fragment Lf bp long, is given by a product of P(K, L,) X A. If the ordering of positive ONPs is accomplished by using overlapping sequences of length N - 1, or at average distance Ao, the N,, on a fragment Lf bp long is given by Nsf = 1 + A0 X X K x P(K, Lf),
K a 2.
PI
Figure 2 shows N,, as a function of Lf for various N - 1 or various Ao. One can see that using all possible 11-mers (4 X 106), there will be approx three SFs within every sequenced 1.5-kb fragment. In general, three SFs can be unambiguously ordered since end SFs that can be distinguished from other SFs are counted in [2] as well. The average number of SFs occurring on a sequence of a certain length is not a useful parameter for the determination of the required length of clones. This is due to the dispersion around the mean. Even when the dispersion is relatively small, due to a large number of clones, there are a considerable number of clones whose content of SFs is even lo-fold larger than that based on the average value. A computer simulation on a random DNA sequence of 40% G + C was used to estimate dispersion. The results for an 11-mer ONP indicate that 16% of 1..5-kb clones would have more than three SFs, despite an average expectation of three SFs. This translates into 100 clones with unordered SFs per million base pairs of DNA. With the use of 0.5-kb inserts the number of clones without ordered SFs would be lowered 100-fold. APPENDIX Determining
A2
the Number and Type of Probes
The use of ordered groups of ONPs with a shorter informative sequence reduces the number of required probes and independent hybridizations. On the other hand, the number of individual ONPs within a group and the number of SFs within a clone being sequenced are increased (Fig. 2). Therefore, we think that an informative 8-mer sequence of the ONP group provides the best balance between these factors. Using [2] (Fig. 2), one can calculate that inserts of 200 bp will give, on average, three SFs, since these probe groups have a 7-bp overlap. Even with (N2)N8(Nl) type probes, the ordering of SFs is inefficient with clones which must be shorter than 200 bp, due to the large numbers of required clones. Instead we propose the use of a combination of two libraries whose clones define IFS (Fig. 3) of appropriate length in which the order of included SFs can be unambiguously determined (Appendix A3).
HYBRIDIZATION
123
With (N2)N8(Nl) type probes, one can go from an ideal case of random DNA sequence to a nonrandom sequence XXXXXXXX arrangement in mammalian DNA (Setlow, 1976). Average distances of ONSs are obtained by multiplying measured dinucleotide frequences and correction factors for connecting dinucleotides for consecutive dinucleotide contained within an ONS (Drmanac et al., 1986). The higher the AT content the longer the common sequence in groups of probes should be. Taking this into account, instead of making all 65,500 (N2)N8(Nl) probe mixes, it is possible to use all (Nl)NlO, where NlO are all lo-mers without C or G nucleotides, all (Nl)NS(Nl), where N9 are all 9-mers with 1 or 2 C + G, and all (N2)N8(Nl) where N8 contains all 8-mers with 3 to 8 C + G. To link the described probes one needs additional ONPs: (Nl)NS(C,G) where N9 are all 9-mers without C or G nucleotides; (Nl)N7(A,T)(C,G)(Nl), where N7 are all 7-mers with 2 C + G nucleotides; (N2)NG(C,G)(C,G)(Nl) where N6 are all 6-mers without C or G. This amounts to synthesis of about 93,000 probe groups. Using the principle described above, an average A0 value of identical overlapping ONS would be 30,000 bp compared with 16,000 bp for (N2)N8(Nl) probe (Fig. 2). To achieve this A0 value in random DNA, one would need 130,000 probes of type (N2)N8(A,T) and (N2)NB(G,C). This gain, allowing a significant reduction in the number of necessary clones, more than compensates for the increase of 50% in the number of probes. Some additional ONPs are needed for genome compartments with extreme GC content. With Ml3 as the vector of choice, about 7000 of the probes could not be used because of the complementarity to vector DNA. On the other hand, one needs additional probes to measure lengths of monotonous sequences, i.e., to determine numbers of tandemly repeated unit ONSs ((Nl)n, (NlN2)n, (NlN2N3)n, . . .) and to confirm the position of insert ends. It is possible to measure precisely the lengths of monotonous sequences of up to 18 bp, which represent the vast majority of these sequences in mammalian genomic DNA. ONP groups of type (Nl)N9-18(Nl) where (Nl) are unspecified bases and N9-18 are more repeats of l- to 7-bp-long ONSs of total length from 9 to 18 bp are needed for this purpose. Counting all possible monotonous sequences of this type, it is necessary to synthesize an additional 7432 probe groups. Depending on length, these probes require 10 different conditions for hybridization. Since hybridization in tetramethylammonium chloride allows precise measurement of the length of a hybrid of k.1 bp up to 20 bp (Wood et al., 1985), the use of unspecified base on probe ends eliminates this as a source of error. Since these sequences are about 7-fold more frequent in mammalian DNA than expected for a random sequence (Tautz et al.,
124
DRMANAC
1986), the total expected number of monotonous stretches of 18 bp and longer defined above is 0.03/106 bp, if they are not parts of repetitive sequences occurring more frequently. The computing step of assembling SFs within a clone or a IF would provide sequences of two SFs coming from ends of the clone, each of which would contain appropriate vector 7-mer sequences. However, a more reliable determination of end SFs, i.e., distinguishing them from ends on both sides of a gap, is useful in determining the order of SFs, but not absolutely necessary. It can be accomplished by the use of 512 ONPs of type (Nl)NlON4(Nl) and (Nl)N4NlO(Nl), where NlO represents sequences of the start and end of the Ml3 vector and N4 represents all possible 4-mers in both cases. The previously defined end subfragment, besides having the Ml3 end 7-mer, would have to show the 4-mer identified by positive hybridization of the (Nl)NlON4(Nl) or (Nl)N4NlO(Nl) type probes. The positively determined l4-mer obtained in this way is sufficient for the unambiguous allocation of ends in the vast majority of clones. Even with an additional number of ONPs that might eventually be needed, the total number of required probes would not exceed 105. APPENDIX
A3
The Number of Cloned Genomic Fragments Necessary to Obtain a Complete Sequencefrom a Random Library by SBH. Pooling Limits To obtain a complete sequence, it is necessary that every sequence be represented in at least one of the clones (or in cloned genomic fragment (CGF)) in a library. The usual relation (Maniatis et al., 1982) for representative libraries is not applicable here, since it estimates the total amount of uncloned DNA. What one needs is an estimate of the number of uncloned sites which will create gaps. Each gap would require sequencing of a fragment containing it, which could be obtained from a new ordered or random library. If one starts from the assumption of random fragmentation of DNA and binomial distribution, one can obtain the relation N&l = NJ1 - NcINbpYC,
131
which gives the number of uncloned sites (N,) as a function of the number (iv,) and the length (L,) of clones and the length of DNA being sequenced (Nbp). The factor (1 - N,/N&c represents the probability of occurrence of a displacement of two neighboring clones
ET
AL.
longer than L,, which is the physical situation when at least someDNA remains uncloned. Multiplying this probability with the number of clones (i.e., number of displacements), one obtains the expected number of NW. Since standard gel sequencing from both ends of an insert without a gap is limited to about 1000 bp, the complete sequencing by standard methods from a random library requires 12,500 clones for lo6 bp of sequenced DNA. Only 0.043 uncloned site is expected on average with 12,500 clones of 1000 bp. In SBH, the number of CGFs is determined by the requirement that the resulting length of IFS allow unambiguous ordering of SFs. With the use of the 93,000 ONP groups (A, = 30,000), an average of more than three SFs are obtained only with IFS longer than 300 bp (Fig. 2). However, because of the dispersion around the mean, many IFS shorter than 300 bp will contain more than three SFs. The total number of such IFS represents the expected number of IFS with unordered SFs (Nifa). From [3], for lo6 bp represented with 25,000 CGFs of 0.5 kb Ni, = 2 X Nc X C I(2 X Nc/Nbp) X (1 - 2 X NJNb,)L”]
X P(>3 SFs/Lif) = 1.2.
[4]
The 2 X N, represents the number of IFS; Lif, the length of IF; [(2 X N,/Nb,) X (1 - 2 X N,/N&‘lr], the probability of occurrence of IFS whose length is equal to Lif bp for a given number of clones (N,); and P(> 3 SFs/Lif), the probability of more than three SFs occurring on the length Lif. The value of the P (>3 SFs/ Lif) factor was determined by computer analysis of a random 40% GC sequence. IFS having more than three SFs which can be unambiguously ordered due to the advantageous position of repeated ONSs were not included in the above estimate. Since the feature of the method is that a usable pool of CGFs is not limited by the number of CGFs in a pool but by their aggregate lengths, the number of CGFs in a pool can be increased at the expense of their length. With 12,500 l-kb CGFs, Npu = 0.043 and Nifa = 28, which can be compared with Npu = 0.08 and Nif, = 1.2 for 25,000 CGFs of 0.5 kb. Hence, with this decrease in fragment size and increase in number of CGFs, the number of uncloned sites is increased only twice, but the number of ambiguities in the sequence is reduced more than 20-fold. This does not require an increase in the number of pools. Therefore, it is better to use fragments of 500 bp, especially when the basic Ml3 library with 7-kb inserts is employed as well. A further decrease in the length of individual genomic fragments in the ordering library lowers the efficiency of their overlap in formation of IFS.
SEQUENCING
BY
The basic library would provide information about most sites not cloned in the 0.5kb library. Using the relation for representation of libraries (Maniatis et al., 1982), in this case the probability of uncloned sequence would be 4 X 10m6 for the first library and 8 X 1O-3 for the 7-kb library of 700 clones per lo6 bp of DNA. Total probability is 3 X lOwa, meaning that only about 0.03 by per million bases would remain uncloned, on average. Using [3], the representative jumping library should contain 170 clones for lo6 bp of DNA. Informationally, the size of pooled DNA is limited by the number of used ONP groups, i.e., the length of the informative part of ONPs and the complexity of DNA being sequenced. We consider the value of 10 kb for the size of pooled DNA as realistic. IF0.5 (IFS generated in the ordering library) and groups of neighboring 1F0.5~ are defined within IF7 (IFS generated in the basic library) by finding sets of the nearest 25 +ONP of 60 contained in IF7 on average, and obtaining the cross sections of these sets. IF7 contain 60 +ONPs on average when data from 10,000 ONPs are used. This is sufficient for ordering of SFs as shown below. In principle, such a reduction in the number of hybridizations with the ordering library is possible. However, because of the nonrandomness of libraries, the ordering library must provide information about sequences not cloned in the basic library. This means that it is probably useful to have complete sequencing data for this library as well. The set of nearest 25 +ONPs defines a 300-bp-long sequence which is shared by several 0.5-kb genomic fragments. Since the average displacement of 0.5-kb fragments in the ordering library of the complexity we proposed is 40 bp, the 300-bp sequence is contained in 5 fragments, on average. The probability is 0.08 for finding target ONS within 10,000 bp for a given ONP. There is a vanishing probability for more than 6 +ONPs to be found by chance in any of the lo7 pools required to cover a mammalian genome. However, there is a certain probability that one of 25 shared +ONPs found in 5 pools and in their IF7 is there by chance: P = 25 X 0.0g5 = 83 X 10e6.Since ordering of SFs requires t,he definition of 5 sets of 25 + ONP per IF7, the total number of these sets is 2 X lo7 per mammalian genome. Multiplying this number by P gives the mathematical expectation that 125 groups of neighboring 1F0.5~ per genome will have a wrong +ONP, while the chance is vanishingly small that 2 +ONPs are simultaneously wrong. Since SFs are marked with 3 +ONPs, on average, there is no appreciable chance of erroneous partitioning of SFs among groups of neighboring 1F0.5~. The recommended composite size of inserts of 10 kb is not the limiting value for the pool size. However, only a computer simulation can finally determine the limiting value.
125
HYBRIDIZATION
APPENDIX
A4
On the Possibility of Omitting Some ONPs or Not Detecting Some ONSs We have based our calculations on sequencing of single-stranded DNA. To sequence a double-stranded DNA, one does not need to synthesize both members of each pair of complementary ONPs, which reduces the number of ONPs required by half. However, the use of double-stranded DNA would impose the requirement of isolation of DNA from each subclone and the problem of contaminating bacterial DNA. The use of noncomplementary ONPs is also possible only for sequencing single-stranded DNA. In the singlestranded DNA case, individual clones would show gaps of undetermined sequence. However, these gaps would be read in a clone having the complementary sequence. In a good random library in which every sequence is represented 10 times on average, there is a considerable chance that every sequence will be cloned in both orientations. Hwever, a part of the complementary pairs of probes (5000 pairs) must be used for overlapping of 7-kb clones and generation of IF0.5. In this way, the number of independent probes and hybridizations can be reduced to 50,000. In any case, 7000 ONPs that are represented in the extruded strand of the Ml3 vector can be omitted. In this way, cross hybridization with vector does not represent a problem for SBH. Further reduction of ONPs either from the total number necessary for single-stranded sequencing or from the halved number needed for double-stranded sequencing is limited by the occurrence of gaps. Omitting the synthesis of a certain fraction F of all possible probes, there is a chance that at some positions in the genome, the two nearest probes synthesized will fail to overlap at all or less than maximally. Thus, there is a probability that certain sequences of a given length would not be readable. The probability that there are sequenceswhere L would be the distance between the two nearest positive ONP is given by P(L)
= FL.
151
When L is equal to the informative part of the ONP (N), the overlap becomes impossible and a gap in sequence occurs. The expected number of these gaps (N,,) with length equal to or longer than L,, within a haploid mammalian genome is given by N&-L,,)
= 3 X 10’ X F(N
+ L,,).
t61
Applying this relation, one can calculate that using 70% (F = 0.3) of 8-mers ((N2)N8(Nl)), about 200,000
DRMANAC
126
gaps of O-18 bases would occur. With only half of the 8-mers made there would be al-bp gap every 500 bp. Equation [6] can be used to estimate the limits of error tolerance in data gathering by hybridization. The error can be either the consequence of wrongly detected ONS (false positives) or the absence of detection of ONS (false negatives). Since every base is read as many times as the length of the ONP or the common sequence in group of ONPs used as probe, a large error can be tolerated. If certain ONPs fail to detect or wrongly detect complementary sequencesin all clones, this might lead to gaps in sequence. However, the method can tolerate a considerable error of this kind. As can be shown by [6], if the fraction of ONPs that always err is kept below 6%, no gaps in sequence are expected for mammalian DNA. Of course, this applies to the use of noncomplementary probes. Otherwise, 6% of complementary pairs would have to always give false results for gaps due to this kind of error to occur. SBH tolerates a large percentage of falsely positive and falsely negative hybridization data for individual clones. This is a result of comparisons of data with overlapped clones and the fact that each base is read eight times. Falsely positive ONPs are recognized as efficiently asthose that fail to overlap with other ONPs. Practically, there is no chance of occurrence of eight falsely positive hybridizations with specific erroneous overlap with simultaneous false negative hybridizations of correct ONPs. The occurrence of each sequence in several overlapped clones results in a large tolerance of SBH to falsely negative hybridizations. The aggregate error for all clones cannot be higher than 6%. Otherwise, there will be a chance of failure to detect eight consecutive ONSs, leading to a gap. However, consider a basic library where every sequence is cloned an average of five times. If the percentage of undetected ONSs per clone is 50%, the total number of undetected ONSs is 100 X OJ5 = 3%. This demonstrates that hybridization efficiency is not a major problem in SBH. The larger error will require more extensive computing, since more SFs will arise per individual clone in the first round. APPENDIX Algorithm
for Assembling
A5 the Sequence
of Contigs
The sequence of maximally extended groups of overlapping clones-contigs-would be generated in three stages in the following way. In the first stage, overlapping and thus ordering of 7-kb clones would be performed. The procedure described by Lehrach (Poustka et al., 1986; Craig et al.,
ET AL.
1986; Michiels et al., 1987) would be used for this purpose except that hybridization data would be obtained with 1000 instead of 100 probes. One would obtain maximally extended groups of overlapping 7-kb clones. Overlapped 7-kb clones would simply define IF7. In the second stage, one would generate a linear order of IF0.5. The partitioning of positively hybridizing probes (+ONP) into respective IF7 using 10,000 ONP groups would be accomplished by comparison of the contents of +ONP between overlapping 7-kb clones. Since the average distance between target ONSs for the 10,000 probes is 12 bp, an average of 60 +ONPs are expected within one IF7. Using the same probes on the ordering library, one would be in a position to compare +ONPs from IF7 and ordering clones. It would be done in two steps. In the first, for a given IF7, one would form sets of neighboring 25-35 of the total 60 SONPs by the criterion that they are contained in about 5 ordering clones. This comes from the fact that 25 neighboring +ONPs are contained within a 300-bp sequence represented on average in 5 clones of the ordering library. In the second step, the cross sections of the formed sets would be found. This comparison gives their linear order. Also, this determines the linear order of 1F0.5~ since they are defined as the cross sections of sets of +ONPs which contain 2 +ONPs on average. In the majority of cases, it is sufficient to find groups of neighboring 1F0.5~together containing 5-10 +ONPs instead of 1F0.5~ to be able to obtain the order of SFs. In the last stage, the SFs would be formed and ordered within each IF7. The data from hybridization of the basic library with the remaining 75,000 probes are used and +ONPs subdivided among IF7 as described above. All SFs within an IF7 would be generated by maximal overlapping of its SONPs. The average number of SFs expected in one IF7 is 16 and its average length is 45 bp (Fig. 3). Since the average distance between 2 +ONP is 12 bp, every SF would contain an average of 3 of these +ONPs. A comparison of the contents of +ONPs between SFs from IF7 and those from groups of neighboring 1F0.5~ allows partitioning of SFs into these groups. This partitioning reduces the number of unordered SFs. For instance, a group of 1F0.5~ 300 bp long would contain 7 SFs from a 7-kb clone, but their real number will be only 3 new SFs on average, obtained by fusion of primary SFs. This comes from the fact that primary SFs are real only in the context of a whole IF7, of which a given group of 1F0.5~ is a part. This reduction in the number of primary SFs allows unequivocal determination of the linear order of SFs in the vast majority of IF7s. On the basis of these algorithmic principles, the sequences of contigs composed of overlapped 7-kb clones would be determined by a computer. Our preliminary computer simulation results indicate that 75% of de-
SEQUENCING
BY
tected ONSs in IF7 are sufficient for formation of all predicted SFs.
15.
AGRAWAL, S., CHRISTODOULOU, C., AND GAIT, M. J. (1986). Efficient methods for attaching non-radioactive labels to the 5’ ends of synthetic oligodeoxyribonucleotides. Nucleic Acids Res. 14: 6227-8245.
2.
BARNES, W. M., AND BEVAN, M. (1983). ordered strategy for rapid DNA sequence cleic Acids Res. 11: 349-368.
3.
4.
Kilo-sequencing: data acquisition.
An Nu-
BURKE, D. T., CARLE, G. F., AND OLSON, M. V. (1987). Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Science 236: 806-812. CHURCH, G. M., DNA sequencing.
AND KIEFFER-HIGGINS, Science
240:
S. (1968).
R. (1986). Research genome stirs debate.
News: Science
Proposal 232:
to sequence
the
1598-1600.
MANIATIS, T., FRITSCH, E. F., AND SAMBROOK, J. (1982). lecular Cloning: A Laboratory Manual,” pp. 270-271, Spring Harbor Laboratory, Cold Spring Harbor, NY.
17.
MAXAM, A. M., AND GILBERT, W. (1977). A new method sequencing DNA. Pm. Natl. Acad. Sci. USA 74: 560-564.
18.
MESSING, J., AND VIEIRA, J. (1982). A new pair of Ml3 vectors for selecting either strand of a double-digest restriction fragment. Gene 19: 269-276.
19.
MICHIELS, F., CRAIG, A. G., ZEHETNER, G., SMITH, G. P., AND LEHRACH, H. (1987). Molecular approaches to genome analysis: A strategy for the construction of ordered overlapping clone libraries. CABZOS 3: 203-210.
20.
OLSON, M. V., DUTCHIK, J. E, GRAHAM, M. Y., BRODEUR, G. M., HELMS, C., FRANK, M., MACCOLLIN, M., SCHEINMAN, R., AND FRANK, T. (1986). Random-clone strategy for genomic restriction mapping in yeast. Proc. Natl. Acad. Sci. USA 83: 7826-7830.
21.
POUSTKA, A., AND LEHIIACH, linking libraries: The next mammalian genetics. Trends
REFERENCES 1.
LEWIN, human
16.
ACKNOWLEDGMENTS We gratefully acknowledge the help of advice of Bruce Alberts. Thanks to H. Lehrach for advice and for sending us preprints of his papers and to G. Crouse for advice and help with English. This work has been supported by SIZ for Science of SR Serbia, Yugoslavia. This work is the subject of Yugoslav patent applications.
127
HYBRIDIZATION
H. (1986). generation Genet.
2:
‘MOCold for
Jumping libraries and of molecular tools in 174-179.
22.
POUSTKA, A., POHL, T., BARLOW, D. P., ZEHETNER, G., CRAIG, A., MICHIELS, F., EHRICH, E., FRISCHAUF, A.-M., AND LEHRACH, H. (1986). Molecular approaches to mammalian genetics. Cold Spring Harbor Symp. Quant. Biol. 61: 131-139. POUSTKA, A., POHL, T., BARLOW, D., FRISCHAIJF, A-M., AND LEHRACH, H. (1987). Construction and use of human chromosome jumping libraries from NotI-digested DNA. Nature
Multiplex
185-188.
5.
COLLINS, F. S., AND WEISSMAN, S. M. (1984). Directional cloning of DNA fragments at a large distance from an initial probe: A circularization method. Proc. Natl. Acad. Sci. USA 81: 68126816.
23.
6.
COULSON, A., SULSTON, J., BRENNER, S., AND KARN, J. (1986). Toward a physical map of the genome of the nematode Cae-
24.
J. M., TRAINCIR, G. L., DAM, R. J., HOBBS, F. W., ROBERTSON, C. W., ZAGURSKY, R. J., COCUZZA, A. J., JENSEN, M. A., AND BAUMEISTEII, K. (1987). A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 238: 33B-341.
25.
SAIKI,
norhabditis
7.
elegant.
Proc.
CRAIG, A., MICHIELS, MEISTER, M., BUCAN,
Natl.
Acad.
Sci.
F., ZEHETNER, M.,
POUSTKA,
USA
83:
1821-7825.
G., SPROAT, A., POHL,
B., BUR-
T., FRISCHAUF,
A.-M., ANTI LEHRACH, H. (1986). Molecular techniques in mammalian genetics: A new era in genetic analysis. In “Human Genetics, Proceedings of the 7th International Congress, Berlin.” 8.
CROSS, S. H., AND LITTLE, P. F. R. (1986). A cosmid systematic chromosome walking. Gene 49: 9922.
9.
DONIS-KELLER, human genome.
10.
H., et al., (1987). Cell 61: 319-337.
A genetic
ESTIVILL, X., AND WILLIAMSON, R. (1987). identify cosmids containing rare restriction Res. 15: 1415-1425.
12.
JEFFREYS, A. J., WILSON, V., AND THEIN, variable “minisatellite” regions in human don) 314: 67-73.
13.
14.
linkage
vector
for
A rapid method to sites. Nucleic Acids S. L. (1985). DNA. Nature
Hyper-
353-355.
R. K., BUGAWAN, T. L., HORN, G. T., MULUS, K. B., ERLICH, H. A. (1986). Analysis of enzymatically amplified fl-globin and HLA-DQol DNA with allele-specific oligonucleotide probes. Nature fLorIdon) 324: 163-166.
26.
SAIKI, R. K., GELFIND, I). H., STOFFEL, B., SCHARF, S. J., HIGUCHI, R., HORN, G. T., MULLIS, K. B., AND ERLICH, H. A. (1988). Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science 239: 487-491.
27.
SANGER, F., NICKLEN, S., AND COUISON, A. R. (1977). sequencing with chain-terminating inhibitors. Proc. Natl. Sci.
USA
74:
DNA Acad.
5463-5467.
28.
SETLOW, P. (1976). “Handbook of Biochemistry and Molecular Biology,” 3rd ed. Vol. 11. “Nucleic Acids” (D. G. Fasman, Ed.), pp. 312-318. CRC Press, Cleveland, OH
29.
SINGER, M. F. (1982). and long interspersed 28: 433-434.
30.
SMITH, C. L., ECONOME, J. G., SCHUTT, A., KLCO, TOR, C. R. (1987). A physical map of the Escherichia
(Lon-
KAFATOS, F. C., JONES, C. W., AND EFSTRATIADIS, A. (1979). Determination of nucleic acid sequence homologies and relative concentration by a dot blot hybridization procedure. Nucleic Acids Res. 7: 1541-1552.
326:
PROBER,
AND
map of the
DRMANAC, .R., PETROVIC, N., GLISIN, V., AND CRKVENJAKOV, R. (1986). A calculation of fragment lengths obtainable from human DN.4 with 78 restriction enzymes: An aid for cloning and mapping. Nucleic Acids Res 14: 4691-4692.
11.
(London)
genome.
236:
repeated short genomes. Cell S., AND CANcoli K 12
1448-1453.
KOHARA,
SMITH, L. M., SANDERS, .I. Z., KAISER, R. J., HUGHES, P., DODD, C., CONNELL, C. R., HEINER, C., KENT, S. B. H., AND HOOD,
map 508.
L. E. (1986). Fluorescence detection in automated quence analysis. Nature (London) 321: 674-679.
Y.: AKIYAMA, K., AND ISONO, K. (1987). The physical of the whole Escherichia coli chromosome. Cell 60: 495-
31.
Science
SINES and LINES: Highly sequences in mammalian
DNA
se-
128
DRMANAC
isotopic hybridization assay methods using fluorescent, chemiluminescent and enzyme labeled synthetic oligodeoxyribonucleotide probes. Nucleic Acids Res. 16: 4937-4956. WADA, A. (1987). Automated high-speedDNA sequencing. Nature (London) 326: 771-772.
SMITH, human
33.
STADEN, R. (1980). A new method of DNA gel reading data. Nucleic
34.
TAUTZ, D., TRICK, M., AND DOVER, G. A. (1986). Cryptic simplicity in DNA is a major source of genetic variation. Nature (London) 322: 652-656.
38.
35.
THEIN, S. L., AND WALLACE, B. (1986). The use of synthetic oligonucleotides as specific hybridization probes in the diagnosis of genetic disorders. In “Human Genetic Diseases: A Practical Approach” (E. K. Davies, Ed.), pp. 33-50, IRL Press, Oxford.
WALLACE, R. B., SHAFFER, J., MURPHY, R. F., HIROSE, T., AND ITAKURA, K. (1)79). Hybridization of synthetic oligonucleotides to @X 174 DNA: The effect of single base pair mismatch. Nucleic Acids Res. 6: 3543-3557.
39.
WOOD, W. I., GITSCHER, L. A., LASKY, L. A., AND LAWN, R. M. (1985). Base composition-independent hybridization in tetramethylammonium chloride: A method for oligonucleotide screening of highly complex gene libraries. Proc. N&l. Acad. Sci. USA 82: 1585-1588.
URDEA, CLYNE,
Mapping and sequencing the Bio/Technology 6: 933-939.
AL.
32.
36.
L., AND HOOD, L. (1987). genome: How to proceed.
ET
for storage and manipulation Acids Res. 8: 3673-3694.
M. S., WARNER, B. D., RUNNING, J. A., STEMPIEN, M., J., AND HORN, T. (1988). A comparison of non-radio-
37.