Comparison of Clone-Ordering Algorithms Used in Physical Mapping

Comparison of Clone-Ordering Algorithms Used in Physical Mapping

SHORT COMMUNICATION Comparison of Clone-Ordering Algorithms Used in Physical Mapping DARREN M. PLATT1 AND TREVOR I. DIX2 Department of Computer Sci...

83KB Sizes 0 Downloads 33 Views

SHORT COMMUNICATION Comparison of Clone-Ordering Algorithms Used in Physical Mapping DARREN M. PLATT1

AND

TREVOR I. DIX2

Department of Computer Science, Monash University, Clayton, 3168, Australia Received August 13, 1996; accepted December 19, 1996

In this paper, a number of existing and novel techniques are considered for ordering cloned extracts from the genome of an organism based on fingerprinting data. A metric is defined for comparing the quality of the clone order for each technique. Simulated annealing is used in combination with several different objective functions. Empirical results with many simulated data sets for which the correct solution is known indicate that a simple greedy algorithm with some subsequent stochastic shuffling provides the best solution. Other techniques that attempt to weight comparisons between nonadjacent clones bias the ordering and give worse results. We show that this finding is not surprising since without detailed attempts to reconcile the data into a detailed map, only approximate maps can be obtained. Making N 2 pieces of data from measurements of N clones cannot improve the situation. q 1997 Academic Press

Fingerprinting is still a popular method for mapping large genomes. A variety of algorithms have been used to determine the order of clones based on such data. We have focused on restriction digestion as a fingerprinting technique, but this work is relevant to probebased clone-ordering techniques (1, 5, 6, 10) and to clone ordering based on sequence data (3). More detailed maps can be constructed by considering the position of individual fragments in the map (7, 14), and a clone order may be a precursor to that stage of mapping (12). We will consider just pure clone-ordering techniques in this paper, although it is acknowledged that better clone orders can be obtained using detailed mapping. This process is time consuming and frequently requires human intervention. A variety of techniques have been used for clone ordering. Kohara et al. (8) used multiple single restriction digestions with eight enzymes and then used sequence analysis software to detect overlaps from common restriction site sequences. Clones from the Caenorhabditis elegans mapping project were ordered using the 1 Partly supported by an Australian Research Council small grant and ARC Grant A49330684. 2 To whom correspondence should be addressed. Telephone: 61-39905-5146.

GENOMICS 40, 490–492 ARTICLE NO. GE964588

(1997)

CONTIG-9 software (15), which implemented a combination of greedy heuristics and human supervision. A greedy algorithm was also used for the ordering of human chromosome 19 clones (2). A clone-ordering algorithm that uses a genetic algorithm has also been used on this data set (4). To rank the ordering algorithms objectively, simulated data were used for which the correct solution is known. The data were generated along the lines of the model given in Ref. (11), which in turn adopted the assumptions of Ref. (9). Repetitive DNA was not simulated, nor was cloning bias. These effects would degrade the performance of all the algorithms presented here. Data for 200 cosmids taken from a 1-Mb genome were simulated, using representative levels of error. This provides an average of sixfold coverage. Fifty different genomes were generated. Pairwise overlap scores were calculated for each pair of clones. We define a matrix of overlap scores M, where Mi, j represents the overlap score between clones Ci and Cj , and the concept of a clone order O, where Oi gives the number of the clone in the ith position. A metric for comparing the performance of two cloneordering techniques is also required. Given both a correct clone order O c and an alternative clone order O a, resulting from some mapping process, a measure of success can be defined by taking all clones that are adjacent in O c and measuring their separation in O a. A histogram can be constructed from such data. This measure is appropriate because having a clone within two to three clones of its ‘‘correct’’ position is often not a problem when the clones are stacked deeply. The algorithms studied use two basic approaches. The first is to join clones greedily into a spanning line, using the two clones with the highest overlap scores first and then proceeding, in order of overlap strength, to make any join that does not create a branch or a cycle in the line through the graph thus formed. This is essentially what many of the genome mapping projects have done (2, 15). The second approach is to use a stochastic search such as simulated annealing. Stochastic clone-ordering systems use an objective function that measures progress toward the correct order. A random search is then

490

0888-7543/97 $25.00 Copyright q 1997 by Academic Press All rights of reproduction in any form reserved.

AID

GENO 4588

/

6r28$$$161

02-21-97 20:24:34

gnmxa

SHORT COMMUNICATION

TABLE 1 Objective Functions for Stochastic Searches Function

Definition

fsingle (M, O) flinear (M, O) fnonlinear (M, O) fdist (M, O)

(n02 iÅ0 MOi,Oi/1 n01 (n01 iÅ0 ( jÅi/1 Éi 0 jÉ 1 MOi,Oi n01 (n01 iÅ0 ( jÅi/1 P(Coverage § Éi 0 jÉ) 1 MOi,Oi n (iÅ0 (njÅi/1 0 log(PÉi0j É(MOi,Oj ))

used to generate potential solutions. An annealing schedule slightly modified from that of Ref. (6) was chosen with a decrease in the maximum iterations at a single temperature. Various objective functions are shown in Table 1. fsingle was used in Ref. (6). It takes into account only overlap between adjacent clones in the clone order. flinear was used in Refs. (3 and 14) and attempts to weight the values in the matrix so that good scores migrate toward the diagonal. fnonlinear was developed by the authors to weight the off-diagonal elements of M more appropriately. fdist is a more elaborate function that uses a distribution for the scores expected on each diagonal. These were determined empirically. The greedy and stochastic algorithms were all tested with each of the 50 simulated genomes. Figure 1 shows the averaged histograms computed using the comparison metric. The x-axis shows the separation of clones that are adjacent in the correct clone order, and the yaxis shows the number of clones falling into each category. The use of the full custom distribution fdist provided only a small improvement over the nonlinear penalty that was not statistically significant, so results are not shown.

Results for the fsingle objective function showed that it was slightly worse than the greedy algorithm. Detailed examination of the results indicated that in 49 of the 50 trials, the maps produced by the greedy algorithm had a better fsingle value than those produced by simulated annealing with the fsingle objective function. The failure of the fsingle trial to outperform the greedy algorithm can therefore be attributed to the difficult nature of this optimization problem rather than a defect in the objective function. One final trial was performed in which the clone orders produced by the greedy algorithm were broken into sections wherever the overlap score was below a certain threshold. The sections were then shuffled using simulated annealing. This operation is quick and produced a small improvement in the fsingle score for 47 of the 50 trials. This results in a very slight improvement in the ordering. Finally, trials were conducted with 104 clones from a chromosome 19 contig that had already been assembled manually. Eighty-eight clones were placed within 6 clones of their correct position by the greedy algorithm, and further simulated annealing was unable to improve the fsingle score. A completely correct clone order could be obtained only by constructing the restriction map or by incorporating other data such as STS hits. Despite its short-sightedness, the greedy algorithm is clearly effective. The results also show that considering only overlaps between adjacent clones is better than weighting all of the overlap scores. This may appear counterintuitive, but it is a natural result of artificially creating N 2 overlap scores from N fingerprints

FIG. 1. Histogram of ordering efficiency, comparing the results for each technique.

AID

GENO 4588

/

6r28$$$162

02-21-97 20:24:34

491

gnmxa

492

SHORT COMMUNICATION

and incorporating them all into the objective function. To make matters worse, the linear penalty function assigns inappropriate weights to the overlap scores, and this gives a biased result. The greedy algorithm is definitely the preferred choice to generate an initial clone order. The clone overlap and ordering system developed for this work is available on the World Wide Web at http:// www.cs.monash.edu.au/biocomputing. Contigs of up to 200 clones may be ordered using this service. Further technical details concerning the research can be found in Ref. (13).

5.

6.

7.

8.

ACKNOWLEDGMENTS 9. We thank Glen Pringle for constructing the web pages for this paper, and we thank Lawrence Livermore National Laboratories for providing the chromosome 19 data that were used.

10.

REFERENCES 11. 1. Alizadeh, F., Karp, R., Newberg, L., and Weisser, D. (1993). Physical mapping of chromosomes: A combinatorial problem in molecular biology. In ‘‘ACM 4th Annual Symposium on Discrete Algorithms,’’ pp. 371–381. 2. Branscomb, E., Slezak, T., Pae, R., Galas, D., Carrano, A., and Waterman, M. (1990). Optimizing restriction fragment fingerprinting methods for ordering large genomic libraries. Genomics 8: 351–366. 3. Burks, C., Parsons, R., and Engle, M. (1994). Integration of competing ancillary assertions into genome assembly. In ‘‘Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology,’’ pp. 62–69. 4. Cede´no, W., Vemuri, V., and Slezak, T. (1995). Assembly of

AID

GENO 4588

/

6r28$$$162

02-21-97 20:24:34

12.

13.

14.

15.

DNA restriction-fragments using genetic algorithms. Evol. Computat. 2: 321–345. Craig, A., Nizetic, D., Hoheisel, J., Zehetner, G., and Lehrach, H. (1990). Ordering of cosmid clones covering the herpes simplex virus type I (HSV-I) genome: A test case for fingerprinting by hybridisation. Nucleic Acids Res. 18(9): 2653–2660. Cuttichia, A., Arnold, J., and Timberlake, W. (1993). ODS: Ordering DNA sequences, a physical mapping program based on simulated annealing. CABIOS 9: 215–219. Gillett, W., Hanks, L., Wong, G., Yu, J., Lim, R., and Olson, M. (1996). Assembly of high-resolution restriction maps based on multiple complete digests of a redundant set of overlapping clones. Genomics 33: 389–408. Kohara, Y., Akiyama, K., and Isono, K. (1987). The physical map of the whole E. coli chromosome: Application of a new strategy for rapid analysis and sorting of a large genomic library. Cell 50: 495–508. Lander, E., and Waterman, M. (1988). Genomic mapping by fingerprinting random clones a mathematical analysis. Genomics 2: 231–239. Mott, R., Grigoriev, A., Maier, E., Hoheisel, J., and Lehrach, H. (1993). Algorithms and software tools for ordering clone libraries: Application to the mapping of the genome of Schitzosaccaromyces pombe. Nucleic Acids Res. 21: 1965–1974. Platt, D., and Dix, T. (1995). A model for comparing genomic restriction maps. In ‘‘Proceedings of the 28th Hawaii International Conference on System Sciences,’’ Vol. 1, pp. 24–31. Platt, D., and Dix, T. (1995). Stochastic assembly of contig restriction maps. In ‘‘Proceedings of the 28th Hawaii International Conference on System Sciences,’’ Volume 1, pp. 155–164. Platt, D., and Dix, T. (1996). Clone Ordering Using Restriction Fingerprinting Data. Technical Report 287, Department of Computer Science, Monash University. Soderlund, C., and Burks, C. (1994). GRAM and genfragII: Solving and testing the single-digest, partially ordered restriction map problem. CABIOS 10(3): 349–358. Sulston, J., Mallet, F., Staden, R., Durbin, R., Horsnell, T., and Coulson, A. (1988). Software for genome mapping by fingerprinting techniques. CABIOS 4(1): 125–132.

gnmxa