DNA uptake signal sequences in naturally transformable bacteria

DNA uptake signal sequences in naturally transformable bacteria

Res. Microbiol. 150 (1999) 603−616 © 1999 Éditions scientifiques et médicales Elsevier SAS. All rights reserved DNA uptake signal sequences in natura...

618KB Sizes 0 Downloads 50 Views

Res. Microbiol. 150 (1999) 603−616 © 1999 Éditions scientifiques et médicales Elsevier SAS. All rights reserved

DNA uptake signal sequences in naturally transformable bacteria Hamilton O. Smith*, Michelle L. Gwinn, Steven L. Salzberg The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA

Abstract — The naturally transformable bacterium Haemophilus influenzae Rd contains 1471 copies of the DNA uptake signal sequence (USS) 5’-AAGTGCGGT in its genome. Neisseria meningitidis contains 1891 copies of the USS sequence 5’-GCCGTCTGAA. The USSs are often found in the base paired stem of transcription terminators. © 1999 Éditions scientifiques et médicales Elsevier SAS uptake / DNA / exogenous / Neisseria / Haemophilus

1. Introduction Many bacteria have genetically determined systems for the uptake and integration of exogenous DNA. Of these naturally transformable organisms, the most thoroughly studied are the Gram-positive bacteria Bacillus subtilis and Streptococcus pneumoniae and the Gram-negative bacteria Haemophilus influenzae Rd and Neisseria gonorrhoeae. The two Gram-positive bacteria efficiently take up DNA from any source, although only homologous DNA from related species is normally recombined into the cell’s chromosome. In contrast, the two Gramnegative bacteria, when exposed to a mixture of homologous and foreign DNAs, show preferential uptake of the homologous DNA [28, 33]. This selective uptake is dependent on the presence of uptake signal sequences (USS) in the DNA [34]. The H. influenzae USS sites were first identified as the 11-bp sequence, 5’AAGTGCGGTCA, common to four DNA fragments showing strongly preferential uptake from a mixture of cloned H. parainfluenzae restriction fragments [8]. Subsequent examination of additional sites revealed that only the first

* Correspondence and reprints Current address: Celera Genomics Corporation, 45 West Gude Drive, Rockville, MD 20850, USA Tel.: +1 240 453 3720; fax: +1 410 429 4008; [email protected]

9 bp, 5’AAGTGCGGT, were essential for preferential uptake [11]. The N. gonorrhoeae USS sites were identified as the 10-bp sequence, 5’GCCGTCTGAA [18]. When the complete sequence of H. influenzae Rd was determined in 1995 [12], it became possible to make a detailed examination of the frequency and distribution of the USS sites [36]. A total of 1465 were found in the 1.83-Mb genome. When reading on one strand of the DNA, 734 were in the plus orientation (5’AAGTGCGGT) and 731 were in the minus orientation (5’ACCGCACTT). They appeared on superficial examination to be randomly distributed with a fairly even admixture of plus and minus orientations. However, closer examination revealed a moderate excess of sites in intergenic regions (35% compared to an expected 14%) and a higher than expected number of closely spaced sites. There were 127 USS pairs separated by less than 26 bp (compared to a chance expectation of 39): 87 were in a plus/minus stem-loop configuration and 39 in a minus/plus stem-loop configuration, with the two oppositely oriented USSs forming the basepaired stem. The stem-loops were primarily located in intergenic regions and they had the appearance of classical transcription terminators, i.e., a stem-loop followed by several T nucleotides in the direction of RNA transcription. The plus/minus pairs were primarily in two spacing groups: exactly 8-bp apart or 19- to

604

Smith et al.

Table I. Highly repeated oligonucleotide sequences in transformable bacteria with sequenced genomes. Bacterial name H. influenzae Rd N. meningitidis Z2491 N. gonorrhoeae FA1090c Synechocystis sp. PCC6803 B.subtilis 168

Gram +/–

NTRa

– – – – +

+ + + + +

Repeated sequence Sites total

SDb

Sites plus

Sites minus

Sites per kb

AAGTGCGGT GCCGTCTGAA GCCGTCTGAA GGCGATCGCCe None

517 943 974 1993

737 958 959d 2823

734 933 993d NAf

0.80 0.87 0.91d 0.79

1471 1891 1952d 2823

a

Naturally transforming. b Standard deviation of the observed frequency compared to the mean frequency calculated from the base composition. c Sequence not completed. d Data based on incomplete sequence. e Not known to be involved in DNA uptake. f Nonapplicable; palindromic sequence has only the plus orientation.

2. USS in H. influenzae Rd and N. meningitidis

average separation distance between consecutive sites is 1224 bp. Figure 1 shows maps of the USS sites aligned with a plot of the trinucleotide composition in 1000-bp windows. The compositional analysis uses the chi-squared statistic as in Nelson et al. [29] and peaks in this plot indicate regions of unusual composition, including laterally transferred DNA. Although not always visible at this scale, a close-up view of these plots indicates a correspondence between peaks in the chi-square plot and gaps in the USS plot. Based on the KolmogorovSmirnov test [20], the distributions of sites for both Haemophilus (P < 0.5) and Neisseria (P < 0.7) approximate uniform random distributions. The plus and minus USSs are mixed together relatively evenly. There is no polar distribution of plus and minus USS sites surrounding any point in the genome. This is in contrast to the polar distributions of certain frequent oligonucleotides in other bacteria. For example, Escherichia coli chi sites (recombination hot spots) are distributed with plus sites predominantly to one side of the origin and minus sites to the other side [4]. Salzberg et al. [31] successfully located origins of replication in several organisms based on the polar distribution of certain frequently occurring oligonucleotides.

N. meningitidis (2.18 Mb) has 1891 USS sites, 958 in the plus orientation and 933 in the minus orientation (table I). The average separation distance between consecutive sites is 1152 bp. The recently updated sequence of H. influenzae Rd (1.83 Mb) has 1471 sites, 737 in the plus orientation and 734 in the minus orientation. The

Over shorter distances the USS are distributed nonrandomly. In both bacterial maps (figure 1A, B) there are several 5-kb to 30-kb regions with few or no plus or minus sites. Some of these align with positive deflections of the chi square plots and may correspond to laterally transferred sequences of recent origin

22-bp apart. In contrast, in the 39 minus/plus pairs, the separation distances were essentially randomly distributed up to about 26 bp. An alignment of all the 1465 sites and examination of the surrounding base composition pointed to a possible explanation for the spacing groups. It was found that the full USS site extended over 29 base pairs with two regions of weak conservation: 5’-aCCGTGCGGT. rwwwww……rwwwww, where ‘r’ = A or G, ‘w’ = A or T, and ‘.’ = any base. In the plus/minus configuration, the rwwwww repeats on the two strands overlap such as to conserve both sites only at the preferred spacings. In the minus/plus configuration the extended regions point away from each other and essentially any spacing is possible. A similar detailed analysis of N. gonorrhoeae will not be possible until the sequence is completed. However, N. meningitidis, which is naturally transformable [26] and has 10-bp USSs identical to N. gonorrhoeae, was recently sequenced [32]. This allows a detailed comparison of USSs in these two organisms.

DNA uptake in naturally transformable bacteria

that have not yet acquired a full complement of USS. For example, the approximately 30-kb open region near the right end of the H. influenzae map (figure 1A) contains sequences homologous to phage genes. The ribosomal RNA operon regions and the major ribosomal protein locus deviate from the average composition of the genome and also tend to be low in USS sites. These genes are highly conserved and perhaps less able to accommodate USS sites. USSs are distributed nonrandomly with respect to coding and noncoding sequences (table II). In H. influenzae Rd 87.93% of the genome is predicted to be gene coding sequences and 12.07% is intergenic sequence. However, only about 65% of the USSs are in genes and about

605

33% are entirely intergenic. The USSs are found more frequently in certain reading frames within genes (table III). Plus sites are located most often in frame 3 in both bacteria. Minus sites are located most often in frame 1 for Haemophilus and frame 3 for Neisseria. In these frames the sites are divided into codons that are weighted toward common, relatively neutral amino acids. In Haemophilus, the plus site in frame 3 is divided into the codon triplets AVAGTVGCGVGT, which code the amino acids S, A, and V. The first frame for the minus site divides the USS sequence into the codon triplets ACCVGCAVCTT, which code for V, A, and L. For Neisseria, the third frame for the plus site divides the USS sequence

Figure 1. USS maps for H. influenzae Rd, N. meningitidis, and Synechocystis sp. Plus sites are above the line and minus sites below the line in each map, except for Synechocystis which has a palindromic site. In this case every other site is alternated above or below the line to reduce crowding. Chi-square plots for each genome are aligned with each map to show alignment of regions deviating from the average base composition with regions of low site density as explained in the text.

606

Smith et al.

Table II. Location of USSs in relation to coding sequences. Bacterium H. influenzae Rd N. meningitidis Synechocystis sp.

Total sites

% Coding sequence

Entirely in genes

Partially in genes

Completely intergenic

1471 1891 2823b

87.93 88.45a 87.37a

966 (65.67) 908 (48.02) 2540 (89.98)

17 (1.16) 38 (2.01) 0 (0.00)

488 (33.17) 945 (49.97) 283 (10.02)

a

Gene coordinates predicted using GeneSmith (unpublished). b These sites are not known to be involved in DNA uptake and therefore are not real USSs.

into the codon triplets GVCCGVTCTV GAA, which code for P, S, and E, and the third frame for the minus site gives TVTCAV GACVGGC, which codes for S, D, and G. The assumption is that the USSs are located in protein coding sequences in such a way that these neutral amino acids are well tolerated. The least represented USS reading frames code for basic, aromatic, or cysteine amino acid residues, or in one case, for a stop codon. Plots of the separation distances (d) between consecutive USSs are shown in figure 2 for H. influenzae Rd. There is an excess of sites < 26 bp apart and a compensatory decrease in sites 40 to 500 bp apart (figure 2A). Above 500 bp the separation distance returns to the expected level (figure 2B). At less than 25 bp separation, the pairs of sites fall into two distinct groups centering on 8 bp and 18 to 22 bp (figure 2C). In N. meningitidis, the USS separation distances show a similar distribution. There is an excess of closely spaced USSs followed by a compensatory decrease in the intermediate range (figure 3A) and a resumption of the expected

Table III. Location of USS in open reading frames. Location Plus sites Frame 1 Frame 2 Frame 3 Total Minus sites Frame 1 Frame 2 Frame 3 Total a

H. influenzae N. meningitidis Synechocystis sp. 29 88 278 395

4 167 218 389

568 103 1869 2540

370 117 84 571

76 175 268 519

NAa NA NA NA

Nonapplicable. Site is a palindrome.

distribution at larger distances (figure 3B). The closely spaced sites are distributed in two peaks centering at 6 and 14 bp separation (figure 3C).

3. Role of USS in transcription termination The closely spaced USS pairs are themselves not randomly distributed with respect to plus and minus orientations. Instead of an equal number of – –, – +, + –, and + + pairs, there is a very strong bias toward the – + and + – inverted site orientations. For Haemophilus, there are 0 – –, 1 + +, 88 + –, and 41 – + pairs with d < 26. Neisseria has 2 – –, 1 + +, 360 +–, and 45 – + pairs with d < 21. The inverted pairs can form stemloop structures if transcribed into RNA. The stem is greater than or equal to the USS length and the loop is less than or equal to the separation distance. In most cases, the stem-loops terminate with a run of several T’s at the 3’ end as is characteristic of transcription terminators. Most of these structures in both bacteria are found in intergenic regions (table IV), supporting the hypothesis that they are transcription terminators. However, about a quarter of them are entirely within predicted genes and some do not have a characteristic run of T’s at the 3’ end. These are less likely to be transcription terminators. However, they might stabilize mRNA or function in gene regulation. Experimental work is needed to determine the range of functions of these structures.

4. Extended USS sequences Experimental studies have shown that not all Haemophilus USS sites have equal activity in

DNA uptake in naturally transformable bacteria

607

Figure 2. Distribution of 9-bp USS separation distances in the H. influenzae Rd genome. A. The number of sites separated from the preceding site by a given distance is plotted vs distance. Distances are from 0 to 1000 bp. B. Distances are from 0 to 13 000 bp. C. Distances are from 0 to 50 bp.

608

Smith et al.

Figure 3. Distribution of 10-bp USS separation distances in the N. meningitidis genome. A. The number of sites separated from the preceding site by a given distance is plotted vs distance. Distances are from 0 to 1000 bp. B. Distances are from 0 to 13 000 bp. C. Distances are from 0 to 50 bp.

DNA uptake in naturally transformable bacteria Table IV. Location of USS stem-loop structures. Bacterium

Total

Entirely in genes

H. influenzae Rd N. meningitidis

130 408

32 103

Partially Completely in genes intergenic 1 11

97 292

DNA uptake [7, 17]. Danner et al. [7] found that a synthetic site was most active when surrounded by AT-rich sequences. Smith et al. [36] found a 29-bp region of partially conserved sequence as mentioned in the Introduction. There were two AT-rich repeats 3’ to the core 9-bp site. Alignment of the 1471 USSs in the plus orientation shows these repeated regions at about one and two helix turns from the beginning of the 9-bp core site (figure 4). The conservation of these AT-rich repeats in the majority of USS sites suggests that they are important for the DNA uptake. Sequence variation in the AT-rich repeats may explain the variable levels of uptake activity of individual sites [7, 17]. The

609

extended Haemophilus site region also appears to influence the spacing of USS in the stem-loop structures. In the + - configuration, the two extended AT-rich repeat regions are facing each other and the AT-rich repeats overlap only when the separation distances are about 8 bp and 19-22. At other spacings, at least one of the AT-rich repeats on one strand would fall opposite to the GC-rich core sequence on the other strand and could not maintain its AT-rich character and be complementary to the core sequence. This is illustrated in figure 5, which shows several examples of typical + - stemloops sequences. In the case of - + stem-loops, the extended regions are facing outward and thus the separation distances are distributed essentially randomly. Plots of the distribution of USS separation distances in the stem-loop structures are shown in figure 6A. In N. meningitidis, there is partial conservation of four bases on the 5’ side of the 10-bp core site and one base on the 3’ side: the extended

Figure 4. Consensus sequences flanking Haemophilus and Neisseria USSs. The consensus sequence for the Synechocystis sp. 10-bp palindromic repeat is also shown. In each bacterium, all of the sites are aligned in the plus orientation and the frequency of bases at each position of the alignment is calculated. Numbers are rounded downwards to integers, therefore the columns total less than 100 in some cases.

610

Smith et al.

Figure 5. USS stem-loop sequences in Haemophilus. The examples show how alignment of the AT-repeats in the 29-bp domain determines spacing of the two USSs that make up a stem-loop.

site is 5’-aaatGCCGTCTGAAa (figure 4). Thus for the most common variety of stemloops in Neisseria (360 out of 405), the typical sequence is 5’-aaatGCCGTCTGAAa..loop.. tTTCAGACGGCattt. The GC-rich core sequence provides the stem, which is followed by several T-residues that are required for good termination. Because of the extended symmetry manifested by many of the stem-loop terminator structures, they should often be capable of functioning in both directions. Somewhat surprisingly, the plus/minus stem-loops in Neisseria show a bimodal distribution, similar to that seen in Haemophilus, with a peak at around d = 5 bp and another peak at around d = 13 bp (figure 6B). Both peaks are relatively broad. It is not readily apparent why there should be two preferred separation distance distributions rather than a single peak or a random distribution. The peak at 3 to 5 bp is most easily explained as the preferred loop size for terminators. However, this leaves the second peak unexplained.

5. Genus-specific distribution of USS sites Deich et al. [10] reported efficient uptake of H. haemolyticus, H. parainfluenzae, and H. aegyptius DNAs into competent H. influenzae Rd cells. DNAs from these bacteria also compete with H. influenzae Rd DNA for uptake into competent H. influenzae Rd cells on an approximately equal mass basis. GenBank sequences for these bacteria were examined for the 9-bp USS sequence. We found 10 USSs with conserved 29-bp domains in approximately 18,000 bp of H. parainfluenzae sequence. Interestingly, 7 of the 10 have the full 11-bp sequence, 5’-AAGTGCGGTCA that was originally reported by Danner et al. [8]. Two plus/minus stem-loop structures are present, accounting for 4 of the USSs. H. aegyptius GenBank sequences (about 29,000 bp) contained 12 USSs with 29-bp domains. One set of the USSs is contained in a plus/minus stem-loop structure.

DNA uptake in naturally transformable bacteria

611

6. Intergeneric transfer of chromosomal genes between Haemophilus and Neisseria Kroll et al. [26] searched N. meningitidis with the 9-bp Haemophilus USS and found a Haemophilus-like sequence associated with three genes: the virulence gene sodC, the bio gene cluster, and an unidentified open reading frame. A very significant finding was that two 9-bp USSs formed the transcription terminator of the sodC gene. They postulate that these are examples of horizontal gene transfer and state that discovery of regions with the ‘wrong’ USS is a ‘way of establishing potentially important chromosomal mosaicism in these pathogenic bacteria’. A search of N. meningitidis with the Haemophilus 9-bp core sequence finds 19 matches and one stem-loop. Seven are expected by chance based on the nucleotide composition of N. meningitidis. A search of H. influenzae Rd with the 10-bp N. meningitidis sequence finds only 4 hits, but less than 1 is expected by chance. The significance of a hit must be based on substantiating evidence of the sort provided by Kroll et al.

7. Maintenance of USS copy number Figure 6. Distribution of separation distances of USS stem-loop structures in Haemophilus and Neisseria depending on whether the configuration of the two USSs is plus/minus or minus plus.

Too few H. haemolyticus sequences are in GenBank to give a valid analysis. H. ducreyi GenBank sequences (about 112 000 bp) contained no USS sequences. Among the Neisseria, N. gonorrhoeae [30] and N. meningitidis have about the same density of sites per kb of genome sequence (table I). However, only 1 site was found in 18 000 bp of N. sicca sequence and 1 site in 7 900 bp of N. pharyngis sequence. A more exhaustive survey of related bacteria is beyond the scope of this paper. We conclude that USS sites are ubiquitous within a genus, but not universal.

It is not unusual for sequence repeats of various classes to accumulate in a genome. This is particularly true in higher eucaryotes where most of a genome may consist of repeated sequences. These most often represent complete transposons and retrotransposons or various remnants thereof. Considerable divergence of the sequences can occur over time. The same is true to a lesser extent in bacteria. However, the USSs appear to be a totally different class of repeats. They are short, highly conserved sequences and they are distributed in the genome in an apparently purposeful way. Because of this, it is a given that they must confer some selective advantage to the organism. Individual copies of USSs may confer selective advantage because of their placement in the genome. The stem-loop USS pairs are an example. One could argue that the exact sequence

612

Smith et al.

in the helical stem of a terminator loop is not crucial to function and so a mutation would be tolerated. However, two mutations would be required to maintain the base pairing. It also may be that the USS terminators interact with one or more proteins during termination or translation. Thus mutations leading to loss of function would be selected against. Some, or even most of the sites, may individually impart only minor selective advantages to the cell, but the totality of sites may have an important survival role. We can assume that transformation is an important function for the cell and, in the natural environment, there is an advantage to selectively taking up homologous DNA. Thus a large number of sites is important because it assures that essentially every piece of DNA released by dying or lysing cells carries one or more uptake sites. The question then arises as to how a particular density of sites is maintained in the chromosome. Smith et al. [36] proposed that good USSs are maintained by the act of transformation. Suppose a USS site mutates to an inactive form. A fragment of DNA containing that site is then somewhat less likely to be taken up compared to the same DNA with a good site. A cell carrying the bad site will have a slightly better chance of picking up DNA molecules with the good site, thereby restoring the good site by recombination. By the same token, if a new site is generated by mutation, it will impart to that part of the chromosome a better chance of being taken up by competent cells and spreading itself in the population, thus increasing the number of sites by one. Eventually equilibrium will be achieved such that creation of new sites and loss of sites by mutation are in balance. If this is true, then mutated sites with one incorrect nucleotide should be in excess in the genome. This is the case in both Haemophilus (761 sites with one mismatch to the 9-bp site) and Neisseria (815 sites with one mismatch to the 10-bp site).

8. A high frequency 10-bp site in the Synechocystis sp. PCC6803 genome Synechocystis sp. PCC6803 is a naturally transformable, Gram-negative, photosynthetic cy-

anobacterium that was sequenced in 1996 [22]. Its genome is 3.57 Mb in size and contains 2823 copies of the palindromic sequence 5’GGCGATCGCC. The mean distance between sites is 1265 bp. The sites appear to be randomly distributed on a genome-wide scale (P < 0.9). Because the site has two-fold rotational symmetry, there is no minus orientation. A map of the sites is shown in figure 1 with sites alternating above and below the line to reduce crowding. A chi-square plot shows that several regions of low site density correspond to regions of atypical sequence as with H. influenzae and N. meningitidis. The distribution of the 10-bp sites over close distances is nonrandom (figure 7). There is a total lack of sites closer together than about 50 bp. The number of sites increases toward expected random levels at separations of 160 to 180 bp, and then to above the expected levels at separation distances of up to about two thousand base pairs. At more distant spacings the number is about as expected. This is exactly opposite to the distribution of USS spacings in H. influenzae Rd and N. meningitidis. The Synechocystis sites are preferentially located in gene coding sequences (table II), and they are most frequently found in frame 3 (table III), which divides the 10-bp sequence into the codon triplets GVGCGVATCVGCC for the neutral amino acids A, I, A. Alignment of all the sites (figure 4) shows that conservation of nucleotide residues does not extend beyond the core 10 base pairs. Specifically there is no ATrich context extending to either side of the site. This along with the preference for coding regions and the lack of closely spaced sites argues against a role in transcription termination. There is also no experimental evidence to suggest a role for these sites in DNA uptake. In fact, the bacteria are reported to take up DNA without selectivity [39]. We can thus safely conclude that the 10-bp sequences are not USSs of the type found in Haemophilus or Neisseria. The high frequency of the sites and their peculiar distribution in the genome suggests a function that can exert a strong selective advantage. A great deal of selective pressure is required to totally eliminate sites in close proxim-

DNA uptake in naturally transformable bacteria

613

Figure 7. Distribution of 10-bp palindromic repeat sequence separation distances in the Synechocystis sp. genome. A. The number of sites separated from the preceding site by a given distance is plotted vs distance. Distances are from 0 to 13 000 bp. B. Distances are from 0 to 1 000 bp.

ity to each other and to preferentially space sites a few hundred base pairs apart. Akiyama et al. [1] recently discovered that plasmids in Synechococcus sp. PCC7002 contain several copies of

the 10-bp palindrome and that it is involved in site-specific recombination between plasmids. Mutation of a site results in loss of plasmid recombination at that position. They also found

614

Smith et al.

high representations of the 10-bp palindrome in chromosomal sequences of several other species of cyanobacteria, including PCC6803. It requires no great stretch of the imagination to speculate that the chromosomal sites may also be involved in site-specific recombination events during interspecies transformation. This could significantly influence the rate of evolution in cyanobacteria.

9. Survey of other sequenced microbial genomes for USS-like sites We examined all of the currently completed microbial genomes for USS-like sites. The frequencies of all 8-bp, 9-bp, 10-bp, 11-bp, and 12-bp sequences were computed. For each sequence, its observed frequency (f) was compared to the expected frequency (m) and the number of standard deviations above or below the mean was computed using the formula SD = ((f – m)2/m) 1/2. We used the frequencies of USSs in Haemophilus and Neisseria and their standard deviations from the mean as a guide in examining each genome. We looked for oligonucleotide sequences that are found at least once every two to three thousand base pairs in the genome and that are 200 or more standard deviations from the mean. Most of the genomes did not contain sequences of high enough copy number to be significant. There were no sequences with the characteristics of USSs among the six archaeal genomes: Methanococcus jannaschii [5], Pyrococcus horikoshii [23], Pyrococcus abyssi [16], Archaeoglobus fulgitidis [25], Methanobacterium thermoautotrophicum [35], and Aeropyrum pernix [24]. However, A. pernix contains 361 copies of the palindrome 5’-GAGGACCTC. Only eight copies are expected by chance (SD = 121). Among the sequenced bacterial genomes, no highly repeated oligonucleotide sequences meeting the above criteria were found in B. subtilis [27], Borrelia burgdorferii [15], Escherichia coli [4], Helicobacter pylori [38], Mycobacterium tuberculosis [6], Mycoplasma genitalium [13], Mycoplasma pneumoniae [19], Aquifex aeolicus [9], Treponema pallidum [14], Rickettsia prowezekii [3],

Thermotoga maritima [29], Chlamydia pneumoniae [21], or Chlamydia trachomatis [21, 37]. Of the sequenced microbial organisms, only those bacteria listed in table I are known to be naturally transformable. Our survey of the sequenced genomes included an effort to predict whether any might possess transformation capability by examining each for homology to known transformation (competence) genes from H. influenzae, N. gonorrhoaea, S. pneumoniae, and B. subtilis using tblastn [2]. Although occasional homologies to one or a few of the genes were found, there was not enough evidence to safely predict the transformability of any of them. This does not rule out the possibility that one or more might be capable of transformation. There is only minimal overlap between the competence genes of the known transformers. A universal transformation gene profile does not yet exist that has predictive ability. Each naturally transformable organism seems to have its own unique transformation-specific genes, with only minimal overlap between organisms.

10. Conclusions Two natural transformers, H. influenzae Rd and N. gonorrhoeae, have been shown experimentally to selectively take up DNA fragments containing short sites of specific sequence referred to as uptake signal sequences or USSs. These sites have been identified as 5’AAGTGCGGT and 5’-GCCGTCTGAA, for the two bacteria respectively. Analysis of genome sequences reveals a large number of copies of the sequences: 1471 in H. influenzae Rd, 1952 in N. gonorrhoeae, and 1891 in N. meningitidis. An analysis of the distribution of the USS sites in H. influenzae Rd and N. meningitidis shows that they occur on both strands in about equal number. Those on one strand are plus sites and those on the complementary strand are minus sites. The sites are nearly randomly distributed on a large scale, but at close range they are nonrandom in distribution. About twenty percent of the USS sites in both bacteria are coupled together so as to form stem-loop structures when transcribed into RNA. These pairs of sites

DNA uptake in naturally transformable bacteria

are found in close proximity to the 3’-ends of a number of genes and have the appearance of typical transcription terminators. Several other bacteria in each genus carry identical sites, but some cells in each genus appear to lack sites or have only a few copies. The high density of USSs in Haemophilus and Neisseria enables tracking of the DNAs during intergeneric transfers. Thus, for example, if characteristic Haemophilus sites are found in close association with a gene on the Neisseria chromosome, this is strong evidence for horizontal gene transfer. Analysis of other sequenced bacterial and archeal genomes does not reveal any highly repeated sequences with the characteristics of USSs. However, Synechocystis sp. 6803 has a high copy number palindromic sequence that has been identified as functioning in sitespecific recombination. Aeropyrum pernix, an archeae also has a high copy number palindromic sequence, but the function has not been studied.

Acknowledgments We wish to thank Jeremy Peterson for help in writing scripts for some of the searches. This work was funded by a grant from the Merck Genome Research Institute.

References [1] Akiyama, H., Kanai, S., Hirano, M., Miyasaka, H., A novel plasmid recombination mechanism of the marine cyanobacterium Synechococcus sp. PCC7002, DNA Res. 5 (1998) 327–334. [2] Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., Basic local alignment search tool, J. Mol. Biol. 215 (1990) 403–410. [3] Andersson S.G., Zomorodipour A., Andersson J.O., SicheritzPonten T., Alsmark U.C., Podowski R.M., Naslund A.K., Eriksson A.S., Winkler H.H., Kurland C.G., The genome sequence of Rickettsia prowazekii and the origin of Mitochondria, Nature 396 (1998) 133–140. [4] Blattner F.R., Plunkett G., Bloch C.A. et al., The complete genome sequence of Escherichia coli K-12, Science 277 (1997) 1453–1474. [5] Bult, C.J. et al., Complete genome sequence of the methanogenic Archaeon, Methanococcus jannaschii, Science 273 (1996) 1058–1073. [6] Cole S.T., Brosch R., Parkhill J., Garnier T. et al., Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence, Nature 393 (1998) 537–544. [7] Danner, D.B., Smith, H.O., Narang, S.A., Construction of DNA recognition sites active in Haemophilus transformation, Proc. Natl. Acad. Sci. USA 79 (1982) 2393–2397.

615

[8] Danner, D.B., Deich, R.A., Sisco, K.L., Smith, H.O., An eleven base pair sequence determines the specificity of DNA uptake in Haemophilus transformation, Gene 11 (1980) 311–318. [9] Deckert G., Warren P.V., Gaasterland T., Young W.G., Lenox A.L., Graham D.E., Overbeek R., Snead M.A., Keller M., Aujay M., Huber R., Feldman R.A., Short J.M., Olsen G.J., Swanson R.V., The complete genome of the hyperthermophilic bacterium Aquifex aeolicus, Nature 392 (1998) 353–358. [10] Deich, R.A., Smith, H.O., Transformation 1978, Homologous and heterologous DNA uptake in Haemophilus transformation, in: Glover, S.W., Butler, L.O. (Eds.), Cotswold Press Ltd., Oxford UK, 1979, pp. 377. [11] Fitzmaurice, W.P., Benjamin, R.C., Huang, P.C., Scocca, J.J., Characterization of recognition sites on bacteriophage HP1c1 DNA which interact with the DNA uptake system of Haemophilus influenzae Rd, Gene 31 (1984) 187–196. [12] Fleischmann R.D., Adams M.D., White O. et al., Whole-genome random sequencing and assembly of Haemophilus influenzae Rd, Science 269 (1995) 496–512. [13] Fraser, C.M. et al., The minimal gene complement of Mycoplasma genitalium, Science 270 (1995) 397–408. [14] Fraser, C.M. et al., Complete genomic sequence of Treponema pallidum, the syphilis spirochete, Science 281 (1998) 375–388. [15] Fraser, C.M. et al., Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi, Nature (1997) (1997) 580–586. [16] Genoscope, Database for Pyrococcus abyssii, http://www.genoscope. cns.fr/Pab/ [17] Goodgal, S.H., Mitchell, M.A., Sequence and uptake specificity of cloned sonicated fragments of Haemophilus influenzae DNA, J. Bacteriol. 172 (1990) 5924–5928. [18] Goodman, S.D., Scocca, J.J., Identification and arrangement of the DNA sequence recognized in specific transformation of Neisseria gonorrhoeae, Proc. Natl. Acad. Sci. USA 85 (1988) 6982–6986. [19] Himmelreich, R., Hilbert, H., Plagens, H., Pirkl, E., Li, B.C., Herrmann, R., Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae, Nucleic Acids Res. 24 (1996) 4420–4449. [20] Hollander, M., Wolfe, D.A., Nonparametric Statistical Methods, John Wiley Publishers, New York, 1973, pp. 219–228. [21] Kalman S., Mitchell W., Marathe R., Lammel C., Fan J., Hyman R.W., Olinger L., Grimwood J., Davis R.W., Stephens R.S., Comparative genomes of Chlamydia pneumoniae and C. trachomatis, Nat. Genet. 21 (1999) 385–389. [22] Kaneko, T., Sato, S., Kotani, H. et al., Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions, DNA Res. 3 (1996) 109–136. [23] Kawarabayasi, Y., Sawada M., Horikawa H. et al., Complete sequence and gene organization of the genome of a hyperthermophilic archaebacterium, Pyrococcus horikoshii OT3, DNA Res. 5 (1998) 55–76. [24] Kawarabayasi Y. et al., Complete genome sequence of an aerobic hyper-thermophilic crenarchaeon, Aeropyrum pernix K1, DNA Res. 6 (1999) 83–101. [25] Klenk, H.P. et al., The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus, Nature 390 (1997) 364–370. [26] Kroll, J.S., Wilks, K.E., Farrant, J.L., Langford, P.R., Natural genetic exchange between Haemophilus and Neisseria: intergeneric transfer of chromosomal genes between major human pathogens, Proc. Natl. Acad. Sci. USA 95 (1998) 12381–12385. [27] Kunst F., Ogasawara N., Moszer I. et al., The complete genome sequence of the Gram-positive bacterium Bacillus subtilis, Nature 390 (1997) 249–256. [28] Mathis, L.S., Scocca, J.J., Haemophilus influenzae and Neisseria gonorrhoeae recognize different specificity determinants in the DNA uptake step of genetic transformation, J. Gen. Microbiol. 128 (1982) 1159–1161.

616

Smith et al.

[29] Nelson, K.E., Clayton, R.A., Gill, S.R., Et A.L., Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima, Nature (1999) (1999) 323–329. [30] Roe, B.A., Lin, S.P., Song, L., Yuan, X., Clifton, S., Ducey, T., Lewis, L., Dyer, D.W., Gonococcal Genome Sequencing Project, supported by USPHS/NIH grant #AI38399 [31] Salzberg, S., Salzberg, A., Kerlavage, A., Tomb, J.F., Skewed oligomers and origins of replication, Gene (1998) (1998) 57–67. [32] Sanger Centre, Neisseria meningitidis sequence available at ftp://ftp.sanger.ac.uk/pub/pathogens/nm [33] Scocca, J.J., Poland, R.L., Zoon, K.C., Specificity in deoxyribonucleic acid uptake by transformable Haemophilus influenzae, J. Bacteriol. 118 (1974) 369–373. [34] Sisco K.L., Smith H.O., Sequence specific DNA uptake in Haemophilus transformation, Proc. Natl. Acad. Sci. USA 76 (1979) 972–976.

[35] Smith, D.R., Doucette-Stamm, L.A., Deloughery, C. et al., Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics. J. Bacteriol. 179 (1997) 7135–7155. [36] Smith, H.O., Tomb, J.F., Dougherty, B.A., Fleischmann, R.D., Craig, J.C., Frequency and distribution of DNA uptake signal sequences in the Haemophilus influenzae Rd genome, Science 269 (1979) 538–540. [37] Stephens R.S., Kalman S., Lammel C., Fan J., Marathe R., Aravind L., Mitchell W., Olinger L., Tatusov R.L., Zhao Q., Koonin E.V., Davis R.W., Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis, Science 282 (1998) 754–759. [38] Tomb, J.F. et al., The complete genome sequence of the gastric pathogen Helicobacter pylori, Nature 388 (1997) 539–547. [39] Yura, K., Toh, H., Go, M., Putative mechanism of natural transformation as deduced from genome data, DNA Res. 6 (1999) 75–82.