dot genes in Legionella

dot genes in Legionella

Plasmid 51 (2004) 127–147 www.elsevier.com/locate/yplas Comparative sequence analysis of the icm/dot genes in Legionella Irina Morozova,a,1 Xiaoyan Q...

2MB Sizes 3 Downloads 72 Views

Plasmid 51 (2004) 127–147 www.elsevier.com/locate/yplas

Comparative sequence analysis of the icm/dot genes in Legionella Irina Morozova,a,1 Xiaoyan Qu,a,1 Shundi Shi,a,1 Gifty Asamani,a Joseph E. Greenberg,a Howard A. Shuman,b and James J. Russoa,* a

Columbia Genome Center, Columbia University College of Physicians and Surgeons, 1150 St. Nicholas Avenue, New York, NY 10032, USA b Department of Microbiology, Columbia University College of Physicians and Surgeons, 701 W. 168th Street, New York, NY 10032, USA Received 18 August 2003, revised 21 November 2003

Abstract The icm/dot genes in Legionella pneumophila are essential for the ability of the bacteria to survive within macrophages in lung infections such as LegionnairesÕ disease, or amoebae in nature. The 22 genes of the complex, thought to encode a transport apparatus for transfer of effector molecules into the host cell cytoplasm, are located in two chromosomal loci. We demonstrate that these genes are present in all the L. pneumophila strains examined herein, but display a wide range of sequence variation among the different strains, none of which are clearly associated with virulence potential. The strains fall within seven phylogenetic groups, but discrepancies among the gene trees indicate a complicated evolutionary history for the icm/dot loci, with perhaps two independent gene acquisition events and subsequent genomic rearrangements. Significant findings include a probable t-SNARE domain in IcmG that may indicate a direct role for this putative inner membrane protein in altering the hostÕs membrane fusion machinery, a potential functional domain in the central hydrophobic portion of IcmK that may allow it to participate in forming the pore of the secretion complex, and strict conservation of the amino acid physicochemical characteristics in the IcmP region corresponding to the trbA domain that could play a role in molecular transfer. Ó 2004 Elsevier Inc. All rights reserved. Keywords: Legionella pneumophila; icm/dot genes; Evolution; Phylogenetic analysis; Virulence

1. Introduction Legionella pneumophila, the causative agent of LegionnairesÕ disease, an occasionally fatal pneu* Corresponding author. Fax: +212-851-5215. E-mail address: [email protected] (J.J. Russo). 1 These three authors contributed equally to the work.

monia, as well as much more common mild ‘‘flu’’like lung infections, is found in fresh water throughout the world. The Philadelphia 1 isolate of L. pneumophila, named after the site of the originally described outbreak in 1976 (Fraser et al., 1977), is a member of the most prevalent serogroup 1 (Fields et al., 2002). Isolates associated with at least 15 other serogroups have been

0147-619X/$ - see front matter Ó 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.plasmid.2003.12.004

128

I. Morozova et al. / Plasmid 51 (2004) 127–147

described in the ensuing years (Helbig et al., 2002; Yu et al., 2002). In addition, L. pneumophila is one of about 42 known species within the genus Legionella (Fields et al., 2002; Yu et al., 2002), many of which can be associated with clinical symptoms. As part of its life cycle, Legionella bacteria are taken up and survive within phagocytic cells (e.g., amoebae in the environment, macrophages in the human lung). The bacteria replicate within intracellular vacuoles, and eventually kill the original host cell, whereupon they may infect nearby phagocytes (Swanson and Hammer, 2000). Among other virulence genes, two regions with some features of pathogenicity islands, the so-called icm/dot gene clusters, appear to be essential for their ability to survive within and kill macrophages and amoebae (Andrews et al., 1998; Berger and Isberg, 1993; Sadosky et al., 1993; Segal and Shuman, 1997; Vogel et al., 1998). The icm/dot2 cluster I includes seven genes (dotA–D, icmV,W,X) and the larger cluster II contains the remaining 17 members (icmT,S,R, Q,P,O,N,M,L,K,E,G,C,D,J,B,F). Their encoded proteins are thought to translocate effector molecules into the host cell that somehow prevent the latter from killing the proliferating bacteria, probably by preventing phagosome–lysosome fusion in macrophages (Christie, 2001; Nagai et al., 2002). The icm/dot loci are highly similar to the transfer region of plasmid R64 and other IncI1 plasmids, and it has previously been suggested that icm/dot virulence genes share a common ancestor with plasmid conjugation genes (Komano et al., 2000; Segal and Shuman, 1999; Segal et al., 1998; Sexton and Vogel, 2002). It is unclear if the icm/dot genes derive from a single plasmid after which they separated into the two gene clusters, or there were multiple gene transfer events. Of the more than 100 bacteria for which complete genome sequence is available, only Coxiella burnettii has homologs of the full icm/dot genes; in Coxiella, all the icm/dot genes are contained in a single locus (Seshadri 2 Many of these genes were discovered at about the same time in two laboratories and referred to as either icm (intracellular multiplication) or dot (defective in organellar trafficking). Where a particular gene has two names, we use the icm designation in this paper.

et al., 2003). Besides the icm/dot genes, Legionella, like many other bacteria, contain most members of a Type IV secretion system, the lvh/lvr genes that have virulence properties in some organisms, though apparently not in L. pneumophila (Segal et al., 1999). Since most of the icm/dot proteins are clearly implicated in the ability of Legionella to grow and survive within macrophages, it would be of interest to know if any of them are missing or considerably different in strains of lower pathogenicity. In the present work, we demonstrated the presence of several of the icm/dot genes in at least seven L. pneumophila species and one or more lvh/lvr genes in most of the Legionella species tested. The evolution of the icm/dot genes may be distinct from the majority of the other Legionella genes, particularly the housekeeping genes, either because they are part of the bacteriaÕs virulence gene set, or because of their presumed plasmid origin. Virulence genes are often subject to diversifying selection and evolve faster than the rest of the genome to avoid the hostÕs response to the infection. But Legionella, with their largely intracellular lifestyle, are only briefly exposed to the mammalian immune system, and are not known to establish infections in serial hosts; therefore they are unlikely to undergo adaptive evolution. Indeed, the mip gene, which encodes a possible virulence factor, was found to have relatively few polymorphisms in L. pneumophila strains even though it encodes an outer membrane protein (Bumbaugh et al., 2002). In addition, although the quite variable dotA gene product was found to be less conservative in its outer domains, the ratio of synonymous and nonsynonymous nucleotide substitutions in this region did not indicate adaptive evolution according to these same investigators (Bumbaugh et al., 2002). In this study we addressed the question of whether the remaining genes of the icm/dot loci show the same elevated level of variability as dotA relative to other portions of the genome, particularly housekeeping genes. A possible plasmid origin of the icm/dot genes would account for differences in evolutionary history between these genes as a group and the rest of

I. Morozova et al. / Plasmid 51 (2004) 127–147

LegionellaÕs chromosomal genes. According to this hypothesis, the icm/dot loci would constitute relatively pliable regions susceptible to repeated regional gene rearrangments. Indications of multiple rearrangements in the icm/dot region have indeed been described (Ko et al., 2002a,b). We sequenced 18 icm/dot genes and 4 other genes in 18 different strains of L. pneumophila. Comparative sequence analysis reveals a wide spectrum of variability among the icm/dot genes, some as conservative as mip and houskeeping genes, others more variable than dotA. Protein functional motif search along with the distribution of variable/conservative regions along the gene sequence gave additional information on the location of functionally important domains in some icm/dot gene products. We did not observe clearcut associations of any particular gene variations with either known serogroups or virulence phenotypes. Phylogenetic analysis indicated that different L. pneumophila strains displayed distinct acquisition histories for some subsets of the icm/ dot genes, consistent with an evolutionary scenario

129

for the L. pneumophila species encompassing gene rearrangements as well as repeated horizontal gene transfer events.

2. Materials and methods 2.1. Bacterial strains The Legionella species and L. pneumophila strains used in this study are enumerated in Table 1. 2.2. Hybridization Specific primers were designed to amplify all or a portion of each gene in the Philadelphia 1 strain of L. pneumophila. In general, PCR was carried out using 150–200 ng DNA, 1.5 mM MgCl2 , 1 reaction buffer, 0.2 lM each dNTP, 10 pmol each primer, and 2 U Taq polymerase (Invitrogen) in 50 ll reaction volumes with the following PCR profile (5 min at 95 °C; 35 cycles of 95 °C, 30 s;

Table 1 Legionella strains L. pneumophila

Other Legionella species

Leg #

Serogroup

Isolate

Leg #

Legionella species

Isolate

Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg

1 11 1 2 3 4 5 6 7 8 9 13 1 10 11 12 1 1

Bellingham 1

Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg Leg

L. L. L. L. L. L. L. L. L. L. L. L. L. L. L. L. L.

NY-23 LB-4 Tucker 1 LS-13 TATLOCK 81-716 Oak Ridge-10 WO-44C-C3 691-WI-H Mt.St.Helens-4 BL-54D Mt.St.Helens-9 JA-26-G1-E2 ORW SC-18-C9 PF-209C-C2 WA-270A-C2

1 2 3 4 5 6 7 8 9 10 11 30 31 32 33 34 35 36

Philadelphia 1 Togus 1 Bloomington 2 Los Angeles 1 Dallas Chicago 2 Chicago 8 Concord 3 IN-23-G1-C2 82A31053 Knoxville 1 Leiden 1 797-PA-H 570-CO-H Amsterdam B-1 Amsterdam B-2

12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29

dumoffii longbeachae 1 longbeachae 2 gormanii micdadei wadsworthii oakridgensis feeleii 1 feeleii 2 sainthelensis jordanis spiritensis jamestowniensis cherrii steigerwaltii parisiensis rubrilucens

Sources of strains: Leg 35 and 36, provided by Dr. R. van Ketel, were derived from the LegionnairesÕ disease outbreak at the 1999 Netherlands flower show. Leg 35 was identified in 28 and Leg 36 in 1 out of 29 patients. The rest of the strains were obtained from Dr. B. Fields at the CDC.

130

I. Morozova et al. / Plasmid 51 (2004) 127–147

52 °C, 30 s; 72 °C, 30 s; 7 min at 72 °C). The PCR product was radiolabeled using random primer labeling kits RTS RadPrime (Life Technologies) or Redi-Prime 2 (Amersham) with [a-32 P]dATP according to the manufacturersÕ instructions. EcoRIdigested Southern blots of 15 species of Legionella other than pneumophila (17 strains in all), and 18 strains of L. pneumophila were probed with each gene-specific amplimer under standard conditions (overnight hybridization at 65 °C in 0.5 M NaHPO4 , pH 7.2, 7% sodium dodecyl sulfate (SDS), 1% bovine serum albumin, 1 mM ethylenediamine tetraacetic acid, 125 lg/ml sheared single-stranded salmon sperm DNA), and then washed to a relatively low stringency (75 mM NaCl/7.5 mM Na3 citrate/0.1% SDS, 65 °C). In a few cases, hybridization temperature was reduced (45 °C) and washing eliminated (just a single 300 mM NaCl/30 mM Na3 citrate/0.1%SDS room temperature rinse).

Homologs of the icm/dot genes from C. burnetii [sequence data were provided by The Institute for Genome Research (TIGR) under an academic license agreement] and Legionella longbeachae (GenBank Accession No. gi—18693262) were used as outgroups in L. pneumophila gene analysis.

2.3. PCR and sequencing

2.5. Sequence alignment and analysis

The same primer pairs were used to attempt to amplify genes from various Legionella strains and species. When amplification failed, at least one additional attempt was made with alternative primer pairs; moreover, in some cases it was necessary to adjust the annealing temperatures for individual strains or genes. After PCR, oligonucleotides were dephosphorylated and primers degraded with shrimp alkaline phosphatase and exonuclease I, respectively (incubation at 37 °C, 90 min; enzymatic denaturation at 72 °C, 15 min). The same primers were then used for bidirectional sequencing. With genes longer than about 500 bases, additional internal oligonucleotides were designed for priming sequencing reactions. Most sequencing reactions were done with ABI big dye terminator kits or Amersham 377 energy transfer kits, according to the manufacturersÕ instructions, and following isopropanol precipitation, the sequencing products were separated on ABI 377 gel systems (Perkin–Elmer). Individual sequence reads were assembled into contigs using the Phrap assembler (Green P., http://bozeman.mbt.washington.edu/) or SeqMan (Lasergene System, DNASTAR, Madison, WI). The quality of each

The ClustalW (Thompson et al., 1994) program was selected for aligning nucleotide or translated amino acid sequences. BioEdit version 5.0.6 (TA Hall, http://www.mbio.ncsu.edu/BioEdit/bioedit. html) and GeneDoc programs (H.B. Nicholas Jr, http://www.psc.edu/biomed/genedoc/) were used to manipulate the alignments and to build the protein hydrophilicity profiles. The number of nonsynonymous (leading to amino acid substitutions) (Kn ) and synonymous (silent) (Ks ) nucleotide substitutions were calculated using the MEGA 2.1 package (Kumar et al., 2001, http://www.megasoftware .net/). Both Kn and Ks were calculated per corresponding site to avoid the influence of overall gene composition. Four physico-chemical properties (volume, polarity, charge, and hydrophobicity) were used to characterize the results of amino acid substitutions in comparisons of translated homologous sequences (Bogardt et al., 1980; Kawashima and Kanehisa, 2000). Corresponding dG values were obtained using MiyataÕs matrix (Miyata et al., 1979) and were calculated per one amino acid substitution so that they would not depend on the rates of nucleotide substitutions per se. Protein secondary structure prediction was done using SSpro (Pollastri

base in each sequence was checked both automatically and manually. In a few cases where there was uncertain base calling even after repeated sequencing attempts, these positions were eliminated from the analyses. The sequences obtained in this study have been submitted to GenBank (National Center for Biotechnology Information, Bethesda, MD). Accession numbers and the sequences themselves are available at http://genome3.cpmc.columbia.edu/~ legion/comp_proj.htm. 2.4. Additional gene sequences

I. Morozova et al. / Plasmid 51 (2004) 127–147

131

et al., 2002), APSSP2 (Raghava, 2000), and PHD programs (Rost, 1996). 2.6. Homology search The generic BLAST program (Altschul et al., 1990) and PARACEL BLASTER machine were used for the searches against the TIGR databases for completed bacterial genomes (http://www.tigr. org) and the NCBI nonredundant databases (http://www.ncbi.nlm.nih.gov). 2.7. Domain search The SMART server (http://smart.embl-heidelberg.de (Letunic et al., 2002)) and PFAM database (Bateman et al., 2002) were used to search for protein functional domains and coiled-coil structures. 2.8. Phylogenetic analyses Tree reconstruction and visualization were accomplished using the MEGA 2.1 package (Kumar et al., 2001). The Li distance approach (Li et al., 1985) was used for building the distance matrix. The neighbor-joining (NJ) tree-building algorithm (Saitou and Nei, 1987), which builds a branching tree diagram from the distance matrix by successively clustering pairs together, was used for phylogenetic inference. Confidence levels of inferred relationships were estimated following 1000 bootstrap iterations. To address uncertainties of tree branching, the split decomposition method of the program SplitsTree (Huson, 1998) was utilized. Unlike most tree building methods, which force data into a tree-like phylogeny, this method portrays the data in a mesh-like graph allowing conflicting phylogenetic information to be visualized, estimated, and compared.

3. Results 3.1. Gene composition of Legionella species Low stringency hybridization to EcoRI-digested Legionella DNA was carried out using labeled amplified regions of selected genes from the

Fig. 1. Gene distribution in different legionellae as scored by hybridization. Comparison of hybridization results using Philadelphia 1 PCR amplified probes in L. pneumophila and other Legionella species for 16S rRNA, aspartate b-semialdehyde dehydrogenase (asd), one of the lvh and three of the icm genes. From left to right: strains Leg 7–17. Presence of multiple bands for 16S rRNA likely due to multiple copies of the rDNA in the bacterial species (there are at least three partial or complete loci in the Philadelphia 1 strain of L. pneumophila based on the genomic sequence). Variation in banding patterns for other genes could be due to presence or absence of paralogs, or to EcoRI restriction site polymorphisms within or near the gene, in the different organisms. In the case of the lvhB4 and other lvh/ lvr genes, variable patterns could reflect the fact that they can be located on a plasmid, as supported by recent data from our laboratory (not shown).

Philadelphia 1 strain of L. pneumophila. The left side of the autoradiogram shown in Fig. 1 depicts some typical results. A table (Supplementary Table SI) compiling the results for every gene and strain is available online at http://genome3.cpmc.columbia.edu/~legion/comp_project/comp_proj.htm. In addition to 16S rRNA genes, positive signals were obtained in most species for housekeeping (e.g., asd) genes; in contrast, only some icm/dot genes were detected in the non-pneumophila species (see right side of autoradiogram in Fig. 1), chiefly in L. longbeachae (Leg14), and to a lesser extent in Legionella dumoffi, Legionella wadsworthii, Legionella gormanii, Legionella micdadei, Legionella

132

I. Morozova et al. / Plasmid 51 (2004) 127–147

feeleii, and Legionella sainthelensis. Intermediate results were obtained for the lvh/lvr genes (lvrA-E, lvhB2-11, D4). For the most part, they generated strong hybridization signals in L. dumoffi, L. longbeachae (Tucker 1), L. wadsworthii, L. oakridgensis, and L. cherrii, but weak or no signals in the remaining species tested. These genes have previously been shown not to be essential for growth of L. pneumophila in macrophages (Segal et al., 1999). As an alternative to hybridization, primer pairs from Philadelphia 1 were used in an attempt to amplify genes directly from the other strains and species. The results were usually consistent with those obtained using hybridization; with few exceptions, if negative results were obtained using one approach, negative results were also obtained with the alternative procedure (see supplementary Table SII at the above URL for all the PCR results). Still, it is important to realize that under the stringency conditions we utilized for hybridization, we would not expect to identify genes with less than 70% identity at the nucleic acid level; similarly, at least 90% conservation of primer sequence would be required for consistently successful PCR amplification. Thus, an unobserved signal may be due either to true absence of a gene, or perhaps more likely, substantial variation in the geneÕs sequence compared to that of Philadelphia 1. Among the L. pneumophila strains, high signal strength was obtained for nearly every gene (housekeeping, icm/dot, and lvr/lvh). This would

imply that different strain virulence phenotypes are not accounted for by simple presence or absence of these orthologs. Therefore, to determine if more subtle genetic features were involved, we carried out a comparative sequence analysis on all these genes in the different L. pneumophila strains as well as the L. longbeachae icm/dot genes available from GenBank. (Although we were able to amplify a few of the icm/dot genes in the non-pneumophila species, the comparative sequencing described below was restricted to the pneumophila strains.) 3.2. Level of interstrain and interspecies variation in L. pneumophila There are now about 48 known Legionella species (Perez-Luz et al., 2002) and about 15 L. pneumophila serogroups, comprising approximately 70 known serogroups in the genus overall. Despite detailed analyses, there are complications in some of the assignments (see review by Benson and Fields, 1998). Appreciating that taxonomic positioning cannot always accurately reflect evolutionary distance (Rosello-Mora and Amann, 2001), Table 2 summarizes icm and non-icm sequence diversity based on our data for different strains of L. pneumophila, and in a lesser number of cases, other Legionellae, as well as published gene sequence data. The differences between Legionella species are within or close to the standard boundaries of speciation (for review, see RoselloMora and Amann, 2001): 95% homology for 16S

Table 2 Variations among Legionella genes 16S rRNAa

Within L. pneumophila Between Legionella spp. With Coxiellad

99.2% 91–99% (96%) 85.5%

Non-icm genesb

icm/dot genesc

DNA

Protein

DNA

Protein

89–100% (96%) 69–99%

97.9% 75–99%





96–98% (97%) 62–79% (70%) 39–66%

94–100% (98%) 58–91% (74%) 23–63%

The data in the above table represent averages or ranges (and averages) of the percent homology for different genes. a Based on our data and that of Adeleke et al. (1996). b Based on information available for 8 non-icm/dot genes, our data plus that of Ratcliff et al. (1997), Doyle et al. (1998), Ratcliff et al. (1998), and Avison and Simm (2002). c Based on our data plus gene sequences for L. longbeachae submitted to GenBank by Rogers et al. (2002) (AF288617). Ko et al. (2002b) have shown a wider range of nucleic acid homology for dotA within L. pneumophila (78–100%), when they include in the comparisons the L. pneumophila subsp. fraseri and subsp. pneumophila. d TIGR data compared with the Philadelphia 1 strain of L. pneumophila.

I. Morozova et al. / Plasmid 51 (2004) 127–147

rRNA and 70% for other genes. As can be seen from the table, icm/dot genes have a higher level of inter-strain diversity (62–79% homology, with a mean value of 70%), than non-icm/dot genes, though as of today, only L. pneumophila vs L. longbeachae comparisons for several icm genes are available. There is a considerable range of variability for the different icm genes among the L. pneumophila strains examined both at the nucleotide and protein levels (Table 3). Some genes have a very low percentage of variable positions, and even silent substitutions are rare, while others, such as icmX,

133

have many polymorphic sites. There were no major insertions or deletions in the sequenced genes, though there were a few 1 or 2 amino acid insertions and deletions (e.g., in icmG in Leg7 and Leg31; icmX in several strains). The number of synonymous (Ks ) and nonsynonymous (Kn ) nucleotide substitutions was determined per corresponding site, and the mean of all pairwise strain comparisons was calculated for each gene. The icmX, W, V, and dotA genes, which are all members of icm/dot region I (the small icm locus), are quite variable, showing consistently higher Ks and Kn values (with the exception of Kn

Table 3 Sequence variations in icm/dot and non-icm genes Gene

Number of strains sequenced

Gene length in Leg 3

% finished

icmF icmB icmJ icmD icmC icmG icmK icmL icmM icmN icmP icmQ icmR icmS icmT Mean for the locus

15 16 17 14 18 18 18 18 17 15 14 16 17 17 17

2922 3030 627 399 582 807 1083 639 285 570 1131 576 363 345 261

100 100 99 99 100 99 98 100 100 88 90 98 100 100 100

278 276 46 39 52 94 141 57 20 59 95 34 40 32 19

icmV icmW icmX dotA Mean for the locus

17 17 18 5

456 456 1404 3189

100 100 99 100

56 34 298 557

tphA asd flrpp RNAseH mip

17 35 17 11 17

1257 1020 690 573 699

99 99 100 100 100

Kn /Ks ratio

dG per one aa change

0.075 0.107 0.105 0.092 0.111 0.131 0.169 0.087 0.063 0.07 0.077 0.079 0.083 0.164 0.097 0.101

0.067 0.019 0.019 0.022 0.054 0.092 0.047 0.011 0.127 0.057 0.026 0.025 0.120 0.018 0.021 0.048

0.923 0.615 1.167 0.566 0.688 1.063 0.373 0.633 1.433 0.120 0.034 0.135 1.148 1.299 0.850 0.736

0.024 0.002 0.036 0.042 0.026

0.147 0.119 0.307 0.352 0.231

0.163 0.017 0.117 0.118 0.104

0.927 0.998 1.131

0.008 0.001 0.006 0.006 0.002

0.097 0.031 0.147 0.046 0.070

0.082 0.032 0.040 0.130 0.025

0.987 0.374 1.084 1.769

Number of polymorphic sites

Mean pairwise value per site

nucl

Kn

Ks

50 16 5 4 10 27 30 3 6 7 12 3 10 3 2

0.005 0.002 0.002 0.002 0.006 0.012 0.008 0.001 0.008 0.004 0.002 0.002 0.01 0.003 0.002 0.005

19 3 78 139

159 29 74 22 54

aa

37 3 11 7 4



1.019



Both nonsynonymous (Kn ) and synonymous (Ks ) nucleotide substitutions are calculated per corresponding site to avoid the influence of gene composition and dG values are calculated per one amino acid change. Numbers in bold vary the most from the remainder. Abbreviations: asd—aspartate b-semialdehyde dehydrogenase, flrpp—flagellar L-ring protein precursor, mip—macrophage inflammatory peptide. * Data from Bumbaugh et al., 2002.

134

I. Morozova et al. / Plasmid 51 (2004) 127–147

in the case of icmW), compared to most of the genes from icm/dot region II. The icmX,V and dotA genes have Kn values approximately 10 times higher than most of the rest of the icm/dot genes. The ratio of nonsynonymous to synonymous nucleotide substitutions is usually taken as an indicator of the functional and structural restrictions on gene variability and is independent of the time of gene diversification. The icm/dot genes show a wide distribution in their Kn /Ks ratios, with, for example, icmV having a ratio nearly 15 times higher than that of icmL. The highly conserved genes (icmL, W, S, B, J, T, D, P, and Q) have lower Kn /Ks ratios than even the very conservative housekeeping gene encoding aspartate b-semialdehyde dehydrogenase (asd), which shows as much as 62% homology even with its relatively distant Vibrio cholerae ortholog. In contrast, the most variable genes (icmV, M, R, and X) have Kn /Ks ratios close to or even higher than dotA, which is considered a relatively variable gene (Bumbaugh et al., 2002). Not all amino acid substitutions in the genes with low Kn /Ks ratios are conservative, as assessed by changes in amino acid physico-chemical properties, and there are cases of genes with relatively conservative amino acid substitutions that nonetheless have a high level of gene variability as judged by Kn / Ks ratios (Fig 2). For example, the IcmJ, S sand W

Fig. 2. Comparison of Kn /Ks and dG values for icm/dot and non-icm/dot genes. Kn /Ks ratios and dG values for the icm/dot genes shown in order from highest (icmV) to lowest (icmL) Kn / Ks values. Dashed lines correspond to locus II mean values.

protein products, despite displaying relatively low Kn /Ks values, have amino acid substitutions that result in drastic changes in their properties; on the other hand, three genes (icmN, P, and Q) have close to locus II average (0.05) Kn /Ks ratios, but their encoded proteins have extremely low dG values, indicating that only substitutions in amino acids with similar physico-chemical properties have been permitted. Since nucleotide substitutions may exert their influence on the function of the final protein product at any of several levels (e.g., DNA, mRNA or protein), Kn /Ks ratios reflect general restrictions on gene and protein variability. On the other hand, dG values reflect variation purely in protein structural and functional features, indicating some restrictions on the amino acid substitutions at the level of the final functioning product. In this sense, icmN, P, and Q may be considered the most conservative of the icm/dot genes. There is no obvious correlation between the predicted cell localization of the protein products of these genes and their variability levels. While IcmN is thought to be an outer membrane protein and not necessary for macrophage killing, IcmK is an indispensable periplasmic or outer membrane, IcmP is an inner membrane, and IcmQ is a soluble cytoplasmic protein required for pore formation (Andrews et al., 1998; Coers et al., 2000; Dumenil and Isberg, 2001; Segal and Shuman, 1998a; Watarai et al., 2001). Overall, the levels of sequence variation found among the non-icm genes in L. pneumophila strains (last group of genes in Table 3) and most of the icm genes from locus II were comparable to the level of diversity in, for example, Salmonella enterica housekeeping genes reported by Boyd et al. (1997) (where the mean nonsynonymous to synonymous nucleotide substitution ratio was 0.032). The level of polymorphism among icm genes from locus I (second group in Table 3) and some locus II members exceeds significantly that for both Legionella and Salmonella housekeeping genes and most of the genes from icm/dot locus II, and corresponds to the variability level for the spaM and spaN genes of the S. enterica inv-spa pathogen invasion complex (Boyd et al., 1997). The order of the icm/dot genes was apparently the same in all 18 strains we examined, as assessed

I. Morozova et al. / Plasmid 51 (2004) 127–147

135

by our ability to amplify these genes using primers from the expected surrounding genes.

180 aa. IcmC and IcmC1 have 40% identity over 171 aa.

3.3. Paralogs of icm/dot genes in Philadelphia strain of L. pneumophila

3.4. Further analysis of individual icm/dot genes

It is not unusual to find distant homologs among the genes of a single organism. These may represent members of a gene family that carry out related but not identical functions, or they may no longer have any functional properties in common. Among the icm/dot genes, four partial homologs (paralogs) for the 30 part of icmL (134 aa), one for its 50 portion (79 aa), and one for icmC were identified in a search of the now essentially complete Philadelphia 1 genome (http://genome3. cpmc.columbia.edu/~legion/). The icmL paralogs are located in different regions of the genome, and the icmC1 paralog is separated from the locus II icmC gene by 23 kbps. In each case, the paralogs are surrounded by genomic housekeeping genes. The average protein sequence homology between IcmL and its 30 paralogs is relatively low but clear: 31% identity and 52% similarity over an approximately 120 aa (amino acid) stretch. For comparison, the L. pneumophila IcmL has 91% amino acid identity with L. longbeachae IcmL over a 220 aa stretch; 39% identity to C. burnetii IcmL over 200 aa; and 25–30% identity to traM genes (Klebsiella oxytoca, Pseudomonas syringae, Escherichia coli, and Salmonella typhimurium plasmids) over 160–

Multiple alignments of the icm/dot genes in all the L. pneumophila strains under study permitted more detailed sequence analyses. The sequence variation patterns at both the nucleotide and amino acid levels, and dG and hydrophilicity profiles along the length of each ORF were determined, as well as potential structural and functional motifs. In Fig. 3, the distribution of nucleotide and amino acid substitutions along the nucleotide and corresponding amino acid sequences are compared for all the sequenced icm genes in L. pneumophila strains. Apparently, in many cases, nonsynonymous substitutions (leading to amino acid changes in encoded proteins) are distributed unevenly along the sequence. The gene regions with low or no nonsynonymous substitutions and close to average number of synonymous substitutions are of special interest since the observed conservatism cannot be explained merely by too little evolutionary time for the compared sequences to diverge. These regions, conservative at the protein level, especially those preserved also in distant homologous proteins, may correspond to important protein domains, so where possible, comparisons were made with distant homologs in other bacteria in conjunction with the functional

Fig. 3. Distribution of nucleotide and amino acid substitutions in the icm genes among L. pneumophila strains. Every nucleotide (bottom halves of each bar) and amino acid substitution (upper bar halves) along the icm sequences from all the L.pneumophila strains is indicated with a vertical hatchmark.

136

I. Morozova et al. / Plasmid 51 (2004) 127–147

motifs predictions. A more detailed description of some of the icm genes (icmP, G, N, and K) follows. 3.5. IcmP IcmP is believed to be an inner membrane protein, possibly involved in DNA transfer, and

Fig. 4. Hydrophilicity profiles of IcmP and distant homologs. Red—L. pneumophila IcmP; blue—Coxiella IcmP homolog; green—Pseudomonas sp. PyR19 plasmid conjugal-transfer related sequence SAT (gi 2642198); brown—Salmonella typhimurium R64 plasmid trbA gene (gi 20521502).

Fig. 5. Hopp and Woods hydrophilicity profiles for IcmG and its homologs. Blue—L.pneumophila IcmG; red—TraP of plasmid R64 gi 4903119; green—C. burnetii IcmG homolog.

absolutely indispensable for macrophage killing (Segal and Shuman, 1998a). The gene product is predicted to have a signal peptide (aa 1–35), transmembrane regions (aa 17–39 and 92–114) and a trbA domain (aa 204–372). trbA is one of the genes found within the transfer region of IncI1 plasmids such as R64, and is absolutely required for conjugal transfer of these plasmids (Furuya and Komano, 1996). Although distant homologs of icmP are found in Coxiella, Pseudomonas, and Salmonella, they display a low overall level of sequence similarity (18–35% identity at the protein level); only in the region of the trbA domain slightly increased homology is found. Nonetheless, all the homologs have very comparable hydrophilicity profiles over their entire lengths (Fig. 4). Since the gene has not been allowed to accumulate significant variable amino acid positions, it is likely to share a closely related function in these fairly diverse genera. Taking the 15 L. pneumophila strains as a group, both synonymous and nonsynonymous substitutions are distributed evenly along the icmP

Fig. 9. Hydrophilicity profile of IcmK and distant homologs. Red—L. pneumophila; blue—L. longbeachae; green—C. burnetii; brown—Shigella TraN; black—Klebsiella TraN.

Fig. 6. Alignment of t-SNARE domains in assorted proteins. Sequences (from top to bottom): L. pneumophila IcmG; Bradrhizobium japonicum Blr2548 protein (BAC47813); Clostridium acetobutylicum methyl-accepting chemotaxis protein (AE007559); Pseudomonas aeruginosa probable chemotaxis transducer (AE004706)—t-SNARE domains predicted by SMART system; human SNAP25 C-terminal end (D21267); SNAP25 N-terminal end (D21267); Saccharomyces cerevisiae SEC9p protein—putative t-SNARE (NP_011523). The conservative amino acids are highlighted.

I. Morozova et al. / Plasmid 51 (2004) 127–147

sequence, but the gene appears to be very conservative, both at the nucleotide and amino acid levels, with the lowest dG value of all the icm and housekeeping proteins sequenced, especially in the trbA region. 3.6. IcmG IcmG has also been predicted to be an inner membrane protein; mutation of this gene leads to a partial reduction in the bacteriaÕs ability to kill macrophages (Segal et al., 1998). When Legionella pneumophila strains are compared, IcmG shows elevated variability, both at the nucleotide and protein levels. Variable positions are almost evenly distributed along the sequence, except in the vicinity of the C- and N-termini that lack even synonymous substitutions. Fig. 5 shows hydrophilicity profile comparisons for icmG in Legionella and two distant homologs, C. burnetii IcmG and plasmid TraP. Despite relatively low sequence homology among the three genes (less than 20% at the protein level), their predicted secondary structures (not shown) and hydrophilicity profiles display significant similarity. Preservation of the protein structure in some cases may be more important for a proteinÕs function than the amino acid sequence itself, and probably because of this, structure-based methods of searching for distant homologs are more efficient than sequence-based approaches (Pawlowski et al., 2001; Sauder et al., 2000). Examples of related bacterial proteins with very low sequence identity but nearly identical structures are not uncommon (Bauer et al., 2001; Ginalski et al., 2000; Girardeau et al., 2000). For the IcmC–TraQ (not shown) and IcmG– TraP comparisons, the protein similarity at these higher structural levels is indeed stronger than at the sequence level. Thus, despite sequence discrepancies, the major function of these distant homologs may remain intact. Local dissimilarities of the protein profiles, as in the case of IcmG– TraP at positions 165–185 (Fig. 5), require additional analysis. The Legionella and Coxiella IcmG proteins, unlike their TraP homolog, are predicted to have a t-SNARE domain precisely in this region (aa 142–210 in the Legionella IcmG protein; aa 95– 194 in C. burnetii homolog, which correspond to

137

positions 153–221 in aligned sequences in Fig. 5) and this similarity extends beyond the coiled-coil structural features predicted for all three homologs in this area (positions 123–179) (Segal and Shuman, 1998b). [Weimbs et al. (1997, 1998) even screen out coiled-coil features when performing t-SNARE domain searches.] Proteins with tSNARE domains play important roles in membrane fusion in eukaryotes (Weber et al., 1998). While the t-SNARE domains are highly diverse, they usually possess a central glutamine (Q) residue and preserve the overall domain structure (Gotte and von Mollard, 1998; Weimbs et al., 1998). There are only a few bacterial proteins known to have similarity to the t-SNARE domain (SMART Accession No. SM0397); most of these are bacterial sensor and chemotaxis integral membrane proteins. Several examples of these are aligned with IcmG in Fig. 6. It will be interesting to see if the t-SNARE domain is conserved in nonpneumophila Legionella species with icm/dot loci. If it is required for IcmG function during infection, this feature may differentiate the global function of the Legionella icm/dot system from that of its homologs in other organisms. 3.7. IcmN IcmN is a putative outer membrane lipoprotein, containing a signal peptide, and is dispensable for macrophage killing (Segal et al., 1998). The sequence is well conserved, especially at the protein level—amino acid substitutions among L. pneumophila strains occur only in the N-terminal half of the protein, and the alternative amino acids always have very similar physico-chemical properties (Figs. 2 and 3). An alignment of L. pneumophila and L. longbeachae sequences also reveals that the C-terminal half (after aa 90) is more conserved than the N-terminal portion (Fig. 7). Starting at aa 83, the IcmN protein shows weak homology to the OmpA domain (Pfam F00691), which is found in bacterial porin-like integral-membrane proteins and lipoproteins, most of which, like IcmN, have a conserved OmpA domain within the C-terminal half and a variable N-terminal portion. Some members of this protein group have antigenic determinants, but IcmN does not display obvious hypervariable

138

I. Morozova et al. / Plasmid 51 (2004) 127–147

Fig. 7. Alignment of IcmN gene product with distant homologs. Dots (.) in the alignment represent identical amino acids. Sequences (top to bottom with NCBI accession numbers): L. pneumophila IcmN; L. longbeachae IcmN (AAL78305); Pseudomonas aeruginosa hypothetical protein (NP_249524); E. coli putative outer membrane protein (OMP) (NP_290137); Agrobacterium tumefaciens putative OMP (NP_355655); Sinorhizobium meliloti hypothetical transmembrane protein (NP_384380); Salmonella enterica putative OMP (NP_458280); Yersinia pestis putative lipoprotein (NP_407501).

regions. The alignment with several distant homologs reveals two extremely conserved motifs: QGVD at aa 147 and RVEIT at the C-terminus (boxed in Fig. 7). 3.8. IcmK The IcmK product is putatively a periplasmic or outer membrane protein, and possesses a secretion signal peptide (Andrews et al., 1998); the protein is needed for ‘‘pore’’ formation (Kirby et al., 1998), indispensable for macrophage killing, but not necessary for conjugation (Andrews et al., 1998; Segal and Shuman, 1998a). It is homologous to the plasmid traN gene product. According to the Pfam database, the TraN domain starts at position 62 of both the protein alignment (Fig. 8) and the hydrophilicity profile for L. pneumophila icmK and its distant homologs (Fig. 9); the alignment shown before that point is uncertain owing to very low homology. As seen in the alignment, the homology level between the orthologs is quite low with <30%

amino acid identity and <50% similarity between the IcmN and TraN gene products. Despite this, the hydrophilicity profiles have similar patterns— e.g., a hydrophobic initial portion, probably corresponding to the signal sequence (predicted by Pfam at aa 1–26 in L. pneumophila IcmK), and a hydrophobic central region (aa 150–270). The latter feature is conserved even in the most distant Klebsiella TraN homolog, which does not show significant sequence homology in this region (and displays no homology at all prior to aa 150). The icmK gene is one of the more conservative icm/dot genes in L. pneumophila strains as judged by both its Kn /Ks ratio and dG value (Fig. 2). The Ks values, though, are somewhat elevated and synonymous nucleotide substitutions are distributed evenly along the gene (Fig. 10, panel 2). In contrast, amino acid substitutions exhibit a very uneven distribution, occurring only in the first half of the protein, before aa 150 (Fig. 10, remaining panels). This conservation of the second half of the protein, along with the preservation of the central

I. Morozova et al. / Plasmid 51 (2004) 127–147

139

Fig. 8. Alignment of IcmK and TraN gene products. Dots represent identical amino acids and dashes are gaps in the alignment. Sequences from top to bottom: L. pneumophila, L. longbeachae (AF288617), and C. burnetti IcmK; E. coli (Shigella sonnei) plasmid ColIb-P9 TraN (BAA75158) (has only 1 aa difference with Salmonella typhimurium IncI1 plasmid R64 TraN, BAB91663); Klebsiella oxytoca plasmid pACM1 primase (AF139719).

hydrophobic portion among distant homologs suggests that this region is a functionally important domain. 3.9. Phylogenetic relationships between strains based on icm gene sequence The dot/icm genes were presumably introduced into the Legionella genomes from a plasmid (Komano et al., 2000; Segal and Shuman, 1998a),

possibly prior to their separating into two loci. It is unknown, though, if this was a one-time event or the region(s) were lost and re-introduced repeatedly during Legionella evolution. Often when gene transfer occurs from a distant organism with different nucleotide content, the transferred region is evident due to its different GC content compared to the rest of the genome. In the case of LegionellaÕs icm/dot loci, their GC content is equivalent to the genome average (38%). Moreover, the regions

140

I. Morozova et al. / Plasmid 51 (2004) 127–147

Fig. 10. icmK variability profiles. A window size of 15 amino acids or codons was used. See text for details.

are distinct from those of their homologs in Coxiella and the R64 plasmid where most icm/dot gene homologs are around 44 and 50% GC, respectively. It is possible that the transfer occurred from a different plasmid with similar GC content to that of Legionella. Based on the differences between phylogenetic trees built for mip and dotA (Bumbaugh et al., 2002) and dotA and rpoB genes (Ko et al., 2002a,b), it has been suggested that repeated events of genetic exchange or loss and acquisition led to the current ‘‘complex’’ composition of these loci. To determine if the rates of molecular evolution of icm genes are disparate in different L. pneumophila strains, a comparison of icm genes from all available strains was undertaken, using their C. burnetii orthologs as outgroups (Sexton and Vogel, 2002). The distances in synonymous and nonsynonymous substitutions per corresponding site were analyzed separately, as was done by Whittam and Bumbaugh (2002). All analyzed genes from all the L. pneumophila strains showed approximately equal relative substitution rates (data not shown). It is probable, though, that minor differences were missed, using such distant homologs from Coxiella. In the future, when more of the closer homologs, e.g., icm/dot genes from other Legionella species, are available, it should be possible to obtain a finer resolution.

A detailed phylogenetic analysis was carried out. Phylogenetic trees were built for 18 icm genes, 3 housekeeping genes, and the icmB/tphA intergenic region as well as ‘‘combined’’ trees built for icm locus subregions (i.e., concatenated icm genes from extensive portions of the two loci or the entire loci). The presented trees were built by two methods: NJ, with 1000 bootstrap iterations to estimate confidence level for the tree topology, and the split decomposition method which displays branching alternatives in a single representation. While trees were built for each gene and several icm/dot subregions, only some representative examples are included in Fig. 11. Based on the combined phylogenetic trees, the strains consistently group into seven subsets: [Leg 5, 1, 9], [6, 11, 32], [{36, 10}, {30, 35}], [3, 4, 8, 34], [2, 33], and [7, 31], though the separation between groups {36, 10} and {30, 35} is less consistent (cf. Figs. 11A and C). This clustering is almost identical, with a few exceptions, for the icm genes of the two loci, houskeeping genes and the icmB/tphA intergenic region and is supported by high bootstrap values on almost all of the trees. Exceptions to this clustering were most frequently found with the Leg6 strain, which, for 6 icm and 3 housekeeping genes, merges with the (5, 1, 9) group (see for example the tree for icmK in Fig. 11D). It appears that in the case of trees built for genes of

I. Morozova et al. / Plasmid 51 (2004) 127–147

141

Fig. 11. Phylogenetic trees. The gene sets do not include dotA,B,C, icmO or icmE. Numbers at the nodes are bootstrap values. (A) Combined NJ tree for all icm genes. (B) Aligned NJ trees for the two icm loci. Notable differences in the tree topologies are detailed in the text. Left: locus II (all locus II icm genes except icmF). Right: locus I (icmX, W, and V only). (C) Combined split decomposition tree for all icm genes. This figure shows the strain clustering, emphasizing the divergence of Leg 7 and 31 from each other and from the rest of the strains. (D) Split decomposition tree for IcmK, demonstrating the complicated picture of group branching. Rectangles represent alternative branching.

the small locus (icmV, W, and X) and those at one end of the large icm locus (icmF, tphA, icmB, J, D, C, and G), Leg 6 belongs to the (11, 32) group, whereas based on trees built for many of the genes at the other end of locus II (icmK, L, N, R, S, and T), this strain falls into the (5, 1, 9) group. In only a very few cases was the clustering violated by other strains (e.g., Leg 30 and 35 are in separate branches in icmV, W, X, F, and B individual gene trees). Despite the largely consistent strain clustering, the relationship between clusters is not as clear, that is, the groups as a whole can switch their relative positions in different trees and sometimes cannot be positioned unambiguously (for example, see Fig. 11D). In many cases, these cluster re-

locations have low bootstrap values, making it difficult to judge whether they correspond to actual gene transfer or to recombination events. In all the trees Leg 7 and 31 constitute a separate group, so distant from the remaining strain clusters that it almost has the appearance of an outgroup. But when strains 7 and 31 are considered independently, they seem to be almost as distant from each other as from the remaining strains (Fig. 11C). Thus, they probably do not form an actual group, but are merely the two most divergent strains of L. pneumophila examined. It was previously shown that L. pneumophila strain Dallas, serogroup 5, which corresponds to our Leg 7 strain, belongs to L. pneumophila subspecies fraseri (Brenner et al., 1988) and that the dotA and

142

I. Morozova et al. / Plasmid 51 (2004) 127–147

mip genes from this strain were most distant from their homologs in other L. pneumophila strains (Bumbaugh et al., 2002). The observed strain clustering does not correlate with serogroups. Thus, while both Leg 2 and 33 belong to serogroup 11 and also to one cluster, none of the five strains of serogroup 1 for which we have sequences (Leg 1, 3, 31, 35, and 36), group together. Trees built for the locus I icm genes vary the most from the locus II genes (compare the two combined trees in Fig. 11B, left and right). The initial trees were aligned by rotating branches around internal nodes, while preserving the branching pattern, to accentuate the differences between the two resulting topologies. Branches corresponding to strains Leg 5, 35, 1, and 6 could not be aligned.

4. Discussion The icm/dot gene loci are present in each of the L. pneumophila serogroups and strains we sequenced. Moreover, based on our ability to amplify and sequence across genes of interest using primers in expected surrounding adjacent genes, it appears that gene order within these clusters is also retained within the L. pneumophila strains. Other investigators have shown that among nine strains of L. pneumophila, eight from serogroup 1 including three commonly used in laboratory studies (AA100, JR32, and Lp01), the presence or absence of two loci involved in Type IV secretion (traI and lvh) and the rtxA locus, may correlate to some extent with the strainsÕ pathogenicities (Samrakandi et al., 2002). More specifically, the lvh and rtxA loci were found more commonly in strains generally associated with disease, whereas the traI locus was not. These authors also were able to detect and discriminate these genes by hybridization in some non-pneumophila species. More recently, dissection of an expanded locus surrounding a set of the so-called tra/trb genes, presumably involved in pilus assembly, distinct from the traI locus of the AA100 strain, as well as from the icm/dot and lvr/lvh loci, revealed it to be a likely pathogenicity island, containing additional genes for putative virulence factors such as methionine

sulfoxide reductases, as well as plasmid mobility elements; while present in Philadelphia 1-derived strains, it appears to be missing in part or in its entirety from JR32 and several clinical isolates (Brassinga et al., 2003). Interestingly, this locus contains paralogs of the lvrA, B, and C genes of the lvr/lvh Type IV secretion locus. Perhaps the most intriguing finding was the presence of a 30 kb unstable genetic element in strain Olda but not in Philadelphia 1 strains, possibly phage derived, involved in phase variation (Luneberg et al., 2001). When integrated into the chromosome, the strain is virulent, but when excised and replicating as a high-copy plasmid, it resultes in a mutant phenotype with a modified lipopolysaccharide O-antigen epitope associated with reduced virulence. At this point, we have insufficient evidence to determine if the icm/dot genes are absent or present in most other Legionella species, with the exception of L. longbeachae where good hybridization signals were obtained for icm C, D, G, K, L, M, O, P, and T; weak signals with J, Q, R, S, V, and X; and no signal for icm B, E, and F. Six L. longbeachae icm/dot genes from the center of locus II have been submitted to GenBank by T. Rogers, S. List, R.M. Doyle, and M.W. Heuzenroeder. In cases where we do not obtain positive hybridization signals using L. pneumophila probes, it is probable that the orthologs are too dissimilar in their sequence, at least in the region between where the primers were designed, to be detected by even the reduced stringency hybridization or amplification used in this study. Their characterization thus awaits large-scale sequencing of other species, or the use of degenerate oligonucleotidebased PCR. Terry Alli et al. (2003) recently reported the presence of the icm/dot loci in every Legionella species they examined based on hybridization, even under high stringency conditions, using pooled regional probes. While we did get weak signals with many icm/dot genes in nonpneumophila species similar to the ones they displayed in their paper, we are unable to explain the several cases of disagreement, except that we used single gene probes which might have been too species-specific. Since we probed the same blots subsequently with several other probes for 16S rRNA, housekeeping or lvh/lvr genes and obtained

I. Morozova et al. / Plasmid 51 (2004) 127–147

excellent signals, the absence of hybridization with those icm/dot gene probes can not be due to the quality of the DNA itself. The two icm/dot clusters may have been subject to substantial changes in the course of their intraspecies evolution. Given that the icm/dot loci are present in all the L. pneumophila strains from the 15 different serogroups we examined and the fact that these strains have a 100-fold range in their ability to replicate within macrophages (data not shown), it might be expected that the strainsÕ differences in virulence depend on sequence variations within the genes, especially in functionally important gene and protein regions, such as those responsible for efficient transport of effector molecules. Of course, it is also possible that altered regulation of these genes (when and where they are expressed), or in the effector molecules themselves, can contribute to the pathogenic phenotype. It is worth noting that even though the entire icm gene set is present in the Coxiella genome (Seshadri et al., 2003), its lifestyle is very different from that of Legionella. In particular, Coxiella does not seem to depend on the disruption of phagosome–lysosome fusion for its survival, which is considered to be the main function of the icm/dot system in Legionella. In the current study, we assessed the level of diversity among genes of the dot/icm loci, focusing on the putative functional domains that are preserved even in distant homologs. The dot/icm genes display a wide range of variability, some being more conservative than an average houskeeping gene (icmP, Q, D, T, J, B, S, W, and L), while others are 5–10 times more variable (icmM, R, V, X, and dotA), as indicated by the ratio of nonsynonymous and synonymous nucleotide substitutions. Low variability at the sequence level, though, does not necessarily mean that all the observed amino acid substitutions are conservative with regard to their physico-chemical properties. For example, it appears that IcmT, J, S, and W proteins are permitted rather dramatic amino acid substitutions. In contrast, IcmN, P, and Q are extremely conservative at this level, but not as much at the sequence level. In general, genes from locus I show higher diversity compared to locus II, both at the gene and protein levels.

143

A second category of intra-species variation is positional, with some portions of the genes and their products more dissimilar than others. For instance, the IcmK and IcmV proteins have many more amino acid substitutions in their N-terminal than their C-terminal portions. Most variation at the amino acid level is found at the ends of IcmP, but centrally in IcmG. At the same time, the silent nucleotide changes are often distributed evenly along the gene indicating that the preservation of amino acid sequence in some regions is not simply due to time of gene divergence, but rather to the presence of important functional domains—especially when the sequence, or at least the protein structure, is preserved in distant orthologs. It is interesting in this regard that remote homology detection by structural methods has helped predict the function of many otherwise uncharacterized proteins in several sequenced genomes (Pawlowski et al., 1999, 2001; Rychlewski et al., 1998). For some icm/dot genes (icmP, G, N, and K) the combination of relatively high regional sequence conservatism and the presence of predicted domains and sequence and/or structure preservation in distant homologs in the same areas serve as indicators of the presence of a functional domain, though they await experimental proof. Features such as the t-SNARE-like domain in IcmG and its Coxiella homolog, occur rarely enough in bacterial genes as to make them noteworthy. If the t-SNARE domain is functional in IcmG, it may compete with the hostÕs membrane fusion SNARE system, potentially altering its normal vesicular trafficking pathways, and preventing phagosome– lysosome fusion, for the bacteriaÕs own ends. Thus these findings may provide the impetus for future experimental studies to more directly determine the function of these proteins. Phylogenetic analysis for individual genes as well as locus subregions largely reveal similar strain groupings, as in Fig. 11C. However, some branches either switch their positions on different trees or cannot be unambiguously positioned. Though it is tempting to speculate that these represent instances of lateral transfer within the locus, it is not possible to determine this with any certainty.

144

I. Morozova et al. / Plasmid 51 (2004) 127–147

Not only are the locus I genes more variable than most of locus II, but interestingly, genes of the smaller locus (icmW, V, X, and dotA) have accumulated more silent nucleotide substitutions per site (Ks values) than most of those from locus II. If both loci were acquired, probably from a plasmid, at the same time, this may mean that locus I is evolving at a higher rate. Alternatively, under the assumption that the evolutionary rates have been the same and unchanged for both loci, genes from the smaller locus must be ‘‘older’’ than most of those in the large icm cluster. This, taken with the fact that the most disparate branching patterns are observed when either individual or combined trees for icm/dot locus I vs locus II are compared, leads to the assumption that the icm/ dot region has a rather complex history of gene acquisition and rearrangment events. In Coxiella all the icm genes are located next to each other whereas in L. pneumophila they are split into two icm/dot loci that are located on opposite sides of the circular genome (http://genome3.cpmc.columbia.edu/~legion/index.html). This may serve as an additional indication that two loci in Legionella were acquired separately or rearranged afterwards. So far, full icm/dot gene sets have only been found in two relatively close species (Legionella and Coxiella), and this system differs substantially from the known Type IV systems. Nonetheless, given the presence of limited but obvious homology of most icm/dot genes from both loci and tra/ trb genes, it is possible to suggest that they may have derived from the same ancestor. This ancestor may be of plasmid origin or assembled from various chromosomal components in ancestral bacteria; in the latter case, these genes may subsequently have been incorporated into a plasmid, support for which would come from the fact that many different bacteria possess tra-like genes (e.g., Type IV secretion systems). Other researchers have also pointed out that the icm/dot region may have a complicated evolutionary history in L. pneumophila. Bumbaugh et al. (2002) compared dotA and mip (a 24 kDa surface protein with peptidyl-prolyl-cis/trans isomerase activity that may be involved in establishment of infections, but not intracellular

survival (Cianciotto et al., 1990), in 17 clinical and environmental isolates. Compared to mip, DotA, a cytoplasmic membrane spanning protein, was extremely and perhaps unexpectedly variable, and the neighbor-joining trees produced for the two genes were discordant at several branch points with high bootstrap values. The authors considered this an indication of lateral gene transfer and recombination and relatively recent gene dispersal. Ko et al. (2002b) compared the dotA and rpoB alleles in 79 Korean isolates of L. pneumophila from six clonal populations. The most parsimonious tree produced using rpoB distinguished four closely related L. pneumophila pneumophila subspecies and two closely related L. pneumophila fraseri subspecies. In contrast, in the case of dotA, one of the pneumophila subspecies seemed more closely related to the fraseri subspecies than to the other three pneumophila. Some caution should be exercised, however, in that these authors previously showed that the rpoB trees, themselves, differed substantially from 16S rRNA and mip trees, which was the basis for distinguishing the six clonal populations (Ko et al., 2002a). Our comparisons, taking into consideration nearly all the members of the icm dot loci, may point out additional subpopulations, especially for those genes showing substantial variation. In the future, comparisons with icm and lvh plasmid gene orthologs may be especially interesting. Since the lvh/lvr locus is likely to have been inherited as a plasmid unit, as we discovered during the sequencing of the Philadelphia 1 genome (manuscript in preparation), with a substantially higher GC content (43%) than the rest of the genome (Segal et al., 1999), we intend to compare its history with that of the icm/dot islands, which have only some of the classic features of pathogenicity islands (apparent absence of essential genes, all-ornone presence of the complete gene set), but not others (GC content the same as the remainder of the genome, separation into two subsets). The separate tra/trb locus also appears to be a pathogenicity island, the central core of which has an elevated GC content (Brassinga et al., 2003), and is thus another good candidate for such comparative sequence analysis.

I. Morozova et al. / Plasmid 51 (2004) 127–147

Acknowledgments Strains Leg 1–Leg 34 were kindly provided by Dr. Barry Fields at the CDC; Leg 35 and Leg 36, specimens from an outbreak at a Dutch flower show, were a generous gift from Dr. Ruud van Ketel at the University of Amsterdam. We thank Huitao Sheng for assistance in sequence submission and Dr. Pavel Morozov for helpful comments throughout the course of this work. This work was supported by NIH Grant U01 1 AI 44371 awarded to J.J.R., and funds generously provided by the Columbia Genome Center. References Adeleke, A., Pruckler, J., Benson, R., Rowbotham, T., Halablab, M., Fields, B., 1996. Legionella-like amebal pathogens—phylogenetic status and possible role in respiratory disease. Emerg. Infect Dis. 2, 225–230. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410, doi: 10.1006/jmbi.1990.9999. Andrews, H.L., Vogel, J.P., Isberg, R.R., 1998. Identification of linked Legionella pneumophila genes essential for intracellular growth and evasion of the endocytic pathway. Infect. Immun. 66, 950–958, id: 0019-9567/98/$04.00+0. Avison, M.B., Simm, A.M., 2002. Sequence and genome context analysis of a new molecular class D b-lactamase gene from Legionella pneumophila. J. Antimicrob. Chemother. 50, 331–338, doi: 10.1093/jac/dkf135. Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M., Sonnhammer, E.L., 2002. The Pfam protein families database. Nucleic Acids Res. 30, 276–280. Bauer, F., Schweimer, K., Kluver, E., Conejo-Garcia, J.-R., Forssmann, W.-G., Rosch, P., Adermann, K., Sticht, H., 2001. Structure determination of human and murine bdefensins reveals structural conservation in the absence of significant sequence similarity. Protein Sci. 10, 2470–2479. Benson, R., Fields, B., 1998. Classification of the genus Legionella. Semin. Respir. Infect. 13, 90–99. Berger, K.H., Isberg, R.R., 1993. Two distinct defects in intracellular growth complemented by a single genetic locus in Legionella pneumophila. Mol. Microbiol. 7, 7–19. Bogardt, R.A., Jones, B.N., Dwulet, F.E., Garner, W.H., Lehman, L.D., Gurd, F.R., 1980. Evolution of the amino acid substitution in the mammalian myoglobin gene. J. Mol. Evol. 15, 197–218. Boyd, E.F., Li, J., Ochman, H., Selander, R.K., 1997. Comparative genetics of the inv-spa invasion gene complex of Salmonella enterica. J. Bacteriol. 179, 1985–1991, id: 00219193/97/$04.00+0.

145

Brassinga, A.K.C., Hiltz, M.F., Sisson, G.R., Morash, M.G., Hill, N., Garduno, E., Edelstein, P.H., Garduno, R.A., Hoffman, P.S., 2003. A 65-kilobase pathogenicity island is unique to Philadelphia-1 strains. J. Bacteriol. 185, 4630– 4637, doi: 10.1128/JB185.15.4630-4637.2003. Brenner, D.J., Steigerwalt, A.G., Epple, P., Bibb, W.F., McKinney, R.M., Starnes, R.W., Colville, J.M., Selander, R.K., Edelstein, P.H., Moss, C.W., 1988. Legionella pneumophila serogroup lansing 3 isolated from a patient with fatal pneumonia, and descriptions of L. pneumophila subsp. pneumophila subsp. nov., L. pneumophila subsp. fraseri subsp. nov., and L. pneumophila subsp. pascullei subsp. nov. J. Clin. Microbiol. 26, 1695–1703. Bumbaugh, A.C., McGraw, E.A., Page, K.L., Selander, R.K., Whittam, T.S., 2002. Sequence polymorphism of dotA and mip alleles mediating invasion and intracellular replication of Legionella pneumophila. Curr. Microbiol. 44, 314–322, doi: 10.1007/s0024-01-0024-6. Christie, P.J., 2001. Type IV secretion: intercellular transfer of macromolecules by systems ancestrally related to conjugation machines. Mol. Microbiol. 40, 294–305, doi: 10.1046/ j.1365-2958. Cianciotto, N.P., Eisenstein, B.I., Mody, C.H., Engleberg, N.C., 1990. A mutation in the mip gene results in an attenuation of Legionella pneumophila virulence. J. Infect. Dis. 162, 121–126. Coers, J., Kagan, J.C., Matthews, M., Nagai, H., Zuckman, D.M., Roy, C.R., 2000. Identification of Icm protein complexes that play distinct roles in the biogenesis of an organelle permissive for Legionella pneumophila intracellular growth. Mol. Microbiol. 38, 719–736, doi: 10.1046/j.13652958.2000.02176.x. Doyle, R.M., Steele, T.W., McLennan, A.M., Parkinson, I.H., Manning, P.A., Heuzenroeder, M.W., 1998. Sequence analysis of the mip gene of the soilborne pathogen Legionella longbeachae. Infect. Immun. 66, 1492–1499, id: 00199567/98/$04.00+0. Dumenil, G., Isberg, R., 2001. The Legionella pneumophila IcmR protein exhibits chaperone activity for IcmQ by preventing its participation in high-molecular-weight complexes. Mol. Microbiol. 40, 1113–1127, doi: 10.1046/j.13652958.2001.02454.x. Fields, B.S., Benson, R.F., Besser, R.E., 2002. Legionella and LegionnairesÕ disease: 25 years of investigation. Clin. Microbiol. Rev. 15, 506–526, doi: 10.1128/CMR.15.3.506526.2002. Fraser, D.W., Tsai, T.R., Orenstein, W., Parkin, W.E., Beecham, H.J., Sharrar, R.G., Harris, J., Mallison, G.F., Martin, S.M., McDade, J.E., Shepard, C.C., Brachman, P.S., 1977. LegionnairesÕ disease: description of an epidemic of pneumonia. N. Engl. J. Med. 297, 1189– 1197. Furuya, N., Komano, T., 1996. Nucleotide sequence and characterization of the trbABC region of the IncI1 plasmid R64: existence of the pnd gene for plasmid maintenance within the transfer region. J. Bacteriol. 178, 1491–1497, id: 0021-9193/96/$04.00+0.

146

I. Morozova et al. / Plasmid 51 (2004) 127–147

Ginalski, K., Venclovas, C., Lesyng, B., Fidelis, K., 2000. Structure-based sequence alignment for the beta-trefoil subdomain of the clostridial neurotoxin family provides residue level information about the putative ganglioside binding site. FEBS Lett. 482, 119–124, doi: 10.1016/S00145793(00)01954-2. Girardeau, J.P., Bertin, Y., Callebaut, I., 2000. Conserved structural features in class i major fimbrial subunits (Pilin) in gram-negative bacteria. Molecular basis of classification in seven subfamilies and identification of intrasubfamily sequence signature motifs which might be implicated in quaternary structure. J. Mol. Evol. 50, 424–442, ISSN: 0022-2844. Gotte, M., von Mollard, G.F., 1998. A new beat for the SNARE drum. Trends Cell. Biol. 8, 215–218, doi: 10.1016/ S0962-8924(98)01272-0. Helbig, J.H., Bernander, S., Castellani Pastoris, M., Etienne, J., Gaia, V., Lauwers, S., Lindsay, D., Luck, P.C., Marques, T., Mentula, S., Peeters, M.F., Pelaz, C., Struelens, M., Uldum, S.A., Wewalka, G., Harrison, T.G., 2002. PanEuropean study on culture-proven legionnairesÕ disease: distribution of Legionella pneumophila serogroups and monoclonal subgroups. Eur. J. Clin. Microbiol. Infect Dis. 21, 710–716, doi:10.1007/s10096-002-0820-3. Huson, D., 1998. SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14, 68–73. Kawashima, S., Kanehisa, M., 2000. AAIndex: amino acid index database. Nucleic Acids Res. 28, 374. Kirby, J.E., Vogel, J.P., Andrews, H.L., Isberg, R.R., 1998. Evidence for pore-forming ability by Legionella pneumophila. Mol. Microbiol. 27, 323–336, doi: 10.1046/j.13652958.1998.00680.x. Ko, K.S., Lee, H.K., Park, M.Y., Lee, K.-H., Yun, Y.-J., Woo, S.-Y., Miyamoto, H., Kook, Y.-H., 2002a. Application of RNA polymerase beta-subunit gene (rpoB) sequences for the molecular differentiation of Legionella species. J. Clin. Microbiol. 40, 2653–2658, doi: 10.1128/JCM.40.7.26532658.2002. Ko, K.S., Lee, H.K., Park, M.-Y., Park, M.-S., Lee, K.-H., Woo, S.-Y., Yun, Y.-J., Kook, Y.-H., 2002b. Population genetic structure of Legionella pneumophila inferred from rna polymerase gene (rpoB) and DotA gene (dotA) sequences. J. Bacteriol. 184, 2123–2130, doi: 10.1128/ JB.184.8.2123-2130.2002. Komano, T., Yoshida, S., Narahara, K., Furuya, N., 2000. The transfer region of IncI1 plasmid R64: similarities between R64 tra and Legionella icm/dot genes. Mol. Microbiol. 35, 1348–1359, doi: 10.1046/j.1365-2958.2000. 01769.x. Kumar, S., Tamura, K., Jakobsen, I.B., Nei, M., 2001. MEGA2: molecular evolutionary genetics analysis software. Bioinformatics 17, 1244–1245. Letunic, I., Goodstadt, L., Dickens, N.J., Doerks, T., Schultz, J., Mott, R., Ciccarelli, F., Copley, R.R., Ponting, C.P., Bork, P., 2002. Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res. 30, 242–244.

Li, W.H., Wu, C.I., Luo, C.C., 1985. A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2, 150–174, id: 0737-4038/85/0202-0201$02.00. Luneberg, E., Mayer, B., Daryab, N., Koolstra, O., Zahringer, U., Rohde, M., Swanson, J., Frosch, M., 2001. Chromosomal insertion and excision of a 30 kb unstable genetic element is responsible for phase variation of lipopolysaccharide and other virulence determinants in Legionella pneumophila. Mol. Microbiol. 39, 1259–1271, doi: 10.1046/ j.1365-2958.2001.02314.x. Miyata, T., Miyazawa, S., Yasunaga, T., 1979. Two types of amino acid substitutions in protein evolution. J. Mol. Evol. 12, 219–236. Nagai, H., Kagan, J.C., Zhu, X., Kahn, R.A., Roy, C.R., 2002. A bacterial guanine nucleotide exchange factor activates ARF on Legionella phagosomes. Science 295, 679– 682. Pawlowski, K., Rychlewski, L., Zhang, B., Godzik, A., 2001. Fold predictions for bacterial genomes. J. Struct. Biol. 134, 219–231, doi: 10.1006/jsbi.2001.4394. Pawlowski, K., Zhang, B., Rychlewski, L., Godzik, A., 1999. The Helicobacter pylori genome: from sequence analysis to structural and functional predictions. Proteins: Struct., Funct., Genet. 36, 20–30, 3.0.CO;2-X" locator-type¼ "doi">doi: 10.1002/(SICI)1097-0134(19990701)36.1<20:: AID-PROT2>3.0.CO;2-X. Perez-Luz, S., Fernandez, J., Rodriguez-Valera, F., Pascual, L., Moreno, C., Amo, A., Apraiz, D., Catalan, V., 2002. Sequence diversity of the internal transcribed spacer (its) region of the rRNA operons among different serogroups of Legionella pneumophila isolates. Syst. Appl. Microbiol. 25, 212–219, doi:10.1078/072320202320386370. Pollastri, G., Przybylski, D., Rost, B., Baldi, P., 2002. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins 47, 228–235, online ISSN: 1097-0134; print ISSN:0887-3585. Raghava, G.P.S., 2000. Protein secondary structure prediction using nearest neighbor and neural network approach. CASP 4, 75–76. Ratcliff, R., Donnellan, S.C., Lanser, J.A., Manning, P.A., Heuzenroeder, M.W., 1997. Interspecies sequence differences in the Mip protein from the genus Legionella: implications for function and evolutionary relatedness. Mol. Microbiol. 25, 1149–1158. Ratcliff, R.M., Lanser, J.A., Manning, P.A., Heuzenroeder, M.W., 1998. Sequence-based classification scheme for the genus Legionella targeting the mip gene. J. Clin. Microbiol. 36, 1560–1567, id: 0095-1137/98/$04.00+0. Rosello-Mora, R., Amann, R., 2001. The species concept for prokaryotes. FEMS Microbiol. Lett. 25, 39–67, doi: 10.1016/S0168-6445(00)00040-1. Rost, B., 1996. PHD: predicting one-dimensional protein structure by profile based neural networks. Methods Enzymol. 266, 525–539.

I. Morozova et al. / Plasmid 51 (2004) 127–147 Rychlewski, L., Zhang, B., Godzik, A., 1998. Fold and function predictions for Mycoplasma genitalium proteins. Fold Des. 3, 229–238, ISSN: 1359-0278. Sadosky, A., Wiater, L.A., Shuman, H.A., 1993. Identification of Legionella pneumophila genes required for growth within and killing of human macrophages. Infect. Immun. 61, 5361–5373. Saitou, N., Nei, M., 1987. The Neighbor-Joining Method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425, id: 0737-4038/87/0. Samrakandi, M.M., Cirillo, S.L.G., Ridenour, D.A., Bermudez, L.E., Cirillo, J.D., 2002. Genetic and phenotypic differences between Legionella pneumophila strains. J. Clin. Microbiol. 40, 1352–1362, doi: 10.1128/JCM.40.4.13521362.2002. Sauder, J.M., Arthur, J.W., Dunbrack Jr., R.L., 2000. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins: Struct., Funct., Genet. 40, 6–22, online ISSN:1097-0134, print ISSN:0887-3585. Segal, G., Shuman, H.A., 1997. Characterization of a new region required for macrophage killing by Legionella pneumophila. Infect. Immun. 65, 5057–5066, id: 0019-9567/ $04.00+0. Segal, G., Shuman, H.A., 1998a. Intracellular multiplication and human macrophage killing by Legionella pneumophila are inhibited by conjugal components of IncQ plasmid RSF1010. Mol. Microbiol. 30, 197–208. Segal, G., Shuman, H.A., 1998b. How is the intracellular fate of the Legionella pneumophila phagosome determined. Trends Microbiol. 6, 253–255, doi: 10.1016/S0966-842X(98)01308-0. Segal, G., Shuman, H.A., 1999. Possible origin of the Legionella pneumophila virulence genes and their relation to Coxiella burnetii. Mol. Microbiol. 33, 669–670, doi: 10.1046/j.13652958.1999.01511.x. Segal, G., Purcell, M., Shuman, H.A., 1998. Host cell killing and bacterial conjugation require overlapping sets of genes within a 22-kb region of the Legionella pneumophila genome. Proc. Natl. Acad. Sci. USA 95, 1669–1674. Segal, G., Russo, J.J., Shuman, H.A., 1999. Relationships between a new type iv secretion system and the icm/dot virulence system of Legionella pneumophila. Mol. Microbiol. 34, 799–809, doi: 10.1046/j.1365-2958.1999.01642.x. Seshadri, R., Paulsen, I.T., Eisen, J.A., Read, T.D., Nelson, K.E., Nelson, W.C., Ward, N.L., Tettelin, H., Davidsen, T.M., Beanan, M.J., Deboy, R.T., Daugherty, S.C., Brinkac, L.M., Madupu, R., Dodson, R.J., Khouri, H.M., Lee, K.H., Carty, H.A., Scanlan, D., Heinzen, R.A., Thompson, H.A., Samuel, J.E., Fraser, C.M., Heidelberg, J.F, 2003. Complete genome sequence of the Q-fever

147

pathogen Coxiella burnetii. Proc. Natl. Acad. Sci. USA 100, 5455–5460, doi 10.1073. Sexton, J.A., Vogel, J.P., 2002. Type IVB secretion by intracellular pathogens. Traffic 3, 178–185, doi: 10.1034/ j.1600-0854.2002.030303.x. Swanson, M.S., Hammer, B.K., 2000. Legionella pneumophila pathogenesis: a fateful journey from amoebae to macrophages. Annu. Rev. Microbiol. 54, 567–613. Terry Alli, O.A., Zink, S., von Lackum, N.K., Abu-Kwaik, Y., 2003. Comparative assessment of virulence traits in Legionella spp. Microbiology 149, 631–641, doi: 10.1099/ mic.0.25980-0. Thompson, J.D., Higgins, D.G., Gilbson, T.J., 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680. Vogel, J.P., Andrews, H.L., Wong, S.K., Isberg, R.R., 1998. Conjugative transfer by the virulence system of Legionella pneumophila. Science 279, 873–876. Watarai, M., Andrews, H.L., Isberg, R.R., 2001. Formation of a fibrous structure on the surface of Legionella pneumophila associated with exposure of DotH and DotO proteins after intracellular growth. Mol. Microbiol. 39, 313–329, doi: 10.1046/j.1365-2958.2001.02193.x. Weber, T., Zemelman, B.V., McNew, J.A., Westermann, B., Gmachl, M., Parlati, F., Sollner, T.H., Rothman, J.E., 1998. SNAREpins: minimal machinery for membrane fusion. Cell 92, 759–772. Weimbs, T., Low, S.H., Chapin, S.J., Mostov, K.E., Bucher, P., Hofmann, K., 1997. A conserved domain is present in different families of vesicular fusion proteins: a new superfamily. Proc. Natl. Acad. Sci. USA 94, 3046–3051. Weimbs, T., Mostov, K., Low, S.H., Hofmann, K., 1998. A model for structural similarity between different SNARE complexes based on sequence relationships. Trends Cell Biol. 8, 260–262, doi: 10.1016/S09628924(98)01285-9. Whittam, T.S., Bumbaugh, A.C., 2002. Inferences from wholegenome sequences of bacterial pathogens. Curr. Opin. Genet. Dev. 12, 719–725, doi: 10.1016/S0959437X(02)0036-1. Yu, V.L., Plouffe, J.F., Castellani Pastoris, M., Stout, J.E., Schousboe, M., Widmer, A., Summersgill, J., File, T., Heath, C.M., Paterson, D.L., Chereshsky, A., 2002. Distribution of Legionella species and serogroups isolated by culture in patients with sporadic community-acquired legionellosis: an international collaborative survey. J. Infect. Dis. 186, 127–128, id: 0022-1899/2002/18601-0020$15.00. Communicated by R. Novick