Evolution of Protein Superfamilies and Bacterial Genome Size

Evolution of Protein Superfamilies and Bacterial Genome Size

doi:10.1016/j.jmb.2003.12.044 J. Mol. Biol. (2004) 336, 871–887 Evolution of Protein Superfamilies and Bacterial Genome Size Juan A. G. Ranea1*, Dan...

1MB Sizes 0 Downloads 80 Views

doi:10.1016/j.jmb.2003.12.044

J. Mol. Biol. (2004) 336, 871–887

Evolution of Protein Superfamilies and Bacterial Genome Size Juan A. G. Ranea1*, Daniel W. A. Buchan1, Janet M. Thornton2 and Christine A. Orengo1 1

Biomolecular Structure and Modelling Group, Department of Biochemistry and Molecular Biology, University College London, London WC1E 6BT UK 2

EMBL-EBI, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK

We present the structural annotation of 56 different bacterial species based on the assignment of genes to 816 evolutionary superfamilies in the CATH domain structure database. These assignments have enabled us to analyse the recurrence of specific superfamilies within and across the genomes. We have selected the superfamilies that have a very broad representation and therefore appear to be universally distributed in a significant number of bacterial lineages. Occurrence profiles of these universally distributed superfamilies are compared with genome size in order to estimate the correlation between superfamily duplication and the increase in proteome size. This distinguishes between those size-dependent superfamilies where frequency of occurrence is highly correlated with increase in genome size, and size-independent superfamilies where no correlation is observed. Consideration of the size correlation and the ratio between the mean and the standard deviations for all the superfamily profiles allows more detailed subdivisions and classification of superfamilies. For example, within the size-independent superfamilies, we distinguished a group that are distributed evenly amongst all the genomes. Within the size-dependent superfamilies we differentiated two groups: linearly distributed and non-linearly distributed. Functional annotation using the COG database was performed for all superfamilies in each of these groups, and this revealed significant differences amongst the three sets of superfamilies. Evenly distributed, size-independent domains are shown to be involved primarily in protein translation and biosynthesis. For the size-dependent superfamilies, linearly distributed superfamilies are involved mainly in metabolism, and non-linearly distributed superfamily domains are involved principally in gene regulation. q 2003 Elsevier Ltd. All rights reserved.

*Corresponding author

Keywords: protein family; three-dimensional structure; genome size; bacteria; domain distribution

Introduction

Supplementary data associated with this article can be found at doi: 10.1016/j.jmb.2003.12.044 Present address: D. W. A. Buchan, Genome Centre, John Vane Science Centre, Queen Mary and Westfield College, London EC1M 6BQ, UK. Abbreviations used: ORF, open reading frame; PDB, Protein Data Bank. E-mail address of the corresponding author: [email protected]

Evolutionary theory postulates that the different homologues observed in each protein family are the final product of consecutive duplication and divergence processes undergone by an ancestral sequence.1 Although in bacteria evolution has also followed other processes (i.e. the loss of genetic material,2 – 4 or horizontal transfer of genes between species5), duplication and diversification of genes seem to be the major factors driving the increase in size and complexity of the larger genomes.6 Therefore, it is reasonable to believe that the

0022-2836/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.

872

duplication of protein families and increase of genome size are evolutionary inter-dependent variables.7 In order to evaluate this hypothesis, it is necessary to recognise individual domains and determine their recurrence within the completed genomes. The bacterial organisms are an ideal model within which to study such relationships, due to their relative genomic simplicity in comparison to the eukaryotic genomes. The absence of long intergene regions and the lack of intra-gene intron – exon regions make open reading frame (ORF) identification more accurate. In bacteria, the genome size will be more directly proportional to the proteome size.3 Additionally, the high number of sequenced species of prokaryotes, compared to eukaryotes, enables more significant statistical analysis of the data. In the CATH database,8 the homologous superfamily level groups together protein domains that have a high probability of sharing a common ancestor. Criteria for recognising homologues are based on structure comparison and sequence or functional similarity. The sum of such, sequence, structural and functional analyses, allows greater sensitivity when detecting remote phylogenetic relationships between distantly related protein domains. However, although structural comparison is the most sensitive method for recognising remote evolutionary relationships, it is somewhat limited by the scarce availability of structural data compared to the huge volumes of sequence entries.9 – 11 Nevertheless, recent analyses6,12,13 have demonstrated that a significant proportion of sequences in small bacterial genomes can be assigned to domain structure families using the most sophisticated sequence profile methods available. For example, the most powerful methodologies, which employ sequence profiles or fold recognition methods can provide some structural annotation for up to 60% of the genes in small microbial genomes, depending on the method. In particular, profile-based methods using PSI-BLAST or IMPALA have been used to assign structures to about 40% of the proteins in Mycoplasma genitalium,14,15 whilst threading algorithms have provided annotations for nearly 60% of this genome.16 Similarly, sensitive hidden Markov model-based methods are currently capable of assigning at least 50– 60% of the ORFs in a given genome to structural families.13,17 In the analysis presented here, an average of 37% of genes covered for all ORFs in a sample of 56 bacterial species, supports the statistical significance of the study. As well as limitations in the amount of structural data available, there may be a bias in the structural superfamilies determined to date. The CATH database8,18 is a hierarchical classification of domains, within protein structures, in the Protein Data Bank (PDB19). Therefore, the structural protein space in CATH reflects the data structure in the PDB and may not necessarily be related to the

Evolution of Protein Superfamilies

true distribution of protein structures in nature. For example, the PDB may be over-represented by structures that are more amenable to crystallisation or have been historically of more interest (e.g. enzyme families). However, this is also a good reason to perform CATH domain annotations on real organisms in order to determine whether the family distributions in the genomes mirror those detected in the PDB. Although structural domain annotation does not cover the complete lengths of all the ORFs, the structural annotations do form a representative sample for reconstructing and studying a gene’s phylogenetic history. Genes have undergone frequent recombination processes with different fusions and fissions of domains within bacteria. These reorganisations make it difficult to trace the duplication and divergence events that have occurred in gene evolution. However, because domains are basic units that are not susceptible to subdivision, they constitute stable markers in evolution that can be used for phylogenetic reconstruction, reducing the background noise that gene reorganisation produces. Analysis of domain superfamily occurrence profiles throughout prokaryotes (Figure 1) will help to answer some important questions such as: have all the superfamilies contributed in the same manner to the increase of genome size in bacterial evolution? And, if there are distribution differences, what is their biological meaning? By correlating bacterial genome size and superfamily recurrence it is possible to distinguish the superfamilies in which duplication and expansion have contributed to increasing genome size from the superfamilies for which duplication has been independent of genome size. Furthermore, the study of the distribution of the superfamilies throughout different organisms (using superfamily occurrence profiles) allows us to distinguish different distribution shapes of superfamilies and relate these to the functional role of the superfamily in prokaryotes. Here, we present an analysis of superfamily distributions in bacterial genomes together with a study of the correlation of domain recurrence with genome size, discussing the functional attributes of the different types of superfamilies we identify.

Results Analysis of the genome size correlation and universal distribution of superfamilies Conventionally, superfamily distributions in bacteria can be divided into two main classes: universally distributed and the specifically distributed superfamilies. Universally distributed superfamilies are defined as those present in a significant number of different species. By contrast, specific superfamilies show the opposite behaviour and are present in only a few species. This

Evolution of Protein Superfamilies

Figure 1. Occurrence profile creation. The domain assignment data are taken from several disparate files, shown at the top of the diagram. For each organism the number of occurrences of a particular domain are listed. These data are simply converted into a single matrix containing the assignment data for all the domains and all the organisms.

distinction is related to the functional and evolutionary differences expected for these two groups. Specific superfamilies will confer functions contributing to the unique phenotypes observed in the species. On the other hand, the universally distributed superfamilies usually supply essential functions in bacterial cells and have a probable ancestral origin in the evolution of bacterial division and diversification.

873

Because of their broad distribution and ancestral origin, we have concentrated primarily on an analysis of these universal superfamilies to study general trends in bacterial evolution. To select these superfamilies, we considered universally distributed superfamilies to be those present in at least 70% of all the species. This threshold was chosen because it reflects a large proportion of the genomes and takes into account the likely proportion of distant homologues (paralogues) that will be missed because they are at the limit of sensitivity of the current profile-based methods used to identify relatives (e.g. PSI-BLAST, IMPALA, HMMs13,20). The size correlation coefficients and universal distribution percentages were calculated for each superfamily detected in the bacterial genomes (a total of 816 annotated superfamilies). The distributions of the superfamilies ordered on the basis of their size correlation (Figure 2a) and universal distribution values (Figure 2b), decrease in an approximately linear fashion. A total of 274 universally distributed superfamilies (universal distribution percentage $ 70%) were selected from the complete sample of 816 annotated superfamilies (see the shaded region in the left top side of Figure 2b). This group was ordered independently and replotted with respect to their size correlation values (Figure 2d), revealing a linear distribution identical with the original distribution of correlation values for all superfamilies (compare Figure 2a and d). It can be seen from Figure 2c that there is no correlation between universal distribution and genome size correlation.

Figure 2. (a) Graphical representation of genome size correlation indexes (y axis) of all the 816 annotated superfamilies (x axis) in decreasing order based on their correlation index values. (b) Universal distribution percentages in decreasing order (y axis) calculated for all 816 superfamilies groups (x axis). (c) Genome size correlation index values (y axis) versus universal distribution percentages (x axis). (d) Genome size correlation index values (y axis) of the 274 selected universal superfamilies (x axis) in decreasing order (universal superfamily implies universal distribution percentage $70%).

874

Evolution of Protein Superfamilies

Table 1. Range classifications of the set of all 816 superfamilies All superfamilies Range order

Range classes

1 2 3 4 5 6 7 8 9 Total

1.0– 0.7 0.7– 0.6 0.6– 0.5 0.5– 0.4 0.4– 0.3 0.3– 0.2 0.2– 0.1 0.1– 0.0 0.0 –( –0.3)

Universal superfamilies

N.Sup.

% Sup.

N. domains

% domains

Dupli. rate (mean)

N.Sup.

% Sup.

N. domains

% domains

Dupli. rate (mean)

59 70 88 87 107 114 97 87 107 816

7 8 11 11 13 14 12 11 13 100

34,185 10,694 8011 6349 4961 3720 4237 2267 3230 77,654

44 14 10 8 6 5 6 3 4 100

579 153 91 73 46 33 44 26 30 95

51 35 29 30 21 24 21 25 38 274

6 4 4 4 2.5 3 2.5 3 5 34

33,572 8282 4172 4513 2294 2185 2832 1484 2317 61,651

43 11 5 6 3 3 4 2 3 80

658 273 144 150 109 91 135 59 61 225

All superfamilies (left) and the universal superfamily group (right) on the basis of their genome size’s correlation indexes, together with different parameters calculated for the superfamilies grouped in each range. From left to right columns: Range order; Range classes, ranges of the size correlation divisions; N. Sup., number of superfamilies belonging to each range; % Sup., percentage of superfamilies with respect to the total (816 superfamilies), N. domains, total number of superfamily domains; % domains, percentage of domains compared to the total of annotated domains for the 816 superfamilies; Dupli. rate (mean), duplication rate (number of domains divided by number of superfamilies).

The set of all superfamilies and the universal sub-group were subsequently divided into nine different classes of size correlation for further analysis (Table 1). In the highest range of size correlation (between 1.0 and 0.7) there are 59 superfamily groups, representing 7% of the total number of CATH superfamilies and 44% (almost half) of all annotated CATH domains in bacteria (Table 1). These percentages remain similar, for the same range of size correlation, when only the universal sub-group are considered (see the range between 1.0 and 0.7 in Table 1). In the universal cluster there are 51 superfamilies (6% of the total, and only eight superfamilies less than considering all superfamilies) representing 43% of all annotated domains (compared to 44% when all superfamilies are considered). The relationship between highly populated superfamilies and strong correlation with genome size suggests that both processes, superfamily duplication and increase in genome size, are connected strongly during evolution. Indeed, the fact that practically all the superfamilies with high size correlation ($ 0.7) are universally distributed appears to be related to the frequent occurrences observed for these kinds of superfamilies. The high frequency of these superfamilies increases the probability that they will be present in a significant proportion of species and therefore be universally distributed. The number of universal superfamilies significantly decreases in the remaining size correlation ranges when compared with the complete superfamilies set. By contrast, the means of the superfamily occurrences are about twofold higher in the universal group, compared with complete superfamily group, for the same ranges of size correlation (compare the columns between All superfamilies and Universal superfamilies for correlation indexes from , 0.7 to 2 0.3 in Table 1). Universal superfamilies represent 34% of all super-

families (274 of 816) but 80% of all the annotated domains (Table 1). This high representation of a relatively low percentage of superfamilies is mirrored in the high occurrence means of these superfamilies, almost two and a half times higher than the average calculated for the set of all the superfamilies. In general, no superfamilies frequency distributions are negatively correlated with size (significant negative slope). While the highest positive value of correlation reaches 0.89 the lowest, 2 0.27, is far from the minimum negative value of correlation (i.e. 2 1.0). This result indicates that superfamilies highly represented in small genomes are not observed to be, proportionally and significantly, less represented in longer genomes. Therefore, when the superfamilies are ordered on the basis of their size correlation values (Figure 2a and d), it is possible to distinguish between superfamilies that are highly correlated with genome size, and the superfamilies that are not correlated with the genome size. For the universal set of superfamilies (universal distribution $ 70%) we have selected two groups for further analysis. The first group corresponds to universal superfamilies in which duplication and expansion has been highly correlated with the increasing genome size (with a correlation index $ 0.7; see the shaded region in the top left hand of Figure 2d). This group was discussed above and represents 6% of all CATH superfamilies and 43% of all annotated bacterial domains (universal superfamilies in Table 1). For future reference, the superfamilies belonging to this group will be named size-dependent superfamilies. The second group corresponds to those universal superfamilies for which duplication and genetic diversification is not related to increasing genome size in bacteria. Superfamilies in this second group have been selected with a correlation index of # 0.2 (see the shaded region in bottom right

Evolution of Protein Superfamilies

hand of Figure 2d). This group represents 10.5% of CATH superfamilies and 9% of all annotated domains in bacteria (see universal superfamilies in Table 1). The superfamilies in this group will be referred to as size-independent superfamilies. Although the correlation limits of 0.2 and 0.7 to differentiate size-independent from sizedependent superfamilies are, to some extent, arbitrary, they provide reasonable boundaries that allow the delineation of both groups and allow the comparison of the contrasting behaviour of the two groups. Distribution analysis Size-independent superfamilies Size-independent and universally distributed superfamilies show no correlation between genome size and superfamily occurrences. However, for a given superfamily, the occurrence in different genomes may fluctuate differently with different deviations around the mean. The significance of these fluctuations can best be estimated by considering the standard deviation (s) of the occurrences, relative to the mean (M), that is the M/s ratio (this parameter corresponds to the inverse of the coefficient of variation, see Materials and Methods). Figure 3 uses simple examples to illustrate the fact that standard deviation (s) or mean (M) on their own cannot be used to distinguish evenly distributed superfamilies, and that M/s is a useful indicator. The implicit mathematical model, used here, assumes that the standard deviations change in proportion to the means. The means and corresponding standard deviations for all universal

875

superfamilies were plotted to check this point (data not shown) and shown to be related by a linear function (M ¼ 0.93 s þ 0.33, and R 2 ¼ 0.87). Superfamilies that show a low fluctuation, in proportion to the mean (high M=s), are distributed more evenly throughout the genomes and therefore more likely to exhibit similar duplication processes and functional diversification for all the species. These families with similar numbers of gene copies in each genome are unlikely to be exhibiting a random distribution and can therefore be referred to as evenly distributed. The flat distribution in the occurrences of evenly distributed superfamilies is observed at the domain level and by considering the non-redundant ORFs containing these domains (Figure 7.1a and 2a). For these families, the number of copies of the genes and therefore the functions, must have crystallised early in bacterial evolution. By contrast, a high deviation in occurrences, in proportion to the mean (low M=s), identifies the families in which the relatives have been duplicated more extensively in some genomes compared to others. These are more likely to be families that have been able to evolve further functions in some genomes through more extensive duplication. For the 274 universal superfamilies identified above, the relationship between the size correlation indexes and the M=s ratio was plotted (Figure 4). Size-independent superfamilies (size correlation index # 0.2) show a wide range of M=s ratio values, varying from about 1.0 to 5.2 (Figure 4). However, it can be seen that all the universal superfamilies with high M=s ratios (i.e. M=s . 2:7) are size-independent (correlation index , 0.2). Therefore, the relationship between

Figure 3. Frequency distribution (y axis) of two hypothetical superfamilies (a) and (b) in the same sample of organisms (x axis). Both superfamilies have the same mean of 20, but they differ in their standard deviation (s). The superfamily in (a) shows a uniform distribution profile compared to the superfamily in (b) that is much more dispersed from the mean. If we use the relationship between the mean (M) and standard deviation (s) in both distributions, the M=s ratio analysis clearly allows us to distinguish the evenly distributed superfamilies. In this example, it is observed that for the more dispersed superfamily in (b) the ratio Ma =sa ¼ 3 is much less than the ratio Mb =sb ¼ 11 for the evenly distributed superfamily in (a). Figure 1(c) and (d) show the same example for two superfamilies a and b, but in this case they have the same standard deviation value of 4, and different means.

876

Figure 4. Genome size correlation indexes values (y axis) versus M=s ratio values (x axis), for the 274 universal bacterial superfamilies.

size correlation and M=s ratio provides a good means for identifying evenly distributed superfamilies (Figure 4). The shaded region in Figure 4 highlights those size-independent and evenly distributed superfamilies ðM=s $ 2:7Þ selected for further analysis. This group comprises 30 superfamilies containing a total of 3285 domains (Table 2). The functions of these evenly distributed superfamilies are considered in the functional analysis section below.

Size-dependent superfamilies Generally, the size-dependent superfamilies show high mean occurrences in the bacterial proteomes (Table 1), suggesting that these superfamilies also have a high probability of being broadly distributed in the bacterial phylum, and are therefore more likely to be providing ancestral functions. We have used the Pearson coefficient to measure the strength of the linear relationship between the number of domain occurrences and genome sizes. However, for some non-linear relationships it is possible to find significant Pearson correlation values, . 0.7 (see for example, Figure 5c). Therefore, although the high correlation with genome size for these superfamilies suggests that domain frequency and genome size are mathematically related, it is not always obvious what function best describes this relationship. Superfamilies that show a linear relationship with the genome size must show a more or less similar proportion of copies by genome size unit (in this case the unit of measurement of the genome size is the number of ORFs). When the genome frequencies of such linearly distributed superfamilies are normalised by the genome size, the normalised frequencies will show a flat distribution (slope close to 0) throughout the different species, close to the mean for all of them (for example: see Figure 5a and b). When the distribution is non-linear, for example when the distribution is exponential with respect to the genome size, the distribution of the size-

Evolution of Protein Superfamilies

normalised frequencies are more dispersed from the mean (Figure 5c and d). Therefore, using the M=s ratio, calculated on normalised superfamily occurrences, it is possible to distinguish those size-dependent and linearly distributed superfamilies from the rest of the size-dependent superfamilies, which are referred to as non-linearly distributed superfamilies. The frequencies of the universal superfamilies were therefore normalised by the mean of the genomes’ sizes over the 56 bacterial species. For each species, the normalisation was calculated by multiplying the frequencies of each superfamily by the mean of the 56 studied genomes (2317 ORFs: see Table 1 in the Supplementary Material) and dividing by the genome size. The normalised frequencies were then used to recalculate the mean and standard deviation, and likewise, the ratio between these two parameters ðN_M=N_sÞ (all the normalised parameters are appointed the N_ prefix in this work). The plot of size correlation versus N_M=s ratio shows that high N_M=s values also correspond to high size correlation indexes (Figure 6). For the highest N_M=s ratios (. 1.9) practically all the superfamilies are highly correlated with increase in genome size (size correlation $ 0.7). When N_M=s ratio values fall below , 1.25, size correlation and the N_M=s ratio lose their relationship and change randomly. In this second region, ðN_M=s , 1:25Þ any N_M=s value can correspond to a wide range of size correlation indexes. Therefore, as with the size-independent superfamilies, we have used the plots of size correlation versus N_M=s plots to distinguish between the two main extreme distributions in the sizedependent set of superfamilies. The first group corresponds to superfamilies that are linearly distributed with respect to the genome size; that is, the superfamilies with a high N_M=s ratio of $ 1.9 and a correlation index of $ 0.7. For this group, 12 superfamilies were selected with a total of 17,490 domains (see the upper right shaded region in Figure 6, and Table 2). The non-linearly distributed group are those superfamilies with a low N_M=s value of # 1.25 and a size correlation index of $ 0.7. For this group, 15 superfamilies were selected with a total of 7508 domains (see the upper left shaded region in Figure 6, and Table 2). Linearly distributed superfamilies are shown to be present even in the smallest genomes, which usually correspond to the parasitic or symbiotic bacteria (Figure 7.1.b and 2.b). By contrast, the top-ranked non-linearly distributed superfamilies show a distribution close to a quadratic function (polynomial function of order 2) (Figure 7.1.c and 2.c) and tend not to be present in the smallest genomes. All superfamilies were evaluated individually by visual inspection of their distribution plots (see third rows in the x axis and the legend to Figure 8).21,22 The size-dependent superfamilies (both linearly and non-linearly distributed) have undergone a

Table 2. Statistics from COG functional annotation and analysis for the main groups: evenly, linearly and non-linearly-distributed superfamilies Domains Me. Size. Cor.

Me. M/s

Me. N_M/ s

Num. Sups.

20.11

3.24

1.18

Linearly

0.81

1.55

Nonlinearly Total

0.78

0.84

Distribution group Evently

ORFs

Num.

Dupli. Rate (Mean)

N.COG hits

%N. COG hits

N. Non Redun.

N.COG hits

%N COG hits

Num COG clus.

Num. Gen. Fun.

30

3285

109

2499

76

2120

1947

92

67

6

2.26

12

17,490

1458

12,427

71

14,177

11,814

83

598

20

1.06

15

7508

500

4032

54

6121

3659

60

183

16

57

28,283

18,958

67

22,418

17,420

78

848

25

Most frequent general functions Translation, ribosomal structure and biogenesis (1712 ORFs) Amino acid transport and metabolism (1987) Replication, recombination and repair (1032) Transcription (1476) Signal transduction mechanisms (583)

Columns from left to right: Me. Size Cor., mean of size correlation indexes; Me. M/s, average of M/s ratios; Me. N_M/s, mean of normalised M/s ratios; Num. Sups., number of superfamilies in each group. For the domain classification section, columns from left to right: Num., number of domains; Dupli. Rate (Mean), duplication rate (number of domains divided by number of superfamilies); N. COG hits, number of domains annotated in the COG database; % N. COG hits, percentage of domains annotated in the COG database. In the ORFs classification section, columns from left to right: N. Non Redun., number of non-redundant ORFs annotated with at least one CATH domain (for non-redundant ORFs selection, see the legend to Figure 9); N. COG hits, number of non-redundant ORFs annotated in the COG database; % N. COG hits, percentage of ORFs annotated in the COG database. Then continuing with the main Table, columns from left to right: Num. COG clus., number of different orthologue clusters in which ORFs are grouped in the COG database; Num. Gen. Fun., number of general functions; the Most frequent General Functions and their frequencies (Figures between parentheses).

878

Evolution of Protein Superfamilies

Figure 5. Graphical representation of (a) the original frequencies, and (b) normalised frequencies of one linearly distributed superfamily; (c) the original frequencies, and (b) normalised frequencies of one exponentially distributed superfamily.

higher rate of duplication than the evenly distributed superfamilies (Table 2), and are very likely to be superfamilies that are closely involved in the functional diversification of biochemical systems in bacteria. A more detailed functional analysis of all three sets of superfamilies, sizeindependent evenly distributed, size-dependent linearly distributed and size-dependent nonlinearly distributed, is presented below. Functional analysis Functional annotation with the COG database was accomplished for all the superfamily groups as the three superfamily sets are well represented within this resource (Table 2). The COG database is, notionally, hierarchically classified into functional categories. The ORFs are annotated at the level of gene product and grouped by homology

Figure 6. Genome size correlation indexes values (y axis) versus normalised M=s ratio values (x axis) for the 274 universal bacterial superfamilies.

Figure 7. Total domain and non-redundant ORF frequencies (y axis) versus genome sizes in the 56 species (x axis), for the three superfamily groups. The total domain frequency is represented by the plots on the left column (1), and the total of non-redundant ORFs frequency by the plots on the right column (2). Plots 1.a and 2.a correspond to the evenly distributed superfamilies. The plots 1.b and 2.b display the total frequency distribution for the linearly distributed set. The plots 1.c and 2.c give the total frequency distribution for the nonlinearly distributed superfamilies.

Evolution of Protein Superfamilies

Figure 8. Frequencies (y axis) of the superfamilies (x axis) ordered on the basis of decreasing frequency values, in the (a) evenly distributed, (b) linearly distributed and (c) non-linearly distributed sets. All superfamilies are numbered in each of the three main groups (first row in x axis). Their main functional role in the COG database is indicated with a one-letter code (second row in x axis). The visual evaluation of superfamily distribution plots with respect to the increasing genome size (flat slope, linear or quadratic regression lines) is indicated (third row in x axis). (a) Evenly distributed superfamilies are numbered from 1 to 30 (first row of x axis), and the superfamily frequencies are represented by dark bars (y axis). The second row gives the functional classifications: S, evenly distributed superfamilies involved in the synthesis of proteins (translation). G, superfamily involved in the glycolysis pathway. The last row of symbols on the x axis indicates the visual evaluation of the best regression line fitting superfamily occurrence distribution with genome size: ) , no correlation with size and slope equal to zero. The evenly distributed superfamilies are as follows: 1, tRNA synthetase domain (CATH code 3.40.690.10) (CATH-PDB representative code 1eovA2); 2, tRNA synthetase (3.40.510.10) (2ts101); 3, ribosomal protein S5 (3.30.230.10) (1pkp02); 4, tRNA synthetase (3.40.50.800) (1httA2); 5, ribosomal protein S5 (2.40.50.30) (1b7yB2); 6, tRNA synthetase (3.90.740.10) (1ile02); 7, ribosomal protein S2 (3.40.50.1990) (1fjgB1); 8, ribosomal protein L6P/L9P (3.66.20.12) (1rl6A1); 9, ribosomal protein L1 (3.30.190.20) (1ad201); 10, signal recognition particle GTPase (1.20.120.140) (2ng101); 11, tRNA synthetase (3.30.56.20) (1b7yB4); 12, ribosomal protein S15P/S13E (1.10.287.10) (1fjgO0); 13, ribosomal protein L14 (2.40.150.20) (1whi00); 14, ribosomal protein L2 (2.20.30.70) (1rl2A2); 15, ribosomal protein S6 (3.30.70.60) (1ris00); 16, ribosomal protein S13 (1.10.8.50) (1fjgM1); 17, ribosomal protein S11 (3.30.420.80) (1fjgK0); 18, ribosomal protein S3 (3.30.300.60) (1fjgC1); 19, 3-phosphoglycerate kinase (3.40.50.1270) (1qpg02); 20, pseudouridylate synthase domain 1 (3.30.70.580) (1dj0A2); 21, pseudouridylate synthase domain 2

879

(3.30.70.660) (1dj0A1); 22, ribosomal protein S10 (3.30.70.600) (1fjgJ0); 23, ribosomal protein S8 (2.30.35.20) (1seiA2); 24, ribosomal protein S5 (3.30.160.30) (1pkp01); 25, ribosomal protein S8 (3.30.70.350) (1seiA1); 26, ribosomal protein S19 (3.30.860.10) (1qkfA2); 27, ribosomal protein L1 (3.40.50.790) (1ad202); 28, ribosomal protein L22 (3.90.470.10) (1bxeA0); 29, ribosomal protein S9 (3.30.230.20) (1fjgI0); 30, ribosomal protein L2 (2.40.50.150) (1rl2A1). (b) Linearly distributed superfamilies are numbered from 1 to 12 (x axis), and the superfamily frequencies are represented by dark bars (y axis). Functional classifications are: M, linearly distributed superfamilies involved in metabolism. The symbols * , indicate distributions for which visual evaluation shows a linear correlation with size; —, when the distribution does not show linear distribution. Linearly distributed superfamilies are: 1, P-loop containing nucleotide triphosphate hydrolase domain (3.40.50.300) (1efuA1); 2, NAD(P)-binding (3.40.50.720) (1db3A0); 3, methyltransferases (3.40.50.150) (1admA1); 4, type-I PLP-dependent aspartate aminotransferase, major subunit (3.40.640.10) (1tplA2); 5, type-I PLPdependent aspartate aminotransferase, capping domain (3.30.70.160) (1ars01); 6, FMN-dependent oxidoreductase-like superfamily domain (3.20.20.90) (1ep2A0); 7, thiamine diphosphate binding domain (3.40.50.970) (1poxA3); 8, thioredoxin-like (3.40.30.10) (1ego00); 9, dehalogenase-like hydrolases (3.40.50.1000) (1jud01); 10, nucleotidyl transferase-like (3.90.550.30) (1e5kA0); 11, metal-dependent hydrolases (3.20.20.140) (1a4mA0); 12, carbamoyl-phosphate synthase L chain, N-terminal domain (3.40.50.20) (1iow01). (c) Non-linearly distributed superfamilies are numbered from 1 to 15 (first row of the x axis), and the superfamily frequencies are represented by dark bars (y axis). Functional code: R, non-linearly distributed superfamilies involved in regulation; r, probably involved in regulation; O, oxidoreductases domains; T, TTP-binding domain (thiamine pyrophosphate as co-factor); P, peptidase. The symbols in the third row of the x axis indicate p the visual evaluation of superfamily distributions: , distribution shows a quadratic relationship with genome size; —, when the distribution does not show quadratic distribution; ?, when distribution shows a low regression of data to a quadratic function. Non-linearly distributed superfamilies are: 1, winged helix repressor DNA-binding domain (1.10.10.10) (1lea00); 2, homeodomain-like (1.10.10.60) (2tct01); 3, response regulator receiver domain CheY (3.40.50.3000) (3chy00); 4, histone acetyltransferase-like (3.40.630.30) (1cjwA0); 5, l-repressor DNA-binding domain (1.10.260.10) (1neq00); 6, aldehyde dehydrogenase domain 1 (3.40.605.10) (1ag8A1); 7, medium-chain (zinc-containing) of alcohol dehydrogenases, catalytic domain (3.90.180.10) (1qorA1); 8, His-kinase signal transduction protein (1.20.15.210) (1joyA0); 9, SBP-bac3 domains, bacterial extracellular solute-binding protein domain, chemoreceptor (3.40.190.30) (1pda01); 10, aldehyde dehydrogenase domain 2 (3.40.309.10) (1ag8A2); 11, carbomyl-phosphatase synthase L chain, ATP domain (2.30.35.30) (1iow03); 12, DJ-1/Pfpl family domain (3.40.50.1090) (1cf9A3); 13, TPP-binding domain (3.40.50.70) (1poxA2); 14, metallocarboxipeptidase, peptidase M20 family dimerisation domain (3.30.70.360) (1cg2A2); 15, isomerase, endoribonuclease L-PSP (3.30.70.130) (2chsA0).

880

Figure 9. Functional distributions amongst the four main broad functional COG classes for: (a) evenly distributed, (b) linearly distributed and (c) non-linearly distributed superfamilies. General functional statistics and frequency estimations were done with a non-redundant set of the ORFs in each group of superfamilies.

into clusters of orthologues. These clusters of orthologues are classified into 25 general functional categories (Figure 9), and subsequently these categories are grouped into four broader functional classes (such as: information storage and processing, cellular processes and signaling, metabolism, and poorly characterized, Figure 8). In addition, PFAM,23 SWISS-PROT,24 SCOP,25,26 and the literature were used as complementary resources for functional annotation.

Evenly distributed superfamilies Functional analysis shows that evenly distributed superfamilies are involved mainly in information storage and processing functions, with 90% of the functional annotations belonging to this broad class (Figure 9a). A more detailed analysis shows that 88% of all functional annotations for these superfamilies are associated with translation, ribosomal structure and biogenesis

Evolution of Protein Superfamilies

(Figure 10). Furthermore, although annotations for these superfamilies involve six of the 25 general functions in COG classification (Table 2), five of these six categories represent only 10% of the annotated domains. The remaining 90% belong to the translation and protein biosynthesis category (Figure 10). Almost all the evenly distributed superfamilies are classified into a single general function (translation, ribosomal structure and biogenesis) (Figure 8a), except one, the 3-phosphoglycerate kinase, an important enzyme in glycolysis (see superfamily 19 in Figure 8a). Twenty-one of the domain superfamilies are ribosomal proteins and five are functionally annotated as tRNA synthetases. The remaining domains are involved in protein biosynthesis. For example, the two domains of the pseudouridylate synthase involved in tRNA modification (see superfamilies 20 and 21 in Figure 8a); and those with prokaryotic homologues in the eukaryote signal recognition particle receptor that mediates binding of the signal sequence of the nascent polypeptide from ribosome (see superfamily 10 in Figure 8a). In summary, all the evenly-distributed superfamilies show a size correlation mean very close to zero, suggesting that they are not involved with the size diversification of bacteria (Table 2). Additionally, the high M=s mean ratio of these superfamilies indicates a narrow deviation of gene copies from the mean in bacterial evolution, with uniform number of gene copies present in each genome (Table 2). They also have a relatively low duplication rate (109 domains/superfamily) compared with the other types of superfamilies (Table 2). The low functional diversification with practically only one functional category, and the

Figure 10. Distributions of the functional annotation frequencies in the three main superfamily sets (evenly, linearly, and non-linearly distributed) amongst the 25 general functional categories in COG. The frequency of functional annotations refers to the same non-redundant set of the ORFs used in Figure 8 (see the legend to Figure 8). The COG categories one-letter code is as follows: J, translation, ribosomal structure and biogenesis; A, RNA processing and modification; K, transcription; L, replication, recombination and repair; B, chromatin structure and dynamics; D, cellcycle control, cell division, chromosome partitioning; Y, nuclear structure; V, defense mechanisms; T, signal transduction mechanisms; M, cell wall/membrane/envelope biogenesis; N, cell motility; Z, cytoskeleton; W, extracellular structures; U, intracellular trafficking, secretion, and vesicular transport; O, post-translational modification, protein turnover, chaperones; C, energy production and conversion; G, carbohydrate transport and metabolism; E, amino acid transport and metabolism; F, nucleotide transport and metabolism; H, coenzyme transport and metabolism; I, lipid transport and metabolism; P, inorganic ion transport and metabolism; Q, secondary metabolites biosynthesis, transport and catabolism; R, general function prediction only; S, function unknown. The broader classes into which the 25 functional categories are grouped are indicated.

881

Evolution of Protein Superfamilies

low number of orthologous clusters in COG database to which they have been assigned (67 clusters, Table 2), suggests an early functional crystallisation of these superfamilies in bacterial evolution. Linearly distributed superfamilies Linearly distributed superfamilies are the most represented group in this analysis with 17,490 annotated domains in 14,177 different ORFs (Table 2). These superfamilies are present in practically all the general functional classes in bacteria covering 20 of the 25 functional categories in COGs (Table 2). Of the five categories, not represented in these superfamilies, four are involved primarily in specific eukaryote biosystems (see categories A, B, Y and Z in Figure 10), and the fifth in the very specific extracellular structures category (see category W in Figure 10). This last category is poorly represented, with only one COG group of orthologues, that of the autotransporter adhesin sequences in bacteria. It can be seen from Figure 9b that domains in these superfamilies are involved mainly in cellular metabolism, with 50% having these functional annotations, followed by cellular process (21%), and information storage (15%) (Figure 9b). There is a remarkable number of poorly characterised domains with a general function prediction only (14%) (Figures 9b and 10), suggesting that a significant number of domains in these superfamilies have diverged, providing new gene functions that are not well characterised in the databases. The most populated general classes, apart from the general function prediction only (14%), are the amino acid transport and metabolism (17%), and the replication, recombination and repair (9%) (Figure 10). However, other functional categories are represented significantly for these superfamily domains and indicate a broad functional expansion for the superfamilies in this group. These superfamilies show the highest mean values of duplication (1458 domains/superfamily) and cover the highest number of general functions and orthologous clusters (598 clusters) compared to the other groups (Table 2). The results support the hypothesis that these superfamilies have undergone an important genetic expansion and functional diversification during bacterial evolution. All the linearly distributed superfamilies were functionally analysed in more detail (Figure 8b). The most populated superfamilies are: the nucleotide triphosphate hydrolase superfamily, with 7091 domains in bacteria; and the NADPbinding domain with 2711 domains (see superfamily 1 and 2 in Figure 8b). The nucleotide triphosphate hydrolase and the NADP-binding domains have appeared as the most frequently occurring structural domains in other studies.27,28 Both these domains show extreme functional versatility,29 with representatives in almost all COG functional subcategories. The nucleotide triphosphate hydrolase domains provide motion and

reaction energy in all organisms (prokaryotes and eukaryotes). Additionally, they can act as kinases on their own or combined with other domains. The same functional versatility is observed for the NADP-binding domains that provide oxidising and reducing energy for a huge number of reactions.27,28 Other examples are the FMN-dependent oxidoreductase-like domain, the two domains of the type-I PLP-aspartate aminotransferase, and the metal-dependent hydrolase superfamilies (see superfamilies 6, 4 and 5, and 11, respectively, in Figure 8b). FMN-dependent oxidoreductase domains adopt a TIM barrel fold and use FMN as a cofactor in oxidoreductase reactions on diverse substrates. The type I-PLP-dependent aspartate aminotransferases are made up of two structural domains. Enzymes in this superfamily use PLP as a cofactor in different metabolic reactions like transamination or decarboxylation. All the metaldependent hydrolase domains adopt a TIM barrel fold with a divalent metal-binding site. This family of enzymes perform the metal-assisted activation of a water molecule for nucleophilic attack on substrates.30 The linearly distributed superfamily domains are represented in a huge variety of different enzyme families, making it very difficult to assign a single functional definition relating to metabolism in bacteria. In general, these domains seem to provide oxido-reducing energy and/or basic mechanisms of reaction for a huge number of catabolic and anabolic processes. The frequent occurrence of these superfamily domains, and their presence in genes involved in practically all levels of bacterial metabolism, make these domains useful markers with regard to metabolic “weight” or content in bacterial genome comparison. Non-linearly distributed superfamilies Non-linearly distributed superfamilies are present in a significant number of different functional categories (16 categories, Table 2 and Figure 10), although fewer than the linearly distributed group. For this group, information storage and processing has a prominent presence (48% of the annotated domains) (Figure 9c), followed by metabolism (25%), cellular processes (18%), and poorly characterised (9%). Further analysis of the functional categories reveals that these superfamilies are particularly involved in transcription within the broader information processing class (40%), and in signal transduction (16%) within the cellular processes and signalling class. However, non-linearly distributed superfamilies have significantly low coverage (60%) of ORFs by COG annotation (see N. COGs hits row in Table 2), suggesting a poorer functional characterisation of these superfamilies compared to the other two groups, evenly and linearly distributed, where the annotation covers 92% and 83% of the nonredundant ORFs, respectively (Table 2). A more

882

detailed analysis of the non-linearly distributed superfamilies, using other functional resources such as PFAM, SWISS-PROT and the literature, shows that the percentage of domains involved in gene regulation is underestimated by using COG data alone, and is in fact significantly higher (for example, see Figure 8c). As with the linearly distributed families, the high domain duplication rate (500 domains/superfamily), the number of general functions (16), and the high number of orthologue clusters in the COG database (183), indicate that these non-linearly distributed superfamilies have been expanded genetically and functionally during bacterial evolution. However, these values are, in general, about three times lower than the linearly distributed group and about three times as high as the evenly distributed set (Table 2). Therefore, these results suggest that non-linearly distributed superfamilies have undergone an intermediate rate of gene duplication and functional diversification compared with the other two groups. Non-linear superfamilies contain several important DNA-binding transcription factor domains such as superfamilies of the winged-helix repressor (1852 domains), the homeodomain-like (942 domains), and the l-repressor DNA-binding (588 domains) (see superfamilies 1, 2 and 5 in Figure 8c). DNA-binding transcription factors are an important component in regulation of gene expression. Usually, they bind to a second domain that mediates the response to changes in the cellular environment. Our results are broadly in agreement with a recent analysis by Babu et al.,31 which showed that the winged-helix repressor, the homeodomain-like and the l-repressor domains constitute 90% of all DNA-binding transcription factors found in Escherichia coli. The various structural domains involved in the bacterial two-component signal transduction system are represented prominently in the nonlinear superfamily set. This system is a basic mechanism of regulation that has been highly extended and diversified in the bacteria phylum. The two-component system generally involves transmembrane receptors (the first component) that sense extracellular signals and transmit information to an intracellular signalling component (the second component) that in turn regulates the responses elicited by target effector proteins.32 Therefore, each two-component system comprises two domains. The SBP-bac3 domains (see superfamily 9 in Figure 8c), or bacterial extracellular solute-binding proteins, serves as chemoreceptors in the periplasmic sensing domains of receptors (first component). These domains function also as recognition constituents of transport systems, and as initiators of signal transduction pathways.33 Histidine kinase domains (see superfamily 8 in Figure 8c) constitute the cytoplasmic signalling domain of the first component, and mediate signal transduction in bacteria by downstream phosphorylation of response regulator proteins. The

Evolution of Protein Superfamilies

response regulator receiver domain CheY (see superfamily 3 in Figure 8c), receives the signal from the sensor partner in bacterial two-component systems, and is usually found N-terminal to a DNA-binding effector domain.34 The DJ-1/ Pfpl domain is also found in transcriptional regulators (see superfamily 12 in Figure 8c). Other non-linear superfamilies shown to be involved in regulation include the endoribonuclease L-PSP protein domain (see superfamily 15 in Figure 8c) that inhibits protein synthesis by cleavage of single-stranded mRNA.35,36 Some superfamilies are putatively involved in regulation such as the acetyltransferase (GNAT) domain superfamily (see superfamily 4 in Figure 8c). This domain is the bacterial homologue of the eukaryote histone acetyl transferases that promotes transcriptional activation.37 Several other members of this family are associated with gene regulation in prokaryote and eukaryote cells.38 Some other enzymes in this family have been recognised to be related to metabolism, detoxification, and drug resistance.39 In other, less populated, non-linearly distributed superfamilies there appears to be no connection with regulation. For example the zinc-containing alcohol dehydrogenase domain, the two domains of the aldehyde dehydrogenase family (see superfamily 7, and 6 and 10, respectively, in Figure 8c) that bind NADP as cofactor; the peptidase-M20 family domain (see superfamily 14 in Figure 8c); the TPP-binding domain (see superfamily 13 in Figure 8c) that binds TPP as cofactor; and the CPSase-L-D2 ATP-binding domain of the carbamoyl-phosphatase synthase that initiates the urea cycle (see superfamily 11 in Figure 8c). Overall, however, the more detailed functional analysis summarised above reveals that there are eight superfamilies with a total of 5213 domains clearly involved in regulation that represents about 70% of all the domains belonging to nonlinearly distributed superfamilies. And if the acetyltransferase (GNAT) family domain, with a possible role in regulation, is included, this percentage rises to 60% of all superfamilies and to 80% of all domains (Figure 8c). This strongly supports the suggestion that the non-linearly distributed superfamilies are involved in functional regulation, in bacteria, although the extreme modularity of the regulation system in bacteria obscures this observation to some extent. In these superfamilies, the winged-helix DNA-binding domains or the homeodomain-like are usually combined with other domains responsible for the specificity of the regulation such as small molecule-binding or enzymatic domains.27 Statistical validation of the functional composition for the three main superfamily groups In order to assess the statistical significance of the trends suggested by the above analyses, three independent sets of 1000 random samples with

Evolution of Protein Superfamilies

the same sizes as the three main superfamily groups (sizes of 30, 12 and 15 superfamilies) were selected from the set of 274 universal superfamilies. All of the universal superfamilies (274) were functionally annotated using information from the COG database in order to analyse the distribution of functional properties within each random sample. To test the significance of the evenly distributed cluster, we randomly selected samples (of size 30) and estimated the number of superfamilies involved in protein biosynthesis. A frequency distribution of sample classes was plotted (Figure 1a in Supplementary Material). The Poisson distribution of the 1000 different samples revealed that, by random, we can expect, on average, six superfamilies to be involved in protein biosynthesis, in each sample of 30 universal superfamilies randomly selected. Therefore, the fact that 28 superfamilies associated with protein biosynthesis are identified within the sample of 30 evenly distributed superfamilies reported in this work is highly significant, as it is highly improbable this would be observed following random selection. (See the black arrow in Figure 1a of the Supplementary Material.) For the linearly distributed cluster test, 1000 random samples of 12 superfamilies were selected. The most probable observation is that random samples will have seven metabolic superfamilies per 12 randomly selected superfamilies. For the superfamilies reported here, the probability p is about 0.003 (see the black arrow in Figure 1b of the Supplementary Material) which, though not as significant as for the evenly distributed superfamilies, is highly significant. After analysing the 1000 random samples for the non-linearly distributed cluster test, it was clear that the probability of finding eight or nine superfamilies associated with regulation, within a sample of 15 superfamilies, as reported in the text, was extremely low (see the black arrow in Figure 1c of the Supplementary Material). Therefore, these results support the statistical significance of the functional composition of the three main groups identified; evenly, linearly, and non-linearly distributed. As demonstrated above, this statistical significance is particularly remarkable for the evenly distributed (protein biosynthesis) and non-linearly (regulation) distributed sets.

Conclusions In summary, by examining distributions of superfamily occurrences across 56 bacterial genomes and selecting the superfamilies with extreme characteristics with regard to genome size correlation and M=s ratios, we have been able to distinguish three groups of superfamilies, each with largely distinct functional characteristics. However, in drawing any conclusions from the data it is important to bear in mind the technical features of

883

this analysis. Although there are now about 100 completed bacterial genomes, we have investigated distributions in almost 60 completed bacterial genomes, a sample we felt was significantly representative for trends to be identified. The sequence-based annotation tools we employed recognise about 50– 60% of the distant homologues that would be identified using structural data.40 Since the most efficient fold recognition methods still detect only up to 70% of these distant relatives,13 we have considered that it is reasonable to assume that a superfamily is universal when it is present in at least 70% of the genomes. Furthermore, although there may be some bias in the set of structural domains used for genome annotation, recent analyses13,41 suggest that this is not significant, although clearly membrane-based proteins are still under-represented. Therefore, bearing in mind these considerations, the data collected allow us to confidently answer some of the important questions formulated in Introduction. It is clear that not all the superfamilies have contributed in the same manner to the increases in genome sizes, and the data can be used to distinguish the superfamilies that show an important correlation between their genetic expansion and increasing bacterial size, from those where evolution has been completely independent of genome size. Although the universal set of genome size-dependent superfamilies we analyse here represent only 6% of all annotated CATH superfamilies, they account for almost half of all the annotated domains (43%) (Table 1). Therefore, only a few superfamilies seem to have been primarily responsible for the increase in genome size in bacteria. Additionally, by also considering the M=s characteristics of the superfamilies it has been possible to further distinguish superfamily groups with specific functional roles. Protein biosynthetic machinery is represented by superfamilies whose occurrences are found to be independent of genome size. Therefore, domains involved in translation and protein biosynthesis have not contributed to the diversity of sizes observed in bacteria. These evenly distributed superfamilies perform essential functions universally represented in bacteria but they are not responsible for bacterial speciation or functional diversification. Evenly distributed superfamilies belong to ancestral molecular mechanisms with functions that have remained stable throughout evolution. It is very probable that these superfamilies still perform the functions they performed in the progenote cell, especially as there has been no innovation in these superfamilies for a long time, at least since the origin and diversification of all prokaryote branches (Eubacteria and Archaea), implying ancestral and conserved functions. These results strongly support the hypothesis that translation crystallised earlier than any other system in evolution.42,43 Amongst the universal superfamilies shown to be size-dependent, the relationship can be

884

mathematically described as either linear or nonlinear with respect to genome size, with different functional characteristics associated with each type of behaviour. Linearly distributed superfamilies are involved mainly in metabolism, for all bacteria, including the smallest organisms (parasites and symbionts). Non-linearly distributed superfamilies tend to be absent from the smallest organisms and increase their frequency of occurrence in the larger genomes. These superfamilies are associated mainly with gene expression (transcription) and control (signal transduction processes). Gene control seems to be simple for the smallest bacteria and increases quadratically with gene complexity in longer genomes. This observation suggests that an increase of the number of genes involved in metabolism demands an even greater development of transcription regulation mechanisms. Bacteria are single-celled organisms for which genome sizes vary by little more than an order of magnitude in length, but which display extraordinary variation in their metabolic properties, cellular structures and lifestyles. Even within relatively narrow taxonomy groups, such as the enteric bacteria, the phenotypic diversity among species is remarkable.5 Bacterial genome size appears to be the sum of different genetic events, such as gene duplication from a common ancestor, horizontal transfer and lineage-specific gene loss7 and is therefore not a good indicator of evolutionary lineage. Smaller genomes are not necessarily the result of a lower rate of gene duplication and evolution. Phylogenetic analyses have revealed that bacteria with the smallest genomes are derived from bacteria with larger genomes, through evolution by reduction.2 – 4,44 Therefore, we clearly cannot use our methodology to analyse the lineage dependency of the superfamilies and model all the evolutionary steps by which the superfamilies have been expanded or deleted within a given organism. However, and although it is difficult to determine and is probably under-estimated with currently available data, lateral or horizontal transfer of genes has been calculated to affect only about 8% of the genome.5 This implies that most duplication and divergence processes, in the size-dependent superfamilies, have probably been lineage-specific in bacteria.45 One main question is why do the bacteria increase their proteomes? Clearly, by expanding their functional repertoire they can promote their survival. However, various hypotheses also argue that genome size is itself the object of selection: that is, that the tight packing and small size of bacterial genomes is an adaptation to promote efficiency or competitiveness during reproduction.3 There is clearly a balance between maintaining a minimal genome size and the need to respond to, or exploit, environmental conditions. The size-dependent superfamilies have the highest rate of duplication and additionally cover the greatest number of functional categories compared

Evolution of Protein Superfamilies

with the evenly distributed superfamilies (Table 2). That is, gene duplication quickly gives rise to functional diversification in bacterial evolution and genome size variation translates almost directly into biochemical, physiological and organismal complexity because the majority of the duplicated sequences are functional protein-coding regions.3,4 Thus, bacteria increase their genome sizes to enrich their complexity and therefore their functome. Gene duplication is clearly the first step in increasing the functome, and thus can bring important benefits in speciation and adaptation to the environment. However, this increase in complexity is balanced by a cost in terms of regulation. Increasing numbers of genes implies increasing logistical problems in distinguishing signal from noise,46 and so, gene number is intrinsically limited by the imprecision of the biochemical mechanism governing gene expression. Because domains involved in cell regulation increase quadratically, for longer genomes, the number of genes involved in regulation quickly surpasses those involved in metabolism. This conclusion agrees with the observations and results obtained recently by other groups.47,48 Therefore, the need to maintain regulation appears to restrict increasing complexity in bacteria. Probably, this limit was established early in evolution, since the mean number of genes appears constant in prokaryotes over a long time.46 This size limit means that bacteria must optimise their genetic space, deleting non-functional DNA and selecting a specific arrangement of genes. An understanding of the balance between metabolism and regulation systems, may help determine optimum genome size in bacteria, and will be investigated in future work.

Materials and Methods This study has used those ORFs identified from the sequencing of the complete genomes of 56 different bacteria (see Table 1 of the Supplementary Material). The bacterial species are divided between the two principal prokaryote phylogenetic phyla, the Eubacteria (46 species) and Archeobacteria (ten species). The mean genome size of the 56 different species was calculated and later used to normalise the frequencies with which superfamilies occur, in a given genome (see Table 1 of the Supplementary Material). Genome structural annotation and occurrence profiles Genome structural annotation data were taken directly from release 2 of the Gene3D database.20 The structural annotations in Gene3D were obtained by scanning sequences from completed genomes against IMPALA profiles49 for non-identical structures from the CATH database.15 The structural annotation coverage was, on average, 37% of the genes (total of 49,224 annotated ORFs in a total of 134,404 ORFs of all 56 bacterial proteomes), and ,22% of the amino acid residues in any given organism (see Table 1 of the Supplementary

885

Evolution of Protein Superfamilies

Material). The structural data are summarised as a selection of gene regions for which a CATH domain structural family has been assigned.40 From the 1386 CATH superfamilies used for genome annotation, 816 were found to be present in the 56 bacterial species. These data were converted into a series of superfamily occurrence profiles, for the purposes of the analysis (Figure 1). Occurrence profiles report the frequencies with which a particular domain superfamily occurs in each of the 56 genomes in the dataset. Occurrence profiles can therefore be used for conceptualising and differentially studying the presence or absence of domains in different genomes (Figure 1). These occurrence profiles differ from the phylogenetic profiles pioneered by Eisenberg and others.50,51 Rather than building up the profile with a binary score, indicating the presence or absence of a gene/domain type, the profiles used in this study maintain the frequency information about the domain types within each genome. This allows the differential duplication and dispersion of domains between the organisms to be studied.

correlation $ 0.7) and size-independent superfamilies (size correlation #0.2). Functional annotation The COG database54 provides the functional annotation for this study. The data were downloaded from the latest release (May 2003). This release contained protein cluster data for 66 complete genomes. The COG database is comprised of clusters of orthologous genes that are assigned a common function. Gene identifiers in the COG database were matched to those in the set of genomes used for this study so that COG annotations could be used to functionally characterise the structural domains. For more detailed functional analysis of superfamilies, we also used functional annotations for domains identified in the PFAM domain database (version 9.0, May 200323), SWISS-PROT database,24 SCOP database,25,26 and the literature.

Mean and standard deviation of superfamily frequencies The statistical mean (M) and standard deviation (s) of all CATH superfamily frequencies, among 56 different species of bacteria (see Table 1 of the Supplementary Material), were calculated.52 The M=s ratio is calculated to measure the relative deviation from the mean number of copies for each superfamily. This parameter corresponds to the inverse of the coefficient of variation ðs=MÞ; which is often used as a statistical measure for the relative variability of a variable from its mean.52

Acknowledgements

Universal distribution percentage of superfamily distribution

References

The universal distribution percentage is a parameter that measures how common a superfamily is throughout the 56 bacterial species studied. To determine this, the occurrence profiles of the superfamilies are reinterpreted in order to account for the absence (superfamily frequencies ¼ 0) or presence (superfamily frequencies .0) of a given superfamily in each of the 56 bacterial species. Then, the total number of genomes in which the superfamily is present is divided by the total number of organisms (56 bacteria) and multiplied by 100. When the universal distribution percentage is equal to 100% it indicates that the superfamily is present (universal) in all the bacterial species, when this parameter is close to 0% it indicates that the superfamily has a highly specific distribution in just a few species. This parameter was calculated for all the annotated superfamilies in this work. The value of 70% was chosen as a confidence limit to define the universally distributed superfamily group. That is, a given superfamily is considered universally distributed when its universal distribution percentage is equal to or greater than 70%. Calculation of Pearson correlation coefficient between the superfamilies’ occurrence profiles and genome sizes An index representing the correlation of the occurrence profiles to the genome size, for the 56 bacterial species, was calculated for all the annotated superfamilies following the Pearson standard formula.53 On the basis of the size correlation values, two limits of confidence were established to define size-dependent (size

We thank Florencio Pazos and Annabel Todd for discussion and helpful comments about the paper, and to Ami for her help in the inspiration of this work. The work was supported by grants from BBSRC (to D.W.A.B.), MRC (C.A.O.) and European Union (J.A.G.R.).

1. Koonin, E. V., Wolf, Y. I. & Karev, G. P. (2002). The structure of the protein universe and genome evolution. Nature, 420, 218– 223. 2. Dobrindt, U. & Hacker, J. (2001). Whole genome plasticity in pathogenic bacteria. Curr. Opin. Microbiol. 4, 550– 557. 3. Mira. A., Ochman, H., Moran, N.A. (2001). Deletional bias and the evolution of bacterial genomes. Trends Genet. 17, 589– 596. 4. Moran, N. A. (2002). Microbial minimalism: genome reduction in bacterial pathogens. Cell, 108, 583 – 586. 5. Ochman, H., Lawrence, J. G. & Groisman, E. A. (2000). Lateral gene transfer and the nature of bacterial innovation. Nature, 405, 299– 304. 6. Chothia, C., Gough, J., Vogel, C. & Teichmann, S. A. (2003). Evolution of the protein repertoire. Science, 300, 1701– 1703. 7. Jordan, I. K., Makarova, K. S., Spouge, J. L., Wolf, Y. I. & Koonin, E. V. (2001). Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res. 11, 555– 565. 8. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). CATH—a hierarchic classification of protein domain structures. Structure, 5, 1093– 1108. 9. Heger, A. & Holm, L. (2003). Exhaustive enumeration of protein domain families. J. Mol. Biol. 328, 749 –767. 10. Vitkup, D., Melamud, E., Moult, J. & Sander, C. (2001). Completeness in structural genomics. Nature Struct. Biol. 8, 559– 566.

886

11. Liu, J. & Rost, B. (2002). Target space for structural genomics revisited. Bioinformatics, 18, 922– 933. 12. Teichmann, S. A., Chothia, C. & Gerstein, M. (1999). Advances in structural genomics. Curr. Opin. Struct. Biol. 9, 390– 399. 13. Lee, D., Grant, A., Buchan, D. & Orengo, C. (2003). A structural perspective on genome evolution. Curr. Opin. Struct. Biol. 13, 359– 369. 14. Muller, A., MacCallum, R. M. & Sternberg, M. J. (1999). Benchmarking PSI-BLAST in genome annotation. J. Mol. Biol. 293, 1257– 1271. 15. Buchan, D. W., Shepherd, A. J., Lee, D., Pearl, F. M., Rison, S. C., Thornton, J. M. & Orengo, C. A. (2002). Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database. Genome Res. 12, 503– 514. 16. Jones, D. (1999). GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287, 797– 815. 17. Gough, J. & Chothia, C. (2002). SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucl. Acids Res. 30, 268– 272. 18. Pearl, F. M. G., Lee, D., Bray, J. E., Sillitoe, I., Todd, A. E., Harrison, A. P. et al. (2000). Assigning genomic sequences to CATH. Nucl. Acids Res. 28, 277– 282. 19. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H. et al. (2000). The Protein Data Bank. Nucl. Acids Res. 28, 235– 242. 20. Buchan, D., Rison, S., Bray, J., Lee, D., Pearl, F., Thornton, J. & Orengo, C. (2003). Gene3D: structural assignments for the biologist and bioinformaticist alike. Nucl. Acids Res. 31, 469– 473. 21. Cook, R. D. & Weisberg, S. (1982). Residuals and Influence in Regression, Chapman and Hall, New York. 22. Weisberg, S. (1985). Applied Linear Regression, 2nd edit., Wiley, New York. 23. Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S. R. et al. (2002). The Pfam protein families database. Nucl. Acids Res. 30, 276– 280. 24. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E. et al. (2003). The SwissProt protein knowledgebase and its supplement TrEMBL in 2003. Nucl. Acids Res. 31, 365– 370. 25. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536– 540. 26. Lo Conte, L., Brenner, S. E., Hubbard, T. J. P., Chothia, C. & Murzin, A. (2002). SCOP database in 2002: refinements accommodate structural genomics. Nucl. Acids Res. 30, 264–267. 27. Apic, G., Gough, J. & Teichmann, S. A. (2001). Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 310, 311 – 325. 28. Hegyi, H., Lin, J., Greenbaum, D. & Gerstein, M. (2002). Structural genomics analysis: characteristics of atypical, common, and horizontally transferred folds. Proteins: Struct. Funct. Genet. 47, 126– 141. 29. Leipe, D. D., Wolf, Y. I., Koonin, E. V. & Aravind, L. (2002). Classification and evolution of P-loop GTPases and related ATPases. J. Mol. Biol. 317, 41 – 72. 30. Todd, A. E., Orengo, C. A. & Thornton, J. M. (2001). Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113 – 1143. 31. Babu, M. M. & Teichmann, S. A. (2003). Evolution of transcription factors and the gene regulatory net-

Evolution of Protein Superfamilies

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42. 43. 44. 45. 46. 47. 48. 49.

50.

work in Escherichia coli. Nucl. Acids Res. 31, 1234– 1244. Goudreau, P. N. & Stock, A. M. (1998). Signal transduction in bacteria: molecular mechanisms of stimulus-response coupling. Curr. Opin. Microbiol. 1, 160– 169. Tam, R. & Saier, M. H., Jr (1993). Structural, functional, and evolutionary relationships among extracellular solute-binding receptors of bacteria. Microbiol. Rev. 57, 320– 346. Pao, G. M. & Saier, M. H., Jr (1995). Response regulators of bacterial signal transduction systems: selective domain shuffling during evolution. J. Mol. Evol. 40, 136– 154. Rappu, P., Shin, B. S., Zalkin, H. & Mantsala, P. (1999). A role for a highly conserved protein of unknown function in regulation of Bacillus subtilis purA by the purine repressor. J. Bacteriol. 181, 3810– 3815. Morishita, R., Kawagoshi, A., Sawasaki, T., Madin, K., Ogasawara, T., Oka, T. & Endo, Y. (1999). Ribonuclease activity of rat liver perchloric acid-soluble protein, a potent inhibitor of protein synthesis. J. Biol. Chem. 274, 20688 –20692. Dyda, F., Klein, D. C. & Hickman, A. B. (2000). GCN5-related N-acetyltransferases: a structural overview. Annu. Rev. Biophys. Biomol. Struct. 29, 81 – 103. Neuwald, A. F. & Landsman, D. (1997). GCN5related histone N-acetyltransferases belong to a diverse superfamily that includes the yeast SPT10 protein. Trends Biochem. Sci. 22, 154– 155. Draker, K. A., Northrop, D. B. & Wright, G. D. (2003). Kinetic mechanism of the GCN5-related chromosomal aminoglycoside acetyltransferase AAC(6’)-Ii from Enterococcus faecium: evidence of dimer subunit cooperativity. Biochemistry, 42, 6565– 6574. Pearl, F., Lee, D., Bray, J., Buchan, D., Shepherd, A. & Orengo, C. (2002). The CATH extended proteinfamily database: providing structural annotations for genome sequences. Protein Sci. 11, 233– 244. Gerstein, M. (1998). How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold. Des. 3, 497– 512. Woese, C. (1998). The universal ancestor. Proc. Natl Acad. Sci. USA, 95, 6854– 6859. Ramakrishnan, V. (2002). Ribosome structure and the mechanism of translation. Cell, 108, 557– 572. Andersson, S. G. & Kurland, C. G. (1998). Reductive evolution of resident genomes. Trends Microbiol. 6, 263– 268. Kurland, C. G., Canback, B. & Berg, O. G. (2003). Horizontal gene transfer: a critical view. Proc. Natl Acad. Sci. USA, 100, 9658– 9662. Bird, A. P. (1995). Gene number, noise reduction and biological complexity. Trends Genet. 11, 94 – 100. Cases, I., de Lorenzo, V. & Ouzounis, C. A. (2003). Transcription regulation and environmental adaptation in bacteria. Trends Microbiol. 11, 248– 253. Nimwegen, E. (2003). Scaling laws in the functional content of genomes. Trends Genet. 19, 479– 484. Schaffer, A. A., Wolf, Y. I., Ponting, C. P., Koonin, E. V., Aravind, L. & Altschul, S. F. (1999). IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics, 15, 1000– 1011. Marcotte, E., Pellegrini, M., Ng, H., Rice, D., Yeates, T. & Eisenberg, D. (1999). Detecting protein function

Evolution of Protein Superfamilies

51.

52. 53. 54.

and protein – protein interactions from genome sequences. Science, 285, 751– 753. Pellegrini, M., Marcotte, E., Thompson, M., Eisenberg, D. & Yeates, T. (1999). Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci. USA, 96, 4285–4288. Wayne, W. D. (1995). Biostatistics, 6th edit., Wiley, New York. Edwards, A. L. (1976). The correlation coefficient. An Introduction to Linear Regression and Correlation, WH Freeman, San Francisco, CA. Tatusov, R., Natale, D., Garkavtsev, I., Tatusova, T., Shankavaram, U., Rao, B. et al. (2001). The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucl. Acids Res. 29, 22 – 28.

887

Edited by G. von Heijne (Received 24 September 2003; received in revised form 11 December 2003; accepted 12 December 2003)

Supplementary Material comprising one Figure and one Table is available on Science Direct