The origins of modern proteomes

The origins of modern proteomes

Available online at www.sciencedirect.com Biochimie 89 (2007) 1454e1463 www.elsevier.com/locate/biochi Research paper The origins of modern proteom...

296KB Sizes 5 Downloads 86 Views

Available online at www.sciencedirect.com

Biochimie 89 (2007) 1454e1463 www.elsevier.com/locate/biochi

Research paper

The origins of modern proteomes C.G. Kurland a, B. Canba¨ck b, O.G. Berg c,* a

Department of Microbial Ecology, Lund University, Ecology Building, SE-223 62 Lund, Sweden b Bjo¨rn Canba¨ck Bioinformatics, Vindo¨gatan 66, SE-257 33 Rydeba¨ck, Sweden c Department of Molecular Evolution, Evolutionary Biology Centre, Uppsala University, SE-75236 Uppsala, Sweden Received 1 June 2007; accepted 6 September 2007 Available online 15 September 2007

Abstract Distributions of phylogenetically related protein domains (fold superfamilies), or FSFs, among the three Superkingdoms (trichotomy) are assessed. Very nearly 900 of the 1200 FSFs of the trichotomy are shared by two or three Superkingdoms. Parsimony analysis of FSF distributions suggests that the FSF complement of the last common ancestor to the trichotomy was more like that of modern eukaryotes than that of archaea and bacteria. Studies of length distributions among members of orthologous families of proteins present in all three Superkingdoms reveal that such lengths are significantly longer among eukaryotes than among bacteria and archaea. The data also reveal that proteins lengths of eukaryotes are more broadly distributed than they are within archaeal and bacterial members of the same orthologous families. Accordingly, selective pressure for a minimal size is significantly greater for orthologous protein lengths in archaea and bacteria than in eukaryotes. Alignments of orthologous proteins of archaea, bacteria and eukaryotes are characterized by greater sequence variation at their N-terminal and C-terminal domains, than in their central cores. Length variations tend to be localized in the terminal sequences; the conserved sequences of orthologous families are localized in a central core. These data are consistent with the interpretation that the genomes of the last common ancestor (LUCA) encoded a cohort of FSFs not very different from that of modern eukaryotes. Divergence of bacterial and archaeal genomes from that common ancestor may have been accompanied by more intensive reductive evolution of proteomes than that expressed in eukaryotes. Dollo’s Law suggests that the evolution of novel FSFs is a very slow process, while laboratory experiments suggests that novel protein genesis from preexisting FSFs can be relatively rapid. Reassortment of FSFs to create novel proteins may have been mediated by genetic recombination before the advent of more efficient splicing mechanisms. Ó 2007 Elsevier Masson SAS. All rights reserved. Keywords: Fold superfamilies; Orthologous families; Parsimony; Length distributions; Dollo’s Law

1. Introduction We begin with the provocative, but informative claim of F.B. Salisbury [1], who observed ‘‘If life depends on each gene being as unique as it appears to be, then it is too unique to come into being by mutations.’’ And, by this he meant that there would not be enough mass in our galaxy or time in the history of our universe for each and every coding sequence that is expressed in modern organisms to have evolved independently of every other one. J. Maynard Smith replied to * Corresponding author. Tel.: þ46 (0)18 471 4215; fax: þ46 (0)18 471 6404. E-mail address: [email protected] (O.G. Berg). 0300-9084/$ - see front matter Ó 2007 Elsevier Masson SAS. All rights reserved. doi:10.1016/j.biochi.2007.09.004

this challenge with the suggestion that protein evolution by random mutations is possible if ‘‘functional proteinsdform a continuous network which can be traversed by unit mutational steps without passing through non-functional intermediates’’ [2]. Maynard-Smith’s conjectured space of nested functional proteins has been the ruling paradigm for decades, and now it has been mapped in some detail by comparisons of sequences and structures encoded by hundreds of fully sequenced genomes [3e16]. An amazing feature of the natural protein space is that it is so compacted that more than half of all families of phylogenetically related protein domains (fold superfamilies) can be found in each of the three modern Superkingdoms [15,16].

C.G. Kurland et al. / Biochimie 89 (2007) 1454e1463

The clustering of extant organisms into three major clades is the signature of unrooted phylogeny based on ribosomal RNA sequences, protein coding sequences as well as fold superfamilies (FSFs) from hundreds of fully sequenced genomes [3e16]. In all such unrooted trees the three major clades corresponding to the modern Superkingdoms or the trichotomy [16] seem to diverge from or coalesce into a virtual node. That virtual node is not directly identifiable with a common ancestor but it does reflect the fact that there is an underlying homology that relates the genomes of one Superkingdom to those of another. A principal objective here is to identify properties of the ancestral lineage from which the proteomes of the trichotomy are descended. This ancestry has been the subject of shifting fashions since Woese first promoted the idea of a common ancestor for the trichotomy [3e5,10]. The history of those fashions is too tortuous to present in detail here. Instead, we consider the two extreme views currently competing for attention: One assumes that eukaryote genomes are chimeric mosaics created from archaeal and bacterial genomes [17e21]. According to this complex view the last common ancestor diverged into archaeal and bacterial lineages, which subsequently combined to form the third, eukaryotic lineage. The simpler view suggests that a divergence from a primitive eukaryote-like ancestor produced minimalist proteomes of both archaea and bacteria [22e28]. In the simpler scenario the last common ancestor is identified with an early stage of the eukaryote lineage. One practical distinction between these two extremes rests on whether the proteins of LUCA (Last Universal Common Ancestor) most resembled those of modern archaea and bacteria or those of modern eukaryotes. This study re-examines that question by focusing on the distributions of fold superfamilies (FSFs) within the trichotomy [16]. Common folds correspond to protein domains with the same structural elements in the same order and each FSF is a family of proteins or protein domains that share a common fold and are likely to have a common evolutionary origin. FSFs are found to be present in roughly comparable numbers in each Superkingdom: 1110, 883 and 713 FSFs respectively in eukaryotes, bacteria and archaea from a cohort of 57 fully sequenced genomes [16]. The most informative observation here is that a substantial majority of all these FSFs in each Superkingdom are shared by two or three of the Superkingdoms. This observation deflates the notion that it is ‘‘easier’’ to evolve archaea and bacteria and then to combine their genomes to create a chimeric eukaryote genome than it would be to evolve the modern trichotomy from a eukaryote-like common ancestor. In fact, we show that the FSF distributions among the three modern Superkingdoms suggest that the FSF complement of LUCA was more like that of modern eukaryotes than that of archaea and bacteria. If the FSF cohort of the common ancestor most resembled that of eukaryotes, modern archaea and bacteria are likely to have evolved under more minimalist constraints than did the eukaryotes. Since minimalist evolutionary constraints might entail both fewer coding sequences as well as shorter lengths for the retained sequences it was encouraging to learn that

1455

randomly sampled protein lengths tend to be significantly longer in eukaryotes than in archaea and bacteria [29e31]. We have extended these studies of length distributions by analyzing the members of orthologous families of proteins. We find that the lengths of proteins within the same orthologous families are significantly longer among eukaryotes than among bacteria and archaea. The detailed comparisons also show that eukaryote proteins are more broadly distributed in length than are the archaeal and bacterial members of the same orthologous families. These data suggest that selective pressure for a minimal size is significantly greater for orthologous protein lengths in archaea and bacteria than in eukaryotes. Finally, we explore the idea that eukaryote protein lengths might be significantly longer because they include dedicated sequences not present in archaea and bacteria, but required in eukaryotes to regulate or distribute proteins within the cell [31]. We find that alignments of orthologous proteins of archaea, bacterial and eukaryote are characterized by greater sequence variation at their N-terminal and C-terminal domains, than in their central cores. Thus, the length variations of all protein families tends to be localized in the terminal sequences while the conserved sequences of orthologous families are localized in a central core. 2. Methods The Cogent and OFAM databases were downloaded from European Bioinformatics Institute (http://cgg.ebi.ac.uk/cgg/ Services.html) [32]. Cogent is a database of complete genomes and their protein sequences. A recent version (September 2005) contains 915554 proteins from 243 genomes including 197 from bacteria, 22 from archaea and 24 from eukarya (Supplementary Tables S1eS3). OFAM (Ortholog families database) is a database of protein ortholog families based on the proteins in the Cogent database. Orthologues sequences are identified by using best bidirectional BLAST hits and then clustering these with a Markov cluster (MCL) algorithm. The database (at release from July 25, 2005) holds 809091 proteins distributed in 308594 families. Of these families, 220260 contain only one member. When aligning sequences from the same family, it becomes apparent that the thresholds used in the MCL algorithm are quite strict, resulting in high quality alignments, but at the same time probably disqualifying diverged orthologs. Comparisons of sequence lengths were made between each possible combination of domains: bacteria-eukarya, bacteriaarchaea, eukarya-archaea and all domains. Only protein families with at least 5 members in each domain were included in the analyses. The cohort containing families of orthologs that are present in all three Superkingdoms is reduced to 365 families. In order to study the sequence variation within the orthologous families analyzed in Figs. 2 and 3, all members of each orthologous protein family were aligned with CLUSTALW [33]. The resulting alignments were used to compare the protein sequences in orthologous families present in all three cellular classes. Here, the alignments were divided arbitrarily into

1456

C.G. Kurland et al. / Biochimie 89 (2007) 1454e1463

10 equal length segments and the relative numbers of gaps were calculated for each segment as shown in Figs. 3 and 4. Supplementary material is available at http://www.acgt.se/ Protein_evolution/.

Eukaryote (19) 244

3. Results and discussion

62

3.1. FSF distributions The differences between the complexities of proteomes in the three Superkingdoms are huge [32,34] while the differences between their complements of FSFs are quite modest [15,16]. The leveling of FSF distributions within the trichotomy facilitates simple parsimony calculations to challenge both competing views of the common ancestor. These tests are based on counts that score the presence or absence of each of the trichotomy’s circa 1200 FSFs within the three Superkingdoms [16]. Here, a cohort of FSFs within each Superkingdom is used to define a virtual genome that is identified with modern Archaea, Bacteria or Eukarya, respectively. Then we count the differences between the presence and absence of FSFs within each virtual genome. There is however a potential ambiguity in these counts. The total numbers of FSFs within the modern trichotomy are a bit uncertain because the available sampling of fully sequenced genomes is heavily biased by the genomes of bacteria. On the other hand, there is some comfort in the fact that the total number of FSFs is found to decrease only marginally (from 1244 to 1205) when the number of fully sequenced genomes is decreased from 174 to 57 [16]. The larger cohort is highly biased in favor of bacterial FSFs because the difference between 174 and 57 fully sequenced genomes is dominated by the excess of sequenced bacterial genomes available in the public domain; 121 out of 174 genomes are bacterial. As there were only nineteen archaeal genomes available in the public domain (Supplementary Table S4), in order to balance the FSF distributions, equal numbers (nineteen) of bacterial and eukaryote genomes were chosen and analyzed [16]. These nineteen genome cohorts were selected [16] to represent a wide spectrum of species in each domain (Supplementary Tables S5 and S6). The FSF counts [16] from that balanced cohort is shown in Fig. 1. It is also comforting that there is good agreement between the distributions of FSFs recently summarized independently for 185 fully sequenced genomes [15,16]. Again, the most recent genome cohort is dominated by bacteria. This cohort yields similar numbers (1259) as well as similar distributions of FSFs among the 185 fully sequenced genomes [15]. This suggests that almost all of the bacterial FSFs in the modern trichotomy are present in that cohort of 174 genomes and that we may expect a modest increase of archaea as well as eukaryote FSFs when a comparable number of these genomes have been sequenced. Most supportive for the present study is the finding that calculations (data not shown) based on the FSFs among 57 genomes are robust in the sense that similar conclusions can be drawn about the distributions of FSFs from the larger but biased cohorts obtained with 174 and 185 genome sequences [15,16].

199 605

16 Archaea (19)

30

49 Bacteria (19)

Fig. 1. Venn diagram summarizing the distributions of 1205 FSFs among archaea, bacteria and eukaryotes according to [16]. Each super kingdom is represented by nineteen fully sequenced genomes as described [16].

The data set we analyzed is presented in Fig. 1, which is a Venn diagram summarizing the phylogenetic distributions of the 1205 FSFs from fifty seven fully sequenced genomes representing nineteen genomes from each of the Superkingdoms [16]. There are a number of striking features in this diagram. One is that the largest single group of FSFs corresponds to those shared by all three Superkingdoms, 605 FSFs or a bit more than half of the entire cohort. A total of 896 or very nearly three fourths of the FSFs are shared between two or three Superkingdoms. The smallest cohort of FSFs belongs to the archaea and they share the least number of FSFs with other Superkingdoms. Nevertheless, more than nine tenths of the archaeal and bacterial FSFs are shared with eukaryotes. These features suggest that most if not all of the extant FSFs may have been present in the last common ancestor of all three Superkingdoms. A second remarkable feature of the Venn diagram is that it shows that all three Superkingdoms present FSF profiles with roughly the same degree of complexity: 1110, 883 and 713 FSFs in eukaryotes, bacteria and archaea, respectively. There is then a huge discrepancy between the total sum of 1205 FSFs for the trichotomy [16] and the more than 300,000 orthologous protein families in the Ofam data base [32]. That discrepancy together with the one between the ca. 4 million bp coding sequence of a bacterium such as Escherichia coli and the circa 70 million bp coding sequence of the highly reduced genome of Fugu rubripes [34] suggest that the proteomes of the trichotomy in general and of eukaryotes, in particular contain highly permuted concatenates of a relatively small and conservative cohort of FSFs. In other words, the common ancestor of the trichotomy may not have been significantly poorer in terms of FSFs than the most highly evolved modern eukaryote or bacterium. It is generally held to be much more difficult to create a novel protein domain than it is to lose that sequence because gene loss is facile and virtually irreversible once selection for a particular sequence has been suspended [35e37]. Thus,

C.G. Kurland et al. / Biochimie 89 (2007) 1454e1463

random mutations can in principle generate any particular FSF but they can do so in relatively few ways while that same FSF can be obliterated in numbers of ways that are orders of magnitude larger. This means that from the point of view of evolutionary rates or probabilities, random mutations that enable FSF births are very much less frequent than those that lead to FSF deaths. Accordingly, we assume that the numbers of novel sequences required to evolve from a putative ancestral genome to a descendent genome provides a measure of the ease or difficulty of that particular evolutionary path [35e37]. In comparing two paths the one requiring the least number of novel FSF would be the one with the greatest a priori expectation. One school of thought suggests that genomes of ancestral archaea and bacteria created the eukaryote genome as a chimera of the two [17e21]. It is evident from the Venn diagram in Fig. 1 that 244 novel sequences would be required to create the eukaryotes from the archaea and bacteria described in that diagram. In contrast, the descent of archaea and bacteria from an ancestral eukaryote requires only 95 creation events. Clearly, of these two scenarios, the one that minimizes FSF births is that beginning with an ancestral eukaryote and leading to modern archaea and bacteria. For completeness, we can calculate for each of the virtual genomes how many novel FSFs would have to be created de novo to generate the other two virtual genomes. The counts of FSF gains are 422 and 392 FSFs beginning with archaea and bacteria, respectively to generate the trichotomy. Thus, the path starting from both archaea and bacteria together is not much more parsimonious with respect to FSF gains than that starting with either archaea or bacteria alone to evolve the eukaryote, 244 versus 422 or 392 respectively. The focus can be shifted to FSF losses by assuming that FSF gains have been minimized to zero in the trajectory from the common ancestor to the modern trichotomy. Here, it is assumed that the ancestor encoded all 1205 FSFs found in the modern trichotomy in Fig. 1. The count shows that to evolve the trichotomy 95, 322, and 392 FSFs must be lost in the transit from such an ancestor to modern eukaryotes, archaea and bacteria, respectively. Clearly, from the point of view of minimal gain and loss within the cohort in Fig. 1, the most parsimonious path begins with an ancestor that most resembles eukaryotes in its presentation of FSFs. 3.2. Protein lengths If archaea and bacteria are the minimalist genomic descendents of an ancient eukaryote lineage, the expectation would be that reductive evolutionary pressures might have influenced the evolution of the lengths of archaeal and bacterial proteins so that they are closer in size to minimum functional sequences than are their eukaryote homologues. Indeed, it has been noted previously [29e31] that the average sizes of proteins from the Superkingdoms differ systematically, with the mean and median lengths of proteins from archaea and bacteria circa two thirds the lengths of proteins that evolved among eukaryotes. Although the initial interpretation was that

1457

eukaryotes have selected longer protein lengths [31], the data are in fact superficially consistent with the interpretation that archaea and bacteria evolve with stronger reductive constraints on their genomes than do eukaryotes. In order to distinguish these interpretations we have reinvestigated the length distributions from new perspectives. First, genomes, particularly those of bacteria, contain significant numbers of very short open reading frames (ORFs) that often are pseudogenes arising as products of gene meltdown [38,39]. The problem is to eliminate such degraded sequences so that the estimates of average coding lengths are not distorted by pseudogenes. We approach this problem by including in our database only those proteins that are members of orthologous families with five or more members in the Ofam database [32]. Orthologous families representing at least five taxa are not likely to contain only fragments from randomly degraded genes. A recent version of Ofam contains more than 809,000 proteins arranged in circa 309,000 orthologous families, many of which are single member families (see Section 2). By considering only families with five or more members that data base is reduced to circa 300,000 proteins distributed among 18,000 families. This culling has the effect of increasing the average family size to 17 from the original 2.6. The mean values observed for the mean as well as the median lengths of orthologous families tend to be very similar within each domain. For archaea these are 311 and 308, for bacteria they are 309 and 308 and for eukaryotes they are 508 and 506 amino acids, respectively. Here, the cohort we studied did not reveal as large a difference between archaeal and bacterial sequence lengths as recorded previously [29e31]. This and other differences in the data may result in part from our culling of pseudogenes from the cohort. The similarities of mean and median lengths suggest that the orthologous family sizes on average are more or less symmetrically distributed in each of the three domains. Our figures are consistent with the previous finding of significant differences in the average lengths between proteins from archaea and bacteria on the one hand and eukaryotes on the other [29e31]. The second novelty of this study is that we focused our attention on detailed comparisons between orthologous proteins that are distributed among all three Superkingdoms. In this way we can be more confident that protein characteristics from one Superkingdom were comparable with those of functional homologues from another Superkingdom. Accordingly, we extracted from the Ofam database a well defined cohort constituted from 365 orthologous families containing very nearly 50,000 proteins distributed in families shared by all three Superkingdoms. Fig. 2 displays the mean lengths of these orthologous families. The lengths are distributed between fifty amino acids and more than one thousand amino acids with the eukaryote sequences skewed toward longer lengths than those of bacteria or archaea. The means and medians for these lengths are 363 and 358 amino acids for archaeal proteins, 384 and 381 amino acids for bacterial proteins, and 474 and 467 amino acids for eukaryote proteins. The differences between these

C.G. Kurland et al. / Biochimie 89 (2007) 1454e1463

1458

Ortholog families common to all domains 70 Eukaryota

60

Bacteria

frequency

50

Archaea

40 30 20 10

1201-

1101-1150

mean length

1001-1050

901-950

801-850

701-750

601-650

501-550

401-450

301-350

201-250

101-150

0-50

0

Fig. 2. Length variation of orthologous families containing proteins from each of the three classes of cells in the trichotomy.

averages and those for all orthologous families may be significant, but at this point we have no explanation for these differences. There may be some bias towards shorter genomes among those that have been sequenced in each of the three Superkingdoms. However, we find at most only a very weak positive correlation between genome size and average protein length among the selected 365 families (Supplementary Figs. S1eS3). Also, there is virtually no overlap between the average protein lengths of eukaryotes and those of archaea and bacteria. This suggests that the differences in observed protein lengths, discussed above, are not due to sampling biases of the genomes. 3.3. Cellular contingencies One striking feature of the length distributions described in Fig. 2 is that they are very broad. Part of that size variation is certainly due to the different functional constraints on the structures of individual orthologous protein families. Nevertheless, the size variation within any given orthologous family are also very broad. Since we study orthologous families of proteins, we can identify differences in the length-selective pressures that distinguish the functionally equivalent proteins

in each cell class by analyzing the variations in their length distributions. For example, Fig. 3 depicts the sizes of the variants from one orthologous family with members in each of the three cellular classes (archaea, bacteria or eukaryote). From these distributions it is apparent that eukaryote orthologs in this family are more broadly distributed than are their structural and functional homologues in the archaea as well as in the bacteria. This is a general attribute of the distributions. Fig. 4 displays the length variation within the orthologous families that are described in Fig. 2. Here, the mean standard deviation of the protein lengths within any cell class from each orthologous family is calculated. Then these mean standard deviations are normalized to the mean length of the corresponding orthologs to yield a relative standard deviation that is expressed as a fraction of the mean length of each protein class. It is evident from Fig. 4 that the relative standard deviations of the eukaryote proteins are more broadly distributed than those of archaea as well as of bacteria and tend to be skewed towards larger values. In fact, the aggregate relative standard deviations are 0.099, 0.091, and 0.21 for protein lengths in the archaeal, bacterial and eukaryote classes respectively. We conclude that on average the larger proteins of eukaryotes are prone to a more substantial relative variation in their lengths

frequency

Gene lengths in one ortholog family 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 400

Eukaryota Bacteria Archaea

420

440

460

480

500

520

540

560

580

600

length Fig. 3. Normalized gene-length distributions in one orthologous family present in all three domains (23, 181, and 20 genes from eukaryots, bacteria, and archaea, respectively). For clarity, two extreme outliers (lengths ¼ 977, 1132) for eukaryotes are not included. Mean lengths are 559, 433, and 423 and standard deviation/ mean are 0.28, 0.05, and 0.025 for eukaryotes, bacteria and archaea, respectively.

C.G. Kurland et al. / Biochimie 89 (2007) 1454e1463

1459

Ortholog families common to all domains

frequency

160 140

Eukaryota

120

Bacteria

100

Archaea

80 60 40 20 0,7-

0,65-0,7

0,6-0,65

0,55-0,6

0,5-0,55

0,45-0,5

0,4-0,45

0,35-0,4

0,3-0,35

0,25-0,3

0,2-0,25

0,15-0,2

0,1-0,15

0,05-,1

0-0,05

0

standard deviation / mean length Fig. 4. Standard deviation normalized to mean length of proteins described in Fig. 2.

within any protein family than are their orthologous relatives among the archaea and bacteria. In effect, the length constraints on archaeal and bacterial proteins are twice as stringent as those exerted on eukaryote proteins. This is precisely as would be expected if the archaeal and bacterial coding sequences evolve under much stronger minimalist constraints than do the eukaryote proteins. What characteristics of the three cell classes are related to their typical protein length variations? Nested within the protein length distributions for the modern trichotomy are other detailed variations that are contingent upon specific parameters such as ambient high or low growth temperatures, or specific metabolic functions that can be correlated with protein length variations [29e31]. Such correlations suggest quite generally that length variation of orthologs depends on cell growth conditions. Which features of the cell populations within each class contribute to the evolution of the characteristic sizes of their proteins? Our working hypothesis suggests that the divergence of archaea and bacteria from their common ancestor was associated with selection for augmented cellular growth rates that enabled the emergent microbes to out-grow phagotrophic eukaryotes and thereby escape decimation by the foraging of these cellular predators [26]. An increase in the growth efficiency of cells necessarily entails the evolution of kinetically efficient proteomes that carry out their functions with a minimum mass investment in each functional compartment of the cell [40e42]. In general, the evolution of physiologically efficient cells requires selection against the fixation of excessively large proteins and against excessive numbers of proteins [40e42]. In addition, there seems to be a universal tendency for the frequencies as well as the lengths of small deletions to exceed those of small insertions [43e51]. Consequently, there are both selective pressures as well as mutational biases to support the tendency towards reductive evolution. The point then is that all cells evolve under reductive pressure but some cell classes do so much more intensively than do others. What then are the contingencies that bias the expression of reductive evolution in the three cell classes? The lengths of all proteins in all cells are selected in a tradeoff between the minimum metabolic cost to translate them and

the maximum rate at which they function under particular circumstances. This trade off is implicit in the conditions that maximize the physiological functions of cells and their growth rates. The growth rate, k0, can be expressed as [40e42]: k0 ¼ kel R=r0 ;

ð1Þ

where kel is the translation rate, R is the number of translating ribosomes and r0 is the total mass (in amino acid equivalents) of protein translated per cell cycle. This relation is based simply on the fact that all proteins in a cell growing in a defined steady state are duplicated in each growth cycle. However, the growth rates are maximized for cells that have an optimal investment in ribosomes and associated gene expression equipment [40,42]. Conversely, cells that are required to invest most of their proteome in functions other than gene expression are perforce relegated to slow growth rates. So, one way that selection on genomes and their coding sequences can increase growth efficiency is to reduce the numbers of expressed proteins that do not contribute to gene expression. This is expressed in the tendency to ‘‘streamline’’ the proteome, which for archaea and bacteria is thought to result in the loss of typical eukaryote cellular protein structures such as internal membranous compartments, and a cytoskeleton including mitotic and meiotic structures [26,27]. We expect the selection pressure that streamlines the complexity of proteomes to be associated with a complementary tendency to reduce the sizes of proteins. Thus, we consider a protein whose expression level corresponds to n molecules per cell. If the size of this protein is increased by one amino acid, the ribosomes will have to translate n extra amino acids per cell cycle and the total mass of translated proteins increases from r0 to r0 þ n. To a good first order approximation, this would correspond to a reduction of the relative growth rate by the same amount, and the extra amino acid would be counter selected by a selection coefficient si ¼  n/r0. Similarly, the deletion of an unnecessary amino acid would be positively selected by sd ¼ n/r0. This kind of selection would be most efficient in organisms that compete through growth rate. It also will be most effective in a population with a large

C.G. Kurland et al. / Biochimie 89 (2007) 1454e1463

effective size (Ne) as well as for proteins at high expression level (n). Its effect will be to increase the deletion bias by the factor exp (2Nen/r0). Eukaryotes, on the other hand, tend to evolve in much smaller populations than do archaea and bacteria [52] and their total protein investment (r0) is larger. These two effects, a large proteome and a small effective population size, would tend to reduce the selective pressure for reductive evolution in eukaryotes. Also, multicellular eukaryotes may be less sensitive to this pressure due to their complex life cycles. We note that the intensity of this selection also will be reduced for individual proteins with relatively small expression levels. Indeed, studies of the growth rate variations of ribosome mutants as well as of tRNA isoacceptor distributions have verified the expectation that selective pressure for optimal function is reduced when expression levels are reduced in bacteria [40e42,53e55]. To illustrate the strong dependence between the intensity of reductive selection and the size of the population of evolving cells we consider an E. coli cell with r0 w 109 amino acids incorporated into protein [40]. E. coli is thought to have an effective population size of 108 estimated on the basis of neutral diversity [47,56]. However, this population size is probably too large to describe evolution in patchy populations i.e. populations that are associations of many smaller sub-populations [47]. Thus, for an effective population size of 108, and for proteins that are present at intermediate expression levels (e.g. n ¼ 102 copies) there will be an enormous selective factor ¼ exp(2Nen/r0), which is of the order of 108 or larger. Such a selective factor does not allow any random variations in gene length, but we know that free living bacteria produce orthologous proteins with significant length variations (Fig. 4). In contrast, the effective population size for this kind of selection in patchy populations would be much smaller, in the range of roughly 105 [47,56]. In this case, the selective factor for n ¼ 102 would be reduced to 1.02 which is consistent with the length distributions that are observed (see Fig. 4). Here, selection would be effective only for proteins at very high expression level, n ¼ 104 or larger. These rough calculations illustrate how the population structure as well as the composition of the proteome of an organism may influence the strength of selection for minimum protein length. In general, greater complexity and size of proteomes, as well as smaller effective population size mitigates selection pressure for minimal protein lengths. In other words, the variations in the lengths as well as length distributions described in Figs. 2e4 for orthologous proteins of the three cell classes are consistent with those expected from the cellular growth rate optimizations [40e42,53e55] of cells in the populations characteristic of eukaryotes, bacteria and archaea, respectively. It should perhaps be emphasized that the pressures discussed here refer only to the amount and size of expressed protein. Total genome size, on the other hand, is subject to other pressures and constraints, e.g. from parasitic DNA. The relatively small genomes of endosymbionts and parasites are the result not so much of a stronger reductive pressure as of their

life style in stable environments [57] with reduced needs for a number of enzymatic pathways and reduced opportunities to pick up foreign DNA.

3.4. Sequence variation We have attempted to determine whether or not there is a distinctive pattern for the length variations of the proteins. For example, are proteins more conserved at their ends? Relevant here is the suggestion that eukaryote proteins are larger than bacterial and archaeal ones because they require additional domains to regulate and distribute proteins within larger, more complex cells [31]. Thus, the functional demands on eukaryote cells might be expressed as more elaborate sequence evolution for addressing proteins and docking them in specialized compartments [31]. Indeed, there is ample evidence for such dedicated N-terminal and C-terminal domains in all cells but these, especially the C-terminal ones, are more extensively documented in eukaryotes [58,59]. If the protein termini have evolved in eukaryotes to enhance their docking and addressing functions, these might reflect the idiosyncrasies of the cell lineages in which they evolve. Accordingly, the sites for such addressing and docking functions might evolve in a more cell lineage-specific way than would the cores of protein structures. In particular, the co-evolution of cell-specific interactive sequences within orthologous families of proteins might be expressed as enhanced rates of sequence diversion for the terminal sequences and lower rates of sequence diversion for core sequences. If such divergent sequences could be identified, it would be possible to determine whether they are more prominent in eukaryote proteins than in prokaryote proteins. Accordingly, we studied the orthologous sequence families described in Fig. 2 to determine whether there are identifiable sequence strings that are more variable and others that are more conserved within these proteins (Fig. 5). We do this by arbitrarily dividing each alignment into ten equal-size sequence strings and calculating the mean frequencies of gaps in the alignments for each deci-sequence. The resulting data are straightforward. The N terminal and C terminal Ortholog families common to all domains Average percentage of gaps

1460

100 Archaea Bacteria Eukaryota

80 60 40 20 0 1

2

3

4

5

6

7

8

9

10

Gene segments Fig. 5. Evolutionary divergence of proteins described in Fig. 2 plotted as a function of relative position within sequence alignments.

C.G. Kurland et al. / Biochimie 89 (2007) 1454e1463

deci-regions of proteins from all three phylogenetic domains have the highest gap frequencies; these vary between 80% and 90%, and between 60% and 80%, respectively. There is a decreasing gradient of gaps that proceeds from both ends of the sequences towards a relatively conservative central core that makes up roughly six/tenths of the sequences and has gap frequencies near 40%. If the variable terminal strings may be tentatively identified with the postulated interactive sequences of proteins [31,58,59], it would seem that archaea, bacteria and eukaryotes have evolved similar C-terminal and N-terminal interactive sites. If so, their evolution would not have driven the length increases preferentially in the eukaryote linage. The main result of these alignment studies is that most of the sequence variation including variation in sequence length is localized at the ends of the proteins and that the orthologous domains in all families are localized within a major conserved central core. 4. Conclusions We have explored the proteomes of the modern trichotomy from two vantages: one from the perspective of families of protein building blocks, the FSFs and the other at the level of complete proteins. There are huge differences in the complexities of these two levels of organization that are numerically defined by circa 1200 FSFs in the modern trichotomy and a corresponding 300,000 orthologous families in Ofam [15,16,32]. The one informs us about a surprisingly simple connectivity between the proteomes of modern organisms while the other points to the adaptive diversity of modern proteomes. Together these two perspectives are mutually informative about the evolution of the trichotomy because they allow us to relate its common origins to the evolutionary specializations of its proteins. A set of 1205 FSFs has been identified in a cohort of fully sequenced genomes from nineteen representatives of each of the three Superkingdoms [16]. Of these 796 are members of two or three Superkingdoms while more than 90% of both the archaeal and the bacterial FSFs are found among the eukaryote cohort. To this may be added that the total numbers of FSFs within each Superkingdom are suggestively similar: 1110, 883 and 713 among eukaryotes, bacteria and archaea, respectively. These FSF distributions define a rather level playing field for the evolution of the modern trichotomy. And, they suggest that the common ancestor had an FSF cohort containing not less than 796 FSFs and perhaps as many as 1200 if none were lost since the divergence of the trichotomy from the common ancestor. It has been possible to compute the most parsimonious trajectory based on the expected gains and losses of FSFs from a putative common ancestor to the modern trichotomy. The most parsimonious trajectory begins with an ancestor that encodes an FSF cohort more like that of modern eukaryotes than of modern archaea and/ or modern bacteria. The very approximate equivalence of FSF distributions among the Superkingdoms speaks rather strongly against

1461

a common prejudice [17e21]. This is the view that the evolution of the eukaryotes entailed a substantial increase in the complexities of the ancestor to the eukaryotes through the creation of a chimera consisting of the archaeal and bacterial proteomes. A maximum of roughly 170 FSFs would have been added to the archaeal FSFs by the bacterial ones, if the modern FSF distributions are a relevant guide to the diversities of the ancestral FSF distributions. The substantial overlap between FSFs of the three Superkingdoms suggests that a relatively small fraction of novel fold superfamilies have been fixed among the eukaryotes since the time of the common ancestor. The number of supplementary novel domains might range between zero and 244 FSFs (the eukaryote signature FSFs). The reasons for favoring a supplementary eukaryote FSF count closer to zero than to 244 arise from Dollo’s law and the ease of recombination between domains of coding sequences. Thus, Dollo’s Law suggests that the creation and fixation of a complex structure, in this case a fold sequence, is a far less likely event than the loss of that complex sequence by random mutations [18,35,36]. So, we might expect most of the eukaryote signature FSFs, if they were present in the ancestor, to have been lost from the archaea and/or from the bacteria rather than created in the eukaryotes after their divergences from the common ancestor. In addition, genetic recombination between coding sequences is facile in the sense that it is easily observed in real time laboratory experiments, particularly, when random mutants are screened under suitable selective conditions [60]. This means that new proteins compounded from preexisting folds can evolve much more quickly than new folds can be created. Accordingly, we might entertain the possibility that the common ancestor marks the point in proteome evolution at which the permutation of concatenated protein domains replaced protein domain creation as the dominant path of protein evolution. This suggestion can explain why it might take billions of years to create the common ancestor and why the divergence of that ancestor into the three Superkingdoms occurred when it did [61]. Thus, there is no compelling evidence that identifies the common ancestor of the modern trichotomy with the origin of life [24e28]. On the other hand, the divergence of the trichotomy may have followed from the emergence of rapid modes through which proteomes could adapt to specific ecological challenges [27,28]. In particular, the rates of cellular adaptations may have accelerated because the cohort of FSFs had begun to approximate the now definitive number found in the modern trichotomy, 1200 or so [15,16]. Were this the case, the common ancestor might represent the transition stage at which reassortment of protein domains by genetic recombination replaced the creation of novel domains as the dominant path of protein evolution. Splicing is certainly more efficient than genetic recombination to generate proteins with reassorted domains. However, more details relevant to the early evolution of splicing systems are needed before we can pinpoint the debut of splicing systems [27]. We note in passing that in the context of the present ‘‘eukaryotes early’’ model, ‘‘introns early’’ has lost its former identification with archaeal and bacterial evolution.

1462

C.G. Kurland et al. / Biochimie 89 (2007) 1454e1463

Reductive constraints that put a cap on the protein mass not involved in the reproduction of cells and on the lengths of all proteins are ubiquitous constraints to insure that reproductive rates remain competitive for cell lineages in any environmental setting. Any cell lineage that fails to evolve its protein mass on the basis of an optimal reproductive rate is doomed to extinction [40e42]. Nevertheless, the magnitudes of these reductive constraints are subject to the vagaries of cellular population size, proteome complexity and specific growth conditions. Thus, we expect the selective constraints on protein lengths to be much more stringent in archaea and in bacteria because eukaryotes have evolved in a mode of specialization that is not directed towards absolutely maximum growth rates. In contrast, archaea and bacteria may have evolved highly specialized proteomes that enable them to outgrow and avoid extinction by eukaryotic phagotrophes [26,27]. In addition, archaeal and bacterial population sizes are much larger than those of eukaryotes while the archaeal and bacterial proteomes are considerably smaller and less complex compared to those of eukaryotes [32,52]. These differences translate into the more intense reductive pressure we observe in the archaeal and bacterial proteomes (Figs. 2e4). Thus, the characteristically larger orthologous proteins of eukaryotes seem to have evolved under less intense selection for minimal lengths rather than through more direct selection for larger, more complex proteins. Acknowledgements This study originated from a richly informative disagreement with David Penny, to whom we are deeply indebted. We thank Russ Doolittle for getting his calculations right the first time. We are indebted to David Penny, Lesley Collins, Russ Doolittle, and Roger Garrett for helpful and encouraging comments as well as guidance with the literature. Gustavo Caetano-Anolle´s generously shared data before publication. The work of C. G. K. is supported by The Royal Physiographic Society, Lund and The Nobel Committee for Chemistry, The Royal Swedish Academy of Sciences, Stockholm. O. G. B. is supported by The Swedish Research Council, Stockholm.

Appendix I. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.biochi.2007.09.004. References [1] F.B. Salisbury, Natural selection and the complexity of the gene, Nature 224 (1969) 342e343. [2] J. Maynard Smith, Natural selection and the concept of a protein space, Nature 225 (1970) 563e564. [3] C.R. Woese, G.E. Fox, The concept of cellular evolution, J. Mol. Evol. 10 (1977) 1e6. [4] C.R. Woese, The primary lines of descent and the universal ancestor, in: D.S. Bendall (Ed.), Evolution from Molecules to Men, Cambridge University Press, Cambridge, 1983, pp. 209e233.

[5] C.R. Woese, The universal ancestor, Proc. Natl. Acad. Sci. U.S.A. 95 (1998) 6854e6859. [6] S.T. Fitz-Gibbon, C.H. House, Whole genome-based phylogenetic analysis of free-living microorganisms, Nucleic Acids Res. 27 (1999) 4218e4222. [7] M.A. Huynen, B. Snel, P. Bork, Lateral gene transfer, genome surveys, and the phylogeny of prokaryotes (in Technical Comments), Science 286 (1999) 1443a. [8] B. Snel, P. Bork, M.A. Huynen, Genome phylogeny based on gene content, Nat. Genet. 21 (1999) 108e110. [9] F. Tekaia, A. Lazcano, B. Dujon, The genomic tree as revealed from whole proteome comparisions, Genome Res. 9 (1999) 550e557. [10] C.R. Woese, Interpreting the universal phylogenetic tree, Proc. Natl. Acad. Sci. U.S.A. 97 (2000) 8392e8396. [11] J.R. Brown, C.J. Douady, M.J. Italia, W.E. Marshall, M.J. Stanhope, Universal trees based on large combined protein sequence data sets, Nat. Genet. 28 (2001) 281e285. [12] J.O. Korbel, B. Snel, M.A. Huynen, P. Bork, SHOT: a web server for the construction of genome phylogenies, Trends Genet. 18 (2002) 158e162. [13] B. Snel, P. Bork, M.A. Huynen, Genomes in flux: the evolution of archaeal and proteobacterial gene content, Genome Res. 12 (2002) 17e25. [14] G. Caetano-Anolle´s, D. Caetano-Anolle´s, An evolutionarily structured universe of protein architecture, Genome Res. 13 (2003) 1563e1571. [15] G. Caetano-Anolle´s, D. Caetano-Anolle´s, Universal sharing patterns in proteomes and evolution of protein fold architecture and life, J. Mol. Evol. 60 (2005) 484e498. [16] S. Yang, R.F. Doolittle, P.E. Bourne, Phylogeny determined by protein domain content, Proc. Natl. Acad. Sci. U.S.A. 102 (2005) 373e378. [17] J.P. Gogarten, L. Olendzenski, E. Hilario, C. Simon, K.E. Holsinger, Dating the cenancester of organisms, Science 274 (1996) 1750e1751 author reply 1751e1753. [18] W. Martin, M. Muller, The hydrogen hypothesis for the first eukaryote, Nature 392 (1998) 37e41. [19] W.F. Doolittle, Phylogenetic classification and the universal tree, Science 284 (1999) 2124e2129. [20] P. Lopez-Garcia, D. Moreira, Metabolic symbiosis at the origin of eukaryotes, Trends Biochem. Sci. 24 (1999) 88e93. [21] E.V. Koonin, L. Aravind, A.S. Kondrashov, The impact of comparative genomics on our understanding of evolution, Cell 101 (2000) 573e576. [22] P. Forterre, Thermoreduction, a hypothesis for the origin of prokaryotes, C.R. Acad. Sci. III. 318 (1995) 415e422. [23] P. Forterre, H. Philippe, Where is the root of the universal tree of life? BioEssays 21 (1999) 871e879. [24] N. Glansdorff, About the last common ancestor, the universal life-tree and lateral gene transfer: a reappraisal, Mol. Microbiol. 38 (2000) 177e185. [25] D. Penny, A. Poole, The nature of the last universal common ancestor, Curr. Op. Genet. Dev. 9 (1999) 672e677. [26] C.G. Kurland, L.J. Collins, D. Penny, Genomics and the irreducible nature of eukaryote cells, Science 312 (2006) 1011e1014. [27] C.G. Kurland, L.J. Collins, D. Penny, The evolution of eukaryotes, Response, Science 316 (2007) 543. [28] M. Wang, S. Yafremava, D. Caetano-Anolle´s, J.E. Mittenthal, G. Caetano-Anolle´s, Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world, Genome Res., in press. doi:10.1101/gr.6454307. [29] J. Zhang, Protein-length distributions for the three domains of life, Trends Genet. 16 (2000) 107e109. [30] P. Liang, M. Riley, A comparative genomics approach for studying ancestral proteins and evolution, Adv. Appl. Microbiol. 50 (2001) 39e72. [31] L. Brocchieri, S. Karlin, Protein length in eukaryotic and prokaryotic proteomes, Nucleic Acids Res. 33 (2005) 3390e3400. [32] L. Goldovsky, P. Janssen, D. Ahren, B. Audit, I. Cases, et al., CoGenTþþ: an extensive and extensible data environment for computational genomics, Bioinformatics 21 (2005) 3806e3810. [33] J.D. Thompson, D.G. Higgins, T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res. 22 (1994) 4673e4680.

C.G. Kurland et al. / Biochimie 89 (2007) 1454e1463 [34] S. Aparicio, J. Chapman, E. Stupka, N. Putnam, J.M. Chia, et al., Wholegenome shotgun assembly and analysis of the genome of Fugu rubripes, Science 297 (2002) 1301e1310. [35] J.S. Farris, Phylogenetic analysis under Dollo’s Law, Syst. Zool. 26 (1977) 77e88. [36] F.D. Ferrari, Evolutionary transformations and Dollo’s Law, J. Crust. Biol. 8 (1988) 618e619. [37] C.R. Marshall, E.C. Raff, R.A. Raff, Dollo’s Law and the death and resurrection of genes, Proc. Natl. Acad. Sci. U.S.A. 91 (1994) 12283e12287. [38] J.O. Andersson, S.G. Andersson, Insights into the evolutionary process of genome degradation, Curr. Opin. Genet. Dev. 9 (1999) 664e671. [39] W. Davids, H. Amiri, S.G. Andersson, Small RNAs in Rickettsia: are they functional? Trends Genet. 18 (2002) 331e334. [40] M. Ehrenberg, C.G. Kurland, Costs of accuracy determined by a maximal growth rate constraint, Q. Rev. Biophys. 17 (1984) 45e82. [41] C.G. Kurland, Translational accuracy and the fitness of bacteria, Annu. Rev. Genet. 26 (1992) 29e50. [42] H. Bremer, P.P. Dennis, Modulation of chemical composition and other parameters of the cell by growth rate, in: F.C. Neidhart (Ed.), Escherichia coli and Salmonella, ASM Press, Washington DC, 1996, pp. 1553e1569. [43] S.G. Andersson, A. Zomorodipour, J.O. Andersson, T. Sicheritz-Ponte´n, U.C. Alsmark, et al., The genome sequence of Rickettsia prowazekii and the origin of mitochondria, Nature 396 (1998) 133e140. [44] D.A. Petrov, D.L. Hartl, High rate of DNA loss in Drosophila melanogaster and Drosophila virilis species groups, Mol. Biol. Evol. 15 (1998) 293e302. [45] D.A. Petrov, T.A. Sangster, J.S. Johnston, D.L. Hartl, K.L. Shaw, Evidence for DNA loss as a determinant of genome size, Science 287 (2000) 1060e1062. [46] S.T. Cole, K. Eiglmeier, J. Parkhill, K.D. James, N.R. Thomson, et al., Massive gene decay in the leprosy bacillus, Nature 409 (2001) 1007e1011. [47] O.G. Berg, C.G. Kurland, Evolution of microbial genomes: sequence acquisition and loss, Mol. Biol. Evol. 19 (2002) 2265e2276.

1463

[48] A.C. Frank, H. Amiri, S.G. Andersson, Genome deterioration: loss of repeated sequences and accumulation of junk DNA, Genetica 115 (2002) 1e12. [49] A. Mira, L. Klasson, S.G. Andersson, Microbial genome evolution: sources of variability, Curr. Opin. Microbiol. 5 (2002) 506e512. [50] D.A. Petrov, Mutational equilibrium model of genome size evolution, Theor. Pop. Biol. 61 (2002) 533e546. [51] H. Ochman, Genomes on the shrink, Proc. Natl. Acad. Sci. U.S.A. 102 (2005) 11959e11960. [52] M. Nei, D. Grauer, Extent of protein polymorphism and the neutral mutation theory, Evol. Biol. 17 (1984) 73e118. [53] R. Mikkola, C.G. Kurland, Evidence for demand-regulation of ribosome accumulation in E coli, Biochimie 73 (1991) 1551e1556. [54] H. Dong, L. Nilsson, C.G. Kurland, Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates, J. Mol. Biol. 260 (1996) 649e663. [55] O.G. Berg, C.G. Kurland, Growth rate-optimised tRNA abundance and codon Usage, J. Mol. Biol. 270 (1997) 544e550. [56] O.G. Berg, Selection intensity for codon bias and the effective population size of Esherichia coli, Genetics 142 (1996) 1379e1382. [57] I. Tamas, L. Klasson, B. Canba¨ck, A.K. Na¨slund, A.-S. Eriksson, et al., 50 million years of genomic stasis in endosymbiotic bacteria, Science 296 (2002) 2376e2379. [58] S. High, B.M. Abell, Tail-anchored protein biosynthesis at the endoplasmic reticulum: the same but different, Biochem. Soc. Trans. 32 (2004) 659e662. [59] O. Emanuelsson, S. Brunak, G. v Heijne, H. Nielsen, Locating proteins in the cell using TargetP, SignalP and related tools, Nat. Protoc 2 (2007) 953e971. [60] D.L. Hartl, E.W. Jones, Genetics, Jones and Bartlett, Sudbury, MA, 2004. [61] R.F. Doolittle, D.F. Feng, S. Tsang, G. Cho, E. Little, Determining divergence times of the major kingdoms of living organisms with a protein clock, Science 271 (1996) 470e477.