Palaeogenomics of plants: synteny-based modelling of extinct ancestors

Palaeogenomics of plants: synteny-based modelling of extinct ancestors

Opinion Palaeogenomics of plants: syntenybased modelling of extinct ancestors Michael Abrouk1, Florent Murat1, Caroline Pont1, Joachim Messing2, Scot...

1MB Sizes 0 Downloads 37 Views

Opinion

Palaeogenomics of plants: syntenybased modelling of extinct ancestors Michael Abrouk1, Florent Murat1, Caroline Pont1, Joachim Messing2, Scott Jackson3, Thomas Faraut4, Eric Tannier5, Christophe Plomion6, Richard Cooke7, Catherine Feuillet1 and Je´roˆme Salse1 1

INRA, UMR 1095, Laboratoire Ge´ne´tique, Diversite´ et Ecophysiologie des Ce´re´ales, 234 avenue du Bre´zet, 63100 Clermont Ferrand, France 2 The Plant Genome Initiative at Rutgers (PGIR), Piscataway, New Jersey 08854, USA 3 Department of Agronomy, 915 W. State Street Purdue University, West Lafayette, Indiana 47907-2054, USA 4 INRA, UMR 444, Laboratoire de Ge´ne´tique Cellulaire, BP 52627, 31326 Castanet Tolosan, France 5 INRIA, Grenoble Rhoˆne-Alpes, CNRS, UMR 5558, Laboratoire Biome´trie et Biologie Evolutive, 43 Boulevard du 11 Novembre1918, 69622 Villeurbanne Cedex, France 6 INRA, UMR 1202 BIOGECO, 69 route d’Arcachon. 33612 Cestas Cedex, France 7 CNRS, Laboratoire Ge´nome et De´veloppement des Plantes, 52 avenue P. Alduy, 66860 Perpignan, France

In the past ten years, international initiatives have led to the development of large sets of genomic resources that allow comparative genomic studies between plant genomes at a high level of resolution. Comparison of map-based genomic sequences revealed shared intragenomic duplications, providing new insights into the evolution of flowering plant genomes from common ancestors. Plant genomes can be presented as concentric circles, providing a new reference for plant chromosome evolutionary relationships and an efficient tool for gene annotation and cross-genome markers development. Recent palaeogenomic data demonstrate that whole-genome duplications have provided a motor for the evolutionary success of flowering plants over the last 50–70 million years. Access to new genome sequences opens new perspectives in palaeogenomics An increasing number of animal and plant genomes are being sequenced; as a consequence comparative genomics has become an important field of research that will shed light on genome function and structure as well as evolutionary mechanisms that have shaped chromosome structure. Flowering plants, or angiosperms, are derived from a common ancestor 150-300 million years ago (mya) during the early Cretaceous period according to fossil records [1] and contain socio-economically important crop species, spanning both monocots and dicots. Genome sequences from monocots and from Eurosid dicots, which both diverged from a common ancestor 150–250 mya [2] (Figure 1, yellow panels, red stars), are now available. Monocots include three subfamilies of the grasses (Poaceae), i.e. the Panicoideae (milo, Sorghum bicolor [3] and maize, Zea mays [4]), Ehrhartoideae (rice, Oryza sativa [5]) and Pooideae (brome, Brachypodium distachyon [6]). Eurosid dicots include grape (Vitis vinifera [7]), poplar (Populus trichocarpa [8]), papaya (Carica papaya [9]), cress (Arabidopsis thaliana [10]) and Corresponding author: Salse, J. ([email protected]).

soybean (Glycine max [11]). Since 2000, the genomes of nine previously cited monocot and dicot species have been sequenced (see sequence genome features in Table 1). In addition, high resolution gene-based genetic maps are available for other agronomically important monocots, such as those of the Triticeae tribe, e.g. wheat (Triticum aestivum [12]) and barley (Hordeum vulgare [13]), within the subfamily of the Pooideae, and eudicots such as Brassicales [14]. These resources have recently been used to perform large scale inter-specific sequence comparisons and refine our understanding of collinearity (see Glossary) between plant chromosomes as well as their palaeohistory from their common ancestor [15,16]. The current opinion article provides a global and integrative view of palaeogenomics in plants based on the comparison of sequenced genomes for eudicots and monocots and highly saturated gene-based genetic maps available for wheat and barley. A need for improved standards of criteria in the field of comparative genomics It is difficult to infer orthologous (derived from a common ancestor by speciation) and paralogous (derived by Glossary Synteny: physical co-localisation of genetic loci and/or genes on the same chromosome and/or linkage group within an individual or species. Conserved and/or shared synteny: preserved co-localisation of genetic loci and/or genes on chromosomes and/or linkage group of different species, also referred as macrosynteny (based on large portions of a chromosome) and microsynteny (based on only a few genes at a time). Collinearity: a more specific form of conserved synteny, requires co-localisation of genes on the same chromosome within an individual or species with common gene order. Ortholog: genes in different species that originated from a common ancestor. Paralog: genes in an organism that are duplicated to occupy two different positions in the same genome. Ohnolog: paralogous genes that have originated by a process of whole genome duplication (WGD) event. Palaeogenomics: reconstruction and analysis of ancestral genomes from actual modern species. Palaeohistory: study of evolutionary and/or speciation events that have shaped the actual modern species.

1360-1385/$ – see front matter ß 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.tplants.2010.06.001 Trends in Plant Science 15 (2010) 479–487

479

Opinion

Trends in Plant Science Vol.15 No.9

Figure 1. Identification of orthologs and paralogs in plants. Orange panels: schematic representation of the phylogenetic relationship between angiosperm species. Divergence times from a common ancestor are indicated on the branches of the phylogenetic tree (in millions of years). Species for which we present synteny and duplication patterns in the current review are highlighted in red. Sequenced genomes are indicated with a red asterisk. Red panels: schematic representation of the 20 270 orthologs identified between the rice chromosomes (r1–r12) used as a reference, and the Brachypodium (bd1–bd5), wheat (w1–w7), Sorghum (s1–s10), and maize (m1–m10) chromosomes; and the 14 156 orthologs identified between the grape chromosomes (G1–G19) used as a pivotal genome, and the Arabidopsis (A1–A5), poplar (P1–P19), Papaya (Py1–Py9) and soybean (S1–S20) chromosomes. Each line represents an orthologous gene. The different coloured blocks reflect the origin from the five and seven ancestral protochromosomes, respectively, for the monocots and the eudicots. Grey panels: schematic representation of the 5 093 paralogous pairs identified within the rice (r1–r12), Brachypodium (bd1–bd5), wheat (w1–w7), Sorghum (s1–s10), and maize (m1–m10) genomes as well the 16 712 paralogous pairs identified within the grape (G1– G19), Arabidopsis (A1–A5), poplar (P1–P19), Papaya (Py1–Py9) and soybean (S1–S20) genomes. Each line represents a duplicated gene. The different coloured blocks represent distinct chromosome-to-chromosome duplication blocks in respect to the five and seven ancestral protochromosomes, respectively, for the monocots and the eudicots. Species-specifc duplications are displayed within black blocks.

duplication within one genome) relationships from sequence comparisons alone. Stringent alignment criteria and statistical validation are crucial when comparing gene sequences to evaluate accurately whether the association between two or more genes found in the same order on two chromosomal segments is due to chance or reflects true conserved collinearity. Several algorithms and/or visualisation tools are available as open source and can accommodate a number of output file standards to represent comparative analyses derived from genomic alignments. However, the limitation of the tools is that they do not 480

necessarily provide any statistical expertise enabling robust assessment of the relationships between the aligned sequences, which is necessary to ensure relevant downstream interpretation of evolutionary mechanisms [17]. The presence in protein or nucleotide sequences of short, highly-conserved motifs can easily lead to artefactual attribution of orthologous or paralogous relationships between genes. Plant genomes also contain many gene families, which can also lead to erroneous conclusions. To increase the significance of inter-specific coding sequence (CDS)

Opinion

Trends in Plant Science

Vol.15 No.9

Table 1. Plant genomes used in palaeogenomics analysis No. of chrom.

Gene assembly physical size (Mbp)

No. of annotated unigene models

Synteny data (No) a (No) b

(%) c

Duplication data (No) d (No) e

(%) c

Monocots Oryza sativa Sorghum bicolor Zea mays Brachypodium distachyon *Triticum aestivum f *Hordeum vulgare f

12 10 10 5 21 (7 groups) 7

372 659 2365 271 NA NA

41046 34008 32540 25504 5003 3423

RG* 6147 4454 8533 827 309

RG* 12 30 12 13 13

RG* 99 82 99 91 84

448 409 3454 642 102 38

10 10 17 13 10 9

73 84 99 79 75 75

Eudicots Vitis vinifera Arabidopsis thaliana Populus trichocarpa Carica papaya Glycine max

19 5 19 9 20

302 119 294 234 949

21189 33198 30260 19205 46194

RG* 2389 4555 3199 4013

RG* 80 90 65 164

RG* 99 75 75 97

543 1630 4791 215 9533

23 55 70 36 89

71 83 66 55 55

Species

Abbreviations: chrom, chromosomes; RG* refers to Reference Genome indicating that rice (Oryza sativa) and grape (Vitis vinifera) have been used as reference genomes for the synteny analysis respectively for the monocots and the eudicots. a Number of orthologous genes b Number of collinear blocks c Percentage of genome coverage d Number of paralogous genes e Number of duplicated blocks f Asterisks for the wheat and barley genomes highlight the highly saturated gene-based genetic maps used in the analysis

alignments for inferring evolutionary relationships between genomes, we defined two new parameters for BLAST analyses (either nucleic or protein-based), which take into account not only similarity but also the relative lengths of the sequences: CIP for Cumulative Identity Percentage and CALP for PCumulative Alignment Length Percentage. The CIP [( nb ID by (HSP/AL)  100], corresponds to the cumulative percent of sequence identity observed for all the High Scoring Pairs (HSPs) divided by the cumulative Aligned Length (AL) which corresponds to the sum of all HSP lengths. The CALP [AL/Query length], is the sum of the HSP lengths (AL) for all HSPs divided by the length of the query sequence. With these parameters, BLAST produces the highest cumulative percentage identity over the longest cumulative length thereby increasing stringency in defining conservation between two compared genome sequences [17,18]. Furthermore, most comparative genomics studies were done without applying statistical validation of the results and therefore, the significance of the relationships established in different studies are difficult to infer. In previous studies, we systematically performed a statistical test after BLAST comparison with the CIP/CALP parameters to validate non random associations between groups of sequences. In order to compare heterogeneous genome sequence data sets (from fully-sequenced genomes to large mapped EST collections), we have derived a new approach involving two criteria: the Density Ratio (DR) and Cluster Ratio (CR) that are functions of the physical and/or genetic size (Size), the total number of genes and/or loci (Gnumber) and number of orthologous sequence pairs (Cnumber) defined in the orthologous regions identified with the previous CIP/CALP parameters. DR [(Size 1 + Size 2)/(2  Cnumber)  100] represents the number of links between two orthologous regions as a function of the size of the considered blocks while CR [(2  Cnumber)/(Gnumber 1 + Gnumber 2)  100], represents the number of links between two orthologous regions as a function of the number of genes in the considered blocks. Statistically

significant chromosome to chromosome collinear relationships between two genomes are associated with the lowest DR and highest CR values while the remaining collinear regions are considered as artefactual, i.e. obtained at random [17,18]. The need for accurate filtering methods will be increasingly important with the onslaught of genome sequencing, and community standards need to be established to ensure that different studies can be correlated and compared. The method described here can be directly used to initiate comparative genome analyses and can also be considered as improved community standards in comparative studies that require stringent filtering methods such as ancestral genome reconstruction. Plant genomes are diploidised palaeopolyploids Monocot synteny was analysed using alignment parameters and statistical tests described previously, considering rice as the reference genome (red panels in Figure 1; and Table 1), to identify orthologous relationships and delimit inter-genome collinearity. Overall, we identified 20 270 orthologous pairs and 80 syntenic blocks covering on average 91% of the six cereal genomes. Systematic analysis of the conservation pattern between homologous genes among the five cereal genomes indicates that 77% of the genes are conserved when relationships are established solely on the basis of short conserved sequence regions (67% between rice and maize, 79% between rice and Sorghum, 73% between rice and wheat, and 87% between rice and Brachypodium) whereas, only 16% are conserved (11% between rice and maize, 15% between rice and Sorghum, 17% between rice and wheat, and 21% between rice and Brachypodium) when orthologous relationships (segments derived from a common ancestor) are considered. Similar analysis of eudicot synteny with grape as the reference genome (red panels in Figure 1; and Table 1) identified 14 156 orthologous pairs and 399 syntenic blocks covering in average 87% of the five considered genomes. Comparison of homologous genes among the five 481

Opinion eudicot genomes indicates that 77% of the genes are conserved when relationships are based only on short conserved sequence motifs (52% between grape and Arabidopsis, 69% between grape and poplar, 97% between grape and soybean, and 89% between grape and Papaya), whereas orthologous relationships are conserved for only 17% (11% between grape and Arabidopsis, 22% between grape and poplar, 19% between grape and soybean, and 15% between grape and Papaya). By providing precise limits for the syntenic regions in monocots and eudicots and systematically validating the data with a statistical test, this study complements and greatly refines previous marker-based and low resolution, sequence-based macrocollinearity studies [15] thereby allowing us to better characterise duplication patterns in different plant genomes. Our approach similarly allowed us to refine and extend identification of interchromosomal duplications (Table 1). There were 69 duplications characterised in the six cereal genomes and 249 in the five eudicot genomes. Thus, in total, 5 093 (defining 69 blocks) and 16 712 (defining 273 blocks) paralogous genes were identified for the 11 monocot (81% of genome coverage in average) and eudicot (66% of genome coverage in average) genomes, respectively, providing the largest set of conserved duplicated genes in plants to date from an evolutionary perspective. Integration of intra-species duplication and inter-species synteny analyses in the eudicot and monocot genomes allowed precise characterisation of seven shared ancestral duplications. These were found on the following chromosome pair combinations in monocots (t4–t5/r11–r12/s5–s8/m2– m4–m1–m3–m10/bd4–bd4, t1–t3/r5–r1/s9–s3/m6–m8–m3/ bd2–bd2, t1–t4/r10–r3/s1–s1/m1–m5–m9/bd1–bd3, t2–t4/ r7–r3/s2–s1/m2–m7–m1–m9–m5/bd1–bd1, t2–t6/r4–r2/s6– s4/m2–m10–m4–m5/bd3–bd5, t5–t7/r9–r8/s2–s7/m2–m7– m1–m4–m10–m6/bd3–bd4, and t6–t7/r2–r6/s4–s10/m4– m5–m6–m9/bd1–bd3), where ‘t’ is Triticeae, ‘r’ rice, ‘s’ Sorghum, ‘m’ maize and ‘bd’ Brachypodium. In the eudicots these seven ancestral duplications, involving the entire genomes of Arabidopsis, poplar, Papaya and soybean, correspond to the following chromosomal relationships in the grape (G) genome: G1–G14–G17/G2–G15– G12–G16/G3–G4–G7–G18/G4–G9–G11/G5–G7–G14/G6– G8–G13/G10–G12–G19. The precise identification of seven ancestral duplications covering more than 50% of any considered genome in eudicots and monocots is clear proof of whole genome duplication (WGD) events, demonstrating that these diploid plant species are all diploidised ancient polyploids. Plant palaeohistory reveals evolutionary novelties through polyploidy Recent palaeogenomic analyses of plants led to the identification of a palaeohexaploid ancestor with seven protochromosomes comprising 9 731 protogene models (eudicots) and a palaeotetraploid ancestor with five protochromosomes comprising 9 138 protogene models (monocots) [8,19]. Based on these rich comparative data sets we recently proposed a model for an evolutionary pathway for angiosperm genomes (Figure 2). In monocots, characterisation of the seven palaeoduplications and relationships 482

Trends in Plant Science Vol.15 No.9

between the different conserved regions allowed us to identify evolutionary events that have shaped grass genomes since their divergence from a putative ancestor with five chromosomes (A5, A7, A11, A8 and A4; red panel in Figure 2). After a WGD (g event) event (5 + 5 = 10 chromosomes) about 50–70 mya, the ancestral genome underwent two interchromosomal translocations and fusions (d event, A1–A12 in Figure 2) that resulted in an n = 12 intermediate ancestor (5 + 5 + 2 = 12 chromosomes). An alternative and simpler model (d event, green panel in Figure 2) would be that seven ancestral duplications represent seven ancestral protochromosomes instead of five in the monocots. In this scenario the two interchromosomal fissions (involving A2, A4, A6 and A3, A7, A10) would be replaced by two palaeoduplications (A7/A70 , A10/ A100 and A4/A40 , A6/A60 ), followed by two chromosome fusion events (A3 = A70 + A100 , and A2 = A40 + A60 ). Fissions (n = 5 based scenario, d event) and fusions (n = 7 based scenario, d event) can be considered as alternative pathways that might have shaped the A2 and A3 ancestral chromosomes. However, because A2 and A3 cannot be reconstructed in their complete entity by chromosomes A4 + A6 plus A7 + A10, an n = 5 scenario is the most likely course of events, considering the synteny and duplication relationships and more precisely boundaries observed in monocots. Moreover, end-to-end chromosome fusion (n = 7 based scenario) is an unlikely evolutionary mechanism driving chromosome number reduction in grasses as is will be discussed in detail in the next section. In the n = 5 model, rice retained the original chromosome number of 12, whereas the other grass genomes have evolved from this ancestral genome structure through independent nested chromosome fusion (NCF) events. In rice, additional segmental duplications occurred without modifying the basic structure of 12 chromosomes including the recent duplications over 3 Mb at the terminal ends of chromosomes r11 and r12 [20]. The maize and Sorghum genomes evolved from the 12 intermediate ancestral chromosomes through two chromosomal fusions (between A3 and A10 and A7 and A9, Figure 2, b event) that resulted in a Panicoideae ancestor with n = 10 (5 + 5 + 2 – 2) chromosomes [21,22]. Maize and Sorghum subsequently evolved independently from this ancestor. While the Sorghum genome structure remained similar to the n = 10 chromosome ancestral genome (Figure 2), maize underwent a WGD event (a event), resulting in an intermediate with n = 20 chromosomes. Rapidly following this event, numerous chromosomal fusions led to a genome structure with 10 chromosomes (n = 10 = [{5 + 5 + 2 – 2}  2] – 10). At least 17 NCF events must have occurred to explain the paralogous relationships that can be observed today between the different maize chromosomes [5,19,22] (Figure 2). From the intermediate ancestral genome with 12 chromosomes, the Triticeae ancestral genome underwent five chromosomal fusions (Figure 2) between A5 and A10, A6 and A8, A9 and A12, A3 and A11, and A4 and A7 that resulted in the five chromosomes T1, T7, T5, T4 and T2, respectively, and a basic number of n = 7 (5 + 5 + 2 – 5, b event) for the wheat and barley genomes. Wheat specific translocation events (T4–T5 and T4–T7, shown black in the wheat circle, Figure 1) were also characterised.

Opinion

Trends in Plant Science

Vol.15 No.9

Figure 2. Angiosperm evolutionary models. The monocot chromosomes (right) are represented with colour codes to illustrate the evolution of segments from a common ancestor with five protochromosomes (named according to the rice nomenclature). The four shuffling events that have shaped the structure of the different grass genomes during their evolution from the common ancestor are indicated with g (Whole Genome Duplication), d (ancestral chromosome translocations and fusions), b (family-specific shuffling event), and a (lineage-specific shuffling event). The seven eudicot protochromosomes (left) are represented with different colours. The two scenarios based on a hexa- or tetra-ploid palaeo-ancestor are shown in red and green boxes, respectively. The different shuffling events that have shaped the structure of the five genomes during their evolution from the common ancestor are indicated as g (WGD), b and a (ancestor intermediate or lineage-specific WGD events). The current structure of the plant genomes is represented at the bottom of the figure.

Two models were proposed for sequenced eudicot genomes, based on the characterisation of seven palaeoduplications (g event, green and red panels in Figure 2). The first suggests that the grape, Arabidopsis, and poplar genomes derive from a hexaploid ancestor with seven protochromosomes followed by one and two specific WGDs in poplar and Arabidopsis, respectively [6,8]. The second proposes that the eudicots derive from three WGDs: one ancestral WGD, one Arabidopsis–poplar shared WGD, and one independent Arabidopsis and poplar specific WGD [23]. The identification of at least remnants of triplications (i.e. hexaploid event) in all the genomes analysed would favour the first model, making an n = 21 intermediate common to all eudicot genomes. In such a scenario, the Arabidopsis genome evolved from an n = 21 intermediate followed by two specific WGD (a and b events) and 30 NCF. The grape genome would also have evolved from an n = 21 intermediate followed by four NCF. The poplar genome has subsequently undergone a specific

WGD (a event) with 71 NCF from the n = 21 intermediate ancestor. The Papaya genome also derived from an n = 21 intermediate and would have undergone 56 NCF. Finally, the soybean genome structure can be explained by two rounds of WGD (a and b events) and 82 NCF. Having protochromosome models from monocots and eudicots directly derived from the accurate identification of seven shared ancestral duplications in both families enabled us to re-examine the divergence of chromosome structure between these two botanical classes of the plant kingdom. Interestingly, when we aligned the grass monocot (five protochromosomes with 9 138 genes) and eudicot (seven protochromosomes with 9 731 genes) ancestral genomes no orthologous chromosomal relationships were identified [19]. This result clearly indicates that macrocollinearity has eroded since the two botanical classes diverged from a common ancestor 150–300 mya and reflects an active history of rearrangements during the 483

Opinion evolution of plant genomes. Comparison of these observations with analyses of genome evolution in the animal kingdoms suggests that plants had to rely on more rapid and frequent changes in chromosomal architecture (especially WGD followed by NCF) in speciation than did mammalian species. We attribute these features to the evolution of DNA replication and repair mechanisms (especially double strand break repairs, i.e. DSBR) in plants which could be explained by the immobility of plants compared with animals and their vulnerability to environmental changes. However, interestingly, similar evolutionary mechanisms have been described both in animals and plants, with a reduced number of protochromosomes through NCF via telomeric/centromeric repeat-mediated illegitimate recombination [6,24] and several rounds of WGD followed by lineage-specific rearrangements leading to different chromosome numbers in extant species. Although there are many similarities between the eukaryotic kingdoms with respect to the characteristics of such chromosomal rearrangements, there are also significant differences. Polyploidisation, a dominant force in the evolution of plants and fungi, occurs far less frequently in vertebrates and is a rare event in most vertebrate lineages, indicating differences in the capacity to adapt to genome duplications [19]. Neo- and palaeo-polyploidisation events provide competitive and selective advantages Gene duplication generates functional redundancy followed either by pseudogenisation (i.e. unexpressed or functionless paralogs) or concerted evolution (i.e. conservation of function for paralogs) or subfunctionalisation (i.e. complementary function of paralogs) or neofunctionalisation (i.e. novel function of paralogs) during the course of genome evolution. Functional divergence either by subfunctionalisation or neofunctionalisation among duplicated genes is one of the most important sources of evolutionary innovations in complex organisms. Recent studies suggested that a majority of duplicated genes that are structurally retained during evolution have at least partially diverged in their function [25,26]. This can be demonstrated in segregation analysis of single factors controlling macrophenotypes. Although there are two orthologous and paralogous copies of the P gene in maize, only one controls seed colour [27]. The same Mendelianisation is true for the transcription factor in maize affecting seed opacity which was duplicated before the 12 chromosomes of rice were formed 57 mya [28]. Micro-array studies in eudicots and monocots showed that the vast majority of duplicated genes diverged in their expression profile, with 73% (from 420 gene pairs for old duplications [29,30]) and 88% (from 115 gene pairs for old duplications [31]) of gene pairs in Arabidopsis and rice, respectively, associated with asymmetric expression profiles after 50–70 million years of evolution. In maize, where a recent WGD occurred in addition to the ancient one, >50% of the duplicated genes have been deleted, indicating a selection against gene duplication by ploidy [32]. These results clearly demonstrate that most of the genetic redundancy originating from polyploidy events is erased by a massive loss of duplicated genes by pseudogenisation in one of the duplicated segments soon after the polyploidisation event. 484

Trends in Plant Science Vol.15 No.9

One mechanism by which dispersed gene copies seem to have arisen is transposition. Although gene copies can be fragmented by such events, there are also examples of intact genes being dispersed [33]. While in maize, helitrons seem to have transposed intact gene copies around the genome, CACTA elements have taken over this function in Sorghum [4]. Moreover, chimerism of gene copies and transposable elements are the basis of a change in transcriptional regulation of paralogous gene copies. Because many genes exert their function through interaction networks, a change in the expression and/or function of a single gene could induce changes for a large number of genes present in the same functional pathway. Georg Haberer et al. [34] noted that tandem as well as segmental duplicate gene-pairs had divergent expression patterns in Arabidopsis, even when they shared many similar cisregulatory sequences and suggested that changes to a small fraction of cis-elements could be sufficient for neofunctionalisation or subfunctionalisation. Epigenetic modifications constitute one mechanism controlling the expression of gene copies, and also control sequence amplification in general. Such a mechanism has the advantage of being reversible in contrast to the deletions discussed above. Xiyin Wang et al. [35] observed silencing of polyploidy-derived duplicates due to hypermethylation in Arabidopsis polyploids. Epigenetic mechanisms as well as interaction networks may be the origin of extremely rapid expression divergence of gene duplicates soon after polyploidisation events. Moreover, biases in gene function that are retained in their structure and function have been reported in plants [36,37] as well as in fungi and mammals [38,39]. Our results for the monocot ancestor [19] are consistent with the results obtained by Andrew H. Paterson and colleagues [36] for the eudicots, who showed that ‘duplication-resistant’ gene families correspond to transcriptional regulators that are retained more significantly after WGD events. Even functional diversification of duplicate genes might be due to epigenetic mechanisms [40]. Genome duplication, and more precisely the resulting gene dosage doubling, may induce disease syndromes and abnormal development [41], similar to events observed in humans. Thus, the spatiotemporal and level of expression of paralogs must be reprogrammed through epigenetic mechanisms in the early process of diploidisation [25,42,43]. Given the prevalence of gene and genome duplication in the evolutionary history of plants, evolution of development in angiosperms may differ from organisms where genome duplication is rare and where extensive expression divergence after duplication would have a profound impact on the evolution of developmental and regulatory networks. Our data support the idea that after 50–70 million years of evolution since the genome has undergone a polyploidisation event, the vast majority of the paralogous genes (>80%) have been lost within a sister block and that the remaining gene pairs have largely diverged in their expression profile [31]. We can argue that gene functional novelties derived from neo or subfunctionalisation of orthologous and paralogous copies may reduce the risk of plant species extinction [44,45] as has been suggested in mammals where vertebrate lineage extension is higher in the

Opinion preduplication palaeohistory [46]. Rapid structural (i.e. reciprocal gene loss) and epigenetic and/or functional changes (i.e. neo or subfunctionalisation) following WGD may provide the ability of polyploids to quickly adapt to survive environmental conditions that do not favour their diploid ancestors as it has been reported that neo or palaeopolyploidy (i) increases vigour [47], (ii) favours tolerance to a wider range of environments [48], and (iii) facilitates self-fertilisation and the formation of asexually reproducing (apomictic) species [49,50]. Palaeogenomics needs to be considered as a tool for trait dissection and genome annotation The identification of the seven shared palaeoduplications in monocots and eudicots provides the simplest picture of orthologous relationships between the plant genomes and allowed a revision of the ‘concentric plant circles’ [16,19] by introducing the ancestral inner circle with five and seven chromosomes, respectively (Figure 3a and 3b). In Figure 3b, we propose the first representation of the ‘Eudicot concentric circles’ aimed at identifying the orthologous regions between Arabidopsis, poplar, Papaya, soybean and

[(Figure_3)TD$IG]

Trends in Plant Science

Vol.15 No.9

grape that have a common ancestral origin at the microscale level. Thus, including the ancestral genomes as the inner circles and proposing a reconstruction of monocot and eudicot collinearity from an ancestor with n = 5 and 7 chromosomes, respectively, it is possible to immediately identify the ancestral relationships and origins (WGD, NCF, breakage and fusion) of the different chromosomes in each of the 11 genomes using a simple colour code (Figure 3a and 3b). Microcollinearity (inset in Figure 3a and 3b) clearly establishes the genome specific duplications (a for the maize genome, regarding the monocots; and a and b duplications for poplar, Papaya, Arabidopsis and soybean in the eudicots) that are not shared by any of the other studied genomes. As shown in Figure 3b, when one copy of the grape genome is considered as the closest relative of the ancestral genome, a single homologous segment is identified in the Papaya genome suggesting no specific WGD, two homologous segments are identified in the poplar genome suggesting one (a) additional, independent and recent round of WGD; and four homologous segments are identified in the soybean and Arabidopsis genomes

Figure 3. The ‘concentric circles’ of the plant genomes. (a) Monocot concentric circles. The Triticeae, maize, Sorghum, rice and Brachypodium chromosomes are represented as concentric circles according to their genome size with the Brachypodium genome closest to the ancestral genomes at the centre. The inner circles represent the n = 5 (or alternative 7) chromosomes ancestor A5 = purple, A7 = red, A11= blue, A8 = yellow, and A4 = green. Any radius of the circle clearly establishes the chromosome-to-chromosome collinear relationships between the five monocot species as detailed in the top inset illustrating a micro-collinearity relationship between wheat chromosome 3B (2048 Kb, 16 genes), rice chromosome 1 (173 Kb, 21 genes), Sorghum chromosome 3 (214 Kb, 25 genes), maize chromosomes 3-8 (799 Kb, 58 genes) and Brachypodium chromosome 2 (124 Kb, 15 genes). Conserved orthologous genes are illustrated with the same colour code and non conserved genes are shown in grey. (b) Eudicot concentric circles. The soybean, grape, poplar, Papaya and Arabidopsis chromosomes are represented as concentric circles according to their genome size with the Arabidopsis genome closest to the ancestral genomes at the centre. The inner circles represent the n = 7 chromosomes ancestor A1 = purple, A4 = dark red, A7 = black, A10 = yellow, A13 = green, A16 = red, and A18 = blue. Any radius of the circle clearly establishes the chromosome-to-chromosome collinear relationships between the five eudicots species; as detailed in the top inset illustrating a micro-collinearity relationship between soybean chromosomes 2, 3,10 and 19 (698 Kb, 76 genes), grape chromosome 7 (415 Kb, 47 genes), poplar chromosomes 2 and 14 (679 Kb, 71 genes), Papaya linkage group 5 (307 Kb, 40 genes) and Arabidopsis chromosomes 1, 2, 3 and 4 (204 Kb, 63 genes). Conserved orthologous genes are illustrated with the same colour code and non conserved genes are shown in grey.

485

Opinion suggesting two (a and b) additional, independent and recent rounds of WGD. In Figure 3a a single rice region is associated with a unique orthologous counterpart in Brachypodium, Sorghum, wheat genomes and two orthologous regions in maize, whose genome underwent an additional WGD (a) about 4.8 mya [20], corresponding to the tetraploidisation event (described in the previous section) leading to the representation of the maize genome as a double circle (fifth circle from the centre in Figure 3a). In addition to bringing new insight into genome evolution, knowledge on the extent of conservation between cereal genomes and the tools generated through comparative genomic studies can be used to (i) define efficient strategies for genetic studies and gene isolation through the design of conserved orthologous marker sets, and (ii) improve the accuracy of gene annotation through the alignment of conserved orthologous genes [17]. This is particularly useful for genomes for which physical map and whole genome sequence are unavailable yet, such as for the Triticeae. To support the synteny-based applications, we provide an online user-friendly interface, called ‘Plant-Synteny’, to access our comparative analyses represented in Figures 2 and 3. It allows visualisation of orthologs and paralogs among plant genomes (http://www.clermont.inra.fr/ umr1095/PlantSynteny). The web site provides access to the raw data (gene name, sequence, position, and alignment criteria) obtained from the synteny and duplication analyses and also provides information about the nonredundant grass genome ancestral gene set that can be used as a platform for the development of Conserved Orthologous Set (COS) markers [51] to support cross genome map-based cloning strategies. This information can greatly increase the success rate of COS marker design because the selection of markers (genes) is not based on only a single reference model genome and applied to another with the risk that the locus of interest may have been lost or subject to lineage-specific rearrangements not shared with the target species. ‘Plant-Synteny’ can greatly simplify and accelerate the identification of candidate genes using the synteny information [17,19]. Finally, the possibility to identify and align genes within and between genomes provides strong support for genome annotation. Conclusion Whole-genome sequencing projects in grasses including foxtail millet (www.jgi.doe.gov), banana (www.cns.fr), and the perspective of the barley and wheat genome sequences in the next years (www.barleygenome.org and www. wheatgenome.org); as well as ongoing sequencing projects in the eudicots such as for the cacao (www.cacaogenomedb. org), tomato (solgenomics.net), cassava (www.phytozome. org), Brassica (www.brassica-rapa.org), cucumber (www. icugi.org) and apple (www.reeis.usda.gov) genomes will help to continue refining the degree of collinearity between the plant genomes as well as the evolutionary pathway that shaped their genomes within 150–300 million years of speciation. Acknowledgments This work has been supported by grants from the Agence Nationale de la Recherche (Program ANRjc-PaleoCereal, ref: ANR-09-JCJC-0058-01).

486

Trends in Plant Science Vol.15 No.9

References 1 Friis, E.M. et al. (2006) Cretaceous angiosperm flowers: innovation and evolution in plant reproduction. Palaeogeogr. Palaeocl. Palaeoecol. 232, 251–293 2 Paterson, A.H. et al. (2009) The Sorghum bicolor genome and the diversification of grasses. Nature 457, 551–556 3 Moore, M.J. et al. (2007) Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms. Proc. Natl. Acad. Sci. U. S. A. 104, 19363–19368 4 Schnable, P.S. et al. (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326, 1112–1115 5 International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436, 793–800 6 International Brachypodium Initiative (2010) Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature 463:763–768 7 Jaillon, O. et al. (2007) The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 8 Tuskan, G.A. et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313, 1596–1604 9 Ming, R. et al. (2008) The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452, 991–996 10 Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 11 Schmutz, J. et al. (2010) Genome sequence of the palaeopolyploid soybean. Nature 463, 178–183 12 Qi, L.L. et al. (2004) A chromosome bin map of 16,000 expressed sequence tag loci and distribution of genes among the three genomes of polyploid wheat. Genetics 168, 701–712 13 Stein, N. et al. (2007) A 1,000-loci transcript map of the barley genome: new anchoring points for integrative grass genomics. Theor. Appl. Genet. 114, 823–839 14 Trick, M. et al. (2009) Complexity of genome evolution by segmental rearrangement in Brassica rapa revealed by sequence-level analysis. BMC Genomics 10, 539 15 Salse, J. and Feuillet, C. (2007) Comparative genomics of cereals. In Genomics-Assisted Crop Improvement (Varshney, R.K. and Tuberosa, R., eds), pp. 177–205, Springer Verlag 16 Bolot, S. et al. (2009) The ‘inner circle’ of the cereal genomes. Curr. Opin. Plant Biol. 12, 119–125 17 Salse, J. et al. (2009) Improved standards and new comparative genomics tools provide new insights into grasses paleogenomics. Brief. Bioinform. 10, 619–630 18 Salse, J. et al. (2008) Identification and characterization of conserved duplications between rice and wheat provide new insight into grass genome evolution. Plant Cell 20, 11–24 19 Salse, J. et al. (2009) Reconstruction of monocotelydoneous protochromosomes reveals faster evolution in plants than in animals. Proc. Natl. Acad. Sci. U. S. A. 106, 14908–14913 20 Rice Chromosomes 11 and 12 Sequencing Consortia (2005) The sequence of rice chromosomes 11 and 12, rich in disease resistance genes and recent gene duplications. BMC Biol. 27, 3:20 21 Swigonova´, Z. et al. (2004) Close split of sorghum and maize genome progenitors. Genome Res. 14, 1916–1923 22 Wei, F. et al. (2007) Physical and genetic structure of the maize genome reflects its complex evolutionary history. PLoS Genet. 3, e123 23 Velasco, R. et al. (2007) A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLoS One 2, e1326 24 Luo, M.C. et al. (2009) Genome comparisons reveal a dominant mechanism of chromosome number reduction in grasses and accelerated genome evolution in Triticeae. Proc. Natl. Acad. Sci. U. S. A. 106, 15780–15785 25 Doyle, J.J. et al. (2008) Evolutionary genetics of genome merger and doubling in plants. Annu. Rev. Genet. 42, 443–461 26 Paterson, A.H. et al. (2010) Insights from the comparison of plant genome sequences. Annu. Rev. Plant. Biol. 61, 349–372 27 Goettel, W. and Messing, J. (2009) Change of gene structure and function by non-homologous end-joining, homologous recombination, and transposition of DNA. PLoS Genet. 5, e1000516 28 Xu, J.H. and Messing, J. (2008) Diverged copies of the seed regulatory Opaque-2 gene by a segmental duplication in the progenitor genome of rice, sorghum, and maize. Mol. Plant 1, 760–769

Opinion 29 Ganko, E.W. et al. (2007) Divergence in expression between duplicated genes in Arabidopsis. Mol. Biol. Evol. 24, 2298–2309 30 Blanc, G. and Wolfe, K.H. (2004) Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 16, 1679–1691 31 Throude, M. et al. (2009) Structure and expression analysis of rice paleo duplications. Nucleic Acids Res. 37, 1248–1259 32 Messing, J. et al. (2004) Sequence composition and genome organization of maize. Proc. Natl. Acad. Sci. U. S. A. 101, 14349– 14354 33 Xu, J.H. and Messing, J. (2006) Maize haplotype with a helitronamplified cytidine deaminase gene copy. BMC Genet. 7, 52 34 Haberer, G. et al. (2004) Transcriptional similarities, dissimilarities, and conservation of cis-elements in duplicated genes of Arabidopsis. Plant Physiol. 136, 3009–3022 35 Wang, X. et al. (2005) Duplication and DNA segmental loss in the rice genome: implications for diploidization. New Phytol. 165, 937– 946 36 Tang, H. et al. (2008) Unraveling ancient hexaploidy through multiply aligned angiosperm gene maps. Genome Res. 18, 1944–1954 37 Seoighe, C. and Gehring, C. (2004) Genome duplication led to highly selective expansion of the Arabidopsis thaliana proteome. Trends Genet. 20, 461–464 38 Blomme, T. et al. (2006) The gain and loss of genes during 600 million years of vertebrate evolution. Genome Biol. 7, R43 39 Davis, J.C. and Petrov, D.A. (2005) Do disparate mechanisms of duplication add similar genes to the genome. Trends Genet. 21, 548–551

Trends in Plant Science

Vol.15 No.9

40 Chen, Z.J. (2007) Genetic and epigenetic mechanisms for gene expression and phenotypic variation in plant polyploids. Annu. Rev. Plant Biol. 58, 377–406 41 Bailey, J.A. et al. (2002) Recent segmental duplications in the human genome. Science 297, 1003–1007 42 Lynch, M. and Conery, J.S. (2000) The evolutionary fate and consequences of duplicate genes. Science 290, 1151–1155 43 Zhang, J. (2003) Evolution by gene duplication: an update. Trends Ecol. Evol. 18, 292–298 44 Fawcett, J.A. et al. (2009) Plants with double genomes might have had a better chance to survive the Cretaceous-Tertiary extinction event. Proc. Natl. Acad. Sci. U. S. A. 106, 5737–5742 45 Van de Peer, Y. et al. (2009) The flowering world: a tale of duplications. Trends Plant Sci. 14, 680–688 46 Donoghue, P.C.J. et al. (2005) Genome duplication, extinction and vertebrate evolution. Trends Ecol. Evol. 20, 312–319 47 Rieseberg, L.H. et al. (2003) Major ecological transitions in wild sunflowers facilitated by hybridization. Science 301, 1211–1216 48 Van de Peer, Y. et al. (2009) The evolutionary significance of ancient genome duplications. Nat. Rev. Genet. 10, 725–732 49 Hegarty, M. and Hiscock, S. (2007) Polyploidy: doubling up for evolutionary success. Curr. Biol. 17, R927–R929 50 Bicknell, R.A. and Koltunow, A.M. (2007) Understanding apomixis: recent advances and remaining conundrums. Plant Cell 16 (Suppl), S228–S245 51 Masood Quraishi, M. et al. (2009) Genomics in cereals: From genomewide conserved orthologous set (COS) sequences to candidate genes for trait dissection. Funct. Integr. Genomics 9, 473–484

487