Molecular Evolution of VEF-Domain-Containing PcG Genes in Plants

Molecular Evolution of VEF-Domain-Containing PcG Genes in Plants

Molecular Plant • Volume 2 • Number 4 • Pages 738–754 • July 2009 RESEARCH ARTICLE Molecular Evolution of VEF-Domain-Containing PcG Genes in...

3MB Sizes 0 Downloads 37 Views

Molecular Plant



Volume 2



Number 4



Pages 738–754



July 2009

RESEARCH ARTICLE

Molecular Evolution of VEF-Domain-Containing PcG Genes in Plants Ling-Jing Chen, Zhao-Yan Diao, Chelsea Specht and Z. Renee Sung1 Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720–3102, USA

ABSTRACT Arabidopsis VERNALIZATION2 (VRN2), EMBRYONIC FLOWER2 (EMF2), and FERTILIZATION-INDEPENDENT SEED2 (FIS2) are involved in vernalization-mediated flowering, vegetative development, and seed development, respectively. Together with Arabidopsis VEF-L36, they share a VEF domain that is conserved in plants and animals. To investigate the evolution of VEF-domain-containing genes (VEF genes), we analyzed sequences related to VEF genes across land plants. To date, 24 full-length sequences from 11 angiosperm families and 54 partial sequences from another nine families were identified. The majority of the full-length sequences identified share greatest sequence similarity with and possess the same major domain structure as Arabidopsis EMF2. EMF2-like sequences are not only widespread among angiosperms, but are also found in genomic sequences of gymnosperms, lycophyte, and moss. No FIS2- or VEF-L36-like sequences were recovered from plants other than Arabidopsis, including from rice and poplar for which whole genomes have been sequenced. Phylogenetic analysis of the full-length sequences showed a high degree of amino acid sequence conservation in EMF2 homologs of closely related taxa. VRN2 homologs are recovered as a clade nested within the larger EMF2 clade. FIS2 and VEF-L36 are recovered in the VRN2 clade. VRN2 clade may have evolved from an EMF2 duplication event that occurred in the rosids prior to the divergence of the eurosid I and eurosid II lineages. We propose that dynamic changes in genome evolution contribute to the generation of the family of VEF-domain-containing genes. Phylogenetic analysis of the VEF domain alone showed that VEF sequences continue to evolve following EMF2/VRN2 divergence in accordance with species relationship. Existence of EMF2-like sequences in animals and across land plants suggests that a prototype form of EMF2 was present prior to the divergence of the plant and animal lineages. A proposed sequence of events, based on domain organization and occurrence of intermediate sequences throughout angiosperms, could explain VRN2 evolution from an EMF2-like ancestral sequence, possibly following duplication of the ancestral EMF2. Available data further suggest that VEF-L36 and FIS2 were derived from a VRN2-like ancestral sequence. Thus, the presence of VEF-L36 and FIS2 in a genome may ultimately be dependent upon the presence of a VRN2-like sequence. Key words:

VEF; EMF2; FIS2; VRN2; VEF-L36; Arabidopsis; PcG; phylogeny; evolution.

INTRODUCTION Identifying genes that act in developmental pathways and determining how they or their interactions are modified throughout organismal evolution is a major focus of the field of evolutionary developmental biology. Understanding how genes and gene networks function during the development of the model plant Arabidopsis thaliana provides a starting point for investigating how characterized developmental pathways may have played a role in the evolution of diverse plant body plans (Irish and Benfey, 2004). The Polycomb Group protein (PcG) genes play a major role in epigenetic regulation of gene expression. Originally characterized in Drosophila, they encode a conserved group of chromatin proteins found in animals and plants. Structurally different Drosophila PcG proteins form complexes that main-

tain the repression of target genes. A PcG protein complex, composed of four core proteins (Suppressor of Zeste 12 (Su(z)12), Extra sex combs (Esc), P55, and Enhancer of zeste (E(z)) (Kuzmichev et al., 2002; Muller et al., 2002)), can methylate histone H3 at lysine 27 through the E(z) SET domain, providing a methyl mark for subsequent transcriptional repression and gene silencing (Cao et al., 2002; Czermin et al., 2002;

1 To whom correspondence should be addressed. E-mail zrsung@nature. berkeley.edu, fax (510) 642-4995, tel. (510) 642-6966.

ª The Author 2009. Published by the Molecular Plant Shanghai Editorial Office in association with Oxford University Press on behalf of CSPP and IPPE, SIBS, CAS. doi: 10.1093/mp/ssp032, Advance Access publication 19 June 2009 Received 10 March 2009; accepted 25 April 2009

Chen et al.

Muller et al., 2002). Arabidopsis genes structurally similar to Drosophila PcG genes have been reported and their mutants characterized: CURLY LEAF (CLF) (Goodrich et al., 1997), FERTILIZATION-INDEPENDENT SEED DEVELOPMENT1 (FIS1)/MEDEA (MEA) (Grossniklaus et al., 1998; Luo et al., 1999), SWINGER (SWN) (Chanvivattana et al., 2004), FIS3/ FERTILIZATION-INDEPENDENT ENDOSPERM (FIE) (Ohad et al., 1999), FERTILIZATION-INDEPENDENT SEED2 (FIS2) (Luo et al., 1999), EMBRYONIC FLOWER2 (Yoshida et al., 2001), and VERNALIZATION2 (Gendall et al., 2001), and MULTICOPY SUPPRESSOR OF IRA1 (MSI1) (Hennig et al., 2003). Evidence indicates that these genes encode proteins that form putative PcG complexes involved in maintaining the silencing of Arabidopsis MADS-box genes (Chanvivattana et al., 2004; Sung and Amasino, 2004; Wood et al., 2006). Some PcG genes can be grouped into families based on sequence homology, such as CLF, MEDEA (MEA), and SWN (Chanvivattana et al., 2004) and EMF2,VRN2, and FIS2 (Yoshida et al., 2001). It is possible that these gene families are the result of gene duplication and subsequent diversification from ancestral sequences that were present prior to the divergence of the lineages, ultimately leading to plants and animals. Duplication and diversification of nucleotide sequences have been shown to lead to functional innovation across the tree of life (Kim et al., 2004). EMF2 is a core component of the putative PcG complex that represses flowering (Chanvivattana et al., 2004). Loss-of-function mutation in the EMF2 gene leads to elimination of vegetative growth in Arabidopsis (Yang et al., 1995), resulting in early flowering. EMF2 thus may have played a major role in plant survival and the evolution of phenological variability. Protein interactions between EMF2 and three other proteins, CLF (Goodrich et al., 1997), FIE (Kinoshita et al., 2001), and MSI1 (Hennig et al., 2003), suggest that they function as a protein complex in mediating floral repression. The putative EMF2/CLF or SWN/FIE/MSI1 complex represses the flower MADS-box genes AGAMOUS (AG), APETALLA 3 (AP3), and PISTILLATA (PI) during vegetative development (Moon et al., 2003; Calonje et al., 2008). CLF also represses flowering time genes, such as FLOWERING LOCUS T (FT), AG-LIKE 19 (AGL19) (Schonrock et al., 2006; Jiang et al., 2008). FIS2 is a core component of the putative PcG complex FIS2/MEDEA (MEA)/FIE/MSI1 that regulates Arabidopsis seed development via repression of PHERES1 during gametophyte and endosperm development (Kohler et al., 2003). VRN2 is a core component of another putative PcG complex VRN2/CLF or SWN/FIE/MSI1 that induces flowering in response to vernalization via the regulation of the FLOWERING LOCUS C (FLC) (Sung and Amasino, 2004; Wood et al., 2006). It appears that the two groups of plant PcG genes, CLF-MEA-SWN and EMF2-VRN2-FIS2, have co-evolved to form multi-protein complexes that target different gene regulatory networks (Calonje and Sung, 2006). The molecular similarity of the VEF genes suggests that they are related and may be the result of an historic gene duplication event followed by diversification. To understand how the Arabidopsis VEF gene family evolved, we investi-

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

|

739

gated homologs of this gene family in Arabidopsis and other land plants. In this paper, we identified 85 partial and fulllength sequences from land plants with a taxonomic focus on flowering plants. Our results suggest that EMF2 is the most plesiomorphic form of the gene and may have acted as a prototype in the generation of the VEF gene family. Intragenic sequence duplication, deletion/insertion, and intergenic exon shuffling could account for the structural and functional diversification of the VEF genes from an EMF2-like ancestor. We propose that VRN2 evolved from an EMF2-like ancestor, and that VEF-L36 and FIS2 were derived from a VRN2-like ancestral sequence in Arabidopsis and possibly in other angiosperms.

RESULTS Domain Organization in Arabidopsis VEF Family Proteins Using a deduced EMF2 amino acid sequence to BLAST against GenBank, four full-length Arabidopsis proteins, EMF2 (At5g51230), FIS2 (At2g35670), VRN2 (At4g16845), and VEFL36 (At4g16810), were recovered with significant e-values (,2e–12). In addition to the common VEF domain that defines this gene family (Figure 1), EMF2, VRN2, and FIS2 share a C2H2 domain. EMF2 and VRN2 further share an N-terminal domain (N-ter) that is present in the Drosophila homolog, Su(z)12, but is absent in FIS2 and VEF-L36. However, VRN2 differs from EMF2 in lacking sequence corresponding to EMF2 exon 5

Figure 1. Domain Organization of VEF-Domain-Containing Proteins of Arabidopsis. Blue block: EMF2 N-terminal domain (N-ter), which is composed of two parts: an N-terminal cap (cap) and the remaining part (N-ter Dcap) as seen in VRN2. Orange block: EMF2-specific E5–10 domain. Green block: C2H2 zinc finger domain. Red block: VEF domain, which is uniquely located at the N-terminus of VEF-L36. Pink block: EMF2/VRN2-specific E15–17 domain. Light-blue block: VEF-L36-specific repeat domain. Dark-green block: VEF-L36-specific L36 domain. Yellow block: FIS2-specific S-rich domain. Purple block: FIS2 C-terminal tail.

740

|

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

through exon 10 (E5–10), as well as a stretch of sequence at the N-terminal called the N-terminal cap (N-ter cap). VRN2 also has a 52-aa repeat in the C-terminus that is absent in EMF2. Despite these differences, globally, VRN2 and EMF2 share similar domain organization and 45% amino acid sequence identity. First reported as EMF2-like 1 by Yoshida et al. (2001), VEF-L36 is a hypothetical protein, based on its predicted gene structure from TAIR (TAIR: www.Arabidopsis.org/servlets/TairObject?id= 128616&type=locus). It shares only the VEF domain with the other VEF proteins (Figure 1). Unlike EMF2, VRN2, and FIS2, its VEF domain is located at the N-terminus and its C-terminus comprises a sequence with low similarity to ribosomal protein L36. There is also a stretch of repeat sequence in the middle region that is not found in any of the other VEF genes.

Widespread of EMF2/VRN2 Homologs among Land Plants To investigate the distribution of homologs of VEF genes in plants, we used VEF-containing proteins to perform BLAST searches against the databases listed above (see Methods). Using the Arabidopsis EMF2 amino acid sequence to BLAST against GenBank, 10 full-length homologs were returned, eight from grasses (Poaceae), one from Carica (Caricaceae), and one from Silene (Caryophyllaceae) (Table 1). The grass homologs included one from wheat (Triticum aestivum), three from barley (Hordeum vulgaris), two from maize (Zea mays), and two from rice (Oryza sativa). The Silene homolog is from Silene latifolia of Caryophyllaceae, a member of the core eudicots. The Chromatin Database (www.chromdb.org/) identifies three full-length sequences from poplar (Populus trichocarpa: VEF901, 902, and 904) and one partial sequence (VEF903). The full-length sequences are heretofore referred to as PtEMF2_1 for VEF901, PtEMF2_2 for VEF902, and PtEMF2_4 for VEF904 (see Table 1A). We also sequenced six full-length cDNAs from species in five different angiosperm families representing early-diverging monocots (Acorus; Acorales), higher monocots (Asparagus, Yucca; Asparagales), basal eudicots (Eschscholzia; Papaveraceae), and the asterids (Solanum; Solanaceae) (see Methods). The Kazusa DNA Research Institute provided one full-length sequence from Lotus japonicus (Fabaceae). Using deduced amino acid sequences of these cDNAs to BLAST against GenBank, the same homologs were returned as when using the Arabidopsis EMF2 sequence. Using full-length VRN2, VEFL36, and FIS2 to BLAST against GenBank, we found mostly the same sequences as described above, likely due to sequence homology in the VEF domain. Pair-wise identity scores of these full-length sequences indicate that non-Arabidopsis sequences display higher identity to Arabidopsis EMF2 and VRN2 than to FIS2 and VEF-L36 (Table 2). Among these homologs, VEF-L36 shows lowest pair-wise identity to other members (average score: 8), followed by FIS2 (average score: 17). Both show higher identity to VRN2 than to other EMF2/VRN2 homologs.

Sequence alignment of the 24 full-length proteins was performed using MUSCLE (www.ebi.ac.uk). All non-Arabidopsis full-length sequences possess the N-terminal (N-ter), C2H2, and VEF domains homologous to that of EMF2/VRN2 sequences (Figure 2), indicating a high conservation of domain organization. These sequences are not likely to be orthologs of FIS2 or VEF-L36 due to both the presence of the EMF2/VRN2characteristic N-ter domain and the absence of either the S-rich domain found in FIS2 or the L36 domain characteristic of VEFL36 (Figure 1). Sixteen full-length, non-Arabidopsis sequences contain the complete N-ter that included the N-ter cap: ZmEMF2_1, ZmEMF2_2, HvEMF2_4, HvEMF2_5, LjEMF2, OsEMF2_4, AaEMF2, YfEMF2, AoEMF2, LeEMF2_1, SIEMF2, LjEMF2, PtEMF2_1, PtEMF2_2, TaEMF2_3, CpEMF2. Five sequences, EcEMF2_2, OsEMF2_9, HvEMF2_1, ZmEMF2-2, and PtEMF2_4, lack the N-ter cap. One sequence from barley, HvEMF2_1, lacks both N-ter cap and the VEF domain. Together with Arabidopsis EMF2 and VRN2, these full-length EMF2/ VRN2 sequences represent 14 species from 11 angiosperm families (Acoraceae, Asparagaceae, Agavaceae, Poaceae, Caryophyllaceae, Fabaceae, Brassicaceae, Solanaceae, Salicaceae, Caricaceae, and Papaveraceae). No discernable FIS2 or VEFL36 orthologs were recovered from rice or poplar, despite the availability of full genomic sequences. In addition to the full-length sequences, we found ;140 incomplete sequences showing significant homology to three EMF2 domains in various genomic databases (see Methods). After the elimination of identical sequences, 54 new sequences homologous to one or more EMF2 domains were identified (Table 1B): (1) 9 ESTs possess N-terminal domain sequences, (2) 16 possess C2H2 domain sequences, and (3) 36 possess VEF domain sequences, from nine additional angiosperm families (Malvaceae, Vitaceae, Liliaceae, Vitaceae, Nymphaeaceae, Ranuculaceae, Asteraceae, Bromeliaceae, and Euphorbiaceae) (Table1 andSupplementalFigure1). Altogether, 78 sequences— 24 full-length and 54 partial sequences—were identified from 20 angiosperm families. Outside of the angiosperms, we identified two gymnosperm ESTs sharing homology with the EMF2 C-terminal domain from Pinaceae (Supplemental Figure 2B and 2C), one each in Pinus taeda (pine) and Picea engelmanii (spruce), and two individual ESTs from the lycophyte species Selaginella mollendorffii (Table 1C). One Selaginella partial sequence (SdEMF2p_1) contained both N-ter and C2H2 domains, showing a 44–39% identity to the respective domains of EMF2. The other Selaginella sequence (SdEMF2p) contained only the VEF domain, showing a 58% identity to EMF2’s VEF in a 145-aa region of overlap (Table 1C and Supplemental Figure 2A). The Chromatin Database yielded three full-length sequences homologous to EMF2 from Physcomitrella patens (Bryophyta; moss), PpEMF2_1, _2, _3 (Table 1C). Despite low sequence similarity to Arabidopsis EMF2 (;25%), the moss sequences possess N-ter, C2H2, and VEF domains. These findings that EMF2/VRN2 homologs exist in lycophytes and mosses and have similar domain structure to modern

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

|

741

Table 1. Full-Length and Partial Sequences of VEF Gene Homologs. (A) Full-length VEF gene homologs from Angiosperm. Name

Family

Plant

Accession #

AaEMF2

Acoraceae

Acorus americanus, sweet flag

GenBank: ABI99480

AoEMF2

Asparagaceae

Asparagus officinalis, sparagus

GenBank: ABD85301

AtEMF2

Brassicaceae

Arabidopsis thaliana

TAIR: AT5G51230

CpEMF2

Caricaceae

Carica papaya

CoGe: Chr Supercontig_13 21118352–2159309

EcEMF2_1

Papaveraceae

Eschscholzia californica, California poppy

GenBank: ABD98790

EcEMF2_2

Papaveraceae

Eschscholzia californica, California poppy

GenBank: ABD98791

FIS2_692

Brassicaceae

Arabidopsis thaliana

TAIR: AT2G35670

HvEMF2_1

Poaceae

Hordeum vulgare, barley

GenBank: BAD99132

HvEMF2_4

Poaceae

Hordeum vulgare, barley

GenBank: BAD99131

HvEMF2_5

Poaceae

Hordeum vulgare, barley

GenBank: BAD99131

LeEMF2_1

Solanaceae

Lycopersicon esculentum

GenBank: ABI99480

LjEMF2

Fabaceae

Lotus japonicus

Legume database

OsEMF2_4

Poaceae

Oryza sativa, rice

TIGR: LOC_Os04g08034

OsEMF2_9

Poaceae

Oryza sativa, rice

TIGR: LOC_Os09g13630

PtEMF2_1

Salicaceae

Populus trichocarpa, cottonwood

ChromDB: VEF901

PtEMF2_2

Salicaceae

Populus trichocarpa, cottonwood

ChromDB: VEF902

PtEMF2_4

Salicaceae

Populus trichocarpa, cottonwood

ChromDB: VEF904

SlEMF2

Caryophyllaceae

Silene latifolia, white campion

GenBank: BAD93353

TaEMF2_3

Poaceae

Triticum aestivum, wheat

GenBank: AAX78232

VRN2_445

Brassicaceae

Arabidopsis thaliana

TAIR: AT4G16845

VEF_L36

Brassicaceae

Arabidopsis thaliana

TAIR: AT4G16810

YfEMF2

Yuccaceae

Yucca filamentosa, Yucca

GenBank: ABD85300

ZmEMF2_1

Poaceae

Zea mays, maize

ChromDB: VEF101

ZmEMF2_2

Poaceae

Zea mays, maize

ChromDB: VEF102

(B) EMF2/VRN2-related ESTs from Angiosperm. N-terminal (nine ESTs) Name Family

Plant

Accession #

CaEMF2p

Solanaceae

Capsicum annuum, pepper

GenBank:CA847455

GmEMF2p_3

Fabaceae

Glycine max, soybean

TIGR: TC221104

GmEMF2p_4

Fabaceae

Glycine max, soybean

TIGR:TC211671

GrEMF2p

Malvaceae

Gossypium barbadense, cotton

TIGR:TC40052

LsEMF2p_1

Asteraceae

Lactuca saligna, lettuce

TIGR:TA10917_4236

MtEMF2p

Fabaceae

Medicago truncatula

TIGR: TC108897

SbEMF2p_2

Poaceae

Sorghum bicolor, sorghum

TIGR: TA29013_4558

VvEMF2p_3

Vitaceae

Vitis vinifera, grape

GenBank: CF609577

ZmEMF2p_3

Poaceae

Zea mays, maize

TIGR:CD436196

C2H2 zinc finger (16 ESTs) Name Family

Plant

Accession #

CcEMF2p_1

Rubiaceae

Coffea canephora

TIGR: TA7702_49390

CsEMF2p_1

Asteraceae

Centaurea solstitialis

TIGR: TA4722_347529

CtEMF2p

Asteraceae

Carthamus tinctorius

TIGR: TA2823_4222

EeEMF2p

Euphorbiaceae

Euphorbia esula

TIGR: TA17942_3993

GmEMF2p_3

Fabaceae

Glycine max, soybean

TIGR: TC221104

GtrEMF2p

Asteraceae

Gerbera hybrid cv. Terra Regina

GenBank: AJ759904

742

|

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

Table 1. Continued C2H2 zinc finger (16 ESTs) Name Family

Plant

Accession #

LeEMF2p_2

Solanaceae

Lycopersicon esculentum, tomato

GenBank: AW038171

SbEMF2p_2

Poaceae

Sorghum bicolor, sorghum

TIGR: TA29013_4558

ScEMF2p

Poaceae

Secale cereale, cereal rye

GenBank: BE587348

SoEMF2p_2

Poaceae

Saccharum officinarum, sugarcane

TIGR: TA38345_4547

SoEMF2p_3

Poaceae

Saccharum officinarum, sugarcane

TIGR: TC71329

SoEMF2p_1

Poaceae

Saccharum officinarum, sugarcane

GenBank: CA098901

TaEMF2p_2

Poaceae

Triticum aestivum, wheat

GenBank: BJ211655

ToEMF2p_1

Asteraceae

Taraxacum officinale

TIGR: TA5836_50225

VvEMF2p_1

Vitaceae

Vitis vinifera, grape

GenBank: CN006883

ZmEMF2p_4

Poaceae

Zea mays, maize

TIGR: TA193846_4577

VEF domain (36 ESTs) Name

Family

Plant

Accession #

AcEMF2p

Liliaceae

Allium cepa

GenBank: CF443745

AfEMF2p

Ranunculaceae

Aquilegia formosa

TIGR: TA14166_338618

AnanasEMF2p

Bromeliaceae

Ananas comosus

GenBank: DT339533

BnEMF2p_1

Brassicaceae

Brassica napus

GenBank: CX194398

BnEMF2p_2

Brassicaceae

Brassica napus

GenBank: CX188412

CcEMF2p

Rubiaceae

Coffea canephora

TIGR: TA7701_49390

CiEMF2p_1

Asteraceae

Cichorium intybus

GenBank: EH708467

CiEMF2p_2

Asteraceae

Cichorium intybus

TIGR: TA5136_13427

CsEMF2p

Asteraceae

Centaurea solstitialis

GenBank: EH782846 TIGR: TA17942_3993

EeEMF2p

Euphorbiaceae

Euphorbia esula

GhEMF2p_1

Malvaceae

Gossypium hirsutum, cotton

GenBank: DW229901

GhEMF2p_2

Malvaceae

Gossypium hirsutum, cotton

TIGR: TA37052_3635

GhEMF2p_3

Malvaceae

Gossypium hirsutum, cotton

TIGR: TA35411_3635

GmEMF2p_1

Fabaceae

Glycine max, soybean

TIGR: TA61896_3847

HaEMF2p_1

Asteraceae

Helianthus annuus, sunflower

GenBank: CD848472

HpEMF2p_1

Asteraceae

Helianthus paradoxus, sunflower

GenBank: EL487885

LeEMF2p_3

Solanaceae

Lycopersicon esculentum, tomato

GenBank: BI932726 TIGR: TA3490_75948

LsEMF2p

Asteraceae

Lactuca saligna, lettuce

LvEMF2p

Asteraceae

Lactuca virosa, wild lettuce

GenBank: DW160707

MeEMF2p

Euphorbiaceae

Manihot esculenta, cassava

GenBank: DV449784

MtEMF2p

Fabaceae

Medicago truncatula

TIGR: TC108897 FGP: nad03-13ms2-e08

NaEMF2p

Nymphaeaceae

Nuphar advenar, yellow pondlily

NtEMF2p

Solanaceae

Nicotiana tabacum, tobacco

GenBank: EB678277

PsEMF2p

Fabaceae

Pisum sativum, pea

GenBank: AAX47184

SbEMF2p_1

Poaceae

Sorghum bicolor, sorghum

TIGR: TA34517_4558

SbEMF2p_3

Poaceae

Sorghum bicolor, sorghum

TIGR: TA35158_4558

LeEMF2p

Solanaceae

Solanum lycopersicum, tomato

GenBank: AW038171

SoEMF2p_1

Poaceae

Saccharum officinarum, sugarcane

GenBank: CA098901

SoEMF2p_2

Poaceae

Saccharum officinarum, sugarcane

TIGR: TA38345_4547

SoEMF2p_4

Poaceae

Saccharum officinarum, sugarcane

GenBank: CA098901

StEMF2p_2

Solanaceae

Solanum tuberosum

TIGR: TA35890_4113 GenBank: BQ505017

StEMF2p_3

Solanaceae

Solanum tuberosum

TaEMF2p_1

Poaceae

Triticum aestivum, wheat

TIGR: TA70383_4565

VvEMF2p_2

Vitaceae

Vitis vinifera, grape

TIGR: TA47215_29760

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

|

743

Table 1. Continued VEF domain (36 ESTs) Name

Family

Plant

Accession #

VvEMF2p_4

Vitaceae

Vitis vinifera, grape

GenBank: AM447481

ZmEMF2p_4

Poaceae

Zea mays, maize

TIGR: TA193846_4577

(C) EMF2/VRN2 homologs from Gymnosperm, Spikemoss, and moss. Gymnosperm Name Family Plant

Accession #

PeEMF2p

Pinaceae

Picea engelmannii, spruce

TIGR: TA1969_373101

PlEMF2p

Pinaceae

Pinus taeda, pine

GenBank: CO368996

Lycophyte Name

Family

Plant

Accession #

SdEMF2p

Selaginellaceae

Selaginella moellendorffii, Spikemoss

gnl|050718cr339|1588846_1

SdEMF2p_1

Selaginellaceae

Selaginella moellendorffii, Spikemoss

gnl|050718cr339|1588846_2

Moss Name

Family

Plant

Accession #

PpEMF2_1

Funariaceae

Physcomitrella patens, moss

ChromDB: VEF1501

PpEMF2_2

Funariaceae

Physcomitrella patens, moss

ChromDB: VEF1502

PpEMF2_3

Funariaceae

Physcomitrella patens, moss

ChromDB: VEF1503

Note: ‘p’ in the sequence name stands for partial sequence. The letters in the sequence name stand for the following plants: Aa: Acorus americanus, Ac: Allium cepa, Af: Aquilegia formosa, Ao: Asparagus officinalis, At: Arabidopsis thaliana, Bn: Brassica napus, Ca: Capsicum annuum, Cc: Coffea canephora, Ci: Cichorium intybus, Cp: Carica papaya, Cs: Centaurea solstitialis, Ct: Carthamus tinctorius, Ec: Eschscholzia californica, Ee: Euphorbia esula, Gh: Gossypium hirsutum, Gm: Glycine max, Gr: Gossypium barbadense, Gtr: Gerbera, Ha: Helianthus annuus, Hp: Helianthus paradoxus, Hv: Hordeum vulgare, Le: Lycopersicon esculentum, Lj: Lotus japonicus, Ls: Lactuca saligna, Lv: Lactuca virosa, Me: Manihot esculenta, Mt: Medicago truncatula, Na: Nuphar, Nt: Nicotiana tabacum, Os: Oryza sativa, Pe: Picea engelmannii, Pl: Pinus taeda, Pp: Physcomitrella patens, Ps: Pisum sativum, Pt: Populus trichocarpa, Sb: Sorghum bicolor, Sc: Secale cereale, Sd: Spikemoss, Sl: Silene latifolia, So: Saccharum officinarum, St: Solanum tuberosum, Ta: Triticum aestivum, To: Taraxacum officinale, Vv: Vitis vinifera, Yf: Yucca filamentosa, Zm: Zea mays.

angiosperm EMF2 indicate that EMF2 was likely to have been present in the genomes of early land plants (Supplemental Figure 2D).

Sequence Comparison of EMF2/VRN2 Class Homologs Predicted full-length and partial EMF2/VRN2 protein homologs were aligned using MUSCLE (see Methods).

N-terminal (N-ter) Domain An N-terminal domain for Arabidopsis EMF2 was defined by Yoshida et al. (2001) as a fragment starting from amino acid 47 to 81 (Figure 2A). The domain is also found in VRN2 and in the Drosophila Su(z)12 protein. Our alignment of the fulllength sequences from all identified EMF2/VRN2 class homologs shows that a larger area is conserved across land plants, starting from the first amino acid of EMF2 to the end of exon 4 (aa 81), and is heretofore referred to as the N-ter domain (Figure 2A). Relative to EMF2, VRN2 has an abbreviated Nter domain, starting translation from a methionine (M) corresponding to the second M of EMF2. The sequence between the two methionines of EMF2 is referred to as the N-ter cap. EMF2/VRN2 homologs of monocots Acorus, Yucca, Aspara-

gus, and the basal eudicot Eschscholzia all contain the Nter cap (Figure 2A), suggesting that the angiosperm ancestral sequence may have had both M sites, similar to Arabidopsis EMF2. Indeed, Selaginella SdEMF2p_1 and the Physcomitrella sequences, PpEMF2_3 (VEF1503), have an N-ter cap (Supplemental Figure 2D), although the sequences and lengths are divergent from that found in the identified angiosperm sequences. Some N-ter cap’s second M is replaced with a different aa; for example, S1EMF2’s second M is replaced by an S (Figure 2A). In species with two or more EMF2 class homologous sequences found so far, at least one sequence has such a cap, such as rice (OsEMF2_4 vs. OsEMF2_9), maize (ZmEMF2_1 vs. ZmEMF2_2), poplar (PtEMF2_1 and PtEMF2_2 vs. PtEMF2_4), and California poppy (EcEMF2_1 vs. EcEMF2_2) (Figure 2A and Supplemental Figure 1A). The early land plants also possess at least one sequence with the N-ter cap (Supplemental Figure 2A and 2D).

E5–10 Domain VRN2 is missing most of the amino acid sequence corresponding to EMF2 exon 5 through exon 10 (E5–10). Comparison of

744

|

Chen et al.

Molecular Evolution of VEF-Domain-Containing PcG Genes

d

Table 2. Pair-Wise Alignment Scores of Full-Length VEF Protein Homologs. Sequences name1

1

1.

AaEMF2

-

2.

AoEMF2

61

3.

AtEMF2

54

55

-

4.

CpEMF2

57

58

68

-

5.

EcEMF2_1

53

55

53

54

-

6.

EcEMF2_2

52

52

51

51

57

7.

FIS2_692

18

19

16

18

19

18

-

8.

HvEMF2_1

42

49

40

43

40

43

12

-

9.

HvEMF2_4

52

60

49

52

50

49

17

78

-

10.

HvEMF2_5

42

46

41

40

37

39

16

54

58

11.

LeEMF2_1

58

59

58

62

53

52

18

41

50

42

-

12.

LjEMF2

40

40

42

46

40

38

17

32

37

34

43

-

13.

OsEMF2_4

45

50

42

45

45

43

17

53

61

54

43

32

-

14.

OsEMF2_9

53

61

50

52

52

52

18

71

82

61

51

37

62

-

15.

PpEMF2_1

25

25

27

26

25

25

13

17

27

23

27

21

22

24

-

16.

PpEMF2_2

24

23

26

23

23

20

10

15

21

20

24

15

22

23

69

-

17.

PpEMF2_3

26

28

29

28

29

26

15

20

29

22

28

22

28

29

55

53

-

18.

PtEMF2_1

57

56

63

71

53

51

17

42

49

39

59

41

44

51

23

25

28

-

19.

PtEMF2_2

56

58

63

70

53

50

19

40

50

42

60

42

42

51

27

21

32

84

-

20.

PtEMF2_4

61

60

53

54

52

58

30

25

49

43

57

44

43

50

27

31

30

56

55

21.

SlEMF2

53

53

57

62

50

47

18

35

45

39

58

44

39

47

23

19

27

57

58

49

-

22.

TaEMF2_3

51

58

49

51

51

48

18

80

93

57

49

35

60

81

27

23

29

48

50

48

45

23.

VEF_L36

8

8

8

8

8

7

7

5

8

7

8

6

8

8

8

9

7

8

8

12

8

8

24.

VRN2_445

46

47

45

45

43

46

31

20

44

34

48

36

37

44

29

29

31

45

44

51

42

44

14

25.

YfEMF2

61

82

56

58

57

52

18

50

58

47

59

40

49

60

25

24

28

57

59

60

53

57

9

47

-

26.

ZmEMF2_1

51

57

48

51

48

47

16

68

75

58

50

36

60

77

23

23

27

47

48

48

46

78

8

43

55

-

27.

ZmEMF2_2

55

59

51

51

51

52

17

71

80

60

53

39

62

81

24

22

29

51

51

48

49

80

8

44

59

80

-

46

49

46

48

44

43

17

42

51

41

47

35

43

51

26

25

28

47

47

46

43

51

8

40

49

49

51

Average score

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

-

-

-

-

Note: 1. The number listed in the top line represents sequence with same number that is listed in the first column. Calculation of pair-wise alignment scores was described in Methods. Average scores were calculated as the sum of the individual score in one category divided by 26. Among these homologs, VEF-L36 showed lowest identity to other members (average score: 8), followed by FIS2 (average score: 17). On the other hand, both showed higher identity to VRN2 than to other EMF2/VRN2 homologs (pair-wise alignment score between VEFL36 and VRN2: 14, pair-wise alignment score between FIS2 and VRN2: 31). The average pair-wise alignment score of other EMF2/VRN2 members was ;44, calculated as the sum of the average scores (excluding 8 and 17) divided by 25.

EMF2 and VRN2 genomic DNA revealed no conserved corresponding sequence in VRN2 in this region, excluding the possibility of alternative mRNA splicing as the cause of the difference. One full-length sequence, PtEMF2_4 from Populus trichocarpa (poplar) (Figure 2 and Supplemental Figure 1B), and three partial sequences, MtEMF2p from Medicago truncatula, GmEMF2p_3 from Glycine max, and CaEMF2p from Capsicum annuum, lack both the E5–10 domain and the N-ter cap, like VRN2. Within the E5–10 domain, amino acids encoded by EMF2 exon 5 (Figure 2A), 6, and 8 were highly conserved among the non-Arabidopsis EMF2 homologs, suggesting potential conserved function of the E5–10 region. Alignment analysis suggested the presence of E5–10 in all three Physcomitrella sequences, though with divergent sequences (Supplemental Figure 2D).

C2H2 Zinc Finger Domain Unlike most Arabidopsis C2H2-domain-containing genes that have multiple C2H2 motifs in tandem (Englbrecht et al., 2004), the three VEF proteins, VRN2, FIS2, and EMF2, contain a single C2H2 motif that is encoded by exon12 and 13 in EMF2. Previous studies found a conserved 37-aa C2H2 domain sequence in EMF2 homologs from Drosophila, human, and Arabidopsis (Yoshida et al., 2001). Alignment of the full-length sequences shows a conserved region extending from EMF2 exon 11 through 14 in the EMF2 homologs, covering a range of ;77 aa. In the EMF2/VRN2 class, VRN2’s C2H2 is most divergent from that of other members (Figure 2B). Selaginella’s SdEMF2p_1 has the C2H2 region with 39% identity to that of EMF2 (Supplemental Figure 2A). Physcomitrella’s PpEMF2_3 has a region corresponding to that of EMF2’s C2H2; its two

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

Figure 2. Alignment of Three Domains of Predicted Full-Length Plant VEF Proteins.

|

745

746

|

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

Cs line up with those in EMF2, but the two Hs are absent (Supplemental Figure 2D).

ilar to EMF2 in that it possesses the N-ter cap, E5–10, C2H2-like, and VEF regions.

E15–17 Domain

Phylogenetic Analysis of Full-Length and VEF Sequences

E15–17 is a region encoded by EMF2 exon 15 to 17, connecting the C2H2 and VEF domains of EMF2, VRN2, and FIS2. Alignment of the EMF2 homologs shows that this region has the highest variability of all EMF2 domains in both amino acid sequence composition and in total length, suggesting intensive diversification including multiple insertion and/or deletion events during the evolution of this region (Supplemental Figure 1D). All three Physcomitrella sequences, including PpEMF2_3, appear to possess this region.

Phylogenetic analysis of the full-length sequences using maximum likelihood and Bayesian methods recovered various lineages reflecting organismal evolution (Figures 3 and 4). Using human and Drosophila sequences as outgroups, phylogenetic analyses of full-length sequences (Figure 3) and VEF domain alone (Figure 4) both recovered a monophyletic angiosperm lineage with monophyletic monocot and eudicot clades. Within the monocots, the grasses (Poales) were also recovered as monophyletic in both full-length and VEF-based gene trees. For VEF domain analyses containing greater sampling of land plant diversity, gymnosperms were found to be monophyletic and sister to angiosperms, Selaginella sister to an angiosperm plus gymnosperm clade, and Physcomitrella sequences sister to remaining land plants. As with full-length sequences, monocots are recovered as monophyletic; however, Eschscholzia, unresolved in the full-length analysis, groups with Aquilegia VEF domain (Figure 4), forming a basal eudicot clade sister to monocots. This clade is unresolved with respect to monocots and core eudicots. Within monophyletic core eudicots, the asterids and rosids are roughly falling out as separate clades, with a few exceptions (e.g. Silene within rosid clade, two sequences of Gossypium recovered as sister to the rosid plus asterid sister group, Lotus japonicus within an otherwise monophyletic asterid clade, and one Helianthus sequence falling within the rosids rather than the asterids). In addition, several sequences from core eudicot species are resolved in a clade containing VRN2, FIS2, and VEF-L36 (Figure 4). This clade is distant from AtEMF2, indicating a different evolutionary history for the VEF domain of VRN2, FIS2, and VEF-L36. In the full-length analyses, PtEMF2_4 or VEF904, a proposed VRN2 ortholog from Populus, is strongly supported within a VRN2 clade reflecting potential homology (or full-length sequence conversion) of the Populus sequence with VRN2. In the VEF domain analyses, this Populus sequence groups with other Populus sequences rather than with the VRN2 clade, indicating that the VEF domain itself is not converging on a VRN2-like VEF domain, despite full sequence and domain-level similarity. Another potential VRN2 ortholog, Medicago truncatula’s MtEMF2p, lacking the E5–10 domain and the N-ter cap, is grouped in the VRN2 clade. It remains

VEF (C-terminal) Domain Alignment of C-terminal sequences of EMF2, VRN2, FIS2, Su(z)12, and the human KIAA0160 led Yoshida et al. (2001) to define an acidic-W/M domain, ;130 aa from exons 18–22 in Arabidopsis EMF2, which is characterized by an acidic cluster and a sequence rich in tryptophan and methionine. A smaller region was later called the VEF domain derived from the initials of VRN2, EMF2, and FIS2 (Birve et al., 2001), which did not include sequences in exon 18, but extended beyond that of the acidic-W/M domain (Figure 2). In this paper, we adopt a broader sense of the VEF domain, encompassing both the acidic-W/M, defined by Yoshida et al. (2001), and the VEF, by Birve et al. (2001), domains (Supplemental Figure 1E–1G). VRN2hasan additional52-aaC’oftheVEFdomain(Supplemental Figure 1G and Figures 1 and 2C) that is not shared with other EMF2-class proteins, including VRN2-like sequences, full-length or partial from plants other than Arabidopsis. Analysis using RADAR (www.ebi.ac.uk/Radar/) suggests that this 52-aa region is a duplication of a stretch of amino acids found within the VEF domain (Supplemental Figure 1G). Selaginella SdEMF2p corresponds to the VEF domain (Supplemental Figure 2A). All three Physcomitrella sequences and the two partial gymnosperm sequences possess the VEF domain (Supplemental Figure 2B–2D). None of the VEF domains found in Physcomitrella, Selaginella, pine, or spruce possesses the VRN2-characteristic repeat sequence in their Ctermini, indicating that this repeat likely evolved in angiosperms after the divergence of the gymnosperm lineage. Among the three moss sequences, PpEMF2_3 is the most sim-

The T-COFFEE (Version 4.85) program was used for the sequence alignment. Vertical lines on top of the sequence mark the boundaries of EMF2 exons, and the arrows and numbers prefixed with an E on top of the sequence indicate EMF2 exons. (A) N-ter domain. Light-blue bar on top of the sequence marks the N-ter domain. Colorless horizontal bar marks the N-ter cap. Dark-blue bar marks the N-terminal domain defined by Yoshida (2001). (B) C2H2 domain. Green bar on top of the sequence marks C2H2 domain defined by Yoshida (2001). Numbers –1, +3, and +6 denote the position relative to the start site of the a-helix of the C2H2 domain. (C) VEF domain. Red and yellow horizontal bars on top of the sequence mark the C-terminal domain defined by Yoshida et al. (2001) and the VEF domain defined by Birve et al. (2001), respectively. Because VEF-L36 only shares VEF with other homologs, its middle and C-terminal sequences were cut off.

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

|

747

Figure 3. Phylogenetic Analysis of Full-Length VEF Protein Homologs. Phylogeny of EMF2/VRN2 using Bayesian inference; average branch lengths are shown. Measures of support are given at the nodes; Bayesian posterior probability (PP)/maximum likelihood bootstrap support (BS). Support values less than 50 are shown as hyphen "-" and support values of 100 are shown as "+".

to be tested whether other homologs with VRN2-like domain structure will have their VEF sequence converge with AtVRN2 VEF amino acid sequence.

Sequence Relationship between VEF-L36 and EMF2/VRN2 VEF-L36 cDNA was deduced from Arabidopsis genomic sequence (TAIR: www.Arabidopsis.org/servlets/TairObject?id= 128616&type=locus) but has not been assayed for function. The 1872-bp open reading frame encodes a predicted 623-aa protein, with the 125-aa VEF located at the N-terminus and a 113-aa C-terminus with only low sequence similarity L36. The RADAR program detected three types of repeat sequence in the middle region of VEF-L36 (Figure 1 and Supplemental Figure 3A). Except for the VEF domain, VEF-L36 shares no other domains with the other three Arabidopsis VEF proteins. Using its 495-aa sequence without the VEF domain to BLAST search against GenBank, we found three Arabidopsis fragments and one rice homolog, as well as few sequences in other non-plant organisms, such as Drosophila, Dictyostelium, Danio, and Trypanosoma, all lacking the VEF domain (Supplemental Figure 3B). The rice homolog encodes a 410-aa protein with low global homology to the non-VEF part of VEF-L36 (22% identity and 37% similarity, Supplemental Figure 3C). To date, VEF-L36 is the only gene found with both VEF and L36 domains. The VEF domain of VEF-L36 is more closely related to that of VRN2 than to EMF2, as indicated by phylogenetic analyses of both the VEF domain alone and of full-length sequences

(Figures 3 and 4). Among the divergent amino acids between EMF2 and VRN2, VEF-L36 shares nine with VRN2 and only three with EMF2 (Table 3). Moreover, VRN2 (AT4G16845) and VEF-L36 (AT4G16810) are closely linked on Arabidopsis chromosome 4. Among the VEF-domain-containing proteins, the VEF domain in VEF-L36 is the only one located at the N-terminus of a protein. Together, these phenomena suggest that the VEF domain of the VEF-L36 may be transferred from VRN2 on a sister chromatin, through an accidental intronic recombination event during meiosis (Figure 5C). This would imply that only plants with VRN2 may generate L36-VEF. So far, VEF-L36 has only been identified from Arabidopsis.

Sequence Relationship between FIS2 and EMF2/VRN2 FIS2 is similar to EMF2/VRN2 in possessing a single C2H2 and the VEF domain, which is connected by a 459-aa region with 70 serines, called the S-rich domain. In addition to the two types of repeats identified (Luo et al., 1999), RADAR identified a third type of repeat in the S-rich domain (Supplemental Figure 4A). Sequences homologous to the S-rich domain have been found in plants, fungi, bacteria, and animals, but none share the C2H2 or VEF domains with FIS2. Despite the abundance of the S-rich homologous domain in nature, the uniqueness/rareness of the S-rich domain in VEF-domain-containing protein family suggests that FIS2 may represent a unique evolutionary event within the Arabidopsis lineage. The C2H2 domain of FIS2 has greater sequence similarity to VRN2 than EMF2 (Table 3). The VEF domain of FIS2 shows

748

|

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

Figure 4. Phylogenetic Analysis of VEF Domain Sequences. Phylogeny of VEF domain using maximum likelihood as implemented in RAxML. Measures of support are given at the nodes; Bayesian posterior probability (PP)/maximum likelihood bootstrap support (BS). Support values less than 50 are shown as hyphens (-). Taxonomic groups indicated at right, with exceptions described in text.

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

|

749

Table 3. Number of Amino Acids Shared between FIS2/VEF-L36 and VRN2 or EMF2*. Identical aa between FIS2 and VRN2

Identical aa between FIS2 and EMF2

Identical aa between VEF-L36 and VRN2

Identical aa between VEF-L36 and EMF2

C2H2 domain

20/131

8/131

na

na

VEF domain

20/116

5/116

9/98

3/98

* Among the divergent amino acids between EMF2 and VRN2, the number of aa shared with EMF2 or VRN2 out of total number of aa in the domain. na, not applicable.

DISCUSSION The VEF domain is found in chromatin proteins required for gene silencing throughout eukaryotic organisms. In addition to the universal VEF domain, the VEF proteins possess other characteristic domains that distinguish them from one another. Based on domain organization, four Arabidopsis VEF proteins were grouped into three classes: EMF2/VRN2, FIS2, and VEF-L36 (Figure 1). Our analysis of homologous sequences throughout land plants indicates the existence of EMF2 in early diverging lineages of land plants (bryophytes and lycophytes) and suggests the presence of an ancestral EMF2-like gene in early land plants. Phylogenetic results (Figures 3 and 4) are consistent with the hypothesis that VRN2 was likely derived from an EMF2-like ancestor within the angiosperms, and that FIS2 and VEF-L36 were secondarily derived from a VRN2-like ancestral sequence in Arabidopsis. Current phylogenetic hypotheses are limited in taxon sampling and in character sampling, constrained by currently available sequences that are not equally distributed across angiosperm evolution and may not represent complete genomic data for all species sampled. Such limitations reduce overall phylogenetic resolution and make it difficult to assign orthology and paralogy to the available sequences in the face of multiple gene and genome duplication events spanning angiosperm evolution. However, given current sampling, our phylogenetic results indicate that EMF2-like genes in angiosperms demonstrate an evolutionary history largely consistent with the taxonomic history of the plants in which they are found.

Proposed Evolution of VEF Genes Figure 5. Model on VRN2, FIS2, and VEF-L36 Evolution. (A) Proposed VRN2 evolution from EMF2. (B) FIS2 evolution from VRN2. (C) VEF-L36 evolution from VRN2.

a closer phylogenetic relationship to the VEF domain of VRN2 than to EMF2 (Figure 4), forming a clade with the VRN2 sequence indicating common ancestry to the exclusion of EMF2. Among the amino acids diverged between EMF2 and VRN2, FIS2 shares 20 identical amino acid residues with VRN2 and only eight with EMF2 in the VEF domain (Table 3). Globally, FIS2 shared a higher pair-wise alignment score with VRN2 than EMF2 (29 vs. 18%; Table 2).

The EMF2/VRN2 class proteins show strong sequence similarity despite modified domain structure. Sequences with the EMF2like domain structure are widespread, found in animals and most vascular plants. Sequences with the VRN2-like domain structure have only been identified in poplar (PtEMF2_4), pepper (CaEMF2p), alfalfa (MtEMF2p), and soybean (GmEMF2_3) (Table 1) as sequences that lack the N-ter cap and E5–10-like VRN2. In Arabidopsis, EMF2 is an essential gene as evidenced by the short-lived and sterile nature of the emf2 mutants. VRN2 promotes vernalization-mediated flowering and vrn2 mutants flower late, but the loss of VRN2 is not lethal (Gendall et al., 2001). Alternative vernalization mechanisms that do not utilize a putative Arabidopsis VRN2 ortholog have evolved in other species (Yan et al., 2004) and may be present in

750

|

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

Arabidopsis as well. While every plant sequenced thus far has at least one copy of EMF2, VRN2 is found only infrequently. The dispensable nature of VRN2 may result in its lower frequency of occurrence throughout land plants. Based on our data, it is likely that VRN2 can arise from a duplication of an EMF2-like ancestor. Once an additional EMF2 copy is present, one of the copies is no longer under strong selection and is able to diverge, potentially resulting in a VRN2-like sequence. Under this scenario, VRN2-like sequences could arise multiple times and independently following any duplication event that included the EMF2 gene. Similarity in domain structure and amino acid composition could then be the result of convergent evolution. Genes possessing all domains found in EMF2 exist in insects and mammals (Yoshida et al., 2001; Schuettengruber et al., 2007). It can be argued, based on the presence of EMF2-like genes in animals, lycophytes, bryophytes, gymnosperms, and angiosperms, that early land plants shared an ancestral sequence having the domain structure found in modern copies of EMF2. As the gene or genome duplicated, VRN2 may have arisen from a duplication of the ancestral EMF2 (Figure 5A), followed by subsequent loss of the N-ter cap and the E5–10 domain, and the acquisition of the 52-aa C-terminal repeat. The presence of intermediary forms with partial domain structure suggests a potential step-wise evolution of VRN2 from an EMF2-like sequence. Among the full-length and partial sequences from 20 angiosperm families used in this analysis, 20 sequences contain complete N-ter domain (Figure 2A and Supplemental Figure 1A), nine lack the N-ter cap only (Intermediary molecule #1 in Figure 5A) and four lack both the N-ter cap and the E5–10 domain (Intermediary #2 in Figure 5A; Figure 2 and Supplemental Figure 1B) but do not contain a VEF repeat. So far, no sequence that lacks E5–10 but contains the N-ter cap has been found, suggesting that the N-ter cap may need to be lost first in order for the E5–10 domain to be lost. Finally, only one VRN2-like sequence, Arabidopsis VRN2, possesses the Cterminal repeat (Supplemental Figure 1G). Based on the frequency of the intermediary forms and results from phylogenetic analyses, we propose a three-step hypothesis in the evolution of VRN2 from a parental EMF2 following gene duplication (Figure 5A). In the first step, EMF2 loses the N-ter cap, resulting in Intermediary molecule #1. This could be achieved by mutation of the first ATG, rendering the second ATG as a translation-starting site. In the second step, Intermediary #1 loses the E5–10 domain, resulting in Intermediary molecule #2. This could be achieved by mutation of the splice sites within exon 5–10, resulting in exon skipping (Hayashi et al., 1991). In the third step, Intermediary #2 gains a C-terminal repeat, resulting in the backbone of VRN2. Currently, this third step has only been observed in Arabidopsis. The importance of the 52-aa VEF repeat to the VRN2 function remains to be tested, but the intermediate sequences may represent intermediate forms that could be in the process of evolving the VRN2 function. Comparison of structure and function between these sequences and VRN2 will be required to better understand the relationships of these genes.

The proposed process could happen sequentially, resulting in independent derivations of a VRN2-like sequence from an EMF2-like ancestor multiple times throughout plant evolution. Convergence of the VEF domain among the VRN2-like sequences may occur concurrently with the losses of domains during steps 1 and 2, or may occur following these structural changes due to selection on the resulting gene sequence. This later case assumes that independently evolved VRN2 sequences would converge upon a particular function, with selection then acting in a similar manner on the individual VEF domains. Studies demonstrating the function of VRN2-like sequences in plants in which they are found would be required to understand the selection events leading to convergence of sequence data. More complete genomic and taxonomic sampling focused on VRN2-like sequences will enable us to test for possible differences on selection of the VRN2 clade in comparison with various recovered EMF2 clades. The presence of the VEF repeat only in Arabidopsis VRN2 indicates that it may be a lineage-specific event. In this case, the ancestral VRN2 in the most recent common ancestor of Arabidopsis and Populus would not have had the VEF repeat, and the repeat was subsequently gained in the lineage leading to Arabidopsis after its divergence from the eudicot lineage leading to Populus. Phylogenetic analysis showed that the full-length Populus and Arabidopsis VRN2-like sequences are in the same clade, despite the lack of the VEF repeat in PtVRN2_4. However, in the analysis of the VEF domain alone, the VEF of PtEMF2_4 remained in the same clade as that of PtEMF2_1 and PtEMF2_2, suggesting stabilizing selection on the VEF domain in Populus since the duplication event leading to the Populus EMF2/VRN2-like divergence. This indicates that overall domain architecture of the EMF2 gene is evolving independently from within-domain protein structure, at least for the VEF domain. Studies investigating evidence for directional selection on the VEF domain following duplication of EMF2 will be helpful to assess the likelihood of VRN2 evolution following gene or genome duplication. Phylogenetic analysis and sequence similarity comparison clearly demonstrate that the VEF domain of VEF-L36 is more closely related to that of VRN2 than to EMF2 (Table 3 and Figures 3 and 4). Similarly, both the C2H2 and VEF domains of FIS2 are more closely related to those of VRN2 than EMF2 (Table 3 and Figures 3 and 4). These findings support the derivation of FIS2 and VEF-L36 from VRN2; only plants that have evolved VRN2 could generate sequences like Arabidopsis FIS2 and VEF-L36. FIS2 is an essential gene in Arabidopsis, but has not yet been identified in other plants, including plants with full genome sequences. FIS2 is specifically expressed in the gametophyte of Arabidopsis and prevents endosperm development prior to fertilization (Luo et al., 1999, 2000). A search against cDNA libraries constructed from various angiosperm flowers did not result in any FIS2-like homologs. In plants that did not evolve VRN2, EMF2-like or alternative sequences may have evolved to prevent endosperm development without fertilization. Alternatively, genes with functional but without

Chen et al.

sequence conservation (Calonje et al., 2008) may have evolved to take the place of FIS2. The presence of FIS2 and VEF-L36 should be investigated across Brassicaceae and its sister family, Capparaceae (Hall et al., 2002), in order to localize the potential duplication events leading to the evolution of these sequences from a hypothetical VRN2-like ancestral sequence. FIS2 may have diverged from a duplicated VRN2, while VEF-L36 may have evolved via a translocation of a VEF domain donated by VRN2 (Figure 5B and 5C). PRC2 components play important roles in animal development, notably in insects and mammals (Schuettengruber et al., 2007). Some animal VEF protein sequences in the database possess all domains found in Su(z)12; others possess only the VEF and C2H2, or only the VEF domain. Indeed, nematode has a sequence that shares C2H2 and VEF domain with Su(z)12 (see GenBank’s protein databases). Protein sequence alignment based on identity/similarity did not identify any animal protein with the VEF domain linked to FIS2’s S-rich or VEF-L36’s L36 domain, despite the abundance of S-rich and L36 in nature. A comprehensive evolutionary analysis of animal VEF-containing proteins is beyond the scope of the present study. However, gene duplication, domain deletion/insertion/ rearrangement apparently occurred during the evolution of animal VEF proteins as well. For example, mouse has one, chimpanzee has three, and zebra fish has two VEF protein homologs. Some animal homologs possess the N-terminal sequence, while others do not; and some domains specific to certain animals can be identified (data not shown). Gene fusion involving the human VEF homolog would lead to neoplastic tumor growth (Li et al., 2008). Future investigation of domain architectures in animal VEF proteins would provide insights into the evolutionary trends of VEF proteins in plants versus those in animals.

Dynamic Changes during VEF Gene Evolution The evolution of the VEF genes in plants is characterized by the mobility of the VEF domain, duplication, and functional divergence of homologous sequences. In addition to its diverse location in the genome, a VEF domain can be located in the N- or C-terminus within a genetic locus. A VEF domain-containing gene may even lose the VEF domain, as in the case of HvEMF2_1. These phenomena indicate that the VEF domain functions like a mobile functional module that plays a major role in protein evolution, facilitated by intronic recombination or exon shuffling (Patthy, 1996; Kolkman and Stemmer, 2001). The dynamic genetic changes that occurred during the evolution of this small gene family caused varying degrees of divergence in sequences located between the conserved domains. For example, a region encoded by EMF2 exon 15 through exon 17 (E15–17), flanked by the highly conserved C2H2 zinc finger and VEF domains, is a region with the lowest conservation among the EMF2/VRN2 class homologs (Supplemental Figure 1D). While the ends use identical or similar amino acids and have almost no length variation, the center

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

|

751

of the gene region encoded by exon 16 and the 5’ end of exon 17 requires indels for multiple sequence alignment representing up to 20–70 aa in length difference. The gradient in the degree of similarity, from highly divergent at the center region to highly conserved at the 5’ and 3’ ends, may be informative in plant phylogenetics. Finally, we note that the VEF gene tree reflected our best understanding of the organismal tree for included taxa (Grass Phylogeny Working Group, 2001). Regions with high levels of variability combined with low copy number may render EMF2, particularly the E15–17 domain, a useful phylogenetic tool for evaluating the evolutionary relationships of plants across both deep and shallow nodes.

METHODS Identification of Sequences and Domains of VEF Genes across Land Plants Full-length EMF2 putative protein sequence was used to BLAST (Basic Local Alignment Search Tool) search against the following databases: GenBank (www.ncbi.nih.gov/), TIGR/JCVI (www.tigr.org/), the Floral Genome Project (http://fgp.bio. psu.edu/fgp/), Plant Genome DataBase (www.plantgdb.org/), the moss genome (www.cosmoss.org/, http://genome.jgipsf.org/Phypa1_1/Phypa1_1.home.html), the papaya genome (http://tinyurl.com/3ua95v), the pine EST database (http://fungen. botany.uga.edu/), the Plant Genome Network (http://pgn.cornell. edu/cgi-bin/blast/blast_search.pl), Brassica (http://ukcrop.net/), SOL Genomics Network (www.sgn.cornell.edu/), the poplar genome (http://genome.jgi-psf.org), the Chromatin database (www.chromdb.org/), and the Selaginella genome (http:// selaginella.genomics.purdue.edu). Sequences with an e-value greater than 0.001 (non-significant homology) were eliminated, thereby eliminating all non-plant sequences. Plant sequences containing intact EMF2-like N-terminal, C2H2, and/ or C-terminal domains were selected for further analysis. For identification of homologs of FIS2’s S-rich domain and VEFL36’s L36 domain, S-rich domain and L36 domain amino acid sequences were used to BLAST search against the same databases listed above with an e-value cut-off of 0.001.

Sequencing EMF2 Homolog cDNA Plasmid cDNAs were extracted from bacteria culture according to the manufacture’s protocols (QIAGEN Inc. Valencia, CA 91355, USA). M13 rev (5#-GGAAACAGCTATGACCATG-3’) and M13 (–20) (5#-GTAAAACGACGGCCAG-3’) primers were used for sequencing, with the following internal primers used as necessary to obtain full sequences: Acorus: 5#-CTCAGTAGAGCATGTCTGCTG-3#, 5#-CCCATGCAATCGTGAGAATGC-3#, 5#-TGACACGCTGAAAGATGATG-3#, 5#-CATTAACTGCCTGATACTCTTC-3#, Asparagus: 5#-CAATACGGAATCCATCATTTCTGC3#, 5#-CTTGCTCCAATGCCATTGGC-3’; Nuphar: 5#-GATGAGGTCGATGATGATATTGC-3#, 5#-CTGCCAAAACCCGCTGTTTC-3’; Yucca: 5#-GTCAATCGGGCATGTATACTG-3#, 5#-CTTGCTCCAACGCCATTGGC-3’; Eschscholzia 8.1: 5#-GCTGATTACAAGGAACAGACTG-3#,

752

|

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

5#-CACGGAACATGACCATCTGC-3’;Eschscholzia8.2:5#-GAGGAATGACAGGGTGGAAGC-3#, 5#-GTTCCAGAGATGCATAATCCTTG-3’; Tomato: 5#-GCTTTGCCGAACTTGCCAG-3#, 5#-CCCTATGAGAATGAAAGAATTGCC-3#.

Sequence Alignment T-coffee (www.ebi.ac.uk/t-coffee/) was used to produce a global amino acid alignment using the default values for protein alignment. RADAR (www.ebi.ac.uk/Radar/) was used to detect de novo repeat regions in EMF2 homologous sequences. Classification of VEFs subgroups was performed based on domain organization in the aligned sequences. The full-length VEF homologs were aligned using T-coffee and pair-wise distance scores were calculated with ClustalW (version 1.83, http://www.ebi.ac.uk/Tools/Radar/) as the number of identities in the best alignment divided by the number of residues compared (gap positions excluded). Scores were initially calculated as percent identity scores and were converted to distances by dividing by 100 and subtracting from 1.0 to give total number of differences per site. No correction for multiple substitutions was performed.

VEF Domain Sequence Data The VEF domain, a region held in common by EMF2, VRN2, FIS2, and VEF-L36, was used to estimate the phylogenetic relationships among VEF gene sequences across land plants. Protein alignment of the VEF domain was performed with MUSCLE, resulting in a multiple sequence alignment of about 130 aa. ProtTest 2.0 was also used to determine the model of evolution that best fits the VEF domain alignment. The bestscoring model for the VEF alignment was also JTT +G, and global rearrangements were sampled with a random order of input sequences. Bayesian and Maximum likelihood methods of phylogenetic inference were conducted on the VEF domain alignment using MrBayes (tree not shown) and RAxMLVI-HPC (Stamatakis, 2006), respectively. The analyses were performed on the computer cluster of the Cyber-Infrastructure for Phylogenetic Research project (CIPRES, www.phylo.org) at the San Diego Supercomputer Center. Clade support, which was assessed with nonparametric bootstrapping (Felsenstein, 1985) as implemented in RAxML-VI-HPC, was based on 100 replicates. The tree with the highest log-likelihood score from the RAxML analysis was chosen for representation here.

Phylogenetic Analysis EMF2/VRN2 Full-Length Sequence Data

Accession Numbers

The T-coffee alignment was used for phylogenetic analysis based on its superior prediction of primary homology statements as compared with prior knowledge of functional domain architecture; for example, the N-terminus-located C2H2 domain of FIS2 aligned with the EMF2 N-terminal domain when using MegAlign or ClustalW, while, in T-coffee, the annotated C2H2 domains aligned with one another across all sequences. Bayesian phylogenetic analyses on aligned full-length sequences were performed with MrBayes v. 3.1.2 (Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003). The model of protein evolution that best fit the protein sequence data was selected using the AIC as implemented in ProtTest 2.0 (Abascal et al., 2005—see e-mail for citation). The best-scoring model for the EMF2/VRN2 full-length alignment was the Jones-Taylor-Thornton (JTT) probability model (Jones et al., 1992), with rate variation among sites calculated as a gamma distribution (+G), and global rearrangements were sampled with a random order of input sequences. Posterior probabilities of the generated trees were approximated using an MCMC algorithm with four incrementally heated chains (T = 0.2) for 5 000 000 generations and sampling trees every 100 generations. Two independent runs were conducted for each dataset simultaneously, the default setting in MrBayes v. 3.1.2. Following completion, the sampled trees from each analysis were plotted against their log-likelihood score to identify the point at which log-likelihood scores reached a maximum value. All trees prior to this point were discarded as the burn-in phase, all post-burn-in trees from each run were pooled, and a 50% majority-rule consensus tree was calculated to obtain a topology with average branch lengths as well as posterior probabilities as indicators of support for all resolved nodes.

Novel full-length protein sequences generated for this study were deposited in GenBank with the following accession numbers: Yucca filamentosa EMF2 (YfEMF2, GenBank accession number(acc.#) ABD85300); Asparagusofficinalis EMF2 (AoEMF2, acc. # ABD85301); Eschscholzia californica EMF2 (EcEMF2_2, acc. # ABD98791); Eschscholzia californica EMF2 (EcEMF2_1, acc. # ABD98790); Tomato EMF2 (LeEMF2_1, acc. # ABI99480); Acorus americanus EMF2 (AaEMF2, acc. # ABI99481).

SUPPLEMENTARY DATA Supplementary Data are available at Molecular Plant Online.

FUNDING This work is supported by NSF grant #IBN 0236399 and USDA grant #03–35301–13244 to Z.R.S.

ACKNOWLEDGMENTS The authors thank Dr Hong Ma (Pennsylvania State University), the Floral Genome Project, the and the SOL Genomics Network (www.sgn.cornell.edu/) for providing EMF2 homologous cDNA clones, Kazusa DNA Research Institute for providing Lotus japonica EMF2 sequence to Dr Rieko Nishimura, Dr Jo Ann Banks (National Science Foundation/Purdue University) for providing Selaginella EMF2 homologous EST sequences, Dr Ralph Quatrano (Washington University) for providing access to the Physcomitrella website, Drs Hong Ma and Damon R. Lisch (UC Berkeley) for comments of the manuscript, Steve Ruzin and Denise Schichnes (Bioimaging Facility, CNR, UC Berkeley) for image processing, and our laboratory members Myriam Calonje, Tiffany Tirtadinata, Robert Luan, Heather

Chen et al.

Driscoll, and Rosario Sanchez for help and support in preparation of this work. No conflict of interest declared.

REFERENCES Abascal, F., Zardoya, R., and Posada, D. (2005). ProtTest: selection of best-fit models of protein evolution. Bioinformatics. 21, 2104–2105. Birve, A., Sengupta, A.K., Beuchle, D., Larsson, J., Kennison, J.A., Rasmuson-Lestander, A., and Muller, J. (2001). Su(z)12, a novel Drosophila Polycomb group gene that is conserved in vertebrates and plants. Development. 128, 3371–3379. Calonje, M., and Sung, Z.R. (2006). Complexity beneath the silence. Curr. Opin. Plant Biol. 9, 530–537. Calonje, M., Sanchez, R., Chen, L., and Sung, Z.R. (2008). EMBRYONIC FLOWER1 participates in Polycomb group-mediated AG gene silencing in Arabidopsis. Plant Cell. 20, 277–291. Cao, R., Wang, L., Wang, H., Xia, L., Erdjument-Bromage, H., Tempst, P., Jones, R.S., and Zhang, Y. (2002). Role of histone H3 lysine 27 methylation in Polycomb-group silencing. Science. 298, 1039–1043. Chanvivattana, Y., Bishopp, A., Schubert, D., Stock, C., Moon, Y.H., Sung, Z.R., and Goodrich, J. (2004). Interaction of Polycombgroup proteins controlling flowering in Arabidopsis. Development. 131, 5263–5276. Czermin, B., Melfi, R., McCabe, D., Seitz, V., Imhof, A., and Pirrotta, V. (2002). Drosophila enhancer of Zeste/ESC complexes have a histone H3 methyltransferase activity that marks chromosomal Polycomb sites. Cell. 111, 185–196. Englbrecht, C.C., Schoof, H., and Bohm, S. (2004). Conservation, diversification and expansion of C2H2 zinc finger proteins in the Arabidopsis thaliana genome. BMC Genomics. 5, 39. Felsenstein, J. (1985). Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 39, 783–791. Gendall, A.R., Levy, Y.Y., Wilson, A., and Dean, C. (2001). The VERNALIZATION 2 gene mediates the epigenetic regulation of vernalization in Arabidopsis. Cell. 107, 525–535. Goodrich, J., Puangsomlee, P., Martin, M., Long, D., Meyerowitz, E.M., and Coupland, G. (1997). A Polycomb-group gene regulates homeotic gene expression in Arabidopsis. Nature. 386, 44–51. Grass Phylogeny Working Group (Nigel P. Barker, Lynn G. Clark, Jerrold I. Davis, Melvin R. Duvall, Gerald F. Guala, Catherine Hsiao, Elizabeth A. Kellogg, and H. Peter Linder) (2001). Phylogeny and subfamilial classification of the grasses (Poaceae). Annals of the Missouri Botanical Garden. 88, 373–457. Grossniklaus, U., Vielle-Calzada, J.P., Hoeppner, M.A., and Gagliano, W.B. (1998). Maternal control of embryogenesis by MEDEA, a polycomb group gene in Arabidopsis. Science. 280, 446–450.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

|

753

Hennig, L., Taranto, P., Walser, M., Schonrock, N., and Gruissem, W. (2003). Arabidopsis MSI1 is required for epigenetic maintenance of reproductive development. Development. 130, 2555–2565. Huelsenbeck, J.P., and Ronquist, F. (2001). BRBAYES: Baysian inference of phylogenetic trees. Bioinformatics. 17, 754–755. Irish, V.F., and Benfey, P.N. (2004). Beyond Arabidopsis: translational biology meets evolutionary developmental biology. Plant Physiol. 135, 611–614. Jiang, D., Wang, Y., Wang, Y., and He, Y.l (2008). Repression of Flowering Locus C and Flowering Locus T by the Arabidopsis Polycomb Repressive Complex 2 components. PLoS One. 3, e3404. Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992). The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8, 275–282. Kim, S., Yoo, M., Albert, V., Farris, J., Soltis, P.S., and Soltis, D.E. (2004). Phylogeny and diversification of B-function MADS-box genes in angiosperms: evolutionary and functional implications of a 260-million-year-old duplication. Amer. J. Bot. 91, 2102–2118. Kinoshita, T., Harada, J.J., Goldberg, R.B., and Fischer, R.L. (2001). Polycomb repression of flowering during early plant development. Proc. Natl Acad. Sci. U S A. 98, 14156–14161. Kohler, C., Hennig, L., Spillane, C., Pien, S., Gruissem, W., and Grossniklaus, U. (2003). The Polycomb-group protein MEDEA regulates seed development by controlling expression of the MADS-box gene PHERES1. Genes Dev. 17, 1540–1553. Kolkman, J.A., and Stemmer, W.P. (2001). Directed evolution of proteins by exon shuffling. Nat. Biotechnol. 19, 423–428. Kuzmichev, A., Nishioka, K., Erdjument-Bromage, H., Tempst, P., and Reinberg, D. (2002). Histone methyltransferase activity associated with a human multiprotein complex containing the Enhancer of Zeste protein. Genes Dev. 16, 2893–2905. Li, J., Wang, J., Mor, G., and Sklar, J. (2008). A neoplastic gene fusion mimics trans-splicing of RNAs in normal human cells. Science. 321, 1357–1361. Luo, M., Bilodeau, P., Koltunow, A., Dennis, E.S., Peacock, W.J., and Chaudhury, A.M. (1999). Genes controlling fertilizationindependent seed development in Arabidopsis thaliana. Proc. Natl Acad. Sci. U S A. 96, 296–301. Luo, M., Bilodeau, P., Dennis, E.S., Peacock, W.J., and Chaudhury, A. (2000). Expression and parent-of origin effects for FIS2, MEA, and FIE in the endosperm and embryo of developing Arabidopsis seeds. Proc Natl Acad Sci. 97, 10637–10642. Moon, Y.H., Chen, L., Pan, R.L., Chang, H.S., Zhu, T., Maffeo, D.M., and Sung, Z.R. (2003). EMF genes maintain vegetative development by repressing the flower program in Arabidopsis. Plant Cell. 15, 681–693.

Hall, J.C., Sytsma, K.J., and Iltis, H.H. (2002). Phylogeny of Capparaceae and Brassicaceae based on chloroplast sequence data. Amer. J. Bot. 89, 1826–1842.

Muller, J., Hart, C.M., Francis, N.J., Vargas, M.L., Sengupta, A., Wild, B., Miller, E.L., O’Connor, M.B., Kingston, R.E., and Simon, J.A. (2002). Histone methyltransferase activity of a Drosophila Polycomb group repressor complex. Cell. 111, 197–208.

Hayashi, S.I., Kunisada, T., Ogawa, M., Yamaguchi, K., and Nishikawa, S.I. (1991). Exon skipping by mutation of an authentic splice site of c-kit gene in W/W mouse. Nucleic Acids Res. 19, 1267–1271.

Ohad, N., Yadegari, R., Margossian, L., Hannon, M., Michaeli, D., Harada, J.J., Goldberg, R.B., and Fischer, R.L. (1999). Mutations in FIE, a WD Polycomb group gene, allow endosperm development without fertilization. Plant Cell. 11, 407–416.

754

|

Chen et al.

d

Molecular Evolution of VEF-Domain-Containing PcG Genes

Ronquist, F., and Huelsenbeck, J.P. (2003). MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 19, 1572–1574.

Wood, C.C., Robertson, M., Tanner, G., Peacock, W.J., Dennis, E.S., and Helliwell, C.A. (2006). The Arabidopsis thaliana vernalization response requires a Polycomb-like protein complex that also includes VERNALIZATION INSENSITIVE 3. PNAS. 103, 14631–14636.

Schonrock, N., Bouveret, R., Leroy, O., Borghi, L., Kohler, C., Gruissem, W., and Hennig, L. (2006). Polycomb-group proteins repress the floral activator AGL19 in the FLC-independent vernalization pathway. Genes Dev. 20, 1667–1678.

Yan, L., Loukoianov, A., Blech, A., Tranquilli, G., Ramakrish, W., SanMiguel, P., Bennetzen, J., Echenique, v, and Dubcovsky, J. (2004). The Wheat VRN2 gene is a flowering repressor downregulated by vernalization. Science. 303, 1640–1644.

Schuettengruber, B., Chourrout, D., Vervoort, M., Leblanc, B., and Cavalli, G. (2007). Genome regulation by Polycomb and Trithorax proteins. Cell. 128, 735–745.

Yang, C.H., Chen, L.J., and Sung, Z.R. (1995). Genetic regulation of shoot development in Arabidopsis: role of the EMF genes. Developmental Biol. 169, 421–435.

Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 22, 2688–2690.

Yoshida, N., Yanai, Y., Chen, L., Kato, Y., Hiratsuka, J., Miwa, T., Sung, Z.R., and Takahashi, S. (2001). EMBRYONIC FLOWER2, a novel Polycomb group protein homolog, mediates shoot development and flowering in Arabidopsis. Plant Cell. 13, 2471–2481.

Patthy, L. (1996). Exon shuffling and other ways of module exchange. Matrix Biol. 15, 301–310; discussion 311–302.

Sung, S., and Amasino, R.M. (2004). Vernalization and epigenetics: how plants remember winter. Curr. Opin. Plant Biol. 7, 4–10.