TRPLSC 1515 No. of Pages 10
Opinion
Beyond Genomics: Studying Evolution with Gene Coexpression Networks Colin Ruprecht,1 Neha Vaid,2 Sebastian Proost,2 Staffan Persson,3,4 and Marek Mutwil2,* Understanding how genomes change as organisms become more complex is a central question in evolution. Molecular evolutionary studies typically correlate the appearance of genes and gene families with the emergence of biological pathways and morphological features. While such approaches are of great importance to understand how organisms evolve, they are also limited, as functionally related genes work together in contexts of dynamic gene networks. Since functionally related genes are often transcriptionally coregulated, gene coexpression networks present a resource to study the evolution of biological pathways. In this opinion article, we discuss recent developments in this field and how coexpression analyses can be merged with existing genomic approaches to transfer functional knowledge between species to study the appearance or extension of pathways.
Trends Genomic studies correlate the appearance of new gene families with the emergence of novel biological processes. However, biological processes and pathways are composed of multiple genes that work together (i. e., gene modules). To provide a more complete understanding of evolution, these functional relationships need to be taken into account. Since functionally related genes are often transcriptionally coordinated (coexpressed), coexpression networks, and comparative studies of these, provide a basis for studying evolution of gene modules. Systematic analyses of coexpression networks demonstrated that gene modules can be highly conserved across distant species. Recent bioinformatic advances revealed that some gene modules have duplicated over time, suggesting that certain biological pathways are present in multiple copies within an organism.
Plant Evolution and Comparative Genomics Evolution has allowed plants to extend their molecular repertoires and morphological features to adapt to changes in their surroundings. Comparative genomics is typically used to correlate the appearance (or loss) of such changes with the appearance (or loss) of genes or gene families (Box 1). For example, comparative genomic analyses of the early land plant model Physcomitrella (Physcomitrella patens) and Arabidopsis (Arabidopsis thaliana) indicated that gene families important for drought tolerance, including abscisic acid- and cutin synthesisrelated genes along with those for auxin and cytokinin signal transduction pathways, were important for the emergence of land plants [1]. Conversely, gene families needed to sustain flagella and centrioles are present in the green alga Chlamydomonas reinhardtii, but are absent in land plants, which are not flagellated and lack centrosomes [2]. The ever-increasing number of sequenced genomes has promoted the use of comparative genomics as a tool to understand plant evolution (Box 1). While this approach clearly has led to a burst of important discoveries [3], it also has some important limitations. First, the emergence of genes or gene families may contribute to certain cellular pathways or morphological features, such as cell walls, seeds, and flowers, but these genes need to operate in context of other genes to do so. Even though comparative genomics have revealed that approximately 1300 gene families appeared in land plants (Box 1), it does not provide much information about how these gene families relate to pathways or cellular processes. Second, morphological features can arise by repurposing existing genetic material toward new functions by a process called co-option (see Glossary) [4]. For example, the nonvascular plant Physcomitrella contains homologs of transcription factors regulating biosynthesis of cellulose, monolignols, and xylan, which are essential components of vessel elements of vascular tissues [5]. This suggests that
Trends in Plant Science, Month Year, Vol. xx, No. yy
By combining traditional genomic approaches with coexpression networks of lower and higher plants, we can expand our knowledge on how biological pathways emerge or are extended.
1
Max-Planck Institute of Colloids and Interfaces, Am Muehlenberg 1, 14476 Potsdam, Germany 2 Max-Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam, Germany 3 School of BioSciences, University of Melbourne, Parkville, VIC 3010, Australia
http://dx.doi.org/10.1016/j.tplants.2016.12.011 © 2016 Elsevier Ltd. All rights reserved.
1
TRPLSC 1515 No. of Pages 10
Box 1. Genomic Approaches Comparative Genomics, Phylostratigraphy, and Phylogenetics Genomes of model species that represent major plant lineages are now available for glaucophytes (Cyanophora paradoxa) [68], red algae (Cyanidioschyzon merolae) [69], chlorophytes (Chlamydomonas reinhardtii) [2], charophytes (Klebsormidium flaccidum) [70], early embryophytes (Physcomitrella patens) [1], early vascular plants (Selaginella moellendorffii) [6], seed plants (Picea abies) [71], basal flowering plants (Amborella) [72], monocots (Oryza sativa) [73], and dicots (Arabidopsis thaliana) [74]. Comparative genomic studies aim to correlate the appearance (or loss) of morphological features in these lineages, with the appearance (or loss) of gene families in their genomes. [254_TD$IF]Phylostratigraphic analysis is a genomic method that estimates the age of a gene or a gene family [40]. This method uses genome sequences of diverse taxa to identify the last common ancestor (LCA) in which genes from a certain gene family can be found. The identity of the LCA is then used to estimate the age of the gene family, as this LCA represents the evolutionary period, or phylostratum, in which the gene family emerged. For example, a gene family found only in flowering plants (i.e., LCA was likely a direct ancestor of flowering plants, thus the gene family would belong to the angiosperm phylostratum) is deemed younger than a gene family found in green algae and flowering plants. Furthermore, the analysis can also reveal appearance of gene families during plant evolution. For example, all green plants contain approximately 4000 gene families, while the movement to land and the development of vasculature coincided with the appearance of approximately 3000 and approximately 500 new gene families, respectively [6]. [25_TD$IF]Phylogenetics can be used to investigate the evolutionary relationships between species or genes. The evolutionary relationships of genes are determined by their sequence similarity and include speciation and duplication events. As these relationships are usually represented in a tree form, the identity of the connecting node from any two given genes determines the relationship (speciation or duplication) of the two genes of interest. In addition to phylogenetic trees, phylogenetic networks have also been introduced to account for the complexity in evolution [75], for example, due to horizontal gene transfer and hybridization events. Comparative genomic and phylogenetic databases that sample widely across the green plant clade include Phytozome (https://phytozome.jgi.doe.gov/) [76], GreenPhylDB (www.greenphyl.org/) [77], PLAZA (bioinformatics.psb.ugent.be/ plaza/) [28], and PlantGDB (www.plantgdb.org/) [78].
vascular plants ‘co-opted’ genes to synthesize lignified secondary walls [6]. Consequently, as many genes do not directly relate to morphological features, comparative genomics might often fail to link genes to distinct processes. These examples illustrate that comparative genomic analyses certainly have limitations, and that approaches that can link genes and/or gene families to existing biological processes may be beneficial to predict evolutionary relationships.
Conservation of Modules Functional contexts of genes or their products can, in many cases, be inferred through ‘-omic’ approaches, including large-scale protein–protein interaction studies (interactomics), or mRNA expression studies (transcriptomics). Through various bioinformatic pipelines, such large-scale data sets can be mined to identify functionally related genes [7–9]. However, while large-scale protein–protein interaction data are available for Arabidopsis [10], they are still limited for other plants. By contrast, large-scale transcriptomic data are abundant for many plant species [11], and may therefore be a suitable choice to extend comparative genomic evaluations. These data may allow for identification of functionally related genes by [257_TD$IF]coexpression analysis (Box 2). A highly effective way of displaying and investigating coexpressed relationships between genes is by representing them as gene coexpression networks [12–15]. Comparative analyses of coexpression networks have shown that parts of the networks are conserved across different species, and even kingdoms [16–18]. These relationships have been used to infer biological functions for genes in species that are less well investigated as compared with certain model species [19–22]. We refer to such relationships as conserved modules, which represent gene neighborhoods in genome-wide coexpression networks that have similar compositions in terms of gene families and/or Pfam domains (labels) across multiple species (Figure 1A, Key Figure) [23]. The significance of module conservation is usually
2
Trends in Plant Science, Month Year, Vol. xx, No. yy
4
ARC Centre of Excellence in Plant Cell Walls, School of Biosciences, University of Melbourne,Parkville, VIC 3010, Australia
*Correspondence:
[email protected] (M. Mutwil).
TRPLSC 1515 No. of Pages 10
Box 2. Coexpression Analysis
Glossary
Coexpression analysis assumes that genes whose mRNAs are expressed similarly across many conditions, for example, different tissues, genotypes, and during biotic/abiotic perturbations, are involved in related biological processes [79]. The degree of coexpression between two genes is typically calculated using Pearson correlation coefficient (linear relationships), Spearman correlation (monotonic relationships), or mutual information (complex, nonlinear relationships) [79]. Because of the large abundance of expression data, coexpression analysis has proven to be an invaluable bioinformatic method to uncover genes associated with certain biological processes [80]. For example, by identifying genes coexpressed with secondary cell wall cellulose synthases, several genes involved in various aspects of secondary cell wall synthesis were uncovered [81,82]. It is important to note that the interpretation of coexpression networks faces several caveats: (i) there is often low correlation of transcript and protein level [83], which highlights that coexpression does not necessarily mean cofunction; (ii) the coexpression relationships are a product of transcriptional and post-transcriptional regulation of mRNAs, and the underlying regulatory relationships of coexpressed genes are often unknown [16]; (iii) the inferred coexpression relationships are dependent on the used expression data set. For example, if pollen expression data are absent from the data set, pollen-specific modules will not be detected; (iv) when whole tissues are sampled, pooling mRNA of the different cell types present in the tissue can lead to spurious coexpression relationships. Recent advances and accumulating data from single-cell transcriptomics can potentially address the latter caveat [84]. Several online databases are available to explore coexpression relationships, and modules, including COP (http:// webs2. kazusa.or.jp/kagiana/cop0911/) [85], CoExpNetViz (http://bioinformatics.psb.ugent.be/webtools/coexpr/) [34], STARNET2 (http://vanburenlab.medicine.tamhsc.edu/starnet2.html) [86], ATTED-II (http://atted.jp) [87], CORNET [88], and PlaNet (www.gene2function.de) [19]. PlaNet contains a comprehensive expression atlas of the major tissue types of Arabidopsis thaliana, Oryza sativa, and Physcomitrella patens, [256_TD$IF]and it has recently been extended with FamNet that can detect conserved and duplicated modules [23], and phylostratigraphic and phylogenetic analyses of modules (unpublished data). A tutorial and examples of how to use the latter tools are available online (www.gene2function.de/ publications).
estimated by permutation analysis, where the observed similarity of two modules is compared against a null model [23,24]. Conserved modules can be observed in many fundamental biological processes, for example, protein synthesis, cell cycle, and photosynthesis [16,25], and also in plant-specific pathways, for example, secondary metabolite synthesis or plant defense [19,26]. The identification of conserved modules has several important applications. First, due to large gene families, it is often difficult to find orthologs that perform the same functions across plant species by sequence similarity approaches only [27–29]. By contrast, conserved modules are found by combining conserved labels and gene coexpression, which may allow for identification of ‘true’ orthologs that serve the same function with high confidence [30,31]. The conserved modules can thus be used to efficiently transfer gene annotations between, for example, Arabidopsis and economically important crop plants. Second, while simple transcriptome analyses of related organs between species (e.g., leaves in Arabidopsis and poplar) can identify putative orthologs, such analyses become difficult when equivalent transcriptomic conditions are lacking or not possible (e.g., when comparing Arabidopsis leaves vs. single-celled algae). The inability to produce equivalent transcriptomic conditions can be solved by identifying conserved modules, as they are based on similar neighborhoods of coexpression networks rather than similar transcriptomic conditions [11]. Third, coexpression relationships that are supported by only a few numbers of transcriptome experiments, or that are found through experimental artifacts, are not likely to be conserved in multiple species [32]. The conserved modules may therefore be used to enrich coexpressed neighborhoods for biologically relevant information [24,32]. Several Web tools that can explore conserved coexpressed relationships have been developed, such as PlaNet, ATTED-II, and CoExpNetViz [19,33,34] (Box 2).
Duplication of Modules Expansion of gene families through duplication and subfunctionalization or neofunctionalization of individual genes might indicate functional relevance of the gene family for an organism.
Conserved module: a neighborhood of coexpressed genes that can be found in at least two species. For example, the highlighted PIN modules are conserved, as the modules can be found in Arabidopsis and Physcomitrella. Note that since modules can be duplicated, a module in one species can be conserved with multiple modules in another species, as exemplified by conservation of the PIN module from Physcomitrella with three PIN modules from Arabidopsis. Co-option: indicates new functions of existing traits, such as genes and organs. Genes can be co-opted to generate new functions by changing their patterns of regulation and/or the functions of the proteins they encode. Duplicated module: a group of coexpressed genes that can be found at least two times in one species. For example, the PIN1 module in Arabidopsis is referred to as duplicated, as PIN2 and PIN4 modules contain similar labels. Note that duplicated modules can be also conserved across species, as exemplified by duplicated AtPIN1, AtPIN2, and AtPIN3,4,7 modules that are similar (i.e., conserved) to the Physcomitrella module PpPIN (Figure 2A). Labels: functional features that are assigned to genes, such as gene families and protein domains. Two genes that have similar labels are presumed to have similar function. Module: can be defined as a group of genes involved in the same biological process or, as in graph theory, as a collection of connected nodes. In the context of coexpression networks, a module represents a group of coexpressed genes, which are putatively involved in the same process. In PlaNet, modules are based on second neighborhoods. Neighborhood: in graph theory, a neighborhood consists of a node and all nodes connected to the node. In terms of coexpression networks, a neighborhood is defined as all genes coexpressed with a given gene. PlaNet uses second neighborhoods to identify modules. Second neighborhoods are extracted in three steps. First, a query gene (e. g., PIN1) is selected. Then, all genes connected (coexpressed) with the
Trends in Plant Science, Month Year, Vol. xx, No. yy
3
TRPLSC 1515 No. of Pages 10
For example, expansion of the E3 ubiquitin ligase family increased substrate specificity for ubiquitination, which is thought to be important for establishing new traits, such as selfincompatibility, hormone responses, and abiotic stress responses [35]. However, it is becoming increasingly clear that evolution of new traits requires polygenic changes in the promoter and coding sequences of genes that constitute a given trait [36,37]. Apart from the conserved modules discussed earlier, several case studies highlight that related gene coexpression relationships can occur multiple times in a plant. For example, a module in Arabidopsis is responsible for a specialized phenolic pathway during pollen development [38], while genes constituting a similar module are expressed in other parts of the plant, indicating that Arabidopsis employs two phenolic pathways [39]. A recent study showed that these duplicated modules are frequently found in plants (Figure 1B) [23]. For example, a cell wall biosynthesis module has duplicated at least three times to support different types of cell walls in Arabidopsis [23]. This study, furthermore, revealed that over 30% of genes in a typical plant genome can be found in hundreds of duplicated modules, indicating that plants frequently duplicate coexpressed gene neighborhoods, which could represent subspecialized or coopted biological pathways. The ability to account for such duplicated modules is therefore a major strength of network-based analyses over comparative genomics.
Integrating Gene Coexpression Networks with Genomic Approaches Common genomic approaches, such as phylogeny and phylostratigraphy, are useful tools to study the evolution of coexpression networks (Box 1). Phylostratigraphic analyses identify the evolutionary period, or phylostratum, in which genes and gene families appeared [40] (Box 1). Appearance of genes associated with a module at a certain phylostratum is therefore an indicator of the evolutionary period in which certain modules emerged (Figure 1C). For example, we observed that cellulose-related cell wall biosynthetic modules are significantly enriched for gene families that appeared in land plants, indicating that the module for cellulosic cell walls likely appeared when plants colonized land ([258_TD$IF]Ruprecht et al., unpublished). Phylogenetic gene trees can reveal the type (speciation or duplication) and the evolutionary period of the relationship (i.e., phylostratum) between members of a gene family. Since conserved and duplicated modules often share multiple gene families, the information from the phylogenetic trees can be readily mapped onto modules. The enrichment of speciation and duplication events for certain evolutionary periods between two modules is an indication for when the duplication event happened (Figure 1C). For example, if two duplicated modules contain genes that were duplicated in the ancestor of angiosperms, the modules are also likely to have been duplicated during this period.
Auxin-Related Modules As an Illustration of Module Evolution To illustrate how comparative genomic studies can be combined with coexpression networksbased analyses, we investigated the conservation and duplication of modules associated with auxin signaling and response using the PlaNet database (see online supplement at [259_TD$IF]www. gene2function.de/publications[260_TD$IF] for step-by-step guide to perform this analysis). The plant hormone auxin (indole 3-acetic acid) plays an essential role in regulating several fundamental processes during plant growth and development, such as organ formation, vascular differentiation, and embryogenesis [41–43]. The function of auxin is dependent on its spatiotemporal distribution, which is determined to a large degree by genes from the PIN family that govern intracellular and intercellular transport of auxin across membranes [44–46]. Comparative genomic analyses between the bryophyte P. patens and angiosperms already indicated expansion of auxin-related gene families in angiosperms [1], but the biological function of these gene duplications remained unclear. Interestingly, we found one PIN module in Physcomitrella and three PIN-related modules in Arabidopsis (Figure 2A). While not identical,
4
Trends in Plant Science, Month Year, Vol. xx, No. yy
query gene are collected. Finally, all genes connected to the genes found in the first neighborhood are collected. Network: a mathematical representation of many-to-many relationships (e.g., protein–protein interactions or coexpression) between elements (e.g., proteins or transcripts) under investigation. A coexpression network consists of nodes (genes), while edges (or links) connect coexpressed genes. The edges can have weights, which indicate how strongly two genes are coexpressed. Phylostratum: evolutionary period that corresponds to the time of emergence of a gene or a gene family. For example, a gene family that arose in the common ancestor of land plants would belong to the land plant phylostratum (see also Box 1).
TRPLSC 1515 No. of Pages 10
Key Figure
Concepts of Conserved and Multiplied Modules, Together with Phylostratigraphic and Phylogenetic Approaches
(B)
(A)
Last common ancestor
(C)
Last common ancestor
Last common ancestor
Gene module
Speciaon
Gene module
Intermediate
Intermediate
Last common lland plant ancestor
Land plant gene module
Last com m common ang gio angiosperm an ancestor Duplicated gene module Angiospermspecific gene duplicaons
Species A
Species B Conserved gene module
Duplicated gene module
Extant species
Extant species
Figure 1. (A) A conserved module is defined as a coexpressed gene neighborhood that is similar across multiple species. Nodes represent genes, while edges connect coexpressed genes. Node shapes and colors indicate gene labels, that is, what gene family the genes belong to. (B) Duplicated modules are defined as modules that are found in multiple copies in a species. In this example, the module is duplicated in successive single gene duplication steps. (C) Phylostratigraphic and phylogenetic analyses of modules. The age of a module can be estimated by phylostratigraphic enrichment analysis of genes found in the module. In this example, the duplicated modules are composed entirely of genes belonging to land plant phylostratum, since the module appeared in the last common ancestor of land plants. Phylogenetic analysis can reveal when modules have duplicated by mapping information obtained from phylogenetic trees onto the modules. In this example, the duplicated modules contain genes that were duplicated in the last common angiosperm ancestor.
these modules contain similar sets of gene families, that is, labels, whose coexpressed genes likely form functional units related to auxin signaling. The PIN modules contain several genes encoding auxin response factors (AUX-IAA, auxininducible) [47], other auxin transporters [ATP-binding cassette transporters (ABC transporter)] [48,49], and many cell wall-modifying enzymes (xyloglucan endotransglycosidases, glycosyl hydrolases, and pectinesterases and pectin lyases), consistent with connection of the growth hormone auxin to cell wall formation and extension (Figure 2B). Moreover, we found several transcription factors (bZIP and MYB) and receptor kinases (protein kinase: PXY, MOL1) related to vascular differentiation and patterning (bZIP: ATHB-8, PHV, PHB, REV, CNA, KNAT6, and MOL1) [50–56], as well as cytokinin-related genes (Response reg: AHK4/ WOL1, AHK5/CKI2 and ABC transporter: ABCG14) [57–59] in the AtPIN1 and AtPIN2
Trends in Plant Science, Month Year, Vol. xx, No. yy
5
TRPLSC 1515 No. of Pages 10
(A)
AtPIN3 AtPIN7
PpPIN
AtPIN4
AtPIN2
(B)
Auxin inducible
AUX-IAA
ABC transporter
Glycosyl hydrolase 17
Glycosyl hydrolase 19
Glycosyl hydrolase 28
Xyloglucan endotransglycosidase
Pecn esterase
Pecn lyase
bZIP
MYB
Response reg
Protein kinase
Seed
Stem
PIN
Flower
Leaf
Root
Seed
Stem
Flower
Leaf
Root
Seed
Stem
Flower
Leaf
Root
Archegonia
Sporophyte
Gametophore
Rhizoids
Chloronema
Caulonema
AtPIN1
Green plant - observed
(C)
Number of genes in modules
60
Green plant - expected Land plant - observed
50
Land plant - expected
*
40
* *
30
*
20 10 0
PpPIN (D)
AtPIN4
AtPIN2
AtPIN1 (E)
5
Land plant emergence
AtPIN3 AtPIN7
AtPIN4
Bryophyte - vascular Plant split
2 AtPIN2 1
2
2
2
Duplicaon in angiosperms
1
1 AtPIN1
PpPIN
PpPIN Land plants
Dicots Arabidopsis
AtPIN1
AtPIN2
AtPIN4
Angiosperms
(See figure legend on the bottom of the next page.)
6
Trends in Plant Science, Month Year, Vol. xx, No. yy
TRPLSC 1515 No. of Pages 10
modules, which corroborate the role of auxin in vasculature formation and in hormonal crosstalk with cytokinin. In Physcomitrella, the genes in the PpPIN module showed predominant expression in ‘complex’ tissue types, such as gametophore, archegonia, and sporophyte, and very low expression in ‘simple’ tissues with linear cell files, such as chloronema and caulonema (Figure 2A). This expression pattern is consistent with a growth-coordinating role of auxin in gametophore, archegonia, and sporophyte [60]. In Arabidopsis, the genes associated with the AtPIN1 module were mainly active in shoots and root vasculature, whereas genes associated with the AtPIN2 module were mainly expressed in roots (Figure 2A). By contrast, genes in the AtPIN4 module, which also included AtPIN3 and AtPIN7 genes, were ubiquitously expressed (Figure 2A). A phylostratigraphic analysis of the modules revealed a significant enrichment for genes that emerged in land plants (red bars, Figure 2C), suggesting that the PIN modules originated in a common ancestor of land plants. This raises the question ‘When were the three Arabidopsis PIN modules duplicated?’. Since there is only one PIN module in Physcomitrella, it appears likely that the duplications of the Arabidopsis modules took place after the split of early land plants and vascular plants. Moreover, phylogenetic analyses of the genes in the Arabidopsis PIN modules revealed an enrichment of genes that were duplicated in the ancestor of angiosperms (purple edges; Figure 2D). The phylostratigraphic and phylogenetic analyses therefore indicate that a PIN module emerged in the common ancestor of land plants and then triplicated in the ancestor of flowering plants, probably to support the more complex morphology of angiosperms (Figure 2E). While the function of the AtPIN1 and AtPIN2 modules can be related to the role of AtPIN1 and AtPIN2 in shoot and root development, respectively, the exact functions of the ubiquitously expressed AtPIN4 module and the PpPIN module are less clear. The presence of genes involved in auxin response, cell wall modification, and vascular patterning suggests that these modules represent downstream transcriptional outputs of auxin signaling, and present exciting directions for functional annotation and experimental testing of these genes in context to auxin signaling.
How Are the Modules Duplicated? A pertinent question is ‘How are gene modules duplicated?’ Did they emerge via whole genome duplication (WGD) events, or perhaps via multiple single gene duplications (SGDs)? A recent study indicated that gene modules are preferentially generated by SGDs, as only 13% of gene pairs generated by WGDs were associated with duplicated modules [23]. Indeed, other analyses indicated that genes duplicated by WGDs are typically retained in the same module [23,61], and only few duplicated modules derived from recent WGDs could be found in Arabidopsis [62]. Similarly, paralogs generated by WGDs in yeast were rarely found in duplicated modules [63,64]. In addition, protein complexes, another form of functional modules, show evidence for stepwise duplication, underpinned by SGD, rather than via WGDs [65]. These findings are contrary to theoretical expectations, which suggested that gene-balanced
Figure 2. PIN Modules in Physcomitrella (PpPIN) and Arabidopsis (AtPIN1, AtPIN2, AtPIN4). (A) Modules and labels of the four PIN-related modules in Physcomitrella and Arabidopsis. Nodes represent genes in the modules, colored shapes represent labels (indicators of which gene family/Pfam a gene belongs to), while gray edges connect coexpressed genes. The colored boxes below the modules represent the expression profiles of the centroid PIN genes of the modules. Blue and white colors indicate high and low expression, respectively. (B) Annotations of selected labels. For example, [253_TD$IF]red diamond corresponds to the PIN gene family. (C) Phylostratigraphic composition of the four PIN modules. Green bars indicate ‘green plant’ phylostratum, while red bars indicate ‘land plant’ phylostratum. The observed and expected number of genes assigned to a given phylostratum is indicated by darker and lighter shade, respectively. The asterisks indicate that the observed number was significantly (p < 0.05) larger than the expected number, which is the case for the ‘land plant’ phylostratum for all modules. (D) Phylogenetic edges found between the PIN modules. The red, purple, orange, and black edges indicate genes duplicated in ancestor of land plants, angiosperms, dicots, and Arabidopsis, respectively. The numbers indicate the observed amount of these edges. (E) Evolutionary model of the PIN modules.
Trends in Plant Science, Month Year, Vol. xx, No. yy
7
TRPLSC 1515 No. of Pages 10
duplications, such as WGDs, might underpin the duplication of functional modules, as deleterious gene-dosage effects might not occur after WGDs [62,66]. However, it is important to note that only recent WGD events (e.g., the a, b, and g WGDs) can be reliably analyzed as duplicated gene pairs become difficult to identify from more ancient WGDs due to genomic rearrangements [29] and saturation of Ks values [67]. As many of the duplicated modules likely predate the recent WGDs, such as the PIN modules, we cannot exclude that more ancient WGDs might have contributed to module duplication. If this would be the case, we would expect that the age of a WGD event would correlate with the contribution of gene pairs to modules. To test this, we reanalyzed the module partition of gene pairs for the three recent WGDs (a, b, and g) in the Arabidopsis lineage and found that approximately 16% of the gene pairs that were generated in the most recent WGD (a) are present in the same coexpression clusters (WGD data derived from PLAZA 3.0, [28], coexpression clusters from PlaNet [19]). By contrast, approximately 9% (b) and approximately 6% (g) of the gene pairs generated in the older WGDs can be found in the same coexpression cluster, respectively. These data confirm a rewiring of coexpression networks following WGDs [61] and suggest that gene pairs generated in more ancient WGD events may indeed contribute more to module duplication than gene pairs from more recent ones. Hence, further analyses of gene pairs generated during older WGDs may shed further light on how modules were created.
Concluding Remarks and Future Prospects Coexpression networks can identify functionally related genes (i.e., modules). The modules can be duplicated to expand certain functions in plants, exemplified by phenolic pathways [39] and cell wall biosynthesis [23]. By combining gene coexpression networks with methods established by comparative genomics, the evolutionary history of modules may be elucidated. Future analyses of coexpression networks from different plant clades should therefore enable us to identify and understand the principles governing the birth, death, and duplication of modules and biological pathways (see also Outstanding Questions). Acknowledgements S. Persson was funded by a R@MAP Professorship at University of Melbourne. This work was in part supported by an ARC Discovery grant (DP150103495), and an ARC Future Fellowship grant (FT160100218). S. Proost was funded by ERACAPS grant EVOREPRO. C.R., N.V., and M.M. were funded by the Max Planck Society.
References 1. Rensing, S.A. et al. (2008) The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science 319, 64–69
10. Arabidopsis Interactome Mapping Consortium (2011) Evidence for network evolution in an Arabidopsis interactome map. Science 333, 601–607
2. Merchant, S.S. et al. (2010) The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science 318, 245–250
11. Movahedi, S. et al. (2012) Comparative co-expression analysis in plant biology. Plant Cell Environ. 35, 1787–1798
3. Chaney, L. et al. (2016) Genome mapping in plant comparative genomics. Trends Plant Sci. 21, 770–780 4. True, J.R. and Carroll, S.B. (2002) Gene co-option in physiological and morphological evolution. Annu. Rev. Cell Dev. Biol. 18, 53–80 5. Zhong, R. et al. (2010) Evolutionary conservation of the transcriptional network regulating secondary cell wall biosynthesis. Trends Plant Sci. 15, 625–632 6. Banks, J.A. et al. (2011) The Selaginella genome identifies genetic changes associated with the evolution of vascular plants. Science 332, 960–963 7. Rhee, S.Y. and Mutwil, M. (2014) Towards revealing the functions of all genes in plants. Trends Plant Sci. 19, 212–221 8. Radivojac, P. et al. (2013) A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 9. Proost, S. and Mutwil, M. (2016) Tools of the trade: studying molecular networks in plants. Curr. Opin. Plant Biol. 30, 130–140
8
Trends in Plant Science, Month Year, Vol. xx, No. yy
12. Lee, T. et al. (2015) AraNet v2: an improved database of cofunctional gene networks for the study of Arabidopsis thaliana and 27 other nonmodel plant species. Nucleic Acids Res. 43, D996–D1002 13. Mutwil, M. et al. (2010) Assembly of an interactive correlation network for the Arabidopsis genome using a novel heuristic clustering algorithm. Plant Physiol. 152, 29–43 14. Stelzl, U. et al. (2005) A human protein-protein interaction network: a resource for annotating the proteome. Cell 122, 957–968 15. Maslov, S. and Sneppen, K. (2002) Specificity and stability in topology of protein networks. Science 296, 910–913 16. Stuart, J.M. et al. (2003) A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249–255 17. Zarrineh, P. et al. (2014) Genome-scale co-expression network comparison across Escherichia coli and Salmonella enterica serovar Typhimurium reveals significant conservation at the regulon level of local regulators despite their dissimilar lifestyles. PLoS One 9, e102871
Outstanding Questions When in plant history did specific gene modules appear and how are they changing? Are single gene duplications or whole genome duplications generating duplicated modules? Are conserved modules regulated by conserved transcriptional regulators and cis-elements? Are duplicated modules regulated by duplicated regulatory network motifs? Is there a relationship between increased complexity of higher plants and the number of duplicated modules? How are new genes and new modules integrated into existing gene networks?
TRPLSC 1515 No. of Pages 10
18. Gerstein, M.B. et al. (2014) Comparative analysis of the transcriptome across distant species. Nature 512, 445–448
43. Reinhardt, D. (2000) Auxin regulates the initiation and radial position of plant lateral organs. Plant Cell 12, 507–518
19. Mutwil, M. et al. (2011) PlaNet: combined sequence and expression comparisons across plant networks derived from seven species. Plant Cell 23, 895–910
44. Petrasek, J. and Friml, J. (2009) Auxin transport routes in plant development. Development 136, 2675–2688
20. Ruprecht, C. et al. (2011) Large-scale co-expression approach to dissect secondary cell wall formation across plant species. Front. Plant Sci. 2, 1–13 21. Tzfadia, O. et al. (2012) The MORPH algorithm: ranking candidate genes for membership in Arabidopsis and tomato pathways. Plant Cell 24, 4389–4406 22. Park, C.Y. et al. (2013) Functional knowledge transfer for highaccuracy prediction of under-studied biological processes. PLoS Comput. Biol. 9, e1002957 23. Ruprecht, C. et al. (2016) FamNet: a framework to identify multiplied modules driving pathway expansion in plants. Plant Physiol. 170, 1878–1894 24. Movahedi, S. et al. (2011) Comparative network analysis reveals that tissue specificity and gene function are important factors influencing the mode of expression evolution in Arabidopsis and rice. Plant Physiol. 156, 1316–1330 25. Ficklin, S.P. and Feltus, F.A. (2011) Gene coexpression network alignment and conservation of gene modules between two grass species: maize and rice. Plant Physiol. 156, 1244–1256 26. Humphry, M. et al. (2010) A regulon conserved in monocot and dicot plants defines a functional module in antifungal plant immunity. Proc. Natl. Acad. Sci. U. S. A. 107, 21896–21901
45. Michniewicz, M. et al. (2007) Antagonistic regulation of PIN phosphorylation by PP2A and PINOID directs auxin flux. Cell 130, 1044–1056 46. Feraru, E. et al. (2012) Evolution and structural diversification of PILS putative auxin carriers in plants. Front. Plant Sci. 3, 227 47. Abel, S. and Theologis, A. (1996) Early genes and auxin action. Plant Physiol. 111, 9–17 48. Ranocha, P. et al. (2013) Arabidopsis WAT1 is a vacuolar auxin transport facilitator required for auxin homoeostasis. Nat. Commun. 4, 2625 49. Ruzicka, K. et al. (2010) Arabidopsis PIS1 encodes the ABCG37 transporter of auxinic compounds including the auxin precursor indole-3-butyric acid. Proc. Natl. Acad. Sci. U. S. A. 107, 10749– 10753 50. Emery, J.F. et al. (2003) Radial patterning of Arabidopsis shoots by class III HD-ZIP and KANADI genes. Curr. Biol. 13, 1768–1774 51. Dean, G. et al. (2004) KNAT6 gene of Arabidopsis is expressed in roots and is required for correct lateral root formation. Plant Mol. Biol. 54, 71–84 52. Berleth, T. et al. (2000) Vascular continuity and auxin signals. Trends Plant Sci. 5, 387–393
27. Kuzniar, A. et al. (2008) The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 24, 539– 551
53. Ilegems, M. et al. (2010) Interplay of auxin, KANADI and class III HD-ZIP transcription factors in vascular tissue formation. Development 137, 975–984
28. Proost, S. et al. (2009) PLAZA: a comparative genomics resource to study gene and genome evolution in plants. Plant Cell 21, 3718–3731
54. Fisher, K. and Turner, S. (2007) PXY, a receptor-like kinase essential for maintaining polarity during plant vascular-tissue development. Curr. Biol. 17, 1061–1066
29. Van Bel, M. et al. (2012) Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol. 158, 590–600
55. Mähönen, A.P. et al. (2000) A novel two-component hybrid molecule regulates vascular morphogenesis of the Arabidopsis root. Genes Dev. 14, 2938–2943
30. Patel, R.V. et al. (2012) BAR expressolog identification: expression profile similarity ranking of homologous genes in plant species. Plant J. 71, 1038–1050
56. Gursanscky, N.R. et al. (2016) MOL1 is required for cambium homeostasis in Arabidopsis. Plant J. 86, 210–220
31. Das, M. et al. (2016) Expression pattern similarities support the prediction of orthologs retaining common functions after gene duplication events. Plant Physiol. 171, 2343–2357 32. Hansen, B.O. et al. (2014) Elucidating gene function and function evolution through comparison of co-expression networks of plants. Front. Plant Sci. 5, 1–9 33. Aoki, Y. et al. (2015) ATTED-II in 2016: a plant coexpression database towards lineage-specific coexpression. Plant Cell Physiol. 57, e5 34. Tzfadia, O. et al. (2016) CoExpNetViz: comparative co-expression networks construction and visualization tool. Front. Plant Sci. 6, 1194 35. Yee, D. and Goring, D.R. (2009) The diversity of plant U-box E3 ubiquitin ligases: from upstream activators to downstream target substrates. J. Exp. Bot. 60, 1109–1121 36. Bullard, J.H. et al. (2010) Polygenic and directional regulatory evolution across pathways in Saccharomyces. Proc. Natl. Acad. Sci. U. S. A. 107, 5058–5063
57. Kakimoto, T. (1996) CKI1, a histidine kinase homolog implicated in cytokinin signal transduction. Science 274, 982–985 58. Zhang, Q.C. et al. (2012) Structure-based prediction of protein– protein interactions on a genome-wide scale. Nature 490, 556–560 59. Zhang, K. et al. (2014) Arabidopsis ABCG14 protein controls the acropetal translocation of root-synthesized cytokinins. Nat. Commun. 5, 3274 60. Lavy, M. et al. (2016) Constitutive auxin response in Physcomitrella reveals complex interactions between Aux/IAA and ARF proteins. Elife Published online June 1, 2016. http://dx.doi.org/ 10.7554/elife.13325 61. De Smet, R. and Van de Peer, Y. (2012) Redundancy and rewiring of genetic networks following genome-wide duplication events. Curr. Opin. Plant Biol. 15, 168–176 62. Blanc, G. and Wolfe, K.H. (2004) Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16, 1667–1678
37. Roop, J.I. et al. (2016) Polygenic evolution of a sugar specialization trade-off in yeast. Nature 530, 336–339
63. Conant, G.C. and Wolfe, K.H. (2006) Functional partitioning of yeast co-expression networks after genome duplication. PLoS Biol. 4, e109
38. Matsuno, M. et al. (2009) Evolution of a novel phenolic pathway for pollen development. Science 325, 1688–1692
64. Wapinski, I. et al. (2007) Natural history and evolutionary principles of gene duplication in fungi. Nature 449, 54–61
39. Ehlting, J. et al. (2008) An extensive (co-)expression analysis tool for the cytochrome P450 superfamily in Arabidopsis thaliana. BMC Plant Biol. 8, 47
65. Pereira-Leal, J.B. and Teichmann, S.A. (2005) Novel specificities emerge by stepwise duplication of functional modules. Genome Res. 15, 552–559
40. Domazet-Loso, T. et al. (2007) A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages. Trends Genet. 23, 533–539
66. Papp, B. et al. (2003) Dosage sensitivity and the evolution of gene families in yeast. Nature 424, 194–197
41. Geldner, N. et al. (2001) Auxin transport inhibitors block PIN1 cycling and vesicle trafficking. Nature 413, 425–428 42. Benková, E. et al. (2003) Local, efflux-dependent auxin gradients as a common module for plant organ formation. Cell 115, 591–602
67. Vanneste, K. et al. (2013) Inference of genome duplications from age distributions revisited. Mol. Biol. Evol. 30, 177–190 68. Price, D.C. et al. (2012) Cyanophora paradoxa genome elucidates origin of photosynthesis in algae and plants. Science 335, 843–847
Trends in Plant Science, Month Year, Vol. xx, No. yy
9
TRPLSC 1515 No. of Pages 10
69. Matsuzaki, M. et al. (2004) Genome sequence of the ultrasmall unicellular red alga Cyanidioschyzon merolae 10D. Nature 428, 653–657 70. Hori, K. et al. (2014) Klebsormidium flaccidum genome reveals primary factors for plant terrestrial adaptation. Nat. Commun. 5, 3978
80. Serin, E.A.R. et al. (2016) Learning from co-expression networks: possibilities and challenges. Front. Plant Sci. 7, 444 81. Brown, D.M. et al. (2005) Identification of novel genes in Arabidopsis involved in secondary cell wall formation using expression profiling and reverse genetics. Plant Cell 17, 2281–2295
71. Nystedt, B. et al. (2013) The Norway spruce genome sequence and conifer genome evolution. Nature 497, 579–584
82. Persson, S. et al. (2005) Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets. Proc. Natl. Acad. Sci. U. S. A. 102, 8633–8638
72. Albert, V.A. et al. (2013) The Amborella genome and the evolution of flowering plants. Science 342, 1241089
83. Maier, T. et al. (2009) Correlation of mRNA and protein in complex biological samples. FEBS Lett. 583, 3966–3973
73. International Rice Genome Sequencing Project (2005) The mapbased sequence of the rice genome. Nature 436, 793–800
84. Efroni, I. and Birnbaum, K.D. (2016) The potential of single-cell profiling in plants. Genome Biol. 17, 65
74. The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815
85. Ogata, Y. et al. (2010) CoP: a database for characterizing coexpressed gene modules with biological information in plants. Bioinformatics 26, 1267–1268
75. Linder, C.R. and Rieseberg, L.H. (2004) Reconstructing patterns of reticulate evolution in plants. Am. J. Bot. 91, 1700–1708
86. Jupiter, D. et al. (2009) STARNET 2: a web-based tool for accelerating discovery of gene regulatory networks using microarray co-expression data. BMC Bioinformatics 10, 332
76. Goodstein, D.M. et al. (2012) Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 40, D1178–D1186 77. Conte, M.G. et al. (2008) GreenPhylDB: a database for plant comparative genomics. Nucleic Acids Res. 36, D991–D998 78. Duvick, J. et al. (2008) PlantGDB: a resource for comparative plant genomics. Nucleic Acids Res. 36, D959–D965 79. Usadel, B. et al. (2009) Co-expression tools for plant biology: opportunities for hypothesis generation and caveats. Plant Cell Environ. 32, 1633–1651
10
Trends in Plant Science, Month Year, Vol. xx, No. yy
87. Obayashi, T. et al. (2011) ATTED-II updates: condition-specific gene coexpression to extend coexpression analyses and applications to a broad range of flowering plants. Plant Cell Physiol. 52, 213–219 88. De Bodt, S. et al. (2012) CORNET 2. 0: integrating plant coexpression, protein-protein interactions, regulatory interactions, gene associations and functional annotations. New Phytol. 195, 707–720