Discovery of SCORs: Anciently derived, highly conserved gene-associated repeats in stony corals

Discovery of SCORs: Anciently derived, highly conserved gene-associated repeats in stony corals

Accepted Manuscript Discovery of SCORs: Anciently derived, highly conserved geneassociated repeats in stony corals Huan Qiu, Ehud Zelzion, Hollie M. ...

664KB Sizes 1 Downloads 14 Views

Accepted Manuscript Discovery of SCORs: Anciently derived, highly conserved geneassociated repeats in stony corals

Huan Qiu, Ehud Zelzion, Hollie M. Putnam, Ruth D. Gates, Nicole E. Wagner, Diane K. Adams, Debashish Bhattacharya PII: DOI: Reference:

S0888-7543(17)30044-7 doi: 10.1016/j.ygeno.2017.06.003 YGENO 8891

To appear in:

Genomics

Received date: Revised date: Accepted date:

9 May 2017 ###REVISEDDATE### 16 June 2017

Please cite this article as: Huan Qiu, Ehud Zelzion, Hollie M. Putnam, Ruth D. Gates, Nicole E. Wagner, Diane K. Adams, Debashish Bhattacharya , Discovery of SCORs: Anciently derived, highly conserved gene-associated repeats in stony corals, Genomics (2017), doi: 10.1016/j.ygeno.2017.06.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Original Article

Discovery of SCORs: Anciently Derived, Highly Conserved Gene-Associated Repeats in Stony Corals

T

Huan Qiua, Ehud Zelziona, Hollie M. Putnamb,d, Ruth D. Gatesb, Nicole E. Wagnera, Diane K.

CR

a

IP

Adamsc, Debashish Bhattacharyaa,*

Department of Ecology, Evolution and Natural Resources, Rutgers University, New Brunswick,

b

Hawai’i Institute of Marine Biology, Kaneohe, HI 96744, USA.

Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, NJ 08901,

AN

c

US

NJ 08901, USA.

USA.

Current address: Department of Biological Sciences, University of Rhode Island, Kingston, RI

M

d

ED

02881, USA.

*Corresponding author

PT

Debashish Bhattacharya, Department of Ecology, Evolution and Natural Resources, 59 Dudley Road, Rutgers University, New Brunswick, NJ 08901, USA.

AC

CE

email: [email protected]; phone: +1 848-932-6218

1

ACCEPTED MANUSCRIPT Abstract Stony coral (Scleractinia) genomes are still poorly explored and many questions remain about their evolution and contribution to the success and longevity of reefs. We analyzed transcriptome and genome data from Montipora capitata, Acropora digitifera, and transcriptome data from 20 other coral species. To our surprise, we found highly conserved, anciently derived, Scleractinia

T

COral-specific Repeat families (SCORs) that are abundant in all the studied lineages. SCORs

IP

form complex secondary structures and are located in untranslated regions and introns, but most abundant in intergenic DNA. These repeat families have undergone frequent duplication and

CR

degradation, suggesting a ‘boom and bust’ cycle of invasion and loss. We speculate that due to their surprisingly high sequence identities across deeply diverged corals, physical association

US

with genes, and dynamic evolution, SCORs might have adaptive functions in corals that need to

AN

be explored using population genomic and function-based approaches.

AC

CE

PT

ED

M

Keywords: DNA Repeats, Genomics, Horizontal Transfer, Secondary Structure, Scleractinia

2

ACCEPTED MANUSCRIPT Introduction Coral reefs are rich in biodiversity and support about 30% of all known marine species [1]. Built by stony corals (Scleractinia) that radiated over two hundred million years ago [2], coral reefs not only stabilize coastlines, but also support fisheries and tourism with a global annual economic value of about 300 billion dollars [3]. The published coral genome from Acropora

T

digitifera is of size 420 Mbp and encodes ~24,000 genes [4]. Analysis of this reference genome

IP

and other coral transcriptome data reveal highly conserved pathways related to stress response, DNA repair, cell cycle and apoptosis [5,6]. Stony corals have also acquired foreign genes

CR

through horizontal gene transfer (HGT) to provide defense against DNA damage and other

their photosynthetic dinoflagellate symbionts [4,5].

US

stresses associated with life in the near-shore environment, and the reactive species generated by

To date, our knowledge of coral genomes comes largely from the analysis of protein-

AN

coding genes, which account for only a small fraction of nuclear DNA and as expected, these amino acid sequences are highly constrained [5]. In addition to epigenetics [7], another driver of

M

genome evolution can be mobile genetic elements that impact gene expression patterns through chromatin modification or facilitate domain shuffling through repeat-based recombination. To

ED

explore this latter source of variation, we used the stony corals Montipora capitata and Acropora digitifera as models for DNA repeat analysis. Our work revealed a novel class of widely

PT

distributed, anciently derived families of repeats (we named SCORs) that have extensive secondary structures and show significant sequence identity within families. Their high

CE

conservation, that eclipses that of protein coding DNA, and wide taxonomic distribution in corals since the divergence of the Scleractinia ca. 240 Mya suggest that SCORs may have a role in

AC

coral biology; i.e., beyond that of being parasitic elements.

Results and Discussion Discovery of SCORs in Montipora capitata Transcriptome Data. We generated >36 million Illumina reads from M. capitata RNA samples that were assembled into >73K contigs with a N50 = 443 bp (see Methods). Using these M. capitata transcriptome data (available under NCBI BioProject PRJNA339779), we identified a total of 57 non-coding SCOR families with consensus sequences ranging in size from 73–633bp (Supplementary Data 1). These SCORs have a total abundance of 2,518 copies in the transcriptome data and are frequently located in the 3

ACCEPTED MANUSCRIPT 5’ and 3’ untranslated regions (UTRs) flanking the predicted open reading frames (ORFs). SCORs had one to two orders of magnitude greater read coverage when compared to coding regions. Examples of 3’ UTR-encoded SCORs are shown in Figure 1A for nearly complete cDNAs encoding a transformation/transcription domain associated protein (Mcap.SCOR-29 family [involved in transcription activation]) and a Ras-related protein Rab-43 (Mcap.SCOR-2 family [involved in intracellular membrane trafficking]). In the former, presence of the 111bp

T

SCOR (Mcap.SCOR-29 family) leads to massive coverage bias that varies from ca 20x – 520x

IP

over the CDS to ca. 4,290x in the SCOR. In the latter case, the average coverage of the CDS is

CR

40x, whereas it is 2,933x in the 133bp repeated region (Mcap.SCOR-2). These SCOR coverage values represent mapping of total transcriptome data (RNA-seq reads), therefore the repeated

US

DNAs recruit sequences from all related family members in the transcriptome (see below). Using a loop-based energy model [8], we observed that the two expressed SCORs in Figure 1A form

AN

robust RNA hairpin structures (Fig. 1B). This is also the case for many of the other expressed SCORs that encode significantly supported (i.e., RNA regions in the red field) helical regions

M

(Fig. S1).

ED

Intronic SCORs in the Montipora capitata Pax-C Gene. At the genomic level, SCORs were commonly found within introns in M. capitata genes. An example is provided by the

PT

spliceosomal intron located between exons 46 and 47 in the nuclear encoded single-copy homeobox gene Pax-C (Fig. S2A). This region has been used for phylogeographic studies in

CE

corals such as Acropora species (e.g., [9]). In practice, only the ca. 5’- terminal 400nt is targeted even though this conserved intron is >800nt in length. The likely reason for this pattern of data

AC

usage is shown in Figure S2A. Mapping the Illumina DNA (see Methods for details) reads to the 5’-half of this intron (Genbank accession: AY313479) identified a single nucleotide polymorphism (SNP) at the expected frequency of ca. 50% in this sequence library derived from a single sperm bundle extract. Phylogenetic analysis of the Pax-C protein results in a topology (Fig. S2B) that is consistent with the reference tree for corals (Fig. S3). These results suggest that the evolution of the Pax-C gene reflects vertical transmission. In contrast, the 3’-half of the intron encodes a repeated sequence that has a copy number exceeding 10K in the genomic data and contains a large number of SNPs across the individual copies. The high copy number of this previously unidentified SCOR stands in contrast to the single-copy nature of the 5’ neighboring

4

ACCEPTED MANUSCRIPT intron region and the Pax-C gene as a whole. Genome data from additional coral species are needed to clarify if the SCOR was present in Pax-C gene in the common ancestor of corals or was inserted more recently into Pax-C intron in M. capitata.

SCORs in the Acropora digitifera Draft Genome. To gain additional insights into the origin, evolution, and potential function(s) of SCORs, we shifted our analysis to the draft genome of A.

T

digitifera. A total of 89 SCOR families (Supplementary Data 2) were predicted in this coral that

IP

were primarily located in intergenic (239.3 Mbp with 22,780 SCORs) and genic regions (172.8

CR

Mbp with 6,451 SCORs) (Table S1). Genic regions showed a >50% deficiency (37 repeats/Mbp versus 95 repeats/Mbp) relative to intergenic regions in terms of SCOR density. Thus, SCOR

US

movement into genic regions has in general been repressed, suggesting a functional role for those that have been retained. Phylogenetic analyses showed a “star-like” topology for most of the

AN

SCOR families in A. digitifera (Supplementary Data 3), including the Adig.SCOR-17 family shown in Figure 2. The majority of sequences in this family therefore resulted from a few bursts

M

of duplication followed by the independent accumulation of substitutions (Fig. 2A). This pattern of evolution is typical for transposons and interspersed elements present in many other animal

ED

genomes [10,11]. Regarding length conservation, most SCORs, particularly those with a consensus sequence >1500bp, were present primarily as partial fragments, suggesting rapid

PT

SCOR degradation after duplication (Fig. S4). Alternatively, many SCOR fragments (not the original repeat) may have undergone extensive duplication. SCORs located in different genomic

CE

domains (e.g., intergenic, intron, and exon) are intermingled (Fig. S5). For example, two SCORs found in genic regions were nested in clusters of intergenic SCORs (Fig. 2B), suggesting that

AC

they likely originated from intergenic DNA. Taken together, these results underline the dynamic evolution of repetitive DNA that resulted from the frequent birth and death of SCOR families in the A. digitifera genome. In addition, we found more than a hundred A. digitifera proteins that correspond to, or contain (near-) complete known transposable element-encoded proteins (Table S2). These enzymes likely explain the widespread expression and retro-transposition of SCORs in the A. digitifera genome.

SCOR Evolution Across Corals. Using a comprehensive coral transcriptome database [5], we studied sequence conservation in A. digitifera SCORs when compared to robust corals.

5

ACCEPTED MANUSCRIPT Surprisingly, high sequence identities (with alignments ≥120bp) were found that ranged from 0.80-0.97 in 44 out of the 89 SCOR families (Fig. 3A). The remaining SCOR families are either specific to complex corals or have corresponding families in robust corals that are too divergent to be detected. In any case, for those SCOR families with high conserved sequences, the level of conservation is much higher than the pairwise identities (i.e., A. digitifera versus Stylophora pistillata) found in coding DNA, which followed a normal distribution with a peak at identity =

T

0.82 (Fig. 3A) and the 90% percentile identity = 0.85. In the corresponding protein data, the

IP

identities peaked at 0.78 with the 80% percentile identity = 0.86. The latter value is lower than

CR

the identities found in most (except 2 cases) of the conserved SCOR families (Fig. 3A). Despite strong conservation at the sequence level, most of the A. digitifera SCOR families were found as

US

partial fragments in the transcriptomes of robust coral species (Fig. S6). This result might be explained by partial integration into the DNA regions giving rise to the transcriptome, or more

AN

likely by reduced lengths due to degradation in robust coral genomes as were found in the A. digitifera genome (Fig. S4).

M

Phylogenetic analysis of the A. digitifera-derived SCOR families across multiple corals species resulted in many complex topologies (Supplementary Data 4) that were not, at first

ED

glance, compatible with coral species relationships (Fig. S3). One example is Adig.SCOR-62 family, in which the complex and robust corals are intermingled throughout the tree and within

PT

the network (Fig. 3B and Fig. S7). The explanation for this pattern of evolution, under the assumption of vertical transmission, requires massive duplications in the common ancestor of

CE

corals followed by multiple independent losses in descendants. This scenario is likely to be true for most SCORs, given their mobility and turnover within species. This would result in a tree

AC

with a highly reticulated pattern of evolution for the different repeat copies over the ca. 240 My of scleractinian evolution. The significant sequence conservation (Fig. 3A) of SCORs therefore likely resulted from constraints on sequence substitution since the divergence between complex and robust corals (Fig. S3). These results suggest that many SCORs have dynamic evolutionary histories and evolved primarily through frequent duplication, fragmentation, and deletion rather than sequence substitutions. This hypothesis is supported by the many instances of recent SCOR duplication shown at the tips of tree that is apparent in Figure 3B and Figure S7. Alternatively, many SCORs might have rapidly become undetectable due to accelerated sequence divergence,

6

ACCEPTED MANUSCRIPT whereas only conserved SCOR copies were included in our analysis. In any case, these results are consistent with the highly dynamic nature of SCOR evolution.

Potential Horizontal Transfer of SCORs Between Corals. For 28 SCOR families, the trees were dominated by complex corals with the inclusion of one or a few sequences from robust corals, or vice versa (Supplementary Data 5) and may indicate HGT across coral species. One

T

putative example of this phenomenon is provided by the Adig.SCOR-70 family that primarily

IP

contains sequences of complex corals with the exception of five from the robust coral Favia sp.

CR

in the tree and in the protein similarity network (Fig. 3C). A closer examination of the tree shows clustering of Acropora species and Favia sp. within the Acropora species complex (Fig. S7).

US

This topology suggests possible transfers of the SCOR from Acropora to Favia. Transfer of interspersed repeats and transposons is commonly reported and regarded to be a major driving

AN

force behind genomic variation and innovation among animals [12]. Transfer of SCORs between complex and robust corals may explain about one-half of the observed high pairwise identities

M

between the two groups (Fig. 3A). We note however that the coral transcriptome data we studied are highly heterogeneous in terms of sequence coverage and quality. Furthermore, many short

ED

SCOR fragments were discarded during the phylogenetic tree building procedure (see Methods). For these reasons, some SCOR families may have a broader taxonomic distribution than reported

during coral evolution.

PT

here. Nevertheless, these results suggest that horizontal transfer of SCORs may have occurred

CE

The question of how SCORs move between coral species remains to be determined and would presumably require a vector or alternatively, direct fusion of coral juveniles or colonies. In

AC

terms of vectors, BLASTn analysis against RefSeq did not turn up evidence of SCOR sequences being present in any non-coral taxa (e.g., bacteria, viruses), although relatively limited data are available from coral microbiomes. With regard to chimerism, this explanation is highly unlikely for the Acropora species and Favia sp. example shown in Figure 3C because coral fusions are thought to occur only between closely related taxa. Existing data provide evidence for both intraand inter-species chimerism in corals [13–16]. For example, molecular markers and observation of A. millepora show that juveniles can fuse and gregarious larvae form chimeric colonies during settlement [13]. Gregarious larval settlement may lead to inter-specific chimeric fusion between M. capitata and M. flabellata [14]. Chimeras can also form without the need for settlement

7

ACCEPTED MANUSCRIPT aggregation. Asexually produced chimeric larvae have been identified in Pocillopora damicornis [16]. In light of these observations, it is conceivable that rare instances of successful chimerism can lead to the spread of SCORs across ‘local neighborhoods’ in the coral tree of life, either at the juvenile stage or between cells at the margin of fused colonies where DNA exchange may occur. An in-depth population level analysis of SCOR genomic distribution in chimeric and non-

T

chimeric colonies is required to test this hypothesis.

IP

Putative Functions of SCORs. The biological function of SCORs is yet to be determined but

CR

they may act as transcription regulators, as previously described in the prasinophyte green alga Micromonas [17]. Transcribed transposable elements are common targets for RNA-binding

US

protein and have important impacts on RNA-protein regulatory networks [18], RNA degradation [19], and protein translation [20]. Similar to the known coding potential of transposable elements

AN

(e.g., in humans; [21]), SCORs are also found in the coding region of many gene transcripts (Table S1). However, their contribution is limited in terms of total numbers (Table S1). To

M

address SCOR function, we first assessed if they are sources of long noncoding RNA and/or microRNA. SCORs were found in hundreds of instances within long noncoding RNA transcripts.

ED

However, most of these expressed SCORs are short fragments and account for a minor fraction of the long noncoding RNA transcript population (Fig. S8). We searched miRBase

PT

(http://www.mirbase.org/) [22] and a database of validated Stylophora pistillata small RNAs [23] for potential similarities in sequence and structure to existing microRNAs in animals and

CE

none were found. In addition, we downloaded the 147 microRNAs from Nematostella vectensis available in miRBase and queried them using BLASTn (e-value cutoff = 1e-5) with the SCORs

AC

as queries. The significant hits were restricted to short (11-14 nt) regions that do not show clear evidence for microRNA structure in the coral. These results do not exclude the possibility that some SCORs generate microRNAs but our bioinformatic analysis does not support this hypothesis. To test whether SCORs take part in regulating gene expression, we analyzed RNA-seq data that address two conserved pathways in coral biology, development (including biomineralization in calcifying spats and adults) and symbiosome formation with the dinoflagellate symbiont. The data on development was derived from 5 different stages in A. digitifera [24] and the symbiosome work addressed pre- and post-exposure of A. digitifera

8

ACCEPTED MANUSCRIPT planulae to Symbiodinium cells [25]. After performing differential gene expression (see Methods), 515 SCOR-associated genes (with ≥1 SCOR in the UTR or intron sequence) were identified as being differentially expressed (DE) between consecutive developmental time points. However, no clear repeat-driven expression pattern was found among the impacted genes for each SCOR family (e.g., Fig. S9). In the case of symbiosome formation, 102 DE genes were found that were associated with SCORs. When these data were clustered according to gene

T

expression over the two conditions versus repeat presence in the genes, the results showed that

IP

repeat expression was not correlated with the DE genes (Fig. S10). These results do not lend

CR

support to the idea that SCORs are directly involved in regulating gene expression related to development and symbiosome formation in corals. The potential impacts of SCORs on gene

US

expression under other conditions (e.g., varying light, CO2 and pH) remain however to be tested. It is possible that these shorter-term stress responses may be the targets for SCOR-based gene

AN

regulation. We also tested the idea that SCORs may mediate domain shuffling and gene fusion, thereby providing a source of genetic variation in corals (e.g., [26]). However, comparison of the

M

A. digitifera predicted proteome with that of the closely related sea anemone Aiptasia pallida [27] showed no evidence of domain shuffling or gene fusion in the coral genome. That is, no A.

ED

digitifera protein was found to be comprised of domains derived from two different genes in A.

PT

pallida (see Methods).

Conclusion

CE

The study of coral genomes and their evolution has until now been primarily limited to proteincoding genes or DNA differences within well-aligned genomic regions. Whereas these targets

AC

are readily amenable to study, they account for limited changes, particularly over short evolutionary timescales. In contrast, SCORs are abundant with a dynamic evolutionary history that suggests they may play a role in coral biology, perhaps generating genetic variation at the population level. Analysis of SCOR distribution within and between coral populations would address this hypothesis. In addition, functional genomic analysis of corals under different environmental conditions as well as DNA footprinting studies need to be done to assess if genes containing SCORs may be differentially expressed and if these repeated regions are sites of DNA or RNA-binding proteins that regulate gene expression.

9

ACCEPTED MANUSCRIPT Materials and Methods Collection and processing of coral tissues Montipora capitata was collected during the July 2015 spawning period (Special Activity Permit 2015-17) in Kāneohe Bay, O’ahu Hawai’i. Egg sperm bundles were collected immediately upon release and placed in 1.5mL sterile RNAse and DNase free microfuge tubes. The bundles were

T

left in individual tubes for 30 min to break apart, with the buoyant eggs floating to the surface

IP

and the denser sperm settling to the bottom. The sperm fraction was removed by pipetting to a new tube and was cleaned by a series of three rinse and spin steps, with samples rinsed with 0.2

CR

µm filtered seawater, centrifuged at 13,000 rpm for 3 min. The supernatant was removed and the

US

concentrated sperm was stored at -80°C.

DNA extraction and genomic shotgun library construction

AN

Genomic DNA was extracted from individual sperm bundles using the Zymo Quick-DNA Universal Kit (Zymo Res. Corp.), with the Biological Fluids and Cells protocol, and eluted in

M

HCl (pH 8.5). DNA concentrations were measured on a Qubit instrument. A total of 200 ng of genomic DNA from a selected single sperm bundle was used to construct a

ED

library using the Illumina TruSeq Nano DNA LT Library Prep Kit (Illumina, Inc.). The library was run on an Illumina MiSeq Personal Genome Sequencer using the Illumina MiSeq Reagent

PT

Kit v3 (600 cycles paired-end [PE]).

CE

RNA extraction and RNA-seq library construction Total RNA from sperm derived from five colonies was extracted by resuspending each sample in

AC

550 L of Trizol (ThermoFisher Scientific) [28]. These samples were passed twice through a QiaShredder column (Qiagen, Inc.) and then transferred to a 1.5mL microcentrifuge tube. A total of 450 L of Trizol was added to bring the volume to 1.0 mL. After 5-minute room temperature incubation, 200 L of chloroform was added and the sample was vigorously shaken for 15 seconds, and then incubated at room temp for 3 minutes. The samples were centrifuged for 15 minutes at 4C. The upper aqueous layer was transferred to a new 1.5 mL tube, and an equal volume of 70% ethanol was added and gently mixed. The samples were transferred to Qiagen RNeasy mini columns. From here, the Qiagen RNeasy mini protocol was followed, including the optional on-column DNase treatment. Total RNA was eluted in 55L of nuclease-free water. 10

ACCEPTED MANUSCRIPT Five individual RNA-seq libraries were generated using 200ng of the total RNA from each sample using the Illumina TruSeq RNA Library Preparation Kit v2. The libraries were combined in equimolar concentrations and run on a single Illumina MiSeq flowcell using the Illumina MiSeq Reagent Kit v3 (150 cycles PE).

Transcriptomic and genomic data generation and analysis

T

The RNA-seq run with the combined 5 libraries yielded 36,250,700 MiSeq raw reads that were

IP

adapter and quality-trimmed using the CLC Genomics Workbench (v7.5, Qiagen, Inc.). After

CR

trimming, 22,681,438 (3.2 Gbp) reads remained for assembly and downstream analysis. The trimmed reads were assembled with CLC Genomics Workbench into 73,094 contigs with a N50

US

= 442 bp. The M. capitata genomic DNA library was run twice on the MiSeq and yielded a total of 95,971,984 raw reads, of which 70,060,127 were used for assembly, producing 600,706

AN

contigs totaling 359,691,707 bp and with a N50 = 720 bp.

M

Identification of expressed SCORs

To identify SCORs in M. capitata, its transcriptome assembly was analyzed using

ED

RepeatModeler 1.0.8 (http://www.repeatmasker.org/RepeatModeler.html) under the default setting. As a de novo repeat family identification and modeling pipeline, RepeatModeler

PT

automates the runs of core repeat finding programs (RECON [29] and RepeatScout [30]), processing the output and provides the consensus models of putative repeats. The resulting

CE

library of repeats (containing consensus sequences for 57 non-coding families) was used as input for the RepeatMasker (http://www.repeatmasker.org/) to determine the distribution of the repeats

AC

(copy number and location) in M. capitata genome under the default setting. The distribution of repeats was parsed from the output from the RepeatMasker analysis. To identify the repeats in Acropora digitifera genome, we collected the genic regions (i.e., the combination of UTR, exon and intron) from 5,000 largest genes that are free of assembly gaps. These genomic sequences were used for repeat identification following the same procedure.

Prediction of SCOR secondary structure The secondary structures of SCORs were inferred using the RNAfold webserver (http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi). By default, the RNA structure was searched to

11

ACCEPTED MANUSCRIPT minimize the free energy of an RNA sequence which is calculated using a loop-based energy model [8]. The partition function and base pairing probability matrix was estimated using McCaskill’s method [31]. The graphic representation of secondary structures and base-pairing probabilities were generated using default settings at the server site. Sequence identity measurement

T

One-to-one orthologous genes pairs between A. digitifera and S. pistillata were identified using

IP

the reciprocal smallest distance algorithm implemented in the ‘rsd_search’ program (-de = ‘0.25

CR

1e-5’) [32]. We used S. pistillata to represent robust corals because its gene models were built on a genomic assembly and were the least fragmented among the available data from robust corals.

US

The alignments between orthologs at nucleotide and protein levels were built using BLASTn and BLASTp, respectively. Sequence identities were subtracted from the BLAST outputs using

AN

custom scripts. In case of two or more non-overlapping alignment blocks, all blocks were concatenated and the average identity was reported. To identify the highest identity between SCORs from the two species, we searched the A. digitifera SCOR consensuses sequences against

M

its whole genome sequence. For each SCOR family, the top 100 A. digitifera hits were retrieved

ED

and used as queries to search against S. pistillata transcriptome using BLASTn (e-value =1e-10). The highest sequence identity among the corresponding 100 BLASTn top-hits (with alignment

PT

≥120 bps) was reported. Replacing S. pistillata with transcriptome data from those from the 11 robust coral species [5], we applied the same method to derive the highest identity between A.

CE

digitifera and robust corals.

AC

Phylogenetic analysis

To explore the evolutionary history of SCORs (e.g., Fig. 3), we searched the A. digitifera draft genome using the 89 A. digitifera SCOR consensus sequences as queries with BLASTn (e-value =1e-10). The homologous sequences from the queried coral species were parsed from the BLASTn output. For each SCOR family, the homologous sequences were aligned using MUSCLE v3.8.31 under the default settings [33]. From the resulting alignment, we removed columns with >50% missing data and sequences with >70% gaps. Phylogenetic trees were built using FastTree v2.1.7 [34] with four rounds of minimum-evolution SPR moves (-spr 4) and exhaustive maximum likelihood nearest-neighbor interchanges (-mlacc 2 -slownni). Branch

12

ACCEPTED MANUSCRIPT supports are calculated local support values using the Shimodaira-Hasegawa test [35]. The resulting

trees (if returned) were manually examined. To study the evolution of SCORs across coral species (e.g., Figs. 3B and 3C), we searched coral transcriptome data [5] using the A. digitifera SCOR consensus sequences as queries and performed the downstream analyses following the same procedure as described above.

T

Network analysis

IP

Network analysis of was done with EGN v1.0 [36] using the same sequence data for each SCOR

CR

family. The outputs from all-to-all BLASTn searches (e-value =1e-10) were used to build edges and graphs with sequence identity cutoff (75%) and coverage cutoff (20% in both sequences).

US

The networks were then visualized with Cytoscape v2.8.3 (www.cytoscape.org/).

AN

Coral gene expression

The raw reads from 5 A. digitifera developmental stages were downloaded from the Short Read

M

Archive (Blastula: SAMD00021035, Gastrula: SAMD00021036, Sphere: SAMD00021038, Planula: SAMD00021037, and Adult: SAMD00021034). The reads were then quality trimmed

ED

and sequencing adapters were removed using CLC Genomic Workbench (QIAGEN), which was used also for mapping the trimmed reads on to the publicly available A. digitera genome [4].

PT

Differential gene expression was performed using DESeq2 [37], genes with a fold change > 4 and with a false discovery rate (FDR) < 0.01 were considered as differentially expressed (DE).

CE

The expression values of 515 DE genes were exported from DESeq2 following transformation using the variant stabilizing transformation, which provides a statistical basis for comparing

AC

expression values between libraries. The same analytical approach was applied to the RNA-seq data from Mohamed et al. [25] (GEO accession no. GSE76976).

Domain shuffling and gene fusion To determine if the SCORs mediate domain shuffling and gene fusion, we carried out a BLASTp search (e-value cutoff = 1e-3) using A. digitifera proteome as queries against its sister taxon, the non-calcifying sea anemone, Aiptasia pallida [27]. We looked for instances where two nonoverlapping regions of a single A. digitifera protein corresponded to two different A. pallida proteins. This pattern could have resulted from a variety of mechanisms including gene fusion in

13

ACCEPTED MANUSCRIPT A. digitifera since its split from A. pallida. We required the matched A. pallida proteins to cover 30%-70% length of the potentially fused A. digitifera proteins.

Acknowledgments This work was made possible by a grant from the National Science Foundation EF-

T

1416785 awarded to D.B. and Paul Falkowski (Rutgers University), OCE-PRF 1323822 to

IP

H.M.P., and Hawaii EPSCOR EPS-0903833. Funding was also provided to R.D.G. by the Paul G. Allen Family Foundation, and by the US-Israel Binational Science Foundation (BSF-

CR

2014035) to D.K.A. These funding sources had no involvement in the conduct of the research

US

and/or preparation of the article.

Conflicts of interests

AN

The authors declare that they have no conflicts of interest.

M

References

ED

1. Wilkinson C. Status of coral reefs of the world. Aust. Inst. Mar. Stud. 2004. 2. Veron J. Corals in space and time: the biogeography and evolution of the Scleractinia. Sydney: UNSW Press; 1995.

PT

3. Costanza R, d’Arge R, de Groot R, Farber S, Grasso M, Hannon B, et al. The value of the world’s ecosystem services and natural capital. Nature. 1997;387:253–60.

CE

4. Shinzato C, Shoguchi E, Kawashima T, Hamada M, Hisata K, Tanaka M, et al. Using the Acropora digitifera genome to understand coral responses to environmental change. Nature.

AC

2011;476:320–3.

5. Bhattacharya D, Agrawal S, Aranda M, Baumgarten S, Belcaid M, Drake JL, et al. Comparative genomics explains the evolutionary success of reef-forming corals. eLife. 2016;5:e12388. 6. Shinzato C, Mungpakdee S, Satoh N, Shoguchi E. A genomic approach to coral-dinoflagellate symbiosis: studies of Acropora digitifera and Symbiodinium minutum. Front. Microbiol. 2014;5:336.

14

ACCEPTED MANUSCRIPT 7. Putnam HM, Davidson JM, Gates RD. Ocean acidification influences host DNA methylation and phenotypic plasticity in environmentally susceptible corals. Evol. Appl. 2016;9:1165–78. 8. Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981;9:133–48. 9. Richards ZT, van Oppen MJH, Wallace CC, Willis BL, Miller DJ. Some rare Indo-Pacific coral species are probable hybrids. PloS One. 2008;3:e3240.

T

10. Craig NL, Lambowitz AM, Craigie R, Gellert M, editors. Mobile DNA II [Internet].

IP

Washington, D.C.: ASM Press; 2002 [cited 2016 Oct 24]. Available from:

CR

http://www.asmscience.org/content/book/10.1128/9781555817954

11. Gilbert C, Schaack S, Pace JK, Brindley PJ, Feschotte C. A role for host-parasite interactions

US

in the horizontal transfer of transposons across phyla. Nature. 2010;464:1347–50. 12. Schaack S, Gilbert C, Feschotte C. Promiscuous DNA: horizontal transfer of transposable

AN

elements and why it matters for eukaryotic evolution. Trends Ecol. Evol. 2010;25:537–46. 13. Puill-Stephan E, Willis BL, van Herwerden L, van Oppen MJH. Chimerism in wild adult

M

populations of the broadcast spawning coral Acropora millepora on the Great Barrier Reef. PloS One. 2009;4:e7751.

ED

14. Work TM, Forsman ZH, Szabó Z, Lewis TD, Aeby GS, Toonen RJ. Inter-specific coral chimerism: genetically distinct multicellular structures associated with tissue loss in Montipora

PT

capitata. PloS One. 2011;6:e22869.

15. Schweinsberg M, González Pech RA, Tollrian R, Lampert KP. Transfer of intracolonial

CE

genetic variability through gametes in Acropora hyacinthus corals. Coral Reefs. 2014;33:77–87. 16. Rinkevich B, Shaish L, Douek J, Ben-Shlomo R. Venturing in coral larval chimerism: a

AC

compact functional domain with fostered genotypic diversity. Sci. Rep. 2016;6:19493. 17. Worden AZ, Lee J-H, Mock T, Rouzé P, Simmons MP, Aerts AL, et al. Green evolution and dynamic adaptations revealed by genomes of the marine picoeukaryotes Micromonas. Science. 2009;324:268–72. 18. Kelley DR, Hendrickson DG, Tenen D, Rinn JL. Transposable elements modulate human RNA abundance and splicing via specific RNA-protein interactions. Genome Biol. 2014;15:537. 19. Gong C, Maquat LE. lncRNAs transactivate STAU1-mediated mRNA decay by duplexing with 3’ UTRs via Alu elements. Nature. 2011;470:284–8.

15

ACCEPTED MANUSCRIPT 20. Carrieri C, Cimatti L, Biagioli M, Beugnet A, Zucchelli S, Fedele S, et al. Long non-coding antisense RNA controls Uchl1 translation through an embedded SINEB2 repeat. Nature. 2012;491:454–7. 21. Lin L, Jiang P, Park JW, Wang J, Lu Z-X, Lam MPY, et al. The contribution of Alu exons to the human proteome. Genome Biol. 2016;17:15. 22. Kozomara A, Griffiths-Jones S. miRBase: annotating high confidence microRNAs using

T

deep sequencing data. Nucleic Acids Res. 2014;42:D68-73.

IP

23. Liew YJ, Aranda M, Carr A, Baumgarten S, Zoccola D, Tambutté S, et al. Identification of

CR

microRNAs in the coral Stylophora pistillata. PloS One. 2014;9:e91101.

24. Reyes-Bermudez A, Villar-Briones A, Ramirez-Portilla C, Hidaka M, Mikheyev AS.

US

Developmental progression in the coral Acropora digitifera is controlled by differential expression of distinct regulatory gene networks. Genome Biol. Evol. 2016;8:851–70.

AN

25. Mohamed AR, Cumbo V, Harii S, Shinzato C, Chan CX, Ragan MA, et al. The transcriptomic response of the coral Acropora digitifera to a competent Symbiodinium strain: the

M

symbiosome as an arrested early phagosome. Mol. Ecol. 2016;25:3127–41. 26. Méheust R, Zelzion E, Bhattacharya D, Lopez P, Bapteste E. Protein networks identify novel

ED

symbiogenetic genes resulting from plastid endosymbiosis. Proc. Natl. Acad. Sci. U. S. A. 2016;113:3579–84.

PT

27. Baumgarten S, Simakov O, Esherick LY, Liew YJ, Lehnert EM, Michell CT, et al. The genome of Aiptasia, a sea anemone model for coral symbiosis. Proc. Natl. Acad. Sci. U. S. A.

CE

2015;112:11893–8.

28. Mass T, Putnam HM, Drake JL, Zelzion E, Gates RD, Bhattacharya D, et al. Temporal and

AC

spatial expression patterns of biomineralization proteins during early development in the stony coral Pocillopora damicornis. Proc. R. Soc. B Biol. Sci. 2016;283:20160322. 29. Bao Z, Eddy SR. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002;12:1269–76. 30. Price AL, Jones NC, Pevzner PA. De novo identification of repeat families in large genomes. Bioinforma. Oxf. Engl. 2005;21 Suppl 1:i351-358. 31. McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990;29:1105–19.

16

ACCEPTED MANUSCRIPT 32. Wall DP, Fraser HB, Hirsh AE. Detecting putative orthologs. Bioinforma. Oxf. Engl. 2003;19:1710–1. 33. Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004;5:113. 34. Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PloS One. 2010;5:e9490.

T

35. Shimodaira H, Hasegawa M. Multiple comparisons of log-likelihoods with applications to

IP

phylogenetic inference. Mol. Biol. Evol. 1999;16:1114–6.

CR

36. Halary S, McInerney JO, Lopez P, Bapteste E. EGN: a wizard for construction of gene and genome similarity networks. BMC Evol. Biol. 2013;13:146.

US

37. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-

AN

seq data with DESeq2. Genome Biol. 2014;15.

M

Figure legends:

Fig. 1. Two examples of 3’-UTR SCORs in M. capitata. (A) The genomic locations of these

ED

SCORs and their host genes are shown. Arrows indicate genes or SCORs. The sequence coverage across the regions is shown as are examples of the mapped Illumina RNA-seq reads

PT

with green and red lines indicating mapping in forward and reverse orientations, respectively. Broken lines indicate mismatches and indels. (B) Predicted minimum free energy (MFE)

CE

secondary structures of SCOR 29 and SCOR 2 from two different genomic locations in M. capitata are shown, as well as a list of other genes with which they are associated in this coral.

AC

For the predicted structures of other M. capitata SCORs, see Fig. S1. The structures are colored according to base-pairing probabilities, with red being more stable. For regions not involved in pairing, the color denotes the probability of being unpaired. The MFEs for two structures are indicated.

Fig. 2. Phylogenetic trees of A. digitifera SCORs from the Adig.SCOR 17 family. (A) Unrooted maximum likelihood tree of this SCOR family. The broken lines indicate artificially shortened long branches. (B) Two selected clades from the SCOR tree in panel A are shown. The

17

ACCEPTED MANUSCRIPT dots and bold texts represent SCORs located in genic regions. Other SCORs reside in intergenic DNA. The statistical support for nodes (when ≥0.5) is shown above the branches.

Fig. 3. Evolution of SCORs. (A) Sequence conservation in SCORs, coding DNA, and protein sequences. The distribution of identities (A. digitifera versus S. pistillata) among orthologous genes is shown as bar plots in grey colors for coding DNA and protein sequence data. The

T

highest identity for each SCOR family is shown as rectangle with the x-axis coordinate

IP

corresponding to the identity value. The y-axis coordinate is specified in arbitrary for clarity.

CR

White rectangles indicate SCOR families with complex evolutionary history, whereas black rectangles indicate SCOR families that were potentially transferred across coral species. (B)

US

Maximum likelihood tree and network of SCORs from Adig.SCOR-62 family. The branches with ≥0.85 statistic support are indicated with vertical bars. Complex and robust corals are

AN

shown in green and brown colors, respectively. (C) Maximum likelihood tree and network of the Adig.SCOR-70 family. This tree is rooted arbitrarily. The branch support information is shown

AC

CE

PT

ED

M

for the clade containing robust corals (for details, see Fig. S7).

18

Fig. 1

AC

CE

PT

ED

M

AN

US

CR

IP

T

ACCEPTED MANUSCRIPT

19

AC

CE

PT

ED

M

AN

US

CR

IP

T

ACCEPTED MANUSCRIPT

Fig. 2

20

Fig. 3

AC

CE

PT

ED

M

AN

US

CR

IP

T

ACCEPTED MANUSCRIPT

21