Single-molecule real-time (SMRT) sequencing facilitates Tachypleus tridentatus genome annotation

Single-molecule real-time (SMRT) sequencing facilitates Tachypleus tridentatus genome annotation

Journal Pre-proof Single-molecule real-time (SMRT) sequencing Tachypleus tridentatus genome annotation facilitates Fangrui Lou, Na Song, Zhiqiang Ha...

1MB Sizes 1 Downloads 31 Views

Journal Pre-proof Single-molecule real-time (SMRT) sequencing Tachypleus tridentatus genome annotation

facilitates

Fangrui Lou, Na Song, Zhiqiang Han, Tianxiang Gao PII:

S0141-8130(19)40237-7

DOI:

https://doi.org/10.1016/j.ijbiomac.2020.01.029

Reference:

BIOMAC 14333

To appear in:

International Journal of Biological Macromolecules

Received date:

12 December 2019

Revised date:

4 January 2020

Accepted date:

4 January 2020

Please cite this article as: F. Lou, N. Song, Z. Han, et al., Single-molecule real-time (SMRT) sequencing facilitates Tachypleus tridentatus genome annotation, International Journal of Biological Macromolecules(2020), https://doi.org/10.1016/ j.ijbiomac.2020.01.029

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Published by Elsevier.

Journal Pre-proof

Single-molecule real-time (SMRT) sequencing facilitates Tachypleus tridentatus genome annotation Fangrui Lou1,2, Na Song1, Zhiqiang Han2*, Tianxiang Gao2* 1

Fishery College, Ocean University of China, Qingdao, Shandong, 266003,

China. Fishery College, Zhejiang Ocean University, Zhoushan, Zhejiang, 316022,

of

2

ro

China.

*corresponding author(s): Correspondence and requests for materials should be to

Z.Q.H

(email:

T.X.G.

(email:

lP na

Abstract

and

re

[email protected]).

[email protected])

-p

addressed

Tachypleus tridentatus is a keystone species in marine ecosystems. Its hemolymph

Jo ur

also provides the limulus amebocyte lysate (LAL) for detection of bacterial endotoxin in human medical service. Here we combined SMRT sequencing and Illumina RNA-seq to characterize the novel isoforms, novel genetic loci, fusion isoforms formation and transcriptome structure and further to unveil the transcriptome complexity of T. tridentatus. We identified 26,705 non-redundancy isoforms form 10,919 genetic loci, including 25,713 novel isoforms, 2,403 novel genes and 170 fusion isoforms. In addition, 1,578 novel genes and 23,172 novel isoforms were annotated in the NR, Pfam, KOG, COG, eggNOG, Swiss-Prot, KEGG and GO databases. Meanwhile, we have obtained 4,671 gene family clustering based on

Journal Pre-proof

genetic loci. Furthermore, there are 17,296, 4,887, 1,054, and 1,435 APAs, AS events, lncRNAs, and TFs were identified in the T. tridentatus long-read transcriptome and the target genes of 1,054 lncRNA sequences were also predicted. Overall, our work firstly provided the long-read transcriptome and these data are very necessary to improve the annotation information of T. tridentatus genome and optimize the

of

boundaries of 12,342 original reference annotated genes. Furthermore, these information are a potential resource to study LAL secretion mechanisms in T.

ro

tridentatus.

Jo ur

na

lP

re

-p

Key words: Tachypleus tridentatus; long-read transcriptome; SMRT sequencing

Journal Pre-proof

1. Introduction The marine living fossil, horseshoe crab is a keystone species in marine ecosystems, it can serve as a bioturbator and its eggs is a fundamental protein source for many marine animals [1,2]. Additionally, its hemolymph provides the limulus amebocyte lysate (LAL) for detection of bacterial endotoxin in human medical service

of

[3]. A large number of studies referred horseshoe crab also is an indicator animal for reflecting the environmental state of coastal habitats, this is the case because

ro

hemolymph can help horseshoe crab have particular sensibility and higher tolerance

-p

to metal contaminants compared to other marine animals [4,5]. Therefore, the

re

horseshoe crab is valued because of both its ecological and economic importance. As

lP

an extant species of the horseshoe crab, Chinese horseshoe crab (Tachypleus

na

tridentatus) has maintained their specific ecological value for several million years in the Indo-Pacific region [6]. However, declining populations of T. tridentatus has

Jo ur

widely been reported in many Asian waters owing to habitat destruction, environmental pollution and overfish for LAL production [4,7,8]. In fact, T. tridentatus has existed in the IUCN Red List of Threatened Species [9]. Additionally, it is worth noting that LAL harvest processes might lead to different mortality rates of male and female T. tridentatus due to LAL primarily comes from female [3,10]. This trend can change the behavioral and physiological effects associated with the breeding activity, especially in the spawning season, and further leading to significantly decrease of T. tridentatus resource and ultimately leading to the significantly decreased of egg abundances [11,12]. Although a previous study has revealed the

Journal Pre-proof

genome information of T. tridentatus, the low accuracy and completeness of genome annotation information have affected the exploration of important biological processes (including LAL secretion) of T. tridentatus. Recently, characterization of the transcriptome provides insight into the analysis of gene function and its regulation mechanism. Highthtought short-read sequencing

of

has revolutionized the transcriptome study and provides a novel method to perfect the genome annotation information [13,14]. However, short-read can result in incorrect

ro

annotation information due to short transcripts often fails to identify alternative

-p

splicing, alternative transcription initiation, or alternative transcription termination

re

sites [15,16]. Additionally, whole-genome duplication event also undermined the gene

lP

annotation ability of short transcripts, this is the case because highly similar

na

transcripts were appeared in that event [17]. Above all, the genome of several organisms is very difficult to obtain complete and accurate annotation information

Jo ur

without full-length (FL) sequences. Recently, the third generation sequencing technology Pacific Bioscience (PacBio) provides the opportunity to directly obtain FL transcripts due to its technology accomplishes single molecule real-time (SMRT) sequencing with a read length up to 20 kb [18]. Additionally, SMRT sequencing can comprehensively identify the novel transcribed regions and disambiguate between distinct but highly similar isoforms of functional genes, ultimately it provides abilities to investigate the polymorphism of protein structure and function [19,20]. While higher sequencing error rate existed in the early SMRT sequencing, self-correction of FL transcripts and correction of short reads have reduced the effective error rate

Journal Pre-proof

[21,22]. Therefore, SMRT sequencing can serve as a valuable tool used in the improvement process of the genome annotation [23,24]. To date, SMRT sequencing has been successfully used to analyze the FL transcripts in many marine animals, such as Haliotis discus hannai [25], Danio rerio [26]. However, no studies have been done in T. tridentatus.

of

In this study, SMRT sequencing was used to produce FL transcripts of T. tridentatus and the sequencing process was based on PacBio RSII platform.We can

ro

optimize the T. tridentatus genome annotation information by the FL transcripts and

-p

ultimately provide fundamental resource for some important biological regulatory

re

mechanism of T. tridentatus, such as LAL secretion.

na

2.1 Animal specimens

lP

2. Materials and methods

T. tridentatus specimens were collected from the coastal water of Zhoushan,

Jo ur

China. Female and male T. tridentatuss were immediately euthanized, and muscle tissues were then extracted using sterilized scissors and forceps. Then, muscle samples were separately snap-frozen in liquid nitrogen and stored at -80°C prior to the subsequent experiments. The total RNA of individual muscle was extracted using a standard Trizol Reagent Kit according to the manufacturer’s protocol. The RNA concentration was measured using NanoDrop 2000 (Thermo). RNA integrity was assessed using the RNA Nano 6000 Assay Kit of the Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA). 2.2 Illumina RNA-seq library construction and sequencing

Journal Pre-proof

We purified mRNA by depleting rRNA from total RNA using the RNA Purification Beads. The remaining RNA was cleaned three times using Beads Binding Buffer, and eluted RNA was incubated at 94°C for 8 min. First-strand cDNA synthesis was performed by combining the fragmented mRNA with 8 μl of First Strand Synthesis Act D Mix and SuperScript II Reverse Transcriptase and incubated at 25°C for 50 min, dropped to 4°C. For second strand synthesis, we added 5 μl of End Repair

of

Control and 20 μl of Second Strand Marking Master Mix and incubated at 16°C for 60

ro

min. The reaction was then purified using 90 μl of AMPure XP beads. Subsequently,

-p

A-Tailing Control and Ligation Control were applied to A-tailing and adapter ligation

re

of the double stranded cDNA, respectively. In the cDNA fragments enrich reaction

lP

process, we added 5 μl of PCR Primer Cocktail and 25 μl of PCR Master Mix and ran

na

the initial denaturing at 98°C for 30 sec, followed by 15 cycles of denaturing at 98°C for 10 sec, annealing at 60°C for 30 sec, and extending at 72°C for 30 sec, and

Jo ur

finished with a final extending at 72°C for 5 min. RNA-seq libraries were diluted to 10 pM and quantified using an Agilent 2100 Bioanalyzer. Libraries were sequenced on an Illumina HiSeq 2000 across one lane with paired-end 150 bp. 2.3 SMRT sequencing library construction and sequencing The first-strand cDNA was synthesized and amplified using the SMARTerTM PCR cDNA Synthesis Kit (Takara Bio USA, Inc., Mountain View, CA, USA). High-quality large-scale PCR products were purified and eluted to the optimal size (approximately 1-6 Kb) using the BluePippinTM Size-Selection System (Sage Science Inc., Beverly, MA, USA). Subsequently, each SMRTbell template library was constructed after

Journal Pre-proof

repairing the damage, blunt end ligation, and adding the SMRTbell adapter for selected cDNAs. The SMRTbell templates combined with polymerases and then the polymeride was bound to the zero-mode waveguide. In the sequencing process, two SMRT cells were prepared on the PacBio RSII platform. 2.4 Illumina RNA-seq data processing

of

Raw RNA-seq reads in FASTQ format were firstly filtered and then clean reads were obtained by removing reads containing sequencing adapters, ploy-N, and low

ro

quality. The remaining high-quality reads were then mapped to the T. tridentatus

-p

genome sequence using HISAT2 software [27]. Meanwhile, we analyzed the location

re

distribution and cover the depth distribution of RNA-seq data on the different

lP

chromosomes.

na

2.5 SMRT sequencing data processing

All polymerase reads produced by the PacBio RSII platform were filtered by

Jo ur

removing reads with sequences length less than 50bp and accuracy less than 0.80 using SMRT analysis software v2.3 Suite (http:// www.pacb.com/devnet/) with the following parameters: readScore = 0.80 and minLength = 50. The remaining high-quality polymerase reads were broken at the adaptor position. Subsreads were generated by trimming all adaptor sequences of polymerase reads. Effective subreads were further obtained by removing reads with sequences length less than 50bp using SMRT analysis software v2.3 Suite with parameters: minSubReadLength = 50. Circular consensus (CCS) reads were processed using Iso-seq pipeline of SMRTLink software with the following parameters: minFullPass = 0 and minPredictedAccuracy

Journal Pre-proof

= 0.90. Then, these CCS reads containing both polyA tail signal, 5' and 3' primers were considered as FL reads, otherwise, these CCS reads were considered as Non-full-length (NFL) reads. It is worth noting that two cDNA template sequences were always direct connected to generate chimeric reads due to the concentration of adaptors or SMRTbell in the library construction process. These FL reads without

of

chimeric reads were considered as FL non-chimetric (FLNC) reads (Figure 1.A). Next, iterative clustering for error correction (ICE) was used to obtain consensus isoforms

ro

and polished FL consensus sequences from ICE were polished using Quiver [18]. It is

-p

worth noting that polished high-quality consensus sequences were classified with the

re

criteria post-correction accuracy above 99%. In order to improve sequences accuracy,

lP

polished low-quality consensus sequences were corrected using clean female

na

RNA-seq data based on Proovread software [28]. In fact, the 5' end of some FLNC reads will be degraded in the sequencing process, in spite of the 3' end can remain

Jo ur

intact due to the presence of the polyA structure. Therefore, redundant transcripts might be generated in the clustering process of FLNC reads. In order to remove redundant sequences, the calibrated consensus sequences were mapped to the reference genome [29] using Genomic Mapping and Alignment Program (GMAP; [30]) with the following parameters: --cross-species and --allow-close-indels 0. The high percent of identity (PID) and high percent of coverage (PCO) aligned FLNC reads were regarded as the non-redundant sequences and these sequences with low PID (less than 90 % ) and low PCO (less than 85 % ) were removed by the cDNA_Cupcake

package

(https://github.com/Magdoll/cDNA_Cupcake/wiki).

Journal Pre-proof

Additionally, exon 5' end difference was not considered when collapsing redundant sequences and therefore we incorporated sequences that were only different at the exon 5' end (Figure 1.B). BUSCO software was applied to the quantitative assessment of the completeness of non-redundant sequences [31]. 2.6 Identification of fusion isoforms

of

All consensus sequences were used to find fusion isoforms and the identification criteria were as follows (Figure 1.C; [18]): (i) a single FL consensus sequence must be

ro

mapped to two or more annotation loci in the T. tridentatus genome; (ii) the minimum

-p

coverage for each annotation loci must be at least 5% of FL consensus sequences and

re

the minimum coverage length is 1 bp; (iii) the combined alignment coverage of all

lP

annotation loci must be at least 95% of FL consensus sequences; (iv) distance

na

between two mapped locus must be at least 10 kb.

optimization

Jo ur

2.7 Identification of potential novel genes, novel isoforms and genetic structure

We identified some potential novel genes, novel isoforms based on the mapping information of FL consensus sequences. Specifically, the FL consensus sequence that having a low overlapping ratio (< 20%) with the reference gene site or having a high overlapping ratio (> 20%) but the gene direction was not consistent with reference gene was considered as a novel gene. The filtering criteria of novel isoform were as follows: (i) the final splice site of 3' end changed; (ii) new intron or new exon emerged (Figure 1.D; [18]). All novel genes and novel isoforms were used to analyze the potential functions based on the following databases using the BLAST alignments:

Journal Pre-proof

NR, KOG/COG, eggNOG, Swiss-Prot, KEGG, and GO. Meanwhile, we obtained the gene family clustering based on the Pfam database annotation information. Finally, we also re-optimized the genetic boundaries of the reference annotations based on the

lP

re

-p

ro

of

novel genetic information.

na

Figure 1. PacBio data processing. (A). FLNC. (B). Remove redundancy of consensus

Jo ur

sequences. (C). Fusion isoforms identification. (D). Novel isoforms identification. 2.8 T. tridentatus transcriptome structure analysis FLNCs were applied to identify alternative polyadenylation using TAPIS pipeline [32]. The comprehensive distribution and classification analysis of alternative splicing (AS) structure in non-redundant sequences were identified using Astalavista software [33]. Four computational approaches include coding potential calculator (CPC; [34]), coding-non-coding index (CNCI; [35]), coding potential assessment tool (CPAT; [36]) and Pfam database were combined to sort non-protein coding RNA candidates from putative protein-coding RNAs in the novel isoforms. Putative protein-coding RNAs were filtered out using a minimum length and exon number threshold. Isoforms with

Journal Pre-proof

lengths more than 200 nt and have more than two exons were selected as long non-coding RNA (LncRNA) candidates and further screened using CPC, CNCI, CPAT, and Pfam database that have the power to distinguish the protein-coding genes from the non-coding genes. Additionally, LncRNA target genes were predicted based on the position and effect of LncRNA and mRNA using LncTar software [37]. Finally, iTAK

of

software [38] was used to identify the T. tridentatus transcription factors (TFs) from novel isoforms based on the animalTFDB 2.0 [39] transcription factor database.

ro

2.9 Gender-specific expression analyses

-p

Gene expression levels were obtained by mapping clean RNA-seq reads to the

re

long-read transcripts. Then, differentially expressed genes (DEGs) of female and male

used

as

the

filtering

thresholds

of

DEGs

na

were

lP

T. tridentatus were identified using EBSeq, FDR < 0.01 and Fold Change ≥ 2

(http://www.bioconductor.org/packages/devel/bioc/html/EBSeq.html). Subsequently,

Jo ur

the functional categories and biochemical metabolic pathways of DEGs were predicted based on Blast2GO software, and the threshold parameter of significant difference was Q ≤ 0.05. 3. Results and Discussion 3.1 Sequencing of the T. tridentatus transcriptome using Illumina platform The raw RNA-seq reads of female and male T. tridentatus were released in the NCBI Sequence Read Archive under BioProject number PRJNA576971, with accession number of SRR10270295 and SRR10270295. After filtering, a total of 42,435,306 and 48,654,696 clean RNA-seq reads were obtained in female and male T.

Journal Pre-proof

tridentatus, respectively. Then, we used HISAT2 to sequence align clean RNA-seq reads with the reference genome and obtain the location information of clean reads on the reference genome. In female RNA-seq data, 95.18% of clean reads could be mapped to the T. tridentatus genome, of which 81.71% were mapped to the only loci in the genome. In male RNA-seq data, 94.98% of clean reads could be mapped to the

of

T. tridentatus genome, of which 84.17% were mapped to the only loci in the genome. Meanwhile, we analyzed the coverage depth distribution of reads that have been

ro

mapped to the reference genome (Figure 2). In addition, we counted the number of

-p

successful reads in different regions of the reference genome (exons, introns and

re

intergenic regions). It is undeniable that although a large number of reads are mapped

lP

to the exon region (72.65% in female and 68.91% in male), a small number of reads

na

are mapped to the intron region (11.76% in female and 13.99% in male) and intergenic region (15.59% in female and 17.10% in male). In fact, reads from mature

Jo ur

mRNA should be mapped to the exon region. Additionally, reads mapped to intron region are related to mRNA precursor and intron retention with AS. It is worth noting that the intergenic regions is belonging to the non-coding region, therefore it is not transcribed. In the present study, some RNA-seq reads were mapped to intergenic regions, which may be related to incomplete annotation information of T. tridentatus genome.

Journal Pre-proof

of

Figure 2. The location and coverage depth distribution of mapped reads on the

ro

reference genome. (A). Female T. tridentatus. (B). Male T. tridentatus.

-p

3.2 Sequencing of the T. tridentatus transcriptome using PacBio plotform

re

In the present study, high-quality total RNA was extracted from T. tridentatus

lP

and then sequenced on the Pacific Bioscience RS II platform. After filtering, we obtained 32.73 G effective subread data from 2 SMRT cells and the subreads were

na

released in the NCBI Sequence Read Archive under BioProject number

Jo ur

PRJNA577185, with accession number of SRR10271121 and SRR10271122. Furthermore, we detected 793,144 CCS reads with a mean sequencing depth of 16 passes. Among them, 635,260 (80.09%) were considered as FLNC. ICE was applied to improve the consensus accuracy of FLNCs and a total of 38,762 consensus isoforms were obtained, with the mean length of 3,124bp. Furthermore, 37,756 high-quality consensus isoforms were mapped to T. tridentatus genome using GMAP and 1,001 low-quality consensus isoforms were corrected by short read RNA-seq data. Finally, 26,705 non-redundancy isoforms were yielded and used to the following analyses. The length distribution of CCS, FLNC, and consensus isoform is shown in Supplementary file 1. We also provided a preliminary assessment of the T. tridentatus

Journal Pre-proof

annotation representation in the non-redundancy isoforms based on the aligned results in the reference genome (Figure 3). In reference genome, we found 60,679 isoforms covering 37,712 genetic loci and 96.54% (58,578/60,679) isoforms were > 1kb in length. Out of 37,712 reference genetic loci, 28,494 (75.56%) genetic loci produced more than one isoforms, corresponding to 51,461 isoforms. Our long-reads transcript

of

mapping results showed that 26,705 non-redundancy isoforms were allocated to 10,919 genetic loci and about 84.31% (9,206/10,919) genes were > 1kb in length.

ro

Among 26,705 isoforms, 2,030 (7.60%) were single-exon isoforms and 24,675

-p

(92.40%) were multiple-exon isoforms. Additionally, 6,029 (55.22%) genetic loci

re

produced more than one isoforms, corresponding to 21,745 isoforms. We determined

lP

whether the number of isoform was affected by the gene length and results showed

file

2).

na

that the number of isoform was not correlated with the gene length (Supplementary Interestingly,

we

found

that

the

gene

SLC7A8

(gene

ID:

Jo ur

evm.TU.Hic_chr_13.3049) had the largest number of isoforms in both reference genome and Pac-Bio data, with 72 and 71 isoforms, respectively, which played a vital role in the amino acid transport process of T. tridentatus [40].

lP

re

-p

ro

of

Journal Pre-proof

na

Figure 3. Distributed visualization of genome locus, transcriptome locus, alternative

Jo ur

splicing, fusion transcripts and LncRNA at chromosome level. The curved black line in each circle corresponding to each chromosome. The lines above and below the each curved black line represent the locus information of the sense strand (+) and the antisense strand (-), respectively.

Finally, 86.1% complete BUSCOs in the arthropoda_odb9 database were covered by isoforms and therefore the completeness of T. tridentatus long-read transcriptome was be confirmed (Figure 4).

ro

of

Journal Pre-proof

-p

Figure 4. BUSCO assessment results.

re

3.3 Fusion isoforms

lP

A fusion isoform is a chimeric RNA encoded by one fusion gene or two different

na

genes that are subsequently joined by trans-splicing or somatic chromosomal rearrangements. Gene fusion is a common feature in many eukaryotes, but no

Jo ur

examples of gene fusion have been described in T. tridentatus. In the present study, we identified 170 fusion isoforms covered 1,943 isoforms and were corresponded to 340 genetic loci. Distribution of fusion isoforms revealed that large amounts of fusion isoforms were predominately located in chr8 (31 fusion isoforms), followed by chr13 (27 fusion isoforms) and chr3 (22 fusion isoforms). Perhaps for two reasons: (1) The quantity variance of fusion isoforms may be influenced by the chromosome length. In fact, we found that chr13 has the longest length (275,858,341 bp) based on reference genome [29]; (2) Chromosomal rearrangements (including chromosomal translocation, interstitial deletion and inversion) caused by transposons eventually lead to gene

Journal Pre-proof

fusion, therefore, we speculate that there may be more transposons on chr3, chr8 and chr13 [41]. However, it cannot be denied that the gene fusion identified by the transcriptome sequencing may come from the RNA fusion after transcription of different genes, rather than the actual genomic variation. Future studies still need to address chromosome-specific gene fusion. The pervious study considered that fusion

of

isoforms usually formed by the concatenation of two or more separate isoforms,

ro

whereas all fusion isoforms in this study from two isoforms. 3.4 Novel genes and novel isoforms

-p

3.4.1 Identification and annotation of novel genes and novel isoforms

re

Mapping information showed that all non-redundancy isoforms were classified

lP

into three groups: (i) 992 known isoforms from 567 genetic loci, (ii) 21,242 novel

na

isoforms of 8,516 genetic loci and (iii) 4,471 isoforms were probably from 2,403 novel genetic loci that do not overlap with any annotated loci. It is undeniable that our

Jo ur

non-redundant isoforms only covered about 24.09% (9,083/37,712) of reference annotation loci. A recent zebrafish long-read transcriptome study reported a similarly low proportion (25.4%) of transcripts accurately matched to the reference genome [26]. This may be related to the completeness of genomic annotation information. The low coverage of the non-redundant isoforms found in the genome data also supports the presence of significant previously undiscovered transcriptional diversity [26]. It is worth noting that a large number of new isoforms (25,713) may include transcripts with skipped exons, retained introns, and other types of canonical alternative splicing events [42]. Additionally, two other categories of transcripts were classified as new

Journal Pre-proof

isoforms: (1) these long-read transcripts with 5'- or 3'-end exons missing from the reference transcript; (2) these long-read transcripts mapped to regions that are generally considered less likely to give rise to true transcripts. It is worth noting that the annotation information of T. tridentatus genome may be limited by the previous data used, therefore the potential novel genes and novel

of

isoforms were further to annotation to improve the completeness of T. tridentatus genome annotation information. We searched 2,403 novel genes and 25,713 novel

ro

isoforms in NR, Pfam, KOG, COG, eggNOG, Swiss-Prot, KEGG and GO databases.

-p

In total, 1,578 novel genes and 23,172 novel isoforms were annotated in the

re

aforementioned databases. Specifically, 1,550, 1,098, 973, 352, 1,419, 814, 665 and

lP

607 novel genes could be found in NR, Pfam, KOG, COG, eggNOG, Swiss-Prot,

na

KEGG and GO database, respectively. Additionally, 23,056, 10,051, 18,389, 8,655, 22,158, 16,028, 13,765 and 12,497 novel isoforms could be found in NR, Pfam, KOG,

Jo ur

COG, eggNOG, Swiss-Prot, KEGG and GO database, respectively. We further integrated all annotated novel genes and novel isoforms to the T. tridentatus reference genome annotation and ultimately generated an updated reference genome annotation information that containing 54,450 annotated isoforms (Supplementary file 3) and 32,755 annotated genetic loci (Supplementary file 4). Meanwhile, we have obtained 4,671 gene family clustering based on genetic loci (Supplementary file 5). 3.4.2 Genetic structure optimization It is worth noting that the reference annotation is often not accurate due to the limitations of the software or data, so we optimize the gene structure of the T.

Journal Pre-proof

tridentatus reference annotation based on the novel genes and novel isoforms. In the present study, the map read sequence appeared outside the boundaries of 12,342 original reference annotated genes, and the untranslated region (UTR) upstream or downstream of these reference annotated genes was further extended and the modification of the gene boundaries was finally completed (Supplementary file 6).

of

3.5 T. tridentatus transcriptome structure 3.5.1 Alternative polyadenylation (APA)

ro

Polyadenylation of most mRNAs 3' end is necessary for transport, localization,

-p

stability and translation of RNA in the eukaryotes protein biosynthesis. Previous

re

studies have shown that polyadenylation can generate different transcripts in the

lP

coding region or the 3' end non-coding region, thus complicating the eukaryotic

na

transcriptome and generating different regulatory mechanisms. Additionally, differential polyadenylation of mRNAs may play different roles in T. tridentatus

Jo ur

development. Here we reported the first genome-wide map of global polyadenylation sites of T. tridentatus. A total of 17,296 APAs from 8,470 genes were obtained from T. tridentatus transcriptome (Supplementary file 7). It is worth noting that the establishment of accurate polyadenylate cleavage site also can help us to determine the 3' end location of each gene in T. tridentatus, and finally improve gene annotation. 3.5.2 Alternative splicing (AS) AS is an important transcriptional mechanism that regulates the expression of eukaryotic traits. Or more specifically, the same gene can selectively recognize exons and splicing sites to produce mature mRNA and protein isoforms with different

Journal Pre-proof

functions during splicing. In the present study, we quantified the number of AS events and compared the distribution of AS types in the T. tridentatus transcriptome based on Astalavista software (Figure 5). In total, we have discovered 4,887 AS events in the T. tridentatus transcriptome and the main AS event types were intron retention, exon skipping, alternative 3' splice site, alternative 5' splice site and mutually exclusive

of

exon (Supplementary file 8). Among all AS events, exon skipping predominated, accounting for 41.80% of all AS events, while mutually exclusive exon was the least

ro

frequent (2.82%). The study of AS may be very important to understanding why there

Jo ur

na

lP

re

-p

are differences in the LAL secretion of different genders T. tridentatus.

Figure 5. The types and number of AS events in the T. tridentatus transcriptome. 3.5.3 Long non-coding RNA (LncRNA) and LncRNA target genes LncRNA is a non-coding RNAs with a length of more than 200 nucleotides. LncRNA plays an important role in many life activities such as dosage compensation effect, epigenetic regulation, cell cycle regulation and cell differentiation regulation, and thus LncRNA has become a research hotspot. Therefore, LncRNAs might constitute an important class of regulators of LAL secretion in T. tridentatus. In the present study, we used four computational approaches (CPC, CNCI, CPAT and Pfam

Journal Pre-proof

database) to identify putative lncRNAs from all novel isoforms. By filtering and excluding isoforms with lengths more than 200nt, 1,486, 7,737, 5,015 and 5,010 lncRNAs were found based on CPC, CNCI, CPAT and Pfam database, respectively. Only 1,054 lncRNAs were found in all four computational approaches (Figure 6). According to the position of lncRNA on the annotation information of reference

of

genome, 1,054 lncRNAs were divided into four categories: lincRNA (575), antisense-lncRNA (50), intronic-lncRNA (160) and sense-lncRNA (244). Additionally,

ro

we used two methods to predict the target genes of 1,054 lncRNA sequences and the

Jo ur

na

lP

re

-p

predicted gene results are shown in the Supplementary file 9.

Figure 6. The number and classification of lncRNAs.

3.5.4 Transcription factor (TF) TF is a kind of protein that can bind to specific nucleotide sequence on the upstream of a gene. It can regulate the binding of RNA polymerase to DNA template and finally influence the transcription of genes. Therefore, TFs may be very important in regulating the LAL secretion in T. tridentatus. In the present study, we predicted 1,435 TFs from the novel isoforms of T. tridentatus based on the animal transcription

Journal Pre-proof

factor database (animalTFDB 2.0). Furthermore, all TFs were classified into 47 families and a large number TFs were dominant in zf-C2H2, bHLH, HMG families Supplementary file 10. 3.6 Gender-specific expression analyses of T. tridentatus To elucidate the gender-specific expression pattern of T. tridentatus, we first

of

counted the number of DEGs. Results showed that a total of 1,796 DEGs were obtained between female and male T. tridentatus (Figure 7.A). Among all DEGs, 964

ro

up- and 832 down-regulated genes were founded in female T. tridentatus. To explore

-p

the potential functions and regulatory pathways of these differential genes, we

re

conducted GO terms and KEGG pathways enrichment analysis for 1,796 DEGs.

lP

Results showed that a total of 698 DEGs were successfully mapped to 47 GO terms

na

(Figure 7.B). Among these GO terms, “metabolic process”, “catalytic activity” and “cell” were dominant in “biological process”, “molecular function” and “cellular

Jo ur

component”, respectively. Based on KEGG database, the KEGG classifications of DEGs were shown in Figure 7.C. In brief, some metabolism, immune, genetic information

processing,

environmental

information

processing

and

cellular

processes-related pathways that may participate in the sexual differentiation process were detected in this study. Interestingly, our results indicated that 14 genes are enriched in the Wnt signaling pathway. In fact, the Wnt signaling pathway is a conservative signaling network, which involve in a variety of biological processes, such as embryonic development, growth regulation and others [43]. It is worth noting that Wnt4 gene in the Wnt signaling pathway is regarded as a sex-determining gene

Journal Pre-proof

and it is crucial for female sexual development [44]. This is the case because Wnt4 gene can regulate the formation of the mullerian duct and then inhibit substance and testosterone secretes, which ultimately lead to sexual differentiation [45]. Further research is required to elucidate the physiological function of these DEGs, and ultimately provide fundamental resource for understanding some critical biological

Jo ur

na

lP

re

-p

ro

of

processes of T. tridentatus, such as LAL secretion.

Figure 7. The number (A), GO term (B) and KEGG pathway (C) analyses of gender-specific DEGs. 4. Conclusions In the present study, SMRT sequencing and RNA-Seq were applied to sequencing the T. tridentatus transcriptome. We obtained 26,705 non-redundancy

Journal Pre-proof

isoforms, including fusion isoforms, novel isoforms of previously annotated genes as well as novel genetic loci. We further integrated all annotated novel genes and novel isoforms to the T. tridentatus reference genome annotation and ultimately generated an updated reference sequence database. Additionally, we identified APA, AS, lncRNA and TF in T. tridentatus transcriptome data and predicted the target genes of

tridentatus

of

lncRNAs. These data are very necessary to improve the annotation information of T. genome and optimize the gene structure.

Furthermore, these

-p

LAL secretion mechanisms in T. tridentatus.

ro

non-redundancy isoforms and transcriptome structure are a potential resource to study

re

Acknowledgments

Reference

na

China (2017YFA0604902).

lP

This work was supported by the National Key Research and Development Program of

Jo ur

[1] D.M. Rudkin, G.A. Young, Horsehsoe crabs-An ancient ancestry revealed. In: Tanacredi, J.T., Botton, M.L., Smith, D.R. (eds) Biology and Conservation of Horseshoe Crabs. Springer, US 2009 25-44. [2] M.L. Botton, C.N. Shuster, Horseshoe crabs in a food web: who eats whom? In: Shuster, C.N. Jr, Barlow, R.B., Brockmann, H.J. (ed) The American Horseshoe Crab. Harvard Press, Cambridge 2003 133-153. [3] R.L. Anderson, W.H. Watson, C.C. Chabrt, Sublethal behavioral and physiological effects of the biomedical bleeding process on the American horseshoe crab, Limulus polyphemus. Biol. Bull. 225 (2013) 137-151. DOI: 10.1086/BBLv225n3p137.

Journal Pre-proof

[4] C.P. Chen, H.Y. Yeh, P.F. Lin, Conservation of the horseshoe crab at Kinmen, Taiwan: strategies and practices. Biodivers. Conserv. 13 (2004) 1889-1904. DOI: 10.1023/b:bioc.0000035868.11083.84 [5] M.L. Botton, M. Hodge, T.I. Gonzalez, High tolerance to tributyltin in embryos and larvae of the horseshoe crab, Limulus polyphemus. Estuaries 21 (1998) 340-346.

of

DOI: 10.2307/1352480. [6] K. Sekiguchi, H. Sugita, Systematics and hybridization in the four living species

ro

of horseshoe crabs. Evolution 34 (1980) 712. DOI: 10.2307/2408025

-p

[7] Y. Liao, S. Hong, X. Li, A survey on the horseshoe crabs in the north of South

re

China Sea. Acta. Zool. Sin. 47 (2011) 108-111 (Chinese with English abstract).

lP

[8] P.K.S. Shin, H. Li, S.G. Cheung, Horseshoe crabs in Hong Kong: current

na

population status and human exploitation. In: Tanacredi, J.T., Botton, M.L., Smith,

347-360.

Jo ur

D.R. (eds) Biology and Conservation of Horseshoe Crabs. Springer, US, 2009

[9] Centre, World Conservation Monitoring. Tachypleus tridentatus. The IUCN Red List of Threatened Species 1996. [10] D. Rutecki, R.H. Carmichael, I. Valiela, Magnitude of harvest of Atlantic horseshoe crabs, Limulus polyphemus, in Pleasant Bay, Massachusetts. Estuaries 27 (2004) 179-187. DOI: 10.2307/1353450. [11] L. Hurton, J. Berkson, Potential causers of mortality for horseshoe crabs (Limulus polyphemus) during the biomedical bleeding process. Fish. Bull. 104 (2006) 293-298. DOI: 10.1111/j.1444-2906.2006.01172.x.

Journal Pre-proof

[12] A.S. Leschen, S.J. Correia. Mortality in female horseshoe crabs (Limulus polyphemus) from biomedical bleeding and handling: implications for fisheries management.

Mar.

Freshw.

Behav.

Physiol.

43

(2010)

135-147.

DOI:

10.1080/10236241003786873. [13] F. Ozsolak, P.M. Milos, RNA sequencing: advances, challenges and opportunities. Nat. Rev. Genet. 12: 87-98. DOI: 10.1038/nrg2934.

of

[14] J. Duitama, P.K. Srivastava, I.I. Măndoiu, Towards accurate detection and

ro

genotyping of expressed variants from whole transcriptome sequencing data. BMC

-p

Genomics 13 (2012) S6. DOI: 10.1186/1471-2164-13-s2-s6.

re

[15] T. Steijger, J.F. Abril, P.G. Engstrom, et al., Assessment of transcript

lP

reconstruction methods for RNA-seq. Nat. Methods 10 (2013) 1177-1184. DOI:

na

10.1038/nmeth.2714.

[16] H. Tilgner, D. Raha, L. Habegger, L., et al., Accurate identification and analysis

Jo ur

of human mRNA isoforms using deep long read sequencing. G3 3 (2013) 387-397. DOI: 10.1534/g3.112.004812.

[17] J.H. Postlethwait, Y.L. Yan, M.A. Gates, et al., Vertebrate genome evolution and the zebrafish gene map. Nat. Genet. 18 (1998) 345-349. DOI: 10.1038/ng0498-345. [18] Y. Li, C.C. Fang, Y.H. Fu, et al., A survey of transcriptome complexity in Sus scrofa using single-molecule long-read sequencing. DNA Res. 25 (2018) 421-437. DOI: 10.1093/dnares/dsy014 [19] M.J. Chaisson, J. Huddleston, M.Y. Dennis, et al., Resolving the complexity of the human genome using single-molecule sequencing. Nature 517 (2014) 608-611.

Journal Pre-proof

DOI: 10.1038/nature13907. [20] F. Min, S. Wang, L. Zhang, Survey of programs used to detect alternative splicing isoforms from deep sequencing data In Silico. Biomed. Res. Int. 2015 (2015) 831352. DOI: 10.1155/2015/831352. [21] K.F. Au, V. Sebastiano, P.T. Afshar, et al., Characterization of the human ESC

of

transcriptome by hybrid sequencing. Proc. Natl Acad. Sci. USA 110 (2013) E4821-E4830. DOI: 10.1073/pnas.1320101110.

ro

[22] S.S. Vembar, M. Seetin, C. Lambert, et al., Complete telomere-to-telomere de

-p

novo assembly of the Plasmodium falciparum genome through long-read (>11 kb),

re

single molecule, real-time sequencing, DNA Res. 23 (2016) 339-351. DOI:

lP

10.1093/dnares/dsw022.

na

[23] K.E. Kim, P. Peluso, P. Babayan, P., et al., Long-read, whole-genome shotgun sequence data for five model organisms. Sci. Data 1 (2014) 140045. DOI:

Jo ur

10.1038/sdata.2014.45.

[24] S.E. Abdel-Ghany, M. Hamilton, J.L. Jacobi, et al., A survey of the sorghum transcriptome using single-molecule long reads. Nat. Commun. 7 (2016) 11706. DOI: 10.1038/ncomms11706. [25] M.A. Kim, J.S. Rhee, T.H. Kim, et al., Alternative splicing profile and sex-preferential gene expression in the female and male Pacific Abalone Haliotis discus hannai. Genes 8 (2017) 99. DOI: 10.3390/genes8030099. [26] G. Nudelman, A. Frasca, B. Kent, High resolution annotation of zebrafish transcriptome using long-read sequencing. Genome res. 28 (2018) 1415-1425. DOI:

Journal Pre-proof

10.1101/gr.223586.117. [27] D. Kim, B. Langmead, S.L. Salzberg, HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12(2015) 357-360. DOI: 10.1038/nmeth.3317. [28] T. Hackl, R. Hedrich, J. Schultz, et al., Proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30 (2014)

of

3004-3011. DOI: 10.1093/bioinformatics/btu392. [29] L. Gong, G.Y. Fan, Y.D. Ren, et al., Chromosomal level reference genome of

ro

Tachypleus tridentatus provides insights into evolution and adaptation of horseshoe

-p

crabs. Mol. Ecol. Resour. 19 (2019) 744-756. DOI: 10.1111/1755-0998.12988.

re

[30] T.D. Wu, C.K. Watanabe, GMAP: a genomic mapping and alignment program

na

10.1093/bioinformatics/bti310.

lP

for mRNA and EST sequences. Bioinformatics 21 (2005) 1859-1875. DOI:

[31] F.A. Simao, R.M. Waterhouse, P. Ioannidis, et al., BUSCO: assessing genome

Jo ur

assembly and annotation completeness with single-copy orthologs. Bioinformatics 19 (2015) 3210-3212. DOI: 10.1093/bioinformatics/btv351. [32] S.E. Abdelghany, M. Hamilton, J.L. Jacobi, et al., A survey of the sorghum transcriptome using single-molecule long reads. Nat. Commun. 7 (2016) 11706. DOI: 10.1038/ncomms11706. [33] S. Foissac, M. Sammeth, ASTALAVISTA: dynamic and flexible analysis of alternative splicing events in custom gene datasets. Nucleic Acids Res. 35 (2007) 297-299. DOI: 10.1093/nar/gkm311. [34] L. Kong, Y. Zhang, Z.Q. Ye, et al., CPC: assess the protein-coding potential of

Journal Pre-proof

transcripts using sequence features and support vector machine. Nucleic Acids Res. 36 (2007) 345-349. DOI: 10.1093/nar/gkm391. [35] L. Sun, H.T. Luo, D.C. Bu, et al., Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Res. 41 (2013) e166. DOI: 10.1093/nar/gkt646.

of

[36] L. Wang, H.J. Park, S. Dasari, et al., CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 6 (2013) e74.

ro

DOI: 10.1093/nar/gkt006.

-p

[37] J. Li, W. Ma, P. Zeng, et al., LncTar: a tool for predicting the RNA targets of long

re

noncoding RNAs. Brief. Bioinform. 16 (2015) 806-812. DOI: 10.1093/bib/bbu048.

lP

[38] Y. Zheng, C. Jiao, H. Sun, et al., iTAK: a program for genome-wide prediction

na

and classification of plant transcription factors, transcriptional regulators, and protein kinases. Mol. Plant 9 (2016) 1667-1670. DOI: 10.1016/j.molp.2016.09.014.

Jo ur

[39] H.M. Zhang, T. Liu, C.J. Liu, et al., AnimalTFDB 2.0: a resource for expression, prediction and functional study of animal transcription factors. Nucl. Acids Res. 43 (2015): 76-81. DOI: 10.1093/nar/gku887.[40] M.T. Bassi, M.P. Sperandeo, B. Incerti, et al., SLC7A8, a gene mapping within the lysinuric protein intolerance critical region, encodes a new member of the glycoprotein-associated amino acid transporter family. Genomics 62 (1999): 297-303. DOI: 10.1006/geno.1999.5978. [41] F. Mertens, B. Johansson, T. Fioretos, et al., The emerging complexity of gene fusions in cancer. Nat. Rev. Cancer. 15 (2015):371-81. DOI: 10.1038/nrc3947. [42] Q. Pan, O. Shai, L.J. Lee, et al., Deep surveying of alternative splicing

Journal Pre-proof

complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40 (2008): 1413-1415. DOI: 10.1038/ng.259. [43] J. Gao, X.W. Wang, Z.H. Zou, et al., Transcriptome analysis of the differences in gene expression between testis and ovary in green mud crab (Scylla paramamosain). BMC Genomics 15 (2014): 585. DOI: 10.1186/1471-2164-15-585

of

[44] S. Vainio, M. Heikkila, A. Kispert, A., et al., Female development in mammals is regulated by Wnt-4 signalling. Nature 397(1999): 405-409. DOI: 10.1038/17068.

ro

[45] F.R. Lou, T.Y. Yang, Z.Q. Han, et al., Transcriptome analysis for identification of

-p

candidate genes related to sex determination and growth in Charybdis japonica. Gene

Jo ur

na

lP

re

66 (2018): 10-16. DOI: 10.1016/j.gene.2018.07.044.

Journal Pre-proof Author statement

Jo ur

na

lP

re

-p

ro

of

Lou Fangrui: Methodology, Validation, Formal analysis, Writing-Original draft. Song Na: Validation, Supervision, Project administration. Han Zhiqiang: Conceptualization, Methodology, Resources, Writing - Review & Editing, Funding acquisition. Gao Tianxiang: Validation, Supervision, Writing- Reviewing & Editing,