Accepted Manuscript De novo assembly, gene annotation, and marker development using Illumina paired-end transcriptome sequencing in the Crassadoma gigantea
Shanmao Cao, Lijie Zhu, Hongtao Nie, Minghao Yin, Gang Liu, Xiwu Yan PII: DOI: Reference:
S0378-1119(18)30257-9 doi:10.1016/j.gene.2018.03.019 GENE 42641
To appear in:
Gene
Received date: Revised date: Accepted date:
6 November 2017 13 February 2018 6 March 2018
Please cite this article as: Shanmao Cao, Lijie Zhu, Hongtao Nie, Minghao Yin, Gang Liu, Xiwu Yan , De novo assembly, gene annotation, and marker development using Illumina paired-end transcriptome sequencing in the Crassadoma gigantea. The address for the corresponding author was captured as affiliation for all authors. Please check if appropriate. Gene(2017), doi:10.1016/j.gene.2018.03.019
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
De novo assembly, gene annotation, and marker development using Illumina paired-end transcriptome sequencing in the Crassadoma gigantea Running title: Transcriptome sequencing in Crassadoma gigantea
College of Fisheries and Life Science, Engineering and Technology Research Center of
SC
a
RI
PT
Shanmao Cao a, Lijie Zhu a, Hongtao Nie a*,1, Minghao Yin b, Gang Liu a, Xiwu Yan a
Shellfish Breeding in Liaoning Province, Dalian Ocean University, Dalian, 116023, China Dalian City Oceanic and Fishery Administration, 100000 Dalian, China
NU
b
MA
* Corresponding author. Tel.:+86-411-84763083. E-mail address:
[email protected] (H. Nie). 1
Present address: Department of Anatomy, Physiology, and Pharmacology, College of
AC
CE
PT E
D
Veterinary Medicine, Auburn University, Auburn, Alabama 36849-5519, USA.
ACCEPTED MANUSCRIPT
Abstract Crassadoma gigantea is an important commercial marine bivalve species in Baja California and Mexico. In this study, we have applied RNA-Seq technology to profile the transcriptome of the C. gigantea for the first time. A total of 80,832,518 raw reads were produced from a Illumina HiSeq4000 platform, and 77,306,198 (95.64%) clean reads were generated after
PT
trimming the adaptor sequences. The transcriptome assembled into 158,855 transcripts with an N50 size of 1995 bp and an average size of 1008 bp. A number of DNA repair related
RI
genes, such as MSH3, EGF, TGF, IGF, FGF, encoding different groups of growth factors were
SC
found in the transcriptome data of C. gigantean. In addition, immune related genes Toll-like receptor (TLR) including TLR1, TLR2, TLR3, TLR4, TLR5, TLR6, TLR7, TLR8, and TLR9
NU
was also observed in C. gigantean. A set of 12 polymorphic microsatellite loci were firstly developed and characterized in C. gigantea. The results show that the number of alleles and
MA
expected heterozygosity ranged from 3 to 9 and from 0.254 to 0.820, respectively. The average polymorphic information content was 0.790. These microsatellite loci will facilitate
D
future studies of population structure and conservation genetics in this species.
PT E
Keywords: Crassadoma gigantea, Transcriptome, Simple sequence repeats, Functional related
AC
CE
genes
ACCEPTED MANUSCRIPT 1. Introduction The rock scallop, Crassadoma gigantean (Gray 1825) (formerly Hinnites multirugosus), is a prized food item among North American west coast sport divers (Culver et al., 2006). This species mainly distributed in north to Alaska, California, South to the United States, Canada and Mexico (Cao et al., 2016a). It has many commercially important characteristics,
PT
including a large adductor muscle, high markets value for both alive and half-shell product and prized food item, thus it is being considered as emerging prospect for marine aquaculture
RI
(Leighton et al., 1979; Culver et al., 2006). C. gigantea aquaculture initiated in early 1970s
SC
and continued since then (Leighton et al., 1976; Olsen, 1984; Leighton, 1991; Bourne, 1991). However, the development of rock scallop culture has been constrained due to the inability to
NU
produce commercial quantities of scallop seed and lack of cost effective and economic grow-out techniques (Bourne, 1991). In the past decades, considerable researches have been
MA
conducted on this species, including reproductive biology (Lauren, 1982; Malachowski, 1988; Barber and Blake, 2006), biochemical physiology (Whyte et al., 1990b) and aquaculture ecology (Culver et al., 2006; Cao et al., 2017; Cary et al., 1981). There have been advanced in
D
induction of spawning, cultivation of larvae, and juvenile production of rock scallop
PT E
(Christopher et al., 1996; Cao et al., 2017). In addition, previous studies on C. gigantea have reported on filter-feeding rate (Cao et al., 2016b), nutritive condition (Whyte et al., 1990a),
CE
toxicity (Beitler, 1991), optimization of larval and spat collection (Christopher et al., 1996) and plasticity of attachment (Culver et al., 2006). However, functional gene analysis and
AC
molecular marker development of C. gigantea has not yet been reported due to the lack of genomic and transcriptome resources. RNA sequencing (RNA-Seq) is an efficient high-throughput method that has been used to obtain large amounts of transcriptome sequence information, even for non-model organisms that lack a reference genome (Marioni et al., 2008; Feldmesser et al., 2014; Wang et al., 2014). The main advantage of RNA-Seq is the good repeatability, wide range of detection, cost-effective and quantitative criteria which understand the activity and expression (Nagalakshmi et al., 2008). Moreover, RNA-Seq is a powerful technique for the analysis of gene expression due to its higher sensitivity and specificity in comparison to microarrays,
ACCEPTED MANUSCRIPT along with its ability to detect new genes, rare transcripts, alternative splice isoforms and novel SNPs which can be used for association studies (Marioni et al., 2008; Nielsen et al., 2011). Given that transcriptome sequences exclude non-coding DNA and obtain a high percentage of functional information which can help to reveal the molecular mechanisms of functional genes (Shiel et al., 2014; Werner et al., 2013; Zhang et al., 2014; Liu et al., 2015).
PT
Meanwhile, transcriptome data can be used to identify microsatellite markers (Li et al., 2015). Compared with genomic simple sequence repeat (SSR) markers, SSRs developed using
RI
RNA-Seq technologies can help to identify candidate functional genes and increase the efficiency of marker assisted selection (Gupta and Rustgi, 2004; Wu et al., 2014). In recent
SC
years, RNA-seq has been extensively employed for transcriptional analysis, novel gene discovery, and molecular marker development in a number of bivalve organisms (Hou et al.,
NU
2011; Artigaud et al., 2014; Deng et al., 2014; Uliano-Silva et al., 2014).
MA
In the present study, RNA-Seq analyses were conducted in C. gigantea using Illumina sequencing technology to discover functional related genes and to develop microsatellite markers in this economically important bivalve. To our knowledge, this is the first report on
D
transcriptome analysis and marker development of C. gigantea. The sequencing data,
PT E
transcript sequence information, and polymorphic SSR markers are valuable resources for further genetic and molecular studies of C. gigantea. This work provides a database of
CE
molecular resources for this taxon and can guide future studies related to non-model species.
AC
2. Materials and methods 2.1. Sample collection
We collected C. gigantea samples in March 2017, which is the offspring of the introduced scallop provided by Canada Vancouver products, from Hekou (Dalian, Liaoning Province, China). The offspring of C. gigantea with a shell height about 3 cm was used as experimental materials in this study. The tissue samples (gill, adductor muscle, mantle, and visceral mass) were obtained from a one-year-old scallop. Each tissue sample was immediately frozen in liquid nitrogen after collection, and stored at −80 °C until RNA extraction.
ACCEPTED MANUSCRIPT Total RNA was separately extracted from each tissues by following the protocol previously described in Hu et al. (2006). The degradation and contamination of total RNA was monitored on 1% agarose gels. RNA purity was checked using the NanoPhotometer’ spectrophotometer (IMPLEN, CA, USA). RNA concentration was measured using Qubit RNA Assay Kit in Qubit 2.0 Flurometer (Life Technologies, CA, USA). RNA integrity was
PT
assessed using the RNA Nano 6000 Assay Kit of the Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA). To develop and characterize polymorphic microsatellites
RI
derived from C. gigantea transcriptome data, the cultured offspring from the introduced wild population of Canada were used to validate their effectiveness and polymorphism. Thirty
SC
individuals were sacrificed for the microsatellite marker polymorphism screening analysis, and the adductor muscles were preserved in 100% ethanol until DNA extraction. The genomic
NU
DNA was extracted from the adductor muscles using TIANamp Marine Animals DNA
MA
extraction kit (TIANGEN).
D
2.2. Library Preparation for Transcriptome Sequencing
PT E
A total amount of 1.5 μg RNA was used as input material for the RNA sample preparations. Equal amount of RNA from gill, adductor muscle, mantle, and visceral mass were mixed to construct one transcriptome library. Sequencing library was generated using
CE
NEBNcxt Ultra™ RNA Library Prep Kit for Illumina (NEB, USA). The mRNA was purified using poly-T oligo attached magnetic beads and fragmentation was carried out using divalent
AC
cations under elevated temperature in NEBNext First Strand Synthesis Reaction Buffer (5X). First strand cDNA was synthesized using random hexamer primer and M-MuLV Reverse Transcriptase (RNase H-). Second strand cDNA synthesis was subsequently performed using DNA Polymerase I and RNase-H. Remaining overhangs were converted into blunt ends via exonuclease/polymerase activities. After adenylation of 3’ ends of DNA fragments, NEBNext Adaptor with hairpin loop structure were ligated to prepare for hybridization. The library fragments were purified with AMPure XP system (Beckman Coulter, Beverly, USA) insert size of about 150~200 bp. Then 3μl USER Enzyme (NEB, USA) was used with size-selected, adaptor-ligated cDNA at 37℃ for 15 min followed by 5 min at 95℃ before PCR. At last, PCR
ACCEPTED MANUSCRIPT products were purified and library quality was assessed on the Agilent Bioanalyzer 2100 system.
2.3. Quality control, Analysis and Assembly Raw data (raw reads) of fasta format were firstly processed through in-house perl scripts.
PT
The raw sequencing reads were filtered in order to generate quality reads for de novo
RI
assembly (Kang et al., 2017). In this step, clean data (clean reads) were obtained by removing reads containing adapter, reads containing poly-N and low quality reads from raw data. At the
SC
same time, Q20, Q30, GC-content and sequence duplication level of the clean data were calculated. All the downstream analyses were based on clean data with high quality. The clean
NU
reads were de novo assembled with Trinity software which is a high efficiency and stability transcriptome splicing software for RNA-Seq data, as well as cooperative researched and
PT E
2.4. Functional annotation
D
MA
development by Broad Institute & Hebrew University of Jerusalem (Grabherr et al., 2011).
Functional annotations for the unigenes were conducted by sequence similarity comparisons against the nonredundant nucleotide database and the nonredundant protein
well
as
the
CE
database of NCBI (http://www.ncbi.nlm.nih.gov/) with BLASTx (E values cutoff ≤ 1e−5) as SWISS-PROT
database
(European
Bioinformatics
Institute,
AC
ftp://ftp.ebi.ac.uk/pub/databases/swissprot/), the Clusters of Orthologous Groups of proteins database (KOG) using diamond v0.8.22 (E values cutoff ≤ 1e−3) (Tatusov et al., 2000), and the Kyoto Encyclopedia of Genes and Genomes database (KEGG) (Kanehisa et al., 2004) with BLASTx (E values cutoff ≤ 1e−10). Moreover, functional assignments of the unigenes were further annotated using Gene Ontology (GO) Blast2GO v2.5 (E values cutoff ≤ 1e−6) (Ashburner et al., 2000). BLAST assignments were conducted against Nt using NCBI blast 2.2.28+ and Nr using diamond v0.8.22 (E values cutoff ≤ 1e−5) (Deng et al., 2006). We used KOBAS software to test the statistical enrichment of differential expression genes in KEGG pathways (Mao et al., 2005).
ACCEPTED MANUSCRIPT
2.5. SSR polymorphism validation MIcroSAtellite (MISA) (http://pgrc.ipk-gatersleben.de/misa/) was used to identify repetitive elements in the assembled transcriptomes of C. gigantea. We focused on SSRs with motifs of 2–6 nucleotides and a minimum of six contiguous repeat units. PCR primers for
PT
each microsatellite locus were designed using the Primer 3 software and tested on 30 wild individuals of C. gigantea sampled from Hekou, Dalian. PCR amplifications were carried out
RI
in a 10 μl reaction volume containing about 100 ng genomic DNA, 1×PCR buffer (TaKaRa),
SC
0.2 mM dNTPs, 1.5 mM MgCl2, 1 μM of each primer and 0.25 U Taq DNA polymerase (TaKaRa). PCR amplification was performed on a gradient thermal cycler (Bio-Rad) with the
NU
following protocol: denaturation for 5 min at 94℃; 30 cycles of 45 s at 94℃, 45 s at annealing temperature (Table 1), and 1 min at 72℃; a final extension at 72℃ for 10 min; and storage at
MA
4℃. The amplification products were then resolved by 6% denaturing polyacrylamide gel, and visualized via silver-staining. A 10 bp DNA ladder (Invitrogen, San Diego, CA, USA) was used as a reference marker for allele size determination. The number of alleles, observed (HO)
D
and expected (HE) heterozygosities of these microsatellite loci were estimated using the
PT E
Microsatellite Analyzer software (Dieringer and Schlotterer, 2003). Tests for deviations from linkage disequilibrium and Hardy-Weinberg equilibrium (HWE) were carried out by
CE
GENEPOP 4.0 (Raymon, 1995; Rousset, 2008). The null alleles were checked with the
AC
MICRO-CHECKER 2.2.3 program (van Oosterhout et al., 2004).
3. Results and discussion 3.1. Transcriptome sequencing and assembly We intermix the visceral mass, gill, adductor muscle, and mantle from rock scallops, as the sample to Illumina HiSeqTM. A total of 80,832,518 raw reads were produced from the Illumina sequencing platform and were deposited in the NCBI SRA database (accession number PRJNA416200). Approximately 77,306,198 (95.64%) clean reads were generated after trimming the adaptor sequences, the ambiguous N nucleotides (>0.1%), and the
ACCEPTED MANUSCRIPT low-quality sequences. These clean reads included 11.6G clean bases with a mean length of 150 bp and were used for the subsequent analysis. The nucleotide analysis showed a GC and Q30 content of 42.12 and 92.03%, respectively (Table 2). The transcriptome assembled into 158,855 transcripts with an N50 size of 1995 bp and an average size of 1008 bp. The transcriptome library generated 96007 unigene sequences with an N50 size of 2298 bp and an
PT
average size of 1479 bp (Table 2). The size distribution of assembled transcripts and unigenes has been shown in Fig. 1. Among the transcripts, a larger fraction (30.58%) of the assembled
RI
sequences had sizes ≤ 301bp. A total of 33923 (21.36%), 29720 (18.71%), 23984 (15.1%), and 22650 (14.26%) transcripts had size in the ranges of 301-500bp, 501-1000 bp, 1001-2000
SC
bp, and >2000 bp, respectively. Among the unigenes, 21531 (22.43%) sequences were no longer than 500 bp in length, 27878 unigenes (29.04%) were in the length range of 501 to
NU
1000 bp, 23948 unigenes (24.94%) were of length ranging from 1001 to 2,000 bp, and 22650 unigenes (23.59%) were longer than 2001 bp. A mixed cDNA library of one individual
MA
samples, including gill, adductor muscle, mantle and visceral mass were pooled for sequencing. The N50 length (2298 bp) of the 54,852 assembled unigene sequences was larger
D
than many other shellfish species. For example, the N50 length was 356 bp for South African
PT E
abalone Haliotis midae (Franchini et al., 2011), and 486 bp for pearl oyster Pinctada maxima
CE
(Deng et al., 2014), and 655 bp for Chlamys farreri (Shi et et al., 2013; Chen et al., 2016).
3.2 Assembly evaluation and annotation
AC
The unigenes ranging in length from 500 to 2000 bp and ≥ 2000 bp showed the greatest match percentages than that of ≤ 500 bp (Table 3). The assembly derived unigenes were searched against public protein and nucleotide databases for validation and annotation. The summary statistics for BLAST assignment against NR, NT, KEGG, Swissprot, PFAM, KOG, and GO databases are shown in Table 3. As indicated, out of 96007 unigenes, 36938 (38.47%), 35575 (37.05%), 35524 (37%), and 28533 (29.71%) unigenes showed annotation hits to homologous sequences in NR, GO, PFAM and SwissProt databases, respectively. Besides the same, 6802, 8074 and 15682 unigenes showed homology matches in NT, KEGG and KOG database, respectively. In all, a total of 45,389 (47.27%) assembled unigene sequences showed
ACCEPTED MANUSCRIPT annotation hits to at least one of the protein and nucleotide databases, and 3386 unigenes (3.52%) showed annotation hits to all databases. The reasons for unannotated sequences has the following several points. Represent short sequences lacking a conserved protein domain. The transcriptome may also contain incompletely spliced introns, orphaned untranslated regions (UTRs), non-coding genes, and random transcriptional noise may not show homology
PT
to available sequences in the public databases. An absence of annotations from transcriptome would also mean genes expressed at low levels or not being expressed at the time of RNA
RI
extraction (Kang et al., 2017).
SC
A five-way Venn diagram plot (Fig. 2) also show the representation of assembled unigenes annotated against NT, NR, GO, PFAM and KOG databases. This shows that 4349 unigenes
NU
were concurrently annotated by all five databases, while 12935 unigenes had homologous sequence both in GO, NR and PFAM databases. Besides these unigenes, 9073 unigenes were
MA
annotated in the above three and KOG, and 7912 unigenes annotated both in GO and PFAM database. Considering the maximum annotation hits to NR database, we studied the homology characteristics of BLASTX hits for the annotated unigenes in the database. The
E-value
D
distribution indicated that 30.8% of annotated unigenes have an E-value in the range of 0 to
PT E
1E−100, followed with 18.3% of unigenes with an E-value distribution of 1E−100 to 1E−60 (Supplementary Fig. 1A). The similarity distribution shows that 48% of the unigenes had a
CE
best-hit similarity of 60-80%, 32.1% of unigenes had a similarity of 40-60% and 17.3% of unigenes had a similarity of 80-95% (Supplementary Fig. 1B). Compared with other bivalve
AC
species, 26975 C. gigantea unigenes hit to 34 species (Table 4). Among these hit species, Crassostrea gigas was the most predominant (26054, 96.59%) (Zhang et al., 2012). Other top hit species with sufficient genomic resources included the Azumapecten farreri (326, 1.21%), Mytilus galloprovincialis (164, 0.6%), Argopecten irradians (77, 0.29%), Mizuhopecten yessoensis (68, 0.25%), Pinctada fucata (64, 0.24%), Mimachlamys nobilis (34, 0.13%), Pinctada martensii, Littorina littorea (23, 0.09%), and C. ariakensis (20, 0.07%).
3.3. Unigenes annotation and Gene Ontology (GO) classification
ACCEPTED MANUSCRIPT The GO database annotated unigenes (35575) were classified in terms of their associated biological process, cellular component and molecular function. Details of the distribution of the unigenes in the main ontologies are shown in Fig. 3. Among the annotated unigenes, 100868, 59792, and 41990 sequences were classified with biological process, cellular component, and molecular function, respectively. In addition, 8116, 2491, and 1373 unigenes
PT
associate with biological process and molecular function, biological process and cellular component, and cellular component and molecular function, respectively. About 13816
RI
annotated unigenes were associated to all the three GO categories. This suggests that there is a strong overlap of associated functions for unigenes under biological process and molecular
SC
function category. The top 10 GO term assignments for biological process were cellular process (GO:0009987, 20348), metabolic process (GO:0008152, 16847), and single organism
NU
process (GO:0044699, 16718). Other significant GO terms under the category included the biological regulation (GO:0065007, 8777), regulation of biological process (GO:0050789,
MA
8371), localization (GO:0051179, 6691), response to stimulus (GO:0050896, 6506), cellular component organization or biogenesis (GO:0071840, 3946), signaling (GO:0023052, 4543).
D
The dominant part in the cellular component category included the cell part (GO:0044464,
PT E
10718), cell (GO:0005623, 10720), followed by membrane (GO:0016020, 7435), organelle (GO:0043226, 7234), membrane part (GO:0044425, 6948), macromolecular complex (GO:0032991, 6762). Under molecular function, binding (GO:0005488, 20261) was the most
CE
dominant term, followed by catalytic activity (GO:0003824, 13894). Only a few genes were assigned to the antioxidant activity (GO:0016209, 89), metallochaperone activity
AC
(GO:0016530, 13), transcription factor activity, protein binding (GO:0000988, 360), and structural molecule activity (GO:0005198, 819) in the molecular function. The sequence and annotation information from GO annotations all provided valuable gene sources for the study of molecular basis that underline these economic traits of C. gigantea. Transcripts putatively take part in growth (GO: 0040007) and reproduction (GO: 0000003) were found in our RNA-seq database (Supplementary Table 1), which were also found in Patinopecten yessoensis (Hou et al., 2011).
3.4. Genes involved in immune and stress responses
ACCEPTED MANUSCRIPT In this study, GO analysis identified 1699 transcripts that are involved in cellular responses to environmental pressure and stimulus, among stress responses (GO:0006950) include oxidant stress, starvation, defense response, metalloid, wounding; among stimulus responses include external stimulus, endogenous stimulus, abiotic stimulus, and biotic stimulus. A number of DNA repair related genes, such as MSH3, EGF, TGF, IGF, FGF (Hall et al., 1996;
PT
Marsischky et al., 1996), encoding different groups of growth factors (Daub et al., 1996; Wang et al., 2013) were found in the transcriptome data of C. gigantean. In addition, immune
RI
related genes as Toll-like receptor (TLR) including TLR1, TLR2, TLR3, TLR4, TLR5, TLR6, TLR7, TLR8, and TLR9 was also observed in C. gigantean. It is reported that TLR1, 2, 3, 5,
SC
7, 9, 21, and 22 are identified in Takifugu rubripes, among which TLR1 was identified as immune enriched genes in gill and was successfully cloned (Cui et al., 2014). A novel
NU
toll-like receptor gene called CnTLR-1 was cloned and its transcripts under different challenges were determined (Lu et al., 2016). Heat shock proteins (HSP) are a family of
MA
proteins that play an important role in modulating stress response, such as stressed by temperature, salinity, pH, hypoxia, aerial exposure, heavy metals, bacteria infection and so on
D
(Zhang et al., 2012; Nie et al., 2017). HSP expression levels would significantly increase,
PT E
enabling organisms to resist the insults of adverse stressors so as to maintain the homeostasis and survival of cells (Zhang et al., 2012). In this study, the HSP family, including HSP40,
CE
HSP60, HSP70, HSP90, CaRHSP1 and HSP75 were found in C. gigantea.
AC
3.5. KOG and KEGG classification Based on a conserved domain alignment, all the unigenes were searched against the KOG database to classify and functionally predict the orthologous gene products. In the KOG classification analysis, 15682 unigenes were assigned to 25 protein families (Fig. 4), among which the signal transduction mechanisms (2732 unigenes, 17.42%) represented the largest annotated group, followed by general function prediction (2558 unigenes, 16.31%), posttranslational modification, protein turnover, chaperones (1556 unigenes, 9.92%), function unknown (1168 unigenes, 7.45%) and Intracellular trafficking, secretion, and vesicular transport (1018 unigenes, 6.49%). The smallest COG groups to which the unigenes were
ACCEPTED MANUSCRIPT mapped were unamed protein (5), cell motility (50), nuclear structure (99), defense mechanisms (150 unigenes, 0.96%). KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies. We
PT
found that 14671 unigenes assigned to 229 KEGG pathways (Fig. 5). The most predominant category was signal transduction (2002) from Environmental Information Processing,
RI
followed with endocrine system (1143) from the Organismal Systems, transport and catabolism (1105) from Cellular Processes. Among the Genetic Information Processing,
SC
folding, sorting and degradation (746), and translation (662). Among the metabolism, lipid metabolism (706), carbohydrate metabolism (662), amino acid metabolism (495), glycan
NU
biosynthesis and metabolism (444), metabolism of cofactors and vitamins (336). In C. gigantean, the unigenes predicted under immune system process were related to 15 signaling
MA
pathways and ascribed to chemokine signaling pathway, leukocyte transendothelial migration, Fc gamma R-mediated phagocytosis, toll-like receptor signaling pathway, T cell receptor
D
signaling pathway, RIG-I-like receptor signaling pathway, NOD-like receptor signaling
PT E
pathway, natural killer cell mediated cytotoxicity, B cell receptor signaling pathway, Fc epsilon RI signaling pathway, cytosolic DNA-sensing pathway, antigen processing and presentation, platelet activation, hematopoietic cell lineage, and complement and coagulation
CE
cascades. These signaling pathways information from the transcriptome data may help identify immune-related genes for C. gigantea. The number of unigenes in KEGG pathways
AC
range from 2 to 395. The top 16 pathways with the greatest number of genes are listed in Table 5. KEGG analysis identified transcripts potentially involved in responses to environmental pressure and stimulus. Overall, functional analysis of our RNA-seq database identified candidate genes potentially involved in growth, reproduction, stress and immunity. Further experiments are needed to validate the functions and expression patterns of these candidate genes.
3.6. Characterization of SSRs and PCR primers
ACCEPTED MANUSCRIPT Screening of 23682 unigenes by using the software of MISA, SSR were detected 24795 sites, distributed in 19182 unigenes, accounting for approximately 19.98% of the total unigenes. 4148 unigenes contains more than one SSR (Table 6). Trinucleotide and dinucleotides were the most frequent type of repeats in C. gigantea (Table 7). The SSRs identified in Cristaria plicata (Patnaik et al., 2016b) and Crassostrea hongkongensis (Tong et
PT
al., 2015) transcriptomes showed a similar pattern with the dominancy of dinucleotide repeat motifs. It is reported that trinucleotide and tetranucleotides were the most frequent in C.
RI
farreri and M. yessoensis (Wang et al., 2013). The AT/AT is the dominant dinucleotide repeat motifs, and AAT/ATT and ATC/ATG are the dominant trinucleotide repeat motifs.
SC
Comparison of C. farreri and M. yessoensis, the ATC motif represented the most scallops in shell abundant trinucleotide motif in both species, which was also common in other bivalves
NU
(Wang et al., 2007). Ninety seven microsatellite-containing sequences were selected for microsatellite marker optimization and validation because of repetition times and flaking
MA
sequence priority. We focused on SSRs with motifs of 2–6 nucleotides and a minimum of six contiguous repeat units. From each SSR type, we randomly selected a subset of 97 primer
D
pairs including 22 dinucleotieds, 66 trinuleotides, and 9 tetranucleotieds. Of the 97 potential
PT E
microsatellite markers, 42 (43.3%) were amplified with the expected products, and the PCR products were separated on denaturing polyacrylamide gels. Among the 42 primer pairs, 12 revealed length polymorphisms in 30 C. gigantea samples (Table 1). Among the 97 potential
CE
SSR markers, 42 loci (43.3%) were successfully amplified and 12 exhibited polymorphisms. This success rate (43.3%) was higher than that reported in the abalone Haliotis diversicolor
AC
diversicolor (28.3%) and lower than that in the clam Mercenaria mercenaria (50%) and Paphia textile (53.8%) (Chen et al., 2016; Wang et al., 2010; An et al., 2012). The percentage of polymorphic loci (28.6%) in C. gigantea was higher than that reported in the abalone (20.2%) and lower than that in the clam Ruditapes philippinarum (43.9%) in Ruditapes philippinarum (Nie et al., 2014). In the present study, the number of alleles per locus varied from 3 to 9, with an average of 3.25. The observed and expected heterozygosities ranged from 0.033 to 0.923 and from 0.254 to 0.820, with an average of 0.496 and 0.542, respectively. Two of the 12 microsatellite loci show significant departure from HWE after a sequential Bonferroni correction (P < 0.01),
ACCEPTED MANUSCRIPT which might be caused by the presence of null alleles and small sample size (Pemberton et al., 1995). Traditional approaches for SSR discovery tend to be either labor intensive or time consuming or highly dependent on available resources. In this study, the transcriptome data here provided a rich EST resource for genetic variants mining. The failure of some of the primer pairs to produce amplicons could be caused by the location of the primers across splice
PT
sites, chimeric primers, or poor quality sequences. Because we identified many other SSRs in the transcriptome dataset, more PCR primers could be developed to validate them. The SSRs
SC
comparative genomics, and functional genomics studies.
RI
could be very useful in genetic polymorphism assessment, quantitative trait loci mapping,
NU
4. Conclusion
In this study, we used IlluminaTM platform to sequence the transcriptome of the C.
MA
gigantean and provide a valuable resource for future research in C. gigantea. Unigenes showing putative functions in immunity, defense and stress response including the TLRs, TGF, and HSPs were identified. A total of 24795 SSRs were identified and characterized as
D
potential molecular markers, and 98 primer pairs were randomly selected to detect the
PT E
amplification and polymorphism, and 12 loci show a polymorphism. These twelve microsatellites enrich the molecular markers resources of C. gigantea, which will be
CE
applicable to studies of the genetic diversity, population structure, genetic linkage mapping and identification of quantitative trait loci, and also in investigating molecular markers for
AC
production and adaptive traits of individual/population in the future.
Acknowledgments
The study was supported by grants from the project of Department of Ocean and Fisheries of Liaoning Province (201214), the project of Dalian City Oceanic and Fishery Administration (2012011), the Modern Agro-industry Technology Research System (CARS-49), the Cultivation Plan for Youth Agricultural Science and Technology Innovative Talents of Liaoning Province (2014004), and the Dalian Youth Science and Technology Star Project Support Program (2016RQ065).
ACCEPTED MANUSCRIPT References
An, H.S., Lee, J.W., Hong. S.W., 2012. Application of novel polymorphic microsatellite loci identified in the Korean Pacific Abalone (Haliotis diversicolor supertexta (Haliotidae)) in the genetic characterization of wild and released populations. Int. J. Mol. Sci. 13(9), 10750-10764.
PT
Artigaud, S., Thorne, M.A.S., Richard, J., 2014. Deep sequencing of the mantle transcriptome of the great scallop Pecten maximus. Mar. Genomics 15, 3–4.
RI
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L.,
SC
Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G., 2000. Gene ontology: tool for the unification of biology. The Gene
NU
Ontology Consortium. Nat. Genet. 25, 25-29.
Barber, B.J., Blake, N.J., 2006. Reproductive physiology. Developments in aquaculture and
MA
Fish. Sci. 35, 357-416.
Beitler, M.K., 1991. Toxicity of adductor muscles from the purple hinge rock scallop (Crassadoma gigantea) along the Pacific coast of North America. Author links open the
D
author workspace. Toxicon 29(7), 889-894.
PT E
Bourne, N., 1991. West coast of North America. In: Shumway, S.E.(Ed.), Scallops: Biology, Ecology, and Aquaculture. Elsevier Science Publishers, New York, pp. 925–939. Cao, S.M., Liang,W.F., Wang, J., Liu G., 2016b. Fundamental research on filter-feeding rate of juvenile rock scallop. J Dalian Ocean Uni. 31(6), 613-617.
CE
Cao, S.M., Wang, H., Chen,W., 2016a. Analysis, evaluation and comparison of nutritive composition in rock scallop Crassadoma gigantea with three Chinese scallops. J Dalian
AC
Ocean Uni. 31(5), 544-550. Cao, S.M., Wang, J., Wang, Q., Liang, W.F., Yin, M.H., Liu G., 2017. Artificial breeding of introduced rock scallop Crassadoma gigantea. J Dalian Ocean Uni. 32(1):1-6 Cary S.C., Leighton D.L., Phleger C.F., 1981. Food and feeding strategies in culture of larval and early juvenile purple-hinge rock scallops, Hinnites multirugosus (GALE).J World Aquaculture Society, 12(1), 156-169. Chen, H., Lin, L., Xie, M., Zhang, G., Su, W., 2015. De novo sequencing and characterization of the Bradysia odoriphaga (Diptera: Sciaridae) larval transcriptome. Comp. Biochem. Physiol. Part D 16, 20-27.
ACCEPTED MANUSCRIPT Chen, X., Li, J., Xiao, S., Liu, X., 2016. De novo assembly and characterization of foot transcriptome and microsatellite marker development for Paphia textile. Gene 576, 537-543. Chen, X.M., Li, J.K., Xiao, S.J., Liu,X.D., 2016. De novo assembly and characterization of foot transcriptome and microsatellite marker development for Paphia textile. GENE, 576, 537-543. Christopher, M.P., Edwin, B., 1996. Aquaculture Settlement of larvae of the giant scallop,
PT
Placopecten magellanicus (Gmelin), on various artificial and natural substrata under hatchery-type conditions. Aquaculture. 141,3-4,31, 201-221.
RI
Cui, J., Liu, S.K., Zhang, B., Wang, H.D., Sun, H.J., Song, S.H., Qiu, X.M., Liu, Y., Wang,
SC
X.L., Jiang, Z.Q., Liu, Z.J., 2014. Transciptome Analysis of the Gill and Swimbladder of Takifugu rubripes by RNA-Seq. PLoS One, 9(1), e85505. Culver, C.S., Richardsa, J.B., Pageb, H.M., 2006. Plasticity of attachment in the purple-hinge
NU
rock scallop, Crassadoma gigantea: Implications for commercial culture. Aquaculture 254, 361-369.
MA
Daub, H., Weiss, F. U., Wallasch, C., Ullrich, A., 1996. Role of transactivation of the EGF receptor in signalling by G-protein-coupled receptors. Nature, 379(6565), 557. Deng, Y., Lei, Q., Tian, Q., Xie, S., Du, X., Li, J., Wang, L., Xiong, Y., 2014. De novo
D
assembly, gene annotation, and simple sequence repeat marker development using
PT E
Illumina paired-end transcriptome sequences in the pearl oyster Pinctada maxima. Biosci. Biotechnol. Biochem. 78(10), 1685–1692. Deng, Y.Y., Li, J.Q., Wu, S.F., Zhu, Y.P., Chen, Y.W., He, F.C., 2006. Integrated nr database in
CE
protein annotation system and its localization. Comput. Eng. 32, 71-74. Dieringer, D., Schlotterer, C., 2003. Microsatellite analyser (MSA): aplatform independent
AC
analysis tool for large microsatellite datasets. Mol. Ecol. Notes 3, 167-169. Feldmesser, E., Rosenwasser, S., Vardi, A., et al., 2014. Improving transcriptome construction in non-model organisms: integrating manual and automated gene definition in Emiliania huxleyi. BMC Genomics 15, 148. Franchini, P., Van der Merwe, M., Roodt-Wilding, R., 2011. Transcriptome characterization of the South African abalone Haliotis midae using sequencing-by-synthesis. BMC research notes 4, 59. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., et al., 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29(7), 644-652.
ACCEPTED MANUSCRIPT Hall, D.B., Holmlin, R.E., Barton, J.K., et al., 1996. Oxidative DNA damage through long-range electron transfer. Nature, 382(6593), 731. Hou, R., Bao, Z.,Wang, S., et al., 2011. Transcriptome sequencing and de novo analysis for Yesso scallop (Patinopecten yessoensis) using 454 GS FLX. PLoS One 6, e21560. Hu, X.L., Bao Z.M., Hu J.J., Shao M.Y., Zhang L.L., 2006. Cloning and characterization of tryptophan 2,3-dioxygenase gene of Zhikong scallop Chlamys farreri (Jones and Preston 1904). Aquac. Res. 37, 1187–1194. deciphering the genome. Nucleic Acids Res 32, 277-280.
PT
Kanehisa, M., Goto, S., Kawashima, S., Hattori, M., 2004. The KEGG resource for
RI
Kang, S.W., Bharat, B.P., Wang, H.J., 2017. Sequencing and de novo assembly of visceral
SC
mass transcriptome of the critically endangered land snail Satsuma myomphala: Annotation and SSR discovery. Comparative Biochemistry and Physiology, Part D 21, 77-89.
NU
Lauren, D.J., 1982. Oogenesis and protandry in the purple-hinge rock scallop,Hinnites giganteus in upper Puget Sound,Washington,U.S.A. Canadian Journal of Zoology, 60
MA
(10), 2333-2336.
Leighton, D.L., 1979. A growth profile for the rock scallop Hinnites multirugosus held at several depths off La Jolla,California. Mar. Biology 51(3), 229-232.
D
Leighton, D.L., 1991. Culture of Hinnites and related scallops on the American Pacific coast.
PT E
In: Menzel, W. (Ed.), Estuarine and Marine Bivalve Mollusk Culture. CRC Press Inc, Boca Raton, Florida, pp. 100-110. Li, H.J., Liu, M., Ye, S., Yang, F., 2017. De novo assembly, gene annotation, and molecular
CE
marker development using Illumina paired-end transcriptome sequencing in the clam Saxidomus purpuratus. Genes & Genomics, 1-11.
AC
Liu, H., Zheng, H., Zhang, H., Deng, L., Liu, W., Wang, S., Meng, F., Wang, Y., Guo, Z., Li, S., Zhang, G., 2015. A de novo transcriptome of the noble scallop, Chlamys nobilis, focusing on mining transcripts for carotenoid-based coloration. BMC Genomics. 16, 44. Lu, Y., Zheng, H., Zhang, H., Yang, J., Wang, Q. 2016. Cloning and differential expression of a novel toll-like receptor gene in noble scallop Chlamys nobilis with different total carotenoid content. Fish & shellfish immunology, 56, 229-238. Malachowski, M., 1988. The reproductive cycle of the rock scallop Hinnites giganteus (Grey) in Humboldt Bay, California. J. Shellfish Res. 7, 341–348.
ACCEPTED MANUSCRIPT Mao, X., Cai, T., Olyarchuk, J.G., 2005. Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics 21, 3787-3793. (KOBAS) Marioni, J.C., Mason, C.E., Mane, S.M., et al., 2008. RNA-seq: an assessment of technical repro-ducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., Gilad, Y., 2008. RNA-seq: an
PT
assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509-1517.
RI
Marsischky, G.T., Filosi, N., Kane, M.F., et al., 1996. Redundancy of Saccharomyces
SC
cerevisiae MSH3 and MSH6 in MSH2-dependent mismatch repair. Genes & development, 10(4), 407-420.
McDonald, B.A., Bourne, N.F., 1989. Growth of the purple hinge rock scallop, Crassadoma culture. J.Shellfish Res. 8 (1), 179–186.
NU
gigantea Gray, 1825 under natural conditions and those associated with suspended
MA
Nagalakshmi U. 2008, The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 2008. 320: 1344-1349.
Ni, L.H., Li, Q., Kong, L.F., 2010. Isolation and characterization of 19 microsatellite markers
D
from the Chinese surf clam (Mactra chinensis). Conservation Genetics Resources 2 (1),
PT E
27-30.
Nie, H., Zhu, D., Yang, F., Zhao, L., Yan, X., 2014. Development and characterization of EST-derived microsatellite makers for Manila clam (Ruditapes philippinarum). Conserv.
CE
Genet. Resour. 6, 25-37.
Nie, H.T., Liu, L.H., Huo Z.M., Chen, P., Ding, J.F., Yang, F., Yan, X., 2017. The HSP70
AC
gene expression responses to thermal and salinity stress in wild and cultivated Manila clam Ruditapes philippinarum. Aquaculture 470, 149-156. Nielsen, R., Paul, J.S., Albrechten, A., Song, Y.S., 2011. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12(6), 443-51. Olsen, S., 1984. Completion Report on Invertebrate Aquaculture Shellfish Enhancement Project. Point Whitney Shellfish Lab. Washington Department of Fisheries. 85 pp. Patnaik, B.B., Wang, H.J., Kang, S.W., Park, S.Y., Wang, T.H., Park, E.B., Chung, J.M., Song, D.K., Kim, C., Kim, S., Lee, J.B., Jeong, H.C., Park, H.S., Han, Y.S., Lee, Y.S., 2015. Transcriptome characterization for non-model endangered Lycaenids, Protantigius
ACCEPTED MANUSCRIPT superans and Spindasis takanosis, using Illumina HiSeq 2500 Sequencing. Int. J. Mol. Sci. 16, 29948–29970. Patnaik, B.B., Wang, T.H., Kang, S.W., Wang, H.J., Park, S.Y., Park, E.B., Chung, J.M., Song, D.K., Kim, C., Kim, S., Lee, J.S., Han, Y.S., Park, H.S., Lee, Y.S., 2016b. Sequencing, de novo assembly, and annotation of the transcriptome of the endangered freshwater pearl bivalve, Cristaria plicata, provides novel insights into functional genes and marker discovery. PLoS One. 11(2), e0148622.
PT
Pemberton, J.M., Slate, J., Bancroft, D.R., Barrett, J.A., 1995. Non amplifying alleles at microsatellite loci: a caution for parentage and population studies. Mol. Ecol. 4, 249-252.
RI
Raymond, M., 1995. GENEPOP: Population genetics software for exact tests and ecumenism.
SC
Vers. 1.2. J. Hered. 86, 248-249.
Richardson, M.F., Sherman, C.D.H., 2015. De novo assembly and characterization of the invasive Northern Pacific Seastar transcriptome. PLoS One 10, e0142003.
NU
Rousset, F., 2008. GENEPOP 007: a complete re-implementation of the GENEPOP software for Windows and Linux. Mol. Ecol. Resour. 8, 103-106.
MA
Shiel, B.P., Hall, N.E., Cooke, I.R., et al., 2014. De novo characterisation of the greenlip ab-alone transcriptome (Haliotis laevigata) with a focus on the heat shock protein 70(HSP70) family. Mar. Biotechnol. 17, 23–32.
D
Shiet, M., Lin, Y., Xu, G., Xie, L., Hu, X., Bao, Z., Zhang, R., 2013. Characterization of the
PT E
Zhikong scallop (Chlamys farreri) mantle transcriptome and identification of biomineralization-related genes. Mar. Biotechnol. 15, 706-715. Tatusov, R. L., Galperin, M. Y., Natale, D. A., Koonin, E. V., 2000. The COG database: a 28(1), 33-36.
CE
tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res.
AC
Tong, Y., Zhang, Y., Huang, J., Xiao, S., Zhang, Y., Li, J., Chen, J., Yu, Z., 2015. Transcriptomics analysis of Crassostrea hongkongensis for the discovery of reproduction-related genes. PLOS One 10, e0134280. Uliano-Silva, M., Americo, J.A., Brindeiro, R., Dondero, F., Prosdocimi, F., Rebelo, M.D.F., Mauro de Freitas Rebelo 2014. Gene discovery through transcriptome sequencing for the invasive mussel Limnoperna fortunei. PLoS One 9, e102973. Van
Oosterhout,
C.,
Hutchinson,
W.F.,
William
D.P.M.,
Shipley
P.,
2004.
MICRO-CHECKER: software for identifying and correcting genotyping errors in microsatellite data. Mol. Ecol. Notes 4(3), 535-538.
ACCEPTED MANUSCRIPT Wang, S., Hou, R., Bao, Z.M., Du, H.X., He, Y., Su, H.L., Zhang, Y.Y., Fu, X.T., Jiao, W.Q., Li, Y., Zhang, L.L., Wang, S., Hu, X.L., 2013. Transcriptome Sequencing of Zhikong Scallop (Chlamys farreri) and Comparative Transcriptomic Analysis with Yesso Scallop (Patinopecten yessoensis). PloS one, 8(5), e63927. Wang, W., Hui, J.H.L., Chan, T.F., Chu, K.H.., 2014. De novo transcriptome sequencing of the snail Echinolittorina malaccana: identification of genes responsive to thermal stress and development of genetic markers for population studies. Mar. Biotechnol. 16,
PT
547–559.
Wang, Y., Guo, X., 2007. Development and characterization of EST-SSR markers in the
RI
eastern oyster Crassostrea virginica. Mar. Biotechnol. 9, 500-511.
SC
Wang, Y., Wang, A., Guo, X., 2010. Development and characterization of polymorphic microsatellite markers for the northern quahog Mercenaria mercenaria (Linnaeus, 1758). J. Shellfish Res. 29, 77–82.
NU
Werner, G.D.A., Gemmell, P., Grosser, S., et al., 2013. Analysis of a deep transcriptome from the mantle tissue of Patella vulgata Linnaeus (Mollusca: Gastropoda: Patellidae) reveals
MA
candidate biomineralising genes. Mar. Biotechnol. 15, 230–243. Whyte, J.N.C., Bourne, N. and Hodgson, C.A., 1990a. Nutritional condition of rock scallop, Crassadoma gigantea (Gray), larvae fed mixed algal diets. Aquaculture, 86: 25-40.
D
Whyte, N.J.C., Bourne, N., Ginther, N.G., 1990b. Biochemical and energy changes during
PT E
embryogenesis in the rock scallop Crassodoma gigantea. Mar. Biol. 106, 239– 244. Zhang, G.F., Fang, X.D., Guo, X.M., Li, L., Luo, R.B., Xu, F., Yang, P.C., Zhang, L.L., Wang, X.T., Qi, H.G., Xiong, Z.Q., Que, H.Y., Xie, Y.L., Holland, P.W.H., Paps, J., Zhu,
CE
Y.B., Wu, F.C., Chen, Y.X., Wang, J.F., Peng, C.F., Meng, J., Yang, L., Liu, J., Wen, B., Zhang, N., et al., 2012. The oyster genome reveals stress adaptation and complexity of
AC
shell formation. Nature 490, 49–54. Zhang, L.L., Li, L., Zhu, Y.B., Zhang. G.F., Guo, X.M., 2014. Transcriptome analysis reveals a rich gene set related to innate immunity in the eastern oyster (Crassostrea virginica). Mar. Biotechnol. 16, 17–33.
ACCEPTED MANUSCRIPT
Figure Captions
Fig. 1 The length distribution of unigenes and transcripts from the de novo assembly process of C. gigantea transcriptone.
PT
Fig. 2 A five-way Venn diagram depicting the unique and overlapped unigenes showing homology to sequence in PFAM, NT,NR,KOG and GO databases.
RI
Fig. 3 Functional annotations of the assembled unigene sequences based on Gene Ontology
SC
(GO). The GO analysis was performed at level 2 under the three main GO categories. Fig. 4 KOG classification of the assembled unigene sequences. The y-axis indicates the
NU
number of unigenes in a specific functional cluster. Each category is indicated on the
MA
x-axis by a letter that is listed in the right column.
Fig. 5 Pathway assignment based on KEGG. (A) Cellular Processes (B) Environmental
AC
CE
PT E
Organismal Systems.
D
Information Processing (C) Genetic Information Processing (D) Metabolism (E)
ACCEPTED MANUSCRIPT
Table 1 Characteristics of 12 PCR validated microsatellite markers developed for C.gigantea Locus Gene ID
Repeat
Primer sequence (5'-3')
Ta
motif
Number Size
(ºC) of alleles range
Ho
HE
P value
(bp) Cdg01 Cluster-21242.28813 (CTG)7
60
F:CGAGCTGGACTTCACAGGTT
3
260-281 0.280 0.254 1.0000
Cdg54 Cluster-20101.6
(GTTG)5 F:CTCAACAGACTGGTAGGTCCA
56
Cdg55 Cluster-20101.4
5
RI
R:ACTGACACGCAACACATCCT
PT
R:GCTAGTCTGGCAGCCAATCT 234-266 0.923 0.723 0.1579
5
242-260 0.821 0.745 0.8546
56
3
150-182 0.360 0.316 1.0000
60
4
92-104
54
3
250-330
60
5
258-290 0.346 0.664 0.0000*
57
4
232-300 0.433 0.597 0.0259
60
3
229-241 0.519 0.577 0.3345
60
9
170-260 0.808 0.820 0.1877
F:TGCACAGAGTATTGGTAACGGT 60
5
206-275 0.321 0.537 0.0062
4
191-232 0.667 0.588 0.0199
(GTTG)5 F:TCAGCAGACTGGTAGGTCCA
SC
R:ACTGACACACAACACATCCTT
60
Cdg56 Cluster-21242.18094 (GTTC)6 F:TTCCGTCCGCTTCTCTAGGA
R:TACTAGAACGGCACTTCGCG
NU
Cdg57 Cluster-21242.33855 (ATTG)5 F:ACGTTACAACCATCCACACA
0.440 0.411 0.8245
R:CGTCGAACGATCAATCAAGCA
MA
Cdg59 Cluster-21242.28287 (TGTA)5 F:TTGGAAGCAGGCCAGGTCTA
0.033 0.267 0.0000*
R:ACAAGCTGGCATGTGAATATTCA Cdg60 Cluster-19933.0
(ATTT)5 F:TCCTCCTCGTTCTGTCAGACT
(ATTG)5 F:ACGTCCGGACAAGATGGATT
PT E
Cdg61 Cluster-31365.0
D
R:TGCCACCACTTCACTTGTGT
R:ACATTCGCCATGCCTCGTAT
Cdg90 Cluster-21242.17985 (AGG)6 F:AGCTTAACCACACTGCTGCT R:TGAGCTGGACACTGGGTAGA F:TCCTGAATGACGTCCATGGC
CE
Cdg91 Cluster-21242.1498 (CTG)6
R:ACCAGGGTCAACAAGGTCAG
AC
Cdg92 Cluster-21242.13588 (ATT)5
R:GACACCTGAACTGGGTCACA
Cdg98 Cluster-21242.29945 (GAG)5 F:GTGGTGGTGGTGGTCCTATG R:CATCTGATCGTGGTCGCTCA
60
ACCEPTED MANUSCRIPT
D PT E CE AC
RI SC
NU
80,832,518 77,306,198 150 11.6 0.02 42.12 92.03 160,144,398 158,855 1995 354 1008 96,007 2298 653 1479
MA
Raw reads Clean reads Average size of clean reads (bp) Clean bases (Gb) Error(%) GC content (%) Q30 percentage (%) Total Nucleotides Transcripts N50 length (bp) N90 length (bp) Mean length (bp) Unigenes N50 length (bp) N90 length (bp) Mean length (bp)
PT
Table 2 Summary of sequencing data of C. gigantea transcriptome Category Number/length
ACCEPTED MANUSCRIPT Table 3 Annotation of unigenes of the C. gigantea transcriptome against seven public databases. ≤500
500-2000
≥2000
NR
36938
2117
17526
17295
NT
6802
457
2290
4055
KEGG
8074
381
3580
4113
SwissProt
28533
1179
12696
14658
PFAM
35524
2408
16680
16436
GO
35575
2420
16713
16442
KOG
15682
482
6487
8713
All
3386
At least one
45389
Total Unigenes
96007
AC
CE
PT E
D
MA
NU
SC
RI
PT
Number of Unigenes
Annotated databases
ACCEPTED MANUSCRIPT Table 4 The distribution of C. gigantea unigenes annotation in the class of bivalvia . The number of genes
Tegillarca granosa
6
Sinonovacula constricta
3
Scrobicularia plana
1
Saccostrea kegaki
1
Ruditapes philippinarum
4
Placopecten magellanicus
13
Pinctada maxima
1
Pinctada martensii
23
Pinctada margaritifera
5
Pinctada fucata
64
Pecten maximus
18
Ostrea edulis
11
Nodipecten subnodosus
3
Mytilus trossulus
5
164
Mytilus edulis
6
MA
Mytilus coruscus Mizuhopecten yessoensis Mimachlamys nobilis
D
Meretrix meretrix Hyriopsis schlegelii Hyriopsis cumingii Cristaria plicata
CE
Crassostrea virginica
PT E
Littorina littorea
6
68 34 8 24 5 9 1 1
Crassostrea hongkongensis
6
Crassostrea gigas
26054
Crassostrea ariakensis
20
Crassostrea angulata
4
Corbicula sp.
1
Corbicula japonica
2
Azumapecten farreri
326
Atrina rigida
1
Argopecten irradians
77
AC
RI
SC
NU
Mytilus galloprovincialis
PT
Bivalve species
ACCEPTED MANUSCRIPT Table 5 Top 16 KEGG pathways mapped to the transcriptome unigenes. KEGG Pathway
Pathway ID
Gene Number
1
Focal adhesion
ko04510
395
2
Endocytosis
ko04144
368
3
Rap1 signaling pathway
ko04015
345
4
Purine metabolism
ko00230
329
5
Oxytocin signaling pathway
ko04921
325
6
Lysosome
ko04142
309
7
cAMP signaling pathway
ko04024
305
8
Ubiquitin mediated proteolysis
ko04120
303
9
Ras signaling pathway
ko04014
301
10
MAPK signaling pathway
ko04010
299
11
Calcium signaling pathway
ko04020
284
12
PI3K-Akt signaling pathway
ko04151
13
Regulation of actin cytoskeleton
ko04810
14
Insulin signaling pathway
ko04910
15
cGMP - PKG signaling pathway
ko04022
265
16
Spliceosome
ko03040
260
RI
SC
NU
MA D PT E CE AC
PT
Rank
280 278 268
ACCEPTED MANUSCRIPT Table 6 Summary of the SSR search results in the C.gigantea transcriptome Number
Total number of sequences examined
96007
Total size of examined sequences (bp)
141948608
Total number of identified SSRs
24795
Number of SSR containing sequences
19182
Number of sequences containing more than 1 SSR
4148
Number of SSRs present in compound formation
1113
AC
CE
PT E
D
MA
NU
SC
RI
PT
Item
ACCEPTED MANUSCRIPT Table 7 Summary of simple sequence repeat (SSR) type in C. gigantea transcriptome Motif Length
Repeat Numbers 5
6
7
8
9
10
11
12
16
Di-
0
2729
1293
715
426
258
169
11
1
5602
72.52
Tri-
1069
431
364
56
3
1
0
0
0
1924
24.91
Tetra-
143
39
2
0
0
0
2
0
0
186
2.41
Penta-
6
2
0
1
0
0
0
0
0
9
0.12
Hexa-
4
0
0
0
0
0
0
0
0
4
0.05
Total
1222
3201
1659
772
429
259
171
11
1
7725
PT
RI SC NU MA D PT E CE AC
Number Percentage%
ACCEPTED MANUSCRIPT Abbreviation list
Simple sequence repeat (SSR) Gene Ontology (GO) Kyoto Encyclopedia of Genes and Genomes (KEGG)
PT
Heat shock proteins (HSP)
AC
CE
PT E
D
MA
NU
SC
RI
Toll-like receptor (TLR)
ACCEPTED MANUSCRIPT Highlights •
This is the first transcriptome analysis for the rock scallop Crassadoma gigantea.
•
96,007 unigenes were assembled and annotated by public function and pathway databases. 24,795 microsatellites were detected and 42 loci of them were validated in
CE
PT E
D
MA
NU
SC
RI
PT
amplification experiments.
AC
•
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5