De novo assembly, gene annotation, and marker development using Illumina paired-end transcriptome sequencing in the Crassadoma gigantea

De novo assembly, gene annotation, and marker development using Illumina paired-end transcriptome sequencing in the Crassadoma gigantea

Accepted Manuscript De novo assembly, gene annotation, and marker development using Illumina paired-end transcriptome sequencing in the Crassadoma gig...

3MB Sizes 0 Downloads 42 Views

Accepted Manuscript De novo assembly, gene annotation, and marker development using Illumina paired-end transcriptome sequencing in the Crassadoma gigantea

Shanmao Cao, Lijie Zhu, Hongtao Nie, Minghao Yin, Gang Liu, Xiwu Yan PII: DOI: Reference:

S0378-1119(18)30257-9 doi:10.1016/j.gene.2018.03.019 GENE 42641

To appear in:

Gene

Received date: Revised date: Accepted date:

6 November 2017 13 February 2018 6 March 2018

Please cite this article as: Shanmao Cao, Lijie Zhu, Hongtao Nie, Minghao Yin, Gang Liu, Xiwu Yan , De novo assembly, gene annotation, and marker development using Illumina paired-end transcriptome sequencing in the Crassadoma gigantea. The address for the corresponding author was captured as affiliation for all authors. Please check if appropriate. Gene(2017), doi:10.1016/j.gene.2018.03.019

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

De novo assembly, gene annotation, and marker development using Illumina paired-end transcriptome sequencing in the Crassadoma gigantea Running title: Transcriptome sequencing in Crassadoma gigantea

College of Fisheries and Life Science, Engineering and Technology Research Center of

SC

a

RI

PT

Shanmao Cao a, Lijie Zhu a, Hongtao Nie a*,1, Minghao Yin b, Gang Liu a, Xiwu Yan a

Shellfish Breeding in Liaoning Province, Dalian Ocean University, Dalian, 116023, China Dalian City Oceanic and Fishery Administration, 100000 Dalian, China

NU

b

MA

* Corresponding author. Tel.:+86-411-84763083. E-mail address: [email protected] (H. Nie). 1

Present address: Department of Anatomy, Physiology, and Pharmacology, College of

AC

CE

PT E

D

Veterinary Medicine, Auburn University, Auburn, Alabama 36849-5519, USA.

ACCEPTED MANUSCRIPT

Abstract Crassadoma gigantea is an important commercial marine bivalve species in Baja California and Mexico. In this study, we have applied RNA-Seq technology to profile the transcriptome of the C. gigantea for the first time. A total of 80,832,518 raw reads were produced from a Illumina HiSeq4000 platform, and 77,306,198 (95.64%) clean reads were generated after

PT

trimming the adaptor sequences. The transcriptome assembled into 158,855 transcripts with an N50 size of 1995 bp and an average size of 1008 bp. A number of DNA repair related

RI

genes, such as MSH3, EGF, TGF, IGF, FGF, encoding different groups of growth factors were

SC

found in the transcriptome data of C. gigantean. In addition, immune related genes Toll-like receptor (TLR) including TLR1, TLR2, TLR3, TLR4, TLR5, TLR6, TLR7, TLR8, and TLR9

NU

was also observed in C. gigantean. A set of 12 polymorphic microsatellite loci were firstly developed and characterized in C. gigantea. The results show that the number of alleles and

MA

expected heterozygosity ranged from 3 to 9 and from 0.254 to 0.820, respectively. The average polymorphic information content was 0.790. These microsatellite loci will facilitate

D

future studies of population structure and conservation genetics in this species.

PT E

Keywords: Crassadoma gigantea, Transcriptome, Simple sequence repeats, Functional related

AC

CE

genes

ACCEPTED MANUSCRIPT 1. Introduction The rock scallop, Crassadoma gigantean (Gray 1825) (formerly Hinnites multirugosus), is a prized food item among North American west coast sport divers (Culver et al., 2006). This species mainly distributed in north to Alaska, California, South to the United States, Canada and Mexico (Cao et al., 2016a). It has many commercially important characteristics,

PT

including a large adductor muscle, high markets value for both alive and half-shell product and prized food item, thus it is being considered as emerging prospect for marine aquaculture

RI

(Leighton et al., 1979; Culver et al., 2006). C. gigantea aquaculture initiated in early 1970s

SC

and continued since then (Leighton et al., 1976; Olsen, 1984; Leighton, 1991; Bourne, 1991). However, the development of rock scallop culture has been constrained due to the inability to

NU

produce commercial quantities of scallop seed and lack of cost effective and economic grow-out techniques (Bourne, 1991). In the past decades, considerable researches have been

MA

conducted on this species, including reproductive biology (Lauren, 1982; Malachowski, 1988; Barber and Blake, 2006), biochemical physiology (Whyte et al., 1990b) and aquaculture ecology (Culver et al., 2006; Cao et al., 2017; Cary et al., 1981). There have been advanced in

D

induction of spawning, cultivation of larvae, and juvenile production of rock scallop

PT E

(Christopher et al., 1996; Cao et al., 2017). In addition, previous studies on C. gigantea have reported on filter-feeding rate (Cao et al., 2016b), nutritive condition (Whyte et al., 1990a),

CE

toxicity (Beitler, 1991), optimization of larval and spat collection (Christopher et al., 1996) and plasticity of attachment (Culver et al., 2006). However, functional gene analysis and

AC

molecular marker development of C. gigantea has not yet been reported due to the lack of genomic and transcriptome resources. RNA sequencing (RNA-Seq) is an efficient high-throughput method that has been used to obtain large amounts of transcriptome sequence information, even for non-model organisms that lack a reference genome (Marioni et al., 2008; Feldmesser et al., 2014; Wang et al., 2014). The main advantage of RNA-Seq is the good repeatability, wide range of detection, cost-effective and quantitative criteria which understand the activity and expression (Nagalakshmi et al., 2008). Moreover, RNA-Seq is a powerful technique for the analysis of gene expression due to its higher sensitivity and specificity in comparison to microarrays,

ACCEPTED MANUSCRIPT along with its ability to detect new genes, rare transcripts, alternative splice isoforms and novel SNPs which can be used for association studies (Marioni et al., 2008; Nielsen et al., 2011). Given that transcriptome sequences exclude non-coding DNA and obtain a high percentage of functional information which can help to reveal the molecular mechanisms of functional genes (Shiel et al., 2014; Werner et al., 2013; Zhang et al., 2014; Liu et al., 2015).

PT

Meanwhile, transcriptome data can be used to identify microsatellite markers (Li et al., 2015). Compared with genomic simple sequence repeat (SSR) markers, SSRs developed using

RI

RNA-Seq technologies can help to identify candidate functional genes and increase the efficiency of marker assisted selection (Gupta and Rustgi, 2004; Wu et al., 2014). In recent

SC

years, RNA-seq has been extensively employed for transcriptional analysis, novel gene discovery, and molecular marker development in a number of bivalve organisms (Hou et al.,

NU

2011; Artigaud et al., 2014; Deng et al., 2014; Uliano-Silva et al., 2014).

MA

In the present study, RNA-Seq analyses were conducted in C. gigantea using Illumina sequencing technology to discover functional related genes and to develop microsatellite markers in this economically important bivalve. To our knowledge, this is the first report on

D

transcriptome analysis and marker development of C. gigantea. The sequencing data,

PT E

transcript sequence information, and polymorphic SSR markers are valuable resources for further genetic and molecular studies of C. gigantea. This work provides a database of

CE

molecular resources for this taxon and can guide future studies related to non-model species.

AC

2. Materials and methods 2.1. Sample collection

We collected C. gigantea samples in March 2017, which is the offspring of the introduced scallop provided by Canada Vancouver products, from Hekou (Dalian, Liaoning Province, China). The offspring of C. gigantea with a shell height about 3 cm was used as experimental materials in this study. The tissue samples (gill, adductor muscle, mantle, and visceral mass) were obtained from a one-year-old scallop. Each tissue sample was immediately frozen in liquid nitrogen after collection, and stored at −80 °C until RNA extraction.

ACCEPTED MANUSCRIPT Total RNA was separately extracted from each tissues by following the protocol previously described in Hu et al. (2006). The degradation and contamination of total RNA was monitored on 1% agarose gels. RNA purity was checked using the NanoPhotometer’ spectrophotometer (IMPLEN, CA, USA). RNA concentration was measured using Qubit RNA Assay Kit in Qubit 2.0 Flurometer (Life Technologies, CA, USA). RNA integrity was

PT

assessed using the RNA Nano 6000 Assay Kit of the Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA). To develop and characterize polymorphic microsatellites

RI

derived from C. gigantea transcriptome data, the cultured offspring from the introduced wild population of Canada were used to validate their effectiveness and polymorphism. Thirty

SC

individuals were sacrificed for the microsatellite marker polymorphism screening analysis, and the adductor muscles were preserved in 100% ethanol until DNA extraction. The genomic

NU

DNA was extracted from the adductor muscles using TIANamp Marine Animals DNA

MA

extraction kit (TIANGEN).

D

2.2. Library Preparation for Transcriptome Sequencing

PT E

A total amount of 1.5 μg RNA was used as input material for the RNA sample preparations. Equal amount of RNA from gill, adductor muscle, mantle, and visceral mass were mixed to construct one transcriptome library. Sequencing library was generated using

CE

NEBNcxt Ultra™ RNA Library Prep Kit for Illumina (NEB, USA). The mRNA was purified using poly-T oligo attached magnetic beads and fragmentation was carried out using divalent

AC

cations under elevated temperature in NEBNext First Strand Synthesis Reaction Buffer (5X). First strand cDNA was synthesized using random hexamer primer and M-MuLV Reverse Transcriptase (RNase H-). Second strand cDNA synthesis was subsequently performed using DNA Polymerase I and RNase-H. Remaining overhangs were converted into blunt ends via exonuclease/polymerase activities. After adenylation of 3’ ends of DNA fragments, NEBNext Adaptor with hairpin loop structure were ligated to prepare for hybridization. The library fragments were purified with AMPure XP system (Beckman Coulter, Beverly, USA) insert size of about 150~200 bp. Then 3μl USER Enzyme (NEB, USA) was used with size-selected, adaptor-ligated cDNA at 37℃ for 15 min followed by 5 min at 95℃ before PCR. At last, PCR

ACCEPTED MANUSCRIPT products were purified and library quality was assessed on the Agilent Bioanalyzer 2100 system.

2.3. Quality control, Analysis and Assembly Raw data (raw reads) of fasta format were firstly processed through in-house perl scripts.

PT

The raw sequencing reads were filtered in order to generate quality reads for de novo

RI

assembly (Kang et al., 2017). In this step, clean data (clean reads) were obtained by removing reads containing adapter, reads containing poly-N and low quality reads from raw data. At the

SC

same time, Q20, Q30, GC-content and sequence duplication level of the clean data were calculated. All the downstream analyses were based on clean data with high quality. The clean

NU

reads were de novo assembled with Trinity software which is a high efficiency and stability transcriptome splicing software for RNA-Seq data, as well as cooperative researched and

PT E

2.4. Functional annotation

D

MA

development by Broad Institute & Hebrew University of Jerusalem (Grabherr et al., 2011).

Functional annotations for the unigenes were conducted by sequence similarity comparisons against the nonredundant nucleotide database and the nonredundant protein

well

as

the

CE

database of NCBI (http://www.ncbi.nlm.nih.gov/) with BLASTx (E values cutoff ≤ 1e−5) as SWISS-PROT

database

(European

Bioinformatics

Institute,

AC

ftp://ftp.ebi.ac.uk/pub/databases/swissprot/), the Clusters of Orthologous Groups of proteins database (KOG) using diamond v0.8.22 (E values cutoff ≤ 1e−3) (Tatusov et al., 2000), and the Kyoto Encyclopedia of Genes and Genomes database (KEGG) (Kanehisa et al., 2004) with BLASTx (E values cutoff ≤ 1e−10). Moreover, functional assignments of the unigenes were further annotated using Gene Ontology (GO) Blast2GO v2.5 (E values cutoff ≤ 1e−6) (Ashburner et al., 2000). BLAST assignments were conducted against Nt using NCBI blast 2.2.28+ and Nr using diamond v0.8.22 (E values cutoff ≤ 1e−5) (Deng et al., 2006). We used KOBAS software to test the statistical enrichment of differential expression genes in KEGG pathways (Mao et al., 2005).

ACCEPTED MANUSCRIPT

2.5. SSR polymorphism validation MIcroSAtellite (MISA) (http://pgrc.ipk-gatersleben.de/misa/) was used to identify repetitive elements in the assembled transcriptomes of C. gigantea. We focused on SSRs with motifs of 2–6 nucleotides and a minimum of six contiguous repeat units. PCR primers for

PT

each microsatellite locus were designed using the Primer 3 software and tested on 30 wild individuals of C. gigantea sampled from Hekou, Dalian. PCR amplifications were carried out

RI

in a 10 μl reaction volume containing about 100 ng genomic DNA, 1×PCR buffer (TaKaRa),

SC

0.2 mM dNTPs, 1.5 mM MgCl2, 1 μM of each primer and 0.25 U Taq DNA polymerase (TaKaRa). PCR amplification was performed on a gradient thermal cycler (Bio-Rad) with the

NU

following protocol: denaturation for 5 min at 94℃; 30 cycles of 45 s at 94℃, 45 s at annealing temperature (Table 1), and 1 min at 72℃; a final extension at 72℃ for 10 min; and storage at

MA

4℃. The amplification products were then resolved by 6% denaturing polyacrylamide gel, and visualized via silver-staining. A 10 bp DNA ladder (Invitrogen, San Diego, CA, USA) was used as a reference marker for allele size determination. The number of alleles, observed (HO)

D

and expected (HE) heterozygosities of these microsatellite loci were estimated using the

PT E

Microsatellite Analyzer software (Dieringer and Schlotterer, 2003). Tests for deviations from linkage disequilibrium and Hardy-Weinberg equilibrium (HWE) were carried out by

CE

GENEPOP 4.0 (Raymon, 1995; Rousset, 2008). The null alleles were checked with the

AC

MICRO-CHECKER 2.2.3 program (van Oosterhout et al., 2004).

3. Results and discussion 3.1. Transcriptome sequencing and assembly We intermix the visceral mass, gill, adductor muscle, and mantle from rock scallops, as the sample to Illumina HiSeqTM. A total of 80,832,518 raw reads were produced from the Illumina sequencing platform and were deposited in the NCBI SRA database (accession number PRJNA416200). Approximately 77,306,198 (95.64%) clean reads were generated after trimming the adaptor sequences, the ambiguous N nucleotides (>0.1%), and the

ACCEPTED MANUSCRIPT low-quality sequences. These clean reads included 11.6G clean bases with a mean length of 150 bp and were used for the subsequent analysis. The nucleotide analysis showed a GC and Q30 content of 42.12 and 92.03%, respectively (Table 2). The transcriptome assembled into 158,855 transcripts with an N50 size of 1995 bp and an average size of 1008 bp. The transcriptome library generated 96007 unigene sequences with an N50 size of 2298 bp and an

PT

average size of 1479 bp (Table 2). The size distribution of assembled transcripts and unigenes has been shown in Fig. 1. Among the transcripts, a larger fraction (30.58%) of the assembled

RI

sequences had sizes ≤ 301bp. A total of 33923 (21.36%), 29720 (18.71%), 23984 (15.1%), and 22650 (14.26%) transcripts had size in the ranges of 301-500bp, 501-1000 bp, 1001-2000

SC

bp, and >2000 bp, respectively. Among the unigenes, 21531 (22.43%) sequences were no longer than 500 bp in length, 27878 unigenes (29.04%) were in the length range of 501 to

NU

1000 bp, 23948 unigenes (24.94%) were of length ranging from 1001 to 2,000 bp, and 22650 unigenes (23.59%) were longer than 2001 bp. A mixed cDNA library of one individual

MA

samples, including gill, adductor muscle, mantle and visceral mass were pooled for sequencing. The N50 length (2298 bp) of the 54,852 assembled unigene sequences was larger

D

than many other shellfish species. For example, the N50 length was 356 bp for South African

PT E

abalone Haliotis midae (Franchini et al., 2011), and 486 bp for pearl oyster Pinctada maxima

CE

(Deng et al., 2014), and 655 bp for Chlamys farreri (Shi et et al., 2013; Chen et al., 2016).

3.2 Assembly evaluation and annotation

AC

The unigenes ranging in length from 500 to 2000 bp and ≥ 2000 bp showed the greatest match percentages than that of ≤ 500 bp (Table 3). The assembly derived unigenes were searched against public protein and nucleotide databases for validation and annotation. The summary statistics for BLAST assignment against NR, NT, KEGG, Swissprot, PFAM, KOG, and GO databases are shown in Table 3. As indicated, out of 96007 unigenes, 36938 (38.47%), 35575 (37.05%), 35524 (37%), and 28533 (29.71%) unigenes showed annotation hits to homologous sequences in NR, GO, PFAM and SwissProt databases, respectively. Besides the same, 6802, 8074 and 15682 unigenes showed homology matches in NT, KEGG and KOG database, respectively. In all, a total of 45,389 (47.27%) assembled unigene sequences showed

ACCEPTED MANUSCRIPT annotation hits to at least one of the protein and nucleotide databases, and 3386 unigenes (3.52%) showed annotation hits to all databases. The reasons for unannotated sequences has the following several points. Represent short sequences lacking a conserved protein domain. The transcriptome may also contain incompletely spliced introns, orphaned untranslated regions (UTRs), non-coding genes, and random transcriptional noise may not show homology

PT

to available sequences in the public databases. An absence of annotations from transcriptome would also mean genes expressed at low levels or not being expressed at the time of RNA

RI

extraction (Kang et al., 2017).

SC

A five-way Venn diagram plot (Fig. 2) also show the representation of assembled unigenes annotated against NT, NR, GO, PFAM and KOG databases. This shows that 4349 unigenes

NU

were concurrently annotated by all five databases, while 12935 unigenes had homologous sequence both in GO, NR and PFAM databases. Besides these unigenes, 9073 unigenes were

MA

annotated in the above three and KOG, and 7912 unigenes annotated both in GO and PFAM database. Considering the maximum annotation hits to NR database, we studied the homology characteristics of BLASTX hits for the annotated unigenes in the database. The

E-value

D

distribution indicated that 30.8% of annotated unigenes have an E-value in the range of 0 to

PT E

1E−100, followed with 18.3% of unigenes with an E-value distribution of 1E−100 to 1E−60 (Supplementary Fig. 1A). The similarity distribution shows that 48% of the unigenes had a

CE

best-hit similarity of 60-80%, 32.1% of unigenes had a similarity of 40-60% and 17.3% of unigenes had a similarity of 80-95% (Supplementary Fig. 1B). Compared with other bivalve

AC

species, 26975 C. gigantea unigenes hit to 34 species (Table 4). Among these hit species, Crassostrea gigas was the most predominant (26054, 96.59%) (Zhang et al., 2012). Other top hit species with sufficient genomic resources included the Azumapecten farreri (326, 1.21%), Mytilus galloprovincialis (164, 0.6%), Argopecten irradians (77, 0.29%), Mizuhopecten yessoensis (68, 0.25%), Pinctada fucata (64, 0.24%), Mimachlamys nobilis (34, 0.13%), Pinctada martensii, Littorina littorea (23, 0.09%), and C. ariakensis (20, 0.07%).

3.3. Unigenes annotation and Gene Ontology (GO) classification

ACCEPTED MANUSCRIPT The GO database annotated unigenes (35575) were classified in terms of their associated biological process, cellular component and molecular function. Details of the distribution of the unigenes in the main ontologies are shown in Fig. 3. Among the annotated unigenes, 100868, 59792, and 41990 sequences were classified with biological process, cellular component, and molecular function, respectively. In addition, 8116, 2491, and 1373 unigenes

PT

associate with biological process and molecular function, biological process and cellular component, and cellular component and molecular function, respectively. About 13816

RI

annotated unigenes were associated to all the three GO categories. This suggests that there is a strong overlap of associated functions for unigenes under biological process and molecular

SC

function category. The top 10 GO term assignments for biological process were cellular process (GO:0009987, 20348), metabolic process (GO:0008152, 16847), and single organism

NU

process (GO:0044699, 16718). Other significant GO terms under the category included the biological regulation (GO:0065007, 8777), regulation of biological process (GO:0050789,

MA

8371), localization (GO:0051179, 6691), response to stimulus (GO:0050896, 6506), cellular component organization or biogenesis (GO:0071840, 3946), signaling (GO:0023052, 4543).

D

The dominant part in the cellular component category included the cell part (GO:0044464,

PT E

10718), cell (GO:0005623, 10720), followed by membrane (GO:0016020, 7435), organelle (GO:0043226, 7234), membrane part (GO:0044425, 6948), macromolecular complex (GO:0032991, 6762). Under molecular function, binding (GO:0005488, 20261) was the most

CE

dominant term, followed by catalytic activity (GO:0003824, 13894). Only a few genes were assigned to the antioxidant activity (GO:0016209, 89), metallochaperone activity

AC

(GO:0016530, 13), transcription factor activity, protein binding (GO:0000988, 360), and structural molecule activity (GO:0005198, 819) in the molecular function. The sequence and annotation information from GO annotations all provided valuable gene sources for the study of molecular basis that underline these economic traits of C. gigantea. Transcripts putatively take part in growth (GO: 0040007) and reproduction (GO: 0000003) were found in our RNA-seq database (Supplementary Table 1), which were also found in Patinopecten yessoensis (Hou et al., 2011).

3.4. Genes involved in immune and stress responses

ACCEPTED MANUSCRIPT In this study, GO analysis identified 1699 transcripts that are involved in cellular responses to environmental pressure and stimulus, among stress responses (GO:0006950) include oxidant stress, starvation, defense response, metalloid, wounding; among stimulus responses include external stimulus, endogenous stimulus, abiotic stimulus, and biotic stimulus. A number of DNA repair related genes, such as MSH3, EGF, TGF, IGF, FGF (Hall et al., 1996;

PT

Marsischky et al., 1996), encoding different groups of growth factors (Daub et al., 1996; Wang et al., 2013) were found in the transcriptome data of C. gigantean. In addition, immune

RI

related genes as Toll-like receptor (TLR) including TLR1, TLR2, TLR3, TLR4, TLR5, TLR6, TLR7, TLR8, and TLR9 was also observed in C. gigantean. It is reported that TLR1, 2, 3, 5,

SC

7, 9, 21, and 22 are identified in Takifugu rubripes, among which TLR1 was identified as immune enriched genes in gill and was successfully cloned (Cui et al., 2014). A novel

NU

toll-like receptor gene called CnTLR-1 was cloned and its transcripts under different challenges were determined (Lu et al., 2016). Heat shock proteins (HSP) are a family of

MA

proteins that play an important role in modulating stress response, such as stressed by temperature, salinity, pH, hypoxia, aerial exposure, heavy metals, bacteria infection and so on

D

(Zhang et al., 2012; Nie et al., 2017). HSP expression levels would significantly increase,

PT E

enabling organisms to resist the insults of adverse stressors so as to maintain the homeostasis and survival of cells (Zhang et al., 2012). In this study, the HSP family, including HSP40,

CE

HSP60, HSP70, HSP90, CaRHSP1 and HSP75 were found in C. gigantea.

AC

3.5. KOG and KEGG classification Based on a conserved domain alignment, all the unigenes were searched against the KOG database to classify and functionally predict the orthologous gene products. In the KOG classification analysis, 15682 unigenes were assigned to 25 protein families (Fig. 4), among which the signal transduction mechanisms (2732 unigenes, 17.42%) represented the largest annotated group, followed by general function prediction (2558 unigenes, 16.31%), posttranslational modification, protein turnover, chaperones (1556 unigenes, 9.92%), function unknown (1168 unigenes, 7.45%) and Intracellular trafficking, secretion, and vesicular transport (1018 unigenes, 6.49%). The smallest COG groups to which the unigenes were

ACCEPTED MANUSCRIPT mapped were unamed protein (5), cell motility (50), nuclear structure (99), defense mechanisms (150 unigenes, 0.96%). KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies. We

PT

found that 14671 unigenes assigned to 229 KEGG pathways (Fig. 5). The most predominant category was signal transduction (2002) from Environmental Information Processing,

RI

followed with endocrine system (1143) from the Organismal Systems, transport and catabolism (1105) from Cellular Processes. Among the Genetic Information Processing,

SC

folding, sorting and degradation (746), and translation (662). Among the metabolism, lipid metabolism (706), carbohydrate metabolism (662), amino acid metabolism (495), glycan

NU

biosynthesis and metabolism (444), metabolism of cofactors and vitamins (336). In C. gigantean, the unigenes predicted under immune system process were related to 15 signaling

MA

pathways and ascribed to chemokine signaling pathway, leukocyte transendothelial migration, Fc gamma R-mediated phagocytosis, toll-like receptor signaling pathway, T cell receptor

D

signaling pathway, RIG-I-like receptor signaling pathway, NOD-like receptor signaling

PT E

pathway, natural killer cell mediated cytotoxicity, B cell receptor signaling pathway, Fc epsilon RI signaling pathway, cytosolic DNA-sensing pathway, antigen processing and presentation, platelet activation, hematopoietic cell lineage, and complement and coagulation

CE

cascades. These signaling pathways information from the transcriptome data may help identify immune-related genes for C. gigantea. The number of unigenes in KEGG pathways

AC

range from 2 to 395. The top 16 pathways with the greatest number of genes are listed in Table 5. KEGG analysis identified transcripts potentially involved in responses to environmental pressure and stimulus. Overall, functional analysis of our RNA-seq database identified candidate genes potentially involved in growth, reproduction, stress and immunity. Further experiments are needed to validate the functions and expression patterns of these candidate genes.

3.6. Characterization of SSRs and PCR primers

ACCEPTED MANUSCRIPT Screening of 23682 unigenes by using the software of MISA, SSR were detected 24795 sites, distributed in 19182 unigenes, accounting for approximately 19.98% of the total unigenes. 4148 unigenes contains more than one SSR (Table 6). Trinucleotide and dinucleotides were the most frequent type of repeats in C. gigantea (Table 7). The SSRs identified in Cristaria plicata (Patnaik et al., 2016b) and Crassostrea hongkongensis (Tong et

PT

al., 2015) transcriptomes showed a similar pattern with the dominancy of dinucleotide repeat motifs. It is reported that trinucleotide and tetranucleotides were the most frequent in C.

RI

farreri and M. yessoensis (Wang et al., 2013). The AT/AT is the dominant dinucleotide repeat motifs, and AAT/ATT and ATC/ATG are the dominant trinucleotide repeat motifs.

SC

Comparison of C. farreri and M. yessoensis, the ATC motif represented the most scallops in shell abundant trinucleotide motif in both species, which was also common in other bivalves

NU

(Wang et al., 2007). Ninety seven microsatellite-containing sequences were selected for microsatellite marker optimization and validation because of repetition times and flaking

MA

sequence priority. We focused on SSRs with motifs of 2–6 nucleotides and a minimum of six contiguous repeat units. From each SSR type, we randomly selected a subset of 97 primer

D

pairs including 22 dinucleotieds, 66 trinuleotides, and 9 tetranucleotieds. Of the 97 potential

PT E

microsatellite markers, 42 (43.3%) were amplified with the expected products, and the PCR products were separated on denaturing polyacrylamide gels. Among the 42 primer pairs, 12 revealed length polymorphisms in 30 C. gigantea samples (Table 1). Among the 97 potential

CE

SSR markers, 42 loci (43.3%) were successfully amplified and 12 exhibited polymorphisms. This success rate (43.3%) was higher than that reported in the abalone Haliotis diversicolor

AC

diversicolor (28.3%) and lower than that in the clam Mercenaria mercenaria (50%) and Paphia textile (53.8%) (Chen et al., 2016; Wang et al., 2010; An et al., 2012). The percentage of polymorphic loci (28.6%) in C. gigantea was higher than that reported in the abalone (20.2%) and lower than that in the clam Ruditapes philippinarum (43.9%) in Ruditapes philippinarum (Nie et al., 2014). In the present study, the number of alleles per locus varied from 3 to 9, with an average of 3.25. The observed and expected heterozygosities ranged from 0.033 to 0.923 and from 0.254 to 0.820, with an average of 0.496 and 0.542, respectively. Two of the 12 microsatellite loci show significant departure from HWE after a sequential Bonferroni correction (P < 0.01),

ACCEPTED MANUSCRIPT which might be caused by the presence of null alleles and small sample size (Pemberton et al., 1995). Traditional approaches for SSR discovery tend to be either labor intensive or time consuming or highly dependent on available resources. In this study, the transcriptome data here provided a rich EST resource for genetic variants mining. The failure of some of the primer pairs to produce amplicons could be caused by the location of the primers across splice

PT

sites, chimeric primers, or poor quality sequences. Because we identified many other SSRs in the transcriptome dataset, more PCR primers could be developed to validate them. The SSRs

SC

comparative genomics, and functional genomics studies.

RI

could be very useful in genetic polymorphism assessment, quantitative trait loci mapping,

NU

4. Conclusion

In this study, we used IlluminaTM platform to sequence the transcriptome of the C.

MA

gigantean and provide a valuable resource for future research in C. gigantea. Unigenes showing putative functions in immunity, defense and stress response including the TLRs, TGF, and HSPs were identified. A total of 24795 SSRs were identified and characterized as

D

potential molecular markers, and 98 primer pairs were randomly selected to detect the

PT E

amplification and polymorphism, and 12 loci show a polymorphism. These twelve microsatellites enrich the molecular markers resources of C. gigantea, which will be

CE

applicable to studies of the genetic diversity, population structure, genetic linkage mapping and identification of quantitative trait loci, and also in investigating molecular markers for

AC

production and adaptive traits of individual/population in the future.

Acknowledgments

The study was supported by grants from the project of Department of Ocean and Fisheries of Liaoning Province (201214), the project of Dalian City Oceanic and Fishery Administration (2012011), the Modern Agro-industry Technology Research System (CARS-49), the Cultivation Plan for Youth Agricultural Science and Technology Innovative Talents of Liaoning Province (2014004), and the Dalian Youth Science and Technology Star Project Support Program (2016RQ065).

ACCEPTED MANUSCRIPT References

An, H.S., Lee, J.W., Hong. S.W., 2012. Application of novel polymorphic microsatellite loci identified in the Korean Pacific Abalone (Haliotis diversicolor supertexta (Haliotidae)) in the genetic characterization of wild and released populations. Int. J. Mol. Sci. 13(9), 10750-10764.

PT

Artigaud, S., Thorne, M.A.S., Richard, J., 2014. Deep sequencing of the mantle transcriptome of the great scallop Pecten maximus. Mar. Genomics 15, 3–4.

RI

Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L.,

SC

Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G., 2000. Gene ontology: tool for the unification of biology. The Gene

NU

Ontology Consortium. Nat. Genet. 25, 25-29.

Barber, B.J., Blake, N.J., 2006. Reproductive physiology. Developments in aquaculture and

MA

Fish. Sci. 35, 357-416.

Beitler, M.K., 1991. Toxicity of adductor muscles from the purple hinge rock scallop (Crassadoma gigantea) along the Pacific coast of North America. Author links open the

D

author workspace. Toxicon 29(7), 889-894.

PT E

Bourne, N., 1991. West coast of North America. In: Shumway, S.E.(Ed.), Scallops: Biology, Ecology, and Aquaculture. Elsevier Science Publishers, New York, pp. 925–939. Cao, S.M., Liang,W.F., Wang, J., Liu G., 2016b. Fundamental research on filter-feeding rate of juvenile rock scallop. J Dalian Ocean Uni. 31(6), 613-617.

CE

Cao, S.M., Wang, H., Chen,W., 2016a. Analysis, evaluation and comparison of nutritive composition in rock scallop Crassadoma gigantea with three Chinese scallops. J Dalian

AC

Ocean Uni. 31(5), 544-550. Cao, S.M., Wang, J., Wang, Q., Liang, W.F., Yin, M.H., Liu G., 2017. Artificial breeding of introduced rock scallop Crassadoma gigantea. J Dalian Ocean Uni. 32(1):1-6 Cary S.C., Leighton D.L., Phleger C.F., 1981. Food and feeding strategies in culture of larval and early juvenile purple-hinge rock scallops, Hinnites multirugosus (GALE).J World Aquaculture Society, 12(1), 156-169. Chen, H., Lin, L., Xie, M., Zhang, G., Su, W., 2015. De novo sequencing and characterization of the Bradysia odoriphaga (Diptera: Sciaridae) larval transcriptome. Comp. Biochem. Physiol. Part D 16, 20-27.

ACCEPTED MANUSCRIPT Chen, X., Li, J., Xiao, S., Liu, X., 2016. De novo assembly and characterization of foot transcriptome and microsatellite marker development for Paphia textile. Gene 576, 537-543. Chen, X.M., Li, J.K., Xiao, S.J., Liu,X.D., 2016. De novo assembly and characterization of foot transcriptome and microsatellite marker development for Paphia textile. GENE, 576, 537-543. Christopher, M.P., Edwin, B., 1996. Aquaculture Settlement of larvae of the giant scallop,

PT

Placopecten magellanicus (Gmelin), on various artificial and natural substrata under hatchery-type conditions. Aquaculture. 141,3-4,31, 201-221.

RI

Cui, J., Liu, S.K., Zhang, B., Wang, H.D., Sun, H.J., Song, S.H., Qiu, X.M., Liu, Y., Wang,

SC

X.L., Jiang, Z.Q., Liu, Z.J., 2014. Transciptome Analysis of the Gill and Swimbladder of Takifugu rubripes by RNA-Seq. PLoS One, 9(1), e85505. Culver, C.S., Richardsa, J.B., Pageb, H.M., 2006. Plasticity of attachment in the purple-hinge

NU

rock scallop, Crassadoma gigantea: Implications for commercial culture. Aquaculture 254, 361-369.

MA

Daub, H., Weiss, F. U., Wallasch, C., Ullrich, A., 1996. Role of transactivation of the EGF receptor in signalling by G-protein-coupled receptors. Nature, 379(6565), 557. Deng, Y., Lei, Q., Tian, Q., Xie, S., Du, X., Li, J., Wang, L., Xiong, Y., 2014. De novo

D

assembly, gene annotation, and simple sequence repeat marker development using

PT E

Illumina paired-end transcriptome sequences in the pearl oyster Pinctada maxima. Biosci. Biotechnol. Biochem. 78(10), 1685–1692. Deng, Y.Y., Li, J.Q., Wu, S.F., Zhu, Y.P., Chen, Y.W., He, F.C., 2006. Integrated nr database in

CE

protein annotation system and its localization. Comput. Eng. 32, 71-74. Dieringer, D., Schlotterer, C., 2003. Microsatellite analyser (MSA): aplatform independent

AC

analysis tool for large microsatellite datasets. Mol. Ecol. Notes 3, 167-169. Feldmesser, E., Rosenwasser, S., Vardi, A., et al., 2014. Improving transcriptome construction in non-model organisms: integrating manual and automated gene definition in Emiliania huxleyi. BMC Genomics 15, 148. Franchini, P., Van der Merwe, M., Roodt-Wilding, R., 2011. Transcriptome characterization of the South African abalone Haliotis midae using sequencing-by-synthesis. BMC research notes 4, 59. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., et al., 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29(7), 644-652.

ACCEPTED MANUSCRIPT Hall, D.B., Holmlin, R.E., Barton, J.K., et al., 1996. Oxidative DNA damage through long-range electron transfer. Nature, 382(6593), 731. Hou, R., Bao, Z.,Wang, S., et al., 2011. Transcriptome sequencing and de novo analysis for Yesso scallop (Patinopecten yessoensis) using 454 GS FLX. PLoS One 6, e21560. Hu, X.L., Bao Z.M., Hu J.J., Shao M.Y., Zhang L.L., 2006. Cloning and characterization of tryptophan 2,3-dioxygenase gene of Zhikong scallop Chlamys farreri (Jones and Preston 1904). Aquac. Res. 37, 1187–1194. deciphering the genome. Nucleic Acids Res 32, 277-280.

PT

Kanehisa, M., Goto, S., Kawashima, S., Hattori, M., 2004. The KEGG resource for

RI

Kang, S.W., Bharat, B.P., Wang, H.J., 2017. Sequencing and de novo assembly of visceral

SC

mass transcriptome of the critically endangered land snail Satsuma myomphala: Annotation and SSR discovery. Comparative Biochemistry and Physiology, Part D 21, 77-89.

NU

Lauren, D.J., 1982. Oogenesis and protandry in the purple-hinge rock scallop,Hinnites giganteus in upper Puget Sound,Washington,U.S.A. Canadian Journal of Zoology, 60

MA

(10), 2333-2336.

Leighton, D.L., 1979. A growth profile for the rock scallop Hinnites multirugosus held at several depths off La Jolla,California. Mar. Biology 51(3), 229-232.

D

Leighton, D.L., 1991. Culture of Hinnites and related scallops on the American Pacific coast.

PT E

In: Menzel, W. (Ed.), Estuarine and Marine Bivalve Mollusk Culture. CRC Press Inc, Boca Raton, Florida, pp. 100-110. Li, H.J., Liu, M., Ye, S., Yang, F., 2017. De novo assembly, gene annotation, and molecular

CE

marker development using Illumina paired-end transcriptome sequencing in the clam Saxidomus purpuratus. Genes & Genomics, 1-11.

AC

Liu, H., Zheng, H., Zhang, H., Deng, L., Liu, W., Wang, S., Meng, F., Wang, Y., Guo, Z., Li, S., Zhang, G., 2015. A de novo transcriptome of the noble scallop, Chlamys nobilis, focusing on mining transcripts for carotenoid-based coloration. BMC Genomics. 16, 44. Lu, Y., Zheng, H., Zhang, H., Yang, J., Wang, Q. 2016. Cloning and differential expression of a novel toll-like receptor gene in noble scallop Chlamys nobilis with different total carotenoid content. Fish & shellfish immunology, 56, 229-238. Malachowski, M., 1988. The reproductive cycle of the rock scallop Hinnites giganteus (Grey) in Humboldt Bay, California. J. Shellfish Res. 7, 341–348.

ACCEPTED MANUSCRIPT Mao, X., Cai, T., Olyarchuk, J.G., 2005. Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics 21, 3787-3793. (KOBAS) Marioni, J.C., Mason, C.E., Mane, S.M., et al., 2008. RNA-seq: an assessment of technical repro-ducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., Gilad, Y., 2008. RNA-seq: an

PT

assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509-1517.

RI

Marsischky, G.T., Filosi, N., Kane, M.F., et al., 1996. Redundancy of Saccharomyces

SC

cerevisiae MSH3 and MSH6 in MSH2-dependent mismatch repair. Genes & development, 10(4), 407-420.

McDonald, B.A., Bourne, N.F., 1989. Growth of the purple hinge rock scallop, Crassadoma culture. J.Shellfish Res. 8 (1), 179–186.

NU

gigantea Gray, 1825 under natural conditions and those associated with suspended

MA

Nagalakshmi U. 2008, The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 2008. 320: 1344-1349.

Ni, L.H., Li, Q., Kong, L.F., 2010. Isolation and characterization of 19 microsatellite markers

D

from the Chinese surf clam (Mactra chinensis). Conservation Genetics Resources 2 (1),

PT E

27-30.

Nie, H., Zhu, D., Yang, F., Zhao, L., Yan, X., 2014. Development and characterization of EST-derived microsatellite makers for Manila clam (Ruditapes philippinarum). Conserv.

CE

Genet. Resour. 6, 25-37.

Nie, H.T., Liu, L.H., Huo Z.M., Chen, P., Ding, J.F., Yang, F., Yan, X., 2017. The HSP70

AC

gene expression responses to thermal and salinity stress in wild and cultivated Manila clam Ruditapes philippinarum. Aquaculture 470, 149-156. Nielsen, R., Paul, J.S., Albrechten, A., Song, Y.S., 2011. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12(6), 443-51. Olsen, S., 1984. Completion Report on Invertebrate Aquaculture Shellfish Enhancement Project. Point Whitney Shellfish Lab. Washington Department of Fisheries. 85 pp. Patnaik, B.B., Wang, H.J., Kang, S.W., Park, S.Y., Wang, T.H., Park, E.B., Chung, J.M., Song, D.K., Kim, C., Kim, S., Lee, J.B., Jeong, H.C., Park, H.S., Han, Y.S., Lee, Y.S., 2015. Transcriptome characterization for non-model endangered Lycaenids, Protantigius

ACCEPTED MANUSCRIPT superans and Spindasis takanosis, using Illumina HiSeq 2500 Sequencing. Int. J. Mol. Sci. 16, 29948–29970. Patnaik, B.B., Wang, T.H., Kang, S.W., Wang, H.J., Park, S.Y., Park, E.B., Chung, J.M., Song, D.K., Kim, C., Kim, S., Lee, J.S., Han, Y.S., Park, H.S., Lee, Y.S., 2016b. Sequencing, de novo assembly, and annotation of the transcriptome of the endangered freshwater pearl bivalve, Cristaria plicata, provides novel insights into functional genes and marker discovery. PLoS One. 11(2), e0148622.

PT

Pemberton, J.M., Slate, J., Bancroft, D.R., Barrett, J.A., 1995. Non amplifying alleles at microsatellite loci: a caution for parentage and population studies. Mol. Ecol. 4, 249-252.

RI

Raymond, M., 1995. GENEPOP: Population genetics software for exact tests and ecumenism.

SC

Vers. 1.2. J. Hered. 86, 248-249.

Richardson, M.F., Sherman, C.D.H., 2015. De novo assembly and characterization of the invasive Northern Pacific Seastar transcriptome. PLoS One 10, e0142003.

NU

Rousset, F., 2008. GENEPOP 007: a complete re-implementation of the GENEPOP software for Windows and Linux. Mol. Ecol. Resour. 8, 103-106.

MA

Shiel, B.P., Hall, N.E., Cooke, I.R., et al., 2014. De novo characterisation of the greenlip ab-alone transcriptome (Haliotis laevigata) with a focus on the heat shock protein 70(HSP70) family. Mar. Biotechnol. 17, 23–32.

D

Shiet, M., Lin, Y., Xu, G., Xie, L., Hu, X., Bao, Z., Zhang, R., 2013. Characterization of the

PT E

Zhikong scallop (Chlamys farreri) mantle transcriptome and identification of biomineralization-related genes. Mar. Biotechnol. 15, 706-715. Tatusov, R. L., Galperin, M. Y., Natale, D. A., Koonin, E. V., 2000. The COG database: a 28(1), 33-36.

CE

tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res.

AC

Tong, Y., Zhang, Y., Huang, J., Xiao, S., Zhang, Y., Li, J., Chen, J., Yu, Z., 2015. Transcriptomics analysis of Crassostrea hongkongensis for the discovery of reproduction-related genes. PLOS One 10, e0134280. Uliano-Silva, M., Americo, J.A., Brindeiro, R., Dondero, F., Prosdocimi, F., Rebelo, M.D.F., Mauro de Freitas Rebelo 2014. Gene discovery through transcriptome sequencing for the invasive mussel Limnoperna fortunei. PLoS One 9, e102973. Van

Oosterhout,

C.,

Hutchinson,

W.F.,

William

D.P.M.,

Shipley

P.,

2004.

MICRO-CHECKER: software for identifying and correcting genotyping errors in microsatellite data. Mol. Ecol. Notes 4(3), 535-538.

ACCEPTED MANUSCRIPT Wang, S., Hou, R., Bao, Z.M., Du, H.X., He, Y., Su, H.L., Zhang, Y.Y., Fu, X.T., Jiao, W.Q., Li, Y., Zhang, L.L., Wang, S., Hu, X.L., 2013. Transcriptome Sequencing of Zhikong Scallop (Chlamys farreri) and Comparative Transcriptomic Analysis with Yesso Scallop (Patinopecten yessoensis). PloS one, 8(5), e63927. Wang, W., Hui, J.H.L., Chan, T.F., Chu, K.H.., 2014. De novo transcriptome sequencing of the snail Echinolittorina malaccana: identification of genes responsive to thermal stress and development of genetic markers for population studies. Mar. Biotechnol. 16,

PT

547–559.

Wang, Y., Guo, X., 2007. Development and characterization of EST-SSR markers in the

RI

eastern oyster Crassostrea virginica. Mar. Biotechnol. 9, 500-511.

SC

Wang, Y., Wang, A., Guo, X., 2010. Development and characterization of polymorphic microsatellite markers for the northern quahog Mercenaria mercenaria (Linnaeus, 1758). J. Shellfish Res. 29, 77–82.

NU

Werner, G.D.A., Gemmell, P., Grosser, S., et al., 2013. Analysis of a deep transcriptome from the mantle tissue of Patella vulgata Linnaeus (Mollusca: Gastropoda: Patellidae) reveals

MA

candidate biomineralising genes. Mar. Biotechnol. 15, 230–243. Whyte, J.N.C., Bourne, N. and Hodgson, C.A., 1990a. Nutritional condition of rock scallop, Crassadoma gigantea (Gray), larvae fed mixed algal diets. Aquaculture, 86: 25-40.

D

Whyte, N.J.C., Bourne, N., Ginther, N.G., 1990b. Biochemical and energy changes during

PT E

embryogenesis in the rock scallop Crassodoma gigantea. Mar. Biol. 106, 239– 244. Zhang, G.F., Fang, X.D., Guo, X.M., Li, L., Luo, R.B., Xu, F., Yang, P.C., Zhang, L.L., Wang, X.T., Qi, H.G., Xiong, Z.Q., Que, H.Y., Xie, Y.L., Holland, P.W.H., Paps, J., Zhu,

CE

Y.B., Wu, F.C., Chen, Y.X., Wang, J.F., Peng, C.F., Meng, J., Yang, L., Liu, J., Wen, B., Zhang, N., et al., 2012. The oyster genome reveals stress adaptation and complexity of

AC

shell formation. Nature 490, 49–54. Zhang, L.L., Li, L., Zhu, Y.B., Zhang. G.F., Guo, X.M., 2014. Transcriptome analysis reveals a rich gene set related to innate immunity in the eastern oyster (Crassostrea virginica). Mar. Biotechnol. 16, 17–33.

ACCEPTED MANUSCRIPT

Figure Captions

Fig. 1 The length distribution of unigenes and transcripts from the de novo assembly process of C. gigantea transcriptone.

PT

Fig. 2 A five-way Venn diagram depicting the unique and overlapped unigenes showing homology to sequence in PFAM, NT,NR,KOG and GO databases.

RI

Fig. 3 Functional annotations of the assembled unigene sequences based on Gene Ontology

SC

(GO). The GO analysis was performed at level 2 under the three main GO categories. Fig. 4 KOG classification of the assembled unigene sequences. The y-axis indicates the

NU

number of unigenes in a specific functional cluster. Each category is indicated on the

MA

x-axis by a letter that is listed in the right column.

Fig. 5 Pathway assignment based on KEGG. (A) Cellular Processes (B) Environmental

AC

CE

PT E

Organismal Systems.

D

Information Processing (C) Genetic Information Processing (D) Metabolism (E)

ACCEPTED MANUSCRIPT

Table 1 Characteristics of 12 PCR validated microsatellite markers developed for C.gigantea Locus Gene ID

Repeat

Primer sequence (5'-3')

Ta

motif

Number Size

(ºC) of alleles range

Ho

HE

P value

(bp) Cdg01 Cluster-21242.28813 (CTG)7

60

F:CGAGCTGGACTTCACAGGTT

3

260-281 0.280 0.254 1.0000

Cdg54 Cluster-20101.6

(GTTG)5 F:CTCAACAGACTGGTAGGTCCA

56

Cdg55 Cluster-20101.4

5

RI

R:ACTGACACGCAACACATCCT

PT

R:GCTAGTCTGGCAGCCAATCT 234-266 0.923 0.723 0.1579

5

242-260 0.821 0.745 0.8546

56

3

150-182 0.360 0.316 1.0000

60

4

92-104

54

3

250-330

60

5

258-290 0.346 0.664 0.0000*

57

4

232-300 0.433 0.597 0.0259

60

3

229-241 0.519 0.577 0.3345

60

9

170-260 0.808 0.820 0.1877

F:TGCACAGAGTATTGGTAACGGT 60

5

206-275 0.321 0.537 0.0062

4

191-232 0.667 0.588 0.0199

(GTTG)5 F:TCAGCAGACTGGTAGGTCCA

SC

R:ACTGACACACAACACATCCTT

60

Cdg56 Cluster-21242.18094 (GTTC)6 F:TTCCGTCCGCTTCTCTAGGA

R:TACTAGAACGGCACTTCGCG

NU

Cdg57 Cluster-21242.33855 (ATTG)5 F:ACGTTACAACCATCCACACA

0.440 0.411 0.8245

R:CGTCGAACGATCAATCAAGCA

MA

Cdg59 Cluster-21242.28287 (TGTA)5 F:TTGGAAGCAGGCCAGGTCTA

0.033 0.267 0.0000*

R:ACAAGCTGGCATGTGAATATTCA Cdg60 Cluster-19933.0

(ATTT)5 F:TCCTCCTCGTTCTGTCAGACT

(ATTG)5 F:ACGTCCGGACAAGATGGATT

PT E

Cdg61 Cluster-31365.0

D

R:TGCCACCACTTCACTTGTGT

R:ACATTCGCCATGCCTCGTAT

Cdg90 Cluster-21242.17985 (AGG)6 F:AGCTTAACCACACTGCTGCT R:TGAGCTGGACACTGGGTAGA F:TCCTGAATGACGTCCATGGC

CE

Cdg91 Cluster-21242.1498 (CTG)6

R:ACCAGGGTCAACAAGGTCAG

AC

Cdg92 Cluster-21242.13588 (ATT)5

R:GACACCTGAACTGGGTCACA

Cdg98 Cluster-21242.29945 (GAG)5 F:GTGGTGGTGGTGGTCCTATG R:CATCTGATCGTGGTCGCTCA

60

ACCEPTED MANUSCRIPT

D PT E CE AC

RI SC

NU

80,832,518 77,306,198 150 11.6 0.02 42.12 92.03 160,144,398 158,855 1995 354 1008 96,007 2298 653 1479

MA

Raw reads Clean reads Average size of clean reads (bp) Clean bases (Gb) Error(%) GC content (%) Q30 percentage (%) Total Nucleotides Transcripts N50 length (bp) N90 length (bp) Mean length (bp) Unigenes N50 length (bp) N90 length (bp) Mean length (bp)

PT

Table 2 Summary of sequencing data of C. gigantea transcriptome Category Number/length

ACCEPTED MANUSCRIPT Table 3 Annotation of unigenes of the C. gigantea transcriptome against seven public databases. ≤500

500-2000

≥2000

NR

36938

2117

17526

17295

NT

6802

457

2290

4055

KEGG

8074

381

3580

4113

SwissProt

28533

1179

12696

14658

PFAM

35524

2408

16680

16436

GO

35575

2420

16713

16442

KOG

15682

482

6487

8713

All

3386

At least one

45389

Total Unigenes

96007

AC

CE

PT E

D

MA

NU

SC

RI

PT

Number of Unigenes

Annotated databases

ACCEPTED MANUSCRIPT Table 4 The distribution of C. gigantea unigenes annotation in the class of bivalvia . The number of genes

Tegillarca granosa

6

Sinonovacula constricta

3

Scrobicularia plana

1

Saccostrea kegaki

1

Ruditapes philippinarum

4

Placopecten magellanicus

13

Pinctada maxima

1

Pinctada martensii

23

Pinctada margaritifera

5

Pinctada fucata

64

Pecten maximus

18

Ostrea edulis

11

Nodipecten subnodosus

3

Mytilus trossulus

5

164

Mytilus edulis

6

MA

Mytilus coruscus Mizuhopecten yessoensis Mimachlamys nobilis

D

Meretrix meretrix Hyriopsis schlegelii Hyriopsis cumingii Cristaria plicata

CE

Crassostrea virginica

PT E

Littorina littorea

6

68 34 8 24 5 9 1 1

Crassostrea hongkongensis

6

Crassostrea gigas

26054

Crassostrea ariakensis

20

Crassostrea angulata

4

Corbicula sp.

1

Corbicula japonica

2

Azumapecten farreri

326

Atrina rigida

1

Argopecten irradians

77

AC

RI

SC

NU

Mytilus galloprovincialis

PT

Bivalve species

ACCEPTED MANUSCRIPT Table 5 Top 16 KEGG pathways mapped to the transcriptome unigenes. KEGG Pathway

Pathway ID

Gene Number

1

Focal adhesion

ko04510

395

2

Endocytosis

ko04144

368

3

Rap1 signaling pathway

ko04015

345

4

Purine metabolism

ko00230

329

5

Oxytocin signaling pathway

ko04921

325

6

Lysosome

ko04142

309

7

cAMP signaling pathway

ko04024

305

8

Ubiquitin mediated proteolysis

ko04120

303

9

Ras signaling pathway

ko04014

301

10

MAPK signaling pathway

ko04010

299

11

Calcium signaling pathway

ko04020

284

12

PI3K-Akt signaling pathway

ko04151

13

Regulation of actin cytoskeleton

ko04810

14

Insulin signaling pathway

ko04910

15

cGMP - PKG signaling pathway

ko04022

265

16

Spliceosome

ko03040

260

RI

SC

NU

MA D PT E CE AC

PT

Rank

280 278 268

ACCEPTED MANUSCRIPT Table 6 Summary of the SSR search results in the C.gigantea transcriptome Number

Total number of sequences examined

96007

Total size of examined sequences (bp)

141948608

Total number of identified SSRs

24795

Number of SSR containing sequences

19182

Number of sequences containing more than 1 SSR

4148

Number of SSRs present in compound formation

1113

AC

CE

PT E

D

MA

NU

SC

RI

PT

Item

ACCEPTED MANUSCRIPT Table 7 Summary of simple sequence repeat (SSR) type in C. gigantea transcriptome Motif Length

Repeat Numbers 5

6

7

8

9

10

11

12

16

Di-

0

2729

1293

715

426

258

169

11

1

5602

72.52

Tri-

1069

431

364

56

3

1

0

0

0

1924

24.91

Tetra-

143

39

2

0

0

0

2

0

0

186

2.41

Penta-

6

2

0

1

0

0

0

0

0

9

0.12

Hexa-

4

0

0

0

0

0

0

0

0

4

0.05

Total

1222

3201

1659

772

429

259

171

11

1

7725

PT

RI SC NU MA D PT E CE AC

Number Percentage%

ACCEPTED MANUSCRIPT Abbreviation list

Simple sequence repeat (SSR) Gene Ontology (GO) Kyoto Encyclopedia of Genes and Genomes (KEGG)

PT

Heat shock proteins (HSP)

AC

CE

PT E

D

MA

NU

SC

RI

Toll-like receptor (TLR)

ACCEPTED MANUSCRIPT Highlights •

This is the first transcriptome analysis for the rock scallop Crassadoma gigantea.



96,007 unigenes were assembled and annotated by public function and pathway databases. 24,795 microsatellites were detected and 42 loci of them were validated in

CE

PT E

D

MA

NU

SC

RI

PT

amplification experiments.

AC



Figure 1

Figure 2

Figure 3

Figure 4

Figure 5