Transcriptome characterization and SSR discovery in large-scale loach Paramisgurnus dabryanus (Cobitidae, Cypriniformes)

Transcriptome characterization and SSR discovery in large-scale loach Paramisgurnus dabryanus (Cobitidae, Cypriniformes)

Gene 557 (2015) 201–208 Contents lists available at ScienceDirect Gene journal homepage: www.elsevier.com/locate/gene Transcriptome characterizatio...

860KB Sizes 0 Downloads 24 Views

Gene 557 (2015) 201–208

Contents lists available at ScienceDirect

Gene journal homepage: www.elsevier.com/locate/gene

Transcriptome characterization and SSR discovery in large-scale loach Paramisgurnus dabryanus (Cobitidae, Cypriniformes) Caijuan Li, Qufei Ling ⁎, Chen Ge, Zhuqing Ye, Xiaofei Han School of Biology and Basic Medical Sciences, Soochow University, 199 Renai Road, Suzhou, Jiangsu 215123, PR China

a r t i c l e

i n f o

Article history: Received 29 August 2014 Received in revised form 2 December 2014 Accepted 15 December 2014 Available online 18 December 2014 Keywords: Paramisgurnus dabryanus Illumina sequencing Transcriptome analysis SSR markers

a b s t r a c t The large-scale loach (Paramisgurnus dabryanus, Cypriniformes) is a bottom-dwelling freshwater species of fish found mainly in eastern Asia. The natural germplasm resources of this important aquaculture species has been recently threatened due to overfishing and artificial propagation. The objective of this study is to obtain the first functional genomic resource and candidate molecular markers for future conservation and breeding research. Illumina paired-end sequencing generated over one hundred million reads that resulted in 71,887 assembled transcripts, with an average length of 1465 bp. 42,093 (58.56%) protein-coding sequences were predicted; and 43,837 transcripts had significant matches to NCBI nonredundant protein (Nr) database. 29,389 and 14,419 transcripts were assigned into gene ontology (GO) categories and Eukaryotic Orthologous Groups (KOG), respectively. 22,102 (31.14%) transcripts were mapped to 302 KEGG pathways. In addition, 15,106 candidate SSR markers were identified, with 11,037 pairs of PCR primers designed. 400 primers pairs of SSR selected randomly were validated, of which 364 (91%) pairs of primers were able to produce PCR products. Further test with 41 loci and 20 large-scale loach specimens collected from the four largest lakes in China showed that 36 (87.8%) loci were polymorphic. The transcriptomic profile and SSR repertoire obtained in this study will facilitate population genetic studies and selective breeding of large-scale loach in the future. © 2015 Elsevier B.V. All rights reserved.

1. Introduction The large-scale loach Paramisgurnus dabryanus (Guichenot; Cobitidae), a small freshwater fish species found mainly in eastern Asia, is farmed in China as a food source (Meng et al., 1995). It is a bottom-dwelling fish and the only species belonging to the genus Paramisgurnus. As a result of its desirable taste, contribution to health and economic value (Zhao et al., 1999), P. dabryanus has been recognized as a species suitable for freshwater aquaculture and become popular in Asia. Owing to its artificial propagation (Zhu et al., 2007), the P. dabryanus industry has developed quickly in recent years and the total output of P. dabryanus in Jiangsu Province, China reached 50,000 tons in 2011. As a consequence of overfishing, artificial propagation and high-density breeding, however, P. dabryanus is under the threat of germplasm recession and mixture. The cultured population of P. dabryanus has been gradually exhibiting early sexual maturity, slow growth-rate and disease susceptibility (Wang and Li, 2005; You et al., 2012). Therefore, the application of molecular genetics tools is essential to protect P. dabryanus germplasm resources, to promote genetic

Abbreviations: GO, gene ontology; SSRs, simple sequence repeats; COG, Clusters of Orthologous Groups; KOG, Eukaryotic Orthologous Groups; KEGG, Kyoto Encyclopedia of Genes and Genomes; QTL, quantitative trait loci. ⁎ Corresponding author. E-mail address: [email protected] (Q. Ling).

http://dx.doi.org/10.1016/j.gene.2014.12.034 0378-1119/© 2015 Elsevier B.V. All rights reserved.

selection for the improvement of the growth rate and to enhance the economic value of this fish. The development of genetic markers is the first step in the application of genomics to improve broodstock. Genetic markers can be used to create linkage maps and identify subsequent quantitative trait loci (QTLs) (Gao et al., 2012a, 2012b). Marker-assisted selection (MAS) can be used in assisting broodstock selection on the basis of the genotypes of QTL that are relevant to economically important traits, including rapid growth rate (Zhang et al., 2011). Owing to their locus-specific codominant and multiallelic nature and high level of abundance in the genome, simple sequence repeats (SSRs or microsatellite markers) are very useful molecular markers (O'Connell and Wright, 1997). SSRs have been used widely in marker-assisted breeding in many organisms (Alam and Islam, 2005; Cavagnaro et al., 2010; Hulaka et al., 2010; Zhu et al., 2011). Genetic linkage mapping has gained deep insight into many species, from economic crops, including wheat, rice, soybean and sesame, to Homo sapiens (Jun et al., 2011). Genetic linkage maps of SSRs have been published for several fish species (Chistiakov et al., 2006; Wang et al., 2007; Rexroad et al., 2008) and marker-assisted selection in breeding is increasingly popular in economic fish studies. There were only 134 nucleotide sequences and 12 ESTs available for large-scale loach in GenBank. One issue related to the genetic research of P. dabryanus is the lack of effective genetic markers. The development of a large set of SSR markers for analysis of the P. dabryanus genome could help to enhance

202

C. Li et al. / Gene 557 (2015) 201–208

P. dabryanus genomics research, to protect P. dabryanus germplasm resources, to facilitate fine mapping of quantitative trait loci (QTL), to improve the identification and exploitation of genes affecting important traits and even to enable marker-assisted selection. Traditional development method of SSR markers consists of establishing a library and then sequencing. Because cloning and sequencing DNA or EST libraries on a massive scale are costly and time-consuming, and importantly the throughput is low, the lack of SSR markers has made it difficult to construct a detailed linkage map that can be used for P. dabryanus genetic breeding. With the advent of next-generation sequencing technology (Metzker, 2010), sequencing has become progressively more efficient. Sequencing the genome of a non-model species, however, is expensive. Recent advances in massive RNA sequencing (RNA-seq) provide a fast, cost-effective and reliable approach for the generation of large expression datasets in non-model species (Marioni et al., 2008; Nagalakshmi et al., 2008; Crawford et al., 2010). Compared to genomic SSR makers, EST–SSR (from genes) can help to identify candidate functional genes and increase the efficiency of marker-assisted selection (Gupta and Rustgi, 2004). We therefore used RNA-seq to gain deep insight into the transcriptome of P. dabryanus and to develop large numbers of efficient genetic-SSR molecular markers. In this study, we sampled 13 tissues of large-scale loach from the parents of a mapping family, generated a database using RNA-seq and developed tens of thousands of SSRs. This is the first comprehensive transcriptome analysis and large-scale SSRs development for P. dabryanus. The results of this study will provide a valuable resource for novel gene discovery, for construction of a genetic linkage map and for comparative genome analysis, which will together facilitate marker-assisted selection in P. dabryanus breeding. 2. Materials and methods 2.1. Sampling A breeding female P. dabryanus from Hongze Lake in Jiangsu Province, China (HZ) and a mature wild type male P. dabryanus from Poyang Lake in Jiangxi Province, China (PY), were euthanized with 100 mg/L tricaine methanesulfonate (MS 222, Sigma, USA) for tissue collection. Seven tissues, including the muscle, bone, fin, brain, spleen, liver and ovary, were collected and placed into RNA sample protector (D311A, TaKaRa Biotechnology (Dalian, China) Co., Ltd.) according to the manufacturer's protocol.

other transcriptomes obtained by our Lab, were assembled by Trinity software using default parameters (Grabherr et al., 2011). 2.4. Transcriptome analysis All assembled sequences were screened for ORF prediction using Trinity software to find potential protein-coding genes (Grabherr et al., 2011). Sequence similarity search was performed against Nr database, String database and Danio rerio transcriptome database using the Blastx algorithm (E-value b 10− 5) (Altschul et al., 1997). Gene ontology (GO) terms were extracted from the best hits using Blast2GO. The GO results were categorized using in-house Perl scripts (Ashburner et al., 2000). Clusters of Orthologous Groups (COG) (with E = 10− 6), Eukaryotic Orthologous Groups (KOG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) were predicted based upon BLAST against Nr and String database for possible functional classifications and molecular pathways (Kanehisa et al., 2006, 2008). 2.5. Identification and validation of SSRs Microsatellites with a minimum repeat length of 12 bp, including di-, tri-, tetra-, penta- and hexanucleotide motifs, were detected using msatcommander software with Primer 3.0 used for primer designing for all of the loci with enough flanking sequences (Rozen and Skaletsky, 2000). 41 SSR loci were randomly selected to test 20 individuals, five from each of the four largest lakes in China (Taihu Lake, Chaohu Lake, Dongting Lake and Hongze Lake). PCR admixture (20 μL) contains 100 ng template DNA, 2× PCR Mix (CW2296, Shiji Kangwei, China) and 0.4 μM for each primer. The following PCR protocol was used: 94 °C for 3 min followed by 25 cycles at 94 °C for 30 s, 55 °C for 40 s, 72 °C for 30 s and, finally, an extension step at 72 °C for 5 min. The separation of alleles was first done by electrophoresis in a 2% (w/v) agarose gel (Sangon, China) and then by PAGE in a 6%(w/v) polyacrylamide gel (Sangon, China) with a DL-1000 DNA marker (TaKaRa Biotechnology (Dalian, China) Co., Ltd.) to calculate the length of the SSR amplicons. The polymorphic loci were used to genotype 20 P. dabryanus individuals. The genetic diversity and allele number were calculated using Popgene 32 software. The polymorphism information content (PIC) of each SSR marker was calculated as: m m−1 m X X X 2 2 2 PIC ¼ 1− Pi − 2P i P j i¼1

i¼1 j¼iþ1

2.2. RNA extraction, library preparation and Illumina sequencing For each tissue, total RNA was extracted using TRIzol® reagent according to the manufacturer's instructions (Invitrogen). The libraries were prepared using a TruSeq™ RNA sample preparation kit from Illumina (San Diego, CA) with 5 μg of mixed total RNA. Messenger RNA was isolated with poly(A) selection by oligo(dT) beads and fragmented using fragmentation buffer, cDNA synthesis, end repair, A-base addition and ligation of the Illumina-indexed adaptors according to Illumina's protocol. Libraries were size-selected for target fragments of 300–500 bp on 2% (w/v) Low Range Ultra Agarose followed by PCR amplified using Phusion DNA polymerase (NEB) for 15 cycles. After quantification by TBS380, paired-end libraries were sequenced with the Illumina HiSeq 2000 (read length 2 × 100 bp). 2.3. Sequence cleaning and assembly Initial raw reads were filtered through SeqPrep (https://github. com/jstjohn/SeqPrep) and Condetri_v2.0.pl (http://code.google. com/p/condetri/downloads/detail?name=condetri_v2.0.pl) software to remove low-quality reads (quality score lower than 25 and error probability N0.003), sequencing adapters, reads with many ambiguity bases and reads with a length b25 bp. All the clean reads from two

where Pi and Pj represent the frequency of the ith and jth alleles for a given SSR marker and m is the total number of alleles detected for that SSR marker (Botstein et al., 1980). 3. Results 3.1. Sequencing and de novo assembly For the specimen AB2012F-6, 62,532,704 raw reads were generated by the Illumina HiSeq sequencer, and 39,751,380 clean reads were left after filtering of low quality reads (Table 1). Two more transcriptome datasets (AB2012M-1, AB2012M-2) were included in the assembly of large-scale loach transcriptome (Table 1). The sequence data were

Table 1 The numbers of clean reads and corresponding numbers of nucleotides in three Paramisgurnus dabryanus samples. Sample name

Clean reads

Nucleotides

AB2012F-6 AB2011M-1 AB2011M-2

39,751,380 33,004,972 32,551,726

3,941,421,395 3,266,627,229 3,266,627,229

C. Li et al. / Gene 557 (2015) 201–208

deposited in the NCBI Sequence Read Archive http//:www.ncbi.nlm.nih. gov/Trace/sra under accession number SRP049748. In total more than 1.05 × 108 reads (~ 10.4 Gb) were used for sequence assembly and results in 71,887 transcripts. The length of transcripts ranged from 301 bp to 24,747 bp, and 34,603 (48.1%) transcripts were longer than 1000 bp. The frequency distribution of the transcript length is shown in Fig. 1. 3.2. Annotation of all nonreduntant transcripts 42,093 (58.55%) out of the 71,887 transcripts were predicted to have an ORF of N150 bp, in which ORFs N200 bp account for 41,254 (96.16%). 43,837 (60.98%) and 18,062 (25.13%) transcripts were found to have significant matches in NCBI non-redundant database and String database (Additional file 1). 83.66% of the transcripts N1000 bp in length showed homologous matches, whereas only 32.42% of the 1–400 bp transcripts showed matches (Additional file 2). BLASTx searches against D. rerio transcriptome revealed that 43,807 P. dabryanus transcripts showed significant matches to 19,064 D. rerio transcripts (Additional file 3). The percent coverage of alignment length against query is shown in Additional file 4. 29,389 transcripts were assigned to gene ontology (GO) terms in three categories, biological processes, cellular components and molecular function clusters (Fig. 2). Within the biological process category, cellular processes were the most dominant group (69.69%) followed by metabolic processes (53.98%), biological regulation (41.51%) and regulation of biological processes (39.60%). For molecular functions, 62.40% of the transcripts were assigned to binding, followed by catalytic activity (38.19%), transporter activity (6.61%) and enzyme regulator activity (5.23%). With regard to the cellular component category, cell (63.12%) and cell part (63.12%) were the dominant groups, followed by organelles (44.99%) and membrane (31.47%). 8254 of the 43,837 transcripts were assigned to 24 COGs, including cellular structure, biochemistry metabolism, molecular processing and signal transduction (Fig. 3). The cluster for general function (3354, 25%) represented the largest group, followed by signal transduction mechanisms (1257, 9.5%), transcription (1254, 9.5%), replication, recombination and repair (1132, 8.5%), post-translational modification, protein turnover, chaperones (944, 7.1%), translation, ribosomal structure and biogenesis (559, 4.2%). Only nine and six transcripts were assigned to cell motility and nuclear structure terms, respectively (Fig. 3). In addition, 22,102 (31.14%) transcripts were mapped to 302 cellular metabolic or signaling pathways, including metabolic pathways, focal adhesion, and regulation of actin cytoskeleton (Additional file 5).

203

3.3. Frequency and distribution of SSRs in the P. dabryanus transcriptome 15,106 SSRs were identified from the 71,887 transcripts. The SSR frequency in the P. dabryanus transcriptome was 21% with a distribution density of 143 per Mb. The most abundant repeats motif was trinucleotide (43.08%), followed by dinucleotide (35.67%), mononucleotide (14.53%), tetranucleotide (6.30%), pentanucleotide (0.39%) and hexanucleotide (0.03%) repeat units (Fig. 4). 66 types of repeats motifs were found among the P. dabryanus transcriptome. The most frequent type was (AC/GT)n (2745, 18.17%), followed by (A/T)n (2162, 14.31%), (ATC/GAT) n (1761, 11.66%), (AG/CT)n (1574, 10.42%), (AAT/ATT)n (1170, 7.75%), and (A/T)n (1049, 6.94%). Additionally, the most abundant type in tetraand pentanucleotide SSRs was (AAAC/GTTT)n (230,1.52%) and (AAAAT/ ATTTT)n (11, 1.5%), respectively (Additional file 6). The copy number of different repeat motifs in the SSR sequences was distributed unevenly. The copy number of different repeat motifs varied from 4 to 33, with the highest copy number in (AG/CT)n dinucleotide repeats. The four most frequent copy numbers for SSRs were 4 (31.38%), 6 (20.41%), 5 (11.17%) and 7 (10.29%), while those presenting a repeat number N 15 accounted for 4.02%. The longest SSR in each unit type (from mono- to hexanucleotide repeats) was 24 bp (A/T), 66 bp (AG/ CT), 39 bp (AAC/GTT), 60 bp (ATTG/CAAT), 85 bp (AAACT/AGTTT) and 66 bp (ACACGC/GCGTGT) (Table 2). 3.4. PCR amplification and polymorphism of SSR markers 11,037 primer pairs were designed from 15,106 SSR-containing sequences and 400 SSR primer pairs were randomly selected for validation. Of these 400 primer pairs, 364 (91%) primer pairs yielded amplification products in two randomly selected P. dabryanus individuals, of which 41 pairs of primer were selected randomly for further validation of polymorphism. The results indicated that 36 (87.8%) microsatellite loci had allelic polymorphism, with the number of alleles per locus varying from 2 to 14 (mean 6.47). PIC values for the SSR markers ranged from 0.048 to 0.872, with an average value of 0.619 (Table 3). 4. Discussion 4.1. Illumina paired-end sequencing and de novo assembly To identify microsatellite markers and achieve a relatively abundant transcriptome database for further research, we sequenced a mixed sample of 13 tissues. 71,887 transcripts were acquired with an average length of 1464.58 bp, similar to that of Lucioperca lucioperca 1474 bp,

Fig. 1. Size distribution of assembled transcripts in Paramisgurnus dabryanus.

204

C. Li et al. / Gene 557 (2015) 201–208

A

B

C

Fig. 2. Gene ontology classification of annotated transcripts in Paramisgurnus dabryanus. A: Biological processes, B: cellular components, C: molecular function.

C. Li et al. / Gene 557 (2015) 201–208

205

Fig. 3. COG (Clusters of Orthologous Groups) classification of transcripts in Paramisgurnus dabryanus.

Siniperca chuatsi 1493 bp, while longer than Pelodiscus sinensis 1275 bp (Illumina plateform, data unpublished) and much longer than contigs originated from 454 RNA-seq including Megalobrama amblycephala 730 bp (Gao et al., 2012a, 2012b), Sparus aurata 727 bp (Josep et al., 2013), common carp 888 bp (Ji et al., 2012) and Gerris incognitus 369 bp (Perry and Rowe., 2011). Apart from the transcriptome difference between organisms, our large quantity database (N10 Gb) might account for its long average sequence length, since the increased transcriptome nucleotide coverage depth facilitates de novo assembly and enhances the sequencing accuracy. With the development of bioinformatics tool, de novo assembly of transcriptome was greatly improved, as newly developed assembly software has overcome the disadvantage of Illumina sequence assembly (Zhao et al., 2011), resulting in an ideal sequence length. Because greater sequence length has more potential for SSR marker identification (Zalapa et al., 2012), our results suggested that RNA-seq using the Illumina platform provide a more intelligent and cost-effective option for SSR development. 4.2. Functional annotation, classification and comparative analysis Transcriptome sequencing projects are difficult or impossible to evaluate without a completely annotated reference genome sequence,

Fig. 4. Distribution of different EST–SSRs' motifs in Paramisgurnus dabryanus transcriptome.

however estimating the number of genes and the level of transcript coverage is an important issue. In this study, we indirectly evaluated this transcriptome coverage through numbering the transcripts with a Blast match with the existing database and performing a comparison with the D. rerio cDNA database. The ORF prediction revealed that 58.55% (42,093) of the 71,887 transcripts has an ORF of N 150 bp, of which ORFs N200 bp account for 57.39% (41,254). In M. amblycephala (Gao et al., 2012a, 2012b) for which a full run of 454 platform was performed, those contigs with an ORF N200 bp only account for a frequency of 21% (5615). Blastx search against the Nr database showed a higher annotation ratio (43,837; 60.98%) than reported in other fish species such as M. amblycephala (40.5%) (Gao et al., 2012a, 2012b), Hypophthalmichthys molitrix (26.9%) (Fu and He., 2012) and Scophthalmus maximus (44.8%) (Pereiro et al, 2012). This annotation improvement can be explained by the higher average length (1464.5 bp) of the database. In the present study, 83.66% of the transcripts of N 1000 bp showed Blast hits with the Nr database, whereas only 32.42% of the 300–400 bp transcripts showed matches, again revealing that longer assembled sequences led to higher Blast hit frequency with the existing protein database. Many P. dabryanus transcripts were assigned to GO categories and COG classifications (Figs. 2 and 3). As for the biological process category, 1005 transcripts were assigned into growth, gaining deep insight into growth-related research such as breeding. Totally 22,102 (31.14%) transcripts were grouped into 302 cellular metabolic or signaling pathways, more than reported in Oviductus Ranae transcriptome database (218 pathways) (Zhang et al., 2013). These annotations provide a valuable resource for investigating specific processes, functions and pathways in P. dabryanus research. In addition, comparison with the D. rerio cDNA database indicated that 43,807 (60.9%) transcripts of the P. dabryanus transcriptome had a homology match with D. rerio cDNA sequences, more than reported in M. amblycephala (15,110; 55%), S. maximus (16,894; 40.35%). At the same time, the number of sequences matched with D. rerio in current research is more than reported in common carp (19,064 VS 14,554) (Ji et al., 2012). However, those which have no significant similarity with D. rerio cDNA database accounts for 39.1% of P. dabryanus transcriptome. This can be attributed to the presence of novel genes and the difference between species. Both D. rerio and P. dabryanus belong to Cyprinidae but in different subfamilies distantly (Meng et al., 1995). They may share limited similarity in some degree. In addition, this study may not be possible to cover the whole transcriptome as only seven kinds of tissues of adult large-scale loach were sampled and some rare transcripts may be

206

C. Li et al. / Gene 557 (2015) 201–208

Table 2 The number of repeats in each different motif length in Paramisgurnus dabryanus transcriptome. Number of repeats

4

5

6

7

8

9

10

11

12

13

14

15

N15

Mono-nucleotide Di-nucleotide Tri-nucleotide Tetra-nucleotide Penta-nucleotide Hexa-nucleotide Total Type %

0 0 4088 606 45 2 4741 31.38

0 0 1367 309 10 1 1687 11.17

0 2400 661 21 1 0 3083 20.41

0 1214 332 8 0 0 1554 10.29

0 636 25 0 1 0 662 4.38

0 504 7 1 1 0 513 3.40

0 416 4 2 0 0 422 2.79

0 210 0 3 0 1 214 1.42

876 5 0 0 0 0 881 5.83

374 2 6 0 0 0 382 2.53

201 0 11 1 0 0 213 1.41

146 0 0 1 0 0 147 0.97

598 2 6 0 1 0 607 4.02

missed. For further understanding and characterization of P. dabryanus transcriptome, we definitely need a complete set of transcriptome from virtually every tissue across every life stage and every circumstance. As we know, this is the first transcriptome sequencing analysis for P. dabryanus. The large number of sequences generated in this study provides valuable sequence information for novel gene discovery, development of genetic markers and further genetic analysis of P. dabryanus. 4.3. Frequency and SSR distribution in P. dabryanus transcriptome Polymorphic SSR markers have an important role in genetic diversity research, population genetics, genetic linkage mapping, comparative genomics and trait-association analysis. In the present study, 15,106 perfect SSRs (≥12 bp) were identified from 71,887 transcripts, which suggested that every EST sequence possesses an average number of

0.21 SSR, more than Phoca largha 0.024 (Gao et al., 2012a, 2012b), Misgurnusan guillicaudatus 0.076 (Chen et al., 2010) and similar to Haliotis midae 0.34 (Franchini et al, 2011) and P. sinensis 0.23 (data unpublished). This might owe to the effective assembly of our dataset. The distribution density was one SSR per 6.99 kb, quite higher than reported for other fish species including M. amblycephala 9.53 kb (Gao et al., 2012a, 2012b), P. largha 9.65 kb (Gao et al., 2012a, 2012b), while lower than H. midae 0.756 kb (Franchini et al, 2011). The distribution density of SSR is influenced by several factors, including genome structure or composition (Toth et al., 2000), SSR detection criteria, dataset size, database-mining tools and the parameters for exploration of microsatellites (Varshney et al., 2005; Wei et al., 2008). Most SSR loci are dinucleotide repeats (30–67%) in the vertebrate genome; in the present study, dinucleotide repeats accounted for 35.67% and (AC/GT)n motif was the most common (2745, 18.17%) dinucleotide repeat microsatellites, consistent with Toth's survey in vertebrate animal

Table 3 Polymorphic SSR markers and polymorphism within 20 Paramisgurnus dabryanus samples. SSR locus

Left primer

Right primer

SSR

Tm

No. of alleles

PIC

GenBank accession no.

Pada105 Pada206 Pada207 Pada258 Pada261 Pada263 Pada265 Pada268 Pada269 Pada272 Pada274 Pada280 Pada283 Pada284 Pada285 Pada286 Pada287 Pada292 Pada293 Pada294 Pada298 Pada299 Pada301 Pada302 Pada304 Pada306 Pada308 Pada309 Pada310 Pada317 Pada319 Pada320 Pada322 Pada323 Pada325 Pada326

TGAGGCAAATCCTGGAGGC CTCCCAGACCAGTGAAGGG GTCGTGACGTTGGTTAGCG GCACATATAACCTGATTTGCATTAGAC TTGCCCACCTTCAGCTTTG TCAGCAAGTACAGGTTGGG GCTGTCTAAATGACGCGTAGG GCAGCTCTAGGTGGCCTG AATCATTGCCTCGCTGACG AGGAAGTTTCCACGGAGCC CAGGCGCAAATGCCTTAAAC ACATGCAAGGCAGTTATAGGG TGATGTCAGTGGCTTTCCAG GCAACTCAACAGCCTGAGC CCTAACTGCGCACCAACTG TGACGCCCAATAGACAAAGC TTGACTGATGGGTGGCTTC TCCCATTCTTAACCCGAGC GCCTACCGAGTGGTCTGG GAAGACGAGGAGTCTCGGG CTTCTGGCAAGAGTTCAGGG TACGGGAGCTAAGGCCAAC CCTGTAGTTGACCCATTTCGC GAGGAGGTCTTCGGATGGG CGCCGACAAATCTGACTAGC TTGCCAATGCAAGTGAACC GGTCCTCGTTAAGGTGGCAG TGGGAAGAGGTGCATCGTG GGGTCTCCACAGTGACGC CACACACGCTCATTGCTTTC AAGGAGGAACTGGAGCCAC ACAATCCTTCAACCTTTGGTG ACAACAGTCCCATTGGTCAAAC CACTCATCGACACGCTTGC GATGGCTGTAATCCACTCGC CGACGATCAGCAGTTCACG

TCAACTCTCACAGCACGGG CGGACTGGACAAACAAGTCG GCCTCCCATGATGCCTTTC TGTGGGTTCCCAGTCTTCC ATCAGGTTCGATGACCGGC TCTGCGGGTTGGAGACTTG TAGCACCCTGCTGTGGAAG CTGCAAATCTCCACGGCTC AACCCGAGGAGATTGACCC AGATTGCTTTGTGCAGCGG TGAGTCGGGTGAAACTCGG TGCCGCACAGCTACATTTG TCATTGCTGCCTCCAAACG CGATTTGTCCAGTGCTGCC CTCGCAAGGCACAAACGAG ACCCACTGAGCTCTGGTTC ATGGTAAAGGCAAAGATACGC ATACCCTCAGGAACCAGCG TTGACAGGTTCCCAAGGGC AAGAAGCCACCCATCAGTC TCAGGTGATCTGGGAATGC TGCAAACAAGTGCTGCCTC AGAAGGGTGACTTGTCCAAAC CACACAGAGGTTAGGGTTCAC CCCATGCAAAGATGCTCCG TGGACCCGCTAAATAGCCC AGTCAGGGCAGGAACATCG CCACCTACTGGGCCTCAAG CGCTGTCTGAGTGATCCCG CTGTGCTTTACTCCTGGGTTG ATCAGCAGGCCGATGACTG GCAGCACTGTGGTCATGC GGGATTGGCTGAGAGCACC CCATCGCAGTGAAGACATGG ACATCATCGGGATGGCTCG ATTAAGTCCAACACAGGTTCAG

(AGC)5 (ATGC)4 (ACC)6 (AAC)4 (AAT)4 (AC)6 (AC)6 (AC)7 (AC)7 (ACG)4 (AG)6 (AGG)4 (AT)6 (AT)6 (AT)7 (AT)7 (AT)7 (ATC)5 (ATT)4 (ATT)4 (CT)6 (CT)6 (CT)7 (CT)7 (CT)8 (CTT)4 (CTT)4 (CTT)4 (CTT)4 (GT)6 (GT)6 (GT)6 (GT)6 (GT)7 (GT)7 (GT)8

T9 T9 54.5 °C T9 T9 T2 T2 T9 T9 T9 T9 T9 T9 T9 T9 T9 T9 T9 55 °C 54.5 °C T9 T9 52 °C T9 T9 52 °C 52 °C 52 °C T9 T9 52 °C T9 52 °C 54.5 °C 52 °C 54.5 °C

3 4 14 4 4 8 12 8 6 6 3 2 9 7 7 9 6 3 6 4 10 2 10 8 4 8 6 7 6 4 7 2 10 7 11 6

0.410 0.599 0.872 0.520 0.317 0.735 0.813 0.829 0.746 0.706 0.570 0.129 0.795 0.718 0.756 0.695 0.718 0.282 0.541 0.464 0.829 0.195 0.795 0.829 0.418 0.740 0.630 0.735 0.628 0.432 0.783 0.048 0.858 0.678 0.828 0.626

KP144788 KP121691 KP202116 KP202117 KP202118 KP202119 KP202120 KP202121 KP202122 KP202123 KP202124 KP202125 KP202126 KP202127 KP202128 KP202129 KP202130 KP202131 KP202132 KP202130 KP202133 KP202134 KP202135 KP202136 KP202137 KP202138 KP202139 KP202140 KP202141 KP202142 KP202143 KP202144 KP202145 KP202146 KP202147 KP202148

T9: a touchdown PCR Tm of 63 °C–52 °C for 11 circles and 52 °C for 14 circles; T2: a Tm of 53°C–47°C for 6 circles and 49 °C for 25 circles.

C. Li et al. / Gene 557 (2015) 201–208

species (Toth et al., 2000). While in P. dabryanus transcriptome, trinucleotide was the richest SSR loci repeats (43.08%).

4.4. Polymorphism of the SSR markers The majority of P. dabryanus SSRs tested generated high-quality amplicons, suggesting that ESTs are suitable for specific primer design. In the present study, 364 (91%) of the primer pairs designed from ESTs yielded amplicons. The failure of primer pairs to produce amplicons could be caused by the location of the primer(s) across splice sites, chimeric primers or poor-quality sequences. Within the successful primer pairs, 36 (87.8%) of the 41 randomly selected amplicons were polymorphic, although several loci did not produce the expected size. The deviation from the expected size could have been due to the presence of introns, large insertions, lack of specificity and mutation in flanking sequences or even assembly errors (Saha et al., 2004; Varshney et al., 2005). These results suggested that the assembled transcripts were of high quality and the SSRs identified in our dataset could be used in the future. Using the SSRs in our dataset, the mean number of alleles per locus was 6.47. The PIC values ranged from 0.048 to 0.872 (mean 0.619), indicating that the polymorphism of SSRs was relatively high (PIC N 0.5). Totally, 15,106 SSRs were identified in our study and more PCR primers may be synthesized in the future as tools for quantitative trait loci mapping, population genetic structure assessment (Li et al., 2014) and protection of germplasm of P. dabryanus.

5. Conclusion We characterized the transcriptome profile of large-scale loach, an important aquaculture species with natural germplasm resources under potential recession and admixture and identified thousands of candidate SSR markers. The transcriptomics data will be valuable for functional studies of genes and gene networks that may lead to the improvement of large-scale loach aquaculture. This study provides deep insights into the properties and patterns of SSR markers. The candidate markers identified in this study will be useful for the construction of linkage maps, population genetic studies and so on.

Acknowledgments We appreciate great advice and comments from Dr. Quoqing Lu at the University of Nebraska at Omaha, USA and Dr. Moli Huang at Soochow University, China. This work was supported by the Science Fund of Jiangsu Province (BE2012354), Agricultural Science and Technology Innovation Fund of Jiangsu Province (CX(13)2042) and by the Aquatic Three Project of Jiangsu Province (PJ2011-62, DY2012-3), China. Appendix A. Additional files Additional file 1a: Annotation of Paramisgurnus dabryanus transcripts with ORF. Additional file 1b: Annotation of Paramisgurnus dabryanus transcripts without ORF. Additional file 2: Comparison of Paramisgurnus dabryanus transcript length with hits against Nr database. Additional file 3: Results of Paramisgurnus dabryanus transcript comparison with Danio rerio cDNA database. Additional file 4: Paramisgurnus dabryanus transcriptome comparison with Danio rerio cDNA database. Additional file 5: KEGG pathways assigned by Paramisgurnus dabryanus transcripts. Additional file 6: The number and corresponding proportion for each specific SSR motif in Paramisgurnus dabryanus transcriptome.

207

References Alam, M.S., Islam, M.S., 2005. Population genetic structure of Catla catla (Hamilton) revealed by microsatellite DNA markers. Aquaculture 246, 151–160. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389–3402. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G., 2000. Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29. Botstein, D., White, R.L., Skolnick, M., Davis, R.W., 1980. Construction of a genetic linkage map in man using restriction fragment length polymorphisms [J]. Am. J. Hum. Genet. 32 (3), 314–331. Cavagnaro, P.F., Senalik, D.A., Yang, L., Simon, P.W., Harkins, T.T., Kodria, C.D., Huang, S., Weng, Y., 2010. Genome-wide characterization of simple sequence repeats in cucumber (Cucumis sativus L.). BMC Genomics 11, 569–586. Chen, D., Li, Q., Kong, L., 2010. Development and characterization of 47 genic microsatellite markers of the loach, Misgurnus anguillicaudatus. J. World Aquac. Soc. 41 (1), 163–167. Chistiakov, D.A., Hellemans, B., Volckaert, F.A.M., 2006. Microsatellites and their genomic distribution, evolution, function and applications: a review with special reference to fish genetics. Aquaculture 255, 1–29. Crawford, J.E., Guelbeogo, W.M., Sanou, A., Traoré, A., Vernick, K.D., Sagnon, N.F., Lazzaro, B.P., 2010. De novo transcriptome sequencing in Anopheles funestus using Illumina RNA-seq technology. PLoS ONE 5 (12), e14202. http://dx.doi.org/10.1371/journal. pone.0014202. Franchini, P., Van der Merwe, M., Roodt-Wilding, R., 2011. Transcriptome characterization of the South African abalone Haliotis midae using sequencing-by-synthesis. BMC Res. Notes 4, 59 (http://www.biomedcentral.com/1756-0500/4/59). Fu, B., He, S., 2012. Transcriptome analysis of silver carp (Hypophthalmichthys molitrix) by paired-end DNA sequencing. DNA Res. 19, 131–142. Gao, Z., Luo, W., Liu, H., Zeng, C., Liu, X., Yi, S., Wang, W., 2012a. Transcriptome analysis and SSR/SNP markers information of the blunt snout bream (Megalobrama amblycephala). PLoS ONE 7 (8), e42637. Gao, X., Han, J., Lu, Z., Li, Y., He, C., 2012b. Characterization of the spotted seal Phoca largha transcriptome using Illumina paired-end sequencing and development of SSR markers. Comp. Biochem. Phys. D 7, 277–284. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., Palma, F.D., Birren, B.W., Nusbaum, C., Lindblad-Ton, K., Friedman, N., Regev, A., 2011. Full length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652. Gupta, P.K., Rustgi, S., 2004. Molecular markers from the transcribed/expressed region of the genome in higher plants. Funct. Integr. Genomics 4 (3), 139–162. Hulaka, M., Kaspar, V., Kohlmann, K., Coward, K., Tešitel, J., Rodina, M., Gela, D., Kocour, M., Linhart, O., 2010. Microsatellite-based genetic diversity and differentiation of foreign common carp (Cyprinus carpio) strains farmed in the Czech Republic. Aquaculture 298, 194–201. Ji, P., Liu, G., Xu, J., Wang, X., Li, J., Zhao, Z., zhang, X., Zhang, Y., Xu, P., Sun, X., 2012. Characterization of common carp transcriptome: sequencing, de novo assembly, annotation and comparative genomics. PLoS ONE 7 (4), e35152. Josep, A.C.G., Azucena, B.N., Laura, B.P., Itziar, E., Gabriel, B.L., Ariadna, S.B., Jaume, P.S., 2013. Deep sequencing for de novo construction of a marine fish (Sparus aurata) transcriptome database with a large coverage of protein-coding transcripts. BMC Genomics 14, 178. Jun, T.H., Michel, A.P., Mian, M.A.R., 2011. Development of soybean aphid genomic SSR markers using next generation sequencing. Genome 54, 360–367. Kanehisa, M., Goto, S., Hattori, M., Aoki-Kinoshita, K.F., Itoh, M., Kawashima, S., Katayama, T., Araki, M., Hirakawa, M., 2006. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34, D354–D357. Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T., Yamanishi, Y., 2008. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 36 (Suppl. 1), D480–D484. Li, C.J., Ling, Q.F., Ge, C., Zhu, P.F., Wang, G.C., Ye, Z.Q., Han, X.F., 2014. Genetic diversity of Paramisgurnus dabryanus from four big lakes in China. Jiangsu J. Agric. Sci. 30 (5), 1087–1094. Marioni, J., Mason, C., Mane, S., Stephens, M., Gilad, Y., 2008. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517. Meng, Q.W., Su, J.X., Miao, X.Z., 1995. Fish taxology. China agricultural publishing company, Beijing. Metzker, M.L., 2010. Sequencing technologies—the next generation. Nat. Rev. Genet. 11 (1), 31–46. Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., Snyder, M., 2008. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349. O'Connell, M., Wright, J.M., 1997. Microsatellite DNA in fishes. Rev. Fish Biol. Fish. 7, 331–363. Pereiro, P., Balseiro, P., Romero, A., Dios, S., Forn-Cuni, G., Fuste, B., Planas, J.V., Beltran, S., Novoa, B., 2012. High-throughput sequence analysis of turbot (Scophthalmus maximus) transcriptome using 454-pyrosequencing for the discovery of antiviral immune genes. 7, e35369. Perry, J.C., Rowe, L., 2011. Rapid microsatellite development for water striders by next-generation sequencing. J. Hered. 102 (1), 125–129.

208

C. Li et al. / Gene 557 (2015) 201–208

Rexroad, C.E., Palti, Y., Gahr, S.A., Vallejo, R.L., 2008. A second generation genetic map for rainbow trout (Oncorhynchus mykiss). BMC Genet. 9, 74. http://dx.doi.org/10.1186/ 1471-2156-9-74. Rozen, S., Skaletsky, H., 2000. Primer3 on the WWW for general users and for biologist programmers. Methods Mol. Biol. 132, 365–386. Saha, M., Mian, M., Eujayl, I., Zwonitzer, J., Wang, L., May, G., 2004. Tall fescue EST–SSR markers with transferability across several grass species. Theor. Appl. Genet. 109 (4), 783–791. Toth, G., Gaspari, Z., Jurka, J., 2000. Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res. 10 (7), 967–981. Varshney, R.K., Graner, A., Sorrells, M.E., 2005. Genic microsatellite markers in plants: features and applications. Trends Biotechnol. 23 (1), 48–55. Wang, Y.J., Li, D.X., 2005. Research on biological characters and exploitations of Paramisgurnus dabryanus Sauvage. Spec. Wild Econ. Anim. Plant Res. (1), 60–62. Wang, C.M., Zhu, Z.Y., Lo, L.C., Feng, F., Lin, G., Yang, W.T., Li, J., Yue, G.H., 2007. A microsatellite linkage map of Barramundi, Lates calcarifer. Genetics 175, 907–915. Wei, L.B., Zhang, H.Y., Zheng, Y.Z., Guo, W.Z., Zhang, T.Z., 2008. Developing EST-derived microsatellites in sesame (Sesamum indicum L.). Acta Agron. Sin. 34 (12), 2077–2084. You, C.H., Tong, J.G., Yu, X.M., 2012. Microsatellite DNA analysis on genetic diversity of seven populations of Paramisgurnus dabryanus. J. Hydroecology 33 (1), 84–91. Zalapa, J.E., Cuevas, H., Zhu, H., Steffan, S., Senalik, D., Zeldin, E., McCown, B., Harbut, R., Simon, P., 2012. Using next-generation sequencing approaches to isolate simple sequence repeat (SSR) loci in the plant science. Am. J. Bot. 99 (2), 193–208.

Zhang, Y., Xu, P., Lu, C., Kuang, Y.Y., Zhang, X.F., Cao, D.C., Li, C., Chang, Y.M., Hou, N., Li, H.D., Wang, S., Sun, X.W., 2011. Genetic linkage mapping and analysis of muscle fiber-related QTLs in common carp (Cyprinus carpio L.). Mar. Biotechnol. 13, 376–392. Zhang, M., Li, Y., Yao, B., Sun, M., Wang, Z., Zhao, Y., 2013. Transcriptome sequencing and de novo analysis for Oviductus Ranae of Rana chensinensis using Illumina RNA-Seq technology. J. Genet. Genomics 40, 137–140. Zhao, Z.S., Gao, G.Q., Yin, G., Wang, X.W., Chen, W.H., 1999. Nutrition analysis of Misgurnus anguillicaudatus and Paramisgurnus dabryanus. Reserv. Fish. 19 (2), 16–17. Zhao, Q., Wang, Y., Kong, Y., Luo, D., Li, X., Hao, P., 2011. Optimizing de novo transcriptome assembly from short-read RNA-seq data: a comparative study. Bioinformatics 12 (Suppl. 14), S2. http://dx.doi.org/10.1186/1471-2015-12-S14-S2. Zhu, S.L., Wan, Q., Li, F., Gan, X.S., Zhang, H.Q., Tuan, X.S., Cao, B.L., 2007. Determination of fecundity and artificial propagation of Paramisgurnus dabryanus. Mod. Agric. Sci. Technol. (13), 171–174. Zhu, H., Senalik, D., Mccown, B.H., Zeldin, E.L., Speers, J., Hyman, J., Bassil, N., Hummer, K., Simon, P.W., Zalapa, J.E., 2011. Mining and validation of pyrosequenced simple sequence repeats (SSRs) from American cranberry (Vaccinium macrocarpon Ait.). Theor. Appl. Genet. http://dx.doi.org/10.1007/s00122-011-1689-2.