Comparison of different sequencing and assembly strategies for a repeat-rich fungal genome, Ophiocordyceps sinensis

Comparison of different sequencing and assembly strategies for a repeat-rich fungal genome, Ophiocordyceps sinensis

    Comparison of different sequencing and assembly strategies for a repeat-rich fungal genome, Ophiocordyceps sinensis Yi Li, Tom Hsiang...

368KB Sizes 4 Downloads 68 Views

    Comparison of different sequencing and assembly strategies for a repeat-rich fungal genome, Ophiocordyceps sinensis Yi Li, Tom Hsiang, Rui-Heng Yang, Xiao-Di Hu, Ke Wang, Wen-Jing Wang, Xiao-Liang Wang, Lei Jiao, Yi-Jian Yao PII: DOI: Reference:

S0167-7012(16)30155-5 doi: 10.1016/j.mimet.2016.06.025 MIMET 4941

To appear in:

Journal of Microbiological Methods

Received date: Revised date: Accepted date:

20 March 2016 21 June 2016 21 June 2016

Please cite this article as: Li, Yi, Hsiang, Tom, Yang, Rui-Heng, Hu, Xiao-Di, Wang, Ke, Wang, Wen-Jing, Wang, Xiao-Liang, Jiao, Lei, Yao, Yi-Jian, Comparison of different sequencing and assembly strategies for a repeat-rich fungal genome, Ophiocordyceps sinensis, Journal of Microbiological Methods (2016), doi: 10.1016/j.mimet.2016.06.025

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT REVISED Comparison of different sequencing and assembly strategies for a repeat-rich fungal genome, Ophiocordyceps sinensis

RI

PT

Yi Li a,b, Tom Hsiang c, Rui-Heng Yang a,d, Xiao-Di Hu a,d, Ke Wang a,d, Wen-Jing Wanga, XiaoLiang Wang a, Lei Jiao a & Yi-Jian Yao a,* a

Corresponding author.

D

*

MA

NU

SC

State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, NO.1 Beichen West Road, Chaoyang District, Beijing 100101, China b Fujian Province Key Laboratory of Plant Virology, Institute of Plant Virology, Fujian Agricultural and Forestry University, No.15 Shangxiadian Road, Cangshan District, Fuzhou, 350002, China c School of Environmental Sciences, University of Guelph, 50 Stone Road East, Guelph, Ontario, N1G 2W1, Canada d University of Chinese Academy of Sciences, 19 A Yuquan Rd, Shijingshan District, Beijing 100049, China.

AC CE P

TE

E-mail address: [email protected] (Y.-J. Yao)

1

ACCEPTED MANUSCRIPT Abstract

AC CE P

TE

D

MA

NU

SC

RI

PT

Ophiocordyceps sinensis is one of the most expensive medicinal fungi world-wide, and has been used as a traditional Chinese medicine for centuries. In a recent report, the genome of this fungus was found to be expanded by extensive repetitive elements after assembly of Roche 454 (223 Mb) and Illumina HiSeq (10.6 Gb) sequencing data, producing a genome of 87.7 Mb with an N50 scaffold length of 12 kb and 6,972 predicted genes. To test whether the assembly could be improved by deeper sequencing and to assess the amount of data needed for optimal assembly, genomic sequencing was run several times on genomic DNA extractions of a single ascospore isolate (strain 1229) on an Illumina HiSeq platform (25 Gb total data). Assemblies were produced using different data types (raw vs. trimmed) and data amounts, and using three freely available assembly programs (ABySS, SOAP and Velvet). In nearly all cases, trimming the data for low quality base calls did not provide assemblies with higher N50 values compared to the non-trimmed data, and increasing the amount of input data (i.e. sequence reads) did not always lead to higher N50 values. Depending on the assembly program and data type, the maximal N50 was reached with between 50% to 90% of the total read data, equivalent to 100 to 200 coverage. The draft genome assembly was improved over the previously published version resulting in a 114 Mb assembly, scaffold N50 of 70 kb and 9,610 predicted genes. Among the predicted genes, 9,213 were validated by RNA-Seq analysis in this study, of which 8,896 were found to be singletons. Evidence from genome and transcriptome analyses indicated that species assemblies could be improved with defined input material (e.g. haploid mono-ascospore isolate) without the requirement of multiple sequencing technologies, multiple library sizes or data trimming for low quality base calls, and with genome coverages between 100 and 200. Key words: genome; sequencing; assembly; coverage; trimming; RNA-Seq

1. Introduction

Since the draft of the human genome was produced by conventional Sanger sequencing (International Human Genome Sequencing Consortium, 2001), incredible advances have been made in high-throughput sequencing technologies, otherwise known as Next Generation Sequencing (NGS). Several second generation sequencing platforms have been developed including Roche 454 (GS FLX Titanium System), SOLiD System, Ion Torrent and Illumina Sequencing. The second generation sequencing technologies could produce massive amounts of data in a short time at a much lower cost, but with a shorter read length than Sanger sequencing. The short reads produced by NGS pose a problem in assembly of repeat regions since repeated sequences with high similarity may be inaccurately placed during assembly (Treangen et al., 2012). Furthermore, these technologies also have difficulties in accurate sequencing of regions extremely rich in GC or AT (Kozarewa et al., 2009; Quail et al., 2012). The third generation technologies, such as those developed by Helicos Biosciences (www.helicosbio.com) and Pacific Biosciences (www.pacificbiosciences.com), give high-throughput with longer reads and an amplification-free workflow which facilitates the assembly process and also sequencing of GC or AT-rich regions, but at higher prices and higher error rates per 2

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

sequenced base than other NGS technologies (Pettersson et al., 2009; Miller et al., 2010). Currently, many hundreds of fungal genomes are publicly available via the NCBI database (http://www.ncbi.nlm.nih.gov/genome/) and other sources, and many more sequencing projects for fungi are going on. In addition to genome sequencing, NGS technologies can also be used for transcriptome analyses, such as RNA-Seq, which allow enumeration of expressed transcripts (Wang et al., 2009). With greater sequencing depth, this provides precise measurement of transcript levels, and with proper software, the discrimination of isoforms (Wang et al., 2009). Moreover, RNA-Seq can also improve the genome assembly (Mortazavi et al., 2010) and annotation (Li et al., 2011). Ophiocordyceps sinensis (Berk.) G.H. Sung, J.M. Sung, Hywel-Jones & Spatafora, an insect pathogen belonging to the class Ophiocordycipitaceae, is a highly valued medicinal fungus which has been used in traditional Chinese medicine for centuries (Pegler et al., 1994). Because of the increasing price and scarcity, it has gained significant scientific attention in recent decades. However, the reproductive biology and conditions required for sporulation are not understood. Cultivation of the fungus for ascocarp production has not yet been commercially successful, although many efforts have been made in the last several decades. It has been recently reported that both MAT1-1 and MAT1-2 were presented in a single genome and expressed in vegetative mycelia, providing evidence that O. sinensis is likely homothallic, capable of selfing (Bushley et al., 2013). This result has also been supported by genome sequencing (Hu et al., 2013). The genome was sequenced using a combination of Roche 454 GS FLX (223 Mb with an average read length of 400 bp) and Illumina HiSeq 2000 platform with three different sized libraries (200 bp, 500 bp and 2 kb) to produce an assembly of 87.7 Mb (Hu et al., 2013). The estimated genome size (120 Mb) is approximately three times larger than other insect ascomycete pathogens due to repeat driven expansion. Their number of predicted genes (6,972) was much fewer than that of a closely related fungus, Cordyceps militaris (9,684), with a much smaller genome size (32.2 Mb) (Zheng et al., 2011), and also fewer than the number of expressed genes detected from transcriptome analysis of O. sinensis (Xiang et al., 2014; Liu et al., 2015). This first draft genome assembly of O. sinensis is likely far from complete. To test whether the assembly could be improved, this study was designed to sequence a single sized library of 500 bp from a mono-ascospore isolate of O. sinensis using Illumina HiSeq 2000 and 2500 technology. A single run of RNA-Seq was added to support the gene predictions. Different data types (raw and trimmed) and amounts were employed using three freely available assembly tools: ABySS (Simpson et al., 2009), SOAP (Li et al., 2008) and Velvet (Zerbino et al., 2008). This study examined how much sequencing coverage (i.e. data) would be needed for a draft assembly of an apparently complicated fungal genome, and evaluated which software could produce the best assembly with gene predictions supported by RNA-Seq data. The results presented here are helpful to researchers contemplating sequencing of their own target genomes, and provide a useful guide for sequencing of other repeat-rich fungal genomes.

2. Materials and methods 2.1. Fungal growth and culture 3

ACCEPTED MANUSCRIPT

NU

SC

RI

PT

Strain 1229 of O. sinensis used in this study was isolated from a single ascospore of a mature specimen originally collected from Guoluo, Qinghai, China, following the method described by Jiao (2010). The stock strain was maintained at 10 °C on Potato Dextrose Agar (PDA) supplemented with 5 g/l wheat bran and 0.5 g/l peptone (wheat bran medium). Seed cultures were grown in 250-ml Erlenmeyer flasks, containing 50 ml of wheat bran liquid culture medium. The flasks were maintained at 18 °C and 100 rpm for 15 d. A new 250-ml Erlenmeyer flask containing 50 ml medium was then inoculated with 10 ml of seed culture and incubated under the same conditions. The fresh mycelia were harvested after 25 d of incubation and washed with distilled water. The fresh mycelial pellets were used immediately for RNA extraction, while for DNA extraction, mycelia were first frozen under −40 °C overnight and then vacuum freeze dried using a freeze dryer (VirTis Co., Gardiner, NY) at room temperature for 1 d and stored at −80 °C until use.

MA

2.2. DNA and RNA extraction and preparation for sequencing

AC CE P

TE

D

Approximately 250 mg dry mycelia were ground with liquid nitrogen and incubated in a 30 ml Eppendorf tube containing 10 ml CTAB buffer at 65°C for 2 h, with 1% β-Mercaptoethanol and 10 μl 10mg/ml RNaseA. An equal volume of chloroform-isoamyl alcohol (24:1) was added to the mixture and centrifuged at 12,000 rpm, 4 °C for 15 min. The supernatant was transferred to a new tube, and the chloroform-isoamyl alcohol extraction was repeated until no more precipitate formed. DNA was precipitated with 2:3 (vol:vol) of cold isopropanol and 1:10 (vol:vol) of 3 M NaAc (pH = 5.2) at −20 °C for at least 2 h, and centrifuged at 10,000 rpm for 10 min at 4 °C. The pellet was washed twice, first with 70% and then with 100% cold ethanol. Air dried DNA was dissolved in 10 mM Tris-HCl (pH = 8.0). The amount and quality of total DNA was visualized by running out on a 1% agarose gel and quantified with a NanoDrop 1000 Spectrophotometer (Thermo Scientific). Approximately 250 mg dry mycelium of O. sinensis isolate 1229 was obtained from each 250-ml Erlenmeyer flask producing about 80 µg genomic DNA. A single paired-end (PE) DNA library of 500 bp was prepared with 40 µg total genomic DNA for sequencing. Total RNA was extracted from fresh mycelia with TRIzol® Reagent (Life Technologies, Inc., Grand Island, NY) following manufacture’s protocol. Dissolved RNA precipitates were treated with DNase I (GenStar Biosolutions Co., Ltd., Beijing, China) according to the manufacture’s instruction. The quality and concentration of the RNA were respectively assayed in an Agilent 2100 Bioanalyzer and using an Agilent RNA 6000 Nano kit. Mycelium from each 250 ml flask produced approximately 150 µg total RNA. RNA extracts (150 µg) with high purity and quality were selected for cDNA library construction. Oligo (dT) magnetic beads were used for purifying the mRNA from the total RNA. Fragmentation buffer treated mRNA (200 nt) were used as the template for doublestranded cDNA library construction. A 200 nt cDNA library was constructed with the NEBNext Ultra Directional RNA Library Prep Kit for Illumina. 2.3. Genome and transcriptome sequencing

4

ACCEPTED MANUSCRIPT

SC

RI

PT

The 500 bp DNA library of O. sinensis, strain 1229, was sequenced using Illumina HiSeq technology. Three separate sequencing runs were performed. The first run of 90 bp PE sequencing was carried out on an Illumina HiSeq 2000 system at BGI (Beijing Genomics Institute, Beijing, China) with a stipulation of more than 30 coverage. Another two runs of 100 bp PE sequencing were done on the Illumina HiSeq 2500 platform at Nextomics (Wuhan, China) with a stipulation of 6 Gb and 4 Gb trimmed data, respectively. The purpose of these multiple runs was to obtain data in order to mix and match different amounts to see their effect on final assembly results, and also as a form of replication to see how different runs might yield different results. The RNA-Seq of the 200 nt cDNA library was done using 100 bp PE sequencing on Illumina HiSeq 2500 at Nextomics, with a target of 4 Gb trimmed data.

NU

2.4. De novo genome assembly and annotation

AC CE P

TE

D

MA

Both trimmed and raw data were received in FASTQ.GZ format. Data trimming refers to removing the low quality reads (reads containing more than 50% base pairs with a quality score lower than 5 on 0 to 41 scale), and reads above a certain proportion (10%) of N (ambiguous sites). As well, adapter sequences were removed. The genome data were used in all possible combinations, but keeping trimmed and raw datasets separate. Three different programs were used for genome assembly: ABySS ver. 1.2.3 (www.bcgsc.ca/platform/bioinfo/software/ABySS; Simpson et al., 2009), SOAPdenovo ver. 1.1.2 (soap.genomics.org.cn/soapdenovo.html; Li et al., 2008) and Velvet ver 1.0.12 (www.ebi.ac.uk/~zerbino/velvet; Zerbino et al., 2008). For each set of data and each program, K-mer values ranging from 41 to 91 were tested for assembly. The quality of the genome assemblies was evaluated with the scaffold N50 statistic. The N90 statistic was also calculated providing information on the median weighted size of the smallest 90th percentile. Genes were predicted using Augustus ver 3.0.2 (http://bioinf.uni-greifswald.de/augustus), based on the Fusarium graminearum gene model provided with the software since both Fusarium spp. and O. sinensis are Sordariomycetes in the order Hypocreales. Assembly of Co18 provided by Hu et al. (2013) was also annotated following the same gene prediction procedure. 2.5. Transcriptome assembly A de-novo transcriptome assembly was generated from the RNA-Seq reads using Trinity ver. 2013-02-25 (http://trinityrnaseq.sourceforge.net; Grabherr et al., 2011). Reads containing adapter sequences or reads containing more than 30% of N or base pairs with a quality score lower than 20 were trimmed prior to assembly. Assembled transcripts were compared to the gene prediction result. 2.6. Assessment of genome sequencing, assembly and gene predictions Genome sequencing, assembly and gene predictions were assessed in different ways: 1) comparison of assembly size to estimated genome size; 2) comparison of predicted genes to eukaryotic core genes created by Parra et al. 2009; and 3) validation of gene predictions by RNA-Seq data. 5

ACCEPTED MANUSCRIPT 3. Results

PT

3.1. Sequence data generation

NU

SC

RI

PE sequencing with a single insert size (500 bp) was used for genome sequencing of O. sinensis. Three sets of sequencing data were produced for this study (Supplmental Table S1). Each of the three sets had a raw version and a trimmed version. Set 0 which was produced by BGI (Shenzhen, China) yielded 2.8109 bp of raw data and 2.6109 bp of trimmed data. Set 1 from a single run conducted by Nextomics (Wuhan, China) yielded 13.6109 bp raw data, and trimming yielded 9.3109 bp. A further run was then added by Nextomics which yielded 8.6109 bp raw data and 6.6109 bp trimmed data. In general, filtering of the raw data gave trimmed versions that were ~70% of the original database sizes.

MA

3.2. Performance comparison of different data types and amounts.

AC CE P

TE

D

There are known differences in the ability of the three programs (ABySS, SOAP and Velvet) to assemble the data from the millions of read sequences into larger consensus sequences (contigs) and into scaffolds (Haridas et al., 2011). To investigate the effects of data types and amounts on O. sinensis genome assembly, raw and trimmed datasets of three separate sequencing runs were combined in all possible combinations keeping raw and trimmed datasets separate (Supplemental Table S1), and the data were subjected to genome assembly using the three programs. Except for two cases, i.e. set 0 assembled with SOAP and set 1 & 2 assembled with ABySS, the scaffold N50 values of raw data assemblies were always larger than their counterparts using trimmed data (Fig. 1, Supplemental Table S1). Increasing the number of reads did not always lead to higher N50 values (Fig. 1). The N50 values for the assemblies seemed to plateau after reaching a saturation point in terms of sequence data, followed by a possible dip, if the optimal range for assembly was exceeded. In most cases, ABySS performed better than SOAP and Velvet in providing higher N50 values. Among 14 different assemblies for each of three programs, the largest N50 values (70,237 bp) were produced by ABySS with trimmed datasets 1 & 2 (Supplemental Table S2). Depending on the assembly program and data type, the maximal N50 was reached with between 50% to 90% of the total read data, equivalent to 109 to 200 coverage.

6

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

MA

Fig. 1. The relationship between the number of nucleotides used in an assembly and the highest N50 value obtained from assembly using raw or clean (trimmed) input data with ABySS or SOAP.

3.3. General genome features

AC CE P

TE

D

The assembly sizes for the three programs across the various sized databases were on average 117 Mb, 133 Mb and 114 Mb for ABySS, SOAP and Velvet, respectively (Supplemental Table S2). The coverage for the highest scaffold N50 assemblies for each of the programs in the current study ranged from 160 to 195 (Supplemental Table S3). The combination of datasets 1 & 2 gave two of the best assemblies than the other possible combinations, with scaffold N50 values of 69,970 bp for raw and 70,237 bp for the trimmed datasets. Table 1 Assembly statistics for Ophiocordyceps sinensis isolates 1229 and Co18, and Cordyceps militaris Cm01. Assembly program and isolate of O. sinensis Characteristic Assembly of C. militaris ABySS-1229 SOAP-1229 Velvet-1229 SOAP-Co18a Cm01b Sequence reads (Gb)

22.3

22.3

22.3

11

4.7

Estimated genome size (Mb)

~ 139

~ 139

~ 139

~120

32.2

Coverage (fold)

195 

160 

187 

241 

147 

GC content (%)

44.7

44.2

41.5

46.1

51.4

114

139

119

87.7

32.2

69,970 / 91

27,304 / 89

49,085 / 87

11,986 / N/A 4,600,000

Scaffold N90 (bp)

17,666

118

12,692

N/A

N/A

Number of scaffolds

12,238

208,216

13,182

10,603

33

Number of predicted genes

9,610

9,714

9,276

6,972

9,684

Assembly size (Mb) Scaffold N50 (bp) / K-mer

c

a

These data are from Hu et al. (2013). N/A = not available. These data are from Zheng et al. (2011). The number of reads was calculated from the stated coverage multiplied by the assembly size. N/A = not available. c K-mer that produced highest scaffold N50 values. b

7

ACCEPTED MANUSCRIPT 3.4. Gene prediction

AC CE P

TE

D

MA

NU

SC

RI

PT

The trimmed dataset using ABySS produced a slightly larger scaffold N50 for the assembly (70,237 bp) than the raw dataset (69,970 bp), but because the raw datasets gave much higher N50 values in the assemblies than the trimmed for SOAP and Velvet, the genome assemblies from the raw dataset of strain 1229 (sets 1 & 2) composed of 22.3109 nucleotides (Table 1) was used for gene prediction. The assemblies from ABySS, SOAP and Velvet gave predictions using Augustus of 9,610, 9,714 and 9,276 genes, respectively, all of which were much more than 6,972 genes predicted by Hu et al. (2013), implying an improved draft genome assembly for this species. Further, it was surprising to find that a total of 8,996 genes (Table 2) were predicted following the procedure used in this study in a test using the genome assembly of Co18 produced by Hu et al. (2013); over a quarter of genes more than the original prediction by Hu et al. (2013) were detected. To assess the validity of the gene predictions and hence the assembly itself, two methods were used including examination of core genes (Parra et al. 2009) by comparative genomics, and RNA-Seq evidence for gene expression. Parra et al. (2009) created a core eukaryotic set of 248 genes to evaluate genome assemblies and predictions by checking whether an assembled genome and its predicted genes might contain all of these highly conserved genes. All three sets of predictions contained 246 out of 248 core genes at evalue<1e-5 with one gene at evalue>1e-5 (from 0.015 to 2e-5). The missing gene was a 60S ribosomal gene. In comparison, the assembly of Co18 (Hu et al., 2013) was found to be missing 16 core genes. For the RNA-Seq data, a total of 57,885,858 original reads (raw data) were generated from Illumina HiSeq 2500 sequencing. After excluding the low-quality and adapter-contaminated reads, 46,748,662 modified reads (trimmed data, 80.76% of the raw) were retained and used for transcriptome assembly. De novo assembly of the trimmed data set (~6.5 Gb) yields 84,432 contigs totaling 225 Mb. Out of 9,610 predicted genes of strain 1229 using the raw data of combined sets 1&2, 95.9% (9,213) had RNA-Seq evidence with an RNA-Seq contig matching at evalue=1e-5 or lower. Among the 9,213 genes, 8,896 were found to be singletons. Even with very strict matching stringency (evalue=1e-50), the number with RNA-Seq evidence was still surprisingly high, 8,956 out of 9,610 (93.2%). The RNA-Seq raw reads were mapped to 9,610 predicted genes. The mapped read numbers and FPKM (fragments per kilobase of exon per million fragments mapped) values were calculated. A total of 8,875 out of 9,610 predicted genes were covered by more than one read (Supplemental Table S4). The predicted gene sets resulting from different assembly programs, using the raw data of combined set 1 & 2, were compared to each other using BLASTP to assess how different the assemblies might be. The results showed that the ABySS assembly had over 96% of its predicted genes in common with the SOAP or the Velvet assemblies (Table 2). The 9,610 genes predicted by ABySS assembly were compared against the 6,972 genes predicted by Hu et al. (2013) and showed 7,370 top matches at cut off e-value<1e-5 (Table 2), indicating that some predicted genes of strain Co18 were matched by more than one gene of strain 1229. In a reciprocal BLAST, 6,851 out of the 6,972 predicted genes of Co18 showed matches (e-value<1e-5) with predicted genes of strain 1229, indicating that the 9,610 predicted genes from our strain encompassed nearly all (98.3%) gene predictions of the strain Co18 by Hu et al. (Table 2). Among the genes predicted in the strain 1229 8

ACCEPTED MANUSCRIPT

SC

RI

PT

assembly that were not predicted for strain Co18 (9610−6972=2638), RNA-Seq evidence supported 2,062 at evalue<1e-5 and 1,909 at evalue<1e-50 indicating that the previous gene predictions (Hu et al., 2013) were missing a large portion of genes which had experimental evidence. The published draft genome assembly data of strain Co18 (Hu et al. 2013) were also subjected to our gene prediction system, and this yielded 8,996 predicted genes compared to their 6,972 genes. When these 8,996 genes were compared to the 9,610 from our local sequenced strain 8,646 showed matches at evalue<1e-5, and 863 showed no matches. Among these 863 genes without matches, only seven were found to be duplicates, leaving 856 that might be unique to strain 1229. Among these 856 genes, 94.6% (810) were supported by RNA-Seq data.

4. Discussion

AC CE P

TE

D

MA

NU

Table 2 Blast comparisons between genes predicted from different genome assemblies of Ophiocordyceps sinensis. BLAST database (gene number) at different e-value cut-offs Query set SOAP-Co18 (8,996b) ABySS-1229 (9,610) SOAP-Shanghai (6,972a) (gene number) 1e-20 1e-5 1e-20 1e-5 1e-20 1e-5 ABySS-1229 (9,610) NA NA 6859 7370 NA NA SOAP-1229 (9,714) 9,350 9,477 NA NA NA NA Velvet-1229 (9,276) 9,061 9,116 NA NA NA NA SOAP-Shanghai (6,972) 6,783 6,851 NA NA 6850 6890 SOAP-Co18 (8,996) 8,579 8,732 6877 7234 NA NA a These gene predictions (6972) are from Hu et al. (2013). b Gene prediction with the assembly from Hu et al. (2013) following our gene prediction pipeline yielded 8996 genes.

Hu et al. (2013) published a 87.7 Mb draft genome of O. sinensis and stated that the genome size should be closer to ~120 Mb based on K-mer calculations. Almost 7,000 genes were predicted from this draft assembly, however, EST library sequencing revealed many more expressed genes (9,641) (Xiang et al., 2014). The results of the present study gave a draft genome size for a haploid single ascospore isolate of O. sinensis at between 114 and 139 Mb (with highly repetitive regions absent from the assembly since Illumina HiSeq short read data does not give these regions), which is larger than the previous estimate of ~120 Mb by Hu et al. (2013). These estimated genome sizes for O. sinensis are unusually large, considering that the closely related species of Cordyceps s.l. usually have nuclear genomes of 30–50 Mb (www.ncbi.nlm.nih.gov/genome/browse/). Interestingly, the mitochondrial genome of this species is also expanded (157 Kb) compared with other hypocrealean species (from ~22 kb in Cordyceps bassiana to ~42 kb in Trichoderma reesei; Li et al., 2015). Long intergenic regions and numerous introns composing of abundant repetitive sequences which associate with mobile elements largely contributed to the mt genome expansion (Li et al., 2015). However, the genome expansion of O. sinensis experienced a different process which was driven by numerous retrotransposons (Hu et al., 2013). Similar processes have been observed in other ascomycetes with expanded genomes. For example, proliferation of transposable elements (TEs) accounted for 58% of the 125 Mb genome in 9

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

Tuber melanosporum (Martin et al., 2010), and 64% of the genome size (~120 Mb) in a barley powdery mildew, Blumeria graminis f. sp. hordei (Spanu et al., 2010). TEs causing genome expansion has also been reported in basidiomycetes. As in Melampsora larici-populina and Puccinia graminis f. sp. tritici, TEs accounted for nearly 45% of the 101.1 Mb and 88.6 Mb assembly sizes, respectively (Duplessis et al., 2011). A filtering and trimming process of raw reads is often performed in de novo genome assembly, but it is still unclear that this process always results in assemblies of higher N50. We found that raw sequence reads yielded higher N50 values than the trimmed data in nearly all combinations in genome assembly of O. sinensis, 40 out of 42 (Table 2), no matter which software was used or how much data were assessed. Lower quality reads might not improve the assembly of genome sequence but may be valuable for finishing draft genome sequences (DiGuistini et al., 2009). Haridas et al. (2011) found that there was no advantage in trimming data for assembly of isolates of Grosmannia clavigera using ABySS or SOAP but improved assembly statistics were found with Velvet. Tritt et al. (2012) also reported that assembly accuracy and statistics are apparently unaffected by the read trimming process when using SOAP for assembling. The quality of a draft genome assembly could be affected by the sequence depth (average sequence coverage) and the repeat content. For the genome of O. sinensis, increasing the coverage did not always lead to better assembly (represented by higher N50; Fig. 1). The N50 values for the assemblies seemed to reach a plateau and followed by a possible dip. The highest scaffold N50 values were produced by ABySS when 15.0 Gb trimmed or 22.3 Gb raw data were used, meaning ~160  average coverage since the genome was estimated as 139 Mb. Depth of coverage higher than this is most likely unnecessary for de novo assembly of fungal genomes with high repeat content. Other sequencing technologies which produce long reads, such as PacBio sequencing, might provide even better assemblies. Different programs have been used for de novo assembly of shot-gun sequences from NGS data, and the performance of these programs may differ for different genomes and even different isolates (Haridas et al., 2011; Zhang et al., 2011). In this study, the three assembly programs yielded assembly sizes that showed ~20% difference, from 114 Mb to 139 Mb. SOAP produced the largest assembly size (139 Mb) with more genes (9,714) predicted and at the same time more short contigs retained when using raw datasets 1 & 2. Meanwhile, ABySS gave higher scaffold N50 values than SOAP or Velvet in 37 out of 42 combinations of different datasets. Hu et al. (2013) obtained an assembly for their sequencing of O. sinensis at 87.7 Mb using SOAP. This assembly size is smaller than that found here and a possible reason for this difference might be the source of the input material used for sequencing. In our study, a mono-ascospore isolate was used, whereas Hu et al. (2013) used an isolate derived from stroma which might have been composed of multiple genotypes, thereby complicating the assembly process. Only 6,972 protein coding genes were predicted in the previous study (Hu et al., 2013). This number of coding genes is much fewer than most other sequenced filamentous fungi. A great number of pseudogenes (1,110) were identified which is apparently much more than closely related species C. militaris (139), Beauveria bassiana (170), Metarhizium robertsii (112) and M. acridum (82) indicating the loss of massive functions such as ability to assimilate nitrate (Hu et al., 2013). Expanded genomes encoding fewer genes was thought to be a feature of some obligate biotrophs (Hu 10

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

et al., 2013), e.g. the 125 Mb genome of T. melanosporum contains only 7,500 protein-coding genes (Martin et al., 2010) and the curated gene number of the ~120 Mb genome of Blumeria graminis f.sp. hordei is 5,854, which is at the lower end of the range of fungi (Spanu et al., 2010). Obligate biotrophs may have lost some metabolic pathways, including the metabolism of inorganic nitrogen, inorganic sulfur and thiamine, and genes encoding carbohydrate active enzymes and secondary metabolism enzyme (Baxter et al., 2010; Spanu, 2012). However, the transcriptome analyses using EST library sequencing of stroma of O. sinensis revealed more expressed genes (9,641) (Xiang et al., 2014) than the genome prediction by Hu et al. (2013). Even more surprisingly, a total of 12,980 unigenes were found to be expressed in cultured mycelia of the species in a recent study (Liu et al., 2015). The secondary metabolism clusters were largely conserved between O. sinensis and other insect pathogens (Hu et al., 2013) and enzymes associated with carbohydrate metabolism were found to be expressed (Xiang et al., 2014). The identified pseudogenes should be re-evaluated to check whether the fungus has really lost the relevant function or whether functional paralogs can be found in the genome. Furthermore, this fungus has been reported from soils in the natural habitats (Jin et al., 2013) and even in roots of plants (Zhong et al., 2014). Ophiocordyceps sinensis has a biotrophic phase which asymptomatically colonizes living larva of host insects and then switches to necrotrophy when the larvae are eventually dead (laboratory observation). These insights indicated that O. sinensis is probably not an obligate biotroph but a facultative saprophyte. Compared to the 8,996 predicted genes of Co18 which were obtained from the genome assembly by Hu et al. (2013) following the same gene prediction procedure used in this study, 856 genes were found to be unique to strain 1229. The surprisingly large difference in gene number may partly due to the genome completeness, 87.7 Mb with an N50 scaffold length of 12 kb for the strain Co18 compared to 114 Mb with scaffold N50 of 70 kb for 1229. Another possible reason could be the inherent differences among strains. In the rice blast fungus Magnaporthe oryzae, hundreds of isolate-specific genes were discovered in field isolates (Xue et al., 2012); large difference up to 2882 genes among isolates was also reported in the filamentous fungus Aspergillus niger (Andersen et al., 2011).

5. Conclusions

The previous genome sequencing of O. sinensis resulted in a draft assembly of 88.7 Mb with scaffold N50 of 11,986 bp and a predicted 6,972 protein coding genes, using a combination of both Roche 454 (223 Mb) and Illumina HiSeq (10.6 Gb) technology (Hu et al., 2013). By sequencing a single library of 500 bp of a mono-ascospore isolate of the fungus using Illumina HiSeq technology (22.3 Gb), we obtained a more complete genome assembly compared to the published version. Different data types and amounts were applied in de novo genome assembling using three freely available programs. Higher N50 values were achieved with raw data rather than the trimmed data. Increasing the amount of input data did not lead to better assemblies. ABySS performed better than SOAP and Velvet in assembling the repeat rich genome of O. sinensis. ABySS produced an assembly of 114 Mb for the mono-ascospore isolate 1229 of O. sinensis with genome size estimated as 139 Mb, scaffold N50 of 69,970 bp and 9,610 genes predicted. Among the 9,610 predicted genes, 11

ACCEPTED MANUSCRIPT

PT

9,213 were further supported by transcriptome analysis, of which 8896 were found to be singletons. Evidence from genome and transcriptome analyses indicated that species assemblies could be improved with defined input material (e.g. haploid mono-ascospore isolate) without the requirement of multiple sequencing technologies, multiple library sizes or data trimming for low quality base calls, and with genome coverages between 100 and 200.

RI

Acknowledgments

MA

Availability of supporting data

NU

SC

This work was supported by the National Science Foundation of China (31400018, 31170017 and 30025002), the Ministry of Science and Technology (2013BAD16B013 and 2007BAI32B03), the Qinghai Science & Technology Department (2014-NS-524 and 2014-NS-525) and the Chinese Academy of Sciences (KSCX2-YW-G-076, KSCX2-SW-101C and the scheme of Introduction of Overseas Outstanding Talents).

AC CE P

References

TE

D

The Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession LKHE00000000 and the RNA-Seq raw reads have been deposited in the GenBank Sequence Read Archive (SRA Accession ID: SRP064291).

Andersen, M.R., Salazar, M.P., Schaap, P.J., van de Vondervoort, P.J., Culley, D., Thykaer, J., Frisvad, J.C., Nielsen, K.F., Albang, R., Albermann, Kaj., Berka, R.M., Braus, G.H., BrausStromeyer, S.A., Corrochano, L.M., Dai, Z.Y., van Dijck, P.W.M., Hofmann, G., Lasure, L.L., Magnuson, J.K., Menke, H., Meijer, M., Meijer, S.L., Nielsen, J.B., Nielsen, M.L., van Ooyen, A.J.J., Pel, H.J., Poulsen, L., Samson, R.A., Stam, H., Tsang, A., van den Brink, J.M., Atkins, A., Aerts, A., Shapiro, H., Pangilinan, J., Salamov, A., Lou, Y.G. Lindquist, E., Lucas, S., Grimwood, Jane., Grigoriev, I.V., Kubicek, C.P., Martinez, D., van Peij, N.N.M.E., Roubos, J.A., Nielsen, J., Baker, S.E., 2011. Comparative genomics of citric-acid-producing Aspergillus niger ATCC 1015 versus enzyme-producing CBS 513.88. Genome Res. 21, 885–897. Baxter, L., Tripathy, S., Ishaque, N., Boot, N., Cabral, A., Kemen, E., Thines, M., Ah-Fong, A., Anderson, R., Badejoko, W., Bittner-Eddy, P., Boore, J.L., Chibucos, M.C., Coates, M., Dehal, P., Delehaunty, K., Dong, S.M., Downton, P., Dumas, B., Fabro, G., Fronick, C., Fuerstenberg, S.I., Fulton, L., Gaulin, E., Govers, F., Hughes, L., Humphray, S., Jiang, R.H. Y., Judelson, H., Kamoun, S., Kyung, K., Meijer, H., Minx, P., Morris, P., Nelson, J., Phuntumart, V., Qutob, D., Rehmany, A., Rougon-Cardoso, A., Ryden, P., Torto-Alalibo, T., Studholme, D., Wang, Y.C., Win, J., Wood, J., Clifton, S.W., Rogers, J., Van den Ackerveken, G., Jones, J.D.G., McDowell, J.M., Beynon, J., Tyler, Brett M., 2010. Signatures of adaptation to obligate biotrophy in the Hyaloperonospora arabidopsidis genome. Science. 330, 1549–1551. 12

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

Bushley, K.E., Li, Y., Wang, W.J., Wang, X.L., Jiao, L., Spatafora, J.W., Yao, Y.J., 2013. Isolation of the MAT1-1 mating type idiomorph and evidence for selfing in the Chinese medicinal fungus Ophiocordyceps sinensis. Fungal Biol. 117, 599–610. DiGuistini, S., Liao, N.Y., Platt, D., Robertson, G., Seidel, M., Chan, S.K., Docking, T.R., Birol, I., Holt, R.A., Hirst, M., Mardis, E., Marra, M.A., Hamelin, R.C., Bohlmann, J., Breuil, C., Jones, St.J.M., 2009. De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data. Genome Biol. 10, R94. Duplessis, S., Cuomo, C.A., Lin, Y.C., Aerts, A., Tisserant, E.,Veneault-Fourrey, C., Joly, D.L., Hacquard, S., Amselem, J., Cantarel, B.L., Chiu, R., Coutinho, P.M., Feau, N., Field, M., Frey, P., Gelhaye, E., Goldberg, J., Grabherr, M.G., Kodira, C.D., Kohler, A., Kües, U., Lindquist, E.A., Lucas, S.M., Mago, R., Mauceli, E., Morin, E., Murat, C., Pangilinan, J.L., Park, R., Pearson, M., Quesneville, H., Rouhier, N., Sakthikumar, S., Salamov, A.A., Schmutz, J., Selles, B., Shapiro, H., Tanguay, P., Tuskan, G.A., Henrissat, B., Van de Peer, Y., Rouzé, P., Ellis, J.G., Dodds, P.N., Schein, J.E., Zhong, S.B., Hamelin, R.C., Grigoriev, I.V., Szabo, L.J., Martin, Francis., 2011. Obligate biotrophy features unraveled by the genomic analysis of rust fungi. Proc Natl Acad Sci USA. 108, 9166–9171. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q.D., Chen, Z.H., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B.W., Nusbaum, C., Lindblad-Toh, K., Friedman, N., Regev, Aviv., 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 29, 644–52. Haridas, S., Breuill, C., Bohlmann, J., Hsiang, T., 2011. A biologist's guide to de novo genome assembly using next-generation sequence data: a test with fungal genomes. J Microbiol Methods. 86, 368–375. Hu, X., Zhang, Y.J., Xiao, G.H., Zheng, P., Xia, Y.L., Zhang, X.Y., St Leger, R.J., Liu, X.Z., Wang, C.S., 2013. Genome survey uncovers the secrets of sex and lifestyle in caterpillar fungus. Chin Sci Bull. 58, 2846–2854. International Human Genome Sequencing Consortium., 2001. Initial sequencing and analysis of the human genome. Nature. 409, 860–921. Jiao, L., 2010. Phylogeographic study on Ophiocordyceps sinensis. Beijing: Thesis submitted for the Doctoral Degree, Graduate School of Chinese Academy of Sciences. Jin, G.S., Wang, X.L., Li, Y., Wang, W.J., Yang, R.H., Ren, S.Y., Yao, Y.J., 2013. Development of conventional and nested PCR assays for the detection of Ophiocordyceps sinensis. J Basic Microbiology, 53, 340–347. Kozarewa, I., Ning, Z., Quail, M.A., Sanders, M.J., Berriman, M., Turner, D.J., 2009. Amplificationfree Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+ C)-biased genomes. Nature methods, 6, 291–295. Li, R.Q., Li, Y.R., Kristiansen, K., Wang, J., 2008. SOAP: Short oligonucleotide alignment program. Bioinformatics. 24, 713–714. Li, Y., Hu, X.D., Yang, R.H., Hsiang, T., Wang, K., Liang, D.Q., Liang, F., Cao, D.M., Zhou, F., Wen, G., Yao, Y.J., 2015. Complete mitochondrial genome of the medicinal fungus Ophiocordyceps sinensis. Sci Rep. 5, 13892. 13

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

Li, Z., Zhang, Z.H., Yan, P.C., Huang, S.W., Fei, Z.J., Lin, K., 2011. RNA-Seq improves annotation of protein-coding genes in the cucumber genome. BMC Genomics. 12, 540. Liu, Z.Q., Lin, S., Baker, P.J., Wu, L.F., Wang, X.R., Wu, H., Xu, F., Wang, H.Y., Brathwaite, M.E., Zheng, Y.G., 2015. Transcriptome sequencing and analysis of the entomopathogenic fungus Hirsutella sinensis isolated from Ophiocordyceps sinensis. BMC Genomics. 16, 106. Martin, F., Kohler, A., Murat, C., Balestrini, R., Coutinho, P.M., Jaillon, O., Montanini, B., Morin, E., Noel, B., Percudani, R., Porcel, B., Rubini, A., Amicucci, A., Amselem, J., Anthouard, V., Arcioni, S., Artiguenave, F., Aury, J.M., Ballario, P., Bolchi, A., Brenna, A., Brun, A., Buée, M., Cantarel, B., Chevalier, G., Couloux, A., Da Silva, C., Denoeud, F., Duplessis, S., Ghignone, S., Hilselberger, B., Iotti, M., Marçais, B., Mello, A., Miranda, M., Pacioni, G., Quesneville, H., Riccioni, C., Ruotolo, R., Splivallo, R., Stocchi, V., Tisserant, E., Viscomi, A.R., Zambonelli, A., Zampieri, E., Henrissat, B., Lebrun, M.H., Paolocci, F., Bonfante, P., Ottonello, S., Wincker, P., 2010. Périgord black truffle genome uncovers evolutionary origins and mechanisms of symbiosis. Nature. 464, 1033–1038. Miller, J.R., Koren, S., Sutton, G., 2010. Assembly algorithms for next-generation sequencing data. Genomics. 95, 315–327. Mortazavi, A., Schwarz, E.M., Williams, B., Schaeffer, L., Antoshechkin, I., Wold, B.J., Sternberg, P.W., 2010. Scaffolding a Caenorhabditis nematode genome with RNA-Seq. Genome Res. 20, 1740–1747. Parra, G., Bradnam, K., Ning, Z., Keane, T., Korf. I., 2009. Assessing the gene space in draft genomes. Nucleic Acids Res. 37, 289–297. Pegler, D.N., Yao, Y.J., Li, Y., 1994. The Chinese ‘caterpillar fungus’. Mycologist. 8, 3–5. Pettersson, E., Lundeberg, J., Ahmadian, A., 2009. Generations of sequencing technologies. Genomics, 93, 105–111. Quail, M.A., Smith, M., Coupland, P., Otto, T.D., Harris, S.R., Connor, T.R., Bertoni, A., Swerdlow, H.P., Gu, Y., 2012. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 13, 341. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J.M., Birol, İ., 2009. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123. Spanu, P.D., 2012. The genomics of obligate (and nonobligate) biotrophs. Annu Rev Phytopathol. 50, 91–109. Spanu, P.D., Abbott, J.C., Amselem, J., Burgis, T.A., Soanes, D.M., Stüber, K., van Themaat, E.V.L., Brown, J.K.M., Butcher, S.A., Gurr, S.J., Lebrun, M.H., Ridout, C.J., Schulze-Lefert, P., Talbot, N.J., Ahmadinejad, N., Ametz, C., Barton, G.R., Benjdia, M., Bidzinski, P., Bindschedler, L.V., Both, M., Brewer, M.T., Cadle-Davidson, L., Cadle-Davidson, M.M., Collemare, J., Cramer, R., Frenkel, O., Godfrey, D., Harriman, J., Hoede, C., King, B.C., Klages, Jochen Kleemann, Daniela Knoll, Prasanna S. Koti, Jonathan Kreplak, Francisco J. López-Ruiz, S., Lu, X.L., Maekawa, T., Mahanil, S., Micali, C., Milgroom, M.G., Montana, G., Noir, S., O’Connell, R.J., Oberhaensli, S., Parlange, F., Pedersen, C., Quesneville, H., Reinhardt, R., Rott, M., Sacristán, S., Schmidt, S.M., Schön, M., Skamnioti, P., Sommer, H., Stephens, A., Takahara, H., Thordal-Christensen, H., Vigouroux, M., Weßling, R., Wicker, T., Panstruga, R., 14

ACCEPTED MANUSCRIPT

AC CE P

TE

D

MA

NU

SC

RI

PT

2010. Genome expansion and gene loss in powdery mildew fungi reveal tradeoffs in extreme parasitism. Science. 330, 1543–1546. Treangen, T.J., Salzberg, S.L., 2012. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Rev Genet. 13, 36–46. Tritt, A., Eisen, J.A., Facciotti, M.T., Darling, A.E., 2012. An integrated pipeline for de novo assembly of microbial genomes. PloS One. 7, e42304. Wang, Z., Gerstein, M., Snyder, M., 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10, 57–63. Xiang, L., Li, Y., Zhu, Y.J., Luo, H.M., Li, C.F., Xu, X.L., Sun, C., Song, J.Y., Shi, L.C., He, L., Sun, W., Chen, S.L., 2014. Transcriptome analysis of the Ophiocordyceps sinensis fruiting body reveals putative genes involved in fruiting body development and cordycepin biosynthesis. Genomics. 103, 154–159. Xue, M., Yang, J., Li, Z., Hu, S., Yao, N., Dean, R. A., Zhao, W.S., Shen, Mi., Zhang, H.W., Li, Chao., Liu, L.Y., Cao, L., Xu, X.W., Xing, Y.F., Hsiang, T., Zhang, Z.D., Xu, J.R., Peng. Y.L., 2012. Comparative analysis of the genomes of two field isolates of the rice blast fungus Magnaporthe oryzae. PLoS Genet. 8, e1002869. Zerbino, D.R., Birney, E., 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829. Zhang, W.Y., Chen, J.J., Yang, Y., Tang, Y.F., Shang, J., Shen, B.R., 2011. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PloS One. 6, e17915. Zheng, P., Xia, Y.L., Xiao, G.H., Xiong, C.H., Hu, X., Zhang, S.W., Zheng, H.J., Huang, Y., Zhou, Y., Wang, S.Y., Zhao, G.P., Liu, X.Z. St Leger, R.J., Wang, C.S., 2011. Genome sequence of the insect pathogenic fungus Cordyceps militaris, a valued traditional Chinese medicine. Genome Biol. 12, R116. Zhong, X., Peng, Q.Y., Li, S.S., Chen, H., Sun, H.X., Zhang, G.R., Liu, X., 2014. Detection of Ophiocordyceps sinensis in the roots of plants in alpine meadows by nested-touchdown polymerase chain reaction. Fungal Biol. 118, 359–363.

15

ACCEPTED MANUSCRIPT

TE

D

MA

NU

SC

RI

PT

Highlights The genome was sequenced for a mono-ascospore isolate of Ophiocordyceps sinensis. Combined genome and transcriptome analyses gave an improved draft genome assembly. Data trimming for low quality base calls did not increase N50 values. Coverage from 100 to 200 was optimal for assembling 100 bp paired-end read data.

AC CE P

   

16