Improved genomic resources for the black tiger prawn (Penaeus monodon)

Improved genomic resources for the black tiger prawn (Penaeus monodon)

Marine Genomics xxx (xxxx) xxxx Contents lists available at ScienceDirect Marine Genomics journal homepage: www.elsevier.com/locate/margen Method p...

4MB Sizes 0 Downloads 83 Views

Marine Genomics xxx (xxxx) xxxx

Contents lists available at ScienceDirect

Marine Genomics journal homepage: www.elsevier.com/locate/margen

Method paper

Improved genomic resources for the black tiger prawn (Penaeus monodon) ⁎

Dong Van Quyena,e, , Han Ming Ganb,c, Yin Peng Leeb,c, Dinh Duy Nguyena, Thi Hoa Nguyena, ⁎⁎ ⁎ Xuan Thach Trana, Van Sang Nguyend, Dinh Duy Khanga, , Christopher M. Austinb,c, a

Laboratory of Molecular Microbiology, Institute of Biotechnology, Vietnam Academy of Science and Technology, 18 Hoang Quoc Viet, Cau Giay, Hanoi, Viet Nam Centre of Integrative Ecology, School of Life and Environmental Sciences Deakin University, Geelong, Australia Deakin Genomics Centre, Deakin University, Geelong, Australia d Institute for Aquaculture No.2 (RIA2), 116 Nguyen Dinh Chieu St., Dist. 1, Ho Chi Minh City, Viet Nam e University of Science and Technology of Hanoi, Vietnam Academy of Science and Technology, 18 Hoang Quoc Viet, Cau Giay, Hanoi, Viet Nam b c

A R T I C LE I N FO

A B S T R A C T

Keywords: Tiger prawn Aquaculture Nanopore Illumina Genome

World production of farmed crustaceans was 7.8 million tons in 2016. While only making up approximately 10% of world aquaculture production, crustaceans are generally high-value species and can earn significant export income for producing countries. Viet Nam is a major seafood producing country earning USD 7.3 billion in 2016 in export income with shrimp as a major commodity. However, there is a general lack of genomic resources available for shrimp species, which is challenging to obtain due to the need to deal with large repetitive genomes, which characterize many decapod crustaceans. The first tiger prawn (P. monodon) genome assembly was assembled in 2016 using the standard Illumina PCR-based pair-end reads and a computationally-efficient but relatively suboptimal assembler, SOAPdenovo v2. As a result, the current P. monodon draft genome is highly fragmented (> 2 million scaffolds with N50 length of < 1000 bp), exhibiting only moderate genome completeness (< 35% BUSCO complete single-copy genes). We sought to improve upon the recently published P. monodon genome assembly and completeness by generating Illumina PCR-free pair-end sequencing reads to eliminate genomic gaps associated with PCR-bias and performing de novo assembly using the updated MaSurCA de novo assembler. Furthermore, we scaffolded the assembly with low coverage Nanopore long reads and several recently published deep Illumina transcriptome paired-end sequencing data, producing a final genome assembly of 1.6 Gbp (1,211,364 scaffolds; N50 length of 1982 bp) with an Arthropod BUSCO completeness of 96.8%. Compared to the previously published P. monodon genome assembly from China (NCBI Accession Code: NIUS01), this represents an almost 20% increase in the overall BUSCO genome completeness that now consists of more than 90% of Arthropod BUSCO single-copy genes. The revised P. monodon genome assembly (NCBI Accession Code: VIGR01) will be a valuable resource to support ongoing functional genomics and molecularbased breeding studies in Vietnam.

1. Introduction World production of farmed crustaceans was 7.8 million tons in 2016 (FAO 2018). While only making up ~ 10% of world aquaculture production, crustaceans are generally high-value species and can earn significant export income for producing countries. Marine shrimps dominate crustacean aquacultures and are typically farmed in coastal areas and are an important source of income for local farmers. Viet Nam is a major seafood producing country earning USD 7.3 billion in 2016 in export income (FAO, 2018) with most of its revenue coming from exports of farmed Pangas catfishes (Pangasius spp.) and shrimp. After the



Whiteleg Shrimp (Litopenaeus vannamei) the Black Tiger Shrimp or Prawn (Penaeus monodon) is the world's most important cultured shrimp and farmers are increasingly attracted to this species due to its higher and more stable price. However, there is a general lack of genetic and genomic resources available for shrimp species and what is available is of questionable value (Yuan et al., 2018; Yuan et al., 2017), due in part to the challenges of dealing with large repetitive genomes, which characterize many decapod crustaceans that are now gradually overcome with the advent of long-read technology (Zhang et al., 2019). Genomic studies are essential for understanding gene functions and their relationship to phenotypes and are fundamental for selective

Correspondence to: C. M. Austin, Deakin Genomics Centre, Deakin University, Geelong, Australia. Correspondence to: D. D. Khang, Institute of Biotechnology, Viet Nam Academy of Science and Technology, Viet Nam. E-mail addresses: [email protected] (D.D. Khang), [email protected] (C.M. Austin).

⁎⁎

https://doi.org/10.1016/j.margen.2020.100751 Received 8 September 2019; Received in revised form 24 November 2019; Accepted 3 January 2020 1874-7787/ © 2020 Published by Elsevier B.V.

Please cite this article as: Dong Van Quyen, et al., Marine Genomics, https://doi.org/10.1016/j.margen.2020.100751

Marine Genomics xxx (xxxx) xxxx

D. Van Quyen, et al.

Library Quantification kit (KAPA Biosystems, CapeTown, South Africa) and Tapestation (Agilent, USA), respectively. Sequencing was performed on a NovaSeq6000 (run configuration of 2 × 150 bp) located at the Deakin Genomics Centre. 2.3. Nanopore sequencing We used the SQK-LSK109 Ligation Sequencing Kit (Oxford Nanopore, Oxford, UK) to process the extracted gDNA for sequencing. Quantification of the extracted gDNA used Qubit dsDNA HS Assay kit and Qubit 4 Fluorometer (Invitrogen, San Clara, CA). Two libraries were prepared from 1 μg of column-extracted gDNA (low molecular weight for higher sequencing output) and one library from 1 μg of gDNA extracted with the conventional salting-out approach (high molecular weight but lower sequencing yield) (Sokolov, 2000). Each library was loaded onto a dedicated MinION R9.4.1 revD flowcell (Oxford Nanopore, Oxford, UK) followed by sequencing for 48 h. Basecalling of the raw Nanopore reads used Guppy v. 3.1.5 (high accuracy mode). 2.4. De novo genome assembly Identification and removal of optical duplicates, commonly encountered in the NovaSeq6000 patterned flowcell used clumpify, a bioinformatic tool from BBMap v. 38.51 (https://github.com/ BioInfoTools/BBMap). The deduplicated reads were subsequently processed through fastp v.0.20 (Chen et al., 2018) for poly-G trimming to reduce the over-representation of high-quality poly-G bases in the 3′ end of reads commonly associated with Illumina Novaseq6000 2-colour imaging system. De novo assembly of the poly-G trimmed paired-end reads used MaSurCA v3.3.1 (Zimin et al., 2013).

Fig. 1. The Vietnamese black tiger prawns (Photo by Dong Van Quyen,RIA2).

breeding programs for increased growth and disease resistance. Similarly, whole-genome assemblies are required to identify trait-specific loci using GWAS and for genomic-based selection breeding. To this end, whole-genome sequencing has been conducted on several shrimp/ prawn species. Except for the recently updated PacBio-led genome assembly of Litopenaeus vannamei (Zhang et al., 2019), none of the prawn/shrimp genomes e.g. Penaeus monodon (Yuan et al., 2018), Exopalaemon carinicauda (Yuan et al., 2017), Parhyale hawaiensis (Kao et al., 2016), and Neocaridina denticulata (Kenny et al., 2014) sequenced have resulted in a high quality assembly because of challenges presented by large genome size and highly repetitive sequences that are typical of decapod crustaceans (Abdelrahman et al., 2017; Yu et al., 2015). In this study, we report the first draft genome of a Vietnamese tiger prawn (Fig. 1) in an effort to improve the quality of the existing genome assembly and annotation for this species and provide a more robust resource for on-going genomic-based selective breeding program supported by the Vietnamese Government.

2.5. Scaffolding of assembly and assessment of genome completeness Scaffolding of the resulting assembly based on Nanopore long reads and Illumina transcriptome data used LRscaf v 1.1.7 (Qin et al., 2018) and BESST_RNA (Sahlin et al., 2014), respectively. The transcriptome data used for BESST_RNA scaffolding was obtained from a recently published comprehensive study of the tiger prawn (Huerlimann et al., 2018). Briefly, the RNA paired-end sequencing reads generated from 57 libraries were individually mapped to the genome using HiSat2 v2.1.0 (default setting) (Kim et al., 2015) with the output piped to “samtools sort” for the generation of BAM files. The BAM files were combined with “samtools merge” and used as the input for BESST_RNA scaffolding (Sahlin et al., 2014). Genome assembly statistics were computed with QUAST v5.0.2 (Mikheenko et al., 2018). In addition to the assemblies generated in this study, the recently published tiger prawn genome (Yuan et al., 2018) was also downloaded from NCBI (https://www.ncbi. nlm.nih.gov/assembly/GCA_002291185.1/) and included for comparison. The assessment of genome completeness based on the presence of single-copy genes was performed with BUSCO v3 (Arthropod odb9 database) (Waterhouse et al., 2017).

2. Materials and methods 2.1. Sample collection A male Tiger Prawn was selected (Prawn ID 26D) from the 4th breeding generation that was bred and cultured at the National Breeding Centre for Southern Marine Aquaculture, Institute for Aquaculture No.2 (RIA2) located at Vung Tau, Vietnam as the DNA source for Illumina and Nanopore sequencing (Table 1).

2.6. Identification of simple sequence repeats 2.2. Illumina sequencing Simple sequence repeats (SSRs) identification from Illumina and Nanopore reads used PAL_FINDER v0.02.03 (Castoe et al., 2012). Prior to repeat identification, the Illumina reads were poly-G, quality and adapter-trimmed using fastp (Chen et al., 2018).

Genomic DNA (gDNA) was extracted from ethanol-preserved tail muscle tissue using the column-based DNAeasy Blood and Tissue Kits (Qiagen, Halden, Germany), according to the manufacturer's instructions. Two different PCR-free libraries were constructed from the extracted gDNA using the NuGen Celero DNA-Seq Library Preparation Kit (Tecan Genomics, San Carlos, CA) and TruSeq DNA PCR-Free library preparation kit (Illumina, San Diego, CA). For comparison, a PCR-based library was also constructed using the NEBNext Ultra DNA library preparation kit (New England Biolabs, Ipswich, MA). Quantification and fragment length estimation of the constructed libraries used KAPA

2.7. Mitogenome assembly and phylogenetic analysis De novo mitogenome assembly using the TruSeq PCR-free sequencing reads was performed with NOVOplasty v.3.2 (Dierckxsens et al., 2016) with a chosen k-mer size of 59 (default k-mer size was 39) to enable the complete assembly of the repetitive control region. The 2

Marine Genomics xxx (xxxx) xxxx

D. Van Quyen, et al.

Table 1 General information of Penaeus monodon. Items MIxS shared mandatory descriptors Investigation type Classification Project name Geographic location Latitude, longitude Collection date Environment (biome) Environment (feature) Environment (material) Isolation and growth condition Sequencing method MIGS-specific mandatory descriptors Ploidy Number of replicons Estimated genome size Reference of biomaterial Assembly method Assembly program Finishing strategy

Description

Eukaryote Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria;Protostomia; Ecdysozoa; Panarthropoda; Arthropoda;Mandibulata; Pancrustacea; Crustacea; Malacostraca;Eumalacostraca; Eucarida; Decapoda; Dendrobranchiata;Penaeoidea; Penaeidae; Penaeus monodon Whole-genome sequencing of Penaeus monodon Da Kao, District 1, Ho Chi Minh City, VietNam 10.785542 N, 106.696493 E 2018 Marine biome (ENVO:00000447) Saline shrimp pond water (ENVO:01001257) Marine water mass (ENVO:01000686) Somatic Illumina NovaSeq6000 Paired-end (2 × 150) Nanopore MinION Diploid 2n 88 chromosomes 2.59 Gb Primary genome report De novo assembly MaSuRCA v. 3.3.1; BESST_RNA v. MAY-2019; LRScaf v. 1.1.7 Draft genome; 30 x coverage

crayfish (Procambarus virginalis) (Gutekunst et al., 2018) and sand flea (Parhyale hawaiensis) (Kao et al., 2016), were downloaded followed by the retrieval of predicted protein sequences using the program gffread (Pertea et al., 2015). Ortholog inference was subsequently performed on the black tiger prawn and the three newly obtained proteomes using Orthofinder2 v2.2.7 that performed (1) all-versus-all BLAST search between each species pair using the Diamond protein similarity search algorithm (−evalue 0.001) (2) Markov Clustering (MCL) of the putative cognate gene-pairs using the default MCL inflation value of 1.5 (3) refinement of orthologous grouping via gene tree inference using FastTree (Emms and Kelly, 2019). The R package UpSetR (Conway et al., 2017) was used to visualise the number of overlapping and unique orthologous groups among these four crustacean species. Functional annotation of the predicted proteins used Interproscan v. 5.37–74.0 (Jones et al., 2014). Representative genes that belonged to an orthologous group in OrthoFinder2 or had Gene Ontology assignment/InterPro Signature were retained for further analysis.

circularized mitogenome was reoriented to the cox1 gene and annotated with MITOS (Bernt et al., 2013). The assembly of the cox1 gene for the remaining dataset used a reference-based mapping approach as implemented in MITObim v1.9 (Hahn et al., 2013). The assembled cox1 gene from our sequencing data and publicly available datasets were aligned with MAFFT v7 (default option) (Katoh and Standley, 2013). The alignment was used to construct a maximum likelihood tree in IQTree v1.6.5 with 1000 ultrafast bootstrap replicates (Hoang et al., 2017) to summarise evolutionary relationships among cox1 gene haplotypes. 2.8. Visualization of PCR and sequencing bias in the mitogenome To assess the influence of insert size and PCR-amplification on sequencing depth, Illumina reads generated from PCR-dependent (NEBNext Ultra DNA and TruSeq DNA) and PCR-free (NuGen Celero DNA-Seq and TruSeq DNA PCR-Free) libraries were aligned with Bowtie2 (Langmead and Salzberg, 2012) to the black tiger prawn reference mitogenome (GenBank Accession Number: NC_002184.1) (Wilson et al., 2000) since many crustacean mitogenomes are prone to PCR bias owing to the presence of two major low-GC regions, namely the 16S rRNA and control region (Gan et al., 2019a; Gan et al., 2019b). The alignment files in BAM format were visualized in Integrative Genomics Viewer (Thorvaldsdóttir et al., 2013).

2.11. Identification of carbohydrate-active enZYmes (CAZy) and phylogenetic analysis The identification of Carbohydrate-Active enZYmes (CAZy) used the dbCAN2 meta server (Zhang et al., 2018). Predicted chitinases with significant hit (E-value < 1e-50) to the GH18 hidden Markov model (HMM) domain were further extracted and aligned with MAFFT v7 (−-maxiterate 1000 –localpair) (Katoh and Standley, 2013). The alignment was subsequently trimmed with trimal v1.2 (−gt 0.2) (Capella-Gutiérrez et al., 2009) followed by maximum likelihood tree construction in IQTree v1.6.5 with 1000 ultrafast bootstrap replicates (Hoang et al., 2017).

2.9. Genome annotation and variant-calling The newly assembled genome was repeat-masked using Red (default setting) (Girgis, 2015). Then, transcriptome data previously used for BESST-RNA scaffolding were aligned to the soft masked genome with HiSAT2 v2.1.0. The individual BAM files in addition to the repeatmasked (soft-masked) tiger prawn genome were used as the input for annotation in BRAKER v. 2.1.2 (Hoff et al., 2016). For variant-calling, trimmed PCR-free reads were mapped to the genome using the default setting of bwa mem (Li, 2013). SNP calling from the alignment (> 1 kb scaffolds) was performed using Strelka v 2.9.10 (Kim et al., 2018).

3. Results and discussion 3.1. Sequencing statistics A total of 75 gigabases of data were generated from Illumina wholegenome sequencing of which 76% (57 gigabases, ~ 28.5× genome coverage) were derived from the PCR-free libraries and 24% (18 gigabases, ~ 9× genome coverage) were from the PCR-based library. An additional 2.5 gigabases (~ 1.25× genome coverage) of data were generated from three separate Nanopore runs, averaging 833

2.10. Ortholog inference and protein functional annotation The genome assemblies and annotation (gff3 files) of the Pacific white shrimp (Litopenaeus vannamei) (Zhang et al., 2019), marbled 3

Marine Genomics xxx (xxxx) xxxx

D. Van Quyen, et al.

megabases / flowcell, which is substantially lower than the advertised specs of ~ 10 gb / flowcell. Although there are now studies showing acceptable Nanopore yield for a few non-model organisms (Gan et al., 2019c; Gan et al., 2019d; Tan et al., 2018; Austin et al., 2017), to our knowledge, a high-yielding Nanopore run has yet to be demonstrated for any marine crustacean. Based on the widely reported repetitiveness of most marine crustacean genomes (Yuan et al., 2018; Yuan et al., 2017; Zhang et al., 2019), we suspect that the low Nanopore data yield is due to the rapid accumulation of irreversible pore blockages caused by the formation of complex secondary structures during sequencing. Given the recent major improvements in PacBio sequencing technology and its demonstrated success in the Pacific shrimp genome sequencing project (Zhang et al., 2019), PacBio sequencing may be an alternative for future long-read sequencing of the black tiger prawn and other marine crustaceans. The Illumina read depth obtained in this study precludes an accurate genome size and heterozygosity estimation using the reference-free k-mer approach (Vurture et al., 2017). As a result, we opted to estimate genome heterozygosity through mapping of the trimmed PCR-free reads to the whole genome (see below) followed by variant-calling. By restricting variant-calling to scaffolds larger than 1000 bp (accumulated length of 1.2 Gb), a total of 6,368,014 high-confident heterozygous SNPs were identified, giving an estimated genome heterozygosity of 0.53%.

3.2. High abundance of long dinucleotide repeats in the black tiger prawn genome Improved coverage for genomic regions exhibiting high AT content was observed in the PCR-free sequencing data (Fig. 2A and C), underscoring the biasing effects of PCR on library GC distribution. A peak at around 50% GC is visible in each library with PCR-free libraries exhibiting a more prominent peak. Scanning of the poly-G, quality and adapter-trimmed PCR-free reads revealed a relatively high proportion of reads with at least 10 copies of TG/CA (13.9%), AG/CT (13.3%), or AT/AT (4.9%) dinucleotide repeats (Fig. 2B). Both TG/CA and AG/CT dinucleotide repeats consist of 50% GC content, corroborating the observed high representation of sequences with 50% GC (Fig. 2A) in the PCR-free library and the resulting genome assembly (Fig. 2C). However, the inclusion of PCR in the library preparation reduced the overall abundance of perfect dinucleotide repeats (Fig. 2B). The lower representation of such sequences in the PCR-based libraries may be explained by PCR errors in the dinucleotide repeats generating increasing copies of truncated repeats (PCR slippage) and non-perfect repeats (nucleotide misincorporation) during each PCR cycle (Shinde et al., 2003; Neininger et al., 2019).

3.3. Nanopore long reads revealed under-represented AT/AT dinucleotide repeats in Illumina libraries In contrast to the Illumina data, the AT/AT and TG/CA appears to be equally abundant (~40% of total reads) in the Nanopore data (Fig. 2B). The Nanopore sequencing profile is more likely to reflect the Fig. 2. Profiling of sequencing reads and genome assemblies. (A) GC content of different library constructs. Libraries with the prefix SRR were generated by Yuan et al. (Yuan et al., 2018) (B) Frequency of dinucleotide repeats in Illumina and Nanopore reads (C) Distribution of GC content in the assemblies calculated based on the number of non-overlapping 100 bp windows. MaSuRCA: Initial de novo assembly using MaSuRCA and PCR-free reads; Nanopore scaf, MaSuRCA assembly scaffolded with Nanopore long reads; RNA-Nanopore scaf, Additional scaffolding of the assembly with deep and comprehensive transcriptome reads; NCBI filtered, Final assembly submitted to NCBI.

4

Marine Genomics xxx (xxxx) xxxx

D. Van Quyen, et al.

actual genomic content given that it is based on native DNA sequencing despite its relatively higher error rate, which is now consistently less than 10% with the Guppy basecaller high accuracy model. The endrepair step for Illumina PCR-free preparation requires an extended (~ 10–30 min) incubation at 65 °C for A-tailing. When fragmented to less than 500 bases, a standard requirement for Illumina library preparation, the melting temperature of high-AT genomic fragments, for example, those with a high representation of AT/AT dinucleotide repeats may be lower than 65 °C. As a result, a substantial portion of high-AT genomic fragments will no longer be present in double-stranded form, a requirement for the subsequent adapter ligation step (Kozarewa et al., 2009). The incorporation of PCR during library preparation can further reduce the representation of high-AT genomic regions due to the polymerase amplification bias towards GC-balanced fragments (Aird et al., 2011). 3.4. Illumina PCR-free library and MaSurCA-based assembly improved the black tiger prawn genome assembly An initial de novo genome assembly using only Illumina wholegenome sequencing generated a genome with an accumulated length of 1.6 gb (N50 = 1756 bp) contained in 1,283,603 scaffolds (Table 2). A substantially higher representation of assembled genomic regions with extremely low GC content (< 20%) was observed in the current assemblies compared to the tiger prawn genome assembled by Yuan et al. (2018) (Fig. 2C). Similar to the GC plots of the Illumina sequencing reads, a sharp peak at 50% GC was observed across all assemblies with a lower prominence in the first tiger prawn genome assembly. Based on the GC plots of the sequencing data, it is conceivable that the improved genome assembly is partially due to the higher representation of PCRfree data that better recover AT-rich genomic regions (Fig. 2A) and possibly also the use of the updated MaSurCA assembler.

Fig. 3. BUSCO completeness comparison between the previously published black tiger prawn genome and the current assemblies based on the Arthropod odb9 dataset. Colors in the bars represent the different categories of the identified BUSCO genes.

3.6. Small PCR-based library insert size exhibited reduced sequencing evenness in the black tiger prawn mitogenome with extremely low coverage in high-AT regions The read coverage of most PCR-based libraries exhibits a relatively similar curve profile with respect to their GC plots (Fig. 4A, B and C). Regardless of insert size, the read depth is lower in the AT-rich control region (Fig. 4D) with the strongest bias observed in the PCR-based 170bp insert size library (Fig. 4A). We further confirm that this is not library-, lab- or Illumina instrument-specific by performing a similar read alignment and visualization using the tiger prawn Illumina dataset generated by Yuan et al. (Yuan et al., 2018). The choice of larger insert size PCR-based library (~ 500 bp) appears to improve sequencing evenness across the mitogenome except for the control region with at least 50% less read depth coverage (Fig. 4C).

3.5. Genome-scaffolding with comprehensive and deep Illumina transcriptome data increased genome completeness Despite the low output from Nanopore sequencing, scaffolding with Nanopore reads still resulted in a 60% increase in the number of scaffolds that are longer than 10,000 bp (Table 2). An additional scaffolding with previously published RNA-sequencing data from multiple P. monodon tissues and developmental stages data (Huerlimann et al., 2018) further improved the assembly metrics with the most dramatic improvement in the genome completeness metrics, as evidenced by the successful identification of more than 90% BUSCO complete singlecopy genes, missing only 3.2% of the Arthropod BUSCO genes (Fig. 3). Table 2 Genome assembly metrics. Statistics

Yuan et al.

MaSuRCA

Nanopore scaffolding

Nanopore+RNA scaffolding

NCBI submission

# contigs # contigs (≥ 1000 bp) # contigs (≥ 5000 bp) # contigs (≥ 10,000 bp) # contigs (≥ 25,000 bp) # contigs (≥ 50,000 bp) Largest contig Total length Total length (≥ 1000 bp) Total length (≥ 5000 bp) Total length (≥ 10,000 bp) Total length (≥ 25,000 bp) Total length (≥ 50,000 bp) N50 N75 GC (%) # N's

2,525,346 322,978 6304 421 2 0 26,545 1,447,415,504 583,943,639 42,638,659 5,108,688 51,580 0 769 378 40 80,289,002

1,283,603 525,567 27,060 3178 35 1 51,625 1,608,897,298 1,165,073,195 197,032,681 41,443,282 1,104,114 51,625 1756 928 39 2,294,808

1,238,255 504,015 37,183 5078 58 5 93,158 1,630,181,707 1,202,431,084 279,112,891 66,881,999 1,887,485 329,239 1919 962 39 25,627,055

1,211,636 483,646 38,968 8950 694 20 93,158 1,632,843,607 1,209,749,762 339,464,223 139,587,347 22,209,852 1,167,697 1981 972 39 28,288,955

1,211,364 483,541 38,961 8940 689 20 93,158 1,632,419,792 1,209,454,222 339,276,585 139,369,382 22,061,325 1,167,697 1982 972 39 28,288,955

5

Marine Genomics xxx (xxxx) xxxx

D. Van Quyen, et al.

Fig. 4. Mitogenome-based evaluation of the effects of sequencing library preparation protocols on sequencing depth. The approximate insert size of each library indicated by the window size above the read coverage plot(s) and each colored vertical line indicates a mismatch to the reference mitogenome (GenBank Accession Number: NC_002184.1) (A) PCR-based library with small insert size close to the sequencing read length (150 bp), strong paired-end read overlap (B) PCR-based and PCR-free libraries with insert size that is larger than the sequencing read length allowing modest paired-end read overlap (C) PCR-based library with large insert size that preclude paired-end read overlap. (D) Location of transfer RNA, ribosomal RNA and protein-coding genes in the reference mitogenome. (E) Alignment of transcriptome reads generated by Huerlimann et al. (Huerlimann et al., 2018) with maximum read coverage set to 500 × .

monodon into two quite divergent clades, appearing to correspond to a western (Indian Ocean) and eastern (south-east Asian and western Pacific) groupings is consistent with other studies (You et al., 2008; Waqairatu et al., 2012; Benzie et al., 2002), focusing on studies of geographic-based genetic variation. Our study indicates that the degree of molecular divergence (> 5% for the COI barcoding region) is potentially of taxonomic significance at the subspecies or even species level based on the principles of DNA barcoding (Hebert et al., 2003). Thus it is recommended that a comprehensive study of geographic variation is undertaken for the black tiger prawn using contemporary population genomics methods such as pool-sEq. (Robledo et al., 2018; Micheletti and Narum, 2018; Dorant et al., 2019) that will benefit from the availability of this revised genome assembly and annotation. Further, there are limitations to existing population genetic studies in terms of techniques used (some are over 20 years old) (Duda and Palumbi, 1999), variable geographic coverage and sampling intensity and the diversity of molecular genetic methodologies used.

3.7. Two distinct mitolineages present within Penaeus monodon The complete mitogenome of the Vietnamese Penaeus monodon isolate 26D is 15.9 kb in length (GenBank accession code: MN057663). Its mitochondrial control region has the highest sequence similarity (99.81%) to control region haplotype VN48 (GenBank accession code: EU426694) that is unique to the currently sampled Vietnamese black tiger prawns (You et al., 2008). Surprisingly, the alignment of mitochondrial reads from previously published black tiger prawn by Yuan et al. (2018) revealed an identical SNP profile with that of our sample 26D, suggesting a recent shared maternal lineage. In contrast, the mitochondrial SNP distribution of these samples is quite divergent from the Australian tiger prawns used for the transcriptome study with one of the samples exhibiting a strikingly high sequence divergence from the reference mitogenome (Fig. 4E). The maximum likelihood tree based on the alignment of the newly assembled cox1 genes and publicly available P. monodon cox1 gene sequences showed two strongly supported (bootstrap values ≥90%) clades (Fig. 5A). Within the same clade, the pairwise nucleotide identity of the cox1 gene ranges from 97% to 100% while between the two major clades, this value ranges from 91% to 93% (Fig. 5B). We found a similar result using publicly available 16S rRNA sequences, but with a lower level of mean divergence between the 2 major clades (~2% see Supplemental Table 1). The division of P.

3.8. Genome annotation and the identification of carbohydrate-active enzymes Consistent with the high sequence repetitiveness observed from the sequencing reads (Fig. 2B), approximately 800 Mb (50% of the 6

Marine Genomics xxx (xxxx) xxxx

D. Van Quyen, et al.

Fig. 5. Penaeus monodon consists of two divergent maternal lineages. (A) Maximum likelihood tree depicting the evolutionary relationships of various Penaeus monodon voucher specimens based on the alignment of cox1 genes. Branch length indicates the number of substitutions per site and numbers next to nodes are IqTree ultrafast bootstrap (UFBoot) support values. The tree was rooted with members from the Penaeus semisucaltus as the outgroup. (B) Heatmap of COX1 pairwise identity with values indicated in the colored cells.

7

Marine Genomics xxx (xxxx) xxxx

D. Van Quyen, et al.

Fig. 6. Orthologous clustering of the tiger prawn proteome with three crustacean proteomes and its functional annotation. (A) UpSet plot showing unique and shared protein ortholog clusters across Pacific white shrimp (Litopenaeus vannamei), black tiger prawn (Penaeus monodon), marbled crayfish (Procambarus virginalis) and sand flea (Parhyale hawaiensis). Connected dots are the intersections of overlapping orthologs with the vertical black bars above showing the number of orthogroups in each intersection. (B) Venn diagram showing the number of tiger prawn proteins with Gene Ontology assignment, InterPro Signature and/or orthologous grouping.

of genome assemblies and sequencing reads, future genome sequencing of black tiger prawns or more generally decapods crustacean will benefit greatly from long read-only assemblies that can span complex repeats followed by error-correction with Illumina reads that are generated from PCR-free libraries preferably with large insert size to minimize coverage bias. By using recently generated RNAseq data for scaffolding, we also observed a much higher BUSCO completeness for a black tiger prawn genome assembly that will facilitate functional genomics and population genetic studies of the giant tiger prawn in Vietnam and more generally South East Asia. Supplementary data to this article can be found online at https:// doi.org/10.1016/j.margen.2020.100751.

assembled length) of the genome sequence were characterized as repetitive. BRAKER2 initially predicted 237,505 protein-coding genes of which 36,685 were functionally annotated and/or clustered into orthologous groups with other crustaceans (Fig. 6B). Consistent with their close phylogenetic relatedness, L. vannamei and P. monodon have the highest number of uniquely shared orthologous groups (n = 5651, Fig. 6A), followed by a total of 5132 orthologous groups that were shared with the remaining two more distantly relate crustacean species namely the sand flea (Parhyale hawaiensis) and noble crayfish (Procambarus virginalis) (Fig. 6A). We identified several significant CAZy hits in the predicted proteomes with GH18 protein (chitinase) being the most abundantly represented GH family in all four crustacean species (Fig. 7A). The high diversity of chitinase is consistent with its role in molting and the digestion of chitin-containing food (Proespraiwong et al., 2010; Watanabe et al., 1998). In addition, several GHs associated with the metabolism of plant materials (Bredon et al., 2019; Gan et al., 2018) were also identified that could facilitate the formulation of optimal prawn feed. Phylogenetic analysis of the chitinases showed clustering pattern first by chitinase variants and then only by species relatedness (Fig. 7B), suggesting that a majority of the chitinase genes were present in the ancestral crustacean genome. One chitinase clade (highlighted in Fig. 7B), however, consists of exclusively members from the black tiger prawn and Pacific shrimp. The higher representation and diversity of chitinase in penaeid shrimps is thought to be due to their high molting frequency compared to other arthropods (Gao et al., 2017; Godin et al., 1996).

Data availability Raw Illumina sequence reads and genome assembly have been deposited in the NCBI database under the Bioproject accession number PRJNA548012 (https://www.ncbi.nlm.nih.gov/bioproject/548012). Basecalled Nanopore reads, intermediate assemblies, Braker2 predicted genes, Orthofinder2 output, CAZy and InterProScan functional annotations are available in the public Zenodo database (doi:https://doi. org/10.5281/zenodo.3520179). Authors' contributions DVQ, DDK,VSN and CMA conceived the study. DDN, THN and XTT collected the samples. HMG and YPL isolated DNA and performed Illumina sequencing. DVQ, HMG and CMA performed the analyses and wrote the paper. All authors read and approved the final manuscript.

4. Conclusions We present a significantly improved whole-genome assembly of the black tiger prawn that was assembled from a mixture of PCR-free Illumina reads and Nanopore long reads. Given the high proportion of long dinucleotide repeats in the genome as evidenced from the GC plots

Declaration of Competing Interest The authors declare that they have no competing interests. 8

Marine Genomics xxx (xxxx) xxxx

D. Van Quyen, et al.

Fig. 7. Identification of glycoside hydrolases in selected crustacean genomes. (A) Heatmap showing the number of predicted protein-coding genes in each crustacean genome that were assigned to the glycoside hydrolase (GH) class of CAZy based on dbCAN2 classification (Zhang et al., 2018). (B) IQTree Maximum likelihood tree depicting the evolutionary relationships of chitinases from Litopenaeus vannamei (Lv), Penaeus monodon (Pm), Procambarus virginalis (Pv) and Parhyale hawaiensis (pH). The tree was midpoint rooted and the nodes were colored based on ultrafast bootstrap values. The first two letters in each tip label correspond to the species name and branch lengths indicate the number of substitutions per site.

Acknowledgments

Moria, S., 2002. Mitochondrial DNA variation in Indo-Pacific populations of the giant tiger prawn, Penaeus monodon. Mol. Ecol. 11, 2553–2569. Bernt, M., Donath, A., Juhling, F., Externbrink, F., Florentz, C., Fritzsch, G., Putz, J., Middendorf, M., Stadler, P.F., 2013. MITOS: improved de novo metazoan mitochondrial genome annotation. Mol. Phylogenet. Evol. 69, 313–319. Bredon, M., Herran, B., Lheraud, B., Bertaux, J., Grève, P., Moumen, B., Bouchon, D., 2019. Lignocellulose degradation in isopods: new insights into the adaptation to terrestrial life. BMC Genomics 20, 462. Capella-Gutiérrez, S., Silla-Martínez, J.M., Gabaldón, T., 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics (Oxford, England) 25, 1972–1973. Castoe, T.A., Poole, A.W., de Koning, A.P.J., Jones, K.L., Tomback, D.F., Oyler-McCance, S.J., Fike, J.A., Lance, S.L., Streicher, J.W., Smith, E.N., Pollock, D.D., 2012. Rapid microsatellite identification from Illumina paired-end genomic sequencing in two birds and a Snake. PLoS One 7, e30953. Chen, S., Zhou, Y., Chen, Y., Gu, J., 2018. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890. Conway, J.R., Lex, A., Gehlenborg, N., 2017. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33, 2938–2940. Dierckxsens, N., Mardulyn, P., Smits, G., 2016. NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res. 45, e18. Dorant, Y., Benestan, L., Rougemont, Q., Normandeau, E., Boyle, B., Rochette, R., Bernatchez, L., 2019. Comparing Pool-seq, rapture, and GBS genotyping for inferring weak population structure: the American lobster (Homarus americanus) as a case study. Ecol. Evol. 9, 6606–6623. Duda, T.F., Palumbi, S.R., 1999. Population structure of the black tiger prawn, Penaeus monodon, among western Indian Ocean and western Pacific populations. Mar. Biol. 134, 705–710. Emms, D.M., Kelly, S., 2019. OrthoFinder: phylogenetic orthology inference for comparative genomics. bioRxiv 466201. FAO, 2018. The State of World Fisheries and Aquaculture 2018 - Meeting the sustainable development goals. http://www.fao.org/3/i9540en/i9540en.pdf. Gan, H.M., Austin, C., Linton, S., 2018. Transcriptome-guided identification of

This study was funded through a grant “Research on the application of biotechnology in the selection of fast-growing black tiger shrimp (Penaeus monodon)” to DDK from the Vietnamese Ministry of Agriculture and rural development, and the Deakin Genomics Centre, Deakin University, Australia. References Abdelrahman, H., ElHady, M., Alcivar-Warren, A., Allen, S., Al-Tobasei, R., Bao, L., Beck, B., Blackburn, H., Bosworth, B., Buchanan, J., Chappell, J., Daniels, W., Dong, S., Dunham, R., Durland, E., Elaswad, A., Gomez-Chiarri, M., Gosh, K., Guo, X., Hackett, P., Hanson, T., Hedgecock, D., Howard, T., Holland, L., Jackson, M., Jin, Y., Khalil, K., Kocher, T., Leeds, T., Li, N., Lindsey, L., Liu, S., Liu, Z., Martin, K., Novriadi, R., Odin, R., Palti, Y., Peatman, E., Proestou, D., Qin, G., Reading, B., Rexroad, C., Roberts, S., Salem, M., Severin, A., Shi, H., Shoemaker, C., Stiles, S., Tan, S., Tang, K.F.J., Thongda, W., Tiersch, T., Tomasso, J., Prabowo, W.T., Vallejo, R., van der Steen, H., Vo, K., Waldbieser, G., Wang, H., Wang, X., Xiang, J., Yang, Y., Yant, R., Yuan, Z., Zeng, Q., Zhou, T., G. The Aquaculture Genomics, B. Workshop, 2017. Aquaculture genomics, genetics and breeding in the United States: current status, challenges, and priorities for future research. BMC Genomics 18, 191. Aird, D., Ross, M.G., Chen, W.-S., Danielsson, M., Fennell, T., Russ, C., Jaffe, D.B., Nusbaum, C., Gnirke, A., 2011. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 12, R18. Austin, C.M., Tan, M.H., Harrisson, K.A., Lee, Y.P., Croft, L.J., Sunnucks, P., Pavlova, A., Gan, H.M., 2017. De novo genome assembly and annotation of Australia's largest freshwater fish, the Murray cod (Maccullochella peelii), from Illumina and Nanopore sequencing read. Gigascience 6, 1–6. Benzie, J.A.H., Ballment, E., Forbes, A.T., Demetriades, N.T., Sugama, K., Haryanti,

9

Marine Genomics xxx (xxxx) xxxx

D. Van Quyen, et al.

Micheletti, S.J., Narum, S.R., 2018. Utility of pooled sequencing for association mapping in nonmodel organisms. Mol. Ecol. Resour. 18, 825–837. Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D., Gurevich, A., 2018. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150. Neininger, K., Marschall, T., Helms, V., 2019. SNP and indel frequencies at transcription start sites and at canonical and alternative translation initiation sites in the human genome. PLoS One 14, e0214816. Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.-C., Mendell, J.T., Salzberg, S.L., 2015. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290. Proespraiwong, P., Tassanakajon, A., Rimphanitchayakit, V., 2010. Chitinases from the black tiger shrimp Penaeus monodon: phylogenetics, expression and activities, comparative biochemistry and physiology. Part B, Biochem. Mole. Biol. 156, 86–96. Qin, M., Wu, S., Li, A., Zhao, F., Feng, H., Ding, L., Chang, Y., Ruan, J., 2018. LRScaf: improving draft genomes using long noisy reads. bioRxiv 374868. Robledo, D., Palaiokostas, C., Bargelloni, L., Martinez, P., Houston, R., 2018. Applications of genotyping by sequencing in aquaculture breeding and genetics. Rev. Aquac. 10, 670–682. Sahlin, K., Vezzi, F., Nystedt, B., Lundeberg, J., Arvestad, L., 2014. BESST - efficient scaffolding of large fragmented assemblies. BMC Bioinforma. 15, 281. Shinde, D., Lai, Y., Sun, F., Arnheim, N., 2003. Taq DNA polymerase slippage mutation rates measured by PCR and quasi-likelihood analysis: (CA/GT)n and (a/T)n microsatellites. Nucleic Acids Res. 31, 974–980. Sokolov, E.P., 2000. An improved method for DNA isolation from mucopolysacchariderich molluscan tissues. J. Molluscan Stud. 66, 573–575. Tan, M.H., Austin, C.M., Hammer, M.P., Lee, Y.P., Croft, L.J., Gan, H.M., 2018. Finding Nemo: hybrid assembly with oxford nanopore and illumina reads greatly improves the clownfish (Amphiprion ocellaris) genome assembly. GigaScience 7. Thorvaldsdóttir, H., Robinson, J.T., Mesirov, J.P., 2013. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192. Vurture, G.W., Sedlazeck, F.J., Nattestad, M., Underwood, C.J., Fang, H., Gurtowski, J., Schatz, M.C., 2017. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33, 2202–2204. Waqairatu, S.S., Dierens, L., Cowley, J.A., Dixon, T.J., Johnson, K.N., Barnes, A.C., Li, Y., 2012. Genetic analysis of black Tiger shrimp (Penaeus monodon) across its natural distribution range reveals more recent colonization of Fiji and other South Pacific islands. Ecol. Evol. 2, 2057–2071. Watanabe, T., Kono, M., Aida, K., Nagasawa, H., 1998. Purification and molecular cloning of a chitinase expressed in the hepatopancreas of the penaeid prawn Penaeus japonicus. Biochim. Biophys. Acta (BBA)-Protein Struct. Mol. Enzymol. 1382, 181–185. Waterhouse, R.M., Seppey, M., Simão, F.A., Manni, M., Ioannidis, P., Klioutchnikov, G., Kriventseva, E.V., Zdobnov, E.M., 2017. BUSCO applications from quality assessments to gene prediction and Phylogenomics. Mol. Biol. Evol. 35, 543–548. Wilson, K., Cahill, V., Ballment, E., Benzie, J., 2000. The complete sequence of the mitochondrial genome of the crustacean Penaeus monodon: are malacostracan crustaceans more closely related to insects than to branchiopods? Mol. Biol. Evol. 17, 863–874. You, E.M., Chiu, T.S., Liu, K.F., Tassanakajon, A., Klinbunga, S., Triwitayakorn, K., de la Pena, L.D., Li, Y., Yu, H.T., 2008. Microsatellite and mitochondrial haplotype diversity reveals population differentiation in the tiger shrimp (Penaeus monodon) in the indo-Pacific region. Anim. Genet. 39, 267–277. Yu, Y., Zhang, X., Yuan, J., Li, F., Chen, X., Zhao, Y., Huang, L., Zheng, H., Xiang, J., 2015. Genome survey and high-density genetic map construction provide genomic and genetic resources for the Pacific white shrimp Litopenaeus vannamei. Sci. Rep. 5, 15612. Yuan, J., Gao, Y., Zhang, X., Wei, J., Liu, C., Li, F., Xiang, J., 2017. Genome sequences of marine shrimp Exopalaemon carinicauda Holthuis provide insights into genome size evolution of caridea. Mar. Drugs 15. Yuan, J., Zhang, X., Liu, C., Yu, Y., Wei, J., Li, F., Xiang, J., 2018. Genomic resources and comparative analyses of two economical penaeid shrimp species, Marsupenaeus japonicus and Penaeus monodon. Mar. Genomics 39, 22–25. Zhang, H., Yohe, T., Huang, L., Entwistle, S., Wu, P., Yang, Z., Busk, P.K., Xu, Y., Yin, Y., 2018. dbCAN2: a meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 46, W95–w101. Zhang, X., Yuan, J., Sun, Y., Li, S., Gao, Y., Yu, Y., Liu, C., Wang, Q., Lv, X., Zhang, X., Ma, K.Y., Wang, X., Lin, W., Wang, L., Zhu, X., Zhang, C., Zhang, J., Jin, S., Yu, K., Kong, J., Xu, P., Chen, J., Zhang, H., Sorgeloos, P., Sagi, A., Alcivar-Warren, A., Liu, Z., Wang, L., Ruan, J., Chu, K.H., Liu, B., Li, F., Xiang, J., 2019. Penaeid shrimp genome provides insights into benthic adaptation and frequent molting. Nat. Commun. 10, 356. Zimin, A.V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S.L., Yorke, J.A., 2013. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677.

carbohydrate active enzymes (CAZy) from the Christmas Island Red Crab, Gecarcoidea natalis and a Vote for the Inclusion of transcriptome-derived crustacean CAZys in comparative studies. Mar. Biotechnol. (New York, N.Y.) 20, 654–665. Gan, H.M., Linton, S.M., Austin, C.M., 2019a. Two reads to rule them all: nanopore long read-guided assembly of the iconic Christmas Island red crab, Gecarcoidea natalis (Pocock, 1888), mitochondrial genome and the challenges of AT-rich mitogenomes. Mar. Genomics 45, 64–71. Gan, H.M., Grandjean, F., Jenkins, T.L., Austin, C.M., 2019b. Absence of evidence is not evidence of absence: Nanopore sequencing and complete assembly of the European lobster (Homarus gammarus) mitogenome uncovers the missing nad2 and a new major gene cluster duplication. BMC Genomics 20, 335. Gan, H.M., Tan, M.H., Austin, C.M., Sherman, C.D.H., Wong, Y.T., Strugnell, J., Gervis, M., McPherson, L., Miller, A.D., 2019c. Best foot forward: nanopore long reads, hybrid meta-assembly, and haplotig purging optimizes the first genome assembly for the southern hemisphere blacklip abalone (Haliotis rubra). Front. Genet. 10. Gan, H.M., Falk, S., Morales, H.E., Austin, C.M., Sunnucks, P., Pavlova, A., 2019d. Genomic evidence of neo-sex chromosomes in the eastern yellow robin. GigaScience 8. Gao, Y., Wei, J., Yuan, J., Zhang, X., Li, F., Xiang, J., 2017. Transcriptome analysis on the exoskeleton formation in early developmetal stages and reconstruction scenario in growth-moulting in Litopenaeus vannamei. Sci. Rep. 7, 1098. Girgis, H.Z., 2015. Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale. BMC Bioinforma. 16, 227. Godin, D.M., Carr, W.H., Hagino, G., Segura, F., Sweeney, J.N., Blankenship, L., 1996. Evaluation of a fluorescent elastomer internal tag in juvenile and adult shrimp Penaeus vannamei. Aquaculture 139, 243–248. Gutekunst, J., Andriantsoa, R., Falckenhayn, C., Hanna, K., Stein, W., Rasamy, J., Lyko, F., 2018. Clonal genome evolution and rapid invasive spread of the marbled crayfish. Nat. Ecol. Evol. 2, 567–573. Hahn, C., Bachmann, L., Chevreux, B., 2013. Reconstructing mitochondrial genomes directly from genomic next-generation sequencing reads—a baiting and iterative mapping approach. Nucleic Acids Res. 41, e129. Hebert, P.D., Cywinska, A., Ball, S.L., de Waard, J.R., 2003. Biological identifications through DNA barcodes, proceedings. Biol. Sci. 270, 313–321. Hoang, D.T., Chernomor, O., Von Haeseler, A., Minh, B.Q., Vinh, L.S., 2017. UFBoot2: improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35, 518–522. Hoff, K.J., Lange, S., Lomsadze, A., Borodovsky, M., Stanke, M., 2016. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32, 767–769. Huerlimann, R., Wade, N.M., Gordon, L., Montenegro, J.D., Goodall, J., McWilliam, S., Tinning, M., Siemering, K., Giardina, E., Donovan, D., Sellars, M.J., Cowley, J.A., Condon, K., Coman, G.J., Khatkar, M.S., Raadsma, H.W., Maes, G.E., Zenger, K.R., Jerry, D.R., 2018. De novo assembly, characterization, functional annotation and expression patterns of the black tiger shrimp (Penaeus monodon) transcriptome. Sci. Rep. 8, 13553. Jones, P., Binns, D., Chang, H.-Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A.F., Sangrador-Vegas, A., Scheremetjew, M., Yong, S.-Y., Lopez, R., Hunter, S., 2014. InterProScan 5: genomescale protein function classification. Bioinformatics (Oxford, England) 30, 1236–1240. Kao, D., Lai, A.G., Stamataki, E., Rosic, S., Konstantinides, N., Jarvis, E., Di Donfrancesco, A., Pouchkina-Stancheva, N., Sémon, M., Grillo, M., Bruce, H., Kumar, S., Siwanowicz, I., Le, A., Lemire, A., Eisen, M.B., Extavour, C., Browne, W.E., Wolff, C., Averof, M., Patel, N.H., Sarkies, P., Pavlopoulos, A., Aboobaker, A., 2016. The genome of the crustacean Parhyale hawaiensis, a model for animal development, regeneration, immunity and lignocellulose digestion. eLife 5, e20062. Katoh, K., Standley, D.M., 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780. Kenny, N.J., Sin, Y.W., Shen, X., Zhe, Q., Wang, W., Chan, T.F., Tobe, S.S., Shimeld, S.M., Chu, K.H., Hui, J.H., 2014. Genomic sequence and experimental tractability of a new decapod shrimp model, Neocaridina denticulata. Mar. Drugs 12, 1419–1437. Kim, D., Langmead, B., Salzberg, S.L., 2015. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357. Kim, S., Scheffler, K., Halpern, A.L., Bekritsky, M.A., Noh, E., Källberg, M., Chen, X., Kim, Y., Beyter, D., Krusche, P., 2018. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591. Kozarewa, I., Ning, Z., Quail, M.A., Sanders, M.J., Berriman, M., Turner, D.J., 2009. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat. Methods 6, 291–295. Langmead, B., Salzberg, S.L., 2012. Fast gapped-read alignment with bowtie 2. Nat. Methods 9, 357. Li, H., 2013. Aligning sequence reads, clone sequences and assembly contigs with BWAMEM. arXiv preprint arXiv:1303.3997.

10