De novo transcriptome sequencing of Cryptotermes domesticus and comparative analysis of gene expression in response to different wood species

De novo transcriptome sequencing of Cryptotermes domesticus and comparative analysis of gene expression in response to different wood species

    De novo transcriptome sequencing of Cryptotermes domesticus and comparative analysis of gene expression in response to different wood...

1MB Sizes 1 Downloads 62 Views

    De novo transcriptome sequencing of Cryptotermes domesticus and comparative analysis of gene expression in response to different wood species Wenjing Wu, Zhenyou Huang, Zhiqiang Li, Shijun Zhang, Xiaolin Liu, Daifei Gu PII: DOI: Reference:

S0378-1119(15)01164-6 doi: 10.1016/j.gene.2015.09.055 GENE 40873

To appear in:

Gene

Received date: Accepted date:

24 July 2015 18 September 2015

Please cite this article as: Wu, Wenjing, Huang, Zhenyou, Li, Zhiqiang, Zhang, Shijun, Liu, Xiaolin, Gu, Daifei, De novo transcriptome sequencing of Cryptotermes domesticus and comparative analysis of gene expression in response to different wood species, Gene (2015), doi: 10.1016/j.gene.2015.09.055

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT De novo transcriptome sequencing of Cryptotermes domesticus and comparative analysis of gene expression in response to different wood

1,2,3

, Zhenyou Huang 1, Zhiqiang Li

1,2,3,

IP

Author names: Wenjing Wu

T

species

1,2,3

,

SC R

Xiaolin Liu 1,4, Daifei Gu 1,5

*, Shijun Zhang

Affiliation: 1 Guangdong Entomological Institute, Guangzhou, China Guangdong Public Laboratory of Wild Animal Conservation and Utilization, Guangzhou,

NU

2

China

Guangdong Key Laboratory of Integrated Pest Management in Agriculture, Guangzhou,

MA

3

China

Institute of Entomology and State Key Laboratory of Biocontrol, Sun Yat-sen University,

D

4

5

TE

Guangzhou, China

College of Forestry, Northeast Forestry University, Harbin, China

CE P

Address: Xingangxi Road 105, Guangzhou, Guangdong, 510260, China.

* Corresponding author: Zhiqiang Li. Address: Guangdong Entomological Institute,

AC

Xingangxi Road 105, Guangzhou, Guangdong, 510260, China. Tel: +86 20 84183617. E-mail address: [email protected].

1

ACCEPTED MANUSCRIPT Abstract

T

The drywood termite Cryptotermes domesticus is an important worldwide pest with limited

IP

genomic resources that causes substantial damage to dry timber and structural lumber.

SC R

Here, we performed transcriptome sequencing for Cr. domesticus pseudergate using Illumina paired-end sequencing technology. A total of 108,745,470 clean reads were collected and assembled into 302,979 contigs with an average length of 648 bp and an

NU

N50 length of 893 bp. A total of 185,248 unigenes and 100,680 proteins were identified among the assembled contigs. Of these, there were 152,317 (50.27%) contigs with

MA

significant similarity to publicly available databases. To understand how the termites respond to phylogenetically diverse wood species, variations in gene expression were examined among pseudergates feeding on three wood species from different plant

D

families, Casuarina equisetifolia (CE), Koompassia excelsa (KE) and Myristica sp. (MS). A

TE

total of 417 (118 up-regulated / 299 down-regulated), 599 (148 up-regulated / 451 down-regulated) and 505 (223 up-regulated / 282 down-regulated) differentially expressed

CE P

genes were detected in KE vs. CE, KE vs. MS and CE vs. MS, respectively. Digital gene expression analysis indicated that different wood species played an important role in the expression of termite genes, such as genes involved in carbohydrate metabolism, and

AC

proteins with catalytic activity and hydrolase activity. Additionally, the genes encoding cellulase were identified and analyzed. This study provides the first primary transcriptome of Cr. domesticus and lays a foundation for future functional genomics studies in the feeding responses.

Keywords: Drywood termite; Casuarina equisetifolia; Koompassia excelsa; Myristica sp.; Feeding response

2

ACCEPTED MANUSCRIPT 1. Introduction Drywood termite Cryptotermes domesticus (Blattodea: Kalotermitidae) is a related lower

T

termite species, and the termite foragers are called a “Pseudergate” without a true worker

IP

caste. Cr. domesticus is capable of living in a piece of dry wood without free water for

SC R

years, so it is a serious pest causing substantial damage to dry timber or structural lumber (Huang et al., 2000; Rust and Su, 2012). Moreover, Cr. domesticus is an important worldwide invasive termite species, spreading from Southeast Asia to the Pacific islands

NU

and Central America (Evans et al., 2013). In China, Cr. domesticus has been found in Hainan, Guangdong, Guangxi, Yunnan and Taiwan, and it is one of the most economically

MA

important termite pests (Huang et al., 2000). This species has a great capacity to feed on wood from different plant families and in which their offspring can grow normally (Qian et al., 2005; Huang et al., 2007; Huang et al., 2011). Myristica sp. (Myristicaceae) and

D

Koompassia excelsa (Caesalpiniaceae) are common timbers imported from Southeast

TE

Asia, and Casuarina equisetifolia (Casuarinaceae) is an important shelter-forest tree species in South China. Cr. domesticus colonies can survive in the specimens from K.

CE P

excelsa and Myristica sp. (Qian et al., 2005; Huang et al., 2007) as well as Ca. equisetifolia (based on our observations). Additionally, we observed in our laboratory that the alates fly was also associated with these tree species. However, after an

AC

approximately ten-year observation period, we found that the fitness of the Cr. domesticus colonies was different among the tree species, and the consumption of Myristica sp. was higher in laboratory conditions. At present, most studies of drywood termites are primarily focused on their distribution (Scheffrahn et al., 2000; Li et al., 2008; Evans et al., 2013), resistance (Qian et al., 2005; Garcia et al., 2011; Hadi et al., 2012) and control (Sbeghen-Loss et al., 2011; Rust and Su, 2012) because the termites spread rapidly and are difficult to control. Currently, the termite genome of the lower termite Zootermopsis nevadensis (Termopsidae) has been reported (Terrapon et al., 2014). However, adaptation mechanisms of drywood termites in different wood species have not been reported at the molecular level. Although the feeding habits of drywood termites have been well studied, it is very difficult to investigate 3

ACCEPTED MANUSCRIPT drywood termites as cryptic and single-piece nesters. To understand the feeding responses of Cr. domesticus to phylogenetically distant wood species, we assembled the transcriptome of Cr. domesticus and compared the gene expression profiles among

IP

T

different wood species. Here, we analyze Cr. domesticus colonies reared in Ca. equisetifolia, K. excelsa and Myristica sp. This work may lay the foundation to reveal

SC R

molecular fitness mechanisms and to prohibit the continued spread of drywood termites.

NU

2. Materials and Methods 2.1. Termite Materials

MA

Cr. domesticus colonies were collected from Zhanjiang City, Guangdong Province, China, and the termites from infested wood species were reared in a glass jar (90 cm long, 50 cm wide, 70 cm high) until primary reproductives occurred in the laboratory at the Guangdong

D

Entomological Institute, Guangzhou City, Guangdong Province, China. From 2002 to

TE

2003, the new primary reproductives as well as their offspring were collected to establish a new colony in three wood species, Ca. equisetifolia (CE), K. excelsa (KE) and Myristica

CE P

sp. (MS) at a constant temperature of 27°C and relative humidity of 80%. In 2014, Cr. domesticus pseudergates were collected from CE, KE and MS and were immediately frozen and stored in liquid nitrogen until used.

AC

For each of the termite colonies, 5 individual pseudergates were randomly chosen and then pooled to generate one mixed sample per colony. Three RNA samples (five pseudergates each from CE, KE and MS woods) were extracted using TRIzol reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer‟s instructions. Total RNA integrity was checked with an Agilent Technologies 2100 Bioanalyzer (Agilent Technologies, Inc., Santa Clara, CA, USA) and a 1% agarose gel. All samples had a RNA integrity number (RIN) of more than 6.

2.2. cDNA library construction and Illumina sequencing for transcriptome analysis To obtain complete gene expression information, a pooled RNA sample obtained from CE, 4

ACCEPTED MANUSCRIPT KE and MS was used to construct a reference library for de novo transcriptome sequencing. According to the Illumina manufacturer‟s protocols, poly (A)+ RNA was purified from 5 μg of total RNA using oligo (dT) magnetic beads and fragmented into short

IP

T

sequences in the presence of divalent cations at 94°C for 5 min. The cleaved poly (A)+ RNA was transcribed, and then second-strand cDNA synthesis was performed. After the

SC R

end-repair and ligation of adapters, the products were amplified by PCR and purified using the QIAquick PCR Purification Kit (QIAGEN, Inc., Valencia, CA, USA) to create a cDNA library. The cDNA library was sequenced on the Illumina HiSeq2000 platform at the

NU

Annoroad Gene Technology Corporation (Beijing, China), which generated approximately

MA

100 bp paired-end raw reads.

2.3. Data filter and de novo assembly of reads

D

Raw sequencing data were evaluated by the quality control tool FastQC (version 0.11.2)

TE

to characterize sequence quality, GC content, base N content, and adapter content. After quality control testing, the raw reads were cleaned by removing the adapter sequences,

CE P

low quality sequences (more than 15% bases with quality value under 19), and any reads with more than 5% of unknown sequences designated as „N‟. The clean reads were then used for transcriptome de novo assembly using the Trinity program with its default

AC

parameter values (Grabherr et al., 2011). In consideration of gut symbionts, we filtered the sequences of all of the known protozoa of Cr. domesticus: Devescovina axiomaculata, Foaina acontophira, Foaina grassii, and Stephanympha helumbium (Huang et al., 2000). After filtering, the assembled transcriptome of Cr. domesticus was used as a reference for the following analyses.

2.4. Analysis and functional annotation of sequences All assembled contigs were searched against NCBI nucleotide (NT), NCBI non-redundant protein (NR), Swiss-Prot, Kyoto Encyclopedia of Genes and Genome (KEGG), Cluster of Orthologous Groups (COG), euKaryotic Orthologous Groups (KOG), and TrEMBL (a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot) databases using the BLAST 5

ACCEPTED MANUSCRIPT algorithm (Korf et al., 2003) with an E-value cut-off of 10-5. If there were conflicting results between the different databases, priority was given, in descending order, to NR, Swiss-Prot, KEGG and COG. The best alignments from the four databases were used to

IP

T

determine the direction of the contigs. Contigs with NR annotation were further analyzed with Blast2GO (version 3.0.8) to obtain their gene ontology (GO) annotations (Conesa et

SC R

al., 2005). The Web Gene Ontology (WEGO) annotation software was used to classify all of the contigs according to GO functions and to understand the distribution of gene functions of the species from the macro level (Ye et al., 2006). The TransDecoder (v2.0.1)

NU

program was used to predict the open reading frame (ORF) from contig sequences and translate them into peptide sequences. TransDecoder identifies likely coding sequences

MA

based on the following criteria: a minimum length ORF of 300 bp; a log-likelihood score > 0; the above coding score is greatest when the ORF is scored in the 1st reading frame as

D

compared to scores in the other 5 reading frames; if a candidate ORF is found fully

TE

encapsulated by the coordinates of another candidate ORF, the longer one is reported. The peptide sequences were aligned with Pfam (Protein family) using HMMER3 with an

CE P

E-value cut-off of 10-10.

2.4. SSR Detection

AC

The assembled sequences were used to identify the signatures of SSRs (Simple Sequence Repeat). FASTA files containing all of the assembled sequences were used as an input file in the Batch Primer3 software (http://probes.pw.usda.gov/batchprimer3/). The parameters were adjusted for identification of perfect di-, tri-, tetra-, penta-, and hexa-nucleotide motifs with a minimum of 9, 6, 5, 4, and 4 repeats, respectively. Mononucleotide repeats were ignored because it was difficult to distinguish genuine mononucleotide repeats from polyadenylation products and single nucleotide stretch errors generated by sequencing.

2.5. Digital gene expression (DGE) library preparation and sequencing DGE library preparation for the three different colony samples (CE, KE, MS) was performed in parallel using Illumina Digital Gene Expression-Tag Profiling with NlaIII kit 6

ACCEPTED MANUSCRIPT (Illumina Inc., San Diego, CA, USA). Briefly, mRNA was purified from 2 μg of total RNA by binding the mRNA to a magnetic oligo (dT) bead. First and second strand cDNAs were directly synthesized and bead-bound cDNA was subsequently digested with NlaIII. The

IP

T

cDNA fragments with 3' ends were then purified with magnetic beads, and then the GEX NlaIII Adapter 1 was added to their 5' ends. The restriction enzyme MmeI recognizes the

SC R

GEX NlaIII Adapter I cDNA junction and cuts 17 bp downstream of the CATG site, producing tags with Adapter I. Subsequently, the GEX Adapter 2 was ligated to the 3' ends of the tags to create a tag library, with cDNA fragments that had adapter molecules on

NU

both ends. The library was amplified by PCR for 15 cycles, and 85 bp strips were purified from 6% TBE PAGE gels. These strips were then digested, and the single-stranded

MA

molecules were attached to the Illumina sequencing chip for sequencing. Each flow cell

D

generated millions of raw tags with a length of 35 bp.

TE

2.6. Analysis and mapping of DGE tags To map the DGE tags, the sequenced raw data were filtered to remove adapter

CE P

sequences, low quality sequences (more than 15% bases with quality value under 19), and reads with more than 5% unknown sequences designated as „N‟. For tag annotation, the clean tags containing CATG and 21 bp tag sequences were mapped to our

AC

transcriptome reference database, and no more than one nucleotide mismatch was allowed.

2.7. Identification of differentially expressed genes The mapped read counts for each gene were normalized for gene length and library size using the reads per kilobase per million (RPKM) method, which facilitates the comparison of the number of transcript levels generated between samples. Differentially expressed genes (DEGs) were analyzed by DEGSeq (1.12.0) based on comparing the RPKM mapped reads between KE and CE (KE vs. CE), between CE and MS (CE vs. MS), and between KE and MS (KE vs. MS). The P-value of DEGs between two samples was adjusted using the Benjamini & Hochberg method (Robinson and Oshlack, 2010). Genes with a corrected P-value ≤ 0.001 and an absolute value log2ratio ≥ 1 were considered 7

ACCEPTED MANUSCRIPT differentially expressed. For GO and KEGG enrichment analyses, all identified DEGs were mapped to GO annotations using the Blast2GO and KEGG annotations using KOBAS, respectively. We selected a corrected P-value < 0.05 as the threshold to

IP

T

determine significant enrichment of the gene sets.

SC R

2.8. Validation and analysis of β-glucosidase encoding genes PCR was performed to validate β-glucosidase encoding genes among the Cr. domesticus transcriptome data. After extraction of total RNA, the total RNA was reverse transcribed by

NU

PrimeScriptTM RT Reagent Kit (TaKaRa Biotechnology Co., Ltd., Dalian, China) according to the manufacturer‟s protocol. The PCR primers were designed by Primer Premier 5.0

MA

software (Supplementary Table S1). The PCR program consisted of 30 cycles of 10 sec at 98°C, 15 sec at 55°C and 2 min at 72°C with PrimeSTAR HS (Premix) (TaKaRa). The

D

PCR products were quantified by gel electrophoresis and then purified using TaKaRa

TE

MiniBEST Agarose Gel DNA Extraction Kit (TaKaRa). All of the purified PCR products were cloned into pGEM-T Easy Vector Systems (Promega Corp., Madison, WI, USA) for

CE P

Sanger sequencing (Invitrogen Biotechnology Co. Ltd., Shanghai, China). The carrier sequences were removed by the SeqMan program within DNAstar and aligned with the transcriptome results. The amino acid sequences were predicted by

AC

DNAMAN (version 8). We performed BLAST on NCBI and selected the best hit results to identified domains of query sequences. Eighteen β-glucosidase sequences from termites and one from cockroaches were downloaded from the GenBank (list in Supplementary Table S2). All of these sequences were aligned using ClustalX. Phylogenetic analysis was performed using MEGA 6.0. The phylogenetic tree was constructed using the Maximum Likelihood method, and bootstrap values were calculated from 1,000 replicates.

2.9. Data Deposition The Illumina sequencing reads of Cr. domesticus were submitted to NCBI Sequence Read Archive (SRA) under the accession number of SRP058685, including the transcriptome

of

Cr.

domesticus

(SRR2039534),

gene

expression

profiles

of

pseudergates living in Ca. equisetifolia woods (SRR2039535), pseudergates living in K. 8

ACCEPTED MANUSCRIPT excelsa woods (SRR2039536), and pseudergates living in Myristica sp. woods

T

(SRR2039537).

SC R

3.1. Illumina sequencing and de novo assemble

IP

3. Results

Using Illumina paired-end sequencing technology, a total of 126,777,292 raw sequencing reads were generated. After quality control and data cleaning, 14.22% reads were filtered,

NU

and 108,745,470 clean reads were obtained with 94.95% Q30 bases (base quality more than 30) (Table 1). Based on the clean reads, a total of 302,979 contigs were assembled

MA

with an average length of 648 bp. The length distribution of these contigs is shown in Fig. 1A, with a minimum length of 201 bp, a maximum length of 17,578 bp and an N50 of 893 bp. After clustering, the contigs generated 185,248 unigenes. A total of 100,680 ORFs

D

was obtained with the mean length of 881 bp and N50 of 1,122 bp. The length distribution

TE

of these ORFs is shown in Fig. 1B. Using the BLAST-Like Alignment Tool (BLAT), all of the assembled contigs were matched to the sequences of Z. nevadensis (with an E-value of

82.82%.

CE P

10-20). Briefly, 19.15% contigs were successfully matched with the average accuracy of

AC

3.2. Functional annotation Out of 302,979 contigs, 152,317 (50.27%) obtained annotations at least in one database, including 32,264 (10.65%) in NT, 142,654 (47.08%) in NR, 79,855 (26.36%) in Swiss-Prot, 80,572 (26.59%) in KOG, 143,029 (47.21%) in TrEMBL, and 81,320 (26.84%) in Pfam. (Fig. 2). The query sequence length over 1,000 bp matched successfully in databases with 36,843 of 44,422 (82.93%) sequences obtaining annotations, while 52.93% of the sequences with lengths from 300 to 1000 bp and 32.33% with lengths less than 300 bp were annotated (Fig. 2). Because of the relatively short length of gene sequence and lacking genome information in Cr. domesticus, 49.73% assembled sequences could not be matched to known genes. On the basis of NR annotation, the species distribution of the best match result (with a cut-off E-value of 10-5) for each sequence is shown in Fig. 3. The 9

ACCEPTED MANUSCRIPT sequences of Cr. domesticus transcriptome matched 36.82% with Trichomonas vaginalis sequences, followed by Z. nevadensis (32.44%), Tribolium castaneum (2.00%) and

IP

T

Coptotermes formosanus (1.02%).

3.3. Functional classification by GO, COG and KEGG

SC R

Based on sequence homology, the 50,329 (16.61%) matched contigs were assigned to 4,947 GO terms and classified into 58 functional groups that belong in biological process, cellular component and molecular function clusters (Fig. 4). In molecular function category,

NU

the matched unique sequences were clustered into 17 classifications. The largest subcategory of the molecular function was „catalytic activity‟ (58.42%), and the second

MA

was „binding‟ (47.85%). In the biological process category, these sequences were divided into 23 classifications. The most represented categories were „metabolic process‟

D

(61.58%) and „cellular process‟ (48.07%). „Cell part‟ (23.85%) and „cell‟ (23.71%) were the

TE

most represented among the 18 categories of the cellular component category. In addition, 33,579 matched genes obtained more than one GO term, and one gene had 93 GO

CE P

terms.

The COG is used to classify orthologous gene products and predict their possible functions. A total of 45,263 (14.94%) COG annotated putative proteins were classified into

AC

25 categories (Supplementary Figure S1). After filtering the categories of the poorly characterized („general function prediction only‟ and „function unknown‟), the largest cluster was „transcription‟ (7,727, 17.07%), followed by „signal transduction mechanisms‟ (7,087,

15.65%)

and

„replication,

recombination

and

repair‟

(6,894,

15.23%)

(Supplementary Figure S1). Pathway analyses were also performed on all assembled genes as an alternative approach to functional categorization and annotation. Briefly, 41,905 (13.83%) sequences had significant matches in the KEGG database and were assigned to 228 KEGG pathways. The number of genes involved in metabolism was 10,575 (25.23%). Among them, the majority were categorized in carbohydrate metabolism with 2,948 genes, followed by nucleotide metabolism with 1,892 genes, amino acid metabolism with 1,815 genes, and lipid metabolism with 1,313 genes. Genetic information processing accounted 10

ACCEPTED MANUSCRIPT for 11,815 (28.19%) genes. The main sub-groups were protein processing in endoplasmic reticulum (1,741 genes), RNA transport (1,507 genes), mRNA surveillance pathway (1,327 genes), and ubiquitin mediated proteolysis (1,322 genes). Environmental

IP

T

information processing, cellular processes, and organismal systems accounted for 4,481 (10.69%), 5,788 (13.81%), and 3,491 (8.33%) genes, respectively. The KEGG

3.4. Frequency and distribution of SSRs

SC R

classification of sequences are listed in Supplementary Table S3.

NU

In total, 4,353 sequences containing 4,709 SSRs were predicted from 302,979 contigs (Table 2). The SSR frequency in the Cr. domesticus transcriptome was 1.44%. Among

MA

them, 4,014 contigs were associated with one SSR hit, while 695 contigs contained two or more SSR hits. The SSR distribution rate was 1.55%. The most abundant type of repeat

D

motif was trinucleotide (48.23%), followed by dinucleotide (35.59%), and tetranucleotide

TE

(16.18%). Pentanucleotide and hexanucleotide repeat units were absent in the Cr. domesticus transcriptome. The frequencies of SSRs with different numbers of tandem

CE P

repeats were calculated and are shown in Table 2. The SSRs with six tandem repeats (26.25%) were the most common, followed by seven tandem repeats (21.15%), ten tandem repeats (15.33%), and five tandem repeats (14.14%). The main SSR repeat units

AC

are listed in Table 2.

3.5. Digital gene expression profile analysis We used DGE to analyze variations in gene expression of Cr. domesticus reared in different wood species. Three DGE libraries were constructed and sequenced, including the following categories: CE (fed with Ca. equisetifolia wood), KE (fed with K. excelsa wood), and MS (fed with Myristica sp. wood). Sequencing depths of more than 13 million raw tags were achieved for each of the three libraries. After removing the low quality tags and tags with adapters, the total number of clean tags per library ranged from 12.7 to 16.8 million (Table 1). To reveal the molecular events underlying the DGE profiles, we mapped the tag sequences of whole DGE libraries to our transcriptome reference database of Cr. domesticus. The CE library (16,850,008 clean tags) had the most clean tags. However, 11

ACCEPTED MANUSCRIPT the percentage of tags mapped to reference genes from the CE library (81.75%) was lower than that from the KE library (85.33%) (Table 1). The number of expressed genes (RPKM > 0) ranged from 83.2 to 135.3 thousand (Table 1). The number in CE library was

IP

T

1.63-fold higher than that in KE library.

SC R

3.6. Differential gene expression at different wood environments To identify genes displaying significant changes in expression when reared with different wood species, differentially expressed tags were analyzed by comparing the CE, KE, and

NU

MS libraries. From our data sets, a total of 417 (118 up-regulated / 299 down-regulated), 599 (148 up-regulated / 451 down-regulated) and 505 (223 up-regulated / 282

MA

down-regulated) DEGs were detected in KE vs. CE, KE vs. MS and CE vs. MS, respectively (Fig. 5). The number of down-regulated DEGs exceeded the number of

D

up-regulated DEGs. We used a Venn diagram to show all possible logical relationships

TE

between the three DEGs datasets (Supplementary Figure S2). There were 72 DEGs exclusively in KE vs. CE, 109 DEGs exclusively in CE vs. MS, 90 DEGs exclusively in KE

Figure S2).

CE P

vs. MS, and 38 common DEGs in KE vs. CE, CE vs. MS and KE vs. MS (Supplementary

AC

3.7. Gene Ontology Functional analysis of DEGs GO enrichment analysis was performed to understand the functions of the exclusive DEGs in each dataset and to look for significantly enriched GO terms. Seven GO terms for cellular components indicated remarkable relationships (Fig. 6). The number of DEGs in KE vs. MS was more than in the other two sets. We found that some of the GO terms were only enriched in KE vs. MS, such as „extracellular region‟, „synapse‟, „presynaptic active zone‟, „plasma membrane‟, and „respiratory chain‟. The DEGs that mapped to „respiratory chain‟ were up-regulated in the KE library compared to the MS library. Some GO terms were not enriched in KE vs. CE, such as „macromolecular complex‟, „envelope‟, „membrane-bounded organelle‟, „organelle membrane‟, and „membrane‟. Only one GO term for molecular function, „catalytic activity‟, was significantly enriched 12

ACCEPTED MANUSCRIPT between the three datasets (Fig. 6). The DEGs mapped to this term were 19, 19, and 28 in KE vs. CE, CE vs. MS, and KE vs. MS, respectively. Three terms were only enriched in KE vs. CE: „monooxygenase activity‟, „phospholipid binding‟, and „GTPase regulator

T

activity‟. The DEGs that mapped to these three terms were down-regulated in the KE

IP

library compared to the CE library. The GO terms „hydrolase activity, acting on ester

SC R

bonds‟, „ligase activity‟, „NADP or NADPH binding‟, „anion binding‟, „enzyme inhibitor activity‟, and „translation regulator activity‟ were only enriched in CE vs. MS, and DEGs in the CE library were up-regulated relative to the MS library. The DEGs that mapped to lyase

activity‟,

„cis-trans

isomerase

NU

„carbon-oxygen

activity‟,

„intramolecular

oxidoreductase activity‟, and „vitamin binding‟ were up-regulated only in the KE library

MA

compared to MS library.

Nine GO terms for biological processes were significantly enriched between the three

D

datasets (Fig. 6). Among them, the GO term „cellular biosynthetic process‟ was only

TE

enriched in CE vs. MS, and these 11 DEGs were up-regulated in CE library. The cellular protein, biopolymer, and macromolecule metabolic processes were not enriched in KE vs.

CE P

CE. The GO terms „reproduction‟ and „pigmentation‟ were only enriched in KE vs. MS. Three DEGs that mapped to the two terms were down-regulated in the KE library.

AC

3.8. Functional annotation of common DEGs A total of 38 DEGs were common to all three libraries. These DEGs and their expression patterns in different wood species are presented in Supplementary Table S4. To profile the changes in the gene expression patterns of these genes in response to different wood species, all 38 common DEGs were classified using hierarchical clustering (Fig. 7). The DEGs in the CE library were grouped closely with those of the KE library. The up-regulated and down-regulated genes were divided into three clusters based on tree branching (Fig. 7). The first cluster contained 10 DEGs that had highest expression in the CE library. The second cluster consisted of 18 DEGs that had the highest expression in the MS library. The third cluster consisted of 10 DEGs that had the highest expression in the KE library. According to KEGG classification, 16 (47.1%) common DEGs were annotated. Among 13

ACCEPTED MANUSCRIPT them, 6 genes were classified into „carbohydrate metabolism‟, including the following 6 pathways: „glycolysis / gluconeogenesis‟ (comp215972_c0), „citrate cycle (TCA cycle)‟ (comp209569_c0), „pentose phosphate pathway‟ (comp214847_c0), „starch and sucrose (comp218797_c0,

comp214419_c2),

„pyruvate

T

metabolism‟

metabolism‟

IP

(comp209569_c0, comp218834_c0), and „glyoxylate and dicarboxylate metabolism‟

SC R

(comp209569_c0). Their expression levels showed a relationship as KE
NU

activity‟, „cell adhesion‟ and „basement membrane‟ were enriched.

3.9. Candidate genes involved in cellulose metabolism

MA

All of the cellulose-related genes in the Cr. domesticus transcriptome were searched, including endoglucanase, exoglucanase, and β-glucosidase (Table 3).

D

Eleven unigenes were annotated as endoglucanase or endo-β-1,4-glucanase. All

TE

sequences had an identity of >70% against the NR database. These genes belonged to glycosyl hydrolase family (GHF) 7, 9 or 45. In GHF 9, all four genes were homologous to

CE P

enzymes in termites, including Z. nevadensis, Neotermes koshunensis, Hodotermopsis sjostedti, and Odontotermes formosanus. Five genes, belonging to GHF 7, were similar with endo-β-1,4-glucanase homologs in symbiotic protist of N. koshunensis. Two genes in

AC

GHF 45 were homologous to endoglucanase in Mastotermes darwiniensis hindgut symbiont. The expression level of endoglucanase was various among CE, KE and MS libraries. Genes in GHF 9 were expressed higher in the CE library than the KE and MS libraries, while genes in GHF 45 were up-regulated in MS library relative to the CE and MS libraries. Genes in KE library were expressed lower than the other two libraries, except gene comp192032_c0. Five exoglucanase genes have been found in the Cr. domesticus transcriptome. They belong to GHF 7 and were similar with homologs in symbionts found in N. koshunensis, Cryptocercus punctulatus, and M. darwiniensis. Exoglucanase of termite homologous has not yet been found from the annotation. Similarly, the exoglucanase genes were expressed at lower levels in the KE library than the CE and MS libraries. A total of 16 unigenes were annotated as β-glucosidase. Homologous sequences had 14

ACCEPTED MANUSCRIPT identities from 46% to 89% with query cover from 63% to 100% against the NR database. These genes were divided into two groups: one in GHF 1 and the other in GHF 3. Seven genes in GHF 1 were termite homologous β-glucosidases and were up-regulated in the

T

CE library relative to the KE and MS libraries. Nine genes in GHF 3 were endosymbiont

IP

homologous β-glucosidases and their expression was higher in the MS library than the CE

SC R

and KE libraries.

Among the 32 cellulose-related genes, 13 had significantly differential expression, including 6 in KE vs. CE, 8 in KE vs. MS, and 12 in CE vs. MS (Table 3). Two were

NU

common DEGs, and both were endosymbiont homologs. Only 4 DEGs were of termite

MA

origin.

3.10. Sequences comparison of Cr. domesticus β-glucosidase

D

The authenticity and reliability of three β-glucosidase encoding genes, comp139691_c0,

TE

comp201652_c0 and comp201932_c1, found in the Cr. domesticus transcriptome were verified. Gene comp139691_c0 contained an ORF of 1,497 bp encoding a 498 aa protein

CE P

with a predicted molecular weight of 56.6 kDa. Gene comp201652_c0 contained an ORF of 1,596 bp encoding a 531 aa protein of 60.9 kDa. Gene comp201932_c1 contained an ORF of 1,494 bp encoding a 497 aa protein of 57.4 kDa.

AC

All of the predicted amino acid sequences from these genes were aligned with the best hit results in NCBI: three known β-glucosidases from Co. formosanus, N. koshunensis and Periplaneta americana (accession number: AEW67361, BAB91145 and AIA09348) (Fig. 8).

The

GHF

1

N-terminal

signature

(accession:

PS00653,

pattern:

F-x-[FYWM]-[GSTA]-x-[GSTA]-x-[GSTA](2)-[FYNH]-[NQ]-x-E-x-[GSTA]) was found in four sequences (Fig. 8). The catalytic domain of glycoside hydrolase and a 5-element fingerprint (accession: PR00131), providing a signature for family 1 glycosyl hydrolases, were found among all sequences. Motif 2 includes the region of GHF 1 active site (accession:

PS00572,

pattern:

[LIVMFSTC]-[LIVFYS]-[LIV]-[LIVMST]-E-N-G-[LIVMFAR]-[CSAGN]), which contains the catalytically active Glu. The sequence pattern was identified in five sequences (Fig. 9). The motif 2 in comp201652_c0 was not consistent with the pattern, but the sequence was 15

ACCEPTED MANUSCRIPT similar and contained the catalytically active Glu. Only the eighth amino acid was changed to Tyr. A phylogenetic tree was constructed and the phylogenetic relationship among 22

T

sequences from 12 termite species and one cockroach species were analyzed

IP

(Supplementary Figure S3). Some species had more than one β-glucosidase genes.

SC R

Some isoforms clustered closely, but some were in different clades. The evolutionary history of β-glucosidase was not obvious from the phylogenetic tree.

NU

4. Discussion

MA

4.1. Cr. domesticus transcriptome data

Several transcriptome studies have previously been performed in the following termite species: Co. formosanus, Coptotermes gestroi, Cryptotermes secundus, H. sjostedti, M. flavipes, Cryptotermes

D

darwiniensis, Reticulitermes

cynocephalus, Reticulitermes

TE

speratus, Nasutitermes takasagoensis, and O. formosanus (Scharf, 2015). Those studies sequenced the host termite (workers or soldiers), gut microbial symbionts, or combined

CE P

the host and symbionts. However, the transcriptome of Cr. domesticus has not previously been reported. Here, the combined transcriptome of Cr. domesticus, including host (pseudergate caste) and gut symbionts, was performed using Illumina HiSeq 2000

AC

technology. Using the Trinity method, 108.7 million sequencing reads were assembled into 302,979 contigs (mean length = 648 bp) and 185,248 unigenes. The dataset here was slightly larger than from the O. formosanus study, which used the same Illumina sequencing platform (Huang et al., 2012). Overall, 50.27% of the contigs had a high homology with sequences in the public databases and were classified into functional categories. These functional annotations provide a valuable genomic resource for investigating specific processes, structures, functions, and pathways in Cr. domesticus. Transcriptome data comparison between Cr. domesticus and Z. nevadensis was performed. Of the contigs of Cr. domesticus, 19.15% showed homologous matches in the results of genome study of Z. nevadensis (Terrapon et al., 2014). The low percentage of matched contigs presented here may be due to three possible reasons. First, the 16

ACCEPTED MANUSCRIPT phylogenetic relationship between the two species may not be very close. Although both species were lower termites in the Isoptera, they belong to different families: Cr. domesticus in Kalotermitidae and Z. nevadensis in Termopsidae. Second, there may be

IP

T

different compositions of the transcriptome. The genome of Z. nevadensis focused on the host termite, only the head of soldiers. However, our study pooled the whole body of

SC R

pseudergates and considered not only the host but also the symbionts. The species distribution (Fig. 3) showed that 36.82% of the contigs matched with T. vaginalis sequences and other species, suggesting the assembly transcriptome of Cr. domesticus

NU

contained some protozoa, bacteria, and fungi. T. vaginalis may also be a primary symbiont of Cr. domesticus, which has not been recognized before. Third, the lack of

MA

available genomic information increased the difficulty of assembly. According to the length distribution (Fig. 1), most of the assembled contigs and ORFs were less than 1,000 bp,

TE

D

which can make it difficult to match homologous sequences.

4.2. Fitness of Cr. domesticus in different wood species

CE P

Woods are host and food for Cr. domesticus, providing the living conditions and nutrients. In the current study, we compared the genetic differences among Cr. domesticus pseudergates living in three distinct wood species. We chose Ca. equisetifolia, K. excelsa,

AC

and Myristica sp. woods for Cr. domesticus to establish colonies and constructed three DGE libraries of the pseudergates, respectively. Our results showed that pseudergates of CE library expressed more genes than pseudergates of KE and MS libraries, and the expressed gene number was the least in the KE library. The quantity variance suggested that drywood termites can survive in different woods but they may express different genes, which may be inhibited or activated. The expression level of many genes was low (RPKM < 5). However, the number of genes with RPKM ≥ 5 and RPKM ≥ 10 in MS library was more than the other two libraries (Table 1). The relatively high expression level of MS library suggested genes in MS were active and may result in high consumption of Myristica sp. wood. A number of DEGs were identified among the comparison KE vs. CE, KE vs. MS, and CE vs. MS. According to the results, most genes of pseudergates living in K. excelsa wood 17

ACCEPTED MANUSCRIPT were significantly down-regulated, and most genes were up-regulated in Myristica sp. wood. The result of gene expression analysis suggested that different wood species played an important impact on the expression of termite genes. Myristica sp. wood may be

T

more positive to drywood termite and K. excelsa wood may be opposite.

IP

In the GO functional analysis of DEGs, the GO term of „catalytic activity‟ was enriched in

SC R

all three datasets, suggesting that termites shared some common activities for living in different woods but reactions between substrates and enzymes were various. We suggest that different enzymes were regulated their activity in different wood species environments.

NU

For example, the activity of monooxygenase, GTPase regulator, hydrolase and ligase may be increased in termites living in Ca. equisetifolia wood and decreased in K. excelsa, and

MA

Myristica sp. woods. The carbon-oxygen lyase, cis-trans isomerase and intramolecular oxidoreductase may increase their activities in termites living in K. excelsa wood. Because

D

different wood species have distinct composition, moisture, hardness, etc. Those

TE

characters constitute specific environments and nutrients for termites and may influence various enzymes and their activities.

CE P

Among the common DEGs, a majority of genes involved in carbohydrate metabolism pathways were identified, especially the genes had hydrolase activity belong to the glycosyl hydrolase family. Most genes were significantly down-regulated in pseudergates

AC

living in K. excelsa wood, suggesting a weak hydrolase activity and cellulose metabolism may be reduced. Together, we suggested that Myristica sp. wood may be the most fitting woods for Cr. domesticus among three studied wood species.

4.3. Genes encoding the cellulases Different woods have different nutrient composition that can influence the behavior and metabolism of wood-living insects. The chemical composition of Ca. equisetifolia wood includes 42.68% cellulose, 22.30% xylogen, 14.36% pentosan, and 10.85% water (Lai and Lin, 1996). However, study of the chemical composition of the other two woods was limited. However, cellulose is likely a major constituent of the different woods, and cellulose metabolism plays an important role in termites. Thirty-two cellulases were found in the transcriptome analyzed here, including 11 endoglucanases, 5 exoglucanases, and 18

ACCEPTED MANUSCRIPT 16 β-glucosidases, which provide a rich resource for exploiting cellulase genes. Among them, a considerable proportion of cellulases identified in this study are homologous to genes of symbionts, such as protists, fungi, and amoebae, according to the NR

T

description. Our cellulases covered GHF 1, 3, 7, 9 and 45. All identified exoglucanases

IP

were in GHF 7 and were of endosymbiotic origin. Both endoglucanase and β-glucosidase

SC R

in Cr. domesticus transcriptome have two possible origins: termite or symbiotic microbes. Our study found that the termite origin homologs belonged to GHF 1 and 9, β-glucosidase and endoglucanase, respectively, which was consistent with reports that termite origin

NU

cellulases were in GHF 1 and 9 (Xiang and Zhou, 2009). This result also suggests that Cr. domesticus digests cellulose using endogenous cellulases and cellulases from symbionts,

MA

similar to Co. formosanus that has two cellulose-digesting systems (Nakashima et al., 2002). Moreover, the quantity and types of enzymes from termite symbionts were greater

D

in number than those from the termite itself, which suggested that Cr. domesticus mainly

TE

relies on symbionts to digest and absorb nutrients. The expression of cellulases varied among the pseudergates living in different woods. Genes of termite origin were had higher

CE P

expression in the CE library relative to the KE and MS libraries. Beta-glucosidase expression from symbionts was up-regulated in the MS library. All of the genes from termites living in K. excelsa wood had the lowest levels of gene expression. These results

AC

suggest that termites living in Ca. equisetifolia wood depend primarily on termite origin cellulases, while in Myristica sp. wood, symbionts play an important role in cellulose metabolism. Moreover, the results further indicate that termites living in K. excelsa wood have reduced metabolic activities. Beta-glucosidase catalyzes the hydrolysis of terminal non-reducing residues in beta-D-glucosides and releases of glucose (Nelson and Cox, 2000). We validated three β-glucosidase encoding genes of termite origin from the Cr. domesticus transcriptome and compared their sequences with homologs in related species. All three had the conserved glycoside hydrolase family 1 domain, which suggested that they were members of GHF 1. However, only one gene, comp201932_c1, had two GHF1 active sites. Gene comp139691_c0 and N. Koshunensis β-glucosidase were both without the GHF1 N-terminal active site. Gene comp201652_c0 had one amino acid replaced and 19

ACCEPTED MANUSCRIPT thus does not match the active site pattern; this may be a novel β-glucosidase mutation. Beta-glucosidase expression among the three libraries was different. However, comp139691_c0 was expressed at higher levels than the other two β-glucosidase gene in

IP

T

all of the libraries, suggesting that comp139691_c0 is the main β-glucosidase in Cr. domesticus and that the other genes may have supplementary functions. However, the

SC R

genes‟ true functions should be researched in future studies.

NU

5. Conclusion

We have generated a transcriptome of Cr. domesticus and constructed three libraries

MA

using the Illumina platform. These efforts revealed a large number of genes of both known and unknown functions, thus greatly changing the current status of available genetic information for this species. This transcriptome can be used as a reference for

D

comparative studies within the genus or family. The SSRs predicted in this study could lay

TE

a platform for better understanding the molecular evolution of Cr. domesticus. This study is also a first step toward a better understanding of the effect of wood nutrition on termites

CE P

at the molecular level, providing valuable information for further investigations of the mechanisms of regulation. However, as the sequences obtained by Illumina sequencing are short, full-length sequence data obtained through RACE PCR are needed to elucidate

AC

most of the interesting putative genes, and the SSRs need to be verified to exclude false positives and sequencing errors.

Conflict of interest The authors have declared that no conflict of interest exist.

Acknowledgements The authors thank Annoroad Gene Technology Corporation (Beijing, China) for assisting with transcriptome sequencing. We are grateful to all persons that in any way contributed to the development of this work.

20

ACCEPTED MANUSCRIPT Funding Source This study was financially supported by grants from the National Natural Science

T

Foundation of China (31172163), the Science and Technology Planning Project of

IP

Guangzhou city, China (201510010036), and the Special Planning of Major Scientific and

SC R

Technological Production (Cultivation project), Guangdong Academy of Sciences (zdccyd201307).

NU

References

Conesa, A., Götz, S., García-Gómez, J.M., Terol, J., Talón, M. and Robles, M., 2005. Blast2GO: a universal tool for annotation, visualization and analysis in functional

MA

genomics research. Bioinformatics 21, 3674-3676. Evans, T.A., Forschler, B.T. and Grace, J.K., 2013. Biology of invasive termites: a worldwide review. Annual Review of Entomology 58, 455-474. Garcia, C.M., Eusebio, D.A., San Pablo, M.R. and Villena, E.M., 2011. Resistance of wood

D

wool cement board to the attack of Philippine termites. Insects 3, 18-24.

TE

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R. and Zeng, Q., 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology 29, 644-652.

CE P

Hadi, Y., Nurhayati, T., Yamamoto, H. and Kamiya, N., 2012. Resistance of smoked wood to subterranean

and

dry-wood

termite

attack.

International

Biodeterioration

&

Biodegradation 70, 79-81. Huang, F., Zhu, S., Ping, Z., He, X., Li, G. and Gao, D., 2000. Fauna Sinica. Insecta,

Science

AC

Press, Beijing.

Huang, Q., Sun, P., Zhou, X. and Lei, C., 2012. Characterization of head transcriptome and analysis of gene expression involved in caste differentiation and aggression in Odontotermes formosanus (Shiraki). PloS one 7, e50383.

Huang, Z., Qian, X., Zhong, J., Hu, J., Li, Z., Li, Q. and Liu, B., 2011. Influence of Colony Age on Production and Fecundity of Replacement Reproductives in the Dry Wood Termite Cryptotermes domesticus (Isoptera: Kalotermitidae). Sociobiology 58, 77-84. Huang, Z., Qian, X., Zhong, J., Xia, C. and Hu, J., 2007. Progress of biological studies on primary reproductives in Cryptotermes domesticus (Isoptera: Kalotermitidae). sociobiology 50, 599-606. Korf, I., Yandell, M. and Bedell, J., 2003. Blast,

O'Reilly Media, Inc., Sebastopol, CA.

Lai, S. and Lin, Z., 1996. Chemical composition and fiber form of Casuarina equisetifolia wood. Journal of Fujian College of Forestry 16, 89-92. Li, H., Scheffrahn, R.H., Su, N., Kanzaki, N. and Yang, R., 2008. Survey of the termites (Isoptera: Kalotermitidae, Rhinotermitidae, Termitidae) of Lanyu Island, Taiwan. Florida entomologist 91, 472-473. Nakashima, K., Watanabe, H., Saitoh, H., Tokuda, G. and Azuma, J.-I., 2002. Dual 21

ACCEPTED MANUSCRIPT cellulose-digesting system of the wood-feeding termite, Coptotermes formosanus Shiraki. Insect biochemistry and molecular biology 32, 777-784. Nelson, D.L. and Cox, M.M., 2000. Lehninger principles of biochemistry, 3rd ed. W. H. Freeman, New York.

T

Qian, X., Huang, Z., Zhong, J., Liu, B., Xia, C., Huang, H., Xia, F. and Yang, R., 2005. Influences of different species of wood on new colonies of Cryptotermes domesticus

IP

(Haviland) (Isoptera: Kalotermitidae). Natural enemies of insects 27, 170-177. Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential

SC R

expression analysis of RNA-seq data. Genome Biol 11, R25.

Rust, M.K. and Su, N., 2012. Managing social insects of urban importance. Annual Review of Entomology 57, 355-375.

Sbeghen-Loss, A.C., Mato, M., Cesio, M.V., Frizzo, C., de Barros, N.M. and Heinzen, H., 2011.

NU

Antifeedant activity of citrus waste wax and its fractions against the dry wood termite, Cryptotermes brevis. Journal of Insect Science 11, 159. Scharf, M.E., 2015. Omic research in termites: an overview and a roadmap. Frontiers in

MA

genetics 6.

Scheffrahn, R.H., Su, N., Chase, J.A., Mangold, J.R., Grace, J.K. and Yates III, J.R., 2000. First record of Cryptotermes cynocephalus Light (Isoptera: Kalotermitidae) and natural woodland infestations of C. brevis (Walker) on Oahu, Hawaiian Islands. Hawaiian

D

Entomological Society 34, 121-125.

TE

Terrapon, N., Li, C., Robertson, H.M., Ji, L., Meng, X., Booth, W., Chen, Z., Childers, C.P., Glastad, K.M., Gokhale, K., Gowin, J., Gronenberg, W., Hermansen, R.A., Hu, H., Hunt, B.G., Huylmans, A.K., Khalil, S.M., Mitchell, R.D., Munoz-Torres, M.C., Mustard,

CE P

J.A., Pan, H., Reese, J.T., Scharf, M.E., Sun, F., Vogel, H., Xiao, J., Yang, W., Yang, Z., Yang, Z., Zhou, J., Zhu, J., Brent, C.S., Elsik, C.G., Goodisman, M.A., Liberles, D.A., Roe, R.M., Vargo, E.L., Vilcinskas, A., Wang, J., Bornberg-Bauer, E., Korb, J., Zhang, G. and Liebig, J., 2014. Molecular traces of alternative social organization in a termite

AC

genome. Nature communications 5, 3636. Xiang, H. and Zhou, Z., 2009. Lignocellulolytic enzymes in termite and its symbiotic microbes. Chinese Bulletin of Entomology 46, 32-40.

Ye, J., Fang, L., Zheng, H., Zhang, Y., Chen, J., Zhang, Z., Wang, J., Li, S., Li, R. and Bolund, L., 2006. WEGO: a web tool for plotting GO annotations. Nucleic acids research 34, W293-W297.

22

ACCEPTED MANUSCRIPT Figure captions Fig. 1. Length distribution of Cr. domesticus transcriptome. (A) Histogram presentation of

T

contig length distribution. (B) Histogram presentation of ORF length distribution.

IP

Fig. 2. Proportion of different query sequence length with annotations in the public

SC R

databases. Blue bar represents sequence length under 300 bp, red bar represents sequence length between 300 and 1000 bp, and green bar represents sequence length above 1000 bp. Number of sequences is shown in each bar.

NU

Fig. 3. Species distribution of the BLASTX results. This figure shows the species distribution of contigs BLASTX results against the NR protein database with a cut-off

MA

E-value < 10-5 and the proportions of each species.

Fig. 4. Gene Ontology (GO) analysis and functional classification of the Cr. domesticus transcriptome. The 50,329 matched contigs were classified into 3 functional categories:

D

molecular function, biological process and cellular component.

TE

Fig. 5. Differentially expressed genes in different wood-living termites. Koompassia excelsa (KE) vs. Casuarina equisetifolia (CE) refers to the comparison of pseudergates

CE P

living between KE and CE. KE vs. MS refers to the comparison of living pseudergates between KE and Myristica sp. (MS). CE vs. MS refers to the comparison of living pseudergates between CE and MS.

AC

Fig. 6. GO function enrichment analysis of DEGs. Fig. 7. Hierarchical cluster analysis of common DEGs. Fig. 8. The sequence character of β-glucosidase. BG means β-glucosidase, Cfo means Coptotermes formosanus, Nko means Neotermes koshunensis, Pam means Periplaneta americana. Fig. 9. Alignment of the 5-element fingerprint sequences in family 1 glycosyl hydrolases. BG means β-glucosidase, Cfo means Coptotermes formosanus, Nko means Neotermes koshunensis, Pam means Periplaneta americana. The fingerprint contains five motifs and is boxed in red. The glycosyl hydrolases family 1 active site is shaded in grey.

23

AC

Figure 1

CE P

TE

D

MA

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

24

MA

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

AC

CE P

TE

D

Figure 2

25

MA

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

AC

CE P

TE

D

Figure 3

26

MA

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

AC

CE P

TE

D

Figure 4

27

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

AC

CE P

TE

D

MA

Figure 5

28

TE

D

MA

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

AC

CE P

Figure 6

29

MA

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

AC

CE P

TE

D

Figure 7

30

TE

D

MA

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

AC

CE P

Figure 8

31

AC

Figure 9

CE P

TE

D

MA

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

32

ACCEPTED MANUSCRIPT Table Table 1 Sequencing data summary. KE

10,844 0

501 86,580

MS

13,148,500 12,768,789 97.11 226,575

14,255,436 13,848,995 97.15 293,058

371 152,765

396 112,987

97.32

97.36

96.91

39

39

39

44 -

48 13,774,436

45 10,895,594

53 11,114,511

185,248

81.75 135,298

85.33 83,202

80.26 129,277

6,361

9,182

4,073

16,737

3,019

4,545

2,142

7,124

NU

D

MA

37

IP

17,215,853 16,850,008 97.87 278,764

SC R

126,777,292 108,745,470 85.78 18,020,978

94.98

AC

CE P

Raw reads number Clean reads number Clean reads rate (%) Low-quality reads number Ns reads number Adapter polluted reads number Clean Q30 bases rate Per sequence quality scores GC content (%) All tags mapping to gene Mapping rate (%) Total expressed genes (RPKM > 0) Total expressed genes (RPKM ≥ 5) Total expressed genes (RPKM ≥ 10)

CE

T

Transcriptome

TE

Sample

33

ACCEPTED MANUSCRIPT

Repeat numbers 5

6

7

8

9

10

>10

Di Tri

0 0

0 1,156

0 987

0 124

618 2

722 0

Tetra

666

80

9

6

0

Total %

666 14.14

1,236 26.25

996 21.15

130 2.76

620 13.17

IP

Motif length

T

Table 2 Summary of SSRs information in the transcriptome of Cr. domestictus.

Average length (bp)

%

336 2

1,676 2,271

35.59 48.23

19.67 19.66

0

1

762

16.18

20.72

722 15.33

339 7.20

US

MA N

AC

CE P

TE D

4,709

34

CR

Total

19.84

Repeat units AT/TA, TG/AC, GT/CA TGA/ACT, TTA/AAT, TAT/ATA, ATT/TAA, CAG/GTC, CTG/GAC, GAT/CTA TATG, AAGA, TTTC, AGTA, TGCT, ATGA, ATTC, CATA, GTAT, TATT, ATAC

ACCEPTED MANUSCRIPT

Expression level (RPKM value) Identities 3.07

0.36

92%

78%

Zootermopsis nevadensis

KDR16731

18944.48

13724.78

8543.96

94%

82%

Neotermes koshunensis

BAD12006

2575.56

804.87

789.43

78%

99%

Hodotermopsis sjostedti

BAD12004

1.32

0

0.32

73%

98%

Odontotermes formosanus

ADB82658

179.09

102.59

2449.55

77%

94%

symbiotic protist of Mastotermes darwiniensis

BAF57448

comp209945_c0

20.8

22.22

70.37

71%

95%

symbiotic protist of Mastotermes darwiniensis

BAF57443

comp139286_c0

17.2

7.91

127.14

85%

99%

symbiotic protist of Neotermes koshunensis

BAF57387

comp213308_c0

13.16

c ac

comp264088_c0

7

Endoglucanase

comp204068_c0

bc

comp188859_c0 comp190349_c0 comp208235_c0 7

Exoglucanase

86%

97%

symbiotic protist of Neotermes koshunensis

BAF57384

23.92

0.28

17.95

93%

98%

symbiotic protist of Neotermes koshunensis

BAF57396

644.98

28.04

154.95

91%

94%

symbiotic protist of Neotermes koshunensis

BAF57391

abc

442.89

204.5

94%

99%

symbiotic protist of Neotermes koshunensis

BAF57379

comp182791_c0

2.26

comp202383_c1

52.63

15.47

1

0.65

89%

86%

symbiotic protist of Neotermes koshunensis

BAF57377

2.57

21.55

60%

97%

symbiotic protist of Cryptocercus punctulatus

BAF57463

45.66

1.69

2.84

95%

98%

symbiotic protist of Neotermes koshunensis

BAF57382

46.29

14.18

6719.68

97%

94%

symbiotic protist of Neotermes koshunensis

BAF57389

2.65

0

2.96

67%

98%

symbiotic protist of Mastotermes darwiniensis

BAF57417

comp150477_c0

0.25

0

0

53%

98%

Zootermopsis nevadensis

BAO85045

comp161273_c0

0.54

0.13

0.31

79%

87%

Coptotermes formosanus

AEW67361

1.16

0.49

0.8

62%

83%

Coptotermes formosanus

AEW67361

300.44

72.4

31.3

63%

63%

Periplaneta americana

AIA09348

1.33

0.54

0.34

57%

97%

Coptotermes formosanus

AGM32308

2055.88

619.81

510.52

89%

87%

Neotermes koshunensis

BAB91145

comp212347_c0

bc

comp250097_c0 Beta-glucosidase

70.06

ac

comp214625_c1

1

4.5

CE P

comp218787_c0

Endoglucanase

number

2.24

comp215949_c0

45

Accession NR Species

cover

MS

TE D

comp192032_c0

comp161273_c1 comp201932_c1

ac

comp215861_c0 comp139691_c0

ac

AC

Endoglucanase

KE

CR

CE 9

Query

Gene ID

US

Homologs

MA N

Family

IP

T

Table 3 Candidate genes encoding cellulases of glycosyl hydrolase family.

35

ACCEPTED MANUSCRIPT

comp215086_c2

b

comp210970_c0 comp214419_c0 comp214419_c2

abc

comp215086_c1 comp203451_c2 comp214757_c1 comp206523_c0

17.02

0.93

96.09

55%

32

1.46

98.27

59%

1.92

0.43

46.21

47%

6.5

0.5

42.57

57.44

3.58

206.86

81%

Neotermes Koshunensis

T

59%

96%

BAB91145

Acanthamoeba castellanii str. Neff

XP_004339837

Dictyostelium purpureum

XP_003288844

92%

Polysphondylium pallidum PN500

EFA84959

68%

95%

Dictyostelium fasciculatum

XP_004356243

46%

92%

Dictyostelium discoideum AX3

AAA74233

55%

94%

Dictyostelium discoideum AX3

AAA74233

51%

96%

Dictyostelium purpureum

XP_004356243

IP

comp214419_c1

3.28

100%

CR

Beta-glucosidase

11.41

27.7

1.77

101.7

bc

3.22

0.87

136.15

bc

2.07

0.92

73.38

65%

99%

Dictyostelium discoideum AX4

XP_629427

bc

5.49

0.86

94.11

55%

97%

Dictyostelium discoideum AX4

XP_629427

a

TE D

DEGs in comparison of KE vs. CE. DEGs in comparison of KE vs. MS. c DEGs in comparison of CE vs. MS.

MA N

3

15.92

US

comp201652_c0

AC

CE P

b

36

ACCEPTED MANUSCRIPT Abbreviations list BG, β-glucosidase;

T

CE, pseudergates living in Casuarina equisetifolia woods;

IP

COG, Cluster of Orthologous Groups;

SC R

DEG, Differentially Expressed Gene; DGE, Digital Gene Expression; GHF, Glycosyl Hydrolase Family;

NU

GO, Gene Ontology;

KE, pseudergates living in Koompassia excelsa woods;

MA

KEGG, Kyoto Encyclopedia of Genes and Genomes; KOG, euKaryotic Orthologous Groups of proteins; MS, pseudergates living in Myristica sp. woods

D

NR, NCBI non-redundant protein;

TE

NT, NCBI nucleotide; ORF, Open Reading Frame;

CE P

PFAM, protein family;

RPKM, Reads Per Kilobase per Million SRA, NCBI Sequence Read Archive;

AC

SSR, Simple Sequence Repeat; TrEMBL, a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot.

37

ACCEPTED MANUSCRIPT Highlights • The transcriptome of Cryptotermes domesticus was first sequenced and assembled.

SC R

• The fitness of termites in distinct woods was different.

IP

T

• Different wood species played an important impact on the expression of termite genes.

AC

CE P

TE

D

MA

NU

• Genes associated with cellulose metabolism were identified.

38