Accepted Manuscript Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly and gene re-annotation Chengcheng Cai, Xiaobo Wang, Bo Liu, Jian Wu, Jianli Liang, Yinan Cui, Feng Cheng, Xiaowu Wang PII: DOI: Reference:
S1674-2052(16)30278-7 10.1016/j.molp.2016.11.008 MOLP 396
To appear in: MOLECULAR PLANT Accepted Date: 9 November 2016
Please cite this article as: Cai C., Wang X., Liu B., Wu J., Liang J., Cui Y., Cheng F., and Wang X. (2016). Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly and gene reannotation. Mol. Plant. doi: 10.1016/j.molp.2016.11.008. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. All studies published in MOLECULAR PLANT are embargoed until 3PM ET of the day they are published as corrected proofs on-line. Studies cannot be publicized as accepted manuscripts or uncorrected proofs.
ACCEPTED MANUSCRIPT Brassica rapa genome 2.0: a reference upgrade through sequence re-assembly and gene re-annotation
RI PT
Dear Editor,
Brassica rapa includes many important crops that are cultivated as vegetables,
condiments, and oilseeds. Recently, the Brassica genomes have been sequenced extensively: a B. rapa draft reference genome in 2011 (Wang et al., 2011), a B.
SC
oleracea in 2014 (Liu et al., 2014), a B. napus in 2014 (Chalhoub et al., 2014), and B. nigra and B. juncea in 2016 (Yang et al., 2016). The first released B. rapa genome
M AN U
reference served as a valuable resource in the genome assembly and annotation of the other Brassicas (Chalhoub et al., 2014; Liu et al., 2014; Parkin et al., 2014). B. rapa has been used widely in Brassica comparative and evolutionary genomics among the Brassicaceae (Cheng et al., 2013). However, the first B. rapa genome assembly (version 1.5) is only about 283.8 Mb, 58.52% of the estimated genome size (485 Mb)
TE D
(Wang et al., 2011). Considering that much of the genome assembly is still missing (41.48%), there is a considerable possibility that important genes have been missed.
EP
In this study, the B. rapa genome was de novo assembled and largely improved upon with more Illumina and PacBio sequencing. This improved assembly, B. rapa genome (V2.0), includes a total of ~76 Gb Illumina paired-end reads (≈156×), 21 Gb
AC C
mate-paired reads, and 6.5 Gb of PacBio single-molecule data (Supplemental Table 1). The data for the new genome were generated with read-through paired-end reads, 20 Kb and 40 Kb mate-paired reads, single-molecule data, and all of the reads used in the previous assembly of B. rapa genome V1.5 (Wang et al., 2011). The new assembly has a total size of 389.2 Mb (Supplemental Methods). It covers approximately 80.25% of the B. rapa genome, 106 Mb more than V1.5. The new assembly contains 86,986 scaffolds, with a scaffold N50 of 3.38 Mb. 90% of the assembled sequences fall into 349 scaffolds larger than 26 Kb (Supplemental Table 2). Statistical comparisons were performed on V2.0 before and after the inclusion of the PacBio data (Supplemental 1
ACCEPTED MANUSCRIPT Table 3).
Two B. rapa segregating populations were used to construct high-density genetic maps to correct assembly errors and assign the scaffolds to the chromosomes of B.
RI PT
rapa. One genetic map was generated from a RIL (recombinant inbred lines) population (Yu et al., 2013), and the other was built on a F1DH (double haploid)
population. A total of 96,589 and 6,944 polymorphic SNPs were identified between
the two parents of each population (Supplemental Methods). With these SNPs, 2,063
SC
and 1,622 bin-markers were identified and used to construct two genetic maps with total genetic distances of 1,316.731 cM and 1,391.516 cM, respectively
M AN U
(Supplemental Table 4). These newly generated genetic maps have much higher marker-density than the previous version which was built with InDel and SSR markers in V1.5 (Wang et al., 2011). A total of 28 scaffolds were corrected (Supplemental Table 5). After that, scaffolds were ordered along the 10 chromosomes of B. rapa based on the two maps and syntenic evidence (Supplemental Methods and
TE D
Supplemental Figure 1). ~85% (>138 scaffolds) of the assembly was assigned to chromosomes, which are ~330 Mb (Supplemental Table 4 and 6). Interestingly, three chromosomes (A05, A06, and A09) had genetic map inconsistences in F1DH
EP
(Supplemental Figure 2), indicating possible structural variation among varieties (parents of population) of B. rapa.
AC C
Three methods were used to predict gene models in the updated genome: ab initio modeling, homologous gene detection, and transcript fragment mapping. For transcript based gene prediction, large volumes of mRNA-Seq data were used to improve gene prediction quality (Supplemental Methods). The summary statistics of the updated gene set and its comparison to previous version are shown in Supplemental Table 7. 48,826 protein-coding genes were predicted in V2.0, which is 7,652 more than V1.5. Among the annotated genes, a great increase in the number of multi-exon genes was observed in V2.0, about 4,610 more than V1.5. Furthermore, we improved the gene models by adding UTRs and alternative splicing isoforms in 2
ACCEPTED MANUSCRIPT V2.0—which were not available in V1.5.
Interestingly, among these 7,652 additional genes predicted in V2.0, 35.93% were tandem genes, which are more abundant than those in the V1.5 tandem genes
RI PT
(12.70%). 1,368 more tandem arrays (corresponding to 2,749 tandem genes) were identified in V2.0 than in V1.5 (Supplemental Figure 3), highlighting the importance of a high-quality assembly in the prediction of tandemly duplicated genes. Based on genome-wide synteny analysis, there were 453 tandem arrays in V1.5 and 1,636 in
SC
V2.0 corresponding to single genes (non-tandem) in V2.0 and V1.5, respectively
(Supplemental Table 8). Tandem gene arrays in V2.0 are significantly enriched for
M AN U
genes associated with defense response, membrane proteins, and enzyme activity, (Supplemental Table 9) which could be the result of adaptation to environmental conditions. These over-retained genes after tandem duplication were inconsistent with genes that were over-retained after WGT, which supports the gene balance hypothesis (Freeling, 2009). This indicates that there is a reciprocal relationship between the
duplication.
TE D
frequencies of retained genes post-tandem duplication and post-whole-genome
EP
By considering each tandem array as one gene locus, there were 38,110 and 44,374 genes for V1.5 and V2.0, respectively (Supplemental Table 10). Among which, 34,276 genes (89.93%) in V1.5 were identified as counterparts to 35,744 genes
AC C
(80.55%) in V2.0. These gene pairs were separated into three groups: 1) one-to-one gene pairs; 2) multiple-to-one gene pairs; and 3) one-to-multiple gene pairs (Supplemental Figure 4A and Supplemental Table 10). A total of 29,294 (76.87% in V1.5 and 66.02% in V2.0) one-to-one gene pairs were identified. Among which, 14,879 were 100% identical at the sequence level. For the others, sequence differences were found as exon changes or point variations (Supplemental Figure 4C). In the group of one-to-multiple gene arrays, there were 2,327 genes in V1.5 corresponding to 4,972 genes in V2.0; while for multiple-to-one gene arrays, there were 2,235 genes in V1.5 corresponding to 1,076 genes in V2.0. mRNA-Seq data 3
ACCEPTED MANUSCRIPT suggested that 196 out of 1,076 multiple genes in V1.5, and 1,295 out of 2,327 multiple genes in V2.0 should be merged (Supplemental Methods). Additionally, we found 3,834 genes in V1.5 and 8,630 in V2.0 were version-specific genes.
RI PT
A new TE expansion event was observed in B. rapa genome V2.0, which was not detected previously. This is contrary to previous findings that indicated a lack of TE expansion in B. rapa after whole genome triplication. In V2.0, 32.30% (~54 Mb) of the genome is comprised of TEs, while only 25.44% was found previously in V1.5
SC
(Figure 1A,1B). As shown in Supplemental Figure 5, more than 70% of the TEs were located in the length interval of 0–300 bp. Additionally, many more TEs were
M AN U
assembled at centromere regions in 10 of the chromosomes in V2.0, indicating a more complete centromeric assembly (Supplemental Figure 6). More importantly, using the same method, a total of 4,129 full-length LTRs (fl-LTR) were annotated in V2.0, compared to 801 in V1.5. With these fl-LTRs, a new TE expansion event ~6.5 million years ago was identified (Figure 1C), which is much earlier than the TE expansion in
TE D
B. oleracea (Liu et al., 2014).
We compared the two B. rapa genomes to identify the content of the additional ~106
EP
Mb added in V2.0. Approximately 6 Mb are CDSs, ~4 Mb are introns, ~54 Mb are TEs, ~30 Mb are intergenic components, and ~12 Mb are gaps (N’s) (Figure 1B). Newly annotated genes were identified in V2.0 within syntenic regions between the
AC C
two genomes. Gaps (Ns) between two neighbor genes in V1.5 that were amended in V2.0, led to the identification of newly annotated genes (Figure 1D). For those without gaps between neighbor genes, two situations were observed: 1) extra genes were predicted in the more completely assembled sequences (Figure 1E); 2) extra genes were predicted with a more comprehensive predicting process (Figure 1F).
With B. rapa genome V2.0, we reconstructed the three sub-genomes and performed comparative and evolutionary genomic analyses. The results support previous findings on the over-retention of specific genes (Wang et al., 2011), biased 4
ACCEPTED MANUSCRIPT fractionation, dominant gene expression among sub-genomes (Feng et al., 2012), and small RNA mediated TE methylation in the regulation of expression between paralogues (Wang and Cheng, 2016). These analyses provide ready-to-use datasets to
RI PT
facilitating the update of previous research to the new B. rapa genome.
In summary, a new B. rapa genome has been assembled, and new gene models have been established. A more complete assembly of B. rapa reference genome will be a valuable resource in exploring information related to genes, TEs, and other factors,
M AN U
SC
thus contributing to comparative genomics and gene function research in Brassica.
SUPPLEMENTAL INFORMATION
Supplemental Information is available at Molecular Plant Online.
FUNDING
TE D
This work is funded by the National Program on Key Research Project (2016YFD0100307); the 973 program (2013CB127000); the National Natural Science Foundation of China (31301771, 31630068) and National Science and Technology
EP
Ministry (2014BAD01B09); the Science and Technology Innovation Program of the Chinese Academy of Agricultural Sciences; the Key Laboratory of Biology and
AC C
Genetic Improvement of Horticultural Crops, Ministry of Agriculture, P. R. China.
AUTHOR CONTRIBUTIONS Xiaowu Wang and F.C. planned and designed the research. J.W., J.L. and Y.C. conducted fieldwork and performed experiments. C.C., Xiaobo Wang, B.L. and F.C. analyzed data. F.C. and C.C. wrote the manuscript.
ACKNOWLEDGEMENTS No conflict of interest declared. 5
ACCEPTED MANUSCRIPT Chengcheng Cai1, Xiaobo Wang1, Bo Liu, Jian Wu, Jianli Liang, Yinan Cui, Feng Cheng*, and Xiaowu Wang*
Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences,
1
RI PT
Zhongguancun Nandajie No.12, Haidian district, Beijing 100081, P.R. China.
These authors contributed equally to this article.
*
Correspondence: Xiaowu Wang (
[email protected]),
AC C
EP
TE D
M AN U
SC
Feng Cheng (
[email protected])
6
ACCEPTED MANUSCRIPT REFERENCES Chalhoub, B., Denoeud, F., Liu, S., Parkin, I.A., Tang, H., Wang, X., Chiquet, J., Belcram, H., Tong, C., and Samans, B. (2014). Early allopolyploid evolution in the post-Neolithic Brassica napus oilseed genome. Science 345:950-953. Cheng, F., Mandáková, T., Wu, J., Xie, Q., Lysak, M.A., and Wang, X. (2013). Deciphering the diploid ancestral genome of the mesohexaploid Brassica rapa. The Plant Cell 25:1541-1554.
RI PT
Feng, C., Jian, W., Lu, F., Silong, S., Bo, L., Ke, L., Guusje, B., and Xiaowu, W. (2012). Biased Gene Fractionation and Dominant Gene Expression among the Subgenomes of Brassica rapa. PloS one 7:e36442.
Freeling, M. (2009). Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annual review of plant biology 60:433-453.
Liu, S., Liu, Y., Yang, X., Tong, C., Edwards, D., Parkin, I.A., Zhao, M., Ma, J., Yu, J., and Huang, S. (2014).
SC
The Brassica oleracea genome reveals the asymmetrical evolution of polyploid genomes. Nature communications 5.
Parkin, I.A., Koh, C., Tang, H., Robinson, S.J., Kagale, S., Clarke, W.E., Town, C.D., Nixon, J., Krishnakumar,
M AN U
V., and Bidwell, S.L. (2014). Transcriptome and methylome profiling reveals relics of genome dominance in the mesopolyploid Brassica oleracea. Genome biology 15:R77. Wang, X., and Cheng, F. (2016). Epigenetic regulation of subgenome dominance following whole genome triplication in Brassica rapa. New Phytologist:1.
Wang, X., Wang, H., Wang, J., Sun, R., Wu, J., Liu, S., Bai, Y., Mun, J.-H., Bancroft, I., and Cheng, F. (2011). The genome of the mesopolyploid crop species Brassica rapa. Nature genetics 43:1035-1039.
TE D
Yang, J., Liu, D., Wang, X., Ji, C., Cheng, F., Liu, B., Hu, Z., Chen, S., Pental, D., Ju, Y., et al. (2016). The genome sequence of allopolyploid Brassica juncea and analysis of differential homoeolog gene expression influencing selection. Nat Genet. Yu, X., Wang, H., Zhong, W., Bai, J., Liu, P., and He, Y. (2013). QTL mapping of leafy heads by genome
AC C
EP
resequencing in the RIL population of Brassica rapa. PloS one 8:e76059.
7
ACCEPTED MANUSCRIPT FIGURE LEGENDS
Figure 1 Comparisons of genomic components between the B. rapa genome
RI PT
assemblies V1.5 and V2.0. (A) Percentage of genomic components in V1.5 (left) and V2.0 (right). (B) Size of genomic components in the two assemblies. (C)
Distributions of insertion time dated by divergence rate of full length LTRs (the rate
SC
of neutral mutation sites accumulated in the two terminal repeat sequences of LTRs). This was calculated in the two genome assemblies of B. rapa and the genome of B.
M AN U
oleracea. y-axis in the left and right are the number and frequency distributions of LTRs, respectively. A new TE expansion event (~6.5 million years) was observed in V2.0. (D) An example showing an extra annotated gene BraA01001064 in V2.0,
TE D
compared to gap sequences in V1.5. (E) BraA03003780 is not present between genes Bra001448 and Bra001449 in V1.5, indicating the improved sequence assembly in V2.0. (F) The extra gene BraA03002268 in V2.0 showed high sequence similarity
EP
(identity:100%, coverage:100%) to intergenic sequences between Bra000197 and
AC C
Bra000198 in V1.5, indicating a more comprehensive gene annotation in V2.0. Gene models colored in yellow are extra genes annotated in V2.0. Red arrows denote counterpart gene pairs that keep a perfect syntenic relationship in the two assemblies. Orange rectangles denote gap positions (N’s). Figures were plotted using the GEvo module in CoGe (https://genomevolution.org/coge/).
8
TEs
3.77%
32.30%
Intron
5.85%
41.55%
C
D
50
500 B. rapa V1.5 B. rapa V2.0 B. oleracea
40
TE D
300
EP
30
AC C
200
100
50
48 54
35 39 11
20
E
V1.5
Coding
Intron
Intergenic (Without Ns)
23
10
V2.0
BraA01001063
V1.5
V1.5
TEs
Bra013312
BraA01001064
Bra001448
BraA01001065
Bra001449
BraA03003780
V2.0
F
N
Bra013311
BraA03003779
Bra000197
BraA03003781
Bra000198
0
0
V2.0
0
2
4
6
8
Age of LTR insertions (Million years)
10
V1.5 V2.0
72
0
Intergenic
400
Number of LTRs
10.14%
100
37.91%
N
Intergenic
Intron
SC
12.24%
13.80%
126
118
Percent of LTRs
N
Coding
17.00%
148
150
Size (Mb)
25.44%
B
RI PT
Coding TEs
ACCEPTED MANUSCRIPT
V2.0
M AN U
A
V1.5
12
BraA03002268 BraA03002267 BraA03002259