Genomics xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
Genomics journal homepage: www.elsevier.com/locate/ygeno
VIpower: Simulation-based tool for estimating power of viral integration detection via high-throughput sequencing Arvis Sulovaria, Dawei Lia,b,c,
⁎
a
Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA Department of Computer Science, University of Vermont, Burlington, VT 05405, USA c Neuroscience, Behavior, and Health Initiative, University of Vermont, Burlington, VT 05405, USA b
A R T I C LE I N FO
A B S T R A C T
Keywords: Viral integration (VI) Detection power High-throughput sequencing (HTS) Modeling Simulation VIpower
Viral sequence integrations in the human genome have been implicated in various human diseases. Viral integrations remain among the most challenging-to-detect structural changes of the human genome. No studies have systematically analyzed how molecular and bioinformatics factors affect the power (sensitivity) to detect viral integrations using high-throughput sequencing (HTS). We selected a wide-range of molecular and bioinformatics factors covering genome sequence characteristics, HTS features, and viral integration detection. We designed a fast simulation-based framework to model the process of detecting variable viral integration events in the human genome. We then examined the associations of selected factors with viral integration detection power. We identified six factors that significantly affected viral integration detection power (P < 2 × 10−16). The strongest factors associated with detection power included proportion of sample cells with clonal viral integrations (Pearson's ρ = 0.64), sequencing depth (ρ = 0.37), length of viral integration (ρ = 0.37), paired-end read insert size (ρ = 0.23), user-defined threshold (number of supporting reads) to claim successful identification of integrations (ρ = −0.19), and read length (when sequence volume was fixed) (ρ = −0.09). As the first tool of its kind, VIpower incorporates all these factors, which can be manipulated in concert with each other to optimize the detection power. This tool may be used to estimate viral integration detection power for various combinations of sequencing or analytic parameters. It may also be used to estimate the parameters required to achieve a specific power when designing new sequencing experiments.
1. Introduction Viral etiologies have been speculated to be involved in various complex human diseases [1–9]. Many viruses are able to insert their genetic materials into host chromosomes [10–13], and the resulting viral integrations, i.e., human-virus-human sequences, may play roles in the pathogenesis and development of some diseases via different mechanisms, such as expressing viral proteins, dysregulating host gene functions, and influencing genomic instability. Use of high-throughput sequencing (HTS) allows for detection of viral integrations, both germline and somatic events, in the human genome. We recently compared the historical success of identifying cancercausal viruses through clonal viral integration analyses [14] and analyzed the existing HTS-based methods and software for viral integration detection [15]. Accurate identification of viral integrations in the human genome remains challenging, in part due to the limitations of available computational methods and insufficient empirical data to
⁎
guide new experimental designs and data analyses. Although viral integration hotspots have been reported in some tumor samples [1], viral integration sites are largely randomly distributed across the entire human genome, including regions with high GC content and repetitive sequences. Thus, in general, HTS-based methods for viral integration detection suffer from ambiguous mapping of short reads in such “challenging-to-detect” regions. For somatic viral integrations, since not all cells carry the integration event, the cellular proportions for clonal integrations vary, further reducing the power (sensitivity) to detect them. Other factors, such as sequencing depth and paired-end read insert size, might also influence, positively or negatively, the power to detect integration events. To accurately capture viral integrations on a whole-genome scale, systematic analyses are required to identify molecular and bioinformatics factors that potentially affect the power of viral integration detection via HTS. In this study, we designed a fast, simulation-based framework to model the process of whole-genome sequencing and viral integration
Corresponding author at: Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA. E-mail address:
[email protected] (D. Li).
https://doi.org/10.1016/j.ygeno.2019.01.015 Received 10 August 2018; Received in revised form 31 December 2018; Accepted 22 January 2019 0888-7543/ © 2019 Elsevier Inc. All rights reserved.
Please cite this article as: Sulovari, A., Genomics, https://doi.org/10.1016/j.ygeno.2019.01.015
Genomics xxx (xxxx) xxx–xxx
A. Sulovari, D. Li
iral integration
uman sequence
detection. We selected a wide-range of molecular and bioinformatics factors that potentially influence the power of viral integration detection, covering genome sequence characteristics, HTS features, and viral integration detection. These factors are among the most variable components in HTS experiments or bioinformatics approaches necessary to detect viral integration events. We developed software, VIpower, utilizing all these factors for power estimation. We then examined the extent of association for some of the factors most frequently speculated to affect viral integration detection power.
%GC
200bp window 100 50
Viral integrations
0
Repeat
Human
sequencing
2. Methods 2.1. Modeling of viral integration detection
Chimeric
The entire process of viral integration detection was modeled by simulation, which included four modules: 1) simulation of virtual human sequences; 2) simulation of virtual viral sequences; 3) simulation of paired-end sequencing reads and virtual alignment of the reads to the human and viral reference genomes; and 4) simulation of viral integration detection (Fig. 1). The sequence datasets, including the human reference genome (GRCh37/hg19), the repeat regions from RepeatMasker [16], dr.VIS database [17], and the “profile” data from pIRS [18], as well as the empirical distributions of genomic features, including whole-genome GC content, lengths of repeat regions, and characteristics of known viral integrations, were incorporated into these simulations (Supplementary Fig. 1).
Split
(*adjusted by cellular proportion)
Fig. 1. Flow diagram of modeling viral integration detection. The modeling of viral integration detection in the human genome is composed of four modules. The first two modules simulate the features of human and viral sequences; while the last two simulate the alignment of paired-end reads and detection of viral integration events (see Methods for details).
corresponding cellular proportions over the integrated viral sequence regions. 2.1.4. Modeling of viral integration detection and calculation of detection power We first simulated 50 viral integrations across the simulated human genome sequence and recorded their genomic coordinates. Then, we identified the overlap between the coordinates of the viral integrations and the previously-simulated paired-end reads. We then labelled each paired-end read as either chimeric or split when one entire end or a portion of an individual read mapped to the simulated viral reference genome, respectively, while the remaining portion mapped to the virtual human genome. Chimeric reads and split reads are also known as discordant reads and soft-clipped reads, respectively. A split read should contain the viral integration breakpoint. Both chimeric and split reads were used as supporting reads. The power (sensitivity) to detect viral integrations was defined as:
2.1.1. Modeling of human sequences The human sequences were simulated (modeled) according to the whole-genome distributions of GC content (Supplementary Fig. 2) and repeat regions (Supplementary Fig. 3). The GC content, which was obtained from the pIRS profile data, was calculated by employing 200 base pair (bp) tiling windows across the human reference genome (GRCh37). The lengths and frequencies (17 repeats/10,000 bps) of repeat regions were extracted from RepeatMasker. The whole-genome distributions of GC content and repeat regions were randomly sampled with replacement and assigned to our simulated human sequences.
Number of identified viral integrations × 100 Number of simulated viral integrations
2.1.2. Modeling of viral sequences and integrations Viral integration events, i.e., human-virus-human sequences, were simulated based on the properties of known viral integrations. Specifically, the lengths of viral integrations were created based on the widely-studied hepatitis B virus (HBV) integrations maintained in the dr.VIS database. The locations of the simulated viral integrations were assigned according to the observed distances between the HBV integration sites and the repeat regions from RepeatMasker (Supplementary Fig. 4).
Detection power (%) =
2.1.3. Modeling of in silico sequencing Each paired-end read was assigned physical coordinates according to the empirical distribution of GC content-specific sequencing depths (Supplementary Fig. 5), which was generated based on the GC-corrected read depth data from pIRS. To remove low quality reads, several quality control procedures commonly-used in HTS data analysis were employed, including minimum mappable read length, read trimming, PCR duplicate removal, and non-uniquely mapped read removal (Supplementary Table 1). A read was discarded if its full length overlapped with a repeat element. If a portion of a single read was mapped to a unique sequence and that portion was longer than the minimum mappable length (with a default value of 20 bp), the read was labelled as a read supporting integration. In a simulated example with commonly-used HTS parameters, use of the quality controls described in Supplementary Table 1 led to reduced low-quality reads, particularly those mapped to regions with very high sequencing depth (Supplementary Fig. 6). For somatic viral integration events, we adjusted the numbers of simulated reads by matching, linearly, the
2.2. Association analysis of factors with detection power
Within our simulation framework, genomic coordinates alone are both necessary and sufficient for estimating viral integration detection power. Every simulated paired-end read and viral integration event is represented by a start and end coordinate; hence, the operations of counting and intersecting genomic coordinates lead to significantly decreased computational runtimes for our simulation framework.
We selected a wide range of molecular and bioinformatics factors that potentially affected the power to detect viral integrations. We then selected a portion of these factors for further association analyses in this study. Pearson's correlation test was used to measure the association between each factor with detection power using all the combined datasets. The cor.test function in R version 3.4.3 was used. The statistical significance threshold was adjusted for the number of multiple tests using Bonferroni correction, resulting in an adjusted threshold α = 0.0001. 2.3. Comparative analysis of the simulation framework We compared the power estimates from our viral integration detection simulation framework with those from a previously-published viral integration detection tool, Virus-Clip [19]. First, we randomly selected 100 sequences of equal length from the HBV reference genome 2
Genomics xxx (xxxx) xxx–xxx
A. Sulovari, D. Li
(Supplementary Fig. 8). When both chimeric and split reads were utilized by our simulation framework, we observed a higher detection power. For example, at 10× sequencing depth, the detection power of our simulation framework increased by approximately 10% when both chimeric and split reads were used for integration detection (Supplementary Fig. 8). We then used our simulation framework and all the 18 factors to develop VIpower, a software package for fast estimation of viral integration detection power. VIpower is available as a Linux command line version where users may calculate power estimates for other HTS scenarios by modifying each factor with different values in the VIpower files, such as viral integration profile, distance to repeats, and GC content-specific read coordinates. This tool is also available as a userfriendly web interface for live runs of power analysis. The web interface can be used not only to query the power estimates precomputed for the values shown in Supplementary Table 2, but also to compute power estimates for new values. In the web version, 15 of the 18 factors can be customized by users since three of them are fixed, including “GC distribution file”, “repeat regions”, and “random seed”. However, in the command-line version, all the 18 factors can be customized with desired values.
(NC_003977.2) and inserted them into randomly-selected positions of human chromosome 22 (GRCh37/hg19). This process was repeated with HBV integration lengths of 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 500, and 1000 bp, and the resulting sequences were stored in FASTA format. Second, these FASTA files were used to generate paired-end sequencing reads in FASTQ format with varying sequencing depths, i.e., 1×, 2×, 4×, 6×, 8×, 10×, 20×, and 40×; read lengths, i.e., 75 and 100 bp; and insert sizes, i.e., 600, 1300, and 2200 bp, using pIRS [18]. Third, we ran these FASTQ files through Virus-Clip and counted how many of the simulated HBV integrations were detected. The same set of parameters was used as the input for our simulation framework to generate power estimates for comparisons. 3. Results We collected a total of 18 molecular and bioinformatics factors (Supplementary Table 2) that potentially affect viral integration detection power. These factors were selected as they were among those representative of the paired-end sequencing process, human genome characteristics, and integration detection practices. We incorporated all these factors into our viral integration detection simulation framework (Fig. 1) as a software package: VIpower. Seven factors were further selected for association analyses with detection power. To compare the detection power of different values of the seven factors, we calculated the numbers of correctly detected viral integrations under various commonly-used values for each factor, as shown in Supplementary Table 2, leading to a total of 23,040 unique combinations of factors and values. To examine the extent of association with viral integration detection power, we carried out Pearson's correlation analyses using the combination of these computed power estimates. We found that six of the seven factors were significantly associated with detection power, including cellular proportion (Pearson's ρ = 0.64 and P < 2 × 10−16; adjusted α = 0.0001), sequencing depth (ρ = 0.37 and P < 2 × 10−16), length of integrated viral sequence (ρ = 0.37, P = 1 × 10−13), paired-end reads insert size (ρ = 0.23 and P < 2 × 10−16), minimum number of supporting reads required to determine viral integration event, i.e., the threshold set by users to claim successful identification of viral integration events (ρ = −0.19 and P < 2 × 10−16), and read length (ρ = −0.09 and P < 2 × 10−16 when the total data volume/sequencing depth was fixed; ρ = 0.1 and P < 2 × 10−16 when the total read number was fixed). The first factor, cellular proportion, is particularly relevant to somatic viral integration detection in a heterogeneous cell population, such as cancer biopsies [20]. Fig. 2 shows the associations of the six factors. Additionally, we observed a marginal association of the seventh factor, minimum mappable length, with detection power (ρ = −0.02 and P = 0.0003). We further analyzed the associations of the number of supporting reads, e.g., chimeric and split reads flanking the integration breakpoint, with detection power. Fig. 3 shows the pairwise correlations among the seven examined molecular and bioinformatics factors, number of supporting reads, and detection power. As expected, the observed numbers of supporting reads were strongly associated with detection power. In real HTS data, we expect fluctuations in the number of supporting reads across viral integration breakpoints. Indeed, we observed a wide variation in the number of supporting reads over the breakpoints of our simulated viral integrations (Supplementary Fig. 7). We compared the power estimates from our simulation framework with those from Virus-Clip for each of the six significantly-associated factors. As Virus-Clip was designed to use split reads only, we tested our simulation framework by using both split and chimeric reads as well as using split reads only. Three replication experiments, each corresponding to different viral sequences and integration breakpoints, were carried out. The average detection powers were compared between Virus-Clip and our simulation framework. We found that the power estimates from our simulation framework were in general slightly lower than those from Virus-Clip when only split-reads were used
4. Discussion We developed a fast, modeling-based, framework to estimate viral integration detection power. This framework incorporated multiple sequence features in HTS data, such as GC content and distance to transposable elements and other repeat sequences. We compared the power estimates for different combinations of commonly-used values of seven selected molecular and bioinformatics factors, and then identified six factors that were significantly associated with viral integration detection power. We constructed a software package, VIpower, that incorporates all potential factors. These molecular and bioinformatics factors can be optimized in concert with each other to improve the power for detecting viral integrations. VIpower may be used to estimate viral integration detection power for different combinations of sequencing or analytic values. It may also be used to determine the values required to achieve specific power when designing new sequencing experiments. The outputs from VIpower allow users to test the effect of the interactions between factors on detection power. For instance, when the sequencing read length was increased from 100 bp to 300 bp, while the total sequence volume was fixed, the number of total supporting reads decreased by an average of 37%, resulting in an ~4% drop in detection power; however, the proportion of split reads increased 4.7-fold (Supplementary Fig. 9). As split reads are necessary to determine the exact breakpoints of integrations, the latter experimental design, i.e., increasing read length, may provide a higher chance for the precise mapping of integration breakpoints at the cost of some loss of detection power. Another important feature of VIpower is that its runtime is minimal. This is because VIpower detects and stores viral integrations by genomic features (genomic coordinates) rather than performing actual sequence alignment. For example, each of our simulations shown in Supplementary Table 2 can be completed by one standard computing core in an average of nine seconds, ranging from 0.6 to 62 s, compared to > 100 CPU hours required by some of the previously-published viral integration detection tools [21,22]. Similarly, the web interface can conduct a power calculation within one minute. We found no supporting reads for a small portion (2.5–7.5%) of our simulated viral integrations (Supplementary Fig. 7). This was likely due to the integrations occurring near difficult-to-sequence regions. This was consistent with our observation that the real HBV integrations maintained in the dr.VIS database were located, on average, < 500 bps from transposable elements (Supplementary Fig. 4). Proximity to repeat elements may make these viral integrations inaccessible or difficult to detect with existing short read sequencing. This might be, in part, due 3
Genomics xxx (xxxx) xxx–xxx
A. Sulovari, D. Li
100
20
60 20
*
20
*
*
*
40
40
* *
*
*
*
60
60 40
*
*
80
80
100 80
*
*
0 0.1
0.2
1
1
4
6
8
0
10 20 40
100 60
*
*
*
*
*
*
*
*
*
*
*
*
100
120
0
0
20
60
40
*
20
20
40
*
1 000
100
*
*
00
Viral integration length (bp)
20
60
60
100
F
80
E
2
Sequencing depth (fold)
100
Cellular proportion
80
D
0
* * 0.01
Detection power (%)
C
100
B
*
0
Detection power (%)
A
600
1300
2 200
Insert size (bp)
2
4
6
8
10
Supporting reads threshold
75
300
Read length (bp)
Fig. 2. Six factors significantly associated with viral integration detection power. The six factors are ordered by significance level of correlation. The box plots indicate five quantiles, and the star symbol (*) represents the average value. The correlation coefficients ρ and P values for each factor were (A) cellular proportion (ρ = 0.64, P < 2 × 10−16), (B) sequencing depth (ρ = 0.37, P < 2 × 10−16), (C) viral integration length (ρ = 0.37, P = 1 × 10−13), (D) insert size (ρ = 0.23, P < 2 × 10−16), (E) supporting reads threshold (ρ = −0.19, P < 2 × 10−16), (F) read length (the top panel represents a scenario where the sequencing depth is fixed, ρ = −0.09, P < 2 × 10−16; the bottom panel represents a scenario where the read number is fixed, ρ = 0.1, P < 2 × 10−16), respectively. In each box plot, all other involved variables were simulated in equal proportion of representation (all these datasets were used in each analysis) to ensure balanced comparisons among data points.
of association of the remaining factors should also be examined, and additional power-associated factors may be identified in future studies. To conclude, this is the first study focused on identifying molecular and bioinformatics factors that affect viral integration detection power using whole-genome sequencing. VIpower is the first tool for fast estimation of power for viral integration detection. The resources generated in this study may aid in sequencing library preparation and bioinformatics pipeline development. Optimization of the values for the factors related to deep sequencing designs and viral integration analyses may further help increase detection power. Supplementary data to this article can be found online at https:// doi.org/10.1016/j.ygeno.2019.01.015.
to the limitations of the current bioinformatics approaches, the protocols employed to generate the integrations leveraged by the dr.VIS database, or the evolutionary idiosyncrasies following the introduction of viral genome into host cells over long evolutionary periods [23]. Longer insert sizes and/or read lengths may help identify some of the viral integrations in low-complexity regions. This study has some limitations. First, the empirical viral integrations analyzed in this study were primarily HBV integrations detected in hepatocellular carcinoma [17], and thus, the power estimates analyzed here might be specific to HBV integration detection. Further analyses are necessary to examine integrations of other viruses to determine how the detection power varies between viral species. VIpower allows users to replace the viral integration reference to any other virus or combination of viruses, thus, the reference files in VIpower can be updated when additional viral integration data become available. This makes VIpower applicable for virome-wide integration screens of various human samples [21]. Second, false discovery rate should be controlled when estimating detection power; however, false integration events could not be simulated in this study. Adding false discovery rate to power analysis is warranted in future analysis. Third, in this study, we compared our viral integration detection simulation framework only with Virus-Clip because of the very short runtimes of this software. However, the more recently-published tools [21] should also be analyzed. Fourth, in this study, association testing with detection power was carried out only for a portion of the collected molecular and bioinformatics factors, as shown in Supplementary Table 2. The extent
Source code and web application The source code of the software was written primarily in R (version 3.3.0). The web interface of the software was designed using HTML and PHP (version 5.3.3), and MySQL was used to store the pre-computed power estimates. Availability of data and software VIpower is available as a command-line application and a userfriendly web application. The command-line application, the real-time web interface, and the precomputed power estimates from the 23,040 combinations of factors and values are available at www.uvm.edu/ 4
Genomics xxx (xxxx) xxx–xxx
A. Sulovari, D. Li
Chimeric reads at 3’ Runtime
Chimeric reads at 3’
Split reads at 3’
[5]
[6]
[7]
Runtime
Chimeric reads at 5’
Chimeric reads at 5’
[4]
Split reads at 3’
Split reads at 5’
Minimum mappable length
Minimum mappable length
[3]
Split reads at 5’
Read length
Supporting reads threshold
Supporting reads threshold
Supporting reads
[2]
Read length
Insert size
Viral integration length
Viral integration length
Insert size
Cellular proportion Sequencing depth
Sequencing depth
Molecular and bioinformatics factors
[8]
Detection power
[9] [10]
Correlation coefficient [11]
Fig. 3. Pairwise correlations of detection power with selected factors and total number of supporting reads. The color of each square corresponds to correlation coefficient ρ (darker color corresponds to stronger correlation) while the size corresponds to the P value (smaller P value corresponds to bigger square size). The six significant factors (P ≤ 0.0001), ordered by their correlation coefficient with detection power, are cellular proportion, sequencing depth, viral integration length, insert size, user-defined minimum number of supporting reads (threshold), and read length. One additional marginal factor (minimum mappable length), observed total number of supporting reads, and runtime are also shown. All parameters represent their average values, except minimum mappable length, cellular proportion, runtime, and detection power.
[12]
[13] [14] [15]
[16] [17]
genomics/software/VIpower. Acknowledgements
[18]
This work was supported by the University of Vermont Start-up Fund, and the University of Vermont Cancer Center Institutional Research Grant 126773-IRG 14-196-01 from the American Cancer Society. The authors thank Michael Mariani for his help with the website design; thank Xun Chen for his constructive comments; and thank Jason Kost for his careful review of the manuscript.
[19]
[20] [21]
[22]
Conflict of Interests The authors declare no potential competing interests.
[23]
References [1] W.K. Sung, H. Zheng, S. Li, R. Chen, X. Liu, Y. Li, N.P. Lee, W.H. Lee,
5
P.N. Ariyaratne, C. Tennakoon, F.H. Mulawadi, K.F. Wong, A.M. Liu, R.T. Poon, S.T. Fan, K.L. Chan, Z. Gong, Y. Hu, Z. Lin, G. Wang, Q. Zhang, T.D. Barber, W.C. Chou, A. Aggarwal, K. Hao, W. Zhou, C. Zhang, J. Hardwick, C. Buser, J. Xu, Z. Kan, H. Dai, M. Mao, C. Reinhard, J. Wang, J.M. Luk, Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma, Nat. Genet. 44 (2012) 765–769. J.D. Khoury, N.M. Tannir, M.D. Williams, Y. Chen, H. Yao, J. Zhang, E.J. Thompson, T. Network, F. Meric-Bernstam, L.J. Medeiros, J.N. Weinstein, X. Su, Landscape of DNA virus associations across human malignant cancers: analysis of 3,775 cases using RNA-Seq, J. Virol. 87 (2013) 8916–8926. J.A. Mikovits, V.C. Lombardi, M.A. Pfost, K.S. Hagen, F.W. Ruscetti, Detection of an infectious retrovirus, XMRV, in blood cells of patients with chronic fatigue syndrome, Virulence 1 (2009) 386–390. I. Carbone, T. Lazzarotto, M. Ianni, E. Porcellini, P. Forti, E. Masliah, L. Gabrielli, F. Licastro, Herpes virus in Alzheimer's disease: relation to progression of the disease, Neurobiol. Aging 35 (2014) 122–129. R. Douville, J. Liu, J. Rothstein, A. Nath, Identification of active loci of a human endogenous retrovirus in neurons of patients with amyotrophic lateral sclerosis, Ann. Neurol. 69 (2011) 141–151. D.J. Smyth, J.D. Cooper, R. Bailey, S. Field, O. Burren, L.J. Smink, C. Guja, C. Ionescu-Tirgoviste, B. Widmer, D.B. Dunger, D.A. Savage, N.M. Walker, D.G. Clayton, J.A. Todd, A genome-wide association study of nonsynonymous SNPs identifies a type 1 diabetes locus in the interferon-induced helicase (IFIH1) region, Nat. Genet. 38 (2006) 617–619. E.F. Foxman, A. Iwasaki, Genome-virome interactions: examining the role of common viral infections in complex disease, Nat. Rev. Microbiol. 9 (2011) 254–264. S.M. Karst, C.E. Wobus, M. Lay, J. Davidson, H.W. Virgin, STAT1-dependent innate immunity to a Norwalk-like virus, Science 299 (2003) 1575–1578. J.E. Gern, Rhinovirus and the initiation of asthma, Curr. Opin. Allergy Cl 9 (2009) 73–78. P. Klenerman, H. Hengartner, R.M. Zinkernagel, A non-retroviral RNA virus persists in DNA form, Nature 390 (1997) 298–301. M. Horie, T. Honda, Y. Suzuki, Y. Kobayashi, T. Daito, T. Oshida, K. Ikuta, P. Jern, T. Gojobori, J.M. Coffin, K. Tomonaga, Endogenous non-retroviral RNA virus elements in mammalian genomes, Nature 463 (2010) 84–87. V.A. Belyi, A.J. Levine, A.M. Skalka, Unexpected inheritance: multiple integrations of ancient bornavirus and ebolavirus/marburgvirus sequences in vertebrate genomes, PLoS Pathog. 6 (2010) e1001030. D.J. Taylor, J. Bruenn, The evolution of novel fungal genes from non-retroviral RNA viruses, BMC Biol. 7 (2009) 88. J. Cao, D. Li, Searching for human oncoviruses: Histories, challenges, and opportunities, J. Cell. Biochem. 119 (2018) 4897–4906. X. Chen, J. Kost, D. Li, Comprehensive comparative analysis of methods and software for identifying viral integrations, Brief. Bioinform. (2019), https://doi.org/10. 1093/bib/bby070. M. Tarailo-Graovac, N. Chen, Using RepeatMasker to identify repetitive elements in genomic sequences, Curr. Protoc. Bioinformatics 25 (2009) 4.10.1–4.10.14. X. Yang, M. Li, Q. Liu, Y. Zhang, J. Qian, X. Wan, A. Wang, H. Zhang, C. Zhu, X. Lu, Y. Mao, X. Sang, H. Zhao, Y. Zhao, X. Zhang, Dr.VIS v2.0: an updated database of human disease-related viral integration sites in the era of high-throughput deep sequencing, Nucleic Acids Res. 43 (2015) D887–D892. X. Hu, J. Yuan, Y. Shi, J. Lu, B. Liu, Z. Li, Y. Chen, D. Mu, H. Zhang, N. Li, Z. Yue, F. Bai, H. Li, W. Fan, pIRS: Profile-based Illumina pair-end reads simulator, Bioinformatics 28 (2012) 1533–1535. D.W. Ho, K.M. Sze, I.O. Ng, Virus-Clip: a fast and memory-efficient viral integration site detection tool at single-base resolution with annotation capability, Oncotarget 6 (2015) 20959–20963. M. Meyerson, S. Gabriel, G. Getz, Advances in understanding cancer genomes through second-generation sequencing, Nat. Rev. Genet. 11 (2010) 685–696. X. Chen, J. Kost, D. Li, Comprehensive comparative analysis of methods and software for identifying viral integrations, Brief. Bioinform. (2018), https://doi.org/10. 1093/bib/bby1070/5066709. Q. Wang, P. Jia, Z. Zhao, VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data, PLoS ONE 8 (2013) e64465. M. Pistello, G. Antonelli, Integration of the viral genome into the host cell genome: a double-edged sword, Clin. Microbiol. Infect. 22 (2016) 296–298.