VIpower: Simulation-based tool for estimating power of viral integration detection via high-throughput sequencing

Genomics xxx (xxxx) xxx–xxx Contents lists available at ScienceDirect Genomics journal homepage: www.elsevier.com/locate/ygeno VIpower: Simulation-...

Download PDF

430KB Sizes 0 Downloads 7 Views

Report

PDF Reader
Full Text

Genomics xxx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

Genomics journal homepage: www.elsevier.com/locate/ygeno

VIpower: Simulation-based tool for estimating power of viral integration detection via high-throughput sequencing Arvis Sulovaria, Dawei Lia,b,c,

⁎

a

Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA Department of Computer Science, University of Vermont, Burlington, VT 05405, USA c Neuroscience, Behavior, and Health Initiative, University of Vermont, Burlington, VT 05405, USA b

A R T I C LE I N FO

A B S T R A C T

Keywords: Viral integration (VI) Detection power High-throughput sequencing (HTS) Modeling Simulation VIpower

Viral sequence integrations in the human genome have been implicated in various human diseases. Viral integrations remain among the most challenging-to-detect structural changes of the human genome. No studies have systematically analyzed how molecular and bioinformatics factors aﬀect the power (sensitivity) to detect viral integrations using high-throughput sequencing (HTS). We selected a wide-range of molecular and bioinformatics factors covering genome sequence characteristics, HTS features, and viral integration detection. We designed a fast simulation-based framework to model the process of detecting variable viral integration events in the human genome. We then examined the associations of selected factors with viral integration detection power. We identiﬁed six factors that signiﬁcantly aﬀected viral integration detection power (P < 2 × 10−16). The strongest factors associated with detection power included proportion of sample cells with clonal viral integrations (Pearson's ρ = 0.64), sequencing depth (ρ = 0.37), length of viral integration (ρ = 0.37), paired-end read insert size (ρ = 0.23), user-deﬁned threshold (number of supporting reads) to claim successful identiﬁcation of integrations (ρ = −0.19), and read length (when sequence volume was ﬁxed) (ρ = −0.09). As the ﬁrst tool of its kind, VIpower incorporates all these factors, which can be manipulated in concert with each other to optimize the detection power. This tool may be used to estimate viral integration detection power for various combinations of sequencing or analytic parameters. It may also be used to estimate the parameters required to achieve a speciﬁc power when designing new sequencing experiments.

1. Introduction Viral etiologies have been speculated to be involved in various complex human diseases [1–9]. Many viruses are able to insert their genetic materials into host chromosomes [10–13], and the resulting viral integrations, i.e., human-virus-human sequences, may play roles in the pathogenesis and development of some diseases via diﬀerent mechanisms, such as expressing viral proteins, dysregulating host gene functions, and inﬂuencing genomic instability. Use of high-throughput sequencing (HTS) allows for detection of viral integrations, both germline and somatic events, in the human genome. We recently compared the historical success of identifying cancercausal viruses through clonal viral integration analyses [14] and analyzed the existing HTS-based methods and software for viral integration detection [15]. Accurate identiﬁcation of viral integrations in the human genome remains challenging, in part due to the limitations of available computational methods and insuﬃcient empirical data to

⁎

guide new experimental designs and data analyses. Although viral integration hotspots have been reported in some tumor samples [1], viral integration sites are largely randomly distributed across the entire human genome, including regions with high GC content and repetitive sequences. Thus, in general, HTS-based methods for viral integration detection suﬀer from ambiguous mapping of short reads in such “challenging-to-detect” regions. For somatic viral integrations, since not all cells carry the integration event, the cellular proportions for clonal integrations vary, further reducing the power (sensitivity) to detect them. Other factors, such as sequencing depth and paired-end read insert size, might also inﬂuence, positively or negatively, the power to detect integration events. To accurately capture viral integrations on a whole-genome scale, systematic analyses are required to identify molecular and bioinformatics factors that potentially aﬀect the power of viral integration detection via HTS. In this study, we designed a fast, simulation-based framework to model the process of whole-genome sequencing and viral integration

Corresponding author at: Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA. E-mail address: [email protected] (D. Li).

https://doi.org/10.1016/j.ygeno.2019.01.015 Received 10 August 2018; Received in revised form 31 December 2018; Accepted 22 January 2019 0888-7543/ © 2019 Elsevier Inc. All rights reserved.

Please cite this article as: Sulovari, A., Genomics, https://doi.org/10.1016/j.ygeno.2019.01.015

Genomics xxx (xxxx) xxx–xxx

A. Sulovari, D. Li

iral integration

uman sequence

detection. We selected a wide-range of molecular and bioinformatics factors that potentially inﬂuence the power of viral integration detection, covering genome sequence characteristics, HTS features, and viral integration detection. These factors are among the most variable components in HTS experiments or bioinformatics approaches necessary to detect viral integration events. We developed software, VIpower, utilizing all these factors for power estimation. We then examined the extent of association for some of the factors most frequently speculated to aﬀect viral integration detection power.

%GC

200bp window 100 50

Viral integrations

0

Repeat

Human

sequencing

2. Methods 2.1. Modeling of viral integration detection

Chimeric

The entire process of viral integration detection was modeled by simulation, which included four modules: 1) simulation of virtual human sequences; 2) simulation of virtual viral sequences; 3) simulation of paired-end sequencing reads and virtual alignment of the reads to the human and viral reference genomes; and 4) simulation of viral integration detection (Fig. 1). The sequence datasets, including the human reference genome (GRCh37/hg19), the repeat regions from RepeatMasker [16], dr.VIS database [17], and the “proﬁle” data from pIRS [18], as well as the empirical distributions of genomic features, including whole-genome GC content, lengths of repeat regions, and characteristics of known viral integrations, were incorporated into these simulations (Supplementary Fig. 1).

Split

(*adjusted by cellular proportion)

Fig. 1. Flow diagram of modeling viral integration detection. The modeling of viral integration detection in the human genome is composed of four modules. The ﬁrst two modules simulate the features of human and viral sequences; while the last two simulate the alignment of paired-end reads and detection of viral integration events (see Methods for details).

corresponding cellular proportions over the integrated viral sequence regions. 2.1.4. Modeling of viral integration detection and calculation of detection power We ﬁrst simulated 50 viral integrations across the simulated human genome sequence and recorded their genomic coordinates. Then, we identiﬁed the overlap between the coordinates of the viral integrations and the previously-simulated paired-end reads. We then labelled each paired-end read as either chimeric or split when one entire end or a portion of an individual read mapped to the simulated viral reference genome, respectively, while the remaining portion mapped to the virtual human genome. Chimeric reads and split reads are also known as discordant reads and soft-clipped reads, respectively. A split read should contain the viral integration breakpoint. Both chimeric and split reads were used as supporting reads. The power (sensitivity) to detect viral integrations was deﬁned as:

2.1.1. Modeling of human sequences The human sequences were simulated (modeled) according to the whole-genome distributions of GC content (Supplementary Fig. 2) and repeat regions (Supplementary Fig. 3). The GC content, which was obtained from the pIRS proﬁle data, was calculated by employing 200 base pair (bp) tiling windows across the human reference genome (GRCh37). The lengths and frequencies (17 repeats/10,000 bps) of repeat regions were extracted from RepeatMasker. The whole-genome distributions of GC content and repeat regions were randomly sampled with replacement and assigned to our simulated human sequences.

Number of identified viral integrations × 100 Number of simulated viral integrations

2.1.2. Modeling of viral sequences and integrations Viral integration events, i.e., human-virus-human sequences, were simulated based on the properties of known viral integrations. Speciﬁcally, the lengths of viral integrations were created based on the widely-studied hepatitis B virus (HBV) integrations maintained in the dr.VIS database. The locations of the simulated viral integrations were assigned according to the observed distances between the HBV integration sites and the repeat regions from RepeatMasker (Supplementary Fig. 4).

Detection power (%) =

2.1.3. Modeling of in silico sequencing Each paired-end read was assigned physical coordinates according to the empirical distribution of GC content-speciﬁc sequencing depths (Supplementary Fig. 5), which was generated based on the GC-corrected read depth data from pIRS. To remove low quality reads, several quality control procedures commonly-used in HTS data analysis were employed, including minimum mappable read length, read trimming, PCR duplicate removal, and non-uniquely mapped read removal (Supplementary Table 1). A read was discarded if its full length overlapped with a repeat element. If a portion of a single read was mapped to a unique sequence and that portion was longer than the minimum mappable length (with a default value of 20 bp), the read was labelled as a read supporting integration. In a simulated example with commonly-used HTS parameters, use of the quality controls described in Supplementary Table 1 led to reduced low-quality reads, particularly those mapped to regions with very high sequencing depth (Supplementary Fig. 6). For somatic viral integration events, we adjusted the numbers of simulated reads by matching, linearly, the

2.2. Association analysis of factors with detection power

Within our simulation framework, genomic coordinates alone are both necessary and suﬃcient for estimating viral integration detection power. Every simulated paired-end read and viral integration event is represented by a start and end coordinate; hence, the operations of counting and intersecting genomic coordinates lead to signiﬁcantly decreased computational runtimes for our simulation framework.

We selected a wide range of molecular and bioinformatics factors that potentially aﬀected the power to detect viral integrations. We then selected a portion of these factors for further association analyses in this study. Pearson's correlation test was used to measure the association between each factor with detection power using all the combined datasets. The cor.test function in R version 3.4.3 was used. The statistical signiﬁcance threshold was adjusted for the number of multiple tests using Bonferroni correction, resulting in an adjusted threshold α = 0.0001. 2.3. Comparative analysis of the simulation framework We compared the power estimates from our viral integration detection simulation framework with those from a previously-published viral integration detection tool, Virus-Clip [19]. First, we randomly selected 100 sequences of equal length from the HBV reference genome 2

Genomics xxx (xxxx) xxx–xxx

A. Sulovari, D. Li

(Supplementary Fig. 8). When both chimeric and split reads were utilized by our simulation framework, we observed a higher detection power. For example, at 10× sequencing depth, the detection power of our simulation framework increased by approximately 10% when both chimeric and split reads were used for integration detection (Supplementary Fig. 8). We then used our simulation framework and all the 18 factors to develop VIpower, a software package for fast estimation of viral integration detection power. VIpower is available as a Linux command line version where users may calculate power estimates for other HTS scenarios by modifying each factor with diﬀerent values in the VIpower ﬁles, such as viral integration proﬁle, distance to repeats, and GC content-speciﬁc read coordinates. This tool is also available as a userfriendly web interface for live runs of power analysis. The web interface can be used not only to query the power estimates precomputed for the values shown in Supplementary Table 2, but also to compute power estimates for new values. In the web version, 15 of the 18 factors can be customized by users since three of them are ﬁxed, including “GC distribution ﬁle”, “repeat regions”, and “random seed”. However, in the command-line version, all the 18 factors can be customized with desired values.

(NC_003977.2) and inserted them into randomly-selected positions of human chromosome 22 (GRCh37/hg19). This process was repeated with HBV integration lengths of 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 500, and 1000 bp, and the resulting sequences were stored in FASTA format. Second, these FASTA ﬁles were used to generate paired-end sequencing reads in FASTQ format with varying sequencing depths, i.e., 1×, 2×, 4×, 6×, 8×, 10×, 20×, and 40×; read lengths, i.e., 75 and 100 bp; and insert sizes, i.e., 600, 1300, and 2200 bp, using pIRS [18]. Third, we ran these FASTQ ﬁles through Virus-Clip and counted how many of the simulated HBV integrations were detected. The same set of parameters was used as the input for our simulation framework to generate power estimates for comparisons. 3. Results We collected a total of 18 molecular and bioinformatics factors (Supplementary Table 2) that potentially aﬀect viral integration detection power. These factors were selected as they were among those representative of the paired-end sequencing process, human genome characteristics, and integration detection practices. We incorporated all these factors into our viral integration detection simulation framework (Fig. 1) as a software package: VIpower. Seven factors were further selected for association analyses with detection power. To compare the detection power of diﬀerent values of the seven factors, we calculated the numbers of correctly detected viral integrations under various commonly-used values for each factor, as shown in Supplementary Table 2, leading to a total of 23,040 unique combinations of factors and values. To examine the extent of association with viral integration detection power, we carried out Pearson's correlation analyses using the combination of these computed power estimates. We found that six of the seven factors were signiﬁcantly associated with detection power, including cellular proportion (Pearson's ρ = 0.64 and P < 2 × 10−16; adjusted α = 0.0001), sequencing depth (ρ = 0.37 and P < 2 × 10−16), length of integrated viral sequence (ρ = 0.37, P = 1 × 10−13), paired-end reads insert size (ρ = 0.23 and P < 2 × 10−16), minimum number of supporting reads required to determine viral integration event, i.e., the threshold set by users to claim successful identiﬁcation of viral integration events (ρ = −0.19 and P < 2 × 10−16), and read length (ρ = −0.09 and P < 2 × 10−16 when the total data volume/sequencing depth was ﬁxed; ρ = 0.1 and P < 2 × 10−16 when the total read number was ﬁxed). The ﬁrst factor, cellular proportion, is particularly relevant to somatic viral integration detection in a heterogeneous cell population, such as cancer biopsies [20]. Fig. 2 shows the associations of the six factors. Additionally, we observed a marginal association of the seventh factor, minimum mappable length, with detection power (ρ = −0.02 and P = 0.0003). We further analyzed the associations of the number of supporting reads, e.g., chimeric and split reads ﬂanking the integration breakpoint, with detection power. Fig. 3 shows the pairwise correlations among the seven examined molecular and bioinformatics factors, number of supporting reads, and detection power. As expected, the observed numbers of supporting reads were strongly associated with detection power. In real HTS data, we expect ﬂuctuations in the number of supporting reads across viral integration breakpoints. Indeed, we observed a wide variation in the number of supporting reads over the breakpoints of our simulated viral integrations (Supplementary Fig. 7). We compared the power estimates from our simulation framework with those from Virus-Clip for each of the six signiﬁcantly-associated factors. As Virus-Clip was designed to use split reads only, we tested our simulation framework by using both split and chimeric reads as well as using split reads only. Three replication experiments, each corresponding to diﬀerent viral sequences and integration breakpoints, were carried out. The average detection powers were compared between Virus-Clip and our simulation framework. We found that the power estimates from our simulation framework were in general slightly lower than those from Virus-Clip when only split-reads were used

4. Discussion We developed a fast, modeling-based, framework to estimate viral integration detection power. This framework incorporated multiple sequence features in HTS data, such as GC content and distance to transposable elements and other repeat sequences. We compared the power estimates for diﬀerent combinations of commonly-used values of seven selected molecular and bioinformatics factors, and then identiﬁed six factors that were signiﬁcantly associated with viral integration detection power. We constructed a software package, VIpower, that incorporates all potential factors. These molecular and bioinformatics factors can be optimized in concert with each other to improve the power for detecting viral integrations. VIpower may be used to estimate viral integration detection power for diﬀerent combinations of sequencing or analytic values. It may also be used to determine the values required to achieve speciﬁc power when designing new sequencing experiments. The outputs from VIpower allow users to test the eﬀect of the interactions between factors on detection power. For instance, when the sequencing read length was increased from 100 bp to 300 bp, while the total sequence volume was ﬁxed, the number of total supporting reads decreased by an average of 37%, resulting in an ~4% drop in detection power; however, the proportion of split reads increased 4.7-fold (Supplementary Fig. 9). As split reads are necessary to determine the exact breakpoints of integrations, the latter experimental design, i.e., increasing read length, may provide a higher chance for the precise mapping of integration breakpoints at the cost of some loss of detection power. Another important feature of VIpower is that its runtime is minimal. This is because VIpower detects and stores viral integrations by genomic features (genomic coordinates) rather than performing actual sequence alignment. For example, each of our simulations shown in Supplementary Table 2 can be completed by one standard computing core in an average of nine seconds, ranging from 0.6 to 62 s, compared to > 100 CPU hours required by some of the previously-published viral integration detection tools [21,22]. Similarly, the web interface can conduct a power calculation within one minute. We found no supporting reads for a small portion (2.5–7.5%) of our simulated viral integrations (Supplementary Fig. 7). This was likely due to the integrations occurring near diﬃcult-to-sequence regions. This was consistent with our observation that the real HBV integrations maintained in the dr.VIS database were located, on average, < 500 bps from transposable elements (Supplementary Fig. 4). Proximity to repeat elements may make these viral integrations inaccessible or diﬃcult to detect with existing short read sequencing. This might be, in part, due 3

Genomics xxx (xxxx) xxx–xxx

A. Sulovari, D. Li

100

20

60 20

*

20

*

*

*

40

40

* *

*

*

*

60

60 40

*

*

80

80

100 80

*

*

0 0.1

0.2

1

1

4

6

8

0

10 20 40

100 60

*

*

*

*

*

*

*

*

*

*

*

*

100

120

0

0

20

60

40

*

20

20

40

*

1 000

100

*

*

00

Viral integration length (bp)

20

60

60

100

F

80

E

2

Sequencing depth (fold)

100

Cellular proportion

80

D

0

* * 0.01

Detection power (%)

C

100

B

*

0

Detection power (%)

A

600

1300

2 200

Insert size (bp)

2

4

6

8

10

Supporting reads threshold

75

300

Read length (bp)

Fig. 2. Six factors signiﬁcantly associated with viral integration detection power. The six factors are ordered by signiﬁcance level of correlation. The box plots indicate ﬁve quantiles, and the star symbol (*) represents the average value. The correlation coeﬃcients ρ and P values for each factor were (A) cellular proportion (ρ = 0.64, P < 2 × 10−16), (B) sequencing depth (ρ = 0.37, P < 2 × 10−16), (C) viral integration length (ρ = 0.37, P = 1 × 10−13), (D) insert size (ρ = 0.23, P < 2 × 10−16), (E) supporting reads threshold (ρ = −0.19, P < 2 × 10−16), (F) read length (the top panel represents a scenario where the sequencing depth is ﬁxed, ρ = −0.09, P < 2 × 10−16; the bottom panel represents a scenario where the read number is ﬁxed, ρ = 0.1, P < 2 × 10−16), respectively. In each box plot, all other involved variables were simulated in equal proportion of representation (all these datasets were used in each analysis) to ensure balanced comparisons among data points.

of association of the remaining factors should also be examined, and additional power-associated factors may be identiﬁed in future studies. To conclude, this is the ﬁrst study focused on identifying molecular and bioinformatics factors that aﬀect viral integration detection power using whole-genome sequencing. VIpower is the ﬁrst tool for fast estimation of power for viral integration detection. The resources generated in this study may aid in sequencing library preparation and bioinformatics pipeline development. Optimization of the values for the factors related to deep sequencing designs and viral integration analyses may further help increase detection power. Supplementary data to this article can be found online at https:// doi.org/10.1016/j.ygeno.2019.01.015.

to the limitations of the current bioinformatics approaches, the protocols employed to generate the integrations leveraged by the dr.VIS database, or the evolutionary idiosyncrasies following the introduction of viral genome into host cells over long evolutionary periods [23]. Longer insert sizes and/or read lengths may help identify some of the viral integrations in low-complexity regions. This study has some limitations. First, the empirical viral integrations analyzed in this study were primarily HBV integrations detected in hepatocellular carcinoma [17], and thus, the power estimates analyzed here might be speciﬁc to HBV integration detection. Further analyses are necessary to examine integrations of other viruses to determine how the detection power varies between viral species. VIpower allows users to replace the viral integration reference to any other virus or combination of viruses, thus, the reference ﬁles in VIpower can be updated when additional viral integration data become available. This makes VIpower applicable for virome-wide integration screens of various human samples [21]. Second, false discovery rate should be controlled when estimating detection power; however, false integration events could not be simulated in this study. Adding false discovery rate to power analysis is warranted in future analysis. Third, in this study, we compared our viral integration detection simulation framework only with Virus-Clip because of the very short runtimes of this software. However, the more recently-published tools [21] should also be analyzed. Fourth, in this study, association testing with detection power was carried out only for a portion of the collected molecular and bioinformatics factors, as shown in Supplementary Table 2. The extent

Source code and web application The source code of the software was written primarily in R (version 3.3.0). The web interface of the software was designed using HTML and PHP (version 5.3.3), and MySQL was used to store the pre-computed power estimates. Availability of data and software VIpower is available as a command-line application and a userfriendly web application. The command-line application, the real-time web interface, and the precomputed power estimates from the 23,040 combinations of factors and values are available at www.uvm.edu/ 4

Genomics xxx (xxxx) xxx–xxx

A. Sulovari, D. Li

Chimeric reads at 3’ Runtime

Chimeric reads at 3’

Split reads at 3’

[5]

[6]

[7]

Runtime

Chimeric reads at 5’

Chimeric reads at 5’

[4]

Split reads at 3’

Split reads at 5’

Minimum mappable length

Minimum mappable length

[3]

Split reads at 5’

Read length

Supporting reads threshold

Supporting reads threshold

Supporting reads

[2]

Read length

Insert size

Viral integration length

Viral integration length

Insert size

Cellular proportion Sequencing depth

Sequencing depth

Molecular and bioinformatics factors

[8]

Detection power

[9] [10]

Correlation coefficient [11]

Fig. 3. Pairwise correlations of detection power with selected factors and total number of supporting reads. The color of each square corresponds to correlation coeﬃcient ρ (darker color corresponds to stronger correlation) while the size corresponds to the P value (smaller P value corresponds to bigger square size). The six signiﬁcant factors (P ≤ 0.0001), ordered by their correlation coeﬃcient with detection power, are cellular proportion, sequencing depth, viral integration length, insert size, user-deﬁned minimum number of supporting reads (threshold), and read length. One additional marginal factor (minimum mappable length), observed total number of supporting reads, and runtime are also shown. All parameters represent their average values, except minimum mappable length, cellular proportion, runtime, and detection power.

[12]

[13] [14] [15]

[16] [17]

genomics/software/VIpower. Acknowledgements

[18]

This work was supported by the University of Vermont Start-up Fund, and the University of Vermont Cancer Center Institutional Research Grant 126773-IRG 14-196-01 from the American Cancer Society. The authors thank Michael Mariani for his help with the website design; thank Xun Chen for his constructive comments; and thank Jason Kost for his careful review of the manuscript.

[19]

[20] [21]

[22]

Conﬂict of Interests The authors declare no potential competing interests.

[23]

References [1] W.K. Sung, H. Zheng, S. Li, R. Chen, X. Liu, Y. Li, N.P. Lee, W.H. Lee,

5

P.N. Ariyaratne, C. Tennakoon, F.H. Mulawadi, K.F. Wong, A.M. Liu, R.T. Poon, S.T. Fan, K.L. Chan, Z. Gong, Y. Hu, Z. Lin, G. Wang, Q. Zhang, T.D. Barber, W.C. Chou, A. Aggarwal, K. Hao, W. Zhou, C. Zhang, J. Hardwick, C. Buser, J. Xu, Z. Kan, H. Dai, M. Mao, C. Reinhard, J. Wang, J.M. Luk, Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma, Nat. Genet. 44 (2012) 765–769. J.D. Khoury, N.M. Tannir, M.D. Williams, Y. Chen, H. Yao, J. Zhang, E.J. Thompson, T. Network, F. Meric-Bernstam, L.J. Medeiros, J.N. Weinstein, X. Su, Landscape of DNA virus associations across human malignant cancers: analysis of 3,775 cases using RNA-Seq, J. Virol. 87 (2013) 8916–8926. J.A. Mikovits, V.C. Lombardi, M.A. Pfost, K.S. Hagen, F.W. Ruscetti, Detection of an infectious retrovirus, XMRV, in blood cells of patients with chronic fatigue syndrome, Virulence 1 (2009) 386–390. I. Carbone, T. Lazzarotto, M. Ianni, E. Porcellini, P. Forti, E. Masliah, L. Gabrielli, F. Licastro, Herpes virus in Alzheimer's disease: relation to progression of the disease, Neurobiol. Aging 35 (2014) 122–129. R. Douville, J. Liu, J. Rothstein, A. Nath, Identiﬁcation of active loci of a human endogenous retrovirus in neurons of patients with amyotrophic lateral sclerosis, Ann. Neurol. 69 (2011) 141–151. D.J. Smyth, J.D. Cooper, R. Bailey, S. Field, O. Burren, L.J. Smink, C. Guja, C. Ionescu-Tirgoviste, B. Widmer, D.B. Dunger, D.A. Savage, N.M. Walker, D.G. Clayton, J.A. Todd, A genome-wide association study of nonsynonymous SNPs identiﬁes a type 1 diabetes locus in the interferon-induced helicase (IFIH1) region, Nat. Genet. 38 (2006) 617–619. E.F. Foxman, A. Iwasaki, Genome-virome interactions: examining the role of common viral infections in complex disease, Nat. Rev. Microbiol. 9 (2011) 254–264. S.M. Karst, C.E. Wobus, M. Lay, J. Davidson, H.W. Virgin, STAT1-dependent innate immunity to a Norwalk-like virus, Science 299 (2003) 1575–1578. J.E. Gern, Rhinovirus and the initiation of asthma, Curr. Opin. Allergy Cl 9 (2009) 73–78. P. Klenerman, H. Hengartner, R.M. Zinkernagel, A non-retroviral RNA virus persists in DNA form, Nature 390 (1997) 298–301. M. Horie, T. Honda, Y. Suzuki, Y. Kobayashi, T. Daito, T. Oshida, K. Ikuta, P. Jern, T. Gojobori, J.M. Coﬃn, K. Tomonaga, Endogenous non-retroviral RNA virus elements in mammalian genomes, Nature 463 (2010) 84–87. V.A. Belyi, A.J. Levine, A.M. Skalka, Unexpected inheritance: multiple integrations of ancient bornavirus and ebolavirus/marburgvirus sequences in vertebrate genomes, PLoS Pathog. 6 (2010) e1001030. D.J. Taylor, J. Bruenn, The evolution of novel fungal genes from non-retroviral RNA viruses, BMC Biol. 7 (2009) 88. J. Cao, D. Li, Searching for human oncoviruses: Histories, challenges, and opportunities, J. Cell. Biochem. 119 (2018) 4897–4906. X. Chen, J. Kost, D. Li, Comprehensive comparative analysis of methods and software for identifying viral integrations, Brief. Bioinform. (2019), https://doi.org/10. 1093/bib/bby070. M. Tarailo-Graovac, N. Chen, Using RepeatMasker to identify repetitive elements in genomic sequences, Curr. Protoc. Bioinformatics 25 (2009) 4.10.1–4.10.14. X. Yang, M. Li, Q. Liu, Y. Zhang, J. Qian, X. Wan, A. Wang, H. Zhang, C. Zhu, X. Lu, Y. Mao, X. Sang, H. Zhao, Y. Zhao, X. Zhang, Dr.VIS v2.0: an updated database of human disease-related viral integration sites in the era of high-throughput deep sequencing, Nucleic Acids Res. 43 (2015) D887–D892. X. Hu, J. Yuan, Y. Shi, J. Lu, B. Liu, Z. Li, Y. Chen, D. Mu, H. Zhang, N. Li, Z. Yue, F. Bai, H. Li, W. Fan, pIRS: Proﬁle-based Illumina pair-end reads simulator, Bioinformatics 28 (2012) 1533–1535. D.W. Ho, K.M. Sze, I.O. Ng, Virus-Clip: a fast and memory-eﬃcient viral integration site detection tool at single-base resolution with annotation capability, Oncotarget 6 (2015) 20959–20963. M. Meyerson, S. Gabriel, G. Getz, Advances in understanding cancer genomes through second-generation sequencing, Nat. Rev. Genet. 11 (2010) 685–696. X. Chen, J. Kost, D. Li, Comprehensive comparative analysis of methods and software for identifying viral integrations, Brief. Bioinform. (2018), https://doi.org/10. 1093/bib/bby1070/5066709. Q. Wang, P. Jia, Z. Zhao, VirusFinder: software for eﬃcient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data, PLoS ONE 8 (2013) e64465. M. Pistello, G. Antonelli, Integration of the viral genome into the host cell genome: a double-edged sword, Clin. Microbiol. Infect. 22 (2016) 296–298.

VIpower: Simulation-based tool for estimating power of viral integration detection via high-throughput sequencing

VIpower: Simulation-based tool for estimating power of viral integration detection via high-throughput sequencing

Recommend Documents