Structural variation detection using next-generation sequencing data

Accepted Manuscript Structural Variation Detection Using Next-Generation Sequencing Data: A Comparative Technical Review Peiyong Guan, Wing-Kin Sung P...

Download PDF

2MB Sizes 0 Downloads 76 Views

Report

PDF Reader
Full Text

Accepted Manuscript Structural Variation Detection Using Next-Generation Sequencing Data: A Comparative Technical Review Peiyong Guan, Wing-Kin Sung PII: DOI: Reference:

S1046-2023(16)30018-4 http://dx.doi.org/10.1016/j.ymeth.2016.01.020 YMETH 3896

To appear in:

Methods

Received Date: Revised Date: Accepted Date:

18 October 2015 9 January 2016 31 January 2016

Please cite this article as: P. Guan, W-K. Sung, Structural Variation Detection Using Next-Generation Sequencing Data: A Comparative Technical Review, Methods (2016), doi: http://dx.doi.org/10.1016/j.ymeth.2016.01.020

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Structural Variation Detection Using Next-Generation Sequencing Data A Comparative Technical Review Peiyong Guan1 and Wing-Kin Sung1,2,* 1 School of Computing, National University of Singapore, 117543 2 Computational & Mathematical Biology Group, Genome Institute of Singapore, 138672 *To whom correspondence should be addressed ([email protected])

Abstract Structural variations (SVs) are mutations of our genome of size at least fifty nucleotides. They contribute to the phenotypic differences among healthy individuals, cause severe diseases and even cancers by breaking or linking genes. Thus, it is crucial to systematically profile SVs in the genome. In the past decade, many next-generation sequencing (NGS)-based SV detection methods have been proposed due to the significant cost reduction of NGS experiments and their ability to unbiasedly detect SVs to the base-pair resolution. These SV detection methods vary in both sensitivity and specificity, since they use different SV-property-dependent and libraryproperty-dependent features. As a result, predictions from different SV callers are often inconsistent. Besides, the noises in the data (both platform-specific sequencing error and artificial chimeric reads) impede the specificity of SV detection. Poorly characterized regions in the human genome (e.g., repeat regions) greatly impact the reads mapping and in turn affect the SV calling accuracy. Calling of complex SVs requires specialized SV callers. Apart from accuracy, processing speed of SV caller is another factor deciding its usability. Knowing the pros and cons of different SV calling techniques and the objectives of the biological study are essential for biologists and bioinformaticians to make informed decisions. This paper describes different components in the SV calling pipeline and reviews the techniques used by existing SV callers. Through simulation study, we also demonstrate that library properties, especially insert size, greatly impact the sensitivity of different SV callers. We hope the community can benefit from this work both in designing new SV calling methods and in selecting the appropriate SV caller for specific biological studies. Keywords: structural variation; next-generation sequencing

1. Structural Variations (SVs) 1

Figure 1. Different types of SVs and discordantly mapped reads. Blue arrow representing reads from the 5’ end and red arrow representing reads from the 3’ end. The first line of each SV type in A to G represents the reference genome sequence and the last line represents the sequence in the sample. The orange-colored sequence is the sequence being deleted, inserted, duplicated or inverted. H shows a compound event leading to an unbalanced translocation. Structural variations (SVs) are large-scale changes in the genome, often more than 50 nucleotides (Feuk, Carson, & Scherer, 2006). They can be classified into different types (see Figure 1) based on read pair information. Deletion is the removal of DNA sequence from the genome. Insertion is the addition of DNA sequence into the genome. There are two types of insertions, depending on whether the inserted sequence is from the genome of the sample. If the inserted sequence is not from the genome of the sample, the insertion is called novel insertion. One example is the insertion of the hepatitis B virus (HBV) into the human genome in hepatocellular carcinoma (Sung et al., 2012). Duplication is the copying of one DNA sequence and pasting it to the genome. Depending on the pasting position, duplication can be classified as interspersed duplication and tandem duplication. Inversion involves the breaking of a DNA sequence at two loci and inverse it, resulting a reversed sequence. Translocation involves the deletion of a DNA sequence from one locus and inserting it to another locus in the genome. Depending on whether the chromosome of the source locus is the same as that of the target locus, translocations can be further classified as intra-chromosomal translocation and inter-chromosomal translocation. 2

Deletions, insertions and duplications alter the copy number of the genome and are thus called unbalanced SVs. Inversions and translocations don’t change the copy number and are called balanced SVs. Besides the simple SVs described thus far, combinations of these SV events can occur. For example, the duplicated sequences can be inverted before being inserted into the target loci. Duplications of this nature are called inverted duplication (see Figure 1G), which can be formed via breakage-fusion-bridge cycle (Bunting & Nussenzweig, 2013). Chromothripsis is another type of chromosomal rearrangements, where the changes are so intense that the region involved is changed beyond recognition (Weckselblatt & Rudd, 2015). Various other types of SVs could happen, resulting from different formation mechanisms (Raphael, 2012; Weischenfeldt, Symmons, Spitz, & Korbel, 2013). Compound SV events also occur. For example, if one parent has two normal chromosomes A and B and the other parent has a balanced translocation between A and B, the child can inherit a normal chromosome A from one parent and a rearranged chromosome B from the other parent (see Figure 1H). If only the child’s sample is available, it appears that the child has an unbalanced translocations, since the child has an additional copy of genes originally on chromosome A and reduced copy of genes originally on chromosome B. Such compound events can only be properly explained when all samples of the pedigree are available. Many studies are designed

to

sequence

the

pedigree

(e.g.,

CEPH

pedigree

1463,

http://www.illumina.com/platinumgenomes/) and compare the variations between the parents and children to not only study the mechanisms of the SVs but also estimate the false discovery rate of various detection tools. SVs affect the activities within our cells, including alteration of gene copy number and change of gene regulation (Weischenfeldt et al., 2013). The variation of copy numbers often leads to change of gene dosage, further disrupting and perturbing biological pathways, leading to undesired biological and physiological conditions (Tubio, 2015). SVs may also cause breaking and linking of genes and thus have been demonstrated to have significant biological implications. For example, the BCR-ABL gene fusion leads to the increased expression of ABL kinase, causing a cascade of cellular changes, eventually results in chronic myeloid leukemia (CML) (Nowell & Hungerford, 1960; Sattler & Griffin, 2001). The 3.7Mb deletion of Chr17p11.2 region can cause Smith-Magenis syndrome (SMS), a developmental disorder that may cause intellectual disability

3

(Smith et al., 1986). Weischenfeldt et al. have conducted a comprehensive survey of phenotypic impacts of structural variations (Weischenfeldt et al., 2013). Due to the biological impacts of SVs, we need technologies to call SVs. Different technologies have different throughputs and resolutions. Microscopic genomic aberrations can often be discovered with low resolution approaches like cytobands staining and karyotyping. Another low throughput technique is fluorescence in situ hybridization (FISH) (Feuk et al., 2006; Speicher & Carter, 2005; Trask, 2002). FISH designs fluorescently labeled DNA probes that bind to the interested SVs; after the probes are hybridized to the sample genome, we can detect if the SVs are present. Note that the above mentioned techniques are low throughput and low resolution; they are mainly used for confirmation of the existence of SVs. Currently, two technologies are more often used to study SVs in genome-wide. They are: Array Comparative Genomic Hybridization (aCGH) and Next-Generation Sequencing (NGS). aCGH is a microarray technology. A set of probes is designed to cover the whole genome. By measuring the relative copy number changes between two samples (usually the matched normal and disease samples), we can detect unbalanced SVs (Raphael, 2012). NGS technologies allow us to perform whole genome sequencing. By mapping the reads to the reference genome, we can detect SVs. Both aCGH and NGS technologies have their limitations and advantages. aCGH can only discover unbalanced SVs. It cannot detect novel insertions and can only provide a rough estimation of the breakpoint positions. However, aCGH experiments are more economical than NGS experiments. NGS technology is often chosen because it can detect both known and novel SVs in one experiment. It also can detect both balanced and unbalanced SVs. Compared to aCGH, NGS technology can detect the location of the SV breakpoints in base-pair resolution. With the increased throughputs and lowering costs of sequencing technologies, NGS has become the de facto standard in studying genomic variations. International collaborations such as The Cancer Genome Atlas (TCGA) (Cancer Genome Atlas Research et al., 2013) and International Cancer Genome Consortium (ICGC) (Joly, Dove, Knoppers, Bobrow, & Chalmers, 2012) continue to produce massive NGS data that potentially enables accurate statistical reasoning to formulate cancer biology hypothesis. This review focuses our discussions on SV detection methods that use NGS technologies. Below, we briefly describe the NGS wet-lab protocol. The input is the genomic DNA of some sample. The NGS protocol first breaks the genome into short fragments by sonication or 4

enzymatic cutting; then, the size selection step select fragments of certain size (such a size is called insert size); finally, through high-throughput sequencing, we sequence the two ends of each fragment. Given these paired reads, the computational problem is to identify the SVs. Figure 2 illustrates the process. In this example, the SV is an insertion of a yellow-colored sequence into the genome. The two boundaries between the black-colored and yellow-colored sequences are called breakpoints. Paired reads that cross the breakpoint are called anomalous paired reads. The reads that overlap with the breakpoints are called split-reads. (In Figure 2, split-reads are reads that contain both black and yellow colored sequences while anomalous paired reads are paired reads that contain both black and yellow colored sequences.)

Figure 2. The illustration of the NGS protocol for whole genome sequencing. This is an insertion of the yellow-colored sequence into the genome. Sensitive and specific SV calling remains a challenge (Cibulskis et al., 2013; Puente et al., 2011). In this review, we focus on the computational aspect of SV calling. We survey the literature and categorize the SV callers based on the features and techniques they use (Table 1). Based on theoretical analysis of the features used by SV callers and the simulation study, we also suggest the appropriate SV caller for specific SV type or sequencing library, hoping to assist biologist in making more informed decisions when choosing different SV callers.

2. The SV Calling Pipeline

5

SV calling requires comparing the sample DNA sequences with a reference genome or comparing the DNA sequences between two samples to find the variations. Different strategies of such comparisons are possible. Depending on whether NGS reads are mapped or aligned to the reference genome, SV calling methods can be classified into two categories: mapping-free and mapping-based. SMUFIN (Moncunill et al., 2014) is one of the mapping-free method, which uses quaternary tree to directly compare sequences from two samples (normal vs. tumor), grouping the reads based on the differences between the normal and tumor reads to identify somatic SVs. SMUFIN can be used for detection of both large-scale SVs and single nucleotide variants (SNVs). Most SV callers are mapping-based, which identify SV candidates from abnormally mapped reads. Mapping-based structural variations (SVs) calling pipeline consists of four steps, namely (1) data preprocessing, (2) SV discovery, (3) SV verification and 4) SV annotation and visualization (Figure 3). The input is a set of paired reads extracted from the sample genome. Data preprocessing organizes and filters the input data. SV discovery step finds the SV candidates. SV verification step examines the SV candidates and removes false discoveries. SV annotation and visualization step provides more information for biological analysis or biological validation. Below, we detail these four steps.

Figure 3. The 4-Staged SV Calling Pipeline. 2.1. Data Preprocessing

6

The data preprocessing step performs three tasks, namely reads mapping, reads filtering and reads classification. Firstly, the reads are mapped to a reference genome using reads mappers. After that, reads are filtered to keep only confident mappings. Finally the confident mappings are classified into different types of anomalously mapped reads, which are indicative of SVs.

Reads Mapping

Each paired read consists of two DNA sequences at the two ends of some random DNA fragment of the sample genome. If a DNA fragment does not contain SV breakpoint and is of high quality, the corresponding paired read will be mapped concordantly (i.e. the two reads are mapped in correct orientation, correct strand and the distance between them is consistent with the insert size distribution) and the bases of the two mapped reads are all mapped, i.e., without soft-clip. Otherwise, the corresponding paired read is anomalous and it either is discordant (i.e. not concordant) or has soft-clip. Paired reads are usually aligned with generic reads mappers like BWA (H. Li & Durbin, 2009) and Bowtie2 (Langmead & Salzberg, 2012). Li et al. (H. Li & Homer, 2010) provide a comparison of reads mappers. However, certain SV callers require specific reads mappers either for initial reads mapping or subsequent anomalous reads realignments, because those mappers can produce the types of anomalously mapped reads they require. For example, LUMPY (Layer, Chiang, Quinlan, & Hall, 2014) accepts a general mapping file and extract those unmapped reads and remaps them with YAHA (Faust & Hall, 2012), which can produce split-reads mapping. Splitread (Karakoc et al., 2012) uses mrFAST (Alkan et al., 2009) for reads mapping. BreaKmer (Abo et al., 2014) uses GATK (DePristo et al., 2011; McKenna et al., 2010) for local realignment of the reads. Gustaf (Trappe, Emde, Ehrlich, & Reinert, 2014) uses Stellar (Kehr, Weese, & Reinert, 2011) to find the partial reads mapping. After the paired reads are mapped, those paired reads that do not cover the breakpoint are expected to be mapped concordantly, i.e., with the correct orientation, strand and relative positions. For example, in Illumina paired-end sequencing protocol, the two reads of a read pair are expected to be mapped to different strands of the same chromosome, the read mapped to the negative strand is expected to have a larger genomic coordinate. Otherwise, the paired reads are either (1) soft-clipped (i.e. partially mapped), (2) one-end-anchored (i.e. one of the read in the paired read is unmapped) or (3) mapped discordantly.

7

Reads mapping affects both SV calling sensitivity and specificity. Mapping reads crossing SV breakpoints is challenging, since most existing reads mappers are designed to map reads allowing mismatches and biased towards mapping the reads concordantly. Different reads mappers often use different heuristics for efficient reads mapping and they have different sensitivities and specificities. The specificity differences between mapping concordant reads and discordant reads can be as great as 7.1%. The sensitivity differences between different mappers can be more than 10% (Lim, Tennakoon, Guan, & Sung, 2015). Mapping-based SV callers are very sensitive to the quality of the mappings. In order to achieve both sensitive and specific SV mapping, a good reads mapper is essential. Recent developments like BatAlign (Lim et al., 2015) specifically target accurate and sensitive mapping of reads crossing and spanning SV breakpoints, potentially lead to more sensitive and specific SV predictions.

Reads Filtering

Mappers may align the reads incorrectly. The incorrect alignments can be caused by either the heuristics used by the mappers or the sequencing errors in the NGS libraries. Most of the reads mappers compute a mapping-quality score (MAPQ) to indicate the probability of a wrong mapping, by considering the base quality score and secondary alignments in the genome, etc. To have confident SV predictions, SV callers often filter low quality mappings in this stage. For example, BreakDancer (Chen et al., 2009) will only keep those mappings with MAPQ at least 35. Ulysses (Gillet-Markowska, Richard, Fischer, & Lafontaine, 2015) only uses pair-read mappings with MAPQ more than 20. Concordantly mapped reads are discarded unless the mapper uses read-depth as a feature (S. S. Sindi, Onal, Peng, Wu, & Raphael, 2012). These types of filtering can remove certain mapping errors. However, such filtering method may potentially remove SV signals because the SV-containing reads are often partially mapped, leading to lower mapping scores. Reads mapped to repetitive regions, which are hotspots for SVs, may also have lower MAPQs because of similar secondary alignments. Such upstream filtering inevitably impacts the sensitivity of the pipeline.

Reads Classification

Given the alignments of the reads, often in sorted SAM/BAM (H. Li et al., 2009) format (containing the chromosomes, positions and strands of the mapped read pairs), the data preprocessing step identifies the anomalously mapped reads, showing evidence of the existence of SVs. Mapping-based SV callers mainly infer SVs from three types of anomalously mapped reads, namely, (1) discordant reads (mapped either on different chromosome (signal for translocations or transpositions, Figure 1F), incorrect strands (signal for inversions, Figure 1E), 8

incorrect orientation (signal for duplications, Figure 1C, Figure 1D) or incorrect insert size (signal for insertions and deletions, Figure 1A, Figure 1B)), (2) soft-clipped reads (signal for the breakpoint, mapped partially) or (3) one-end-anchored reads (signal for the breakpoint, unmapped reads whose mates are mapped). During the reads classification step, SV callers will extract one or more types of anomalously mapped reads and organize them in certain ways to enable efficient SV discovery and verification (see Figure 1 for the discordantly mapped reads for each SV type; see Table 1 "Anomalously Mapped Reads Used" column for details). 2.2. SV Discovery Given the anomalously mapped reads discovered during data preprocessing, the SV discovery step aims to identify regions of possible SV candidates. In this section, we first classified the types of anomalously mapped reads into two cases: direct and indirect. After that, we discuss the various SV discovery techniques: clustering, split-reads alignment, contig (consensus sequence) assembly and statistical testing. Generally, based on the different types of anomalously mapped reads, SV callers will assign the predictions to different types. For example, if SV callers detect paired reads mapped to two different chromosomes, they will report translocations/transpositions; if SV callers detects reads mapped with incorrect insert sizes, they will report insertions/deletions; similarly for other types of SVs. The soft-clipped and one-end anchored reads can be aligned to the SV breakpoint regions to refine the position of the breakpoints up to base-pair resolution (Figure 1). 2.2.1. Direct vs. Indirect Cases Different SV callers use different types of anomalously mapped reads, affecting their performances. Here, we classify the anomalously mapped reads that support an SV into two cases: direct case and indirect case. Direct case refers to discordantly mapped read pairs, which are the most direct evidence since the mapping locations of their two reads indicate the approximate locations of the two regions that are linked together in the SV genome. The exact breakpoints require further computations to define, either by local assembly or split-reads remapping.

Direct case SV callers include

BreakDancer (Chen et al., 2009), DELLY (Rausch et al., 2012) and other SV callers that use clustering of discordant read pairs (see Table 1). Indirect case refers to soft-clipped reads or one-end-anchored reads. In such case, the SV callers only know one end of the SV. To discover the other end of the SV, they need to map the

9

soft-clipped portions of the reads or the one-end-anchored reads within the reference genome. Table 1 gives a list of indirect case SV callers. CREST (J. Wang et al., 2011) is one of the indirect-case SV callers. In principle, any SV can be detected by either direct case or indirect case of anomalously mapped reads. Apart from direct and indirect cases, we can also use read-depth (i.e. the number of reads covering a genomic region) to call SVs. Read-depth feature can only detect unbalanced SVs. However, read-depth changes at the boundary of balanced SVs can also be used to better annotate the balanced SVs. For example, GASVPro (S. S. Sindi et al., 2012) uses subtle local changes of the read-depth at the breakpoint for zygosity predictions.

Figure 4. SV features. (A) shows a segment with SV in the sample genome. It also gives 6 read pairs extracted near the SVs. (B) The 6 read pairs in (A) are aligned on the reference genome. (a) to (c) are mapped as discordant reads. (a’) and (c’) are mapped as soft-clipped reads. (b’) is mapped as an one-end-anchored read. Read-depth can also be used as a feature to further validate unbalanced-SVs. 2.2.2. SV Discovery Techniques By combining different anomalously mapped reads, we can identify candidate SVs (Medvedev, Stanciu, & Brudno, 2009) (Figure 5). Four commonly used techniques are: (1) Clustering (CL), which groups the discordant reads that are mapped in proximity to identify SV regions; (2) Split-reads alignment (SA), which maps the unmapped portion of the soft-clipped reads or the one-end-anchored reads to find the pairing breakpoints; 10

(3) Contig assembly (CA), where the anomalously mapped reads are de novo assembled to form longer consensus sequences (called contigs) that can be remapped back to the reference genome to identify the pairing breakpoints; and (4) Statistical testing (ST), which is often used to detect copy-number variations using the local variations of reads depth. The “techniques” column in Table 1 summarizes techniques used by different SV callers. Below, we detail the ideas of these four techniques.

Figure 5. Structural variation (SV) calling techniques. Clustering (CL) groups reads in proximity. Split-Reads Alignment (SA) maps the partially-mapped reads across the breakpoint of an SV candidate, defining the accurate breakpoint. Contig Assembly (CA) forms the consensus sequence of the candidate SV, which can be remapped to the reference genome to define the accurate breakpoint. Statistical Testing (ST) is mainly used to combine the different types of anomalously mapped reads at the breakpoint for defining the confidence level.

Clustering (CL)

Different clustering methods have been proposed, but the objective is the same, which is to group reads that supporting the same SV together. VariationHunter (Hormozdiari et al., 2010) transforms the clustering problem into a set-cover problem to identify the minimum number of clusters that can cover the set of discordant reads. GASV (S. Sindi, Helman, Bashir, & Raphael, 2009) and GASVPro (S. S. Sindi et al., 2012) formulate the clustering problem as a computational geometric problem (plane sweep) to efficiently find the set of clusters. DELLY (Rausch et al., 2012), CLEVER (Marschall et al., 2012) and Ulysses (Gillet-Markowska et al., 2015) use graph-based method trying to find cliques in the reads mapping graph. SVMiner (Hayes, Pyon, & Li, 2012) uses a model-based clustering method. Most other SV callers first estimate the insert sizes of the library and group those reads within the estimated maximum insert for example, BreakDancer (Chen et al., 2009) is a SV caller that purely uses discordant read pairs. The read pairs are clustered together to form confident SV candidates. 11

It is known that clustering method is less sensitive to small insertions or deletions, because the reads containing those small SVs are often not recognizable as discordantly mapped read pairs, since the small SV size is below the standard deviation of the insert size of the concordantly mapped read pairs.

Split-Reads Alignment (SA)

Split-reads alignment tries to align soft-clipped reads and one-end-anchored reads to either find the matching breakpoints (indirect case) or refine the breakpoints identified by discordantly mapped reads (direct case). For indirect case, CREST (J. Wang et al., 2011), ClipCrop (Suzuki, Yasuda, Shiraishi, Miyano, & Nagasaki, 2011) and Socrates (Schröder et al., 2014) use soft-clipped reads to detect the breakpoints. However, the soft-clipped reads are aligned differently. Instead of aligning the split-reads directly, CREST first builds the consensus sequence of the soft-clipped reads and aligns the consensus sequence to the reference genome to find the pairing breakpoint. ClipCrop (Suzuki et al., 2011) aligns the split-reads 1000bp flanking all the candidate breakpoints. Restricting alignments to local regions makes it more likely for the reads, especially short ones to align to the desired position. However, the restricted alignments also may miss the chance to identify long distance SVs like translocations, long indels or long duplications. On the other hand, alignment of split-reads globally to the entire reference genome searches the whole genome for the best alignments of the soft-clipped reads, to find the matching breakpoints. But this requires longer soft-clipped reads to find confident mappings. Socrates (Schröder et al., 2014) uses a hybrid approach: long soft-clipped reads are used to find breakpoints by realigning them globally against the reference genome while short soft-clipped reads are used to support the correctness of the identified breakpoints. The alignment not only searches the matching breakpoint but also defines the exact breakpoint. For direct case, SV callers define the SV candidates through clustering of discordant reads. These methods often use split-reads alignment as a filtering method to validate the discordant reads cluster. For example, DELLY (Rausch et al., 2012) uses split-reads alignments to define the exact position of the breakpoints, it aligns the split-reads across the two regions linked by the discordant clusters. Gustaf (Trappe et al., 2014) uses split-read graph to find the best splitalignment with the shortest edit distances along a path. SV callers like PRISM (Jiang, Wang, & Brudno, 2012) use one-end-anchored (OEA) reads only for refining the breakpoints, which is effective when the number of OEA reads are abundant. 12

Note that OEA reads are abundant only if the sequencing error rate is high or there are novel insertions. With the development of NGS technologies, the read length is increased, the error rate is reduced and the read mappers have better mapping rate, the number of OEA reads is expected to reduce. Thus, such methods that use only OEA reads rather than combing soft-clipped and OEA reads will suffer in sensitivity.

Contig Assembly (CA)

Paired reads may be too short to accurately locate the SVs. Some SV callers proposed that we can assemble paired reads into contigs. When the contigs are long enough, it is easier to locate the breakpoints of the SVs. Different SV callers may use different methods to build the contigs. Some methods (Hajirasouliha et al., 2010; Mohiyuddin et al., 2015; Wong, Keane, Stalker, & Adams, 2010) build the contigs by existing de novo assemblers (SSAKE (Warren, Sutton, Jones, & Holt, 2007), ABySS (Simpson et al., 2009), Velvet (Miller, Koren, & Sutton, 2010), SGA (Simpson & Durbin, 2012), SPAdes (Bankevich et al., 2012) and Cortex (Iqbal, Caccamo, Turner, Flicek, & McVean, 2012) etc.), which are designed mainly for whole-genome assembly, biased to produce long contigs and often computationally challenged. Some SV callers (like TIGRA (Chen et al., 2014)) use assembler that was specifically designed for SV breakpoint assembly. Some SV callers obtain the contigs by the pile-up approach. For example, Socrates (Schröder et al., 2014) aligns the split-reads and combines them with the discordant read pairs to pileup to form the consensus sequence using voting matrix approach. The consensus sequence formed by Socrates (Schröder et al., 2014) is additional information output by the program, but not used for discovering or verifying and SVs. The pileup approach is computational more efficient because it uses the existing alignment information to get the relative positions of the sequences involved, without the need to re-compute the positions of the reads. No matter which approach is used for contig assembly, it is necessary for the SV callers to handle low coverage data, which results in discontinuous contigs with gaps. For example, SVMerge (Wong et al., 2010) uses ABySS (Simpson et al., 2009) or Velvet (Zerbino & Birney, 2008) for assembly, which can join the adjacent non-overlapping contigs with scaffolding. SOAPindel (S. Li et al., 2013) starts de novo assembly process from one-end-anchored reads (OEA) and adding the mapped reads for building the contig of the breakpoints. However, the availability of the OEA reads depends on both the reads coverage and the reads mapping software. Contig assembly methods can also detect complex events and find sequences of novel insertions. TIGRA (Chen et al., 2014) can detect complex events like inversion occurring together

13

with deletions. NovelSeq (Hajirasouliha et al., 2010) uses ABySS (Simpson et al., 2009) to find the novel insertion sequence. CREST (J. Wang et al., 2011) is also one of the methods using contig assembly to find the breakpoints. It finds the first breakpoint by looking for soft-clipped reads clipped at the same position, after that, the soft-clipped reads are assembled and mapped to the reference genome to search for the matching breakpoint. Then the soft-clipped reads at the matching breakpoint are assembled to map back to the first breakpoint.

Statistical Testing (ST)

The statistical testing techniques are commonly used for detecting copy-number-variations (CNVs). Most of the methods use sound statistical models to verify the observed read-depth against to null distributions (i.e., distribution without the CNVs, see Table 1 for more details). Statistical methods are also used by general SV calling methods to derive a confidence score for filtering and ranking the predictions. For example, BreakDancer (Chen et al., 2009) computes confidence score based on the reads mappings, measuring the depth and breadth of the clusters. 2.2.3. Hybrid-Approach for SV Discovery Previous subsections describe various methods that use either direct or indirect case mapped reads for calling SVs. Those methods cannot fully utilize all information provided by the sequencing dataset. Here, we study SV callers that integrate different types of anomalously mapped reads to improve SV calling sensitivity. DELLY (Rausch et al., 2012) uses both discordant reads and one-end-anchored reads, optionally soft-clipped reads. However, it only uses discordant reads during SV discovery; the one-end-anchored reads are used in the SV verification step to define the exact breakpoints. SoftSV (Bartenhagen & Dugas, 2015) is another SV caller that uses both paired-reads and softclipped reads. SoftSV achieves better sensitivity because it includes discordant read clusters of size 1 (singletons) as SV candidates. When the discordant read is absent, SoftSV pairs the softclipped reads with the neighbors for identifying small SVs. The implicit assumption of SoftSV is that large SVs will have discordantly mapped reads. However, this might not be the case, because with low coverage data (Figure 7), even large SVs might not be covered by discordant read pairs. Meerkat (L. Yang et al., 2013) and Scalpel (Narzisi et al., 2014) both use all types of anomalously reads (i.e. discordant reads, soft-clipped reads and one-end-anchored reads). The difference is that Meerkat uses realignments and clustering of the anomalously mapped reads. Scalpel integrates the reads via de novo assembly. 14

Besides sensitivity and specificity, the types of anomalously mapped reads used by SV callers also decide the resolution of the SV prediction (Table 1). Methods using clustering of paired-end reads cannot achieve base-pair resolution, whereas split-reads or contig alignment methods can in general detect the exact breakpoint by aligning the reads or contigs to cover the breakpoint of the SV. Other than SV callers that use purely discordant, soft-clipped and one-end-anchored reads, there are other SV callers that also use the concordantly mapped reads as the background information to assess the likelihood of the SVs. For example, CLEVER (Marschall et al., 2012) uses both concordant and discordant reads in a region to identify signal of possible insertions and deletions. SV callers that use more types of anomalously mapped reads during SV discovery stage can in general find more SV candidates, making them sensitive. However, the use of more features also means the increased number of SV candidates to verify, which may also lead to a less efficient SV caller. Often, it is necessary to make some trade-offs between speed and sensitivity. Most of the SV callers have parameters that can be tuned to have a shorter SV candidate list. 2.3. SV Verification SV verification step checks the candidate SVs to find more evidence to support the prediction. Even though the mappers can map most of the reads correctly, when mapping reads crossing the SV breakpoints, the mappers can produce many wrong mappings and lead to false predictions. Thus, it is important for the SV callers to check the origins of the SV supporting reads to ensure they are not artificial mappings created by the mappers. Usually, the SV candidates defined by only discordant reads clusters are quite noisy. However, those clusters with split-reads support tend to be more specific. Thus, split-reads are often used by SV callers to filter out the noisy candidates (Jiang et al., 2012; Newman et al., 2014; Rausch et al., 2012; L. Yang et al., 2013). Other SV callers like GASVPro (S. S. Sindi et al., 2012) use read depth for validation of the SV breakpoints using statistical methods. Another type of filtering is to use the known features of genomic regions, for example, the repeat regions, the telomere and the centromere are known to harbor many predictions that are more likely to be caused by mapping noise (Wong et al., 2010). Some SV callers, like FusionSeq (Sboner et al., 2010) and FACTERA (Newman et al., 2014), filter SVs near to repeat regions defined by the RepeatMasker.

15

Other than the filtering methods mentioned above, some SV callers use ad hoc quantitative filters based on the number of discordant reads (Chen et al., 2009) or the number of split-reads (Schröder et al., 2014; J. Wang et al., 2011). 2.4. SV Annotation SV callers may optionally include the annotation step to annotate the called SVs. This step provides more information to biologists for follow-up studies. For example, Meerkat (L. Yang et al., 2013) annotates mechanisms of formation of the SVs. Socrates (Schröder et al., 2014) also annotates the breakpoints in terms of the breakpoint features, namely: blunt-end joining, micro-homology and untemplated sequences. BreakSeq (Lam et al., 2010) is an annotation tool that provides annotation of given SVs. BreakSeq infers the ancestral state and SV formation mechanisms by comparing the sequences against primates. BreakSeq also annotates the SVs’ location, motif as well as physical properties. To summarize, SV annotation modules of the SV callers generally provide the information on (1) location of SVs; (2) nearest genes; (3) formation mechanisms; (4) sequence features. Such annotations provide important information to the biologists, who are more interested in the formation mechanisms and downstream implications of the SVs. However, many of the annotations of the SVs are mainly performed on ad hoc basis. Very often only the closest genes are used to annotate the breakpoints to produce the list of candidate genes for subsequent studies. There is a need for more standardized comprehensive SV annotation tools. 2.5. SV Visualization SV visualization is a very important step in the SV calling process. SVs are often visualized using either customized or standard tools. The visualization enables biologists to visually inspect the SVs and often leads to more insightful conclusions. Visualization tools can also assist the biologist to make sense the data, judge the specificity of the predictions and hypothesize the mechanisms of SV formations. Currently the visualization tools mainly focus on visualizing the input data and the final SV predictions. Less effort has been spent on showing the intermediate outputs (often of non-standardized formats) of the SV detection tools. Customized SV visualization tools are available in some of the SV calling pipelines. They are specifically designed for SV visualizations, showing the supporting reads and interactions of the different genomic regions. They either produce standardized output data files that can be shown in standard visualization tools or integrate visualization modules in the processing pipeline. For

16

example, SVDetect (Zeitouni et al., 2010) has one CIRCOS (Krzywinski et al., 2009) visualization tool that can show the SVs in terms of their chromosome numbers, enabling users to study the patterns of the SV positions. ViVar (Sante et al., 2014) is an SV analysis and visualization platform (with built-in established standard SV calling pipelines) that can visualize the results in the genome browser. Gremlin (O'Brien, Ritz, Raphael, & Laidlaw, 2010) also provides multiple-scale and interactive SV visualizations, showing the regions of SVs and other information like copy count, etc. TargetSeqView (Halper-Stromberg, Steranka, Burns, Sabunciyan, & Irizarry, 2014) is another SV visualization tool that uses probability-based scores and visualization for distinguishing the correct SV predictions from mapping artifacts. Fastbreak (Bressler et al., 2012) uses scalable visualization of large number of samples, which allows interactive analysis and visual data mining. However, Fastbreak can only give a rough estimation of the SV breakpoints, without explicitly indicating the nature of the SVs. CIRCUS (Naquin, d'Aubenton-Carafa, Thermes, & Silvain, 2014) is an R package that is designed for visualizing SVs. CIRCUS provides an easy-to-use R wrapper to visualize the outputs from SV callers. While most of the previously mentioned visualization tools only show the reads mapping to the reference genome, svviz (Spies, Zook, Salit, & Sidow, 2015) visualizes the supporting reads of an SV candidate by sorting and assigning the reads to either the reference allele or the alternate (SV) allele based on with Smith-Waterman alignment (Zhao, Lee, Garrison, & Marth, 2013). Those alignments are visualized in separate tracks. Besides the above-mentioned customized visualized tools, standard general-purpose NGS data visualization tools (mainly genome browsers) can also be used to visualize the SV data. For example, in UCSC Genome Browser (Kent et al., 2002), the input BAM file (H. Li et al., 2009) can be visualized as a track. Alternatively, the BAM file can be converted to WIG file and visualized as the coverage of the SV-supporting reads. The Integrative Genomics Viewer (IGV) (Robinson et al., 2011; Thorvaldsdottir, Robinson, & Mesirov, 2013) is another commonly used NGS data visualization tool. It supports visualization of most of the standardized file formats, including BAM (H. Li et al., 2009), Browser Extensible Data (BED), Variant Call Format (VCF) and many others. Those files in combination can be used to visualize the SV breakpoints and the supporting paired reads. Both UCSC Genome Browser and IGV allow user to zoom in to specific regions of the genome to inspect the supporting reads. Interested readers may refer to (Pavlopoulos et al., 2013; J. Wang, Kong, Gao, & Luo, 2013) for more detailed review of genome browsers.

17

Most of the existing visualization tools focus on visualizing SVs for one sample or a small number of samples. With collaborative sequencing efforts (Cancer Genome Atlas Research et al., 2013; Genomes Project et al., 2012), there is a need for the visualization tools to handle multiple samples. Further, the visualization tools often lack the support of interactive analysis of the data, which requires efficient computational strategies to aggregate the data to provide real-time responses to biologists’ queries. Such interactive visualization tools would aid the biologist in making interesting biological discoveries.

3. SV and Library Properties Impacts SV Calling SV calling is a complex process. It is impacted by SV properties and library properties (Figure 6). SV calling is further impacted by the sequencing errors, quality of the reference genomes and the sequence contexts. This section discusses and highlights these factors.

Figure 6. SV and library properties impact SV discovery. The SV types and library properties impacts on the reads mapping and in turn determines the SV discovery performance. 3.1. SV Properties Impacts SV Calling SV properties, including SV type, size and frequency will determine the coverage and the types of anomalously mapped reads, which determine the performance of SV callers.

18

SV types influence the local coverage of the SV breakpoints as well as the types of anomalously mapped reads. Different types of SVs will produce different types of anomalously mapped reads (Figure 1). For example, deletions produce discordant reads with insert size longer than expected. Intra-chromosomal translocations produce read pairs that are mapped to different chromosomes. SV types can also impact the local coverage. Sizes of SVs can impact the types of anomalously mapped reads. Small SVs, like small indels may not produce the necessary discordant reads, because the insert size standard deviation might be even larger than the SV size, leaving the SV undetectable when examining the changes in the insert size. Detecting small indels requires split-reads mapping methods. For example, Gustaf (Trappe et al., 2014) identifies SVs from 30 to 100bp by finding the optimal alignment in the split-read graph. Frequency of SV is another determining factor that impacts SV calling. Due to heterogeneity, the allele frequency of an SV support is affected. Less frequent SVs are under-sampled during library preparation for sequencing. Hence, SV callers may fail to detect the low frequent SVs since the number of anomalously mapped reads covering these SVs may be too little. The situation becomes worse when the SV caller does not use all types of anomalously mapped reads. Simply increasing the coverage of the library may solve the problem. However, it costs more money to sequence the library and requires longer time to process the data. SV callers using reads count thresholds will be less sensitive to low frequent SVs (Chen et al., 2009; S. Sindi et al., 2009; J. Wang et al., 2011). 3.2. Library Properties Impacts SV Calling Library properties describe the parameters used in the wet-lab protocol. Those properties include coverage (i.e. sequencing depth), insert size and read length of the paired reads. They impact the types and the frequencies of the anomalously mapped reads. A library of higher coverage can produce more reads, potentially more anomalously mapped reads for the SV callers to work on. Library insert size determines the types of anomalously reads that will be produced by reads mappers. Libraries with smaller insert size will have more reads crossing the breakpoint than spanning the breakpoint (Figure 4). Libraries of smaller and larger insert sizes both have advantages and disadvantages (English et al., 2015). Libraries with smaller insert size (e.g., Illumina HiSeq) often produce reads mapped crossing the breakpoint, which can be used to define the accurate breakpoint. However, small insert size library only covers a short sequence of the sample, making it hard to cover large SVs. On the other hand, libraries with large insert sizes (e.g., 19

Illumina Nextera Mate Pair Sequencing) can in general cover large-scale SVs, but maybe lack of the necessary reads coverage at the exact breakpoint to determine breakpoint at base-pair resolution. Thus, it might be beneficial to integrate paired reads of different insert sizes to call SVs. Biologist may also consider optimizing the library properties to suit the SV callers chosen. To study the impact of insert size on the types of anomalously mapped reads, we performed a simulation study. 3260 SVs are randomly simulated in hg19 using RSVSim (Bartenhagen & Dugas, 2013) to produce an SV genome (hg19Sv). After that, we simulated paired reads (read length: 100bp) of 2x coverage from hg19Sv using pIRS (Hu et al., 2012) with insert sizes 170bp to 800bp. In total, we generated 14 libraries. Reads for all the libraries are mapped to hg19 using BWA-MEM (H. Li & Durbin, 2009). Figure 7 shows the number of SVs covered by different types of anomalously mapped reads. Obviously, the number of SVs covered by both discordant reads and soft-clipped reads vary a lot with different insert sizes. Libraries with smaller insert size tend to have more soft-clipped reads mapped. With increased insert sizes, the number of SVs covered by at least two soft-clipped reads is reducing (blue + grey); but the number of SVs covered by at least two discordant reads is increasing (blue + orange). Only 15% of the simulated SVs are tandem duplications. However, 65% to 75% of the SVs with “at least two discordant and two soft-clipped reads” are tandem duplications, which have more reads support because of the increased copy numbers. Read length can also impact the sensitivity and specificity of SV calling. Shorter read length will result in less confident mappings of the soft-clipped reads, causing the SV callers to fail to detect a confident matching breakpoint if they are used to find the pairing breakpoint. On the other hand, longer read length (e.g., PacBio reads) may be able to detect SVs covering long repeat regions. Such SVs may not be detected by SV callers using discordant short read pairs. In summary, both SV and library properties can greatly influence the SV calling results. It is thus a must to carefully choose the correct SV caller for specific dataset. Understanding how the reads mappers work greatly help this. Ideally, an SV caller utilizing all types of anomalously mapped reads in both the SV discovery and verification process should be used. However, when such option is not available, in order to achieve better sensitivity, library with smaller insert sizes may be processed with SV callers that utilize soft-clipped reads (e.g., CREST (J. Wang et al., 2011) and Socrates (Schröder et al., 2014)), whereas library with large insert size should be handled with SV callers using clustering of discordant reads (e.g. DELLY (Rausch et al., 2012), Ulysses (Gillet-Markowska et al., 2015), VariationHunter (Hormozdiari et al., 2010) (which is 20

not base-pair resolution) and BreakDancer (Chen et al., 2009) (which is not base-pair resolution, sensitive for low coverage data)). Further, to detect smaller SVs, soft-clipped read-based SV callers may be chosen over discordant-reads based methods. If the exact breakpoint, rather than a rough region of SV is required, methods that use split-reads mapping are required (e.g., Meerkat (L. Yang et al., 2013) and DELLY (Rausch et al., 2012)). With the lack of gold standard data to objectively assess the performance of different SV callers, cautions must be taken when choosing the SV callers. Due diligent verifications and validations of the predicated SV candidates are required to remove the noisy predictions caused by sequencing errors and poorly characterized regions of the genome (see Section 3.3 and 3.4 for details). The recommended methods above are based on the theoretical analysis of the NGS reads mapping and the reads features used by different methods, given the specific data. Besides the sensitivity and specificity concerns discussed, other considerations of selecting a particular SV caller include (1) breakpoint resolution (base-pair resolution vs. crude breakpoint regions); (2) processing speed (capability of multi-threading, multiple sample processing) and (3) biological questions to answer (germline mutations vs. somatic mutations). Since paired reads of different insert sizes will produce different types of anomalously mapped reads, it might be advantageous to experimentally design libraries with different insert sizes and computationally combine the libraries with different insert sizes to achieve maximal sensitivity. SV callers with capability of integrating libraries with different insert sizes include DELLY (Rausch et al., 2012), Socrates (Schröder et al., 2014), etc.

21

Figure 7 Insert sizes determine types of anomalously mapped reads. The x-axis represents the insert size of the simulated libraries, ranging from 170bp to 800bp. The coverage of all libraries is 2x. The y-axis represents the number of SVs covered by the different types of anomalously mapped reads. Only those with both two discordant and two soft-clipped reads (blue-colored line) can be discovered by both pair-end read methods and soft-clipped read methods.

3.3. Sequencing Errors Impact SV Calling Because of the limitations of the NGS sequencing technologies, the reads from the sequencers contain errors. Two types of possible errors are: sequencing errors and artificial chimeric reads. Sequencing errors impact both sensitivity and specificity of SV calling. Sequencing errors occur when bases in the reads cannot be determined for certain (van Dijk, Auger, Jaszczyszyn, & Thermes, 2014). Two types of sequencing errors may occur: substitutions and indels (Yang, Chockalingam, & Aluru, 2013). Both substitution and indel sequencing errors may cause wrong mappings of the reads, possibly leading to anomalously mapped reads and eventually cause false predictions. To filter anomalously mapped reads caused by sequencing errors, SV callers may use base quality scores, which represent the probability of the sequenced base to be wrong. For example, Socrates (Schröder et al., 2014) requires the average base quality score of the soft-clipped reads to be at least 5. SoftSV (Bartenhagen & Dugas, 2015) uses 10 as the quality score cutoff for quality-trimming of the soft-clipped reads. 22

Chimeric reads are artificial sequences formed during polymerase chain reaction (PCR) amplification step of NGS experiments (Ashelford, Chuzhanova, Fry, Jones, & Weightman, 2005; Edgar, Haas, Clemente, Quince, & Knight, 2011). These artificial chimeric reads may be misinterpreted as formed by SVs (Edgar et al., 2011). Two possible approaches are to mitigate the impacts of chimeric reads. One is through combination of different libraries of the same sample, preferably of different sequencing parameters. Since the probability of having the same artificial chimeric reads in two libraries is expected to be low, novel sequences present only in one library can be treated as artificial chimeric reads. However, such filtering might filter true rare SVs, which may only appear in one library due to its low frequency. When the matching library is not available, chimeric reads removal tools like ChimeraChecker (Nilsson et al., 2010), ChimeraSlayer (Haas et al., 2011), Perseus (Quince, Lanzen, Davenport, & Turnbaugh, 2011) and UCHIME (Edgar et al., 2011) may be used to remove the chimeric reads before using them for SV detections. Existing SV callers mostly do not explicitly address the issue of chimeric reads. Thus, it is crucial to filter such noise before passing the data to the SV callers, particularly if the experiment protocols used tend to produce more chimeric reads. 3.4. Reference Genome and Sequence Context Impact SV Calling Mapping-based methods by their design require a complete and accurately assembled reference genome. Otherwise, poorly characterized regions will not have sufficient number of reads mapped for the SV callers to make confident predictions. Since the initial sequencing of human genome (Lander et al., 2001), great efforts have been spent to close the gaps in the human genome (International Human Genome Sequencing, 2004). However, the human genome is still not complete (Altemose, Miga, Maggioni, & Willard, 2014). There are still poorly characterized regions such as pericentromeric and telomeric regions, where the sequences are of low complexity or not available (represented by a stretch of Ns). Further, the sequence context of the reference genome impacts the sensitivity of SV calling. It is known that certain genomic regions are difficult to sequence (Kieleczawa, 2006), e.g., genomic repeats and GC-rich regions. Those regions are also known to be difficult to map as well. According to the mechanisms of SVs (Pang, Migita, Macdonald, Feuk, & Scherer, 2013; L. Yang et al., 2013), SVs tend to occur more frequently in repetitive regions of the genome. To confidently sequence those regions, NGS sequencing technologies need to be improved to sequence long reads with less or comparable sequencing errors.

23

To summarize, SV detection is a complex process, whose sensitivity and specificity greatly depend on the SV property, library property, platform-specific sequencing errors and sequence contexts of the reference genome. In order to accurately and sensitively detect SVs, necessary steps must be taken to remove the possible noises resulted from library preparation, mapping artifacts, etc. Depending on the library properties, it is also important to select SV callers that use the types of anomalously mapped reads available in the libraries.

4. SV Detection by Integrating Multiple SV Callers It is possible to combine predictions from different SV callers to achieve better sensitivity (Lin, Smit, Bonnema, Sanchez-Perez, & de Ridder, 2014), since different SV callers often use different types of anomalously mapped reads or different techniques to call SVs. For example, SVMerge (Wong et al., 2010) combines prediction results from BreakDancer (Chen et al., 2009), Pindel (Ye, Schulz, Long, Apweiler, & Ning, 2009), RDXplorer (Yoon, Xuan, Makarov, Ye, & Sebat, 2009), SECluster and RetroSeq (Keane, Wong, & Adams, 2013) to get the SV candidates, which are combined and verified using local assembly, with Velvet (Zerbino & Birney, 2008) and ABySS (Simpson et al., 2009) to refine the breakpoint and reduce false discovery rate. MetaSV (Mohiyuddin et al., 2015) is another SV caller that combines a suite of orthogonal SV callers, including BreakSeq (Lam et al., 2010), BreakDancer (Chen et al., 2009), Pindel (Ye et al., 2009) and CNVnator (Abyzov, Urban, Snyder, & Gerstein, 2011). It further improves the insertion calling by analyzing the soft-clipped reads. Same as SVMerge (Wong et al., 2010), local assembly with SPAdes (Bankevich et al., 2012), which can use pair-end information, is used to refine the breakpoints. MetaSV does not simply aggregate SV predictions from different tools. Instead it gives priority to methods that are known to be more accurate. For example, it trusts split-reads methods over discordant paired reads clustering methods. Though integrating multiple SV callers can achieve better sensitivity, the combined sensitivity can only be the collective best of all SV callers. Thus, it is more advantageous to integrate SV callers utilizing different types of anomalously mapped reads, which can complement one another. Further, even if we integrate SV callers using different types of anomalously mapped reads (for example, combining a soft-clipped SV caller with a discordant read-pair based SV caller), we may not obtain the optimal set of predictions. This is because there are SVs discoverable by none of those methods (Figure 7). For example, an SV supported by one discordant reads and two softclipped reads at one breakpoint will not be discovered by combining CREST (J. Wang et al., 24

2011) or BreakDancer (Chen et al., 2009) because CREST by default requires at least three softclipped reads at both breakpoints and BreakDancer requires at least two discordant read pairs to form a cluster. Hence, integrating different SV callers that complement each other is crucial. More importantly, individual SV callers should utilize information from all the anomalously mapped reads.

5. SV Detection by Combining Multiple Samples With the reduced sequencing costs, more and more NGS libraries are routinely sequenced. Recently, single-cell sequencing is made possible, opening the opportunities for analyzing the SVs at the single-cell level. Simultaneous analysis and comparison of multiple samples requires more computational power. SV callers need to be designed to efficiently process the data. Given the SV calling itself is a challenging task, it is even more challenging to compare multiple samples to achieve population-scale studies. Combining and integrating different samples may improve the SV prediction. For example, if we sequence the same sample using paired reads of different insert sizes, different types of anomalously mapped reads can be produced in the mapping files, thus, enabling the SV callers that use multiple types of anomalously mapped reads to detect the SVs more effectively, improving the sensitivity. Specificity can be improved, by comparing different samples to eliminate possible false predictions resulted from systematic noisy mappings. Integrating replicated libraries of the same sample can help to reduce the number of false predictions that are caused by chimeric reads or sequencing errors. Most existing SV callers are designed to analyze one sample or compare at most two samples (for identifying somatic SVs). There are some methods, like LUMPY (Layer et al., 2014) and Hydra-Multi (Lindberg, Hall, & Quinlan, 2015), that can analyze multiple samples. However, both methods use direct case (see Section 2.2.1) mapping reads only to find SV candidates. Their performance may be compromised since they don’t use indirect case mapping reads. Thus, it is important to develop SV callers that utilize all anomalously mapped reads while capable of integrating multiple samples. Because whole genome sequencing has become the norm, more and more libraries are accumulated. SV callers that can tactically integrate and analyze large number of samples are required. It is also important that the SV callers can retrospectively re-analyze different cohorts of sequences. Furthermore, pipelines that can work in different computing environments in 25

distributed fashion that can effectively utilize the distributed computing systems are also important. Speed of a SV caller greatly impacts its usability and practicality. Depending on the design, existing SV callers usually make trade-offs between sensitivity and processing speed. Also, computational efficiency is greatly impacted by the techniques used. For example, in general, given the same number of SV candidates, assembly-based methods (e.g., CREST (J. Wang et al., 2011)) tend to run slower than clustering-based method (e.g., BreakDancer (Chen et al., 2009)). In order to improve the speed, SV callers like Socrates (Schröder et al., 2014) use generic reads mapper to remap the soft-clipped reads, which is much more efficient than mapping the reads or consensus sequences using BLAST (local alignment of Meerkat) (L. Yang et al., 2013) or BLAT (CREST) (J. Wang et al., 2011). To fully utilize the computational resources, most of the SV callers are designed to run in parallel, which is a necessary feature for any SV caller that aims to process large datasets. Speed is also impacted by the features used by the SV caller. With more features considered, the SV callers tend to be slower but more sensitive. Thus speed of SV callers can only be assessed together with their sensitivity and specificity.

6. Conclusions In this paper, we have conducted a comprehensive survey of existing SV callers. We analyze the SV and library properties that can impact the SV calling performances. Based on our simulation study, we emphasize that, in order to achieve better sensitivity, libraries with different insert sizes should be analyzed using different tools, especially for low-coverage libraries or rare SVs. We also discuss the impacts of sequencing errors, reference genomes and sequence context on SV calling. We generalize the SV detection to a 4-staged process and analyzed the design options of the SV callers in each step, highlighting their pros and cons. Based on the types of anomalously mapped reads and techniques they use, we studied their advantages and disadvantages in SV discovery. The evolution of sequencing technologies requires the SV calling to be further improved. Reduction of sequencing cost and production of longer reads require new SV calling methods. Lowering sequence cost will lead to larger libraries. We need to improve the efficiency of the current SV calling algorithms. Sequencing technologies now produce longer reads. Those long reads again bring up new challenges and opportunities. Error profiles of the longer reads are 26

different from those of the shorter reads, making accurate mapping of the reads challenging. Further, smaller SVs may be covered within one read, making the SV detection easier. However, this also requires better reads mappers that can accurately map the reads to the correct locus. SV callers should also be improved to handle such long read, where the entire SV could be contained in one read. Efforts have been made to incorporate multiple data types of the same sample to properly assess the sensitivity and specificity of the current methods. Jiang et. al (Jiang, Turinsky, & Brudno, 2015) compared the indels called from NGS data and Sanger sequencing data to estimate the number of indels in the sample. English et al. (English et al., 2015) assessed the SV landscape of one individual, profiled using different technologies (aCGH, short reads and long reads) to assess the false discoveries by complementing advantages of different technologies. Combining the different data may help to reduce the number of false predictions. These efforts may finally lead to gold standard data for benchmarking the performance of the SV calling methods. Recent developments also try to incorporate biological knowledge to better targeted regions of the genome, for example, for the study of fusion genes. Such targeted regions enable more efficient processing of the data and making it possible for daily usage in the clinical settings for personalized treatment. SV detections have been largely focused on human genome. However, large and complex SVs are also observed in plant genomes (Saxena, Edwards, & Varshney, 2014), which are crucial for agricultural studies. How to integrate prior knowledge of human and plant biological processes is another challenging task. Better understanding of the SV mechanisms can make the SV prediction more accurate. SV detection methods will continue to evolve with the improvements of sequencing technologies as already seen over the past decade.

Acknowledgments We would like to thank the anonymous reviewers who gave very detailed advices on improving the manuscript. Funding: This research is supported in part by MOEs AcRF Tier 2 funding R-252-000-444-112.

27

Table 1 Classification of SV and CNV Callers1 Data

Anomalously Mapped Reads Used

BIC-seq

x

PE;SE

x

x

x

N

cn.MOPS

x

PE;SE

x

x

x

N

cnD

x

PE

x

x

x

N

CNVeM

x

PE

x

x

x

N

CNVnator

x

PE;SE

x

x

x

N

CNV-seq

x

PE;SE

x

x

x

N

JointSLM

x

PE;SE

x

x

x

N

RDXplorer

x

SE

x

x

x

N

SegSeq

x

PE;SE

x

x

x

N

CNVer

x

PE

x

N

SV

LUMPY

x

x

x

x

PE

x

x

x

PE

MetaSV

x

x

SVM2

x

x

PE

x x x

x x

x

x

x

x

x x

x x

1

CA

SA

CL

UM

OEA

PR

SC

RD

Validation Stage

UM

OEA

PR

SC

RD

TRA

INV

DUP

DEL

INS

CNV

Discovery Stage

CNV

Techniques

BP Resolution (Y/N)

SV Types

ST

SV Callers

x

x x

x x x

x x x

x

N

x

Y

x x

N

References

(Xi et al., 2011) (Klambauer et al., 2012) (Simpson, McIntyre, Adams, & Durbin, 2010) (Z. Wang, Hormozdiari, Yang, Halperin, & Eskin, 2013) (Abyzov et al., 2011) (Xie & Tammi, 2009) (Magi, Benelli, Yoon, Roviello, & Torricelli, 2011) (Yoon et al., 2009) (Chiang et al., 2009) (Medvedev, Fiume, Dzamba, Smith, & Brudno, 2010) (Layer et al., 2014) (Mohiyuddin et al., 2015) (Chiara, Pesole, & Horner, 2012)

SV Types: CNV – copy number variation; SV – structural variation; INS – insertion; DEL – deletion; DUP – duplication; INV – inversion; TRA – translocation. Data: PE – paired end; SE – single end; MP – mate pair. Mapping Features Used: RD – read depth; SC – soft-clipped; PR – paired – reads; OEA – one-end-anchored; UM – unmapped. Techniques: CL – clustering; SA – split-reads alignment; CA – contig assembly; ST – statistical testing. BP Resolution: base-pair resolution.

28

x

x

x

x

x

x

x

SoftSV

x x

x x

x

x

x

x

x

x

PE

x

x

x

x

BreaKmer

x

x

x

x

x

PE

ClipCrop

x

x

x

x

x

PE

x x

CREST

x

x

x

x

PE;SE

x

x

Gustaf

x

x

x

x

PE;SE

x

x

Socrates

x

x

x

x

PE;SE

x

x

PE

x

x x

x

PE

x x

Bellerophon BreakDancer

x

x

CLEVER

x

x

DELLY

x

FACTERA

x

x

x

PE x

x

x

x x

PE

x

PE

x x

GASV

x

x

x

x

x

PE

x

GASVPro

x

x

x

x

x

PE

x

PE

x

x

N Y

x

x

x

x

x

x

x

Y

x

Y Y Y

x

x

x

BP Resolution (Y/N)

x

x

x x

PE

CA

x

x

SA

SVMerge

PE

x

CL

x

x

SC

x

x

RD

Scalpel

x

x

Validation Stage

UM

x

OEA

x

PR

Meerkat

SE

SC

x

RD

TRA

x

INV

DEL

Breakpointer

DUP

INS

CNV

Discovery Stage

Techniques

ST

Anomalously Mapped Reads Used

UM

Data

OEA

SV Types

PR

SV Callers

x

x

Y

x x

x x

x x

x

x x

(L. Yang et al., 2013) (Narzisi et al., 2014) (Wong et al., 2010) (Bartenhagen & Dugas, 2015) (Abo et al., 2014) (Suzuki et al., 2011) (J. Wang et al., 2011)

x

Y

(Trappe et al., 2014)

x x

Y Y x x

N N Y

x x

Y

x x

(Sun et al., 2012)

Y

x x x

References

N

x

x

x

N

x

x

x

x

N

(Schröder et al., 2014) (Hayes & Li, 2013) (Chen et al., 2009) (Marschall et al., 2012) (Rausch et al., 2012) (Newman et al., 2014) (S. Sindi et al., 2009) (S. S. Sindi et al., 2012) (Handsaker, Korn, Nemesh, & McCarroll, 2011)

GenomeSTRiP

x

HYDRA

x

x

x

x

PE

x

x

x

Y

HYDRA-Multi

x

x

x

x

PE

x

x

x

Y

(Lindberg et al., 2015)

x

x

x

PE

x x

N

(Qi & Zhao, 2011)

inGAP-SV

x

x

MoDIL

x

x

PE

x x

29

x

N

(Quinlan et al., 2010)

(Lee, Hormozdiari, Alkan,

ST

CA

SA

CL

UM

RD

Validation Stage

UM

OEA

PR

SC

RD

TRA

INV

DUP

DEL

INS

CNV

Discovery Stage

Techniques

BP Resolution (Y/N)

Anomalously Mapped Reads Used

OEA

Data

PR

SV Types

SC

SV Callers

References

& Brudno, 2009) PEMer

x

PeSV-Fisher

x

x

x

x

PE

x

x

x

x

PE;MP

x

x

x

x

PE

x x

PE

x

PE;MP

x

PE

x

PRISM

x

RetroSeq

x

SVDetect

x

x

SVMiner

x

x

Ulysses

x

x

VariationHunter

x

x

NovelSeq

x

PINDEL

x

x

SLOPE

x

x

SOAPindel

x

Splitread

x

x

x

x x

x

x

MP PE

x x x x x

x

x

x

PE

x

PE

x

PE;SE

x

x

PE

x

x

x

PE

x

BreakSeq

x

x

PE

SMUFIN

x

x

x

x

x

x

N

x x

N x

Y

x

x

Y

x

x

x

x

x

x

x

x

x x x

x

x

x

x

N

(Keane et al., 2013) (Zeitouni et al., 2010)

N

(Hormozdiari et al., 2010)

Y

x

Y Y

x

Y

x

Y x

(Jiang et al., 2012)

(Hayes et al., 2012) (Gillet-Markowska et al., 2015)

Y x

(Escaramis et al., 2013)

N

x x

x

30

x

x

x

PE

N

(Korbel et al., 2009)

Y

(Hajirasouliha et al., 2010) (Ye et al., 2009) (Abel et al., 2010) (S. Li et al., 2013) (Karakoc et al., 2012) (Lam et al., 2010) (Moncunill et al., 2014)

References Abel, H. J., Duncavage, E. J., Becker, N., Armstrong, J. R., Magrini, V. J., & Pfeifer, J. D. (2010). SLOPE: a quick and accurate method for locating non-SNP structural variation from targeted next-generation sequence data. Bioinformatics, 26(21), 2684-2688. doi: 10.1093/bioinformatics/btq528 Abo, R. P., Ducar, M., Garcia, E. P., Thorner, A. R., Rojas-Rudilla, V., Lin, L., . . . MacConaill, L. E. (2014). BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Res. doi: 10.1093/nar/gku1211 Abyzov, A., Urban, A. E., Snyder, M., & Gerstein, M. (2011). CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res, 21(6), 974-984. doi: 10.1101/gr.114876.110 Alkan, C., Kidd, J. M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., . . . Eichler, E. E. (2009). Personalized copy number and segmental duplication maps using nextgeneration sequencing. Nat Genet, 41(10), 1061-1067. doi: 10.1038/ng.437 Altemose, N., Miga, K. H., Maggioni, M., & Willard, H. F. (2014). Genomic characterization of large heterochromatic gaps in the human genome assembly. PLoS Comput Biol, 10(5), e1003628. doi: 10.1371/journal.pcbi.1003628 Ashelford, K. E., Chuzhanova, N. A., Fry, J. C., Jones, A. J., & Weightman, A. J. (2005). At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl Environ Microbiol, 71(12), 7724-7736. doi: 10.1128/AEM.71.12.7724-7736.2005 Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., . . . Pevzner, P. A. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol, 19(5), 455-477. doi: 10.1089/cmb.2012.0021 Bartenhagen, C., & Dugas, M. (2013). RSVSim: an R/Bioconductor package for the simulation of structural variations. Bioinformatics, 29(13), 1679-1681. doi: 10.1093/bioinformatics/btt198 Bartenhagen, C., & Dugas, M. (2015). Robust and exact structural variation detection with paired-end and soft-clipped alignments: SoftSV compared with eight algorithms. Brief Bioinform. doi: 10.1093/bib/bbv028 Bressler, R., Lin, J., Eakin, A., Robinson, T., Kreisberg, R., Rovira, H., . . . Shmulevich, I. (2012). Fastbreak: a tool for analysis and visualization of structural variations in genomic data. EURASIP J Bioinform Syst Biol, 2012(1), 15. doi: 10.1186/1687-4153-2012-15 Bunting, S. F., & Nussenzweig, A. (2013). End-joining, translocations and cancer. Nat Rev Cancer, 13(7), 443-454. doi: 10.1038/nrc3537 Cancer Genome Atlas Research, N., Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R., Ozenberger, B. A., . . . Stuart, J. M. (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet, 45(10), 1113-1120. doi: 10.1038/ng.2764

31

Chen, K., Chen, L., Fan, X., Wallis, J., Ding, L., & Weinstock, G. (2014). TIGRA: a targeted iterative graph routing assembler for breakpoint assembly. Genome Res, 24(2), 310-317. doi: 10.1101/gr.162883.113 Chen, K., Wallis, J. W., McLellan, M. D., Larson, D. E., Kalicki, J. M., Pohl, C. S., . . . Mardis, E. R. (2009). BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods, 6(9), 677-681. doi: 10.1038/nmeth.1363 Chiang, D. Y., Getz, G., Jaffe, D. B., O'Kelly, M. J., Zhao, X., Carter, S. L., . . . Lander, E. S. (2009). High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods, 6(1), 99-103. doi: 10.1038/nmeth.1276 Chiara, M., Pesole, G., & Horner, D. S. (2012). SVM(2): an improved paired-end-based tool for the detection of small genomic structural variations using high-throughput single-genome resequencing data. Nucleic Acids Res, 40(18), e145. doi: 10.1093/nar/gks606 Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., . . . Getz, G. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol, 31(3), 213-219. doi: 10.1038/nbt.2514 DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., . . . Daly, M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet, 43(5), 491-498. doi: 10.1038/ng.806 Edgar, R. C., Haas, B. J., Clemente, J. C., Quince, C., & Knight, R. (2011). UCHIME improves sensitivity and speed of chimera detection. Bioinformatics, 27(16), 2194-2200. doi: 10.1093/bioinformatics/btr381 English, A. C., Salerno, W. J., Hampton, O. A., Gonzaga-Jauregui, C., Ambreth, S., Ritter, D. I., . . . Gibbs, R. A. (2015). Assessing structural variation in a personal genome-towards a human reference diploid genome. BMC Genomics, 16, 286. doi: 10.1186/s12864-015-1479-3 Escaramis, G., Tornador, C., Bassaganyas, L., Rabionet, R., Tubio, J. M., Martinez-Fundichely, A., . . . Estivill, X. (2013). PeSV-Fisher: identification of somatic and non-somatic structural variants using next generation sequencing data. PLoS One, 8(5), e63377. doi: 10.1371/journal.pone.0063377 Faust, G. G., & Hall, I. M. (2012). YAHA: fast and flexible long-read alignment with optimal breakpoint detection. Bioinformatics, 28(19), 2417-2424. doi: 10.1093/bioinformatics/bts456 Feuk, L., Carson, A. R., & Scherer, S. W. (2006). Structural variation in the human genome. Nat Rev Genet, 7(2), 85-97. doi: 10.1038/nrg1767 Genomes Project, C., Abecasis, G. R., Auton, A., Brooks, L. D., DePristo, M. A., Durbin, R. M., . . . McVean, G. A. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422), 56-65. doi: 10.1038/nature11632 Gillet-Markowska, A., Richard, H., Fischer, G., & Lafontaine, I. (2015). Ulysses: accurate detection of low-frequency structural variations in large insert-size sequencing libraries. Bioinformatics, 31(6), 801-808. doi: 10.1093/bioinformatics/btu730

32

Haas, B. J., Gevers, D., Earl, A. M., Feldgarden, M., Ward, D. V., Giannoukos, G., . . . Birren, B. W. (2011). Chimeric 16S rRNA sequence formation and detection in Sanger and 454pyrosequenced PCR amplicons. Genome Res, 21(3), 494-504. doi: 10.1101/gr.112730.110 Hajirasouliha, I., Hormozdiari, F., Alkan, C., Kidd, J. M., Birol, I., Eichler, E. E., & Sahinalp, S. C. (2010). Detection and characterization of novel sequence insertions using paired-end nextgeneration sequencing. Bioinformatics, 26(10), 1277-1283. doi: 10.1093/bioinformatics/btq152 Halper-Stromberg, E., Steranka, J., Burns, K. H., Sabunciyan, S., & Irizarry, R. A. (2014). Visualization and probability-based scoring of structural variants within repetitive sequences. Bioinformatics, 30(11), 1514-1521. doi: 10.1093/bioinformatics/btu054 Handsaker, R. E., Korn, J. M., Nemesh, J., & McCarroll, S. A. (2011). Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet, 43(3), 269-276. doi: 10.1038/ng.768 Hayes, M., & Li, J. (2013). Bellerophon: a hybrid method for detecting interchromosomal rearrangements at base pair resolution using next-generation sequencing data. BMC Bioinformatics, 14 Suppl 5, S6. doi: 10.1186/1471-2105-14-S5-S6 Hayes, M., Pyon, Y. S., & Li, J. (2012). A model-based clustering method for genomic structural variant prediction and genotyping using paired-end sequencing data. PLoS One, 7(12), e52881. doi: 10.1371/journal.pone.0052881 Hormozdiari, F., Hajirasouliha, I., Dao, P., Hach, F., Yorukoglu, D., Alkan, C., . . . Sahinalp, S. C. (2010). Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics, 26(12), i350-357. doi: 10.1093/bioinformatics/btq216 Hu, X., Yuan, J., Shi, Y., Lu, J., Liu, B., Li, Z., . . . Fan, W. (2012). pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics, 28(11), 1533-1535. doi: 10.1093/bioinformatics/bts187 International Human Genome Sequencing, C. (2004). Finishing the euchromatic sequence of the human genome. Nature, 431(7011), 931-945. doi: 10.1038/nature03001 Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., & McVean, G. (2012). De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet, 44(2), 226-232. doi: 10.1038/ng.1028 Jiang, Y., Turinsky, A. L., & Brudno, M. (2015). The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection. Nucleic Acids Res, 43(15), 7217-7228. doi: 10.1093/nar/gkv677 Jiang, Y., Wang, Y., & Brudno, M. (2012). PRISM: pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants. Bioinformatics, 28(20), 2576-2583. doi: 10.1093/bioinformatics/bts484 Joly, Y., Dove, E. S., Knoppers, B. M., Bobrow, M., & Chalmers, D. (2012). Data sharing in the post-genomic world: the experience of the International Cancer Genome Consortium (ICGC)

33

Data Access Compliance Office (DACO). PLoS Comput Biol, 8(7), e1002549. doi: 10.1371/journal.pcbi.1002549 Karakoc, E., Alkan, C., O'Roak, B. J., Dennis, M. Y., Vives, L., Mark, K., . . . Eichler, E. E. (2012). Detection of structural variants and indels within exome data. Nat Methods, 9(2), 176178. doi: 10.1038/nmeth.1810 Keane, T. M., Wong, K., & Adams, D. J. (2013). RetroSeq: transposable element discovery from next-generation sequencing data. Bioinformatics, 29(3), 389-390. doi: 10.1093/bioinformatics/bts697 Kehr, B., Weese, D., & Reinert, K. (2011). STELLAR: fast and exact local alignments. BMC Bioinformatics, 12 Suppl 9, S15. doi: 10.1186/1471-2105-12-S9-S15 Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., & Haussler, D. (2002). The human genome browser at UCSC. Genome Res, 12(6), 996-1006. doi: 10.1101/gr.229102. Article published online before print in May 2002 Kieleczawa, J. (2006). Fundamentals of sequencing of difficult templates--an overview. J Biomol Tech, 17(3), 207-217. Klambauer, G., Schwarzbauer, K., Mayr, A., Clevert, D. A., Mitterecker, A., Bodenhofer, U., & Hochreiter, S. (2012). cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res, 40(9), e69. doi: 10.1093/nar/gks003 Korbel, J. O., Abyzov, A., Mu, X. J., Carriero, N., Cayting, P., Zhang, Z., . . . Gerstein, M. B. (2009). PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol, 10(2), R23. doi: 10.1186/gb-2009-10-2-r23 Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., . . . Marra, M. A. (2009). Circos: an information aesthetic for comparative genomics. Genome Res, 19(9), 16391645. doi: 10.1101/gr.092759.109 Lam, H. Y., Mu, X. J., Stutz, A. M., Tanzer, A., Cayting, P. D., Snyder, M., . . . Gerstein, M. B. (2010). Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat Biotechnol, 28(1), 47-55. doi: 10.1038/nbt.1600 Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., . . . International Human Genome Sequencing, C. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860-921. doi: 10.1038/35057062 Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods, 9(4), 357-359. doi: 10.1038/nmeth.1923 Layer, R. M., Chiang, C., Quinlan, A. R., & Hall, I. M. (2014). LUMPY: a probabilistic framework for structural variant discovery. Genome Biol, 15(6), R84. doi: 10.1186/gb-201415-6-r84

34

Lee, S., Hormozdiari, F., Alkan, C., & Brudno, M. (2009). MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nat Methods, 6(7), 473-474. doi: 10.1038/nmeth.f.256 Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754-1760. doi: 10.1093/bioinformatics/btp324 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., . . . Genome Project Data Processing, S. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078-2079. doi: 10.1093/bioinformatics/btp352 Li, H., & Homer, N. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform, 11(5), 473-483. doi: 10.1093/bib/bbq015 Li, S., Li, R., Li, H., Lu, J., Li, Y., Bolund, L., . . . Wang, J. (2013). SOAPindel: efficient identification of indels from short paired reads. Genome Res, 23(1), 195-200. doi: 10.1101/gr.132480.111 Lim, J. Q., Tennakoon, C., Guan, P., & Sung, W. K. (2015). BatAlign: an incremental method for accurate alignment of sequencing reads. Nucleic Acids Res. doi: 10.1093/nar/gkv533 Lin, K., Smit, S., Bonnema, G., Sanchez-Perez, G., & de Ridder, D. (2014). Making the difference: integrating structural variation detection tools. Brief Bioinform. doi: 10.1093/bib/bbu047 Lindberg, M. R., Hall, I. M., & Quinlan, A. R. (2015). Population-based structural variation discovery with Hydra-Multi. Bioinformatics, 31(8), 1286-1289. doi: 10.1093/bioinformatics/btu771 Magi, A., Benelli, M., Yoon, S., Roviello, F., & Torricelli, F. (2011). Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm. Nucleic Acids Res, 39(10), e65. doi: 10.1093/nar/gkr068 Marschall, T., Costa, I. G., Canzar, S., Bauer, M., Klau, G. W., Schliep, A., & Schonhuth, A. (2012). CLEVER: clique-enumerating variant finder. Bioinformatics, 28(22), 2875-2882. doi: 10.1093/bioinformatics/bts566 McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., . . . DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing nextgeneration DNA sequencing data. Genome Res, 20(9), 1297-1303. doi: 10.1101/gr.107524.110 Medvedev, P., Fiume, M., Dzamba, M., Smith, T., & Brudno, M. (2010). Detecting copy number variation with mated short reads. Genome Res, 20(11), 1613-1622. doi: 10.1101/gr.106344.110 Medvedev, P., Stanciu, M., & Brudno, M. (2009). Computational methods for discovering structural variation with next-generation sequencing. Nat Methods, 6(11 Suppl), S13-20. doi: 10.1038/nmeth.1374

35

Miller, J. R., Koren, S., & Sutton, G. (2010). Assembly algorithms for next-generation sequencing data. Genomics, 95(6), 315-327. doi: 10.1016/j.ygeno.2010.03.001 Mohiyuddin, M., Mu, J. C., Li, J., Bani Asadi, N., Gerstein, M. B., Abyzov, A., . . . Lam, H. Y. (2015). MetaSV: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics. doi: 10.1093/bioinformatics/btv204 Moncunill, V., Gonzalez, S., Bea, S., Andrieux, L. O., Salaverria, I., Royo, C., . . . Torrents, D. (2014). Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads. Nat Biotechnol, 32(11), 1106-1112. doi: 10.1038/nbt.3027 Naquin, D., d'Aubenton-Carafa, Y., Thermes, C., & Silvain, M. (2014). CIRCUS: a package for Circos display of structural genome variations from paired-end and mate-pair sequencing data. BMC Bioinformatics, 15, 198. doi: 10.1186/1471-2105-15-198 Narzisi, G., O'Rawe, J. A., Iossifov, I., Fang, H., Lee, Y. H., Wang, Z., . . . Schatz, M. C. (2014). Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Nat Methods, 11(10), 1033-1036. doi: 10.1038/nmeth.3069 Newman, A. M., Bratman, S. V., Stehr, H., Lee, L. J., Liu, C. L., Diehn, M., & Alizadeh, A. A. (2014). FACTERA: a practical method for the discovery of genomic rearrangements at breakpoint resolution. Bioinformatics, 30(23), 3390-3393. doi: 10.1093/bioinformatics/btu549 Nilsson, R. H., Abarenkov, K., Veldre, V., Nylinder, S., P, D. E. W., Brosche, S., . . . Kristiansson, E. (2010). An open source chimera checker for the fungal ITS region. Mol Ecol Resour, 10(6), 1076-1081. doi: 10.1111/j.1755-0998.2010.02850.x Nowell, P. C., & Hungerford, D. A. (1960). Chromosome studies on normal and leukemic human leukocytes. J Natl Cancer Inst, 25, 85-109. O'Brien, T. M., Ritz, A. M., Raphael, B. J., & Laidlaw, D. H. (2010). Gremlin: an interactive visualization model for analyzing genomic rearrangements. IEEE Trans Vis Comput Graph, 16(6), 918-926. doi: 10.1109/TVCG.2010.163 Pang, A. W., Migita, O., Macdonald, J. R., Feuk, L., & Scherer, S. W. (2013). Mechanisms of formation of structural variation in a fully sequenced human genome. Hum Mutat, 34(2), 345354. doi: 10.1002/humu.22240 Pavlopoulos, G. A., Oulas, A., Iacucci, E., Sifrim, A., Moreau, Y., Schneider, R., . . . Iliopoulos, I. (2013). Unraveling genomic variation from next generation sequencing data. BioData Min, 6(1), 13. doi: 10.1186/1756-0381-6-13 Puente, X. S., Pinyol, M., Quesada, V., Conde, L., Ordonez, G. R., Villamor, N., . . . Campo, E. (2011). Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature, 475(7354), 101-105. doi: 10.1038/nature10113 Qi, J., & Zhao, F. (2011). inGAP-sv: a novel scheme to identify and visualize structural variation from paired end mapping data. Nucleic Acids Res, 39(Web Server issue), W567-575. doi: 10.1093/nar/gkr506

36

Quince, C., Lanzen, A., Davenport, R. J., & Turnbaugh, P. J. (2011). Removing noise from pyrosequenced amplicons. BMC Bioinformatics, 12, 38. doi: 10.1186/1471-2105-12-38 Quinlan, A. R., Clark, R. A., Sokolova, S., Leibowitz, M. L., Zhang, Y., Hurles, M. E., . . . Hall, I. M. (2010). Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res, 20(5), 623-635. doi: 10.1101/gr.102970.109 Raphael, B. J. (2012). Chapter 6: Structural variation and medical genomics. PLoS Comput Biol, 8(12), e1002821. doi: 10.1371/journal.pcbi.1002821 Rausch, T., Zichner, T., Schlattl, A., Stutz, A. M., Benes, V., & Korbel, J. O. (2012). DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28(18), i333-i339. doi: 10.1093/bioinformatics/bts378 Robinson, J. T., Thorvaldsdottir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., & Mesirov, J. P. (2011). Integrative genomics viewer. Nat Biotechnol, 29(1), 24-26. doi: 10.1038/nbt.1754 Sante, T., Vergult, S., Volders, P. J., Kloosterman, W. P., Trooskens, G., De Preter, K., . . . Menten, B. (2014). ViVar: a comprehensive platform for the analysis and visualization of structural genomic variation. PLoS One, 9(12), e113800. doi: 10.1371/journal.pone.0113800 Sattler, M., & Griffin, J. D. (2001). Mechanisms of transformation by the BCR/ABL oncogene. Int J Hematol, 73(3), 278-291. Saxena, R. K., Edwards, D., & Varshney, R. K. (2014). Structural variations in plant genomes. Brief Funct Genomics, 13(4), 296-307. doi: 10.1093/bfgp/elu016 Sboner, A., Habegger, L., Pflueger, D., Terry, S., Chen, D. Z., Rozowsky, J. S., . . . Gerstein, M. B. (2010). FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data. Genome Biol, 11(10), R104. doi: 10.1186/gb-2010-11-10-r104 Schröder, J., Hsu, A., Boyle, S. E., Macintyre, G., Cmero, M., Tothill, R. W., . . . Papenfuss, A. T. (2014). Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads. Bioinformatics. doi: 10.1093/bioinformatics/btt767 Simpson, J. T., & Durbin, R. (2012). Efficient de novo assembly of large genomes using compressed data structures. Genome Res, 22(3), 549-556. doi: 10.1101/gr.126953.111 Simpson, J. T., McIntyre, R. E., Adams, D. J., & Durbin, R. (2010). Copy number variant detection in inbred strains from short read sequence data. Bioinformatics, 26(4), 565-567. doi: 10.1093/bioinformatics/btp693 Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J., & Birol, I. (2009). ABySS: a parallel assembler for short read sequence data. Genome Res, 19(6), 1117-1123. doi: 10.1101/gr.089532.108 Sindi, S., Helman, E., Bashir, A., & Raphael, B. J. (2009). A geometric approach for classification and comparison of structural variants. Bioinformatics, 25(12), i222-230. doi: 10.1093/bioinformatics/btp208

37

Sindi, S. S., Onal, S., Peng, L. C., Wu, H. T., & Raphael, B. J. (2012). An integrative probabilistic model for identification of structural variation in sequencing data. Genome Biol, 13(3), R22. doi: 10.1186/gb-2012-13-3-r22 Smith, A. C., McGavran, L., Robinson, J., Waldstein, G., Macfarlane, J., Zonona, J., . . . Magenis, E. (1986). Interstitial deletion of (17)(p11.2p11.2) in nine patients. Am J Med Genet, 24(3), 393-414. doi: 10.1002/ajmg.1320240303 Speicher, M. R., & Carter, N. P. (2005). The new cytogenetics: blurring the boundaries with molecular biology. Nat Rev Genet, 6(10), 782-792. doi: 10.1038/nrg1692 Spies, N., Zook, J. M., Salit, M., & Sidow, A. (2015). svviz: a read viewer for validating structural variants. Bioinformatics, 31(24), 3994-3996. doi: 10.1093/bioinformatics/btv478 Sun, R., Love, M. I., Zemojtel, T., Emde, A. K., Chung, H. R., Vingron, M., & Haas, S. A. (2012). Breakpointer: using local mapping artifacts to support sequence breakpoint discovery from single-end reads. Bioinformatics, 28(7), 1024-1025. doi: 10.1093/bioinformatics/bts064 Sung, W. K., Zheng, H., Li, S., Chen, R., Liu, X., Li, Y., . . . Luk, J. M. (2012). Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nat Genet, 44(7), 765-769. doi: 10.1038/ng.2295 Suzuki, S., Yasuda, T., Shiraishi, Y., Miyano, S., & Nagasaki, M. (2011). ClipCrop: a tool for detecting structural variations with single-base resolution using soft-clipping information. BMC Bioinformatics, 12 Suppl 14, S7. doi: 10.1186/1471-2105-12-S14-S7 Thorvaldsdottir, H., Robinson, J. T., & Mesirov, J. P. (2013). Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform, 14(2), 178192. doi: 10.1093/bib/bbs017 Trappe, K., Emde, A. K., Ehrlich, H. C., & Reinert, K. (2014). Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone. Bioinformatics, 30(24), 3484-3490. doi: 10.1093/bioinformatics/btu431 Trask, B. J. (2002). Human cytogenetics: 46 chromosomes, 46 years and counting. Nat Rev Genet, 3(10), 769-778. doi: 10.1038/nrg905 Tubio, J. M. (2015). Somatic structural variation and cancer. Brief Funct Genomics. doi: 10.1093/bfgp/elv016 van Dijk, E. L., Auger, H., Jaszczyszyn, Y., & Thermes, C. (2014). Ten years of next-generation sequencing technology. Trends Genet, 30(9), 418-426. doi: 10.1016/j.tig.2014.07.001 Wang, J., Kong, L., Gao, G., & Luo, J. (2013). A brief introduction to web-based genome browsers. Brief Bioinform, 14(2), 131-143. doi: 10.1093/bib/bbs029 Wang, J., Mullighan, C. G., Easton, J., Roberts, S., Heatley, S. L., Ma, J., . . . Zhang, J. (2011). CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods, 8(8), 652-654. doi: 10.1038/nmeth.1628

38

Wang, Z., Hormozdiari, F., Yang, W. Y., Halperin, E., & Eskin, E. (2013). CNVeM: copy number variation detection using uncertainty of read mapping. J Comput Biol, 20(3), 224-236. doi: 10.1089/cmb.2012.0258 Warren, R. L., Sutton, G. G., Jones, S. J., & Holt, R. A. (2007). Assembling millions of short DNA sequences using SSAKE. Bioinformatics, 23(4), 500-501. doi: 10.1093/bioinformatics/btl629 Weckselblatt, B., & Rudd, M. K. (2015). Human Structural Variation: Mechanisms of Chromosome Rearrangements. Trends Genet, 31(10), 587-599. doi: 10.1016/j.tig.2015.05.010 Weischenfeldt, J., Symmons, O., Spitz, F., & Korbel, J. O. (2013). Phenotypic impact of genomic structural variation: insights from and for human disease. Nat Rev Genet, 14(2), 125-138. doi: 10.1038/nrg3373 Wong, K., Keane, T. M., Stalker, J., & Adams, D. J. (2010). Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly. Genome Biol, 11(12), R128. doi: 10.1186/gb-2010-11-12-r128 Xi, R., Hadjipanayis, A. G., Luquette, L. J., Kim, T. M., Lee, E., Zhang, J., . . . Park, P. J. (2011). Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion. Proc Natl Acad Sci U S A, 108(46), E1128-1136. doi: 10.1073/pnas.1110574108 Xie, C., & Tammi, M. T. (2009). CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics, 10, 80. doi: 10.1186/1471-2105-10-80 Yang, L., Luquette, L. J., Gehlenborg, N., Xi, R., Haseley, P. S., Hsieh, C. H., . . . Park, P. J. (2013). Diverse mechanisms of somatic structural variations in human cancer genomes. Cell, 153(4), 919-929. doi: 10.1016/j.cell.2013.04.010 Yang, X., Chockalingam, S. P., & Aluru, S. (2013). A survey of error-correction methods for next-generation sequencing. Brief Bioinform, 14(1), 56-66. doi: 10.1093/bib/bbs015 Ye, K., Schulz, M. H., Long, Q., Apweiler, R., & Ning, Z. (2009). Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from pairedend short reads. Bioinformatics, 25(21), 2865-2871. doi: 10.1093/bioinformatics/btp394 Yoon, S., Xuan, Z., Makarov, V., Ye, K., & Sebat, J. (2009). Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res, 19(9), 1586-1592. doi: 10.1101/gr.092981.109 Zeitouni, B., Boeva, V., Janoueix-Lerosey, I., Loeillet, S., Legoix-ne, P., Nicolas, A., . . . Barillot, E. (2010). SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data. Bioinformatics, 26(15), 1895-1896. doi: 10.1093/bioinformatics/btq293 Zerbino, D. R., & Birney, E. (2008). Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res, 18(5), 821-829. doi: 10.1101/gr.074492.107

39

Zhao, M., Lee, W. P., Garrison, E. P., & Marth, G. T. (2013). SSW library: an SIMD SmithWaterman C/C++ library for use in genomic applications. PLoS One, 8(12), e82138. doi: 10.1371/journal.pone.0082138

40

Structural variation detection using next-generation sequencing data

Structural variation detection using next-generation sequencing data

Recommend Documents