Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
Contents lists available at ScienceDirect
Mutation Research-Reviews in Mutation Research journal homepage: www.elsevier.com/locate/mutrev
Review
Free-access copy-number variant detection tools for targeted nextgeneration sequencing data
T
Iria Rocaa,b,c, , Lorena González-Castrod, Helena Fernándezd, Mª Luz Coucec, Ana Fernández-Marmiesseb,c ⁎
a
University of Santiago de Compostela, Spain Genomes and Disease Group, Molecular Medicine and Chronic Diseases Centre (CiMUS), University of Santiago de Compostela, Spain c Unit of Diagnosis and Treatment of Congenital Metabolic Diseases. Fundación Instituto de Investigación Sanitaria de Santiago de Compostela (FIDIS), Spain d Galician Research and Development Center in Advanced Telecommunications (GRADIANT), Vigo, Spain b
ARTICLE INFO
ABSTRACT
Keywords: Targeted next-generation sequencing CNV-detection tools7 Gene panels Simulated data
Copy number variants (CNVs) are intermediate-scale structural variants containing copy number changes involving DNA fragments of between 1 kb and 5 Mb. Although known to account for a significant proportion of the genetic burden in human disease, the role of CNVs (especially small CNVs) is often underestimated, as they are undetectable by traditional Sanger sequencing. Since the development of next-generation sequencing (NGS) technologies, several research groups have compared depth of coverage (DoC) patterns between samples, an approach that may facilitate effective CNV detection. Most CNV detection tools based on DoC comparisons are designed to work with whole-genome sequencing (WGS) or whole-exome sequencing (WES) data. However, few methods developed to date are designed for custom/commercial targeted NGS (tg-NGS) panels, the assays most commonly used for diagnostic purposes. Moreover, the development and evaluation of these tools is hindered by (i) the scarcity of thoroughly annotated data containing CNVs and (ii) a dearth of simulation tools for WES and tg-NGS that mimic the errors and biases encountered in these data. Here, we review DoC-based CNV detection methods described in the current literature, assess their performance with simulated tg-NGS data, and discuss their strengths and weaknesses when integrated into the daily laboratory workflow. Our findings suggest that the best methods for CNV detection in tg-NGS panels are DECoN, ExomeDepth, and ExomeCNV. Regardless of the method used, there is a need to make these programs more user-friendly to enable their use by diagnostic laboratory staff who lack bioinformatics training.
1. Introduction Copy number variations (CNVs) are structural variants containing copy number changes typically involving segments of between 1 kb and 5 Mb [1], and are thought to account for a large component of our genetic diversity [2–4] and a significant proportion of the genetic burden of human disease [5,6]. CNVs spanning 1 Mb are detectable by cytogenetic techniques such as spectral karyotyping and fluorescence in situ hybridization. However, the level of resolution attained by these methods does not permit detection of CNVs in smaller segments. Microarray comparative genomic hybridization (array-CGH) was the first technique developed to enable higher resolution CNV detection (˜100 kb) [7–9], while subsequent single-nucleotide polymorphism (SNP) array strategies provided yet higher resolution (˜20 kb) [10]. These approaches have several inherent drawbacks, including
⁎
hybridization noise, limited genome coverage, and difficulty detecting novel and rare mutations [11,12]. The emergence of next-generation sequencing (NGS) technologies signaled a turning point in genetic diagnosis by providing faster, cheaper, and higher-resolution testing options [13–15]. Data produced by these platforms allow both the identification of point mutations and the detection of CNVs. Compared with array-based approaches, in which probes are predefined for limited genomic regions, short reads from NGS platforms are randomly sampled from the entire genome (whole-genome sequencing [WGS] analysis). Advantages of the WGS approach include greater coverage, higher resolution, more accurate copy number estimation, more precise breakpoint detection, and a greater ability to identify novel CNVs [16,17]. Exploiting these advantages, multiple tools has been developed to detect CNVs based on different features that can be extracted from NGS data [18–22]. The 4 most commonly used methods to detect
Corresponding author. E-mail address:
[email protected] (I. Roca).
https://doi.org/10.1016/j.mrrev.2019.02.005 Received 13 April 2018; Received in revised form 25 December 2018; Accepted 22 February 2019 Available online 23 February 2019 1383-5742/ © 2019 Elsevier B.V. All rights reserved.
Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
I. Roca, et al.
CNVs using NGS data are paired-end mapping, split read, de novo assembly, and depth of coverage (DoC). In paired-end sequencing, DNA fragments are assumed to have a specific insert-size distribution. Tools based on the paired-end mapping approach identify CNVs by detecting fragments with a discordance between mapped paired-reads whose distances differ significantly from the predetermined average insert size. The split read approach uses reads from pair-end sequencing in which only one read of the pair is reliably mapped and the other either partially or completely fails to map to the genome. In de novo assembly, DNA fragments are reconstructed from short reads by assembling overlapping reads, and the assembled fragments then compared to the reference genome to identify regions with different copy numbers. Finally, the DoC approach is based on the assumption that the depth of sequence coverage at any target is generally correlated with the initial copy number of the corresponding region [23]. Most of the CNV detection methods summarized in Table 1 were developed for use with WGS data [24–42]. Most CNV detection tools for tg-NGS data have been developed for whole-exome sequencing (WES), the broadest targeted panel available. However, in their attempts to overcome the coverage limitations of WES and reduce costs, and to reduce the complexity of the analysis of the huge amount of data generated by WES, molecular diagnosis laboratories have largely turned to custom or commercial gene panels (tg-NGS), which offer greater mean coverage of clinically relevant genes and are more economical [43–45]. Gene panels, designed to explore a much smaller proportion of the genome than WGS or WES, offer much greater coverage in the interrogated regions, allowing more accurate and sensitive detection of disease-related small CNVs encompassing one or more exons of one or multiple genes [22]. Tools that can simultaneously detect small CNVs and point mutations with a resolution as high as one small exon are essential to effectively tackle diagnostic problems. Previously, when classical sequencing of consecutive genes was the main approach used by diagnostics laboratories, CNV detection within the gene of interest was facilitated by optimizing custom, gene-specific assays such as multiplex ligation-dependent probe amplification (MLPA) [46] and quantitative polymerase chain reaction (qPCR) [47]. Most CNV detection tools based on DoC comparisons are designed to work with WGS or WES data (Table 1). By contrast, few methods have been optimized for use with the custom/commercial panels used in clinical labs nowadays. Unlike WGS, WES data offer greater coverage of targeted regions (genome-coding regions), allowing for more accurate CNV detection using a DoC-based calling approach. However, due to differential capture efficiency, the DoC can vary substantially between coding regions, a factor that should be considered in the downstream analysis of CNV calling. This variability can be caused by several biases. First, due to inconsistent capture efficiency, certain regions may be poorly sequenced. Second, the assumption of normal distribution (as occurs in WGS) may no longer be valid due to biases relating to readdepth distribution. Third, owing to the discontinuity of genomic regions, most CNV breakpoints cannot be detected. Finally, the segmentation algorithms typically used to merge windows in WGS may not be applicable to WES data owing to the non-continuous distributions of the reads. Since tg-NGS data (gene panels) constitutes a special case of WES, the aforementioned biases inherent to WES also apply. In most gene panels, the size of the target region (exons) is typically small (100–200 base pairs [bp]), as in whole-exome capture. Moreover, the genomic distribution of target regions is often sparse and uneven due to the size of intronic segments and the arbitrary locations of the genes of interest. The small size and the sparse, non-contiguous nature of target regions pose challenges to the application of existing CNV methods to gene panel data. Furthermore, the underlying assumptions made for CNV estimation in WGS may not hold in exome sequencing settings [48]. In particular, the aforementioned assumptions that genome-wide read depth is normally distributed and that segmentation search space is continuous do not apply in the case of exon-capture data. Another issue is the number of control samples available. In the case
of WES all simultaneously processed samples can potentially serve as controls for one another. This is not necessarily possible with tg-NGS analysis, since gene panel design changes continuously with the addition/removal of genes to/from panels. Moreover, some laboratories may work with more than one gene panel (e.g., if studying multiple diseases). Here, we present a comprehensive review of the requirements, advantages, and drawbacks of the most commonly used free-access bioinformatic tools created for CNV detection since 2011 and applicable to WES or tg-NGS sequencing data: ExomeCNV (2011) [48], ExomeCopy (2011) [49], CONTRA (2012) [50], ExomeDepth (2012) [51], CONIFER (2012) [52], CANOES (2014) [53], CODEX (2015) [54], CLAMMS (2015) [55], CoNVaDING (2015) [56], DECoN (2016) [57], CNVkit (2016) [58], and SeqCNV (2017) [59]. Both the capacity and scope of these tools are questionable owing to the lack of a gold standard against which to validate and compare results [60]. Ideally, their performance should be evaluated using a large number of validated CNV-positive controls, but these are often unavailable. Although several NGS simulation tools have been developed, none can fully reproduce the complexity of sequencing data generated by NGS technologies. Several factors are important when choosing a simulator, including the type of sequencing platform (Illumina, Roche 454, SOLiD, etc.) and the availability of options for simulating sequencing errors, reproducing biases (e.g., GC content bias), configuring parameters (e.g., read length or DoC), and simulating variants such as gross indels and CNVs. The main NGS simulation tools are summarized in Table 2. Although some, such as Metasim [61], Flowsim [62], Pbsim [63], GenFrag [64], Mason [65], GemSIM [66], ART [67], Dwgsim [68], and pIRS [69], can generate reads from a reference genome, none can simulate samples containing CNVs. Another class of simulation tools exist in which variations are introduced into a DNA chain (e.g., VarSim [70], RSVSim [71], FIGG [72], and SCNVSim [73]), although these unfortunately do not generate reads. An intuitive approach would be to combine tools from the two groups. However, using the output from one tool as input for another is not always straightforward and is sometimes impossible [74]. An additional limitation is that the aforementioned simulation tools were designed to generate WGS data. Very few NGS simulation tools generate synthetic tg-NGS data that faithfully reproduce the specific biases and errors resulting from the restriction of the genome regions being sequenced. In addition to the biases and errors inherent to any NGS experiment, tg-NGS data suffer from other problems related to the size of the sequenced regions and the capture library. These problems need to be reproduced to optimize the development and evaluation of genetic data analysis tools, such as CNV detectors [75,76]. To our knowledge, only two tg-NGS simulators have been described: Wessim [77] and TargetedSim [78]. Wessim generates tg-NGS reads and considers factors such as sequencing errors and GC content bias, but does not simulate variants. TargetedSim was designed to generate tg-NGS samples through the controlled simple insertion of CNVs, but cannot model GC bias, and has since been discontinued due to its serious coverage limitations. In this article, we 1) demonstrate how to generate simulated tg-NGS data with artificial CNVs for the validation of CNV detection tools, 2) present a comprehensive review of the most commonly used CNV detection methods and 3) compare the performance of such methods using simulated datasets. 2. Drawbacks associated with DoC-based CNV detection DoC-based CNV detection methods are based on the underlying assumption that the DoC of each target sequenced is proportional to the initial number of copies in the corresponding region of the genome. Thus, if a region is duplicated (or deleted), its number of reads should be either double or half that of the expected value. The main issue with this approach is that differences between samples in the DoC in a given region can also be influenced by experimental biases, such as GC115
2011
2012 2012
2012 2014
2015
2015
2015
2016
2016
2017
CONTRA ExomeDepth
CONIFER CANOES
CODEX
CLAMMS
CoNVaDING
DECoN
CNVkit
SeqCNV
2014
Lumpy
ExomeCopy
2013
m-HMM
2015 2011
2012
CnvHiTSeq
Ulysses ExomeCNV
2009 2009 2009 2010 2011 2011 2012 2012
RDXplorer Magnolya Pindel Copy-Seq CNVnator BIC-seq Cortex cn.MOPS
2015
2009
BreakDancer
GenomeSTRiP v2
2009 2009
GASV VariationHunter
2014
2009 2009 2009
SegSeq PEMer CNV-seq
CNVrd2
Year
Tool
116
Depth of coverage
Depth of coverage
Depth of coverage
Depth of coverage
Depth of coverage
Depth of coverage
Depth of coverage Depth of coverage
Depth of coverage Depth of coverage
Depth of coverage
Depth of coverage and Paired-end mapping Paired-end mapping Depth of coverage
Paired-end mapping, split read and DOC Depth of coverage
Paired-end mapping, split read and DOC Depth of coverage
Depth of coverage De novo assembly Split read Depth of coverage Depth of coverage Depth of coverage De novo assembly Depth of coverage
Paired-end mapping
Paired-end mapping Paired-end mapping
Depth of coverage Paired-end mapping Depth of coverage
Method based
C++
Python
R
Perl
C
R
Python R
Python, R R
R
Python, R R
Java, R
R
C++
R
Java
Python, R Python C++ Java C++ Perl, R C R
Perl/C++
Java, Perl C
Matlab Python, Perl Perl, R
Language
OS OS
OS OS OS OS OS OS OS
OS OS OS
OS OS
OS
OS
Linux, Mac OS
Windows, Linux, Mac OS Linux, Mac OS
Linux, Mac OS
Windows, Linux, Mac OS Linux, Mac OS
Linux, Mac Windows, Linux, Mac Windows, Linux, Mac Linux, Mac Windows, Linux, Mac Linux, Mac Linux, Mac
Windows, Linux, Mac OS Linux, Mac OS
Windows, Linux, Mac OS Linux, Mac OS
Linux, Mac Linux, Mac Linux, Mac Linux, Mac Linux, Mac Linux, Mac Linux, Mac Windows, Linux, Mac Linux, Mac
Linux, Mac OS
Linux, Mac OS Linux, Mac OS
Linux, Mac OS Linux, Mac OS Linux, Mac OS
SO
BAM files
BAM files
BAM files
BED files (SAMtools bedcov output) BAM files
BAM files or RPKM files Plain text (bedtools multicov output) BAM files
BAM or SAM files BAM files
BAM or SAM files GTF format or Plain text (GATK’s DepthOfCoverage output) BAM files
BAM files
BAM files or read count matrices
BAM files
Read count matrices
BAM files
Alignment files FASTA files Alignment files (obtained with BLAT) BAM files An specific format (DIVET) obtained from mrFast Alignment files (obtained with: MAQ, BWA, NovoAlign or Bfast) BAM files FASTQ or FASTA files BAM files BAM files BAM or SAM files BAM files FASTQ or FASTA files BAM files or read count matrices
Input
tg-NGS / WES tg-NGS / WES tg-NGS / WES tg-NGS / WES
WES
WES
WES WES
WES WES
WES
WGS WES
WGS
WGS
WGS
WGS
WGS
WGS WGS WGS WGS WGS WGS WGS WGS
WGS
WGS WGS
WGS WGS WGS
Data type
http://www.iipl.fudan.edu.cn/SeqCNV
https://cnvkit.readthedocs.io/en/stable/
https://github.com/RahmanTeam/DECoN
https://github.com/molgenis/CoNVaDING
https://github.com/rgcgithub/clamms
https://bioconductor.org/packages/release/bioc/html/CODEX.html
http://conifer.sourceforge.net/ http://www.columbia.edu/-ys2411/canoes/
http://contra-cnv.sourceforge.net/ https://cran.r-project.org/web/packages/ExomeDepth/index.html
http://bioconductor.org/packages/exomeCopy/
http://www.lcqb.upmc.fr/ulysses/ https://github.com/cran/ExomeCNV
http://software.broadinstitute.org/software/genomestrip/
https://github.com/hoangtn/CNVrd2
https://github.com/arq5x/lumpy-sv
https://www.stt.msu.edu/users/hengwang/mHMM.html
https://sourceforge.net/projects/cnvhitseq/
http://rdxplorer.sourceforge.net/ https://sourceforge.net/projects/magnolya/ https://github.com/genome/pindel http://www.korbel.embl.de/CopySeq/ https://github.com/abyzovlab/CNVnator http://compbio.med.harvard.edu/Supplements/PNAS11.html http://cortexassembler.sourceforge.net/index.html http://bioconductor.org/packages/release/bioc/html/cn.mops.html
http://breakdancer.sourceforge.net/
http://compbio.cs.brown.edu/projects/gasv/ http://variationhunter.sourceforge.net/Home
http://portals.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode = view&paper_id = 182 http://sv.gersteinlab.org/pemer/ http://tiger.dbs.nus.edu.sg/cnv-seq/
Availability
Table 1 Summary of Copy Number Variation detection tools for Next Generation Sequencing data, ranked by data type (WGS, WES, tg-NGS) and publication year.
I. Roca, et al.
Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
I. Roca, et al.
3. Generation of artificial CNVs
content, highly homologous regions, poor mappability, and other biases like technical variability in library preparation, capture, and sequencing [22,79,80]. Nonetheless, high levels of DoC correlation for a given region have been demonstrated between samples, especially if processed simultaneously. For any given sample, this correlation can be used to construct a model to predict sequence depth at any particular target in the absence of a CNV [48]. The relation between GC-content and DoC has been widely discussed [81,82]. In the in-solution hybridization step, capture probes of differential GC content behave differently, with a subsequent effect on the resulting sequence coverage. Regions with very high (70–80%) or very low (30–40%) GC content have a lower average DoC than regions with medium GC content, reflecting the poor efficiency of hybridization to AT- or GC-rich targets. However, there are other possible explanations for these observations, including potential oligonucleotide synthesis issues associated with high and low GC-content probes, selfstructure of targeted DNA during hybridization, and biases in the PCR amplification step during sequencing library generation, resulting in fewer corresponding targeted sequences [83]. In fact, the GC content of the full DNA fragment, not the sequenced read, most significantly influences fragment count [81,84]. These low coverage regions caused by extreme GC content can give rise to false positives and negatives in CNV detection, since the method used may not discriminate between inherent noise due to GC content and the presence of a real CNV. Highly homologous regions are sequences containing similar (or identical) copies that are either contiguous (repetitive regions) or are found in other areas of the genome. These similarities complicate the alignment of reads corresponding to these regions, since the algorithm responsible for assigning each read to a genomic position cannot determine which is the appropriate region. If these reads are not discarded using a quality filter (which would promote a loss of coverage in the corresponding regions, and preclude detection of point mutations or CNVs), this can result in the detection of false positives when most reads are aligned to one of the candidate regions; the algorithm may consider the read-rich region a duplication and the read-poor region a deletion [85]. Mappability is the probability that a read originating from a given region is mapped to that region, and is closely related to the presence of repetitive sequences and of short-length reads, given that the smaller the read the more likely it can be assigned to different genomic positions. While regions with more repetitive sequences have lower mappability, mutations or sequencing errors in these regions may also lead to mapping of the reads to other genomic positions [60]. In addition to these drawbacks, outcomes may be affected by biases associated with the specific platform used, and how these biases in turn affect simultaneous sequencing of samples [86].
3.1. Methodology To obtain a comprehensive benchmark against which to test the algorithms, we generated a synthetic dataset spanning multiple CNVs of different lengths, types (duplications and deletions in both homozygous and heterozygous state), and mean coverages. We simulated tg-NGS using a combination of Wessim and RSVSim. In Wessim, target regions can be specified in two ways: 1) using a list of genomic coordinates (ideal targets) in a BED file (although this approach has limitations when mutations that alter the number of bases of the genome are inserted); or 2) using a set of probe sequences to capture fragments (this allows for more realistic DNA fragment generation as it uses statistics based on real data). We selected the latter, probe-based approach, as it offers greater flexibility for generating different CNVs in the selected target regions along the reference genome. To generate artificial CNVs in Wessim, we modified the input data (the reference genome and probe sequences) at the positions in which the CNVs were to be introduced. This step was performed with the aid of another simulation tool, RSVSim. To generate deletions, the genome was modified by removing the bases from the regions in which the deletion was to be introduced, preventing the generation of fragments in these zones and resulting in drops in coverage. To introduce duplications, probe sequences that hybridized to the regions of interest were duplicated, increasing the final coverage in these regions. However, this strategy only permits the generation of samples with homozygous CNVs (heterozygous CNVs are much more frequent and difficult to detect). To overcome this limitation, we introduced a second step consisting of two simulations with identical genomes in all regions except those containing CNVs. For homozygous CNVs, both copies would have the same deleted regions, but for heterozygous variants, only one of the copies would contain the CNV. For each of these copies of the genome, a Wessim read simulation was performed with half of the final desired coverage. Finally, results from both experiments were combined into a single file to produce a single sample. 3.2. Description of the simulated dataset We introduced genomic variants into a given reference genome (hg19) to create a set of mutated samples that resemble individuals with rare CNVs. To determine the target regions we selected a custom gene panel, containing 4179 exons ranging in size from 53 bp to 17,155 bp and distributed across all chromosomes except chromosome 18. The loci selected for the introduction of the CNVs were chosen randomly. Three types of variants spanning different sizes were considered:
Table 2 Summary of the most representative NGS data simulation tools. IL, Illumina; R, Roche 454; S, SOLID; IT, Ion Torrent; CS, classic sequencing (Sanger); n.a. = not available; —— = Not applicable; * VarSim integrates directly with other read generation tools, which will determine the format of the final output. Tool
Reads
CNVs
TR
GC-content
Platform
Software language
Input
Output
ART dwgsim FIGG Flowsim GemSIM GenFrag Mason MetaSim PBSIM pirs RSVSim SCNVsim TargetedSim VarSim Wessim
YES YES NO YES YES YES YES YES YES YES NO NO YES NO YES
NO NO YES NO NO NO NO NO NO NO YES YES YES YES NO
NO NO ——— NO NO NO NO NO NO NO ——— ——— YES ——— YES
NO NO ——— NO NO NO NO NO NO YES ——— ——— NO ——— YES
IL,R,S IL,S,IT ——— R IL,R CS IL,R,CS IL,R, CS PacBio IL ——— ——— IL ——— IL,R,S, CS
C++ C Java Haskell Python n.a. C++ Java C C++ R Java Python Java, Python Python
FASTA FASTA FASTA FASTA FASTA FASTA FASTA FASTA FASTA FASTA FASTA FASTA FASTA, BED FASTA FASTA, BED
FASTQ FASTQ FASTA SFF FASTQ,SAM FASTQ FASTQ, SAM FASTA FASTQ FASTQ FASTA FASTA FASTQ FASTA* FASTQ
117
Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
I. Roca, et al.
homozygous deletions (DEL-HO), heterozygous deletions (DEL-HT), and duplications (DUP). A set of 320 synthetic samples with 4151 deletions and duplications (2000 DEL-HT, 273 DEL-HO, and 1878 DUP) ranging from 100 bp to 20 kb and 20 control samples without CNVs were generated as follows: 137 samples contained “simple” CNVs (i.e., only one type of CNV ranging from 1 exon to a set of genes) and 83 samples contained “compound” CNVs (i.e., a combination of different CNV types in separate genes). To determine differences in algorithm performance according to coverage, the same 100 samples with simple CNVs were simulated at 50X and 300X to eliminate bias due to GC content, exon size, and other potential modifiers. An empirical error model was used to simulate paired-end reads from the Illumina platform (length, 75 bp; mean insert size, 200 bp) that were then aligned to the reference genome using BWA software.
4.3. CONTRA CONTRA [50] can be used by users with moderate programming skills, requiring one command line to create the baseline, and another to perform the analysis. It also allows the user to change a wide range of parameters, such as minimum DoC, minimum read length, minimum number of bases with the specified minimum DoC, and other parameters specific to the method itself, therefore allowing optimization of each analysis. Nonetheless, CONTRA requires considerable computational resources and time to create the baseline and perform the analysis, probably owing to the normalization steps involved. The algorithm first creates a baseline from the control samples. First, it calculates the library size of each control sample as the product of the total number of reads for that sample, the read length, and the percentage of reads on target. Next, the base-level coverage for each control sample is computed for each bp as the product of the DoC in that bp and the geometric mean of the library size of all control samples, and the result is then divided by the library size of the sample. Once the baseline has been calculated, CONTRA repeats the aforementioned process for the case and control/baseline samples, and calculates the base-level log-ratios as the logarithm of the ratio between the normalized test sample DoC and the normalized control sample DoC. For this step, regions with DoC values lower than a predefined threshold are excluded from the analysis: by default, regions with less than 10 bp at 10X are discarded. Next, the mean of all base-level logratios in the region and the value of the fitted linear regression between the log-ratios and the log-coverage are extracted from each individual log-ratio. After this correction step, a 2-tailed p-value test is performed (the base-level log-ratios are assumed to follow a normal distribution), with the null hypothesis assuming no differences between the coverage of the test sample and that of the control. The p-value is the parameter used to prioritize the most likely CNVs.
4. CNV detection methods suitable for tg-NGS data 4.1. ExomeCNV ExomeCNV [48] is very fast and can be used with any operating system (Windows, Linux, and Mac). It provides multiple analysis options (determine optimal cutoff ratio by maximizing specificity, sensitivity, or area under curve; minimum specificity/sensitivity required, and minimum read length), allowing selection of the most appropriate parameters for each experiment. However, the user must be able to write an R-script to use the package. Although the analysis is performed rapidly, data preprocessing (to obtain GATK output from bam files) is required, and slows down the process. A key drawback of this tool is that all regions examined are reported in the final output, after which the user must filter those most likely to be real CNVs. For each exon, this method determines the log-ratio between the coverage of the test sample and the control. These log-ratios are then divided by their median. Assuming that coverage of test and control samples follows a Poisson distribution, with sufficient DoC the Poisson distribution converges to a normal distribution with equal mean and variance. Equal means for the test and control samples indicate an absence of CNVs, while the presence of CNVs the mean normalized logratio for the test sample will be equal to that of the control multiplied by a parameter p (0.5 for deletions and 1.5 for duplications). The program obtains the equations for sensitivity and specificity in terms of this parameter p and the log-ratio values, and the user can solve these equations in 3 ways: by maximizing sensitivity, maximizing specificity, or using area under the ROC curve. The log-ratio and ratio are the parameters that can be used to prioritize the most likely CNVs.
4.4. ExomeDepth ExomeDepth [51] is a fast tool which can be used with any operating system. It allows the user to introduce a large dataset of possible control samples, from which the program selects those that most closely resemble the case sample, thereby precluding the need for user selection of the most appropriate samples. This is an R-based package; users must have a moderate knowledge of R, and be capable of writing their own script using the functions available in the package. ExomeDepth also requires importing certain packages, and each of them requires its own dependencies, different versions of which may exist depending on the user’s R configuration. This in turn can result in dependencies problems and/or differential outputs. This method assumes that the number of test reads in each exon follows a binomial distribution, and calculates this parameter based on the sum of the number of test and control reads, the GC content of the exon, a dispersion parameter (estimated from number of global reads), and another parameter based on the number of copies of the exon. The method then generates a beta-binomial model that calculates the expected average coverage for each exon of the test sample. The model is solved in 3 ways: assuming the exon has 2 copies, has 1 copy, or has 3 copies. All these values are incorporated into a hidden Markov model. This model allows merging of adjacent CNVs into one and the determination of CNV type (duplication or deletion). Users can use a specific set of controls or allow the software to select those which best fit the test, based on correlations between DoC values. The Bayes factor and the ratio of observed/expected number of reads are the parameters used to prioritize the most likely CNVs.
4.2. ExomeCopy ExomeCopy [49] can be used with any operating system. Users must have moderate R programming skills (i.e. be capable of writing a script using all functions required to perform an analysis, in the correct order). Furthermore, ExomeCopy depends on other R-packages, and its functionality may be affected depending on the versions used. This method normalizes the coverage of control samples based on the median value, and then models the number of reads for each chromosome in the test sample using a hidden Markov model (where the hidden state is the copy number of the sample). It assumes that the DoC for each exon in the test sample follows a negative binomial distribution, and determines the parameter of the distribution as a combination of the logarithm of the normalized control sample coverage, the GC content, and the exon length. This method does not provide a p-value, but instead calculates the log odds, which is the sum of the log-ratio of the emission probability for the given read count for the predicted state divided by the emission probability for the normal state (the higher the value, the more likely the read counts in the region could not have been generated in the normal state).
4.5. CONIFER CONIFER [52] enables CNV detection from a large set of samples in one single analysis, but requires moderate programming skills (several 118
Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
I. Roca, et al.
different commands are required and must be used in a specific order). Moreover, the analysis is not completely automatic: the user must create a plot, open it, select the necessary parameters and introduce them into the program (explained below), and continue with the process. The minimum number of samples in each analysis is 6 (although the authors recommend at least 8 samples), which can pose problems for laboratories using tg-NGS panels, as the minimum required number of samples may not be available for every panel. The following important drawbacks should be noted: 1) the method is intended to detect CNVs containing at least 3 consecutive exons (precluding detection of smaller CNVs); and 2) no parameter is used to prioritize the most likely CNVs. This method first determines the standardized RPKM values (reads per thousand bases per million reads sequenced) [87] for each exon, which are a normalization of the exon’s DoC. Next, the method applies singular value decomposition: it creates a matrix containing all standardized RPKM values for each sample and exon, and decomposes the matrix into the product of another 3 matrices, whereby the central matrix (singular values matrix) reflects the relative importance of each component of the other 2 matrices. The highest values, are assumed to be the result of bias, and are thus eliminated and the product recalculated. These new values correspond to the normalized relative number of copies for each sample in each exon. Those with a value < 1.5 are considered deletions and those with a value > 1.5 are considered duplications. The number of components eliminated is determined manually by the user for each analysis. To determine the optimal number, CONIFER generates a scree plot of the relative contributed variance per singular value: the optimal number is that corresponding to the inflection point of the plot.
interrogated in each sample follows a Poisson distribution, and estimates the parameter of the distribution based on the global mean coverage of the corresponding sample, the total number of reads, the GC content of the exon, and a parameter that measures exon-specific biases. A maximum likelihood method is used to calculate this parameter. Next, the program uses a Poisson likelihood-based recursive segmentation algorithm to estimate copy number in the case sample, using the previously calculated expected coverage, and employs a circular binary segmentation procedure [88] to calculate the p-value. Two parameters can be used to prioritize the most likely CNVs: the p-value and the mBIC (the higher this value, the more likely the CNV is real). 4.8. CLAMMS This tool [55] analyzes all samples simultaneously, thus saving considerable time in analyses involving large numbers of samples. It also selects the most suitable control samples for each case, reducing the need for prior filtering of samples by the user. The algorithm is extremely fast but users must have good programming skills, given that multiple command lines are required to perform CNV calling. Moreover, chromosomes must be correctly labeled (“1”, “2”, …, instead of “chr1”, “chr2”, …), meaning that an additional step is required when analyzing samples with distinct chromosome nomenclatures. Another drawback is that users must create a BED file with the mean DoC per exon for each sample using BEDtools and other input files must be downloaded from public repositories. This program normalizes the DoC of each bp as a function of the GC content of the region. Next, the optimal set of controls is selected by choosing the most similar samples based on the quality control metrics of the Picard program [89]. Using this set of optimal controls, the method generates a model to determine the expected DoC of the test sample in each exon, according to the number of copies (generally 0, 1, 2, or 3, although for exons with known duplications [41] components for 4, 5, and 6 copies are also included). A hidden Markov model identifies regions with a number of copies other than 2. The following parameters are used to measure CNV quality, although no threshold values are provided: Q_SOME (Phred-scaled quality of any CNV within the interval), Q_EXACT (a non-Phred-scaled quality score that measures how closely the coverage profile matches the exact CNV state and breakpoints), Q_LEFT_EXTEND (Phred-scaled quality of the left breakpoint), and Q_RIGHT_EXTEND (Phred-scaled quality of the right breakpoint).
4.6. CANOES CANOES [53] is a high-speed program which analyzes all samples simultaneously, and allows the user to select the most suitable control samples for each case. Users must have at least moderate programming skills and a basic knowledge of R, as they need to use command lines to prepare the sample data and write a basic R script for CNV calling. For analysis, a minimum of 3 input samples is required, all of which must be preprocessed with BEDtools to extract the corresponding DoC. Regions within the sex chromosomes are excluded from the analysis. To reduce DoC biases between samples, the DoC of each control sample is normalized to achieve the same average coverage over all targets as in the case sample. The DoC is assumed to follow a negative binomial distribution, with mean and variance equal to the weighted mean and variance of the normalized DoC of the control samples (the weights for each control sample are obtained by regression of the test samples’ DoC against that of the normalized controls). To establish variance, a floor is set based on the mean DoC and GC content. Finally, a hidden Markov model is used to identify and merge exons with equal number of copies. Two parameters serve as quality metrics: the ‘NQ’ score (the Phredscaled probability that the corresponding sample contains no CNV in the region), and the ‘SQ’ score (the Phred-scaled probability that any of the targets in that region contain a CNV). The optimal limits of these 2 parameters are not provided.
4.9. CoNVaDING With this method [56] the user can introduce a large dataset of possible control samples and let the program select those most similar to the case sample, precluding the need for user selection of the most appropriate samples. Several eligibility parameters allow optimization of the analysis, including number of optimal controls, minimum coefficient of variation, and other parameters specific to the method itself. Moderate programming skills are required, given that several commands must be introduced in the correct order, and appropriate settings and inputs selected. The method has several notable drawbacks: in version 1.1.6 targets with an average coverage of 0 must be excluded from the analysis, precluding detection of homozygous deletions (this issue has been resolved in v1.2.0), and a minimum of 6 control samples are required (although the authors recommend 30 control samples). Finally, the computational time and resources required are considerable. To select the optimal set of controls, for each sample the method divides the mean DoC of each exon by the mean DoC of all autosomes, and selects the controls displaying the smallest differences with respect to test sample values (absolute values). Next, 2 distinct normalization strategies are applied: in the first (type A), for each sample the average
4.7. CODEX CODEX [54] can be used with any operating system and estimates the number of copies for each region. The user can choose different thresholds for several parameters (coverage, exon length, GC content, and mappability), and thus select the best combination for each experiment. A moderate knowledge of R is required (write a script using all the functions needed to perform an analysis, in the correct order). This tool also depends on other R packages, which may affect its functionality depending on the version of each used. To normalize the DoC of each exon, this method assumes that the DoC of all bps 119
Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
I. Roca, et al.
DoC of each exon is divided by the average DoC of all autosomal chromosomes; in the second (type B), for each sample the average DoC of each exon is divided by the average DoC of all the exons belonging to the same gene. Three different values are calculated for each exon: 1) the result of type A normalization of the test sample is divided by the result of type A normalization of the controls (values < 0.65 are indicative of deletions, and those > 1.4 are indicative of duplications); 2) the result of type A normalization of the controls is subtracted from the result of type A normalization of the test sample, and the value obtained divided by the standard deviation of the result of type A normalization of the controls (values > 3 are indicative of a duplication, and those < 3 are indicative of a deletion); 3) the process described in 2 is applied to the product of type B normalization. Only exons with values indicative of a CNV in all 3 cases are reported by the program. No parameters are provided to allow users to prioritize results.
created using a pool of control samples) used to obtain the log2 copy ratios for each bin. Bias correction is also implemented in the normalization step: the GC-content, repeat-masked fraction, and target density values are obtained for each bin. Next, the bins are sorted by these values, and the rolling median of the bins' log2 ratios is then subtracted from each log2 ratio (since that median value is considered to represent the expected bias for each bin). A notable drawback is that all regions examined are reported in the final output, but since the copy number of each region is reported, users can filter the results by choosing those with a copy number other than 2. Another important issue is that to use this tool the user must first install several Python and R packages, and CNV calling requires the use of multiple command lines in a specific order, meaning that users must have moderate programming skills. 4.12. SeqCNV
4.10. DECoN
This tool [59] is easy to use for users with basic programming skills. Only one command line with the path to the input files, as well as the path to the program folder and the output path, is needed to run the program. A key drawback is that SeqCNV only allows one control sample per analysis, and is thus adequate for matched control/case studies (e.g., cancer studies) but not for the detection of rare CNVs associated with rare disorders. Moreover, the recursive formula used (with a computational complexity of O(n²)) means that analyses require considerable time and computational resources. This method considers each read start point as a possible start point for a CNV. For the segment between 2 start points of consecutive reads, the tool develops a mathematical model based on the numbers of test and control reads, and estimates the probability that a given read belongs to the test or the control sample, assuming all reads are mixed together. Next, SeqCNV performs maximum penalized log-likelihood estimation (MPLE) to determine the boundaries of the CNVs, as well as the copy number; the penalization parameter of the MPLE is obtained based on the total number of reads. To solve the MPLE model, a recursive formula is applied using double loops. Once the candidate CNV regions are determined, only those with a likelihood value < 0.6 (for deletions) or > 1.4 (for duplications) are reported. The parameter used to prioritize likely CNVs is the “r_on” value (the value of the maximum penalty log-likelihood for the region). Values < 0.6 indicate deletions, while those > 1.4 indicate duplications (the lower or higher the r_on, the most likely the CNV is real).
This method [57] is the result of modification of ExomeDepth v1.0.0. The main difference is that the transition probabilities of the hidden Markov model have been altered to depend upon the distance between exons, so that exons adjacent in the list of targeted regions are treated independently if they are located too far apart on the chromosome. Therefore, this tool has the same strengths as ExomeDepth (can be used with any operating system, allows automated selection of optimal control samples, and performs rapid analyses), but is much easier to use (the user only has to run the provided R scripts or executables). Furthermore, program setup involves automatic download and installation of all the appropriate versions of all necessary packages and dependencies. Finally, all samples introduced are analyzed simultaneously, saving the user a considerable amount of time. For Windows users, there are 3 executables. The first is ReadInBams.bat (data-reading function), where the following data must be entered: 1) the list with the folder containing the bams and their names, and 2) the BED file for the targeted regions. Starting with this information the executable provides the user with a RData file. The second (optional) executable is IdentifyFailures.bat, which identifies regions that do not pass the quality threshold. Here, the following information must be entered: the output from the previous step, and (optionally) the minimum correlation between samples and the minimum coverage deemed acceptable for the analysis. Finally, the makeCNVcalls.bat function identifies the CNVs, and requires entry of the RData file created by the first executable. For Linux/Mac OS users 3 R scripts are required to run the program. These are equivalent to the executables in Windows. A key weakness of this tool should be noted: given that the method implements strict version control over the R-packages and dependencies, it changes local default R settings, and therefore care is required when installing DECoN to ensure that these settings are not altered in other analyses or R-packages. The parameters used to prioritize CNVs are the same as those used in ExomeDepth; the Bayes factor (higher values indicate a greater likelihood of a real CNV) and the ratio between expected and observed reads.
5. Comparison between methods 5.1. Results with the simulated dataset We used the simulated tg-NGS data described in 3.2 to evaluate the performance of each tool. The algorithms were run using default parameters. For ExomeCNV, only regions with ratios < 0.6 (for deletions) or > 1.3 (for duplications) and an average coverage for control samples of over 10X in the region of interest were considered. For CNVkit, only regions with a number of copies other than 2 were considered. Each exon was considered separately to calculate the total number of false negatives, false positives, and other parameters. Each tool was evaluated based on the following parameters: the number of control samples used (from 1 to 10 control samples), the mean DoC of the samples (300X vs 50X), simple vs compound CNVs, CNV type (duplication vs deletion), and CNV length. We used the following versions of each CNV-detection tool: ExomeCNV v1.4, ExomeCopy v1.18.0, CONTRA v2.0.4, ExomeDepth v1.1.10, CONIFER v0.2.2, CANOES (no version number provided), CODEX v1.7.0, CLAMMs v1.0, CoNVaDING v1.2.1, DECoN v1.0.1, CNVkit v0.9.6, and SeqCNV (no version number provided). First, we evaluated the performance at 300X DoC depending on the number of control samples used. The results are summarized in
4.11. CNVkit CNVkit [58] is a command-line toolkit and Python library available for Ubuntu or Debian Linux and Mac OS X. Although designed for hybrid capture sequencing data, it can be used for WGS and targeted amplicon sequencing by changing the ‘method’ option to ‘wgs’ or ‘amplicon’, respectively. The main difference between this tool and the others is the use of both on- and off-target sequencing reads to call CNVs: the mean DoC of each bin located between targeted regions is combined with the mean DoC of the targeted regions, and (after normalizing to a reference 120
Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
I. Roca, et al.
Fig. 1. A,B,C,D Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of each tool (DECoN, ExomeDepth, ExomeCNV, CoNVaDING, CANOES, CNVkit, CLAMMS, CODEX, SeqCNV, CONTRA, ExomeCopy, and CONIFER) using 220 samples simulated with a mean DoC of 300X, with different numbers of control samples (1, 5 and 10). E,F Number of false negatives detected by each tool using 100 samples simulated with a mean DoC of 300X and 100 samples simulated with a mean DoC of 50 × .
Fig. 1(A–D). The most sensitive methods, regardless of the number of control samples used, were ExomeDepth and DECoN (> 0.99), followed closely by ExomeCNV (> 0.93). CONIFER, ExomeCopy, and CONTRA were the least sensitive. It should be noted that CONTRA performs far better with a single control sample than with 2 or more, probably due to the baseline used with the control samples. All methods showed comparable specificity (> 0.95), except for ExomeCopy (˜0.75) and CLAMMS when run using a single control sample (˜0.21). The methods with a higher positive predictive value (PPV, or precision) were ExomeDepth, CANOES, CoNVaDING, DECoN, and ExomeCNV (> 0.91). CONTRA (˜0.56 for 1 control sample, but ˜0.02 for 2 or more samples) and ExomeCopy (˜0.01) returned the lowest PPV. All methods produced similar negative predictive values (NPV; > 0.99). To evaluate the influence of sample DoC (300X vs 50X), we compared the number of false negatives produced by each tool using data from 100 simulated samples at both levels of coverage (Fig. 1E and F). All tools produced better results with higher levels of coverage, except for CNVkit, which, surprisingly, performed slightly better in the 50X scenario, although no statistically significant differences were observed. CANOES and CONIFER were the methods least affected by DoC differences (Wilcoxon test, p > 0.05). We also evaluated differences in sensitivity between simple CNVs (1 CNV per sample) and compound CNVs (> 2 CNVs in different genes) using simulated data from 137 and 83 samples, respectively (Fig. 2A
and B); between deletions (76 samples) and duplications (61 samples), using only simple CNVs to avoid bias (Fig. 2C and D); and between small (1 exon) and large (> 1exon) CNVs (38 and 99 samples, respectively; Fig. 2E and F). For DECoN, ExomeDepth, CoNVaDING, CANOES, and CONTRA, no significant differences in the detection of simple or compound CNVs were observed (p > 0.05). ExomeCNV (p = 0.002), CODEX (p = 0.006), ExomeCopy (p = 0.002), CONIFER (p = 0.03), and SeqCNV performed better with compound CNVs, while the performance of CLAMMS and CNVkit was better with simple CNVs (p = 0.002). For DECoN (p = 0.004), ExomeDepth (p = 0.004), and CONTRA (p = 0.002), duplications were easier to detect than deletions, while ExomeCNV (p = 0.002), CNVkit (p = 0.002), CANOES (p = 0.007), CLAMMS (p = 0.002), CODEX (p = 0.002), SeqCNV, and ExomeCopy (0.002) detected more deletions than duplications. CNV size had no effect on the detection rates of DECoN, ExomeDepth, CoNVaDING, or CANOES (p > 0.05). ExomeCNV (p = 0.002) and SeqCNV performed better with small CNVs, while for CLAMMS (p = 0.004), CODEX (p = 0.002), CONTRA (p = 0.002), CNVkit (p = 0.003), ExomeCopy (p = 0.006) and CONIFER (p = 0.03), large CNVs were easier to detect. Since none of the simulated tools achieved 100% sensitivity, we sought to identify the optimal combination required to minimize the number of false negatives. To this end, we categorized a region as a CNV if classified as such by at least 3 tools, after combining between 4 121
Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
I. Roca, et al.
Fig. 2. For each tool, the percentage of false negatives (%FN) over the total number of real positives in several scenarios, considering different numbers of control samples (1–10), is shown. A %FN using 137 simulated samples with simple CNVs (1 CNV per sample). B %FN using 83 simulated samples with compound CNVs (> 1 CNV per sample). C %FN using 76 simulated samples with deletions (only simple CNVs). D %FN using 61 simulated samples with duplications (only simple CNVs). E %FN using 38 simulated samples with small CNVs (1 exon, only simple CNVs). F %FN using 99 simulated samples with large CNVs (> 1 exon, only simple CNVs).
included 2 DEL-het of a single exon, 1 DEL-het of 2 exons, 1 DEL-het of 3 exons, and 1 DUP of 3 exons (total, 10 regions with CNVs and 11,083 normal regions). The results of this comparison are summarized in Table 4. ExomeDepth, DECoN, and ExomeCNV detected all CNVs, while CoNVaDING did not detect the 3 duplicated exons, and CANOES failed to detect both the 3 duplicated exons and 1 single-exon deletion (it should be noted that the duplicated exons were located in chrX, and CANOES does not analyze sex chromosomes). Regarding the number of false positives, CANOES (0 F P), ExomeDepth (1 F P), and CoNVaDING (2 F P) were the best performing tools (DECoN, 18 F P; ExomeCNV, 44 F P).
and 10 programs. The results for the simulated samples at 300X are summarized in Table 3. When using 5 or fewer control samples, the results obtained were no better than those produced by DECoN or ExomeDepth; the number of false negatives was slightly lower but the number of false positives was much higher. Using 6 or more control samples, the best results (in terms of sensitivity) were obtained using a combination of 9 tools (DECoN, ExomeDepth, CoNVaDING, CANOES, CNVkit, CLAMMS CODEX, ExomeCopy, and CONIFER), with 0 false negatives. A 100% sensitivity was also achieved using a combination of 10 tools (DECoN, ExomeDepth, ExomeCNV, CoNVaDING, CANOES, CNVkit, CLAMMS, CODEX, CONIFER, and ExomeCopy), but the number of false positives increased considerably.
6. Conclusions
5.2. Results with real data
The main challenges in CNV detection are experimental biases caused by GC content [81] and repetitive DNA elements [85], and other biases such as poor mappability and technical variability in library preparation, capture, and sequencing. To overcome the GC-content bias, some CNV detection tools (ExomeDepth, ExomeCopy, CODEX, DECoN, CLAMMS, and CANOES) employ GC-content correction, while others (ExomeCNV, CoNVaDING, and CONTRA) reduce this bias by using the ratios between samples [80]. There are certain similarities between the tools described here. Some programs (e.g., ExomeCNV) have a specific feature whereby the copy number is assumed to follow a Poisson distribution, which
Based on the results obtained with the simulated dataset, we selected the 5 best-performing tools (DECoN, ExomeDepth, ExomeCNV, CANOES, and CoNVaDING) and used a small dataset of real samples with high mean DoC (analyzed using different tg-NGS panels) to compare their performance in a “real-world” example. In-solution hybridization capture (SureSelect XT; Agilent Technologies) was used for enrichment, captured fragments were sequenced as paired-end 100base reads on the MiSeq platform (Illumina Inc.), and reads were aligned to the reference genome hg19 using BWA. We used 8 different samples with no CNVs as controls for each test sample. The CNVs 122
X X X X X X X
2 2 1 1 – – –
5 5 61 69 – – –
3 3 2 1 – – –
FN
FP
FN
X X X
CoNVaDING
5 controls
X
X
X X
eCNV
4 controls
X X X X X X X
eDepth
6 6 49 77 – – –
FP
X X X X
CANOES
X X X X X X X
X X X X X X X
3 3 2 1 0 0 0
FN 6 6 33 43 39 155 230
FP
CLAMMS
6 controls
CNVkit
X X
X X
CODEX
4 3 1 1 1 0 0
FN
7 controls
SeqCNV
6 6 44 51 49 148 222
FP
X
CONTRA
X X
X X X
2 2 1 1 1 0 0
FN
5 4 3 2 – – –
3 3 2 1 0 0 0
FN
9 controls
54 103 859 594 – – –
3 3 2 2 – – –
6 6 43 70 101 293 368
FP
FN
FN
FP
2 controls
1 control
6 6 43 53 68 285 361
FP
CONIFER
8 controls
eCopy
6 7 38 283 – – –
FP
2 2 1 1 0 0 0
FN
10 controls
2 2 1 1 – – –
FN
3 controls
7 7 45 86 102 306 380
FP
6 6 42 54 – – –
FP
123
Sample T1 T2 T3 T4 T5
N_bp 513,533 515,928 513,624 513,887 124,700
N_exons 2,610 2,621 2,609 2,608 645
Mean DoC 554X 621X 547X 601X 757X
CNV Type DEL-het DEL-het DEL-het DEL-het DUP
CNV region E7-E8 ASS1 gene(NM_054012) E1-E2-E3 GLDC gene(NM_000170) E5 PAH gene(NM_000277) E6 PDHX gene(NM_001135024) E6-E7-E8 CASK gene(NM_003688)
CNV chrom position chr9:133346195-133346927 chr9:6620158-6645549 chr12:103260348-103260466 chr11:34988161-34988386 chrX:41519666-41530808
#FN 0 0 0 0 0
DECoN #FP 13 0 1 3 1
#FN 0 0 0 0 0
#FP 0 0 1 0 0
ExomeDepth
#FN 0 0 0 0 0
#FP 30 5 2 6 1
ExomeCNV
#FN 0 0 0 0 3
#FP 2 0 0 0 0
CoNVaDING
#FN 0 0 1 0 3
CANOES #FP 0 0 0 0 0
Table 4 Performance of the 5 tools that showed better results in the simulated dataset (DECoN, ExomeDepth, ExomeCNV, CoNVaDING, and CANOES) with a small dataset of real samples. For each test sample, 8 different samples with no CNVs were used as controls. N_bp: number of base pairs analyzed in the tg-NGS panel; N_exons: number of exons analyzed in the tg-NGS panel; Mean DoC: average depth of coverage for the sample; CNV Type: type of the CNV; CNV region: region of the CNV; CNV chrom position: chromosomal position of the CNV (according to the hg19 reference genome); #FN: number of false negatives; #FP: number of false positives.
4 programs 5 programs 6 programs 7 programs 8 programs 9 programs 10 programs
4 programs 5 programs 6 programs 7 programs 8 programs 9 programs 10 programs
DECoN
Table 3 Table shows the best combination (in terms of sensitivity) of tools for different scenarios (4–10 programs), and the numbers of false negatives (FN) and false positives (FP) obtained in test samples at 300X according to the number of control samples used (1–10).
I. Roca, et al.
Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
I. Roca, et al.
converges to a normal distribution, while others (e.g., CONIFER) employ singular value decomposition for CNV calling. DECoN appears to be the best method for the analysis of tg-NGS gene panel data. It was the most sensitive, is fast and easy to use, selects the optimal controls for each sample, and analyzes all samples simultaneously. These features make this tool suitable for almost any laboratory using NGS data. DECoN’s main drawback is that it is not entirely intuitive for users who have never worked in a terminal. Even in Windows, the executable does not open a graphical interface, but rather opens a terminal, and the user must type the path to the files (or the list in plain text). Moreover, at least 2 executables are required (the first generates an output that is fed into the second). ExomeDepth is also an excellent tool: it is very sensitive, very fast, and selects the optimal controls for each sample. However, despite the detailed user instructions provided, at least moderate R programming skills are required, and ExomeDepth’s dependence on multiple packages can result in incompatibilities between the R version the tool requires and the one installed. ExomeCNV performed well, but suffers from several drawbacks: the output file reports all regions sorted by chromosome, from which the user must select those most likely to be pathogenic (based on the value of the log-ratio or the ratio itself); the data must be preprocessed using GATK (or a script written to convert data to that format); and users must have moderate R skills. Although these 3 tools performed best with the simulated data and the real samples, it should be noted that we used default parameters for all tools, and therefore some could potentially have produced better results after adjustment of certain parameters. To improve these 3 programs to allow their introduction into the workflow of a small laboratory, and their use by staff without bioinformatics expertise, a graphical interface should be designed, in which the user could press a button to enter input files and select the necessary options and the entire process could be performed automatically. To prioritize the most likely CNVs, DECoN and ExomeDepth use the Bayes factor and the ratio of observed/expected number of reads. By contrast, ExomeCNV uses the log-ratios values or the ratio itself. These tools could also be improved by providing an optimal threshold for these parameters, allowing the user to more rapidly identify the most likely real CNVs as well as possible false positives. The combination of these CNV detection tools proved effective to increase sensitivity only when using more than 5 control samples and when 9 or 10 different tools were combined, which considerably increases the computational time required for analysis. The approach we used was to categorize a region as a CNV if classified as such by at least 3 tools. However, other approaches would provide different results, with higher sensitivity and lower specificity in less restrictive scenarios (e.g., CNV calling based on detection by at least 2 tools), and lower sensitivity and higher specificity in more restrictive scenarios (e.g., CNV calling based on detection by at least 4 tools).
[2] M. Zarrei, J.R. MacDonald, D. Merico, S.W. Scherer, A copy number variation map of the human genome, Nat. Rev. Genet. 16 (2015) 172–183. [3] R. Redon, S. Ishikawa, K.R. Fitch, L. Feuk, G.H. Perry, T.D. Andrews, et al., Global variation in copy number in the human genome, Nature. 444 (2006) 444–454. [4] K.K. Wong, R.J. deLeeuw, N.S. Dosanjh, L.R. Kimm, Z. Cheng, D.E. Horsman, et al., A comprehensive analysis of common copy-number variations in the human genome, Am. J. Hum. Genet. 80 (2007) 91–104. [5] I. Ionita-Laza, A.J. Rogers, C. Lange, B.A. Raby, C. Lee, Genetic association analysis of copy-number variation (CNV) in human disease pathogenesis, Genomics 93 (2009) 22–26. [6] F. Zhang, W. Gu, M.E. Hurles, J.R. Lupski, Copy number variation in human health, disease, and evolution, Annu. Rev. Genomics Hum. Genet. 10 (2009) 451–481. [7] R. Lucito, J. Healy, J. Alexander, A. Reiner, D. Esposito, M. Chi, et al., Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation, Genome Res. 13 (2003) 2291–2305. [8] D. Pinkel, R. Segraves, D. Sudar, S. Clark, I. Poole, D. Kowbel, et al., High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays, Nat. Genet. 20 (1998) 207–211. [9] J.R. Pollack, C.M. Perou, A.A. Alizadeh, M.B. Eisen, A. Pergamenschikov, C.F. Williams, et al., Genome-wide analysis of DNA copy-number changes using cDNA microarrays, Nat. Genet. 23 (1999) 41–46. [10] N.P. Carter, Methods and strategies for analyzing copy number variation using DNA microarrays, Nat. Genet. 39 (2007) S16–S21. [11] A.M. Snijders, N. Nowak, R. Segraves, S. Blackwood, N. Brown, J. Conroy, et al., Assembly of microarrays for genome-wide measurement of DNA copy number, Nat. Genet. 29 (2001) 263–264. [12] J. Shendure, H. Ji, Next-generation DNA sequencing, Nat. Biotechnol. 26 (2008) 1135–1145. [13] D.B. Goldstein, A. Allen, J. Keebler, E.H. Margulies, S. Petrou, S. Petrovski, et al., Sequencing studies in human genetics: design and interpretation, Nat. Rev. Genet. 14 (2013) 460–470. [14] T. Tucker, M. Marra, J.M. Friedman, Massively parallel sequencing: the next big thing in genetic medicine, Am. J. Hum. Genet. 85 (2009) 142–154. [15] L.M. Baudhuin, A new era of genetic testing and its impact on research and clinical care, Clin. Chem. 58 (2012) 1070–1071. [16] C. Alkan, B.P. Coe, E.E. Eichle, Genome structural variation discovery and genotyping, Nat. Rev. Genet. 12 (2011) 363–376. [17] M. Meyerson, S. Gabriel, G. Getz, Advances in understanding cancer genomes through second-generation sequencing, Nat. Rev. Genet. 11 (2010) 685–696. [18] P.J. Campbell, P.J. Stephens, E.D. Pleasance, S. O’Meara, H. Li, T. Santarius, et al., Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing, Nat. Genet. 40 (2008) 722–729. [19] S. Ivakhno, T. Royce, A.J. Cox, D.J. Evers, R.K. Cheetham, S. Tavaré, CNAseg–a novel framework for identification of copy number changes in cancer from secondgeneration sequencing data, Bioinformatics 26 (2010) 3051–3058. [20] P. Medvedev, M. Fiume, M. Dzamba, T. Smith, M. Brudno, Detecting copy number variation with mated short reads, Genome Res. 20 (2010) 1613–1622. [21] M. Pirooznia, F.S. Goes, P.P. Zandi, Whole-genome CNV analysis: advances in computational approaches, Front. Genet. 6 (2015) 138. [22] M. Zhao, Q. Wang, Q. Wang, P. Jia, Z. Zhao, Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives, BMC Bioinformatics 14 (2013) S1. [23] D.Y. Chiang, G. Getz, D.B. Jaffe, M.J. O’Kelly, X. Zhao, S.L. Carter, et al., Highresolution mapping of copy-number alterations with massively parallel sequencing, Nat. Methods 6 (2009) 99–103. [24] J.O. Korbel, A. Abyzov, X.J. Mu, N. Carriero, P. Cayting, Z. Zhang, et al., PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data, Genome Biol. 10 (2009) R23. [25] C. Xie, M.T. Tammi, CNV-seq.; a new method to detect copy number variation using high-throughput sequencing, BMC Bioinformatics 10 (2009) 80. [26] S. Sindi, E. Helman, A. Bashir, B.J. Raphael, A geometric approach for classification and comparison of structural variants, Bioinformatics 25 (2009) i222–i230. [27] F. Hormozdiari, C. Alkan, E.E. Eichler, S.C. Sahinalp, Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes, Genome Res. 19 (2009) 1270–1278. [28] K. Chen, J.W. Wallis, M.D. McLellan, D.E. Larson, J.M. Kalicki, C.S. Pohl, et al., BreakDancer: An algorithm for high resolution mapping of genomic structural variation, Nat. Methods 6 (2009) 677–681. [29] S. Yoon, Z. Xuan, V. Makarov, K. Ye, J. Sebat, Sensitive and accurate detection of copy number variants using read depth of coverage, Genome Res. 19 (2009) 1586–1592. [30] J.F. Nijkamp, M.A. van den Broek, J.M. Geertman, M.J. Reinders, J.M. Daran, D. de Ridder, De novo detection of copy number variation by co-assembly, Bioinformatics 28 (2012) 3195–3202. [31] K. Ye, M.H. Schulz, Q. Long, R. Apweiler, Z. Ning, Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads, Bioinformatics. 25 (2009) 2865–2871. [32] S.M. Waszak, Y. Hasin, T. Zichner, T. Olender, I. Keydar, M. Khen, et al., Systematic inference of copy-number genotypes from personal genome sequencing data reveals extensive olfactory receptor gene content diversity, PLoS Comput. Biol. 6 (2010) e1000988. [33] A. Abyzov, A.E. Urban, M. Snyder, M. Gerstein, CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing, Genome Res. 21 (2011) 974–984. [34] R. Xi, A.G. Hadjipanayis, L.J. Luquette, T.M. Kim, E. Lee, J. Zhang, et al., Copy
Conflict of interest statement The authors declare that there are no conflicts of interest. Acknowledgments We thank Rosanna Abal García for her invaluable laboratory work, and Belén Pérez for sharing DNA samples with known CNVs to use as positive controls. This work was supported by the Instituto de Salud Carlos III [grant numbers: PI13/02177, JR13/00019, and PI13/02451]. References [1] J.L. Freeman, G.H. Perry, L. Feuk, R. Redon, S.A. McCarroll, D.M. Altshuler, et al., Copy number variation: new insights in genome diversity, Genome Res. 16 (2006) 949–961.
124
Mutation Research-Reviews in Mutation Research 779 (2019) 114–125
I. Roca, et al.
[35] [36]
[37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55]
[56] [57] [58] [59]
[60] B. Liu, C.D. Morrison, C.S. Johnson, D.L. Trump, M. Qin, J.C. Conroy, et al., Computational methods for detecting copy number variations in cancer genome using next generation sequencing: principles and challenges, Oncotarget 4 (2013) 1868–1881. [61] D.C. Richter, F. Ott, A.F. Auch, R. Schmid, D.H. Huson, MetaSim: a sequencing simulator for genomics and metagenomics, PLoS One 3 (2008) e3373. [62] S. Balzer, K. Malde, A. Lanzén, A. Sharma, I. Jonassen, Characteristics of 454 pyrosequencing data–enabling realistic simulation with flowsim, Bioinformatics 26 (2010) i420–i425. [63] Y. Ono, K. Asai, M. Hamada, PBSIM: PacBio reads simulator–toward accurate genome assembly, Bioinformatics. 29 (2013) 119–121. [64] M.L. Engle, C. Burks, GenFrag 2.1: new features for more robust fragment assembly benchmarks, Comput. Appl. Biosci. 10 (1994) 567–568. [65] M. Holtgrewe, Mason–A Read Simulator for Second Generation Sequencing Data. TR-B-10-06, Institut Für Mathematik Und Informatik, Freie Universität Berlin, 2010. [66] K.E. McElroy, F. Luciani, T. Thomas, GemSIM: general, error-model based simulator of next-generation sequencing data, BMC Genomics 13 (2012) 74. [67] W. Huang, L. Li, J.R. Myers, G.T. Marth, ART: a next-generation sequencing read simulator, Bioinformatics. 28 (2012) 593–594. [68] Dwgsim: https://github.com/nh13/DWGSIM (Accessed 2 April 2018). [69] X. Hu, J. Yuan, Y. Shi, J. Lu, B. Liu, Z. Li, et al., pIRS: profile-based Illumina pair-end reads simulator, Bioinformatics. 28 (2012) 1533–1535. [70] J.C. Mu, M. Mohiyuddin, J. Li, N. Bani Asadi, M.B. Gerstein, A. Abyzov, et al., VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications, Bioinformatics 31 (2015) 1469–1471. [71] C. Bartenhagen, M. Dugas, RSVSim: an R/Bioconductor package for the simulation of structural variations, Bioinformatics 29 (2013) 1679–1681. [72] S. Killcoyne, A. del Sol, FIGG: simulating populations of whole genome sequences for heterogeneous data analyses, BMC Bioinformatics 15 (2014) 149. [73] M. Qin, B. Liu, J.M. Conroy, C.D. Morrison, Q. Hu, Y. Cheng, et al., SCNVSim: somatic copy number variation and structure variation simulator, BMC Bioinformatics 16 (2015) 66. [74] K. Břinda, V. Boeva, G. Kucherov, RNF: a general framework to evaluate NGS read mappers, Bioinformatics. 32 (2016) 136–139. [75] M. Fromer, J.L. Moran, K. Chambert, E. Baks, S.E. Bergen, D.M. Ruderfer, et al., Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth, Am. J. Hum. Genet. 91 (2012) 597–607. [76] J. Wu, K.R. Grzeda, C. Stewart, F. Grubert, A.E. Urban, M.P. Snyder, et al., Copy Number Variation detection from 1000 Genomes project exon capture sequencing data, BMC Bioinformatics 13 (2012) 305. [77] S. Kim, K. Jeong, V. Bafna, Wessim: a whole-exome sequencing simulator based on in silico exome capture, Bioinformatics. 29 (2013) 1076–1077. [78] TargetedSim: http://sourceforge.net/projects/targetedsim/files/TargetedSim (Accessed 2 April 2018). [79] D. Sims, I. Sudbery, N.E. Ilott, A. Heger, C.P. Ponting, Sequencing depth and coverage: key considerations in genomic analyses, Nat. Rev. Genet. 15 (2014) 121–132. [80] S.M. Teo, Y. Pawitan, C.S. Ku, K.S. Chia, A. Salim, Statistical challenges associated with detecting copy number variations with next-generation sequencing, Bioinformatics 28 (2012) 2711–2718. [81] Y. Benjamini, T.P. Speed, Summarizing and correcting the GC content bias in highthroughput sequencing, Nucleic Acids Res. 40 (2012) e72. [82] Y.C. Chen, T. Liu, C.H. Yu, T.Y. Chiang, C.C. Hwang, Effects of GC Bias in nextgeneration-Sequencing data on de novo genome assembly, PLoS One 8 (2013) e62856. [83] M.A. Quail, I. Kozarewa, F. Smith, A. Scally, P.J. Stephens, R. Durbin, et al., A large genome center’s improvements to the Illumina sequencing system, Nat. Methods 5 (2008) 1005–1010. [84] D. Aird, M.G. Ross, W.S. Chen, M. Danielsson, T. Fennell, C. Russ, et al., Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries, Genome Biol. 12 (2011) R18. [85] T.J. Treangen, S.L. Salzberg, Repetitive DNA and next-generation sequencing: computational challenges and solutions, Nat. Rev. Genet. 13 (2011) 36–46. [86] O. Harismendy, P.C. Ng, R.L. Strausberg, X. Wang, T.B. Stockwell, K.Y. Beeson, et al., Evaluation of next generation sequencing platforms for population targeted sequencing studies, Genome Biol. 10 (2009) R32. [87] A. Mortazavi, B.A. Williams, K. McCue, L. Schaeffer, B. Wold, Mapping and quantifying mammalian transcriptomes by RNA-Seq, Nat. Methods 5 (2008) 621–628. [88] A.B. Olshen, E.S. Venkatraman, R. Lucito, M. Wigler, Circular binary segmentation for the analysis of array-based DNA copy number data, Biostatistics 5 (2004) 557–572. [89] Picard: http://broadinstitute.github.io/picard (Accessed 2 April 2018).
number variation detection in whole-genome sequencing data using the Bayesian information criterion, Proc. Natl. Acad. Sci. U. S. A. 108 (2011) E1128-E136. Z. Iqbal, M. Caccamo, I. Turner, P. Ficek, G. McVean, De novo assembly and genotyping of variants using colored de Bruijn graphs, Nat. Genet. 44 (2012) 226–232. G. Klambauer, K. Schwarzbauer, A. Mayr, D.A. Clevert, A. Mitterecker, U. Bodenhofer, et al., cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate, Nucleic Acids Res. 40 (2012) e69. E. Bellos, M.R. Johnson, L.J. Coin, cnvHiTSeq: integrative models for high-resolution copy number variation detection and genotyping using population sequencing data, Genome Biol. 13 (2012) R120. H. Wang, D. Nettleton, K. Ying, Copy number variation detection using next generation sequencing read counts, BMC Bioinformatics 15 (2014) 109. R.M. Layer, C. Chiang, A.R. Quinlan, I.M. Hall, LUMPY: a probabilistic framework for structural variant discovery, Genome Biol. 15 (2014) R84. H.T. Nguyen, T.R. Merriman, M.A. Black, The CNVrd2 package: measurement of copy number at complex loci using high-throughput sequencing data, Front. Genet. 5 (2014) 248. R.E. Handsaker, V. Van Doren, J.R. Berman, G. Genovese, S. Kashin, L.M. Boettger, et al., Large multiallelic copy number variations in humans, Nat. Genet. 47 (2015) 296–303. A. Gillet-Markowska, H. Richard, G. Fischer, I. Lafontaine, Ulysses: accurate detection of low-frequency structural variations in large insert-size sequencing libraries, Bioinformatics 31 (2015) 801–808. B. Sikkema-Raddatz, L.F. Johansson, E.N. de Boer, R. Almomani, L.G. Boven, M.P. van den Berg, et al., Targeted next-generation sequencing can replace Sanger sequencing in clinical diagnostics, Hum. Mutat. 34 (2013) 1035–1042. C.J. Klein, T.M. Foroud, Neurology individualized medicine: when to use nextgeneration sequencing panels, Mayo Clin. Proc. 92 (2017) 292–305. Saudi Mendeliome Group, Comprehensive gene panels provide advantages over clinical exome sequencing for Mendelian diseases, Genome Biol. 16 (2015) 134. P.G. Eijk-Van Os, J.P. Schouten, Multiplex Ligation-dependent Probe Amplification (MLPA®) for the detection of copy number variation in genomic sequences, Methods Mol. Biol. 688 (2011) 97–126. S.N. Peirson, J.N. Butler, Quantitative polymerase chain reaction, Methods Mol. Biol. 362 (2007) 349–362. J.F. Sathirapongsasuti, H. Lee, B.A. Horst, G. Brunner, A.J. Cochran, S. Binder, et al., Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV, Bioinformatics 27 (2011) 2648–2654. M.I. Love, A. Myšičková, R. Sun, V. Kalscheuer, M. Vingron, S.A. Haas, Modeling read counts for CNV detection in exome sequencing data, Stat. Appl. Genet. Mol. Biol. (2011) 10. J. Li, R. Lupat, K.C. Amarasinghe, E.R. Thompson, M.A. Doyle, G.L. Ryland, et al., CONTRA: copy number analysis for targeted resequencing, Bioinformatics 28 (2012) 1307–1313. V. Plagnol, J. Curtis, M. Epstein, K.Y. Mok, E. Stebbings, S. Grigoriadou, et al., A robust model for read count data in exome sequencing experiments and implications for copy number variant calling, Bioinformatics 28 (2012) 2747–2754. N. Krumm, P.H. Sudmant, A. Ko, B.J. O’Roak, M. Malig, B.P. Coe, et al., Copy number variation detection and genotyping from exome sequence data, Genome Res. 22 (2012) 1525–1532. D. Backenroth, J. Homsy, L.R. Murillo, J. Glessner, E. Lin, M. Brueckner, et al., CANOES: detecting rare copy number variants from whole exome sequencing data, Nucleic Acids Res. 42 (2014) e97. Y. Jiang, D.A. Oldridge, S.J. Diskin, N.R. Zhang, CODEX: a normalization and copy number variation detection method for whole exome sequencing, Nucleic Acids Res. 43 (2015) e39. J.S. Packer, E.K. Maxwell, C. O’Dushlaine, A.E. Lopez, F.E. Dewey, R. Chernomorsky, et al., CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data, Bioinformatics 32 (2016) 133–135. L.F. Johansson, F. van Dijk, E.N. de Boer, K.K. van Dijk-Bos, J.D. Jongbloed, A.H. van der Hout, et al., CoNVaDING: single exon variation detection in targeted NGS data, Hum. Mutat. 37 (2016) 457–464. A. Fowler, S. Mahamdallie, E. Ruark, S. Seal, E. Ramsay, M. Clarke, et al., Accurate clinical detection of exon copy number variants in a targeted NGS panel using DECoN, Wellcome Open Res. 1 (2016) 20. E. Talevich, A.H. Shain, T. Botton, B.C. Bastian, CNVkit: genome-wide copy number detection and visualization from targeted DNA sequencing, PLoS Comput. Biol. 12 (4) (2016) e1004873. Y. Chen, L. Zhao, Y. Wang, M. Cao, V. Gelowani, M. Xu, et al., SeqCNV: a novel method for identification of copy number variations in targeted next-generation sequencing data, BMC Bioinformatics 18 (2017) 147.
125