chapter
10 Structural Genomic Variation in the Human Genome Charles Lee
INTRODUCTION The presence of clinically significant structural and copy-number variation in the human genome has been known since the early days of human genetics. Early cytogenetic studies recognized that microscopically visible aberrations such as duplications or deletions of entire chromosomes (aneuploidy) were associated with specific congenital developmental disorders. For example, an extra copy of chromosome 21 (i.e., trisomy 21) – or even a part of chromosome 21 – has long been established as being correlative with the then-called “mongoloid” phenotype that was first recognized in 1866 by John Langdon Down (Down, 1866; Lejeune et al., 1959). While standard karyotyping methods represented the first widely applied “genome-wide” diagnostic tests, it has taken decades of technological innovation to refine estimates of underlying genomic variation in the human genome at the submicroscopic and now nucleotide level. The development of high-resolution assays, capable of systematically detecting small segmental genetic alterations in a genome-wide fashion, have led to the detection of widespread structural genomic variation among the genomes of healthy individuals (Iafrate et al., 2004; Sebat et al., 2004). This finding has sparked intense efforts to identify and characterize the extent of these structural variants (SVs) in human populations and to understand their impact on human health in the context of genomic and personalized medicine.
BASIC PRINCIPLES OF COPY NUMBER VARIATION As the major class of SVs, copy number variants (CNVs) were initially defined as gains and losses of DNA segments of 1 kb or larger in one individual’s genome when compared to another (Feuk et al., 2006; Freeman et al., 2006). More recently, the term CNV has been used by some in a more encompassing manner to include smaller-sized fragments, but is still differentiated from other forms of polymorphism and/or repeated DNAs in the human genome (see Chapter 1), including insertions and deletions (indels), microsatellites, minisatellites, simple repeats (e.g., dinucleotide repeats, trinucleotide repeats, etc.), telomeric and centromeric repetitive DNAs, and most interspersed repetitive elements. In addition, since CNVs contribute to regional imbalances much smaller than entire chromosomes, they are differentiated from whole chromosomal aneuploidies, such as trisomy 21 or Turner syndrome. As CNV data become available for many individuals within a given human population, they can be categorized into biallelic or multiallelic states. Biallelic CNVs have only two alleles and thus produce three different genotypes (Figure 10.1A), while CNVs with more than two alleles are considered multiallelic and result in more than three different genotypes (Figure 10.1B). Heritable CNVs are thought to arise from germline genomic rearrangements (or, in some cases, possibly very
Genomic and Personalized Medicine, 2nd edition by Ginsburg & Willard. DOI: http://dx.doi.org/10.1016/B978-0-12-382227-7.00010-0
Copyright © 2013 2012, Elsevier Inc. All rights reserved. 123
124 CHAPTER 10
n
Structural Genomic Variation in the Human Genome
A Biallelic CNV Alleles
Reference = 2 copies (40% frequency)
Duplication CNV = 3 copies (50% frequency)
Frequency (%)
100 Duplication CNV = 4 copies (10% frequency)
80 60 40 20 0
1
2 3 4 Copy Number
5
6 Hypothetical Gene
B Multiallelic CNV Alleles
Deletion CNV = 3 copies (10% frequency)
Duplication CNV = 5 copies (20% frequency)
Frequency (%)
100 80 60 Reference = 4 copies (60% frequency)
40 20 0
1
2 3 4 Copy Number
5
6
SULT1A1 Gene SULT1A1 Gene Deletion
Figure 10.1 Examples of biallelic and multiallelic CNVs. (A) An example of a biallelic CNV that has a one-copy allele and a twocopy allele. The reference individual has two one-copy alleles, but 50% of individuals in this population have a total of three copies of this gene per cell. All biallelic CNVs have three genotypes per diploid cell, and in this case, copy numbers of two, three, and four per diploid cell. (B) An example of a multiallelic CNV that has zero-, one-, two- and three-copy alleles, resulting in six genotypes in this population. Only the allelic combinations for the three most common genotypes are shown.
early somatic events). The genomic rearrangements (mutational events) that are thought to cause CNVs can be broadly categorized according to mechanism of generation: 1. Non-allelic homologous recombination (NAHR). This is where homologous recombination occurs between highly identical and multiple sequences in the genome (e.g., segmental duplications or related interspersed repetitive elements) during meiosis (Figure 10.2). DNA sequences that lie between the sequences that recombine will either be duplicated or deleted. 2. Non-homologous end joining (NHEJ). A repair mechanism (that can occur during G1 or G2 of the cell cycle), in which double-strand breaks in the genome are ligated together with the assistance of specific protein complexes that form at the sites of the double strand breaks. Erroneous repairs can eventually lead to genomic imbalances near the sites of the DNA breaks. 3. Replication-based DNA mechanisms. These involve, for example, stalling of the replication fork and template switching (Lee et al., 2007b). Template switching is
thought to only require extremely small regions of homology (microhomology) and likely results in non-recurrent CNVs that are complex in nature or, if facilitated by the presence of cruciform or other non-B DNA structures, less complex in structure. The rate for NHEJ in particular is likely influenced significantly by environmental factors and localized DNA conformations, but in general has been estimated to occur at a rate of less than 10−7 per locus per generation (Conrad and Hurles, 2007), similar to the 10−8 per locus per generation mutational rate estimate for single nucleotide polymorphisms (SNPs). NAHR events are believed to occur more frequently, with estimates of up to 10−4 per locus per generation (Shaffer and Lupski, 2000). Since NAHR events lead to duplication and deletion of DNA sequences that lie between highly similar sequences in the genome, this mutational event tends to be associated with larger CNVs (Conrad et al., 2010; Redon et al., 2006). With the advent of next-generation DNA sequencing (see Chapter 7), large-scale projects garnering whole genome sequences (such as the 1000 Genomes Project) have been able
Basic Principles of Copy Number Variation
n
125
(gene a and b deleted)
(gene a and b duplicated)
segmental duplication - allele 1 segmental duplication - allele 2 gene a gene b
Figure 10.2 Non-allelic homologous recombination (NAHR) is a mechanism for generating CNVs during meiosis, where recombination between non-allelic repeats with >90% sequence homology (indicated by black- and gray-colored DNA segments). Intervening DNA sequences are deleted and duplicated on different chromatids.
CNV content
Frequency
to obtain exact nucleotide breakpoint sequence data for over 10,000 common CNVs, and have therefore been able to classify these CNVs with respect to mechanism of formation (Mills et al., 2011). Approximately 65% of common deletions and 6% of common genomic gains exhibit 2 bp (base pair) microhomologies at the SV breakpoints, characteristic of non-homologous mechanisms of SV formation. The vast majority of common genomic gains are due to mobile element insertions (67%). NAHR-mediated CNVs account for approximately one-third of common deletions and 26% of common genomic gains. CNVs are scattered throughout the human genome, and it is estimated that as many as ~1500 CNVs can be found in a single individual’s genome (Conrad et al., 2010). When one examines all the CNVs identified in a given individual, there are clearly more smaller-sized CNVs compared to large-sized CNVs (Figure 10.3). While some CNVs can be greater than 1 Mb (1 megabase, or 1 million base pairs) in size in healthy individuals, the median size of CNVs is approximately 2.9 kb (kilobases). It is important to note that earlier CNV discovery projects relied
0
20
40
60
80
100
Size (kb)
Figure 10.3 Size distribution of CNVs in a given person. Note that there are exponentially more smaller CNVs in a person’s genome than larger CNVs.
n
Structural Genomic Variation in the Human Genome
on array-based comparative genomic hybridization technologies that used large-insert DNA clones (see section below). Such studies yielded CNVs with ill-defined boundaries and likely overestimated the sizes for many currently known CNVs (Perry et al., 2008). The identification of CNVs from higher-resolution array platforms and more recently from next-generation DNA sequencing projects provides much more accurate CNV breakpoint information and sizes (see later sections). Taken together, when comparing the genomes of two healthy individuals, approximately 0.8% of their genomes differ with respect to CNVs, involving eight times more genome sequence than is accounted for by single nucleotide polymorphism (SNP) differences. Many of the documented CNVs are being cataloged and collated in the Database of Genomic Variants (http://projects. tcag.ca/variation) and the Human Paralogy Database (http:// humanparalogy.gs.washington.edu/structuralvariation). Most CNVs found in the genomes of healthy individuals are found within intergenic regions and are biased away from genes (Conrad et al., 2006; Nguyen et al., 2006) and away from ultraconserved elements in the genome that are indicators of important regulatory sequences (Derti et al., 2006). Nevertheless, the remaining CNVs have been shown to overlap some 3000 RefSeq genes and 300 OMIM genes (as listed in the Online Mendelian Inheritance in Man database; http://www. ncbi.nlm.nih.gov/sites/entrez?db=OMIM). Ontological analyses of the genes thought to be copy number variable point to a substantial number that can be classified as “environmental sensor/interaction” genes (Tuzun et al., 2005). These are genes that are involved in sensory perception, neurophysiological processes, drug detoxification, immunity, and inflammation, as well as cell-surface integrity and cell-surface antigens. Clearly, such genes are not critical for early development but rather are involved in our perception and interaction with external stimuli, helping us to adapt to our ever-changing environment. The functional impact of certain CNVs can be relatively straightforward. For example, reduced copies of a given gene can often be correlated with reduced expression levels and additional copies of a gene could lead to increased expression levels of the CNV gene (McCarroll et al., 2006), provided that the duplicated segment also contains the essential corresponding regulatory elements that drive expression of that gene or efficiently uses a shared regulatory element. CNVs that involve parts of a gene could result in fusion gene products or aberrant proteins with addition or loss of specific protein domains. It has been suggested that CNVs potentially alter the structure of over 3800 gene transcripts as well as lead to the complete loss of function of as many as 247 genes (Conrad et al., 2010). CNVs within intergenic regions can also overlap regulatory elements that affect genes as far away as 4 Mb (Stranger et al., 2007), and correlations of CNVs with transcriptional levels do not necessarily have to be positive (Brown et al., 2012). For example, deletion of a repressor element may cause up-regulation of an associated gene. Moreover, it is plausible that some CNVs disrupt position effects and thereby cause dysregulation of expression of certain genes (Reymond et al., 2007).
DETECTING CNVs IN A GENOME-WIDE MANNER Array-based Comparative Genomic Hybridization There are a number of different genome-wide methods for detecting CNVs. By far the most widely used approach has been array-based comparative genomic hybridization (aCGH). This technology was first introduced as “matrix-CGH” (SolinasToldo et al., 1997) and later referred to as “array CGH” (Pinkel et al., 1998). In aCGH, the “test” genome being interrogated is labeled with one type of fluorescent molecule (e.g., Cy5) and a “normal” or “reference” control genome is labeled with another type of fluorescent molecule (e.g., Cy3) (Figure 10.4). (A)
Gain
Microarray with DNA segments No change
Loss Test Reference DNA DNA
(B) 1.2 Log2 ratio
126 CHAPTER 10
0.6 0 –0.6 –1.2 Position along the genome
Figure 10.4 (A) A schematic of an array CGH assay where a test genome (labeled with Cy5, denoted in green) is cohybridized with a reference genome (labeled with Cy3, denoted in red). The DNA probes are mixed and allowed to hybridize to their complementary sequences on the array in a stoichiometric fashion. Fluorescence intensities of the spots on the microarray (each containing a specific DNA sequence) are measured and DNA sequences occurring in greater copy number in the test than in the reference will result in more green fluorescence for those spots on the microarray. A lower copy number of the same DNA sequences will result in more red fluorescence. (B) Typically, the log2 of the fluorescence ratios for each DNA segment on the array is then plotted from one end of a chromosome to the other. A gain is indicated in green and a loss in red. DNA segments having no significant change in DNA copy number in the test sample (with respect to the reference DNA) are indicated in yellow.
Detecting CNVs in a Genome-Wide Manner
The labeled DNAs are combined, denatured, and hybridized to an array of DNA fragments or oligonucleotides on a microscope slide, with each DNA fragment or oligonucleotide representing a unique part of the human genome. The labeled DNAs are then allowed to hybridize to their complementary DNA sequences on the array in a stoichiometric fashion, such that – by measuring the fluorescence ratio of the two fluorescent dyes at each spot on the array – one can infer the relative copy number of that particular DNA sequence in the genome being tested with respect to the reference genome (Figure 10.4B). There has been steady advancement in aCGH technology over the past five years. Originally, aCGH platforms contained several hundred, or even up to a thousand, large-insert DNA clones (e.g., BAC clones that have an average insert size of about 120–150 kb), which recapitulated the human genome with a clone per ~3 Mb. More recently, in order to increase both coverage and resolution, the trend has been to manufacture arrays that use smaller DNA sequences as hybridization targets but with increasing numbers of targets on an array. On such arrays, targets can be oligonucleotides of 45 to 75 bases in length that have been designed to have similar annealing temperatures (isothermic), based primarily on the length of the oligonucleotide and its GC base content. Two companies that produce such arrays are Nimblegen Systems, Inc. (http://www.nimblegen.com), and Agilent, Inc. (http://www.agilent.com/chem/goCGH). NimbleGen uses a programmable mirror array to synthesize as many as 4.2 million targets directly on a single glass slide using photolithography. Agilent, on the other hand, uses ink-jet technology to synthesize as many as 1.1 million targets on a single glass slide. While oligonucleotide-based arrays offer the advantage of increased coverage and resolution, their primary disadvantage is that each oligonucleotide probe tends to have a lower signal-tonoise ratio than a single large-insert genomic clone target. This results in more experimental “noise” for oligonucleotide-based aCGH assays. Indeed, the typical standard deviation of log2 ratios for an oligonucleotide-based array is approximately 0.25–0.3, five times the standard deviation of log2 ratios obtained for BACbased arrays. However, the inclusion of hundreds of thousands to millions of targets on a given oligonucleotide-based platform does ultimately provide an assay with increased resolution. The “effective resolution” of one of these platforms depends partially on the minimum number of consecutive probes needed to confidently call a CNV, which in turn is a function of how well the target sequences were chosen to accurately and consistently report copy number changes. Hence, a particular platform that has 500,000 targets and requires only three consecutive probes to make a confident CNV call actually has a higher effective resolution than an array platform with 1 million targets, but requiring ten consecutive probes to make a CNV call, assuming that both platforms distribute targets evenly throughout the genome. Genotyping Platforms High-throughput array technologies for identifying SNPs can now also be used to identify CNVs. In general, these platforms utilize shorter targets (20–30-base oligonucleotides) that make
n
127
them ideal for detecting single base alterations, but less ideal for identifying CNVs (especially when compared to long oligonucleotide-based arrays). For genotyping platforms, only a single labeled DNA source (“test sample”) is hybridized; hence these are not referred to as comparative genomic hybridization assays. The signal intensities obtained at each hybridized target appear to have a linear relationship with respect to copy numbers of that particular DNA sequence in the test genome. For example, if a given DNA sequence has four copies in test genome “X” and only two copies in test genome “Y,” the signal intensity obtained for that DNA sequence when test genome “X” is hybridized would be twice that of when test genome “Y” is hybridized. In general, genotyping platforms can detect larger CNVs (especially when there is a higher level of copy number change) or smaller CNVs (that are detectable by numerous targets on the array platform). Affymetrix (http://www.affymetrix. com) SNP arrays contain targets that are ~25 bases long, and Illumina Inc. (http://www.illumina.com) has designed a genotyping platform that uses 50-base oligonucleotides attached to indexed beads on a glass slide. Labeled test DNAs are hybridized to the slide, followed by primer extension and then immunofluorescence detection. Clearly, there would be significant benefits for platforms that are capable of accurately determining both SNP and CNV genotypes, including reduced reagent costs and expenditure of minimal amounts of DNA. However, the fluorescence intensity data obtained from these genotyping platforms typically have more noise when trying to obtain copy number information than do long oligonucleotide-based arrays. Hence, both Affymetrix and Illumina have now produced “next-generation” arrays that incorporate thousands of non-polymorphic probes (i.e., probes that do not target known SNPs) that fall within known and unknown CNV regions. For example, Affymetrix has released the Human SNP Array 6.0 that contains 906,600 probes for detecting common SNPs and 946,000 probes for detecting CNVs (Korn et al., 2008; McCarroll et al., 2008). It is thought that a CNV can be confidently detected if enough targets are strategically chosen for a given CNV region and included in the array. In other words, one probe may not be able to reliably detect a single copy loss in a CNV region but cumulative data from one hundred targets in the same CNV region may result in a consistent and confident CNV call. Illumina has released several newer platforms, adding common CNV regions identified by each phase of the 1000 Genomes Project. Some of the newer arrays include the Illumina 660 W, Omni 2.5S, and the Omni 5 Quad. With the availability of so many commercial arrays for genome-wide CNV analyses, it can be difficult to compare the limitations of each platform. However, some cross-platform comparison studies have been conducted; CNV detection issues appear to include batch effects, as well as which analytical program is being used for the CNV detection (Pinto et al., 2011). Whole-Genome Sequence Comparisons CNVs can also be detected with whole-genome sequence comparison analyses, using a variety of analytic strategies (Figure
128 CHAPTER 10
n
Structural Genomic Variation in the Human Genome
ci
35000 Number of read pairs
(A)
cd
30000 25000 20000 15000 10000 5000 0 0
50
Pairs of DNA sequences
Expected
>Cd (deletion)
(B)
100
150
200
250
300
Span between pairs
Reverse orientation (inversion)
Mean sequence coverage
Deletion
Duplication
Mean sequence coverage
(C) Pairs of DNA sequences
Deletion
Insertion
Figure 10.5 Strategies that can be used to identify structural genomic variants from next-generation DNA-sequencing datasets. (A) Paired-end mapping. The upper left-hand side shows pairs of DNA sequence reads that have been obtained from next-generation DNA sequencing. DNA reads are often approximately 100 bp in length, but read lengths differ with respect to the technology used for sequencing. The upper right-hand side shows the lower limit (Ci) and upper limit (Cd) of the DNA insert sizes used in the library preparation. For accurate detection of structural variants, the library insert sizes should have minimal deviation. The lower schematic shows how mapping these pairs of sequence reads can identify structural variants. If a pair of sequence reads map back to the reference genome with the expected distance from each other (corresponding to the DNA library size), there is no detectable SV. If a pair of sequence reads map back to the reference genome with a distance greater than the library size, it indicates a deletion at this genomic site in the individual being sequenced. Conversely, if a pair of sequence reads map back to the reference genome with a distance smaller than the library size, it indicates an insertion at this genomic site in the individual being sequenced. Finally, if a pair of sequence reads map back to the reference genome – but in the similar orientation – it is suggestive of the presence of an inversion
Association of CNVs to Disease and Disease Susceptibility
10.5). The main advantage of this approach for identifying CNVs is the acquisition of fine-scale genomic architecture of CNVs (i.e., accurate CNV sizes and breakpoint information). Tuzun and colleagues (2005) performed one of the initial whole-genome sequence analyses by aligning end-sequence data from thousands of fosmid clones from the G248 DNA library (derived from a single North American female) and comparing these with the human reference genome sequence. Taking advantage of the tight size restriction of fosmid clones, they were able to identify genomic gains/losses in the reference genome, when pairs of end-sequences aligned with intervening spacing significantly greater or less than the expected 40 kb (Figure 10.5). In this manner, more than 200 CNVs were identified in one of these two healthy and presumably “normal” individuals. Next-generation DNA sequencing technologies have advanced to a point where complete human genome sequences can now be obtained more efficiently and costeffectively than previously possible (see Chapter 7). Genome sequences for different individuals now typically report thousands of CNVs in any given individual, encompassing hundreds of millions of bases of DNA (Ahn et al., 2009; Bentley et al., 2008; Kim et al., 2009; Korbel et al., 2007; Levy et al., 2007; Wang et al., 2008; Wheeler et al., 2008). A major advantage of utilizing DNA sequence analyses is the ability to identify balanced chromosomal rearrangements that cannot be detected by aCGH-based methods (Figure 10.5). For example, Tuzun and colleagues (2005) found evidence for 56 inversion breakpoints in their comparative analysis of the two genomes of two individuals, and Korbel and colleagues (2007) found 132 inversion breakpoints when comparing the genomes of two different individuals.
ASSOCIATION OF CNVs TO DISEASE AND DISEASE SUSCEPTIBILITY Genomic imbalances can contribute to human diseases in at least two ways. First, certain genomic imbalances appear to cause neurodevelopmental diseases in a highly penetrant manner. These genomic imbalances (referred to by some as “pathogenic” CNVs) are usually de novo in nature, and recent
n
129
estimates have associated specific genomic imbalances with as many as 50 such genetic syndromes (http://www.sanger. ac.uk/PostGenomics/decipher/). Most of the remaining known genomic imbalances – sometimes referred to as “CNVs of unknown clinical significance” or “benign CNVs” because of their identification in healthy individuals – may actually have more subtle consequences on human health. Indeed, as illustrated by the examples below, some of these CNVs have been shown to confer differential susceptibility to common human diseases. Fcgr3 Copy Number Variation in Glomerulonephritis Glomerulonephritis is a major contributor to human kidney failure. Fcgr3 is a gene that encodes for a receptor found on the surfaces of macrophages, which has low affinity binding properties to immunoglobulin G. The copy number of Fcgr3 can vary in humans and among rat strains from 0 to 4 per diploid cell. Individuals with fewer copies of this gene, due to deletions of paralogous Fcgr3 genes (which appear to have a negative regulatory effect on the “full-length” Fcgr3 gene/gene products), exhibit increased macrophage activity and an autoimmune response (Aitman et al., 2006; Fanciulli et al., 2007). DEFB4 Copy Number Variation in Crohn’s Disease and Psoriasis Human β-defensins are a family of genes predominantly secreted from leukocytes and epithelial tissues. β-Defensins are small proteins (15–20 residues) that function in antimicrobial defense by penetrating a microbe’s cell membrane and cause microbial death in a manner similar to that of antibiotics. In the presence of interleukin 1-alpha (IL-1α), which is secreted by macrophages and other immunologically relevant cell types at the site of tissue inflammation, the expression levels of the β-defensin gene, DEFB4, increase (O’Neil et al., 1999) to protect the tissue from further microbial invasion. Therefore, individuals with a lower copy number of this β-defensin have decreased immunity against microbes and increased susceptibility to Crohn’s disease (inflammatory bowel disease 1) (Fellermann et al., 2006; Naser et al., 2004). Hollox and colleagues (2007) later found that individuals with greater than
breakpoint. Detecting multiple pairs of sequences that suggest a given SV at a particular genomic location increases the confidence that the structural variant is present in the individual. (B) Split-read analyses. When pairs of DNA sequence reads are mapped back to the reference genome, sometimes there is a split in the sequence read itself. If all the DNA maps to the reference genome but is split during alignment, it suggests the presence of a deletion breakpoint at the site of the split. However, if some of the DNA does not align back to the reference genome and there is a split in the read during alignment, it suggests an insertion at the site of the split. Split-read analyses are especially helpful for rapidly identifying SV breakpoints. DNA sequencing technologies generating longer read lengths have increased probability of identifying SV breakpoints with this methodology. (C) Read-depth analysis. When a genome is sequenced at a given coverage (e.g., 40× coverage), one expects to find an average of 40 reads for a particular DNA sequence. When the number of reads for a given DNA sequence is statistically less or more than the sequencing coverage, it suggests a deletion or gain, respectively, of the given DNA sequence.
130 CHAPTER 10
n
Structural Genomic Variation in the Human Genome
four copies of the β-defensin gene have an increased susceptibility to psoriasis. It is thought that an elevated inflammatory response is elicited upon minor skin injury by increasing cytokine, epidermal growth factor receptor (EGF-R), and signal transducer and activator of transcription (STAT) signaling pathways, thus leading to inflammation. Copy Number Variation of Complement Component C4 in Systemic Lupus Erythematosus Although a link between the complement component C4 (and its isotypes, C4A and C4B) and systemic lupus erythematosus (SLE) has been previously reported (Hauptmann et al., 1974), a recent study suggests that the gene’s variable copy number actually serves as a significant risk factor for the disease (Yang et al., 2007). The complement system comprises over 20 proteins or protein fragments that normally circulate in the blood but, when activated, cause a biochemical cascade that clears pathogens from the human body, often by forming new transmembrane channels in the pathogen and causing osmotic lysis of the target cell. The median copy number of the complement component C4 gene is four, but it can range from zero to greater than five copies among humans (Yang et al., 2007). Low copy numbers of this gene are correlated with increased risk for SLE (Fanciulli et al., 2007; Yang et al, 2007).
IMPLICATIONS OF CNVs Disease-Association Studies SNPs have become powerful markers for identifying important disease loci in genetic association studies (see Chapter 3). However, recent analysis of the complete DNA sequence from a single human individual has shown that CNVs (and other structural genomic variants, including small, balanced chromosomal rearrangements) account for at least 22% of all genetic variation events in the individual and 74% of the total DNA sequence variation, when compared to the human reference genome (Levy et al., 2007). Although it is premature to predict the relative contribution of CNVs to the etiology of common, complex diseases (compared to SNPs), it is clear that CNVs represent a substantial component of human genetic variation that should not be ignored in future disease-association studies. Indeed, the presence of CNVs in the human genome has actually reduced the power of certain SNPs. For example, SNPs that lie within CNV regions are difficult to genotype, result in Hardy–Weinberg equilibrium distortions, and are often erroneously calculated as having reduced linkage disequilibrium to nearby causative genomic regions. CNVs are far more complex in nature than SNPs and demand appreciation of several factors in order to use this form of genetic variation appropriately in disease association studies. First, the absolute copy number (rather than relative copy number obtained from aCGH-based experiments) needs to be established for each CNV (Park et al., 2010). The exact
genomic location of duplicated CNVs should also be considered: the presence of a third copy of a gene may have dramatically different phenotypic effects depending on where the third copy occurs in the genome. The precise boundaries (at the DNA sequence level) of each CNV should also be known as well as the specific allelic state of a CNV (e.g., two copies of a gene could be distributed as one copy per chromosome or both copies on a single chromosome). Without this level of information, the power of CNVs in genetic association studies diminishes. Indeed, among the several thousand CNVs that have been identified to date, only a small percentage have actually been genotyped to this precision. Some efforts have been made to determine if CNVs are in linkage disequilibrium with nearby SNPs. If so, this would allow for specific CNV alleles to be assayed indirectly with a subset of well-characterized SNPs (i.e., “tagging” SNPs) (see Chapter 3). Initial observations suggested that a subset of CNVs did appear to be ancestral in nature and therefore taggable by specific SNPs (McCarroll and Altshuler, 2005). Larger CNVs, especially those that are in segmental, duplication-rich areas of the genome, appear to be less taggable by known SNPs (Locke et al., 2006). Part of the reason for this may be that the mutation rate for such CNVs is higher than the mutation rates for nearby SNPs (NAHR mutation rates are estimated to be at least double those of SNPs; see earlier section). Overall, this suggests that for association studies, a substantial number of CNVs may need to be genotyped directly. Pharmacogenetics Since many CNV genes are involved in metabolism and drug detoxification, it has been speculated that CNVs may also make significant contributions to future pharmacogenomic studies (Ouahchi et al., 2006). For example, the CYP2D6 genes code for enzymes that are involved in the metabolism of more than 30 medications that include anti-arrhythmics, antihypertensives, and beta-blockers (see Chapter 31). CYP2D6 CNVs have been identified and shown to result in gene products with differential metabolism efficiency (Rotger et al., 2007). Similarly, CNVs of genes involved in metabolism may explain some cases of inter-individual variation in drug toxicity. McCarroll and colleagues (2005) and Conrad and coworkers (2006) found more than 120 CNV genes that were homozygously deleted. Individuals harboring such homozygous deletions presumably have low toxicity tolerance to medications that depend heavily on the homozygously deleted CNV gene product for proper metabolism. Rapid and accurate identification of these individuals should be made a priority in pharmacogenomic research. Clinical Cytogenetic Diagnostics From the cytogenetics perspective, aCGH-based techniques are now being more widely used in the clinical cytogenetic diagnostic arena to identify smaller genomic imbalances that may be associated with neurodevelopmental disorders. Indeed, it has been estimated that aCGH-based assays are now detecting
References
apparently pathogenic genomic imbalances in as many as 20% of cases that previously had normal results from conventional chromosome-banded karyotyping tests. However, the recognition of widespread CNVs among the genomes of healthy individuals has made interpretation of aCGH-based assays more difficult (Lee et al., 2007a).
n
131
There is increasing interest in the role of CNVs and other SVs in clinical cytogenetics and molecular diagnostic laboratories within the healthcare setting (see Chapter 11), and this represents an important translational aspect of defining the significance of structural genome variation in genomic and personalized medicine.
REFERENCES Ahn, S.-M., Kim, T.-H., Lee, S., et al., 2009. The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Res 19, 1622–1629. Aitman, T.J., Dong, R., Vyse, T.J., et al., 2006. Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans. Nature 439, 851–855. Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., et al., 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59. Brown, K.H., Dobrinski, K.P., Lee, A.S., et al., 2012. Extensive genetic diversity and substructuring among zebrafish strains revealed through copy number variant analysis. Proc Natl Acad Sci USA 109, 529–534. Conrad, D.F., Andrews, T.D., Carter, N.P., Hurles, M.E., Pritchard, J.K., 2006. A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 38, 75–81. Conrad, D.F., Hurles, M.E., 2007. The population genetics of structural variation. Nat Genet 39, S30–S36. Conrad, D.F., Pinto, D., Redon, R., et al., 2010. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712. Derti, A., Roth, F.P., Church, G.M., Wu, C.-T., 2006. Mammalian ultraconserved elements are strongly depleted among segmental duplications and copy number variants. Nat Genet 38, 1216–1220. Down, J.L.H., 1866. Observations on an ethnic classification of idiots. London Hosp Clin Lect Rep 3, 259. Fanciulli, M., Norsworthy, P.J., Petretto, E., et al., 2007. FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity. Nat Genet 39, 721–723. Fellermann, K., Stange, D.E., Schaeffeler, E., et al., 2006. A chromosome 8 gene-cluster polymorphism with low human beta-defensin 2 gene copy number predisposes to Crohn disease of the colon. Am J Hum Genet 79, 439–448. Feuk, L., Marshall, C.R., Wintle, R.F., Scherer, S.W., 2006. Structural variants: Changing the landscape of chromosomes and design of disease studies. Hum Mol Genet 15, R57–R66. Freeman, J.L., Perry, G.H., Feuk, L., et al., 2006. Copy number variation: New insights in genome diversity. Genome Res 16, 949–961. Gonzalez, E., Kulkarni, H., Bolivar, H., et al., 2005. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science 307, 1434–1440. Hansson, K., Szuhai, K., Knijnenburg, J., Van Haeringen, A., De Pater, J., 2007. Interstitial deletion of 6q without phenotypic effect. Am J Med Genet A 143, 1354–1357. Hauptmann, G., Grosshans, E., Heid, E., 1974. Lupus erythematosus syndrome and complete deficiency of the fourth component of complement. Boll Lst Sieroter Milan Suppl 28, 53.
Hollox, E.J., Huffmeier, U., Zeeuwen, P.L., et al., 2007. Psoriasis is associated with increased beta-defensin genomic copy number. Nat Genet 40, 23–25. Iafrate, A.J., Feuk, L., Rivera, M.N., et al., 2004. Detection of largescale variation in the human genome. Nat Genet 36, 949–951. Kim, J.-I., Ju, Y.S., Park, H., et al., 2009. A highly annotated wholegenome sequence of a Korean individual. Nature 460, 1011–1015. Korbel, J.O., Urban, A.E., Affourtit, J.P., et al., 2007. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420–426. Korn, J.M., Kuruvilla, F.G., McCarroll, S.A., et al., 2008. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms, and CNVs. Nat Genet 40, 1253–1260. Lee, C., Iafrate, A.J., Brothman, A.R., 2007a. Copy number variations and clinical cytogenetic diagnosis of constitutional disorders. Nat Genet 39, S48–S54. Lee, J.A., Carvalho, C.M., Lupski, J.R., 2007b. A DNA replication mechanism for generating nonrecurrent rearrangements associated with genomic disorders. Cell 131, 1235–1247. Lejeune, J., Gautier, M., Turpin, R., 1959. Etude des chromosomes somatiques de neuf enfants mongoliens. C.R. Acad. Sci 248, 1721–1722. Levy, S., Sutton, G., Ng, P.C., et al., 2007. The diploid genome sequence of an individual human. PLoS Biol 5, e254. Locke, D.P., Sharp, A.J., McCarroll, S.A., et al., 2006. Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am J Hum Genet 79, 275–290. McCarroll, S.A., Altshuler, D.M., 2007. Copy-number variation and association studies of human disease. Nat Genet 39, S37–S42. McCarroll, S.A., Hadnott, T.N., Perry, G.H., et al., 2006. Common deletion polymorphisms in the human genome. Nat Genet 38, 86–92. McCarroll, S.A, Kuruvilla, F.G., Korn, J.M., et al., 2008. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet 40, 1166–1174. Mills, R.E., Walter, K., Stewart, C., et al., 2011. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65. Naser, S.A., Ghobrial, G., Romero, C., Valentine, J.F., 2004. Culture of Mycobacterium avium subspecies paratuberculosis from the blood of patients with Crohn’s disease. Lancet 364, 1039–1044. Nguyen, D.Q., Webber, C., Ponting, C., 2006. Bias of selection on human copy-number variants. PLoS Genet 2, e20. O’Neil, D.A., Porter, E.M., Elewaut, D., et al., 1999. Expression and regulation of the human beta-defensins hBD-1 and hBD-2 in intestinal epithelium. J Immunol 163, 6718–6724.
132 CHAPTER 10
n
Structural Genomic Variation in the Human Genome
Ouahchi, K., Lindeman, N., Lee, C., 2006. Copy number variants and pharmacogenomics. Pharmacogenomics 7, 25–29. Park, H.S., Kim, J.I., Ju, Y.S., et al., 2010. Absolute quantification of common Asian copy number variants (CNVs) using an integrated approach of high-resolution array CGH and massively parallel DNA sequencing. Nat Genet 42, 400–405. Park, J., Chen, L., Ratnashinge, L., et al., 2006. Deletion polymorphism of UDP-glucuronosyltransferase 2B17 and risk of prostate cancer in African American and Caucasian men. Cancer Epidemiol Biomarkers Prev 15, 1473–1478. Peiffer, D.A., Le, J.M., Steemers, F.J., Chang, W., Jenniges, T., Garcia, F.K., 2006. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res 16, 1136–1148. Perry, G.H., Ben-Dor, A., Tsalenko, A., et al., 2008. The fine-scale and complex architecture of human copy-number variation. Am J Hum Genet 82, 685–695. Pinkel, D., Segraves, R., Sudar, D., et al., 1998. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 20, 207–211. Pinto, D., Darvishi, K., Shi, X., et al., 2011. Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nat Biotechnol 29, 512–520. Redon, R., Ishikawa, S., Fitch, K.R., et al., 2006. Global variation in copy number in the human genome. Nature 444, 444–454. Reymond, A., Henrichsen, C.N., Harewood, L., Merla, G., 2007. Side effects of genome structural changes. Curr Opin Genet Dev 17, 381–386. Rotger, M., Saumoy, M., Zhang, K., et al., 2007. The Swiss HIV Cohort Study. Partial deletion of CYP2B6 owing to unequal crossover with CYP2B7. Pharmacogenet Genomics 17, 885–890.
Scherer, S.W., Lee, C., Birney, E., et al., 2007. Challenges and standards in integrating surveys of structural variation. Nat Genet 39, S7–S15. Sebat, J., Lakshmi, B., Troge, J., et al., 2004. Large-scale copy number polymorphism in the human genome. Science 305, 525–528. Shaffer, L.G., Lupski, J.R., 2000. Molecular mechanisms for constitutional chromosomal rearrangements in humans. Annu Rev Genet 34, 297–329. Solinas-Toldo, S., Lampel, S., Stilgenbauer, S, et al., 1997. Matrixbased comparative genomic hybridization: Biochips to screen for genomic imbalances. Genes Chromosom Cancer 20, 399–407. Stranger, B.E., Forrest, M.S., Dunning, M., et al., 2007. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848–853. Tuzun, E., Sharp, A.J., Bailey, J.A., et al., 2005. Fine-scale structural variation of the human genome. Nat Genet 37, 727–732. Wang, J., Wang, W., Li, R., et al., 2008. The diploid genome sequence of an Asian individual. Nature 456, 60–65. Wheeler, D.A., Srinivasan, M., Egholm, M., et al., 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876. Wilson III, W., Pardo-Manuel de Villena, F., Lyn-Cook, B.D., et al., 2004. Characterization of a common deletion polymorphism of the UGT2B17 gene linked to UGT2B15. Genomics 84, 707–714. Yang, Y., Chung, E.K., Wu, Y.L., et al., 2007. Gene copy-number variation and associated polymorphisms of complement component C4 in human Systemic Lupus Erythematosus (SLE): Low copy number is a risk factor for and high copy number is a protective factor against SLE susceptibility in European Americans. Am J Hum Genet 80, 1037–1054.
RECOMMENDED RESOURCES 1000 Genomes Project http://www.1000genomes.org Affymetrix, Inc. http://www.affymetrix.com Agilent, Inc. http://www.agilent.com/chem/goCGH Database of Genomic Variants http://projects.tcag.ca/variation Ensembl Genome Browser http://www.ensembl.org/index.html
Human Paralogy Database http://humanparalogy.gs.washington.edu/structuralvariation Human Segmental Duplication Database http://projects.tcag.ca/humandup/ Illumina Inc. http://www.illumina.com NimbleGen Systems, Inc. http://www.nimblegen.com UCSC Genome Browser http://www.genome.ucsc.edu/