1
Common Genetic Variation and Human Disease Nick Orr* and Stephen Chanock*,† *Laboratory of Translation Genomics, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892 †Core Genotyping Facility, Advanced Technology Center, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892
I. Introduction II. Variation in the Human Genome A. Single-nucleotide polymorphisms B. Factors influencing SNP frequencies in populations C. SNPs in health and disease D. Other categories of genomic variation III. Utilization of Genetic Variation in Gene Mapping Studies IV. Mapping Complex Disease Genes Using Association A. Linkage and association: Out with the old and in with the new? B. Genetic association testing: The direct approach C. Genetic association testing: The indirect approach D. tagSNPs E. Quantifying LD in the genome F. Testing association using haplotypes V. Genome-wide Association Studies VI. Study Design and Data Analysis A. Introduction B. Type I error and the multiple testing problem C. Additional sources of error
Advances in Genetics, Vol. 62
0065-2660/08 $35.00 DOI: 10.1016/S0065-2660(08)00601-9
2
Orr and Chanock
VII. Significance for Public Health VIII. Concluding Remarks References
ABSTRACT The landscape of human genetics has changed remarkably in a relatively short space of time. The field has progressed from comparatively small studies of rare genetic diseases to vast consortia based efforts that target the inherited components of common complex diseases and which typically involve thousands of individual samples. In particular, genome wide association studies have become possible as a result of a new generation of genotyping platforms. At the time of writing, these have led to the discovery of more than 150 novel susceptibility loci across a broad spectrum of diseases, a few in genes with high biological plausibility but the majority in others that had not been considered candidates. Here, we provide an overview of the field of complex disease genetics pertaining to mapping by association and consider the many pitfalls and caveats that have arisen. ß 2008, Elsevier Inc.
I. INTRODUCTION Annotation of genetic variation in the human genome coupled with advances in bioinformatics and technology have significantly changed the landscape of human genetics. Now geneticists are in a strong position to answer questions pertaining to the heritability of common conditions. In particular, we are able to address the contribution of common germ line genetic variation to disease susceptibility and outcome. In the near future looms the potential to sequence entire human genomes (Bennett et al., 2005; Binladen et al., 2007; Margulies et al., 2005), which will yield insights into the significance of less common genetic variants, perhaps in both common and uncommon diseases. Successes in the analysis of single-gene disorders using linkage or reverse cloning (e.g., hemophilia (Youssoufian et al., 1988), cystic fibrosis (Kerem et al., 1989), thrombosis (Bertina et al., 1994), and chronic granulomatous disease (Royer-Pokora et al., 1986) have, until recently, eluded polygenic conditions. The etiology of these diseases may even be further influenced by environmental challenges (Hunter, 2005). Complex diseases do not necessarily follow traditional patterns of Mendelian inheritance. The identification of diseasecausing genes using conventional linkage-based approaches has been less than successful because studies do not have the requisite statistical power to detect
1. Common Genetic Variation and Human Disease
3
low-penetrance, high-frequency alleles. During the course of this review, we will discuss advances in the approaches for mapping genetic variants that contribute to complex diseases, particularly those that utilize unrelated subjects. We will outline the goals and limitations of current procedures and will discuss the recent successes seen in genome-wide association studies (GWAS). We will also touch upon issues underlying the pre-GWAS paucity of replicable findings.
II. VARIATION IN THE HUMAN GENOME A. Single-nucleotide polymorphisms The human genome is composed of over three billion bases of DNA encoding between 25,000 and 30,000 genes (International Human Genome Sequencing Consortium, 2004; Lander et al., 2001; Venter et al., 2001). The most common form of variation in the genome is the single-nucleotide polymorphism (SNP). SNPs are DNA variants in which a single nucleotide at a fixed position in the genome is substituted with another. It has been estimated that there are in excess of 10 million common [minor allele frequency (MAF)>1%] SNPs within the genome (Kruglyak and Nickerson, 2001; Reich and Lander, 2001); a small subset of these likely give rise to the observable phenotypic differences in and between populations, including disease susceptibility and outcome. One may expect to observe a single-nucleotide difference between two haploid genomes in the range of 1 in every 300–1000 base pairs. Although the vast majority of SNPs are shared between populations (Conrad et al., 2006; Hinds et al., 2005; International HapMap Consortium, 2005), it is evident that many are specific to populations or continental grouping of populations that share recent history. In this regard, it is possible to identify sets of markers that can be used to measure admixture in populations and that in some circumstances may be utilized to map genes that could partially account for differences in disease incidence between populations (Freedman et al., 2006; Patterson et al., 2004; Shriver et al., 2005).
B. Factors influencing SNP frequencies in populations Because the chemical structure of DNA influences the rate of mutation, there is bias in the frequency distribution of categories of mutational events. Transitions, which preserve the class of nucleotide (e.g., purine substituted for purine or pyrimidine for pyrimidine, A $ G or C $ T, respectively) are more common than transversions, which result in a switch of a pyrimidine for a purine or purine for a pyrimidine (A $ C, A $ T, G $ C, or G $ T) (Topal and Fresco, 1976). New SNPs arise via mutation and with time they either disappear or reach fixation, replacing the ancestral allele. The time taken for either of these
4
Orr and Chanock
extremes to be reached is proportional to population size; generally, as population size increases, so too does the number of generations in which a new SNP will be observed in its heterozygous state. It is thought that the lifespan of most SNPs is under the influence of neutral selection because they are inconsequential with respect to the fitness of an organism. SNPs that confer a selective advantage among members of a population may become enriched within that population through positive selection. Signatures of positive selection, though rare in genes, can be useful for the identification of those that have played an important role in the adaptation of a species to its local environment (Bersaglieri et al., 2004; Nielsen et al., 2005). For example, the frequency of lactose intolerance is low in European populations that have historically relied heavily on dairy farming for nutrition (Scrimshaw and Murray, 1988). This is due to the rapid expansion of a lactase variant that remains expressed in adulthood. Carriers of the variant thus have a selective advantage over those who lose the ability to metabolize lactose. Positive selection at the lactase gene is characterized by functional variants that lie on high-frequency, long-range haplotypes with often substantial frequency differences between populations (Bersaglieri et al., 2004). Patterns of selection observed in the human genome are not uniform and in fact vary according to gene function. Balancing selection at immune system loci can act to sustain a high level of functional diversity relative to other gene classes (Hughes et al., 2005) and provides a mechanism by which diversity at antigen recognition sites may be maintained, providing significant host advantage upon immune challenge. In contrast, purifying selection acting at sites of nonsynonymous nucleotide substitution is manifest as reduced heterozygosity and involves the process of elimination of variation with only slightly detrimental effects (Hughes et al., 2003).
C. SNPs in health and disease It has been estimated that 50,000–200,000 SNPs may be biologically important (Chanock, 2001; Risch, 2000; Sachidanandam et al., 2001). Nucleotide substitutions in the genome have the potential to directly contribute to disease pathogenesis, acting in a variety of ways depending on where they occur. Gene-centric SNPs can have serious consequences for the function or structural stability of a protein if they cause its primary structure to change. Exonic SNPs that lead to amino acid substitutions are referred to as “nonsynonymous.” Exonic SNPs are the best characterized class of genetic polymorphism; they are subject to detection bias and their functional effects are often readily assayable. The relative severity of an amino acid substitution may be predicted by consideration of the
1. Common Genetic Variation and Human Disease
5
biochemical properties of the side chains in question. Reference tables and algorithms (e.g., SIFT and PolyPhen) have been developed to aid investigators in assessing the significance of amino acid substitutions (Grantham, 1974; Ng and Henikoff, 2003; Ramensky et al., 2002). Many investigators choose to prioritize the analysis of nonsynonymous SNPs in their genetic association studies, on the basis that they may be extremely biologically significant. Nucleotide substitutions in the protein-coding portions of genes sometimes result in the premature insertions of codons that cause the termination of protein translation. These often become alleles that are effectively null because their transcribed mRNA is rapidly degraded by nonsense mediated decay (LykkeAndersen, 2001). SNPs occurring in the exons of genes that do not alter protein primary structure are called “synonymous.” Historically, while of interest to population and evolutionary geneticists, synonymous SNPs had been thought to be functionally uninteresting. Recent experimental evidence has shown, however, that they can effect mRNA stability (Capon et al., 2004; Wang et al., 2005) and alter splicing signals in genes; the latter mechanism is known to be involved in androgen-insensitivity syndrome, Glanzmann thrombasthenia, and cerebrotendinous xanthomatosis (Chamary et al., 2006). SNPs in introns, regulatory, and gene-distant regions can also be functionally important, primarily by affecting gene regulation. A relatively common variant (MAF of 1–2%), G21210A, in the 3 prime UTR of the prothrombin gene, F2, increases its expression, and carriers of the minor allele are at significantly increased risk for venous thrombosis (Poort et al., 1996). SNPs in the upstream untranslated region of neuregulin 1 have been associated with schizophrenia and, in particular, with expression levels of splice variants of the gene (Law et al., 2006). Indeed, SNPs that occur in apparent gene deserts have been associated with disease risk, three studies having independently identified and validated variants on chromosome 8 that increase susceptibility to prostate cancer and that are located 250 kb away from the nearest gene (Gudmundsson et al., 2007; Haiman et al., 2007; Yeager et al., 2007). In some cases, genetic variants may mediate protection from a particular disease. In the field of infectious diseases especially, there are a number of examples in which the success of a pathogen is subject to the genetic makeup of its host. Hope within the HIV research community was triggered by the discovery that a deletion variant of the chemokine receptor gene, CCR5, is highly associated with increased resistance to HIV infection, even in the face of multiple exposures to the virus (Liu et al., 1996). Similarly, SNPs in CCR5 and other chemokine receptor genes have been shown to be associated with disease progression, substantially waylaying the onset of AIDS (O’Brien and Nelson, 2004; Smith et al., 1997).
6
Orr and Chanock
D. Other categories of genomic variation There are many alternative classes of DNA variation that can have an impact on human health. Short tandem repeats (STRs) and variable number tandem repeats (VNTRs), collectively termed microsatellites, are head to tail repeats of multiple copies of a sequence motif, with STRs comprising a smaller number of individual bases than VNTRs. They are often extremely heterogeneous within a population and as such are useful for mapping purposes and for establishing relatedness. Large expansions of trinucleotide repeats can lead to genomic instability, the classic example being fragile X syndrome. A dinucleotide repeat (DG8S737) on chromosome 8 has been shown to be strongly associated with prostate cancer in African-Americans (Cheng et al., 2008; Freedman et al., 2006), though its functional importance has yet to be established. Whether or not modest variation in STR and VNTR length impacts on disease remains to be determined, though evidence suggests that some may act as binding sites for nuclear proteins (Richards et al., 1993). Structural variants comprising large regions of variable copy number occur in the human genome (Iafrate et al., 2004; Sebat et al., 2004) and can have MAFs>1%. Copy number variants (CNVs) therefore comprise part of the common genetic variation in a population. It has been estimated that a pair of individuals from a population will differ by a minimum of 11 CNVs (Sebat et al., 2004). The technology required to detect and assay CNVs has not reached a similar level of accessibility and versatility as for that of SNPs; this is reflected experimentally in a lack of overlap between publications describing CNVs, largely due to differences in assay methodology (Eichler, 2006). CNVs may encompass entire genes including promoter regions (Iafrate et al., 2004) and therefore may have an impact on phenotype. CNVs can have dose effects; the CCL3L1 gene duplication in HIV highly exposed individuals is a fine example of how varying gene dosage can alter host susceptibility to infection (Gonzalez et al., 2005). CCL3L1 copy number is inversely correlated with HIV susceptibility. We can be certain that complex disease phenotypes will involve genetic contributions from both SNPs and CNVs, but resolving those of the latter will be slower in the immediate future because of a comparative lack of available analytical resources. Insertion and deletion polymorphisms, ranging in size from a few to several thousand kilobases of DNA, often appear to be in strong linkage disequilibrium (LD) (see below) with surrounding SNPs (Hinds et al., 2006; McCarroll et al., 2006). Existing repositories of SNP data may be explored for clues as to the whereabouts of insertion and deletion polymorphisms because they often leave telltale traces, including deviation from Hardy-Weinberg equilibrium (HWE) (McCarroll et al., 2006). Indeed, it may be prudent to reevaluate past association study data in which SNPs excluded on the basis of deviation from
1. Common Genetic Variation and Human Disease
7
genotypic proportions expected under HWE because this may indicate underlying structural variation (Wittke-Thompson et al., 2005). Plans are under way to comprehensively map population-wide structural variation (Eichler et al., 2007).
III. UTILIZATION OF GENETIC VARIATION IN GENE MAPPING STUDIES The goal of disease gene discovery projects is to identify biologically functional variants in sets of genes. However, it cannot be overemphasized that the numbers of such variants are vastly overshadowed by those of functionally silent SNPs and it is these which are of most relevance to the discussions that follow. SNPs can serve as surrogate markers for functional variants when mapping disease genes and this has fueled much of the drive toward characterizing and understanding genomic variation. A clear distinction must be drawn between SNPs with biological function and those used solely for the purpose of mapping because they are utilized differently in genetic association studies (Fig. 1.1). Historically, much emphasis has been placed on the investigation of candidate SNPs, driven by a specific hypothesis. But, with the rapid expansion of the characterization of common variants, the concept of “SNPs as markers” has emerged as the primary approach, partly because so few SNPs have been adequately characterized in laboratory evaluations. By analyzing and assaying the distribution of marker SNPs, we can capture the impact of functional ones in proximity. Two main approaches have been used to identify disease-causing genes, namely linkage analysis and association studies. Both methods are similar in that they use genetic variation to mark genomic loci and then attempt to detect cosegregation of marker and disease. A sine qua non of any disease mapping study, whether by linkage or association, is the knowledge of the stable position and frequency of the markers that are to be used, which can be a daunting challenge in the context of the rapid evolution of content, knowledge, and tools for cataloging and display. Initial efforts to discover, validate, and catalogue SNPs were gene-centric. APOE was one of the first targets of attempts to build high-density SNP maps for association studies in complex disease (Lai et al., 1998). Subsequently, large-scale gene resequencing projects have been established in an effort to characterize genetic variation within genes of interest for a number of diseases. The SeattleSNPs discovery resource aims to resequence genes involved in inflammatory processes (http://pga.gs.washington.edu/), whereas the Environmental Genome Project focuses upon the genetic components of diseases with a clear environmental component (Wilson and Olden, 2004). Of particular interest to cancer molecular epidemiologists is the SNP500Cancer project (Packer et al., 2006) maintained by the NCI Cancer Genome Anatomy Project (http://cgap.nci.nih.gov) that aims to resequence genes of significance in cancer
8
Orr and Chanock
i a
b
c
d
e
f
d
e
f
e
f
ii Direct test
a
b
c
iii Indirect test
a
b
Indirect test
c
d
c
d
iv 1 2 3
a b
f e
Figure 1.1. Direct versus indirect association testing. Part (i) shows six common SNPs as they would be represented in a population sample. SNP-c is responsible for conferring a disease phenotype upon carriers. In a direct test (ii), SNP-c would be directly assayed and tested for association with the disease, perhaps based on prior evidence of structural or functional consequences of variation at this site. In contrast, the indirect approach (iii) is agnostic with regard to functional variation. The assayed markers need only be in LD with the causative variant to achieve a signal of association. The caveat with this method is that care must be taken to type the appropriate markers needed to ensure thorough coverage of a given region. In the hypothetical example shown, tests of association between disease status and genotype at SNP-b, SNP-e, or SNP-f would prove nonsignificant. Only SNP-a and SNP-d are indirectly associated with the disease. The reason is shown in part (iv) that illustrates the concept that SNPs arise on independent haplotypic backgrounds and that many common haplotypes exist at a given locus (three are illustrated in the example, but in reality many more are likely to be present). If we assume that SNP-c arose on haplotype 1, we can see that assaying the SNPs that define haplotypes 2 and 3 will not be useful in demonstrating an association of this locus with the disease. Instead, to fully analyze this region, we must assay at least one haplotype “tagging” SNP from each of the observed haplotypes.
in four ethnically diverse populations. A review of a small number of the most significant SNP databases, along with a brief outline of their relative merits, can be found in Table 1.1.
Table 1.1. Widely Used SNP Data Repositories Database dbSNP
HapMap
SNP500
Notes
No. of SNPs
Population
Web url
References
NCBI repository for SNP data from (at present) 35 organisms. It includes small insertion/deletion polymorphisms. No stipulation as to minimum MAF, therefore many SNPs are potentially singletons. International effort designed to catalogue common human genetic variation across the genome in four ethnically distinct populations. Its primary purpose is to aid in the identification of haplotype-tagging SNPs that may be used to facilitate association study design and as such it is of great value to medical geneticists. In addition, it has yielded much information regarding the evolutionary genetics of populations. The SNP500 database is home to resequencing data from genes thought to be of importance in cancer. It aims to provide a framework of use to molecular epidemiologists in the design of cancer-based association studies. Validated sequencing and genotyping assay conditions are openly available from the SNP500 website and data may be extracted preformatted for use in a number of genetic analysis programs. There is a heavy SNP selection bias in favor of putative functional polymorphism and as such, SNP500 is almost entirely gene-centric.
>10 million human SNPS, 4.8 million validated
Diverse—no restriction on population
http://www.ncbi. nlm.nih.gov/ projects/SNP/
Sherry ST, et al. Nucleic Acids Res. 2001; 29: 308–311
>5.8 million
4 populations; 30 Yoruba trios, 30 Caucasian trios, 45 Chinese individuals, and 45 Japanese individuals
www.hapmap.org
International HapMap Consortium Nature. 2005 Oct 27;437 (7063): 1299–1320.
>13800 (updated daily)
102 individuals from 4 ethnically diverse groups
http:// snp500cancer. nci.nih.gov/ home_1.cfm
Packer et al. Nucleic Acids Res. 2006; 34: D617–D621
(Continues)
Table 1.1. (Continued ) Database Gene SNPs/The Environmental Genome Project (EGP)
Seattle SNPs
HGMD
Notes The premise of the EGP is to identify polymorphic variation in candidate genes that are believed to be at the interface between genetics and response to environmental stimulus. Approximately 500 genes drawn from cell cycle, DNA repair, apoptosis, and signaling, among others, have been chosen for inclusion. It is hoped that the EGP will be valuable in the elucidation of the genetic components of diseases with strong environmental etiology. Concentrates on genes with relevance to inflammation, but also clotting and heart lung and blood-related phenotypes. Provides assay conditions and resources for assay design. This database is particularly useful for physicians, researchers, and genetic counselors. It aims to collate data on genetic variation pertaining to human disease. About 70% of the lesions described are SNPs, but the remainder comprises the full mutational spectrum, from small indels to gross chromosomal abnormalities. The mutations are germ line in nature; somatic and mitochondrial variants are excluded. Important to note that the HGMD relies on the opinion of submitters as to the pathogenic significance of their entries and as such there are likely to be many nonfunctional polymorphisms in the database.
No. of SNPs Approximately 30,000
Over 30,700
More than 50,000 entries, approximately 35,000 SNPs
Population
Web url
90 sample population from the polymorphism discovery resource of the Coriell Repository including European, African-Americans, Mexican, Native Americans, and AsianAmericans 48 African-Americans and 47 Americans with European ancestry
http://www. genome.utah. edu/genesnps/
Wilson and Olden Mol. Interv. 4: 147–156
http://pga.mbt. washington. edu/
Na
http://www. hgmd.cf.ac.uk
Stenson PD, et al. Hum Mutat. 2003 Jun;21 (6):577–581.
No restriction on population
References
1. Common Genetic Variation and Human Disease
11
IV. MAPPING COMPLEX DISEASE GENES USING ASSOCIATION A. Linkage and association: Out with the old and in with the new? A great proportion of the success observed in Mendelian disease mapping was due to the strong heritability of the traits in families and the fact that mutation of a single gene was the driving force behind disease pathogenesis; in almost all instances, a strong genotype–phenotype correlation could be drawn. Genetic linkage analysis (Elston, 1995; Ott, 1999; Risch, 1991; Teare and Barrett, 2005), in its various guises, became established as the method of choice for finding disease-associated genes and at the beginning of the new century, its success is compelling, with over 1200 Mendelian disorders having been identified (Hamosh et al., 2005). Buoyed by these remarkable achievements, many geneticists turned their attention to the pressing problem of complex diseases that pose a much greater burden from a public health point of view. Unfortunately, linkage analysis did not fare well when applied to diseases resulting from moderate genetic contributions from multiple loci. Risch and Merikangas (1996) published a seminal paper highlighting the shortcomings of the linkage approach when applied to complex disease. By conducting simulations of hypothetical diseases with predefined genotypic risk and causative allele frequencies, they were able to show that studies of association in populations of unrelated subjects were more powerful than linkage and also that, assuming individual genes implicated in complex disease each contribute modest risk, linkage methods will rarely detect effects in study samples of the size typically available. Although there have been isolated cases of success such as in the identification of variants in the ALOX5AP gene that confer increased risk for myocardial infarction and stroke (Helgadottir et al., 2004), these are the exception, and it is unlikely that many complex disease genes will be found in this manner using linkage analysis. Association testing is the preferred method in the situation where a disease is caused by a small number of genes, each with low to modest contribution. The promise of association analysis is founded in the demonstration that sample size can be reduced dramatically in comparison to that required for linkage, to within realistic ranges, and yet sufficient power to detect small genetic effects can be achieved (Morton and Collins, 2002; Risch, 2000; Risch and Merikangas, 1996). Genetic association testing most commonly uses unrelated, population-based cases and controls. The principle of association is simply to determine if there is a statistically significant difference in frequency of one or more genetic markers between the two groups. If this is shown to be the case, it is possible that the marker may either be causative, but it is more likely that it is closely associated with the causative mutation (Rothman et al., 2001). Association testing depends on the supposition that the variants contributing to
12
Orr and Chanock
common diseases arise on ancestral genetic backbones that are shared by a large proportion of those affected. This is the basis for the common disease, common variant hypothesis of complex disease (CDCV) (Reich and Lander, 2001). Metaanalysis of over 300 independent but overlapping studies testing 25 reported associations concluded that common variants were most likely to be responsible for a significant number of the positive findings (Lohmueller et al., 2003). The impact of rare variants must not be discounted; a small number of recent studies have highlighted the importance of population-based resequencing for comprehensive association analysis (Cohen et al., 2006; Romeo et al., 2007). Until the recent GWAS revolution, there have been few cases where the reality of association study performance lived up to predictions. Analysis of candidate gene association studies highlights the paucity of confirmed findings; in fact, only a handful of associations from over 600 studies have been replicated more than twice (Hirschhorn et al., 2002; Lohmueller et al., 2003). The advent of whole genome scans has spawned the opportunity to pursue sets of candidate genes or regions marked by the GWAS and so in this regard, we will never leave candidate gene studies behind.
B. Genetic association testing: The direct approach Genetic association may be tested either directly or indirectly (see Fig. 1.1). Direct association testing assumes that the polymorphism under examination is itself the disease-causing variant. This assumption poses the problem of how to sift through the millions of SNPs that have been catalogued and choose an appropriate subset for study: there are simply too many SNPs in the genome to consider testing each and every one. Instead, investigators often select variants with known or predicted functional consequences, perhaps changing protein structure or regulating gene expression, because it is easy to envisage how these might contribute to disease processes. The chance of success may be greatly improved if candidate genes, perhaps implicated in disease pathogenesis by nature of accompanying experimental evidence, can be identified. Such an example is the association of the NAT2 slow acetylation genotype in smokers with bladder cancer (Garcia-Closas et al., 2005). Biological plausibility for this association comes from the fact that aromatic monoamines, which are a constituent carcinogen found in cigarette smoke, are detoxified by N-acetylation. Subjects with the NAT2 slow acetylation genotype have an increased overall risk of bladder cancer—a finding which, crucially, has been replicated multiple times. Recently, a candidate gene approach has yielded promising results for type 1 diabetes, in which immune deregulation is suspected, strong association to the CD25 locus being demonstrated in a large population-based case control study (Vella et al., 2005) and in the replication of highly significant findings in PTPN22 for the same disease (Bottini et al., 2004; Smyth et al., 2004) and in myocardial infarction where
1. Common Genetic Variation and Human Disease
13
association with LTB4 in the leukotriene pathway has been reported (Helgadottir et al., 2006). The candidate approach is heavily biased by biological intuition and the overwhelming limitation of the direct testing approach is that the a priori probability of testing the correct SNP within the correct gene is very small. Current awareness of the biological functionality of nonprotein encoding genetic variation makes choosing SNPs solely on the basis of coding potential seem rather outdated. Nonetheless, while it is extremely straightforward to identify exonic and splice boundary SNPs in a gene of interest, methodologies for prediction of functional intronic or regulatory SNPs are in their infancy. The promise of comparative genomics in this area is currently being evaluated; by identifying regions of high similarity between homologous genes or genomic regions in different species, it is possible to accentuate conserved noncoding sequences (Bejerano et al., 2004, 2005). It is hypothesized that many of these are likely to be domains of functional importance and that they should be considered as higher priority candidates in which any observed variation should be assayed for association. Of great utility in this approach is the bioinformatics portal at the University of California Santa Cruz Genome website and the Vista comparative genomic browser (Frazer et al., 2004; Hinrichs et al., 2006; Kent et al., 2002). Direct testing may be a satisfactory strategy when applied to one or a few genes with a high prior involvement in the pathogenesis of a given disease. But, a consequence of their inherent bias is the publication of associations that are of moderate statistical significance and which subsequently fail to replicate. Multicenter studies using pooled samples may be a solution to this problem. In a recent Lancet Oncology article, Rothman et al. (2006) describe an association study designed to test the hypothesis that variations in immune and inflammatory response genes influence susceptibility to non-Hodgkin lymphoma. Focusing on a small number of variants, each with putative functional significance, the authors present evidence of association of variants in TNF and IL10 with diffused large B-cell lymphoma. The findings presented were highly significant in the primary pooled analysis. However, had a consortia-based approach not been adopted, it is likely the overall significance would not have been determined until a later metaanalysis was performed. The article also serves as an example of how study power may be improved by multicenter efforts with concomitant reductions in type I error.
C. Genetic association testing: The indirect approach As stated in the preceding section, the odds of selecting the true causative SNP for direct testing in an association study are, generally, impossibly low. The alternative is to genotype subsets of SNPs that exploit correlation between markers, thereby reducing redundancy while providing comprehensive coverage of interesting loci. Indirect study designs offer the opportunity to map disease genes, while remaining agnostic with regard to function, by detecting causal
14
Orr and Chanock
variants by proxy as a consequence of the correlation between a marker SNP and the true functional variant. Intermarker correlation is slowly eroded by recombination (International HapMap Consortium, 2005; Reich et al., 2002) and markers on the same chromosome that remain strongly associated with one another in the face of such recombination are said to be in LD with one another. The patterns of LD in present day populations are the result of meiotic events that occurred in previous generations. In populations of unrelated individuals, LD deteriorates rapidly as the distance between markers increases; the collective population-wide effects of recombination diminish LD such that it is maintained only over relatively short genomic regions. When LD is plotted across the genome, it appears to create blocks of common genetic variation separated by hotspots of recombination (Myers et al., 2005). Each haplotype block is characterized by having relatively low haplotypic diversity in a given population. At present, there is no evidence for the existence of mechanisms that regulate the size of haplotype block boundaries, although the size and distribution of haplotype blocks varies between populations (Conrad et al., 2006; International HapMap Consortium, 2005; Reich et al., 2001). In general, the average block size in African populations is smaller than for other populations studied (e.g., North European Caucasians or East Asians). This is the result of an ancestral population bottleneck event that occurred during the migration of modern humans out of Africa; reductions in population size led to diminished genetic diversity within that population. Some haplotypes may be relatively large, extending over hundreds of kilobases, evidence that parts of the genome are perhaps relatively protected from recombination. Initially, SNP discovery efforts were needed to characterize common variation at a locus of interest; this was usually achieved by resequencing in small numbers of individuals. Thankfully, with the advent of the HapMap (International HapMap Consortium, 2005), this time-consuming and expensive practice has become unnecessary for first pass analysis and currently is primarily reserved for fine mapping. The main objective of the HapMap was to genotype SNPs with sufficient density across the human genome, eventually achieving a resolution of one SNP in every one to two kilobases. Crucially, three ethnically diverse populations were included in the study. The patterns of LD in the genome vary according to population genetic history and it is important therefore to ensure that the appropriate reference population be selected when using SNP repository data in pursuing an indirect testing strategy. It is critical to consider the applicability of data from HapMap to other populations. In this regard, the transportability of markers should be assessed prior to conducting the full scale analysis (Cullen et al., 1997; de Bakker et al., 2006; Gonzalez-Neira et al., 2006; Ribas et al., 2006; Smith et al., 2006; Willer et al., 2006). This systematic approach of the HapMap in the annotation of common variation has provided the genetics community with the means to select highly informative sets of SNPs
1. Common Genetic Variation and Human Disease
15
to efficiently interrogate a region of interest. In addition, it has provided an invaluable resource for an unprecedented examination of the evolutionary history of human populations (Conrad et al., 2006).
D. tagSNPs Knowledge of the nature of LD in the human genome as gleaned from the HapMap has enabled the selection of minimally redundant panels of SNP markers that can be used in indirect genetic association studies. Called tagSNPs, their aim is to significantly reduce the marker genotyping burden required for a typical association study (Johnson et al., 2001; Sebastiani et al., 2003). Carlson et al. (2004) developed a “greedy” algorithm that groups SNPs according to pairwise correlation (see below) into bins for tag selection; a high performance variant of this algorithm is incorporated in the NCI’s Tagzilla package (http://tagzilla.nci.nih.gov) that has the ability to select a genome-wide panel of tag-SNPs using HapMap CEPH data in under 6 h, running on a standard desktop PC. A number of algorithms have been written solely for the purpose of identifying the tag SNPs and the reader is directed toward the review of Stram (2005) for detailed discussion. TagSNP selection will likely require further optimization, especially in admixed populations and for analyzing multimarker-based haplotype tags (e.g., where two or more SNPs act as proxy for a single untested SNP) (de Bakker et al., 2005).
E. Quantifying LD in the genome Several measures of pairwise LD are routinely used when describing marker– marker correlation (Devlin and Risch, 1995) and are central to SNP tagging. The two most commonly used are D0 (standardized LD coefficient, D) and r2 (correlation coefficient). Both D0 and r2 have maximal values of one, indicating complete LD between markers in a two-SNP haplotype; values of less than one can be more difficult to interpret. A maximum D0 value of 1 is reached when less than the total of four possible two-SNP haplotypes is observed in a population. Alternatively, r2 is a direct measure of pairwise correlation and has several properties that make it particularly applicable to tag-SNP selection. For a value of one to be reached, it requires that only two of the four possible two-SNP haplotypes are observed. The values of r2 (and to a lesser extent, D0 ) often appear to fluctuate in a seemingly disparate manner when viewed linearly; though this seems somewhat counterintuitive, it can be explained by coalescent theory (Donnelly and Tavare, 1995). When designing indirect testing studies, r2 has the useful property that the sample size adjustments required to achieve the equivalent power of a direct test are a function of the inverse of the correlation coefficient (Pritchard and Przeworski, 2001; Table 1.2).
16
Orr and Chanock Table 1.2. Sample Considerations to Maintain an Equivalent Level of Power to Detect Association at Differing r2 Correlation Thresholds r2
Additional samples required (%)
1.0 0.9 0.8 0.7 0.6 0.5
na 11 25 43 67 100
F. Testing association using haplotypes Haplotypes are sets of alleles on chromosomes that are inherited together. They neatly encapsulate the genetic events that have taken place over time between common ancestor and the present day population. At some point in the past, mutations that influence disease susceptibility occurred on the backbone of a particular haplotype. Under certain circumstances, analysis of haplotypes may increase power to detect disease loci relative to that of single SNPs (Clark, 2004). If mutations at a given locus arise on a large number of independent haplotypes, rather than on one or a few that are common, the power of association-based studies will be reduced because they rely on a moderately low degree of allelic heterogeneity at the locus of interest (Pritchard and Cox, 2002). A possible solution to this problem may be to search for association primarily using haplotype analysis (Pritchard, 2001). Unlike SNP genotypes, an individual’s diplotype (pair of haplotypes) cannot easily be determined using molecular methods (Xu, 2006). Current DNA sequencing platforms generate reads of diploid sequence with which it is not possible to determine haplotype phase. Statistical algorithms for haplotype inference have been developed, as an alternative to direct experimental observation, that estimate haplotype probabilities from population-based genotype data (Clark, 1990; Excoffier and Slatkin, 1995; Stephens et al., 2001). Methods vary subtly by sensitivity to genotyping error rate, deviation from HWE, and various other inherent population genetic properties but in general all perform well and have been widely implemented (Marchini et al., 2006; Salem et al., 2005). It is important, however, that uncertainty in haplotype phase assignments be accounted for in haplotype based tests of association. The most appropriate method for testing haplotypes for phenotypic association is much debated. In general, there is no one size fits all solution. In some circumstances, such as when a single risk haplotype is thought to be present, it may be appropriate to collapse all but the risk haplotype into a single
1. Common Genetic Variation and Human Disease
17
group and then compare the frequencies in cases and controls to that of the risk haplotype. In lieu of knowledge of the true risk haplotype, this process must be completed for all haplotypes in the group and so the procedure must incur a penalty for multiple testing. Alternatively, a single global test of association may be used, which is the preferred approach if multiple haplotypes contribute to disease risk, but with the caveat that power is adversely affected when the level of haplotype diversity is high. Recently, novel techniques that attempt to circumvent these shortcomings have been proposed (Li et al., 2007).
V. GENOME-WIDE ASSOCIATION STUDIES In a relatively short space of time, genetic association studies have progressed from the study of a small number of candidate genes to candidate pathways and now have finally reached the point at which the entire genome can be examined. This has been realized via synergism between technological platforms (Gunderson et al., 2005; Kennedy et al., 2003; Matsuzaki et al., 2004a,b; Steemers et al., 2006), novel sophisticated algorithms for data analysis, awareness in issues relating to the storage of the immense volumes of data generated, and the completion of the International HapMap project, an invaluable resource for tagSNP selection (de Bakker et al., 2006; Pe´er et al., 2006). Now in the relative aftermath of the first batch of GWAS studies, it is prudent to reflect on many of the issues that had given rise to considerable debate within the community. One of the major advantages of the GWAS approach is that, unlike candidate gene studies, it facilitates a hypothesis-free approach to genetic epidemiological investigations of common diseases. This has paid dividends in that novel loci have been identified in a wide spectrum of conditions, many of which are in genes that had not previously been considered to be involved in their pathogenesis (Manolio et al., 2008). Thus, new light has been shed on biological process relevant to disease. Sobering, however, is the observation of replicable associations in regions of the genome in which there are no known genes (Amundadottir et al., 2006; Gudbjartsson et al., 2007; Moffatt et al., 2007; Stacey et al., 2007), perhaps implicating mechanisms of long-range gene regulation that will require considerable effort to unravel. Now the challenge in the wake of such data is to unravel the biology underlying them; the real payoff for whole genome association studies will come when insights into novel mechanisms underlying health and disease can be established. In a typical GWAS, a daunting number of markers must be tested to ensure adequate coverage of the entire genome. Commercially available genotyping arrays have increased their content capabilities from numbers in the low 100,000s to 1 million or more SNPs. It has been estimated that approximately half a million SNPs will ensure a high degree of coverage of the Caucasian
18
Orr and Chanock
genome with substantial increases needed to attain a similar level of coverage in African and African-American populations (Barrett and Cardon, 2006). The majority of commercially available products for GWAS are composed of manufacturer-chosen SNPs, selected according to their suitability for LD-based indirect association testing. A small number of arrays allow for a certain amount of user-defined content in addition to the preselected SNPs. Regardless of the criteria for inclusion of a SNP on an array, the overwhelming factor in a GWAS study is the unparalleled capacity for data generation and the concomitant issues that are involved both in ensuring a consistently high level of data quality and in the need for statistical methodologies that are required to make sense of it. The requirement for robust raw genotype data cannot be overstated. Stringent quality control measures including completion rate by both sample and assay and genotype concordance between duplicate samples should be mandatory. Tests for deviation of genotype frequencies from Hardy-Weinberg proportions may be useful for the detection of systematic genotyping error. However, care should be taken because such departures may also be signatures of true association (Wittke-Thompson et al., 2005). Of particular concern is the observation that sample extraction procedures can introduce biases in genotype calls, as this has serious implications for studies that require samples from multiple sources to attain critical mass (Clayton et al., 2005). We have chosen not to labor upon analytic approaches specifically tailored toward GWAS as they have been well reviewed in a number of recent articles (McCarthy et al., 2008; Pearson and Manolio, 2008) and often are similar in all but scale to the methods used in candidate studies, which are briefly touched upon in the subsequent section. We would like though to emphasize what we feel are essential criteria for the validation of GWAS signals. The problem of statistical noise generated by conducting hundreds of thousands of tests for association is discussed below. Here, we will focus on standards that are of particular relevance to GWAS and which have been agreed upon by experts in the field, with the aim of minimizing the trend of publication of false positive reports that have traditionally dominated complex disease genetic epidemiology (Chanock et al., 2007). Replication of interesting findings in independent studies is currently the de facto standard approach in any genetic association study, GWAS or otherwise. Care should be taken to ensure the same criteria in replication studies for case-control selection so that comparable populations are utilized. Preferably, either the same SNPs or those in perfect LD should be genotyped as in the original report. Care should be taken to assess the raw genotypic clusters for highly associated SNPs and the genotype data should be assessed on a secondary platform. GWAS studies in particular may benefit from careful meta-analysis across studies and as such, investigators are strongly encourage to make genotyping data available to others, satisfactory to confidentiality agreements required to protect the identities of individual study
1. Common Genetic Variation and Human Disease
19
participants. It is likely in future that the largest and best funded endeavors will incorporate distinct exploratory and multiple replication phases into single studies; the NCI’s CGEMS project (http://cgems.cancer.gov) (Yeager et al., 2007) (Hunter et al., 2007a,b; Thomas et al., 2008) is one such example of this approach and combines a prospective exploratory cohort with subsequent sequential validation of the top SNPs from each round of replication.
VI. STUDY DESIGN AND DATA ANALYSIS A. Introduction Study designs for genetic association testing follow closely the patterns laid out by traditional epidemiology. Although simple case-control types are often adopted, some may adopt nested or prospective approaches. The ability to generate substantial numbers of genotypes has created a requirement for robust statistical methods that address the hypotheses underlying the study. The simplest approach in association analysis is based on contingency tables and tests for differences between the observed versus expected allele frequencies of a marker between cases and controls and a null hypothesis of no association. The same approach can easily be expanded to consider genotype frequencies and this is perhaps more appropriate given the nature of genetic inheritance. Further adaption is possible so that specific genetic models may be tested, the most common being dominant, recessive, overdominant, and additive. Logistic regression models may be constructed if one wishes, as is likely in complex disease epidemiology, to incorporate environmental covariates into the analysis.
B. Type I error and the multiple testing problem In genetic epidemiology, the challenge of sorting true findings from false positives (type I error) is daunting. Chance false positive findings are the most likely cause of failure to replicate findings in follow-up studies of association (Colhoun et al., 2003). If we consider that thousands of SNPs may be tested in a study scrutinizing hundreds of genes in a candidate pathway or that hundreds of thousands may be tested in a whole genome association study, it quickly becomes clear that vast numbers of spurious associations will be detected at conventional thresholds of statistical significance. It is generally accepted that classical approaches to correct for multiple testing are too conservative in genetic epidemiology. Bonferroni-type corrections that aim to limit the possibility of accepting a single false positive association aim to minimize the study-wise error rate. In other words, the price for looking is steep; overly stringent penalization of p-values may lead to truly associated markers dropped from replication analysis.
20
Orr and Chanock
Some have argued that it may be better to downplay this problem on the basis that it will be easier to discredit false positive findings at a later date than to resurrect false negatives after they have been discarded. An alternative approach has been to weight genes according to the prior probability that they are biologically involved in disease pathogenesis, perhaps based upon knowledge of pathways or other functional evidence (Wacholder et al., 2004). False positive rate probability (FPRP) considers each test separately and does not exact a significant penalty for testing multiple hypotheses. It takes into account the power of a study and accounts for correlative information and is a measure of positive predictive value. In this regard, the FPRP is only as good as the subjective decision concerning the prior probability. Moreover, its merit lies in using ranges to determine if a hypothesis is noteworthy. Alternatively, and widely applied is the false discovery rate (FDR) that controls the expected proportion of type I errors in the overall group of rejected null hypotheses (Benjamini and Hochberg, 1995). The FDR limits the study-wise error rate but does not exact a high penalty for testing more hypotheses with low priors. The relative merits and shortcomings of three of these methods have been summarized in Table 1.3.
C. Additional sources of error Population stratification (substructure in the study population that leads to inflated type I error) in genetic association testing had been thought by some to be a considerable source for concern. Recent data from GWAS has largely waylaid this apprehension because it seems that stratification may be avoided by careful matching of cases and controls (Wellcome Trust Case Control Consortium, 2007). In populations without substantial admixture, the effect is small to nonexistent (Freedman et al., 2004; Wacholder et al., 2000). If present, it can be corrected with relative ease (Devlin and Roeder, 1999; Epstein et al., 2007; Price et al., 2006; Pritchard and Rosenberg, 1999). Whether the effect of cryptic relatedness, or different population genetics history, the challenge of population stratification is likely to be confined to populations with recent admixture. Biases in genetic epidemiology can be introduced by differences in ascertainment between cases and controls, sample handling procedures, and use of multiple genotyping platforms, among others. One must be highly selective with regard to the participant selection criteria, which regulate entry into a study. Thus, careful phenotype collection is crucial. It is notable that all published examples of irrefutable association and replication have involved conditions with standardized and widely adopted classification criteria (Manolio et al., 2008). In spite of the benefit of association versus linkage with respect to power, many studies remain only moderately powered to detect association because of suboptimal sample size. There may be considerable gains in power by combining multiple individual, studies and meta-analysis has been touted as a possible way
Table 1.3. Advantages and Disadvantages of Statistical Corrections for Type I Error Commonly Used in Genetic Association Studies Test Bonferroni correction
False positive report probability
False discovery rate
Description Derived threshold for significance determined by dividing the significance value for rejection of the null hypotheses, (usually 0.05) by the number of tests conducted. Genome-wide association study Bonferroni corrected threshold of significance would be 510–7 for 100,000 SNPs.
Strategy to reduce the number of false reports of positive association by enabling biological credibility to be factored into the error correction. Dependent on observed p-value, prior probability of association and statistical power.
Defines a global probability threshold as a proportion of the number of expected incidences of type I error over the total number of rejections of the null hypothesis. Yields proportion of statistically significant findings that are actually false positives.
Advantages
Disadvantages
Ease of application
Treats each SNP in an association study as an independent test, making no allowance for LD between markers.
High stringency for acceptance
Allows for no weighting according to prior biological knowledge.
Threshold FPRP levels can be adjusted to suit study design
Potentially too harsh leading to inflated rejection of modest but real association. Difficulty in assessing prior probabilities for genes, pathways, or SNPs.
Sensible selection of prior probability ranges enhances power to detect modest genetic association Applicable to meta-analysis Adaptive to the study set under examination
Predicated on biological insight(s) not always available.
Less conservative than the Bonferroni correction
Does not take LD between markers into account. Does not consider prior probabilities.
Does not provide a “corrected” p-value.
22
Orr and Chanock
to achieve this (Ioannidis et al., 2006). The utility of meta-analysis has been comprehensively reviewed (Munafo and Flint, 2004). Meta-analysis has the potential to uncover significant associations from multiple studies which appeared insignificant when analyzed individually (Ntais et al., 2005; Vineis et al., 2001; Vogl et al., 2004). However, special care needs to be taken to minimize and account for heterogeneity between each study.
VII. SIGNIFICANCE FOR PUBLIC HEALTH A fundamental goal of the human genome project was to facilitate genomic medicine, both though the identification of novel targets for therapeutic intervention and by enabling so-called personalized medicine in which an individual’s genomic profile could be incorporated into diagnostic algorithms and treatment of disease. In reality, both of these objectives will likely take many years to come to fruition, at least in the setting of general practice. Although immensely exciting, it is crucial that enthusiasm to translate the findings from the first batch of GWAS from array to bedside be tempered by the need for caution in attempting to make risk predictions based on estimates generated from retrospective studies. Indeed, at the time of writing, the number of new loci that have been identified for any individual disease is low and identification of the full spectrum of independent variants is a long way off. Thus, for each of the diseases that have been studied, we currently have in hand a few variants which each explain only a very small proportion of a person’s overall risk. In addition, few of the studies so far have been designed with the intention of determining the nature of gene–environment interactions that are sure to be central in complex disease pathogenesis (Hunter, 2005). It is alarming therefore to note the rapid emergence of commercial enterprises that offer direct to consumer predictions of risk based solely on their individual genomic profile (Hunter et al., 2008). Such ventures should be regarded with a healthy dose of skepticism by the general public until the impact of new loci has been properly assessed in prospective studies. Of any of the currently reported loci, only those associated with age-related macular degeneration (AMD) (Edwards et al., 2005; Haines et al., 2005; Hughes et al., 2006; Klein et al., 2005), a late onset disease resulting in blindness, appreciably alter risk at levels that may be clinically relevant (Hughes et al., 2007). However, given that the interpretation of genetic profiles is epidemiologically complex (Ware, 2006), it seems prudent that these analyses be performed only under the direction of experienced physicians and genetic councilors. The potential for direct to consumer personalized genomic profiles to be misconstrued is such that legislation designed to tightly regulate their application both in commercial and clinical practice is urgently needed (Hunter et al., 2008; Offit, 2008).
1. Common Genetic Variation and Human Disease
23
Screening for a limited number of genetic variants has, however, begun to filter into the clinical decision making process. Pharmacogenomics, the prediction of drug response and toxicity from genotype analysis, is the most prominent example (Roses, 2001). The anticancer drug CPT-11 is a clinically important agent whose efficacy is widely variable. This variability has become the focus of many pharmacogenetic studies (Ando and Hasegawa, 2005; Charasson et al., 2004; Nagar and Blanchard, 2006; O’Dwyer and Catalano, 2006). CPT-11 requires hydrolysis for metabolic activity and is degraded by members of the UDP glucuronosyltransferase 1 (UGT1) family before being excreted in the intestines, where it may be reactivated by -glucuronidase. This reactivation can induce intestinal toxicity. Polymorphisms in the UGT1A1 gene have been associated with increased incidences of diarrhea and neutropenia (Iyer et al., 1998). Many physicians advocate screening for UGT1A1 polymorphisms prior to treatment with CPT-11 so as to enable dose modification in an effort to reduce unwanted side effects associated with the drug (Maitland et al., 2006). Indeed, the FDA now recommends that reduced doses should be given to carriers of specific UGT1A1 variants. Similar approaches to other drugs are likely to be employed using individual genetic profiles in concert with traditional indices used to determine appropriate treatment regimens.
VIII. CONCLUDING REMARKS Understanding the forces that drive and shape genetic variation is central to our goal of elucidating the basis of common and uncommon diseases. The recent success of the GWAS approach has done much for the morale of those initially discouraged by its apparent lack of reproducibility. The last 5 years have seen much effort in the laying of a core set of tools to facilitate future such studies; their continuing maturation, coupled with advances in statistical methods, will likely yield many more significant findings. It would be somewhat naive to think that association studies will provide clarity for linking genotypes with phenotype in every complex condition. Elucidation of the environmental components of common disorders deserves equal merit and may prove to have an even greater impact on public health. Undoubtedly, the ability to make informed lifestyle choices and reduce exposures to harmful external stimuli will be much easier than influencing or altering one’s genetic makeup. Nonetheless, the groundwork provided by past research in the field of genetic variation and disease mapping by association should have a profound impact on the well-being of generations to come, especially if we are to realize the promise of personalized medicine (Collins, 1999). One of the ironies of personalized medicine is that to define the markers that one can apply to a specific individual, the science underlying
24
Orr and Chanock
the foundation of personalized medicine will be based on large-scale studies using populations. In this regard, genetics will continue to search for the dialectic between common markers and individual or unique risk.
References Amundadottir, L. T., Sulem, P., Gudmundsson, J., Helgason, A., Baker, A., Agnarsson, B. A., Sigurdsson, A., Benediktsdottir, K. R., Cazier, J. B., Sainz, J., Jakobsdottir, M., Kostic, J., et al. (2006). A common variant associated with prostate cancer in European and African populations. Nat. Genet. 38, 652–658. Ando, Y., and Hasegawa, Y. (2005). Clinical pharmacogenetics of irinotecan (CPT-11). Drug Metab. Rev. 37, 565–574. Barrett, J. C., and Cardon, L. R. (2006). Evaluating coverage of genome-wide association studies. Nat. Genet. 38, 659–662. Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W. J., Mattick, J. S., and Haussler, D. (2004). Ultraconserved elements in the human genome. Science 304, 1321–1325. Bejerano, G., Siepel, A. C., Kent, W. J., and Haussler, D. (2005). Computational screening of conserved genomic DNA in search of functional noncoding elements. Nat. Methods 2, 535–545. Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300. Bennett, S. T., Barnes, C., Cox, A., Davies, L., and Brown, C. (2005). Toward the 1,000 dollars human genome. Pharmacogenomics 6, 373–382. Bersaglieri, T., Sabeti, P. C., Patterson, N., Vanderploeg, T., Schaffner, S. F., Drake, J. A., Rhodes, M., Reich, D. E., and Hirschhorn, J. N. (2004). Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74, 1111–1120. Bertina, R. M., Koeleman, B. P., Koster, T., Rosendaal, F. R., Dirven, R. J., de Ronde, H., van der Velden, P. A., and Reitsma, P. H. (1994). Mutation in blood coagulation factor V associated with resistance to activated protein C. Nature 369, 64–67. Binladen, J., Gilbert, M. T., Bollback, J. P., Panitz, F., Bendixen, C., Nielsen, R., and Willerslev, E. (2007). The use of coded PCR primers enables high-throughput sequencing of multiple homolog amplification products by 454 parallel sequencing. PLoS ONE 2, e197. Bottini, N., Musumeci, L., Alonso, A., Rahmouni, S., Nika, K., Rostamkhani, M., MacMurray, J., Meloni, G. F., Lucarelli, P., Pellecchia, M., Eisenbarth, G. S., Comings, D., et al. (2004). A functional variant of lymphoid tyrosine phosphatase is associated with type I diabetes. Nat. Genet. 36, 337–338. Capon, F., Allen, M. H., Ameen, M., Burden, A. D., Tillman, D., Barker, J. N., and Trembath, R. C. (2004). A synonymous SNP of the corneodesmosin gene leads to increased mRNA stability and demonstrates association with psoriasis across diverse ethnic groups. Hum. Mol. Genet. 13, 2361–2368. Carlson, C. S., Eberle, M. A., Rieder, M. J., Yi, Q., Kruglyak, L., and Nickerson, D. A. (2004). Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet. 74, 106–120. Chamary, J. V., Parmley, J. L., and Hurst, L. D. (2006). Hearing silence: Non-neutral evolution at synonymous sites in mammals. Nat. Rev. 7, 98–108. Chanock, S. (2001). Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease. Dis. Markers 17, 89–98.
1. Common Genetic Variation and Human Disease
25
Chanock, S. J., Manolio, T., Boehnke, M., Boerwinkle, E., Hunter, D. J., Thomas, G., Hirschhorn, J. N., Abecasis, G., Altshuler, D., Bailey-Wilson, J. E., Brooks, L. D., Cardon, L. R., et al. (2007). Replicating genotype-phenotype associations. Nature 447, 655–660. Charasson, V., Bellott, R., Meynard, D., Longy, M., Gorry, P., and Robert, J. (2004). Pharmacogenetics of human carboxylesterase 2, an enzyme involved in the activation of irinotecan into SN-38. Clin. Pharmacol. Ther. 76, 528–535. Cheng, I., Plummer, S. J., Jorgenson, E., Liu, X., Rybicki, B. A., Casey, G., and Witte, J. S. (2008). 8q24 and prostate cancer: Association with advanced disease and meta-analysis. Eur. J. Hum. Genet. 16, 496–505. Clark, A. G. (1990). Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7, 111–122. Clark, A. G. (2004). The role of haplotypes in candidate gene studies. Genet. Epidemiol. 27, 321–333. Clayton, D. G., Walker, N. M., Smyth, D. J., Pask, R., Cooper, J. D., Maier, L. M., Smink, L. J., Lam, A. C., Ovington, N. R., Stevens, H. E., Nutland, S., Howson, J. M., et al. (2005). Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat. Genet. 37, 1243–1246. Cohen, J. C., Pertsemlidis, A., Fahmi, S., Esmail, S., Vega, G. L., Grundy, S. M., and Hobbs, H. H. (2006). Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. Proc. Natl. Acad. Sci. USA 103, 1810–1815. Colhoun, H. M., McKeigue, P. M., and Davey Smith, G. (2003). Problems of reporting genetic associations with complex outcomes. Lancet 361, 865–872. Collins, F. S. (1999). Shattuck lecture—medical and societal consequences of the Human Genome Project. N. Engl. J. Med. 341, 28–37. Conrad, D. F., Jakobsson, M., Coop, G., Wen, X., Wall, J. D., Rosenberg, N. A., and Pritchard, J. K. (2006). A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat. Genet. 38, 1251–1260. Cullen, M., Noble, J., Erlich, H., Thorpe, K., Beck, S., Klitz, W., Trowsdale, J., and Carrington, M. (1997). Characterization of recombination in the HLA class II region. Am. J. Hum. Genet. 60, 397–407. de Bakker, P. I., Yelensky, R., Pe´er, I., Gabriel, S. B., Daly, M. J., and Altshuler, D. (2005). Efficiency and power in genetic association studies. Nat. Genet. 37, 1217–1223. de Bakker, P. I., Burtt, N. P., Graham, R. R., Guiducci, C., Yelensky, R., Drake, J. A., Bersaglieri, T., Penney, K. L., Butler, J., Young, S., Onofrio, R. C., Lyon, H. N., et al. (2006). Transferability of tag SNPs in genetic association studies in multiple populations. Nat. Genet. 38, 1298–1303. Devlin, B., and Risch, N. (1995). A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322. Devlin, B., and Roeder, K. (1999). Genomic control for association studies. Biometrics 55, 997–1004. Donnelly, P., and Tavare, S. (1995). Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29, 401–421. Edwards, A. O., Ritter, R., 3rd, Abel, K. J., Manning, A., Panhuysen, C., and Farrer, L. A. (2005). Complement factor H polymorphism and age-related macular degeneration. Science (New York, NY) 308, 421–424. Eichler, E. E. (2006). Widening the spectrum of human genetic variation. Nat. Genet. 38, 9–11. Eichler, E. E., Nickerson, D. A., Altshuler, D., Bowcock, A. M., Brooks, L. D., Carter, N. P., Church, D. M., Felsenfeld, A., Guyer, M., Lee, C., Lupski, J. R., Mullikin, J. C., et al. (2007). Completing the map of human genetic variation. Nature 447, 161–165. Elston, R. C. (1995). Linkage and association to genetic markers. Exp. Clin. Immunogenet. 12, 129–140. Epstein, M. P., Allen, A. S., and Satten, G. A. (2007). A simple and improved correction for population stratification in case-control studies. Am. J. Hum. Genet. 80, 921–930.
26
Orr and Chanock
Excoffier, L., and Slatkin, M. (1995). Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12, 921–927. Frazer, K. A., Pachter, L., Poliakov, A., Rubin, E. M., and Dubchak, I. (2004). VISTA: Computational tools for comparative genomics. Nucleic acids Res. 32, W273–W279. Freedman, M. L., Haiman, C. A., Patterson, N., McDonald, G. J., Tandon, A., Waliszewska, A., Penney, K., Steen, R. G., Ardlie, K., John, E. M., Oakley-Girvan, I., Whittemore, A. S., et al. (2006). Admixture mapping identifies 8q24 as a prostate cancer risk locus in African-American men. Proc. Natl. Acad. Sci. USA 103, 14068–14073. Freedman, M. L., Reich, D., Penney, K. L., McDonald, G. J., Mignault, A. A., Patterson, N., Gabriel, S. B., Topol, E. J., Smoller, J. W., Pato, C. N., Pato, M. T., Petryshen, T. L., et al. (2004). Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36, 388–393. Garcia-Closas, M., Malats, N., Silverman, D., Dosemeci, M., Kogevinas, M., Hein, D. W., Tardon, A., Serra, C., Carrato, A., Garcia-Closas, R., Lloreta, J., Castano-Vinyals, G., et al. (2005). NAT2 slow acetylation, GSTM1 null genotype, and risk of bladder cancer: Results from the Spanish Bladder Cancer Study and meta-analyses. Lancet 366, 649–659. Gonzalez, E., Kulkarni, H., Bolivar, H., Mangano, A., Sanchez, R., Catano, G., Nibbs, R. J., Freedman, B. I., Quinones, M. P., Bamshad, M. J., Murthy, K. K., Rovin, B. H., et al. (2005). The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science (New York, N.Y.) 307, 1434–1440. Gonzalez-Neira, A., Ke, X., Lao, O., Calafell, F., Navarro, A., Comas, D., Cann, H., Bumpstead, S., Ghori, J., Hunt, S., Deloukas, P., Dunham, I., et al. (2006). The portability of tagSNPs across populations: A worldwide survey. Genome Res. 16, 323–330. Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science (New York, N.Y.) 185, 862–864. Gudbjartsson, D. F., Arnar, D. O., Helgadottir, A., Gretarsdottir, S., Holm, H., Sigurdsson, A., Jonasdottir, A., Baker, A., Thorleifsson, G., Kristjansson, K., Palsson, A., Blondal, T., et al. (2007). Variants conferring risk of atrial fibrillation on chromosome 4q25. Nature 448, 353–357. Gudmundsson, J., Sulem, P., Manolescu, A., Amundadottir, L. T., Gudbjartsson, D., Helgason, A., Rafnar, T., Bergthorsson, J. T., Agnarsson, B. A., Baker, A., Sigurdsson, A., Benediktsdottir, K. R., et al. (2007). Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q24. Nat. Genet. 39, 631–637. Gunderson, K. L., Steemers, F. J., Lee, G., Mendoza, L. G., and Chee, M. S. (2005). A genome-wide scalable SNP genotyping assay using microarray technology. Nat. Genet. 37, 549–554. Haiman, C. A., Patterson, N., Freedman, M. L., Myers, S. R., Pike, M. C., Waliszewska, A., Neubauer, J., Tandon, A., Schirmer, C., McDonald, G. J., Greenway, S. C., Stram, D. O., et al. (2007). Multiple regions within 8q24 independently affect risk for prostate cancer. Nat. Genet. 39, 638–644. Haines, J. L., Hauser, M. A., Schmidt, S., Scott, W. K., Olson, L. M., Gallins, P., Spencer, K. L., Kwan, S. Y., Noureddine, M., Gilbert, J. R., Schnetz-Boutaud, N., Agarwal, A., et al. (2005). Complement factor H variant increases the risk of age-related macular degeneration. Science 308, 419–421. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., and McKusick, V. A. (2005). Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic acids Res. 33, D514–D517. Helgadottir, A., Manolescu, A., Helgason, A., Thorleifsson, G., Thorsteinsdottir, U., Gudbjartsson, D. F., Gretarsdottir, S., Magnusson, K. P., Gudmundsson, G., Hicks, A., Jonsson, T., Grant, S. F., et al. (2006). A variant of the gene encoding leukotriene A4 hydrolase confers ethnicityspecific risk of myocardial infarction. Nat. Genet. 38, 68–74.
1. Common Genetic Variation and Human Disease
27
Helgadottir, A., Manolescu, A., Thorleifsson, G., Gretarsdottir, S., Jonsdottir, H., Thorsteinsdottir, U., Samani, N. J., Gudmundsson, G., Grant, S. F., Thorgeirsson, G., Sveinbjornsdottir, S., Valdimarsson, E. M., et al. (2004). The gene encoding 5-lipoxygenase activating protein confers risk of myocardial infarction and stroke. Nat. Genet. 36, 233–239. Hinds, D. A., Kloek, A. P., Jen, M., Chen, X., and Frazer, K. A. (2006). Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat. Genet. 38, 82–85. Hinds, D. A., Stuve, L. L., Nilsen, G. B., Halperin, E., Eskin, E., Ballinger, D. G., Frazer, K. A., and Cox, D. R. (2005). Whole-genome patterns of common DNA variation in three human populations. Science 307, 1072–1079. Hinrichs, A. S., Karolchik, D., Baertsch, R., Barber, G. P., Bejerano, G., Clawson, H., Diekhans, M., Furey, T. S., Harte, R. A., Hsu, F., Hillman-Jackson, J., Kuhn, R. M., et al. (2006). The UCSC Genome Browser Database: Update 2006. Nucleic acids Res. 34, D590–D598. Hirschhorn, J. N., Lohmueller, K., Byrne, E., and Hirschhorn, K. (2002). A comprehensive review of genetic association studies. Genet. Med. 4, 45–61. Hughes, A. E., Orr, N., Esfandiary, H., Diaz-Torres, M., Goodship, T., and Chakravarthy, U. (2006). A common CFH haplotype, with deletion of CFHR1 and CFHR3, is associated with lower risk of age-related macular degeneration. Nat. Genet. 38, 1173–1177. Hughes, A. E., Orr, N., Patterson, C., Esfandiary, H., Hogg, R., McConnell, V., Silvestri, G., and Chakravarthy, U. (2007). Neovascular age-related macular degeneration risk based on CFH, LOC387715/HTRA1, and smoking. PLOS Med. 4, e355. Hughes, A. L., Packer, B., Welch, R., Bergen, A. W., Chanock, S. J., and Yeager, M. (2003). Widespread purifying selection at polymorphic sites in human protein-coding loci. Proc. Natl. Acad. Sci. USA 100, 15754–15757. Hughes, A. L., Packer, B., Welch, R., Chanock, S. J., and Yeager, M. (2005). High level of functional polymorphism indicates a unique role of natural selection at human immune system loci. Immunogenetics 57, 821–827. Hunter, D. J. (2005). Gene–environment interactions in human diseases. Nat. Rev. 6, 287–298. Hunter, D. J., Khoury, M. J., and Drazen, J. M. (2008). Letting the genome out of the bottle—will we get our wish? N. Engl. J. Med. 358, 105–107. Hunter, D. J., Kraft, P., Jacobs, K. B., Cox, D. G., Yeager, M., Hankinson, S. E., Wacholder, S., Wang, Z., Welch, R., Hutchinson, A., Wang, J., Yu, K., et al. (2007a). A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat. Genet. 39, 870–874. Hunter, D. J., Thomas, G., Hoover, R. N., and Chanock, S. J. (2007b). Scanning the horizon: What is the future of genome-wide association studies in accelerating discoveries in cancer etiology and prevention? Cancer Causes Control 18, 479–484. Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y., Scherer, S. W., and Lee, C. (2004). Detection of large-scale variation in the human genome. Nat. Genet. 36, 949–951. International HapMap Consortium (2005). A haplotype map of the human genome. Nature 437, 1299–1320. International Human Genome Sequencing Consortium (2004). Finishing the euchromatic sequence of the human genome. Nature 431, 931–945. Ioannidis, J. P., Gwinn, M., Little, J., Higgins, J. P., Bernstein, J. L., Boffetta, P., Bondy, M., Bray, M. S., Brenchley, P. E., Buffler, P. A., Casas, J. P., Chokkalingam, A., et al. (2006). A road map for efficient and reliable human genome epidemiology. Nat. Genet. 38, 3–5. Iyer, L., King, C. D., Whitington, P. F., Green, M. D., Roy, S. K., Tephly, T. R., Coffman, B. L., and Ratain, M. J. (1998). Genetic predisposition to the metabolism of irinotecan (CPT-11). Role of uridine diphosphate glucuronosyltransferase isoform 1A1 in the glucuronidation of its active metabolite (SN-38) in human liver microsomes. J. Clin. Invest. 101, 847–854.
28
Orr and Chanock
Johnson, G. C., Esposito, L., Barratt, B. J., Smith, A. N., Heward, J., Di Genova, G., Ueda, H., Cordell, H. J., Eaves, I. A., Dudbridge, F., Twells, R. C., Payne, F., et al. (2001). Haplotype tagging for the identification of common disease genes. Nat. Genet. 29, 233–237. Kennedy, G. C., Matsuzaki, H., Dong, S., Liu, W. M., Huang, J., Liu, G., Su, X., Cao, M., Chen, W., Zhang, J., Liu, W., Yang, G., et al. (2003). Large-scale genotyping of complex DNA. Nat. Biotechnol. 21, 1233–1237. Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., and Haussler, D. (2002). The human genome browser at UCSC. Genome Res. 12, 996–1006. Kerem, B., Rommens, J. M., Buchanan, J. A., Markiewicz, D., Cox, T. K., Chakravarti, A., Buchwald, M., and Tsui, L. C. (1989). Identification of the cystic fibrosis gene: Genetic analysis. Science 245, 1073–1080. Klein, R. J., Zeiss, C., Chew, E. Y., Tsai, J. Y., Sackler, R. S., Haynes, C., Henning, A. K., SanGiovanni, J. P., Mane, S. M., Mayne, S. T., Bracken, M. B., Ferris, F. L., et al. (2005). Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389. Kruglyak, L., and Nickerson, D. A. (2001). Variation is the spice of life. Nat. Genet. 27, 234–236. Lai, E., Riley, J., Purvis, I., and Roses, A. (1998). A 4-Mb high-density single nucleotide polymorphism-based map around human APOE. Genomics 54, 31–38. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921. Law, A. J., Lipska, B. K., Weickert, C. S., Hyde, T. M., Straub, R. E., Hashimoto, R., Harrison, P. J., Kleinman, J. E., and Weinberger, D. R. (2006). Neuregulin 1 transcripts are differentially expressed in schizophrenia and regulated by 50 SNPs associated with the disease. Proc. Natl. Acad. Sci. USA 103, 6747–6752. Li, Y., Sung, W. K., and Liu, J. J. (2007). Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows. Am. J. Hum. Genet. 80, 705–715. Liu, R., Paxton, W. A., Choe, S., Ceradini, D., Martin, S. R., Horuk, R., MacDonald, M. E., Stuhlmann, H., Koup, R. A., and Landau, N. R. (1996). Homozygous defect in HIV-1 coreceptor accounts for resistance of some multiply-exposed individuals to HIV-1 infection. Cell 86, 367–377. Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S., and Hirschhorn, J. N. (2003). Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat. Genet. 33, 177–182. Lykke-Andersen, J. (2001). mRNA quality control: Marking the message for life or death. Curr. Biol. 11, R88–R91. Maitland, M. L., Vasisht, K., and Ratain, M. J. (2006). TPMT, UGT1A1 and DPYD: Genotyping to ensure safer cancer therapy? Trends Pharmacol. Sci. 27, 432–437. Manolio, T. A., Brooks, L. D., and Collins, F. S. (2008). A HapMap harvest of insights into the genetics of common diseases. J. Clin. Invest. 118, 1590–1605. Marchini, J., Cutler, D., Patterson, N., Stephens, M., Eskin, E., Halperin, E., Lin, S., Qin, Z. S., Munro, H. M., Abecasis, G. R., and Donnelly, P. (2006). A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet. 78, 437–450. Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J., Braverman, M. S., Chen, Y. J., Chen, Z., Dewell, S. B., Du, L., et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380. Matsuzaki, H., Dong, S., Loi, H., Di, X., Liu, G., Hubbell, E., Law, J., Berntsen, T., Chadha, M., Hui, H., Yang, G., Kennedy, G. C., et al. (2004a). Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat. Methods 1, 109–111. Matsuzaki, H., Loi, H., Dong, S., Tsai, Y. Y., Fang, J., Law, J., Di, X., Liu, W. M., Yang, G., Liu, G., Huang, J., Kennedy, G. C., et al. (2004b). Parallel genotyping of over 10,000 SNPs using a oneprimer assay on a high-density oligonucleotide array. Genome Res. 14, 414–425.
1. Common Genetic Variation and Human Disease
29
McCarroll, S. A., Hadnott, T. N., Perry, G. H., Sabeti, P. C., Zody, M. C., Barrett, J. C., Dallaire, S., Gabriel, S. B., Lee, C., Daly, M. J., and Altshuler, D. M. (2006). Common deletion polymorphisms in the human genome. Nat. Genet. 38, 86–92. McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P., and Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: Consensus, uncertainty and challenges. Nat. Rev. 9, 356–369. Moffatt, M. F., Kabesch, M., Liang, L., Dixon, A. L., Strachan, D., Heath, S., Depner, M., von Berg, A., Bufe, A., Rietschel, E., Heinzmann, A., Simma, B., et al. (2007). Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448, 470–473. Morton, N. E., and Collins, A. (2002). Toward positional cloning with SNPs. Curr. Opin. Mol. Ther. 4, 259–264. Munafo, M. R., and Flint, J. (2004). Meta-analysis of genetic association studies. Trends Genet. 20, 439–444. Myers, S., Bottolo, L., Freeman, C., McVean, G., and Donnelly, P. (2005). A fine-scale map of recombination rates and hotspots across the human genome. Science 310, 321–324. Nagar, S., and Blanchard, R. L. (2006). Pharmacogenetics of uridine diphosphoglucuronosyltransferase (UGT) 1A family members and its role in patient response to irinotecan. Drug Metab. Rev. 38, 393–409. Ng, P. C., and Henikoff, S. (2003). SIFT: Predicting amino acid changes that affect protein function. Nucleic acids Res. 31, 3812–3814. Nielsen, R., Williamson, S., Kim, Y., Hubisz, M. J., Clark, A. G., and Bustamante, C. (2005). Genomic scans for selective sweeps using SNP data. Genome Res. 15, 1566–1575. Ntais, C., Polycarpou, A., and Ioannidis, J. P. (2005). Association of GSTM1, GSTT1, and GSTP1 gene polymorphisms with the risk of prostate cancer: A meta-analysis. Cancer Epidemiol. Biomarkers Prev. 14, 176–181. O’Brien, S. J., and Nelson, G. W. (2004). Human genes that limit AIDS. Nat. Genet. 36, 565–574. O’Dwyer, P. J., and Catalano, R. B. (2006). Uridine diphosphate glucuronosyltransferase (UGT) 1A1 and irinotecan: Practical pharmacogenomics arrives in cancer therapy. J. Clin. Oncol. 24, 4534–4538. Offit, K. (2008). Genomic profiles for disease risk: Predictive or premature? JAMA 299, 1353–1355. Ott, J. (1999). Methods of analysis and resources available for genetic trait mapping. J. Hered. 90, 68–70. Packer, B. R., Yeager, M., Burdett, L., Welch, R., Beerman, M., Qi, L., Sicotte, H., Staats, B., Acharya, M., Crenshaw, A., Eckert, A., Puri, V., et al. (2006). SNP500 Cancer: A public resource for sequence validation, assay development, and frequency analysis for genetic variation in candidate genes. Nucleic Acids Res. 34, D617–D621. Patterson, N., Hattangadi, N., Lane, B., Lohmueller, K. E., Hafler, D. A., Oksenberg, J. R., Hauser, S. L., Smith, M. W., O’Brien, S. J., Altshuler, D., Daly, M. J., and Reich, D. (2004). Methods for high-density admixture mapping of disease genes. Am. J. Hum. Genet. 74, 979–1000. Pe´er, I., de Bakker, P. I., Maller, J., Yelensky, R., Altshuler, D., and Daly, M. J. (2006). Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat. Genet. 38, 663–667. Pearson, T. A., and Manolio, T. A. (2008). How to interpret a genome-wide association study. JAMA 299, 1335–1344. Poort, S. R., Rosendaal, F. R., Reitsma, P. H., and Bertina, R. M. (1996). A common genetic variation in the 30 -untranslated region of the prothrombin gene is associated with elevated plasma prothrombin levels and an increase in venous thrombosis. Blood 88, 3698–3703. Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909.
30
Orr and Chanock
Pritchard, J. K. (2001). Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69, 124–137. Pritchard, J. K., and Cox, N. J. (2002). The allelic architecture of human disease genes: Common disease-common variant . . . or not? Hum. Mol. Genet. 11, 2417–2423. Pritchard, J. K., and Przeworski, M. (2001). Linkage disequilibrium in humans: Models and data. Am. J. Hum. Genet. 69, 1–14. Pritchard, J. K., and Rosenberg, N. A. (1999). Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65, 220–228. Ramensky, V., Bork, P., and Sunyaev, S. (2002). Human non-synonymous SNPs: Server and survey. Nucleic Acids Res. 30, 3894–3900. Reich, D. E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P. C., Richter, D. J., Lavery, T., Kouyoumjian, R., Farhadian, S. F., Ward, R., and Lander, E. S. (2001). Linkage disequilibrium in the human genome. Nature 411, 199–204. Reich, D. E., and Lander, E. S. (2001). On the allelic spectrum of human disease. Trends Genet. 17, 502–510. Reich, D. E., Schaffner, S. F., Daly, M. J., McVean, G., Mullikin, J. C., Higgins, J. M., Richter, D. J., Lander, E. S., and Altshuler, D. (2002). Human genome sequence variation and the influence of gene history, mutation and recombination. Nat. Genet. 32, 135–142. Ribas, G., Gonzalez-Neira, A., Salas, A., Milne, R. L., Vega, A., Carracedo, B., Gonzalez, E., Barroso, E., Fernandez, L. P., Yankilevich, P., Robledo, M., Carracedo, A., et al. (2006). Evaluating HapMap SNP data transferability in a large-scale genotyping project involving 175 cancerassociated genes. Hum. Genet. 118, 669–679. Richards, R. I., Holman, K., Yu, S., and Sutherland, G. R. (1993). Fragile X syndrome unstable element, p(CCG)n, and other simple tandem repeat sequences are binding sites for specific nuclear proteins. Hum. Mol. Genet. 2, 1429–1435. Risch, N. (1991). Developments in gene mapping with linkage methods. Curr. Opin. Genet. Dev. 1, 93–98. Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases. Science 273, 1516–1517. Risch, N. J. (2000). Searching for genetic determinants in the new millennium. Nature 405, 847–856. Romeo, S., Pennacchio, L. A., Fu, Y., Boerwinkle, E., Tybjaerg-Hansen, A., Hobbs, H. H., and Cohen, J. C. (2007). Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase HDL. Nat. Genet. 39, 513–516. Roses, A. D. (2001). Pharmacogenetics. Hum. Mol. Genet. 10, 2261–2267. Rothman, N., Skibola, C. F., Wang, S. S., Morgan, G., Lan, Q., Smith, M. T., Spinelli, J. J., Willett, E., De Sanjose, S., Cocco, P., Berndt, S. I., Brennan, P., et al. (2006). Genetic variation in TNF and IL10 and risk of non-Hodgkin lymphoma: A report from the InterLymph Consortium. Lancet Oncol. 7, 27–38. Rothman, N., Wacholder, S., Caporaso, N. E., Garcia-Closas, M., Buetow, K., and Fraumeni, J. F., Jr. (2001). The use of common genetic polymorphisms to enhance the epidemiologic study of environmental carcinogens. Biochim. Biophys. Acta 1471, C1–C10. Royer-Pokora, B., Kunkel, L. M., Monaco, A. P., Goff, S. C., Newburger, P. E., Baehner, R. L., Cole, F. S., Curnutte, J. T., and Orkin, S. H. (1986). Cloning the gene for an inherited human disorder—chronic granulomatous disease—on the basis of its chromosomal location. Nature 322, 32–38. Sachidanandam, R., Weissman, D., Schmidt, S. C., Kakol, J. M., Stein, L. D., Marth, G., Sherry, S., Mullikin, J. C., Mortimore, B. J., Willey, D. L., Hunt, S. E., Cole, C. G., et al. (2001). A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933.
1. Common Genetic Variation and Human Disease
31
Salem, R. M., Wessel, J., and Schork, N. J. (2005). A comprehensive literature review of haplotyping software and methods for use with unrelated individuals. Hum. Genomics 2, 39–66. Scrimshaw, N. S., and Murray, E. B. (1988). The acceptability of milk and milk products in populations with a high prevalence of lactose intolerance. Am. J. Clin. Nutr. 48, 1079–1159. Sebastiani, P., Lazarus, R., Weiss, S. T., Kunkel, L. M., Kohane, I. S., and Ramoni, M. F. (2003). Minimal haplotype tagging. Proc. Natl. Acad. Sci. USA 100, 9900–9905. Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M., Navin, N., Lucito, R., et al. (2004). Large-scale copy number polymorphism in the human genome. Science 305, 525–528. Sherry, S. T., Ward, M. H., Kholodov, M., Baker, J., Phan, L., Smigielski, E. M., and Sirotkin, K. (2001). dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311. Shriver, M. D., Mei, R., Parra, E. J., Sonpar, V., Halder, I., Tishkoff, S. A., Schurr, T. G., Zhadanov, S. I., Osipova, L. P., Brutsaert, T. D., Friedlaender, J., Jorde, L. B.., et al. (2005). Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation. Hum. Genomics 2, 81–89. Smith, E. M., Wang, X., Littrell, J., Eckert, J., Cole, R., Kissebah, A. H., and Olivier, M. (2006). Comparison of linkage disequilibrium patterns between the HapMap CEPH samples and a familybased cohort of Northern European descent. Genomics 88, 407–414. Smith, M. W., Dean, M., Carrington, M., Winkler, C., Huttley, G. A., Lomb, D. A., Goedert, J. J., O’Brien, T. R., Jacobson, L. P., Kaslow, R., Buchbinder, S., Vittinghoff, E., et al. (1997). Contrasting genetic influence of CCR2 and CCR5 variants on HIV-1 infection and disease progression. Hemophilia Growth and Development Study (HGDS), Multicenter AIDS Cohort Study (MACS), Multicenter Hemophilia Cohort Study (MHCS), San Francisco City Cohort (SFCC), ALIVE Study. Science 277, 959–965. Smyth, D., Cooper, J. D., Collins, J. E., Heward, J. M., Franklyn, J. A., Howson, J. M., Vella, A., Nutland, S., Rance, H. E., Maier, L., Barratt, B. J., Guja, C., et al. (2004). Replication of an association between the lymphoid tyrosine phosphatase locus (LYP/PTPN22) with type 1 diabetes, and evidence for its role as a general autoimmunity locus. Diabetes 53, 3020–3023. Stacey, S. N., Manolescu, A., Sulem, P., Rafnar, T., Gudmundsson, J., Gudjonsson, S. A., Masson, G., Jakobsdottir, M., Thorlacius, S., Helgason, A., Aben, K. K., Strobbe, L. J., et al. (2007). Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor-positive breast cancer. Nat. Genet. 39, 865–869. Steemers, F. J., Chang, W., Lee, G., Barker, D. L., Shen, R., and Gunderson, K. L. (2006). Wholegenome genotyping with the single-base extension assay. Nat. Methods 3, 31–33. Stenson, P. D., Ball, E. V., Mort, M., Phillips, A. D., Shiel, J. A., Thomas, N. S., Abeysinghe, S., Krawczak, M., and Cooper, D. N. (2003). Human Gene Mutation Database (HGMD). Hum Mutat 21(6), 577–81. Stephens, M., Smith, N. J., and Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989. Stram, D. O. (2005). Software for tag single nucleotide polymorphism selection. Hum. Genomics 2, 144–151. Teare, M., and Barrett, J. H. (2005). Genetic linkage studies. Lancet 366, 1036–1044. Thomas, G., Jacobs, K. B., Yeager, M., Kraft, P., Wacholder, S., Orr, N., Yu, K., Chatterjee, N., Welch, R., Hutchinson, A., Crenshaw, A., Cancel-Tassin, G., et al. (2008). Multiple loci identified in a genome-wide association study of prostate cancer. Nat. Genet. 40, 310–315. Topal, M. D., and Fresco, J. R. (1976). Complementary base pairing and the origin of substitution mutations. Nature 263, 285–289. Vella, A., Cooper, J. D., Lowe, C. E., Walker, N., Nutland, S., Widmer, B., Jones, R., Ring, S. M., McArdle, W., Pembrey, M. E., Strachan, D. P., Dunger, D. B., et al. (2005). Localization of a type 1 diabetes locus in the IL2RA/CD25 region by use of tag single-nucleotide polymorphisms. Am. J. Hum. Genet. 76, 773–779.
32
Orr and Chanock
Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., Gocayne, J. D., Amanatides, P., et al. (2001). The sequence of the human genome. Science 291, 1304–1351. Vineis, P., Marinelli, D., Autrup, H., Brockmoller, J., Cascorbi, I., Daly, A. K., Golka, K., Okkels, H., Risch, A., Rothman, N., Sim, E., and Taioli, E. (2001). Current smoking, occupation, N-acetyltransferase-2 and bladder cancer: A pooled analysis of genotype-based studies. Cancer Epidemiol. Biomarkers Prev. 10, 1249–1252. Vogl, F. D., Taioli, E., Maugard, C., Zheng, W., Pinto, L. F., Ambrosone, C., Parl, F. F., NedelchevaKristensen, V., Rebbeck, T. R., Brennan, P., and Boffetta, P. (2004). Glutathione S-transferases M1, T1, and P1 and breast cancer: A pooled analysis. Cancer Epidemiol. Biomarkers Prev. 13, 1473–1479. Wacholder, S., Chanock, S., Garcia-Closas, M., El Ghormli, L., and Rothman, N. (2004). Assessing the probability that a positive report is false: An approach for molecular epidemiology studies. J. Natl. Cancer Inst. 96, 434–442. Wacholder, S., Rothman, N., and Caporaso, N. (2000). Population stratification in epidemiologic studies of common genetic variants and cancer: Quantification of bias. J. Natl. Cancer Inst. 92, 1151–1158. Wang, D., Johnson, A. D., Papp, A. C., Kroetz, D. L., and Sadee, W. (2005). Multidrug resistance polypeptide 1 (MDR1, ABCB1) variant 3435C>T affects mRNA stability. Pharmacogenet. Genomics 15, 693–704. Ware, J. H. (2006). The limitations of risk factors as prognostic tools. N. Engl. J. Med. 355, 2615–2617. Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678. Willer, C. J., Scott, L. J., Bonnycastle, L. L., Jackson, A. U., Chines, P., Pruim, R., Bark, C. W., Tsai, Y. Y., Pugh, E. W., Doheny, K. F., Kinnunen, L., Mohlke, K. L., et al. (2006). Tag SNP selection for Finnish individuals based on the CEPH Utah HapMap database. Genet. Epidemiol. 30, 180–190. Wilson, S. H., and Olden, K. (2004). The Environmental Genome Project: Phase I and beyond. Mol. Interv. 4, 147–156. Wittke-Thompson, J. K., Pluzhnikov, A., and Cox, N. J. (2005). Rational inferences about departures from Hardy-Weinberg equilibrium. Am. J. Hum. Genet. 76, 967–986. Xu, J. (2006). Extracting haplotypes from diploid organisms. Curr. Issues Mol. Biol. 8, 113–122. Yeager, M., Orr, N., Hayes, R. B., Jacobs, K. B., Kraft, P., Wacholder, S., Minichiello, M. J., Fearnhead, P., Yu, K., Chatterjee, N., Wang, Z., Welch, R., et al. (2007). Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat. Genet. 39, 645–649. Youssoufian, H., Wong, C., Aronis, S., Platokoukis, H., Kazazian, H. H., Jr., and Antonarakis, S. E. (1988). Moderately severe hemophilia A resulting from Glu—Gly substitution in exon 7 of the factor VIII gene. Am. J. Hum. Genet. 42, 867–871.