SNPing in the human genome

SNPing in the human genome

78 SNPing in the human genome Christopher S Carlson*, Tera L Newman and Deborah A Nickerson† More than a million genetic markers in the form of singl...

81KB Sizes 2 Downloads 147 Views

78

SNPing in the human genome Christopher S Carlson*, Tera L Newman and Deborah A Nickerson† More than a million genetic markers in the form of single nucleotide polymorphisms are now available for use in genotype–phenotype studies in humans. The application of new strategies for representational cloning and sequencing from genomes combined with the mining of high-quality sequence variations in clone overlaps of genomic and/or cDNA sequences has played an important role in generating this new resource. The focus of variation analysis is now shifting from the identification of new markers to their typing in populations, and novel typing strategies are rapidly emerging. Assay readouts on oligonucleotide arrays, in microtiter plates, gels, flow cytometers and mass spectrometers have all been developed, but decreasing cost and increasing throughput of DNA typing remain key if high-density genetic maps are to be applied on a large scale. Addresses Department of Molecular Biotechnology, University of Washington, Box 357330, Seattle, WA 98195-7330, USA *e-mail: [email protected] † e-mail: [email protected] Current Opinion in Chemical Biology 2001, 5:78–85 1367-5931/01/$ — see front matter © 2001 Elsevier Science Ltd. All rights reserved. Abbreviations DHPLC denaturing HPLC EST expressed sequence tag HPLC high-performance liquid chromatography OLA oligonucleotide ligation assay PCR polymerase chain reaction SNP single nucleotide polymorphism SSCP single-stranded conformational polymorphism

Introduction Analysis of the association between genotype and disease phenotype in humans is at a critical point. Hundreds of rare diseases showing Mendelian patterns of inheritance in humans have been described in the medical literature. In the past twenty years, the genes responsible for most of these diseases have been mapped and cloned, providing dramatic insights into human biology [1]. The tools that made this leap forward possible were linkage analysis, access to low-density genetic linkage maps and positional cloning. Despite the dramatic success of these techniques in identifying the genetic factors for rare disease, few genetic factors for common diseases have been identified. To identify genetic factors in common disease, a major effort is now under way to optimize the tools required for high-density marker association analysis in human populations. Here, we review recent advances in the technologies for constructing high-density genetic maps, as well as new technologies for high-throughput DNA typing of the most common form of sequence variation: single nucleotide substitutions, also known as single nucleotide polymorphisms (SNPs).

Genotype–phenotype analysis: an evolving paradigm Although twin and adoption studies consistently demonstrate that most common diseases have a significant heritable component, linkage analysis has generally failed to identify loci associated with major genetic risk. The statistical power of linkage analysis is greatest for diseases where variation in a single gene is both necessary and sufficient to cause disease, such as cystic fibrosis and Huntington’s disease. Linkage analysis is much less powerful for detecting genetic variants of low risk, even when such variants are common [2••]. The failure to identify the genetic components of common disease has led many groups to believe that multiple loci, each conferring a low level of relative risk, are involved. Low levels of relative risk can also reflect interaction between genotype at multiple loci (epistasis), interaction between genotype and environment, or truly stochastic factors such as X-chromosome inactivation or allelic exclusion. Association studies have greater power to detect variants with low relative risk than linkage studies, but require highly dense sets of polymorphic markers that can rapidly be typed on large numbers of samples [2••]. A dense marker map is required because these methods rely on linkage disequilibrium (the non-random association between genotype and phenotype in a population) that extends over much smaller regions than marker linkage in related members of a family (i.e. pedigree analysis). The exact number of markers required for a full genome scan using association methods will depend on the extent of useful levels of linkage disequilibrium in the genome. Theoretical models suggest that maps as dense as one SNP every 3–6 kb will be necessary for association studies [3••], which would mean, in the case of the human genome, a map with more than 500,000 SNPs. Experimental evidence suggests that useful levels of linkage disequilibrium regularly extend up to tens of kilobases, with significant variance across the genome [4,5]. Nonetheless, even using a highly optimistic estimate of one SNP every 30 kb, an optimized map of 100,000 SNPs will be required for a full genome scan. Currently, this is far beyond the routine genotyping capacity of most laboratories, assuming several hundred samples and 100,000 SNPs, or on the order of tens of millions of genotypes per study. Although genome-wide association studies are not yet feasible (from the standpoint of cost and throughput), with current assay capacities it is possible to rapidly perform meaningful analyses of candidate genes and these are generating new insights into common human genotype–phenotype associations [6••,7].

Developing high-density genetic maps To develop high-density genetic maps of the genome, studies have focused on the identification of SNPs because

SNPing in the human genome Carlson, Newman and Nickerson

79

Figure 1 Alternative methods for sequence-based SNP discovery. BAC, bacterial artificial chromosome; RRS, reduced representation sequencing.

Genomic DNA

mRNA

BAC library

RRS library

PCR

cDNA library

Clone overlap

Shotgun overlap

Sequence overlap

EST overlap

Sequence overlap SNP discovery GTTTAAATA A TACTGATCA GTTTAAATA A TACTGATCA GTTTAAATA G TACTGATCA GTTTAAATA G TACTGATCA Current Opinion in Chemical Biology

they are the most common form of sequence variation. In the human genome, approximately one base pair in a thousand is variant when any two chromosomes are compared [8–10,11••]. Aside from their high frequency in the genome, SNPs are very stable genetic markers and have a relatively low mutation rate in comparison with other types of genetic markers. SNPs are also biallelic in nature, which makes them much more amenable to automated typing. Constructing a high-density genetic map relies on the ability to identify very large numbers of SNPs. A variety of methods are currently in use for SNP discovery: some are optimized for random SNP discovery across the genome, whereas others are optimized for targeted SNP discovery in candidate genes. The optimal technology for any given project will depend on the question being addressed, the scale of the project and the resources available. These techniques may further be broken down into direct sequence-based SNP discovery and indirect, sequenceunknown SNP discovery. All of the large-scale SNP identification efforts use direct sequencing to detect SNPs, and most of these use gelbased fluorescent sequencing methods. In many recent strategies, overlapping sequences from multiple individuals are computationally aligned to identify high-quality mismatches (Figure 1; [12,13,14•,15,16]). At these sites, information about alignment depth, sequence context and read quality are used to predict a statistical likelihood that the site is an actual polymorphism and not a sequencing error. This process of ‘in silico’ SNP detection has been augmented tremendously by the development of rigorous

base-calling quality analysis programs such as PHRED [17,18], which determine quantifiable differences in the quality of the sequence reads, as well as programs that flag likely polymorphic sites, such as POLYBAYES [14•]. Using the primary data generated by the Human Genome Project, ~75% of the 1,500,000 SNPs presently in the database dbSNP (see URL: http://www.ncbi.nlm.nih.gov/SNP/) were identified in silico from overlapping regions of genomic clones. These SNPs are clustered in 10–50 kb regions, but the clusters are fairly randomly distributed across the genome. A much smaller fraction of dbSNP (~4% of all dbSNP entries) comes from similar data-mining efforts using single-pass sequence reads from cDNAs (known as expressed sequence tags [ESTs]) as the raw data for in silico detection [19,20,21••]. Because the EST SNPs are clustered in processed transcripts, they are more likely to occur in coding regions that can be tested for functional consequences. However, many other molecular mechanisms exist for the control of transcription, RNA splicing, and mRNA stability that can affect the final expressed phenotype. Thus, SNPs occurring in noncoding sequences such as 5′ promoter sequences, splice-site junctions, and regulatory/enhancer elements (intronic, 5′, and/or 3′ regions) can also have significant functional effects [22•]. Most of the remaining SNPs in dbSNP have been identified by the SNP Consortium using reduced representation sequencing [23••]. In this approach, pooled DNA from a group of unrelated individuals is digested extensively with restriction enzymes and size fractionated on an agarose gel. Fragments approximately 1.5 kb in size are

80

Proteomics and genomics

cloned, representing the same small percentage of the genome (~1%) from all individuals. The clone library is shotgun sequenced to 2–5-fold redundant coverage and overlapping sequence traces aligned and compared to screen for mismatches (SNPs). Although the majority of SNPs in dbSNP come from the large-scale, undirected efforts outlined above, a small but important fraction come from directed SNP identification within candidate genes [11••,24••]. Directed identification of SNPs generally relies upon sequence-specific PCR amplification of genomic DNA. Once amplified, many techniques are available to scan for SNPs in the PCR products obtained from different individuals. Sensitivity and specificity vary between techniques, and some require greater cost and/or time as a result of more involved optimization procedures. The most direct approach for targeted SNP discovery is gelbased DNA resequencing, in which the amplified PCR products from different samples (individuals) are sequenced, aligned and compared [25–27]. This is one of the more robust and accurate methods used for directed SNP discovery and allows resolution of both common SNPs (rare allele present in the population at greater than 10%) and rare SNPs (rare allele frequency of <10%). An alternative to gel-based resequencing is hybridization-based oligonucleotide array resequencing [11••,24••,28,29]. An array of 25-mer oligonucleotides is anchored to a glass slide (usually called a chip) in sets of four. Each set of four oligos represents the same 25-nucleotide stretch of the reference sequence of the target genomic region; the four oligos differ at a central nucleotide position, where each has only one of the four possible nucleotides — A, C, T, or G. The PCR-amplified DNA product is hybridized to the oligos on the chip and in each set of four oligos, the signal is dramatically higher for the oligo or oligos that match the target amplicon perfectly. This approach is not robust for detection of heterozygous samples, so detection of rare SNPs remains problematic, and is currently limited by chip design costs. A number of other targeted SNP identification methods can be grouped together by their dependence upon the chemical and physical properties of DNA duplexes and single strands, without sequence information. DNA heteroduplexes form when heterozygous amplicon is denatured and reannealed. Denaturing high-performance liquid chromatography (DHPLC) is a method that detects polymorphisms using the differential elution time of DNA homoduplexes and heteroduplexes (where a SNP is present) from a liquid chromatography column [30]. When a mixed population of heteroduplexes and homoduplexes is analyzed by HPLC under partially denaturing temperatures, the heteroduplexes elute from the column earlier than the homoduplexes because of their reduced melting temperature. Another indirect approach is single-stranded conformational polymorphism (SSCP) analysis, which is a simple and reasonably sensitive method to detect SNPs. In

this method, in the absence of a complementary strand, and under non-denaturing conditions, single-stranded DNA will form a secondary structure with itself. The secondary structure is sequence dependent, and can differentially affect electrophoretic mobility. In SSCP, amplified DNA is denatured, allowed to renature under conditions that favor the formation of single-stranded secondary structures, and run on a non-denaturing polyacrylamide gel. Samples with polymorphisms will fold and therefore migrate in a different way from wild-type strands, allowing detection of a SNP. There are many other less-automatable indirect SNPdetection techniques including denaturing gradient gel electrophoresis (DGGE) [31], chemical [32] or enzymatic [33] heteroduplex cleavage, enzymatic cleavage of singlestranded DNA [34], heteroduplex-mobility analysis [35] and restriction-fragment analysis [36]. These technologies will not be discussed as part of this review. It is important to note that all indirect detection methods incur some overhead from optimization of assay condition. Hence, they are more useful in situations where many individuals will be analyzed over the same genomic region, rather than when a small number of individuals will be scanned for SNPs in many different regions of the genome. In general, there is also some overhead from sequencing for confirmation of polymorphisms discovered by indirect means. However, sample preparation costs are generally low enough for some indirect techniques (DHPLC, SSCP) that they may reasonably be used for low to medium throughput genotyping as well as SNP discovery.

SNP genotyping SNP genotyping is not yet a mature field but many promising technologies are emerging. The ideal genotyping system for SNPs should be robust, cheap, homogenous (i.e. a self-contained assay requiring minimal handling for sample analysis), highly multiplexed (i.e. many SNPs genotyped in parallel), and fully automated. There are a number of genotyping technologies, discussed here, but none have yet approached the required cost per genotype to make whole-genome-scale association studies feasible, even though the high-density maps will soon be available. The most basic genotyping assays rely simply on differential hybridization of a pair of allelic probes to amplified target (Figure 2a). These include allele-specific oligonucleotide (ASO) assays [37] and reverse-blot assays [38], which differ only in terms of whether the target or the probe is immobilized. Newer solution-based hybridization systems using hairpin probe assays (e.g. molecular beacons) are also emerging [39]. All hybridization assays must be performed under conditions that favor perfect-match binding over mismatch binding. In most cases, the polymorphic site is placed centrally within the probe. Genotype is determined by examining which probes bind perfectly matched target, and assay specificity depends on the hybridization kinetics of the probe/target system.

SNPing in the human genome Carlson, Newman and Nickerson

81

Figure 2 Probe and target

C allele

C Target

C G Hybridize

(a)

C

(b)

A Fail to hybridize

C

C C Target

(c)

T allele

G Amplify

A Fail to amplify

C Target

C G Ligate

A Fail to ligate

C Target

C G Cleave

A Fail to cleave

C Target

C G Degrade

A Fail to degrade

C G C incorporated

A C Fails to incorporate

C

Allelic discrimination strategies. Schematic diagrams showing the hybridization of probes to target sequence. The differential results from hybridization of the probe(s) for the C allele to C and T alleles of a hypothetical C/T SNP are shown in the two right columns. A second reaction specific for the T allele (not shown) would be necessary to score genotype. (a) Differential hybridization of a pair of allelic probes to amplified target. (b) Allele-specific PCR. (c) OLA. (d) Flap cleavage. (e) 5′ Nuclease discrimination. (f) Minisequencing. Bold arrows indicate the probe and thinner arrows indicate the target.

(e)

(f)

+ddCTP Target

C

(d)

C

Current Opinion in Chemical Biology

Optimization of each hybridization system requires significant effort, but subsequent genotyping is quite efficient, so these technologies are optimal for studies of a small number of SNPs in a large number of individuals. Minisequencing is a genotyping method in which unique primers anneal immediately 5′ of the SNP and initiate a single-base extension reaction using DNA polymerase and labeled dideoxynucleotides (Figure 2f). Genotype is scored by determining which dideoxynucleotides were incorporated at the polymorphic site, and specificity is determined by the accuracy of nucleotide incorporation by the polymerase [40,41]. This technology appears to require the least optimization, and therefore should be optimal for large-scale studies with many hundreds of SNPs. Allele-specific PCR uses three primers: one allele-specific primer for each allele with the polymorphic site at the 3′ end, and a common reverse primer (Figure 2b; [42,43]). Using a polymerase without 3′→5′ proofreading activity, samples are amplified twice, once using each allele-specific primer. Genotype is scored for each sample based on which of the allele-specific amplifications were positive. Specificity is determined by the amplification kinetics of each assay and the differential efficiency of the polymerase in extending 3′ matched and 3′ mismatched primers. This assay format fails to multiplex and requires significant optimization of each assay, but uses cheap unmodified oligos and relatively cheap instrumentation.

Oligonucleotide ligation assays (OLA) [44] also use three specific primers to type a SNP with two allele-specific oligonucleotides and one joining oligonucleotide (Figure 2c). After the oligos have had time to hybridize to the traget sequence, a DNA ligase is used to join the adjacent oilgos (an allele-specific with a joining oligonucleotide). The fusion occurs only if the 3′ nucleotide at the ligation site complements the target DNA site. Genotype is scored by determining which of the allelic ligation reactions were successful. OLA enables multiplex genotyping of a large number of individuals for multiple SNPs [45] but requires expensive custom oligonucleotide syntheses. Flap cleavage (Invader, Figure 2d; [46]) and 5′ nuclease discrimination (Taqman, Figure 2e; [47]) are both assays that use enzymatic cleavage to identify the presence of a SNP in a sample. During flap cleavage, an enzyme is used to cleave the structure formed when two overlapping oligonucleotides hybridize to a target DNA strand, and cleavage is dependent on a 5′ match between the cleaved oligo and target. This assay requires relatively cheap instrumentation and limited optimization, and can achieve reasonable levels of throughput. Alternatively, in the 5′ nuclease assay, a complementary probe bound to the target DNA strand is cleaved if and only if it is perfectly complementary to the SNP site. Using oligonucleotides complementary to both the wild type and the polymorphic base, the presence of both kinds of target DNA (both SNP

82

Proteomics and genomics

Figure 3

multiplex has been used with minisequencing [48], OLA [49], and SSCP [50].

(a) D

+

A

A

D

Fluorescence: D↑ A↓

D↓ A↑

One donor, two allele-specific acceptors Minisequencing OLA

(b) A

D Fluorescence: D↓ A↑

+

D

A

D↑ A↓

One acceptor, two allele-specific donors Taqman Invader Molecular Beacons Current Opinion in Chemical Biology

Fluorescence-based technologies for genotype scoring. Changes in fluorescent signal intensity are an optimal readout technology for microtiter plates because fluorescence can easily be monitored in a closed-tube assay. (a) Fluorescence energy transfer occurs when two fluorophores are physically very close to one another: energy from excitation of the donor fluorophore (shown as D) is transferred to an acceptor (shown as A) that fluoresces at a different wavelength from the donor (gray arrow). Thus, biochemical processes that bring donor and acceptor into close proximity result in an increase in fluorescence at a frequency characteristic of the acceptor. (b) Conversely, processes that increase the distance between donor and acceptor result in increasing fluorescent intensity at a frequency characteristic of the donor.

alleles) can be detected even in a complex mixture. The 5′ nuclease assay requires expensive fluorescent oligos, and frequently requires optimization, but provides excellent throughput once assays are developed.

Genotype scoring formats Each SNP genotyping technology may be scored in one or more formats. Historically, radioisotopic methods were common, but they have been almost completely supplanted by colorimetric and fluorescent assays. Manually intensive methods have also become less popular as throughput considerations have increased in importance. The primary categories of genotype scoring format are gel multiplex, microtiter plates, array, and mass spectrometry. Gel electrophoresis separates DNA fragments on the basis of size, and has therefore been the technique of choice for scoring microsatellite genotypes in linkage studies. Tools developed for microsatellite analysis have been reapplied to multiplex SNP analysis in systems where the products of genotyping assays can be designed to be different sizes. Additional multiplexing is facilitated by using multiple fluorescent reporters in a single lane. Fluorescent gel

Microtiter plates allow many reactions to be performed in parallel on a single plate, are highly amenable to automation, and can be used with multiple readout formats, including enzymatic and fluorescent tags. Although there are also methods by which the two allelic reactions may be duplexed to double the number of SNPs assessed per plate, currently there are no multiplex (multiple SNPs together) for microtiter readouts. Enzymatic detection methods have been used in microtiter plates with OLA, flap cleavage, and minisequencing, but require many handling steps and are more difficult to automate. Fluorescent readouts have the advantage that they may be used in closed tubes without handling, and have been used extensively in allele-specific PCR [51,52], TaqMan assays [47] and Molecular Beacon assays [39], as well as OLA [45,53] (Figure 3). On the other end of the spectrum, array formats allow genotyping of many SNPs on a single sample. Chip arrays are fundamentally reverse blots, where the amplicon is fluorescently tagged and hybridized to an immobilized array of oligonucleotides. However, because the oligo sequences are SNP specific, each chip has a unique design, and the cost and effort involved in the manufacture of unique arrays can be prohibitive. An alternative array readout is universal arrays of oligonucleotides, where each array is simply used as an address for hybridization capture of genotyping reactions (Figure 4). A unique sequence complementary to a tag on the array is synthesized on the 5′ end of each minisequencing oligo. Minisequencing reactions are run for many SNPs, pooled, and hybridized to the universal chip. Fluorescence levels at each address on the array are measured, and then genotype is scored by decoding the corresponding minisequencing reaction [54–56]. A novel alternative to two-dimensional arrays is the use of universal oligonucleotides bound to labeled beads (Figure 4; [57–59]). Using a flow cytometer, the identity of the oligo bound to a bead is read on one wavelength, and fluorescently tagged genotype assays are scored on other wavelength(s). This readout format is compatible with OLA, minisequencing, and reverse blot hybridization. Another recently developed technology for genotype scoring is mass spectrometry, where minisequencing reactions are scored by directly measuring the mass of the extension product, or by measuring the mass of allele-specific tag molecules. The instrumentation for this technology is prohibitively expensive for small groups, but several companies have set up to provide genotyping services at reasonably competitive costs. There is a relationship between the automation of a format and its ability to be used in multiplexing. Array formats allow the most robust testing of many SNPs in parallel; however, automation of their throughput is difficult as there are not many robots commercially available for largescale chip handling. Microtiter-well formats are highly

SNPing in the human genome Carlson, Newman and Nickerson

automatable, but have not yet been highly multiplexed. Hence, none of the currently available genotyping approaches meets all of the desired goals. Most provide sufficient throughput to generate several thousands of genotypes per day, although costs still remain a practical factor in scaling these assays with current costs for SNP genotyping ranging from 20–80 cents per genotype. These cost estimates, however, do not include the cost of assay development or optimization, which can vary dramatically.

Figure 4

Locus-specific sequence cTag sequence

Tag sequence Substrate Bead array

Pooling strategies One way to reduce the number of assays required for a full genome scan is to multiplex samples instead of markers. In the simplest pooled study design, two sample pools are generated, representing affected and unaffected individuals [60,61]. After accurate measurement of allele frequency in each pool, the frequencies are compared for significant differences. The magnitude of the allele frequency difference is dependent on several parameters of the risk allele: allele frequency, mode of inheritance, and relative risk. It is also important to note that genetic interaction between loci and haplotype-specific risks will confound the power of pooling analysis, although the likelihood of these effects remains to be determined. Pooling samples requires a quantitative readout of genotype frequency, and the accuracy and precision of allele frequency measurements will have significant impact on the power of pooled study designs to detect disease-associated loci. At the very least, the error in pooled frequency estimates should be less than 2% across a broad spectrum of allele frequencies. Promising preliminary results have been published for pooled frequency estimation using mass spectrometry minisequencing [62], DHPLC minisequencing [63], SSCP [64], allele-specific PCR [52] and chip-array minisequencing [55]. The utility of pooled approaches remains to be proven, especially in light of some recent studies which suggest that the relative risk associated with a SNP is context-dependent on local haplotype [6••,65•]. Haplotype information is lost in pooling strategies, so such effects would be missed.

Conclusions The availability of high-throughput SNP-detection methods creates two unprecedented tools for molecular genetics. First, even if genome-wide association studies require hundreds of thousands of SNPs, creating such maps is feasible. Second, it is now possible to rapidly identify the pattern of all common genetic variation across any single candidate gene. However, we do not yet understand how to effectively apply these tools. Full genome-association studies will require tens of thousands if not hundreds of thousands of markers per study. To make these studies fiscally comparable to modern linkage studies that use hundreds of microsatellite markers, the cost per genotype must come down by at least several

83

Chip array Tag 1 Tag 2 Tag 3 Tag 4 Current Opinion in Chemical Biology

Array Readouts. Universal array genotyping allows multiplexed readout by hybridization of locus-specific probes to unique addresses on chips or beads. Fluorescent-label incorporation directed by the locusspecific sequence results in an increase in fluorescence at the corresponding tag address.

orders of magnitude. Furthermore, the statistical tools to analyze such an undertaking are not yet fully developed. The loss of statistical power associated with multiple testing of this scale is daunting, and will require clever approaches if it is to be overcome. In addition to the challenges of study design, there are still significant challenges associated with construction of a full genome association map. Assembling a useful 100,000 SNP map will require optimization and genotyping of many more than 100,000 markers to determine relative allele frequencies and levels of linkage disequilibrium between markers. In the immediate future, the utility of dense SNP maps will be most apparent in candidate-gene studies. Complete knowledge of frequent patterns of variation across candidate genes makes it possible to test the hypothesis that frequent variation is associated with common disease, with a meaningful negative result. Performing this type of analysis should keep the community busy until the full genome SNP maps are ready for deployment.

Acknowledgement Thanks to M Rieder for comments on the manuscript.

References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as:

• of special interest •• of outstanding interest 1.

Collins FS: Positional cloning moves from perditional to traditional. Nat Genet 1995, 9:347-350. [Published erratum appears in Nat Genet 1995, 11:104.]

2. Risch NJ: Searching for genetic determinants in the new •• millennium. Nature 2000, 405:847-856. This is an excellent review of the relative statistical power of linkage and association studies.

84

Proteomics and genomics

3. Kruglyak L: Prospects for whole-genome linkage disequilibrium •• mapping of common disease genes. Nat Genet 1999, 22:139-144. This paper models linkage disequilibrium and suggests that useful levels of linkage disequilibrium may only extend a few kilobases in either direction.

21. Irizarry K, Kustanovich V, Li C, Brown N, Nelson S, Wong W, Lee CJ: •• Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences. Nat Genet 2000, 26:233-236. This report describes SNP mining from large-scale EST contig assemblies.

4.

Taillon-Miller P, Bauer-Sardina I, Saccone NL, Putzel J, Laitinen T, Cao A, Kere J, Pilia G, Rice JP, Kwok PY: Juxtaposed regions of extensive and minimal linkage disequilibrium in human Xq25 and Xq28. Nat Genet 2000, 25:324-328.

22. Shen LX, Basilion JP, Stanton VP Jr: Single-nucleotide • polymorphisms can cause different structural folds of mRNA. Proc Natl Acad Sci USA 1999, 96:7871-7876. This paper reports that non-amino acid changing SNPs (synonymous SNPs) can have functional consequences.

5.

Eaves IA, Merriman TR, Barber RA, Nutland S, Tuomilehto-Wolf E, Tuomilehto J, Cucca F, Todd JA: The genetically isolated populations of Finland and Sardinia may not be a panacea for linkage disequilibrium mapping of common disease genes. Nat Genet 2000, 25:320-323.

6. ••

Horikawa Y, Oda N, Cox NJ, Li X, Orho-Melander M, Hara M, Hinokio Y, Lindner TH, Mashima H, Schwarz PE et al.: Genetic variation in the gene encoding calpain-10 is associated with type 2 diabetes mellitus. Nat Genet 2000, 26:163-175. This is an association study of a common disease, which applies linkage disequilibrium analysis and suggests that non-coding SNPs are important. 7.

8.

9.

Zhu X, McKenzie CA, Forrester T, Nickerson DA, Broeckel U, Schunkert H, Doering A, Jacob HJ, Cooper RS, Rieder MJ: Localization of a small genomic region associated with elevated ACE. Am J Hum Genet 2000, 67:1144-1153. Kwok PY, Deng Q, Zakeri H, Taylor SL, Nickerson DA: Increasing the information content of STS-based genome maps: identifying polymorphisms in mapped STSs. Genomics 1996, 31:123-126. Wang DG, Fan JB, Siao CJ, Berno A, Young P, Sapolsky R, Ghandour G, Perkins N, Winchester E, Spencer J et al.: Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 1998, 280:1077-1082.

10. Cambien F, Poirier O, Nicaud V, Herrmann SM, Mallet C, Ricard S, Behague I, Hallet V, Blanc H, Loukaci V et al.: Sequence diversity in 36 candidate genes for cardiovascular disorders. Am J Hum Genet 1999, 65:183-191. 11. Halushka MK, Fan JB, Bentley K, Hsie L, Shen N, Weder A, Cooper R, •• Lipshutz R, Chakravarti A: Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat Genet 1999, 22:239-247. This paper looked at the patterns of SNP variation across a number of candidate genes for cardiovascular disease. 12. Taillon-Miller P, Gu Z, Li Q, Hillier L, Kwok PY: Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res 1998, 8:748-754. 13. Picoult-Newberg L, Ideker TE, Pohl MG, Taylor SL, Donaldson MA, Nickerson DA, Boyce-Jacino M: Mining SNPs from EST databases. Genome Res 1999, 9:167-174. 14. Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, Stitziel NO, • Hillier L, Kwok PY, Gish WR: A general approach to singlenucleotide polymorphism discovery. Nat Genet 1999, 23:452-456. This paper describes a new tool for the identification of SNPs from genomic clone overlaps.

23. Altshuler D, Pollara VJ, Cowles CR, van Etten WJ, Baldwin J, Linton L, •• Lander ES: An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 2000, 407:513-516. This paper details the reduced representation shotgun sequencing approach to SNP discovery, which has provided a significant portion of the SNPs in dbSNP at this point. 24. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N, Lane CR, •• Lim EP, Kalayanaraman N, Nemesh J et al.: Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 1999, 22:231-238. This paper describes the patterns of SNP variation across the coding regions of a large panel of candidate genes. 25. Kwok PY, Carlson C, Yager TD, Ankener W, Nickerson DA: Comparative analysis of human DNA variations by fluorescencebased sequencing of PCR products. Genomics 1994, 23:138-144. 26. Nickerson DA, Tobe VO, Taylor SL: PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res 1997, 25:2745-2751. 27.

Rieder MJ, Taylor SL, Tobe VO, Nickerson DA: Automating the identification of DNA variations using quality-based fluorescence re-sequencing: analysis of the human mitochondrial genome. Nucleic Acids Res 1998, 26:967-973.

28. Chee M, Yang R, Hubbell E, Berno A, Huang XC, Stern D, Winkler J, Lockhart DJ, Morris MS, Fodor SP: Accessing genetic information with high-density DNA arrays. Science 1996, 274:610-614. 29. Hacia JG, Brody LC, Chee MS, Fodor SP, Collins FS: Detection of heterozygous mutations in BRCA1 using high density oligonucleotide arrays and two-colour fluorescence analysis. Nat Genet 1996, 14:441-447. 30. Underhill PA, Jin L, Lin AA, Mehdi SQ, Jenkins T, Vollrath D, Davis RW, Cavalli-Sforza LL, Oefner PJ: Detection of numerous Y chromosome biallelic polymorphisms by denaturing high-performance liquid chromatography [letter]. Genome Res 1997, 7:996-1005. 31. Sheffield VC, Cox DR, Lerman LS, Myers RM: Attachment of a 40-base-pair G + C-rich sequence (GC-clamp) to genomic DNA fragments by the polymerase chain reaction results in improved detection of single-base changes. Proc Natl Acad Sci USA 1989, 86:232-236. 32. Cotton RG, Rodrigues NR, Campbell RD: Reactivity of cytosine and thymine in single-base-pair mismatches with hydroxylamine and osmium tetroxide and its application to the study of mutations. Proc Natl Acad Sci USA 1988, 85:4397-4401.

15. Garg K, Green P, Nickerson DA: Identification of candidate coding region single nucleotide polymorphisms in 165 human genes using assembled expressed sequence tags. Genome Res 1999, 9:1087-1092.

33. Youil R, Kemper BW, Cotton RG: Screening for mutations by enzyme mismatch cleavage with T4 endonuclease VII. Proc Natl Acad Sci USA 1995, 92:87-91.

16. Buetow KH, Edmonson MN, Cassidy AB: Reliable identification of large numbers of candidate SNPs from public EST data. Nat Genet 1999, 21:323-325.

34. Rossetti S, Englisch S, Bresin E, Pignatti PF, Turco AE: Detection of mutations in human genes by a new rapid method: cleavage fragment length polymorphism analysis (CFLPA). Mol Cell Probes 1997, 11:155-160.

17.

Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 1998, 8:175-185.

18. Ewing B, Green P: Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res 1998, 8:186-194. 19. Clifford R, Edmonson M, Hu Y, Nguyen C, Scherpbier T, Buetow KH: Expression-based genetic/physical maps of single-nucleotide polymorphisms identified by the cancer genome anatomy project. Genome Res 2000, 10:1259-1265. 20. Gu Z, Hillier L, Kwok PY: Single nucleotide polymorphism hunting in cyberspace. Hum Mutat 1998, 12:221-225.

35. Upchurch DA, Shankarappa R, Mullins JI: Position and degree of mismatches and the mobility of DNA heteroduplexes. Nucleic Acids Res 2000, 28:E69. 36. Botstein D, White RL, Skolnick M, Davis RW: Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 1980, 32:314-331. 37.

Saiki RK, Bugawan TL, Horn GT, Mullis KB, Erlich HA: Analysis of enzymatically amplified beta-globin and HLA-DQ alpha DNA with allele-specific oligonucleotide probes. Nature 1986, 324:163-166.

38. Saiki RK, Walsh PS, Levenson CH, Erlich HA: Genetic analysis of amplified DNA with immobilized sequence-specific

SNPing in the human genome Carlson, Newman and Nickerson

oligonucleotide probes. Proc Natl Acad Sci USA 1989, 86:6230-6234.

85

53. Chen X, Livak KJ, Kwok PY: A homogeneous, ligase-mediated DNA diagnostic test. Genome Res 1998, 8:549-556.

39. Marras SA, Kramer FR, Tyagi S: Multiplex detection of singlenucleotide variations using molecular beacons. Genet Anal 1999, 14:151-156.

54. Pastinen T, Partanen J, Syvanen AC: Multiplex, fluorescent, solidphase minisequencing for efficient screening of DNA sequence variation. Clin Chem 1996, 42:1391-1397.

40. Nikiforov TT, Rendle RB, Goelet P, Rogers YH, Kotewicz ML, Anderson S, Trainor GL, Knapp MR: Genetic bit analysis: a solid phase method for typing single nucleotide polymorphisms. Nucleic Acids Res 1994, 22:4167-4175.

55. Fan JB, Chen X, Halushka MK, Berno A, Huang X, Ryder T, Lipshutz RJ, Lockhart DJ, Chakravarti A: Parallel genotyping of human SNPs using generic high-density oligonucleotide tag arrays. Genome Res 2000, 10:853-860.

41. Syvanen AC: From gels to chips: ‘minisequencing’ primer extension for analysis of point mutations and single nucleotide polymorphisms. Hum Mutat 1999, 13:1-10.

56. Hirschhorn JN, Sklar P, Lindblad-Toh K, Lim YM, Ruiz-Gutierrez M, Bolk S, Langhorst B, Schaffner S, Winchester E, Lander ES: SBETAGS: an array-based method for efficient single-nucleotide polymorphism genotyping. Proc Natl Acad Sci USA 2000, 97:12164-12169.

42. Newton CR, Graham A, Heptinstall LE, Powell SJ, Summers C, Kalsheker N, Smith JC, Markham AF: Analysis of any point mutation in DNA. The amplification refractory mutation system (ARMS). Nucleic Acids Res 1989, 17:2503-2516. 43. Sommer SS, Cassady JD, Sobell JL, Bottema CD: A novel method for detecting point mutations or polymorphisms and its application to population screening for carriers of phenylketonuria. Mayo Clin Proc 1989, 64:1361-1372. 44. Iannone MA, Taylor JD, Chen J, Li MS, Rivers P, Slentz-Kesler KA, Weiner MP: Multiplexed single nucleotide polymorphism genotyping by oligonucleotide ligation and flow cytometry. Cytometry 2000, 39:131-140. 45. Chen X, Kwok PY: Homogeneous genotyping assays for single nucleotide polymorphisms with fluorescence resonance energy transfer detection. Genet Anal 1999, 14:157-163. 46. Lyamichev V, Mast AL, Hall JG, Prudent JR, Kaiser MW, Takova T, Kwiatkowski RW, Sander TJ, de Arruda M, Arco DA et al.: Polymorphism identification and quantitative detection of genomic DNA by invasive cleavage of oligonucleotide probes. Nat Biotechnol 1999, 17:292-296. 47.

Livak KJ: Allelic discrimination using fluorogenic probes and the 5′′ nuclease assay. Genet Anal 1999, 14:143-149.

48. Kobayashi M, Rappaport E, Blasband A, Semeraro A, Sartore M, Surrey S, Fortina P: Fluorescence-based DNA minisequence analysis for detection of known single-base changes in genomic DNA. Mol Cell Probes 1995, 9:175-182. 49. Grossman PD, Bloch W, Brinson E, Chang CC, Eggerding FA, Fung S, Iovannisci DM, Woo S, Winn-Deen ES, Iovannisci DA: Highdensity multiplex detection of nucleic acid sequences: oligonucleotide ligation assay and sequence-coded separation. Nucleic Acids Res 1994, 22:4527-4534. [Published erratum appears in Nucleic Acids Res 1998, 26:5539.] 50. Iwahana H, Yoshimoto K, Mizusawa N, Kudo E, Itakura M: Multiple fluorescence-based PCR-SSCP analysis. BioTechniques 1994, 16:296-297, 300-295. 51. Germer S, Higuchi R: Single-tube genotyping without oligonucleotide probes. Genome Res 1999, 9:72-78. 52. Germer S, Holland MJ, Higuchi R: High-throughput SNP allelefrequency determination in pooled DNA samples by kinetic PCR. Genome Res 2000, 10:258-266.

57.

Chen J, Iannone MA, Li MS, Taylor JD, Rivers P, Nelsen AJ, Slentz-Kesler KA, Roses A, Weiner MP: A microsphere-based assay for multiplexed single nucleotide polymorphism analysis using single base chain extension. Genome Res 2000, 10:549-557.

58. Cai H, White PS, Torney D, Deshpande A, Wang Z, Marrone B, Nolan JP: Flow cytometry-based minisequencing: a new platform for high-throughput single-nucleotide polymorphism scoring. Genomics 2000, 66:135-143. 59. Dunbar SA, Jacobson JW: Application of the luminex LabMAP in rapid screening for mutations in the cystic fibrosis transmembrane conductance regulator gene: a pilot study. Clin Chem 2000, 46:1498-1500. 60. Barcellos LF, Klitz W, Field LL, Tobias R, Bowcock AM, Wilson R, Nelson MP, Nagatomi J, Thomson G: Association mapping of disease loci, by use of a pooled DNA genomic screen. Am J Hum Genet 1997, 61:734-747. 61. Risch N, Teng J: The relative power of family-based and casecontrol designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res 1998, 8:1273-1288. 62. Ross P, Hall L, Haff LA: Quantitative approach to single-nucleotide polymorphism analysis using MALDI-TOF mass spectrometry. BioTechniques 2000, 29:620-629. 63. Hoogendoorn B, Norton N, Kirov G, Williams N, Hamshere M, Spurlock G, Austin J, Stephens M, Buckland P, Owen M: Cheap, accurate and rapid allele frequency estimation of single nucleotide polymorphisms by primer extension and DHPLC in DNA pools. Hum Genet 2000, in press. 64. Sasaki T, Tahira T, Suzuki A, Higasa K, Kukita Y, Baba S, Hayashi K: Precise estimation of allele frequencies of single-nucleotide polymorphisms by a quantitative SSCP analysis of pooled DNA. Am J Hum Genet 2000, 68: in press. 65. Drysdale CM, McGraw DW, Stack CB, Stephens JC, Judson RS, • Nandabalan K, Arnold K, Ruano G, Liggett SB: Complex promoter and coding region beta 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc Natl Acad Sci USA 2000, 97:10483-10488. This paper describes how the phenotypic consequences of some SNPs are affected by haplotype context.