Genetic Analysis: Biomolecular Engineering 14 (1999) 215 – 219
Differential sequencing with mass spectrometry Joel H. Graber *, Cassandra L. Smith, Charles R. Cantor Center for Ad6anced Biotechnology, Boston Uni6ersity, Cummington Street, Boston MA 02215, USA
Abstract Differential or genetic sequencing requires searching sample DNA for variations with respect to a reference sequence. Conventional detection techniques are too labor and cost expensive for use in diagnostic applications, therefore new technologies will be required. Measurement techniques based on mass spectrometry (MS) possess the potential for high-throughput, high fidelity measurement of sequence variation. Unambiguous detection of polymorphic sequences has been demonstrated, even in heterozygous samples. Automated reproducible measurements of microscopic arrays of samples will enable the high-throughput detection required for large-scale applications. Computational simulation and analysis of experimental parameters prior to experimentation will provide the optimization necessary for development of robust, reproducible measurements. © 1999 Elsevier Science B.V. All rights reserved. Keywords: Differential sequencing; Mass spectrometry; DNA
1. Introduction The Human Genome Project, projected for completion in the next 7 years [1], will produce a reference sequence for human DNA. The availability of the entire 3 billion bases of sequence data will open new avenues of diagnostic investigation. This reference sequence will be, as the name implies, only a reference. The DNA sequence of any individual selected at random will differ from this reference sequence at many non-detrimental sites (polymorphisms) and possibly also detrimental sites (mutations). Characterization of the differences between a given specific sequence and the reference sequence is referred to as differential sequencing, or in the specific case of harmful mutation detection, diagnostic sequencing. Any method for differential sequencing will be best designed if it takes advantage of the fact that most of the sequence is already known. This will greatly reduce both the data collection and analysis requirements. Most genes are thousands of nucleotides long, while variations are much smaller, potentially as small as a * Corresponding author. Tel.: +1-617-353-8500; fax: + 1-617-3538501; e-mail:
[email protected].
single base. The typical estimate of the random variability human DNA, ignoring repeat and other known highly polymorphic regions [2], is about 0.1%, or approximately 3 million differences between a given sequence and the reference standard. The ideal technique is one which will allow rapid localization of variable regions, and more importantly, elimination of all remaining sequence from the need for further analysis. Identification and characterization of all possible varieties is not feasible with conventional electrophoresisbased detection techniques, especially in a clinical setting, due to financial, time, and manpower constraints. Deleterious mutations can be highly localized, such as Huntington’s disease [3], or widely varied in both composition and location, such as in the many forms of cancer [4]. Development of clinical tests for unknown sequence variation will require advances in miniaturization and parallel detection. Significant advances have been already been made in the area of miniaturization and parallel detection through the development of chip-based detection systems [5–8], in which either probe or target fragments are attached to a surface-mounted array. Selection and identification of genetic variation (harmful or otherwise) is achieved through alteration of the hybridization efficiency of the
1050-3862/99/$ - see front matter © 1999 Elsevier Science B.V. All rights reserved. PII: S 1 0 5 0 - 3 8 6 2 ( 9 8 ) 0 0 0 2 0 - 5
216
J.H. Graber et al. / Genetic Analysis: Biomolecular Engineering 14 (1999) 215–219
probe and target due to imperfect base pairings caused by differing sequences. Detection in these systems is typically achieved through fluorescent tagging of one or both of the single-stranded components. Sequence information, and therefore mutation information, is obtained through analysis of the pattern of successfully hybridized probes. Results from this type of analysis have been impressive, but problems remain with the use of fluorescence as the basis of detection. Some of the potential limitations include: 1. A large redundancy of probes: in order to identify a given mutation precisely, an exact matching probe must be present. 2. Potential for false positive or false negative signals: hybridization efficiency can be affected by factors other than sequence variation, e.g. secondary structure or variations in melting temperatures of different probes. 3. Requirement for a large volume of target sample: even for microscopic arrays, this method requires a relatively large amount of sample, as the unknown targets must be simultaneously present for hybridization at all array sites. 4. Ambiguous determination of heterozygous samples: multiple hybridization signals of the same probe cannot generally be resolved, since the presence of fluorescence only indicates relative hybridization efficiencies, rather than unique identification of the species present. Mass spectrometry (MS) is a detection technique which potentially addresses most of the limitations of fluorescence based detection as listed above. In addition, MS based measurement make possible experimental approaches which do not rely upon hybridization as the basis of differentiation of variable sequences, thereby completely avoiding the difficulties listed above. These approaches are discussed further below.
2. Differential sequencing with mass spectrometry Mass spectrometric measurements of nucleic acids were made possible through the development of the ‘gentle’ ionization techniques of matrix assisted laser desorption/ionization mass spectrometry (MALDIMS) and electro-spray ionization mass spectrometry (ESI-MS). For a more complete description of MALDI-MS and ESI-MS, see Refs [9 – 12]. In brief, a MS measurement is achieved through the creation of an ionized gas phase analyte, followed by detection based on the mass to charge (m/z) ratio. A common detection scheme, and the only one discussed here, is time-of-flight measurement, where the ionized
molecules are accelerated in an electric field, followed by a flight through a vacuum chamber (typically 1 m) to a detector. The time of flight to the detector is inversely proportional to the velocity of the particle, which is proportional to (m/z) − 1/2. Currently available mass time-of-flight spectrometers are capable of mass accuracy greater than 1 Dalton and mass resolution up to one part in a thousand for nucleic acid fragments as large as 80–100 nucleotides long. Measurements can be made rapidly and automatically with instruments capable of analyzing arrays of samples [13]. Measurement of all sites in an array is projected to occur at 1 s per sample, which will lead to total measurement times on the order of a few minutes for arrays with 100 to 1000 s of sites. Several aspects of MS based hybridization detection make it an attractive alternative to fluorescence based detection. Mass spectrometric detection can usually be freed of many of the ambiguities which arise in fluorescent detection schemes, since different molecules which hybridize to the same site will have different sequences and therefore likely different masses. Since there are no labels which must be attached to the molecules, the number of samples which can be measured in a single spectrum is limited only by the mass resolution and maximum mass detected by the spectrometer. This is technically easier than fluorescent assays, in which multiple signals at a given array element require different colored fluorescent tags. Conservative estimates based on current mass spectrometer capabilities will allow spectra with 200–500 unique molecules present. This in turn greatly reduces the size and complexity of the sample array which must be constructed for hybridization experiments prior to MS detection. An array of a few hundred to a thousand sites will be capable of detecting 104 to 105 different molecules, each of which would require one or more dedicated sites in a fluorescence based array. MS measurement schemes can also be made inherently differential; the molecular mass to charge ratio is an intrinsic property of the samples under examination, dependent only upon base composition. Systematic creation of molecules for differential sequencing analysis (see Section 4 below) allows pre-computation of the expected mass. The first level of data analysis consists simply of comparing measured with predicted masses, which reduces the overall data collection by a factor equal to the average number of nucleotides in a molecule. Any molecules with no change from the predicted mass are therefore excluded from further analysis. Fig. 1 shows some of the basic schemes that can and have been implemented for detection of polymorphisms or mutations using mass spectrometry.
J.H. Graber et al. / Genetic Analysis: Biomolecular Engineering 14 (1999) 215–219
217
Fig. 1. Examples of possible schemes for mutation/polymorphism detection via mass spectrometry, along with rough appearance of the expected spectra. (a) Hybridization of targets to fixed probes. MS can distinguish between perfect and imperfect hybridization, based on different masses of the targets. (b) Extension of primers through known variable regions: a probe is annealed to a single-stranded target, upstream of the variable region, after which a primer extension reaction is carried out. MS distinguishes between normal and mutated/polymorphic sequence by the different mass of the extended primers. (c) Differential cleavage by restriction enzymes: in this scheme, the presence of a recognition sequence for a restriction enzyme is exploited. If no mutation is present, the enzyme cleaves the fragment into two smaller fragments (labeled 1 and 2); in the case of mutation or polymorphism, the enzyme cannot cut, therefore a single large fragment (3) is measured via MS.
3. Recent progress in differential sequencing with mass spectrometry MS based mutation/polymorphism detection studies are currently being carried out by several different groups [14–20]. The Primer Oligo Base Extension (PROBE)™ technique [14 – 17], developed by Sequenom, Inc. is a particularly successful example of this type of measurement. In PROBE measurements, a primer sequence (20 nucleotides) is annealed to a single stranded target, upstream from the potentially variable region of interest. The primer is then extended in the presence of one or more deoxy-nucleotide-triphosphate (dNTP), and one di-deoxy-nucleotidetriphosphate (ddNTP), which will terminate the extension upon incorporation. Variable sequences are easily detected in this scheme, (even from heterozygous individuals) due to different final molecular masses arising from either different incorporated bases, or differing fragment length due to changed location of the terminating ddNTP. The PROBE technique [13–15] has been used to investigate genetic variations in several diseases, including cystic fibrosis, hypertension, and several different types of cancer. In a separate study [17] the PROBE assay was used to measure the lengths
of tri-nucleotide repeat regions, which are interesting both as polymorphic genetic markers, and also for their possible association with disease genotypes. The PinPoint™ assay [18], developed by Perseptive Biosystems, Inc, operates under a similar technique, though it to date has only demonstrated interrogation of the first base immediately downstream of the annealed primer. Fu et al. [21] used MALDI-MS to sequence exons 5 through 8 of the p53 gene, performing direct readout of Sanger sequencing ladders. The MS measurement clearly demonstrated the ability to identify heterozygous samples unambiguously through the different masses of the products. In addition, false stops, a common problem in sequencing reactions, were easily identified. The correct base could be determined by the mass included in successful downstream products. Automated MS analysis of an array of samples was recently proved feasible by Little et al. [14]. They used arrays of nanoliter volume pits [24], chemically etched into a silicon surface. The creation of these pits allowed reproducible and automated measurement of sub-femtomole quantities of analytes. This technique is in sharp contrast to manual searches of crystalline samples for measurable signal which characterized the earliest MALDI-MS measurements.
218
J.H. Graber et al. / Genetic Analysis: Biomolecular Engineering 14 (1999) 215–219
4. Optimization: Intelligent assay design is key to maximized utility The key to successful large-scale MS-based differential sequencing schemes will be careful design and creation of the sets of molecules to be analyzed simultaneously. Since MS is by definition a measurement of mass, any two molecules with identical composition, but differing sequences will be indistinguishable. This constraint requires that experiments be designed and implemented with prior considerations or calculations [22] made to minimize the likelihood of overlap of sample masses. This not a severe limitation, since the reference sequence is known a priori. All fragments created from the reference sequence, as well as most variations, can be predicted prior to MS measurement. In order to best utilize the advantages of MS-based detection, experimental schemes are being developed which need no hybridization step or sequencing reaction, but instead rely solely upon MS measurement of small fragments of DNA. Such methods involve reproducibly fragmenting a genomic sequence from several hundred nucleotides (a typical exon size) to the tens of nucleotide fragments required for MS analysis. Several approaches to reproducible DNA fragmentation are under currently under analysis (see below). The complexity of the optimization eliminates manual determination for all but the simplest cases. Computational analysis is straightforward, and can therefore be used to simulate and assess all potential fragment creation schemes prior to any MS measurements. The fragmentation schemes currently being investigated through simulations include: 1. mixtures of restriction enzymes which cleave the sequence producing fragments for MS detection (Fig. 2 shows a diagram of the steps involved in this analysis); 2. multiplexed PCR amplification of small fragments for MS detection; 3. introduction of sequence specific cutting sites via base analogs. In each of these cases, the optimization must be determined within a set of constraints for the resulting fragments. Constraints can include maximum number of restriction enzymes, maximum fragment mass, maximum number of fragments, minimum mass difference between fragments, maximum number of parallel PCR reactions, and others. Initial non-optimized studies of the p53 coding sequence [22] indicate that over 90% of the mutations in a publicly available database [23] can be detected through the enzyme mixture method, and nearly 100% can be obtained directly using the multiplexed PCR method.
5. Conclusions Mass spectrometry has the potential to become a valuable tool for diagnostic sequencing applications. Small-scale experiments to date have shown the power of this method for unambiguous detection of genetic variations, even in the difficult situation of a heterozygous sample. The further incorporation of parallel arrays of samples, combined with the implementation of various optimization schemes are the next steps to making this technology feasible on a genomic scale and in clinical settings.
Fig. 2. Flow chart for MALDI based mutation/polymorphism detection by multi-enzyme clevage. Prior knowledge of the sequence of interest allows simulations to be carried out an all steps shown, which in turn facilitates the final step of comparison of predicted/expected fragments with measured fragments. Optimization of this method is primarily throgh choice of initial PCR primers and of sets of enzymes for multiple clevage.
J.H. Graber et al. / Genetic Analysis: Biomolecular Engineering 14 (1999) 215–219
Acknowledgements This work was supported by a DOA grant (DAMD17-94-V-414) to C.L. Smith. J.H. Graber is partially supported through a U.S. National Human Genome Research Institute training grant (T32 HG00041-03).
[13]
[14]
[15]
References [16] [1] Rowen L, Mahairas G, Hood L. Sequencing the human genome. Science 1997;278:605–7. [2] Epplen C, Santos EJM, Maeueler W, van Helden P, Epplen JT. On simple repetitive DNA sequences and complex diseases. Electrophoresis 1997;18:1577–85. [3] Huntington’s Disease Collaborative Research Group. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. Cell 1993;72:971 – 83. [4] Weinberg RA. How cancer arises. Sci Am 1996;275:62–70. [5] Chee M, Yang R, Hubbell E, Berno A, Huang XC, Stern D, et al. Accessing genetic information with high-density DNA arrays. Science 1996;274:610–4. [6] Drobyshev A, Nologina N, Shik V, Pobedimskaya D, Yershov G, Mirzabekov A. Sequence analysis by hybridization with oligonucleotide microchip: identification of beta-thalassemia mutations. Gene 1997;188:45–52. [7] Weiler J, Gausepohy H, Hauser N, Jensen ON, Hoheisel JD. Hybridisation based DNA screening on peptide nucleic acid (PNA) oligomer arrays. Nucleic Acids Res 1997;25:2792– 9. [8] Pastinen T, Kurg A, Metspalu A, Peltonen L, Syvanen AC. Minisequencing: a specific tool for DNA analysis and diagnostics on oligonucleotide arrays. Genome Res 1997;7:606–14. [9] Cotter RJ. Time of flight mass spectrometry instrumentation and applications in biological research. Washington, DC: American Chemical Society, 1997. [10] Crain PF, McCloskey JA. Applications of mass spectrometry to the characterization of oligonucleotides and nucleic acids. Curr Opin Biotech 1998;9:25–34. [11] Juhasz P, Roskey MT, Smirnov IP, Haff LA, Vestal ML, Martin SA. Applications of delayed extraction matrix-assisted laser desorption ionization time-of-flight mass spectrometry to oligonucleotide analysis. Anal Chem 1996;68:941–6. [12] Andersen JS, Svensson B, Roepstorff P. Electrospray ionization and matrix assisted laser desorption/ionization mass spectrome-
.
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
219
try: powerful analytical tools in recombinant protein chemistry. Nat Biotechnol 1996;14:449 – 57. Little DP, Braun A, O’Donnell MJ, Koester H. Mass spectrometry from miniaturized arrays for full comparative DNA analysis. Nat Med 1997;3:1413 – 6. Little DP, Braun A, Darnhofer-Demar B, Koester H. Identification of apolipoprotein E polymorphisms using temperature cycled primer oligo base extension and mass spectrometry. Eur J Clin Chem Clin Biochem 1997;35:545 – 8. Little DP, Braun A, Reuter D, Darnhofer-Demar B, Frilling A, Li Y, McIver RT, Koester H. Detection of RET proto-oncogene codon 634 mutations using mass spectrometry. J Mol Med 1997;75:745 – 50. Braun A, Little DP, Koester H. Detecting CFTR gene mutations by using primer oligo base extension and mass spectrometry. Clin Chem 1997;43:1151 – 8. Haff LA, Smirnov IP. Single-nucleotide polymorphism identification assays using a thermostable DNA polymerase and delayed extraction MALDI-TOF mass spectrometry. Genome Res 1997;7:378 – 88. Roskey MT, Juhasz P, Smirnov IP, Takach EK, Martin SA, Haff LA. DNA sequencing by delayed extraction-matrix-assisted laser desorption/ionization time of flight mass spectrometry. Proc Natl Acad Sci USA 1996;93:4724 – 9. Taranenko NI, Matteson KJ, Chung CN, Zhu YF, Chang LY, Allman SL, Haff L, Martin SA, Chen CH. Laser desorption mass spectrometry for point mutation detection. Genet Anal 1996;13:87 – 94. Higgins GS, Little DP, Koester H. Competitive oligonucleotide single-base extension combined with mass spectrometric detection for mutation screening. BioTechniques 1997;23:710–4. Fu DJ, Tang K, Braun A, Reuter D, Darnhofer-Demar B, Little DP, O’Donnell MJ, Cantor CR, Koester H. Sequencing exons 5 to 8 of the p53 gene by MALDI-TOF mass spectrometry. Nat Biotech 1998;16:381 – 4. Graber JH, Fu DJ, Smith CL, Cantor CR. Computational optimization of mass spectrometry based genetic mutation localization and identification. Proceedings of Eleventh International Conference on Mathematical and Computer Modelling and Scientific Computing, vol. 8 1997 (in press). Hollstein M, Shomer B, Greenblatt M, Soussi T, Hovig E, Montesano R, Harris CC. Somatic point mutations in the p53 gene of human tumors and cell lines: updated compilation. Nucleic Acids Res 1996;24:141 – 6. Jespersen S, Niessen WMA, Tjaden UR, van der Greef J, Litborn E, Lindberg U, Roeraade. Automated detection of proteins by matrix-assisted laser desorption/ionization mass spectroscopy with the use of picolitre vials. J Rapid Commun Mass Spectrom 1994;8:581 – 4.
.