Statistical inference of sequence-dependent mutation rates

Statistical inference of sequence-dependent mutation rates

612 Statistical inference of sequence-dependent mutation rates Mihaela Zavolan* and Thomas B Kepler † Several lines of research are now converging to...

84KB Sizes 7 Downloads 186 Views

612

Statistical inference of sequence-dependent mutation rates Mihaela Zavolan* and Thomas B Kepler † Several lines of research are now converging towards an integrated understanding of mutational mechanisms and their evolutionary implications. Experimentally, crystal structures reveal the effect of sequence context on polymerase fidelity; large-scale sequencing projects generate vast amounts of sequence polymorphism data; and locus-specific databases are being constructed. Computationally, software and analytical tools have been developed to analyze mutational data, to identify mutational hot spots, and to compare the signatures of mutagenic agents. Addresses *Laboratory of Computational Genomics, The Rockefeller University, 1230 York Avenue, New York, New York 10021, USA; e-mail: [email protected] † The Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, USA; e-mail: [email protected] Correspondence: Mihaela Zavolan Current Opinion in Genetics & Development 2001, 11:612–615 0959-437X/01/$ — see front matter © 2001 Elsevier Science Ltd. All rights reserved.

a finer-grain level, locus-specific and core mutation databases are constructed [10,11,12•,13] to assist the diagnosis and prognosis of human disease. Finally, molecular studies reveal the effects of specific sequence contexts on polymerase fidelity [4] and on DNA repair [14••]. Ideally, we would like to bridge these levels and understand the molecular basis of the observed distribution of mutations in gene and protein sequences — the mutational spectrum [15,16]. Data for studying mutational spectra: a variety of resources Highly specialized, project-specific databases

We have constructed, for example, a database of mutations that we inferred to have occurred in human processed pseudogenes, as well as a database of mutations that accumulated in non-selected regions of immunoglobulin genes during somatic hypermutation [17••]. Similar databases have been developed and used by Rogozin et al. [18] for comparing the mutational spectrum of somatic hypermutation with that of polymerase η.

Abbreviations SC stochastic complexity SNP single nucleotide polymorphism

Locus-specific and core databases

Introduction

Genetic polymorphism data

The evolutionary paradigm postulates that the observable phenotypes of organisms result from selection providing directionality to an otherwise unbiased process of genetic variation. Data has been accumulating that suggests that some directionality may be intrinsic in the mutational mechanisms themselves [1], be they extrinsic (environmental mutagens) [2,3] or intrinsic (DNA polymerases) [4]. Adaptive evolution can exploit sequence-dependent mutational biases, for example, during phase variation in bacteria [5].

Notable resources became available recently in the first draft of the human genome [19,20] and the single-nucleotide polymorphism (SNP) data that has been generated as a by-product of large-scale sequencing projects [21,22]. SNP data are already being mined to uncover associations between specific loci in the human genome and disease traits but they could also be used to study mutational patterns in the human genome. At present, almost three million SNPs are deposited in the dbSNP database of the National Center of Biotechnology Information (http://www.ncbi.nlm.nih.gov/SNP/).

Mutations feature prominently in human pathology. Germline mutations can cause genetic disease and confer susceptibility to cancers, somatic mutations may initiate malignant transformation, and drug-resistance mutations impede the treatment of infectious disease. In all of these cases, mutational hot spots have been described [6,7•,8]. In this review, we discuss the following: first, recent developments in the analysis of mutational spectra; second, recent studies in which a sequence-dependent effect on mutation rate has been observed; and third, recent studies that provide a link between the biochemistry of DNA replication and observed regularities in mutational spectra.

Analysis of mutational patterns Mutational patterns are studied at different levels of granularity. Molecular evolution, for example, treats mutations as stochastic events — with some regularities such as transition/transversion bias — that can be used to reconstruct the evolutionary history of biological systems [9]. On

Locus-specific [12•,14••] and core databases [13] are being developed and maintained for web access.

Statistical methods

The problem that we address is that of estimating the relative mutation rate, or ‘mutability’, for a given site in a DNA molecule and testing hypotheses about relationships among the rates at different sites. The factors that influence the intrinsic mutability of a site are the identity (A, G, C or T) of the base at that site, the identities of the bases in the local neighborhood, which we shall refer to as the ‘microsequence context’, the potential for secondary structures resulting from more distant bases, the position of the site relative to relevant markers in the molecule (distance from the centromere, from the telomere), and so on. Here we focus solely on the effect of the microsequence context on mutability. Other studies addressed the role of genomic heterogeneity on some aspects of mutability [14••]. We start by creating a classification scheme such that every site represented in the data is assigned to exactly one class;

Statistical inference of sequence-dependent mutation rates Zavolan and Kepler

we assume that the probability of mutation is homogeneous within each class. This amounts to assuming that differences among sites within a class have no influence on mutability, whereas those that distinguish classes do. So, for example, we might assign sites to classes depending only on the wild-type nucleotide at that site. This gives four classes and four probabilities: pA, pG, pC and pT and realizes the hypothesis that the flanking nucleotides have no influence. A more refined classification is one in which the mutating base and the 5′ nearest neighbor determine the class, so that there are 16 classes and corresponding probabilities: pAA, pGA, pTA, pCA, pAG, pGG and so on. Finally, we could add the hypothesis that the probabilities in this latter case are each given by the product of a factor depending only on the mutating base and a factor depending only on the 5′ base: pGA = qGrA, and so on, where qG and rA are parameters. Then the 16 probabilities depend only on 8 numbers, there are 8 equations of constraint that the ps obey, all of the form pGApTG = qGrAqTrG = qTrAqGrG = pTApGG. This last model is an example of a classification with constraints and is, in fact, the most widely used and most general, though it is not often presented as such. The data on which the inference is based is the total number of observed sites nc (wildtype + mutated) and the frequency fc of mutations in each class c. The basis for statistical manipulation is the log likelihood, logL(f p,n) = ∑ [fc log( pc ) + (nc − fc )log(1 − pc )]

Eqn 1

where the notation L(f p,n) is read ‘the likelihood of f given n and p’. Point estimation consists in maximizing log L over the parameters p. If there are no constraints on these parameters, their maximum likelihood estimators are given simply by pˆ c = fc /nc . In the presence of constraints, the estimators usually have to be found numerically.

613

Table 1 Mutated/(wildtype + mutated) counts for a somatic hypermutation dataset. Target base in columns, 5′ neighbor in rows. A

G

T

C

A

182/8393

220/9379

105/6370

97/5660

G

99/7434

110/8408

71/6627

146/5056

T

202/5555

100/8442

62/7003

50/7758

C

127/8257

22/1330

85/8779

43/5566

nested within model 2 (and model 1 is nested within it). The maximum likelihood estimates can be obtained by an iterative procedure [17••]. The resulting log likelihood ratio for model 1 and the 8-parameter multiplicative model 3 is 0.68; compared to chi-square with 4 degrees of freedom, we find that this constrained model is not an improvement. Given that the 16 parameter model is an improvement, we are led to conclude that 5′ nucleotides do have an influence but that the effect depends on the identity of the mutating nucleotide. Bayesian techniques are also gaining in popularity [6,23,24]. These treat the parameters as random variables and necessitate the specification of a prior distribution based on the state of knowledge regarding the parameters before the data is collected. Bayes’ rule then specifies how knowledge of the data changes the distribution, which is then called the posterior. Again, when the parameters are unconstrained, the evaluation of these posterior distributions is not difficult, but in the presence of constraints, approximate methods are required.

Take a specific example comparing the three classifications mentioned above: target base only (model 1), target+5′ neighbor (model 2) and target+5′ as two multiplicative factors (model 3). We will compare these models using a likelihood ratio test and the data in Table 1 (see Oprea et al. [17••] for details).

Finally, contingency table analysis [25] is appropriate for hypothesis testing in the absence of constraints. It is closely related to likelihood methods but has reliable testing procedures (Fisher’s exact test) even for very small datasets whereas likelihood ratio tests commonly depend on large-sample properties. In fact, one can speak of a likelihood function for contingency tables, this being Equation 1, conditioned on the number of mutations. For large enough datasets, all three techniques yield similar results.

The maximum log likelihood is –3829.9 under model 1, and –3769.2 under model 2. Model 1 is ‘nested’ within model 2 — any set of parameters in model 1 can be realized by the appropriate choice of parameters in model 2 — so we can use the likelihood ratio test. This test compares twice the difference in log-likelihood, 121.3 in our example, to a tabulated chi-square statistic for 12 degrees of freedom (the difference in the number of degrees of freedom for the two models). The probability of obtaining such a large value under model 1 is much less than 10–9, and thus provides very strong evidence that the 5′ neighbor does indeed influence the mutability of the target nucleotide. Similarly, the constrained model (model 3) is

Common to all the methods is that the probability of mutation is the product of the mutation rate and the time over which mutagenesis has been acting (as long as that product is considerably smaller than one): pc = µcθc. Without knowing the mutagenesis time, only relative rates and times can be estimated. This is why we refer to the mutability rather than the mutation rate. If the data can be reliably assumed to have the same age, the problem is unconstrained and the mutabilities are easily estimated, as above. If, on the other hand, the data come from multiple different experiments or observations differing in the mutagenesis times, θ, then the classification will have two factors: the microsequence context and the time. This is then a constrained system.

614

Genomes and evolution

Selection is an extrinsic factor that distorts the observed frequency of mutations and therefore biases the estimated mutabilities in a way that depends on the manner in which data are gathered. Mutations that result in disadvantageous phenotypes may not survive to be counted in natural populations. Mutations without phenotypic effect will not be observed if phenotypic change is the basis by which mutations are screened. In order to draw inferences about the intrinsic mutabilities, one has to ensure either that the data have not been distorted by selection or devise methods to estimate this distortion. Krawczak et al. [6] have devised a method based on the dual classification of each potential mutation according to its intrinsic probability of occurring (our pc above) and its probability of being observed if it does occur. This, again, is a constrained system; both sets of parameters must be estimated and the appropriate division into classes determined. It is not clear that such a classification of selective effects is effective. Choosing the ‘best’ model, rather than simply performing a set of pairwise comparisons is much more difficult. The number of possible models grows very rapidly as the number of informative bases potentially considered increases. These models are not generally nested, so the log-likelihood tests are inappropriate and contingency tables cannot represent the models. More importantly, there is a huge multiple testing problem. We will surely find a model that works if we test enough — but will it be real? A very useful solution to this problem comes from a technique related to Bayesian methods called ‘stochastic complexity’ (SC) [26]. SC provides a global criterion for testing models that penalizes, as usual, large residuals (poor fits under the likelihood) but also the complexity of the model. This criterion is being adapted to the microsequence classification problem (TB Kepler, M Zavolan, LG Cowell, unpublished data). In addition to analytical techniques, statisticians are increasingly turning to resampling methods [27]. An example of the use of resampling methods for mutation rate estimation and inference can be found in [28].

How strong is the effect of sequence context on mutability? Krawczak et al. [6] estimated the effect of sequence context using a data set of mutations responsible for human genetic disease. Aside from the well-known CG dinucleotide context [29], they found only a subtle effect of the flanking bases on the mutation rates, extending up to two bases in each direction. The positive correlation between duplex stability and relative mutation rate found in this study suggests that neighboring nucleotides may stabilize mismatches, thereby enhancing the mutation rate. Duplex stability is also known to affect the transition:transversion ratio in the plant chloroplast genome [30]. A more profound effect of sequence context on both frameshift and substitution mutation frequency has been

reported by Page et al. [31] in mutagen-induced mutagenesis. In this study, the immediate neighbors of the target nucleotide changed the mutation rate up to ten-fold. Broschard et al. [32] also reported dramatic effects of the DNA sequence-context on frameshift mutagenesis induced by N-2-acetylaminofluorene in Escherichia coli, and a similar effect has been described in mammalian cells by Shibutani et al. [33]. Sequence-dependent effects on adduct mutagenesis have been discussed in a recent review by Seo et al. [2].

Molecular basis of sequence-dependent effects on mutation rate Sequence context can, in principle, affect any step of the DNA replication and repair. Timsit [4] proposed that an incorrect structure of the template–primer duplex in the active site of the DNA polymerase can induce the incorporation of incorrect nucleotides. Polymerase β seems more permissive for altered template–primer structures than polymerase α, and hence it has a higher error rate. Becherel and Fuchs [34] proposed that sequence context affects the error spectrum during SOS responses. As multiple polymerases seem to be involved in SOS mutagenesis [3], these studies raise the very interesting possibility that regulation of polymerase recruitment at the site of the lesion may affect the mutation spectrum. The implications of these findings for eukaryotic biology are entirely unknown, although it is interesting to speculate that stochastic or induced variations in the level of various DNA polymerases or DNA repair might contribute to cell transformation. Examples of hypermutable sequence contexts being exploited during adaptive evolution are known: contingency loci of Haemophilus influenzae and Neisseria meningitidis are repetitive sequence elements highly prone to mutation, responsible for phase variation [5,35]. The immunoglobulin genes in mammalian immune systems seem to have evolved a finely-tuned codon usage. This enhances the mutability of the regions that encode the antigen-binding site, and reduces the mutability of the so-called ‘framework’ during the somatic hypermutation process [16••,36].

Conclusions Mutational biases have implications not only for evolutionary theory but also for identifying signatures of environmental mutagens and for quantifying genotype/phenotype correlations. Several lines of research are now converging towards an integrated understanding of the way mutations are introduced, of the effect of mutational biases on evolutionary dynamics, and of the way adaptive evolution might exploit mutational biases. The sources of data that can be used to characterize mutational biases are becoming increasingly more diverse and more extensive, and it is important to develop the quantitative methods that can handle such data appropriately.

References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as:

• of special interest •• of outstanding interest 1.

Yampolsky LY, Stoltzfus A: Bias in the introduction of variation as an orienting factor in evolution. Evol Dev 2001, 3:73-83.

Statistical inference of sequence-dependent mutation rates Zavolan and Kepler

2.

Seo KY, Jelinski SA, Loechler EL: Factors that influence the mutagenic patterns of DNA adducts from chemical carcinogens. Mutat Res 2000, 463:215-246.

3.

Napolitano R, Janel-Bintz R, Wagner J, Fuchs RPP: All three SOS-inducible DNA polymerases (Pol II, Pol IV, and Pol V) are involved in induced mutagenesis. EMBO J 2000, 19:6259-6265.

615

mutagenesis time, the microsequence and the mutagen. We used data from non-selected sequences to characterize the intrinsic propensity for mutation of various sequence motifs. The estimated rates can be used as a null-model against which hypotheses about selection acting at various sites can be tested. 18. Rogozin IB, Pavlov YI, Bebenek K, Matsuda T, Kunkel TA: Somatic mutation hotspots correlate with DNA polymerase eta spectrum. Nat Immunol 2:530-536.

4.

Timsit Y: DNA structure and polymerase fidelity. J Mol Biol 1999, 293:835-853.

5.

Moxon RE, Rainey PB, Nowak MA, Lenski RE: Adaptive evolution of highly mutable loci in pathogenic bacteria. Curr Biol 1994, 4:24-33.

19. Lander ES, Linton LM, Birren EB, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921.

6.

Krawczak M, Ball EV, Cooper DN: Neighboring-nucleotide effects on the rates of germline single-base pair substitution in human genes. Am J Hum Genet 1998, 63:474-488.

20. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA et al.: The sequence of the human genome. Science 2001, 291:1304-1351.

7. Rogozin IR, Kondrashov FA, Glazko GV: Use of mutation spectra • analysis software. Hum Mutat 2001, 17:83-102. The authors review a variety of approaches used in the analysis of mutational spectra, in particular hotspot prediction via a Simulation, Expectation, Minimization approach, and comparison of 2 or more mutational spectra via tests of homogeneity in contingency tables. Most of these approaches neglect the effect of selection on mutational spectra. This is an important caveat, as most of the available mutational data comes from sequences that are likely to have been exposed to selection pressures. The mutational hotspots inferred may thus be due either to highly mutable sequence motifs, or to preferential expansion of mutants with high fitness. 8.

9.

Keulen W, Back NK, van Wijk A, Boucher CA, Berkhout B: Initial appearance of the 184Ile variant in lamivudine-treated patients is caused by the mutational bias of human immunodeficiency virus type 1 reverse transcriptase. J Virol 1997, 71:3346-3350. Li WH: Molecular Evolution. Sunderland, Massachussetts: Sinauer Associates, Inc.; 1997.

10. Cotton RG, McKusick V, Scriver CR: The HUGO mutation database initiative. Science 1998, 279:10-11. 11. Scriver CR, Nowacki PM, Lehvaslaiho H: Guidelines and recommendations for content, structure, and deployment of mutation databases. Hum Mutat 1999, 13:344-350. 12. Beroud C, Collod-Beroud G, Boileau C, Soussi T, Junien C: UMD • Universal mutation database: a generic software to build and analyze locus-specific databases. Hum Mutat 2000, 15:86-94. The authors describe a software platform for creating locus-specific mutation data bases and for performing preliminary analyses of mutation data. Standardizing the reporting of mutational data is highly desirable, given that the volume of mutation and polymorphism data will continue to grow. It will be equally important to provide easy access to the raw data such that various models might be tested as data becomes available. The methods described in this paper, for example, neglect the effect of selection on mutational spectra. 13. Krawczak M, Ball EV, Fenton I, Stenson PD, Abeysinghe S, Thomas N, Cooper DN: Human gene mutation database — a biomedical information and research resource. Hum Mutat 2000, 15:45-51. 14. Balajee AS, Bohr VA: Genomic heterogeneity of nucleotide •• excision repair. Gene 2000, 250:15-30. This review addresses the relevance of chromatin structure for repair heterogeneity, thus providing a possibly important link between observed regional variations in mutation rate and the mechanisms involved in DNA metabolism. 15. Soussi T, Dehouche K, Beroud C: p53 website and analysis of p53 gene mutations in human cancer: forging a link between epidemiology and carcinogenesis. Hum Mutat 2000, 15:105-113. 16. Hartwig A, Kasper P, Madle S, Speit G, Staedtler F, Sengstag C: The potential use of mutation spectra in cancer related genes in genetic toxicology: a statement of a GUM working group. Mutat Res 2001, 473:263-267. 17. ••

Oprea M, Cowell LG, Kepler TB: The targeting of somatic hypermutation closely resembles that of meiotic mutation. J Immunol 2001, 166:892-899. We used a variety of methods including maximum likelihood, contingency tables and a novel method for estimating correlation among binomial parameters to compare the microsequence specificity of somatic hypermutation and of germline mutation (as inferred from comparing pseudogenes with their functional counterpart). We used a three-way classification scheme to account for

21. Sherry ST, Ward MH, Khodolov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001, 29:308-311. 22. Brookes A: HGBASE — a unified human SNP database. Trends Genet 2001, 17:229. 23. Glazko GB, Milanesi L, Rogozin IB: The subclass approach for mutational spectrum analysis: application of the SEM algorithm. J Theor Biol 1998, 192:475-487. 24. Dunson DB, Tindall KR: Bayesian analysis of mutational spectra. Genetics 2000, 156:1411-1418. 25. Khromov-Borisov NN, Rogozin IR, Pegas Henriques JA, de Serres FJ: Similarity pattern analysis in mutational distribution. Mutat Res 1999, 430:55-74. 26. Rissanen J: Stochastic Complexity in Statistical Inquiry. Singapore: World Scientific; 1989. 27.

Efron B, Tibshirani TJ: An Introduction to the Bootstrap. New York, NY: Chapman & Hall, Inc; 1993.

28. Oprea M, Kepler TB: Genetic plasticity of V genes under somatic hypermutation: statistical analysis using a new resampling-based methodology. Genome Res 1999, 9:1294-13904. 29. Cooper DN, Youssoufian H: The CpG dinucleotide and human genetic disease. Hum Genet 1988, 78:151-155. 30. Morton BR, Oberholzer VM, Clegg MT: The influence of specific neighboring bases on substitution bias in noncoding regions of the plant chloroplast genome. Mol Evol 1997, 45:227-231. 31. Page JE, Zajc B, Oh-hara T, Lakshman MK, Sayer JM, Jerina DM, Dipple A: Sequence context profoundly influences the mutagenic potency of trans-opened benzo[a]pyrene 7,8-diol 9,10-epoxidepurine nucleoside adducts in site-specific mutation studies. Biochemistry 1998, 37:9127-9137. 32. Broschard TH, Koffel-Schwartz N, Fuchs RPP: Sequence-dependent modulation of frameshift mutagenesis at NARI-derived mutation hot sports. J Mol Biol 1999, 288:191-199. 33. Shibutani S, Suzuki N, Tan X, Johnson F, Grollman AP: Influence of • the flanking sequence context on the mutagenicity of acetylaminofluorene-derived DNA adducts in mammalian cells. Biochemistry 2001, 40:3717-3722. Demonstrates that the nature of the nucleotides flanking the acetylaminofluorene-derived DNA adduct strongly influences the mutational frequency in mammalian cells. Similar findings have been reported in bacteria by Broschard [32]. Coupled with structural data and theoretical models, these studies will provide a mechanistic understanding of how mutational hotspots emerge. 34. Becherel OJ, Fuchs RPP: SOS mutagenesis results from up-regulation of translesion synthesis. J Mol Biol 1999, 294:299-306. 35. Bayliss CD, Field D, Moxon ER: The simple sequence contingency loci of Haemophilus influenzae and Neisseria meningitidis. J Clin Invest 2001, 107:657-662. 36. Kepler TB: Codon bias and plasticity in immunoglobulins. Mol Biol Evol 1997, 14:637-643.