Mathematical Biosciences 176 (2002) 237–252 www.elsevier.com/locate/mbs
Statistical and algorithmic methods for fluctuation analysis with SALVADOR as an implementation Qi Zheng
*
Division of Biometry and Risk Assessment, National Center for Toxicological Research, HFT-20, 3900 NCTR Road, Jefferson, AR 72079-9502, USA Received 24 April 2001; received in revised form 19 November 2001; accepted 15 December 2001
Abstract This paper aims at removing certain long-standing impediments to more effective and widespread use of fluctuation analysis. The paper presents a method of constructing confidence intervals for mutation rates using data from fluctuation experiments. The method was inspired by a rediscovery of a little-known, not fully developed method of Lea and Coulson; substantial modifications have been made both to enhance computational efficiency and to widen the scope of the original method’s applicability. A computer package named SALVADOR is presented that can be used for Monte Carlo simulation, for point and interval estimation of mutation rates, and for exploration of various hypotheses spawned by the directed mutation controversy. In addition to the maximum likelihood method, methods of considerable historical interest are also examined and included in SALVADOR to help the reader compare and assess some of the most popular methods for estimating mutation rates. Ó 2002 Elsevier Science Inc. All rights reserved. Keywords: Fluctuation experiment; Directed mutation controversy; Fisher’s information; Estimation of mutation rate
1. Introduction A fundamental tenet of evolutionary biology was repeatedly challenged and defended during the past half a century or so [1,2]. Those defending the tenet believe that mutations are random in the sense that mutations occur regardless of their effects on the survival of the organism; those challenging the tenet hold that mutations can be directed in the sense that mutations are more likely to occur when the environment favors the survival of the resulting mutants. Fluctuation *
Tel.: +1-870 543 7176; fax: +1-870 543 7662. E-mail address:
[email protected] (Q. Zheng).
0025-5564/02/$ - see front matter Ó 2002 Elsevier Science Inc. All rights reserved. PII: S 0 0 2 5 - 5 5 6 4 ( 0 2 ) 0 0 0 8 7 - 1
238
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
analysis, devised by Luria and Delbr€ uck [3] to solve this controversy, is now a time-honored technique widely used not only for exploring hypotheses originating from the directed mutation controversy, but also for estimating mutation rates [4–8]. Despite a long history of theoretical development and numerous applications, there still exist several impediments to the effective use of fluctuation analysis. One astonishing consequence of these impediments is the fact that, even in recent years, the crude method of means for estimating mutation rates was more widely applied in practice than the superior method of maximum likelihood. As Jones et al. [9] put it, ‘‘the published literature relies mainly on the less efficient but computationally trivial estimators’’. Among those impediments that occasioned this phenomenon is lack of convenient and efficient computer software written specifically for fluctuation analysis. A deeper cause is insufficient mathematical work on certain statistical and algorithmic aspects of fluctuation analysis. The goals of the present paper are to assess existing algorithmic methods and to propose new ones that are essential to widespread use of sound methods in fluctuation analysis. A novel contribution of this paper is a numerical procedure for computing confidence intervals for mutation rates, inspired by a little-known method proposed by Lea and Coulson [10, p. 280] over half a century ago. All these methods have been assembled into a computer package called SALVADOR, written in a blend of the Mathematica and the C language. (The computer code of SALVADOR and related documents are available at http://www.mathsource.com/Content/ Applications/LifeScience/0211-767.)
2. Biological background In this section we briefly review some relevant features of a typical fluctuation experiment, introducing necessary nomenclature. For simplicity we assume that the organism under study is a bacterium. From a single parent culture of non-mutants, a small inoculum consisting of N0 bacteria is taken to seed a liquid culture. A prescribed number (n) of such cultures are prepared, the resultant cultures being usually called sister cultures. Bacteria in each of the n sister cultures are allowed to grow under normal (non-selective) conditions until T time units later when the bacterial population in each sister culture reaches approximately a prescribed size NT ¼ N0 eb1 T . At this point, either a partial or the entire population of each of the n sister cultures is transferred (plated) to a fresh plate of selective medium – a solid culture containing a selective element like phage or penicillin. (Recently there has been a trend of plating the entirety of each sister culture to simplify data analysis [11,12]; we shall here consider only this type of experiment.) Each mutant plated is assumed to grow eventually into a visible clone, while all non-mutants are assumed to be killed immediately after plating. Thus the number of mutant clones at the end of the second growth stage is the same as the number of individual mutants present at the end of the first growth stage. The number of mutant clones on the n plates are counted and recorded as X1 ; . . . ; Xn . As an illustration, consider a fluctuation experiment conducted by Demerec [13]. A saturated broth culture of Staphylococcus aureus was diluted so that the dilution contained about 300 bacteria per ml. From this dilution 0.3 ml of the diluted broth was taken and dispensed into each of 30 small test tubes. (Thus n ¼ 30 and N0 ¼ 0:3 300 ¼ 90.) After the test tubes were incubated
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
239
at 37 °C for 18 h, 0.7 ml of broth was added into each of the 30 tubes (to reduce the errors due to plating). The contents from each of the 30 tubes were then plated onto a Petri dish with nutrient agar containing 0.064 Oxford units of penicillin per ml. Before plating, 10 tubes were chosen, from each of which 0.05 ml of material was taken to determine bacterium concentrations. Because each tube contained 1 ml of material and the average concentration was found to be 1:9 108 bacteria per ml, we set NT ¼ 1:9 108 . (We ignore the fact that, prior to plating, 10 of the tubes contained 5% less of material than the other tubes.) Colonies of penicillin-resistant mutants on each plate were counted and reported as follows. 33 196 28
18 66 67
839 28 730
47 17 168
13 27 44
126 37 50
48 126 583
80 33 23
9 12 17
71 44 24
As shown by the above example, the following five biological parameters are important in describing a fluctuation experiment: (i) l, mutation rate per cell per unit time; (ii) b1 , non-mutant growth rate; (iii) b2 , mutant growth rate; (iv) N0 , the number of non-mutants used to seed a sister culture; and (v) NT , the number of non-mutants present in a sister culture immediately before plating. In practice, the following composite parameters are widely used: l l m ¼ ðNT N0 Þ NT ; b1 b1 l m m ; lb ¼ ¼ b1 NT N0 NT ð1Þ b2 q¼ ; b1 N0 / ¼ 1 eb1 T ¼ 1 : NT The parameter m is the expected number of mutations (as contrasted with mutants) per sister culture. The parameter lb is usually called mutation rate per cell division, although this term is usually reserved for the related quantity lb log 2. The parameter q is due to Mandelbrot [14] who was the first to notice the important role of that parameter. The parameter / was historically set to unit to simplify the distribution, but the resultant approximating distribution often possesses divergent moments (see e.g., Ref. [15]). When exact distributions are desirable, the parameter / is indispensable. 3. Mathematical considerations Three categories of numerical procedures are important in fluctuation analysis. The first category is for Monte Carlo simulation. The second category contains convenient tools for computing the distributions of the number of mutants, which we shall call loosely mutant distributions. Mutant distributions can be used to explore various hypothesis spawned by the directed mutation controversy. The third category includes statistical procedures intended for those who carry out fluctuation experiments mainly to estimate mutation rates.
240
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
3.1. Simulation Simulation of the experimental data X1 ; . . . ; Xn can serve two purposes. It enables the biologist to explore competing hypotheses by changing values of the parameters of interest, and it permits the theorist to assess both old and new methods of estimating mutation rates by analyzing carefully simulated data. For reasons of computational efficiency, SALVADOR dose not directly adopt the generic method of inverse transform to simulate number of mutants. SALVADOR uses an intuitively straightforward approach comprising the following three steps. First, compute the mean number of mutations m from Eq. (1) and the total growth time T by T ¼ b1 1 logðNT =N0 Þ. Next, simulate the epochs at which mutations occur. This step amounts to simulating a nonhomogeneous Poisson process having intensity lN0 eb1 t for t 2 ½0; T Þ, which can be simulated in three ways [16, p. 523]. SALVADOR adopts the method based on the conditional distribution of the arrival times, for this method seems best suited to our present purposes. This step can further ~ having mean m; (2) generate m ~ be divided into two parts: (1) generate a Poisson random number m random numbers T1 ; . . . ; Tm~ , each of which has the distribution function eb1 t 1 ð0 6 t 6 T Þ: eb1 T 1 b1 T Because the inverse F 1 ðuÞ ¼ b1 1Þ þ 1Þ is easy to compute, the Ti are generated by 1 logðuðe ~ independent random using the inverse transformation method [16, p. 498]. Finally, generate m G has the geometric distribution having parameter eb2 ðT Ti Þ , and numbers G1 ; . . . ; Gm~ such that Pm~ i return the random integer k¼1 ðGk þ 1Þ as the number of mutants in a sister culture. Note that a unit must be added to each Gi because the size of a mutant clone at time s that was spawned by a 0 mutation occurring at time s0 < s has a shifted geometric distribution having parameter eb2 ðss Þ [16, Eq. (6.15)]. (To avoid confusion, we call X a geometric random variable having parameter p if E½zX ¼ p=ð1 qzÞ; we call X þ 1 a shifted geometric random variable having the same parameter p.) F ðtÞ ¼
3.2. Computation of mutant distributions The distributional form, particularly the upper tail behavior, of the number of mutants was at the heart of the directed mutation controversy; SALVADOR provides functions for computing the probability mass functions (p.m.f.s) of four mutant distributions that were intimately interwoven with the directed mutation controversy. Distributions peculiar to fluctuation analysis fall into the category of so-called Poisson P stoppedk sum distributions. Specifically, the probability generating function (p.g.f.) GðzÞ ¼ 1 k¼0 pk z of such a distribution is of the form (see, e.g., [15,17]) Z T b1 s ðgðz; T sÞ 1ÞlN0 e ds ; ð2Þ GðzÞ ¼ exp P1
0
where gðz; tÞ ¼ k¼0 ak ðtÞzk is often the p.g.f. of a birth-and-death type stochastic process. From a computational point of view, it is convenient to recast the above expression into ) ( 1 1 X X ð3Þ pk zk ¼ exp m þ q0 þ qk zk k¼0
k¼1
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
241
with qk ¼ lN0 e
b1 T
Z
T
ak ðsÞeb1 s ds
ðk P 0Þ:
ð4Þ
0
For example, if death of the organism is to be considered, the dynamics of the mutants can be modeled by a simple birth-and-death process having birth and death rates b2 and d2 (we assume b2 > d2 for simplicity). Then the ak ðtÞ are known to be [18, p. 94] a0 ðtÞ ¼ d2 AðtÞ; k1 ak ðtÞ ¼ bk1 2 ð1 d2 AðtÞÞð1 b2 AðtÞÞAðtÞ
ðk P 1Þ;
ð5Þ
with AðtÞ ¼
eðb2 d2 Þt 1 : b2 eðb2 d2 Þt d2
Moreover, the dynamics of the non-mutants can be modeled by NðtÞ ¼ N0 eðb1 d1 Þt with d1 denoting the death rate of non-mutants. Therefore, by replacing b1 with b1 d1 , we can use Eqs. (4) and (5) to compute qk . Although there is an efficient algorithm for computing fpk g from fqk g [15, Lemma 2], computing qk from ak ðÞ for large k by directly using Eq. (4) can be liable to numerical difficulties, as can be easily seen from the above example. Hence essential to computing p.m.f.’s in fluctuation analysis is to find special cases where algorithms can be devised to simplify the process of computing fqk g from fak ðÞg. SALVADOR treats two such special cases. In the first case we assume that the p.g.f. gðz; tÞ in Eq. (2) is that of a Yule process having birth rate b1 . This assumption implies that non-mutants and mutants grow at the same rate (that is, b1 ¼ b2 ). The mutant distribution as determined by Eq. (2) under this assumption has the p.g.f. m 1 Gðz; m; /Þ ¼ exp 1 logð1 /zÞ : ð6Þ / z Therefore, the fqk g sequence appearing in Eq. (3) is given by q0 ¼ 0; k1
qk ¼ m/
1 / k kþ1
ðk P 1Þ:
ð7Þ
An approximate form of this distribution (obtained by setting / ¼ 1) was first derived by Lea and Coulson [10], and is now widely known as the Luria–Delbr€ uck distribution. We shall designate the exact distribution by LDðm; /Þ, the approximate distribution either by LDðmÞ or by LDðm; 1Þ. In view of Eq. (7), the p.m.f. of an LDðm; /Þ distribution can be computed using the recursive algorithm [19,20] p0 ¼ em ; k mX j/ pk ¼ /j1 1 pkj k j¼1 jþ1
ðk P 1Þ:
ð8Þ
242
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
The second case differs from the first one only in that b2 is allowed to be different from b1 . The fqk g sequence appearing in Eq. (3) for this case assumes the form Z T qk ¼ leb2 T ð1 eb2 ðT sÞ Þk1 eðb1 þb2 Þs ds ðk P 1Þ ð9Þ 0
with q0 ¼ 0 [15, Eq. (47)]. By introducing the new variable of integration y ¼ 1 eb2 ðT sÞ we recast Eq. (9) into the form m 1 ð10Þ Bx k; 1 þ with x ¼ 1 ð1 /Þq : qk ¼ q/ q Therefore, the fqk g sequence canR be computed using standard numerical methods for the inz complete beta function Bz ða; bÞ ¼ 0 ta1 ð1 tÞb1 dt. Like in the previous case, setting / ¼ 1 leads to a significantly simplified approximate distribution. (We naturally define x ¼ 1 when / ¼ 1.) For this approximate distribution, the fqk g sequence can be computed recursively: m ; q1 ¼ 1þq k1 qk ¼ qk1 ðk P 2Þ: k þ 1=q This recursive relation, first noted by Stewart et al. [21], is used by SALVADOR in combination with Lemma 2 in Ref. [15] to compute the p.m.f. of the approximate distribution. It should be noted that the mutant distribution as determined by the fqk g sequence given in Eq. (10) can be traced back to Mandelbrot [14]. As stated in Lemma 3 of Ref. [15], the p.g.f. of a mutant distribution is expressible by GðzÞ ¼ exp fmðhðzÞ 1Þg;
ð11Þ
where hðzÞ is the p.g.f. of Y, the size of a so-called randomly chosen mutant clone. Thus, knowing the distribution of Y is equivalent to knowing the mutant distribution. Because Z x 1 1 X m X 1 j m zð1 uÞ1=q j qj z ¼ Bx j; 1 þ z ¼ du; q/ j¼1 q q/ 0 1 zu j¼1 it is clear from Eqs. (3) and (11) that Z x 1=q 1 zð1 uÞ hðzÞ ¼ du: q/ 0 1 zu
ð12Þ
Without giving a derivation, Mandelbrot [14, p. 439] offered what he called the generating function of Y in the form of Z expðb2 T Þ 1ð1lÞ=b2 1l v T ð1lÞ 1 ^ dv: ð13Þ ½1 e Y ðb; l; b2 Þ ¼ b2 vðeb 1Þ þ 1 1 1 (The exponent 1 appearing in ½1 eT ð1lÞ is added by this author to correct an obvious misprint.) Upon introducing v ¼ 1=ð1 uÞ in Eq. (13) we obtain Z 1expðb2 T Þ ð1lÞ=b2 1l ð1 uÞ T ð1lÞ 1 ^ Y ðb; l; b2 Þ ¼ ½1 e du: ð14Þ b2 eb u 0
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
243
It will help the understanding of Eqs. (13) and (14) to recall two simple historical facts. First, like Luria and Delbr€ uck [3, p. 495], Mandelbrot adopted a time scale that allowed him to assume b1 ¼ 1. Second, Mandelbrot strictly followed Formulation A in modeling the occurrence of mutations [15, p. 4]; Eqs. (13) and (14) can be made valid for Formulation B by replacing 1 l with 1 in both equations. Assuming b1 ¼ 1, we now compare Eq. (12) with Eq. (14) to obtain hðeb Þ ¼ Y^ ðb; 0; b2 Þ: We thus make the observation that Mandelbrot has in effect given a Laplace-type transform of the distribution of Y (historically it can also be called the moment generating function of Y). As the mutant distribution is uniquely determined by the distribution of Y through Eq. (11), it appears appropriate to acknowledge the origin of the mutant distribution by naming it the M distribution. Furthermore, we shall designate the exact form of the mutant distribution by Mðm; q; /Þ, and the approximate form of the mutant distribution by either Mðm; q; 1Þ or Mðm; qÞ. It is also noteworthy that Koch [22] made several important contributions to the popularity of this distribution. For example, inspired by an algorithm of Lea and Coulson [10], Koch devised the following recursive algorithm to compute the p.m.f. of an Mðm; q; 1Þ distribution: pk ¼ em
k X j¼1
Cj;k
mj j!
ðk P 1Þ;
ð15Þ
with j ðk 1Þq Cj1;k1 þ Cj;k1 ; j þ kq j þ kq 1 ¼ ; ð1 þ qÞk ðk 1Þq C1;k1 : ¼ 1 þ kq
Cj;k ¼ Ck;k C1;k
Although this algorithm is inexact mathematically, the approach is satisfactory for cases where both eb1 T and eb2 T are negligible. Due to a bug in programming, Fig. 2 in Ref. [15] displayed an exaggerated discrepancy between the Koch algorithm and an exact method. Note in passing that the Poisson distribution played the role of a benchmark in the directed mutation controversy. Moreover, the Poisson distribution can be used to generate new mutant distributions. For example, the convolution of a Poisson distribution and a Luria–Delbr€ uck distribution yields the distribution that Cairns et al. [4] used to model their experimental data. Both distributions are included in SALVADOR for the user’s convenience. 3.3. Estimation of mutation rates Estimation of the composite parameter lb ¼ l=b1 from the experimental data X1 ; . . . ; Xn has been of great interest since fluctuation analysis was first proposed [3]. Following convention, we concentrate on estimating m, for an estimate of lb can be obtained from that of m by using Eq. (1). We consider only the maximum likelihood method in this subsection.
244
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
The speed of computing a point estimate of lb depends crucially on the speed of computing the p.m.f. of a mutant distribution. Take the case q ¼ 1, for example. The essence of the maximum likelihood method is to find that value of m that maximizes the log likelihood function n X log pðXi ; m; /Þ; ð16Þ lðm; /; X1 ; . . . ; Xn Þ ¼ i¼1
where pðk; m; /Þ denotes the probability of having k mutants according to an LDðm; /Þ distribution. Note that / ¼ 1 N0 =NT is considered as a known quantity. Let U ¼ max1 6 i 6 n Xi . For fixed m and /, pð0; m; /Þ; . . . ; pðU ; m; /Þ must all be computed so that the log likelihood function in Eq. (16) can be formed by picking up relevant terms. Due to the heavy upper tail of the distribution of Xi , U is likely to be large, e.g., U ¼ 839 in the example introduced in Section 2. Moreover, as Eq. (8) indicates, the cost of computing pðU ; m; /Þ increases rapidly with U. Because the log likelihood function must be computed repeatedly, computation of p.m.f.s can easily bottleneck our efforts to search for the maximum likelihood estimate (MLE) of m. An effective strategy used by SALVADOR to alleviate this problem was to code the computation of p.m.f.s in the C language wherever possible, allocating Mathematica to perform the optimization. A long-standing obstacle in efficiently using data from fluctuation experiments is lack of methods for interval estimation of lb . The issue of computing confidence intervals for lb (or equivalently for m) had received surprisingly little attention until 1994 when two pertinent papers appeared in Genetics. In the first paper Jones et al. [9] presented Fisher’s information for m when each Xi was transformed to either 0 or 1 according as Xi ¼ 0 or Xi > 0. In the second paper Stewart [23] proposed constructing confidence intervals using graphs that he produced by a simulation study. The method of Jones et al. does not make full use of experimental data. The graphical method of Stewart appears to be accurate for practical purposes, but the method was not convenient to use and was limited to cases where q ¼ 1 and / ¼ 1. A method that circumvents these shortcomings is clearly desirable. Ironically, a clever yet not fully developed numerical method of computing confidence intervals for m has lain fallow since 1949 when Lea and Coulson [10, p. 280] first proposed it. The method of Lea and Coulson, intended originally for the LDðm; 1Þ distribution, can be extended to more general situations, and can also be modified to improve computational efficiency. We first show how to extend the method to handle the Mðm; q; 1Þ distribution in an instructive way akin to the original approach. Let pk ðm; qÞ denote the probabilities as defined by Eq. (15). The Fisher information for m is defined by 2 o log px ðm; qÞ ; IðmÞ ¼ Ex om where Ex indicates mathematical expectation with respect to x. According to the asymptotic ^ is an MLE of m, then Var½m ^ 1=ðnIðm ^ ÞÞ. theory of the maximum likelihood method, if m Therefore, an approximate confidence interval for lb with confidence coefficient 1 a is za=2 ^ m pffiffiffiffiffiffiffiffiffiffiffiffi ; NT NT nIðm ^Þ where za denotes the 100ath percentile point of a standard normal distribution. Differentiating Eq. (15) yields
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
opk ðm; qÞ ¼ pk ðm; qÞ þ tk ðm; qÞ om
245
ðk P 1Þ;
with tk ðm; qÞ ¼ em
k X j¼1
Cj;k
mj1 : ðj 1Þ!
By definition, Fisher’s information for m is 1 X ðtj ðm; qÞ pj ðm; qÞÞ2 IðmÞ ¼ ; pj ðm; qÞ j¼0
ð17Þ
where we define t0 ðm; qÞ ¼ 0. Moreover, we can extend the method to compute Fisher’s information for q. By differentiating Ci;j with respect to q, we emerge with a new set of auxiliary constants Dk;k ¼ D1;k Dj;k
k
; ð1 þ qÞkþ1 k1 ðk 1Þq ¼ C þ D1;k1 ; 2 1;k1 1 þ kq ð1 þ kqÞ kj j ðk 1Þj ðk 1Þq ¼ C þ C þ Dj1;k1 þ Dj;k1 : 2 j1;k1 2 j;k1 j þ kq j þ kq ðj þ kqÞ ðj þ kqÞ
ð18Þ
Defining sk ðm; qÞ ¼
k X opk ðm; qÞ mj ¼ em Dj;k ; oq j! j¼1
with s0 ðm; qÞ ¼ 0, we get 2 X 2 1 o log px ðm; qÞ sk ðm; qÞ ¼ IðqÞ ¼ Ex : oq pk ðm; qÞ k¼1
ð19Þ
ð20Þ
We now propose a more substantial modification that can enhance computing speed and reduce computer memory requirement. For ease of exposition, we illustrate the basic idea by treating P1 the LDðm;k /Þ distribution. Consider the p.g.f. of an LDðm; /Þ distribution, Gðz; m; /Þ ¼ k¼0 pk ðm; /Þz , as given in Eq. (6). Differentiating Gðz; m; /Þ with respect to m gives 1 X opk ðm; /Þ k ð21Þ z ¼ /1 ðz1 1Þ logð1 /zÞGðz; m; /Þ: om k¼0 P k We can expand /1 ðz1 1Þ logð1 /zÞ in the form 1 k¼0 lk ð/Þz with 1 k ¼ 0; lk ð/Þ ¼ k1 / f1=k /=ðk þ 1Þg k P 1: Define a new sequence ffk g by ffk ðm; /Þg ¼ flk ð/Þg P fpk ðm; /Þg, where the asterisk sign ‘’ denotes the convolution product. In other words, fk ¼ kj¼0 lj pkj . Equating coefficients of zk on both sides of Eq. (21) yields
246
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
opk ðm; /Þ ¼ fk ðm; /Þ; om from which it follows that Fisher’s information for m can be computed by 1 X fk2 ðm; /Þ : IðmÞ ¼ pk ðm; /Þ k¼0
ð22Þ
In practice, P we can retain only a finite number of terms in Eq. (22), that is, we compute Ik ðmÞ ¼ kj¼0 fj2 =pj as an approximate value of IðmÞ. Note that this modified method of computing Fisher’s information is in principle applicable to any Poisson stopped-sum distributions. For our present purposes, the modification has two obvious advantages. First, it permits use of the exact form of the Luria–Delbr€ uck distribution in computing IðmÞ. Second, it obviates the necessity of computing and storing a large number of auxiliary constants. Now we are in a position to treat the more general case where the number of mutants has an M distribution. It suffices to find the sequence flk g. Note that the p.g.f. of an Mðm; q; /Þ distribution can be written as ( ) 1 m X 1 k ð23Þ Gðz; m; q; /Þ ¼ exp m þ Bx k; 1 þ z : q/ k¼1 q Differentiating this identity with respect to m yields oGðz; m; q; /Þ ¼ om
! 1 1 X k Bx ðk; 1 þ 1=qÞz Gðz; m; q; /Þ: 1þ q/ k¼1
Therefore, the flk g sequence for computing IðmÞ is l0 ¼ 1; 1 Bx ðk; 1 þ 1=qÞ ðk P 1Þ: lk ¼ q/ We can significantly enhance the speed of computing Ik ðmÞ for the useful special case / ¼ 1. First, we compute flj g recursively by setting l0 ¼ 1, l1 ¼ 1=ð1 þ qÞ, and j1 lj1 ð2 6 j 6 kÞ: lj ¼ j þ 1=q Next, we compute fpj g recursively by setting p0 ¼ em and j mX pj ¼ ili pji ð1 6 j 6 kÞ: j i¼1
P Finally, we compute ffj g by convoluting flj g and fpj g to obtain Ik ðmÞ ¼ kj¼0 fj2 =pj . In the same manner we can deduce the flk g sequence needed for computing IðqÞ, but the expressions are unwieldy. We therefore concentrate on the useful special case where / ¼ 1 (and hence x ¼ 1). Differentiating Eq. (23) with respect to q gives l0 ¼ 0; lk ¼ q2 mBðk; 1 þ 1=qÞ½1 þ q1 ðwðk þ 1 þ 1=qÞ wð1 þ 1=qÞÞ ðk P 1Þ;
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
247
where w denotes the psi (digamma) function wðzÞ ¼ C0 ðzÞ=CðzÞ. Because wðx þ 1Þ ¼ wðxÞ þ 1=x, we have k X 1 : wðk þ 1 þ 1=qÞ wð1 þ 1=qÞ ¼ j þ 1=q j¼1 On the other hand, we also have k Y 1 j1 : Bðk; 1 þ 1=qÞ ¼ 1 þ 1=q j¼2 j þ 1=q These two relations suggest a simple recursive algorithm for computing the flk g sequence. First, let l1 ¼ ða1 =q 1Þb1 with a1 ¼ 1=ð1 þ 1=qÞ and b1 ¼ mq2 =ð1 þ 1=qÞ. Next, we update ak , bk and lk for k P 2 as follows: 1 ; ak ¼ ak1 þ k þ 1=q k1 bk ¼ bk1 ; k þ 1=q lk ¼ ðak =q 1Þbk : Thus, for / ¼ 1 we have two methods of computing Fisher’s information. The first method requires a large number of auxiliary constants, but the computation is simple. So the first method may be practical for computing Ik with moderate k. If k is large or computer memory is limited, the second method will outperform the first one. Moreover, the second method permits the exact form of an M distribution to be used. Both methods are included in SALVADOR; results produced by the two methods matched perfectly. ^ ¼ 5:6, NT ¼ As an illustration, consider a numerical example given in Ref. [23], where m ^ Þ ¼ 0:0672134, I30;000 ðm ^Þ ¼ 6:3 108 and n ¼ 32. Using SALVADOR, we found that I5000 ðm ^ ð m Þ ¼ 0:0672471. It therefore appears that keeping 5000 terms in the ex0:0672441 and I60;000 P 2 ^ ^ f =p yields an approximation of Ið m Þ for m ¼ 5:6, which is sufficiently pression IðmÞ ¼ 1 j j¼0 j accurate for practical purposes. Simple arithmetic calculation leads to a 95% confidence interval for lb in the form 6:768 109 < lb < 1:101 108 , which is remarkably close to Stewart’s result 6:8 109 < lb < 1:12 108 , as given in Eq. (6) of Ref. [23]. In the above example we determined the number of terms needed in the series for IðmÞ by comparing Ik ðmÞ with IkþDk ðmÞ. It would be useful to find out the decay rates of fk2 =pk . P 2 Proposition 1. If the sequence ffk g is as defined in the expression IðmÞ ¼ 1 j¼0 fj =pj and is constructed according to either an LDðm; 1Þ distribution or an Mðm; q; 1Þ distribution, then fk 1 : pk m
ð24Þ
Note we use the notation ak bk to signify ak =bk ! 1 as k ! 1. If ffk g is constructed according to an LDðm; 1Þ distribution, a deeper result can be established in a simple and instructive manner. Observe first that
248
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
1 1 exp m 1 logð1 zÞ ¼ 1 þ m 1 logð1 zÞ þ RðzÞ; z z where RðzÞ ¼ Oðð1=z 1Þ2 log2 ð1 zÞÞ as z ! 1. Therefore, in view of Eq. (21), 2 1 X 1 1 j fj z ¼ 1 logð1 zÞ þ m 1 log2 ð1 zÞ þ R ðzÞ: z z j¼0
ð25Þ
Let the notation ½zk f ðzÞ denote the coefficient of zk in the Taylor expansion of f ðzÞ. It can be shown that 2 1 4Hk1 6 k ½z 1 log2 ð1 zÞ ¼ ; ð26Þ z kðk þ 1Þðk þ 2Þ where Hk is the kth harmonic number. Observe that R ðzÞ satisfies the O-type transfer condition of Theorem 2 in Ref. [24], e.g., R ðzÞ ¼ Oðð1 zÞ3 log3 ð1 zÞÞ as z ! 1; accordingly we can infer that 3 log k k : ð27Þ ½z R ðzÞ ¼ O k4 In view of the well-known relations Hk ¼ log k þ c þ Oðk 1 Þ (where c is Euler’s constant) and ½zk ðz1 1Þ logð1 zÞ ¼ 1=kðk þ 1Þ for k > 1, it follows from Eqs. (25)–(27) that 3 1 mð4 logðk 1Þ þ 4c 6Þ log k : ð28Þ þ þO fk ¼ kðk þ 1Þ kðk þ 1Þðk þ 2Þ k4 Because pk m=k 2 (e.g., [25]), the proposition is an immediate consequence of Eq. P1(28).2 Furthermore, Eq. (24) implies that fk2 =pk m1 =k 2 , which confirms that the series k¼0 fk =pk is summable. It is worth noting that the main term of fk is independent of m; Eq. (28) gives the simplest asymptotic form of fk that involves m. The validity of Eq. (24) for arbitrary q > 0 has been proved by Pakes [29]. To use the propan estimate for osition to find an estimate for kk2 =pk , we first apply Theorem 3.1 in Ref. [26] to get P k pk . We set ak ¼ kmq1 Bðk; 1 þ 1=qÞ with a0 ¼ 0 and bk ¼ em pk , which satisfy kbk ¼ j¼0 aj bkj [15, Lemma 2]. Stirling’s formula gives m Cð1 þ 1=qÞ ; ak q k 1=q which is regularly varying of index 1=q < 0. It follows from Theorem 3.1 in Ref. [26] that bk em mq1 Cð1 þ 1=qÞ=k 1þ1=q . Therefore, pk
m Cð1 þ 1=qÞ : q k 1þ1=q
ð29Þ
Combining Eqs. (24) and (29) yields fk2 Cð1 þ 1=qÞ 1 : mq k 1þ1=q pk Note that Eq. (29) was first given by Pakes [25, p. 993] with a concise clue to its proof; this expression confirms the intuitively obvious claim made by Lenski et al. [5] that the upper tail of an
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
249
M distribution is thinner than that of a jackpot (Luria–Delbr€ uck) distribution when mutants grow more slowly than non-mutants (that is, when q < 1). 4. Historical considerations Due to historical reasons, several alternative methods are still popular. These methods deal with cases where both q ¼ 1 and / ¼ 1 are assumed. Although none of these methods is as efficient as the maximum likelihood method, some occupy a prominent place in the literature and occasionally caused considerable controversy [23] over their merits. SALVADOR includes four of these methods, allowing the user to assess them. ^ ¼ log p^0 , where p^0 is the proportion The first method, the method of P0 [3], estimates m by m of sister cultures containing no mutants. This method is inapplicable to cases where mutants appear in all sister cultures. At the crux of the second method, the method of means [3], lies the concept of a so-called likely average which we denote by M . The likely average is defined by M ¼ ðT t0 ÞlN0 eb1 T where t0 is the epoch at which the expected number of mutations is exactly 1 (and hence t0 satisfies nlN0 ðeb1 t0 1Þ=b1 ¼ 1).PHistorically, M was introduced to remedy the poor behavior of the ^ ¼ n1 n Xi as an estimator of the theoretical mean number of mutants. The sample mean M i¼1 concept of a likely average appears to have no theoretical basis; M does not seem to be an ef^ , which was caused by the heavy upper tail of the fective remedy for the poor behavior of M ^ , one can argue from the definition of mutant distribution. Nevertheless, by equating M with M M that the quantity w ¼ nlNT =b1 approximately satisfies ^ ¼ 0: w log w nM ð30Þ ^¼w ~ =n, where w ~ is the unique positive root of Eq. (30). The root An estimator of m is therefore m ~ can be computed using Newton’s method [27, p. 160]. Specifically, with an arbitrary w0 > 0, the w sequence wk defined by ^ wk log wk nM wkþ1 ¼ wk ; log wk þ 1 ~ . Another overlooked drawback of this method is that m ^ P 1=n regardless of the converges to w actual experimental data. The truth of this claim can be seen from Eq. (30) in connection with the fact that w log w < 0 for w 2 ð0; 1Þ. The third method, the method of medians, sprang from a remarkably keen observation made by Lea and Coulson [10] that the transformed random variables Yi ¼
11:6 2:02; Xi =m log m þ 4:5
ð31Þ
resembled a standard normal random variable. Because this transform is a monotone decreasing function of Xi , Y1 ; . . . ; Yn will be in descending order of magnitude if X1 ; . . . ; Xn are arranged in ascending order of magnitude. In particular, the median of X1 ; . . . ; Xn , denoted by n^0:5 , is transformed into the median of Y1 ; . . . ; Yn . In light of the approximate normality of Yi , the median of Yi is likely to be close to zero. It thus follows from Eq. (31) that the median of X1 ; . . . ; Xn approximately satisfies
250
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
n^0:5 log m 1:24 ¼ 0: m By introducing w ¼ n^0:5 =m Lea and Coulson transformed Eq. (32) into w þ log w ðlog n^0:5 þ 1:24Þ ¼ 0:
ð32Þ
ð33Þ
~. The root w ~ can be computed by ~ is a root of Eq. (33), then an estimate of m is m ^ ¼ n^0:5 =w If w solving Eq. (33) using Newton’s method. PnThe last method was perhaps inspired by the principle of maximum likelihood. Note that i¼1 Yi has an approximate N ð0; nÞ distribution. Lea Pn and Coulson [10] suggested estimating m by maximizing the probability density function of i¼1 Yi . In other words, an estimate of m was P simply a root of ni¼1 Yi ¼ 0, obtainable by numerically solving n X i¼1
11:6 2:02n ¼ 0: Xi =m log m þ 4:5
ð34Þ
5. A numerical example In this section we use SALVADOR to analyze the real-world data introduced in Section 2. Point estimates of m were found to be 18.94 (by the method of means), 11.85 (by the method of medians),
Fig. 1. Different distributions were fitted to Demerec’s [13] experimental data: the solid line is the result of fitting an LDðmÞ distribution to the data (m ¼ 10:84); the dotted line is the result of fitting an Mðm; qÞ distribution to the data (m ¼ 9:85 and q ¼ 1:12); the dashing line depicts an LDð11:85Þ distribution. The first two distributions were found by the method of maximum likelihood, while the third was found by the method of means. The assumption / ¼ 1 was adopted in all three cases.
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
251
11.55 (by solving Eq. (34)) and 10.84 (by the maximum likelihood method using an LDðm; 1Þ distribution). Therefore, a maximum likelihood estimate of the mutation rate is l^b ¼ 5:7 108 . ^ Þ 0:025, and hence the asymptotic standard deviation of m ^ is 1=ð30 0:025Þ1=2 ¼ We found Iðm 1:15. So an approximate 95% confidence interval for the mutation rate is lb 2 ð4:52 108 ; 6:89 108 Þ. Moreover, we can allow for differential growth rates by fitting an Mðm; q; 1Þ distri^ ¼ 9:85 and q^ ¼ 1:12. bution to the data. SALVADOR yields maximum likelihood estimates of m 8 Therefore, a point estimate of the mutation rate is l^b ¼ 5:19 10 . SALVADOR also finds ^ Þ ¼ 0:02657, which leads to an approximate 95% confidence interval in the form I15;000 ðm lb 2 ð4:03 108 ; 6:34 108 Þ. Note that the method of means gives an estimate of m appreciably larger than all other estimates. Fig. 1 displays the original data and three fitted distributions. 6. Concluding remarks In this paper we addressed some statistical and algorithmic issues in fluctuation analysis in hopes of making statistically sound and computationally efficient methods accessible to a wide range of researchers. With the new procedures developed here, and with the computer package SALVADOR that implemented almost all available numerical procedures in fluctuation analysis, researchers can now either analyze their experimental data or further explore competing hypotheses with relative ease. For example, on a typical computer of today (having a CPU of 500 MHz or above and at least 256 MB of RAM), SALVADOR can compute the first 3000 probabilities of a commonly used mutant distribution within a second. SALVADOR can fit either an Mðm; q; 1Þ or an LDðm; /Þ distribution to the type of data that have so far been published within half a minute, usually much less. SALVADOR not only selects the best available algorithm for a particular problem, but it also adopts a mixture of two computer languages to achieve simultaneously efficiency and flexibility. Specifically, the Mathematica language was chosen to carry out advanced computations (e.g., optimization) and to provide a convenient user interface; the C language was chosen to code simple but highly repetitive arithmetic operations. Through MathLink [28, Section 2.12], all C programs communicate with Mathematica, and the user communicates only with the versatile frontend of Mathematica. Note finally that the methods presented here can serve as a useful guideline should the need arise for implementations in other computer languages. Acknowledgements I tender my cordial thanks to two anonymous referees, one of whom made very detailed and useful comments. This paper was presented in part at a conference of the Society for Mathematical Biology held in Hilo, HI, USA, on July 16–19, 2001. References [1] R.E. Lenski, J.E. Mittler, The directed mutation controversy and neo-Darwinism, Science 259 (1993) 183. [2] P.D. Sniegowski, R.E. Lenski, Mutation and adaptation: the directed mutation controversy in evolutionary perspective, Ann. Rev. Ecol. Syst. 26 (1995) 553.
252
Q. Zheng / Mathematical Biosciences 176 (2002) 237–252
[3] S.E. Luria, M. Delbr€ uck, Mutations of bacteria from virus sensitivity to virus resistance, Genetics 28 (1943) 491. [4] J. Cairns, J. Overbaugh, S. Miller, The origin of mutants, Nature 335 (1988) 142. [5] R.E. Lenski, M. Slatkin, F.J. Ayala, Mutation and selection in bacterial populations: alternatives to the hypothesis of directed mutation, Proc. Nat. Acad. Sci. USA 86 (1989) 2775. [6] T.D. Tlsty, B.H. Margolin, K. Lum, Differences in the rates of gene amplification in non-tumorigenic and tumorigenic cell lines measured by Luria–Delbr€ uck fluctuation analysis, Proc. Nat. Acad. Sci. USA 86 (1989) 9441. [7] P.D. Sniegowski, P.J. Gerrish, R.E. Lenski, Evolution of high mutation rates in experimental populations of E. coli, Nature 387 (1997) 703. [8] C. Zeyl, J.A.G.M. DeVisser, Estimates of the rate and distribution of fitness effects of spontaneous mutation in Saccharomyces cerevisiae, Genetics 157 (2001) 53. [9] M.E. Jones, S.M. Thomas, A. Rogers, Luria–Delbr€ uck fluctuation experiments: design and analysis, Genetics 136 (1994) 1209. [10] D.E. Lea, C.A. Coulson, The distribution of the numbers of mutants in bacterial populations, J. Genetics 49 (1949) 264. [11] P.L. Foster, J. Cairns, The occurrence of heritable Mu excisions in starving cells of Escherichia coli, EMBO J. 13 (1994) 5240. [12] B.G. Hall, Adaptive evolution that requires multiple spontaneous mutations. I. Mutations involving an insertion sequence, Genetics 120 (1988) 887. [13] M. Demerec, Production of Staphylococcus strains resistant to various concentrations of penicillin, Proc. Nat. Acad. Sci. USA 31 (1945) 16. [14] B. Mandelbrot, A population birth-and-mutation process. I. Explicit distributions for the number of mutants in an old culture of bacteria, J. Appl. Prob. 11 (1974) 437. [15] Q. Zheng, Progress of a half century in the study of the Luria–Delbr€ uck distribution, Math. Biosci. 162 (1999) 1. [16] S.M. Ross, Introduction to Probability Models, 5th Ed., Academic Press, Boston, MA, 1993. [17] K.S. Crump, D.G. Hoel, Mathematical models for estimating mutation rates in cell populations, Biometrika 61 (1974) 237. [18] N.T.J. Bailey, The Elements of Stochastic Processes with Applications to the Natural Sciences, Wiley, New York, NY, 1964. [19] W.T. Ma, G.vH. Sandri, S. Sarkar, Analysis of the Luria–Delbr€ uck distribution using discrete convolution powers, J. Appl. Prob. 29 (1992) 255. [20] S. Sarkar, W.T. Ma, G.vH. Sandri, On fluctuation analysis: a new simple and efficient method for computing the expected number of mutants, Genetica 85 (1992) 173. [21] F.M. Stewart, D.M. Gordon, B.R. Levin, Fluctuation analysis: the probability distribution of the number of mutants under different conditions, Genetics 124 (1990) 175. [22] A.L. Koch, Mutation and growth rates from Luria–Delbr€ uck fluctuation tests, Mutation Res. 95 (1982) 129. [23] F.M. Stewart, Fluctuation tests: how reliable are the estimates of mutation rates?, Genetics 137 (1994) 1139. [24] P. Flajolet, A. Odlyzko, Singularity analysis of generating functions, SIAM J. Disc. Math. 3 (1990) 216. [25] A.G. Pakes, Remarks on the Luria–Delbr€ uck distribution, J. Appl. Prob. 30 (1993) 991. [26] J. Hawkes, J.D. Jenkins, Infinitely divisible sequence, Scand. Actuar. J. (1978) 65. [27] C.H. Edwards, Jr., Advanced Calculus of Several Variables, Academic Press, 1973; reprinted by Dover, 1994. [28] S. Wolfram, The Mathematica Book, 4th Ed., Wolfram Media/Cambridge University, 1999. [29] A.G. Pakes, personal communication.