Mathematical Biosciences 196 (2005) 198–214 www.elsevier.com/locate/mbs
New algorithms for Luria–Delbru¨ck fluctuation analysis Qi Zheng
*
Department of Epidemiology and Biostatistics, School of Rural Public Health, Texas A&M University System Health Science Center, Bryan, TX 72802, USA Received 23 June 2004; received in revised form 15 February 2005; accepted 2 March 2005 Available online 13 June 2005
Abstract Fluctuation analysis is the most widely used approach in estimating microbial mutation rates. Development of methods for point and interval estimation of mutation rates has long been hampered by lack of closed form expressions for the probability mass function of the number of mutants in a parallel culture. This paper uses sequence convolution to derive exact algorithms for computing the score function and observed Fisher information, leading to efficient computation of maximum likelihood estimates and profile likelihood based confidence intervals for the expected number of mutations occurring in a test tube. These algorithms and their implementation in SALVADOR 2.0 facilitate routine use of modern statistical techniques in fluctuation analysis by biologists engaged in mutation research. 2005 Elsevier Inc. All rights reserved. Keywords: Mutation rate; Newton–Raphson algorithm; Profile likelihood based confidence interval; Score function; Observed Fisher information
1. Introduction Estimation of mutation rates plays an indispensable role in genetic and evolutionary studies. The most widely used methods for estimating mutation rates have been those based on the fluctuation test protocol devised by Luria and Delbru¨ck [1] about 60 years ago. Although the major *
Tel.: +1 979 845 4379; fax: +1 979 458 1877. E-mail address:
[email protected]
0025-5564/$ - see front matter 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.mbs.2005.03.011
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
199
concern of Luria and Delbru¨ck at the time was to address the issue of the origin of certain type of bacterial mutation, the Luria–Delbru¨ck fluctuation test protocol has left indelible imprint on almost every present-day method for estimating mutation rates. Successful use of the fluctuation test protocol in estimating mutation rates depends not only on efficient methods for computing the probability distribution of mutants, but also on statistically sound techniques for computing point and interval estimates of mutation rates. The past half a century has seen vigorous mathematical developments in tackling the first issue, but much less effort has been expended on the second issue. (Most of the theoretical contributions in the field are summarized in Ref. [2].) Lea and Coulson [3] were the first to outline a method of finding the maximum likelihood estimate (m.l.e.) of the fundamental parameter m – the expected number of mutations occurring in a parallel culture. They gave no specific algorithms but offered numeric tables to assist the user in finding m.l.e.s of m. Jones et al. [4] appear to be the first to give an explicit and practical algorithm for computing the m.l.e. of m. The Jones algorithm for computing m.l.e. is inexact in the sense that the log likelihood function and its derivatives are computed by truncating an infinite series representing the mutant distribution. This algorithm has not been extended to the case where growth rate of non-mutants can differ from that of mutants. (We shall refer to this case as the differential growth case.) Stewart [5] is the first to seek a systematic method for constructing confidence intervals (CIs) for m. The Stewart approach relies on computer simulation; the user consults computer generated charts or tables to construct a CI. This approach is not convenient to apply and has not extended to the differential growth case either. The first computationally feasible method for constructing CIs for m appears to be that proposed by Zheng [6]. The Zheng method uses sequence convolution to compute expected Fisher information, whence Wald type asymptotic CIs can be obtained. Theoretical considerations and simulations have shown that this method is efficient in computing CIs when growth rates of non-mutants and mutants are the same (the equal growth case) or when the two growth rates differ but the ratio is known in advance. However, there is an uncertainty in computing the expected Fisher information for the differential growth case when the ratio of the two growth rates is unknown; this uncertainty is due to lack of knowledge about the rate of convergence of certain infinite series employed by the algorithm to compute expected Fisher information. A major goal of the present report is to propose new algorithms that effectively obviate the above difficulty. The algorithms to be presented use a sequence convolution technique to compute likelihood ratio based CIs for the equal growth case and profile likelihood based CIs for the differential growth case. The new methods for computing CIs also provide a statistical procedure that can help choose between the equal growth model and the differential growth model. This paper also gives an explicit and more efficient method for computing m.l.e.s based on the Newton–Raphson algorithm. An important feature of this method is its use of the exact mutant distribution, the exact log likelihood function and its exact derivatives – a desirable feature lacking in the Jones algorithm. Finally, unlike the Jones algorithm and the Stewart approach, the methods presented here embrace the differential growth case. All these new algorithms have been incorporated into version 2.0 of SALVADOR (available at http://library.wolfram.com/infocenter). This paper does not address the issue of plating efficiency [7,8], for which two sources can cause concern. The first is deliberate partial plating. For example, in some of the first fluctuation tests [1], only a 0.05 ml sample from a 10 ml broth culture was plated. The practice of partial plating was common until the late 1980s when various assumptions and laboratory procedures were
200
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
scrutinized to enhance the validity of conclusions drawn from fluctuation experiments. As a result, complete plating now appears to be the norm. The second cause is cell death. This is a relevant issue in experiments involving eukaryotic cells, but effects of cell death seem negligible in experiments involving bacterial cells. It is straightforward in principle to extend most of the results in this paper to the case of imperfect plating, but satisfactory solution of the problem requires a separate investigation.
2. Nomenclature and notation In a slightly idealized fluctuation test experiment a small number, N0, of non-mutant cells are seeded into each of n test tubes containing broth or a similar liquid culture. After an incubation period the population of non-mutant cells in each tube expands exponentially to a size of NT, and the whole contents of each tube are then plated onto a selective and solid culture in a dish. Each dish is then incubated until every mutant cell transferred from the tube grows into a visible mutant colony. Let X = (X1, X2, . . . , Xn) be the vector of numbers of mutant colonies appearing on the n dishes. As the mutation-mutant principle [9] dictates, each of the Xi colonies on dish i is spawned by a mutant originated in tube i, and each of these Xi, mutants can be traced to a random number of mutations occurring in tube i. Note in passing that mutation and mutant are two related but different concepts. As far as the present paper is concerned, a mutation is a genetic event in a cell that causes changes in a gene (e.g. a DNA base substitution), whereas a mutant is a cell that manifests the phenotype attributable to the mutation. Barring back mutation, all of the offspring of a mutant inherit the mutation, and hence in general the number of mutations occurring in a tube is not the same as the number of mutants found in the same tube. Mathematically, the random number of mutations occurring in a tube obeys a Poisson law having mean l ð1Þ m ¼ ðN T N 0 Þ; b1 where l is the instantaneous mutation rate per cell and b1 is the growth rate of non-mutant cells (the growth rate of mutant cells will be denoted by b2 subsequently). The parameter of major biological interest is the mutation rate lb = l/b1. Because NT and N0 are assumed known, it suffices to obtain point and interval estimates for m. The probability mass function (p.m.f.) of Xi is conveniently referred to as a mutant distribution. Development of point and interval estimators has been hampered by the fact that the p.m.f. of a mutant distribution admits no explicit analytic expressions, even though the probability generating function (p.g.f.) possesses a relatively simple analytic expression. For example, in the equal growth case, the p.g.f. of the mutant distribution is (e.g. Eq. (6) in Ref. [6]) m 1 1 logð1 /zÞ ; ð2Þ Gðz; m; /Þ ¼ exp / z where / = 1 N0/NT is often set to unity for simplicity. The distribution defined by the p.g.f. in (2) is denoted by LD(m, /). An observation that will play a pivotal role in subsequent developments can be more easily appreciated by rewriting (2) in the form
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214 1 X k¼0
m 1 pðk; m; /Þz ¼ exp 1 logð1 /zÞ ; / z k
201
ð3Þ
where p(k; m, /) = P(X1 = k) is the p.m.f. of an LD(m, /) distribution. It has been shown in Ref. [6] that differentiating (3) with regard to m yields an algorithm for computing expected Fisher information useful in constructing Wald type CIs for m. The thrust of the present paper is to derive practical algorithms for computing likelihood ratio based CIs by further exploring the technique used in Ref. [6]. All algorithms proposed here rely on sequence convolution. A sequence cn (n = 0, 1, . . .) is said to beP the convolution of sequences an and bn (n = 0, 1, . . .), denoted by cn = an * bn, if k cn ¼ nk¼0 ak bnk for all n P 0. A shorthand notation for an * an is a2 n and in general an ¼ aðk1Þ an for k P 2. For a given positive integer K, the computing cost of obtaining the first n K terms of an * bn is approximately K2 arithmetic operations. Sequence convolution is a powerful tool, but it should be used frugally to prevent it from bottlenecking computation. We shall use l to denote a log likelihood function throughout this paper. In the equal growth case, for instance, the log likelihood function is defined by lðm; /; X Þ ¼
n X
ð4Þ
log pðX i ; m; /Þ.
i¼1
The score function U is defined to be the gradient vector of first derivatives of l with respect to the unknown parameters. The observed Fisher information J is the Hessian matrix of minus the second partial derivatives of l. The expected Fisher information I is the mathematical expectation of observed Fisher information J. Rigorous definitions of these basic statistical concepts can be found in Ref. [10, p. 94]. As in Ref. [6], we shall use the fluctuation experiment conducted by Demerec [11] as an illustrative example. In that experiment, n = 30, N0 = 90 and NT = 1.9 · 108. The 30 observations of mutant colony counts are as follows. 33 18 839 196 66
28
47 13
126
48 80
9
71
17 27
37 126 33
12 44
28 67 730 168 44
50 583 23
17 24
3. The equal growth case When non-mutants and mutants grow at the same rate, the p.g.f. of the mutant distribution is given by (2) or (3). Because / is either assumed to be known beforehand or set to unity for simplicity, m is the only unknown parameter. The difficulty that is to be overcome by the present paper arises from the fact that the p.m.f. p(k; m, /) is not amenable to the usual algebraic manipulation. As first noticed by Ma et al. [12], the p.m.f. is defined only by a recursive relation
202
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
pð0; m; /Þ ¼ em k mX j/ pðk; m; /Þ ¼ /j1 1 pðk j; m; /Þ k j¼1 jþ1
9 > = . ðk P 1Þ > ;
ð5Þ
What has impeded both m.l.e. and CI computation is lack of an exact algorithm for computing the first and second derivatives of p(k; m, /). We offer one such algorithm below. Differentiating (3) repeatedly with respect to m yields for i = 1, 2, . . . i oi G 1 1 m 1 1 logð1 /zÞ exp 1 logð1 /zÞ . ð6Þ ¼ omi / z / z P k When a sequence hk (k = 0, 1, . . .) is defined by /1 ð1=z 1Þ logð1 /zÞ ¼ 1 k¼0 hk z , the expression in (6) can be rewritten as !i ! 1 1 1 X X X oi pðk; m; /Þ k k k ð7Þ z ¼ hk z pðk; m; /Þz omi k¼0 k¼0 k¼0 with
9 = 1 / . hk ¼ /k1 ðk P 1Þ ; k kþ1
h0 ¼ 1
Equating coefficients of zk on both sides of (7) gives pðiÞ ðk; m; /Þ ¼ ðhi k Þ pðk; m; /Þ (i)
i
ði ¼ 1; 2; . . .Þ;
ð8Þ
i
where p (k; m, /) stands for o p(k; m, /)/om . In particular, we have ) pð1Þ ðk; m; /Þ ¼ hk pðk; m; /Þ . pð2Þ ðk; m; /Þ ¼ hk pð1Þ ðk; m; /Þ
ð9Þ
Note that pð2Þ ðk; m; /Þ ¼ h2 k pðk; m; /Þ as given in (8) is mathematically correct, but the formula is recast in the form of (9) to avoid an unnecessary operation of sequence convolution. We can compute the exact score function U and observed Fisher information J by using the relations 9 n ol X pð1Þ ðX i ; m; /Þ > > ¼ U ðm; /; X Þ ¼ > > = om pðX ; m; /Þ i i¼1 " # . ð10Þ 2 n > o2 l X pð1Þ ðX i ; m; /Þ pð2Þ ðX i ; m; /Þ > > > ¼ J ðm; /; X Þ ¼ ; om2 pðX i ; m; /Þ pðX i ; m; /Þ i¼1 Thus an m.l.e. of m can be iteratively computed using the Newton–Raphson method [10, p. 441]: ~k þ ~ kþ1 ¼ m m
~ k ; /; X Þ U ðm . ~ k ; /; X Þ J ðm
ð11Þ
~ 0 , the above procedure is iterated until convergence With an appropriately chosen initial guess m to a prescribed degree of accuracy. In each iteration only the first K + 1 terms of p(k; m, /),
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
203
p(1)(k; m, /), p(2)(k; m, /) need computing, where K = maxi Xi. All these quantities can be computed using (5), (9) and (10). Finally, a convenient initial guess is the median estimator proposed by Jones et al. [13] ~0 ¼ m
^ n0.5 logð2Þ ; ^ logðn0.5 Þ logðlogð2ÞÞ
ð12Þ
where ^ n0.5 is the sample median of X1, . . . , Xn. However, when n^0.5 ¼ 0, (12) is undefined and one can use the well-known P0 method of Luria and Delbru¨ck [1] to generate an initial guess m0. We pass now to offer an algorithm for computing a likelihood ratio based asymptotic CI for m. ^ be an m.l.e. of m. It follows from statistical asymptotic theory [14, p. 234], that Let m ^ /; X Þ lðm0 ; /; X ÞÞ has an asymptotic chi-squared distribution with one degree of freedom, 2ðlðm; when m0 is the true value of m. Hence with large n we have 1 2 ^ /; X Þ va;1 1 a; ð13Þ P lðm; /; X Þ P lðm; 2 where v2a;1 is the (1 a)th quantile of the v2 distribution with one degree of freedom. Eq. (13) indicates that boundary points of an asymptotic (1 a)100% CI for m can be obtained by solving for m the equation
log likelihood of m
1 ^ /; X Þ v2a;1 . ð14Þ lðm; /; X Þ ¼ lðm; 2 We shall assume that l(m, /; X) is a unimodal function of m, and consequently Eq. (14) has exactly ^ þ ). As a rule, the two roots m ^ and m ^ þ are not equidistant from m. ^ There^ < m two roots (say m ^ ^ ^ fore, a likelihood ratio based CI for m, (m ; mþ ), is not symmetric about m. Fig. 1 illustrates this point using DemerecÕs [11] experimental data. Furthermore, we can find the two roots of (14) by using the Newton–Raphson iteration
-164 -165 -166 -167 -168 -169 -170 8
10
m
12
14
Fig. 1. This figure illustrates the method of constructing a likelihood based CI for m using an LD(m, /) mutant distribution. The unimodal curve is the log likelihood function l(m, /; X) where / is set to unity and X is DemerecÕs [11] ^ /; X Þ 12 v20.05;1 . The two dashed vertical lines indicate the experimental data. The horizontal line represents ~la ¼ lðm; two roots of the equation lðm; /; X Þ ¼ ~la , giving the two boundaries of an asymptotic 95% CI for m.
204
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
~ k ; /; X Þ ~la lðm ; ð15Þ ~ k ; /; X Þ U ðm ^ /; X Þ 12 v2a;1 . Because Uðm; ^ /; X Þ ¼ 0, an initial guess m ^ 0 that is too close to m ^ can where ~la ¼ lðm; ^ 0 be ^ 0 is to let m cause computational instability. As suggested in Ref. [15], one way to choose m ^ and a ÔhypotheticalÕ root of Eq. (14) obtained by assuming that the log likemidway between m lihood function l(m, /; X) is quadratic in m. This reasoning along with simple algebra leads to sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi v2a;1 1 ^ ~0 ¼ m . ð16Þ m ^ /; X Þ 2 J ðm; ~k ~ kþ1 ¼ m m
To assess the above algorithm, we first compare this algorithm with that proposed earlier [6] for constructing Wald type CIs for m. The expected Fisher information can be computed by IðmÞ ¼
2 1 X ½pð1Þ ðk; m; /Þ . pðk; m; /Þ k¼0
ð17Þ
According to statistical asymptotic theory [16, p. 531], an asymptotic (1 a)100% CI for m takes the form za=2 ^ pffiffiffiffiffiffiffiffiffiffiffiffi ; ð18Þ m ^ nIðmÞ where za/2 is the (1 a/2)th quantile of the standard normal distribution. With the Demerec [11] ^ ¼ 10.8438. Retaining 20 000 terms and setting / = 1 in (17), we find example, we have m.l.e. m ^ ¼ 0.02535. Therefore, Eq. (18) gives 95% and 99% Wald type CIs for m as (8.60, 13.09) IðmÞ and (7.89, 13.79), respectively. On the other hand, the new algorithm gives 95% and 99% likelihood ratio based CIs for m as (8.65, 13.19) and (8.00, 13.96), respectively. These remarkably consistent results appear to indicate that the two algorithms corroborate each other, although the new algorithm is easier to use and requires less computing time. Further evaluation of the algorithm is provided by simulations. In one simulation session, 10 000 statistically identical fluctuation experiments were simulated, each experiment consisting of n = 30 cultures. The data were simulated using parameter values N0 = 10, NT = 108, l = 109 and b1 = b2 = 0.08. Among the 10 000 likelihood ratio based CIs for m, 501 CIs missed the true parameter value m = 1.25. In another session 10 000 experiments were simulated using the same parameter values, but a 99% likelihood ratio CI was constructed for each simulated fluctuation experiment. This time 108 CIs missed the true parameter value. These simulations suggest that likelihood ratio based CIs are satisfactory even with a sample size as small as 30.
4. The differential growth case 4.1. The mutant distribution When growth rate of non-mutants b1 differs from that of mutants b2, the mutant distribution is often identified by the ratio q = b2/b1 along with m and /. Denoted by M(m, q, /), this distribution has its p.g.f. of the form (e.g. Eq. (23) in Ref. [6])
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
205
! 1 m X 1 k ð19Þ Gðz; m; q; /Þ ¼ exp m þ Bx k; 1 þ z ; q/ k¼1 q Rz where x = 1 (1 /)q and Bz ða; bÞ ¼ 0 ta1 ðl tÞb1 dt is the incomplete beta function. We shall confine our attention to the commonly used approximate case where / = 1 (and hence x = 1). Because the reciprocal of q, denoted throughout this paper by r = b1/b2, is a more convenient parametrization, we shall adopt r in place of q and use M(m, r) to denote the distribution whose p.g.f. is of the form ! 1 X k ð20Þ Bðk; 1 þ rÞz . Gðz; m; rÞ ¼ exp m þ mr k¼1
It is computationally noteworthy that the beta function sequence B(k, r + 1) (k = 1, 2, . . .) appearing in (20) and first derivatives of these beta functions can be computed with simple arithmetic operations. From elementary properties of the beta function it follows that 9 Bð1; 1 þ rÞ ¼ 1=ð1 þ rÞ > = k Y 1 j1 . ð21Þ Bðk; 1 þ rÞ ¼ ðk P 2Þ > ; 1 þ r j¼2 j þ r Thus one can compute the beta function sequence recursively and then use the recursive relation 9 pð0; m; rÞ ¼ em > = k X mr ð22Þ pðk; m; rÞ ¼ jBðj; 1 þ rÞpðk j; m; rÞ ðk P 1Þ > ; k j¼1 to compute the p.m.f. of an M(m, r) distribution. To find an algorithm for computing oB(k, 1 + r)/ or, we write B(k, 1 + r) = C(k)C(1 + r)/C(k + 1 + r) and differentiate the expression to get oBðk; 1 þ rÞ ¼ Bðk; 1 þ rÞ½wð1 þ rÞ wðk þ 1 þ rÞ ðk P 1Þ; or where w(x) = C 0 (x)/C(x) is the psi (digamma) function. Repeated use of the relation w(x + 1) = w(x) + 1/x shows that gk ðrÞ ¼ wðk þ 1 þ rÞ wð1 þ rÞ ¼
k X j¼1
1 ; jþr
ð23Þ
which can also be computed recursively. Thus we compute the derivative of B(k, 1 + r) by oBðk; 1 þ rÞ ¼ Bðk; 1 þ rÞgk ðrÞ or
ðk P 1Þ.
ð24Þ
We now digress to discuss a property of the M(m, r) distribution that seems to have been overlooked. From (21) it follows that limr!1rB(1, 1 + r) = 1 and limr!1rB(k, 1 + r) = 0 for k P 2. Therefore, in view of (20), we have limr!1G(z; m, r) = exp(m(z 1)), which is the p.g.f. of a Poisson random variable having mean m.
206
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
Proposition. As r ! 1, the mutant distribution M(m, r) as defined by its p.g.f. in Eq. (20) converges in distribution to a Poisson distribution having mean m. This proposition lends stronger and more satisfying support to a claim of Lenski et al. [17] than the argument given in Ref. [6, p. 248]. The proposition asserts that the mutant distribution resembles a Poisson distribution when non-mutants grow much faster than mutants. 4.2. The score function and observed Fisher information In this subsection we describe a method for computing the first order and second order partial derivatives of the log likelihood function, as the derivatives play a key role in computing point and interval estimates of m and r. Adapting standard notation in calculus, we use the symbol p(i,j)(k; m, r) to denote the derivative oi+jp(k; m, r)/omiorj. Note that by differentiating the p.g.f. in (20) we have 9 ! 1 X > oGðz; m; rÞ > k > > rBðk; 1 þ rÞz Gðz; m; rÞ ¼ 1 þ > = om k¼1 ! . ð25Þ 1 > X > oGðz; m; rÞ > mBðk; 1 þ rÞ½1 rgk ðrÞzk Gðz; m; rÞ > ¼ > ; or k¼1 Equating coefficients of zk on both sides of each of the two equations in (25), we arrive at ) pð1;0Þ ðk; m; rÞ ¼ h1;0 k ðrÞ pðk; m; rÞ ; pð0;1Þ ðk; m; rÞ ¼ m h0;1 k ðrÞ pðk; m; rÞ where we define for k P 1 h1;0 k ðrÞ ¼ rBðk; 1 þ rÞ
)
h0;1 k ðrÞ ¼ Bðk; 1 þ rÞð1 rgk ðrÞÞ 0;1 with h1;0 0 ðrÞ ¼ 1 and h0 ðrÞ ¼ 0. We can rewrite the second equation in (25) in the form ! 1 X oGðz; m; rÞ 0;1 mhk ðrÞzk Gðz; m; rÞ. ¼ or k¼0
ð26Þ
ð27Þ
Differentiating (27) with respect to r yields 2 ! !2 3 1 1 X o2 Gðz; m; rÞ 4 X k k 5Gðz; m; rÞ; þ ¼ mh0;2 mh0;1 k ðrÞz k ðrÞz or2 k¼0 k¼0 0;1 where h0;2 k ðrÞ ¼ ohk ðrÞ=or. Therefore, 2 2 0;1 pð0;2Þ ðk; m; rÞ ¼ ½mh0;2 k ðrÞ þ m ðhk ðrÞÞ pðk; m; rÞ 0;1 ð0;1Þ ¼ m ½h0;2 ðk; m; rÞ. k ðrÞ pðk; m; rÞ þ hk ðrÞ p
ð28Þ
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
207
Differentiating the second equation in (26) and substituting (24) and (23) into the resultant expression, we get for k P 1 " # k X j 2 0;2 ð29Þ hk ðrÞ ¼ Bðk; 1 þ rÞ rgk ðrÞ gk ðrÞ 2 j¼1 ðj þ rÞ and clearly h0;2 0 ðrÞ ¼ 0. In a similar fashion, one can differentiate with respect to m the two equations in (25) to obtain 9 2 1;0 ð1;0Þ > ðk; m; rÞ pð2;0Þ ðk; m; rÞ ¼ ðh1;0 k ðrÞÞ pðk; m; rÞ ¼ hk ðrÞ p = 0;1 0;1 1;0 ð1;1Þ ð30Þ p ðk; m; rÞ ¼ hk ðrÞ pðk; m; rÞ þ m hk ðrÞ hk ðrÞ pðk; m; rÞ . > ; 0;1 1 ð0;1Þ ð1;0Þ ¼ m p ðk; m; rÞ þ m hk ðrÞ p ðk; m; rÞ Thus the score vector U(m, r; X) = [U1(m, r; X) U2(m, r; X)] can be computed by 9 n X pð1;0Þ ðX i ; m; rÞ > > > U 1 ðm; r; X Þ ¼ > = pðX ; m; rÞ i i¼1 . n X pð0;1Þ ðX i ; m; rÞ > > > U 2 ðm; r; X Þ ¼ > ; pðX ; m; rÞ i¼1
ð31Þ
i
The 2 · 2 observed Fisher information matrix J(m, r; X) = (Ji,j) is symmetric and can be computed by 9 ! ð1;0Þ 2 n > o2 lðm; r; X Þ X p ðX i ; m; rÞ pð2;0Þ ðX i ; m; rÞ > > > ¼ J 1;1 ðm; r; X Þ ¼ > 2 > om pðX pðX ; m; rÞ ; m; rÞ > i i i¼1 > !> > > n 2 = o lðm; r; X Þ X pð1;0Þ ðX i ; m; rÞpð0;1Þ ðX i ; m; rÞ pð1;1Þ ðX i ; m; rÞ > ¼ J 1;2 ðm; r; X Þ ¼ . ð32Þ 2 omor pðX i ; m; rÞ pðX i ; m; rÞ > i¼1 > > ! > ð0;1Þ 2 > n > o2 lðm; r; X Þ X p ðX i ; m; rÞ pð0;2Þ ðX i ; m; rÞ > > > J 2;2 ðm; r; X Þ ¼ ¼ > 2 > or pðX pðX ; m; rÞ ; m; rÞ > i i i¼1 ; 4.3. Point and interval estimation With the above exact algorithm for computing the score function and expected Fisher information, it is straightforward to compute the m.l.e.s of m and r by the Newton–Raphson iteration process ~ k ; ~rk ; X Þ J 1;2 ðm ~ k ; ~rk ; X Þ 1 U 1 ðm J 1;1 ðm ~ k ; ~rk ; X Þ ~k ~ kþ1 m m ð33Þ . ¼ þ ~ k ; ~rk ; X Þ J 2;2 ðm ~ k ; ~rk ; X Þ J 1;2 ðm ~ k ; ~rk ; X Þ ~rkþ1 ~rk U 2 ðm ~ 0 the In practice, r often differs from unity moderately, so one can choose ~r0 ¼ 1 and choose m same way as in the equal growth case. In order to construct CIs for m we consider lP ðm; X Þ ¼ lðm; ~rðmÞ; X Þ;
ð34Þ
208
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
where ~rðmÞ maximizes the log likelihood function for any given m, that is, ~rðmÞ ¼ arg max lðm; r; X Þ. r>0
ð35Þ
^ and ^r be m.l.e.s of m and r, respectively. According to statistical asymptotic theory, Let m 1 ^ ^r; X Þ lP ðm; X ÞÞ has an asymptotic chi-squared distribution with one degree of freedom. ðlð m; 2 Thus a boundary point of an approximate (1 a)100% CI for m is a root of the equation 1 ^ ^r; X Þ lP ðm; X ÞÞ ¼ v2a;1 . To find such a root, we can solve simultaneously for m and r the ðlðm; 2 equations 9 lðm; r; X Þ ¼ ~la = ; ð36Þ olðm; r; X Þ ¼ 0; or ~ ^ ^r; X Þ 12 v2a;1 . In other words, if (36) has a solution m ¼ m ~ and r ¼ ~r then m ~ is taken where la ¼ lðm; as a boundary of the desired CI for m, and ~r is discarded. A CI thus obtained is known as a profile ~ þ ) are in ~ thus found (say m ~ < m likelihood based CI. As in the equal growth case, the two mÕs ^ Therefore, unlike a Wald type CI as given in (18), a general not equidistant from the m.l.e. m. ~ þ Þ, is not symmetric about m. ^ Note that numeric solutions of ~ ; m profile based CI for m, ðm (36) can be obtained by the Newton–Raphson algorithm # " ~ k ; ~rk ; X Þ U 2 ðm ~ k ; ~rk ; X Þ 1 lðm U 1 ðm ~ kþ1 ~k m m ~ k ; ~rk ; X Þ ~la . ð37Þ ¼ þ ~ k ; ~rk ; X Þ ~ k ; ~rk ; X Þ J 1;2 ðm J 2;2 ðm ~rkþ1 ~rk ~ k ; ~rk ; X Þ U 2 ðm ^ ^r; X Þ ¼ U 2 ðm; ^ ^r; X Þ ¼ 0, the 2 · 2 matrix in (37) is singular at the point ðm; ^ ^rÞ. Because U 1 ðm; ~ 0 and ~r0 . As suggested in Ref. [15], one way of choosing Hence care must be taken in choosing m ~ 0 and ~r0 is to find Dm which, under the assumption that lðm; ~rðmÞ; X Þ is quadratic in m, satisfies m ^ þ Dm; ~rðm ^ þ DmÞ; X Þ ¼ ~la . lðm
ð38Þ
With Dm thus defined, we take 9 1 > = ~0 ¼ m ^ þ Dm m 2 ; ð39Þ > 1 0 ; ^ ~r0 ¼ ^r þ Dm ~r ðmÞ 2 ^ is the first derivative of ~rðmÞ evaluated at m. ^ Let l (i,j)(m, r; X) stand for oi+jl(m, r; X)/ where ~r0 ðmÞ i j om or . It follows from the definition of ~rðmÞ in (35) that for all m > 0 lð0;1Þ ðm; ~rðmÞ; X Þ ¼ 0.
ð40Þ
Differentiating (40) with respect to m gives lð1;1Þ ðm; ~rðmÞ; X Þ þ lð0;2Þ ðm; ~rðmÞ; X Þ~r0 ðmÞ ¼ 0.
ð41Þ
Rearranging and applying (32) leads to ~r0 ðmÞ ¼
lð1;1Þ ðm; ~rðmÞ; X Þ J 1;2 ðm; ~rðmÞ; X Þ ¼ . J 2;2 ðm; ~rðmÞ; X Þ lð0;2Þ ðm; ~rðmÞ; X Þ
ð42Þ
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
209
On account of (40), one has dlðm; ~rðmÞ; X Þ ¼ lð1;0Þ ðm; ~rðmÞ; X Þ þ lð0;1Þ ðm; ~rðmÞ; X Þ~r0 ðmÞ ¼ lð1;0Þ ðm; ~rðmÞ; X Þ. dm Differentiating (43) with respect to m yields
ð43Þ
d2 lðm; ~rðmÞ; X Þ ¼ lð2;0Þ ðm; ~rðmÞ; X Þ þ lð1;1Þ ðm; ~rðmÞ; X Þ~r0 ðmÞ. ð44Þ dm2 ^ ¼ ^r and lð1;0Þ ðm; ^ ^r; X Þ ¼ 0. Expanding the left side of (38) as a quadratic function of Note that ~rðmÞ ^ and then substituting (42)–(44) and (32), one can easily solve (38) to obtain Dm about m vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! u , u ^ ^r; X Þ2 ½J 1;2 ðm; t 2 ^ ^r; X Þ . ð45Þ J 1;1 ðm; Dm ¼ va;1 ^ ^r; X Þ J 2;2 ðm; For comparison purposes, we can compute the expected Fisher information matrix by 9 2 X 1 o log pðX ; m; rÞ ½pð1;0Þ ðk; m; rÞ2 > > > I 1;1 ðm; rÞ ¼ E ¼ > > om pðk; m; rÞ > k¼0 > > > X > 1 ð1;0Þ ð0;1Þ o log pðX ; m; rÞ o log pðX ; m; rÞ p ðk; m; rÞp ðk; m; rÞ = I 1;2 ðm; rÞ ¼ E ¼ . om or pðk; m; rÞ > > k¼0 > > > 2 X > 2 1 ð0;1Þ > > o log pðX ; m; rÞ ½p ðk; m; rÞ > > I 2;2 ðm; rÞ ¼ E ¼ ; or pðk; m; rÞ k¼0
ð46Þ
One has to truncate infinite series to obtain expected Fisher information from (46). This is a drawback for the differential growth case, because we know only the convergence rate for I1,1(m, r) [6]. If we denote the elements of the inverse matrix of I(m, r) by I1(m, r) = (Ii,j(m, r)), a (1 a)100% Wald type CI for m is sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ^ ^rÞ I 1;1 ðm; ^ za=2 m . ð47Þ n ^ ¼ 9.8526 and In fitting an M(m, r) model to the Demerec [11] experimental data, we get m.l.e.s m ^r ¼ 0.8938. A profile likelihood based 95% CI for m is (6.98, 13.01). On the other hand, keeping 100 000 terms in (46), we get 0.02659 0.2631 ^ ^rÞ ¼ Iðm; ; 5.3468 whence 1
^ ^rÞ ¼ I ðm;
73.3148 3.6077 0.3646
.
^ ^rÞ ¼ 73.3148 and m ^ ¼ 9.8526 into (47) yields (6.79, 12.92) as an approximate Substituting I 1;1 ðm; 95% CI for m.
210
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
4.4. Which model to choose? A CI for r can be a statistical device to help choose between the equal growth model and the differential growth model. The procedure for constructing a CI for r is to solve 9 lðm; r; X Þ ¼ ~la > = ð48Þ olðm; r; X Þ > ¼ 0; om ^ ^r; X Þ 12 v2a;1 . The Newton–Raphson iteration process for solving (48) is with ~la ¼ lðm; 3 " # " # " # 2 ~ k ; ~rk ; X Þ U 2 ðm ~ k ; ~rk ; X Þ 1 lðm U 1 ðm ~k ~ kþ1 m m ~ k ; ~rk ; X Þ ~la 4 5. ¼ þ ~ k ; ~rk ; X Þ ~ k ; ~rk ; X Þ J 1;1 ðm J 1;2 ðm ~rkþ1 ~rk ~ k ; ~rk ; X Þ U 1 ðm To start this iteration process, one can take 9 1 > 0 ~ ð^rÞ > ^ þ Dr m ~0 ¼ m m > = 2 ; > > 1 > ; ~r0 ¼ ^r þ Dr 2
ð49Þ
ð50Þ
where vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi !9 u , 2 u > ^ ^r; X Þ > ½J 1;2 ðm; > > ^ ^r; X Þ J 2;2 ðm; r ¼ tv2a;1 > ^ ^r; X Þ = J 1;1 ðm; . > > > > ^ ^r; X Þ J 1;2 ðm; > ; ~ 0 ð^rÞ ¼ m ^ ^r; X Þ J 1;1 ðm;
ð51Þ
With regard to the Demerec [11] data, a 95% profile likelihood based CI for r is (0.69, 1.13). Because unity is contained in the CI, the equal growth model appears sufficient to describe the experimental data.
5. A case study To determine whether theoretical mutant distributions were in agreement with experimental observations, Boe et al. [18,19] conducted perhaps the largest and most meticulously designed fluctuation experiment in history. The mutants studied were Escherichia coli cells resistant to nalidixic acid. Twenty three tests (48 cultures each) were performed under slightly different conditions, and the data were then pooled for goodness of fit test. The data are quoted below, with the notation x(k) indicating that k cultures were found to have x mutants.
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
0(543) 11(10) 23 35 66 146
1(169) 12(9) 24(2) 36 68 151
2(92) 13(5) 25 37 69 152
3(57) 14(5) 26(5) 39(2) 73(2) 192
4(42) 15 27(3) 40 74 258
5(25) 16(8) 28 41 105 265
6(23) 17(2) 29(3) 42 107 320
7(13) 18(5) 30(3) 49 116 482
8(11) 19(6) 31 52 132 513(4)
211
9(5) 20(2) 32 57 137
10(8) 21(5) 34 59 140
Prob.of no more than k mutants
Four of the 1104 cultures were believed to contain more than 512 mutants, but the exact counts were difficult to determine. For simplicity, we regard each of these four cultures as having 513 mu^ ¼ 0.74. But a v2 goodness of fit tants. Using the equal growth model Boe et al. found an m.l.e. m test yields a small p-value, leading Boe et al. to suggest that theoretical mutant distributions cannot describe experimental results. According to Boe et al. [18], imperfect plating efficiency, phenotypic delay and differential growth were all unlikely to have contributed to the poor agreement between theory and experimental results. However, by adopting the differential growth ^ ¼ 0.7139 and ^r ¼ 0.8380, which dramatically improves the fit (see Fig. 2 model, we find that m and Table 1). In addition, a profile likelihood based 95% CI for m is (0.6568, 0.7742), and that for r is (0.7637, 0.9182). A question arises at this stage; is the improved fit purely due to the addition of a free parameter? It is generally believed that mutants grow more slowly than non-mutants, which might cast doubt on the m.l.e. ^r ¼ 0.8380 < 1. Using their experimental results Boe et al. [18] concluded that growth rates for mutants and for non-mutants were roughly equal but that plating efficiency was less than perfect (95%). Could the unexpectedly small ^r be a consequence of imperfect plating efficiency? To probe this question, we simulated a plating process as follows. We consider the number of mutant colonies on a dish (Xi) as the actual number of mutants in the tube and we then mimic imperfect plating by generating for each Xi a random number Yi in accordance with a binomial distribution having index Xi and parameter 0.95. A new m.l.e. of r can then be computed using the ÔdilutedÕ data Yi. This process was simulated 10 000 times, and only 414 out of the
1 0.9 0.8 0.7 0.6 0.5 0
1
2
4 3 log of k+1
5
6
Fig. 2. Two mutant distributions were fitted to the experimental data of Boe et al. [18] using the maximum likelihood method. The dashed line represents an LD(0.7366) distribution (the equal growth model), whereas the solid line is an M(0.7139, 0.8380) distribution (the differential growth model).
212
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
Table 1 Goodness of fit for two mutant distributions fitted to the experimental data of Boe et al. [18] No. of mutants
Observed
M(0.7139, 0.8380)
LD(0.7366)
0 1 2 3 or 4 5–8 9–16 17–32 33–64 65–128 129–256 257–512 >512
543 169 92 99 72 51 41 13 9 7 4 4
540.6 176.0 90.7 93.3 76.9 52.8 32.1 18.4 10.3 5.7 3.2 4.0
528.5 194.7 100.7 100.9 78.1 48.5 26.2 13.3 6.6 3.3 1.6 1.6
v2 df P
5.72 9 0.767
25.88 10 0.0039
PearsonÕs chi-squared statistics (v2) were computed following Boe et al. [18]. The number of degrees of freedom (df) for the equal growth case was misprinted in Ref. [18] and hence is corrected.
frequency
800 600 400 200 0 0.83
0.835 0.84 0.845 0.85 0.855 0.86 maximum likelihood estimate of r
Fig. 3. A Ôre-platingÕ process was simulated as follows, using the experimental data of Boe et al. [18,19]. For each Xi, a Yi was generated according to a binomial distribution with parameter 0.95 and index Xi. An m.l.e. of r was then computed using the new data Yi. This process was simulated 10 000 times. The above histogram represents the frequencies of the 10 000 m.l.e.s of r; recall that the m.l.e. for r computed using the original data is ^r ¼ 0.838.
10 000 m.l.e.s of r were smaller than the m.l.e. computed using the original data (see Fig. 3). In a similar simulation study, plating efficiency was set to 90%. Among the 10 000 m.l.e.s of r only 81 were smaller than the m.l.e. of r obtained using the original data. Thus imperfect plating efficiency seems unlikely to cause underestimation of r. Then, could r < 1 be true? In nature mutants often suffer from reduced competitiveness, partly due to the effects of pleiotropy [17]. In a fluctuation experiment, however, mutants grow only for a short period of time and are plated before diminished resources render competition necessary for survival. The experimental evidence of Boe et al.
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
213
that b1 b2 suggests that r could be slightly larger or smaller than unity in their experiment. In fact, generation time for mutants was found to be 72 ± 5.0 min, whereas that for non-mutants was 73 ± 5.9 min [18, p. 2784]. Although model fitting alone cannot prove that r < 1 in the Boe et al. experiment, the foregoing biological information and data analysis results suggest that the chosen model having r = 0.8380 might be a plausible model. The fit in this case appears almost too good to be true (p-value = 0.767), which may be partly attributable to the unusually large sample size and painstaking laboratory work done to ensure high quality of the experiment.
6. Concluding notes Interval estimation of mutation rates has received scanty attention. The recent approach to computing Wald type asymptotic CIs for m is practical for the equal growth case, but, due to difficulties in computing infinite series, the approach is not convenient to use in the differential growth case. The present approach to computing likelihood ratio based CIs is practical for both types of cases. In practice these two methods complement each other. The method proposed here to compute m.l.e.s appears to be the most efficient, for it uses exact derivatives of the likelihood functions, and is defined by only simple arithmetic operations (e.g. no numerical differentiation is involved). Biologists, with these new algorithms at their disposal (e.g. via SALVADOR 2.0), can concentrate on extracting vital information from their experimental data, without being distracted by tedious computational details.
Acknowledgments I extend grateful appreciation to two anonymous reviewers and the handling editor. I in particular acknowledge cordially that one reviewerÕs detailed comments enabled me to improve the presentation significantly.
References [1] S.E. Luria, M. Delbru¨ck, Mutations of bacteria from virus sensitivity to virus resistance, Genetics 28 (1943) 491. [2] Q. Zheng, Progress of a half century in the study of the Luria–Delbru¨ck distribution, Math. Biosci. 162 (1999) 1. [3] D.E. Lea, C.A. Coulson, The distribution of the numbers of mutants in bacterial populations, J. Genetics 49 (1949) 264. [4] M.E. Jones, J. Wheldrake, A. Rogers, Luria–Delbru¨ck fluctuation analysis: Estimating the Poisson parameter in a compound Poisson distribution, Comput. Biol. Med. 23 (1993) 525. [5] F.M. Stewart, Fluctuation tests: How reliable are the estimates of mutation rates? Genetics 137 (1994) 1139. [6] Q. Zheng, Statistical and algorithmic methods for fluctuation analysis with SALVADOR as an implementation, Math. Biosci. 176 (2002) 237. [7] F.M. Stewart, Fluctuation analysis: The effect of plating efficiency, Genetica 54 (1991) 51. [8] M.E. Jones, Luria–Delbru¨ck fluctuation experiments; accounting simultaneously for plating efficiency and differential growth rate, J. Theor. Biol. 166 (1994) 355. [9] Q. Zheng, Mathematical issues arising from the directed mutation controversy, Genetics 164 (2003) 373. [10] J.K. Lindsey, Parametric Statistical Inference, Clarendon, Oxford, 1996.
214
Q. Zheng / Mathematical Biosciences 196 (2005) 198–214
[11] M. Demerec, Production of staphylococcus strains resistant to various concentrations of penicillin, Proc. Natl. Acad. Sci. USA 31 (1945) 16. [12] W.T. Ma, G.vH. Sandri, S. Sarkar, Analysis of the Luria–Delbru¨ck distribution using discrete convolution powers, J. Appl. Prob. 29 (1992) 255. [13] M.E. Jones, S.M. Thomas, A. Rogers, Luria–Delbru¨ck fluctuation experiments: Design and analysis, Genetics 136 (1994) 1209. [14] P.K. Sen, J.M. Singer, Large Sample Methods in Statistics, Chapman and Hall, New York, 1993. [15] D.J. Venzon, S.H. Moolgavkar, A method for computing profile-likelihood-based confidence intervals, Appl. Statist. 37 (1988) 87. [16] E.L. Lehmann, Elements of Large-sample Theory, Springer, New York, 1999. [17] R.E. Lenski, M. Slatkin, F.J. Ayala, Mutation and selection in bacterial populations: Alternatives to the hypothesis of directed mutation, Proc. Natl. Acad. Sci. USA 86 (1989) 2775. [18] L. Boe, T. Tolker-Nielsen, K.-M. Eegholm, H. Spliid, A. Vrang, Fluctuation analysis of mutations to nalidixic acid resistance in Escherichia coli, J. Bacteriol. 176 (1994) 2781. [19] L. Boe, T. Tolker-Nielsen, K.-M. Eegholm, H. Spliid, A. Vrang, Erratum, J. Bacteriol. 176 (1994) 4463.