Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799 www.elsevier.com/locate/jspi
Optimal adaptive designs for acute oral toxicity assessment Nigel Stallard∗ Medical and Pharmaceutical Statistics Research Unit, The University of Reading, P.O. Box 240, Earley Gate, Reading, RG6 6FN, UK Available online 1 September 2005
Abstract Acute oral toxicity studies are used to assess the toxicity to experimental animals of a single dose of the substance under investigation, assigning the substance to one of a number of classes. Animal welfare concerns have led to the development of three adaptive designs as alternatives to the traditional fixed sample design. These lead to reductions in the number of animals required in total and in the number exposed to lethal doses. In this paper, we show how designs can be constructed to optimise utility functions combining the need to classify correctly with the desire to use a small number of animals. Constrained optimal designs are also obtained in which no animal is exposed to a dose higher than that at which a death has been observed. The optimal designs lead to the correct classification with high probability whilst reducing the expected number of animal deaths relative to existing adaptive designs. © 2005 Elsevier B.V. All rights reserved. Keywords: Backward induction; Bayesian decision theory; Bayesian optimal design; LD50 estimation
1. Background 1.1. Acute oral toxicity assessment The acute oral toxicity test is the simplest, and often the first, toxicity test to be conducted on a new chemical substance. A single, large, dose of the substance is given to each of a number of experimental animals and it is observed whether or not they die, where death is ∗ Tel.: +44 118 378 6565; fax: +44 118 975 3169.
E-mail address:
[email protected]. 0378-3758/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2005.08.004
1782
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
taken to mean either natural death or humane killing within the observation period, usually of 14 days duration. It is imagined that for each experimental animal there is some dose such that after exposure to any greater dose the animal will die and after exposure to any lesser dose the animal will survive. This is sometimes called the tolerance dose for that animal. Doses are usually measured on a per body weight basis, and expressed in mg/kg body weight, which will hereafter be abbreviated to mg/kg. As animals may be considered as being chosen at random from some large population, and different animals have different tolerance doses, the tolerance dose for any individual animal may be treated as a random variable. It is common to assume that the logarithm of the tolerance dose follows either a normal or logistic distribution. If some dose, d, is administered to some animal, then the probability of death for that animal is the probability that its tolerance dose is less than d. The probability of death is thus given by the cumulative distribution function of the normal or logistic distribution. Historically, the normal distribution has been most used (see, for example, Finney, 1971) following Bliss (1934a, b) and other early workers in this area. In practice, however, it is difficult to distinguish between the logistic and normal distributional forms, and we will, for mathematical convenience, assume that the logarithm of the tolerance dose follows a logistic distribution. In this case the probability of death at dose d is given by pr(death; d) = logit −1 ((log(d) − l))
(1)
for some and l, where logit(p) = log(p/(1 − p)). Considered as a function of d, this is called the dose–response curve. The parameter l is the logarithm of the mean or median tolerance dose, and el is usually called the median lethal dose, or LD50 . This is a measure of the toxicity of the substance; a smaller value corresponds to smaller doses causing lethal toxicity, that is to a more toxic substance. The parameter is related to the variance of the tolerance dose distribution, with smaller values corresponding to a larger variance. It is sometimes called the slope of the dose–response curve. 1.2. The LD 50 test Traditionally, the aim of the acute oral toxicity study has been the estimation of l, or equivalently, of the LD50 . Groups of animals are exposed to doses spread around the anticipated LD50 , in what is usually called the LD50 test (see, for example, Hamilton, 1991), and logistic regression analysis is used to estimate both l and . An approximate confidence interval for l can be obtained via Fieller’s theorem (see, for example, Collett, 1991, for details). In practice, the estimate of the LD50 is used to assign the substance under investigation to one of a number of toxic classes. Under the globally harmonised system of classification (GHS) (OECD, 1998), substances are assigned to classes 1–5, with substances in class 1 being the most toxic, according to their estimated LD50 as shown in Table 1. Substances in class 5 are sometimes termed unclassified, or a separate class for substances with estimated LD50 greater than 5000 mg/kg may be used. The classification given affects restrictions on transportation and use of the substance, as well as how it must be labelled. Substances assigned to classes 1 and 2 are labelled fatal if swallowed, substances assigned to class 3
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
1783
Table 1 GHS acute oral toxicity classifications Class
Range for estimated LD50 (mg/kg)
1 2 3 4 5
LD50 5 5 < LD50 50 50 < LD50 300 300 < LD50 2000 2000 < LD50
are labelled toxic if swallowed and substances assigned to class 4 are labelled harmful if swallowed. In the most recent version of the LD50 test (OECD, 1987) prior to the deletion of the guideline in December 2002, five or more animals are tested at each of at least three test doses. This test has been criticised for a number of reasons including the large number of animals required and the fact that many animals are tested at doses above the LD50 , leading to considerable suffering (see, for example, Balls, 1992 and Zbinden and Flury-Roversi, 1981). Through the 1980s and 1990s, three alternative approaches to the design of acute animal toxicity studies that lead to the use of fewer animals and/or testing at lower doses have been developed. These alternatives were revised in 2001 to ensure that they allow classification under the GHS classification system described above, prior to the deletion of OECD guideline 401 in 2002. The alternatives, the up-and-down procedure, the fixed dose procedure, and the acute toxic class method, are described in detail in the next section. All three methods use adaptive designs in which the doses used depend on the responses previously observed. In this paper, we consider the construction of optimal Bayesian designs for this classification problem. These designs will optimise some utility function which will include terms expressing the desires to classify correctly and to minimise the number of animals required. The method for construction of the optimal designs is described in Section 3 below, and a comparison with the traditional LD50 test and the current adaptive designs is reported in Section 4. The paper concludes in Section 5 with discussion of the problem and the results obtained. 2. Existing adaptive designs for the assessment of acute oral toxicity 2.1. The up-and-down procedure The up-and-down procedure (UDP) (OECD, 2001c) is based on the adaptive method proposed by Bruce (1985), which has its foundations in the method proposed by Dixon and Mood (1948). Using this method, animals are tested one at a time. The idea is that depending on whether an animal dies or survives, the next animal is tested at a lower or higher dose. The OECD guideline proposes that, in the absence of knowledge of the compound under investigation, the first animal should receive a dose of 175 mg/kg. The logarithm of the dose used is then increased or decreased in steps of size log(10)/2 depending on the outcome of each animal. If an animal dies after receiving a dose of 1.75 mg/kg, or an animal survives
1784
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
after receiving a dose of 5500 mg/kg, the same dose is used for the next animal, so that these are, respectively, the smallest and largest doses used. The procedure stops as soon as any of four criteria are met; three animals in succession die after receiving a dose of 1.75 mg/kg, three animals in succession survive after receiving a dose of 5500 mg/kg, of six consecutively dosed animals, exactly three survive and three die, or a 82% confidence interval for l has width less than or equal to 1.83 on the log(mg/kg) scale (equivalent to likelihood ratios comparing the maximum likelihood estimate of the LD50 with doses above and below this estimate by a factor of 2.5 both being less than 2.5). Following the procedure, the guideline recommends that l be estimated via probit regression assuming a fixed value for the variance of the normal tolerance distribution. Following estimation of l, a GHS classification is assigned according to the estimated LD50 value. The procedure may be modified if properties of the substance under investigation are known prior to the acute toxicity test. For example, the starting dose may be chosen to be similar to the anticipated LD50 , and the amount by which doses of consecutive animals differ may be increased or decreased according to belief about . For the comparison reported below, however, it was supposed that no prior knowledge of the test substance was available, so that the default procedure described in the preceding paragraph was assumed to have been conducted. 2.2. The fixed dose procedure The aim of the fixed dose procedure (FDP) (OECD, 2001a) is not to provide an estimate of l, but to directly assign the substance under investigation to one of the GHS classes. With this aim in mind, animals are tested at doses equal to the boundaries between the GHS classes. Each animal in the procedure is thus tested at one of the fixed doses, 5, 50, 300 or 2000 mg/kg. The procedure is illustrated in Fig. 1 and starts with a sighting study to determine the starting dose for the main study. In the sighting study, animals are tested one at a time, with the first animal tested at 300 mg/kg. If an animal survives without any sign of toxicity, the next animal is tested at the next higher fixed dose. If an animal shows some sign of toxicity, possibly including death, the next animal is treated at the next lower fixed dose. The sighting study is completed if an animal survives without toxicity at 2000 mg/kg, in which case the main study starts at this dose, if an animal dies at 5 mg/kg, in which case the substance is assigned to GHS class 1 without the main study being performed, if, at any dose, an animal shows non-fatal evident toxicity, in which case the main study starts at that dose, or if an animal survives without evident toxicity at one dose and an animal dies at a higher dose, in which case the main study starts at the fixed dose immediately below the lowest at which death was observed. In the main study, animals are tested in groups of five, except for testing at a dose already used in the sighting study. In this case the previously tested animal is included, so that only four additional animals are tested. If two or more animals die, testing continues at the next lower dose, unless main-study testing has already been conducted at that dose. If at most one animal dies, but two or more animals show some sign of toxicity, which may include death, testing stops and the substance is assigned to one of the GHS classes as shown in Fig. 1. If no signs of toxicity are observed, testing continues at the next higher dose provided that dose has not already been used for main-study testing or led to toxicity or death in a
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
1 animal
1 animal
1 animal
1 animal
5 mg/kg
50 mg/kg
300 mg/kg
2000 mg/kg
d
1
t
n
d
t
n
d
t
n
d
t
n
5 animals*
5 animals*
5 animals*
5 animals*
5 mg/kg
50 mg/kg
300 mg/kg
2000 mg/kg
d
d
d
t
n
1
1
2
t
n
2
3
t
n
3
d
t
n
4
4
5
1785
Sighting study (1 animal per dose) d: death t: non-fatal toxicity n: no toxicity
Main study (5 animals per dose *including sighting study animals) d: ≥ 2 deaths t: < 2 deaths, ≥ 1 with toxicity n: no toxicity
GHS Classification
Fig. 1. The fixed dose procedure (FDP).
sighting study animal. Testing stops, and the substance is assigned to one of the GHS classes if evident toxicity with at most one death is observed at any dose, if two or more deaths are observed at 5 mg/kg, if no toxic effects are observed at 2000 mg/kg, or if two or more deaths are observed at one dose with no toxic effects observed at the dose immediately below. Classification is as indicated in Fig. 1. An important feature of the design of the FDP is that testing at a given dose is stopped as soon as death is observed at that dose in either the sighting study or the main study, and that the procedure stops as soon as toxicity is observed in the main study. This means that no test is conducted at or above a dose that has already been seen to lead to the death of any animal, or above a dose that has led to non-fatal toxicity in any animal. This, together with the use of non-fatal evident toxicity as an endpoint, leads to testing being generally conducted at low doses, reducing animal suffering. This is in marked contrast to the LD50 test and UDP, in which many animals are tested at doses close to or above the LD50 . In order to enable comparison with the other methods, we evaluate below a version of the FDP that ignores non-fatal toxicity, or equivalently assume that non-fatal toxicity is not observed at any of the test doses. As considered in the discussion, the effect of ignoring non-fatal toxicity may increase the number of animals required by the procedure, but leads to results consistent with the dose-response curve for the probability of death for the substance rather than requiring specification of an additional dose–response curve for non-fatal toxicity. 2.3. The acute toxic class method The acute toxic class (ATC) method (OECD, 2001b) is similar to the FDP in that animals are tested in groups at fixed doses corresponding to the boundaries of the GHS classes
1786
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
3 animals
3 animals
3 animals
3 animals
5 mg/kg
50 mg/kg
300 mg/kg
2000 mg/kg
2-3 1
2-3
1
0
2-3
1
0
2-3 1
0
3 animals
3 animals
3 animals
3 animals
5 mg/kg
50 mg/kg
300 mg/kg
2000 mg/kg
≥1
≥1
≥1
1
0
≥1
0
0
2
0
3
4
0 number of 1 deaths 2-3 observed ≥1
0
5
GHS Classification Fig. 2. The acute toxic class method (ATC).
with the aim of classification of the substance under investigation directly rather than via estimation of the LD50 . Unlike the FDP, this procedure is based on observation of death alone. The method is illustrated in Fig. 2. Animals are tested in groups of three, with a starting dose of 300 mg/kg. If no animals die, testing continues at the next higher dose. If two or three animals die, testing continues at the next lower dose. If one animal dies, three further animals are tested at this dose, with testing continuing at the next higher dose if none of these animals die, and at a lower dose otherwise. Testing continues in this way until two or more deaths are observed at the lowest dose, one or fewer deaths are observed at the highest dose, or at most one death occurs at one dose with at least two deaths at the adjacent higher dose. The substance under investigation is then assigned to one of the GHS classes as shown in Fig. 2. As for the UDP, the starting dose for either the FDP or ATC could be modified according to the anticipated LD50 value.
3. Optimal Bayesian designs Let N denote the sample size for a trial to assess acute oral toxicity. In a fixed sample size design, such as the LD50 test, N is fixed in advance. In an adaptive design, N is a random variable, depending on the observed data. We suppose that the design is such that N takes some maximum value m. For i = 1, . . . , N, let di denote the dose at which animal i is tested, and let xi be the observed response, with xi = 0 if animal i survives, and xi = 1 if
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
1787
animal i dies. A design is a set of rules, which, given the data observed at any point in the trial, specify whether the trial should continue, and if so, at which dose, and, if not what classification should be given. Let d(i) = (d1 , . . . , di ) and x(i) = (x1 , . . . , xi ) denote vectors of length i giving the doses used and observed responses for the first i animals tested. The design specifies a value for (i, d(i) , x(i) ) equal to 1 if the trial is to be stopped after observation of these doses and data and 0 otherwise. Values of (i, d(i) , x(i) ) are defined for all those d(i) and x(i) (i) (i) (i) (i) (i) (i) such that (j, (d1 , . . . , dj ), (x1 , . . . , xj )) = 0 for all j < i, with (d1 , . . . , dj ) and (i)
(i)
(x1 , . . . , xj ) denoting the vectors of length j comprising the first j elements of d(i) and x(i) , as these d(i) and x(i) are exactly those vectors of doses and responses for which the trial has not already stopped. Since m is the maximum sample size, (m, d(m) , x(m) ) = 1 for all d(m) and x(m) , so that continuation beyond this number of observations is not possible. For all i = 1, . . . , m and d(i) and x(i) for which is defined, the design also specifies a value for di+1 if (i, d(i) , x(i) ) = 0 and for the classification that will be assigned to the substance, which will be denoted by (i, d(i) , x(i) ), if (i, d(i) , x(i) ) = 1. The UDP, FDP and ATC method described above are all examples of such designs. The LD50 test is also an example, in this case a particularly simple one, since N and d1 , . . . , dN are fixed in advance. Suppose that, for a substance with properties given by l and , we can express the desirability of a classification , after observation of n animals by some utility. This utility will be considered as a function of l and for each and n, and written as u,n (l, ). Two possible forms are discussed below. In general, the sample size, N, with observed value n, and the observed data at the end of the trial, x(n) are both random variables. The classification given by the procedure, (n, d(n) , x(n) ), and hence the utility, u(n,d(n) ,x(n) ),n are therefore also random variables. The distributions of these random variables depends on the design of the study, and, for any given design, on the properties of the substance under investigation, that is on the values of l and for a dose–response curve of the form (1). For a given design, we may therefore consider the expected value of u(n,d(n) ,x(n) ),n (l, ) for fixed values of l and given by
E(u | l, ) =
5
uj,n (l, )pr(n, (d(n) , x(n) ))pr(n, d(n) , x(n) | l, ),
j =1 (d(n) ,x(n) )=j
where the first sum is taken over the five possible classifications, the second sum is taken over all possible n, d(n) and x(n) for which the trial may stop and lead to classification j, and pr(n, d(n) , x(n) | l, ) denotes the probability for the design that the trial stops after this number of animals with these dose levels and data given l and . Suppose that the substances to be tested can be considered to be drawn at random from some population of substances. This population can be summarised by a joint prior distribution for the parameters, l and specifying the properties of the substances, enabling a Bayesian approach.
1788
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
Let (l, ) denote the density of the joint prior distribution for l and . For a given design, the prior expected utility is equal to E(u | l, )(l, ) dl d. Possible forms for the prior distribution of l and are discussed below. The prior expected utility for any given design can be obtained by numerical integration. The construction of a sequential design to maximise the prior expected value of a specified utility function is described in the next section. 3.1. Construction of optimal designs using dynamic programming If both the maximum total number of animals, m, and the number of doses available for testing each animal are finite, the design to optimise a specified utility function may be obtained by a method known as dynamic programming. This approach, which is described in a general setting by, for example, Berger (1985), is based on the principle of optimality of Bellman (1957) that an optimal design has the property that at any point in the trial, the design for the remainder of the trial must be optimal. This means that the optimal design can be obtained by working backwards through the trial, as described in detail in this setting in the remainder of this section. Consider first the case i = m, and suppose that data x(m) = (x1 , . . . , xm ) have been observed at doses d(m) = (d1 , . . . , dm ). As i = m, no further testing may be conducted since (m, d(m) , x(m) ) = 1. The only design choice is therefore the classification (m, d(m) , x(m) ) to be given. Given the observed data, the posterior expected utility from classification j is E(uj,m | d(m) , x(m) ) = uj,m (l, )(l, | d(m) , x(m) ) dl d, where (l, | d, x) is the posterior joint density of l and given the observed data and doses. In the uj,m (l, ) term in the integrand, it is l and that is considered random. The value of (m, d(m) , x(m) ) is considered fixed, and is chosen to give the largest expected gain, so that the optimal value of (m, d(m) , x(m) ) is given by (m, d(m) , x(m) ) = argmax {E(uj,m | d(m) , x(m) )} j ∈{1,...,5}
with this optimal action giving expected utility equal to G(m, d(m) , x(m) ) =
max {E(uj,m | d(m) , x(m) )}.
j ∈{1,...,5}
Suppose now that i < m and that the expected utility for the optimal design from observation i onwards, G(i , d(i ) , x(i ) ), has been found for all i , i < i m for all d(i ) and x(i ) . (i) (i) Suppose that data x have been observed at doses d . Design choices at this point in the trial are the specification of (i, d(i) , x(i) ), and the value of (i, d(i) , x(i) ) if (i, d(i) , x(i) )=1 or the value of di+1 if (i, d(i) , x(i) ) = 0. That is, the design needs to specify whether the trial should stop and classify the substance, and if so into which class, or to continue the
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
1789
trial, and if so, with what dose used for the next animal. If we set (i, d(i) , x(i) ) = 1 and assign the substance under investigation to class j, the posterior expected utility is equal to (i) (i) E(uj,i | d , x ) = uj,i (l, )(l, | d(i) , x(i) ) dl d. The optimal classification if the trial stops is therefore given by argmax {E(uj,i | d(i) , x(i) )},
j ∈{1,...,5}
leading to expected posterior utility of max {E(uj,i | d(i) , x(i) )}.
j ∈{1,...,5}
(2)
If we set (i, d(i) , x(i) ) = 0 and choose the next dose to be di+1 , the expected posterior utility is given by the average expected posterior utility taken over the possible outcomes from the test of the i + 1th animal. If we denote by d(i) (di+1 ) and x(i) (xi+1 ), the vectors (i) (i) (i) (i) of length i + 1 equal to (d1 , . . . , di , di+1 ), and (x1 , . . . , xi , xi+1 ) respectively, this average is equal to Gcont (i, d(i) (di+1 ), x(i) ) 1 = G(i + 1, d(i) (di+1 ), x(i) (xi+1 ))pr(xi+1 | di+1 , l, ) xi+1 =0
× (l, | di , xi ) dl d where pr(xi+1 | di+1 , l, ) is given by (1). The best dose to use if the trial is to continue is that di+1 to maximise this posterior expected utility, giving posterior expected utility of max {Gcont (i, d(i) (di+1 ), x(i) )}, di+1
(3)
the maximum being taken over the finite set of available doses. The optimal action is either to stop and give the best classification or to continue with the best dose depending on which of the expected utilities given by (2) or (3) is larger. The posterior expected utility from this optimal action will be denoted by G(i, d(i) , x(i) ), and G(i, d(i) , x(i) ) may be found for all x(i) and d(i) , allowing backward induction to proceed. In this way, the whole optimal design may be obtained. The dynamic programming method thus consists of a series of alternating expectations and maximisations. The computer program to obtain the optimal design reported below works backwards from the maximum sample size, m, determining the optimal action for each possible data set that might be observed. When considering the possible data sets, the order in which the animals were tested is not important. At a stage when a total of i animals are being considered, therefore, each possible pair d(i) and x(i) can be summarised by the number of animals dying and surviving at each of the four test doses. Ignoring the ordering of the tested animals, then, the number of possible data sets of size i is given by the numberof ways in i+7 which the i animals can be divided into eight different groups. This is equal to 7 (see
1790
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
Feller, 1967, Chapter 2, Section 5). In the dynamic programming algorithm, the expected utility from continuing and testing the next animal at a certain dose is given by a weighted sum of the utilities for optimal actions for the data sets with one additional animal dying or surviving at that dose, weighted by the probabilities of death and survival. The optimal action and the corresponding expected utility for each data set of size i can thus be obtained and stored and used to find the optimal actions for data sets of size i − 1, and so on to give the entire optimal design. For a maximum sample size of 15 animals, this calculation is feasible, but the size of the problem increases rapidly with m. For larger m, the efficiency of the program might be improved by observing that for some data sets, irrespective of the dose used for the next animal and the outcome observed, the study will always lead to the same classification. In this case it must be optimal to stop the study at this point, giving this classification, so that comparison of all possible options is not required. 3.2. Utility functions The utility function, u,n (l, ), expresses the desirability of classification after observation of n animals for a substance with LD50 of el and dose–response curve slope . Let (l) be defined to equal 1 if el 5, 2 if 5 < el 50, 3 if 50 < el 300, 4 if 300 < l e 2000 and 5 if el > 2000, so that (l) is the correct GHS classification for a substance with LD50 of el . We would therefore like to be equal to (l) with high probability. This will be expressed by a utility function taking relatively high values for l such that u,n (l, )=(l) and relatively low values for other l. We would also like n to be small. This will be expressed by a utility function u,n that is decreasing in n. In the comparison below we will consider utility functions of two forms. The first form has u,n (l, ) = I ( = (l)) − n
(4)
for some choice of , where I (.) is the indicator function taking value 1 when its argument is true and 0 otherwise. The value of reflects the trade-off between the conflicting desires to classify correctly and to minimise the number of animals required. The choice of the is discussed below. The first part of this form of the utility function reflects only the fact that we would like to classify correctly. In practice, not all incorrectly classifications may be equally serious. Public health considerations mean that a classification into too stringent a class has less severe implications than classification into a less stringent class. This asymmetry is reflected in the second utility function form, in which u,n (l, ) = −I ( < (l)) − 2I ( > (l)) − n
(5)
for some , so that the loss associated with an incorrect classification into a less stringent class is twice that associated with an incorrect classification into a more stringent class. 3.3. Constrained optimal designs A deliberate feature of the FDP is the constraint that, for animal welfare reasons, no animal may be exposed at or above any dose at which death has been observed. Designs
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
1791
that maximise the prior expected utility subject to this constraint may easily be constructed using the dynamic programming method described above by limiting the range of actions over which the maximum is taken at each decision point in the algorithm. 3.4. Prior distributions In the Bayesian framework described above, a joint prior distribution for l and is required. Given the total sample size, n and the doses, d(n) = (d1 , . . . , dn ), the data, x1 , . . . , xn are independent, with xi having a binomial distribution with xi = 1 with probability logit−1 ((log(di ) − l)). The likelihood function is thus equal to n
{logit −1 ((log(di ) − l))}xi {1 − logit −1 ((log(di ) − l))}1−xi .
i=1
A conjugate prior for l and thus has density proportional to 0
{logit −1 ((log(di ) − l))}xi {1 − logit −1 ((log(di ) − l))}1−xi
i=−k
for some k, d−k , . . . , d0 , x−k , . . . , x0 . The posterior density for l and given d(n) and x(n) is then proportional to n
{logit −1 ((log(di ) − l))}xi {1 − logit −1 ((log(di ) − l))}1−xi .
i=−k
The prior information may be considered as being equivalent to the observation of outcomes xi , i =−k, . . . , 0 at doses di , i =−k, . . . , 0. Whitehead (1997) refers to these observations as pseudo-data. The use of a prior distribution of this sort is considered by Tsutakawa (1975). In the comparisons below, we consider two prior distributions, both of the conjugate form just described. The first, that we shall call prior 1 has k = 9 with d−9 = · · · = d−5 = 100, 000 mg/kg, x−9 = · · · = x−5 = 1, d−4 = · · · = d0 = 0.1 mg/kg and x−4 = · · · = x0 = 0. A contour plot of the density is given in Fig. 3. This prior distribution leads to high prior probability of death at high doses and low prior probability of death at low doses, that is a high prior probability that > 0. The second prior distribution, prior 2, has k = 15, with d−9 , . . . , d0 and x−9 , . . . , x0 as for the first prior and d−15 = 5 mg/kg, x−15 = 0, d−14 = d−13 = 50 mg/kg, x−14 = x−13 = 0, d−12 = d−11 = 300 mg/kg, x−12 = 0, x−11 = 1 and d−10 = 2000 mg/kg, x−11 = 1. A contour plot of the density of this prior distribution is shown in Fig. 4. There is a high prior probability that the value of the LD50 is close to 300 mg/kg, that is that l is close to 5.70. Both priors were taken to have probability mass only in the ranges 0.1 el 100, 000, that is −2.30 l 11.51, and −10 10 as substances outside of these ranges would be very unlikely in practice. The designs considered will test animals at doses of 5, 50, 300, and 2000 mg/kg. These correspond to the boundaries of the GHS classes, and are the doses used in the FDP and ATC. The optimal designs obtained will thus maximise the prior expected utility as defined above in the class of designs with animals tested at these doses. This class of designs includes the
1792
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
10 8
beta
6 4 0.008 0.006 0.004 0.002
2 0 -2
0
2
6
4
8
10
12
10
12
I Fig. 3. Contour plot for density of prior 1.
10 8
beta
6 4
0.4 0.3
2
0.2 0.1
0 -2
0
2
4
6
8
I Fig. 4. Contour plot for density of prior 2.
FDP (with the toxicity endpoint ignored) and ATC, but does not include the UDP, since this may use doses chosen from a larger set. The choice of doses is considered further in the discussion section below. 4. Comparison of testing procedures In this section, optimal Bayesian designs obtained using dynamic programming as outlined in Section 3 are compared with the UDP, FDP and ATC method and the LD50 test described in Sections 1 and 2 above.
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
1793
For each of the two prior distributions described at the end of Section 3.3, designs to optimise utility functions of the forms given by (4) and (5) were obtained with and without the constraint that testing should not be conducted at or above a dose at which the death of an animal has been observed. The maximum sample size, m for the optimal tests was set to be 15, since this is comparable with the maximum sample sizes of 15, 20 and 18, respectively, for the UDP, FDP and ATC method. The traditional LD50 test was assumed to be conducted with 5 animals exposed at each of 5, 50, 300 and 2000 mg/kg, that is 20 animals in total, so that the doses cover the range of doses considered for the GHS classification. For each design considered, expected values of the probability of a correct GHS classification, the total number of animals used in the procedure, and the number of deaths observed were calculated with substances drawn from each of the two prior distributions described in Section 3.3 above. The optimal designs obtained using each of priors 1 and 2 were thus evaluated under each of the two priors. This gives an indication of the robustness of the optimal design approach to misspecification of the prior distribution for l and . Also calculated for each design under each prior distribution was the prior expected probability of a classification into either the correct GHS class or a more stringent class. As discussed below, this could be of interest since classification into a category more stringent than the correct GHS class might be considered much less serious than classification into a less stringent class. For the optimal procedures, the FDP and the ATC method, these prior expected values were calculated by analytical calculation of the probability of all possible outcomes from the testing procedure using the path induction method described by, for example, Hardwick and Stout (1999). These calculations were repeated for a range of values of l and , and the prior expected values calculated using numerical integration over the prior density. For the UDP, estimates of the prior expected values were obtained by a similar method with the path induction replaced by simulation of 1000 runs of the procedure for each pair of l and values used in the numerical integration. To avoid the need to fit a large number of probit regression models, the estimates of l were obtained by taking the mean of the logarithms of the doses used in the procedure following the observation of both death and survival, as suggested by Brownlee et al. (1953). To minimise the computational requirements, for the LD50 test, estimates of the prior expected values were obtained by simulation of the procedure for 10,000 (l, ) pairs simulated from their joint prior distribution. Table 2 gives the prior expected properties for the LD50 test, UDP, FDP and ATC method under the two prior distributions described above. Table 2 shows that, comparing the existing procedures, the traditional LD50 test has the highest probability of leading to the correct classification. The prior expected probability is 0.949 under prior 1 and 0.832 under prior 2. This, is, however, at a considerable cost in terms of the number of animals required. The LD50 test requires a total of 20 animals, which is nearly twice the expected number required for any of the other procedures. The number of deaths expected is also higher for the LD50 test than for the other tests. The ATC method and UDP have similar properties, requiring about 10 animals and with a probability of correct classification of about 0.9 under prior 1 and 0.7 under prior 2. The ATC method gives the correct classification with slightly higher probability than the UDP
1794
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
Table 2 Prior probabilities of correct classification and prior expected sample sizes and numbers of deaths for existing procedures Procedure
pr(correct)
pr(correct or more stringent)
E(N )
E(deaths)
Substances drawn from prior 1 0.95 LD50 test UDP 0.89 FDP 0.67 ATC method 0.92
0.97 0.95 0.99 0.99
20.0 10.7 5.8 10.5
10.3 5.3 1.8 4.5
Substances drawn from prior 2 LD50 test 0.83 UDP 0.73 FDP 0.68 ATC method 0.74
0.92 0.80 0.98 0.96
20.0 11.0 7.4 9.7
7.4 5.3 2.0 3.3
Table 3 Prior probabilities of correct classification and prior expected sample sizes and numbers of deaths for optimal procedures based on the symmetric utility function given by (4) Procedure
pr(correct) pr(correct or more stringent) E(N ) E(deaths)
Substances drawn from prior 1 Unconstrained optimal design assuming prior 1 Constrained optimal design assuming prior 1 Unconstrained optimal design assuming prior 2 Constrained optimal design assuming prior 2
0.95 0.89 0.92 0.49
0.98 0.96 0.94 0.54
9.8 5.0 12.3 4.2
4.6 2.2 7.5 1.5
Substances drawn from prior 2 Unconstrained optimal design assuming prior 1 Constrained optimal design assuming prior 1 Unconstrained optimal design assuming prior 2 Constrained optimal design assuming prior 2
0.85 0.70 0.86 0.73
0.94 0.899 0.94 0.89
13.5 7.4 9.8 6.2
6.2 2.0 4.6 2.0
and requires slightly fewer animals on average. The expected number of deaths required for the ATC is also lower than for the UDP, particularly under prior 2. The FDP has a considerably smaller expected sample size, and also leads to the correct classification much less often, with an expected probability of 0.67 under both prior distributions. The number of deaths is very small for the FDP. For example, under prior 1, about a third of those animals tested using the FDP are expected to die, compared to about a half of those tested using the other procedures, indicating that lower doses are used in the FDP than in the other methods. An important property of the FDP is illustrated by the third column of Tables 2 and 3, which shows the prior expected probability of a classification that is either correct or more stringent. This is highest for the FDP, showing that although an incorrect classification is often given, this is almost always into too stringent a class. This is due to a deliberate asymmetry in the FDP in that the risk of an incorrect classification that is too stringent is favoured over the risk of an incorrect classification that is not stringent enough.
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
1795
Table 4 Prior probabilities of correct classification and prior expected sample sizes and numbers of deaths for optimal procedures based on the asymmetric utility function given by (5) Procedure
pr(correct) pr(correct or more stringent) E(N ) E(deaths)
Substances drawn from prior 1 Unconstrained optimal design assuming prior 1 Constrained optimal design assuming prior 1 Unconstrained optimal design assuming prior 2 Constrained optimal design assuming prior 2
0.95 0.88 0.93 0.53
0.99 0.99 0.96 0.59
9.6 5.1 12.3 6.2
4.2 2.2 6.9 2.2
Substances drawn from prior 2 Unconstrained optimal design assuming prior 1 Constrained optimal design assuming prior 1 Unconstrained optimal design assuming prior 2 Constrained optimal design assuming prior 2
0.83 0.67 0.85 0.71
0.97 0.96 0.96 0.96
13.1 7.2 12.1 5.7
5.4 2.0 5.0 2.0
We consider next the optimal procedures constructed as described above. Prior expected properties under both of the prior distributions for optimal tests both with and without the constraint described above using utility functions of the forms given by (4) and (5) are given in Tables 2 and 3, respectively. These designs were obtained for equal to 0.0002 for the utility function given by (4) and 0.0004 for the utility function given by (5). This value were chosen to give tests with similar expected sample sizes to those in Table 2. When the prior distribution used in construction of the design is the same as that used in the evaluation, the unconstrained optimal designs lead to a prior probability of correct classification that is higher than for any other design considered. In addition, the expected sample size is smaller than that for any of the other procedures except for the FDP under prior 1 and just exceeds that for the ATC method under prior 2. For the unconstrained designs, the use of prior distribution that is different from that used in the evaluation does not appear to have a very large effect on the expected probability of correct classification, though there is an increase in the expected sample size and number of deaths. The effect of the constraint that testing cannot be conducted at or above doses that have led to death is also shown in Table 3. It can be seen that the imposition of such a constraint leads to procedures in which lower doses are used. This leads to a considerable reduction in the expected number of deaths, bringing this close to that for the FDP and far below that for the other existing procedures. The expected number of animals required is also reduced. The probability of the correct classification is also reduced, though it remains higher than that for the FDP, and is similar to that for the UDP under both prior distributions. An effect of the reduction in the number of animals is that there is increased reliance on the information expressed in the prior distribution. This is particularly marked when the relatively informative prior 2 is assumed at the design stage, but the distribution of substances assumed in the evaluation is given by prior 1, when the probability of correct classification with the design obtained is very low. The effect of the use of a utility function of the form given by (5), as shown in Table 4 is as expected. Relative to the designs of Table 3, the optimal designs obtained tend to lead to
1796
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
more frequent misclassification into a more stringent class to minimise the probability of the more heavily penalised misclassification into less stringent classes.
5. Discussion In this paper, we have considered adaptive designs for the assessment of acute oral toxicity in experimental animals. Adaptive designs have been proposed following criticism of the traditional LD50 test. The aim of these methods is primarily not to estimate any parameter of the dose–response curve, but to assign the substance under investigation into one of a small number of classes according to its acute toxicity level. The design of the procedures is necessarily a compromise between accurate classification and minimisation of both the number of animals required and the number exposed to lethal doses. The use of adaptive designs is attractive in that both the total number of animals tested and the number tested at high doses can be reduced relative to a fixed sample design without reducing the probability of correct classification. OECD testing guidelines describe three adaptive methods. These are the UDP, the FDP and the ATC methods. These are compared with one another and with the traditional LD50 test in Section 4 above. The comparison shows that these methods do lead to sample sizes and expected numbers of deaths considerably smaller than those for the standard design, but that, for the designs as proposed, this is at the cost of some accuracy in classification. Section 3 above describes how Bayesian adaptive designs may be constructed that optimise some specified utility function. Two utility functions have been considered, the first including terms expressing the desire both to classify correctly and to minimise the total number of animals required, and the second including an additional term expressing the difference between incorrect classifications that are too stringent and those that are not stringent enough. The optimal designs obtained are compared with the traditional LD50 test and with the existing adaptive designs in Section 4 above. The Bayesian designs are shown to provide similar probabilities of correct classification as the traditional LD50 test, but with smaller expected sample sizes and expected numbers of deaths. The use of a Bayesian decision-theoretic approach to the construction of designs provides considerable flexibility in terms of the choice of the utility function and possible constraints on the procedure. For example, properties are given in Section 4 above for designs in which testing is never repeated at doses at or above that previously observed to lead to death, and even subject to such a constraint, the optimal designs are shown to compare favourably with the existing adaptive methods. Because of the difficulties in specification of a utility function that accurately expresses the risks of incorrect classification and increased sample sizes, this flexibility could actually be one of the disadvantages of the Bayesian decision-theoretic approach in this context. It might be considered desirable to take the pragmatic approach of obtaining optimal designs for a number of possible utility functions and comparing the properties of these. Such an approach was taken when obtaining the designs described above to determine a value of the cost per observation, , that led to reasonable expected sample sizes. The computing time required for such a comparison means, however, that this is a formidable task. It also
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
1797
seems likely that, as in other application areas, some reluctance to use a decision-theoretic approach is due to the need to explicitly state the aims of the experiment and to express this by the utility function. The use of a Bayesian design method also requires the specification of a prior distribution for the properties of the substances that are to be investigated. Results are given above for both for designs obtained using prior distributions that are identical to the distribution of substances used for the evaluation and for designs obtained using priors that differ from the distribution of substances used, indicating the degree to which the properties of the optimal designs depend on the prior distribution used. The results indicate that, in terms of the probability of correct classification, the use of a relatively non-informative prior leads to designs that are more robust to differences in the true distribution of substances. The problem that we have considered is one of classification according to the LD50 value rather than estimation of the LD50 . Classically, the latter problem has received more attention. Up-and-down designs such as that proposed by Bruce (1985), on which the current OECD testing guideline is based, are one of several approaches to the design of sequential trials for estimation of the dose leading to death of half or some other fraction of the population. A number of authors, including, for example, Durham and Flournoy (1994) have considered randomised versions of up-and-down type designs. A well-known alternative approach to the same problem is the stochastic approximation method of Robbins and Monro (1951) in which, rather than being chosen from a finite set, the dose levels are chosen to be increasingly close to the estimated LD50 as the study progresses. A decision-theoretic approach similar to that considered above was used by Freeman (1970) to design trials for estimation of the LD50 . His designs minimise a loss function (equivalent to maximising an opposite utility function) that is composed of the sum of two parts, the first of which is proportional to the squared difference between the estimated and true LD50 values and the second of which is proportional to the sample size. The aims of the design are thus similar to those expressed by the utility function given by (3) above. In order to reduce the computational effort required, Freeman considered designs in which testing occurs at three fixed doses. In the construction of the optimal designs considered above, a restriction has been imposed that the dose levels used correspond to the classification boundaries. That is, that only dose levels of 5, 50, 300 and 2000 mg/kg are used. Restriction to the use of such doses is, at first sight, reasonable. If it is desired to determine whether the LD50 of the substance under investigation is, for example, above or below 50 mg/kg, in the absence of any other information, the use of this dose for all animals tested is optimal. If prior information on the substance is available, however, or if testing has already been conducted at other doses, the use of some dose other than 50 mg/kg might improve the prior expected probability of correct classification of the LD50 as being above or below 50 mg/kg. The optimal designs obtained are thus optimal only in the class of designs in which all testing is conducted at the dose levels corresponding to the classification boundaries. The construction of globally optimal designs requires, at each point at which the study may continue, choice of the next dose from a continuous range. This might lead to designs more similar to the stochastic approximation approach, and is an area in which further research is required. With the exception of the FDP, the methods considered in this paper are all based on the assessment of acute oral toxicity by the observation of the death of animals. The FDP is
1798
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
unique in that observation of non-lethal toxicity is also used in the procedure. The comparison given in Section 4 above ignored this aspect of the FDP. The inclusion of non-fatal toxicity means that the doses at which testing occurs can be lower, with some substances classified without any deaths being observed. It would, in principle, be possible to modify other testing methods to use information on non-fatal toxicity in a similar way. The disadvantage with such an approach is that, since the correct classification is taken to depend on the parameters of the dose–response curve relating dose to the probability of death, some relationship between non-fatal and fatal toxicity must be assumed if information on the former is to be used to lead to a classification. In the FDP, classifications are given assuming that, if non-fatal toxicity occurs at one dose, the next highest fixed dose is very likely to lead to death of two or more animals, so that the substance can be classified directly without any further testing. The use of non-fatal toxicity as an endpoint in other testing methods is also an area in which further research would be useful. The results given above show that improvements over the existing adaptive designs can be made by the use of optimal designs. Further, if optimal designs are based on a vague prior distribution, they perform well even if the prior does not accurately describe the distribution of substances tested. It is clear, however, that the existing procedures are quite different in their properties. The traditional LD50 test gives a high degree of accuracy, but requires what is often considered to be an excessively large sample size. The FDP generally requires a very small sample size and is unlikely to lead to classification into a less stringent class than that warranted by the LD50 value, but leads to too stringent a classification fairly often. The UDP and ATC method have properties partway between those of the FDP and the traditional LD50 test, leading to the correct classification more often than the former but requiring more animals, and leading to the correct classification less often than the latter, but requiring fewer animals. If improvements over these existing adaptive tests are to be made, possibly through the use of Bayesian decision theory as discussed in this paper, a requirement is that the most desirable properties of procedures for the assessment of acute oral toxicity should be identified. Acknowledgements The author is grateful to Anne Whitehead, John Whitehead and two anonymous referees for their helpful comments. References Balls, M., 1992. Time to reform toxic tests. New Scientist, 2nd May 1992, 31–33. Bellman, R., 1957. Dynamic Programming. Princeton University Press, Princeton, NJ. Berger, J.O., 1985. Statistical Decision Theory and Bayesian Analysis. Springer, New York. Bliss, C.I., 1934a. The method of probits. Science 79, 38–39. Bliss, C.I., 1934b. The method of probits—a correction. Science 79, 409–410. Brownlee, K.A., Hodges, J.L., Rosenblatt, M., 1953. The up-and-down method with small samples. J. Amer. Statist. Assoc. 48, 262–277. Bruce, R.D., 1985. An up-and-down procedure for acute toxicity testing. Fundamental Appl. Toxicology 5, 151–157.
N. Stallard / Journal of Statistical Planning and Inference 136 (2006) 1781 – 1799
1799
Collett, D., 1991. Modelling Binary Data. Chapman & Hall, London. Dixon, W.J., Mood, A.M., 1948. A method for obtaining and analyzing sensitivity data. J. Amer. Statist. Assoc. 43, 109–126. Durham, S.D., Flournoy, N., 1994. Random walks for quantiles estimation. In: Berger, J., Gupta, S. (Eds.), Statistical Decision Theory and Related Topics, vol. v. Springer, New York. Feller, W., 1967. An Introduction to Probability Theory and its Applications, vol. 1. Wiley, New York. Finney, D.J., 1971. Probit Analysis. Cambridge University Press, Cambridge. Freeman, P.R., 1970. Optimal Bayesian sequential estimation of the median effective dose. Biometrika 57, 79–89. Hamilton, M.A., 1991. Estimation of the typical lethal dose in acute toxicity studies. In: Krewski, D., Franklin, C. (Eds.), Statistics in Toxicology. Gordon and Breach Science Publishers, New York. Hardwick, J.P., Stout, Q.F., 1999. Using path induction for evaluating sequential allocation procedures. SIAM J. Sci. Comput. 21, 67–87. OECD, 1987. Guidelines for the testing of chemical substances. No. 401 Acute oral toxicity. OECD, Paris. OECD, 1998. Harmonized integrated hazard classification system for human health and environmental effects of chemical substances. OECD, Paris. OECD, 2001a. Guidelines for testing of chemical substances, No 420 Acute oral toxicity—fixed dose procedure. OECD, Paris. OECD, 2001b. Guidelines for testing of chemical substances, No 423 Acute oral toxicity—acute toxic class method. OECD, Paris. OECD, 2001c. Guidelines for testing of chemical substances, No 425Acute oral toxicity—up-and-down procedure. OECD, Paris. Robbins, H., Monro, S., 1951. A stochastic approximation method. Ann. Math. Statist. 22, 400–407. Tsutakawa, R.K., 1975. Bayesian inference for bioassay. Technical Report 52, Mathematical Sciences, University of Missouri, Columbia. Whitehead, J., 1997. Bayesian decision procedures with application to dose-finding studies. Internat. J. Pharmaceutical Medicine 11, 201–208. Zbinden, G., Flury-Roversi, M., 1981. Significance of the LD50 test for the toxicological evaluation of chemical substances. Arch. Toxicology 47, 77–99.