CHAPTER 16
Binomial and Multinomial Distributions BASIC CONCEPTS Introduction Observations that fall into one of two mutually exclusive categories are dichotomous or binary. Any population with only two categories is a binomial population; newborn infants are male or female or the outcome of a treatment is cure or failure. We can ask certain questions about binomial populations. For example, what is the likelihood that in a family of five children there will be three girls? What is the probability that 5 patients out of 50 will die from meningitis? How likely is it that 7 out of 20 people with a certain type of seizure will develop paralysis? The allocation to each category must be unambiguous. The number of items (counts) in one group (e.g., r successes) gives information about the proportion of the total number (n) attributed to one of the groups. Define the proportion of the number of one of the groups to the total (r/n) as π for a theoretical population and as p for a sample from that number of newborn girls population. Therefore π ¼ in an infinitely large number of total number of newborns births, but the same proportion is p if related to a smaller number of births in a particular hospital. Because this is a dichotomous variable in which the total probability must add to 1, if the probability of one of the two categories is π, then the probability of the other category must be 1–π. In a sample, the probability of one category is p, and the other is 1–p, also called q. By convention, the probability of one category is a success and of the other category is a failure, without equating the terms “success” and “failure” with the desirability of the outcome. An examination of the dichotomous outcomes of a study is often termed a Bernoulli trial if 1. The number of trials is defined. 2. In each trial, the event of interest either occurs (“Success”) or does not occur (“Failure”). 3. The trials are independent, that is, the outcome of one trial does not affect the outcome of the next trial. 4. The probability of success does not change throughout the trial (Example 16.1). Example 16.1 The proportion π for male births is approximately 0.5. (0.512 in the United States.) If a family has five children, how often would there be three girls? no girls? four boys? To approach this problem, consider what happens by flipping a fair coin many times, and recording whether it Basic Biostatistics for Medical and Biomedical Practitioners https://doi.org/10.1016/B978-0-12-817084-7.00016-4
© 2019 Elsevier Inc. All rights reserved.
249
250
Basic Biostatistics for Medical and Biomedical Practitioners
comes down heads (H) or tails (T). This is a good analogy, because for heads in coin tossing π is exactly 0.5. Table 16.1 gives the possible results for the first few tosses. Table 16.1 Binomial table Number of tosses
Possible combinations
1
1H
2
1HH
3 4 5
1HHH 1HHHH 1HHHHH
1T 2HT
3HHT 4HHHT
5HHHHT
1TT 3HTT
6HHTT
10HHHTT
1TTT 4HTTT
10HHTTT
1TTTT
5HTTTT
1TTTTT
H ¼ head; T ¼ tail. The numbers (coefficients) indicate how many possible combinations there are for each set.
Turning this figure through 90 degrees gives a tree diagram (Fig. 16.1).
Fig. 16.1 Binomial table as tree diagram.
Consider the numbers of possible combinations of heads and tails for any given number of tosses. First toss: a head (H) or a tail (T). Second toss: HH, HT, TH, or TT. The difference between the two middle combinations is the order in which the outcomes occur. If order is immaterial, the outcomes are HH, 2HT, TT.
Binomial and Multinomial Distributions
Third toss: HHH (three heads) once, HHT, HTH, THH (two heads and one tail), HTT, THT, TTH (one head and two tails), or TTT (three tails) once. Fourth toss: Four heads (HHHH) once; 3 heads and one tail, and there are four ways of doing this (HHHT, HHTH, HTHH, THHH); 2 heads and 2 tails, and there are 6 ways of doing this (HHTT, HTHT, HTTH, TTHH, TTTH, THHT); one head and three tails, with four possible combinations; and 4 tails (TTTT) once. Fifth toss: HHHHH, with only one such combination possible; four heads and one tail, with 5 possible combinations; three heads and two tails, with 10 possible combinations (work this out for practice); two heads and three tails, also with 10 possible combinations; one head and four tails, with 5 possible combinations; and five tails, with only one such combination. Isolating the numbers of possible combinations (the coefficients) in the earlier diagram gives (Table 16.2).
Table 16.2 Bernoulli coefficients Number of tosses
Number of combinations
1
1 1
2 1
3
5
1
5
1
2 3
3 4
1
4
1
4
6 10
1
10
1 5
Each line begins and ends with a 1, because there can be only one combination with all heads and one with all tails. In each line except the first, the number of combinations is the sum of the two nearest combinations in the preceding line. Thus 5 for five tosses is the sum of 1 and 4 which are above it in line 4, 10 is the sum of the 4 and the 6 above it in line 4, and so on. In each line, the pattern of coefficients is symmetrical because the coefficient for, say, 5–1 heads ¼ 4 heads is 5, and this is the same as the coefficient for 5–4 heads ¼ 1 head. More generally, in n trials, the coefficient for k successes is the same as for n–k successes. This pattern is termed Pascal’s triangle, described in a letter to Fermat in 1654, but it was known in China in the 12th century. These numbers of combinations, also termed coefficients, are one of the two items needed to work out the probabilities. The other is the probability of getting any particular combination. The second item is easy to calculate. Assuming that the results of one toss do not influence any of the other results, then to get any combination with two tosses multiply the probabilities of each result. The probability of getting one head is 0.5.
251
252
Basic Biostatistics for Medical and Biomedical Practitioners
The probability of getting two heads is 0.5 0.5 ¼ 0.52 ¼ 0.25. The probability of getting three heads is 0.5 0.5 0.5 ¼ 0.53 ¼ 0.125. The probability of getting two heads and a tail is 0.52 0.5 ¼ 0.53 ¼ 0.125, and so on for other numbers of tosses and combinations. Now all we need to do is to multiply the probability of getting any particular combination by the number of times that combination occurs, that is, by the coefficient. For example, for four tosses, we get the following probabilities: P(Four heads) ¼ 0.5 0.5 0.5 0.5 ¼ 0.54 ¼ 0.0625. P(Three heads and one tail) ¼ P(3 heads) P(1 tail) number of combinations ¼ 3 0.5 0.5 4 ¼ 0.25 because there are 4 ways of having three heads and one tail: HHHT, HHTH, HTHH, THHH. P(Two heads and two tails) ¼ 0.52 0.52 6 ¼ 0.375, because there are 6 possible combinations: HHTT, HTHT, HTTH, TTHH, TTTH, THHT. P(One head and three tails) ¼ 0.51 0.53 4 ¼ 0.25. P(Four tails) ¼ 0.54 ¼ 0.0625. Adding up all these probabilities gives 0.0625 + 0.25 + 0.375 + 0.25 + 0.0625 ¼ 1.0000 (Example 16.2). Example 16.2 Perform similar calculations when π is not 0.5. Take a 6-sided die, call the number 1 a success, and any of the numbers 2–6 a failure. Then the probability of getting a success is 1/6, and the probability of getting a failure is 5/6. For one toss of the die, the probability of one success ¼ 1/6, and the probability of one failure ¼ 5/6. For two tosses, the probability of two successes is 1/6 1/6 ¼ (1/6)2 ¼ 1/36. The probability of one success and one failure is 1/6 5/6 2 ¼ (1/6)1 (5/6)1 2 ¼10/36. The probability of two failures is 5/6 5/6 ¼ (5/6)2 ¼ 25/36. For three tosses, the probability of one success is 1/6 1/6 1/6 ¼ (1/6)3 ¼ 1/216. The probability of two successes and one failure is 1/6 1/6 5/6 3 ¼ (1/6)2 (5/6)1 3 ¼ 15/216. The probability of one success and two failures is 1/6 5/6 5/6 3 ¼ (1/6)1 (5/6)2 3 ¼ 75/216. The probability of three failures is 5/6 5/6 5/6 ¼ (5/6)3 ¼ 125/216. The sum of these probabilities is (1 + 15 + 75 + 125)/216 ¼ 1, as expected.
Bernoulli formula With greater numbers of tosses, the calculations of the probabilities (especially the coefficients) become increasingly tedious. Fortunately, Jakob Bernoulli (1654–1705), one of a large, famous and disputatious family of Swiss physicists and mathematicians, showed that these expressions were the expansion of the algebraic expression (π + [1 – π])n, where n represents the number of observations. For there to be X successes (with probability π of each success), there must therefore be n – X failures. The probability for any value of r and
Binomial and Multinomial Distributions
n n is: nCrπr[1–π] , where nCr (also written represents the number of possible comr binations of r successes and n–r failures (i.e., the binomial coefficient). This is the probability function. n! From combinatorial theory (Chapter 12) nCr is: . r!ðn r Þ! n! In our example, the probability would be π r ½1 π nr . This formula has two r!ðn r Þ! n! components. The first, the binomial coefficient, , gives the number of possible r!ðn r Þ! arrangements of r successes and n r failures as set out in part in Table 16.2. The second factor involving π is the probability for any of the different ways of getting r successes and n r failures. The probability of getting 4 successes out of 6 tosses is: 6! 720 1 25 ð1=6Þ4 ð5=6Þ64 ¼ ¼ 0:008038 4!ð6 4Þ! 24x2 1296 36 The probability of getting one success out of 6 tosses is: n–r
6! 720 1 3125 ð1=6Þ1 ð5=6Þ61 ¼ ¼ 0:401878: 1!ð6 1Þ! 120 6 7776 To determine the probability of getting 7 successes out of 12 total trials in which p ¼ 0.37, this would be calculated from: 12 C
12
12! 0:377 ð1 0:37Þ127 7!ð12 7Þ! ¼ 792 0:0009493 0:09924 ¼ 0:07461:
7ð0:37Þ7 ð1 0:37Þ127 ¼
What is the probability that 5 patients out of 50 with meningitis will die? To answer this question, we need information about the average proportion of these patients who die. Assume that this probability is p ¼ 0.1. Then the probability that we want is: 50 C
50
50! ð0:1Þ5 ð1 0:1Þ505 5!ð50 5Þ! ¼ 2118760 0:000010 0:008727964 ¼ 0:1849:
5ð0:1Þ5 ð1 0:1Þ505 ¼
Free online binomial calculators are available at http://stattrek.com/onlinecalculator/ binomial.aspx, . http://binomial.calculatord.com and http://ncalculators.com/statistics/binomial-distribution-calculator.htm and http:// www.statstodo.com/BinomialTest_Pgm.php Problem 16.1 In an influenza epidemic the mortality rate is 0.7%. What is the probability of death in 3 out of the next 50 patients?
253
254
Basic Biostatistics for Medical and Biomedical Practitioners
Binomial basics The mean of a binomial distribution is nπ, and its variance is nπ(1π). Because both π and 1–π are less than one, their product is even smaller, so that the variance of a binomial distribution is less than the mean. If the mean and variance are calculated as proportions of 1, rather than as numbers based on the sample size, then divide the mean by n and the nπ ð1 π Þ π ð1 π Þ variance by n2 to give π and ¼ , respectively. n2 n
The normal approximation The binomial distribution is approximately normal if nπ and n(1π) > 9 (Hamilton, 1990). Fig. 16.2 shows the distribution for different values of n when p ¼ 0.1, 0.3, or 0.5.
Fitting a binomial distribution to a set of trial results With a typical Mendelian recessive gene, for example, for Tay-Sachs disease, the probability of a child having the disease if both parents carry the recessive gene is 0.25. One hundred such families with 5 children are examined, and the children with the disease are identified (Table 16.3). Does the observed distribution fit a binomial distribution? Calculate the expected probabilities of having 0, 1, 2, 3, 4, or 5 children with the disease. P(r ¼ 0) ¼ 1 0.250 0.755 ¼ 0.237305. P(r ¼ 1) ¼ 5 0.251 0.754 ¼ 0.395508. P(r ¼ 2) ¼ 25!3!! 0.252 0.753 ¼ 0.263672. P(r ¼ 3) ¼ 25!3! ! 0.253 0.752 ¼ 0.087891. P(r ¼ 4) ¼ 5 0.254 0.751 ¼ 0.014648. P(r ¼ 5) ¼ 1 0.255 0.750 ¼ 0.0009766. Table 16.3 The expected values show the theoretical distribution for the recessive population Number of involved children (r) Observed f Expected f x2
0 1 2 3 4 5
26 37 22 11 4 0
Total
100
23.73 39.55 26.37 8.79 1.46 0.10
0.22 0.16 0.72 0.56 4.42 0.10 6.18
Binomial and Multinomial Distributions
Fig. 16.2 The bigger the product of N and p, the closer the approximation to the normal curve.
255
256
Basic Biostatistics for Medical and Biomedical Practitioners
These add up to 1.0000006, the difference from 1 being due to rounding off errors. Now multiply each probability by N, the total number, and compare observed and expected numbers by a chi-square test, as in Table 16.3 (Chapter 14). The observed and expected values are similar. The total chi-square is 6.18 with 4 degrees of freedom, so that P ¼ 0.1861. There is therefore no convincing evidence against the null hypothesis that these samples are drawn from a population in which the mean probability of the disease occurring is 0.25.
Cumulative binomial probabilities Some problems call for an estimate of more than one probability. For example, if for a Mendelian recessive characteristic, the risk of a child with the recessive disease is 0.25, how likely is it that there will be families of 5 children with two or more diseased children? This could be calculated by working out the probabilities of having 0 or 1 affected child, adding these together and then subtracting that total from 1. If the sample sizes were larger, however, this could be very time consuming. For example, in samples of 100 people, what is the probability that there will be >20 with type B blood group that occurs in about 14% of the population? Fortunately there are tables in which cumulative probabilities have been calculated and listed for varying values of π (McGhee, 1985; Kennedy and Neville, 1986; Hahn and Meeker, 1991) (Example 16.3). The calculations can be done easily on the free interactive websites http://onlinestatbook.com/analysis_ lab/binomial_dist.html, http://www.danielsoper.com/statcalc3/calc.aspx?id¼71 Example 16.3 Using the blood group data before, n ¼ 100, p ¼ 0.14. From the calculator, enter 0.14, 20, and 100, and get P(X > 20) ¼ 0.0356 and P(X 20) ¼0.0614.
Confidence limits Continuity correction A binomial series consists of the probabilities of discrete events—3 children, 7 teeth, and so on. Because these events are discrete they are represented by integers. When this probability set is examined as if it were a continuous distribution, an error appears as diagrammed in Fig. 16.3. For the sample illustrated before, with n ¼ 20 and p ¼ 0.3, the probability of P 10, that is, the probability of getting 10 or more, can be calculated from the binomial theorem. This probability is 0.0480 (using the online calculator referred to previously). But with the normal approximation and the formula. Xi p z ¼ pffiffiffiffiffiffiffiffi (see the following) Npq
Binomial and Multinomial Distributions
Fig. 16.3 Diagram of continuity correction.
10 20 0:3 4 z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ ¼ 1:95,p ¼ 0:0256: 20 0:3 0:7 2:0494 As shown in Fig. 16.3, this value must be too small. The value 10 really extends from 9.5 to 10.5, so that calculating the probability from the continuous curve using only 10 and above ignores the shaded area in the figure. To allow for this, subtract 0.5 from 10 and then 9:5 20 0:3 3:5 ¼ 1:71, P ¼ :0463: z ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 20 0:3 0:7 2:0494 This is much closer to the true probability. When N is large, the correction is less important. The correction is not always by subtracting 0.5. Table 16.4 summarizes the changes. For example, for X 10 you have to start at 9.5, because starting at 10.5 does not meet Table 16.4 Continuity corrections Discrete Continuous
X ¼ 10 X > 10 X 10 X < 10 X 10
9.5 < X < 10.5 X > 10.5 X < 10.5 X < 9.5 X > 9.5
Correction
0.5 +0.5 +0.5 0.5 0.5
the requirements. If you want X < 10, again subtract 0.5 or else the distribution will end at 10.5. For X 10 add 0.5 or the distribution will end at 9.5. (See Fig. 16.3). Sometimes the correction is not 0.5. For example, when calculating the confidence limits of a proportion, the numerator includes the number of events as a proportion of n. 0:5 1 The correction has to be on the same scale and becomes ¼ . n 2n
257
258
Basic Biostatistics for Medical and Biomedical Practitioners
ðr p npffiffiffiffiffiffi Þ ffi 0:5 : (Do not cornpq rect if r–np 0.5). For two-tailed tests, delete the fractional part of r–np if it lies between 0 and 0.5, and replace it by 0.5 if the fractional part lies between 0.5 and 1.0. Therefore 7.3 becomes 7, 8.73 becomes 8.5, 5.0 becomes 4.5, and 3.5 becomes 3.0. Confidence limits can be calculated online from http://statpages.org/confint.html, http://www.danielsoper.com/statcalc3/calc.aspx?id¼85, http://www.graphpad.com/ quickcalcs/ConfInterval2.cfm, and http://www.biyee.net/data-solution/resources/ binomial-confidence-interval-calculator.aspx. (the last one is only for Windows operating system). For example, if N ¼ 20 and p ¼ 0.3, what are the confidence limits for p? From the calculator, the 95% confidence limits for p are 0.1189–0.5428, a very wide range because of the small sample size. In larger samples, the binomial estimate of p is approximately normally distributed rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi π ð1 π Þ about population proportion π with standard deviation : For the true but rffiffiffiffinffi pq unknown standard deviation substitute the sample estimate : n Hence the probability is approximately 0.95 that π lies between the limits π – 1.96 and rffiffiffiffiffi rffiffiffiffiffi pq pq 1 π + 1:96 . With the correction for continuity, use p 1:96 + (Example n n 2n 16.4). Using the continuity correction, for a one-tailed test, z ¼
Example 16.4 If the proportion of people with group B blood in a sample of 100 people is 0.14, what are the 95% confidence limits for that proportion? rffiffiffiffiffi pq 1 + ¼ p ¼ 0.14, q ¼ 0.86, n ¼ 100. The 95% confidence limits are p 1:96 n 2n h i pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0:14 1:96 0:14 0:86=100 + 1=200 ¼ 0:14 ½1:96 0:034699 + 0:005 ¼ 0:14 0:073 ¼0.067 to 0.213 (The exact interval is 0.0612–0.2072, so that the approximation is close). Other slightly more complex formulas can be used, with Wilson’s method being highly accurate (see later) (Curran-Everett and Benos, 2004).
Problem 16.2 A study ( Jousilahti et al., 1998) showed that serum cholesterol in men in 1992 had the following distribution: <5.0 mmol/L—6%; 5–6.4 mmol/L—37%; 6.5–7.9 mmol/L—41%; >8.0 mmol/L— 16%. Set 95% confidence limits for those with serum cholesterol 5.0–6.4 mmol/L in a sample of 200 men.
Binomial and Multinomial Distributions
Comparing two binomial distributions This can be done with the chi-square. Tables or graphs for performing these comparisons can be found in Gardner and Altman (1995). It can also be done by turning the mean values into proportions and using the free program http://vassarstats.net (see Proportions) that also provides confidence intervals.
Estimating sample size If two binomial distributions are not substantially different, this may because they are similar (i.e., that the null hypothesis is true) or else that the null hypothesis may not be true but the numbers in each sample are too small to allow us to reach that conclusion with confidence. If we have the data based on a 2 2 table, we can use the program referred to in the preceding paragraph. Sometimes, however, we have only the two proportions. For example, we know that the proportion of patients who die from treatment A is 0.4, and we want to know how many patients we will need to demonstrate a reduction on treatment B to 0.2, A simple way is to use the normal approximation with the formula. 16pav ð1 pav Þ nA ¼ , where pA and pB are the two proportions being compared and pav ðpA pB Þ2 is their average, and nA is the number in each group (Lehr, 1992; van Belle, 2002). For our example nA ¼
16 0:4ð1 0:4Þ ¼ 95: ð0:4 0:2Þ2
Alternatively, to determine sample size, use the free online program http://www.cct. cuhk.edu.hk/stat/proportion/Casagrande.htm, http://statulator.com/SampleSize/ss2P. html, or https://www.stat.ubc.ca/rollin/stats/ssize/b2.html. These do not give identical answers because they are based on different principles, but they are close enough for practical use. In the order given, these programs give 91, 89, and 82.
Comparing probabilities We may want to compare the ratio of two of the probabilities pi/pj. A 1 – α confidence interval can be created from rffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 + and Log e L ¼ log pi =pj zð1 α=2Þ n1 n2 rffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 + Log e U ¼ log pi =pj + zð1 α=2Þ n1 n2
259
260
Basic Biostatistics for Medical and Biomedical Practitioners
In the previous example, consider the ratio of two blood types, A and B; pA/ pB ¼ 0.43/0.12 ¼ 3.58. If this ratio is based on the observation of 100 individuals, 43 with A and 12 with B, what is the 95% confidence interval for this ratio? pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi log L ¼ log ð43=12Þ 1:96 1=43 + 1=12,and pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi log U ¼ log ð43=12Þ + 1:96 1=43 + 1=12: logL ¼ 1.2763–1.96 0.6399 ¼ 0.6364, and logU ¼ 1.9162. The lower and the upper limits are the antilogarithms of these values, and these are 1.8897–6.7952. These limits would be different for different sample sizes.
ADVANCED OR ALTERNATIVE CONCEPTS Exact confidence limits Wilson’s method for determining confidence limits (Newcombe, 1998) seems to be the most accurate method available. It involves solving the two formulas for the lower (L) and upper (U) confidence limits of a binomial proportion and includes the continuity correction: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi fz2 2 1=n + 4pðnq + 1Þg L¼ 2ðn + z2 Þ p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2np + z2 + 1 + z fz2 + 2 1=n + 4pðnq + 1Þg U¼ 2ðn + z2 Þ 2np + z2 1 z
For 95% limits, z ¼ 1.96, and the equations reduce to: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2np + 2:84 1:96 f1:84 1=n + 4pðnq + 1Þg L¼ 2ðn + 3:84Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2np + 4:84 + 1:96 5:84 1=n + 4pðnq + 1Þ U¼ 2ðn + 3:84Þ For the example used before, the confidence limits are 0.0829–0.2278, for an interval of 0.145. If p is very low and the lower limit is <0, it is set at zero. If p is very high and the upper limit is >1, it is set at 1. The continuity correction is not needed for larger sample sizes, and a simpler formula can then be used: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2np + z2 z z2 + 4npq : 2ðn + z2 Þ
Binomial and Multinomial Distributions
With the same proportions as used previously, the 95% limits are from 0.0852 to 0.2213, quite close to the more exact formulation. This method seems to be more accurate than other proposed methods. It can be performed online at http://vassarstats.net/ (see Frequency data). Problem 16.3 Calculate 95% confidence limits as in this problem, using Wilson’s method.
MULTINOMIAL DISTRIBUTION This is an extension of the binomial distribution to more than two groups. If there are >2 mutually exclusive categories, X1 is the number of trials producing type 1 outcomes, X2 is the number of trials producing type 2 outcomes,....Xk is the number of trials producing type k outcomes, then the event (X1 ¼ x1, X2 ¼ x2,....Xk ¼ xk) follows a multinomial distribution if there are n independent trials, each trial results in one of k mutually exclusive outcomes, the probability of a type i outcome is pi, and this probability does not change from trial to trial. Then the distribution of the event (X1 ¼ x1, X2 ¼ x2,....Xk ¼ xk) is given by P ðX1 ¼ x……Xk ¼ xk Þ ¼
n! P x1 px2 …px2 k xL !x2 !…xk !
P P where ki1 pi ¼ 1, and x1x2 ⋯xk are nonnegative integers satisfying Ki1 x1 ¼ n. The mean of each X1 is given by μ ≔ np1 and its variance is given by σ 2i ¼ npi(1 pi). Consider the four main blood group phenotypes. The probabilities of A, B, AB, and O are, respectively, 0.43, 0.12, 0.05, and 0.40. In a study of 17 randomly selected subjects, what is the probability of choosing 5 As, 2 Bs, 1 AB, and 9 Os at random? By the formula, pðA ¼ 5; B ¼ 2; AB ¼ 1; O ¼ 9Þ ¼
17! 0:435 0:122 0:051 0:409 ¼ 0:01133: 5!2!1!9!
In repeated random samples of size 17, the mean number of persons with type A is 17 0.43 ¼ 7.3, with variance 17 0.43 0.57 ¼ 4.1667 and standard deviation is about 2.04. Explanations of the formula with examples are given by Ash (1993), Hogg and Tanis (1977), Murphy (1979), and Ross (1984).
261
262
Basic Biostatistics for Medical and Biomedical Practitioners
Problem 16.4 Using the serum cholesterol data before, in repeated random samples of 75 subjects, what is the probability of choosing, in order of ascending concentrations, 3, 26, 43, and 9, respectively?
APPENDIX Calculation of sample size for unequal numbers: Let sample sizes be n1 and n2 and let k ¼ n1/n2. Then modified sample size N0 ¼ N(1 + k)2/4k. That is, calculate N for total sample size, assuming equal sample sizes. Then compute N0 . Then the two sample sizes are N0 /(1 + k) and kN0 (1 + k).
REFERENCES Ash, C., 1993. The Probability Tutoring Book. An Intuitive Course for Engineers and Scientists. IEEE Press, New York, p. 470. Curran-Everett, D., Benos, D.J., 2004. Guidelines for reporting statistics in journals published by the American Physiological Society. Am. J. Physiol. Endocrinol. Metab. 287, E189–E191. Gardner, M.J., Altman, D.G., 1995. Statistics with Confidence—Confidence Intervals and Statistical Guidelines. British Medical Journal, London. Hahn, G.J., Meeker, W.Q., 1991. Statistical Intervals. A Guide for Practitioners. John Wiley and Sons, Inc, New York. Hamilton, L.C., 1990. Modern Data Analysis. A First Course in Applied Statistics. Brooks/Cole Publishing Co, Pacific Grove, CA. Hogg, R.V., Tanis, E.A., 1977. Probability & Statistical Inference. Macmillan Publishing Company, Inc, New York. Jousilahti, P., Vartiainen, E., Pekkanen, J., Tuomilehto, J., Sundvall, J., Puska, P., 1998. Serum cholesterol distribution and coronary heart disease risk: observations and predictions among middle-aged population in eastern Finland. Circulation 97, 1087–1094. Kennedy, J.B., Neville, A.M., 1986. Basic Statistical Methods for Engineers and Scientists. Harper and Row, New York. Lehr, R., 1992. Sixteen s-squared over d-squared: a relation for crude sample size estimates. Stat. Med. 11, 1099–1102. McGhee, J.W., 1985. Introductory Statistics. West Publishing Company, St. Paul, MN. Murphy, E.A., 1979. Probability in Medicine. Johns Hopkins University Press, Baltimore. Newcombe, R.G., 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. Stat. Med. 17, 873–890. Ross, S., 1984. A First Course in Probability. Macmillan Publishing Company, New York. Van Belle, G., 2002. Statistical Rules of Thumb. Wiley Interscience, New York.