Chapter 8
Estimation A comprehensive study of nature is the most fruitful source of mathematical discoveries. Joseph Fourier
8.1
INTRODUCTION
As previously described, statistical inference has as its main objective to draw conclusions in relation to the population based on data obtained from the sample. The sample must be representative of the population. One of the most important goals of statistical inference is the estimation of population parameters, which is the main goal of this chapter. For Bussab and Morettin (2011), a parameter can be defined as a function of a set of population values; a statistic as a function of a set of sample values; and an estimate as the value assumed by the parameter in a certain sample. Parameters can be estimated using points, through a single point (point estimation), or through an interval of values (interval estimation). The main point estimation methods are estimator of moments, ordinary least squares, and maximum likelihood estimation. Conversely, the main interval estimation methods or confidence intervals (CI) are CI for the population mean when the variance is known, CI for the population mean when the variance is unknown, CI for the population variance, and CI for the proportion.
8.2
POINT AND INTERVAL ESTIMATION
Population parameters can therefore be estimated through a single point or through an interval of values. As examples of population parameter estimators (point and interval), we can mention the mean, the variance, and the proportion.
8.2.1
Point Estimation
Point estimation is used when we want to estimate a single value of the population parameter we are interested in. The population parameter estimate is calculated from a sample. Hence, the sample mean (x) is a point estimate of the real population mean (m). Analogously, the sample variance (S2) p) is a point estimate of the population is a point estimate of the population parameter (s2), as the sample proportion (^ proportion (p). Example 8.1: Point Estimation Consider a luxury condominium with 702 lots. We would like to estimate the average size of the lots, their variance, as well as the proportion of lots for sale. In order to do that, a random sample with 60 lots is collected, revealing an average size of 1750 m2 per lot, a variance of 420 m2, and a proportion of 8% of the lots for sale. Thus: (a) x ¼ 1750 is a point estimate of the real population mean (m); (b) S2 ¼ 420 is a point estimate of the real population variance (s2); and (c) p^ ¼ 0:08 is a point estimate of the real population proportion (p).
Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00008-2 © 2019 Elsevier Inc. All rights reserved.
189
190
PART
8.2.2
IV Statistical Inference
Interval Estimation
Interval estimation is used when we are interested in finding an interval of possible values in which the estimated parameter is located, with a certain confidence level (1 a), being a the significance level. Example 8.2: Interval Estimation Consider the information in Example 8.1. However, instead of using a point estimate of the population parameter, let’s use an interval estimate: (a) The [1700–1800] interval contains the average size of the 702 condominium lots, with a 99% confidence interval; (b) With a 95% confidence interval, the [400–440] interval contains the population variance of the size of the lots; (c) The [6%–10%] interval contains the proportion of lots for sale in the condominium, with 90% confidence.
8.3
POINT ESTIMATION METHODS
The main point estimation methods are the method of moments, ordinary least squares, and maximum likelihood estimation.
8.3.1
Method of Moments
In the method of moments, the population parameters are estimated from the sample estimators as, for example, the mean and the sample variance. Consider a random variable X with the probability density function (p.d.f.) f(x). Assume that X1, X2, ..., Xn is a random sample of size n drawn from population X. For k ¼ 1, 2, ..., the k-th population moment of distribution f(x) is: E Xk (8.1) Consider the random variable X with a p.d.f. f(x). Assume that X1, X2, ..., Xn is a random sample of size n drawn from population X. For k ¼ 1, 2, ..., the k-th sample moment of distribution f(x) is: n X
Mk ¼
Xik
i¼1
(8.2)
n
The estimation procedure of the method of moments is described. Assume that X is a random variable with a p.d.f. f(x, y1, ..., ym), in which y1, ..., ym are population parameters whose values are unknown. A random sample X1, X2, ..., Xn is drawn from population X. ym are obtained by matching the m first sample moments to the corresponding m The estimators of moments ^ y1 , …, ^ population moments and by solving the resulting equations for y1, ..., ym. Thus, the first population moment is: Eð X Þ ¼ m
(8.3)
And the first sample moment is: n X
M1 ¼ X ¼
i¼1
n
Xi (8.4)
By matching the population and the sample moments, we have: ^¼X m Therefore, the sample mean is the moment estimator of the population mean. Table 8.1 shows how to calculate E(X) and Var(X) for different probability distributions, as also studied in Chapter 6.
Estimation Chapter
8
191
TABLE 8.1 Calculating E(X) and Var(X) for Different Probability Distributions E(X)
Var(X)
M
Normal [X N(m,s )]
m
s
2
Binomial [X b(n,p)]
np
np(1 p)
1
Poisson [X Poisson(l)]
l
l
1
(a + b)/2
(b a) /12
Distribution 2
Uniform [X U(a,b)]
2
2
2
Exponential [X exp(l)]
1/l
1/l
1
Gamma [X Gamma(a,l)]
al
al
2
2
2
Example 8.3: Method of Moments Assume that a certain random variable X follows an exponential distribution with parameter l. A random sample of 10 units is drawn from the population whose data can be seen in Table 8.E.1. Calculate the estimation of the l moment.
TABLE 8.E.1 Data Obtained From the Sample 5.4
9.8
6.3
7.9
9.2
10.7
12.5
15.0
13.9
17.2
Solution We have E ðX Þ ¼ X. For an exponential distribution, since E ðX Þ ¼ 1l, we have 1l ¼ X. Therefore, the moments estimator of l is given by ^ l ¼ X1 . For the data in Example 8.3, since X ¼ 10:79, the estimation of the l moment is: 1 1 ^ ¼ 0:093 l¼ ¼ X 10:79
8.3.2
Ordinary Least Squares
A model of a simple linear regression is given by the following expression: Yi ¼ a + b:Xi + mi , i ¼ 1,2, …, n
(8.5)
where: Yi is the i-th observed value of the dependent variable; a is the linear coefficient of the straight line or constant; b is the angular coefficient of the straight line (slope); Xi is the i-th observed value of the explanatory variable; mi is the random error term of the linear relationship between Y and X. Since parameters a and b of the regression model are unknown, we would like to estimate them by using the regression line: Y^i ¼ a + b:Xi where: Y^i is the i-th value estimated or predicted by the model; a and b are the estimates of parameters a and b of the regression model; Xi is the i-th observed value of the explanatory variable.
(8.6)
192
PART
IV Statistical Inference
However, the Yi observed values are not always equal to the Y^i values estimated by the regression model. The difference between the observed value and the estimated value for the i-th observation is the error term mi: mi ¼ Yi Y^i
(8.7)
Thus, the ordinary least squares method is used to determine the best straight line that fits the points of a diagram, that is, the method consists in estimating a and b considering that the sum of squares for the residuals is the smallest possible: min
n X i¼1
m2i ¼
n X
ðYi a b:Xi Þ2
i¼1
The calculation of the estimators is given by: n X
b¼
i¼1
n X
Yi Y Xi X
n X
Xi X
2
¼
i¼1
Yi Xi nXY
i¼1 n X
(8.8) Xi2 nX2
i¼1
a ¼ Y b:X
(8.9)
In Chapter 13, we will study the estimation of a linear regression model by ordinary least squares in more detail.
8.3.3
Maximum Likelihood Estimation
Maximum likelihood estimation is one of the procedures used to estimate the parameters of a model from the variable probability distribution that represents the phenomenon being studied. These parameters are chosen in order to maximize the likelihood function, which is the objective function of a certain linear programming problem (Fa´vero, 2015). Consider a random variable X with a probability density function f(x,y), in which vector y ¼ y1, y2, …, yk is unknown. A random sample X1, X2, …, Xn of size n is drawn from population X; consider x1, x2, …, xn the values effectively observed. Likelihood function L associated to X is a joint probability density function given by the product of the densities of each of the observations: Lðy; x1 , x2 , …, xn Þ ¼ f ðx1 , yÞ f ðx2 , yÞ + ⋯ + f ðxn , yÞ ¼
n Y
f ð x i , yÞ
(8.10)
i¼1
The estimator of maximum likelihood is vector ^y that maximizes the likelihood function.
8.4
INTERVAL ESTIMATION OR CONFIDENCE INTERVALS
In Section 8.3, the population parameters that interested us were estimated through a single value (point estimation). The main limitation of point estimation is that when a parameter is estimated through a single point, all the data information is summarized through this numeric value. As an alternative, we can use interval estimation. Thus, instead of estimating the population parameter through a single point, an interval of likely estimates is given to us. Therefore, we define an interval of values that will contain the true population parameter, with a certain confidence level (1 a), being a the significance level. ^ ^ Consider y an estimator of population parameter y. An interval estimate for y is obtained through interval ]y – k; y + k[, so, P y k < ^ y < y + k ¼ 1 a.
8.4.1
Confidence Interval for the Population Mean (m)
Estimating the population mean from a sample is applied to two cases: when the population variance (s2) is known or unknown.
Estimation Chapter
8
193
FIG. 8.1 Standard normal distribution.
8.4.1.1 Known Population Variance (s2) Let X be a random variable with a normal distribution, mean m, and known variance s2, that is, X N(m,s2). Therefore, we have: Z¼
Xm pffiffiffi Nð0, 1Þ s= n
(8.11)
that is, variable Z has a standard normal distribution. Consider that the probability of variable Z assuming values between zc and zc is 1 a, so, the critical values of zc and zc are obtained from the standard normal distribution table (Table E in the Appendix), as shown in Fig. 8.1. NR and CR means nonrejection region and critical region of the distribution, respectively. Therefore, we have: Pðzc < Z < zc Þ ¼ 1 a
(8.12)
Xm P zc < pffiffiffi < zc ¼ 1 a s= n
(8.13)
s s P X zc pffiffiffi < m < X + zc pffiffiffi ¼ 1 a n n
(8.14)
or:
Thus, the confidence interval for m is:
Example 8.4: CI for the Population Mean When the Variance Is Known We would like to estimate the average processing time of a certain part, with a 95% confidence interval. We know that s ¼ 1.2. In order to do that, a random sample with 400 parts was collected, obtaining a sample mean of X ¼ 5:4. Therefore, construct a 95% confidence interval for the true population mean. Solution We have s ¼ 1.2, n ¼ 400, X ¼ 5:4, and CI ¼ 95% (a ¼ 5%). The critical values of zc and zc for a ¼ 5% can be obtained from Table E in the Appendix (Fig. 8.2). Applying Expression (8.14): 1:2 1:2 P 5:4 1:96 pffiffiffiffiffiffiffiffi < m < 5:4 + 1:96 pffiffiffiffiffiffiffiffi ¼ 95% 400 400 that is:
FIG. 8.2 Critical values of zc and zc.
194
PART
IV Statistical Inference
P ð5:28 < m < 5:52Þ ¼ 95% Therefore, the [5.28;5.52] interval contains the average population value with 95% confidence.
8.4.1.2 Unknown Population Variance (s2) Let X be a random variable with a normal distribution, mean m, and unknown variance s2, that is, X N(m,s2). Since the variance is unknown, it is necessary to use an estimator (S2) instead of s2, which results from another random variable: Xm pffiffiffi tn1 T¼ (8.15) ðS= nÞ that is, variable T follows Student’s t-distribution with n 1 degrees of freedom. Consider that the probability of variable T assuming values between tc and tc is 1 a, so, the critical values of tc and tc are obtained from Student’s t-distribution table (Table B in the Appendix), as shown in Fig. 8.3. Therefore, we have: Pðtc < T < tc Þ ¼ 1 a
(8.16)
Xm P tc < pffiffiffi < tc ¼ 1 a S= n
(8.17)
or:
Therefore, the confidence interval for m is: S S P X tc pffiffiffi < m < X + tc pffiffiffi ¼ 1 a n n
(8.18)
Example 8.5: CI for the Population Mean When the Variance Is Unknown We would like to estimate the average weight of a given population, with a 95% confidence interval. The random variable analyzed has a normal distribution with mean m and unknown variance s2. We pick a sample with 25 individuals from the population and
FIG. 8.3 Student’s t-distribution.
FIG. 8.4 Critical values of Student’s t-distribution.
Estimation Chapter
8
195
calculate the sample mean X ¼ 78 and the sample variance (S2 ¼ 36). Determine the interval that contains the average weight of the population. Solution Since the variance is unknown, we use estimator S2, which results from variable T that follows Student’s t-distribution. The critical values of tc and tc, obtained from Table B in the Appendix, for a significance level of a ¼ 5% and 24 degrees of freedom, can be seen in Fig. 8.4. Applying Expression (8.18): 6 6 P 78 2:064 pffiffiffiffiffiffi < m < 78 + 2:064 pffiffiffiffiffiffi ¼ 95% 25 25 that is: P ð75:5 < m < 80:5Þ ¼ 95% Therefore, the [75.5;80.5] interval contains the average population weight with 95% confidence.
8.4.2
Confidence Interval for Proportions
Consider X a random variable that represents whether a characteristic that interests us in the population exists or not. Thus, X follows a binomial distribution with parameter p, in which p represents the probability of an element in the population presenting the characteristic we are interested in: X bð1, pÞ with mean m ¼ p and variance s2 ¼ p(1 – p). A random sample X1, X2, …, Xn of size n is drawn from the population. Consider k the number of sample elements with the characteristic we are interested in. The estimator of population proportion p (^ p) is given by: p^ ¼
k n
(8.19)
If n is large, we can consider that sample proportion p^ follows a normal distribution, approximately, with mean p and variance p(1 p)/n: p ð1 p Þ (8.20) p^ N p, n p^ p We consider that variable Z ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Nð0, 1Þ. Since n is large, we can substitute p for p^: pð 1 pÞ n p^ p (8.21) Z ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Nð0, 1Þ p^ð1 p^Þ n Consider that the probability of variable Z assuming values between zc and zc is 1 a, so, the critical values of zc and zc are obtained from the standard normal distribution table (Table E in the Appendix), as shown in Fig. 8.1. Thus, we have: Pðzc < Z < zc Þ ¼ 1 a
(8.22)
or: 0
1
B C p^ p PB < zc C @zc < rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A ¼1a p^ð1 p^Þ n
(8.23)
196
PART
IV Statistical Inference
Therefore, the confidence interval for p is: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi! p^ð1 p^Þ p^ð1 p^Þ < p < p^ + zc ¼1a P p^ zc n n
(8.24)
Example 8.6: CI for Proportions A factory discovered that the proportion of defective products, in one batch with 1000 parts, is 230 parts. Construct a 95% confidence interval for the true proportion of defective products. Solution n ¼ 1,000 k 230 ¼ 0:23 p^ ¼ ¼ n 1,000 zc ¼ 1:96 Therefore, Expression (8.24) can be written as: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0:23 0:77 0:23 0:77 P 0:23 1:96 < p < 0:23 + 1:96 ¼ 95% 1, 000 1, 000 P ð0:204 < p < 0:256Þ ¼ 95% Thus, the [20.4%;25.6%] interval contains the true proportion of defective products with 95% confidence.
8.4.3
Confidence Interval for the Population Variance
Let Xi be a random variable with a normal distribution, mean m, and variance s2, that is, Xi N(m,s2). An estimator for s2 is sample variance S2. Thus, we consider that random variable Q has a chi-square distribution with n 1 degrees of freedom: Q¼
ðn 1Þ S2 w2n1 s2
(8.25)
Consider that the probability of variable Q assuming values between w2low and w2upp is 1 a, so, the critical values of w2low and w2upp are obtained from the chi-square distribution table (Table D in the Appendix), as shown in Fig. 8.5. Therefore, we have: P w2low < w2n1 < w2upp ¼ 1 a (8.26) or: ðn 1Þ S 2 2 < w P w2low < upp ¼ 1 a s2
low
FIG. 8.5 Chi-square distribution.
upp
(8.27)
Estimation Chapter
8
197
Therefore, the confidence interval for s2 is: ! 2 ð n 1 Þ S2 ð n 1 Þ S ¼1a P < s2 < w2upp w2low
(8.28)
Example 8.7: CI for the Population Variance Consider the population of Business Administration students at a public university whose variable of interest is students’ ages. A sample with 101 students was obtained from the normal population and provided S2 ¼ 18.22. Construct a 90% confidence interval for the population variance. Solution From distribution table w2 (Table D in the Appendix), for 100 degrees of freedom, we have: w2low ¼ 77:929 w2upp ¼ 124:342 Therefore, Expression (8.28) can be written as follows: 100 18:22 100 18:22 P < s2 < ¼ 90% 124:342 77:929 P 14:65 < s2 < 23:38 ¼ 90% Thus, the [14.65;23.38] interval contains the true population variance with 90% confidence.
8.5
FINAL REMARKS
Statistical inference is divided into three main parts: sampling, estimation of population parameters, and hypotheses tests. This chapter discussed estimation methods. There are point and interval population parameter estimation methods. Among the main point estimation methods, we can highlight the estimator of moments, ordinary least squares, and maximum likelihood estimation. Conversely, among the main interval estimation methods, we studied the confidence interval (CI) for the population mean (when the variance is known and unknown), the CI for proportions, and the CI for the population variance.
8.6
EXERCISES
1) We would like to estimate the average age of a population that follows a normal distribution and has a standard deviation s ¼ 18. In order to do that, a sample with 120 individuals was drawn from the population and the mean obtained was 51 years old. Construct a 90% confidence interval for the true population mean. 2) We would like to estimate the average income of a certain population with a normal distribution and an unknown variance. A sample with 36 individuals was drawn from the population, presenting a mean of X ¼ 5,400 and a standard deviation S ¼ 200. Construct a 95% confidence interval for the population mean. 3) We would like to estimate the illiteracy rate of a certain municipality. A sample with 500 inhabitants was drawn from the population, presenting an illiteracy rate of 24%. Construct a 95% confidence interval for the proportion of illiterate individuals in the municipality. 4) We would like to estimate the variability of the average time in rendering services to customers in a bank branch. A sample with 61 customers was drawn from the population with a normal distribution and it gave us S2 ¼ 8. Construct a 95% confidence interval for the population variance.