International Review of Economics Education 12 (2013) 60–79
Contents lists available at SciVerse ScienceDirect
International Review of Economics Education journal homepage: www.elsevier.com/locate/iree
Teaching basic econometric concepts using Monte Carlo simulations in Excel Genevieve Briand a,*, R. Carter Hill b a
Instructor, School of Economic Sciences, Washington State University, Pullman, Washington 99164-6210, United States Ourso Family Professor of Econometrics and Thomas Singletary Professor of Economics, Economics Department, Louisiana State University, Baton Rouge, Louisiana 70803, United States
b
A R T I C L E I N F O
A B S T R A C T
Article history: Available online 8 April 2013
Monte Carlo experiments can be a valuable pedagogical tool for undergraduate econometrics courses. Today this tool can be used in the classroom without the need to acquire any specialized econometrics software. This paper argues that Microsoft Excel, which is already available at many office and home computer stations, offers the opportunity to run meaningful Monte Carlo simulations and to successfully teach students basic econometric concepts. The reader is guided, step-by-step, through two different exercises. The first one is a repeated sampling exercise showing that least squares estimators are unbiased. The second one expands on the first to explain the true meaning of confidence interval estimates of least squares estimators. ß 2013 Elsevier Ltd. All rights reserved.
Keywords: Teaching Econometrics Excel Monte Carlo simulations
1. Introduction Monte Carlo experiments rely on repeated random sampling to simulate and compute results of interest to researchers, instructors or students. Undergraduate econometrics textbooks make use of Monte Carlo simulations to help teach basic econometric concepts. In Gujarati and Porter (2009), the authors state that the reader will be asked to conduct Monte Carlo experiments using different statistical packages (p.12). Hill et al. (2011) illustrate the sampling properties of the least squares and interval estimators in the beginning chapters of their textbook (pp. 88–93 and pp. 127–129). In later chapters, they make use of Monte Carlo simulations to explore the properties of the least
* Corresponding author. E-mail address:
[email protected] (G. Briand). 1477-3880/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.iree.2013.04.001
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
61
squares estimator in the case of random independent variables (pp. 273–4, 280) and limited dependent variables (pp. 442–4). Murray (1999) points out that ‘‘Monte Carlo techniques offer a rich device for discovery learning’’ (p. 308–9). He further states that ‘‘as I explored the potential of Monte Carlo techniques for learning econometrics, I realized how much Monte Carlo methods highlight the central role of sampling properties in econometrics’’ (p. 309). In his preface for teachers, Murray (2006) explains that his textbook ‘‘starts out with the Monte Carlo approach to estimators’’ and ‘‘returns to Monte Carlo analyses to facilitate learning about heteroskedasticity, errors in variables, and consistency’’ (p. xxvii and p. xxx). Kennedy (2008, 1998a, 1998b) is a strong advocate of Monte Carlo experiments as a pedagogical tool for undergraduate econometrics courses. He suggests the use of ‘‘explain how to do a Monte Carlo study’’ problems to teach student the sampling-distribution concept which, he argues, is the ‘‘statistical lens’’ allowing students to make sense of the statistics world (1998a, 1998b). He proposes an array of such problems in Appendix D of his textbook (2008). Kennedy recommends that instructors do not ask students to actually do a Monte Carlo study (1998a). He cautions of the high opportunity cost of having them learn how to program (1998b). Although Murray (1999) ‘‘eagerly champions’’ using computers to teach econometrics and allow ‘‘a hands-on, discovery mode of learning’’, he also warns that a computer classroom can be very costly in faculty time and institutional dollars (p. 308). We, on the other hand, following Judge (1999) and Craft (2003), argue that Microsoft Excel offers the means to run meaningful Monte Carlo simulations and to successfully teach students basic econometric concepts, at a relatively low opportunity cost. Cahill and Kosicki (2000 p. 771) offer this perfect summary of arguments for using Microsoft Excel: From a practical perspective, spreadsheet software such as Excel is a natural choice to use in exploring economic models because it is widely available on most campuses. This availability eliminates the task of seeking funding for the purchase and support of specialized software packages. In addition, spreadsheet software is relatively easy to use, and its flexibility makes it useful in many different courses at all levels of the traditional economics curriculum. Most economics students almost certainly will use it after graduation in both career and personal settings. Most important, it minimizes black-box features that characterize much computerassisted learning software. Our paper differs from Judge (1999) and Craft’s (2003) in that we restrict ourselves to presenting how Monte Carlo simulations can be run in Excel. Instructors will decide for themselves how to incorporate them in their econometrics courses. Our exposition includes many screen shots and provides step-by-step instructions for using Excel to run the Monte Carlo simulation exercises. The first Monte Carlo simulation exercise we go through is a repeated sampling exercise showing that least squares estimators are unbiased. The second one expands on the first to explain the true meaning of confidence interval estimates of least squares estimators. 2. Repeated sampling and unbiasedness Following Hill et al. (2011) and Briand and Hill (2012), the examples that follow are developed around the idea of studying the relationship between household weekly income and their corresponding weekly expenditure on food. We consider the experiment of randomly selecting households from a population, and subsequently estimating the simple linear regression model: y = b1 + b2x + e, where y represents weekly food expenditure and x represents weekly income. In a Monte Carlo experiment the repeated sampling properties of estimators and tests are observed directly, by creating many samples of data, applying an estimator or test to each sample, recording the outcomes, and then summarizing the outcomes. Samples are created using a specific data generating process (DGP) by which we create a sample of N values. The ingredients of the DGP include (i) choosing the sample size N, (ii) choosing x values (which may be fixed in repeated samples, or not) (iii) selecting parameter values, and (iv) randomly selecting values of the regression error terms e from a probability distribution with a given mean and variance. Given these elements we can ‘‘create’’ an outcome y via a process that resembles a controlled experimental outcome.
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
62
Fig. 1. Probability distribution functions of food expenditure given income level and linear relationship between expected food expenditure and income. Table 1 Monte Carlo experiment parameters.
1 2 3 4 5 6 ... 26
A
B
N= x1= x2=
40 1000 2000
x =$B$2 ... =$B$3
y
C
D
E
s= b1= b2=
50 100 0.10
We will work with random samples of N = 40 households: 20 of them with weekly income x = $1000 and 20 of them with weekly income x = $2000. These values will remain fixed in repeated samples. To illustrate the unbiasedness of the least squares estimators, the actual values of the regression parameters b1 and b2, and the variance of the regression error s2 do not matter. We choose b1 = 100 and b2 = 0.10. Then E(y|x = 1000) = $200, and E(y|x = 2000) = $300. In keeping with the assumptions of the normal linear regression model, the random errors e will be chosen to have independent normal distributions with mean 0 and constant, homoskedastic, variance s2 = 2500. Because x is fixed in repeated samples, the distribution of y is normal with mean E(y|x = 1000 or 2000) and variance s2 = 2500. These distributions and the regression function E(y|x) = b1 + b2x are shown in Fig. 1. The objective of the first Monte Carlo exercise is to show that if we draw many samples of size N = 40 using the specified data generation process, the average value of the least squares estimates b1 and b2 will be close to their true parameter values b1 and b2. If, for example, we use 1000 Monte Carlo P samples, then the sample average ð1=1000Þ 1000 s¼1 b2s ¼ b2 , where b2s is the least squares estimate of b2 in the s’th Monte Carlo sample, will be close to b2 = 0.10. The expected value of the least squares estimator is based on an infinite number of repetitions. If we could compute an infinite number of least squares estimates b1 and b2, their average value would equal their parameter values b1 and b2. A Monte Carlo simulation is only based on a finite number of repetitions, and thus the sample average b2 will not exactly equal E(b2) = b2, however the sample average b2 will converge towards b2 as the number of Monte Carlo samples is increased. To begin the Monte Carlo simulation exercise, we first enter the following labels, values and formulas in cells A1:B3,1 D1:E3, A5:B6 and A26, as shown in Table 1. We then Copy the cell reference from A6 into A7:A25 and the cell reference from A26 into A27:A45. 1
A1:B3 refers to the range of cells between A1 and B3, inclusively. For more on Excel basic skills, please see Appendix A.
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
63
We have specified the sample size N, the x values and the parameter values of the data generating process, b1, b2 and s. Next, we randomly select error terms e and generate a random sample of households’ food expenditure values, y’s. 2.1. Generating a random sample We use Excel functions2 NORMINV and RAND to generate random values using what is called the ‘‘inversion method.’’ This technique, briefly described in Appendix B,3 can be used to generate random deviates from a probability distribution for a continuous random variable for which the cumulative distribution function, or cdf, has an inverse. Let u be a uniformly distributed random variable on the interval [0,1] and let F(x) denote the cdf for a N(m,s2) random variable X, such that P(X x) = F(x). Then a random value x from the distribution N(m,s2) is created by solving the equation u = F(x) for x as x = F1(u), where F1 denotes the inverse function. In Excel the function RAND creates a uniform random value, u, and NORMINV is the inverse function for the normal distribution, F1. Thus by nesting these two functions we can create a random value from a N(m,s2) distribution. The statistical basis for this result is discussed in Appendix B. The general syntax of the NORMINV function is: =NORMINV(probability, m, s)
Fct. (1)
The NORMINV function computes the value x of a normally distributed variable X with mean m, standard deviation s, and with probability =P(Xx), where 0 probability 1. We obtain random error values from the N(m,s2) distribution by specifying the probability argument of the NORMINV function to be the RAND function. The general syntax of the RAND function is as follows: =RAND()
Fct. (2)
The RAND function4 returns a uniformly distributed random number u, with 0 u < 1. Nesting the RAND function into the NORMINV function we have5 =NORMINV(RAND(), m, s)
Fct. (3)
Our interest here lies specifically in the normal distribution of random errors, e, with mean m = 0 and standard deviation of s = 50. Having these, the final step is to generate a random weekly food expenditure value drawn from households with one of two weekly income (x = $1000 or x = $2000) via the known linear function: y = b1 + b2x + e, where b1 = 100 and b2 = 0.10. In cells B6 and B26 we enter the equations shown in Table 2. We Copy the formula from B6 into B7:B25 and from B26 into B27:B45. The random sample that we obtain6 is presented on Screen Shot 1 and its scatter plot on Fig. 2. Fig. 2 can be compared to Fig. 1. The first group of points from the scatter plot (on the left in Fig. 2) was 2 Using Excel functions to generate random numbers is a better alternative than using the Excel Random Number Generation (RNG) Data Analysis tool. This approach has two main advantages: (1) avoiding using an add-in, which may not be installed by default in students’ computer labs, and (2) avoiding falling back into the black box trap. However, one advantage of the RNG Data Analysis tool is that it makes it easier to repeat the exercise with different sample size. 3 For a bit more see, for example, Wikipedia: http://en.wikipedia.org/wiki/Inverse_transform_sampling. 4 For a description of the algorithm that is used in the RAND function, Excel random number generator function, see http:// support.microsoft.com/kb/828795 Microsoft (2011). Microsoft acknowledges that there were some problems with RAND in previous versions of Excel, but it has been improved. For a brief introduction to the issues see, for example, Wikipedia: http:// en.wikipedia.org/wiki/Random_number_generation. 5 The Box-Muller transformation is an alternative to using the inversion method. However it is specific to generating values from the normal distribution, while the inversion method can be used to generate random values from many distributions. For more on the Box-Muller transformation see, for example, http://www.exceluser.com/explore/statsnormal.htm, Kyd (2011); or http://mathworld.wolfram.com/Box-MullerTransformation.html Weisstein (2011). 6 RAND() generates different random values each time it is calculated. Thus the user’s random sample values will differ from ours.
64
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
Table 2 Specifying the regression function. 6 ... 26
B =$E$2+$E$3*$B$2+NORMINV(RAND(),0,$E$1) =$E$2+$E$3*$B$3+NORMINV(RAND(),0,$E$1)
Fig. 2. Scatter plot of random sample.
drawn from the probability distribution function of food expenditure for income level x = $1000 (in the front in Fig. 1). The second group of points from the scatter plot (on the right in Fig. 2) was drawn from the probability distribution function of food expenditure for income level x = $2000 (in the back in Fig. 1). The fitted regression line, which runs between the two groups of points in Fig. 2, is an estimate of the true regression line depicted in Fig. 1 Screen Shot 1:
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
65
Table 3 How LINEST reports parameter estimates.
row 1
column 1
column 2
b2
b1
Table 4 How to report the estimates.
47 48
A
B
b1= b2=
=INDEX(LINEST(B6:B45,A6:A45),2) =INDEX(LINEST(B6:B45,A6:A45),1)
2.2. Generating least squares estimates Next, use the LINEST function to obtain the least squares estimates for the intercept and slope parameters, based on the random sample just drawn. For this purpose, the general syntax of the LINEST function is as follows: =LINEST(y’s, x’s)
Fct. (4)
The first argument of the LINEST function specifies the y values, and the second argument specifies the x values, the least squares estimates are based on. In our case, we specify7: =LINEST(B6:B45,A6:A45)
Fct. (5)
The LINEST function creates a table where the least squares estimates are stored in Excel, first the slope coefficient estimate, and then the intercept coefficient estimate. The estimates are reported as shown in Table 3. We nest the LINEST function in the INDEX function to get the estimated coefficients, one at a time. The INDEX function returns values from within a table. In the case of a table with only one row, the INDEX function general syntax is as follows: =INDEX(table of results, column_num)
Fct. (6)
The first argument of the INDEX function specifies the source table. In our case, this is the table of results generated by the LINEST function above. So, replace ‘‘table of results’’ by ‘‘LINEST(B6:B5,A6:A5)’’. The second argument indicates from which column of the table to retrieve the result of interest. If we want to retrieve the estimate of the intercept coefficient, b1, from the table above, we would indicate that it can be found in column 2 by replacing ‘‘column_num’’ by ‘‘2’’. We report the estimated coefficients at the bottom of our worksheet. In cell A47:B48 enter the labels and equations shown in Table 4. The estimates of the intercept and slope coefficients we obtain from our sample of data are: Screen Shot 2:
7 Note that because we will nest the LINEST function in the INDEX function, we will effectively be working with regular formulas, as opposed to the more complex array formulas. In addition, we choose to work with the LINEST function instead of the SLOPE and INTERCEPT functions so we can also generate standard errors estimates (see Section 3.1).
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
66
Table 5 Key cell formulas to report estimates. Cell
Formula
Copied to
A6 A26 B6 B26 B47 B48
=$B$2 =$B$3 =$E$2+$E$3*$B$2+NORMINV(RAND(),0,$E$1) =$E$2+$E$3*$B$3+NORMINV(RAND(),0,$E$1) =INDEX(LINEST(B6:B45,A6:A45),2) =INDEX(LINEST(B6:B45,A6:A45),1)
A7:A25 A27:A45 B7:B25 B27:B45 – –
Recall that each random sample is different and will yield different estimates, which may or may not be close to the true parameter values. The property of unbiasedness is about the average values of b1 and b2 if many samples of the same size are drawn from the same population. In the next section, we thus repeat our sampling and least squares estimation exercise. Screen Shot 3 presents the generation of a random sample and its least squares estimates— accompanied by Table 5 of the formulas used and addresses where the formulas were copied to.8 Screen Shot 3:
2.3. Repeated sampling We would like to draw 9 additional random samples. For that, Copy the formula from B6 into C6:K25 and the formula from B26 into C26:K45. Next, before copying the formula to obtain coefficient estimates for the new samples, transform the Relative cell reference A6:A45 into an Absolute cell reference $A6:$A45—this retains the same 8
This presentation is drawn from Ragsdale (2008) Managerial Decision Modeling textbook.
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
67
Table 6 Average the estimates.
47 48 50 51
A
B
b1= b2=
=INDEX(LINEST(B6:B45,$A6:$A45),2) =INDEX(LINEST(B6:B45,$A6:$A45),1)
K ... ... =AVERAGE(B47:K47) =AVERAGE(B48:K48)
x-values for the next 9 rounds of least squares estimations. Copy the formulas from B47:B48 into C47:K48. In cells K50:K51, we compute the AVERAGEs of the estimates from the 10 samples by entering the formulas shown in Table 6. The estimates and average values that we obtain for the 10 samples are: Screen Shot 4:
Screen Shot 5 presents the repeated sampling of 10 random samples and their least squares estimates—accompanied by Table 7 of the formulas used and addresses where the formulas were copied to. Screen Shot 5:
Taking the averages of estimates from many samples, the averages will approach the true parameter values b1 and b2. To show that this is the case, we repeated the exercise again—and you
68
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
Table 7 Key cell formulas: 10 random samples. Cell
Formula
Copied to
A6 A26 B6 B26 B47 B48 L47 L48
=$B$2 =$B$3 =$E$2+$E$3*$B$2+NORMINV(RAND(),0,$E$1) =$E$2+$E$3*$B$3+NORMINV(RAND(),0,$E$1) =INDEX(LINEST(B6:B45,$A6:$A45),2) =INDEX(LINEST(B6:B45,$A6:$A45),1) =AVERAGE(B47:K47) =AVERAGE(B48:K48)
A7:A25 A27:A45 B7:B25, C6:K25 B27:B45, C26:K45 C47:K47 C48:K48 – –
Table 8 Monte Carlo average values. Number of samples
10
100
1000
Parameter Values
Average value of b1 Average value of b2
89.87141 0.105656
97.52929 0.101557
99.09254 0.100546
100 0.1
could easily ask your students to do the same.9 We encourage a reader who has so far been replicating the Excel steps along with us to continue reading, and abstain for now from repeating the exercise. The cell references that are given next in this paper assume that this is the case. In Table 8 are the average values of b1 and b2 obtained when the number of samples is increased from 10 to 100, and finally to 1000. Note that for 1000 random samples, or even 100, the average of the least squares estimates are very close to the true values. 3. Repeated sampling and interval estimation In the second Monte Carlo exercise we will illustrate the meaning of 95% ‘‘level of confidence’’ in interval estimation. A 95% interval estimator is bk t(0.975,NK)se(bk), where bk is the least squares estimator of bk, t(0.975,NK) is the 97.5 percentile from a t-distribution with NK degrees of freedom and se(bk) is the standard error of bk. In a large number of repeated samples from the same population, 95% of interval estimates will contain the true underlying population parameter. This section provides step-by-step instructions for constructing a Monte Carlo simulation template in Excel. If the following exercise were proposed to students, it would require of them to spend more time familiarizing themselves with new Excel functions as well as figuring how to use them to properly design a Monte Carlo simulation template. The authors believe that the time spent on those two activities will enhance their grasp of basic econometric concepts as well as present an opportunity for them to refine their spreadsheet software skills. This time, we would like to draw 90 additional random samples. Copy the formula from K6 into L6:CW25 and the formula from K26 into L26:CW45. 3.1. The LINEST function revisited The LINEST function will obtain the least squares estimates and their standard errors with one additional option. The general syntax of the LINEST function is:
9 As an alternative to the method used here, see Barreto and Howland (2005) add-in: http://www3.wabash.edu/ econometrics/EconometricsBook/Basic%20Tools/ExcelAddIns/MCSim.htm.
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
69
Table 9 How LINEST reports estimates and standard errors.
row 1 row 2
column 1
column 2
b2 se(b2)
b1 se(b1)
=LINEST(y’s, x’s,,TRUE)
Fct. (7)
The first argument of LINEST function specifies the y values; the second argument specifies the x values; we ignore the third argument by putting a space between the second and third commas; and the fourth argument, TRUE, indicates that we would like LINEST to return additional regression statistics. The LINEST function creates a table where it stores the least squares and standard errors estimates. The order in which they are reported is shown in Table 9. We nest the LINEST function in the INDEX function to get the estimated coefficients, one at a time. The INDEX function returns values from within a table. The INDEX function general syntax is as follows: =INDEX(table of results, row_num, column_num)
Fct. (8)
The first argument of the INDEX function specifies the source table. The second argument and third argument indicate the intersection of a row and a column at which the result of interest can be found. The nested commands are: b1: se(b1): b2: se(b2):
=INDEX(LINEST(y-values,x-values, =INDEX(LINEST(y-values,x-values, =INDEX(LINEST(y-values,x-values, =INDEX(LINEST(y-values,x-values,
TRUE),1,2) TRUE),2,2) TRUE),1,1) TRUE),2,1)
Fct. Fct. Fct. Fct.
(9) (10) (11) (12)
3.2. The simulation template The template shown in Table 1010 reports estimated coefficients, standard errors, t-percentile values and limits of the interval estimates (Lower Limit: LL and Upper Limit: UL). Next, we count how many of the 100 interval estimates contain the true parameters’ values. Finally, we compute summary statistics for our estimated coefficients and standard errors. Specify cells A47:B68 as shown in Table 10 (some cells are outlined in different shades of gray only to distinguish groups of similar or related cells which we comment on shortly). In cells A47:B48, the sample size (N) and a value are specified, for a 100(1 a)% confidence interval. The t-distribution degrees of freedom (d.f.) and t-percentile value (tc) are computed and reported in cells A49:B50. Cells A51:B52 and A60:B61 are used to report and compute coefficient estimates (bk, k = 1,2) and standard errors (se(bk), k = 1,2). In the formula typed in cells B51:B52 and B60:B61, the cell references to the x values are in Absolute format, $A6:$A45, as opposed to Relative format, as we will be using the same x values for all 100 repetitions. Cells A53:B54 and A62:B63 are used to compute and report interval estimates. The t-critical value tc will be the same over all repetitions, so its cell reference in the formulas of the intervals limits is specified in Absolute format, $B$50. In cells A55:B55 and A64:B64, we establish and report whether the true parameters’ values are contained in the interval estimates. In cells A56:B56 and A65:B65, we keep track of the number of interval estimates that do contain the true parameters values. 10
Available from the authors upon request:
[email protected].
70
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
Table 10 The simulation template.
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
A
B
N= a= d.f. = tc = b1 = se(b1) = LL = UL = b1 in CI Yes’ average b1’s = std. dev. (b1’s) = average se(b1)’s = b2 = se(b2) = LL = UL = b2 in CI Yes’ average b2’s = std. dev. (b2’s) = average se(b2)’s =
40 0.05 =B47-2 =TINV(B48,B49) =INDEX(LINEST(B6:B45,$A6:$A45, TRUE),1,2) =INDEX(LINEST(B6:B45,$A6:$A45, TRUE),2,2) =B51-$B$50*B52 =B51+$B$50*B52 =IF(OR(100
B54),‘‘No’’, ‘‘Yes’’) =COUNTIF(B55:CW55, ‘‘Yes’’) =AVERAGE(B51:CW51) =STDEV(B51:CW51) =AVERAGE(B52:CW52) =INDEX(LINEST(B6:B45,$A6:$A45, TRUE),1,1) =INDEX(LINEST(B6:B45,$A6:$A45, TRUE),2,1) =B60-$B$50*B61 =B60+$B$50*B61 =IF(OR(0.1B63),‘‘No’’, ‘‘Yes’’) =COUNTIF(B64:CW64, ‘‘Yes’’) =AVERAGE(B60:CW60) =STDEV(B60:CW60) =AVERAGE(B61:CW61)
In cells A57:B58 and A66:B67 we compute and report the average and standard deviation of the 100 estimates of the intercept and slope coefficients. The average of the parameter estimates should be close to the true parameter values because the least squares estimator is unbiased. The standard deviations of the estimates are the actual finite sample variability of estimates in the Monte Carlo experiment. In cells A59:B59 and A68:B68 we compute and report the average of the 100 standard errors of the intercept and slope coefficients, which should be close to the estimates’ standard deviations. This provides an opportunity to remind students that standard errors reflect the sampling variability of the estimates. 3.3. The TINV function The TINV function returns 100(1 a/2) percentile values for a t-distribution (t-critical values) with specified degrees of freedom. The syntax of the TINV function is: =TINV(a, degrees of freedom)
Fct. (13)
where a is the two-tail probability. 3.4. The IF function Use the IF and OR logical functions to indicate, for each interval estimate, whether or not it contains the true parameter value. The general syntax for the IF function is IF(logical_test,value_if_true,value_if_false)
Fct. (14)
where: Logical_test is any value or expression that can be evaluated to be TRUE or FALSE. In this exercise we want to determine whether or not the true parameter value, bk, is within the estimated interval [LL,UL], where LL = bk tcse(bk) and UL = bk + tcse(bk). The logical expression we
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
71
use is: if bk < LL or bk > UL. If bk is outside [LL,UL], then this expression is TRUE. Otherwise, the expression is FALSE. Value_if_true is the value that is returned if logical_test is TRUE. For example, if this argument is the text string ‘‘No’’ and the logical_test argument is TRUE, then the IF function displays the text ‘‘No’’. Value_if_false is the value that is returned if logical_test is FALSE. For example, if this argument is the text string ‘‘Yes,’’ and the logical_test argument is FALSE, then the IF function displays the text ‘‘Yes’’. 3.5. The OR function Use the OR function to write the logical_test. The general syntax of the OR function is OR(argument_1,argument_2)
Fct. (15)
If the first logical expression, argument_1, or the second logical expression, argument_2, is TRUE, then the OR function returns TRUE. It returns FALSE only if both arguments are FALSE. The general syntax for the OR function, nested in the IF function, is: IF(OR(argument_1,argument_2),value_if_true,value_if_false)
Fct. (16)
Applied to our exercise, the nested function looks like this (which is what we have in cells B55 and B64): IF(OR(bk < LL, bk > UL),‘‘No’’,‘‘Yes’’)
Fct. (17)
If bk is outside [LL,UL], then the logical_test bk < LL or bk > UL is TRUE, and ‘‘No’’ is returned to indicate that bk is not in the estimated confidence interval. Otherwise, the logical expression is FALSE, and ‘‘Yes’’ is returned to indicate that bk is in the estimated confidence interval. 3.6. The COUNTIF function Finally, we use the COUNTIF function to count the number of times bk is within the interval estimate [LL,UL]. The COUNTIF function is a statistical function that counts the number of cells within a range that meet a given criteria. Its general syntax is: COUNTIF(cell_range,criteria)
Fct. (18)
Cell_Range is one or more cells to count. Criteria is the number, expression, cell reference, or text that defines which cells will be counted. Since we are interested in counting how many interval estimates, among all the ones we construct, actually contain the true parameter value, we count the ‘‘Yes’’ that are generated following the application of our logical_test (this is what we do in cells B56 and B65): COUNTIF(cell_range,‘‘Yes’’)
Fct. (19)
Review with students the meaning of the formulas and values in B47:B68. Copy the content of B51:B55 to C51:CW55 and copy the content of B60:B64 to C60:CW64. 3.7. Results Screen Shot 6 presents the repeated sampling of 100 random samples and 95% confidence interval estimates—the accompanying table of the formulas used and addresses where the formulas were copied to can be found in Appendix C.
72
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
Screen Shot 6:
We find that 94 out of our 100 confidence intervals contain b1, and 95 out of 100 interval estimates contained b2 (see Screen Shot 6). Note that each replication will result in different random samples, different interval estimates and thus a different number of intervals that will contain the true parameters values. With our simulation of 1000 samples, we find that 954 out of 1000 confidence intervals contained the true parameter value, both for the intercept and slope coefficients. With our simulation of 10,000 samples we find that 95% of both the intercept and slope coefficients interval estimates contained the true parameters values. Finally by computing the summary statistics of the Monte Carlo estimates we can illustrate that the standard errors se(bs,k), s = 1, . . ., S and k = 1,2, measure the sampling variation in the estimates bs,k. The
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
73
average of the standard errors for each coefficient is close to the standard deviation of S estimates of each coefficient (B59 versus B58, and B68 versus B67 on Screen Shot 6). 3.8. Extensions The exercise developed thus far can be modified or extended to investigate additional concepts. Some suggestions are: Effect of changes in variation of the x values on se(bk); Effect of random errors from alternative distributions on the sampling distribution of the bk estimators, as shown in Appendix B; Effect of rescaling x and/or y values on bk estimates; and Effect of changes in sample size on se(bk). 4. Conclusion This paper presented a step-by-step guide to Microsoft Excel fundamentals and two Monte Carlo simulations exercises. The first Monte Carlo simulation is a repeated sampling exercise showing that least squares estimators are unbiased. The second one expands on the first to explain the true meaning of confidence interval estimates of least squares estimators. In these two exercises, the sampling distribution concept is presented in the context of the regression analysis, as Kennedy (2001) suggests it should be introduced. The use of Excel requires students to understand what they are doing. We hope that the experience of using Excel in an undergraduate econometrics course will be akin to that of programming procedures in a graduate econometrics text such as in Mittelhammer et al. (2000 p.713). The way we suggest to use Excel in an undergraduate econometrics course is different than what Barreto and Howland (2006) propose. They ‘‘use Excel workbooks powered by Visual Basic macros’’ that ‘‘enable Monte Carlo simulations to be run by students with a click of a button’’ (p. i). While Barreto and Howland use of Excel would seem paramount to what Day (1987) refers to as a ‘‘canned program’’ in the context of macro-economic simulation exercises, ours would rather presents the advantages of a ‘‘student-built model’’ (p. 351), giving students active learning opportunities. Becker and Greene (2001) point out that ‘‘the starting point for any course in statistics and econometrics is the calculation and use of descriptive statistics and mastery of basic spreadsheets skills (emphasis added)’’ (p. 173). Some students are still not familiar with Excel, and a few are even reluctant to use it; but at the end of a course using Excel to teach undergraduate econometrics, students and instructors alike, come out of the experience with a stronger understanding of core econometrics concepts and with better Excel skills. Acknowledgement We are grateful to anonymous referees for helpful comments and suggestions. Appendix A. Microsoft Excel fundamentals (Using Excel 2007) A.1. Starting Excel Find the Excel shortcut on your desktop and double click on it to start Excel (left clicks). Screen Shot A1:
74
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
Alternatively, left-click the Windows Start button at the bottom left corner of your computer screen. Screen Shot A2:
Slide your mouse over All programs, Microsoft Office, and finally Microsoft Office Excel 2007. Left-click on the last one to start Excel. To create a shortcut instead, right-click on it; slide the mouse over Send to, and then select (i.e. drag the mouse over and left-click on) Desktop (create shortcut). From the Windows Start button, an easier way yet to start Excel is to select Run, type Excel in the Open window of the Run dialog box and select the OK button or simply press your Enter key. Screen Shot A3:
Excel opens to a new file, titled Book1. Find the name of the open file on the very top of the Excel window, on the Title bar. An Excel file like Book1 contains several sheets. By default, Excel opens to Sheet1 of Book1. Determine which sheet is open by looking at the Sheet tabs found in the lower left corner of the Excel window. Your screen might look slightly different than the one shown below. If your computer screen is bigger, Excel will automatically display more of its available options. For example, in the Styles group of command, instead of the Cell styles button, there might be a colorful display of cell styles. Screen Shot A4:
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
75
The Active cell is surrounded by a border and is in Column A and Row 1; its Cell reference is A1. Below the title bar is a Tab list. The Home tab is the one Excel opens to. Under each tab we find groups of commands. Under the home tab, the first one is the Clipboard group of commands, named after the tasks it relates to. The wide bar including the tab list and the groups of commands is referred to as the Ribbon. The content of the Active cell shows up in the Formula bar (as shown above, there is nothing in it). Perhaps the most important of all of this is to locate the Help button on the upper right corner of the Excel window. Finally, we can use the Scroll bars and the arrows around them to navigate up-down and right-left in the worksheet. And we potentially have a long way to go: each worksheet in Microsoft Excel 2007 contains 1,048,576 rows and 16,384 columns!!!! A.2. Entering data and carrying out calculations To enter labels and data into an Excel worksheet move the cursor to a cell (i.e. drag our mouse over and left-click on) and begin typing. Type X in cell A1, then press the Enter key to get to cell A2 or navigate by moving the cursor with the mouse, or use the Arrow keys (to move right, left, up or down). Fill in the rest of the worksheet as shown below in Table A1. Excel’s primary usefulness is to carry out repeated calculations. We can add, subtract, multiply and divide; and apply mathematical and statistical functions to the data in a worksheet. To illustrate, we are going to compute the squares of the numbers we just entered. There are two main ways to perform calculations in Excel. One is to write formulas using arithmetic operators, which we demonstrate below; the other is to write formulas using mathematical functions—these and other functions will be used in the main section of this paper. Place the cursor in cell B2. We want to compute the square of the value from cell A2. Let us emphasize that the trick to using Excel efficiently is NOT to re-type values already stored in the worksheet, but instead to use references of cells where the values are stored. So, to compute the square of 10, which is the value stored in cell A2, instead of typing the formula =10*10, type the formula =A2*A2 or =A2^2 (the asterisk and the caret can be typed by simultaneously pressing the Shift key and the * or ^ key) as shown in Table A2. Press Enter. Note that: (1) a formula always starts with an equal sign; this is how Excel recognizes it is a formula, and (2) formulas are not case sensitive, so we could also have typed =a2^2 instead. The way Excel understands the instructions we gave in cell B2 is ‘‘square the value found at the address A2’’. It is important to fully understand how Excel interprets ‘‘address A2’’. To Excel ‘‘address A2’’ means ‘‘from where you are at, go left by one cell’’—because this is where A2 is located vis-a`-vis B2. In other words, an address gives directions: left-right, up-down, and distances: number of cells away—all in reference to the cell where the formula is entered. Table A1 Entering data.
1 2 3
A
B
x 10 20
y
Table A2 Carrying out calculations.
1 2 3
A
B
x 10 20
y =A2^2
76
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
Table A3 Absolute cell reference.
1 2 3
A
B
x 10 20
y =$A$2^2
To copy the formula in cell B2 to cell B3 place the cursor back into cell B2, and move it to the southeast corner of the cell, until the fat cross turns into a skinny one, as shown below: Screen Shot A5:
Then left-click, hold it, drag it down to the next cell below, and release! Excel has copied the formula we typed in cell B2 into the cell below. The formula entered in cell B2 instructed Excel to collect the value stored one-cell away from its left, and then square it—those exact same instructions are now found in cell B3. Place your cursor back into B3, and look at the Formula bar. You can see that, in this cell, these same instructions translate into ‘‘=A3^2’’, and the value computed is thus 400. Screen Shot A6:
Now, if we had wanted the values in B2 and B3 to be equal to the square of the value stored in cell A2, then we would transform the cell reference in cell B2 from a Relative Cell Reference into an Absolute Cell Reference, before copying it to cell B3. This ensures that an address does NOT change when we copy a formula from one cell to another. A Relative Cell Reference is made into an Absolute Cell Reference by preceding both the row and column references by a dollar sign. Place the cursor back in cell B2 (i.e. move the mouse over and left-click), and in the Formula bar, place the cursor before the A, insert a dollar sign (by pressing the Shift key and the $ key at the same time); then move the cursor before the first 2 and insert another dollar sign; and finally, place the cursor at the end of the formula and press Enter. See Table A3. Copy the formula from cell B2 to cell B3. Place the cursor back into B3, and look at the Formula bar. You can see that, this time, the formula is still =$A$2^2—which means the instruction (or address) given to Excel has not changed. The value computed is thus 100. Appendix B. The inversion method for generating random deviates Using the inversion method is an opportunity to explain to students the usefulness of change-ofvariable techniques and cumulative distribution functions.
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
77
B.1. Distributions of functions of random variables Let X be a continuous random variable with probability density function, pdf, f(x). Let Y = g(X) be a function that is strictly increasing or strictly decreasing. This condition ensures that the function is one-to-one, so that there is exactly one Y value for each X value, and exactly one X value for each Y value. The importance of this condition on g(X) is that we can solve Y = g(X) for X. That is, we can find an inverse function X = w(Y). Then the pdf for Y is given by dwðyÞ hðyÞ ¼ f ½wðyÞ (A1) dy where j j denotes the absolute value. This is called the change-of-variable technique.11 As an example, let X be a continuous random variable with pdf f(x) = 2x for 0 < x < 1. Let Y = g(X) = 2X. The inverse function is X ¼ wðYÞ ¼ Y=2 and dwðyÞ=dy ¼ 1=2, so hðyÞ ¼ f ½wðyÞ jdwðyÞ=dtj ¼ 2½y=2 j1=2j ¼ y=2 for 0 < x < 2. B.2. The inversion method for generating random values This method is an application of the change-of-variable technique. We wish to draw a random value from a continuous probability distribution with pdf f(y) and cdf F(y). We will use the fact that dF(y)/dy = f(y). Also, pdf of a uniformly distributed random variable, u, on the (0,1) interval is f(u) = 1 for 0
Draw a uniform random number u1 in the (0,1) interval. Let u1 = F(y1) where F(y) is the cdf of the desired random deviate. Solve the equation in step 2 for y1 using y1 = F1(u1) = g(u1). The value y1 is a random value from the target pdf f(y).
From the change-of-variable technique, the inverse function wðyÞ ¼ FðyÞ, so that f ½wðyÞ jdwðyÞ=dyj ¼ 1 jdFðyÞ=dyj ¼ f ðyÞ. As an illustration suppose our target distribution is the pdf f(y) = 2y for 0 < y < 1. The corresponding cdf is F(y) = y2. Set a uniformly distributed random number u1 ¼ Fðy1 Þ ¼ y21 . Solve for y1 = F1(u1) = (u1)1/2. This value is a random draw from the distribution f(y) = 2y. Exactly the same strategy is used for creating a random draw from a normal distribution. Suppose our objective is to draw a random value from a normal distribution, with mean m and variance s2. Let the pdf of the normal density be denoted f(y) and the corresponding cdf F(y). Using the inversion method we obtain a random draw from this density by finding y1 = F1(u1) where u1 is a uniform random number in the (0,1) interval. The Excel function NORMINV is F1 and RAND provides the uniform random value. It is worth noting that once this technique is in place there are endless possibilities for modifying the structure of the Monte Carlo experiment. We might have students examine the probability distribution of the least squares estimators when the regression disturbances are distributed something other than normal. Several examples for using errors with mean 0 and variance s2 are: Triangular random errors. Generate random values from the distribution f(y) = 2y for 0 < y < 1 as discussed above. The mean and variance of this distribution are 2/3 and 1/18,prespectively. For the ffiffiffiffiffiffi regression random errors to have mean 0 and variance s2 use e ¼ s ðy 2=3Þ= 18. Students will be surprised to find that even with modest sized samples that the least squares estimators’ distributions are not far from normal. Chi-square random errors.12 Generate standard normal random values y with mean 0 and variance 1, the standard normal, following the steps described above. Then y2 has a chi-square distribution
11 12
For a description of the method at the undergraduate level, see Hill et al. (2011). See, for example, Cameron and Trivedi (2010).
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
78
with 1 degree of freedom, implying that it has mean 1 andpvariance 2. For the regression random ffiffiffi errors to have mean 0 and variance s2 use e ¼ s ðy2 1Þ= 2 Extreme value random errors. The extreme value density is f ðyÞ ¼ expðyÞ expðexpðyÞÞ
(A2)
which has cdf F(y) = exp( exp( y)). Despite its imposing form, we can obtain random values from this distribution using the inversion method, since y = F1(u) = ln( ln(u)) where u is a uniformly distributed random value. The mean and variance of this distribution are 0.57722 and p2/6.13 For the pffiffiffiffiffiffiffiffiffiffiffi regression random errors to have mean 0 and variance s2 use e ¼ s ðy 0:57722Þ= p2 =6 Random values from many distributions can be similarly formed. Books of statistical distributions provide ‘‘formulas’’ for generating random deviates given uniform random values. With Monte Carlo studies based on these alternative non-normal distributions, it is easy to illustrate using a histogram how effective the Central Limit Theorem can be, resulting in nearly normally distributed least squares estimators in even moderately sized samples. Furthermore the performance of interval estimators, or hypothesis tests, can be studied under these alternative assumptions as we have illustrated in Section 3. Appendix C. Repeated sampling and interval estimation Screen Shot 6 accompanying table of the formulas used and addresses where the formulas were copied to. Key cell formulas Cell
formula
Copied to
A6 A26 B6 B26 B49 B50 B51 B52 B53 B54 B55 B56 B57 B58 B59 B60 B61 B62 B63 B64 B65 B66 B67 B68
=$B$2 =$B$3 =$E$2+$E$3*$B$2+NORMINV(RAND(),0,$E$1) =$E$2+$E$3*$B$3+NORMINV(RAND(),0,$E$1) =B47-2 =TINV(B48,B49) =INDEX(LINEST(B6:B45,$A6:$A45,TRUE),1,2) =INDEX(LINEST(B6:B45,$A6:$A45,TRUE),2,2) =B51-$B$50*B52 =B51+$B$50*B52 =IF(OR(100B54),‘‘No’’,‘‘Yes’’) =COUNTIF(B55:CW55, ‘‘Yes’’) =AVERAGE(B51:CW51) =STD(B51:CW51) =AVERAGE(B52:CW52) =INDEX(LINEST(B6:B45,$A6:$A45,TRUE),1,1) =INDEX(LINEST(B6:B45,$A6:$A45,TRUE),2,1) =B60-$B$50*B61 =B60+$B$50*B61 =IF(OR(0.1B63),‘‘No’’, ‘‘Yes’’) =COUNTIF(B64:CW64, ‘‘Yes’’) =AVERAGE(B60:CW60) =STD(B60:CW60) =AVERAGE(B61:CW61)
A7:A25 A27:A45 B7:B25, C6:CW25 B27:B45, C26:CW45 – – C51:CW51 C52:CW52 C53:CW53 C54:CW54 C55:CW55 – – – – C60:CW60 C61:CW61 C62:CW62 C63:CW63 C64:CW64 – – – –
References Barreto, H., Howland, F.M., 2005. Excel Add-Ins for Introductory Econometrics using Monte Carlo Simulation with Microsoft Excel. Cambridge University Press, New York Available at: http://www3.wabash.edu/econometrics/EconometricsBook/ Basic%20Tools/ExcelAddIns/index.htm.
13
For example, Forbes et al. (2011).
G. Briand, R.C. Hill / International Review of Economics Education 12 (2013) 60–79
79
Barreto, H., Howland, F.M., 2006. Introductory Econometrics using Monte Carlo Simulation with Microsoft Excel. Cambridge University Press, New York. Becker, W.E., Greene, W.H., 2001. Teaching statistics and econometrics to undergraduates. Journal of Economic Perspectives 15 (4) 169–182. Briand, G., Hill, R.C., 2012. Using Excel 2007 for Principles of Econometrics, third edition. John Wiley & Sons, New York. Cahill, M., Kosicki, G., 2000. Exploring economic models using Excel. Southern Economic Journal 66 (3) 770–792. Cameron, A.C., Trivedi, P., 2010. Microeconomics Using Sata, Revised Edition. Stata Press, College Station, TX. Craft, R.K., 2003. Using spreadsheets to conduct Monte Carlo experiments for teaching introductory econometrics. Southern Economic Journal 69 (3) 716–735. Day, E., 1987. A note on simulation models in the economics classroom. Journal of Economic Education 18 (4) 351–356. Forbes, C., Evans, M., Hastings, N., Peacok, B., 2011. Statistical Distributions, fourth edition. John Wiley & Sons, Hoboken, NJ pp. 99. Gujarati, D.N., Porter, D.C., 2009. Basic Econometrics, fifth edition. McGraw-Hill/Irwin, New York. Hill, R.C., Griffiths, W.E., Lim, G.C., 2011. Principles of Econometrics, fourth edition. John Wiley & Sons, New York. Judge, G., 1999. Simple Monte Carlo studies on a spreadsheet. Computers in Higher Education Economics Review (online) 13 (2) , Available at:In: http://www.economicsnetwork.ac.uk/cheer/ch13_2/ch13_2p12.htm. Kennedy, P.E., 1998a. Using Monte Carlo studies for teaching econometrics. In: Becker, W.E., Watts, M. (Eds.), Teaching Economics to Undergraduates: Alternatives to Chalk and Talk. Edward Elgar, Cheltenham and Northampton, pp. 141–159. Kennedy, P.E., 1998b. Teaching undergraduate econometrics: a suggestion for fundamental change. American Economic Review, Papers and Proceedings 88 (2) 487–491. Kennedy, P.E., 2001. Bootstrapping Student understanding of what is going on in econometrics. Journal of Economic Education 32 (2) 110–123. Kennedy, P.E., 2008. A Guide to Econometrics, 6th edition. Blackwell Publishing, Malden, MA. Kyd, C. 2011. An Excel tutorial: an introduction to Excel’s normal distribution functions. ExcelUSer—For Business Users of Microsoft Excel. Excel User, Inc. Available at: http://www.exceluser.com/explore/statsnormal.htm. Microsoft. 2011. Description of the RAND Function in Excel. Microsoft Support. Microsoft Corporation. Available at: http:// support.microsoft.com/kb/828795. Mittelhammer, R.C., Judge, G.G., Miller, D.J., 2000. Econometric Foundations. Cambridge University Press, New York. Murray, M.P., 1999. Econometrics lectures in a computer classroom. Journal of Economic Education 30 (3) 308–321. Murray, M.P., 2006. Econometrics: A Modern Introduction. Pearson Education, Boston. Ragsdale, C.T., 2008. Managerial Decision Modeling. South-Western Cengage Learning, Mason. Weisstein, E.W. 2011. Box–Muller Transformation. Mathworld—A Wolfram Web Resource. Wolfram Research, Inc. Available at: http://mathworld.wolfram.com/Box-MullerTransformation.html. Wikipedia. 2011. Inverse Transform Sampling. Wikipedia—The Free Encyclopedia. Wikimedia Foundation, Inc. Available at: http://en.wikipedia.org/wiki/Inverse_transform_sampling. Wikipedia. 2011. Random Number Generation. Wikipedia—The Free Encyclopedia. Wikimedia Foundation, Inc. Available at: http://en.wikipedia.org/wiki/Random_number_generation.