Chapter 25
A Case Study In this chapter we consider a case study, that is, we analyze a dataset with a view to illustrating some of the concepts and definitions introduced in the book. We consider data from the 2016 summer Olympics. The number of athletes submitted by country and the number of medals (Gold, Silver, and Bronze) won by each country. There were a total of 11249 athletes and 974 medals. We might suppose that this is a repeatable random experiment with random variable X = 1 if athlete wins a medal and X = 0 if athlete does not, and suppose the true probability of X = 1 was p. We have a sample of n = 11249 and k = 974 so we could estimate p by 0.087. Is this a good assumption/model here? Athletes can win more than one medal and the probability of Usain Bolt, say, winning a medal is obviously higher than other athletes. Finally, if athlete i wins a medal, then the chance of athlete j winning a medal is diminished slightly. So one might question the applicability of the random experiment model. There were a total of 208 countries entered, although Kuwait submitted no athletes, so was dropped from the sample. Suppose that we assume that all athletes come from the same homogeneous population, and that the independence is a good assumption. Then we can calculate the probability that India wins no medal given they submitted 124 athletes would be (1 − 0.087)124 = 0.0000125. In fact they won 2 medals. The probability that they would exactly win two medals is also quite low, around 0.00087. In any case, we may think that the equal likelihood of all athletes across all countries is just not a good assumption, and prefer to partition by countries. Consider Xi to be the number of athletes entered by country i and Yi to be the number of medals won by country i. There are only 87 countries for which Yi ≥ 1 so most countries did not win a thing. The data are integer valued but cover a wide range (1 ≤ Yi ≤ 121) and (0 ≤ Xi ≤ 552), which makes the data hard to categorize as discrete or continuous. At the low end there is quite a bit of discreteness, meaning multiple countries taking the same value Probability, Statistics and Econometrics. http://dx.doi.org/10.1016/B978-0-12-810495-8.00028-2 Copyright © 2017 Elsevier Inc. All rights reserved.
335
336 Probability, Statistics and Econometrics k= Y X
0 120 0
1 21 1
2 10 9
3 6 10
4 6 11
5 3 18
However, even for Y at the low end, the Poisson distribution does not seem too good. It should satisfy that k×
Pr(Y = k) =λ Pr(Y = k − 1)
is a constant, but this seems far from the case. There are too many zeros. Indeed the Poisson distribution would not be a good fit for the whole distribution. The sample mean of X is 54.0 and the sample standard deviation is 95.4 so that this is a gross violation of the mean-variance law of the Poisson. The median is 10. This is even more the case when considering Y , where mean is 4.7 and standard deviation is 13.0, and of course there are very many zeros such that the median is 0. The frequency count shows the sparsity of X, most cells contain zero entries and the second most common count is one. There are many models for discrete data that can address some of these issues. We next look at the top end of the distribution. Recall that for discrete Zipf distribution or Pareto distribution we have ln Pr(X > x) = a + b ln(x) for constants a, b, where the discrete Zipf’s law applies only to integers whereas the Pareto distribution is for continuous variation. We just consider this for x > 150 for which there are 20 observations. We compute this regression and find b = −2.3, which says that Pr(X > x) ∼ x −2.3 for large x; this decay rate is higher than has been found for cities and stock volume (b ∼ −1). The Gini coefficients are 0.71 for X and 0.85 for Y , which indicates a highly unequal distribution (according to the World Bank, the Gini coefficient for worldwide income was 0.68 in 2005). We next consider the association between X and Y . There does seem to be a positive relationship between X and Y . We run the regression with all the observations n = 207. The usual OLS estimate yields R-squared of 0.729 and parameter estimates constant Athletes
est −1.5914 0.1166
se 0.5427 0.0050
wse 0.6635 0.01944
The t-statistics are very significant. The scatter plot in Fig. 25.1 shows the raw data and the regression line. There must be some heteroskedasticity here because when X is small Y must also be small. We compute the White’s standard errors (wse) in the table, which
A Case Study Chapter | 25
337
FIGURE 25.1 Medal count against athletes
are larger than the OLS standard errors but the coefficients are still statistically significant. We might believe that the regression goes through the origin because when X = 0 we must have Y = 0. In this case we obtain parameter estimate 0.1094 with se 0.0044. This doesn’t change very much. We might believe there is a host country advantage, so we include a dummy variable for Brazil. The fit improves slightly to R2 of 0.768. The host country dummy is negative and statistically significant. Relative to the number of athletes they submitted, Brazil underperformed. constant Host
est −1.873 −38.6195
se 0.5062 6.6429
wse 0.6336 8.2857
Athletes
0.1252
0.0048
0.01869
Perhaps it is better to work with logarithms. In that case we must drop the countries that did not win any medals. The scatter plot appears to be more homoscedastic, but there is still an issue with an oversupply of zeros (which were the ones in the previous model). A common approach in econometrics to dealing with zero values is censored regression. In that case assume that xi + εi yi∗ = β for some variable yi∗ and that we observe yi = yi∗ 1(yi∗ > 0), which would acxi ) is a nonlinear count for the large number of zeros observed. In this case, E(yi | function of xi that depends on the error distribution.
338 Probability, Statistics and Econometrics
FIGURE 25.2 Alternative regression models. (For interpretation of the references to colour in this figure, the reader is referred to the web version of this chapter.)
We investigate whether a linear relationship is correct here, specifically, we consider polynomials. In fact, we estimated the linear regression with covariates 1, x, . . . , x k for k = 1, 2, . . . , 7 and show the fitted lines along with the raw data. See Fig. 25.2. The k = 1 case is shown in red, while the higher order polynomials are all in blue. The polynomial fitting leads to an improvement in fit mostly with regard to how it takes account of the right most observation, the USA, which is a substantial positive outlier relative to the linear regression fit. The BIC criterion selected k = 7 (out of k = 1, 2, . . . , 14). One might wonder whether the linear regression (or polynomial regression) is well specified here. We might think that over time the number of athletes a country submits is related to the number of medals they have won in the previous Olympics and therefore X and Y are jointly determined. However, from the point of view of predicting the number of medals obtained in an Olympics, the number of athletes submitted by each country is known sometime in advance and can be treated as given for the prediction exercise of the coming games. There is a substantial betting market on outcomes of various sorts in the Olympics. Goldman Sachs and PwC have published articles that describe their methodology for predicting Olympic success. They essentially do regressions of win percentage (number of medals won divided by total number of available medals) on a variety of covariates. The choice of dependent variable doesn’t change the essentials of the linear regression but it does make for comparability across Olympics where the total number of available medals has grown over
A Case Study Chapter | 25
339
time. In our case, we might regress win percentage on the ratio of number of athletes submitted by country to total number of athletes. The linear transformations will not change the R-squared but will change the coefficients. It might also make the analysis more relevant for future Olympics where there is likely to be a further inflation of both totals. GS and PwC also included GDP, population, and past performance as regressors. Our analysis has not included those variables in which case they might be considered omitted variables and our coefficients to be biased estimates. We also regress the ratio of gold medals to total medals on the number of athletes and find almost no linear relationship.