Journal of Corporate Finance 45 (2017) 333–341
Contents lists available at ScienceDirect
Journal of Corporate Finance journal homepage: www.elsevier.com/locate/jcorpfin
Misspecification in event studies☆ Joseph M. Marks a, Jim Musumeci b,⁎ a b
Department of Finance, AAC 273, Bentley University, Waltham, MA 02452, United States Department of Finance, MOR 107, Bentley University, Waltham, MA 02452, United States
a r t i c l e
i n f o
Article history: Received 31 December 2016 Received in revised form 2 May 2017 Accepted 8 May 2017 Available online 09 May 2017 JEL classification: G14 C10 C15
a b s t r a c t We examine the statistical error and efficiency associated with two commonly used eventstudy techniques when applied to samples of various sizes. Previous research has established that the frequently used Patell (1976) test is not well specified when the event itself creates additional return variance. We find that even under ideal conditions when the event creates no additional variance, the Patell test rejects a true null hypothesis substantially more often than the stated significance level. In contrast, the alternate test of Boehmer et al. (1991) performs well in samples of all sizes and under all conditions we consider. © 2017 Elsevier B.V. All rights reserved.
Keywords: Event study Standardized abnormal return Misspecification Simulation Patell test BMP test
1. Introduction Since the introduction of event-study methods in accounting (Ball and Brown, 1968) and financial research (Fama et al., 1969), they have been the predominant method of determining whether an event is associated with a change in firm value. Typically, a researcher collects a sample of firms experiencing the event and tests whether their returns on the event day are significantly different from what would be expected absent the event. The required statistical tests usually rely on the Central Limit Theorem (CLT),1 of which there are several versions. The most common version (Feller, 1968, for example) asserts that if {Xi}N i= 1 is a random sample from a population with finite mean and variance, then the sampling distribution of the mean of {Xi} approaches a normal distribution as N → ∞. The theorem is silent concerning the rate of convergence, although it is known that convergence is faster when the skewness and kurtosis of the underlying distribution are near their values for a normal distribution. Daily stock returns, however, are known to be both skewed and have fat tails (e.g., Mandelbrot (1963) and Fama (1965)), and this remains true of abnormal returns calculated relative to a benchmark return (Brown and Warner, 1985). Indeed, Mandelbrot and Fama speculated that the population of stock returns might not even have a finite variance, although subsequent research favored the conjecture that stock returns are drawn from a finite mixture of normal distributions (e.g., Campbell et al. (1997), ☆ The authors thank Charlie Hadlock, Mingfei Li, Yun Ling, Richard Sansing, and participants in the Bentley University seminar series for their helpful comments. The usual disclaimer applies. ⁎ Corresponding author. E-mail addresses:
[email protected] (J.M. Marks),
[email protected] (J. Musumeci). 1 One method that does not rely on the CLT is Corrado's (1989) non-parametric test.
http://dx.doi.org/10.1016/j.jcorpfin.2017.05.003 0929-1199/© 2017 Elsevier B.V. All rights reserved.
334
J.M. Marks, J. Musumeci / Journal of Corporate Finance 45 (2017) 333–341
pp. 18–19; Harris (1986); or Kon (1984)). A population variance that is not finite would imply we cannot rely on the CLT at all, but even the assumption that daily stock returns are drawn from a mixture distribution with fat tails means that we cannot assume “small” samples necessarily have the properties required to apply standard parametric tests. The size required for a sample of daily stock returns to be sufficiently large for an event-study test to be well specified is primarily an empirical matter. Brown and Warner (1985) and Boehmer et al. (1991) examined this issue, but both used only 250 simulated portfolios of size 50 events each. The use of only 250 simulated portfolios leads to confidence intervals that are quite wide, thus making it difficult to detect misspecification of the test. For example, at a significance level of α = 5% and when the number of simulations is only 250, the pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffi 1:96 250ð0:05Þð0:95Þ , or (3.3%, 95% confidence interval suggested by the npq binomial approximation of standard deviation is 0:05 250 1 7.7%). Given that the minimum increment to the rejection frequency is 250 = 0.4%, a method would have to reject a true null 60% more often than it should to conclude that it is misspecified. The small number of simulations was a constraint imposed by the computing power of that era, but this is not a significant limitation today. We reexamine this issue using 10,000 simulated event portfolios. This provides tighter confidence intervals and thus more powerful tests for misspecification. We find that even if the event does not alter the dispersion of standardized abnormal returns (abnormal return normalized by the standard deviation of the estimation period residuals), the default method in Eventus is misspecified for samples from 100 to as large as 5000 firms. Like Boehmer et al. (1991) and Harrington and Shrider (2007), we find this misspecification is markedly larger if the event causes an increase in variance, which Harrington and Shrider find it necessarily does. The method of Boehmer et al. (1991), hereafter referred to as the BMP test, is based on a cross-sectional test of standardized abnormal returns and is well specified for all portfolio sizes we considered, both in the absence or presence of event-induced variance. Furthermore, while the default test statistic in Eventus is the Patell test, the BMP test statistic can just as easily be chosen by selecting the STDCSECT option. For researchers who access Eventus through Wharton Research Data Services, step 5 of the query form provides the ability to select different test statistics, including both Patell and BMP.2 Finally, because the large number of simulations might lead us to conclude that a test is statistically misspecified even though it produces good practical results, we apply Bayes' Theorem to compare the probability the null is false given either test rejects it. We find that, compared with the BMP test, the Patell test provides lower confidence that the null is false given its t-statistic is significant.
2. Description of data Provided we had sufficient data during a 120-day estimation period as described below, we used the set of all daily CRSP return observations from 1926 to 2015 and total returns on the CRSP equally weighted index as inputs for the market model to find benchmark returns, ^R ^i þ β E Ri;E ¼ α i M;E
ð1Þ
^ the ordinary least squares (OLS) es^ and β where i denotes the firm, E the event day, M the CRSP equally weighted index, and α timates from a 120-day estimation window ending two days prior to the event day.3 In an effort to ensure meaningful estimates ^ we required a minimum of 100 observations in the 120-day estimation period preceding the event. This left us with ^ and β, of α 68,934,304 values of E(Ri,E) across 24,021 securities (PERMNOs). Next, we formed a simulated effect of an event by adding to the actual return, Ri,E, a variable Δi,E with mean Δ (0 or 0.25%) and a variance of θ (0 or 1) times the variance of the market-model residuals during the stock's estimation period. For example, Δ = 0 indicates there is no mean effect of the event, and θ = 1 means that the effect of the event has a variance equal to the estimationperiod variance of the residual. We add this simulated effect of the event to the actual return on the event day and then subtract the benchmark expected return to obtain a simulated abnormal return: h i h i ^R ^i þ β ARi;E ¼ Ri;E þ Δi;E − α i M;E
ð2Þ
We then calculate the standardized abnormal return, SARi;E ¼
ARi;E vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ffi u u R −R u M;E M 1 σiu 2 t1 þ T þ ∑t RM;t −RM
ð3Þ
as per Patell (1976), with σi the standard deviation of the estimation period residuals and the remainder of the denominator an 2 For researchers who do not use Eventus, a simple Matlab function that conducts event studies and calculates the Patell and BMP test statistics has been written by the authors and is available for public download at http://atc3.bentley.edu/faculty/jmarks/event_study.zip. This function can be used directly or as a guide for writing a similar function in another language. 3 With this timing, the expected return computed in Eq. (1) is based on data independent of the actual event return (i.e. the event return and last return in the estimation window do not share a price). The results we present are unaffected by ending the estimation window on the day prior to the event date.
J.M. Marks, J. Musumeci / Journal of Corporate Finance 45 (2017) 333–341
335
adjustment for the fact that the benchmark return is an out-of-sample prediction. When Δ = θ = 0, we would expect the average abnormal return and average standardized abnormal return to have a mean of zero, and the latter to have a standard deviation of one. 3. Simulations In this section we sample without replacement from the population of 68,934,304 expected daily returns to form 10,000 randomly selected portfolios, each containing 5000 hypothetical events. For each portfolio of 5000, we then choose nested subsets of sizes 2500, 1000, 500, 250, and 100.4 Then, for each event day and each combination of Δ and θ, we calculate the abnormal return ^ RM;E ). Finally, we use the ^i þ β as the difference between the simulated return (Ri, E + Δi,E) and the market model prediction (α i estimation-period standard error to calculate the Patell (1976) test statistic (which is the default in Eventus, e.g., see Cowan (2007)), and is commonly used)5 and the BMP test statistic. Both tests normalize the event-day abnormal return in Eq. (2) by the standard deviation of the estimation-period residuals (after adjusting for the fact that the benchmark return is an out-of-sample prediction) to obtain the standardized abnormal return (SAR) of Eq. (3). The Patell test assumes the resulting SARs are drawn from a t-distribution with degrees of freedom equal to the number of days in the estimation period minus two, while the BMP test uses the cross-sectional standard deviation of the SARs to calculate its t-statistic. Thus, the Patell test implicitly assumes that the event itself creates no additional variance, while the BMP test assumes it creates a shock that has a variance proportional to the standard deviation of the residuals from the estimation period. Research cited in the introduction has found that individual stock returns are not normally distributed, but instead are skewed towards the right and have fat tails. The CLT guarantees only that, if the underlying population has a finite variance, then the sampling distribution of the mean will eventually converge to a normal distribution as sample size increases. Given this and Kothari and Warner's (2007) observation that “for small samples…one cannot rely on asymptotic results for the central limit theorem,” one could reasonably expect improved results for larger sample sizes. In the case of Δ = θ = 0, this would mean the frequency of rejections converges to the assumed significance level as sample size increases. 3.1. Tests of specification with no variance increase (Δ = 0, θ = 0) Individual stock returns and residuals from the market model are not normally distributed, and therefore test results might be poor when the number of events studied is small. However, the CLT suggests that the performance of the test statistics should improve as portfolio size increases. Panel A of Table 1 reports that while the Patell test indeed works poorly for small sample sizes, its misspecification actually increases as sample size increases from 100 to 500, and then drops off only slightly after that. It rejects the null hypothesis significantly more often than it should for any portfolio size up to 5000 at significance levels of α = 5% and α = 1%. In particular, when α = 5% the lowest rejection rate associated with the Patell test is 7.04% (N = 100), over 40% more often than expected. The binomial approximation estimates the standard deviation of the rejection rate to be pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 10;000ð0:95Þð0:05Þ ≅ 0:218%. Thus the observed 7.04% rejection rate is about 9.36 standard deviations above the 5% theoretical rejec10;000 tion rate. Its worst performance occurs with a sample size of 500, when we rejected the null hypothesis 7.75% of the time,6 about 12.62 standard deviations above the expected 5% rate. To put this 7.75% rejection rate in context, using the Patell test is similar to rejecting a null hypothesis for a well-specified test assuming a significance level of α = 5% not at a t-value of 1.96, but instead at 1.77. The third row of Panel A finds similar results when α = 1%. When N = 100, the rejection rate is over twice its theoretical rate at 2.15% (t = 11.56). It peaks when N = 500 (rejection rate = 2.21%, or 12.16 standard deviations above the 1% theoretical value), then falls to 1.86%, or 8.64 standard deviations above the theoretical value when N = 5000. In contrast, for the BMP test reported in Panel B, there is no evidence of misspecification for any sample size or significance level. In all cases, the observed frequency of rejection associated with the BMP test is within a 95% confidence interval based on the binomial approximation. ^ ^i þ β To test whether outliers are causing the Patell test's misspecification, we winsorized the event-period residuals, Ri;E −ðα i RM;E Þ, at 0.5% and 99.5% for the entire set of stock returns before forming portfolios or simulating an effect of the event. If outliers are the source of the excessive rejection rates documented in Table 1, then winsorizing should greatly mitigate the problem. Table 2 shows that the problem, while not eliminated, is substantially alleviated. For example, instead of rejecting the null hypothesis that the Patell test is well specified at α = 5% with t-values ranging from 9.36 (when N = 100) to 12.62 (when N = 500) as in Table 1, the rejection rates for the winsorized returns range from only 5.52% (t = 2.39) when N = 5000 to 5.91% (t = 4.18) when N = 1000.
4 This range is a bit wider than, but roughly comparable to, the number of events in several recent event studies. Specifically, we examined four event studies and found the actual sizes ranged from 364 events in Borokhovich et al. (2014) to 1132 in Chen et al. (2014). 5 For example, David and Ginglinger (2016) and Ang and Ismail (2015) specify use of the Patell test, and given it is the default in Eventus, we infer that expressions such as “standard event-study methodology” or “conventional event-study methodology” as in Andres and Hofbaur (2017), Dutordoir et al. (2016), and Chen et al. (2014) indicate use of the Patell test as well. 6 This is roughly consistent with the 7.6% rejection rate found by Boehmer et al., but as discussed earlier, because they used only 250 simulated portfolios, their 7.6% was not sufficiently large to reject the null hypothesis that the test was well specified. Brown and Warner (1985) used only one-tailed tests, so their results are not directly comparable.
336
J.M. Marks, J. Musumeci / Journal of Corporate Finance 45 (2017) 333–341
Table 1 Frequency of rejection of null hypothesis SAR = 0 with no simulated abnormal return (Δ = 0) and no event-induced variance (θ = 0). Panel A: Patell test Patell
N = 100
N = 250
N = 500
N = 1000
N = 2500
N = 5000
|t| N 1.96 t-Statistic (difference from 5%) |t| N 2.57 t-Statistic (difference from 1%)
7.04% 9.360 2.15% 11.558
7.45% 11.241 2.19% 11.960
7.75% 12.618 2.21% 12.161
7.70% 12.388 2.16% 11.658
7.69% 12.343 2.14% 11.457
7.50% 11.471 1.86% 8.643
Panel B: BMP test BMP
N = 100
N = 250
N = 500
N = 1000
N = 2500
N = 5000
|t| N 1.96 t-Statistic (difference from 5%) |t| N 2.57 t-Statistic (difference from 1%)
5.08% 0.367 1.06% 0.603
5.07% 0.321 1.07% 0.704
5.17% 0.780 1.02% 0.201
5.05% 0.229 0.97% −0.302
5.12% 0.551 0.94% −0.603
4.78% −1.009 0.90% −1.005
Rejection rates absent any actual abnormal return or variance increases. Even under these most favorable conditions, the Patell test rejects substantially more often than it should.
We emphasize that we are not suggesting returns be winsorized as a solution to the misspecification problem; instead, we employ it here (under the condition of simulated abnormal returns) only for statistical diagnostic purposes. The winsorizing pro^ RM;E Þ, the ^i þ β cess used here would not even be possible in an actual event study because we cannot directly observe R i;E −ðα i ^ ^i þ β residual return absent the event; we can only observe the abnormal return including the event's effect, or ½Ri;E þ Δi;E −½α i RM;E . Thus in practice any winsorizing would also necessarily alter the effect of the event itself.
3.2. Test power with no variance increase (Δ = 0.25%, θ = 0) Next we examine test power for an abnormal return uniformly equal to 0.25% (Δ = 0.25% and θ = 0). The results are reported in Table 3. At first glance it appears that the Patell test is slightly more powerful than the BMP test; for example, it rejects the null 73:51% 68:69% −1 ¼ 7:02% more frequently when α = 5% and the sample size is 500. To see whether the Patell test is indeed a more powerful test or if its higher rejection rates merely reflect its tendency to reject more frequently as ^ RM;E Þ, at 0.5 and 99.5%. These results ^i þ β documented in Table 1, we again winsorize the event-period residuals, Ri;E −ðα i are shown in Table 4 and demonstrate that the Patell test's advantage in power falls substantially. For example, when α = 5% and sample size is 500, winsorizing causes the Patell test to have fewer rejections (72.34% as opposed to 73.51% in the unwinsorized sample). For reasons discussed in Appendix A, the BMP test has more rejections (70.93% compared with 68.69% in the unwinsorized sample). In this winsorized case, the Patell test rejects only 72:34% 70:93% −1 ¼ 1:02% more frequently than the BMP test. This suggests the Patell test's apparently greater test power is largely driven by outliers that cause it to reject more often, whether the null is true or false. This raises the issue of confidence that the null is false given either test rejects it, which is examined in greater detail in Section 4.
Table 2
^ RM;E Þ. ^ i −β Rejection rates of a true null hypothesis (Δ = 0) with no event-induced variance (θ = 0), winsorized values of [Ri;E −ðα i Panel A: Patell test Patell
N = 100
N = 250
N = 500
N = 1000
N = 2500
N = 5000
|t| N 1.96 t-Statistic (difference from 5%) |t| N 2.57 t-Statistic (difference from 1%)
5.61% 2.799 1.49% 4.925
5.86% 3.946 1.31% 3.116
5.90% 4.129 1.41% 4.121
5.91% 4.175 1.36% 3.618
5.80% 3.671 1.28% 2.814
5.52% 2.386 1.09% 0.905
Panel B: BMP test BMP
N = 100
N = 250
N = 500
N = 1000
N = 2500
N = 5000
|t| N 1.96 t-Statistic (difference from 5%) |t| N 2.57 t-Statistic (difference from 1%)
5.26% 1.193 1.10% 1.005
5.15% 0.688 1.08% 0.804
5.22% 1.009 1.06% 0.603
5.05% 0.229 1.05% 0.503
5.08% 0.367 0.94% −0.603
4.86% −0.642 0.86% −1.407
^ RM;E Þ, are winsorized before simulating an abnormal return. While the Patell test still rejects more ^i þ β Rejection rates when the event-period residuals, R i;E −ðα i often than it should, the rates are substantially lower than in Table 1, suggesting it is primarily event-period outliers that cause the Patell test to be misspecified.
J.M. Marks, J. Musumeci / Journal of Corporate Finance 45 (2017) 333–341
337
Table 3 Test power absent event-induced variance (Δ ¼ 0:25%; θ ¼ 0Þ. Panel A: Patell test Patell
N = 100
N = 250
N = 500
N = 1000
N = 2500
N = 5000
|t| N 1.96 |t| N 2.57
23.85% 10.25%
46.21% 25.86%
73.51% 51.85%
94.59% 85.13%
99.97% 99.93%
100.00% 100.00%
BMP
N = 100
N = 250
N = 500
N = 1000
N = 2500
N = 5000
|t| N 1.96 |t| N 2.57
19.84% 6.72%
41.17% 19.94%
68.69% 44.34%
92.64% 80.55%
99.81% 99.43%
99.99% 99.96%
Panel B: BMP test
Absent any event-induced variance and given a simulated abnormal performance of 0.25%, the Patell test rejects the null hypothesis more often than does the BMP test. Table 4 suggests the reason for this is unrelated to actual test power, but rather to outliers even absent an event.
3.3. Specification and test power with a variance increase (Δ = 0 and Δ = 0.25%; θ = 1) The results from Table 3A and B indicate that even under simple conditions (i.e., when the effect of the event is identical for all firms), the Patell test is misspecified. As noted above, the Patell test implicitly assumes that there is no variance shift, while Harrington and Shrider (2007) show that cross-sectional variation in the event return necessarily leads to a variance increase. We now introduce variance in the simulated effect of the event while continuing to allow the mean effect to assume a value of either Δ = 0 or Δ = 0.25%. Because the results in this section are similar to those of Boehmer et al. (1991) and Harrington and Shrider (2007), we summarize them very briefly. When the mean effect of the event is zero, but there is event-induced variance equal to that of the estimation period residuals, Table 5 indicates that the Patell test is severely misspecified for all sample sizes considered. For example, using a portfolio size of N = 500 and assuming a significance level of 5%, the Patell test rejects a true null hypothesis of zero abnormal return in 18.49% of the portfolios. To put this 18.49% rejection rate in context, this would be analogous to using 1.33 instead of 1.96 as the critical value for rejecting the null hypothesis in a well-specified test at a significance level of α = 5%. When α = 1%, the misspecification is even worse, with the Patell test rejecting the null hypothesis approximately 8% of the time across sample sizes. Table 6 reports the results when Δ = 0.25% and θ = 1. It again appears that the Patell test is somewhat more powerful than the BMP test. However, as was the case in Tables 1 and 3, the Patell test seems more powerful because its misspecification leads it to rejects more often all of the time, both when the null is true and when it is false. In the next section we explore the tradeoffs between misspecification and test power. 4. Does the misspecification make a substantive practical difference? How important is the degree of misspecification found in the previous section? Statistically speaking, the Patell test is severely misspecified, but what is the practical consequence of this observation? This is an important question because, by virtue of being the default event-study method in Eventus, the Patell test is used frequently in research. If its misspecification is not consequential, then George Box' observation (Box and Draper (1987), p. 74) “Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful?” is applicable. For example, if a test purports to reject a true null 5% of the time, but actually rejects it 5.01% of the time, it is very unlikely that this minute degree of misspecification would deter anyone from using it if it is more powerful than alternative tests. The question is not whether or not a test is misspecified, but rather how misspecified must it be for us to avoid using it?
Table 4
^ RM;E Þ]. ^ i −β Test power absent event-induced variance (Δ ¼ 0:25%; θ ¼ 0Þ, winsorized values of [Ri;E −ðα i Panel A: Patell test Patell
N = 100
N = 250
N = 500
N = 1000
N = 2500
N = 5000
|t| N 1.96 |t| N 2.57
21.86% 8.74%
43.98% 23.15%
72.34% 49.36%
94.58% 84.61%
99.98% 99.93%
100.00% 100.00%
N = 100 20.66% 7.17%
N = 250 42.46% 20.97%
N = 500 70.93% 46.91%
N = 1000 93.99% 83.01%
N = 2500 99.88% 99.60%
N = 5000 99.99% 99.99%
Panel B: BMP test BMP |t| N 1.96 |t| N 2.57
When event-period residuals are winsorized before a simulated abnormal performance of 0.25% is added, the Patell test rejects the null only very slightly more often than does the BMP test. This suggests that the Patell test appears to be more powerful not because it is better at detecting abnormal performance, but because outliers cause it to reject more frequently whether the null is true or false.
338
J.M. Marks, J. Musumeci / Journal of Corporate Finance 45 (2017) 333–341
Table 5 Test specification with event-induced variance (Δ ¼ 0%; θ ¼ 1Þ. Panel A: Patell test Patell
N = 100
N = 250
N = 500
N = 1000
N = 2500
N = 5000
|t| N 1.96 t-Statistic (difference from 5%) |t| N 2.57 t-Statistic (difference from 1%)
18.34% 61.208 7.74% 67.740
18.47% 61.805 8.56% 75.981
18.49% 61.896 8.26% 72.966
18.06% 59.923 8.04% 70.755
18.78% 63.227 8.38% 74.172
19.34% 65.796 8.26% 72.966
Panel B: BMP test BMP
N = 100
N = 250
N = 500
N = 1000
N = 2500
N = 5000
|t| N 1.96 t-Statistic (difference from 5%) |t| N 2.57 t-Statistic (difference from 1%)
4.78% −1.009 1.06% 0.603
5.57% 2.615 1.24% 2.412
5.06% 0.275 1.01% 0.101
4.84% −0.734 1.07% 0.704
4.98% −0.092 0.97% −0.302
4.87% −0.596 1.04% 0.402
When the null hypothesis is true but the event itself creates additional variance, the Patell test is significantly misspecified, but the BMP test rejects at the appropriate levels. Table 6 Test power with event-induced variance (Δ ¼ 0:25%; θ ¼ 1Þ. Panel A: Patell test Patell
N = 100
N = 250
N = 500
N = 1000
N = 2500
N = 5000
|t| N 1.96 |t| N 2.57
31.33% 18.08%
47.38% 31.42%
67.99% 52.00%
88.96% 79.04%
99.75% 99.04%
100.00% 100.00%
Panel B: BMP test BMP
N = 100
N = 250
N = 500
N = 1000
N = 2500
N = 5000
|t| N 1.96 |t| N 2.57
13.61% 4.08%
24.08% 9.62%
44.17% 22.92%
72.60% 48.74%
98.06% 92.11%
99.99% 99.84%
When the null hypothesis is false but the event creates additional variance, the Patell test rejects more frequently than does the BMP test. Once again, however, this does not imply it is a more powerful test, but rather that it simply rejects more frequently whether the null is true or false.
The answer to this question is necessarily subjective and relies in part on the relative costs of Type I and Type II error.7 However, we can clarify one aspect of this issue by answering the question “How confident are we that the null is really false conditional on our having rejected it?” for any tests being considered. This requires only a straightforward application of Bayes' Theorem: pðnull falsejnull rejectedÞ ¼
pðnull is rejectedjnull falseÞpðnull falseÞ pðnull is rejectedjnull falseÞpðnull falseÞ þ pðnull is rejectedjnull trueÞpðnull trueÞ
Suppose that 80% of our null hypotheses are in fact true, and assume that when the null is false, Δ = 0.25%. Using a rejection rate of 7.75% for the Patell test when Δ = 0 and θ = 0 (Panel A of Table 1, N = 500) and a rejection rate of 73.51% when Δ = 0.25% and θ = 0 (Panel A of Table 3, N = 500) gives us 0
Pattel s pðnull falsejnull rejectedÞ ¼
0:7351ð0:2Þ ¼ 70:34% 0:7351ð0:2Þ þ 0:0775ð0:8Þ
In contrast, given the BMP's 5.17% false rejection rate and its 68.69% test power, the same statistic is 0
BMP s pðnull falsejnull rejectedÞ ¼
0:6869ð0:2Þ ¼ 76:86% 0:6869ð0:2Þ þ 0:0517ð0:8Þ
Thus, absent any event-induced variance and when N = 500 firms, switching from the Patell test to the BMP test produces a ¼ 9:27% increase in the probability the null is truly false given the test rejects it.8
0:7686 0:7034 −1
7 For example, if the test is to diagnose cancer, we might well prefer a somewhat misspecified test with high power to a perfectly specified test with low power, especially if the test statistic is a 0–1 variable. See Winkler (1972) for a discussion of such issues. 8 Bayes' Theorem can also be used to illustrate the cost of data mining. Suppose, for example, that a test is well specified at α = 5% and has test power of 50%. Suppose also that when tests are carefully and thoughtfully designed, the null will still be true 80% of the time, but if a researcher adopts an approach of “let's try this and see if we get significant results”, then the null will be true 90% of the time. An application of Bayes' Theorem based on these parameters gives us a 71.4% probability that the null is really false when the test rejects a carefully designed hypothesis, but only a 52.6% probability in the second case.
J.M. Marks, J. Musumeci / Journal of Corporate Finance 45 (2017) 333–341
339
The BMP test's advantage increases substantially when event-induced variance is present. Using the values from Tables 5 and 6, these two statistics become 0
Patell s pðnull falsejnull rejectedÞ ¼
0:6799ð0:2Þ ¼ 47:90% 0:6799ð0:2Þ þ 0:1849ð0:8Þ
and 0
BMP s pðnull falsejnull rejectedÞ ¼
0:4417ð0:2Þ ¼ 68:58% 0:4417ð0:2Þ þ 0:0506ð0:8Þ
In this case, switching from the Patell test to the BMP test produces a 0:6858 0:4790 −1 ¼ 43:17% increase in the probability the null is truly false given the test rejects it. Table 7 compares the two methods' probabilities that the null is false given that it is rejected at α = 5% for various proportions of true and false nulls being tested. 5. Conclusions Researchers who conduct event studies typically rely on the Central Limit Theorem to determine appropriate rejection regions for statistical tests. However, the fact that individual stock returns are characterized by skewness and kurtosis suggests convergence of the mean's sampling distribution to a normal distribution might be substantially slower than generally assumed. Our results suggest that even in the absence of event-induced variance, the Patell test is misspecified for samples of up to 5000 firms. In contrast, the BMP test is well-specified and only slightly less powerful. Consistent with previous research, we document that the misspecification of the Patell test increases dramatically in the presence of event-induced variance, while the BMP test remains well-specified. Finally, we quantify the practical impact of this misspecification by documenting substantial differences in the probability the null is truly false given that a test rejects it. By this measure, the degree of confidence a researcher can have that a documented abnormal event return is truly significant is invariably higher when using the BMP test. These findings are important because, as the default test statistic in Eventus, the Patell test is often reported. Fortunately, users of Eventus can just as easily conduct the BMP test by selecting STDCSECT as the desired test statistic. For researchers who do not use Eventus, a Matlab function that conducts event studies of the type studied in this article is available at the URL in footnote 2. Appendix A. The effect of outliers on t-statistics This appendix examines the effect one or more outliers will have on a t-test that a sample is drawn from a population with some hypothesized mean value. We begin with the case of a single outlier, and then proceed to discuss multiple outliers. 9 Any set of observations {wi}N i=1 will have a t-statistic equal to w−hypothesized value under the null w−hypothesized value under the null pffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ s= N−1 ∑ðw −wÞ2 =NðN−1Þ
ðA1Þ
i
We model a single outlier by assuming the observation wN is replaced with wN + Δ, and to avoid confusion, we refer to this N distribution with an outlier as {xi}N i=1, where xi = wi for i b N and xi = wi + Δ for i = N. The t-statistic for {xi}i=1 can be written as x−hypothesized value under the null N qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi t fxi gi¼1 ¼ 2 2 ∑N i¼1 xi −Nx =½NðN−1Þ
ðA2Þ
Expressing this in terms of the original sample {wi}N i=1 and the outlier's parameter Δ gives us Δ w þ −hypothesized value under the null N N N ffi t fxi gi¼1 ¼ t Δ; fwi gi¼1 ¼ rhffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i 2 2 2 ∑N−1 i¼1 wi þ ðwN þ ΔÞ −Nðw þ Δ=NÞ =½NðN−1Þ
ðA3Þ
We emphasize that we are not asserting that it is necessarily appropriate to use a standard t-test in the presence of such an outlier; we are merely determining what the t-statistic, as typically calculated, will be in such a case. We determine the effect of 9 Note that we are using standard set notation for the observations j and k, which are necessarily different whenever j ≠ k, but not the values of the jth and kth observations. In other words, x1 and x2 are necessarily different observations and so appear in {xi}N i=1 exactly once each, but they may have the same value, i.e., x1 may be 100, and so might x2.
340
J.M. Marks, J. Musumeci / Journal of Corporate Finance 45 (2017) 333–341
Table 7 Probability null is false given null is rejected. Panel A: α = 5%, N = 500, θ = 0, Δ = 0 or 0.25% Population % of null hypotheses that are false
10%
15%
20%
25%
30%
35%
40%
Patell BMP
51.31% 59.62%
62.60% 70.10%
70.34% 76.86%
75.97% 81.58%
80.26% 85.06%
83.63% 87.74%
86.35% 89.86%
Population % of null hypotheses that are false
10%
15%
20%
25%
30%
35%
40%
Patell BMP
29.01% 49.24%
39.35% 60.64%
47.90% 68.58%
55.07% 74.42%
61.18% 78.91%
66.44% 82.46%
71.03% 85.34%
Panel B: α = 5%, N = 500, θ = 1, Δ = 0 or 0.25%
This table uses Bayes Theorem to estimate the probability the null is truly false conditioned on either test's rejecting it. Panel A is based on the rejection rates from Tables 1 and 3, while Panel B is based on rejection rates from Tables 5 and 6.
the outlying parameter Δ by observing what happens if we take the limit of Eq. (A3) as Δ → ∞: Δ=N N limΔ→∞ t Δ; fxi gi¼1 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ½Δ2 −N Δ=NÞ2 =NðN−1Þ Δ=N Δ=N ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ ¼1 Δ=N 2 Δ ðN−1Þ=N=NðN−1Þ
ðA4Þ
This is well below the standard benchmarks corresponding to α = 1%, α = 5%, or even α = 10%. Thus the common fear that a single extreme outlier (large value of Δ) would lead to rejections of a true null too frequently is unfounded. Large deviations Δ actually drive the t-statistic towards one, which is much smaller than the common benchmarks for statistical significance. While the preceding discussion has focused on the fear of rejecting true null hypotheses too frequently, nowhere did we actually use the premise that the null hypothesis was true. What we have, then, is a proof that the t-statistic used to test any null hypothesis, true or false, will be driven towards 1 by a positive outlier.10 While the concern of excessive type I error is unwarranted, it is true that a single outlier will cause a loss of test power, i.e., lead to greater type II error. On the other hand, the presence of an extreme outlier increases the likelihood that the underlying population is characterized by heavy skewness or fat tails, either of which makes application of the CLT problematic unless the sample size is extremely large. What if there is more than one outlier? What if, for example, there are j observations that depend on Δ, but move with Δ at different rates kj? Now Eq. (A4) will become j ∑m¼1 km Δ=N j N limΔ→∞ t Δ; fkm gm¼1 ; fxi gi¼1 ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h i 2 2 ∑km Δ −Nð∑ki Δ=NÞ2 =½NðN−1Þ j ∑m¼1 km Δ=N ffi ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 2 2 ∑km Δ −∑km Δ =N−∑m≠n km kn =N =½NðN−1Þ
¼
j ∑m¼1 km Δ=N qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 Δ=N ∑km ½N−1−∑m≠n km kn =½ðN−1Þ
j ∑m¼1 km ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ∑km −∑m≠n km kn =ðN−1Þ
If the sample size N is large, then the second term under the radical is negligible and we have ∑ j km ffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ qm¼1 ∑k2m
ðA5Þ
This term is maximized when all the value of km are equal to each other, in which case four ðt ¼ p4ffiffi4 ¼ 2Þ is the minimum number of outliers required for the t-statistic to exceed 1.96, and 7ðt ¼ p7ffiffi7 ¼ 2:65Þ is the minimum number required for t to exceed 2.56. 10
Similarly, a single negative outlier will drive the t-statistic towards −1.
J.M. Marks, J. Musumeci / Journal of Corporate Finance 45 (2017) 333–341
341
References Andres, C., Hofbaur, U., 2017. Do what you did four quarters ago: trends and implications of quarterly dividends. J. Corp. Finan. 43, 139–158. Ang, J., Ismail, A., 2015. What premiums do target shareholders expect? Explaining negative returns upon offer announcements. J. Corp. Finan. 30, 245–256. Ball, R., Brown, P., 1968. An empirical evaluation of accounting income numbers. J. Account. Res. 6 (2), 159–178. Boehmer, E., Musumeci, J., Poulsen, A., 1991. Event study methodology under conditions of event-induced variance. J. Financ. Econ. 30, 253–272. Borokhovich, K., Boulton, T., Brunarski, K., Harman, Y., 2014. The incentives of grey directors: evidence from unexpected executive and board chair turnover. J. Corp. Finan. 28, 102–115. Box, G., Draper, N., 1987. Empirical Model-building and Response Surfaces. John Wiley & Sons. Brown, S.J., Warner, J.B., 1985. Using daily stock returns: the case of event studies. J. Financ. Econ. 14, 3–31. Campbell, J., Lo, A., MacKinlay, A., 1997. The Econometrics of Financial Markets. Princeton University Press. Chen, G., Kang, J.K., Kim, J.M., Na, H.S., 2014. Sources of value gains in minority equity investments by private equity funds: evidence from block share acquisitions. J. Corp. Finan. 29, 449–474. Corrado, C., 1989. A nonparametric test for abnormal security price performance. J. Financ. Econ. 23, 385–395. Cowan, A.R., 2007. Eventus®, Eventus User's Guide: Software Version 8.0, Standard Edition 2.1. Cowan Research L.C. David, T., Ginglinger, E., 2016. When cutting dividends is not bad news: the case of optional stock dividends. J. Corp. Finan. 40, 174–191. Dutordoir, M., Li, H., Liu, F.H., Verwijmeren, P., 2016. Convertible bond announcement effects: why is Japan different. J. Corp. Finan. 37, 76–92. Fama, E., 1965. The behavior of stock prices. J. Bus. 38 (1), 34–105. Fama, E., Fisher, L., Jensen, M., Roll, R., 1969. The adjustment of stock prices to new information. Int. Econ. Rev. 10, 1–21. Feller, W., 1968. An Introduction to Probability Theory and Its Applications. 1. John Wiley and Sons, New York. Harrington, S., Shrider, D., 2007. All events induce variance: analyzing abnormal returns when effects vary across firms. J. Financ. Quant. Anal. 42 (1), 229–256. Harris, L., 1986. Cross-sectional tests of the mixture of distributions hypothesis. J. Financ. Quant. Anal. 21 (1), 39–46. Kon, S., 1984. Models of stock return—a comparison. J. Financ. 39 (1), 147–165. Kothari, S.P., Warner, J., 2007. The econometrics of event studies. chapter in. In: Eckbo (Ed.), Handbook of Corporate Finance. 1. Elsevier. Mandelbrot, B., 1963. The variation of certain speculative prices. J. Bus. 36 (4), 394–419. Patell, J., 1976. Corporate forecasts of earnings per share and stock price behavior: empirical test. J. Account. Res. 14 (2), 246–276. Winkler, R., 1972. An Introduction to Bayesian Inference and Decision. Holt McDougal.