Technological Forecasting & Social Change 80 (2013) 1222–1231
Contents lists available at SciVerse ScienceDirect
Technological Forecasting & Social Change
An examination of factors affecting accuracy in technology forecasts Shannon R. Fye, Steven M. Charbonneau, Jason W. Hay, Carie A. Mullins ⁎ The Tauri Group, Alexandria, VA 22310, USA
a r t i c l e
i n f o
Article history: Received 30 March 2012 Received in revised form 2 August 2012 Accepted 16 October 2012 Available online 14 December 2012 Keywords: Technological forecasting Accuracy Prediction Methods
a b s t r a c t Private and public organizations use forecasts to inform a number of decisions, including decisions on product development, competition, and technology investments. We evaluated technological forecasts to determine how forecast methodology and eight other attributes influence accuracy. We also evaluated the degree of interpretation required to extract measurable data from forecasts. We found that, of the nine attributes, only methodology and time horizon had a statistically significant influence on accuracy. Forecasts using quantitative methods were more accurate than forecasts using qualitative methods, and forecasts predicting shorter time horizons were more accurate that those predicting longer time horizons. While quantitative methods produced the most accurate forecasts, expert sourcing methods produced the highest number of forecasts whose events had been realized, indicating that experts are best at predicting if an event will occur, while quantitative methods are best at predicting when. We also observed that forecasts are as likely to overestimate how long it will take for a predicted event to occur as they are to underestimate the time required for a prediction to come to pass. Additionally, forecasts about computers and autonomous or robotic technologies were more accurate than those about other technologies, an observation not explained by the data set. Finally, forecasts obtained from government documents required more interpretation than those derived from other sources, though they had similar success rates. © 2012 Elsevier Inc. All rights reserved.
1. Introduction Technological forecasts predict the invention, timing, characteristics, dimensions, performance, and rate of diffusion of a device, material, technique, or process [1]. These forecasts help end users predict product development, anticipate competitors' technical capabilities, and avoid technological surprise. Given the widespread reliance on technological forecasts, it is helpful for end users to know how accurate forecasts are and if there are specific factors that affect accuracy in whole or for a subset of forecasts. By understanding the factors that affect accuracy, users can determine the usefulness of any given forecast.
⁎ Corresponding author at: 6363 Walker Lane, Suite 600, Alexandria, VA 22310, USA. Tel.: + 1 724 449 6179. E-mail address:
[email protected] (C.A. Mullins). 0040-1625/$ – see front matter © 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.techfore.2012.10.026
Evaluating forecast accuracy is a complex task. First, there is no universally accepted measure of forecast accuracy [2], though many have been suggested, such as the mean average error (MAE), mean squared error (MSE), mean percentage error (MPE), and mean absolute percentage error (MAPE) [3]. Second, assessing the accuracy of forecasts is confounded by ambiguous or vague forecast language, which requires interpretation by evaluators, thereby increasing uncertainty and bias in accuracy assessments [4]. Third, to determine a forecast's accuracy, evaluators must verify whether a predicted event has occurred as well as when it occurred. This increases the level of effort required of analysts and potentially limits the pool of assessable forecasts from which to draw statistical conclusions. Despite these obstacles, a number of studies have examined forecast accuracy. Studies show that quantitative methods produce more accurate forecasts than qualitative methods do [5]. Among quantitative forecasts, research shows that those using complex methods tend to be less accurate than those using simpler methods. This inverse relationship of
S.R. Fye et al. / Technological Forecasting & Social Change 80 (2013) 1222–1231
complexity and accuracy also applies to forecasts obtained using qualitative methodologies [6]. Although forecasts derived from expert methods are less accurate than those derived from quantitative methods, expert methods are widely used. Armstrong found that among expert methods, those that rely on the collaborative judgment of participants produce markedly less accurate forecasts than those that average expert input to generate forecasts [7]. Furthermore, Carbone and colleagues observed that when experts reviewed and modified quantitative forecasts, accuracy decreased [8]. Forecast time horizon also influences accuracy. Research by Makridakis, Hibon, and Moser, which used 111 time series to evaluate the accuracy of forecasting methods, found that as the time horizon of forecasting increases, accuracy decreases [9]. The type of technology about which the forecast is conducted also influences accuracy. Albright evaluated 100 forecasts made in 1967 that projected to 2000 [4]. Fewer than half of these forecasts had come to pass by 2000, and the greatest number realized made predictions about computer and communication technologies. Reviewing the literature on forecast accuracy illuminates several limitations of previous studies. First, the majority of the literature assesses only the influence of methodology and forecast time horizon on accuracy, possibly ignoring other important factors, such as the type of technology or its complexity. Second, most studies are based on small sample sizes, and those that compare the accuracy of forecast methodologies are often limited to a few methodologies. Lastly, few published studies evaluate the factors that affect the accuracy of technological forecasts. The purpose of this study is to empirically examine the factors affecting accuracy in technological forecasts using a sample size large enough to allow for statistically meaningful analysis. The study compared the accuracy of eight forecast methods to determine which produces the most accurate forecasts. However, we hypothesize that eight additional attributes—technology area, time horizon, prediction type, technology maturity, technology complexity, geographic origin of the forecast, geographic region forecasted about, and publication type—may also influence forecast accuracy. Since forecast language is often vague or ambiguous, we also provide a subjective measure of the degree of interpretation required to evaluate each forecast. Using this method, we were able to determine the effects of language specificity on evaluated forecast accuracy, thereby providing a measure of confidence in individual results. 2. Method To characterize the influence of forecast methodology and other attributes on accuracy, we 1) collected forecast documents from a variety of sources, 2) extracted technology-related forecasts from the documents and characterized their attributes, 3) verified whether the forecasted events occurred, and 4) assessed the accuracy of the verified forecasts. 2.1. Document collection We collected documents with technology forecasts from academic literature; industry organizations, associations, and societies; government agencies; market research firms; strategic
1223
analysis firms; and trade press and popular media. To be included in the study, documents needed to be old enough to allow their forecasts to be verified, contain technological forecasts, and contain original forecasts rather than re-published forecasts originating from other documents. We collected a total of 424 documents. Of these, 210 documents contained forecasts that fit our selection criteria and qualified for inclusion in the study. Table 1 shows the distribution of documents among geographic regions and publication types. Despite attempts to collect documents from a variety of geographic regions, the majority of forecast documents— almost two thirds—originated from North America. The distribution of forecast documents among different publication types was more evenly distributed, though we found only six documents from market research firms, primarily because these are sold as research reports and are not openly available.
2.2. Forecast extraction and attribute recording We extracted 975 forecasts from our collection of forecast documents; 811 were assessable. To be assessable, forecasts needed to meet five inclusion criteria. First, they had to be accompanied by a clear time horizon. Second, they needed to be old enough to allow their predictions to be evaluated. Third, they had to make predictions about technology emergence, evolution, migration to a new market or use, or market penetration. Fourth, the methodology by which they were produced needed to be identifiable. Lastly, their language needed to be specific enough to allow for subsequent analysis. Most forecast documents yielded fewer than 10 forecasts that met these criteria. Forecast language is often vague or ambiguous, requiring interpretation [4]. In many cases, extracted forecasts were specific enough for analysis but contained words or phrases that required interpretation before analysis could begin. We developed a standard lexicon to facilitate the consistent interpretation of vague terminology commonly used in forecasts. Analysis rules were included in the standard lexicon to
Table 1 Forecast document distribution among geographic regions and publication types. The majority of forecast documents originated from the Americas and were derived from academic publications. Forecast document balance
Documents
Geographic origin Americas Worldwide Europe Asia Multi-regional Oceania
137 26 25 16 5 1
Publication type Academic publications Trade press and popular media Government reports and roadmaps Industry organizations, associations, and societies Strategic analysis firms Market research firms
75 42 42 25 20 6
1224
S.R. Fye et al. / Technological Forecasting & Social Change 80 (2013) 1222–1231
ensure different types of forecasts, such as multi-variable and contingent forecasts,1 were evaluated consistently. Table 2 provides examples of the terminology and analysis rules captured in the standard lexicon. We characterized assessable forecasts based on the nine attributes we hypothesized may influence accuracy. Each of these attributes is detailed in Table 3. Assessable forecasts and their associated interpretations were entered into a web-based database that served as a repository for all analysis information. 2.3. Forecast verification We verified assessable forecasts by determining whether the forecasted event occurred and, if so, when it occurred. Open-source research was the primary means of verification. We attempted to verify forecasts using two independent sources to ensure information was correct and unbiased. Only sources with journalistic standards and appropriate subject matter expertise were used for verification. If sources conflicted—or if we were unable to find verifying information using open-source research—we relied on expert opinion for clarification or verification. We used a five-point Likert scale to record the degree of interpretation involved in verifying forecasts. 2 Using these methods, we verified nearly 40% of the assessable forecasts (310 of the 811). Of these 310 forecasts, 15 were outliers that were excluded from the analysis, resulting in a final sample size of 295 forecasts; 201 of these verified forecasts were realized before we completed our research in mid-2011. All verification information, including the degree of interpretation rating, analyst justifications, source citations, and page numbers, was included in the database in sufficient detail to allow each forecast's verification to be reproduced by other analysts. Table 3 provides the number of verified forecasts associated with each attribute. The remaining 501 unverified forecasts were not included in the statistical analysis. The inability to verify these forecasts is not an indictment on the forecasts themselves but instead a result of the time available to research them and the availability of verifying information. 2.4. Forecast accuracy assessment Using the 295 verified forecasts, we 1) compared forecast success rates against a control, 2) compared the success rates of different forecast methodologies, 3) compared temporal
1 Multi-variable forecasts are those that predict multiple events within the same forecast statement. We evaluated these forecasts by treating each individual event as an independent forecast. Contingent forecasts are those that indicate the forecasted event is dependent upon another event occurring. We did not include these forecasts in our study because it was not possible to determine whether forecast success could be attributed to causal factors. 2 The Likert scale included five metrics: all interpretation, a lot of interpretation, moderate interpretation, little interpretation, and no interpretation. The database included a comment field for analysts to justify their assessments.
Table 2 Sample standard lexicon and analysis rules for interpreting language relating to time horizon. The lexicon and rules helped analysts translate and assess forecasts consistently. Forecast language
Translation
Few Several, multiple Multitude Near term, short term, near future, soon Mid term, not-too-distant future Long term “By the year…”
3 4–10 >10 1–5 years 6–10 years 11–25 years Evaluated date is the forecasted year A date range from the year of the forecast plus x years
“In the next x years…” “Within x years…”
error among verified forecasts, and 4) attempted to develop a predictive model of forecast performance. Fig. 1 describes the process by which we determined forecast accuracy. 2.4.1. Test against a control We used a binomial test to compare the success rates of verified forecasts with a hypothesized probability of success based on a random guess. A forecast was considered successful if the forecasted event was realized within ±30% of the total forecasted time horizon around the forecasted year (the time horizon is the number of years between the date an event is predicted to happen and the date the forecast was made). We refer to this as the allowable error. Some forecasts provided a specific temporal range for the forecasted event. In these cases, a forecast was considered successful if it fell within the given range. Forecasts occurring outside of the specified limits were not considered successful. To prevent outliers in the data from influencing the test, we based our theoretical probability of success on the distribution of statistically useful records that fell within the 99th percentile of data for time horizon—thereby excluding the extreme outliers in forecast length—and the 95th percentile for temporal error. The trimmed data set included all records with a forecast length no greater than 20 years and no forecasts with a temporal error greater than 10 years. Using these constraints, we excluded 4 records that had forecast lengths in excess of 20 years and 11 additional forecasts that had temporal errors in excess of 10 years from the original set of 310 verified forecasts, for a final sample size of 295 forecasts. All of the excluded records were failed forecasts, resulting in a conservative test in favor of the forecasting methods. Since the 30% allowable error was subjectively derived, we conducted a sensitivity analysis of the allowable error to determine whether our findings were sensitive to the selection of 30%. We varied the allowable error from 0% (event must occur in the year forecasted) to 100% (event must occur within twice the length of the forecast) to determine at which point (if any) our results changed. 2.4.2. Comparison of success rates among all attributes In addition to testing forecast methodologies against a control, we also compared forecast success rates across the
S.R. Fye et al. / Technological Forecasting & Social Change 80 (2013) 1222–1231 Table 3 Verified forecasts per attribute and subattribute. The number of assessable forecasts for each subattribute is included to illustrate the difficulty of verifying whether and when forecasted events occurred. Attribute Methodology Expert sourcing Expert analysis Quantitative trend analysis Source analysis Models Gaming and scenarios Qualitative trend analysis Multiple Technology area Communications technology Biological technology Computer technology Space technology Energy and power technology Ground transportation technology Autonomous/robotics technology Air transportation technology Photonics and phononics technology Sensor technology Physical, chemical, and mechanical systems Production technology Materials technology Maritime transportation technology Prediction type Market penetration Technology emergence Technology evolution Technology migration Technology maturity TRL 7-9 TRL 4-6 TRL 1-3 New application of existing technology Undetermined
1225
Table 3 (continued) Attribute
Assessable forecasts
Verified forecasts
Assessable forecasts
Verified forecasts
Eastern Europe Southeast Asia
1 1
1 1
314 133 121 67 60 51 54 11
122 72 30 30 19 16 14 7
Geographic region forecasted about Multi-regional or worldwide Northern America Eastern Asia Southern Asia Northern Europe Western Europe Eastern Africa Australia and New Zealand Southern Europe
424 279 72 11 6 8 2 1 8
144 128 28 5 2 1 1 1 -
101 68 114 42 107 49
50 35 34 34 33 28
338 311
115 82
87 90
41 39
63
20
53 30
18 14
77 8
31 2
42 27
12 11
28 51 36
10 8 3
322 191 245 53
111 91 89 19
452 179 147 32
167 78 55 10
1
-
Technology complexity System of systems System Component Subsystem
271 267 180 93
127 94 49 40
Time horizon Short-term Medium-term Long-term
475 176 160
188 72 50
Geographic origin of forecast North America Western Europe Multi-regional Eastern Asia Northern Europe Southern Europe Southern Asia Australia and New Zealand
563 56 57 66 52 4 8 3
229 28 18 16 10 3 2 2
Publication type Academic publications Government reports and roadmaps Trade press and popular media Industry organizations, associations, and societies Strategic analysis firms Market research firms
nine attributes to determine whether some attributes produced more successful forecasts. For this analysis, we used Fisher's Exact Test. The general form of these hypothesis tests were: H0 : μ 1 ≤ μ 2 HA : μ 1> μ 2
where: μ1 μ2
is the success rate of the first attribute value, and is the success rate of the second attribute value
2.4.3. Comparison of temporal error We also evaluated temporal error for the forecasts with respect to methodology. The temporal error is the number of years between the forecast realization and the date the forecasted event was predicted to occur. We assert that forecasts with a small temporal error are more accurate than those with a large temporal error, and forecasting methodologies that produce forecasts with a small average temporal error are more accurate than methodologies associated with large average temporal errors. We used two metrics to calculate temporal error: the mean absolute error (MAE) and the mean error (ME). The MAE provides insight about the average magnitude of errors associated with each method. A large average MAE indicates that a forecast methodology is less accurate. The ME provides insight about the distribution of optimistic and pessimistic forecasts. A method with a negative ME tends to overestimate the time horizon necessary for the event. A positive average ME indicates a tendency to underestimate timing of events. The temporal error analysis was based on the 201 forecasts in the sample set that had realized events.
1226
S.R. Fye et al. / Technological Forecasting & Social Change 80 (2013) 1222–1231
Fig. 1. Methodology for assessing forecast accuracy. Accuracy was determined by calculating realization rate, accuracy, and success rate.
Due to the non-normal nature of our data and the dissimilar distributions of the data for each methodology, neither the ANOVA nor the Kruskal–Wallis test were appropriate for analyzing temporal error. Consequently, we developed a non-parametric approximation of the Tukey– Kramer Honestly Significant Difference Test to determine whether certain forecasting methods were more accurate than others. Our test was: H 0 : μ QT ¼ μ i H A : μ QT ≠μ i Where μQT μi
is the mean of the MAE for the quantitative trend analysis, and is the mean of the MAE for forecasting method i
We generated a 99% confidence interval for each methodology and then subtracted the confidence interval for the quantitative trends analysis method from each of the other methods. If zero was included in the resulting confidence interval, then we could not reject the null hypothesis. If zero was excluded from the resulting confidence interval, then we could reject the null hypothesis. If the lower bound of the resulting confidence interval was positive, the test implied μQT was less than μi, and if the upper bound of the difference confidence interval was negative, the test implied μQT was greater than μi.
polynomial multivariate regression models using the nine attributes as variables. We also applied transformations to both the statistic of interest (ME) and length of forecast—the years between the date the forecast was made and the date that the forecast was realized. All attributes other than length of forecast were categorical variables and thus could not benefit from transformations. The intent of this test was to determine whether a forecast user could predict the ME associated with that forecast given its associated attributes. 2.5. Forecast interpretation In addition to evaluating forecast accuracy, we evaluated the degree of interpretation involved in verifying forecasts and its potential effect on accuracy. We ascertained the degree of interpretation involved in extracting measurable data from forecasts by comparing the interpretation rankings that analysts assigned to forecasts during the verification process across the nine attributes. This interpretation ranking provides a subjective assessment of the difficulty associated with understanding a forecast and comparing it to available data on historic events. Forecasts with a high degree of interpretation required more analyst manipulation to be assessed and may therefore have more evaluator-induced error. 3. Results and discussion 3.1. Test against a control
2.4.4. Predictive model for forecast performance To determine if there are combinations of attributes that contribute to forecast accuracy, we developed linear and
Our theoretical success rate is based off a randomly generated guess between one and 20 years, with an event
S.R. Fye et al. / Technological Forecasting & Social Change 80 (2013) 1222–1231
occurrence of one to 30 years. These upper limits on length of forecast and time to event realization align with our observed data distribution for the trimmed data set described in Section 2.4.1. Assuming a 30% allowable error, the probability of success for the random guess described above is 24.7%. Thus, we used ρ = 24.7% for the hypothesized probability of success. Our hypothesis is: H 0 : μ f ≥ρ H A : μ f bρ Where μf ρ
is the observed success rate of forecast methodology f, is the hypothesized probability of success
Table 4 provides the results of the binomial test. We failed to reject the null hypothesis, supporting our assumption that forecasts are better than a random guess. When comparing each forecast methodology against the control, tests for seven of the eight methods produced very high p-values. We interpret this as evidence that these methods exhibit better success than the random guess. We note that forecasts using multiple methodologies and qualitative methodologies have a relatively small sample size (6 and 13 forecasts, respectively). While the test results do appear to strongly support the null hypothesis, we make no claims regarding their efficacy. Only source analysis has an observed success rate less than our theoretical success rate. However, we cannot reject the null hypothesis at a significance level of α = 0.10. The quantitative trend analysis methodology expressed the highest success rate of the methods evaluated. To determine whether this method was better than the other methods, we compared it to the next highest scoring method with a sufficiently large sample size, expert sourcing. Using Fisher's Exact Test, we compared quantitative trend analysis to expert sourcing with a resulting p-value of 0.023, providing strong Table 4 Methodology success rates compared to control given 295 observations and 30% allowable error. Although qualitative trend analysis and forecasts that used multiple methodologies had high success rates, the small number of forecasts using these methodologies precluded our ability to draw meaningful statistical conclusions about their success. Methodology
Percent of events that occurred
Percent of successful forecasts
p-Value
Quantitative trend analysis Expert sourcing Models Expert analysis methods Gaming and scenarios Uninformed guess Source analysis Multiple Qualitative trend analysis Average
67.9
64.3
1.000
75.7 64.7 57.7
38.3 35.3 32.4
1.000 0.898 0.946
50.0
31.3
0.818
– 53.6 50.0 78.6
24.7 14.3 50.0 42.9
0.143 0.964 0.964
66.1
36.9
1.000
1227
statistical evidence that forecasts generated using quantitative trend analysis methods have a higher success rate than do forecasts generated using other methods. We interpret this to mean that quantitative forecasting methods perform better than qualitative forecasting methods—consistent with other research [5]. We point out, however, that while quantitative methods had the best success rate, expert sourcing had the best realization rates at 78%. This suggests that experts are best at predicting if an event will occur, while quantitative methods are best at predicting when an event will occur. Source analysis had the lowest degree of success overall and across all time horizons. The comparatively small p-value associated with source analysis calls into question whether this method produces better results than a randomly generated forecast. Few of the verified forecasts that were generated using source analysis involved scientific source analysis methods, such as econometrics and scientometrics. Consequently, these findings can only be extrapolated to the subset of source analysis forecast methods that identify general trends in published sources, such as environmental scanning and key technology analysis. The results of the sensitivity analysis are provided in Table 5. With the exception of gaming and scenarios, all methods responded to the sensitivity analysis as expected. The aberrant results associated with the gaming and scenarios method are due to a bimodal distribution of temporal errors. For this method, temporal errors were extremely small (less than one year) or extremely large (twice as long as the forecast period itself). The method's success rate did not benefit from the increased allowable range until the range was large enough to include the realized forecasts with large errors, at 90%. We also note that multiple and qualitative trend analysis methodologies had an unusually large percentage of range forecasts (over 70%). Consequently, they were relatively unaffected by the effects of the sensitivity analysis. While we attempted to eliminate bias in the sensitivity analysis when increasing the allowable range, reducing the range did introduce some bias because we only gathered forecasts that were consistent with the 30% allowable error. 3.2. Comparison of success rates among attributes Our analysis of accuracy among attributes revealed little about how prediction type, technology maturity, technology complexity, geographic origin, or geographic region forecasted about influence accuracy. However, we did observe that forecasts projected over short and medium terms were equally successful—at 38% and 39%, respectively—and were more accurate than long-term forecasts, which had a 14% success rate. When comparing long-term forecasts to theoretical distributions of randomly generated long-term forecasts (ρ = 34%), we rejected the null hypothesis and conclude there is statistical evidence that long-term forecasts have a worse success rate than a random guess. In addition to time horizon, we also observed that technology area might influence accuracy. Table 6 provides the comparison of technology area success rates derived from Fisher's Exact Test. Forecasts for computer and autonomous/ robotics technologies—with 59% and 55% success rates, respectively—were statistically the most successful, followed
1228
S.R. Fye et al. / Technological Forecasting & Social Change 80 (2013) 1222–1231
Table 5 Results of sensitivity analysis on allowable range. Values represent observed success rates given changes in allowable range. Forecast method
Allowable range (%) 0
Random Guess Expert Analysis Methods Expert Sourcing Gaming and Scenarios Models Multiple Qualitative Trend Analysis Quantitative Trend Analysis Source analysis Key
10
0.05 0.11 0.24 0.27 0.30 0.33 0.06 0.19 0.18 0.18 0.50 0.50 0.43 0.43 0.46 0.57 0.11 0.11 0 b p-value ≤ .10
20
30
0.17 0.28 0.37 0.31 0.35 0.50 0.43 0.61 0.11
0.25 0.31 0.32 0.33 0.38 0.41 0.31 0.25 0.35 0.35 0.50 0.50 0.43 0.43 0.64 0.64 0.14 0.18 .10 b p-value ≤ .20
by communications technologies (37%). When assessed together, forecasts about autonomous/robotics technologies and computer technologies performed statistically better than the sample of all other technology areas, indicating that forecasts about these technologies are more accurate than those about other technology areas. This finding is partially consistent with Albright's findings, which revealed that long-term forecasts about computer technologies—as well as communication technologies—were more accurate than forecasts about other technologies [4]. Interestingly, nearly 60% of the forecasts in our sample that were obtained from quantitative methodologies made predictions about computer and autonomous/ robotics technologies, indicating that the high level of success among these forecasts may be dependent upon methodology. However, additional data is needed to determine which attribute is dependent.
3.3. Comparison of temporal error We compared the MAE and ME of the 201 realized forecasts to determine whether it was possible to predict the accuracy of a forecast given different forecast attributes.
Table 6 Success rate by technology area. Small sample sizes precluded us from drawing statistical conclusions about the accuracy of predictions about maritime transportation; air transportation; physical, chemical, and mechanical systems; sensor; production; photonics and phononics technologies; and materials. Technology area
Percent of successful forecasts
Computer technology Autonomous/robotics technology Communications technology Space technology Biological technology Ground transportation technology Energy and power technology Maritime transportation technology Air transportation technology Physical, chemical, and mechanical system Sensor technology Production technology Photonics and phononics technology Materials technology Average
59.4 55.0 37.5 35.3 26.9 26.5 25.8 66.7 42.9 36.4 33.3 33.3 30.0 28.6 36.9
40
50
60
70
80
0.4 0.41 0.49 0.25 0.46 0.50 0.64 0.67 0.18
0.45 0.42 0.51 0.25 0.46 0.50 0.64 0.70 0.18
0.5 0.55 0.43 0.42 0.52 0.57 0.27 0.27 0.45 0.45 0.50 0.50 0.62 0.62 0.70 0.74 0.23 0.23 .20 b p-value b 1.0
90
100
0.6 0.42 0.59 0.45 0.33 0.50 0.62 0.74 0.20
0.64 0.41 0.60 0.45 0.33 0.50 0.62 0.74 0.20
Table 7 provides the comparison of forecast methodology MAE and ME. Table 8 provides the results of a non-parametric approximation of the Tukey-Kramer Honestly Significant Difference Test, which was used to determine if the quantitative trend analysis method was derived from a different distribution than the other methods. We failed to reject the null hypothesis (distributions are the same) for two methods: gaming and scenarios, and multiple. A portion of this could be explained by the large variances and small sample sizes for both of these methods. However, we did reject the null hypothesis for the five other methodologies at a significance level of 0.049. Because the lower limit for the five methods was above zero, there is strong evidence to suggest that the distribution of the MAE for the quantitative trend analysis method is smaller than the other five methods. This observation is consistent with the descriptive data presented in Table 7.
3.4. Predictive model for forecast performance To determine whether a forecast user could predict the ME associated with a forecast given its associated attributes, we conducted a regression analysis. We initially generated one- and two-variable plots of the data to identify potential trends that would facilitate the development of a regression model. However, we could not identify any trends in the data with any combination of attributes; the data generally exhibited a random distribution. Fig. 2 provides the ME distribution for forecasts and their methodologies. Forecasts are equally likely to over- or underestimate event dates. Furthermore, the data are random in their placement; they do not indicate a prominent positive or negative trend and do not indicate any type of underlying non-linear trend. Even with the outliers removed, the data exhibit heteroscedasticity relative to time frame. That is, the variance of the data increases as time increases. To account for this, we checked for heteroscedasticity in each data set of our regression analysis given the attributes under consideration, and we adjusted our findings using White-Huber covariance matrices. We conducted an exhaustive regression analysis exploring both linear and non-linear relationships. We explored all
S.R. Fye et al. / Technological Forecasting & Social Change 80 (2013) 1222–1231
1229
Table 7 Temporal error among forecast methods. The analysis of temporal error was based on 201 forecasts in the sample set whose predicted events had been realized. Method
Realized forecasts
Expert analysis methods Expert sourcing Gaming and scenarios Models Multiple Qualitative trend analysis Quantitative trend analysis Source analysis
41 90 8 11 4 11 20 16
Mean
Variance
Range ME
MAE
ME
MAE
ME
MAE
ME
Min
Max
2.09 3.91 2.50 1.86 5.13 2.64 1.18 3.72
0.06 −0.83 1.75 0.41 2.88 0.91 −0.43 −0.09
5.65 14.40 7.14 2.80 37.90 4.00 6.64 10.57
10.10 29.16 10.79 6.44 61.90 10.74 7.90 25.31
1 3 1 1 3 2.5 0.25 2.75
0.00 0.00 0.50 0.00 1.00 −1.00 0.00 −0.75
−8.5 −17.0 −1.0 −3.0 −4.5 −2.5 −11.0 −11.0
10.0 11.0 8.0 5.0 14.0 8.0 5.0 10.0
combinations up to five variables, as well as their interactions. To filter out the ill-fitting models, we established the criteria that a model had to produce a p-value of 0.05 or smaller and an adjusted R 2 value of 0.8 or better. No model produced any results that satisfied the criteria. 3.5. Forecast interpretation We observed a large disparity in the clarity and specificity of forecasted events, with some forecasts containing specific enough information that they did not require interpretation and others that were too vague to be included in the analysis. Forecasts that included clear descriptions of time horizon, technology, and the predicted event, as well as those that included performance metrics, were more informative and required less interpretation. The risk of misinterpretation was greater for forecasts that described predicted outcomes using ambiguous language (i.e., “it is possible,” “it could happen,” “become commonplace,” “better than the current generation capabilities”). Forecast statements that predicted about multiple variables, such as the weight, size, battery type, and gas mileage of an automobile, were difficult to interpret and evaluate because predictions about each variable needed to be realized for the entire forecast to be considered realized. 3 Table 9 presents the success rates associated with specificity of these forecasts. Most of the verified forecasts in our sample—38% and 32%—required little to moderate interpretation to verify, respectively. Forecasts requiring no interpretation comprised 19% of our sample, while those requiring significant interpretation comprised 11%. Interestingly, forecasts obtained from government documents required more interpretation than those derived from other sources (p-value: 0.073), though they had similar success rates (38%) as forecasts derived from other sources. Of the verified forecasts, 59% of forecasts from government sources required a lot or moderate interpretation during verification, compared to 28% to 35% for most other sources. Much of this difficulty was due to bias inherent in attempting to discern meaning in vague or imprecise forecasts. Community-wide development or adoption of a standard lexicon similar to that developed for this study will improve communication of forecast information, though it is unlikely to increase forecast success. 3
Median
Contingent forecasts, which indicate the forecasted event is dependent upon another event occurring, were not included in our study because it was not possible to determine whether a contingent forecast's success could be attributed to the causal factor.
4. Conclusions We evaluated the accuracy of more than 300 technological forecasts that were derived from a variety of sources and geographic regions. These forecasts were obtained using numerous methodologies and made predictions about several technologies of varying complexity and maturity. Our results indicate that quantitative methods produce the most accurate forecasts. However, forecasts obtained using expert sourcing had the highest realization rates, suggesting that experts are best at predicting if an event will occur, while quantitative methods are best at predicting when an event will occur. Our results also indicate that forecasts predicting shorter Table 8 Tukey–Kramer Honestly Significant Difference Test comparing forecast methods. The Tukey–Kramer test was used instead of ANOVA, due to the dissimilar distribution of data when partitioned by methodology. Method
Realized forecasts
MAE
Difference CI
Expert analysis methods Expert sourcing Gaming and scenarios Models Multiple Qualitative trend analysis Source analysis
41
2.09
5.65
0.09
1.73
90 8
3.91 2.50
14.40 7.14
2.03 2.80
3.44 −0.15
11 4 11
1.86 5.13 2.64
2.80 37.90 4.00
0.40 13.96 1.57
0.97 −6.06 1.35
16
3.72
10.57
3.49
1.60
Mean Variance Lower Upper
Table 9 Test to determine the influence of forecast interpretation on accuracy. There were no forecasts in the sample that were assessed as requiring “all interpretation.” Degree of interpretation
Successful forecasts
Unsuccessful forecasts
Observations Success rate
All interpretation A lot of interpretation Moderate interpretation Little interpretation No interpretation Total
–
–
–
–
10
23
33
30.3%
34
60
94
36.2%
46
67
113
40.7%
19
36
55
34.5%
109
186
295
36.9%
1230
S.R. Fye et al. / Technological Forecasting & Social Change 80 (2013) 1222–1231
Fig. 2. Distribution of ME among forecast methodologies. The distribution of error is symmetric about the abscissa, indicating that forecasts are equally likely to over- or underestimate event dates.
time horizons are more accurate than those predicting longer time horizons. Therefore, forecasters should use quantitative methods when possible and project over the short term. Furthermore, predictions about computer and autonomous or robotics technologies were more accurate than those made about other technology areas, which has been partially demonstrated in previous work [4]. No statistically relevant influence on accuracy was observed in the prediction type, technology maturity, technology complexity, geographic origin, or geographic region forecasted about. We also observed that forecasts provide more accurate predictions than uninformed guesses do and are equally likely to over- or underestimate event dates. Our study also confirms that forecast language is often vague or ambiguous and requires interpretation to be understood by end users or evaluated by researchers. Since this insufficient language increases the possibility of misinterpretation by end users, forecasters should take care to clearly describe time horizons, technologies, and predicted events or outcomes, as well as provide performance metrics, probabilities, and explicit baselines for comparison. Consumers of forecasts should place greater confidence on predictions that were generated using quantitative methods, project events with shorter time horizons, and use unambiguous language. The development of standard forecasting language guidelines could improve accessibility and reduce interpretative error. Additional data, in the form of an increased number of verified forecasts, are needed to evaluate the accuracy of 1) collaborative expert sourcing relative to non-collaborative expert sourcing and other forecast methodologies; 2) sophisticated quantitative methodologies against simpler quantitative methods; 3) combined forecasts against other methodologies; 4) methodologies for which our sample size was too small, such as qualitative trend analysis and broad sourcing; and 5) determine if there is a causal relationship between forecast attributes and computer and autonomous or robotics technologies that influence the higher success rate in these technology areas. Acknowledgements We would like to thank the Assistant Secretary of Defense for Research and Development's (ASD(R&E)) Office of Technical Intelligence for funding this research. We also
thank Dr. Bernard Courtney, Kate Maliga, and Rachael Lussos for reviewing the manuscript. References [1] National Academies Presses, Persistent Forecasting of Disruptive Technologies, National Academies Presses, 2009. http://www.nap.edu/openbook. php?record_id=12557&page=17. [2] S. Makridakis, S.C. Wheelwright, V.E. McGee, Forecasting: Methods and applications, 2nd ed. Wiley, New York, NY, 1983. [3] E. Mahmoud, Accuracy in forecasting: a survey, J. Forecast. 3 (2) (1984) 139–159. [4] R.E. Albright, What can past technology forecasts tell us about the future? Technol. Forecast. Soc. Chang. 69 (5) (2002) 443–464; W. Ascher, Problems of forecasting and technology assessment, Technol. Forecast. Soc. Chang. 13 (1979) 149–156. [5] J.S. Armstrong, M.C. Grohman, A comparative stuffy for long-range market forecasting, Manag. Sci. 19 (2) (1972) 211–227; K.S. Lorek, C.L. McDonald, D.H. Patz, A comparative examination of management forecasts and Box–Jenkins forecasts of earnings, Account. Rev. 51 (2) (1976) 321–330; J.S. Armstrong, Long-Range Forecasting: From Crystal Ball to Computer, Wiley, New York, NY, 1978.; W.M. Grove, D.H. Zald, B.S. Lebow, B.E. Snitz, C. Nelson, Clinical versus mechanical prediction: a meta-analysis, Psychol. Assess. 12 (1) (2000) 19–30. [6] W. Ascher, Problems of forecasting and technology assessment, Technol. Forecast. Soc. Chang. 13 (1979) 149–156; D.G. Goldstein, G. Gigerenzer, Fast and frugal forecasting, Int. J. Forecast. 25 (2009) 760–772; ce:given-name>P. Goodwin, High on complexity, low on evidence: are advanced forecasting methods always as good as they seem? Foresight (2001) 10–12; S. Makridakis, M. Hibon, The M3-competition: results, conclusions and implications, Int. J. Forecast. 16 (4) (2000) 451–476; R. Carbone, A. Anderson, Y. Corriveau, P.P. Corson, Comparing for different time series methods the value of technical expertise, individualized analysis, and judgmental adjustment, Manag. Sci. 29 (5) (1983) 559–566. [7] J.S. Armstrong, How to make better forecasts and decisions: avoid face-to-face meetings, Foresight 5 (2006) 3–15. [8] R. Carbone, A. Anderson, Y. Corriveau, P.P. Corson, Comparing for different time series methods the value of technical expertise, individualized analysis, and judgmental adjustment, Manag. Sci. 29 (5) (1983) 559–566. [9] S. Makridakis, M. Hibon, C. Moser, Accuracy of forecasting: an empirical investigation, J. R. Stat. Soc. Ser. A 142 (2) (1979) 97–145. Ms. Shannon Fye is a principal analyst for The Tauri Group, where she conducts studies for various government clients. Ms. Fye has a BS in Molecular Biology and Microbiology from California State University, Sacramento and an MS in Biodefense from George Mason University, where she is also a doctoral candidate studying technical complexities associated with emerging biotechnologies.
S.R. Fye et al. / Technological Forecasting & Social Change 80 (2013) 1222–1231 Dr. Steven Charbonneau is a senior operations research analyst for The Tauri Group, where he leads and conducts analyses on issues ranging from resource allocation to data visualization. Dr. Charbonneau has a BS in Civil Engineering from the United States Military Academy and a PhD in Operations Research from George Mason University. Mr. Jason Hay is a senior technology and aerospace analyst for The Tauri Group, specializing in identifying and evaluating technologies for use in space. Mr. Hay has a BS in Physics from the Georgia Institute of Technology
1231
and an MS in Science and Technology Policy from The George Washington University. Ms. Carie Mullins is a senior engineer for The Tauri Group, where she specializes in analyzing advanced technologies and systems for NASA, the Department of Defense, and the commercial space industry. Ms. Mullins received her BS and MS in Industrial Engineering from the University of Pittsburgh.