Car resale price forecasting: The impact of regression method, private information, and heterogeneity on forecast accuracy

Car resale price forecasting: The impact of regression method, private information, and heterogeneity on forecast accuracy

International Journal of Forecasting 33 (2017) 864–877 Contents lists available at ScienceDirect International Journal of Forecasting journal homepa...

868KB Sizes 1 Downloads 55 Views

International Journal of Forecasting 33 (2017) 864–877

Contents lists available at ScienceDirect

International Journal of Forecasting journal homepage: www.elsevier.com/locate/ijforecast

Car resale price forecasting: The impact of regression method, private information, and heterogeneity on forecast accuracy Stefan Lessmann a, *, Stefan Voß b a b

School of Business and Economics, Humboldt-University of Berlin, Unter-den-Linden 6, 10099 Berlin, Germany Institute of Information Systems, University of Hamburg, Von-Mell-Park 5, 20146 Hamburg, Germany

article

info

Keywords: Forecasting Data mining Decision support systems Automotive industry

a b s t r a c t The paper investigates statistical models for forecasting the resale prices of used cars. An empirical study is performed to explore the contributions of different degrees of freedom in the modeling process to the forecast accuracy. First, a comparative analysis of alternative prediction methods provides evidence that random forest regression is particularly effective for resale price forecasting. It is also shown that the use of linear regression, the prevailing method in previous work, should be avoided. Second, the empirical results demonstrate the presence of heterogeneity in resale price forecasting and identify methods that can automatically overcome its detrimental effect on the forecast accuracy. Finally, the study confirms that the sellers of used cars possess informational advantages over market research agencies, which enable them to forecast resale prices more accurately. This implies that sellers have an incentive to invest in in-house forecasting solutions, instead of basing their pricing decisions on externally generated residual value estimates. © 2017 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.

1. Introduction The paper explores the potential of forecasting methods to support decision making in the automotive industry. More specifically, we concentrate on the second-hand market and develop empirical models for forecasting resale prices. Given that the sale of new cars is typically associated with taking back used vehicles due to, e.g., retail trade-ins, repossessions and fleet returns from car rental companies (e.g., Du et al., 2009), the used car market is strategically important for car manufacturers. Forecasting1 is a popular approach to improving business processes and supporting decision making (e.g., Cang author. Fax: +49 030 2093 5741. * Corresponding E-mail addresses: [email protected] (S. Lessmann), [email protected] (S. Voß). 1 We do not distinguish between forecasting and predictive modeling here. Accordingly, the terms ‘forecast’ and ‘prediction’ are used interchangeably in the paper. Furthermore, we assume that forecasting models can be distinguished into regression and classification models that forecast continuous and discrete dependent variables, respectively.

& Yu, 2014; Ha & Krishnan, 2012). In particular, we develop forecasting models for supporting pricing decisions. A considerable body of research has shown that sophisticated pricing strategies can increase the profitability of customer-centric operations substantially (e.g., Mantrala et al., 2006; Sharif Azadeh et al., 2015). For example, Marn et al. (2003) estimate that a 1% increase in sales prices can translate into an 8% increase in operational profits for an average S&P 500 company. Pricing is especially important in the used car market. Given that the quantity is largely fixed (i.e., because of take-back obligations), the price is the only control variable for increasing sales revenue and profit (Du et al., 2009). In general, making effective pricing decisions requires a good estimate of the demand (e.g., Ferrer et al., 2012). In the second-hand car market, the demand depends substantially on the difference between a car’s residual value and its offer price (Jerenz, 2008). Thus, when deciding on The paper is concerned with empirical methods of developing regression methods.

http://dx.doi.org/10.1016/j.ijforecast.2017.04.003 0169-2070/© 2017 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

offer prices, sellers need to estimate both residual values and resale prices. Such forecasts are also important in the new car business, where leasing has become a major sales channel (e.g., Pierce, 2012). Leasing companies set prices (i.e., leasing rates) based on expected residual values (e.g., Desai & Purohit, 1998). If the actual resale price of a car falls below expectations, the company faces a loss. Consequently, the accuracy of resale price forecasts is linked directly to the profitability of car leasing. More generally, given the impact of pricing policies on firms’ performances and the dependence of pricing decisions on residual value estimates, we argue that resale price forecasting (RPF) is connected indirectly to the profitability of car selling. This suggests that attempts to increase the accuracy of resale price forecasts are managerially important and a relevant research topic. Previous studies in the used car market have generally comprised a statistical modeling of resale prices but have rarely taken a decision support perspective. For example, only two studies have used a forecasting method other than multivariate linear regression. To the best of our knowledge, no determinants of the forecast accuracy other than the forecasting method have been explored at all (see Section 2 for details). The objective of the paper is to fill this gap. The corresponding results support car manufacturers who remarket used vehicles on a large scale, for example in conjunction with a leasing business, in their price management, as well as independent car dealers. More specifically, sellers need to decide how to obtain car residual values as inputs for their pricing strategy. One option is to purchase car residual value estimates from market research agencies such as ALG.2 Alternatively, sellers can set up in-house forecasting support systems (FSS) for estimating prices. The advantages of the latter approach include full transparency and control over price estimation, and the opportunity to use all available information, including specific car characteristics. The paper clarifies the potential of such information to increase the accuracy of car resale price forecasts and, more generally, provides insights as to how best to approach the car resale price forecasting task. The paper makes three contributions in pursuing its objectives. First, we perform a systematic comparison of several widely diverse forecasting methods under different experimental conditions. This is useful for identifying methods that are particularly well-suited for RPF and for augmenting the findings from other domains concerning the relative effectiveness of alternative forecasting methods. Second, we examine the predictive value of carspecific information and find that such (private) information allows used car dealers to increase the accuracy of their resale price forecasts. This demonstrates the merit of in-house forecasting support systems (FSS) compared to relying exclusively on purchased residual value estimates. Third, we examine the amount of effort required to update and manage forecasting models as part of an inhouse solution. Such an analysis provides insights into the economic rationality of in-house FSS and helps managers to make informed decisions. In summary, our study delivers 2 http://www.alg.com.

865

original findings concerning various methodological, organizational, and economic aspects of RPF. The paper is organized as follows. We review related literature in Section 2 and develop our research questions in Section 3. We then describe our methodology in Section 4 and report experimental results in Section 5. Section 6 concludes the paper. The online appendix3 provides additional results and describes the forecasting methods considered in the study. 2. Related work We organize the related literature into two categories. From an application perspective, papers that consider the second-hand car market are related to this work. From a methodological point of view, our study is related to the forecasting literature. The second-hand car market is a popular object of scientific enquiry. Many studies have focused on the informational efficiency of the market; that is, how efficiency changes over time and especially in response to onlinemediated sales channels (e.g., Adams et al., 2011; Andrews & Benzing, 2007; Genesove, 1993). Market prices play a central role in such work, as they determine the informational efficiency of the market and are a key element in consumer decisions. Market prices are also relevant for decision support. In particular, price forecasts are a crucial input for revenue management initiatives and advanced selling support systems (Sharif Azadeh et al., 2015). However, only a few studies have emphasized revenue management, and more specifically the connection between pricing decisions and revenue, in the used car business. Jerenz (2008) develops a pricing optimization system that consists of three parts: a forecasting component for estimating residual values, a statistical model for estimating price response functions (given model-estimated residual values), and a dynamic optimization engine for determining the optimal pricing strategy. Du et al. (2009) propose a related three-stage approach for identifying the optimal distribution of auction vehicles. In their approach, the expected auction price of a used vehicle and the local market elasticity are modeled using linear regression and an ARIMA process, respectively. The corresponding results represent the input to a genetic algorithm, which maximizes the net auction profit of distributing vehicles on the basis of their estimated auction prices, asset carrying costs and business constraints (Du et al., 2009). Both studies illustrate that resale price forecasts form a starting point for systematic revenue management initiatives. More specifically, they exemplify how the effectiveness of a used car selling support system depends on the accuracy of the resale price forecasts. This suggests that better (more accurate) forecasts could make a sizeable contribution to the effectiveness of such system. However, Du et al. (2009) and Jerenz (2008) both model resale prices by means of multivariate linear regressions. Although explanatory regression models possess many advantages, they forecast less accurately than data-driven 3 Available as an online supplement (see Appendix A).

866

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

prediction models that are developed explicitly to generate operationally accurate forecasts (Shmueli & Koppius, 2011). Therefore, one objective of this study is to explore empirically whether ‘better methods’ facilitate improvements in forecast accuracy. To the best of our knowledge, only Lessmann et al., (2010) and Lian et al. (2003) use advanced techniques for resale price forecasting. From a methodological point of view, their approaches are closely related. Both propose a forecasting framework that integrates an advanced prediction method (neural networks and support vector regression, respectively) with an evolutionary algorithm (genetic algorithm and evolution strategy, respectively) in order to adapt the meta-parameters of the forecasting method. Neither study conducts a comprehensive comparison of the proposed modeling approach to challenging benchmarks. Lessmann et al. (2010) compare the results to those of a linear regression, while (Lian et al., 2003) do not perform any comparisons. In summary, previous studies that have concentrated on decision support in the used car market have hinted at the importance of accurate resale price forecasts for informing decision making and aiding used car pricing in particular. However, advanced forecasting methods have received little attention. No systematic comparison of many alternative methods has ever been undertaken. As a consequence, there is an absence of empirical evidence on the extent to which alternative resale price forecasting methods differ in their forecast accuracies, or which approach is most effective. Corresponding results are available in other domains (e.g., Crone et al., 2011; Dejaeger et al., 2011; Loterman et al., 2012), and might generalize to RPF. However, many previous benchmark experiments have contrasted forecast accuracies in a single setting. When clarifying the conditions under which a forecasting method works well, it is preferable to compare alternative methods in different experimental settings (Armstrong & Fildes, 2006). Moreover, accuracy is not the only determinant of a forecasting model’s business performance. The employment of predictive decision support models leads to various costs for model management and maintenance, for example for updating the models as new information becomes available (e.g., Zheng et al., 2016). Previous benchmarks in other domains have not provided insights into the magnitude of such modeling efforts in RPF. Performing a benchmarking experiment that is tailored to RPF allows us to examine different experimental settings that reflect characteristic aspects of this forecasting task and to provide specific recommendations as to how it is best approached. 3. Research questions This section develops a set of research questions that aim to fill the research gaps identified above. The research questions can be split into two categories: those that concentrate on the relative effectiveness of different forecasting methods and those that focus on organizational aspects, to shed light on the efficiency of the overall forecasting process.

3.1. The accuracy of alternative resale price forecasting methods Accuracy is an important issue in any forecasting task. It is also a key determinant of the acceptability and effectiveness of a forecasting method in corporate practice (e.g., Syntetos et al., 2016). Therefore, a general objective of this study is to identify methods that are particularly effective in RPF. Thus, our first research question (RQ) is: RQ1: Which forecasting methods forecast resale prices most accurately? More specifically, the resale price forecasting task exhibits several characteristics that might affect the efficacy of alternative forecasting methods. One modeling challenge in RPF concerns nonlinearity. The relationship between resale prices and explanatory variables is likely to be nonlinear because: (i) new car depreciation is high in the early using period and decreases over time (Desai & Purohit, 1998); (ii) announcements of new car model introductions have large – unusual – impacts on resale prices (Purohit, 1992); and (iii) consumers suffer cognitive biases (e.g., anchoring) that distort market prices (Kooreman & Haan, 2006). The linear regression methods that have been used predominantly in previous work require the analyst to specify nonlinear dependencies explicitly. Possible options include replacing resale prices with their natural logarithm and/or employing nonlinear transformations of covariates (e.g., Du et al., 2009). Advanced forecasting methods such as neural networks take nonlinear relationships into account in a data-driven manner. An automatic approach reduces manual efforts and the cost of predictive modeling in particular. Thus, it is of interest to examine how the RPF forecast accuracy varies across different forms of handling nonlinearity. As a consequence, we examine: RQ1a: How do different ways of handling nonlinearity influence the resale price forecast accuracy? Another modeling challenge stems from the fact that resale prices may depend on a variety of car attributes. It is common practice to encode specific attribute values, such as the engine type of a car, using binary dummy variables (e.g., Andrews & Benzing, 2007; Du et al., 2009; Erdem & Sentürk, 2009; Prado, 2009). As a consequence, the number of covariates will often be large (e.g., when considering many attributes and/or when individual attributes have many nominal values). High dimensionality is a challenge in forecasting. In particular, methods that are based on the principles of empirical risk minimization might experience problems in the face of high dimensionality (e.g., Hansen et al., 2006). Various different strategies, such as regularization and structural risk minimization, have been proposed in the literature to make prediction methods more robust to high dimensionality (e.g., Hastie et al., 2009). A comparison of different prediction methods helps to shed light on the effectiveness of these paradigms for RPF. It also helps to elicit the technical characteristics of a ‘good’ resale price forecasting method. Therefore, the study examines: RQ1b: How effective are different regimes at coping with high dimensionality in resale price forecasting? Finally, it is methodologically interesting to examine whether it is better to generate resale price forecasts using individual or ensemble models. We find considerable

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

evidence in the literature that ensemble models, which combine multiple forecasts, perform better than individual models (e.g., Baumeister & Kilian, 2015). However, there is some debate as to how to integrate multiple base model predictions (e.g., Cang & Yu, 2014), and, more generally, which combination approach is most effective (e.g., Lessmann et al., 2015). Given that the previous work on RPF has not considered ensemble models, we contribute to the literature by examining: RQ1c: Does forecast combination increase the predictive accuracy for resale price forecasting? 3.2. Predictive value of private information Used car dealers and leasing agencies require resale price forecasts in order to decide on offer prices (e.g., Pierce, 2012). They have the option to buy such estimates externally from market research agencies (e.g., ALG4 in North America or DAT5 in Europe). Thus, implementing an in-house forecasting system – whether as an alternative or an addition to external agency forecasts – is sensible only if it increases the forecast accuracy. Our second RQ sheds light on the relative merits of an internal forecasting approach and externally purchased agency forecasts by focussing on information asymmetries (e.g., Ozer et al., 2011). Information asymmetries appear to be an especially important determinant of the forecast accuracy because the relevance of other, more common factors such as the forecasting method, forecaster skill, and the complex interplay between methods, people, and decision support systems (e.g., Fildes et al., 2006; Lawrence et al., 2006) is reduced in RPF. The accuracy depends less on the forecaster because RPF requires an automated modeling approach. This is due to the large volume of the used car market, where sellers need vast numbers of forecasts to inform their pricing decisions. In addition, automation is also appropriate in order to mitigate problems with deliberate pricing misbehaviors, by forcing sales agents to base their pricing strategies on model-estimated residual values (e.g., Lessmann et al., 2010; Pierce, 2012). With respect to forecasting methods as a determinant of accuracy, it is important to recall that several alternative methods will be compared when answering RQ1. Once the most effective approach has been identified, possible methodological disparities between external market research agencies and used car sellers can be overcome. According to hedonic theory, the value of an item is related to its constituent attributes and external factors (e.g., Narula et al., 2012). The price and rate of depreciation among vehicles of the same make and model can vary by thousands of dollars depending on factors such as the model year, mileage, condition, equipment, etc. (Du et al., 2009). The seller of a used car has full information about a car’s features, whereas market research agencies use a limited set of car attributes for their residual value estimates, since they lack access to highly specific information. In what follows, the term private information (PI) is used to refer to car attributes that are known to the seller but not to 4 http://www.alg.com. 5 http://www.dat.de.

867

market research agencies. The availability of PI represents an informational advantage that could enable sellers to predict resale prices more accurately than market research agencies. Therefore, this study examines: RQ2: Does private information help to predict resale prices more accurately? 3.3. Prediction model specificity In accordance with previous work (e.g., Erdem & Sentürk, 2009; Prado, 2009), this study estimates resale prices by means of a hedonic regression framework. Heterogeneity may be a threat to such an approach. Heterogeneity means that the effects of covariates on the response variable differ across subgroups of observations (e.g., Vicari & Vichi, 2013). Heterogeneity is likely to occur in car resale price forecasting. Consider for example a covariate color. A red color might increase resale prices for sports cars but decrease prices for compact cars. Specific modeling procedures such as latent class models can account for heterogeneity explicitly (e.g., Halme & Kallio, 2011). However, such method-specific approaches are not applicable here, as one objective of this study is to compare a large set of different forecasting methods (see RQ1). This study examines the presence and severity of heterogeneity in a method-independent fashion by comparing prediction models that are geared specifically toward one particular car model (high specificity) to more general prediction models that forecast resale prices for all car models under study (low specificity). This leads to: RQ3: To what extent does forecast model specificity affect the accuracy of resale price predictions? It is worth noting that model specificity has an economic dimension. A high-specificity modeling approach requires one prediction model for each car model under study. This multiplies the overall number of prediction models and therefore the costs of model building and maintenance. In this sense, the results of RQ3 contribute toward an economic appraisal of an internal forecasting support system. 4. Resale price forecasting with regression We can state the RPF task as a standard regression problem. Let y ∈ R denote the resale price of a used car and x ∈ Rm be a vector of m features that characterize the car (e.g., age, mileage, engine type, etc.). Regression analysis assumes that the value of the response variable y depends on the values of the covariates x in some unknown fashion. We approximate this relationship by developing a regression model f (x) from a training sample D = {yi ; xi }ni=1 of used car sales. To that end, we minimize the discrepancy between the observed response values (resale prices) and model outputs (forecasts) over D. A variety of regression methods are available (e.g., Hastie et al., 2009). These can be split into two groups, individual and ensemble methods. Individual methods produce forecasts using a single regression model. We can distinguish between linear and nonlinear methods based on the functional form through which they relate the covariates to the response variable.

868

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

Fig. 1. Taxonomy of the regression methods employed in the study.

An ensemble operates in two stages: first, it creates multiple (base) models, then it combines their predictions using a pooling mechanism. A considerable amount of theoretical and empirical work has shown that forecast pooling increases the predictive accuracy (e.g., Armstrong, 2001). The success of an ensemble usually depends on the diversity (e.g., low error correlation) of its constituent base models (e.g., Brown et al., 2005). Thus, an interesting distinction between ensemble methods is how they foster diversity. We distinguish between heterogeneous and homogeneous ensembles. The former seek diversity by using a range of different regression methods when constructing the base models. The latter employ only one regression method but resample the training data from which they estimate base models (e.g., Hastie et al., 2009). Fig. 1 provides an overview of alternative regression methods for car resale price forecasting. This study employs nineteen different regression methods. The selection draws inspiration from previous regression benchmarks (e.g., Loterman et al., 2012) and spans a range of different modeling approaches (e.g., from simple, easy to use methods to sophisticated state-of-the-art procedures; from parametric to non-parametric approaches; and from individual to homogeneous and heterogeneous ensemble models). Thus, it may be considered a comprehensive sample of the available regression methods. Most methods facilitate some tuning through the selection of meta-parameter values, such as the numbers of hidden nodes in a neural network or kernel parameters in support vector machines. Tuning meta-parameters is important in order to obtain a clear indication of how well a method can perform (e.g., Carrizosa et al., 2014). Therefore, we consider a set of candidate values for every meta-parameter, and create regression models with all settings. This approach yields a total of 494 models. Table 1 summarizes the different regression methods and their meta-parameter settings. We keep the paper self-contained by providing brief descriptions of the methods and their meta-parameters in an electronic companion, which accompanies the electronic version of the paper. For a comprehensive description of the methods, see Hastie et al. (2009). 5. Experimental design 5.1. Data The dataset used to explore the three research questions was provided by a leading German car manufacturer, who prefers to remain anonymous. The dataset contains

roughly 450,000 observations that refer to the actual sales of six different car models in the second-hand market. All six car models are from the same brand, with five belonging to the premium and one to the medium-class segment. The response variable is the ratio of a car’s resale price to its original list price. Stating resale prices as percentages is common practice in the literature (e.g., Prado, 2009), and also assists with the confidentiality of sensitive information (i.e., car prices and depreciation). In the data set, each sale is described by several attributes (e.g., car age, mileage, model year, engine type, gear, etc.). We use these attributes to estimate the resale prices by means of regression. Table 2 summarizes the types of car attributes and provides some dataset statistics. When examining RQ1 and RQ2, we split the data into six subsets, one for each of the different car models, and create prediction models for each subset. This allows us to estimate individual regression models for forecasting the resale prices of individual car models. RQ3 scrutinizes this approach by comparing car model specific regression models to regression models that are estimated from a pooled dataset that includes the sales of all six car models. In the pooled data set, we encode the different car models using five dummy variables. The set of explanatory variables differs across experimental settings. When examining RQ1, we use all available variables. We consider high dimensionality as a characteristic of RPF, and therefore the appropriateness of a regression method for RPF depends, amongst other things, on its ability to cope with many, possibly correlated, variables. RQ1b looks specifically at the effectiveness of alternative regression methods for handling high dimensionality. Consequently, we avoid a pre-selection of variables when examining RQ1. We explore RQ2 by categorizing the covariates into private and public information. Market research agencies base their residual value estimates on the car type, brand, model characteristics, and usage statistics.6 The sales data used here refer to passenger cars from a single car maker. Thus, for an individual car model data set, the covariates in the categories age, mileage, model year, and engine type approximate the information that research agencies employ in their forecasts. The covariates in the other categories represent private information that is available only to the sellers of used cars. Keeping everything else constant, we then compare regression models derived from data sets with the full set of variables (including PI) to those derived 6 See e.g. http://www.dat.de/fzgwerte/index.php.

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

869

Table 1 Summary of the regression methods employed in the benchmarking study. Regression method

No. of models

Meta-parameter(s)

Candidate values

Transformation of the response variable

None Natural logarithm Box–Cox

Linear individual methodsa Multivariate linear regression (MLR)b

3 + 8 = 11

Weighting function (for robust regression only)

Stepwise linear regression (SWR)

19 ∗ 3 = 57

None Natural logarithm Box–Cox

Transformation of the response variable

0.05, 0.1, . . . , 0.95

p-value used for the t-test of regression coefficients

Ridge regression (RiR)

13 ∗ 3 = 39

None Natural logarithm Box–Cox

Transformation of the response variable Regularization parameter δ

Least absolute shrinkage and selection operator (Lasso)

13 ∗ 3 = 39

Eight standard weighting functions (e.g., Huber, Cauchy, etc.) as available in the MATLAB environmentc

0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000 None Natural logarithm Box–Cox

Transformation of the response variable Regularization parameter δ

0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000

Nonlinear individual methods Multivariate adaptive regression splines (MARS)

5 ∗ 3 ∗ 2 ∗ 2 = 60

Max. no. of basis functions GCV penalty per knot Model type No. of self-interactions per covariate

5, 10, 20, 50, 100 0, 3, 5 Piecewise linear, piecewise cubic 1, 2

Artificial neural network (ANN)

18 ∗ 7 = 126

No. of nodes in hidden layer Regularization parameter (weight decay)

3, . . . , 20 0.01, 0.05, 0.15, 0.1, 0.2, 0.25, 0.3

ϵ -insensitive loss

Support vector regression (SVR)

7 ∗ 3 ∗ 5 = 105

Regularization parameter δ Width of the Gaussian kernel function

2(−12,−10,...,0) 2(0,4,8) 2(−6,−5,...,−2)

Regression trees (CART)

3∗2=6

Min. observations per leaf node Pruning of grown trees

10, 100, 1000 yes, no

Nearest-neighbor regression (kNN)

8 ∗ 2 = 16

No. of neighbors Weight neighboring observations response values by distance

3, 5, 7, 25, 50, 100, 250, 500 yes, no

10, 25, 50, 100, 250, 500, 1000, 1500, 2000 5, 10, 25, 50, 100 10, 20,. . . , 50, 100, 250, 500, 1000, 1500, 2000 10, 25, 50, 100, 250, 500, 1000, 1500, 2000 √ m · (0.25, 0.5, 1, 2) , with m denoting the number of covariates

Homogeneous ensemble methods Bagged regression tree (Bag) Bagged ANN (BANN) Boosted regression trees (Boost)

9 5 11

Ensemble size Ensemble size Ensemble size

Random regression forest (RF)

9 ∗ 4 = 36

Ensemble size Size of the covariate sample per node split

(continued on next page)

from data sets with a reduced set of variables all of which capture publicly available information. The experiment corresponding to RQ3 also compares regression models derived from different types of data, namely car model specific data sets and one pooled data set that includes all car models. Given that the car models differ in certain variables (for example, some engine types are available for one car model only), RQ3 necessitates a pre-selection of variables. In particular, the pooled data set includes only covariates related to age, mileage and customer group (Table 2), because these are identical for all car models. Accordingly, we also restrict the car-model

specific data sets in the RQ3 experiment to include only these variables. 5.2. Assessment of forecast errors A variety of accuracy indicators and error measures are available for assessing the forecast accuracy. This study uses the mean of the absolute difference between actual and predicted resale prices (MAE): MAE = 1/n ·

n ∑ i=1

|yi − f (xi )| .

(1)

870

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

Table 1 (continued) Regression method

No. of models

Meta-parameter(s)

Candidate values

1 1 1 1 1 6

n.a. n.a. n.a. n.a. n.a. No. of best performing models selected for the ensemble

Heterogeneous ensemble methodsd Ensemble selection (ES) Simple average (SAvg) Trimmed average (TAvg) Weighted average (WAvg) Stacking Simple average over the best N models (Top-N)

1, 5, 10, 25, 50, 100

a

With respect to the linear methods, it is important to note that they are not strictly linear. The models have been estimated both from the raw data and from data where the response variable has undergone some nonlinear transformation such as a log transformation (see Section 2.1.1. in the online companion for details). b Note that robust regression is considered a variant of ordinary linear regression. Thus, the settings for robust regression are shown in the meta-parameter column for MLR. c See http://www.mathworks.com/help/stats/robustfit.html. d

All heterogeneous ensembles operate on the basis of a model library. This library consists of the 494 regression models that are created with linear/nonlinear individual methods and homogeneous ensemble methods. For example, the SAvg ensemble is simply the mean resale price forecast computed over the 494 library models. Table 2 Characteristics of the dataset used in the empirical study. Car model

No. 1 No. 2 No. 3 No. 4 No. 5 No. 6

Observations

124,386 107,109 132,504 14,410 70,624 5,585

Attributes (total)

178 215 174 147 171 110

Number of attributes per attribute category Age

Mileage

Customer

Model year

Engine type

Prev. usage

Lacquer

Special equipment

2 2 2 2 2 2

1 1 1 1 1 1

12 12 12 12 12 12

6 8 7 3 5 7

25 40 16 17 24 9

10 10 10 10 10 10

29 26 21 12 20 12

93 116 105 90 97 57

Relative to the popular mean square error, the MAE is easier to interpret and its assumptions about the loss function that is associated with forecast errors are less strict (Nikolopoulos et al., 2007). For example, a MAE of 0.5 indicates that, on average, forecasts deviate from the actual values by half a percentage point. This is because the response variable is expressed as a percentage (i.e., resale prices/list price). The MAE is distorted less by very small outcomes than other indicators that are based on percentage errors (e.g., the MAPE and MdAPE; see Goodwin & Lawton, 1999), which further supports its selection for this study.7 5.3. Model building, selection and evaluation Our experimental setup for creating and assessing resale price forecast models is as follows. First, we split the data randomly into a training (60%) and a test set (40%). We derive forecast models from the training data and use these to produce resale price forecasts for the observations in the test set. We then compute the MAE of a model by comparing its forecasts to the actual values using the test 7 For the sake of completeness, the forecast accuracy was also assessed in terms of MSE, R2 , and the Pearson correlation. Overall, the forecast accuracy is highly correlated across different error measures. For example, the average rank correlation between MAE and alternative accuracy measures (across data sets and prediction methods) is τ = 0.83. Section 1 of the ecompanion provides a more detailed correlation analysis and benchmark results in terms of the above indicators of the forecasting accuracy.

set. All of the empirical results are based on MAE values from the hold-out test set. We compare alternative prediction methods (e.g., in RQ1) by first identifying the best model per method. For example, in the case of ANN, we create 126 prediction models (see Table 1), of which we use only the best (i.e., lowest MAE) for subsequent comparisons with other methods. The identification of the best model per method, subsequently referred to as model selection, requires auxiliary validation data. Using the test set for this purpose would bias the comparison of different methods, and must thus be avoided (e.g., Zhang et al., 1999). Instead, we produce validation data by performing three-fold cross-validation of the training set. This involves two steps: (i) splitting the training set into three partitions of equal size and (ii) collecting predictions for every individual partition. We achieve the latter by estimating regression models from the union of the other two partitions for each validation partition. This approach allows the collection of hold-out forecasts for all observations in the training sample (Caruana et al., 2006). In addition to model selection, we also use the crossvalidated training set forecasts to develop heterogeneous ensemble models. In particular, we require validation data for (i) calculating base model weights in WAvg, (ii) determining the pruning level in Top-N, (iii) performing hillclimbing ensemble selection in ES, and (iv) estimating the coefficients of the stacking regression model (see Table 1 and Section 2.2.2 in the online companion).

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

871

Table 3 Forecast accuracies of alternative forecasting methods in terms of MAE. Car model No. 1

No. 2

No. 3

No. 4

No. 5

No. 6

6.19 6.10 6.85 6.06

4.58 6.86 8.24 4.62

8.86 5.93 5.67 5.13

4.29 7.59 9.65 4.24

5.43 5.60 6.88 5.24

5.71 5.82 6.18 5.75 5.69

4.55 4.73 5.00 4.46 4.40

5.31 5.33 5.05 5.05 4.92

4.21 4.28 4.69 4.22 4.13

5.23 7.03 8.88 5.91 6.10

5.18 5.70 5.66 5.25

4.30 4.51 4.49 4.31

4.61 5.24 5.01 4.62

4.00 4.12 4.25 4.00

7.15 5.08 6.30 6.30

5.55 5.68 5.37 5.68 5.38 5.73

4.27 4.48 6.37 4.44 4.30 4.53

4.68 4.99 4.39 4.97 4.69 4.99

3.94 4.22 4.12 4.15 3.99 4.30

5.54 5.59 5.35 5.56 5.75 5.68

Linear individual prediction models

Single models

MLR Lasso RiR SWR

10.23 11.92 13.06 8.95

Nonlinear individual prediction models ANN CART kNN MARS SVR

7.13 6.92 5.83 8.43 7.51

Homogeneous ensemble models Bag BANN Boost RF Ensemble models

6.65 7.06 7.42 6.72

Heterogeneous ensemble models ES SAvg Stacking TAvg Top-N WAvg

7.11 7.63 6.75 7.63 7.31 7.65

Note: Bold text indicates the best performing method per car model data set.

6. Empirical results 6.1. Forecast accuracies of alternative resale price prediction methods The first research question concerns the accuracy of different resale price forecasting methods. The comparison is based on a high specificity modeling approach and full datasets with PI included. Table 3 reports the MAEs of the nineteen forecasting methods across the six car model datasets. Table 3 illustrates some interesting patterns. For example, the magnitude of the forecast accuracy varies considerably across car models, which could indicate that the resale prices of some car models are considerably more difficult to predict. It is also apparent that, with one exception, the best-performing method for each dataset belongs to the ensemble family. This hints at the efficacy of forecast combination in resale price prediction. We examine the statistical significance of the observed differences in MAE by performing a repeated measures analysis of variance (ANOVA). To that end, a pooled data set is created that includes the test set predictions of all prediction methods for the six car model data sets. The pooled data set contains 158,073 observations, after removing outliers that could affect the parametric statistical tests unduly. Specifically, we find car resale prices to follow a normal distribution, and remove observations with actual resale prices that are more than three standard deviations away from the mean. An observation in the pooled dataset corresponds to the actual resale price of a used car and 19 forecasts of this price that arise from the alternative prediction methods.

The statistical analysis is based on the absolute differences between the forecasts and actuals. Let µi denote the MAE of method i on the pooled data set and let K be the total number of prediction methods (i.e., 19). ANOVA tests the null hypothesis H0 : µ1 = µ2 =, . . . , = µK against the alternative hypothesis that the MAEs of at least two RPF methods differ. In addition to the (within-subjects) factor regression method, the factor car model is also considered as a blocking factor. This is to account for the variation in MAEs that can be attributed to the different car models. The ANOVA results allow the rejection of the null hypothesis of equal performances with a high level of significance (F [4.2; 670,591] = 7150; MSE = 116,032; p < 0.000; using a Greenhouse-Geisser correction). Thus, it is possible to proceed with multiple comparisons in order to examine in detail which forecasting methods perform significantly differently from others. In particular, K · (K − 1)/2 pairwise comparisons are carried out to test the nullhypothesis H0 : µi = µj against H1 : µi ̸ = µj ∀i, j = 1, . . . , K ; i ̸ = j. The corresponding results are shown in Table 4, where the forecast methods are ordered from the most to least accurate methods. The pairwise differences in MAE are computed such that a negative value implies that the column method produces lower errors than the row method. Normal text indicates that the difference between two methods is significant at the 1% level. Similarly, an underline indicates significance at the 5% level. Only differences that are shown in bold are not substantial enough to reject the null hypothesis of equal performances. Note that the p-values in these tests are adjusted using Bonferroni’s method in order to guarantee family-wise error levels of 1% and 5%, respectively.

872

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

Table 4 Pairwise differences among forecasting methods in terms of MAE.

ES RF Top-N Bag BANN TAvg SAvg Stacking SVR ANN WAvg Boost MARS CART SWR kNN MLR Lasso RiR

MAE

ES

3.97 3.97 4.02 4.11 4.17 4.17 4.19 4.22 4.23 4.23 4.24 4.32 4.44 4.46 4.50 4.67 5.28 5.92 7.37

−0.04 −0.14 −0.20 −0.20 −0.22 −0.24 −0.25 −0.25 −0.26 −0.35 −0.46 −0.48 −0.53 −0.69 −1.31 −1.95 −3.39

RF

TopN

Bag

BANN TAvg

−0.06 −0.06 −0.08 −0.11 −0.12 −0.12 −0.13 −0.21 −0.33 −0.35 −0.39 −0.56 −1.17 −1.81 −3.26

−0.02 −0.05 −0.06 −0.06 −0.07 −0.16 −0.27 −0.29 −0.33 −0.50 −1.11 −1.75 −3.20

SAvg

Stack- SVR ing

ANN

WAvg Boost MARS CART SWR

−0.01 −0.10 −0.21 −0.23 −0.27 −0.44 −1.06 −1.69 −3.14

−0.09 −0.20 −0.22 −0.26 −0.43 −1.04 −1.68 −3.13

kNN

MLR

Lasso

0.00

−0.04 −0.14 −0.19 −0.20 −0.22 −0.24 −0.25 −0.25 −0.26 −0.35 −0.46 −0.48 −0.53 −0.69 −1.31 −1.95 −3.39

−0.09 −0.15 −0.15 −0.18 −0.20 −0.21 −0.21 −0.22 −0.31 −0.42 −0.44 −0.48 −0.65 −1.27 −1.90 −3.35

0.00

−0.02 −0.04 −0.05 −0.06 −0.07 −0.15 −0.27 −0.29 −0.33 −0.49 −1.11 −1.75 −3.20

−0.02 −0.03 −0.03 −0.04 −0.13 −0.24 −0.26 −0.31 −0.47 −1.09 −1.73 −3.17

−0.01 −0.01 −0.02 −0.11 −0.22 −0.24 −0.28 −0.45 −1.07 −1.71 −3.15

0.00

−0.01 −0.10 −0.21 −0.23 −0.27 −0.44 −1.06 −1.70 −3.14

−0.11 −0.13 −0.18 −0.34 −0.96 −1.60 −3.04

−0.02 −0.06 −0.23 −0.85 −1.49 −2.93

−0.04 −0.21 −0.83 −1.46 −2.91

−0.17 −0.78 −0.62 −1.42 −1.26 −0.64 −2.87 −2.70 −2.08 −1.45

Note: the values shown in the MAE column are the estimated marginal means of the absolute forecast errors. The estimation is carried out across all observations in the pooled data set and takes into account the effect of the blocking factor car model.

Table 4 helps us answer the first set of research questions. First, it reveals that ES and RF perform almost identically (pairwise difference: −0.001), and are the overall winners of the comparison. These methods predict resale prices significantly more accurately than any of their competitors. Thus, they are particularly effective for the focal application (i.e., RQ1). Second, Table 4 reveals that advanced methods that capture the nonlinearity in a data-driven manner typically outperform linear methods (i.e., RQ1a). The four linear methods MLR, SWR, RiR and Lasso are among the five least accurate methods and predict significantly less accurately than most of the other methods. Note that the four methods are not restricted to linear relationships in this study. Instead, they consider different representations of the response variable during model selection (see Table 1). In particular, modeling the log of resale prices allows MLR, SWR, RiR and Lasso to account for nonlinearity. This approach is used widely in the literature (e.g., Agarwal & Ratchford, 1980; Desai & Purohit, 1998; Kooreman & Haan, 2006); however, the results observed here cast doubt on its appropriateness. Data-driven procedures that do not require a manual coding of nonlinear relationships perform consistently better than the linear regression methods. Third, some insight on RQ1b follows from the overall level of forecast accuracy (i.e., column MAE in Table 4). Most methods produce forecasts that deviate from the actual resale prices by only four percentage points. To put this value into context, one can consider the mean of the response variable, y¯ , as a naïve benchmark. For the data used here, the MAE of y¯ (11.72) is substantially higher than those observed for most of the forecast methods. The fact that many methods predict resale prices fairly accurately suggests that high dimensionality was not a major obstacle. However, the results do not provide any clear answer to the question of how different approaches to increasing a method’s robustness to high dimensionality compare to each other. Considering the four linear methods only, a stepwise selection of covariates (SWR) performs better

than either not addressing high dimensionality (MLR) or augmenting the least-square loss function with a complexity penalty (Lasso and RiR). Theory suggests that MLR might suffer from a large number of covariates, but that Lasso and RiR should be robust (e.g., Vapnik, 1998). Surprisingly, though, the resale price forecasts of MLR are more accurate than those of Lasso and RiR. On the other hand, the two approaches that provide the best performances among all individual prediction methods, ANN8 and SVR, penalize model complexity with an L2-penalty, which is the same approach used in RiR. In this sense, the results do not warrant conclusions regarding the effectiveness of penaltybased regularization methods. Considering the ensemble methods, the superior performance of RF over Bag may be taken as evidence that the random subspace approach (Ho, 1998) is a good way to deal with high dimensionality. That is, RF and Bag differ only in that the former uses this approach whereas the latter does not. However, the random subspace approach affects the predictive accuracy in multiple ways; for example, it also increases the diversity among the base models in the ensemble (Breiman, 2001). Thus, it is not possible to determine the extent to which the superior performance of RF over Bag results from a better way of handling high-dimensionality. Finally, Table 4 provides strong evidence for the superiority of ensembles over individual prediction methods (RQ1c). The eight best performing methods belong to the ensemble family, and only the weakest ensemble, Boost, is significantly inferior to the two best individual prediction methods (ANN and SVR). One can also average the predictions of all single models and all ensembles in order to compare the group means based on a paired t-test. This confirms that the difference between the average MAEs among ensemble methods (4.31) and individual methods (4.87) is significant at the 1% level (t-statistic: 255.29; 8 The ANN models in this study are trained with weight-decay (e.g., Hastie et al., 2009).

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

873

Fig. 2. Difference in MAEs between prediction models that exclude and include PI, for each forecasting method.

p < 0.000). The performance differences between homogeneous and heterogeneous ensembles are minor and do not show any clear tendency. For example, a 95% confidence interval of the MAE, estimated across the results of all respective ensemble methods across all data sets, is [5.24; 5.42] for homogeneous ensembles, compared to [5.29; 5.41] for heterogeneous ensembles. This shows that the results do not warrant any conclusions regarding the superiority of one ensemble philosophy over the other. 6.2. Predictive value of private information A second set of regression models is created to examine the predictive value of PI. Specifically, all 494 models are rebuilt after removing the PI covariates from the datasets. The resulting regression models differ from those used in the previous comparison only in that they do not embody PI. Consequently, any observed differences in forecast accuracy can be attributed directly to the inclusion/exclusion of PI. Then, we perform a two-way repeated-measures ANOVA to test the predictive value of PI. The two (withinsubjects) factors in this comparison are PI (two levels: included/excluded) and the forecasting method (nineteen levels). The dependent variable of the test is the MAE. The factor car model is included as a blocking factor because Table 3 suggests that it explains a notable degree of the variation in MAE. ANOVA confirms that PI has a significant (main) effect on MAE (F [1; 158,067] = 386; MSE = 16,736; p < 0.000; using a Greenhouse-Geisser correction). The estimated marginal means of MAE when including and excluding PI are 4.55 and 476, respectively. The difference between the two is significant at the 1% level (p < 0.000). In other words, using PI reduces the forecast errors significantly, indicating that PI has predictive value. ANOVA also reveals a significant interaction between the factors PI and forecasting method (F [479; 756,553] = 5997; MSE = 50,935; p < 0.000; using a GreenhouseGeisser correction). This implies that the degree to which PI improves the forecast accuracy depends on the forecasting method. Fig. 2 examines this dependency in more detail by depicting the pairwise differences between forecast errors with and without PI for each method. The differences are computed such that a positive value indicates a better performance (lower error) when including PI.

Fig. 2 reveals that most of the forecasting methods perform better when PI is available, with the MAE typically differing by about 0.5. Considering the estimated marginal mean of MAE when excluding PI, 4.76 (see above), a positive difference of 0.5 points indicates that PI reduces the forecast error by about 10%. The largest improvement, of almost one point (∼20%), is observed for RF. This shows that at least some of the covariates in the PI group are related systematically to resale prices. However, PI reduces the accuracy for MLR, Lasso and RiR, which also had the poorest performances in the previous experiment (e.g., Table 4). This performance decrease is due to MLR, Lasso and RiR being more sensitive to high dimensionality. Although PI has predictive value in general, it is likely that some of the PI covariates do not contribute much information, and/or are correlated with other covariates. Unlike the more advanced forecasting methods, MLR, Lasso and RiR appear to be unable to distinguish between informative and redundant PI covariates. When using PI with these methods, the negative effect of the higher dimensionality outweighs the positive effect of additional information, which in turn increases the forecast errors. This finding provides some additional insight on RQ1b. Specifically, it confirms the view that a stepwise covariate selection is the best way to deal with high dimensionality among the linear methods. The accuracy of SWR increases by 0.43 points (Fig. 2) when PI is included, whereas in RiR and the Lasso, the complexity penalty does not succeed in filtering out redundant information from the PI covariates. An important question is whether the observed MAE differences, though statistically significant, are substantial enough to add to the bottom line. After all, an in-house resale price forecasting solution is necessary in order to benefit from PI. However, this question is difficult to answer. The most difficult part is probably estimating the increase in actual resale prices that better forecasts facilitate. The car manufacturer who donated the data for this study estimates an average resale price of her car models (including special equipment) of e35,000. According to the Federal Office for Motor traffic in Germany, the number of repossessions for the car models considered here was at least 500,000 cars p.a. between 2006 and 2015. Using data from the same car maker, Jerenz (2008) estimated that a price-based revenue management program including a resale price forecasting engine is able to increase revenues by 4.6%. Having observed MAE improvements of up to

874

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

Fig. 3. Differences in MAEs between car model specific and independent regression models.

20% (RF), one may speculate that an enhancement of the forecasting engine of such magnitude might facilitate a 1% increase in program performance, which, based on the data above, implies an increase in revenues of e175 Mio in the German market alone. 6.3. Prediction model specificity The effect of prediction model specificity on the forecast accuracy is examined in the last research question. The analysis is similar to the PI experiment, and involves a comparison of two groups of prediction models. Highspecificity prediction models are created from the sales data of individual car models, whereas low-specificity models are derived from a pooled dataset of all six car models. The influence of the factor model specificity (two levels: high/low) is then tested with a repeated measures ANOVA. The second factor and the dependent variable are the prediction method (nineteen levels) and MAE, respectively, as in the PI experiment. The blocking factor car model is also included in the statistical model. The ANOVA results reveal that the specificity does not have a significant main effect (F [2; 158,067] = 0.78; MSE = 6.01; p = 0.38; using a Greenhouse-Geisser correction). This is also expressed in the estimated marginal means of MAE in the two settings, which are almost identical (4.762 and 4.758 for the low- and high-specificity models, respectively). However, the ANOVA results suggest a significant interaction between model specificity and the prediction method (F [3.57; 564,733] = 1,705; MSE = 7,904; p < 0.000; using a Greenhouse-Geisser correction). The accuracy differences between the high- and lowspecificity modeling approaches have two sources: heterogeneity and sample size. A sample size effect emerges because the low- and high-specificity prediction models are estimated from all of the sales data versus that of only a single car model, respectively. The significant factor interaction indicates that the compound effect of heterogeneity and sample size on the accuracy depends on the forecasting method. This is examined in Fig. 3, which shows the pairwise difference in MAE between high- and lowspecificity models for each forecasting method. A positive difference indicates that low-specificity models predict more accurately.

Fig. 3 reveals that seven of the nineteen methods perform better when creating car model specific prediction models. On the other hand, twelve methods perform better when the same prediction model is used to forecast resale prices for all car models. Linear methods benefit the most from a high-specificity setup. These methods should not suffer a sample size effect. Even in the most extreme case of car model 6, a high-specificity setup leaves more than 3000 observations9 for estimating the regression coefficients. This suggests that heterogeneity is the main reason why the forecast accuracy improves in the high-specificity setup. Only this setup allows linear methods to capture car model specific relationships between the covariates and resale prices; it is impossible in the low-specificity setup because the regression function can only accommodate additive covariate effects. Sample sizes play a more important role when estimating complex, nonlinear relationships between covariates and the response variable. Larger samples are also beneficial in an ensemble framework, where different modeling stages require disjoint subsamples of the original data. This is consistent with the results observed. In particular, Fig. 3 shows that most ensemble methods perform better in a low-specificity setup. More generally, the methods that perform better in a low-specificity setup consist exclusively of nonlinear regression methods. It is plausible that these methods may benefit a lot from larger sample sizes. Given that the results of linear prediction methods indicate the presence of heterogeneity, Fig. 3 also suggests that most nonlinear forecasting methods are not affected by heterogeneity. More specifically, while a negative heterogeneity effect may exist, it is not substantial enough to compensate for the positive sample size effect. It is interesting to consider what features protect the nonlinear forecasting methods against a negative heterogeneity effect. From a mathematical point of view, a nonlinear technique can approximate arbitrary complex relationships (Hornik et al., 1989), which could allow the method to account for heterogeneity automatically. For example, if the data suggest that the effect of a covariate on resale prices differs across car models, the forecasting methods could capture this effect by relating the covariate to the 9 The dataset for car model 6 has 5585 observations, of which 60% are used as a training set (Section 5.1).

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

car model dummy variable. The observed results suggest that this theoretical advantage holds up to real life to some extent. Most of the nonlinear forecasting methods benefit from a low-specificity setup, although previous results have indicated the presence of heterogeneity across car models. However, the performances of (single and bagged) ANN learners and stacking are not consistent with the explanation above. Instead, these methods benefit from a high-specificity modeling approach. Stacking is based on (stepwise) linear regression (i.e., to form the heterogeneous ensemble). It may be that the advantages of linear regression translate to the ensemble construction stage to some extent. That is, the high-specificity modeling approach could help to account for heterogeneity when estimating the coefficients in the second-stage regression model (see Section 2.2.2 in the online companion). The ANN results are more difficult to explain. Being a universal approximator (Hornik et al., 1989), ANN could capture heterogeneity in the same way as other nonlinear methods. We confirm this by comparing the ANN considered in Fig. 3 to an ANN with two hidden layers. A second hidden layer might help the neural network to capture discontinuities, and as such, any heterogeneity in the data.10 Using the pooled data set including all car models (i.e., the lowspecificity setting), we compare the MAE of ANNs with one and two hidden layers across 500 random splits of the data into the training (60%) and test (40%) sets. For each split, we estimate the connection weights of the two ANN models on the training set and calculate their MAEs on the test set. The comparison shows the mean (over the 500 random train/test splits) MAE of the single hidden layer ANN to be 9.20, compared to a value of 9.27 for the ANN with two hidden layers. A paired t-test suggests that the single hidden layer ANN predicts significantly more accurately (Tdf =499;sd=0.41 = −3.6; p-value = 0.003). This supports the view that heterogeneity has not caused the ANN to perform better in the high specificity setup. However, it also emphasizes the question of what other factors explain the ANN result of Fig. 3, where the availability of more training data in the low-specificity setup should have improved the ANN accuracy, but failed to do so. Of course, we cannot rule out the possibility that the specific metaparameter selection that we consider in this study (see Table 1) was less appropriate for the data corresponding to the low-specificity setting. For example, while a value of 0.01 is the smallest candidate setting of the regularization parameter that we consider, more observations and fewer variables in the low-specificity dataset might require even lower settings. However, considering the objectives of this study, a more detailed analysis of ANN meta-parameters, and the origin of the observed ANN results in the specificity experiment more generally, seems out of scope, and thus is left for future research. A general conclusion from Fig. 3 is that the absolute differences in MAE range between 0.02 and 0.68, which is less than in the PI experiment. However, given an average MAE of around five, the differences of 0.1 that are observed 10 We are grateful to an anonymous reviewer for suggesting this comparison.

875

for most forecasting methods still equate to about a 2% error reduction. 7. Conclusions This study has evaluated the comparative performances of several alternative regression methods for forecasting resale prices in the used car industry. Our empirical results have shown that the market value of a used car depends on a variety of factors and that the forecast accuracy benefits from incorporating these factors into resale price prediction models. One downside of information-rich forecasting approaches is that they increase the dimensionality. However, an analysis of high- versus low-dimensional forecasting settings revealed that several forecasting methods are well prepared to cope with large numbers of covariates. Depending on the specific modeling approach, the forecast accuracy could also suffer from heterogeneity. The results observed suggest that heterogeneity occurs in RPF but does not decrease the forecast accuracy for most advanced forecasting methods. Overall, ensemble methods have been found to produce the most accurate forecasts across several experimental conditions. 7.1. Implications The empirical study has a number of implications for both theory and practice. From an academic point of view, the first implication pertains to the appealing performances of RF and ES. Our benchmarking experiment contributes by both, consolidating previous modeling approaches and serving as an anchor for future research. Subsequent studies that forecast car resale prices, or durable goods more generally, should consider at least one of the methods that performed best in this experiment. Being a highly challenging benchmark, it facilitates a critical assessment of methodological advances. A second academic implication is that the results of previous studies associated with the used car market should be revisited. The vast majority of previous studies relied exclusively on linear regression methods, while some employed standard techniques to account for nonlinearity, such as the log transformation. The results of this paper clearly show that linear regression methods (whether with or without such transformations) predict significantly less accurately than more advanced regression methods. It is also important to examine the predictability of market prices in order to validate the existing theory concerning the factors that govern price formation. A predictability gap between linear and advanced forecasting methods indicates that some of the latent determinants of market prices have not yet been uncovered, thus hinting at a gap in our structural understanding (Shmueli & Koppius, 2011). This opens the way for future research to uncover the origin and nature of these determinants. More specifically, it may be that previous studies of the used car market resulted in misleading conclusions because their methodology failed to discount important determinants of resale prices. Thus, future explanatory studies of the effects of information asymmetries, knowledge sharing or other constructs in the used car market should use linear regression methods

876

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877

alongside advanced regression methods such as RF. This helps to determine the expressive power of an explanatory model and the degree to which it provides structural insight. Finally, the study further supports the call of Armstrong and Fildes (2006) for forecasting experiments to explore multiple ‘reasonable hypotheses’. Each of the three research questions emphasizes a different aspect of a forecasting task. Some of the forecasting methods perform equally well in all settings. This robustness to dimensionality and specificity represents an important advantage over the methods’ less robust competitors. On the other hand, linear methods performed much worse than several alternatives in the initial benchmark but were almost competitive in the low-dimensionality setting. This emphasizes that the experimental setting can have a large effect on the empirical results and thus on the relative value of contrasting prediction methods across multiple different scenarios. Thus, future benchmarking experiments in predictive modeling should adopt the multiple hypotheses approach. The empirical results also have several managerial implications. First, the study has identified two modeling techniques, RF and ES, that are particularly suitable for predicting resale prices on the basis of car characteristics and other factors. Given that the differences in accuracy between the two are negligible, RF is preferred. This method is already available in some modeling toolboxes and is therefore easier to adopt than ES. A second implication is also related to implementation efforts. The study has shown that most forecasting methods work well in a low specificity setup. This is especially true for RF, which improves the forecast accuracy by about 7% (Fig. 3) in a low specificity setup. This indicates that car makers do not need to create large numbers of regression models for different car models. Provided that a proper modeling technique is employed, it is appropriate to use the same model to estimate resale prices for different car models. This reduces the effort required to create and maintain prediction models within an in-house resale price forecasting system and contributes to the economic feasibility of such an approach. A third implication is that sellers of used cars have an informational advantage over market research agencies. In particular, the study has provided evidence of the predictive value of PI. A forecasting method with inbuilt information filtering abilities will produce significantly more accurate forecasts when PI is available. This indicates that sellers of used cars do indeed have an incentive to invest in in-house resale price forecasting systems. By definition, external market research agencies are unable to use PI in their forecasts. As a consequence, sellers’ informational advantages allow them to improve upon the accuracy of agency forecasts. 7.2. Limitations and future research The study also has a number of limitations that need to be addressed in future work. As with any empirical study, the observed results hold true for the data employed but do not necessarily generalize to other settings. This study aims to ensure external validity by employing both a large

sample of real-world data (an experimental setup that is well established for the assessment of competing prediction models) and formal statistical tools for testing the significance of the observed results. However, replicating the study using other data, e.g., from different car brands, would be a valuable contribution in order to secure the above conclusions further. A particularly important task for future research would be to examine the profitability of an in-house resale price forecasting solution for investment appraisal methods. For an economic analysis, a key challenge is to quantify the economic value of accurate forecasts. Exploring the revenue impact of a higher accuracy through simulations in the scope of a fully-functional revenue management system might be one way to achieve this. However, the implementation and data requirements of such an endeavor present a formidable challenge. In this sense, an important finding of this study is that the necessary conditions for the profitability of an internal forecasting approach are fulfilled. Acknowledgments The authors wish to express their gratitude to the managing editor, George Kapetanios, for his continuous support and his efforts in handling the paper. Furthermore, the anonymous reviewers provided several very constructive comments, which have helped substantially in improving the paper further. The authors are very thankful for these suggestions. Finally, the authors would like to thank Stefan Gnutzmann, who provided both the data for the study and several valuable comments. Without him, this research would not have been possible. Appendix A. Supplementary data Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.ijforecast.2017. 04.003. References Adams, C. P., Hosken, L., & Newberry, P. W. (2011). Vettes and lemons on eBay. Quantitative Marketing and Economics, 9, 109–127. Agarwal, M. K., & Ratchford, B. T. (1980). Estimating demand functions for product characteristics: The case of automobiles. Journal of Consumer Research, 7, 249–262. Andrews, T., & Benzing, C. (2007). The determinants of price in internet auctions of used cars. Atlantic Economic Journal, 35, 43–57. Armstrong, J. S. (2001). Combining forecasts. In J. S. Armstrong (Ed.), Principles of forecasting: a handbook for researchers and practitioners (pp. 417–439). Boston: Kluwer. Armstrong, J. S., & Fildes, R. (2006). Making progress in forecasting. International Journal of Forecasting, 22, 433–441. Baumeister, C., & Kilian, L. (2015). Forecasting the real price of oil in a changing world: a forecast combination approach. Journal of Business and Economic Statistics, 33, 338–351. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Brown, G., Wyatt, J. L., & Tino, P. (2005). Managing diversity in regression ensembles. Journal of Machine Learning Research (JMLR), 6, 1621–1650. Cang, S., & Yu, H. (2014). A combination selection algorithm on forecasting. European Journal of Operational Research, 234, 127–139. Carrizosa, E., Martín-Barragán, B., & Romero Morales, D. (2014). A nested heuristic for parameter tuning in support vector machines. Computers & Operations Research, 43, 328–334.

S. Lessmann, S. Voß / International Journal of Forecasting 33 (2017) 864–877 Caruana, R., Munson, A., & Niculescu-Mizil, A. (2006). Getting the most out of ensemble selection. In Proceedings of the 6th international conference on data mining (pp. 828–833). Hong Kong, China: IEEE Computer Society. Crone, S. F., Hibon, M., & Nikolopoulos, K. (2011). Advances in forecasting with neural networks? Empirical evidence from the NN3 competition on time series prediction. International Journal of Forecasting, 27, 635–660. Dejaeger, K., Verbeke, W., Martens, D., & Baesens, B. (2011). Data mining techniques for software effort estimation: A comparative study. IEEE Transactions on Software Engineering, 38, 375–397. Desai, P., & Purohit, D. (1998). Leasing and selling: Optimal marketing strategies for a durable goods firm. Management Science, 44, 19–34. Du, J., Xie, L., & Schroeder, S. (2009). PIN optimal distribution of auction vehicles system: Applying price forecasting, elasticity estimation, and genetic algorithms to used-vehicle distribution. Marketing Science, 28, 637–644. Erdem, C., & Sentürk, I. (2009). A hedonic analysis of used car prices in Turkey. International Journal of Economic Perspectives, 3, 141–149. Ferrer, J.-C., Oyarzún, D., & Vera, J. (2012). Risk averse retail pricing with robust demand forecasting. International Journal of Production Economics, 136, 151–160. Fildes, R., Goodwin, P., & Lawrence, M. (2006). The design features of forecasting support systems and their effectiveness. Decision Support Systems, 42, 351–361. Genesove, D. (1993). Adverse selection in the wholsesale used car market. Journal of Political Economy, 101, 644–665. Goodwin, P., & Lawton, R. (1999). On the asymmetry of the symmetric MAPE. International Journal of Forecasting, 15, 405–408. Ha, S. H., & Krishnan, R. (2012). Predicting repayment of the credit card debt. Computers & Operations Research, 39, 765–773. Halme, M., & Kallio, M. (2011). Estimation methods for choice-based conjoint analysis of consumer preferences. European Journal of Operational Research, 214, 160–167. Hansen, J. V., McDonald, J. B., & Turley, R. S. (2006). Partially adaptive robust estimation of regression models and applications. European Journal of Operational Research, 170, 132–143. Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning. (2nd ed.). New York: Springer. Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 832–844. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Jerenz, A. (2008). Revenue management and survival analysis in the automobile industry. Wiesbaden: Gabler. Kooreman, P., & Haan, M. A. (2006). Price anomalies in the used car market. De Economist, 154, 41–62. Lawrence, M., Goodwin, P., O’Connor, M., & Önkal, D. (2006). Judgmental forecasting: A review of progress over the last 25 years. International Journal of Forecasting, 22, 493–518. Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247, 124–136. Lessmann, S., Listiani, M., & Voß, S. (2010). Decision support in car leasing: a forecasting model for residual value estimation. In M. Lacity, F. Niederman, & S. March (Eds.), Proceedings of the international conference on information systems. ICIS’2010, Saint Louis, MO, USA: AIS, Paper 17. Lian, C., Zhao, D., & Cheng, J. (2003). A fuzzy logic based evolutionary neural network for automotive residual value forecast. In N. Ansari, F. Deek, C.-Y. Lin, & H. Yu (Eds.), Proceedings of the international conference on information technology: research and education (pp. 545–548). Newark, NJ, USA: IEEE Computer Society. Loterman, G., Brown, I., Martens, D., Mues, C., & Baesens, B. (2012). Benchmarking regression algorithms for loss given default modeling. International Journal of Forecasting, 28, 161–170. Mantrala, M. K., Seetharaman, P. B., Kaul, R., Gopalakrishna, S., & Stam, A. (2006). Optimal pricing strategies for an automotive aftermarket retailer. Journal of Marketing Research, 43, 588–604. Marn, M. V., Roegner, E. V., & Zawada, C. C. (2003). The power of pricing. McKinsey Quarterly, 1, 26–39.

877

Narula, S. C., Wellington, J. F., & Lewis, S. A. (2012). Valuating residential real estate using parametric programming. European Journal of Operational Research, 217, 120–128. Nikolopoulos, K., Goodwin, P., Patelis, A., & Assimakopoulos, V. (2007). Forecasting with cue information: A comparison of multiple regression with alternative forecasting approaches. European Journal of Operational Research, 180, 354–368. Ozer, O., Zheng, Y., & Chen, K.-Y. (2011). Trust in forecast information sharing. Management Science, 57, 1111–1137. Pierce, L. (2012). Organizational structure and the limits of knowledge sharing: Incentive conflict and agency in car leasing. Management Science, 58, 1106–1121. Prado, S. (2009). The European used-car market at a glance: hedonic resale price valuation in automotive leasing industry. Université de Paris Ouest Nanterre La Défense. Available at http://economix.u-paris10. fr/. Purohit, D. (1992). Exploring the relationship between the markets for new and used durable goods: The case of automobiles. Marketing Science, 11, 154–167. Sharif Azadeh, S., Marcotte, P., & Savard, G. (2015). A non-parametric approach to demand forecasting in revenue management. Computers & Operations Research, 63, 23–31. Shmueli, G., & Koppius, O. R. (2011). Predictive analytics in information systems research. MIS Quarterly, 35, 553–572. Syntetos, A. A., Babai, Z., Boylan, J. E., Kolassa, S., & Nikolopoulos, K. (2016). Supply chain forecasting: Theory, practice, their gap and the future. European Journal of Operational Research, 252, 1–26. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Vicari, D., & Vichi, M. (2013). Multivariate linear regression for heterogeneous data. Journal of Applied Statistics, 40, 1209–1230. Zhang, G., Hu, M. Y., Patuwo, B. E., & Indro, D. C. (1999). Artificial neural networks in bankruptcy prediction: General framework and crossvalidation analysis. European Journal of Operational Research, 116, 16–32. Zheng, M., Wu, K., & Shu, Y. (2016). Newsvendor problems with demand forecast updating and supply constraints. Computers & Operations Research, 67, 193–206. Stefan Lessmann received a diploma in business administration and a Ph.D. from the University of Hamburg in 2002 and 2007, respectively. He works as a lecturer and senior lecture in business informatics at the Institute of Information Systems of the University of Hamburg. Since 2008, Stefan has also been a guest lecturer at the School of Management of the University of Southampton, where he gives under- and postgraduate courses on quantitative methods, electronic business, and web application development. Stefan completed his habilitation on decision analysis and support using ensemble forecasting models in 2012. He then joined the Humboldt-University of Berlin in 2014, where he heads the Chair of Information Systems at the School of Business and Economics. Since 2015, he has served as associate editor in the decision analytics department of Business and Information System Engineering (BISE). Stefan has secured substantial amounts of research funding and published several papers in leading international journals and conferences, including the European Journal of Operational Research, IEEE Transactions of Software Engineering, and the International Conference on Information Systems. He participates actively in knowledge transfer and consulting projects with industry partners, from small start-up companies to global players and not-forprofit organizations. Stefan Voß holds degrees in Mathematics (diploma) and Economics from the University of Hamburg and a Ph.D. and the habilitation from the University of Technology Darmstadt. His previous positions include full professor and head of the department of Business Administration, Information Systems and Information Management, at the University of Technology Braunschweig (Germany), 1995–2002. Prof. Voß’s main interests focus on the fields of information systems, supply chain management and logistics, as well as intelligent search. He has an international reputation, due to his numerous publications in these fields. His current research projects include the consideration of problem formulations in the fields of information systems in transport and supply chain management, as well as the use of meta-heuristics and intelligent search algorithms in practical applications. Prof. Vo participates in advisory boards and editorships for academic journals such as INFORMS Journal on Computing, Public Transport Journal, and Journal of Heuristics.