Transportation Research Part B 101 (2017) 295–305
Contents lists available at ScienceDirect
Transportation Research Part B journal homepage: www.elsevier.com/locate/trb
Construction cost estimation: A parametric approach for better estimates of expected cost and variation Omar Swei a,∗, Jeremy Gregory b, Randolph Kirchain c a
U.S. Fulbright Scholar to Jordan, Department of Civil & Environmental Engineering, Material Systems Laboratory, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Building E19-695, Cambridge, MA 02139, USA b Research Scientist, Department of Civil & Environmental Engineering, Material Systems Laboratory, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Building E19-695, Cambridge, MA 02139, USA c Principal Research Scientist, Materials Processing Center, Material Systems Laboratory, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Building E19-695, Cambridge, MA 02139, USA
a r t i c l e
i n f o
Article history: Received 2 June 2016 Revised 24 April 2017 Accepted 25 April 2017
Keywords: Cost estimation Cost overruns Heteroscedasticity Least angle regression Parametric estimation Roadways
a b s t r a c t As project planners continue to move towards frameworks such as probabilistic life-cycle cost analysis to evaluate competing transportation investments, there is a need to enhance the current cost-estimation approaches that underlie these models to enable improved project selection. This paper presents an approach for cost estimation that combines a maximum likelihood estimator for data transformations with least angle regression for dimensionality reduction. The authors apply the proposed method for 15 different pavement bid items across five states in the United States. The results from the study demonstrate that the proposed approach frequently leads to consistent parametric estimates that address the structural bias and heteroscedasticity that plague the current cost-estimation procedures. Both of these aspects are particularly important for large-scale construction projects, where traditional methods tend to systematically underestimate expected construction costs and overestimate the associated variance. © 2017 Elsevier Ltd. All rights reserved.
1. Introduction Over the last few decades, transportation planners have used several quantitatively-based methods to evaluate the costeffectiveness of alternative infrastructure investments. For roadways, the focus of this research, decision-makers frequently use life-cycle cost analysis (LCCA), a framework that estimates the total cost of a project over its lifetime, to evaluate the merits of alternative designs (Walls and Smith, 1998). Implicitly, the outputs and accompanying conclusions of any LCCA depend upon assumed values for relevant input parameters. As a result, as practitioners continue to move towards probabilistic-based LCCA, there is an increasing need for techniques that provide more representative estimates of both the expected value and associated variation for input values (Walls and Smith, 1998; Tighe, 2001). This need is particularly true for inputs and life-cycle stages that contribute significantly to total expected life-cycle costs (LCC) and variation. A previous study conducted by the authors of this paper demonstrated that not only are initial-cost estimates dominant contributors to total agency LCC, but they are also the major driver of variation across a range of contextual (e.g., traffic, climate) conditions (Swei et al., 2015). As a result, the goal of this research is to benchmark the fidelity of current approaches
∗
Corresponding author. E-mail addresses:
[email protected] (O. Swei),
[email protected] (J. Gregory),
[email protected] (R. Kirchain).
http://dx.doi.org/10.1016/j.trb.2017.04.013 0191-2615/© 2017 Elsevier Ltd. All rights reserved.
296
O. Swei et al. / Transportation Research Part B 101 (2017) 295–305
Table 1 Statistical methods to estimate construction costs for buildings and bridges. Study
Approach
Performance metrics
McCaffer et al. (1984)
Linear regression (OLS and alternative approaches) Neural networks and linear regression
COV of ratio of predicted over actual
Emsley et al. (2002) Trost and Oberlender (2003) Chan and Park (2005)
Factor analysis and multiple linear regression (MLR) Principal component analysis and MLR
Lowe et al. (2006)
Stepwise MLR
Ji et al. (2010) Kim and Hong (2012)
Stepwise MLR Case-based reasoning and MLR
Coefficient of determination (R2 ) and mean absolute percent error (MAPE) Not applicable R2 , MAPE, tests for heteroscedasticity and serial correlation of the residuals R2 , MAPE, qualitatively assess bias and heteroscedasticity of the residuals Adjusted R2 , MAPE MAPE and standard deviation of MAPE
to estimate initial-costs (both expectation and variation) and to identify a modeling approach that leads to better estimates of expected initial-costs and associated variation. The authors compare the performance of the proposed modeling approach and current approaches by applying them to pay items that contribute significantly to the total cost of roadway construction. Because this study is limited in scope to roadway construction, the findings of this paper should be viewed as one contribution to the broader topic of transportation cost estimation, a domain in which several authors have documented the prevalence of inaccurate and imprecise early cost estimates (Eliasson and Fosgerau, 2013; Flyvbjerg et al., 2002; Molenaar, 2005). Such studies are primarily concerned with mega-infrastructure projects, tremendous financial investments that cost the public billions of dollars to construct (Flyvbjerg, 2003). Flyvbjerg et al. (2002), perhaps the most widely cited paper on this topic, demonstrate that early cost estimates across a range of transportation types (rail, fixed-link, and roadways) are both highly inaccurate, with average cost overruns approaching 30%, and imprecise, with a coefficient of variation (COV) for cost escalation greater than one. Flyvbjerg et al. (2002) subsequently focus their attention on the cause for inaccurate (i.e., biased) estimates and postulate that it is likely attributed to “strategic misrepresentation”, the deliberate misrepresentation of project costs, and possibly “appraisal optimism”, being overly optimistic about the outcomes of a project, on the part of project planners and promoters. In contrast, this research is concerned with roadway construction, in which the cost to build and maintain a single facility is considerably less than the aforementioned mega-projects. For multiple reasons, the authors believe that this case study is of keen interest to the infrastructure community. Planning agencies charged with maintaining large roadway networks must typically operate within fixed budgets, suggesting that their primary concern is to be efficient with available resources (Markow, 1995). Furthermore, the individual contractors that are responsible for procuring roadway projects typically have extensive experience. As a result, cost escalation stemming from strategic misrepresentation and/or appraisal optimism should be mitigated within this context, which the results of previous studies indirectly suggest. Flyvbjerg et al. (2002), for example, find that roadway projects have on average the lowest cost escalation rate (20%) of the three infrastructure types analyzed. Other studies, such as that of Shrestha and Pradhananga (2010), show an even smaller average cost escalation for pavement construction of 3.5%, while the pavement cost dataset used by Williams (2003) has a cost escalation frequency of 60%. These results are not to say that cost escalation is not present in roadway construction, but rather the systemic bias in cost estimation that stems from economic (i.e., strategic misrepresentation) and psychological (i.e., appraisal optimism) factors for mega-infrastructure projects is limited within this context. Therefore, it is of particular interest to develop novel methods for cost estimation in a domain where forecast inaccuracy and imprecision are attributed largely to the analytical tools used by planners. The idea that model misspecification can tremendously influence the fidelity of cost estimation tools is not new. As an example, Lowe et al. (2006) developed a series of regression models for large-scale building projects that, as the authors noted, systematically underestimated construction costs due to model specification. However, few studies (as will be further discussed) have searched for techniques to proactively deal with this issue. Consequently, the approaches discussed in this paper for cost estimation are potentially of great significance given (a) the increasing use of frameworks such as LCCA for pavement design and maintenance decisions and (b) the significance of highway expenditures in their aggregate, representing $165 billion in public spending in the United States in 2014 (Congressional Budget Office, 2015). An analytical approach that might offer a small improvement in model fidelity could represent large savings to the public sector.
2. Cost estimation approaches in the literature Table 1 presents a set of cost estimation research for buildings and bridges, the two most common forms of infrastructure that researchers have studied in-depth. Existing research uses both cross-sectional and panel data to develop parametric and non-parametric models to predict project-level costs. Predictor variables across the studies listed include general site conditions, location, and the geometric layout of a structure. Parametric approaches make use of multiple linear
O. Swei et al. / Transportation Research Part B 101 (2017) 295–305
297
regression combined with dimensionality reduction techniques such as stepwise (both forward and backward) regression, factor analysis, and principal component analysis to elicit the important parameters that influence construction costs. Alternatively, neural networks are a popular non-parametric approach for cost estimation that learns through iteration the response of a system given a set of external inputs (de la Garza and Rouhana, 1995; Hegazy and Ayed, 1998). Although researchers have demonstrated the accuracy of neural networks for cost estimation, such models have inherent drawbacks that parametric methods can address (Emsley et al., 2002). In particular, the ability of the parametric approach to (a) be transparent (b) explicitly account for uncertainty and (c) integrate with analytical tools such as probabilistic LCCA provides practical and theoretical advantages over non-parametric methods such as neural networks. Of course, the main advantage of an approach like a neural network is its ability to account for non-intuitive, complex, and nonlinear relationships between response and predictor variables. As a result, an important component for measuring the fidelity of regression models is to ensure that the estimates are unbiased and the residuals are homoscedastic, two basic assumptions for ordinary least squares (OLS) regression. Doing so guarantees that parametric estimates that would enter a model like LCCA properly reflect the data at hand and the construction activity it represents. Of the studies listed in Table 1, the majority measure model fidelity via conventional metrics such as the sample coefficient of determination (R2 ) and the mean absolute percent error (MAPE) of the model (Trost and Oberlender, 2003; Ji et al., 2010; Kim and Hong, 2012). Chan and Park (2005), one exception to this trend, test for the presence of heteroscedasticity and serial correlation in their regression models for building construction in Singapore. Based on those tests, both the null hypotheses of uncorrelated residuals and homoscedasticity are rejected using Engle’s test for autoregressive conditional heteroscedasticity (ARCH) and White’s test for heteroscedasticity without interaction terms. Notably, however, these test results are not viewed as sufficient reasons to abandon the OLS model of Chan and Park (2005). Lowe et al. (2006) qualitatively evaluate whether or not the residuals of a building cost regression model are homoscedastic or biased for untransformed and logarithmically transformed data. The authors mention that the models are structurally biased such that they “overestimate the cost of cheaper projects and underestimate the cost of more expensive projects.” Furthermore, the residuals are not always homoscedastic over the covariates, potentially leading to mischaracterized parametric estimates of expected construction costs and uncertainty. As the authors note, this may be due to “poor representation” of nonlinear relationships between the dependent and independent variables (Lowe et al., 2006). A smaller body of literature has focused on early cost estimation methods for pavements. Since traditional cost estimation tools for roadways tend to follow a bottom-up approach, these papers have focused more so on cost as it relates to individual pay items rather than the entire cost of a roadway (Chua and Li, 20 0 0; Hiyassat, 20 01; Williams, 20 03). Sanders et al. (1992) is the first major work to have evaluated potential methods to characterize the variation that underlies bid unitprice data. The authors develop a series of linear regression models for nine different bid items in the state of Alabama as a function of quantity. An increase in the size of a project tends to decrease the winning bid unit-price (e.g., economies of scale) for all nine bid prices. The authors measure the fidelity of the estimated regression models using the sample R2 . Shrestha et al. (2014) augments the preceding research by conducting a linear regression analysis of winning bid unit-price versus quantity for twelve different bid items in the state of Nevada, but this time considering alternative traditional data transformations to achieve linearity. The data transformation that leads to the highest sample R2 is selected for each bid item. As expected, all bid items evaluated show a statistically significant and negative correlation between bid unit-price and quantity. Although the aforementioned transformations improve the overall fit of the models, the authors do not report information on how they affect bias and/or homoscedasticity. The development of accurate and unbiased early cost estimates is an important component of comparing alternative transportation and infrastructure investments. In its current form, however, the current models are susceptible to both systematic bias and heteroscedasticity, both of which lead to incorrect assumptions about expected construction cost and uncertainty. Within the context of pavements, the focus of this study, these issues are particularly important given the mitigation of cost estimation bias resulting from the factors discussed in Flyvbjerg et al. (2002). As a result, this paper develops an approach for cost estimation that more frequently leads to unbiased estimates and homoscedasticity across the covariates. The approach merges an optimization solver for data transformations and a dimensionality reduction technique for variable selection amongst multiple covariates. The authors apply their approach to asphalt and concrete pay items, which are composed of not only material costs but also non-material costs such as transportation of materials to site, labor, and equipment. Consequently, these pay items represent a significant portion of total roadway expenditures (well over 50% for the state of Louisiana in one case study) and have also been demonstrated as significant contributors to total variation in LCCA (Swei et al., 2015). The methodological approach discussed in this paper can easily extend to other infrastructure systems and also scale to high-dimensional problems with many input factors to explain a larger portion of the variation than the bivariate models of Sanders et al. (1992) and Shrestha et al. (2014). The results from the case study analyses of pavement pay items (to follow) demonstrate that the traditional approaches for cost estimation oftentimes lead to both biased and heteroscedastic estimates that underestimate expected costs yet yield inflated variance estimates for large-scale projects. Depending upon the dataset, the approach proposed by the authors can lead to estimates that better represent expected costs and variation for initial-cost data. The authors illustrate the impact of the proposed approach for cost estimation by analytically solving for the distribution of bid unit-prices for large-scale roadway projects using both the proposed and the baseline approaches.
298
O. Swei et al. / Transportation Research Part B 101 (2017) 295–305 Table 2 Summary table of the structural form for the three approaches to model variation in bid unit-price data. The structural form selected in Approach 1 is determined by the transformation that leads to the highest R2 . Approach
Model type
Functional form
1
Log Reciprocal Power
Pi = β 0 + β q LN(Xi,q ) + ui 1 = β0 + βq Xi,q + ui Pi LN(Pi ) = β 0 + β q LN(Xi,q ) + ui
2
Box–Cox transformation (λ)
( λ2 ) + ui Pi(λ1 ) = β0 + βq Xi,q
3
Box–Cox transformation (λ), district variation (d), and number of bidders (b)
( λ2 ) + Pi(λ1 ) = β0 + βq Xi,q
L l=1
βdl Xi,dl + βb Xi,b + ui
Notes: Pi –Bid unit-price for ith sample Xi,q –Bid quantity Xi,dl –Dummy variable for lth district within a state Xi,q –Number of bidders for a project λ1 , λ2 –Optimal data transformations for bid unit-price and quantity
It must be noted that a limitation of this study is that the authors relied upon publically available bid data rather than information on the final cost of construction. With that said, previous studies in pavement cost estimation suggest that these findings should hold true for actual cost data. Williams (2005), for example, finds that bid and actual pavement costs (once logarithmically transformed) are correlated with a Pearson correlation coefficient of 0.995, and so the general trends of this paper are likely unchanged with actual cost data. 3. Methodology This study considers three possible parametric approaches to model the unit-price of bid items used in roadway construction. The baseline approach follows that of Shrestha et al. (2014), where bid unit-prices (Pi ) are regressed against bid quantity (Xi,q ) with an ordinary least squares (OLS) estimator. The data is transformed in one of three ways based upon the coefficient of determination (R2 ). The second approach is similar to the baseline except that the analysis searches for an optimal data transformation (denoted as λ). The third and final approach incorporates the optimal data transformation solution while also including variation across districts within a state (Xi,d ) and the number of bidders for a project (Xi,b ). Table 2 summarizes the three modeling approaches discussed. The decision to incorporate number of bidders is due to the results of Shrestha and Pradhananga (2010), which evaluate the deviation between bid price and early cost estimates as a function of the number of bidders for a project. The authors find a strong negative correlation between the two, meaning that for projects with little competition, total costs are significantly higher than expected. This same finding has been corroborated in the buildings community by the research of Carr (2005). Since pay items typically convolve material and non-material costs, such as transportation of materials from the plant to the site, costs are likely to differ within a given state. Therefore, this study introduces dummy variables for each district within a state to account for differences in non-material cost structures. Although other variables, such as distance of the plant from project site, are perhaps better indicators, this information was not readily available to the authors. Future research could expand the number of explanatory factors, for which this methodology can readily accommodate. An important contribution of this study is the evaluation as to whether or not the three proposed approaches adequately meet the assumptions of unbiased and homoscedastic residuals that underlie OLS regression. As such, the goodness-of-fit of the models described is not just measured using the coefficient of determination, but also the Breusch–Pagan test for heteroscedasticity and an F-test on the residuals that evaluates the presence of bias (Breusch and Pagan, 1979). Additionally, the second approach discussed requires a method to select an optimal data transformation, while the third approach necessitates a variable selection/dimensionality reduction technique. For this research, least angle regression (LAR), which is related to the least absolute shrinkage and selection operator (LASSO) algorithm, is conducted for dimensionality reduction. Box– Cox transformations combined with a maximum likelihood (ML) estimator are used to find the optimal data transformation (Box and Cox, 1964). 3.1. Testing for heteroscedasticity and bias in the OLS estimation The specification of a cost model without properly accounting for the presence of bias and/or heteroscedasticity can significantly reduce its usefulness to practitioners. A biased model, in which the expected error of the model is not equal to zero consistently, lends itself to highly inaccurate cost estimates by planners. Furthermore, although parameter estimates may remain unchanged should heteroscedasticity exist, the model will remain inefficient given that it does not properly estimate the true variance for a selected project. Consequently, it is of utmost importance to address both of these concerns to enable better investment decisions by project planners. Perhaps the most simple (yet highly effective) strategy to test for heteroscedasticity is the Breusch–Pagan test, which evaluates the null hypothesis that the residuals of an OLS regression meet the assumption of homoscedasticity. In a typical
O. Swei et al. / Transportation Research Part B 101 (2017) 295–305
299
Table 3 Estimates of λ from Box–Cox MLE and associated ‘traditional’ transformation.
λ value
Corresponding transformation
0.75 to 1.50 0.42 to 0.75 0.17 to 0.42 −0.17 to 0.17 −0.42 to −0.17 −0.75 to −0.42 −0.75 to −1.50
z1 z0.5 z0.33 Ln (z) z−0.33 z−0.5 z−1
multiple linear regression of the form:
Yi = β0 + β1 Xi,1 + . . . β p Xi,p + ui
(1)
where for each individual sample, i, the predictor variable, Yi , is a function of p covariates, Xi,p , and their associated effects, β p , one would expect that the variance of the residuals, ui , is constant across all covariates. Therefore, an auxiliary regression of the squared residuals as a function of those same covariates (with effects now denoted as γ p ) can be run to test whether or not the residuals meet the underlying assumption of homoscedasticity:
uˆ2i = γ0 + γ1 Xi,1 + . . . γ p Xi,p + vi
(2)
The null hypothesis that the parameter estimates for the covariates are equal to zero (e.g., homoscedasticity) can subsequently be tested by use of a Lagrangian multiplier (LM). Furthermore, a t-test on the covariates for the above auxiliary regression can provide insight as to which covariate should be transformed such that the assumption of homoscedasticity is met. Given that bid quantity, Xi,q , is the only covariate that is continuous across all three approaches, it is likely that if the null hypothesis of the Breusch–Pagan test is rejected, then bid quantity will have a high t-statistic for the above auxiliary regression. With that said, such t-statistics are very valuable should future research consider a larger range of factors. The concept of the Breusch–Pagan test is extended by the authors to test for bias in the residuals over bid quantity per:
uˆi = δ0 + δ1 Xi,q +
2 δ2 Xi,q + εi
(3)
Although bias in the residuals could take other forms, this specific type of bias is particularly pervasive in regression models. The F-statistic for this regression is computed to evaluate the null hypothesis that the residuals are unbiased (e.g., δ 1 = δ 2 = 0) as a function of bid quantity. 3.2. Data transformation and dimensionality reduction Logarithmic and reciprocal models are examples of simple data transformation techniques frequently used by statisticians to transform predictor variables and covariates so that homoscedastic and unbiased estimates are met in regression. A more general and flexible class of models exist that are known as Box–Cox transformations, which are of the following form:
z (λ ) =
zλ − 1
(4)
λ
λ (λ2 ) Yi( 1 ) = β0 + βn Xi,n
(5)
In Eq. 5, λ1 and λ2 represent the data transformation of the dependent variable, Y, and covariate, X, for each sample, i, per the transformation for z shown in Eq. 4. The advantage of the Box-Cox transformation is that it covers both traditional (e.g., logarithmic, reciprocal) and non-traditional data transformations frequently used by statisticians. This generality allows one to overcome the shortcoming of existing construction cost studies in which the authors only consider a select few data transformations in the analysis. Table 3 presents the values of λ that correspond with the more traditional transformations used in statistical and econometric analyses. The flexibility of the Box–Cox framework can subsequently be exploited by introducing an ML estimator to select the ‘optimal’ data transformation (i.e., parameter estimates for λ1 and λ2 ) for the dataset at hand. This ML estimator is fairly simple to construct given that the residuals of OLS regression are assumed to conform to a Gaussian distributed. The determination as to which covariate(s) need to be transformed can be inferred from the t-test in the previously mentioned auxiliary regression of the residuals. As for dimensionality reduction, LAR is a recently developed method that has grown in popularly since it can be used as an efficient solution to the LASSO algorithm and address some of the concerns with forward stepwise regression (Efron et al., 2004). Similar to forward stepwise regression, LAR functions as any greedy algorithm where a locally optimal choice is made without looking ahead to future steps. Initially, all parameter estimates are set to zero while values for the covariates are normalized. Subsequently, the covariate most correlated with the residuals is selected, and the parameter estimate is moved from zero towards its least-square coefficient until a second variable has just as much correlation with the residuals as the nth
300
O. Swei et al. / Transportation Research Part B 101 (2017) 295–305 Table 4 Summary table of datasets for analysis (Oman Systems, Inc., 2016). State
Number of districts
Bid datasets
Missouri
7
Concrete pavements HMA mixture BP-1 HMA PG64-22
Indiana
6
PCCP HMA type A surface HMA type B surface HMA type A inter.
Colorado
5
Concrete pavement HMA PG64-22
Louisiana
9
PCCP pavement HMA superpave level 1
Kentucky
12
JPC pavement HMA CL2 0.38D PG64-22 HMA CL3 0.38D PG64-22 HMA CL2 1.00D PG62-22
first covariate. This process of moving covariates until another variable is just as correlated with the residuals continues until the active set of parameters to select from is exhausted. For each new covariate added to the model, Mallow’s Cp (one best-fit measure) is estimated; the step that minimizes its value is treated as the cut-off point for the number of variables to select in model estimation. The benefit of the LAR approach compared to other greedy approaches (such as stepwise regression) is that it does not add a predictor fully to the model at once. Rather, it does so gradually, causing it to be more robust than other greedy approaches.
3.3. Description of data sets and scope of the analysis This research uses bid data for asphalt and concrete pay items collected by Oman Systems, Inc. for the states of Missouri, Colorado, Indiana, Louisiana, and Kentucky between late 2011 and early 2014 (Oman Systems, Inc,. 2016). The selection of bid data rather than actual cost information, as mentioned previously, is due to the general availability of the former in public databases. Approach 3 incorporates both district variation within a given state and the implication of number of bidders (one metric for competition in a project) on bid unit-prices. To address district variation, bid data is divided (through dummy variables as noted in Table 2) into regions according to the current districts established for each respective DOT. Table 4 summarizes the bid datasets and number of districts within each state for this analysis.
Table 5 Summary statistics for Approach 1 (baseline). Values denoted with a ‘∗ ’ reject the null hypothesis of homoscedastic (Breusch–Pagan LM test) or unbiased (residual F-test) residuals. Optimal Box–Cox transformations per MLE are listed below. State
Dataset
Coefficient estimate Estimate
Breusch–Pagan test LM
Biased residual F-statistic
t-statistic ∗
∗
Box–Cox transformation Price λopt
Quantity λopt
Missouri
Concrete pavements HMA mixture BP-1 HMA PG64-22
−0.14 −0.19 −0.05
−17.0 −24.2 −7.0
25.6 172.1∗ 35.8∗
3.1 23.9∗ 8.6∗
−0.96 −1.04 −0.92
−.02 −0.04 −0.47
Indiana
PCCP HMA type A surface HMA type B surface HMA type A inter.
−0.17 −0.17 −0.18 −0.19
−9.1 −17.7 −18.3 −18.3
15.2∗ 126.3∗ 230.5∗ 42.3∗
0.3 10.2∗ 9.1∗ 6.1∗
−0.38 −0.43 −0.46 −0.37
−.07 −0.11 −0.09 −0.06
Colorado
Concrete pavement HMA PG64-22
−0.14 −0.14
−14.1 −9.6
16.1∗ 8.2∗
1.1 4.4∗
−0.58 −0.51
0.19 −0.17
Louisiana
PCCP pavement HMA superpave level 1
−0.13 −0.18
−11.9 −40.2
29.7∗ 19.2∗
0.7 36.9∗
−0.75 −0.35
0.20 −0.09
Kentucky
JPC pavement HMA CL2 0.38D PG64-22 HMA CL3 0.38D PG64-22 HMA CL2 1.00D PG62-22
−0.12 −0.14 −0.12 −0.13
−12.0 −42.8 −10.5 −19.1
17.1∗ 706.5∗ 73.7∗ 115.3∗
1.1 261.0∗ 49.6∗ 19.0∗
0.22 −0.39 −0.46 −0.72
0.11 −0.33 −0.53 −0.11
Note: Approach 1 (Power Model): LN(Pi ) = β 0 + β q LN(Xi,q ) + ui
O. Swei et al. / Transportation Research Part B 101 (2017) 295–305
301
Table 6 Summary of model for Approach 3 (Box–Cox transformations with district variation and number of bidders covariates). The positive and negative signs indicate the existence and direction of a statistically significant relationship. State
Dataset
Price λ
Quantity λ
Quantity estimate
Bids estimate
Number of remaining dummy districts
Missouri
Concrete pavements HMA mixture BP-1 HMA PG64-22
−1 −1 −1
LN LN −0.5
– – –
– + –
4 2 2
Indiana
PCCP HMA type A surface HMA type B surface HMA type A inter.
−0.33 −0.5 −0.5 −0.33
LN LN LN LN
– – – –
–
3 5 1 3
Colorado
Concrete pavement HMA PG64-22
−0.5 −0.5
0.33 LN
– –
– –
3 3
Louisiana
PCCP pavement HMA superpave level 1
−1 −0.33
0.33 LN
– –
–
4 5
Kentucky
JPC pavement HMA CL2 0.38D PG64-22 HMA CL3 0.38D PG64-22 HMA CL2 1.00D PG62-22
0.33 −0.33 −0.5 −0.5
LN −0.33 −0.5 LN
– – – –
4 9 7 4
– –
L ( λ2 ) + Note: Approach 3: Pi(λ1 ) = β0 + βq Xi,q βdl Xi,dl + βb Xi,b + ui . l=1
Table 7 Biased F-test, Breusch Pagan heteroscedasticity test, and R2 values for Approach 1, 2, and 3. Values denoted with a ‘∗ ’ reject the null hypothesis of homoscedastic (Breusch-Pagan LM test) or unbiased (residual F-test) residuals. Biased F-stat
Breusch–Pagan LM
R2
State
Dataset
1
2
3
1
2
3
1
2
3
Missouri
Concrete pavements HMA mixture BP-1 HMA PG64-22
3.1∗ 23.9∗ 8.6∗
0.1 1.6 0.5
0.0 1.8 4.4∗
25.6∗ 172∗ 35.8∗
0.2 30.2∗ 7.3∗
0.8 29.0∗ 6.8∗
0.66 0.64 0.12
0.68 0.69 0.14
0.74 0.72 0.18
Indiana
PCCP HMA type A surface HMA type B surface HMA type A inter.
0.3 10.2∗ 9.1∗ 6.1∗
0.0 0.5 1.3 0.8
0.4 1.4 3.5∗ 0.5
15.2∗ 126∗ 231∗ 42.3∗
2.6 7.4∗ 69.6∗ 0.5
0.2 5.9∗ 25.9∗ 0.2
0.26 0.42 0.45 0.50
0.26 0.42 0.44 0.49
0.34 0.48 0.47 0.54
Colorado
Concrete pavement HMA PG64-22
1.1 4.4∗
3.5∗ 0.7
4.4∗ 0.0
16.1∗ 8.2∗
4.9∗ 1.3
1.0 2.1
0.50 0.35
0.52 0.34
0.59 0.56
Louisiana
PCCP pavement HMA superpave level 1
0.7 36.9∗
0.4 14.6∗
0.1 23.0∗
29.7∗ 19.2∗
0.2 0.8
0.9 2.1
0.49 0.74
0.56 0.74
0.64 0.78
Kentucky
JPC pavement HMA CL2 0.38D PG64-22 HMA CL3 0.38D PG64-22 HMA CL2 1.00D PG62-22
1.1 261∗ 49.6∗ 19.0∗
0.2 0.9 0.3 7.1∗
0.5 5.2∗ 0.4 14.8∗
17.1∗ 707∗ 73.7∗ 115∗
42.8∗ 101∗ 7.3∗ 27.9∗
4.3∗ 145∗ 16.0∗ 26.8∗
0.37 0.47 0.22 0.50
0.37 0.52 0.27 0.52
0.44 0.69 0.63 0.58
Notes: Approach 1 (Power Model): LN(Pi ) = β 0 + β q LN(Xi,q ) + ui
( λ2 ) Approach 2: Pi(λ1 ) = β0 + βq Xi,q + ui
L ( λ2 ) + βdl Xi,dl + βb Xi,b + ui Approach 3: Pi(λ1 ) = β0 + βq Xi,q l=1
4. Results The following results present the fidelity of the current initial-cost estimation method for pavement pay items relative to the proposed approach. 4.1. Testing the validity of the current model: approach 1 To evaluate the performance of the traditional method (Approach 1) for cost estimation, the Breusch–Pagan heteroscedasticity test and the modified F-test for biased residuals are estimated for the baseline model. For all 15 bid datasets, a power model (where the independent and dependent variables are logarithmically transformed) is the best fitting model of the three choices available per the sample coefficient of determination. Table 5 presents bid quantity coefficient estimates and t-statistics, Breusch–Pagan test results (both LM and t-statistic for bid quantity) and biased F-test values for the baseline model as well as optimal Box–Cox transformations per the ML estimator.
302
O. Swei et al. / Transportation Research Part B 101 (2017) 295–305
Fig. 1. Box-whisker plot of bid unit-price expectation (represented by the dots) and its uncertainty for the three approaches for 15,0 0 0 cubic yards of concrete. Whiskers represent the 5th and 95th percentiles of the theoretical distribution.
As can be seen, across all 15 bid items there is a strong correlation between bid unit-price and quantity. On the other hand, the null hypothesis of homoscedasticity is rejected for all datasets, while the null hypothesis of unbiased residuals is rejected in 11 instances. Furthermore, t-statistics for bid quantity in the auxiliary regression of the squared residuals are significant and negative except for one instance (the JPCP pavement model in Kentucky, which is positive). This result suggests two important aspects regarding the datasets at hand. First, the variance of the residuals decreases across bid quantity and, therefore, the OLS estimation overestimates the variance for large projects. This finding is particularly important given that LCCA is typically employed for large-scale roadway projects. Second, the residuals are structurally biased such that the regression models underestimate prices for large projects, similar to what Lowe et al. (2006) noted. These findings suggest that both the bias and heteroscedasticity in the baseline approach could potentially lead to a poor characterization of expected unit-prices and variance for roadway bid items, something that the Box-Cox transformations as presented at the end of Table 5 may address. 4.2. Box–Cox transformations (Approach 2) coupled with the incorporation of multiple covariates (Approach 3) Table 6 presents results of the dimensionality reduction analysis via the LAR algorithm. Both bid unit-price and quantity are transformed in this section by rounding the optimal Box–Cox transformations listed in Table 5 to their closest traditional transformation per Table 2. In addition to economies-of-scale, as noted in Table 5, district variation is important and significant, while the number of bidders for a given project is only significant at the 5% level for half of the datasets. Table 7 compares the Breusch–Pagan LM critical values, biased F-statistic estimates, and coefficient of determination (R2 ) of the baseline model (Approach 1), the Box–Cox transformation analysis (Approach 2), and the Box–Cox model that
O. Swei et al. / Transportation Research Part B 101 (2017) 295–305
303
Fig. 2. Box-whisker plot of bid unit-price expectation (represented by the dots) and its uncertainty for the three approaches for 15,0 0 0 tons of asphalt. Whiskers represent the 5th and 95th percentiles of the theoretical distribution.
incorporates multiple covariates (Approach 3). Although some of the models still exhibit biased residuals (6 of 15) and heteroscedasticity (8 of 15) in Approach 3, it significantly improves both aspects relative to the current benchmark. Furthermore, the improvement of the coefficient of determination (R2 ) between Approach 2 and Approach 3 signifies the ability of additional covariates to reduce unexplained variation in the data, with the Kentucky HMA CL3 0.38D PG64-22 pay item the most extreme example. These findings are particularly promising and provide strong evidence that the modeling approach proposed by the authors can aid the development of initial-cost estimates that better reflect the structural form of the actual data. To demonstrate the potential impact of the proposed method on early-cost estimates for bid items in LCCA, the unitprice of each bid item is estimated using the three approaches discussed for 15,0 0 0 cubic yards of concrete and 15,0 0 0 tons of asphalt, representative of the amount of material that goes into large-scale pavement projects. The number of bidders that are input for each regression in Approach 3 is the sample average for that individual dataset. Figs. 1 and 2 present box-and-whisker plots of the estimated bid unit prices of concrete and asphalt and the associated uncertainty for four of the datasets: Missouri and Louisiana concrete pavements and Missouri BP-1 and Indiana Type B Surface asphalt mixtures. As can be noted, the first three bid items are excellent examples of how optimized data transformations and the expansion of the number of explanatory variables both shifts the distribution mean/median and, for large projects, reduces the expected variance. The Indiana Type B Surface asphalt mixture is one example where the proposed approach has a smaller impact on the estimated cost. This finding does not render the preceding results useless, but rather demonstrates the value of the models and approaches discussed are largely a function of the structure of the data at hand. Another important point is that due to the selected data transformations, the distribution of bid unit-prices is implicitly asymmetric, which leads to expected unit-prices (blue dots) that are greater than the medians (central line) for all cases
304
O. Swei et al. / Transportation Research Part B 101 (2017) 295–305
analyzed, as shown in Figs. 1 and 2. The simplest example to demonstrate this effect is to consider the model form associated with Approach 1, which assumes that natural logarithmic unit-prices are Gaussian distributed. Based upon the properties of a log-normal distribution, expected bid unit-prices, P, given a quantity, Q, in Approach 1 would be: s2
E [P |Q ] = eβ0 +βq LN (Q )+ 2
(6)
where s2 is the unbiased estimator of the variance, σ 2 . Although our study is limited in scope, making it difficult to generalize conclusions, this discussion suggests that practitioners in roadway construction are potentially further prone to underestimate expected costs for large-scale projects because they do not account for the error term when transforming data back to its original form. Changes in practice that address these possible pitfalls could potentially reduce the bias in current roadway cost estimation procedures. 5. Conclusions and future work Probabilistic LCCA is an increasingly used tool amongst the pavement community to evaluate the merits of alternative investments. To be effective, such tools require representative parametric estimates of expectations and variation for input factors. Unfortunately, the existing cost estimation models presented in the literature can lead to biased estimates and the presence of heteroscedasticity, potentially causing planners to select less economically efficient design alternatives for a given roadway project. Consequently, this study presents one alternative approach for initial cost estimates via the integration of a maximum likelihood (ML) estimator to search for an optimal data transformation combined with least angle regression (LAR) for variable selection/reduction. The results presented in Table 7 demonstrate that the proposed approach (Approach 3) accepts the null hypothesis of unbiased and homoscedastic residuals more frequently than current practice (Approach 1). Furthermore, the introduction of new explanatory variables (e.g., district variation) in the proposed approach can lead to significantly higher R2 values as compared to the baseline bivariate models, indicating that this approach can also reduce the unexplained variation for cost estimation models. The results and contribution of the work presented are particularly important for large projects, the type of infrastructure where LCCA is used, as the traditional approach tends to lead to (a) overestimation of variance and (b) biased estimates that underestimate expected costs. Although the purpose of this paper was to address a pertinent need for the pavement management community, the methodological contribution of this work can be extended for other types of infrastructure systems. Despite the promising results from this case study, opportunities exist to further augment this work. The case study only considers asphalt and concrete pay items and does not include other inputs for pavement construction, such as excavation, that contribute significantly to total construction costs (Asmar et al., 2011). Future work should confirm if the results presented are generalizable to the broader set of pay items used in roadway construction. Furthermore, a similar study could be conducted but now using actual construction costs rather than bid prices. Lastly, future research could explore the converging/diverging solutions of alternative dimensionality reduction techniques (e.g., factor analysis, LASSO, principal component analysis) and non-linear frameworks (e.g., regression trees) to predict bid unit-prices. Acknowledgments This research was carried out as part of the Concrete Sustainability Hub at MIT, supported by the Portland Cement Association (PCA) and Ready Mixed Concrete Research and Education Foundation (RMC). References Box, G.E.P., Cox, D.R., 1964. An analysis of transformations. J. R. Stat. Soc. Series B 26 (2), 211–252. Breusch, T.S., Pagan, A.R., 1979. A simple test for heteroscedasticity and random coefficient variation. Econometrica 47 (5), 1287–1294. Carr, P.G., 2005. Investigation of bid price competition measured through prebid project estimates, actual bid prices, and number of bidders. J. Constr. Eng. Manage. 131 (11), 1165–1172. Chan, S.L., Park, M., 2005. Project cost estimating using principal component regression. Constr. Manage. Econ. 23 (3), 295–304. Chua, D.K.H., Li, D., 20 0 0. Key factors in bid reasoning model. J. Constr. Eng. Manage. 126 (5), 349–357. Congressional Budget Office, 2015. Public Spending on Transportation and Water Infrastructure, 1956 to 2014 Congress of the United States, Washington, D.C. de la Garza, J.M., Rouhana, K.G., 1995. Neural networks versus parameter-based applications in cost estimating. Cost Eng. 37 (2), 14–18. Efron, B., Hastie, T., Johnstone, I., 2004. Least angle regression. Annals Stat. 32 (2), 407–499. El Asmar, M., Hanna, A.S., Whithed, G.C., 2011. New approach to developing conceptual cost estimates for highway projects. J. Constr. Eng. Manage. 137 (11), 942–949. Eliasson, J., Fosgerau, M., 2013. Cost overruns and demand shortfalls - deception or selection? Transp. Res. Part B 57, 105–113. Emsley, M.W., Lowe, D.J., Duff, A.R., Harding, A., Hickson, A., 2002. Data modelling and the application of a neural network approach to the prediction of total construction costs. Constr. Manage. Econ. 20 (6), 465–472. Flyvbjerg, B., 2003. Megaprojects and Risk. Cambridge University Press, Cambridge, UK. Flyvbjerg, B., Holm, M.S., Buhl, S.L., 2002. Underestimating cost in public work projects - error or lie? J. Am. Plann. Assoc. 68 (3), 279–295. Hegazy, T., Ayed, A., 1998. Neural network model for parametric cost estimation of highway projects. J. Constr. Eng. Manage. 124 (3), 210–218. Hiyassat, M.A.S., 2001. Construction bid price evaluation. Can. J. Civ. Eng. 28 (2), 264–270. Ji, S.H., Park, M., et al., 2010. Data preprocessing-based parametric cost model for building projects: case studies of North Korean construction projects. J. Constr. Eng. Manage. 136 (8), 844–853. Kim, B.S., Hong, T., 2012. Revised case-based reasoning model development based on multiple regression analysis for railroad bridge construction. J. Constr. Eng. Manage. 138 (1), 154–162.
O. Swei et al. / Transportation Research Part B 101 (2017) 295–305
305
Lowe, D.J., Emsley, M.W., Harding, A., 2006. Predicting construction cost using multiple regression techniques. J. Constr. Eng. Manage. 132 (7), 750–758. Markow, M.J., 1995. Highway management systems: state of the art. J. Infrastruct. Syst. 1 (3), 186–191. McCaffer, R., McCaffrey, M.J., Thorpe, A., 1984. Predicting the tender price of buildings during early design: method and validation. J. Oper. Res. Soc. 35 (5), 415–424. Molenaar, K., 2005. Programmatic cost risk analysis for highway megaprojects. J. Constr. Eng. Manage. 131 (3), 343–353. Oman Systems, Inc., 2016. Bid tabs professional (Oct. 20, 2015). http://omanco.com. Sanders, S.R., Maxwell, R.R., Glagola, C.R., 1992. Preliminary estimating models for infrastructure projects. Cost Eng. 34 (8), 7–13. Shrestha, P.P., Pradhananga, N., 2010. Correlating bid price with the number of bidders and final construction cost of public street projects. Transp. Res. Rec. 2151, 3–10. Shrestha, P.P., Pradhananga, N., Mani, N., 2014. Correlating the quantity and bid cost of unit price items for public road projects. KSCE J. Civ. Eng. 18 (6), 1590–1598. Swei, O., Gregory, J., Kirchain, R., 2015. Probabilistic life-cycle cost analysis of pavements: drivers of variation and implications of context. Transp. Res. Rec. 2523, 47–55. Tighe, S., 2001. Guidelines for probabilistic pavement life cycle cost analysis. Transpor. Res. Rec. 1769, 28–38. Trost, S.M., Oberlender, G.D., 2003. Predicting accuracy of early cost estimates using factor analysis and multivariate regression. J. Constr. Eng. Manage. 129 (2), 198–204. Walls, J., Smith, M., 1998. Life-Cycle Cost Analysis in Pavement Design - Interim Technical Bulletin. FHWA, Washington, D.C. Williams, T.P., 2003. Predicting final cost for competitively bid construction projects using regression models. Int. J. Project Manage. 21 (8), 593–599. Williams, T.P., 2005. Bidding ratios to predict highway project costs. Eng. Constr. Archit. Manage. 12 (1), 38–51.