Cost estimation for general aviation aircrafts using regression models and variable importance in projection analysis

Cost estimation for general aviation aircrafts using regression models and variable importance in projection analysis

Journal Pre-proof Cost estimation for general aviation aircrafts using regression models and variable importance in projection analysis Xiaonan Chen, ...

2MB Sizes 0 Downloads 55 Views

Journal Pre-proof Cost estimation for general aviation aircrafts using regression models and variable importance in projection analysis Xiaonan Chen, Jun Huang, Mingxu Yi PII:

S0959-6526(20)30695-8

DOI:

https://doi.org/10.1016/j.jclepro.2020.120648

Reference:

JCLP 120648

To appear in:

Journal of Cleaner Production

Received Date: 24 September 2019 Revised Date:

6 February 2020

Accepted Date: 17 February 2020

Please cite this article as: Chen X, Huang J, Yi M, Cost estimation for general aviation aircrafts using regression models and variable importance in projection analysis, Journal of Cleaner Production (2020), doi: https://doi.org/10.1016/j.jclepro.2020.120648. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier Ltd.

Authorship contribution statement Xiaonan Chen: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing - original draft, Writing - review & editing. Jun Huang: Methodology, Resources, Funding acquisition, Writing - review & editing. Mingxu Yi: Methodology, Resources, Software, Funding acquisition, Writing - review & editing.

5183 words

Cost estimation for general aviation aircrafts using regression models and variable importance in projection analysis Xiaonan Chena; Jun Huanga,*; Mingxu Yia a

School of Aeronautic Science and Technology, Beihang University, Beijing 100191, China. Corresponding author. Tel: +86 18810953285 E-mail address: [email protected]

Abstract: Accurately estimating the development cost of general aviation aircraft plays a key role in devising the best strategy for corporates. However, studies on cost estimation for general aviation aircraft are limited. Development cost data of general aviation aircraft are commonly multi-collinear and the sample size is small, which may cause the estimation performance of the developed model to be poor. To address this issue, a combination of the variable importance in projection (VIP) analysis method and regression models is proposed. The VIP analysis method, applied to strong correlation data and small sample size, is utilized to select the most influential independent variables. The combined regression models tested using the selected variables include a partial least-squares regression (PLS) model and a back-propagation neural network (BPN) model. The PLS and BPN model are established respectively by utilizing the unfiltered raw variables for comparison. To verify the accuracy and feasibility of the proposed method, a case study utilizing general aviation aircraft cost data is presented. The results suggest that the VIP analysis method could be utilized for variable selection, and the proposed PLS model combined with VIP analysis has better prediction accuracy than that of the pure regression models and the combined BPN model, with MSE , MMRE , and R

2

values of 1.07,

4.1% and 98.32%, respectively. Therefore, the VIP analysis method combined with the regression model can be effectively used to estimate the preliminary cost of general aviation aircraft. More importantly, this work provides a feasible method to improve the accuracy of cost estimation for general aviation aircraft. Keywords: General aviation aircraft; Cost estimation; Regression models; Variable importance in projection analysis; Back-propagation neural network; Partial least-squares regression

1

5183 words Abbreviations VIP

variable importance in projection

MSE

mean squared error

PLS

partial least-squares

AE

absolute error

BPN

back-propagation neural network

MMRE

mean magnitude of relative error

ANN

artificial neural network

R2

coefficient of determination

MW

maximum take-off weight

c

average of the actual cost

MT

maximum thrust

n

sample number

MO

maximum oil load

PLS1

Lf

length of the fuselage

PLS2

cact

actual development cost

BPN1

cpre

predicted development cost

BPN 2

2

the BPN model with unfiltered raw independent variables the PLS model with selected variables by VIP method the BPN model with unfiltered raw independent variables the BPN model with selected variables by VIP method

5183 words

1. Introduction The development and production of aircraft is characterized by technical difficulties, policy changes, long development cycle, and uncertain factors. Therefore, the cost of aircraft is considerable (Chen et al., 2019; Wang et al., 2014). Whether a company can successfully develop an aircraft depends on its financial viability and operations. As such, an effective estimation of the development cost is crucial. Cost estimation is an important strategic premise in product development; more importantly, as proposed by Calado et al. (2018), it affects corporate performance. Therefore, it is crucial to accurately estimate the development cost in the early stages of design; Yeh and Deng (2012) concluded that accurate cost prediction can facilitate decision-making. However, relevant cost estimation methods for general aviation aircraft are limited because of the small sample sizes due to short development times (Chen et al., 2019), lack of information on relevant parameters, and difficulties in data collection (Altavilla et al., 2018). Regression models have been widely used to estimate the cost of aircraft since the 1960s. This method utilizes a certain number of independent variables (weight, size, performance, and time) that characterize the engineering system to form a cost estimation equation or model for cost prediction. Thus, the cost of an aircraft can be estimated with fewer independent variables, and a certain estimation accuracy can be achieved. Over the past few decades, many regression models have been proposed and utilized for preliminary cost estimation. However, the poor accuracy of the cost estimation methods is becoming an increasing problem, because a good cost estimation involves accurately predicting the cost of an aircraft on delivery (Tirovolis and Serghides, 2005). Currently, researchers have used partial least-squares (PLS) regression and back-propagation network (BPN) models to predict the cost of aircraft. Cost prediction software programs developed for specific industries, such as SEER H, PRICE H, DAPCA, and COCOMO, are based on regression models. The PLS method is an analytical tool that combines principal component analysis, typical correlation analysis of the relevant variables, and multiple linear regression. This method can be used when the number of samples is limited (Cabral and Dhar, 2019), correlations between the variables are high (Wu et al., 2015), and when there are more missing data (Yusoff et al., 2017). For example, Lockstrom and Lei (2013) used a PLS model to identify antecedents to supplier integration in China and showed that the proposed PLS model is effective and accurate when the number of samples is limited. Wu et al. (2015) investigated the

3

5183 words effectiveness of a PLS method in overcoming cost estimation problems pertaining to complex equipment in the early design stage; however, the multi-collinearity of the variables caused an interpretation bias in the results, which affected the estimation performance. In comparison, the BPN method has been more widely used for cost estimation, as it can better handle the nonlinear characteristics of cost data. Kourentzes (2013) employed an artificial BPN model to forecast intermittent time series. The results indicated that such intelligent estimation methods are useful and effective for intermittent demand applications. Chou et al. (2010) developed an artificial BPN model to forecast costs of thin-film transistor liquid-crystal display equipment. Günaydın and Doğan (2004) investigated the effectiveness of a BPN model in overcoming the cost prediction problems in the design process of a new building. However, Fang et al. (2018) showed that the estimation performance of the BPN model is poor when data are limited (Fang et al., 2018; Zhang et al., 2019); the same conclusion was also put forward by Zhang et al. (2019). Because of the inherent drawbacks of the pure regression models in view of small sample size, the PLS and BPN models cannot always provide a good estimation result. Chong and Jun (2005) stated that, in order to solve prediction problems, researchers use only a few parameters that have the most critical impact on project cost. With only these parameters, it is much easier to solve the prediction problem. The screening of scientific independent variables has been proven to be effective in improving the estimation accuracy of such models. When choosing independent variables in practical applications, we should not omit important explanatory factors and follow the principle of parameter saving, so that the independent variables are as few as possible. This is because the model becomes complicated when there are too many independent variables, and multiple correlations between the variables are unavoidable, which often increases the estimation variance, reduces the accuracy of the model, and even invalidates the estimation method when the number of samples is small. The methods used to select independent variables mainly include correlation coefficient analysis, gray correlation analysis and stepwise regression analysis. Although these methods are frequently utilized for variable selection because of their convenience and simplicity, the results are not accurate when multi-collinearity exists among variables or the sample size is small. Due to these reasons, variables importance in projection (VIP) analysis has emerged as a potential method for screening independent variables. However, the VIP analysis method has not been widely used. It is often utilized in food

4

5183 words science (Mao et al., 2018), medical science (He et al., 2015), and chemical science (Lu et al., 2014), and few studies have utilized the VIP analysis method to deal with relevant variables selection and prediction accuracy improvement problems of aircraft. Because of the small sample size of general aviation aircraft and the prevalence of multi-collinearity among various design parameters, accurately predicting the development cost is particularly difficult. Thus, this study will first apply VIP analysis method to estimate the cost of general aviation aircraft. The main task of this research is to analyze if the variables selected by the VIP analysis method are significant with respect to the small sample size and multi-collinearity of variables, and the selected variables are utilized to develop an accurate cost estimation model for general aviation aircraft. The remainder of this paper is organized as follows. Section 2 introduces two cost estimation methods: PLS and BPN. Section 3 introduces the VIP method and the modeling procedures involved in the combined VIP analysis and regression methods. Section 4 presents a case study utilizing general aviation aircraft data and compares the training performances of the different regression models. In Section 5, the results are summarized, and the advantages of the combined VIP analysis and regression methods are given. Section 6 summarizes the paper and discusses future works.

2. Cost Estimation Methods 2.1. Partial least-squares regression method A typical PLS method combines multiple linear regression analysis, principal component analysis, and correlation analysis. This approach no longer directly uses a set of dependent and independent variables for regression modeling. Instead, it extracts several new synthesis parameters that are best interpreted by the system, and then utilizes these parameters for a regression analysis. The modeling steps involved in the PLS method are as follows:

We assume that matrix independent variables

Y represents the dependent variable y with h samples, and the

xij with h samples are represented in the form of a matrix X .

X = [ xij ]h×k and Y = [ yi ]h×1 , where i = 1, 2,L h j = 1, 2, L k Step 1.

X and Y are normalized to obtain standardized matrices S0 and Z 0 . 5

(1)

5183 words

xij* =

xij − x j = S0 x% j

(2)

y −y = Z0 y = i y% * i

Here, and

x j represents the average value of x j , and x% j denotes the standard deviation of x j . y y% represent the average value and standard deviation of y , respectively.

Step 2. The first component is extracted from

t1 = S0 ⋅ w1

The regression of

S0T Z 0 w1 = T S0 Z 0

w1 = 1

(3)

S0 and Z 0 is implemented on t1 , as follows.

S0 = t1 p1T + S1

Here,

S0 , which is expressed as follows.

Z 0 = t1 f1T + Z1

p1 =

S 0T t1 t1

2

f1 =

Z 0T t1 t1

2

(4)

p1 and f1 are the regression coefficients, and the residual matrixes ( S1 and Z1 ) are

expressed as follows.

S1 = S0 − t1 p1T and Z1 = Z 0 − t1 f1T

(5)

Step 3. Whether the regression has met the accuracy requirements is analyzed. If met, we proceed to the next step; otherwise,

S0 = S1 and Z 0 = Z1 . Next, the above steps are repeated in the same way.

Step 4. If the regression meets the accuracy requirements, and

g components are obtained

(t1 , t2 ,Lt g ) , the regression equation can be derived: Zˆ 0 = f1t1 + f 2t2 + L + f g t g As a linear transformation can be performed between follows.

6

(6)

S0 and Z 0 , Eq. (6) can also be written as

5183 words

Zˆ 0 = f1S 0 w1* + f 2 S 0 w2* + L f g S 0 wg*

(7)

i −1

Here,

w*g = ∏ ( I − w j pTj )wi (i = 1, 2,3,L g ) , and I is a unit matrix. Finally, we obtain j =1

zˆ*j = χ1s*j1 + χ 2 s*j 2 + L χ k s*jk (8)

g

χ j = ∑ fi wij* i =1

*

Here, the regression coefficient of s j is

χ j , and wij*

represents the

jth element of wi* .

Step 5. According to the standardized inverse process, Eq. (8) can be converted into a regression equation with original parameters, as follows.

) y j = a1 x j1 + a2 x j 2 + L ak x jk

(9)

2.2. Back-propagation neural network method Artificial neural network (ANN) is the most recognized approach in machine learning (Shin et al., 2019). An ANN is an abstract mathematical model composed of many artificial neurons. It imitates the human brain nervous system in structure and function and has the ability of self-learning, self-adaptation, and non-linear processing. The BPN approach is the most popular in this area. In this method, first, the target values are compared with the output values, then the steepest gradient descent technique is utilized to minimize the error. A typical BPN model includes output and input layers; furthermore, at least one hidden layer should be included between the output and input layers. Sodikov (2005) suggested that the minimum number of hidden nodes should be half of the total number of output and input nodes. Fig. 1 shows the typical structure of a BPN model.

7

5183 words I nput layer

Hidden layer

... .

... .

... .

... .

Outp ut layer

Fig. 1. Architecture of the back-propagation neural network (BPN) The BPN is trained utilizing the back-propagation training algorithm (Yeh and Deng, 2012). A detailed description of the BPN is provided below:

We assume that the training samples are denoted by value, and

x1 , x 2 , L x p . yˆ p denotes the real output

y p represents the expected value. We assume that the input, output, and hidden layers of

the BPN model have b ,

e , and d nodes, respectively.

Step 1. The output of the hidden layer is denoted as follows. b

k = 1, 2,3L d

mk = c1 (∑ vki xi )

(10)

i=0

Here,

c1 represents the transfer function of the hidden layer, vki denotes the weights between the

hidden and input layers. Furthermore, each node in the output layer produces an output

yˆ p , which can

be expressed as follows. d

j = 1, 2,3L e

yˆ p = c2 (∑ w jk mk )

(11)

k =0

Here,

c2 denotes the transfer function of the output layer, and w jk denotes the weights between the

output and hidden layers. Step 2. The error value of the pth sample can be expressed as follows.

8

5183 words

Ep =

Here,

1 e ∑ ( y p − yˆ p )2 2 j =1

(12)

E is the global error, expressed as follows. p 1 p e ˆ E = ∑∑ ( y p − y p ) = ∑ E p 2 p =1 j =1 p =1

Step 3. An accumulative error algorithm is employed to revise the value of

(13)

w jk , so as to reduce the

global error. Thus,

∆w jk = −η

p p ∂E p ∂E ∂ = −η (∑ E p ) = ∑ (−η ) ∂w jk ∂w jk p =1 ∂w jk p =1

Here, the learning rate of the BPN model is denoted by

(14)

η . Eq. (15) is used to calculate the change in

vki .

∆vki = −η

p ∂E p ∂E = ∑ (−η ) ∂vki p =1 ∂vki

(15)

3. Supplementary Analysis Methods 3.1. Variable importance in projection The explanatory ability of the independent variables to the dependent variables could be obtained utilizing the traditional stepwise regression method, gray correlation analysis method, or correlation analysis method. However, these methods may not work well in some cases where several variables have multi-collinearity. In such cases, some variables that have a negative impact on cost prediction are retained in the model, which can lead to greater estimation errors. For strong correlation data and small sample size, the VIP method can be appropriately and accurately used to screen independent variables. He et al. (2015) proposed that the VIP value reflects not only the importance of the independent variables to the model, but also how well the dependent variable is expressed. The VIP value of the

jth independent variable to dependent variable can be expressed as follows.

9

5183 words h k VIP = Rd (Y ;t ) w 2 ∑ j i ij Rd (Y ;t ,L t ) i =1 1 h

(16)

Here, k represents the number of independent variables, wij denotes the weight value of the ith variable in the

jth component, h stands for the total number of components, Y denotes the

dependent variable, Rd (Y ; t1 , L th ) is the interpretation ability of t1 , t2 Lth to Y , and Rd (Y ; ti ) represents the explanatory ability of ti to Y .

Rd (Y ; ti ) = r 2 (Y ; ti ) h

Rd (Y ; t1 , L th ) = ∑ Rd (Y ; ti )

(17)

i =1

Here, the correlation coefficient between the principal component ti to dependent variable Y is represented by r (Y ; ti ) .

As the explanatory ability of

x j to Y is translated by ti , the stronger the explanatory ability of

ti to Y , the greater the interpretation ability of x j to Y . Moreover, the VIPj value is higher. The VIP value represents the influence of independent variables on model fitting (He et al., 2015). Therefore, if all independent variables have the same ability in explaining Y , the VIPj values are equal to 1; otherwise, the lower the VIPj value, the poorer the explanatory ability of the independent variables with respect to the dependent variables. A VIP value less than 1 indicates an unimportant variable, which may be eliminated. However, it is not advisable to merely delete all independent variables with VIP values less than 1. Instead, the independent variable with the minimum VIP value should be discarded. If the prediction model is optimized, the method can be repeated for a reduced set of variables until no further optimization occur. Thus, the VIP method can be used to select the independent variables.

3.2. Combined VIP and regression methods modeling procedures To overcome the low prediction accuracy of the existing models in the case of small samples, we

10

5183 words combine the VIP analysis with the regression models. The fundamental reason for this combination is to select the most influencing variables, so as to ensure the stability, accuracy, and effectiveness of the prediction models. The modeling steps are as follows: Step 1. The VIP method is used to screen the independent variables, and the components that have less influence on the dependent variables are deleted. The data samples are divided into training and testing groups. Step 2. Based on the unfiltered raw independent variables, the PLS1 and BPN1 models are established using the training samples. Step 3. Based on the independent variables selected using the VIP method, the PLS2 and BPN 2 models are established based on the training samples. Step 4. Finally, the availability of the VIP analysis and the best prediction model can be verified by analyzing the fitting and estimation results. The modeling method used in this study is shown in Fig. 2.

Fig. 2. Adopted methodology (modeling diagram)

11

5183 words

4. Case study: Development Cost Estimation 4.1. Description of parameters A general aviation aircraft is a typically complex, containing hundreds of design processes. Hence, there are hundreds of variables that affect the cost of the aircraft. According to Liu et al. (2014), four independent variables have been utilized as the main variables in the literature. Data on 17 general aviation aircraft experiments in the open literature (Chen et al., 2019; Liu et al., 2014) were collected. Table 1 lists the related parameters and collected development costs. Fourteen samples were randomly selected as training samples, while the remaining three samples were utilized for testing.

Table 1 Design parameters and experimental data of different general aviation aircraft Sample

cact /104 RMB

M W /t

M T /kips

M O /L

L f /ft

1

32

65.09

26.3

17800

102

2

65.2

85.13

27.3

29660

138

3

200.2

395

63.3

216840

211

4

174.6

230

77.2

117340

208

5

122.2

163

53.1

120000

186

6

144.7

257

72

139100

192

7

261.9

560

70

310000

239

8

61.5

73.5

27

29680

123

9

172.4

257

34

139100

208

10

60.57

77.791

27.3

26020

119

11

115.705

181.89

63.3

90916

165

12

206.409

297.55

115.3

181280

222

13

58.807

83

33

23700

114

14

89.494

150

59

68150

140

15

30.282

33

13.79

11146

98

16

38.808

50.3

18.5

16629

109

17

157.402

247

71

138898

206

Table 1 (Continued) Design parameters

MW

Description

Unit

Range

Maximum take-off weight

t

33–560

12

5183 words MT

Maximum thrust

kips

13.79–115.3

MO

Maximum oil load

L

11146–310000

Lf

Length of the fuselage

ft

98–239

cact

Actual development cost

104 RMB

30.282–261.9

Notes: 1 t=103 kg; 1 kips=453.59237 kg; 1 ft=0.3048 m. Fig. 3 shows the main independent variables influencing the development cost.

M W : Maximum take-off weight M T : Maximum thrust

M O : Maximum oil load Lf : Length of the fuselage

Fig. 3. Main parameters ( M W , M T , M O and

Lf ) influencing the development cost of a general

aviation aircraft

4.2. Estimation performance criteria In this study, four accuracy measures, namely the mean squared error ( MSE ), absolute error (AE), mean magnitude of relative error ( MMRE ), and coefficient of determination ( R 2 ), were used as the estimation performance criteria. The AE , MSE , MMRE , and R 2 are expressed as follows:

AE = cact − cpre

13

(18)

5183 words

MSE =

n

1 n

MMRE =

∑ (c i =1

∑ (c i =1 n

pre

∑ (c

act

i =1

Here,

− cpre )2

1 n cact − cpre ∑ c ×100% n i =1 act

n

R2 =

act

(19)

(20)

− c )2 × 100% − c)

(21)

2

cact denotes the actual cost, cpre denotes the predicted cost, c stands for the average of the

actual cost, and

n

represents the sample number.

In the AE , MSE , and MMRE measurements, a lower value indicates a better estimation performance. R 2 represents the matching degree between the estimated and actual costs. The more precise the model, the closer the value of R 2 is to 1.

4.3. Development cost modeling With the confidence level set to 95%, the training samples with a total of four raw independent variables were subjected to a PLS analysis in a SIMCA-P+12.0 environment. As shown in Fig. 4, the training sample points lie within the 95% confidence interval, confirming that the training samples can be utilized to establish the PLS1 model.

14

5183 words

Fig. 4. Judgement ellipse of fighter observations The linear regression model with four unfiltered raw independent variables is set as PLS1 , and the predicted cost ( cpre ) is utilized as the dependent variable. Thus, the regression equation ( PLS1 ) can be expressed as follows.

cpre = −27.79 + 0.135M W + 0.579M T +0.000152M O + 0.418Lf Fig. 5 shows the comparison between

(22)

cact and cpre utilizing the PLS1 model. The results

indicate that PLS1 can be used to make a preliminary prediction. Actual cost

Predicted cost

300

250

) 200 B M R 4 01 150 ( ts o C 100

50

0 1

2

3

4

5

6

7

8

9

10

11

12

13

14

Sample

Fig. 5. Comparison between actual cost (red dot) and PLS1 predicted cost (black line) based on training data

15

5183 words Through the VIP analysis, the VIP values of M W , M T , M O , and

Lf

were found to be 1.03,

0.71, 1.05, and 1.06, respectively. M T has the lowest VIP value, suggesting that the contribution of

M T to the model is less, and that it should be eliminated. After the VIP analysis, the remaining independent variables M W , M O , and

Lf

were used to

establish the PLS2 model. The resulting model is as follows:

cpre = −26.382 + 0.164 M W + 0.0002777 M O + 0.506 Lf Fig. 6 shows the VIP values of MW , M O , and

(23)

Lf . The VIP value in the model is relatively

balanced at this time, and there are no independent variables with significantly lower VIP values.

Lf: 1.01

MW: 0.99

MO: 1.00

Fig. 6. Variable importance in projection (VIP) values of independent variables ( MW , M O , and

Lf )

The absolute errors (AE) are calculated to show the training effect of these two PLS models. Fig. 7 shows the absolute errors between the predicted and actual costs.

16

5183 words PLS1 Sample 14

PLS2

Sample 1 40

Sample 2

35 30 Sample 13

Sample 3

25 20 15 10

Sample 12

Sample 4

5 0 Sample 11

Sample 5

Sample 10

Sample 6

Sample 9

Sample 7 Sample 8

Note: The y-coordinate represents the absolute error (104 RMB)

Fig. 7. Absolute errors (AE) of PLS1 (yellow color) and PLS2 (green color) with respect to experimental data Fig. 7 indicates that the absolute error of PLS2 is lower than that of PLS1 , indicating that PLS2 shows a better training performance. Therefore, the training accuracy of the PLS model is improved by combining with the VIP analysis. Based on the number of independent variables, two BPN models are established as an alternative to the linear regression models developed. The four independent variables ( M W , M T , M O ,

Lf )

are

included in the input layer to build the BPN1 model, and the predicted development cost is included in the output layer. According to the selection principle of hidden layer units, which was introduced by Sodikov (2005), the number of hidden nodes is set to 3. The BPN 2 model is built utilizing the three independent variables ( M W , M O ,

Lf ) retained by the VIP analysis, and the number of hidden nodes

is set to 2. For these two artificial models, tan-sigmoid and purelin are utilized as the transfer functions of the hidden and output layers, respectively. Furthermore, the target error and learning rate are taken as 0.001 and 0.01, respectively. The BPN model involves gradually reducing the error between the actual and expected outputs.

17

5183 words Therefore, it represents the input–output mapping by minimizing the MSE values obtained for the training samples. Trans et al. (2019) stated that MSE is the criterion that determines whether network training should be stopped. Günaydın and Doğan (2004) suggested that the MSE is an excellent indicator of a successful training operation and is used in the training process to evaluate the performance of the BPN models. A training session on the training samples is called an epoch. When the MSE remains constant within a given number of epochs, the training should be stopped immediately to ensure that the BPN model is not overtrained. Figs. 8 and 9 show the training curves of BPN1 and BPN 2 , respectively.

Mean Squared Error (MSE)

10 0

Train

Goal

10 -1

Epoch 500 Minimum MSE 0.001 Final MSE 0.001

10 -2

10 -3

10 -4 0

50

100

150

200

250

300

350

400

Epoch

Fig. 8. Training curve of BPN1 (best training performance is 0.000992 at epoch 455)

18

450

5183 words 10 0

Mean Squared Error (MSE)

Train

Goal

10 -1

Epoch Minimum MSE Final MSE

500 0.001 0.001

10 -2

10 -3

10 -4 0

50

100

150

200

250

300

Epoch

Fig. 9. Training curve of BPN2 (best training performance is 0.000998 at epoch 285) The above two figures show that an epoch number of 500 is enough to train the models BPN1 and

BPN 2 . The training curve of BPN 2 is smoother, and the curve convergence speed of BPN 2 is higher than that of BPN1 . This shows that the training speed of the BPN model is improved when it is combined with the VIP analysis. It can realize optimization in the BPN model.

5. Comparison and Verification The most suitable regression model can be selected by comparing the prediction performances of

PLS1 , PLS2 , BPN1 , and BPN 2 . The effectiveness of combining the modeling procedures of the VIP and regression methods can be confirmed. Table 2 lists the predicted and actual cost values of the testing samples obtained using the different regression models.

Table 2 Actual and predicted values of testing samples

cact (104 RMB)

PLS1

PLS2

BPN1

BPN2

1

30.282

27.31

31.71

32.55

31.79

2

38.808

37.80

41.64

50.52

46.71

3

157.402

153.88

156.93

167.98

156.44

Num

19

5183 words As listed in Table 2, each regression model plays a preliminary role in cost estimation. In the testing phase, the prediction performances ( MSE , MMRE , and R 2 ) are as follows: PLS1 (1.57, 4.88%, 97.46%); PLS2 (1.07, 4.1%, 98.32%); BPN1 (5.31, 14.80%, 94.85%); and BPN 2 (2.70, 8.65%, 95.03%). Table 3 lists the prediction results of the four regression models.

Table 3 Estimation results of different models

Prediction model

Number of independent variables

MSE

MMRE (%)

R 2 (%)

PLS1

4

1.57

4.88

97.46

PLS2

3

1.07

4.10

98.32

BPN1

4

5.31

14.80

94.85

BPN2

3

2.70

8.65

95.03

From Table 3, we find that all four models ( PLS1 , PLS2 , BPN1 , and BPN 2 ) show good prediction performance. As can be seen from the above table, the MSE value of PLS2 is the minimum value of 1.07, which implies that all predicted costs are highly reliable. The MMRE value of PLS2 , 4.1%,

is also lower than that of the other models. The

R 2 value of PLS2 , 98.32%, is

closer to 1, which is better than those of PLS1 , BPN1 and BPN 2 test. The MSE and MMRE are relatively low, and the R 2 of these models is greater than 90%. In order to compare the estimation effects of each model more directly, the prediction performances for various modeling techniques are determined, as shown in Fig. 10.

20

5183 words 25

16

14.8

MSE

MMRE(%) 14

20 12

r or re 15 de ra u sq na10 e M

10 8.65 8

6

4.88 4.1 5.31

4

5

) % (r or re e vi ta le r f o e d ut in ga m na e M

2.7 2

1.57

1.07

0

0 OREG PLS1

PLS PLS2

BPNs2 BPN2

BPNs1 BPN1

Modeling technique

(a) Mean squared error (green color) and mean magnitude of relative error (red color) for PLS1 ,

PLS2 , BPN1 , and BPN 2 based on test data 100

Coefficient of determination(%) ) 99 % ( n oi t a 98 n i m re te 97 d f o t 96 n iec if fe 95 o C

98.32 97.46

95.03 94.85

94 1PLS1

2 2 PLS

3 1 BPN

4 BPN 2

Modeling Technique (b) Coefficient of determination for PLS1 , PLS2 , BPN1 , and BPN 2 based on test data

Fig. 10. Prediction performances for various modeling techniques As shown in Fig. 10, a model with a lower MSE is accompanied by a lower MMRE and a higher

R 2 , indicating a better prediction performance. Among them, the models PLS2 and BPN 2 with three independent variables have a better prediction effect than the models PLS1 and BPN1 with

21

5183 words four original independent variables. On the other hand, the results reveal that the regression models with the VIP analysis are more accurate than the pure regression models. Fig. 10 shows that PLS2 has a slightly better prediction performance than BPN 2 . This indicates that the BPN model is not the best choice for prediction in this case. Finally, PLS2 is determined to be the most appropriate regression model. A comparison of Fig. 10 with Tables 2 and 3 shows that the VIP analysis can be applied for selecting the independent variables. The combined prediction method is superior to the pure prediction methods in terms of the prediction performance. Furthermore, the proposed PLS model combined with VIP analysis ( PLS2 ) has the best estimation accuracy.

6. Conclusion An accurate development cost estimation for general aviation aircraft plays a key role in planning for the best strategy, which can help companies to arrive at a reasonable selling price for their products and increase market competitiveness. However, conventional pure regression models have certain limitations in that they cannot be used to estimate the cost with reasonable accuracy. This work focuses on combining regression models with the VIP analysis method to solve the problems encountered in cost estimation. Compared with the pure regression models ( PLS1 and BPN1 ), the combined regression models have better estimation accuracy, with higher R 2 , and lower MSE and

MMRE values. We also find the PLS model has predictive advantage over the BPN model in the case of small sample size. The results indicate that the VIP analysis method could be utilized for variables selection in some cases when independent variables have multi-collinearity and the sample size is small, and the combined PLS2 model could be utilized to obtain relatively accurate cost estimations for general aviation aircraft. As part of future research, a database on general aviation aircraft cost will be gradually established and refined, and the validity of the method proposed in this study will be verified using the actual industrial cost database. In addition, the remaining variable selection methods must be combined with the regression models to allow for further comparative studies. Finally, with the popularization and application of all-carbon fiber electric general aviation aircraft, the previous energy consumption mode has been changed, and social development is paying more attention to environmental protection while pursuing economic growth. We plan to build a faster and more flexible

22

5183 words cost estimation method to help companies enhance their viability, and contribute to building an environmentally friendly society.

Data Availability All data, models, and code generated or used during the study appear in the submitted article.

Acknowledgements This work was supported by the National Postdoctoral Program for Innovative Talents, Postdoctoral Science Foundation of China (No. 2017M610740). Also, the authors would like to acknowledge the supports from Hefei General Aviation Research Institute, Beihang University.

References Altavilla, S., Montagna, F., Cantamessa, M., 2018. A Multilayer Taxonomy of Cost Estimation Techniques, Looking at the Whole Product Lifecycle. J. Manuf. Sci. Eng. -Trans. ASME. 140 (3). Cabral, C., Dhar, R.L., 2019. Green competencies: Construct development and measurement validation. J. Clean. Prod. 235, 887-900. Calado, E.A., Leite, M., Silva, A., 2018. Selecting composite materials considering cost and environmental impact in the early phase of aircraft structure design. J. Clean. Prod. 186, 113-122. Chen, X.N., Huang, J., Yi, M.X., Pan, Y.L., 2019. Prediction of the development cost of general aviation aircraft. Aircr. Eng. Aerosp. Technol. 91 (4), 567-574. Chen, X.N., Huang, J., Yi, M.X., 2019. Development cost prediction of general aviation aircraft projects with parametric modeling. Chin. J. Aeronaut. 32 (6), 1465-1471. Chong, I.G., Jun, C.H., 2005. Performance of some variables selection methods when multicollinearity is present. Chemometrics Intell. Lab. Syst. 78, 103-112. Chou, J.S., Tai, Y.A., Chang, L.J., 2010. Predicting the development cost of TET-LCD manufacturing equipment with artificial intelligence models. Int. J. Prod. Econ. 128 (1), 339-350. Fang, D.N., et al., 2018. A novel method for carbon dioxide emission forecasting based on improved Gaussian processes regression. J. Clean. Prod. 173, 143-150. Günaydın, H.M., Doğan, S.Z., 2004. A neural network approach for early cost estimation of structural systems of buildings. Int. J. Project. Manag. 22 (7), 595-602.

23

5183 words He, P., et al., 2015. Estimation of leaf chlorophyll content in winter wheat using variable importance for projection (VIP) with hyperspectral data. International Society for Optics and Photonics. Liu, F.H., Xie, N.M., 2014. Aircraft cost estimation model and algorithm based on small sample and poor information. J. Syst. Simulat. 26 (3), 687-691 (in Chinese). Lu, B., Castillo, I., Chiang, L., Edgar, T.F., 2014. Industrial PLS model variable selection using moving window variable importance in projection. Chemometrics Intell. Lab. Syst. 135 (8), 90-109. Mao, S.H., et al., 2018. Identification of key aromatic compounds in congou black tea by partial least-square regression with variable importance of projection scores and gas chromatography-mass spectrometry/gas. J. Sci. Food Agric. 98 (14), 5278-5286. Shin, Y., Kim, Z., Yu, J., Kim, G., Hwang, S., 2019. Development of NOx reduction system utilizing artificial neural network (ANN) and genetic algorithm (GA). J. Clean. Prod. 232, 1418-1429. Sodikov, J., 2005. Cost estimation of highway projects in developing countries: artificial neural network approach. Journal of the Eastern Asia Society for Transportation Studies. 6, 1036-1047. Tirovolis, N.L., Serghides, V.C., 2005. Unit cost estimation methodology for commercial aircraft. J. Aircr. 42 (6), 1377-1386. Tran, V.L., Thai, D.K., Kim, S.E., 2019. Application of ANN in predicting ACC of SCFST column. Compos. Struct. 228, 111332. Wu, L., Liu, S., Song, D., 2014. Multi-objective optimization of aircraft design for emission and cost reductions. Chin. J. Aeronaut. 27 (1), 52-58. Wu, L., Liu, S., Song, D., 2015. Using weighted partial least squares to estimate the development cost of complex equipment at early design stage. IEEE International Conference on Grey Systems and Intelligent Services. Yeh, T.H., Deng, S., 2012. Application of machine learning methods to cost estimation of product life cycle. Int. J. Comput. Integr. Manuf. 25 (5), 340-352. Yusoff, Y.M., Omar, M.K., Zaman, M.D.K., Samad, S., 2019. Do all elements of green intellectual capital contribute towards business sustainability? Evidence from the Malaysian context using the partial least squares methods. J. Clean. Prod. 234, 626-637. Zhang, R., et al., 2019. An evaluating model for smart growth plan based on BP neural network and set pair analysis. J. Clean. Prod. 226, 928-939.

24

Highlights 1. This work addresses the application of VIP analysis in the case of small sample size and strong correlation data. 2. Findings support a lower prediction error of PLS model than BPN model for general aviation aircraft cost data. 3. The combined regression model has been proved to be effective in the cost estimation of general aviation aircraft. 4. Accurately estimating the cost plays a key role in devising the best strategy for corporates.

Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: