Comparison of the decision tree, artificial neural network, and linear regression methods based on the number and types of independent variables and sample size

Comparison of the decision tree, artificial neural network, and linear regression methods based on the number and types of independent variables and sample size

Available online at www.sciencedirect.com Expert Systems with Applications Expert Systems with Applications 34 (2008) 1227–1234 www.elsevier.com/loca...

515KB Sizes 1 Downloads 85 Views

Available online at www.sciencedirect.com

Expert Systems with Applications Expert Systems with Applications 34 (2008) 1227–1234 www.elsevier.com/locate/eswa

Comparison of the decision tree, artificial neural network, and linear regression methods based on the number and types of independent variables and sample size Yong Soo Kim

*

CI Division, SK telecom, 11, Euljiro 2-ga, Jung-gu, Seoul, 100-999, Republic of Korea

Abstract In this article, the performance of data mining and statistical techniques was empirically compared while varying the number of independent variables, the types of independent variables, the number of classes of the independent variables, and the sample size. Our study employed 60 simulated examples, with artificial neural networks and decision trees as the data mining techniques, and linear regression as the statistical method. In the performance study, we use the RMSE value as the metric and come up with some additional findings: (i) for continuous independent variables, a statistical technique (i.e., linear regression) was superior to data mining (i.e., decision tree and artificial neural network) regardless of the number of variables and the sample size; (ii) for continuous and categorical independent variables, linear regression was best when the number of categorical variables was one, while the artificial neural network was superior when the number of categorical variables was two or more; (iii) the artificial neural network performance improved faster than that of the other methods as the number of classes of categorical variable increased.  2006 Elsevier Ltd. All rights reserved. Keywords: Data mining; Statistical method; Artificial neural network; Decision tree; Linear regression

1. Introduction The difficulties posed by prediction problems have resulted in a variety of problem-solving techniques. For example, data mining methods comprise artificial neural networks and decision trees, and statistical techniques include linear regression and stepwise polynomial regression. It is difficult, however, to compare the efficacy of the techniques and determine the best one because their performance is data-dependent. A few studies have compared data mining and statistical approaches to solving prediction problems. Gorr, Nagin, and Szczypula (1994) compared linear regression, stepwise polynomial regression, and neural networks in the context of predicting student GPAs. Although they found that lin*

Tel.: +82 2 6100 5987; fax: +82 2 6100 7911. E-mail address: [email protected]

0957-4174/$ - see front matter  2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2006.12.017

ear regression performed best overall, none of the methods performed significantly better than the ordering index used by the investigator. Shuhui, Wunsch, Hair, and Giesselmann (2001) reported that neural networks performed better than linear regression for wind farm data, while Hardgrave, Wilson, and Walstrom (1994) experimentally showed that neural networks did not significantly outperform statistical techniques in predicting the academic success of students entering the MBA program. Subbanarasimha, Arinze, and Anadarajan (2000) demonstrated that linear regression performed better than neural networks when the distribution of the dependent variable was skewed, and Kumar (2005) expanded on Subbanarasimha et al. (2000) result, developing a hybrid method that improved the prediction accuracy. These comparison studies have mainly considered a specific data set or the distribution of the dependent variable. Other unexplored criteria, however, affect the performance

1228

Y.S. Kim / Expert Systems with Applications 34 (2008) 1227–1234

of decision problem techniques, such as sample size and characteristics of the independent variables. We empirically compared the performance of data mining and statistical techniques while varying the number of independent variables, the types of independent variables, the number of classes of the independent variables, and the sample size. Our study employed 60 simulated examples, with artificial neural networks and decision trees as the data mining techniques, and linear regression as the statistical method. In addition to these general comparison results, we used the RMSE value as the metric and determined the following: for continuous independent variables, a statistical technique (i.e., linear regression) was superior to data mining (i.e., decision tree and artificial neural network) regardless of the number of variables; for continuous and categorical independent variables, linear regression was best when the number of categorical variables was one, while the artificial neural network was superior when the number of categorical variables was two or more; and the artificial neural network performance improved faster than that of the other methods as the number of classes of categorical variable increased. The article is organized as follows. Section 2 illustrates the generation of the data sets and analysis methods for the empirical study. The experimental results are described in Section 3, and the conclusions and future research directions are presented in Section 4. 2. Data analysis 2.1. Data generation In this section, we describe the 60 simulated prediction problems that we generated to evaluate the performance of the decision tree, neural network, and linear regression techniques. First, Table 1 shows 12 simulated examples with continuous independent variables. These 12 examples were obtained from the linear model, where xi was randomly selected in the range [0, 1], and e was normally distributed with mean 0 and standard deviation 1. The number of independent variables was set to one,

three, or five, and the sample size was set to 100, 500, 1000, or 10,000. Some continuous variables in Table 1 were converted to categorical variables in Tables 2 and 3. Three and five independent variables were considered in Tables 2 and 3, respectively. In the case of two categorical variables, a value of a continuous variable was converted to category ‘A’ when it is less than 50% and as ‘B’ when greater than 50%. For three categorical variables, each continuous variable was categorized as ‘A’ when it was less than 25%, as ‘B’ when greater than 75%, and as ‘C’ otherwise. 2.2. Data analysis methods In this section, the artificial neural network (ANN), decision tree analysis (DT), and linear regression (LR) techniques are applied to the 60 simulated examples to evaluate their prediction accuracy. Each example was randomly divided into two sets, a training set and a test set. The training set consisted of 70% of the data while the remainder was assigned to the test set. For simplicity, our performance comparisons only considered the root mean square error of the test set. The analyses were performed using ‘‘SAS Enterprise Miner’’. The ANN employed in this study was a multilayer feedforward network trained by a backpropagation algorithm. The number of hidden layers was set to either one or two. For each hidden layer, the number of hidden neurons varied between one and ten to identify the best ANN structure. The learning rate and momentum were set to 0.1 and 0.9, respectively. A low learning rate ensures a continuous descent on the error surface, and a high momentum is able to speed up the training process (Sarle, 1994; Yeh, Hamey, & Westcott, 1998). These values are typically used for ANN training (Ting, Yunus, & Salleh, 2002). For DT, we varied the splitting criterion and used two parameters for pre-pruning: ‘minimum number of observations in a leaf’ and ‘observations required for a split search’. The splitting criterion was set to either ‘F-test at 2% significance level’ or ‘Variance reduction’. The ‘minimum number of observations in a leaf’ and ‘observations

Table 1 Simulated examples with continuous independent variables Example ID

No. of independent variables

Sample size

Relationship

Distribution of independent variables and error term

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12

1 1 1 1 3 3 3 3 5 5 5 5

100 500 1000 10,000 100 500 1000 10,000 100 500 1000 10,000

y = 1 + 5x + e y = 1 + 5x + e y = 1 + 5x + e y = 1 + 5x + e y = 1 + 3x1 + 2x2 + 2x3 + e y = 1 + 3x1 + 2x2 + 2x3 + e y = 1 + 3x1 + 2x2 + 2x3 + e y = 1 + 3x1 + 2x2 + 2x3 + e y = 1 + 3x1 + 2x2 + 2x3 + x4 + x5 + e y = 1 + 3x1 + 2x2 + 2x3 + x4 + x5 + e y = 1 + 3x1 + 2x2 + 2x3 + x4 + x5 + e y = 1 + 3x1 + 2x2 + 2x3 + x4 + x5 + e

xi  U(0,1) e  N(0,1)

Y.S. Kim / Expert Systems with Applications 34 (2008) 1227–1234

1229

Table 2 Simulated examples with three independent variables, including categorical variables Example ID

Original ID before conversion

Sample size

No. of categorical variables

No. of classes of categorical variables

Relationship

Description

S13 S14 S15 S16

S5 S6 S7 S8

100 500 1000 10,000

1 1 1 1

2 2 2 2

y = 1 + 3x1 + 2x2 + 2c1 + e

x3 is converted to categorical variable c1. c1has a value of ‘A’ or ‘B’

S17 S18 S19 S20

S5 S6 S7 S8

100 500 1000 10,000

2 2 2 2

2 2 2 2

y = 1 + 3x1 + 2c1 + 2c2 + e

x2 and x3 are converted to categorical variables c1 and c2, respectively. c1 and c2 have a value of ‘A’ or ‘B’

S21 S22 S23 S24

S5 S6 S7 S8

100 500 1000 10,000

1 1 1 1

3 3 3 3

y = 1 + 3x1 + 2x2 + 2c1 + e

x3 is converted to categorical variable c1. c1 has a value of ‘A,’ ‘B,’ or ‘C’

S25 S26 S27 S28

S5 S6 S7 S8

100 500 1000 10,000

2 2 2 2

3 3 3 3

y = 1 + 3x1 + 2c1 + 2c2 + e

x2 and x3 are converted to categorical variables c1 and c2, respectively. c1 and c2 have a value with ‘A’, ‘B’ or ‘C’

Table 3 Simulated examples with five independent variables, including categorical variables Example Original ID ID before conversion

Sample size

No. of categorical variables

No. of Relationship classes of categorical variables

Description

S29 S30 S31 S32

S9 S10 S11 S12

100 500 1000 10000

1 1 1 1

2 2 2 2

y = 1 + 3x1 + 2x2 + 2x3 + x4 + c1 + e

x5 is converted to categorical variable c1. c1 has a value of ‘A’ or ‘B’

S33 S34 S35 S36

S9 S10 S11 S12

100 500 1000 10000

2 2 2 2

2 2 2 2

y = 1 + 3x1 + 2x2 + 2x3 + c1 + c2 + e

x4 and x5 are converted to categorical variables c1 and c2, respectively. c1 and c2 have a value of ‘A’ or ‘B’

S37 S38 S39 S40

S9 S10 S11 S12

100 500 1000 10000

3 3 3 3

2 2 2 2

y = 1 + 3x1 + 2x2 + 2c1 + c2 + c3 + e

x3, x4, and x5 are converted to categorical variables c1, c2, and c3, respectively. c1, c2, and c3 have a value of ‘A’ or ‘B’

S41 S42 S43 S44

S9 S10 S11 S12

100 500 1000 10000

4 4 4 4

2 2 2 2

y = 1 + 3x1 + 2c1 + 2c2 + c3 + c4 + e

x2, x3, x4, and x5 are converted to categorical variables c1, c2, c3, and c4, respectively. c1, c2, c3, and c4 have a value of ‘A’ or ‘B’

S45 S46 S47 S48

S9 S10 S11 S12

100 500 1000 10000

1 1 1 1

3 3 3 3

y = 1 + 3x1 + 2x2 + 2x3 + x4 + c1 + e

x5 is converted to categorical variable c1. c1 has a value of ‘A,’ ‘B,’ or ‘C’

S49 S50 S51 S52

S9 S10 S11 S12

100 500 1000 10000

2 2 2 2

3 3 3 3

y = 1 + 3x1 + 2x2 + 2x3 + c1 + c2 + e

x4 and x5 are converted to categorical variables c1 and c2, respectively. c1 and c2 have a value of ‘A,’ ‘B,’ or ‘C’

S53 S54 S55 S56

S9 S10 S11 S12

100 500 1000 10000

3 3 3 3

3 3 3 3

y = 1 + 3x1 + 2x2 + 2c1 + c2 + c3 + e

x3,x4, and x5 are converted to categorical variables c1, c2, and c3, respectively. c1, c2, and c3 have a value of ‘A,’ ‘B,’ or ‘C’

S57 S58 S59 S60

S9 S10 S11 S12

100 500 1000 10000

4 4 4 4

3 3 3 3

y = 1 + 3x1 + 2c1 + 2c2 + c3 + c4 + e

x2, x3, x4, and x5 are converted to categorical variable c1, c2, c3 and c4, respectively. c1, c2, c3 and c4 have a value with ‘A’, ‘B’ or ‘C’

1230

Y.S. Kim / Expert Systems with Applications 34 (2008) 1227–1234

required for a split search’ parameters were set to either 5 and 10 or 10 and 20, respectively. Thus, four decision trees were generated. Finally, LR used the least squares method and all independent variables were considered.

Table 6 Experimental results: RMSE values for LR, ANN, and DT with 5 independent variables, including categorical variables No. of categorical variables

No. of classes of categorical variables

Sample size

RMSE values LR

ANN

DT

1

2

100 500 1000 10000 100 500 1000 10000

0.909 1.106 1.043 1.000 0.933 1.090 1.036 1.000

0.928 1.100 1.065 1.040 0.928 1.090 1.035 1.000

1.151 1.319 1.305 1.200 1.151 1.319 1.320 1.210

100 500 1000 10000 100 500 1000 10000

0.905 1.095 1.049 1.009 0.966 1.101 1.036 1.005

0.904 1.073 1.034 1.008 0.961 1.098 1.036 1.010

1.169 1.327 1.313 1.200 1.169 1.322 1.328 1.210

100 500 1000 10000 100 500 1000 10000

1.003 1.126 1.096 1.047 1.010 1.129 1.058 1.020

0.980 1.100 1.080 1.040 1.010 1.124 1.058 1.020

1.206 1.286 1.274 1.160 1.135 1.325 1.247 1.170

100 500 1000 10000 100 500 1000 10000

1.119 1.155 1.109 1.084 1.032 1.145 1.081 1.040

1.098 1.123 1.092 1.080 1.031 1.145 1.081 1.040

1.357 1.257 1.205 1.180 1.143 1.244 1.231 1.170

3. Experimental evaluation Computational results (i.e., RMSE values) are summarized in Tables 4–6 for LR, ANN, and DT prediction methods (M), sample size (S), number of independent variables (V), number of categorical variables (CA), and number of classes of categorical variables (CL). Table 4 shows the RMSE values for LR, ANN, and DT when the independent variables were continuous. The prediction methods of LR and ANN performed consistently better than DT. Furthermore, in almost all cases consid-

2

Table 4 Experimental results: RMSE values for LR, ANN, and DT in the case when independent variables consist of continuous variables

3

No. of independent variables

Sample size

LR

ANN

DT

1

100 500 1000 10000

1.193 1.017 1.006 0.990

1.208 1.029 1.009 0.990

1.238 1.061 1.055 1.050

3

100 500 1000 10000

0.931 0.924 0.987 0.995

0.953 0.933 0.987 0.990

1.282 1.142 1.146 1.100

100 500 1000 10000

0.948 1.084 1.027 0.996

0.989 1.076 1.027 0.996

1.151 1.285 1.306 1.210

5

No. of categorical variables

No. of classes of categorical variables

Sample size

RMSE values LR

ANN

DT

1

2

100 500 1000 10000 100 500 1000 10000

1.014 0.977 1.028 1.033 0.964 0.959 1.010 1.010

1.066 0.978 1.027 1.030 0.963 0.966 1.009 1.010

1.282 1.161 1.138 1.110 1.282 1.172 1.144 1.100

100 500 1000 10000 100 500 1000 10000

1.090 1.051 1.071 1.070 1.034 0.994 1.009 1.062

1.066 1.050 1.071 1.070 1.037 1.001 1.007 1.030

1.282 1.140 1.125 1.120 1.306 1.188 1.083 1.100

2

2

3

2

3

2

RMSE values

Table 5 Experimental results: RMSE values for LR, ANN, and DT with three independent variables, including categorical variables

3

3

3

4

2

3

ered, LR was superior or equal to ANN, although the differences between the RMSE values decreased as S and V increased. Table 5 gives the experimental results for the case when three independent variables included one or two categorical variables. When S was small (100 or 500), LR performed better than ANN and DT, while ANN was superior to the other methods when S was large (1000 or 10,000). Table 6 shows that for five independent variables that included one or two categorical variables, ANN outperformed the other methods in almost all cases. Note that when CL was set to two, the RMSE values for ANN remained smaller than for LR as CA increased. To access the effects of various parameters on RMSE in a more succinct manner, we applied analysis of variance (ANOVA) to the experimental data given in each table. The experimental setting for each approach can be regarded as a full factorial design (Montgomery, 2000). For example, the factors for the proposed approach included the prediction method (denoted by M with three levels of LR, ANN, and DT), sample size (denoted by S with four levels of 100, 500, 1000, and 10,000), number

Y.S. Kim / Expert Systems with Applications 34 (2008) 1227–1234

of independent variables (denoted by V with three levels of one, three, and five), number of categorical variables (denoted by CA with two or four levels), and number of classes of categorical variables (denoted by CL with two levels of two and three). Table 7 shows the ANOVA table for the continuous independent variables. The three-way interaction effect (i.e., S · V · M) was assumed to be negligible. Note that the p-values (or the least significant probabilities) for the main effects S, V, and M as well as for interaction effects S · V and V · M were ‘small,’ and therefore considered statistically significant (Montgomery, 2000). This is also illustrated in Figs. 1–3, which show the three main effects and interaction effects S · V and V · M. Since all main effects were involved in the interaction effects, their effects on RMSE could not be assessed independently. That is, the effects of S, V, and M on RMSE must be assessed using their main and interaction plots together. Figs. 1 and 2 show that performance when S = 10,000 was better than for the other sample sizes, while it was worse than the others when V = 3. Note in Fig. 3 that the performances for different M show different patterns depending on V. Nevertheless, LR was superior to the other methods regardless of V. The ANOVA results for three independent variables, including categorical variables, are summarized in Table 8.

1231

Fig. 2. Interaction plot of S and V for the continuous independent variables.

Table 7 AVOVA table for continuous independent variables Source

Degrees of freedom

Sum of squares (·102)

Mean squares (·102)

F

p-Value

S V M S·V S·M V·M Error

3 2 2 6 6 4 12

1.8532 2.2834 19.7029 9.3953 0.4397 4.9614 2.0219

0.6177 1.1417 9.8515 1.5659 0.0733 1.2403 0.1685

3.67 6.78 58.47 9.29 0.43 7.36

0.044 0.011 0.000 0.001 0.842 0.003

Total

35

40.6578

Fig. 1. Main effect plots for the continuous independent variables.

Fig. 3. Interaction plot of V and M for the continuous independent variables.

Here, the p-values indicate that all four main effects and interaction effects CL · M, S · M and CA · M were statistically significant (see also Fig. 4 for the main effects and Figs. 5–7 for the interaction effects). Figs. 9 and 10 show that ANN performed best when CL = 3, while LR was superior when CL = 2. In addition, the performances for different M showed different patterns depending on S; i.e., when S = 500, the performances were better than for other values of S, except for the DT. Fig. 7 shows that LR performed the best when CA = 1, while ANN was the best when CA = 2. In other words, a smaller CA and CL facilitated better performance by LR, while a larger CA and CL improved the performance of ANN (see Figs. 5 and 7). In Table 9, ANOVA results are shown for five independent variables, including categorical variables. The main effects S, CA, and M and the interaction effects CL · CL, S · CA, and CA · M were statistically significant. In Figs. 8 and 9, note that the prediction accuracy decreased as CA

1232

Y.S. Kim / Expert Systems with Applications 34 (2008) 1227–1234

Table 8 ANOVA table for 3 independent variables, including categorical variables Source

Degrees of freedom

Sum of squares (·102)

Mean squares (·102)

F

p-Value

CL S CA M CL · S CL · CA CL · M S · CA S·M CA · M Error

1 3 1 2 3 1 2 3 6 2 23

0.8190 3.1084 0.8454 23.1092 0.1227 0.0266 0.4629 0.2959 5.9384 0.6210 0.6352

0.8190 1.0361 0.8454 11.5546 0.0409 0.0266 0.2314 0.0986 0.9897 0.3105 0.0276

29.65 37.52 30.61 418.37 1.48 0.96 8.38 3.57 35.84 11.24

0.000 0.000 0.000 0.000 0.246 0.337 0.002 0.030 0.000 0.000

Total

47

35.9847 Fig. 6. Interaction plot of S and M for the three independent variables, including categorical variables.

Fig. 4. Main effect plots for the three independent variables, including categorical variables. Fig. 7. Interaction plot of CA and M for the three independent variables, including categorical variables.

Table 9 ANOVA table for five independent variables, including categorical variables

Fig. 5. Interaction plot of CL and M for the three independent variables, including categorical variables.

Source

Degrees of freedom

Sum of squares (·102)

CL S CA M CL · S CL · CA CL · M S · CA S·M CA · M Error

1 3 3 2 3 3 2 9 6 6 57

0.2262 22.1784 2.4786 78.2713 0.2286 0.8667 0.0308 4.0731 0.9853 4.0117 3.6321

Total

95

116.9829

Mean squares (·102)

F

p-Value

0.2262 7.3928 0.8262 39.1357 0.0762 0.2889 0.0154 0.4526 0.1642 0.6686 0.0637

3.55 116.02 12.97 614.17 1.20 4.53 0.24 7.10 2.58 10.49

0.065 0.000 0.000 0.000 0.320 0.006 0.786 0.000 0.028 0.000

Y.S. Kim / Expert Systems with Applications 34 (2008) 1227–1234

Fig. 8. Main effect plots for the five independent variables, including categorical variables.

1233

Fig. 11. Interaction plot of CA and M for the five independent variables, including categorical variables.

increased and that CL = 3 was superior to CL = 2 (except for CA = 2). That is, a larger number of classes facilitated better performance. Fig. 10 shows the CA influences on RMSE when the sample size was small (i.e., S = 100). However, when the sample size was larger than 100, the RMSE was not influenced by CA. Finally, Fig. 11 shows that LR performed best when CA = 1, while ANN was superior when CA > 1. This is similar to the results shown in Figs. 5 and 7, when a smaller value of CA led to a better performance by LR and a larger value facilitated a better performance by ANN. 4. Conclusions

Fig. 9. Interaction plot of CL and CA for the five independent variables, including categorical variables.

Fig. 10. Interaction plot of S and CA for the five independent variables, including categorical variables.

In this article, we present the results of an experimental comparison study of data mining and statistical techniques based on varying the number of independent variables, the types of independent variables, the number of classes of the independent variables, and the sample size. To evaluate the performance of the different techniques, we generated various simulated problems and used the RMSE metric. The main results include the following: when independent variables are continuous, LR is superior to both DT and ANN regardless of the number of variables; when independent variables are continuous and categorical, LR performs best when the number of categorical variables is small (i.e., CA = 1), while ANN is the best when the number of categorical variables is two or more; and ANN performance improves more relative to LR and DT performance as the number of classes of categorical variables increases. The above results were derived from simulated data and need further verification using a variety of actual data. However, the results are meaningful in that this study provides the first comparison between statistical and data mining techniques based on the characteristics of the independent variables. In addition, the results of this study provide insight for selecting the most appropriate predic-

1234

Y.S. Kim / Expert Systems with Applications 34 (2008) 1227–1234

tion method for a problem based on characteristics of the problem’s independent variables. A promising area of future research would be in applying this approach to compare the performance of classification methods. References Gorr, W. L., Nagin, D., & Szczypula, J. (1994). Comparative study of artificial neural network and statistical models for predicting student grade point averages. International Journal of Forecasting, 10, 17–34. Hardgrave, B. C., Wilson, R. L., & Walstrom, K. A. (1994). Predicting graduate student success: A comparison of neural networks and traditional techniques. Computers and Operation Research, 21, 249–263. Kumar, U. A. (2005). Comparison of neural networks and regression analysis: A new insight. Expert Systems with Applications, 29, 424–430. Montgomery, D. C. (2000). Design and analysis of experiments. New York: Wiley.

Sarle, W. S., (1994). Neural network implementation in SAS software. In Proceedings of the nineteenth annual SAS users group international conference (pp.1551–1573). Cary, NC. Shuhui, L., Wunsch, D. D., Hair, E. O., & Giesselmann, M. G. (2001). Comparative analysis of regression and artificial neural network models for wind turbine power curve estimation. Journal of Solar Energy Engineering, 123, 327–332. Subbanarasimha, P. N., Arinze, B., & Anadarajan, M. (2000). The predictive accuracy of artificial neural networks and multiple regression in the case of skewed data: Exploration of some issues. Expert systems with Applications, 19, 117–123. Ting, H. N., Yunus, J., & Salleh, H., (2002). Speaker-independent phonation recognition for Malay Plosives using neural networks. In International joint conference on neural networks (pp.619–623). Honolulu, HI. Yeh, J. C. H., Hamey, L.G.C., & Westcott, T., (1998). Developing FENN applications using cross-validated validation training. In Proceedings of the second IEEE international conference on intelligent processing systems (pp.565–569). Gold Coast.