Design of ensemble neural network using entropy theory

Advances in Engineering Software 42 (2011) 838–845 Contents lists available at ScienceDirect Advances in Engineering Software journal homepage: www...

Download PDF

487KB Sizes 0 Downloads 92 Views

Report

PDF Reader
Full Text

Advances in Engineering Software 42 (2011) 838–845

Contents lists available at ScienceDirect

Advances in Engineering Software journal homepage: www.elsevier.com/locate/advengsoft

Design of ensemble neural network using entropy theory Zhiye Zhao ⇑, Yun Zhang School of Civil and Environmental Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore

a r t i c l e

i n f o

Article history: Received 19 May 2010 Received in revised form 11 May 2011 Accepted 13 May 2011 Available online 14 June 2011 Keywords: Ensemble neural network Network architecture Entropy theory Peak particle velocity Network weight optimization Lagrange method

a b s t r a c t Ensemble neural networks (ENNs) are commonly used neural networks in many engineering applications due to their better generalization properties. An ENN usually includes several component networks in its structure, and each component network commonly uses a single feed-forward network trained with the back-propagation learning rule. As the neural network architecture has a signiﬁcant inﬂuence on its generalization ability, it is crucial to develop a proper algorithm to determine the ENN architecture. In this paper, an ENN, which combines the component networks using the entropy theory, is proposed. The entropy-based ENN searches the best structure of each component network ﬁrst, and employs entropy as an automating design tool to determine the best combining weights. Two analytical functions – the peak function and the Friedman function are used to assess the accuracy of the proposed ensemble approach. Then, the entropy-based ENN is applied to the modeling of peak particle velocity (PPV) damage criterion for rock mass. These computational experiments have veriﬁed that the proposed entropy-based ENN outperforms the simple averaging ENN and the single NN. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction The artiﬁcial neural network (NN) [1] is a mathematical or computational model for information processing based on the biological neural networks. It has been successfully applied to a number of civil engineering applications, such as in predicting soil thermal resistivity [2], ﬁre analysis of steel frames [3], simulating the seismic response [4], monitoring offshore platforms [5], and in concrete strength prediction [6]. The ENN originates from Hansen and Salamon’s work [7], which showed that the generalization ability of an NN system can be signiﬁcantly improved through ensembling a number of NNs. Since this approach behaves remarkably well, nowadays it has already been applied to many areas, such as in pattern recognition [8], predicting software reliability [9], ﬁnancial decision application [10], and weather analysis [11]. In general, an ENN is constructed in two steps, creating component networks and combining component networks in an ensemble. The key principle to generate a good ENN is to create both accurate and diverse component networks. The information theory based on the entropy concept has a long history in statistical thermodynamics, quantum physics and communication engineering [12,13]. The entropy has been deﬁned in various ways [14–17] and used to characterize communication models where signals are mixed with random noises. The entropy,

⇑ Corresponding author. Tel.: +65 6790 5255; fax: +65 6791 0676. E-mail address: [email protected] (Z. Zhao). 0965-9978/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.advengsoft.2011.05.027

as a mathematical concept, is appeared ﬁrst in Shannon’s paper [15] on the information theory. Another important progress for mathematical entropy was made by Kullback [18] in 1950s. The entropy, as a measurement of disorder or uncertainty, has been widely used by both mathematicians and physicists. There are some applications of the entropy in some speciﬁc ﬁelds of NNs. Schraudolph [19] had employed the entropy as the objective for the unsupervised learning from the minimum description length and projection pursuit frameworks. Its optimization in an NN is nontrivial since the entropy depends on the probability density, which is not explicit in an empirical data sample. Ng et al. [20] deﬁned the entropy as a term used in the learning phase of an NN. Yuan et al. [21] proposed a new method for optimizing the number of hidden neurons based on the information entropy. Chakik et al. [22] presented a comprehensive maximum entropy procedure for the classiﬁcation tasks. Esteves et al. [23] presented a cross-entropy approach to map the high-dimensional data into a low-dimensional space. Liu et al. [24] presented a maximum-entropy learning algorithm based on the radial basis function NNs to control the chaotic system. Most of the development work aim to use the entropy to determine and to deﬁne the complexity of the NN by (a) deﬁning bounds for it; or (b) generating the NN based on the entropy criterion; or (c) performing NN architecture optimization and pruning. In this paper, the entropy theory has been used to combine the component networks in an ENN. The proposed entropy-based ENN increases the accuracy of each component network, and balances its contribution to the ENN. The entropy reduces the overﬁtting of the component networks, so as to improve the overall performance of the ENN.

839

Z. Zhao, Y. Zhang / Advances in Engineering Software 42 (2011) 838–845

The rest of this paper is organized as follows. In Sections 2 and 3, ENNs and the entropy are introduced, respectively. In Section 4, the entropy-based ensemble network is proposed. In Section 5, three examples are used to verify the proposed entropy-based ENN. Finally in Section 6, contributions of this paper are summarized.

2. Ensemble neural network An ENN is a collection of a ﬁnite number of NNs that are trained for the same task. Usually the networks in the ensemble are trained independently and then their predictions are combined [25]. In other words, any one of the component networks in the ENN could provide a solution or a predictor to the task by itself, but better results might be obtained by an ENN through different methods to combine the solutions or the predictors that are achieved by the component networks. The architecture of the ENN is shown in Fig. 1. The two main steps to construct an ENN are: Step 1 – creating component networks; Step 2 – combining these component networks in the ENN. In Step 1, good regression or classiﬁcation component networks must be both accurate and diverse. To ﬁnd networks which generalize differently, a number of training parameters can be manipulated. These parameters include: the initial conditions, the training data, the typology of the nets, and the training algorithm [26]. The most widely used techniques for creating the training data are Bagging and Boosting. Bagging (short for ‘bootstrap aggregation’) is proposed by Breiman [27] based on the bootstrap sampling [28], in which one available sample gives rise to many others by a resampling process. During the re-sampling, the repeated data, which was randomly picked, can be used in the new training set. Then a component network with this new sample was trained. This process was repeated until the component networks are sufﬁcient in the ENN. So, Bagging is useful for problems with a shortage of data. Boosting is proposed by Schapire [29] and is improved by Freund and Schapire [30]. Boosting generates a set of component networks whose training sets are determined by the performance of former component network. Another frequently used method for creating the ENN is to change the number of hidden nodes in different component networks. Zhao et al. [31] proposed a simple procedure to deﬁne the number of hidden nodes in each component network, in which the upper bound and the lower bound for the number of hidden nodes are determined basis on the physical problem and the training data available. After a set of component networks has been created, the method to combine these networks has to be considered. From the beginning of the 1990s, several methods have been proposed. Hashem [32] provided an account of methods to ﬁnd optimal linear combinations of the members of an ensemble by using the equal combination weights. This set of outputs combined by a uniform weighting is referred as the simple ensemble (or simple averaging method). Perrone and Cooper [33] proposed a generalized

Input

ensemble method (GEM) to determine the optimal weights using the correlation matrix. They deﬁned the symmetric correlation matrix by using the error between the target function and the output of the component network, which is often named as the weighted averaging method. Some researchers used the nonlinear ensemble strategy to determine the optimal neural ensemble predictor’s weight. Lai et al. [34] proposed a nonlinear ensemble method with support vector machine regression (SVMR) principle [35]. Chen et al. [36] proposed a ﬂexible neural tree ensemble technique, in which the output of ensemble is approximated by a local polynomial model. It is worth mentioning that when a number of NNs are available, most ensemble approaches aim to reduce the mean squared error (MSE) of each NN, which may increase the complexity of the ENN and result in an unstable performance of the ENN model. The complexity of the ENN model may increase the computational time, and may lead to the over-ﬁtting. So the main purpose of the paper is to use ‘entropy’ to reduce both the complexity of the component NNs and the over-ﬁtting.

3. Entropy The entropy was introduced in the context of efﬁciency of heat engines in early 19th century. The entropy of a random variable is deﬁned in terms of its probability distribution and can be shown to be a good measure of the randomness or the uncertainty. Shannon and Weaver [37] gave a measure of uncertainty known as Shannon’s entropy, having the same mathematical expression as the entropy in statistical mechanics. According to the second law of thermodynamics, the entropy never decreases in a closed system, and it is a measure of the disorder or complexity of a system. Let X be a random variable with sample space X = {x1, x2, ..., xn} and the probability associated with the value xi is pi, i.e., P(X = xi) = pi, i = 1,2,. . .,n. The entropy H(X) Hn(p) of the ensemble {(x1, p1), (x2, p2), ..., (xn, pn)} as deﬁned by Shannon is given by the expression

HðXÞ Hn ðpÞ ¼ Hn ðp1 ; p2 ; . . . ; pn Þ ¼ c

n X

pi log pi

ð1Þ

i¼1

where c is an arbitrary positive constant. The base of the logarithm is arbitrary. If logarithm base is 2, the entropy is measured in bits; if the base is e, the entropy is in nats (for natural units). The measure H, that Shannon used to measure the uncertainty of a collection of events, reaches a maximum when p1 = p2 = . . . = pn = 1/n, in other words, when the probabilities are uniform. The maximum entropy formalism published by Jaynes in 1957 is a fundamental concept in the information theory [12]. The maximum entropy formalism is used to determine the probabilities underlying a random process from any available statistical data about the process. The resulting maximum entropy probability distribution corresponds to a distribution which is consistent with the

NN 1

Task solution 1

NN 2

Task solution 2

NN m

Task solution m

Fig. 1. Architecture of the ensemble neural network.

Better Solution

840

Z. Zhao, Y. Zhang / Advances in Engineering Software 42 (2011) 838–845

given partial information but has maximum uncertainty or entropy associated with it. To illustrate Jaynes’ principle, let’s consider a discrete random variable X. To obtain the ‘most objective’ probability distribution of X, the maximum entropy principle can be used in the following procedure:

max

SðkÞ ¼

m X

ki ln ki

ð2Þ

i¼1

subject to

m X

ki ¼ 1

ð3Þ

i¼1

and the constraints

m X

ki fj ðxi Þ ¼ f j ;

j ¼ 1; 2; . . . ; m

ð4Þ

i¼1

where fj(x) is a given function of x. Using the method of Lagrange’s multipliers, the resulting distribution is ki ðxÞ ¼ ea0 a1 f1 ðxi Þa2 f2 ðxi Þ...am fm ðxi Þ ; i ¼ 1; 2; . . . ; n, where a0, a1,. . ., am are Lagrangian multipliers which are determined from the (m + 1) constraints in Eqs. (3) and (4). It can be proved that Lagrange’s method yields global maximum and the distribution obtained has the maximum entropy than any other distribution satisfying the given constraints [12]. The wide applicability of Jaynes’ principle in a number of ﬁelds as observed by Tribus [38] is attributed to: ‘‘Jaynes’ principle shows that if Shannon’s measure is taken to be the measure of uncertainty, not just a measure, the formal results of statistical mechanical reasoning can be carried over to other ﬁelds’’. 4. Design ensemble neural networks using the entropy 4.1. Creating the component networks The entropy-based ENN can reduce over-ﬁtting in the ENN. Since varying the initial random weights can create the different performance of the single network with the same hidden nodes and the same training data, the proposed method choose each component network with the best initial structure ﬁrst. By considering the network’s accuracy and the model’s complexity, the entropy is used to combine these best component networks. The proposed method reduces the error of each component network ﬁrst, then balances their contributions to the ENN, so to make the ENN both accurate and stable. Creation of the component network can be divided into two steps. The ﬁrst step is to create the training data and testing data sets, and the second step is to create the component networks. During the creation of the training data and testing data sets, some common ratios of training data to testing data will be used in the analyses. All training data are used for each component network. For creating component networks, each component network is created several times, but the best structure will be used in the ENN. The criterion to choose the best structure is the one with the smallest training MSE. Since good regression ensemble members must be both accurate and diverse, the training of each component network should also have high accuracy and diversity. Thus, different number of hidden nodes will be used in different component networks, except for the cases with limited data. The procedure to deﬁne the number of hidden nodes in each component network is similar to the Zhao’s method [31], in which the best number of hidden nodes in a single NN is worked out by the trial and error method ﬁrst. Since the small test MSE and the small Akaike information criterion (AIC) value indicate sufﬁcient training and proper number of parameters, to choose the best

number of hidden nodes is to ﬁnd the single NN with the smallest test MSE and the smallest AIC value. And this number is selected as the maximum number of hidden nodes of the component network. Then several other networks with the number of hidden nodes less than the maximum one are added. The gap of the number of hidden nodes between any two component networks should be as big as possible to increase the diversity. And the minimum number of hidden nodes of component networks would be as small as possible but with sufﬁcient accuracy. An upper boundary for the number of parameters that could be incorporated in the model is determined by the fact that it is not possible to determine more parameters than the number of samples in the data set [39]. The boundary for the number of hidden nodes is thus determined as (for the case with only 1 output node):

Nh < ðNtr 1Þ=ðNi þ 2Þ

ð5Þ

where Ni is the number of input nodes for the component network; Nh is the number of hidden nodes; and Ntr is the number of the training data. In this paper, the method with the different number of hidden nodes in the component networks is adopted if the training data is enough. 4.2. Combining the component networks with the entropy In the past 20 years, the entropy theory has been used to solve the Min–Max problems with constraints in the different areas [40–42]. Similarly, to use the entropy concept to obtain the unbias ENN, three parts of the problem should be optimized at the same time: to maximize the entropy of the combining weights of the whole ENN; to minimize the error between the mean output of the ENN and the mean target value; to minimize the difference of the standard deviation of the output of the ENN and the standard deviation of the target value. This will be beneﬁting the whole ENN. The three part optimizing problem can be formulated as follows:

Problem max SðPÞ ¼

m X

Pi ln Pi

i¼1

min

lENNoutput ltarget rENNoutput rtarget

subject to

m X

ð6Þ

P i ¼ 1; P i > 0

i¼1

where S(P) is the entropy value of the combining weights of the whole ENN; Pi is the ith component network’s weight of the ENN; P m is the number of the component networks; ltarget ¼ 1n nj¼1 T j is the mean value of the target, n is the number of sets of input data to determine the component weights; Tj is the target value with P P the jth input data set; lENNoutput ¼ 1n nj¼1 m i¼1 P i fij ðxÞ is the mean value of the output of the ENN where fij(x) is the output of the ith Pm component NN using the jth input data set, so i¼1 P i fij ðxÞ is the output of the ENN using the jth input data set; qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P rtarget ¼ 1n nj¼1 ðT j ltarget Þ2 is the standard deviation of the target; P rENNoutput ¼ mi¼1 Pi ri is the standard deviation of the ENN, where qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P ri ¼ 1n nj¼1 ðfij li Þ2 is the standard deviation of the output of P the ith component network; li ¼ 1n nj¼1 fij ðxÞ is the mean value of the output of the ith component network. In statistics, when X1, X2,. . .,Xm are dependent random variables, P P r2ENNoutput ¼ mi¼1 P2i r2i þ 2 i–j Pi Pj covðX i ; X j Þ. Here, the approximation of rENNoutput is used, as following:

Z. Zhao, Y. Zhang / Advances in Engineering Software 42 (2011) 838–845

r2ENNoutput ¼

m X

P2i r2i þ 2

6

Pi Pj covðX i ; X j Þ

i–j

i¼1 m X

X

841

P2i

2 i

r þ2

X

Pi Pj ri rj ¼

i–j

i¼1

m X

!2 P i ri

ð7Þ

i¼1

Generally, the input data sets for the component networks of the ENN are identical or similar, so the outputs of these component networks are not independent. Since the relationship of the standard deviation of the ENN to the standard deviation of the component networks in the ENN cannot be exactly identiﬁed, the upper boundary of the standard deviation of the ENN will be deﬁned as the approximated standard deviation of the ENN in this approach, P which is rENNoutput ¼ m i¼1 P i ri . Eq. (6) can be solve by the Lagrange’s method as follows:

MaxLðx; PÞ ¼

m X

Pi ln Pi k0

m X

i¼1

! Pi 1 k1 ðlENNoutput

i¼1

ltarget Þ k2 ðrENNoutput rtarget Þ

ð8Þ

where k0, k1 and k2 are the Lagrange multipliers. Let @Lðx;PÞ ¼ 0, i = 1,2,. . .,m, the solution of this problem is @Pi P i ¼ e1k0 k1 li k2 ri . Let A ¼ e1þk0 , B ¼ ek1 , C ¼ ek2 , and Pm i¼1 P i ¼ 1, the solution of this problem becomes:

Bli C ri Pi ¼ Pm li ri i¼1 B C

ð9Þ

To obtain the weights of the component networks, Newton’s method is used to solve the above equations. Since P P ltarget ¼ mi¼1 Pi li and rtarget ¼ mi¼1 Pi ri , the problem can be rewritten in the following form:

8 m P > ri li > > < uðB; CÞ ¼ ðlt li Þ B C i¼1

m P > r l > > : v ðB; CÞ ¼ ðrt ri Þ B i C i

ð10Þ

i¼1

And the iteration procedure is as follows:

@u Bnþ1 ¼ Bn þ @@Cv @C u C nþ1 ¼ C n þ J1 n v 1 Jn

u v P n @u @B @v @B

ð11Þ

Pn

where Pn is the point (Bn,Cn), the original values of B and C (i.e. B0 and C0) can inﬂuence the performance of the ENN. And Jn, the Jacobian matrix, is the matrix of partial derivatives of the functions at @u @u the point Pn: Jn ¼ @@Bv @@Cv . @B

Fig. 2. Flowchart of entropy-based ensemble neural network.

of hidden nodes usually needs to be optimized to get the best ENN. Each component network will be trained by the training data set. 3. Randomly create each component network several times. For each component network, calculate the MSE of the component network by using the training data set. Then compare them and select the best structure that has the smallest training MSE. 4. Use the modiﬁed entropy value of each component network to determine the weight of each component network in the ENN. Firstly, use Lagrange method to ﬁnd the solution of the entropy equations. Then, the Newton’s method is used to determine the Lagrange multipliers. Finally, construct the ENN with each best component network by the solution from the Lagrange method, which is the weight of each component network. 5. The MSE of the testing data set will be used as the performance measurement of the ENN.

5. Computational experiments

@C Pn

4.3. Algorithm The major steps of the entropy-based ENN are shown in Fig. 2, which are explained further as follows. 1. The data set is split into two parts: the training and testing data sets. The training data set is used for developing the various component network models. The testing data set is not used in the network training. The testing data set is used for testing the performance of the trained NN model. 2. Several component networks were used in the ENN. Each component network has one input layer, one hidden layer and one output layer. The number of input/output nodes is selected according to the problem’s input/output attributes. The number

To verify the performance of the entropy-based ENN proposed in this paper, three computational experiments are carried out by an ENN program written in MATLAB. Two theoretical functions – the peak function and Friedman function are applied ﬁrst, and followed by one practical example – the modeling of peak particle velocity (PPV) damage criterion for rock mass. For comparison purposes, a simple averaging ENN which has the same entropy-based ENN structure and a single NN which uses the best number of hidden node are also simulated using the same data.

5.1. Peak function approximation The peak function, which is shown in Fig. 3, is a sample function of two variables and obtained by translating and scaling Gaussian

842

Z. Zhao, Y. Zhang / Advances in Engineering Software 42 (2011) 838–845

8 6 4

Predicted Value

5

0

-5

2 0 -2 -4

2

y

0 -2 -3

-2

0

-1

1

2

3 -6

x

-6

-4

-2

0

2

4

6

8

Actual Value

Fig. 3. Example 1 – peak function. Fig. 4. Comparison between the actual and predicted values for example 1.

distributions. It is a typical complex two-dimensional function used as demonstration in MATLAB as following:

Z ¼ 3ð1 xÞ2 eðx eððxþ1Þ

2

2 ðyþ1Þ2 Þ

y2 Þ

=3

10ðx=5 x3 y5 Þeðx

2 y2 Þ

ð12Þ

The peak function with normally distributed noise (mean 0, variance 0.05) is used to generate the training data and the testing data. Firstly, 11 11 evenly distributed data along both the x-axis and the y-axis are selected from the domain [3, 3] as the training data for the simulation. Another 10 10 evenly distributed points from the same domain are used as the testing data. The maximum training epoch of each component network is set to 30. There are 121 examples in the training data set and 100 test examples, and all these training data will be used to train all the component networks in the ENN. All three kinds of NNs have the same input and output layers: the number of input nodes is 2 and the number of output nodes is 1. The optimal number of hidden nodes is selected as 15 by the trial and error method. Therefore, in the two ENNs, there are four component networks, and the numbers of hidden nodes in the component networks are 9, 11, 13 and 15, respectively. Each component network is trained four times randomly to ﬁnd the best weight conﬁguration. The simple averaging ENN and the entropybased ENN will then combine the component networks with the best weight conﬁguration. For the simple averaging ENN, the output of the ensemble networks is combined with the simple averaging method (noted as Ave-ENN), and the entropy-based ensemble method (noted as EN-ENN) uses the modiﬁed entropy value to determine the networks’ weights. For a fair comparison, the single NN also obtains the best result from the four random runs. The performance of the EN-ENN for the testing data is shown in Fig. 4, where all data points are within a narrow band of the 45° line, indicating a good accuracy. The statistical performance on the training data set and testing data set for 20 runs are shown in Table 1, where single 9 in the table denotes the single NN with 9 hidden nodes. It can be observed that both ENNs have better accuracy than the single NNs. For the single networks, the network with higher number of hidden nodes has the better performance. When these four component networks combined, the performance of ENNs becomes better than any of the single one. Those results demonstrate the better generalization property of the ENN, which has been veriﬁed by many others. Between the Ave-ENN and the EN-ENN, the latter has much smaller

MSE in terms of mean value and the standard deviation (SD) for both the testing data and the training data, indicating a better generalization capability. Thus, this example demonstrated that the entropy-based weighted ENN outperforms both the single NNs and the simple averaging ENN. 5.2. Friedman function approximation Friedman #1 is a nonlinear prediction problem which was used by Friedman [43] in his work on multivariate adaptive regression splines (MARS). It has 5 independent predictor variables that are uniform in [0, 1]. The following Friedman #1 with normally distributed noise (mean 0, variance 1) is used to test the entropy-based ENN.

Y ¼ 10 sinðpx1 x2 Þ þ 20ðx3 0:5Þ2 þ 10x4 þ 5x5

ð13Þ

Firstly, 5 5 5 5 5 evenly distributed data along the x1-axis to the x5-axis are selected from the domain [0, 1] as the training data for the simulation. Another 4 4 4 4 4 evenly distributed points from the same domain are used as the testing data. The maximum training epoch for each network is set to 30. There are 3125 examples in the training data set, and 1024 examples in the testing data set. The number of input nodes is 5, and the number of output nodes is 1. The three kinds of NNs used for the ﬁrst example are selected again to solve this problem. After the same processing as in the peak function, the single NNs use 4, 6, 8, 10 hidden nodes in their hidden layer respectively. The single NNs are trained four times randomly to ﬁnd the best results for comparison. In the other two ENNs, there are four component networks. The numbers of hidden nodes in each component network are 4, 6, 8 and 10, respectively. After choosing the best weight conﬁguration of each component network from 4 random runs, the simple averaging ensemble method and the entropybased ensemble method combine these four component networks with their respective methods. Table 2 shows the corresponding MSE value of the testing data and the training data with 20 runs during training. From Table 2, it can be observed that the EN-ENN provides the best generalization in terms of the mean value and the SD of the MSEs. The relative small SD for both ENNs indicates the main advantage of the ENN, i.e. the consistency of the NN simulation. The comparison between the actual and predicted test results of EN-ENN is shown in Fig. 5, all the data points are within a narrow band of the 45° line.

843

Z. Zhao, Y. Zhang / Advances in Engineering Software 42 (2011) 838–845 Table 1 Results of twenty runs on peak function with four component networks. Single 9

Single 11

Single 13

Single 15

Ave-ENN

EN-ENN

Test-MSE

Minimum Mean SD

0.486 0.818 0.236

0.289 0.669 0.294

0.324 0.495 0.126

0.234 0.460 0.201

0.269 0.360 0.056

0.247 0.338 0.055

Train-MSE

Minimum Mean SD

0.418 0.557 0.088

0.217 0.357 0.060

0.171 0.271 0.066

0.100 0.177 0.044

0.157 0.196 0.027

0.132 0.175 0.028

Table 2 Results of twenty runs on Friedman #1 function with four component networks. Single 4

Single 6

Single 8

Single 10

Ave-ENN

EN-ENN

Test-MSE

Minimum Mean S.D.

2.941 5.693 2.064

1.075 3.419 2.076

1.540 4.567 2.403

1.052 2.889 1.679

1.410 1.948 0.455

1.285 1.862 0.437

Train-MSE

Minimum Mean S.D.

2.382 5.155 1.381

0.226 2.398 1.648

0.188 1.183 0.648

0.041 0.357 0.383

0.481 0.963 0.358

0.181 0.584 0.296

30

Predicted Value

25

20

15

10

5

0 0

5

10

15

20

25

30

Actual Value Fig. 5. Comparison between the actual and predicted values for example 2.

5.3. Prediction of peak particle velocity damage criterion for rock mass When an explosive detonates in a blast hole, instantaneously huge amount of energy in forms of pressure and temperature liberates. A proportion of this total energy is utilized for actual breakage and displacement of rock mass; and the rest of the energy is spent in undesirable side effects like ground vibration, air blasts, noises, back breaks etc. The ground vibration may induce damage to the surface structure or the surrounding rock. The frequency and PPV are most commonly used parameters for the assessment of ground vibrations and also the level of structure damage. However, it is quite difﬁcult to establish a universally acceptable PPV damage criterion for rock mass because it depends on many factors, such as the level of damage, the material parameter of the intact rock and the physical properties of the rock mass. And each of the abovementioned parameters includes many factors. The damage can be divided into three to ﬁve levels. The material parameter of the intact rock includes the rock type, the static and dynamic tensile and compressive strength. The physical properties of the rock mass include the numbers of discontinuities such as cracks and joints, their properties such as positions, orientations, strength and stiffness.

Because there are numerous numbers of discontinuities in a rock mass where their exact locations are not known, the rock mass classiﬁcation systems have been used to estimate the rock mass properties. The rock mass rating (RMR) value was used in the present study. RMR classiﬁcation has six parameters to be considered, namely: (1) uniaxial compressive strength (rci) of the intact rock, (2) joint spacing, (3) rock quality designation (RQD), (4) condition of the joints, (5) water ﬂow/pressure and (6) the inclination of the discontinuities. With different degree of these six parameters, some rating values will be given from the RMR table [44]. Since very limited data are available for this case study, the input parameters will be deﬁned by three parameters: the tensile strength of the intact rock, the compressive strength of the intact rock, and RMR of the rock mass. Although the compressive strength of the intact rock has considered in the RMR system, the same RMR of the rock mass do have the very different compressive strength of the intact rock due to the other factors considered in the system. Therefore, all these three parameters are considered in the present study. The range of values of different input parameters has been decided by the previous ﬁeld tests published [45–47]. The input parameters for the neural network and their range are listed in Table 3. In this paper, total of 47 groups data are used to predict the threshold PPV. Two thirds of the data (i.e. 32) were chosen as the training data, and the remaining one third (i.e. 15) is used for network testing. The maximum training epoch for each network is set to 30. Since there is limited training data available with three inputs and 1 output, it is found that the optimal number of hidden nodes for the single NN is 3. In this application, the number of hidden nodes of the different component networks in the ENNs are chosen as the same, i.e. 3, but use different starting initial weights to provide the diversity. Each component network is trained four

Table 3 Parameters for network and their range for example 3. Parameter

Range

Input parameters

Tensile strength (MPa) Compressive strength (MPa) RMR

0.6–16.1 18.0–186.0 20–95

Output parameter

PPV (m/s)

0.065–1.0

844

Z. Zhao, Y. Zhang / Advances in Engineering Software 42 (2011) 838–845

methods. These results also showed the potential of the proposed ENN to be applied to other kinds of problems.

1.0

References

Predicted Value

0.8

0.6

0.4

0.2

0.0 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Actual peak particle velocity (m/s) Fig. 6. Comparison between the measured and predicted PPV values for example 3.

Table 4 Results of twenty runs on PPV (m/s) with four component networks.

Test-MSE (103)

Train-MSE (103)

Single 3

Single 3

Single 3

Single 3

AveENN

ENENN

Minimum

0.57

0.35

0.15

0.59

0.92

0.93

Mean SD

2.88 2.46

4.56 4.60

3.35 2.42

7.47 12.03

2.28 1.17

2.26 1.15

Minimum

0.01

0.02

0.01

0.02

0.04

0.04

Mean SD

0.24 0.42

1.62 1.54

0.69 1.12

1.10 1.49

0.41 0.37

0.39 0.36

times randomly to ﬁnd the best weight conﬁguration. The single NN uses the best result from the four random runs. Fig. 6 shows the performance of the entropy-based ENN for the testing data set. Twenty runs are carried out for each type of NN, and their statistic results are summarized in Table 4. It can be observed that both ENNs have better accuracy than the single NNs. The single NN is quite sensitive to the initial weights assigned for each run, which explains the large standard deviation of the MSE for the single NN. The component networks performs differently even with the same number of hidden nodes. This is because the performance of the network also depends on the initial weights in the network. Again, similar conclusions as observed from the two analytical functions can be observed: with limited data available, the ENNs have better accuracy than the single NN, and the best results are obtained by the entropy-based ENN.

6. Conclusions This paper aims to improve the ENN in two ways: (1) instead of using component NN directly, a preliminary selecting process is used to get the best component NN; (2) the entropy is used to determine the weights of the component NNs in the ENN. Using the entropy to combine these best component networks can improve the performance of the ENN by balancing the contribution of each component network. Three computational experiments are used to verify the performance of the proposed ENN: peak function, Friedman function, and PPV modeling with limited data. From the comparison study, it can be found that the proposed entropy-based ENN outperforms other

[1] McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 1943;5:115–33. Reprinted in Anderson & Rosenfeld [1988], 18–28. [2] Erzin Y, Rao BH, Singh DN. Artiﬁcial neural network models for predicting soil thermal resistivity. Int J Ther Sci 2008;47:1347–58. [3] Hozjan T, Turk G, Srpcic S. Fire analysis of steel frames with the use of artiﬁcial neural networks. J Construct Steel Res 2007;63:1396–403. [4] Tsompanakis Y, Lagaros ND, Psarropoulos PN, Geogopoulos EC. Simulating the seismic response of embankments via artiﬁcial neural networks. Adv Eng Softw 2009;40:640–51. [5] Mangal L, Idichandy VG, Ganapathy C. ART-based multiple neural networks for monitoring offshore platforms. Appl Ocean Res 1996;18:137–43. [6] Jiang N, Zhao ZY, Ren LQ. Design of structural modular neural networks with genetic algorithm. Adv Eng Softw 2003;34:17–24. [7] Hansen LK, Salamon P. Neural network ensembles. IEEE Trans Pattern Anal Mach Intel 1990;12(10):993–1001. [8] Giacinto G, Roli F. Design of effective neural network ensembles for image classiﬁcation purposes. Img Vis Comput 2001;19:699–707. [9] Zheng J. Predicting software reliability with neural network ensembles. Exp Syst Appl 2009;36:2116–22. [10] West D, Dellana S, Qian J. Neural network ensemble strategies for ﬁnancial decision applications. Comput Oper Res 2005;32:2543–59. [11] Maqsood I, Abraham A. Weather analysis using ensemble of connectionist learning paradigms. Appl Soft Comp 2007;7:995–1004. [12] Jaynes ET. Information theory and statistical mechanics I. Phys Rev 1957;106:620–30. [13] Middleton D. Topics in communication theory. New York: McGraw Hill; 1964. [14] Hartley RV. Transmission of information. Bell Syst Technol J 1928;7:535–63. [15] Shannon CE. A mathematical theory of communication. Bell Syst Technol J 1948;27(379–423):623–56. [16] Wiener N. Cybernetics. New York: John Wiley; 1961. [17] Renyi A. On the measure of entropy and information. Proc Fourth Berkeley Sympos Math Stat Probab 1961;1:541–61. [18] Kullback S. Information theory and statistics. New York: Willey & Sons; 1959. [19] Schraudolph NN. Optimization of entropy with neural networks. Ph.D. thesis, University of California; 1995. [20] Ng GS, Wahab A, Shi D. Entropy learning and relevance criteria for neural network pruning. Int J Neural Syst 2003;13(5):291–305. [21] Yuan HC, Xiong FL, Huai XY. A method for estimating the number of hidden neurons in feed-forward neural networks based on information entropy. Comput Electron Agr 2003;40:57–64. [22] Chakik FE, Shahine A, Jaam J, Hasnah A. An approach for constructing complex discriminating surfaces based on Bayesian interference of the maximum entropy. Inf Sci 2004;163:275–91. [23] Esteves PA, Figueroa CJ, Saito K. Cross-entropy embedding of high-dimensional data using the neural gas model. Neural Netw 2005;18:727–37. [24] Liu F, Sun CX, Si-ma WX, Liao RJ, Guo F. Chaos control of ferroresonance system based on RBF-maximum entropy clustering algorithm. Phys Lett A 2006;357:218–23. [25] Sollich P, Krogh A. Learning with ensembles: How over-ﬁtting can be useful. In: Touretzky DS, Mozer MC, Hasselmo ME, editors. Advances in neural information processing systems, 8. Cambridge, MA: Denver, CO, MIT press; 1996. p. 190–6. [26] Sharkey A, editor. Combining artiﬁcial neural nets ensemble and modular multi-net systems. London: Springer; 1999. [27] Breiman L. Bagging predictors. Machine Learn 1996;24(2):123–40. [28] Efron B, Tibshirani R. An introduction to the bootstrap. New York: Chapman & Hall; 1993. [29] Schapire RE. The strength of weak learnability. Machine Learn 1990;5(2):197–227. [30] Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. In: Proc. EuroCOLT-94, Barcelona, Spain, Berlin: Springer; 1995; p. 23–37. [31] Zhao ZY, Zhang Y, Liao HJ. Design of ensemble neural network using the Akaike information criterion. Eng Appl Artif Intel 2008;21:1182–8. [32] Hashem S. Optimal Linear Combinations of Neural Networks. PhD thesis, School of Industrial Engineering, Purdue University; December 1993. [33] Perrone MP, Cooper LN. When networks disagree: ensemble method for neural networks. In: Mammone RJ, editor. Artif Neural Netw Speech Vis. New York: Chapman & Hall; 1993. p. 126–42. [34] Lai KK, Yu L, Wang SY, Huang W. A novel nonlinear neural network ensemble model for ﬁnancial time series forecasting. Int Conf Comput Sci 2006;1:790–3. [35] Vapnik V. The nature of statistical learning theory. New York: Springer-Verlag; 1995. [36] Chen YH, Yang B, Abraham A. Flexible neural trees ensemble for stock index modeling. Neurocomputing 2007;70:697–703. [37] Shannon CE, Weaver W. The mathematical theory of communication. Urbana: University of Illinois Press; 1964. [38] Tribus M. In: Levine RD, Tribus M, editors. Thirty years of information theory in the maximum entropy formalism. MIT Press; 1978. p. 1–14.

Z. Zhao, Y. Zhang / Advances in Engineering Software 42 (2011) 838–845 [39] Ren LQ, Zhao ZY. An optimal neural network and concrete strength modeling. Adv Eng Softw 2002;33:117–30. [40] Li H, He DH, Li XS. Entropy, a new measurement method for investment portfolio risk. Math Practice Theory (Chinese) 2003;33(6):16–21. [41] Li XS. A maximum entropy method for structural optimization. Comput Struct Mech Appl (Chinese) 1989;6(1):36–46. [42] Wang YC, Tang HW. An entropy function method for min–max problems with constraints. Numer Math A J Chin Univ (Chinese) 1999;2:132–9. [43] Friedman JH. Multivariate adaptive regression splines. Ann Stat 1991;19(1): 1–82.

845

[44] Bieniawski ZT. Engineering rock mass classiﬁcations. New York: Wiley; 1989. [45] Hao H, Wu Y, Ma G, Zhou Y. Characteristics of surface ground motions induced by blasts in jointed rock mass. Soil Dyn Earthq Eng 2001;21(2):85–98. [46] Singh PK. Blast vibration damage to underground coal mines from adjacent open-pit blasting. Int J Rock Mech Min Sci 2002;39(8):959–73. [47] Singh M, Seshagiri RK. Empirical methods to estimate the strength of jointed rock masses. Eng Geol 2005;77(1–2):127–37.

Design of ensemble neural network using entropy theory

Design of ensemble neural network using entropy theory

Recommend Documents