An accelerating scheme for destructive parsimonious extreme learning machine

An accelerating scheme for destructive parsimonious extreme learning machine

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom An accele...

1MB Sizes 10 Downloads 116 Views

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

An accelerating scheme for destructive parsimonious extreme learning machine Yong-Ping Zhao a,n, Bing Li a, Ye-Bo Li b a b

School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China AVIC Aeroengine Control Research Institute, Wuxi 214063, China

art ic l e i nf o

a b s t r a c t

Article history: Received 27 December 2014 Received in revised form 10 February 2015 Accepted 2 April 2015 Communicated by G.-B. Huang

Constructive and destructive parsimonious extreme learning machines (CP-ELM and DP-ELM) were recently proposed to sparsify ELM. In comparison with CP-ELM, DP-ELM owns the advantage in the number of hidden nodes, but it loses the edge with respect to the training time. Hence, in this paper an equivalent measure is proposed to accelerate DP-ELM (ADP-ELM). As a result, ADP-ELM not only keeps the same hidden nodes as DP-ELM but also needs less training time than CP-ELM, which is especially important for the training time sensitive scenarios. The similar idea is extended to regularized ELM (RELM), yielding ADP-RELM. ADP-RELM accelerates the training process of DP-RELM further, and it works better than CP-RELM in terms of the number of hidden nodes and the training time. In addition, the computational complexity of the proposed accelerating scheme is analyzed in theory. From reported results on ten benchmark data sets, the effectiveness and usefulness of the proposed accelerating scheme in this paper is confirmed experimentally. & 2015 Elsevier B.V. All rights reserved.

Keywords: Single-hidden layer feedforward network Extreme learning machine Destructive algorithm Constructive algorithms Sparseness

1. Introduction The widespread popularity of single-hidden layer feedforward networks (SLFNs) in extensive fields is mainly due to their powerfulness of approximating complex nonlinear mappings and simple forms. As a specific type of SLFNs, extreme learning machine (ELM) [1–3] recently has drawn a lot of interests from researchers and engineers. Generally, training an ELM consists of two main stages [4]: (1) random feature mapping and (2) linear parameters solving. In the first stage, ELM randomly initializes the hidden layer to map the input samples into a so-called ELM feature space with some nonlinear mapping functions, which can be any nonlinear piecewise continuous functions, such as the sigmoid and the RBF. Since in ELM the hidden node parameters are randomly generated (independent of the training samples) without tuning according to any continuous probability distribution instead of being explicitly trained, it owns the remarkable computational priority to regular gradient-descent backpropagation [5,6]. That is, unlike conventional learning methods that must see the training samples before generating the hidden node parameters, ELM can generate the hidden node parameters before seeing the training samples. In the second stage, the linear parameters can be obtained by solving the Moore-Penrose generalized

n

Corresponding author. E-mail address: [email protected] (Y.-P. Zhao).

inverse of hidden layer output matrix, which reaches both the smallest training error and the smallest norm of output weights [7]. Hence, different from most algorithms proposed for feedforward neural networks not considering the generalization performance when they were proposed first time, ELM can achieve better generalization performance. From the interpolation perspective, Huang et al. [3] indicated that for any given training set, there exists an ELM network which gives sufficient small training error in squared error sense with probability one, and the number of hidden nodes is not larger than the number of distinct training samples. If the number of hidden nodes amounts to the number of distinct training samples, then training errors decreases to zero with probability one. Actually, in the implementation of ELM, it is found that the generalization performance of ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes is large enough [8]. In addition, unlike traditional SLFNs, such as radial basis function neural networks and multilayer perceptron with one hidden layer, where the activation functions are required to be continuous or differentiable, ELM can choose the threshold function and many others as activation functions without sacrificing the universal approximation capability at all. In statistical learning theory, Vapnik–Chervonenkis (VC) dimension theory is one of the most widely used frameworks in generalization bound analysis. According to the structure risk minimization perspective, to obtain better generalization performance on testing set, an algorithm should not only

http://dx.doi.org/10.1016/j.neucom.2015.04.002 0925-2312/& 2015 Elsevier B.V. All rights reserved.

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

Table 1 The computational complexity of ADP-ELM/ADP-RELM at the ith iteration. Algorithms

ADP-ELM/ADP-RELM at the ith iteration

Operation  1 Rði  1Þ

Multiplication (  )

Addition ( þ)

Division (/)

Square root (pffiffiffi)

1 3 1 6Li  6Li

1 3 1 6Li  6Li

1 2 1 2Li þ 2Li

0

1 2Li ðLi þ 1Þm

1 2Li ðLi  1Þm

0

0

1 2 1 2Li þ 2Li þ Li m

1 2 3 2Li  2Li þ Li m

Li

0

1 3 1 2 1 3 1 2 6Li þ 2Li þ 3Li þ 2Li m þ 2Li m

1 3 1 2 5 1 2 1 6Li þ 2Li  3Li þ 2Li mþ 2Li m

1 2 3 2Li þ 2Li

0

ði  1Þ

β^ 

ði  1Þ β^ js

2

‖r 0j ði  1Þ‖22

, js ¼ 1; ⋯; Li

s

In sum Note: Li ¼ L  iþ 1.

Table 2 The computational complexity of DP-ELM/DP-RELM at the ith iteration. Algorithms

DP-ELM/DP-RELM at the ith iteration

Operation h

RðiÞ s;j

s

2

i T ðj i  1Þ ¼ Q ðj iÞ 4 s s

Rðj iÞ s

T ðj iÞ s

01ðL  i  js þ 1Þ

t ðisiÞ

3 5js ¼ 1; ⋯; Li  1

Multiplication (  )

Addition (þ )

Division (/)

Square root (pffiffiffi)

2 2 2 3 5 3Li þ Li  3Li þ 2Li m  2Li m

2 1 3 1 2 5 3Li þ 2Li  6Li þ Li m Li m

L2i  Li

1 2 1 2L i  2L i

‖t ðisiÞ ‖2F js ¼ 1; ⋯; Li

Li m

L i ðm  1 Þ

0

0

In sum

2 2 2 3 5 3Li þ Li  3Li þ 2Li m  Li m

2 1 3 1 2 11 3Li þ 2Li  6 Li þ Li m

L2i  Li

1 2 1 2L i  2L i

Table 3 Specifications on each benchmark data set. Data sets

Hidden node type

#Hidden nodes

log 2 ðμÞ

Energy efficiency

Sigmoid RBF

190 220

 15  20

Sigmoid RBF

80 90

Sigmoid RBF

0.5 Sml2010 0.5 Parkinsons 0.5 Airfoil 0.5 Abalone 0.5 Winequality white 0.5 CCPP 0.5 Ailerons 0.5 Elevators 0.5 Pole

#Training

#Testing

#Inputs

#Outputs

450 450

318 318

8 8

2 2

2 8

3000 3000

1137 1137

16 16

2 2

100 70

 10  13

4000 4000

1875 1875

16 16

2 2

Sigmoid RBF

100 90

 17  30

800 800

703 703

5 5

1 1

Sigmoid RBF

40 60

 11  10

2800 2800

1377 1377

8 8

1 1

Sigmoid RBF

90 70

8  11

3000 3000

961 961

11 11

1 1

Sigmoid RBF

220 230

 34  35

6000 6000

3527 3527

4 4

1 1

Sigmoid RBF

120 100

1  10

7154 7154

6596 6596

40 40

1 1

Sigmoid RBF

60 50

4  11

8752 8752

7847 7847

18 18

1 1

Sigmoid RBF

300 300

6  27

10000 10000

4958 4958

26 26

1 1

Notes: #Training represents the number of training samples, #Testing represents the number of testing samples, #Inputs represents the number of input variables, and #Outputs represents the number of output variables.

achieve low training error on training set, but also should have a lower VC dimension. Aiming to classification tasks, Liu et al. [9] proved that the VC dimension of an ELM is equal to its number of hidden nodes with probability one, which states that ELM has a relatively low VC dimension. With respect to regression problems, the generalization ability of ELM had been comprehensively studied in [10] and [11], leading to the conclusion that ELM with some suitable activation functions, such as polynomials, sigmoid and

Nadaraya–Watson function, can achieve the same optimal generalization bound as a SLFN in which all parameters are tunable. From the above analyses, it is known that ELM owns excellent universal approximation capability and generalization performance. However, due to the fact that ELM generates hidden nodes randomly, it usually requires more hidden nodes than traditional SLFN to get matched performance. Large network size always signifies more running time in the testing phase. In cost sensitive learning, the testing time

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

3

3.4

20

3.2

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

10

3.0

Training time (sec.)

Training time (sec.)

15

2.8 2.6

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

2.4 2.2

5 2.0 10

20

0 40

60

80

100

120

140

160

60

70

80

5.2

30

4.8

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

Training time (sec.)

Training time (sec.)

50

180

#Hidden nodes

20

40

#Hidden nodes 20

25

30

15

10

4.4

4.0

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

3.6

5 3.2 20

0 50

100

150

200

#Hidden nodes

40

60

80

#Hidden nodes Fig. 2. The training time on Sml2010 (a) sigmoid and (b) RBF.

Fig. 1. The training time on Energy efficiency (a) sigmoid and (b) RBF.

should be minimized, which requires a compact network to meet testing time budget. Thus, the topic on improving the compactness of ELM has recently attracted great interests. First, Huang et al. [12] proposed an incremental ELM (I-ELM), which randomly generates the hidden nodes and analytically calculates the output weights. Due to the fact that I-ELM does not recalculate the output weights of all the existing nodes when a new node is added, hence its convergence rate can be further improved by recalculating the output weights of the existing nodes based on a convex optimization method when a new hidden node is randomly added [13]. In I-ELM, there may be some hidden nodes, which play a very minor role in the network output and eventually increase the network complexity. In order to avoid this problem and to obtain a more compact network, an enhanced version of I-ELM was presented [14], where in each learning step several hidden nodes are randomly generated and among them the hidden node leading to the largest residual error decreasing will be added to the existing network. In [15], an error minimized ELM was proposed, which can add random hidden nodes one by one or group by group. And during the growth of the network, the output weights are updated incrementally. In [16], an optimally pruned extreme learning machine was presented, which is based on the original ELM with additional steps to make it more robust and compact. Deng et al. [17] adopted the two-stage stepwise strategy to improve the compactness of the ELM. At the first stage, the selection procedure can be automatically terminated based on

the leave-one-out error. At the second stage, the contribution of each hidden nodes is reviewed, and insignificant ones are replaced. This procedure dose not terminate until no insignificant hidden nodes exist in the final model. When Bayesian approach is applied to ELM, a Bayesian ELM was got [18]. This Bayesian method can not only optimize the network architecture, but also make use of a priori knowledge and obtain the confidence interval. Subsequently, a new learning algorithm called bidirectional ELM was presented [19], which can greatly enhance the learning effectiveness, reduce the number of hidden nodes, and eventually further increase the learning speed. In addition, the traditional ELM was further extended by Zhang et al. [20] to be the one using Lebesgue p-integrable hidden activation functions, which can approximate any Lebesgue p-integrable functions on a compact input set. Further, a dynamic ELM where the hidden nodes can be recruited or deleted dynamically according to their significance to network performance was proposed [21], so that not only the parameters can be adjusted but also the architecture can be selfadapted simultaneously. Castano et al. [22] proposed a robust and pruned ELM approach, where the principal component analysis is utilized to select the hidden nodes from input features while corresponding input weights are deterministically defined as principal components rather than random ones. To solve large and complex sample issues, a stacked ELM was designed [23], which divides a single large ELM network into multiple stacked small ELMs which are serially connected. In addition, to improve

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7.0

2.5

6.5

2.0

Training time (sec.)

Training time (sec.)

4

6.0

5.5

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

5.0

4.5

1.5

1.0

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

0.5

0.0 20

40

60

80

20

100

40

60

80

100

#Hidden nodes

#Hidden nodes 2.1 5.2 1.8

Training time (sec.)

Training time (sec.)

5.0

4.8

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

4.6

4.4

1.5

1.2

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

0.9

0.6

0.3

4.2 10

20

30

40

50

60

20

70

40

60

80

#Hidden nodes

#Hidden nodes Fig. 3. The training time on Parkinsons (a) sigmoid (b) RBF.

Fig. 4. The training time on Airfoil (a) sigmoid (b) RBF.

the performance of ELM on high-dimension and small-sample problems, a projection vector machine [24] was proposed. Aiming at classification tasks [7,25–28], several algorithms were proposed to optimize the network size of ELM. Recently, novel constructive and destructive parsimonious ELMs (CP-ELM and DP-ELM) were proposed based on recursive orthogonal least squares [29]. CP-ELM starts with a small initial network and gradually adds new hidden nodes until a satisfactory solution is found. In contrary, DP-ELM starts by training a larger than necessary network and then removes the insignificant node one by one. Further, Zhao et al. [30] extend CP-ELM and DP-ELM to the regularized ELM (RELM), thus correspondingly obtaining CPRELM and DP-RELM. Generally speaking, compared with CP-ELM, DP-ELM can obtain more compact network size when reaching nearly the same accuracy, but it loses the edge on the training time. The main reason is that when DP-ELM removes every insignificant hidden node, a series of Givens rotations are implemented to compute the additional residual error reduction. As known, the computational burden of Givens rotation is high. Hence, a scheme is proposed to accelerate DP-ELM (ADP-ELM) instead of Givens rotations. ADP-ELM not only needs fewer hidden nodes than CP-ELM, but also wins it in the training time. When this accelerating scheme is extended to DP-RELM, ADP-RELM is yielded. ADP-RELM reduces the training time of DP-RELM further, and outperforms CP-RELM in terms of the number of hidden nodes and the training time. In theory, this proposed accelerating scheme can acquire the same effect as Givens rotations but

requires less training time when removing the insignificant hidden nodes. Finally, extensive experiments are conducted on benchmark data sets, and the results verify the effectiveness of the proposed scheme in this paper. The remainder of this paper is organized as follows. In Section 2, ELM and RELM are briefly introduced. As two destructive algorithms, DP-ELM and DP-RELM are described in Section 3. In Section 4, an equivalent measure is proposed instead of Givens rotations to respectively accelerate DP-ELM and DP-RELM, yielding ADP-RELM and ADP-RELM, and their computational complexity is analyzed. Experimental results on ten benchmark data sets are presented in Section 5. Finally, conclusions follow.

2. ELM and RELM 2.1. ELM  N For N arbitrary distinct samples ðxi ; t i Þ i ¼ 1 , where xi A ℜn and t i A ℜm , the traditional SLFN with L hidden nodes and activation function hðxÞ can be mathematically modeled as y¼

L X

βi hðx; ai ; bi Þ

ð1Þ

i¼1

where ai A ℜn is the input weight vector connecting the ith hidden node and the input nodes, bi A ℜ is the bias of the ith hidden node,

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

5

4.0

1.08

3.8 1.05

3.6

Training time (sec.)

Training time (sec.)

1.02

0.99

0.96

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

0.93

0.90

3.4 3.2 3.0

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

2.8 2.6 2.4 2.2

5

10

15

20

25

30

35

40

20

40

#Hidden nodes

60

80

#Hidden nodes

2.6

4.2 4.0

2.5

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

2.4

Training time (sec.)

Training time (sec.)

3.8

2.3

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

2.2

2.1

3.6 3.4 3.2 3.0 2.8

2.0 10

20

30

40

50

2.6

60

10

#Hidden nodes

20

30

40

50

60

70

#Hidden nodes Fig. 5. The training time on Abalone (a) sigmoid (b) RBF. Fig. 6. The training time on Winequality white (a) sigmoid (b) RBF.

βi A ℜm is the output weight vector connecting the ith hidden node and the output nodes. If the outputs of the SLFN amount to the targets, the following compact formulation is got Hβ ¼ T

ð2Þ

T

where β ¼ β1 ; ⋯; βL , T ¼ ½t 1 ; ⋯; t N T , and 2 3 hðx1 ; a1 ; b1 Þ ⋯ hðx1 ; aL ; bL Þ 6 7 ⋮ ⋱ ⋮ H¼4 5 hðxN ; a1 ; b1 Þ







T

1

ð3Þ

hðxN ; aL ; bL Þ

2.2. RELM

ð4Þ T

β

where ‖d‖F denotes the Frobenius norm. Letting ðdGELM =dβÞ ¼ 0 may yield the same solution as (4).

here H is the so-called hidden-layer output matrix. For the  L traditional SLFN, the parameters ðai ; bi Þ i ¼ 1 in (1) are usually determined using the gradient-based methods. However, in ELM these parameters are randomly initiated before seeing the training samples. In this case, training an ELM is simply equal to finding the output weight matrix β of a linear system (2). A simple solution of (2) is given explicitly as follows:

β^ ¼ H † T

As for this point, most algorithms for SLFN do not consider the generalization performance when they were proposed first time. Actually, Solving (2) in the squared error sense is equivalent to minimizing the following problem   min GELM ¼ ‖H β T‖2F ð5Þ

where H ¼ H H H is the Moore-Penrose generalized inverse of H. The solution β^ in (4) is to minimize the training error as well as the norm of the output weights [7]. In other words, ELM aims to reach better generalization performance by reaching both the smallest training error and the smallest norm of output weights.

Due to the random choice of input weights and the hidden layer biases, the hidden layer output matrix H may not be of full column rank, or the condition number of H T H is too large. In these scenarios, solving β^ directly using (4) may drop the ELM accuracy. To end this, after adding the regularization term to (5), the mathematical model of RELM is [31]   min GRELM ¼ ‖H β  T‖2F þ μ‖β‖2F ð6Þ β

where μ is the regularization parameter. Letting dGRELM =dβ ¼ 0 yields  1 β^ ¼ H T H þ μI HT T ð7Þ

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

22

35

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

25

20

21

Training time (sec.)

Training time (sec.)

30

15

10

20

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

19

18

5

0 50

100

150

20

200

40

60

80

100

120

#Hidden nodes

#Hidden nodes 20.5

45 20.0

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

35 30

Training time (sec.)

Training time (sec.)

40

25 20

19.5

19.0

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

18.5

18.0

15 17.5

10

20

50

100

150

40

200

60

80

100

#Hidden nodes

#Hidden nodes

Fig. 8. The training time on Ailerons (a) sigmoid and (b) RBF.

Fig. 7. The training time on CCPP (a) sigmoid and (b) RBF.

here Q is an N  N orthogonal matrix satisfying Q T Q ¼ Q Q T ¼ I, and RLL is an upper triangular matrix with the same rank as H.

3. DP-ELM and RDP-ELM 3.1. DP-ELM 3.1.1. Preliminary work According to the conclusion in [3], it has been proved that an ELM with at most N hidden nodes and with almost any nonlinear activation function can exactly learn N distinct samples. In fact, in real-world applications, the number of hidden nodes is always less than the number of training samples, i.e., L oN, and the matrix H is of full column rank with probability one. Hence, the following theorem is obtained Theorem 1. The minimizer of Eq. (5) amounts to the following minimizer: n o min G0ELM ¼ ‖RLL β  T^ Lm ‖2F ð8Þ β

where " H¼Q

RLL

ð9Þ

0ðN  LÞL "

QTT ¼

#

T^ Lm

T~ ðN  LÞm

Proof. Since the matrix H is of full column rank, hence it can be " # RLL decomposed as H ¼ Q using QR decomposition accord0ðN  LÞL ing to the matrix theory [32], where Q is an orthogonal matrix, RLL is an upper triangular matrix. Thus, we get " # " # 2 RLL T^ Lm β  ~ ‖H β  T‖2F ¼ ‖Q T H β  Q T T‖2F ¼ 0ðN  LÞL T ðN  LÞm F

" # 2 R β  T^ 2 2 Lm LL ¼ ¼ RLL β  T^ Lm þ T~ ðN  LÞm  T~ ðN  LÞm F F

Plugging (11) into (5) and eliminating the constant part ‖T~ ðN  LÞm ‖2F , Eq. (8) is obtained. Now, this proof is completed. □ Since Eqs. (5) and (8) have the same minimizer, we do not distinguish GELM and G0ELM in this paper. In addition, according to (10) we know ‖T‖2F ¼ ‖T^ Lm ‖2F þ‖T~ ðN  LÞm ‖2F

# ð10Þ

ð11Þ

F

ð12Þ

where ‖T~ ðN  LÞm ‖2F is defined as the initial residual error, ‖T^ Lm ‖2F is the additional residual error. The output weight matrix β^ in (4)

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7

13.8

100

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

13.6

Training time (sec.)

Training time (sec.)

80

13.4

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

13.2

13.0

10

20

60

40

20

30

40

50

50

60

100

120

13.2

100

Training time (sec.)

Training time (sec.)

13.3

13.1

13.0

12.9

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

12.8

12.7 10

20

40

50

50

100

300

together with

3.1.2. Realization The DP-ELM starts with full candidate regressors, i.e., let the initial ^ and a full index set P ¼ f1; 2; ⋯; Lg, regressor matrix Rð0Þ ¼ R, T ð0Þ ¼ T, and then it gradually removes the insignificant hidden nodes one by one until the stopping criterion is met. Assume that at the ith iteration the candidate regressor matrix RðiÞ s is obtained by removing the regressor r ðsiÞ from the previous regressor matrix Rði  1Þ ð14Þ



Q ðiÞ

T

" T ði  1Þ ¼

T ðiÞ t ðisiÞ

#

ð16Þ

where Q ðiÞ consists of a series of given rotations satisfying  T  T Q ðiÞ Q ðiÞ ¼ Q ðiÞ Q ðiÞ ¼ I ðL  i þ 1ÞðL  i þ 1Þ , RðiÞ is an upper triangular matrix, and t ðisiÞ is the last row vector of the transformed targets. Obviously, ‖t ðisiÞ ‖2F is the increase on the additional residual error due to the removal of the regressor r ðsi  1Þ . In fact, Eq. (15) can become much simpler, since it only needs Givens rotations to be performed on each column of the submatrix including the columns after r ðsi  1Þ in Rði  1Þ . Hence, together with (16) they are simplified as h

RðiÞ s;j s

T ðj i  1Þ s

i

2 ¼ Q ðj iÞ 4 s

Rðj iÞ

T ðj iÞ

01ðL  i  js þ 1Þ

t ðisiÞ

s

s

3 5

ð17Þ

where js is the column index of the regressor r ðsi  1Þ in the matrix Rði  1Þ , RðiÞ s;j and Rðj iÞ are the submatrices including elements from js th s

s

row and js th column to the end of RðiÞ s and RðiÞ , respectively, T ðj i  1Þ and

is retriangularized by a series of Givens rotations as 01ðL  iÞ

250

Fig. 10. The training time on Pole (a) sigmoid and (b) RBF.

ð13Þ

RðiÞ s ’Rði  1Þ \r ðsi  1Þ ; s A P

150

#Hidden nodes

where R and T^ are from (8) after omitting subscripts. Because R is an upper triangular matrix, β^ in Eq. (13) is easily got with backward substitutions. Although an ELM can be obtained easily through (13), its solution is dense. That is, there maybe exist insignificant hidden nodes in the solution. Hence, we need to improve the compactness of ELM. It should be pointed out that every regressor (or column) r i of the matrix R ¼ ½r 1 ; ⋯; r L  in (8) corresponds to one hidden node. Hence, compacting ELM amounts to determining the significant regressors from R.

RðiÞ s ¼ Q ðiÞ

200

20 30

Rβ^ ¼ T^

RðiÞ

300

40

can be got by solving

"

250

60

#Hidden nodes

Then,

200

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

80

Fig. 9. The training time on Elevators (a) sigmoid and (b) RBF.

RðiÞ s

150

#Hidden nodes

#Hidden nodes

s

#

ð15Þ

T ðj iÞ are the submatrices including elements from js th row to the end of s

T ði  1Þ and T ðiÞ , respectively, and the orthogonal matrix Q ðj iÞ is governed s

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

0.080

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.064

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.090

RMSE

0.072

0.095

0.085

0.056 0.080

0.048 0.075

0.040 20

40

60

80

100

10

120

20

30

40

50

60

70

#Hidden nodes

#Hidden nodes 0.11

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.056

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.10

RMSE

0.064

0.048

0.09

0.08

0.040

0.07

0.032 50

100

150

10

200

20

#Hidden nodes

50

60

3.2. DP-RELM

by 2 Q

40

Fig. 12. The testing accuracy on Sml2010 (a) sigmoid and (b) RBF.

Fig. 11. The testing accuracy on Energy efficiency (a) sigmoid and (b) RBF.

ðiÞ

30

#Hidden nodes

¼4

3

I ðjs  1Þðjs  1Þ Q ðj iÞ

5

ð18Þ

s

Theorem 2. The minimizer of Eq. (6) is equivalent to the following minimizer: n o min G0RELM ¼ ‖RLL β  T^ Lm ‖2F ð21Þ β

Hence, the index of the ith regressor to be removed is

s† ¼

‖t ðiÞ ‖2F argmin is 2 s A P ‖T‖ F

ð19Þ

Eq. (19) shows that the regressor incurring the least increase on the additional residual error will be removed. After removing the ith   regressor r ðsi† 1Þ , the index set P is updated as P’P\ s† . DP-ELM does not stop until the following criterion is reached ‖T ðiÞ ‖2F r1ρ ‖T‖2F

ð20Þ

where 0 o ρ{1, which can balance the training error against the model complexity.

where " # " # H RLL pffiffiffi μI LL ¼ Q 0NL " QT

T 0Lm

#

" ¼

T^ Lm T~ Nm

ð22Þ

# ð23Þ

here Q is an ðN þ LÞ  ðN þ LÞ orthogonal matrix satisfying Q T Q ¼ Q Q T ¼ I, and RLL is an upper triangular matrix with the rank L. The proof of Theorem 2 is similar to that of Theorem 1, so we omit it here. The matrices RLL and T^ Lm in Theorems 1 and 2 are different, but we adopt the same symbols, which does not lead to misunderstanding. When we obtain RLL and T^ Lm from Theorem 2, DP-RELM is naturally obtained by adopting the same realization procedure as DP-ELM.

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

0.216

0.104

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.208

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.100

RMSE

0.212

RMSE

9

0.096

0.092

0.204 0.088

0.200 10

20

30

40

50

60

70

20

80

40

60

80

100

#Hidden nodes

#Hidden nodes 0.105

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.218

0.216

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.100

0.095

RMSE

0.220

0.090

0.214 0.085

0.212

0.080

0.210 10

20

30

40

50

20

60

40

#Hidden nodes

Proof. Substituting (15) and (16) into (25), we get " # RðiÞ ðiÞ ðiÞ ðiÞ ðiÞ ði  1Þ 2 G  s ¼ ‖R  s β  T ‖F ¼ Q βðiÞ 01ðL  iÞ

4. ADP-ELM and ADP-RELM 4.1. An equivalent measure Assume that at the (i 1)th iteration in DP-ELM or DP-RELM, Rði  1Þ and T ði  1Þ are obtained, and let βði  1Þ

" # " ðiÞ # T RðiÞ ðiÞ ¼ β  ðiÞ F2 t is 01ðL  iÞ

 T ði  1Þ 2F

ð24Þ ðiÞ

ði  1Þ

^ Evidently, G ¼ 0. When the regressor r ðsi  1Þ is removed at the ith iteration, Eq. (24) becomes as n o ^ ðiÞ ¼ min GðiÞ ¼ ‖RðiÞ βðiÞ T ði  1Þ ‖2 G s F s s βði  1Þ

ð25Þ

Then, define ðiÞ

ΔðiÞ s ¼ G^  s  G^

ði  1Þ

^ ðiÞ ¼G s

80

Fig. 14. The testing accuracy on Airfoil (a) sigmoid and (b) RBF.

Fig. 13. The testing accuracy on Parkinsons (a) sigmoid and (b) RBF.

n o ^ ði  1Þ ¼ min Gði  1Þ ¼ ‖Rði  1Þ βði  1Þ T ði  1Þ ‖2 G F

60

#Hidden nodes

ð26Þ

¼ ‖RðiÞ β  T ðiÞ ‖2F þ ‖t ðisiÞ ‖2F

ð28Þ

ðiÞ

^ ðiÞ ¼ ‖t ðiÞ ‖2 . □ Because ‖RðiÞ β^ T ðiÞ ‖2F ¼ 0, hence G s is F ði  1Þ To remove the regressor r s† at the ith iteration, we need to ðiÞ 2 compute ‖t is ‖F L-iþ1 times, which consumes a lot of time. From Theorem 3, note that the issue of computing ‖t ðisiÞ ‖2F can be sideðiÞ stepped with finding the equivalent measure Δ  s . If we can seek a ðiÞ computationally cheap method of calculating Δ  s , then the bottleðiÞ 2 neck of computing ‖t is ‖F will be circumvented.

ðiÞ s

here Δ represents the increase on the objective function in (24) due to the removal of the regressor r ðsi  1Þ . Therefore, the following theorem is obtained

4.2. An accelerating scheme ðiÞ

Theorem 3. The following formula holds

ΔðiÞ s ¼ ‖t ðisiÞ ‖2F where t ðisiÞ is from (16).

ð27Þ

If at the ith iteration each Δ  s is computed directly according to (25), the computational burden is very huge, not cheaper than that of computing ‖t ðisiÞ ‖2F . Hence, it is necessary to seek a method of ðiÞ computing each Δ  s cheaply. Assume that the column index of the ði  1Þ regressor r s to be removed in the matrix Rði  1Þ is js , then

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

10

0.125

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.0760

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.124

0.123

RMSE

0.0765

0.0755

0.0750

0.122

0.121

0.0745

0.120 10

20

30

40

20

40

#Hidden nodes 0.0775

0.0760

80

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.1230

0.1225

RMSE

0.0765

RMSE

0.1235

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.0770

60

#Hidden nodes

0.0755

0.1220

0.0750

0.1215 0.0745 0.0740

0.1210 10

20

30

40

50

20

30

40

#Hidden nodes Fig. 15. The testing accuracy on Abalone (a) sigmoid and (b) RBF.

ði  1Þ ‖β^ ‖2 js

2

ð29Þ

Λðjsi  1Þ

where tr ðdÞ represents the trace of a square matrix. In addition,

 T ^ ði  1Þ ¼ ‖T ði  1Þ ‖2  tr T ði  1Þ Rði  1Þ G F



holds, ði  1Þ ði  1Þ is the js th row vector of the minimizer β^ of where β^ js

 1 T ði  1Þ Eq. (24), Λjs is the js th diagonal element of Rði  1Þ Rði  1Þ .

Proof. First, see the following optimization problem. n o ^ ði  1Þ ¼ min Gði  1Þ ¼ ‖Rði  1Þ βði  1Þ  T ði  1Þ ‖2 þ ‖Θβði  1Þ ‖2 G s F F s s s βði s 1Þ



ð30Þ

^ ði  1Þ ¼ lim G s θ- þ 1

ð31Þ ði  1Þ

Letting dGði s 1Þ =dβ  s ði  1Þ

β^  s ¼



Rði  1Þ

T

¼ 0 gets

Rði  1Þ þ Θ

T

  1

Θ

Rði  1Þ

T

T ði  1Þ

ð32Þ

þΘ

T

Θ



  1



Rði  1Þ 

T ði  1Þ

Rði  1Þ

T

T

Rði  1Þ





T

Rði  1Þ



T

T

Rði  1Þ

Rði  1Þ

T

Rði  1Þ T

1

Θ

¼

1

Rði  1Þ

ði  1Þ

T

! T ði  1Þ

ð34Þ



Rði  1Þ

Θ I þΘ



T

Rði  1Þ

Rði  1Þ

T

1

Rði  1Þ

1

!1

Θ

1

ð35Þ

ð36Þ

^ ði  1Þ ^ ði  1Þ  G ¼ lim G s θ- þ 1

ði  1Þ

¼ lim tr @ β^

ð33Þ

Rði  1Þ

So,

θ- þ 1

T ði  1Þ Þ

  1

^ ði  1Þ  G ^ ði  1Þ G s 0 1



 1 !1 T ði  1Þ T ði  1Þ A ¼ tr @ β^ Θ I þ Θ Rði  1Þ Rði  1Þ Θ Θβ^

0

R ði  1 Þ

Rði  1Þ

Hence

ðiÞ



T

Rði  1Þ þ Θ

ΔðiÞ s ¼ G^  s  G^

Plugging (32) into (30) yields ^ ði  1Þ ¼ ‖T ði  1Þ ‖2  tr G s F

Rði  1Þ

According to Woodbury formula [32], we get



where Θ ¼ diag 01 ; ⋯; 0js  1 ; θ; 0js þ 1 ; ⋯; 0L  i þ 1 . Comparing (30) with (25), note that ^ ðiÞ G s

60

Fig. 16. The testing accuracy on Winequality white (a) sigmoid and (b) RBF.

Theorem 4.

ΔðiÞ s ¼

50

#Hidden nodes

T

1

 1 !1 T ði  1 Þ ði  1 Þ ði  1Þ ^ A Θ IþΘ R R Θ Θβ

ð37Þ

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.0535

RMSE

0.0534

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.048

0.047

RMSE

0.0536

11

0.0533

0.0532

0.046

0.0531

0.045

0.0530 100

120

140

160

180

20

200

40

60

80

#Hidden nodes

#Hidden nodes 0.0490

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.0536

0.0534

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.0485

0.0480

0.0475

RMSE

0.0538

0.0532

0.0470

0.0465

0.0460

0.0530 0.0455

100

120

140

160

180

10

200

#Hidden nodes

ði  1Þ

‖22 θ

2

θ- þ 11 þ Λði  1Þ θ js

¼ 2

ði  1Þ

‖β^ js

Λ

‖22

ð38Þ

ði  1Þ js



backward substitutions and it is also an upper triangular matrix,  1

T ¼ r 01 ði  1Þ; ⋯; r 0L  i ði 1Þ . Therefore here denoted by Rði  1Þ

Λðjsi  1Þ ¼ ‖r 0js ði  1Þ‖22

ð39Þ

ði  1Þ

Δ

¼

‖22

‖r 0j ði  1Þ‖22

ð40Þ

s

Hence, the index of the ith regressor to be removed for ADP-ELM and ADP-RELM is found as ði  1Þ

‖β^ js

‖22 0 s A P ‖r ði  1Þ‖2 2 js

s† ¼ argmin

4.3. Analysis of computational complexity

iteration, to remove the regressor r ðsi† 1Þ from the candidate matrix  1 . Due to Rði  1Þ using Eq. (41), we firstly need to compute Rði  1Þ  1 Rði  1Þ being an upper triangular matrix, Rði  1Þ can be easily  1 ði  1Þ got with backward substitutions. When Rði  1Þ is found, β^ and ‖r 0j ði  1Þ‖22 can be acquired. Then, Eq. (41) is used to find the s

Substituting (39) into (29), we get ‖β^ js

60

Assume that ADP-ELM or ADP-RELM is implemented at the ith

ði  1Þ

is an upper triangular matrix, on the basis of the  1 is easily got with the knowledge of matrix theory Rði  1Þ

ðiÞ s

50

procedure of removing the insignificant hidden nodes terminates, which is convenient to control the network size.

Up to now, the proof has been finished. Since R

40

Fig. 18. The testing accuracy on Ailerons (a) sigmoid and (b) RBF.

which is simplified as ‖β^ js

30

#Hidden nodes

Fig. 17. The testing accuracy on CCPP (a) sigmoid and (b) RBF.

ΔðiÞ s ¼ lim

20

ð41Þ

As for the stopping criterion, Eq. (20) is also suitable ADP-ELM and ADP-RELM. Of course, we can predefine a positive integer M for ADP-ELM and ADP-RELM. When M regressors are left, the

regressor r ðsi† 1Þ to be removed. For the above procedure, the computational complexity of each step is listed in Table 1. When DP-ELM or DP-RELM is at the ith iteration, in order to obtain t ðisiÞ with (17), a series of Givens rotation need be constructed in this process. For example, assume that u is a two-dimensional vector denoted by u ¼ ½u1 ; u2 T , then Givens rotation is simply shown as   " #  v  w u1 ‖u‖2 ð42Þ ¼ u2 0 w v qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi where v ¼ ðu1 =rÞ, w ¼ ðu2 =rÞ, and r ¼ u21 þ u22 . Notice that constructing a Givens rotation needs four multiplications, one addition, two divisions, and one square root. When t ðisiÞ is got using (17),

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

12

then the index s† is determined via (19). The computational complexity of DP-ELM or DP-RELM is tabulated in Table 2. Comparing Table 1 with 2, the conclusion that the computational complexity of ADP-ELM/ADP-RELM is lower than that of DPELM/DP-RELM for any operation is easily drawn. Naturally, ADPELM/ADP-RELM accelerates the training process of DP-ELM/DPRELM while keeping the same accuracy.

0.208

0.206

RMSE

4.4. The flowchart of ADP-ELM and ADP-RELM

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.210

In summary, the flowchart of ADP-ELM/ADP-RELM is depicted as follows.

0.204

0.202

0.200

Algorithm 1. ADP-ELM/ADP-RELM 1 Initializations:

   

100

N Input the training samples ðxi ; t i Þ i ¼ 1 . Choose the type of activation function. Choose a small ρ 4 0 or a positive integer M for controlling the number of final hidden nodes exactly. Let P ¼ f1; ⋯; Lg, and i¼1.

120

140

160



Generate the hidden layer output matrix H with L random hidden nodes. 3 For ADP-RELM, determine the appropriate regularization parameter μ.

180

200

220

240

260

280

#Hidden nodes 0.27

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.26

RMSE

2

0.25

0.24

0.044

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.043

0.042

0.23 50

100

150

200

#Hidden nodes Fig. 20. The testing accuracy on Pole (a) sigmoid and (b) RBF.

0.041

Obtain R and T^ according (9) and (10) ((22) and (23)), respectively, for ADP-ELM (ADP-RELM). ^ 5 Let Rð0Þ ¼ R and T ð0Þ ¼ T. 4

0.040

0.039 20

30

40

50

#Hidden nodes

9

0.056

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.054

RMSE

0.052 0.050 0.048

6 7 8

0.046

 1 Compute the upper triangular matrix Rði  1Þ based

on Rði  1Þ with backward substitutions.  1 ði  1Þ ¼ R ð i  1Þ T ði  1Þ . 10 Calculate β^ 11 Determine the regressor r ðsi† 1Þ to be removed according to (41). 12 13 14 15 16

0.044

If Eq. (20) is satisfied or i4 L  M Go to step 16. Else

17

Obtain RðiÞ s† according to (14). Get RðiÞ and T ðiÞ from (15) and (16), respectively.   Let P’P\ s† and i’iþ 1, and go to step 6. End if ðiÞ

Solve RðiÞ β^ ¼ T ðiÞ with backward substitutions.  P ^ ðiÞ  βj h x; aj ; bj Output ADP-ELM/ADP-RELM: f ðxÞ ¼ jAP

0.042

5. Experiments 10

20

30

40

#Hidden nodes Fig. 19. The testing accuracy on Elevators (a) sigmoid and (b) RBF.

In this section, we experimentally study the usefulness of the proposed scheme in accelerating the training processes of DP-ELM and DP-RELM. The experiment environment consists of: Windows

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

13

Table 4 The detailed experimental results on benchmark data sets. Data sets

Hidden node type

Algorithms

#Hidden nodes

RMSE

Energy efficiency

Sigmoid

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

190 80 81 81 190 130 105 105 220 82 86 86 220 208 182 182

4.8952E  027 5.0472E  03 4.8988E  027 3.5454E  03 4.8711E  027 3.6369E  03 4.8711E  027 3.6369E  03 4.3066E  027 1.6941E  03 4.2996E  027 1.8785E  03 4.2993E  027 1.9261E  03 4.2993E  027 1.9261E  03 4.0442E  02 71.8273E  03 4.0288E  027 9.3419E  04 4.0388E  027 2.2874E  03 4.0388E  027 2.2874E  03 3.5343E  027 1.0892E  03 3.5393E  027 1.1035E 03 3.5380E  027 9.7920E  04 3.5380E  027 9.7920E  04

0.0187 0.002 9.622 7 0.098 16.909 7 0.361 6.930 7 0.046 0.0117 0.002 11.6477 0.337 16.0717 0.187 6.363 7 0.137 0.436 7 0.010 14.095 7 0.571 28.3487 0.857 11.7337 0.100 0.426 7 0.011 18.6147 0.670 14.3197 0.101 5.8017 0.112

0.030 7 0.002 0.0117 0.000 0.0127 0.002 0.0127 0.002 0.0097 0.001 0.006 7 0.001 0.004 7 0.000 0.0057 0.004 0.302 7 0.006 0.1157 0.009 0.1187 0.001 0.1177 0.001 0.3017 0.008 0.280 7 0.002 0.260 7 0.012 0.258 7 0.013

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

80 29 32 32 80 75 65 65 90 36 17 17 90 60 34 34

7.5580E  027 6.1177E 03 7.5289E  027 3.8640E  03 7.5570E  027 4.4971E  03 7.5570E  027 4.4971E  03 7.3716E  02 72.8558E  03 7.3731E  027 2.8613E  03 7.3728E  027 2.9638E  03 7.3728E  027 2.9638E  03 7.7460E  02 79.3898E  03 7.6741E  027 6.1075E 03 7.6085E  027 5.5149E  03 7.6085E  027 5.5149E  03 7.3686E  027 5.3856E  03 7.3654E  027 5.6984E  03 7.3307E  02 7 6.3927E  03 7.3307E  02 7 6.3927E  03

0.032 7 0.003 2.759 7 0.047 3.1797 0.060 2.558 7 0.015 0.0197 0.002 3.228 7 0.009 2.6787 0.071 2.439 7 0.056 1.1387 0.013 4.5317 0.019 5.0707 0.064 4.202 7 0.060 1.1167 0.006 4.932 7 0.016 4.9487 0.016 4.308 7 0.013

0.0357 0.001 0.0147 0.002 0.0157 0.002 0.0157 0.004 0.0157 0.001 0.0147 0.003 0.0137 0.003 0.0137 0.003 0.430 7 0.004 0.1777 0.006 0.087 7 0.004 0.087 7 0.004 0.430 7 0.008 0.286 7 0.003 0.1597 0.002 0.1597 0.002

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

100 67 52 52 100 85 75 75 70 41 46 46 70 62 59 59

2.0338E 017 2.4357E  03 2.0338E 017 1.5048E  03 2.0336E 017 1.7314E  03 2.0336E 017 1.7314E  03 2.0224E 017 1.5824E 03 2.0224E 017 1.4386E  03 2.0229E  017 1.4048E  03 2.0229E  017 1.4048E  03 2.1198E  017 2.7295E  03 2.1193E  017 1.5492E  03 2.1190E  017 2.1838E  03 2.1190E  017 2.1838E  03 2.1090E  017 1.2686E  03 2.1093E  017 1.2167E 03 2.1093E  017 1.2610E  03 2.1093E  017 1.2610E  03

0.052 7 0.009 6.2757 0.044 6.429 7 0.053 5.360 7 0.011 0.032 7 0.018 6.6337 0.011 5.8577 0.013 5.3117 0.050 1.2477 0.020 4.989 7 0.088 4.902 7 0.041 4.623 7 0.024 1.252 7 0.017 5.2187 0.031 4.7217 0.042 4.6147 0.019

0.020 7 0.001 0.0137 0.003 0.0117 0.002 0.0117 0.003 0.0217 0.001 0.0197 0.005 0.0177 0.007 0.0167 0.005 0.581 7 0.016 0.343 7 0.022 0.3787 0.003 0.3797 0.003 0.583 7 0.013 0.503 7 0.001 0.4837 0.004 0.4847 0.008

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

100 95 98 98 100 78 87 87 90 88 73 73 90 88 70 70

8.8070E 02 74.2246E 03 8.8091E  027 4.0819E  03 8.8077E 02 74.2395E  03 8.8077E 02 74.2395E  03 8.6679E  027 2.9691E  03 8.6737E  02 71.9706E  03 8.6713E  02 7 2.8901E  03 8.6713E  02 7 2.8901E  03 8.3086E  027 1.4600E  03 8.3040E  027 1.4769E  03 8.3085E  027 1.7361E  03 8.3085E  027 1.7361E  03 8.3111E  02 71.4513E  03 8.3101E  027 1.4383E  03 8.3117E  027 1.6314E  03 8.3117E  02 7 1.6314E  03

0.0147 0.002 1.6767 0.022 0.202 7 0.015 0.1177 0.018 0.0097 0.002 1.664 7 0.038 0.790 7 0.033 0.4157 0.010 0.3177 0.010 1.5727 0.040 1.1917 0.023 0.7317 0.007 0.309 7 0.006 1.595 7 0.019 1.1877 0.009 0.8047 0.008

0.006 7 0.004 0.0057 0.003 0.0057 0.002 0.0057 0.001 0.0147 0.001 0.0117 0.002 0.0127 0.003 0.0127 0.003 0.2797 0.009 0.269 7 0.001 0.2067 0.001 0.205 7 0.002 0.2727 0.011 0.2677 0.002 0.208 7 0.001 0.208 7 0.001

40 40 28 28 40 40 30

7.4643E  027 4.2603E  04 7.4643E  027 4.2603E  04 7.4624E 027 2.7234E  04 7.4624E 027 2.7234E  04 7.4583E  027 4.1528E  04 7.4583E  027 4.1528E  04 7.4587E  027 4.3381E 04

0.0167 0.001 1.0517 0.009 1.0157 0.048 0.953 7 0.043 0.0127 0.001 1.0757 0.003 0.999 7 0.005

0.285 7 0.001 0.286 7 0.003 0.166 7 0.003 0.1697 0.007 0.0097 0.001 0.0097 0.003 0.006 7 0.004

RBF

Sml2010

Sigmoid

RBF

Parkinsons

Sigmoid

RBF

Airfoil

Sigmoid

RBF

Abalone

Sigmoid

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM

Training time (s)

Testing time (s)

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

14

Table 4 (continued ) Data sets

Hidden node type

RBF

Winequality white

Sigmoid

RBF

CCPP

Sigmoid

RBF

Ailerons

Sigmoid

RBF

Elevators

Sigmoid

RBF

Algorithms

#Hidden nodes

RMSE

Training time (s)

Testing time (s)

ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

30 60 36 30 30 60 53 42 42

7.4587E  027 4.3381E 04 7.4558E  027 4.8491E  04 7.4552E  027 3.5553E  04 7.4513E  027 2.9363E  04 7.4513E  027 2.9363E  04 7.4279E 02 71.7412E  04 7.4276E  027 1.8356E 04 7.4285E  027 2.0016E  04 7.4285E  027 2.0016E  04

0.9717 0.003 0.7387 0.019 2.456 7 0.079 2.4957 0.032 2.2477 0.028 0.7137 0.010 2.5247 0.042 2.408 7 0.019 2.236 7 0.025

0.0067 0.003 0.3557 0.009 0.2147 0.005 0.1797 0.003 0.1807 0.008 0.360 7 0.024 0.3137 0.002 0.298 7 0.007 0.2977 0.012

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

90 58 53 53 90 69 79 79 70 50 55 55 70 58 47 47

1.2076E  017 8.1419E  04 1.2077E 017 7.3178E  04 1.2075E  017 7.9765E  04 1.2075E  017 7.9765E  04 1.2024E  017 4.6597E  04 1.2024E  017 4.7537E  04 1.2024E  017 4.4131E  04 1.2024E  017 4.4131E  04 1.2122E  01 75.1000E  04 1.2123E  01 75.0902E  04 1.2121E  017 5.1387E  04 1.2121E  017 5.1387E  04 1.2109E  017 4.6537E  04 1.2111E  017 4.5950E  04 1.2111E  017 4.6525E  04 1.2111E  017 4.6525E  04

0.032 7 0.004 3.4877 0.058 3.583 7 0.058 2.902 7 0.008 0.0247 0.016 3.6747 0.059 2.856 7 0.059 2.6417 0.016 0.923 7 0.027 4.0217 0.012 3.289 7 0.031 2.8707 0.009 0.928 7 0.035 4.1417 0.036 3.4727 0.039 3.096 7 0.022

0.030 7 0.001 0.0187 0.003 0.0167 0.004 0.0167 0.004 0.0197 0.003 0.0107 0.002 0.0107 0.003 0.0137 0.004 0.3117 0.026 0.208 7 0.004 0.236 7 0.005 0.2377 0.010 0.302 7 0.013 0.2337 0.004 0.1907 0.005 0.1917 0.006

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

220 211 174 174 220 206 180 180 230 208 205 205 230 189 150 150

5.3057E  027 9.3348E  05 5.3064E  027 8.5253E  05 5.3064E  027 1.0417E  04 5.3064E  027 1.0417E  04 5.3032E  027 4.6702E  05 5.3031E  027 4.2081E  05 5.3034E  027 4.9733E  05 5.3034E  027 4.9733E  05 5.2954E  027 1.0245E 04 5.2962E  027 9.6408E  05 5.2963E  027 9.2246E 05 5.2963E  027 9.2246E 05 5.3080E  027 4.3402E  05 5.3083E  027 4.7827E  05 5.3082E  027 6.4042E  05 5.3082E  027 6.4042E  05

0.305 7 0.009 23.2777 1.231 18.5117 1.532 9.0667 0.815 0.093 7 0.015 22.4987 0.250 19.654 7 2.670 8.6317 0.346 5.8167 0.037 30.308 7 0.107 19.0627 0.060 12.542 7 0.088 5.543 7 0.029 30.502 7 0.782 36.2467 1.428 18.8067 0.337

0.0407 0.002 0.039 7 0.006 0.0337 0.005 0.0337 0.004 0.0437 0.001 0.036 7 0.003 0.034 7 0.003 0.034 7 0.003 3.2357 0.024 2.960 7 0.010 2.925 7 0.011 2.9247 0.017 3.2177 0.015 2.7167 0.012 2.1177 0.006 2.1157 0.008

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

120 31 35 35 120 79 74 74 100 22 17 17 100 63 56 56

4.5491E  027 1.9189E  04 4.5495E  027 9.0525E  05 4.5483E  027 1.9785E  04 4.5483E  027 1.9785E  04 4.5153E  027 1.3803E  04 4.5155E  027 1.5541E  04 4.5151E  027 1.5914E  04 4.5151E  027 1.5914E  04 4.6844E  027 5.1310E  04 4.6754E 02 74.6998E  04 4.6828E  027 2.4087E  04 4.6828E  027 2.4087E  04 4.5955E  027 2.5765E  04 4.5975E 027 2.6970E  04 4.5975E 027 2.7819E  04 4.5975E 027 2.7819E  04

0.1377 0.003 19.4647 0.078 21.8047 0.155 19.5647 0.039 0.0757 0.019 21.282 7 0.123 20.8337 0.296 19.3717 0.157 3.298 7 0.205 18.6877 0.033 20.202 7 0.057 18.9157 0.031 3.1407 0.010 19.943 7 0.037 19.783 7 0.070 18.925 7 0.094

0.0467 0.001 0.028 7 0.004 0.029 7 0.005 0.029 7 0.004 0.0497 0.005 0.0487 0.012 0.0477 0.008 0.0477 0.011 2.9047 0.016 0.6777 0.005 0.596 7 0.048 0.595 7 0.007 2.905 7 0.013 1.9127 0.012 1.9017 0.010 1.902 7 0.014

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM

60 40 28 28 60 55 43 43 50 34 28 28 50 40 34

3.9780E 027 9.2234E  04 3.9719E  027 3.4326E  04 3.9783E 027 8.2234E  04 3.9783E 027 8.2234E  04 3.9111E  027 6.3041E  04 3.9111E  027 6.0934E  04 3.9103E  02 7 5.1272E  04 3.9103E  027 5.1272E  04 4.2481E  027 7.2707E  04 4.2314E  02 71.2490E  03 4.2291E  027 1.1311E  03 4.2291E  027 1.1311E  03 4.1710E  027 6.4155E  04 4.1738E  027 9.138E  04 4.1697E  02 78.0712E  04

0.0767 0.004 13.543 7 0.011 13.5617 0.086 13.260 7 0.029 0.0357 0.002 13.759 7 0.012 13.465 7 0.071 13.3537 0.028 1.9157 0.009 13.0757 0.047 13.0797 0.068 12.883 7 0.070 1.926 7 0.022 13.2357 0.021 13.0627 0.031

0.0277 0.001 0.025 7 0.005 0.0247 0.005 0.0247 0.007 0.0357 0.001 0.029 7 0.003 0.025 7 0.008 0.025 7 0.002 1.753 7 0.090 1.1707 0.007 0.980 7 0.005 0.983 7 0.013 1.7107 0.040 1.3547 0.004 1.1797 0.011

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

15

Table 4 (continued ) Data sets

Pole

Hidden node type

Sigmoid

RBF

Algorithms

#Hidden nodes

Testing time (s)

34

4.1697E  02 78.0712E  04

12.962 7 0.063

1.1787 0.010

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

300 225 187 187 300 248 280 280 300 93 80 80 300 242 233 233

2.0204E  01 74.6552E  03 2.0190E  017 4.6273E  03 2.0203E  017 3.9885E  03 2.0203E  017 3.9885E  03 1.9994E  01 72.9768E  03 1.9994E  01 72.9057E  03 1.9993E  01 72.9785E 03 1.9993E  01 72.9785E 03 2.4132E  017 8.4840E  03 2.4119E  01 74.3560E  03 2.4113E  01 74.0748E  03 2.4113E  01 74.0748E  03 2.3101E  017 3.8627E 03 2.3099E  017 4.0630E  03 2.3098E  017 3.9031E  03 2.3098E  017 3.9031E  03

0.7577 0.047 60.0077 1.070 75.7077 9.695 34.563 7 1.071 0.305 7 0.036 61.292 7 2.037 28.202 7 0.162 16.946 7 0.502 13.384 7 0.263 55.5787 0.377 105.659 7 0.461 54.8747 0.263 12.8217 0.290 74.855 7 0.548 76.1527 2.083 41.526 7 1.350

0.080 7 0.007 0.0747 0.011 0.0577 0.002 0.0577 0.005 0.090 7 0.022 0.0757 0.005 0.0767 0.001 0.0767 0.002 6.323 7 0.132 2.038 7 0.007 1.666 7 0.012 1.666 7 0.009 6.3307 0.155 5.1937 0.019 5.1137 0.026 5.1137 0.139

nique [33]. To statistically obtain robust and reliable results, 30 trials are carried out with every algorithm on each data set. To facilitate comparison, the performance index root mean square error (RMSE) is used, which is defined as vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u #Testing u X 1 RMSE ¼ t ð43Þ ‖t~ i t i ‖2F #Testing  m i ¼ 1 where #Testing denotes the number of testing samples, t~ i is the estimated vector of the testing sample t i . Figs. 1–10 show the training time against #Hidden nodes. For CP-ELM and CP-RELM, the training time increases with the increasing #Hidden nodes. These phenomena are generated by the fact that CP-ELM and CP-RELM start with null models, and

2

Training time (s)

ADP-RELM

7 32-b operating system, Intels Core™ i3-2310 M CPU @2.10 GHz, 2.00 GB RAM, Matlab2013b platform. Ten benchmark data sets (three multi-output data sets plus seven single-output ones) are utilized to do experiments, which include Energy efficiency, Sml2010, Parkinsons, Airfoil, Abalone, Winequality white, CCPP, Ailerons, Elevators, and Pole. The front seven data sets are from the well-known UCI machine learning repository,1 and the rear three are available from the data collection.2 For each data set, before doing experiments, we get rid of the repeated samples and the redundant inputs. And then the remaining inputs have been normalized into the range [  1, 1] while the outputs have been normalized into [0, 1]. In addition, we divide every data set into two parts, i.e., the training set and the testing set, whose details are shown in Table 3. In this paper, two commonly-used hidden node types are chosen as the functions of hidden layer,  activation   i.e., the sigmoid hðxÞ ¼ 1= 1 þ exp  xT ai and the RBF hðxÞ ¼ exp  bi ‖x  ai ‖22 , where ai are randomly chosen from the range [  1, 1], bi is chosen from the range (0. 0.5) [12]. In this paper, we select ELM, CP-ELM, DP-ELM, ADP-ELM, RELM, CP-RELM, DP-RELM, and ADP-RELM to implement tests. First of all, we employ the cross validation to determine a nearly optimal number of hidden nodes (#Hidden nodes) for ELM from the set f10; 20; ⋯; 300g. Then, we extend it to other algorithms. In RELM, CP-RELM, DP-RELM, and ADP-RELM, there is an additional parameter, viz. the regularization parameter μ, to be determined. Here, n o it is chosen from 2  50 ; ⋯; 20 ; ⋯; 25 using cross-validation tech-

1

RMSE

http://archive.ics.uci.edu/ml/. http://www.dcc.fc.up.pt/ ltorgo/Regression/DataSets.html.

then gradually pick up hidden nodes one by one. Contrarily, for DP-ELM, ADP-ELM, DP-RELM, and ADP-RELM, the training time decreases with the increasing #Hidden nodes. The main reason is that these destructive algorithms start with full model and then remove the insignificant hidden nodes step by step. Under the same number of hidden nodes, ADP-ELM (ADP-RELM) needs less training time than DP-ELM (DP-RELM), which demonstrates that the accelerating scheme proposed in this paper indeed speeds up the original training process. In other words, the effectiveness of the presented scheme is verified experimentally. When choosing the same number of hidden nodes, generally CP-ELM (DP-ELM, and ADP-ELM) is superior to CP-RELM (DP-RELM, and ADP-RELM) with respect to the training time. One major reason is that during the process of obtaining RLL and T^ Lm with (9) and (22), res" # H pectively, RELM with the expanded matrix pffiffiffi μI LL usually needs more decomposing time than ELM. Before removing the insignificant hidden nodes, ADP-ELM (ADP-RELM) has the same training time as DP-ELM (DP-RELM), because the accelerating scheme does not start to work. Figs. 11–20 show the RMSE against #Hidden nodes. In these figures, the dash line represents the ELM testing accuracy, and the RELM testing accuracy is denoted by the dash dot line. In theory, the dash dot line lies under the dash line. That is, RELM should obtain better generalization performance than ELM, because ELM is a special type of RELM when the regularization parameter μ ¼ 0. However, for Airfoil and CCPP with RBF hidden nodes RELM is inferior to ELM. The reason for this is that RELM chooses the optimal regularization parameter from the finite set n o 2  50 ; ⋯; 20 ; ⋯; 25 instead of the whole real space. Note that the RMSE decreases with the increasing #Hidden nodes. Generally speaking, at first the RMSE drops very quickly, but it decreases slowly when approaching the benchmark line (the dash line or the dash dot line), because the learning capability becomes gradually weak as larger #Hidden nodes. At the same #Hidden nodes, the testing accuracy of ADP-ELM (ADP-RELM) overlaps that of DP-ELM (DP-RELM), which demonstrates that the proposed equivalent scheme does not alter the generalization performance. When CPELM, DP-ELM, and ADP-ELM (CP-RELM, DP-RELM, and ADP-RELM) touch (or nearly touch) the dash line (the dash dot line), we terminate them mandatorily. For these cases, Table 4 gives their statistical results. When CP-ELM and DP-ELM reach nearly the same accuracy as ELM, in general CP-ELM needs more hidden

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

16

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

nodes than DP-ELM, but it takes the advantage in terms of the training time. When the presented accelerating scheme is applied to DP-ELM, ADP-ELM outperforms CP-ELM with respect to both the training time and the number of hidden nodes. In some cost sensitive scenarios, ADP-ELM shows meaningful, which owns the priority of the training time and the testing time to CP-ELM. While arriving at nearly the same accuracy as RELM, DP-RELM works better than CP-RELM in the training time and #Hidden nodes. ADP-RELM not only keeps the same accuracy as DP-RELM, but also reduces the training time further, which is more beneficial for the training time sensitive scenarios. In addition, ELM (RELM) requires the least training time among ELM, CP-ELM, DP-ELM, and ADPELM (RELM, CP-RELM, DP-RELM, and ADP-RELM) because of no additional actions on it.

6. Conclusions Single-hidden layer feedforward networks recently have drawn much attention due to their excellent performances and simple forms. Unlike the traditional SLFNs where the input weights and the biases of hidden layer need to be tuned, ELM randomly generates these parameters independent of the training samples, and its output weights can be analytically determined. In this situation, ELM maybe generates redundant hidden nodes. Hence, it is necessary to improve the compactness of ELM network. CP-ELM and DP-ELM can reduce the number of hidden nodes without impairing the generalization performance. Compared with CPELM, DP-ELM shows excellent in the number of hidden nodes, but it needs more training time. Hence, an equivalent measure is proposed in this paper to accelerate the training process of DPELM. ADP-ELM improves DP-ELM’s performance on the training time so that ADP-ELM needs less training time than CP-ELM. In other words, ADP-ELM is superior to CP-ELM in terms of the training time and the number of hidden nodes, which is important for the training time sensitive scenarios. These similar situations are suitable for the RELM. That is, ADP-RELM improves the training process of DP-RELM further, and it needs fewer hidden nodes and less training time than CP-RELM. Experimental results have shown that the proposed scheme is able to accelerate DP-RELM and DPRELM indeed, and the analysis of the computational complexity also favors this viewpoint.

Acknowledgment This research was partially supported by the National Natural Science Foundation of China under Grant no. 51006052, and the NUST Outstanding Scholar Supporting Program. Moreover, the authors wish to thank the anonymous reviewers for their constructive comments and great help in the writing process, which improve the manuscript significantly. References [1] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in: Proceedings of IEEE International Joint Conference on Neural Networks, 2004, pp. 985–990. [2] G.-B. Huang, C.-K. Slew, Extreme learning machine: RBF network case, in: Proceedings of 8th International Conference on Control, Automation, Robotics and Vision, 2004, pp. 1029–1036. [3] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1–3) (2006) 489–501. [4] G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: a review, Neural Netw. 61 (2015) 32–48. [5] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by backpropagating errors, Nature 323 (6088) (1986) 533–536. [6] Y.A. LeCun, L. Bottou, G.B. Orr, K.-R. Muller, Efficient backprop, Lect. Notes Comput. Sci. (2012) 9–48.

[7] G.-B. Huang, X. Ding, H. Zhou, Optimization method based extreme learning machine for classification, Neurocomputing 74 (1–3) (2010) 155–163. [8] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Trans. Syst. Man Cybern. Part B: Cybern. 42 (2) (2012) 513–529. [9] X. Liu, C. Gao, P. Li, A comparative analysis of support vector machines and extreme learning machines, Neural Netw. 33 (2012) 58–66. [10] S. Lin, X. Liu, J. Fang, Z. Xu, Is extreme learning machine feasible? A theoretical assessment (Part II), IEEE Trans. Neural Netw. Learn. Syst. 26 (1) (2015) 21–34. [11] X. Liu, S. Lin, J. Fang, Z. Xu, Is extreme learning machine feasible? A theoretical assessment (Part I), IEEE Trans. Neural Netw. Learn. Syst. 26 (1) (2015) 7–20. [12] G.B. Huang, L. Chen, C.K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans. Neural Netw. 17 (4) (2006) 879–892. [13] G.-B. Huang, L. Chen, Convex incremental extreme learning machine, Neurocomputing 70 (16–18) (2007) 3056–3062. [14] G.-B. Huang, L. Chen, Enhanced random search based incremental extreme learning machine, Neurocomputing 71 (2008) 3460–3468. [15] G. Feng, G.-B. Huang, Q. Lin, R. Gay, Error minimized extreme learning machine with growth of hidden nodes and incremental learning, IEEE Trans. Neural Netw. 20 (8) (2009) 1352–1357. [16] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM: optimally pruned extreme learning machine, IEEE Trans. Neural Netw. 21 (1) (2010) 158–162. [17] J. Deng, K. Li, G.W. Irwin, Fast automatic two-stage nonlinear model identification based on the extreme learning machine, Neurocomputing 74 (16) (2011) 2422–2429. [18] E. Soria-Olivas, J. Gomez-Sanchis, J.D. Martin, J. Vila-Frances, M. Martinez, J.R. Magdalena, A.J. Serrano, BELM: Bayesian Extreme Learning Machine, IEEE Trans. Neural Netw. 22 (3) (2011) 505–509. [19] Y. Yang, Y. Wang, X. Yuan, Bidirectional extreme learning machine for regression problem and its learning effectiveness, IEEE Trans. Neural Netw. Learn. Syst. 23 (9) (2012) 1498–1505. [20] R. Zhang, Y. Lan, G.-B. Huang, Z.-B. Xu, Universal approximation of extreme learning machine with adaptive growth of hidden nodes, IEEE Trans. Neural Netw. Learn. Syst. 23 (2) (2012) 365–371. [21] R. Zhang, Y. Lan, G.-B. Huang, Z.-B. Xu, Y.C. Soh, Dynamic extreme learning machine and its approximation capability, IEEE Trans. Cybern. 43 (6) (2013) 2054–2065. [22] A. Castano, F. Fernandez-Navarro, C. Hervas-Martinez, PCA-ELM: a robust and pruned extreme learning machine approach based on principal component analysis, Neural Process. Lett. 37 (3) (2013) 377–392. [23] H. Zhou, G.B. Huang, Z. Lin, H. Wang, Y.C. Soh, Stacked extreme learning machines, IEEE Trans. Cybern. (2015). [24] W.-Y. Deng, Q.-H. Zheng, Z.-M. Wang, Projection vector machine, Neurocomputing 120 (0) (2013) 490–498. [25] H.-J. Rong, Y.-S. Ong, A.-H. Tan, Z. Zhu, A fast pruned-extreme learning machine for classification problem, Neurocomputing 72 (1–3) (2008) 359–366. [26] D. Du, K. Li, G.W. Irwin, J. Deng, A novel automatic two-stage locally regularized classifier construction method using the extreme learning machine, Neurocomputing 102 (0) (2013) 10–22. [27] Z. Bai, G.-B. Huang, D. Wang, H. Wang, M.B. Westover, Sparse extreme learning machine for classification, IEEE Trans. Cybern. 44 (10) (2014) 1858–1870. [28] J. Luo, C.-M. Vong, P.-K. Wong, Sparse Bayesian extreme learning machine for multi-classification, IEEE Trans. Neural Netw. Learn. Syst. 25 (4) (2014) 836–843. [29] N. Wang, M.J. Er, M. Han, Parsimonious extreme learning machine using recursive orthogonal least squares, IEEE Trans. Neural Netw. Learn. Syst. 25 (10) (2014) 1828–1841. [30] Y.-P. Zhao, K.-K. Wang, Y.-B. Li, Parsimonious regularized extreme learning machine based on orthogonal transformation, Neurocomputing 156 (2015) 280–296. [31] W. Deng, Q. Zheng, L. Chen, Regularized extreme learning machine, in: Proceedings of 2009 IEEE Symposium on Computational Intelligence and Data Mining, 2009, pp. 389–395. [32] X. Zhang, Matrix Analysis and Applications, Tsinghua University Press, Beijing, 2004. [33] Y. Zhao, K. Wang, Fast cross validation for regularized extreme learning machine, J. Syst. Eng. Electron. 25 (5) (2014) 895–900.

Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, he received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His research interests include machine learning and kernel methods.

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Bing Li was born in 1990. He received the B.S. degree from Xinxiang University, China, in 2014. He is currently pursuing the M.Eng. degree from Nanjing University of Science and Technology, China. His research interests include machine learning, pattern recognition, etc.

17 Ye-Bo Li studied in Shenyang Aerospace University of China from September 2004 to June 2008, majored in aircraft power engineering and received B.S. degree. From September 2008 to October 2014, he studied in Nanjing University of Aeronautics and Astronautics of China, majored in aerospace propulsion theory and engineering, and received M.S. degree and Ph.D. degree. Now, he is an engineer at AVIC Aeroengine Control Research Institute of China. His research interests include modelling, control and fault diagnosis on aircraft engines.

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i