An accelerating scheme for destructive parsimonious extreme learning machine

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom An accele...

Download PDF

1MB Sizes 10 Downloads 116 Views

Report

PDF Reader
Full Text

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

An accelerating scheme for destructive parsimonious extreme learning machine Yong-Ping Zhao a,n, Bing Li a, Ye-Bo Li b a b

School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China AVIC Aeroengine Control Research Institute, Wuxi 214063, China

art ic l e i nf o

a b s t r a c t

Article history: Received 27 December 2014 Received in revised form 10 February 2015 Accepted 2 April 2015 Communicated by G.-B. Huang

Constructive and destructive parsimonious extreme learning machines (CP-ELM and DP-ELM) were recently proposed to sparsify ELM. In comparison with CP-ELM, DP-ELM owns the advantage in the number of hidden nodes, but it loses the edge with respect to the training time. Hence, in this paper an equivalent measure is proposed to accelerate DP-ELM (ADP-ELM). As a result, ADP-ELM not only keeps the same hidden nodes as DP-ELM but also needs less training time than CP-ELM, which is especially important for the training time sensitive scenarios. The similar idea is extended to regularized ELM (RELM), yielding ADP-RELM. ADP-RELM accelerates the training process of DP-RELM further, and it works better than CP-RELM in terms of the number of hidden nodes and the training time. In addition, the computational complexity of the proposed accelerating scheme is analyzed in theory. From reported results on ten benchmark data sets, the effectiveness and usefulness of the proposed accelerating scheme in this paper is conﬁrmed experimentally. & 2015 Elsevier B.V. All rights reserved.

Keywords: Single-hidden layer feedforward network Extreme learning machine Destructive algorithm Constructive algorithms Sparseness

1. Introduction The widespread popularity of single-hidden layer feedforward networks (SLFNs) in extensive ﬁelds is mainly due to their powerfulness of approximating complex nonlinear mappings and simple forms. As a speciﬁc type of SLFNs, extreme learning machine (ELM) [1–3] recently has drawn a lot of interests from researchers and engineers. Generally, training an ELM consists of two main stages [4]: (1) random feature mapping and (2) linear parameters solving. In the ﬁrst stage, ELM randomly initializes the hidden layer to map the input samples into a so-called ELM feature space with some nonlinear mapping functions, which can be any nonlinear piecewise continuous functions, such as the sigmoid and the RBF. Since in ELM the hidden node parameters are randomly generated (independent of the training samples) without tuning according to any continuous probability distribution instead of being explicitly trained, it owns the remarkable computational priority to regular gradient-descent backpropagation [5,6]. That is, unlike conventional learning methods that must see the training samples before generating the hidden node parameters, ELM can generate the hidden node parameters before seeing the training samples. In the second stage, the linear parameters can be obtained by solving the Moore-Penrose generalized

n

Corresponding author. E-mail address: [email protected] (Y.-P. Zhao).

inverse of hidden layer output matrix, which reaches both the smallest training error and the smallest norm of output weights [7]. Hence, different from most algorithms proposed for feedforward neural networks not considering the generalization performance when they were proposed ﬁrst time, ELM can achieve better generalization performance. From the interpolation perspective, Huang et al. [3] indicated that for any given training set, there exists an ELM network which gives sufﬁcient small training error in squared error sense with probability one, and the number of hidden nodes is not larger than the number of distinct training samples. If the number of hidden nodes amounts to the number of distinct training samples, then training errors decreases to zero with probability one. Actually, in the implementation of ELM, it is found that the generalization performance of ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes is large enough [8]. In addition, unlike traditional SLFNs, such as radial basis function neural networks and multilayer perceptron with one hidden layer, where the activation functions are required to be continuous or differentiable, ELM can choose the threshold function and many others as activation functions without sacriﬁcing the universal approximation capability at all. In statistical learning theory, Vapnik–Chervonenkis (VC) dimension theory is one of the most widely used frameworks in generalization bound analysis. According to the structure risk minimization perspective, to obtain better generalization performance on testing set, an algorithm should not only

http://dx.doi.org/10.1016/j.neucom.2015.04.002 0925-2312/& 2015 Elsevier B.V. All rights reserved.

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

Table 1 The computational complexity of ADP-ELM/ADP-RELM at the ith iteration. Algorithms

ADP-ELM/ADP-RELM at the ith iteration

Operation 1 Rði 1Þ

Multiplication ( )

Addition ( þ)

Division (/)

Square root (pﬃﬃﬃ)

1 3 1 6Li 6Li

1 3 1 6Li 6Li

1 2 1 2Li þ 2Li

0

1 2Li ðLi þ 1Þm

1 2Li ðLi 1Þm

0

0

1 2 1 2Li þ 2Li þ Li m

1 2 3 2Li 2Li þ Li m

Li

0

1 3 1 2 1 3 1 2 6Li þ 2Li þ 3Li þ 2Li m þ 2Li m

1 3 1 2 5 1 2 1 6Li þ 2Li 3Li þ 2Li mþ 2Li m

1 2 3 2Li þ 2Li

0

ði 1Þ

β^

ði 1Þ β^ js

2

‖r 0j ði 1Þ‖22

, js ¼ 1; ⋯; Li

s

In sum Note: Li ¼ L iþ 1.

Table 2 The computational complexity of DP-ELM/DP-RELM at the ith iteration. Algorithms

DP-ELM/DP-RELM at the ith iteration

Operation h

RðiÞ s;j

s

2

i T ðj i 1Þ ¼ Q ðj iÞ 4 s s

Rðj iÞ s

T ðj iÞ s

01ðL i js þ 1Þ

t ðisiÞ

3 5js ¼ 1; ⋯; Li 1

Multiplication ( )

Addition (þ )

Division (/)

Square root (pﬃﬃﬃ)

2 2 2 3 5 3Li þ Li 3Li þ 2Li m 2Li m

2 1 3 1 2 5 3Li þ 2Li 6Li þ Li m Li m

L2i Li

1 2 1 2L i 2L i

‖t ðisiÞ ‖2F js ¼ 1; ⋯; Li

Li m

L i ðm 1 Þ

0

0

In sum

2 2 2 3 5 3Li þ Li 3Li þ 2Li m Li m

2 1 3 1 2 11 3Li þ 2Li 6 Li þ Li m

L2i Li

1 2 1 2L i 2L i

Table 3 Speciﬁcations on each benchmark data set. Data sets

Hidden node type

#Hidden nodes

log 2 ðμÞ

Energy efﬁciency

Sigmoid RBF

190 220

15 20

Sigmoid RBF

80 90

Sigmoid RBF

0.5 Sml2010 0.5 Parkinsons 0.5 Airfoil 0.5 Abalone 0.5 Winequality white 0.5 CCPP 0.5 Ailerons 0.5 Elevators 0.5 Pole

#Training

#Testing

#Inputs

#Outputs

450 450

318 318

8 8

2 2

2 8

3000 3000

1137 1137

16 16

2 2

100 70

10 13

4000 4000

1875 1875

16 16

2 2

Sigmoid RBF

100 90

17 30

800 800

703 703

5 5

1 1

Sigmoid RBF

40 60

11 10

2800 2800

1377 1377

8 8

1 1

Sigmoid RBF

90 70

8 11

3000 3000

961 961

11 11

1 1

Sigmoid RBF

220 230

34 35

6000 6000

3527 3527

4 4

1 1

Sigmoid RBF

120 100

1 10

7154 7154

6596 6596

40 40

1 1

Sigmoid RBF

60 50

4 11

8752 8752

7847 7847

18 18

1 1

Sigmoid RBF

300 300

6 27

10000 10000

4958 4958

26 26

1 1

Notes: #Training represents the number of training samples, #Testing represents the number of testing samples, #Inputs represents the number of input variables, and #Outputs represents the number of output variables.

achieve low training error on training set, but also should have a lower VC dimension. Aiming to classiﬁcation tasks, Liu et al. [9] proved that the VC dimension of an ELM is equal to its number of hidden nodes with probability one, which states that ELM has a relatively low VC dimension. With respect to regression problems, the generalization ability of ELM had been comprehensively studied in [10] and [11], leading to the conclusion that ELM with some suitable activation functions, such as polynomials, sigmoid and

Nadaraya–Watson function, can achieve the same optimal generalization bound as a SLFN in which all parameters are tunable. From the above analyses, it is known that ELM owns excellent universal approximation capability and generalization performance. However, due to the fact that ELM generates hidden nodes randomly, it usually requires more hidden nodes than traditional SLFN to get matched performance. Large network size always signiﬁes more running time in the testing phase. In cost sensitive learning, the testing time

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

3

3.4

20

3.2

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

10

3.0

Training time (sec.)

Training time (sec.)

15

2.8 2.6

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

2.4 2.2

5 2.0 10

20

0 40

60

80

100

120

140

160

60

70

80

5.2

30

4.8

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

Training time (sec.)

Training time (sec.)

50

180

#Hidden nodes

20

40

#Hidden nodes 20

25

30

15

10

4.4

4.0

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

3.6

5 3.2 20

0 50

100

150

200

#Hidden nodes

40

60

80

#Hidden nodes Fig. 2. The training time on Sml2010 (a) sigmoid and (b) RBF.

Fig. 1. The training time on Energy efﬁciency (a) sigmoid and (b) RBF.

should be minimized, which requires a compact network to meet testing time budget. Thus, the topic on improving the compactness of ELM has recently attracted great interests. First, Huang et al. [12] proposed an incremental ELM (I-ELM), which randomly generates the hidden nodes and analytically calculates the output weights. Due to the fact that I-ELM does not recalculate the output weights of all the existing nodes when a new node is added, hence its convergence rate can be further improved by recalculating the output weights of the existing nodes based on a convex optimization method when a new hidden node is randomly added [13]. In I-ELM, there may be some hidden nodes, which play a very minor role in the network output and eventually increase the network complexity. In order to avoid this problem and to obtain a more compact network, an enhanced version of I-ELM was presented [14], where in each learning step several hidden nodes are randomly generated and among them the hidden node leading to the largest residual error decreasing will be added to the existing network. In [15], an error minimized ELM was proposed, which can add random hidden nodes one by one or group by group. And during the growth of the network, the output weights are updated incrementally. In [16], an optimally pruned extreme learning machine was presented, which is based on the original ELM with additional steps to make it more robust and compact. Deng et al. [17] adopted the two-stage stepwise strategy to improve the compactness of the ELM. At the ﬁrst stage, the selection procedure can be automatically terminated based on

the leave-one-out error. At the second stage, the contribution of each hidden nodes is reviewed, and insigniﬁcant ones are replaced. This procedure dose not terminate until no insigniﬁcant hidden nodes exist in the ﬁnal model. When Bayesian approach is applied to ELM, a Bayesian ELM was got [18]. This Bayesian method can not only optimize the network architecture, but also make use of a priori knowledge and obtain the conﬁdence interval. Subsequently, a new learning algorithm called bidirectional ELM was presented [19], which can greatly enhance the learning effectiveness, reduce the number of hidden nodes, and eventually further increase the learning speed. In addition, the traditional ELM was further extended by Zhang et al. [20] to be the one using Lebesgue p-integrable hidden activation functions, which can approximate any Lebesgue p-integrable functions on a compact input set. Further, a dynamic ELM where the hidden nodes can be recruited or deleted dynamically according to their signiﬁcance to network performance was proposed [21], so that not only the parameters can be adjusted but also the architecture can be selfadapted simultaneously. Castano et al. [22] proposed a robust and pruned ELM approach, where the principal component analysis is utilized to select the hidden nodes from input features while corresponding input weights are deterministically deﬁned as principal components rather than random ones. To solve large and complex sample issues, a stacked ELM was designed [23], which divides a single large ELM network into multiple stacked small ELMs which are serially connected. In addition, to improve

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7.0

2.5

6.5

2.0

Training time (sec.)

Training time (sec.)

4

6.0

5.5

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

5.0

4.5

1.5

1.0

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

0.5

0.0 20

40

60

80

20

100

40

60

80

100

#Hidden nodes

#Hidden nodes 2.1 5.2 1.8

Training time (sec.)

Training time (sec.)

5.0

4.8

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

4.6

4.4

1.5

1.2

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

0.9

0.6

0.3

4.2 10

20

30

40

50

60

20

70

40

60

80

#Hidden nodes

#Hidden nodes Fig. 3. The training time on Parkinsons (a) sigmoid (b) RBF.

Fig. 4. The training time on Airfoil (a) sigmoid (b) RBF.

the performance of ELM on high-dimension and small-sample problems, a projection vector machine [24] was proposed. Aiming at classiﬁcation tasks [7,25–28], several algorithms were proposed to optimize the network size of ELM. Recently, novel constructive and destructive parsimonious ELMs (CP-ELM and DP-ELM) were proposed based on recursive orthogonal least squares [29]. CP-ELM starts with a small initial network and gradually adds new hidden nodes until a satisfactory solution is found. In contrary, DP-ELM starts by training a larger than necessary network and then removes the insigniﬁcant node one by one. Further, Zhao et al. [30] extend CP-ELM and DP-ELM to the regularized ELM (RELM), thus correspondingly obtaining CPRELM and DP-RELM. Generally speaking, compared with CP-ELM, DP-ELM can obtain more compact network size when reaching nearly the same accuracy, but it loses the edge on the training time. The main reason is that when DP-ELM removes every insigniﬁcant hidden node, a series of Givens rotations are implemented to compute the additional residual error reduction. As known, the computational burden of Givens rotation is high. Hence, a scheme is proposed to accelerate DP-ELM (ADP-ELM) instead of Givens rotations. ADP-ELM not only needs fewer hidden nodes than CP-ELM, but also wins it in the training time. When this accelerating scheme is extended to DP-RELM, ADP-RELM is yielded. ADP-RELM reduces the training time of DP-RELM further, and outperforms CP-RELM in terms of the number of hidden nodes and the training time. In theory, this proposed accelerating scheme can acquire the same effect as Givens rotations but

requires less training time when removing the insigniﬁcant hidden nodes. Finally, extensive experiments are conducted on benchmark data sets, and the results verify the effectiveness of the proposed scheme in this paper. The remainder of this paper is organized as follows. In Section 2, ELM and RELM are brieﬂy introduced. As two destructive algorithms, DP-ELM and DP-RELM are described in Section 3. In Section 4, an equivalent measure is proposed instead of Givens rotations to respectively accelerate DP-ELM and DP-RELM, yielding ADP-RELM and ADP-RELM, and their computational complexity is analyzed. Experimental results on ten benchmark data sets are presented in Section 5. Finally, conclusions follow.

2. ELM and RELM 2.1. ELM N For N arbitrary distinct samples ðxi ; t i Þ i ¼ 1 , where xi A ℜn and t i A ℜm , the traditional SLFN with L hidden nodes and activation function hðxÞ can be mathematically modeled as y¼

L X

βi hðx; ai ; bi Þ

ð1Þ

i¼1

where ai A ℜn is the input weight vector connecting the ith hidden node and the input nodes, bi A ℜ is the bias of the ith hidden node,

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

5

4.0

1.08

3.8 1.05

3.6

Training time (sec.)

Training time (sec.)

1.02

0.99

0.96

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

0.93

0.90

3.4 3.2 3.0

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

2.8 2.6 2.4 2.2

5

10

15

20

25

30

35

40

20

40

#Hidden nodes

60

80

#Hidden nodes

2.6

4.2 4.0

2.5

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

2.4

Training time (sec.)

Training time (sec.)

3.8

2.3

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

2.2

2.1

3.6 3.4 3.2 3.0 2.8

2.0 10

20

30

40

50

2.6

60

10

#Hidden nodes

20

30

40

50

60

70

#Hidden nodes Fig. 5. The training time on Abalone (a) sigmoid (b) RBF. Fig. 6. The training time on Winequality white (a) sigmoid (b) RBF.

βi A ℜm is the output weight vector connecting the ith hidden node and the output nodes. If the outputs of the SLFN amount to the targets, the following compact formulation is got Hβ ¼ T

ð2Þ

T

where β ¼ β1 ; ⋯; βL , T ¼ ½t 1 ; ⋯; t N T , and 2 3 hðx1 ; a1 ; b1 Þ ⋯ hðx1 ; aL ; bL Þ 6 7 ⋮ ⋱ ⋮ H¼4 5 hðxN ; a1 ; b1 Þ

⋯

†

T

1

ð3Þ

hðxN ; aL ; bL Þ

2.2. RELM

ð4Þ T

β

where ‖d‖F denotes the Frobenius norm. Letting ðdGELM =dβÞ ¼ 0 may yield the same solution as (4).

here H is the so-called hidden-layer output matrix. For the L traditional SLFN, the parameters ðai ; bi Þ i ¼ 1 in (1) are usually determined using the gradient-based methods. However, in ELM these parameters are randomly initiated before seeing the training samples. In this case, training an ELM is simply equal to ﬁnding the output weight matrix β of a linear system (2). A simple solution of (2) is given explicitly as follows:

β^ ¼ H † T

As for this point, most algorithms for SLFN do not consider the generalization performance when they were proposed ﬁrst time. Actually, Solving (2) in the squared error sense is equivalent to minimizing the following problem min GELM ¼ ‖H β T‖2F ð5Þ

where H ¼ H H H is the Moore-Penrose generalized inverse of H. The solution β^ in (4) is to minimize the training error as well as the norm of the output weights [7]. In other words, ELM aims to reach better generalization performance by reaching both the smallest training error and the smallest norm of output weights.

Due to the random choice of input weights and the hidden layer biases, the hidden layer output matrix H may not be of full column rank, or the condition number of H T H is too large. In these scenarios, solving β^ directly using (4) may drop the ELM accuracy. To end this, after adding the regularization term to (5), the mathematical model of RELM is [31] min GRELM ¼ ‖H β T‖2F þ μ‖β‖2F ð6Þ β

where μ is the regularization parameter. Letting dGRELM =dβ ¼ 0 yields 1 β^ ¼ H T H þ μI HT T ð7Þ

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

22

35

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

25

20

21

Training time (sec.)

Training time (sec.)

30

15

10

20

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

19

18

5

0 50

100

150

20

200

40

60

80

100

120

#Hidden nodes

#Hidden nodes 20.5

45 20.0

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

35 30

Training time (sec.)

Training time (sec.)

40

25 20

19.5

19.0

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

18.5

18.0

15 17.5

10

20

50

100

150

40

200

60

80

100

#Hidden nodes

#Hidden nodes

Fig. 8. The training time on Ailerons (a) sigmoid and (b) RBF.

Fig. 7. The training time on CCPP (a) sigmoid and (b) RBF.

here Q is an N N orthogonal matrix satisfying Q T Q ¼ Q Q T ¼ I, and RLL is an upper triangular matrix with the same rank as H.

3. DP-ELM and RDP-ELM 3.1. DP-ELM 3.1.1. Preliminary work According to the conclusion in [3], it has been proved that an ELM with at most N hidden nodes and with almost any nonlinear activation function can exactly learn N distinct samples. In fact, in real-world applications, the number of hidden nodes is always less than the number of training samples, i.e., L oN, and the matrix H is of full column rank with probability one. Hence, the following theorem is obtained Theorem 1. The minimizer of Eq. (5) amounts to the following minimizer: n o min G0ELM ¼ ‖RLL β T^ Lm ‖2F ð8Þ β

where " H¼Q

RLL

ð9Þ

0ðN LÞL "

QTT ¼

#

T^ Lm

T~ ðN LÞm

Proof. Since the matrix H is of full column rank, hence it can be " # RLL decomposed as H ¼ Q using QR decomposition accord0ðN LÞL ing to the matrix theory [32], where Q is an orthogonal matrix, RLL is an upper triangular matrix. Thus, we get " # " #2 RLL T^ Lm β ~ ‖H β T‖2F ¼ ‖Q T H β Q T T‖2F ¼ 0ðN LÞL T ðN LÞm F

" #2 R β T^ 2 2 Lm LL ¼ ¼ RLL β T^ Lm þ T~ ðN LÞm T~ ðN LÞm F F

Plugging (11) into (5) and eliminating the constant part ‖T~ ðN LÞm ‖2F , Eq. (8) is obtained. Now, this proof is completed. □ Since Eqs. (5) and (8) have the same minimizer, we do not distinguish GELM and G0ELM in this paper. In addition, according to (10) we know ‖T‖2F ¼ ‖T^ Lm ‖2F þ‖T~ ðN LÞm ‖2F

# ð10Þ

ð11Þ

F

ð12Þ

where ‖T~ ðN LÞm ‖2F is deﬁned as the initial residual error, ‖T^ Lm ‖2F is the additional residual error. The output weight matrix β^ in (4)

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7

13.8

100

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

13.6

Training time (sec.)

Training time (sec.)

80

13.4

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

13.2

13.0

10

20

60

40

20

30

40

50

50

60

100

120

13.2

100

Training time (sec.)

Training time (sec.)

13.3

13.1

13.0

12.9

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

12.8

12.7 10

20

40

50

50

100

300

together with

3.1.2. Realization The DP-ELM starts with full candidate regressors, i.e., let the initial ^ and a full index set P ¼ f1; 2; ⋯; Lg, regressor matrix Rð0Þ ¼ R, T ð0Þ ¼ T, and then it gradually removes the insigniﬁcant hidden nodes one by one until the stopping criterion is met. Assume that at the ith iteration the candidate regressor matrix RðiÞ s is obtained by removing the regressor r ðsiÞ from the previous regressor matrix Rði 1Þ ð14Þ

Q ðiÞ

T

" T ði 1Þ ¼

T ðiÞ t ðisiÞ

#

ð16Þ

where Q ðiÞ consists of a series of given rotations satisfying T T Q ðiÞ Q ðiÞ ¼ Q ðiÞ Q ðiÞ ¼ I ðL i þ 1ÞðL i þ 1Þ , RðiÞ is an upper triangular matrix, and t ðisiÞ is the last row vector of the transformed targets. Obviously, ‖t ðisiÞ ‖2F is the increase on the additional residual error due to the removal of the regressor r ðsi 1Þ . In fact, Eq. (15) can become much simpler, since it only needs Givens rotations to be performed on each column of the submatrix including the columns after r ðsi 1Þ in Rði 1Þ . Hence, together with (16) they are simpliﬁed as h

RðiÞ s;j s

T ðj i 1Þ s

i

2 ¼ Q ðj iÞ 4 s

Rðj iÞ

T ðj iÞ

01ðL i js þ 1Þ

t ðisiÞ

s

s

3 5

ð17Þ

where js is the column index of the regressor r ðsi 1Þ in the matrix Rði 1Þ , RðiÞ s;j and Rðj iÞ are the submatrices including elements from js th s

s

row and js th column to the end of RðiÞ s and RðiÞ , respectively, T ðj i 1Þ and

is retriangularized by a series of Givens rotations as 01ðL iÞ

250

Fig. 10. The training time on Pole (a) sigmoid and (b) RBF.

ð13Þ

RðiÞ s ’Rði 1Þ \r ðsi 1Þ ; s A P

150

#Hidden nodes

where R and T^ are from (8) after omitting subscripts. Because R is an upper triangular matrix, β^ in Eq. (13) is easily got with backward substitutions. Although an ELM can be obtained easily through (13), its solution is dense. That is, there maybe exist insigniﬁcant hidden nodes in the solution. Hence, we need to improve the compactness of ELM. It should be pointed out that every regressor (or column) r i of the matrix R ¼ ½r 1 ; ⋯; r L in (8) corresponds to one hidden node. Hence, compacting ELM amounts to determining the signiﬁcant regressors from R.

RðiÞ s ¼ Q ðiÞ

200

20 30

Rβ^ ¼ T^

RðiÞ

300

40

can be got by solving

"

250

60

#Hidden nodes

Then,

200

CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM

80

Fig. 9. The training time on Elevators (a) sigmoid and (b) RBF.

RðiÞ s

150

#Hidden nodes

#Hidden nodes

s

#

ð15Þ

T ðj iÞ are the submatrices including elements from js th row to the end of s

T ði 1Þ and T ðiÞ , respectively, and the orthogonal matrix Q ðj iÞ is governed s

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

0.080

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.064

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.090

RMSE

0.072

0.095

0.085

0.056 0.080

0.048 0.075

0.040 20

40

60

80

100

10

120

20

30

40

50

60

70

#Hidden nodes

#Hidden nodes 0.11

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.056

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.10

RMSE

0.064

0.048

0.09

0.08

0.040

0.07

0.032 50

100

150

10

200

20

#Hidden nodes

50

60

3.2. DP-RELM

by 2 Q

40

Fig. 12. The testing accuracy on Sml2010 (a) sigmoid and (b) RBF.

Fig. 11. The testing accuracy on Energy efﬁciency (a) sigmoid and (b) RBF.

ðiÞ

30

#Hidden nodes

¼4

3

I ðjs 1Þðjs 1Þ Q ðj iÞ

5

ð18Þ

s

Theorem 2. The minimizer of Eq. (6) is equivalent to the following minimizer: n o min G0RELM ¼ ‖RLL β T^ Lm ‖2F ð21Þ β

Hence, the index of the ith regressor to be removed is

s† ¼

‖t ðiÞ ‖2F argmin is 2 s A P ‖T‖ F

ð19Þ

Eq. (19) shows that the regressor incurring the least increase on the additional residual error will be removed. After removing the ith regressor r ðsi† 1Þ , the index set P is updated as P’P\ s† . DP-ELM does not stop until the following criterion is reached ‖T ðiÞ ‖2F r1ρ ‖T‖2F

ð20Þ

where 0 o ρ{1, which can balance the training error against the model complexity.

where " # " # H RLL pﬃﬃﬃ μI LL ¼ Q 0NL " QT

T 0Lm

#

" ¼

T^ Lm T~ Nm

ð22Þ

# ð23Þ

here Q is an ðN þ LÞ ðN þ LÞ orthogonal matrix satisfying Q T Q ¼ Q Q T ¼ I, and RLL is an upper triangular matrix with the rank L. The proof of Theorem 2 is similar to that of Theorem 1, so we omit it here. The matrices RLL and T^ Lm in Theorems 1 and 2 are different, but we adopt the same symbols, which does not lead to misunderstanding. When we obtain RLL and T^ Lm from Theorem 2, DP-RELM is naturally obtained by adopting the same realization procedure as DP-ELM.

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

0.216

0.104

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.208

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.100

RMSE

0.212

RMSE

9

0.096

0.092

0.204 0.088

0.200 10

20

30

40

50

60

70

20

80

40

60

80

100

#Hidden nodes

#Hidden nodes 0.105

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.218

0.216

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.100

0.095

RMSE

0.220

0.090

0.214 0.085

0.212

0.080

0.210 10

20

30

40

50

20

60

40

#Hidden nodes

Proof. Substituting (15) and (16) into (25), we get " # RðiÞ ðiÞ ðiÞ ðiÞ ðiÞ ði 1Þ 2 G s ¼ ‖R s β T ‖F ¼ Q βðiÞ 01ðL iÞ

4. ADP-ELM and ADP-RELM 4.1. An equivalent measure Assume that at the (i 1)th iteration in DP-ELM or DP-RELM, Rði 1Þ and T ði 1Þ are obtained, and let βði 1Þ

" # " ðiÞ # T RðiÞ ðiÞ ¼ β ðiÞ F2 t is 01ðL iÞ

T ði 1Þ 2F

ð24Þ ðiÞ

ði 1Þ

^ Evidently, G ¼ 0. When the regressor r ðsi 1Þ is removed at the ith iteration, Eq. (24) becomes as n o ^ ðiÞ ¼ min GðiÞ ¼ ‖RðiÞ βðiÞ T ði 1Þ ‖2 G s F s s βði 1Þ

ð25Þ

Then, deﬁne ðiÞ

ΔðiÞ s ¼ G^ s G^

ði 1Þ

^ ðiÞ ¼G s

80

Fig. 14. The testing accuracy on Airfoil (a) sigmoid and (b) RBF.

Fig. 13. The testing accuracy on Parkinsons (a) sigmoid and (b) RBF.

n o ^ ði 1Þ ¼ min Gði 1Þ ¼ ‖Rði 1Þ βði 1Þ T ði 1Þ ‖2 G F

60

#Hidden nodes

ð26Þ

¼ ‖RðiÞ β T ðiÞ ‖2F þ ‖t ðisiÞ ‖2F

ð28Þ

ðiÞ

^ ðiÞ ¼ ‖t ðiÞ ‖2 . □ Because ‖RðiÞ β^ T ðiÞ ‖2F ¼ 0, hence G s is F ði 1Þ To remove the regressor r s† at the ith iteration, we need to ðiÞ 2 compute ‖t is ‖F L-iþ1 times, which consumes a lot of time. From Theorem 3, note that the issue of computing ‖t ðisiÞ ‖2F can be sideðiÞ stepped with ﬁnding the equivalent measure Δ s . If we can seek a ðiÞ computationally cheap method of calculating Δ s , then the bottleðiÞ 2 neck of computing ‖t is ‖F will be circumvented.

ðiÞ s

here Δ represents the increase on the objective function in (24) due to the removal of the regressor r ðsi 1Þ . Therefore, the following theorem is obtained

4.2. An accelerating scheme ðiÞ

Theorem 3. The following formula holds

ΔðiÞ s ¼ ‖t ðisiÞ ‖2F where t ðisiÞ is from (16).

ð27Þ

If at the ith iteration each Δ s is computed directly according to (25), the computational burden is very huge, not cheaper than that of computing ‖t ðisiÞ ‖2F . Hence, it is necessary to seek a method of ðiÞ computing each Δ s cheaply. Assume that the column index of the ði 1Þ regressor r s to be removed in the matrix Rði 1Þ is js , then

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

10

0.125

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.0760

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.124

0.123

RMSE

0.0765

0.0755

0.0750

0.122

0.121

0.0745

0.120 10

20

30

40

20

40

#Hidden nodes 0.0775

0.0760

80

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.1230

0.1225

RMSE

0.0765

RMSE

0.1235

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.0770

60

#Hidden nodes

0.0755

0.1220

0.0750

0.1215 0.0745 0.0740

0.1210 10

20

30

40

50

20

30

40

#Hidden nodes Fig. 15. The testing accuracy on Abalone (a) sigmoid and (b) RBF.

ði 1Þ ‖β^ ‖2 js

2

ð29Þ

Λðjsi 1Þ

where tr ðdÞ represents the trace of a square matrix. In addition,

T ^ ði 1Þ ¼ ‖T ði 1Þ ‖2 tr T ði 1Þ Rði 1Þ G F

holds, ði 1Þ ði 1Þ is the js th row vector of the minimizer β^ of where β^ js

1 T ði 1Þ Eq. (24), Λjs is the js th diagonal element of Rði 1Þ Rði 1Þ .

Proof. First, see the following optimization problem. n o ^ ði 1Þ ¼ min Gði 1Þ ¼ ‖Rði 1Þ βði 1Þ T ði 1Þ ‖2 þ ‖Θβði 1Þ ‖2 G s F F s s s βði s 1Þ

ð30Þ

^ ði 1Þ ¼ lim G s θ- þ 1

ð31Þ ði 1Þ

Letting dGði s 1Þ =dβ s ði 1Þ

β^ s ¼

Rði 1Þ

T

¼ 0 gets

Rði 1Þ þ Θ

T

1

Θ

Rði 1Þ

T

T ði 1Þ

ð32Þ

þΘ

T

Θ

1

Rði 1Þ

T ði 1Þ

Rði 1Þ

T

T

Rði 1Þ

Θ

T

Rði 1Þ

T

T

Rði 1Þ

Rði 1Þ

T

Rði 1Þ T

1

Θ

¼

1

Rði 1Þ

ði 1Þ

T

! T ði 1Þ

ð34Þ

Rði 1Þ

Θ I þΘ

T

Rði 1Þ

Rði 1Þ

T

1

Rði 1Þ

1

!1

Θ

1

ð35Þ

ð36Þ

^ ði 1Þ ^ ði 1Þ G ¼ lim G s θ- þ 1

ði 1Þ

¼ lim tr @ β^

ð33Þ

Rði 1Þ

So,

θ- þ 1

T ði 1Þ Þ

1

^ ði 1Þ G ^ ði 1Þ G s 0 1

1 !1 T ði 1Þ T ði 1Þ A ¼ tr @ β^ Θ I þ Θ Rði 1Þ Rði 1Þ Θ Θβ^

0

R ði 1 Þ

Rði 1Þ

Hence

ðiÞ

T

Rði 1Þ þ Θ

ΔðiÞ s ¼ G^ s G^

Plugging (32) into (30) yields ^ ði 1Þ ¼ ‖T ði 1Þ ‖2 tr G s F

Rði 1Þ

According to Woodbury formula [32], we get

where Θ ¼ diag 01 ; ⋯; 0js 1 ; θ; 0js þ 1 ; ⋯; 0L i þ 1 . Comparing (30) with (25), note that ^ ðiÞ G s

60

Fig. 16. The testing accuracy on Winequality white (a) sigmoid and (b) RBF.

Theorem 4.

ΔðiÞ s ¼

50

#Hidden nodes

T

1

1 !1 T ði 1 Þ ði 1 Þ ði 1Þ ^ A Θ IþΘ R R Θ Θβ

ð37Þ

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.0535

RMSE

0.0534

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.048

0.047

RMSE

0.0536

11

0.0533

0.0532

0.046

0.0531

0.045

0.0530 100

120

140

160

180

20

200

40

60

80

#Hidden nodes

#Hidden nodes 0.0490

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.0536

0.0534

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.0485

0.0480

0.0475

RMSE

0.0538

0.0532

0.0470

0.0465

0.0460

0.0530 0.0455

100

120

140

160

180

10

200

#Hidden nodes

ði 1Þ

‖22 θ

2

θ- þ 11 þ Λði 1Þ θ js

¼ 2

ði 1Þ

‖β^ js

Λ

‖22

ð38Þ

ði 1Þ js

□

backward substitutions and it is also an upper triangular matrix, 1

T ¼ r 01 ði 1Þ; ⋯; r 0L i ði 1Þ . Therefore here denoted by Rði 1Þ

Λðjsi 1Þ ¼ ‖r 0js ði 1Þ‖22

ð39Þ

ði 1Þ

Δ

¼

‖22

‖r 0j ði 1Þ‖22

ð40Þ

s

Hence, the index of the ith regressor to be removed for ADP-ELM and ADP-RELM is found as ði 1Þ

‖β^ js

‖22 0 s A P ‖r ði 1Þ‖2 2 js

s† ¼ argmin

4.3. Analysis of computational complexity

iteration, to remove the regressor r ðsi† 1Þ from the candidate matrix 1 . Due to Rði 1Þ using Eq. (41), we ﬁrstly need to compute Rði 1Þ 1 Rði 1Þ being an upper triangular matrix, Rði 1Þ can be easily 1 ði 1Þ got with backward substitutions. When Rði 1Þ is found, β^ and ‖r 0j ði 1Þ‖22 can be acquired. Then, Eq. (41) is used to ﬁnd the s

Substituting (39) into (29), we get ‖β^ js

60

Assume that ADP-ELM or ADP-RELM is implemented at the ith

ði 1Þ

is an upper triangular matrix, on the basis of the 1 is easily got with the knowledge of matrix theory Rði 1Þ

ðiÞ s

50

procedure of removing the insigniﬁcant hidden nodes terminates, which is convenient to control the network size.

Up to now, the proof has been ﬁnished. Since R

40

Fig. 18. The testing accuracy on Ailerons (a) sigmoid and (b) RBF.

which is simpliﬁed as ‖β^ js

30

#Hidden nodes

Fig. 17. The testing accuracy on CCPP (a) sigmoid and (b) RBF.

ΔðiÞ s ¼ lim

20

ð41Þ

As for the stopping criterion, Eq. (20) is also suitable ADP-ELM and ADP-RELM. Of course, we can predeﬁne a positive integer M for ADP-ELM and ADP-RELM. When M regressors are left, the

regressor r ðsi† 1Þ to be removed. For the above procedure, the computational complexity of each step is listed in Table 1. When DP-ELM or DP-RELM is at the ith iteration, in order to obtain t ðisiÞ with (17), a series of Givens rotation need be constructed in this process. For example, assume that u is a two-dimensional vector denoted by u ¼ ½u1 ; u2 T , then Givens rotation is simply shown as " # v w u1 ‖u‖2 ð42Þ ¼ u2 0 w v qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ where v ¼ ðu1 =rÞ, w ¼ ðu2 =rÞ, and r ¼ u21 þ u22 . Notice that constructing a Givens rotation needs four multiplications, one addition, two divisions, and one square root. When t ðisiÞ is got using (17),

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

12

then the index s† is determined via (19). The computational complexity of DP-ELM or DP-RELM is tabulated in Table 2. Comparing Table 1 with 2, the conclusion that the computational complexity of ADP-ELM/ADP-RELM is lower than that of DPELM/DP-RELM for any operation is easily drawn. Naturally, ADPELM/ADP-RELM accelerates the training process of DP-ELM/DPRELM while keeping the same accuracy.

0.208

0.206

RMSE

4.4. The ﬂowchart of ADP-ELM and ADP-RELM

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.210

In summary, the ﬂowchart of ADP-ELM/ADP-RELM is depicted as follows.

0.204

0.202

0.200

Algorithm 1. ADP-ELM/ADP-RELM 1 Initializations:

100

N Input the training samples ðxi ; t i Þ i ¼ 1 . Choose the type of activation function. Choose a small ρ 4 0 or a positive integer M for controlling the number of ﬁnal hidden nodes exactly. Let P ¼ f1; ⋯; Lg, and i¼1.

120

140

160

Generate the hidden layer output matrix H with L random hidden nodes. 3 For ADP-RELM, determine the appropriate regularization parameter μ.

180

200

220

240

260

280

#Hidden nodes 0.27

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.26

RMSE

2

0.25

0.24

0.044

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

RMSE

0.043

0.042

0.23 50

100

150

200

#Hidden nodes Fig. 20. The testing accuracy on Pole (a) sigmoid and (b) RBF.

0.041

Obtain R and T^ according (9) and (10) ((22) and (23)), respectively, for ADP-ELM (ADP-RELM). ^ 5 Let Rð0Þ ¼ R and T ð0Þ ¼ T. 4

0.040

0.039 20

30

40

50

#Hidden nodes

9

0.056

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

0.054

RMSE

0.052 0.050 0.048

6 7 8

0.046

1 Compute the upper triangular matrix Rði 1Þ based

on Rði 1Þ with backward substitutions. 1 ði 1Þ ¼ R ð i 1Þ T ði 1Þ . 10 Calculate β^ 11 Determine the regressor r ðsi† 1Þ to be removed according to (41). 12 13 14 15 16

0.044

If Eq. (20) is satisﬁed or i4 L M Go to step 16. Else

17

Obtain RðiÞ s† according to (14). Get RðiÞ and T ðiÞ from (15) and (16), respectively. Let P’P\ s† and i’iþ 1, and go to step 6. End if ðiÞ

Solve RðiÞ β^ ¼ T ðiÞ with backward substitutions. P ^ ðiÞ βj h x; aj ; bj Output ADP-ELM/ADP-RELM: f ðxÞ ¼ jAP

0.042

5. Experiments 10

20

30

40

#Hidden nodes Fig. 19. The testing accuracy on Elevators (a) sigmoid and (b) RBF.

In this section, we experimentally study the usefulness of the proposed scheme in accelerating the training processes of DP-ELM and DP-RELM. The experiment environment consists of: Windows

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

13

Table 4 The detailed experimental results on benchmark data sets. Data sets

Hidden node type

Algorithms

#Hidden nodes

RMSE

Energy efﬁciency

Sigmoid

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

190 80 81 81 190 130 105 105 220 82 86 86 220 208 182 182

4.8952E 027 5.0472E 03 4.8988E 027 3.5454E 03 4.8711E 027 3.6369E 03 4.8711E 027 3.6369E 03 4.3066E 027 1.6941E 03 4.2996E 027 1.8785E 03 4.2993E 027 1.9261E 03 4.2993E 027 1.9261E 03 4.0442E 02 71.8273E 03 4.0288E 027 9.3419E 04 4.0388E 027 2.2874E 03 4.0388E 027 2.2874E 03 3.5343E 027 1.0892E 03 3.5393E 027 1.1035E 03 3.5380E 027 9.7920E 04 3.5380E 027 9.7920E 04

0.0187 0.002 9.622 7 0.098 16.909 7 0.361 6.930 7 0.046 0.0117 0.002 11.6477 0.337 16.0717 0.187 6.363 7 0.137 0.436 7 0.010 14.095 7 0.571 28.3487 0.857 11.7337 0.100 0.426 7 0.011 18.6147 0.670 14.3197 0.101 5.8017 0.112

0.030 7 0.002 0.0117 0.000 0.0127 0.002 0.0127 0.002 0.0097 0.001 0.006 7 0.001 0.004 7 0.000 0.0057 0.004 0.302 7 0.006 0.1157 0.009 0.1187 0.001 0.1177 0.001 0.3017 0.008 0.280 7 0.002 0.260 7 0.012 0.258 7 0.013

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

80 29 32 32 80 75 65 65 90 36 17 17 90 60 34 34

7.5580E 027 6.1177E 03 7.5289E 027 3.8640E 03 7.5570E 027 4.4971E 03 7.5570E 027 4.4971E 03 7.3716E 02 72.8558E 03 7.3731E 027 2.8613E 03 7.3728E 027 2.9638E 03 7.3728E 027 2.9638E 03 7.7460E 02 79.3898E 03 7.6741E 027 6.1075E 03 7.6085E 027 5.5149E 03 7.6085E 027 5.5149E 03 7.3686E 027 5.3856E 03 7.3654E 027 5.6984E 03 7.3307E 02 7 6.3927E 03 7.3307E 02 7 6.3927E 03

0.032 7 0.003 2.759 7 0.047 3.1797 0.060 2.558 7 0.015 0.0197 0.002 3.228 7 0.009 2.6787 0.071 2.439 7 0.056 1.1387 0.013 4.5317 0.019 5.0707 0.064 4.202 7 0.060 1.1167 0.006 4.932 7 0.016 4.9487 0.016 4.308 7 0.013

0.0357 0.001 0.0147 0.002 0.0157 0.002 0.0157 0.004 0.0157 0.001 0.0147 0.003 0.0137 0.003 0.0137 0.003 0.430 7 0.004 0.1777 0.006 0.087 7 0.004 0.087 7 0.004 0.430 7 0.008 0.286 7 0.003 0.1597 0.002 0.1597 0.002

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

100 67 52 52 100 85 75 75 70 41 46 46 70 62 59 59

2.0338E 017 2.4357E 03 2.0338E 017 1.5048E 03 2.0336E 017 1.7314E 03 2.0336E 017 1.7314E 03 2.0224E 017 1.5824E 03 2.0224E 017 1.4386E 03 2.0229E 017 1.4048E 03 2.0229E 017 1.4048E 03 2.1198E 017 2.7295E 03 2.1193E 017 1.5492E 03 2.1190E 017 2.1838E 03 2.1190E 017 2.1838E 03 2.1090E 017 1.2686E 03 2.1093E 017 1.2167E 03 2.1093E 017 1.2610E 03 2.1093E 017 1.2610E 03

0.052 7 0.009 6.2757 0.044 6.429 7 0.053 5.360 7 0.011 0.032 7 0.018 6.6337 0.011 5.8577 0.013 5.3117 0.050 1.2477 0.020 4.989 7 0.088 4.902 7 0.041 4.623 7 0.024 1.252 7 0.017 5.2187 0.031 4.7217 0.042 4.6147 0.019

0.020 7 0.001 0.0137 0.003 0.0117 0.002 0.0117 0.003 0.0217 0.001 0.0197 0.005 0.0177 0.007 0.0167 0.005 0.581 7 0.016 0.343 7 0.022 0.3787 0.003 0.3797 0.003 0.583 7 0.013 0.503 7 0.001 0.4837 0.004 0.4847 0.008

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

100 95 98 98 100 78 87 87 90 88 73 73 90 88 70 70

8.8070E 02 74.2246E 03 8.8091E 027 4.0819E 03 8.8077E 02 74.2395E 03 8.8077E 02 74.2395E 03 8.6679E 027 2.9691E 03 8.6737E 02 71.9706E 03 8.6713E 02 7 2.8901E 03 8.6713E 02 7 2.8901E 03 8.3086E 027 1.4600E 03 8.3040E 027 1.4769E 03 8.3085E 027 1.7361E 03 8.3085E 027 1.7361E 03 8.3111E 02 71.4513E 03 8.3101E 027 1.4383E 03 8.3117E 027 1.6314E 03 8.3117E 02 7 1.6314E 03

0.0147 0.002 1.6767 0.022 0.202 7 0.015 0.1177 0.018 0.0097 0.002 1.664 7 0.038 0.790 7 0.033 0.4157 0.010 0.3177 0.010 1.5727 0.040 1.1917 0.023 0.7317 0.007 0.309 7 0.006 1.595 7 0.019 1.1877 0.009 0.8047 0.008

0.006 7 0.004 0.0057 0.003 0.0057 0.002 0.0057 0.001 0.0147 0.001 0.0117 0.002 0.0127 0.003 0.0127 0.003 0.2797 0.009 0.269 7 0.001 0.2067 0.001 0.205 7 0.002 0.2727 0.011 0.2677 0.002 0.208 7 0.001 0.208 7 0.001

40 40 28 28 40 40 30

7.4643E 027 4.2603E 04 7.4643E 027 4.2603E 04 7.4624E 027 2.7234E 04 7.4624E 027 2.7234E 04 7.4583E 027 4.1528E 04 7.4583E 027 4.1528E 04 7.4587E 027 4.3381E 04

0.0167 0.001 1.0517 0.009 1.0157 0.048 0.953 7 0.043 0.0127 0.001 1.0757 0.003 0.999 7 0.005

0.285 7 0.001 0.286 7 0.003 0.166 7 0.003 0.1697 0.007 0.0097 0.001 0.0097 0.003 0.006 7 0.004

RBF

Sml2010

Sigmoid

RBF

Parkinsons

Sigmoid

RBF

Airfoil

Sigmoid

RBF

Abalone

Sigmoid

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM

Training time (s)

Testing time (s)

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

14

Table 4 (continued ) Data sets

Hidden node type

RBF

Winequality white

Sigmoid

RBF

CCPP

Sigmoid

RBF

Ailerons

Sigmoid

RBF

Elevators

Sigmoid

RBF

Algorithms

#Hidden nodes

RMSE

Training time (s)

Testing time (s)

ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

30 60 36 30 30 60 53 42 42

7.4587E 027 4.3381E 04 7.4558E 027 4.8491E 04 7.4552E 027 3.5553E 04 7.4513E 027 2.9363E 04 7.4513E 027 2.9363E 04 7.4279E 02 71.7412E 04 7.4276E 027 1.8356E 04 7.4285E 027 2.0016E 04 7.4285E 027 2.0016E 04

0.9717 0.003 0.7387 0.019 2.456 7 0.079 2.4957 0.032 2.2477 0.028 0.7137 0.010 2.5247 0.042 2.408 7 0.019 2.236 7 0.025

0.0067 0.003 0.3557 0.009 0.2147 0.005 0.1797 0.003 0.1807 0.008 0.360 7 0.024 0.3137 0.002 0.298 7 0.007 0.2977 0.012

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

90 58 53 53 90 69 79 79 70 50 55 55 70 58 47 47

1.2076E 017 8.1419E 04 1.2077E 017 7.3178E 04 1.2075E 017 7.9765E 04 1.2075E 017 7.9765E 04 1.2024E 017 4.6597E 04 1.2024E 017 4.7537E 04 1.2024E 017 4.4131E 04 1.2024E 017 4.4131E 04 1.2122E 01 75.1000E 04 1.2123E 01 75.0902E 04 1.2121E 017 5.1387E 04 1.2121E 017 5.1387E 04 1.2109E 017 4.6537E 04 1.2111E 017 4.5950E 04 1.2111E 017 4.6525E 04 1.2111E 017 4.6525E 04

0.032 7 0.004 3.4877 0.058 3.583 7 0.058 2.902 7 0.008 0.0247 0.016 3.6747 0.059 2.856 7 0.059 2.6417 0.016 0.923 7 0.027 4.0217 0.012 3.289 7 0.031 2.8707 0.009 0.928 7 0.035 4.1417 0.036 3.4727 0.039 3.096 7 0.022

0.030 7 0.001 0.0187 0.003 0.0167 0.004 0.0167 0.004 0.0197 0.003 0.0107 0.002 0.0107 0.003 0.0137 0.004 0.3117 0.026 0.208 7 0.004 0.236 7 0.005 0.2377 0.010 0.302 7 0.013 0.2337 0.004 0.1907 0.005 0.1917 0.006

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

220 211 174 174 220 206 180 180 230 208 205 205 230 189 150 150

5.3057E 027 9.3348E 05 5.3064E 027 8.5253E 05 5.3064E 027 1.0417E 04 5.3064E 027 1.0417E 04 5.3032E 027 4.6702E 05 5.3031E 027 4.2081E 05 5.3034E 027 4.9733E 05 5.3034E 027 4.9733E 05 5.2954E 027 1.0245E 04 5.2962E 027 9.6408E 05 5.2963E 027 9.2246E 05 5.2963E 027 9.2246E 05 5.3080E 027 4.3402E 05 5.3083E 027 4.7827E 05 5.3082E 027 6.4042E 05 5.3082E 027 6.4042E 05

0.305 7 0.009 23.2777 1.231 18.5117 1.532 9.0667 0.815 0.093 7 0.015 22.4987 0.250 19.654 7 2.670 8.6317 0.346 5.8167 0.037 30.308 7 0.107 19.0627 0.060 12.542 7 0.088 5.543 7 0.029 30.502 7 0.782 36.2467 1.428 18.8067 0.337

0.0407 0.002 0.039 7 0.006 0.0337 0.005 0.0337 0.004 0.0437 0.001 0.036 7 0.003 0.034 7 0.003 0.034 7 0.003 3.2357 0.024 2.960 7 0.010 2.925 7 0.011 2.9247 0.017 3.2177 0.015 2.7167 0.012 2.1177 0.006 2.1157 0.008

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

120 31 35 35 120 79 74 74 100 22 17 17 100 63 56 56

4.5491E 027 1.9189E 04 4.5495E 027 9.0525E 05 4.5483E 027 1.9785E 04 4.5483E 027 1.9785E 04 4.5153E 027 1.3803E 04 4.5155E 027 1.5541E 04 4.5151E 027 1.5914E 04 4.5151E 027 1.5914E 04 4.6844E 027 5.1310E 04 4.6754E 02 74.6998E 04 4.6828E 027 2.4087E 04 4.6828E 027 2.4087E 04 4.5955E 027 2.5765E 04 4.5975E 027 2.6970E 04 4.5975E 027 2.7819E 04 4.5975E 027 2.7819E 04

0.1377 0.003 19.4647 0.078 21.8047 0.155 19.5647 0.039 0.0757 0.019 21.282 7 0.123 20.8337 0.296 19.3717 0.157 3.298 7 0.205 18.6877 0.033 20.202 7 0.057 18.9157 0.031 3.1407 0.010 19.943 7 0.037 19.783 7 0.070 18.925 7 0.094

0.0467 0.001 0.028 7 0.004 0.029 7 0.005 0.029 7 0.004 0.0497 0.005 0.0487 0.012 0.0477 0.008 0.0477 0.011 2.9047 0.016 0.6777 0.005 0.596 7 0.048 0.595 7 0.007 2.905 7 0.013 1.9127 0.012 1.9017 0.010 1.902 7 0.014

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM

60 40 28 28 60 55 43 43 50 34 28 28 50 40 34

3.9780E 027 9.2234E 04 3.9719E 027 3.4326E 04 3.9783E 027 8.2234E 04 3.9783E 027 8.2234E 04 3.9111E 027 6.3041E 04 3.9111E 027 6.0934E 04 3.9103E 02 7 5.1272E 04 3.9103E 027 5.1272E 04 4.2481E 027 7.2707E 04 4.2314E 02 71.2490E 03 4.2291E 027 1.1311E 03 4.2291E 027 1.1311E 03 4.1710E 027 6.4155E 04 4.1738E 027 9.138E 04 4.1697E 02 78.0712E 04

0.0767 0.004 13.543 7 0.011 13.5617 0.086 13.260 7 0.029 0.0357 0.002 13.759 7 0.012 13.465 7 0.071 13.3537 0.028 1.9157 0.009 13.0757 0.047 13.0797 0.068 12.883 7 0.070 1.926 7 0.022 13.2357 0.021 13.0627 0.031

0.0277 0.001 0.025 7 0.005 0.0247 0.005 0.0247 0.007 0.0357 0.001 0.029 7 0.003 0.025 7 0.008 0.025 7 0.002 1.753 7 0.090 1.1707 0.007 0.980 7 0.005 0.983 7 0.013 1.7107 0.040 1.3547 0.004 1.1797 0.011

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

15

Table 4 (continued ) Data sets

Pole

Hidden node type

Sigmoid

RBF

Algorithms

#Hidden nodes

Testing time (s)

34

4.1697E 02 78.0712E 04

12.962 7 0.063

1.1787 0.010

ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM

300 225 187 187 300 248 280 280 300 93 80 80 300 242 233 233

2.0204E 01 74.6552E 03 2.0190E 017 4.6273E 03 2.0203E 017 3.9885E 03 2.0203E 017 3.9885E 03 1.9994E 01 72.9768E 03 1.9994E 01 72.9057E 03 1.9993E 01 72.9785E 03 1.9993E 01 72.9785E 03 2.4132E 017 8.4840E 03 2.4119E 01 74.3560E 03 2.4113E 01 74.0748E 03 2.4113E 01 74.0748E 03 2.3101E 017 3.8627E 03 2.3099E 017 4.0630E 03 2.3098E 017 3.9031E 03 2.3098E 017 3.9031E 03

0.7577 0.047 60.0077 1.070 75.7077 9.695 34.563 7 1.071 0.305 7 0.036 61.292 7 2.037 28.202 7 0.162 16.946 7 0.502 13.384 7 0.263 55.5787 0.377 105.659 7 0.461 54.8747 0.263 12.8217 0.290 74.855 7 0.548 76.1527 2.083 41.526 7 1.350

0.080 7 0.007 0.0747 0.011 0.0577 0.002 0.0577 0.005 0.090 7 0.022 0.0757 0.005 0.0767 0.001 0.0767 0.002 6.323 7 0.132 2.038 7 0.007 1.666 7 0.012 1.666 7 0.009 6.3307 0.155 5.1937 0.019 5.1137 0.026 5.1137 0.139

nique [33]. To statistically obtain robust and reliable results, 30 trials are carried out with every algorithm on each data set. To facilitate comparison, the performance index root mean square error (RMSE) is used, which is deﬁned as vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ u #Testing u X 1 RMSE ¼ t ð43Þ ‖t~ i t i ‖2F #Testing m i ¼ 1 where #Testing denotes the number of testing samples, t~ i is the estimated vector of the testing sample t i . Figs. 1–10 show the training time against #Hidden nodes. For CP-ELM and CP-RELM, the training time increases with the increasing #Hidden nodes. These phenomena are generated by the fact that CP-ELM and CP-RELM start with null models, and

2

Training time (s)

ADP-RELM

7 32-b operating system, Intels Core™ i3-2310 M CPU @2.10 GHz, 2.00 GB RAM, Matlab2013b platform. Ten benchmark data sets (three multi-output data sets plus seven single-output ones) are utilized to do experiments, which include Energy efﬁciency, Sml2010, Parkinsons, Airfoil, Abalone, Winequality white, CCPP, Ailerons, Elevators, and Pole. The front seven data sets are from the well-known UCI machine learning repository,1 and the rear three are available from the data collection.2 For each data set, before doing experiments, we get rid of the repeated samples and the redundant inputs. And then the remaining inputs have been normalized into the range [ 1, 1] while the outputs have been normalized into [0, 1]. In addition, we divide every data set into two parts, i.e., the training set and the testing set, whose details are shown in Table 3. In this paper, two commonly-used hidden node types are chosen as the functions of hidden layer, activation i.e., the sigmoid hðxÞ ¼ 1= 1 þ exp xT ai and the RBF hðxÞ ¼ exp bi ‖x ai ‖22 , where ai are randomly chosen from the range [ 1, 1], bi is chosen from the range (0. 0.5) [12]. In this paper, we select ELM, CP-ELM, DP-ELM, ADP-ELM, RELM, CP-RELM, DP-RELM, and ADP-RELM to implement tests. First of all, we employ the cross validation to determine a nearly optimal number of hidden nodes (#Hidden nodes) for ELM from the set f10; 20; ⋯; 300g. Then, we extend it to other algorithms. In RELM, CP-RELM, DP-RELM, and ADP-RELM, there is an additional parameter, viz. the regularization parameter μ, to be determined. Here, n o it is chosen from 2 50 ; ⋯; 20 ; ⋯; 25 using cross-validation tech-

1

RMSE

http://archive.ics.uci.edu/ml/. http://www.dcc.fc.up.pt/ ltorgo/Regression/DataSets.html.

then gradually pick up hidden nodes one by one. Contrarily, for DP-ELM, ADP-ELM, DP-RELM, and ADP-RELM, the training time decreases with the increasing #Hidden nodes. The main reason is that these destructive algorithms start with full model and then remove the insigniﬁcant hidden nodes step by step. Under the same number of hidden nodes, ADP-ELM (ADP-RELM) needs less training time than DP-ELM (DP-RELM), which demonstrates that the accelerating scheme proposed in this paper indeed speeds up the original training process. In other words, the effectiveness of the presented scheme is veriﬁed experimentally. When choosing the same number of hidden nodes, generally CP-ELM (DP-ELM, and ADP-ELM) is superior to CP-RELM (DP-RELM, and ADP-RELM) with respect to the training time. One major reason is that during the process of obtaining RLL and T^ Lm with (9) and (22), res" # H pectively, RELM with the expanded matrix pﬃﬃﬃ μI LL usually needs more decomposing time than ELM. Before removing the insigniﬁcant hidden nodes, ADP-ELM (ADP-RELM) has the same training time as DP-ELM (DP-RELM), because the accelerating scheme does not start to work. Figs. 11–20 show the RMSE against #Hidden nodes. In these ﬁgures, the dash line represents the ELM testing accuracy, and the RELM testing accuracy is denoted by the dash dot line. In theory, the dash dot line lies under the dash line. That is, RELM should obtain better generalization performance than ELM, because ELM is a special type of RELM when the regularization parameter μ ¼ 0. However, for Airfoil and CCPP with RBF hidden nodes RELM is inferior to ELM. The reason for this is that RELM chooses the optimal regularization parameter from the ﬁnite set n o 2 50 ; ⋯; 20 ; ⋯; 25 instead of the whole real space. Note that the RMSE decreases with the increasing #Hidden nodes. Generally speaking, at ﬁrst the RMSE drops very quickly, but it decreases slowly when approaching the benchmark line (the dash line or the dash dot line), because the learning capability becomes gradually weak as larger #Hidden nodes. At the same #Hidden nodes, the testing accuracy of ADP-ELM (ADP-RELM) overlaps that of DP-ELM (DP-RELM), which demonstrates that the proposed equivalent scheme does not alter the generalization performance. When CPELM, DP-ELM, and ADP-ELM (CP-RELM, DP-RELM, and ADP-RELM) touch (or nearly touch) the dash line (the dash dot line), we terminate them mandatorily. For these cases, Table 4 gives their statistical results. When CP-ELM and DP-ELM reach nearly the same accuracy as ELM, in general CP-ELM needs more hidden

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

16

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

nodes than DP-ELM, but it takes the advantage in terms of the training time. When the presented accelerating scheme is applied to DP-ELM, ADP-ELM outperforms CP-ELM with respect to both the training time and the number of hidden nodes. In some cost sensitive scenarios, ADP-ELM shows meaningful, which owns the priority of the training time and the testing time to CP-ELM. While arriving at nearly the same accuracy as RELM, DP-RELM works better than CP-RELM in the training time and #Hidden nodes. ADP-RELM not only keeps the same accuracy as DP-RELM, but also reduces the training time further, which is more beneﬁcial for the training time sensitive scenarios. In addition, ELM (RELM) requires the least training time among ELM, CP-ELM, DP-ELM, and ADPELM (RELM, CP-RELM, DP-RELM, and ADP-RELM) because of no additional actions on it.

6. Conclusions Single-hidden layer feedforward networks recently have drawn much attention due to their excellent performances and simple forms. Unlike the traditional SLFNs where the input weights and the biases of hidden layer need to be tuned, ELM randomly generates these parameters independent of the training samples, and its output weights can be analytically determined. In this situation, ELM maybe generates redundant hidden nodes. Hence, it is necessary to improve the compactness of ELM network. CP-ELM and DP-ELM can reduce the number of hidden nodes without impairing the generalization performance. Compared with CPELM, DP-ELM shows excellent in the number of hidden nodes, but it needs more training time. Hence, an equivalent measure is proposed in this paper to accelerate the training process of DPELM. ADP-ELM improves DP-ELM’s performance on the training time so that ADP-ELM needs less training time than CP-ELM. In other words, ADP-ELM is superior to CP-ELM in terms of the training time and the number of hidden nodes, which is important for the training time sensitive scenarios. These similar situations are suitable for the RELM. That is, ADP-RELM improves the training process of DP-RELM further, and it needs fewer hidden nodes and less training time than CP-RELM. Experimental results have shown that the proposed scheme is able to accelerate DP-RELM and DPRELM indeed, and the analysis of the computational complexity also favors this viewpoint.

Acknowledgment This research was partially supported by the National Natural Science Foundation of China under Grant no. 51006052, and the NUST Outstanding Scholar Supporting Program. Moreover, the authors wish to thank the anonymous reviewers for their constructive comments and great help in the writing process, which improve the manuscript signiﬁcantly. References [1] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in: Proceedings of IEEE International Joint Conference on Neural Networks, 2004, pp. 985–990. [2] G.-B. Huang, C.-K. Slew, Extreme learning machine: RBF network case, in: Proceedings of 8th International Conference on Control, Automation, Robotics and Vision, 2004, pp. 1029–1036. [3] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1–3) (2006) 489–501. [4] G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: a review, Neural Netw. 61 (2015) 32–48. [5] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by backpropagating errors, Nature 323 (6088) (1986) 533–536. [6] Y.A. LeCun, L. Bottou, G.B. Orr, K.-R. Muller, Efﬁcient backprop, Lect. Notes Comput. Sci. (2012) 9–48.

[7] G.-B. Huang, X. Ding, H. Zhou, Optimization method based extreme learning machine for classiﬁcation, Neurocomputing 74 (1–3) (2010) 155–163. [8] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classiﬁcation, IEEE Trans. Syst. Man Cybern. Part B: Cybern. 42 (2) (2012) 513–529. [9] X. Liu, C. Gao, P. Li, A comparative analysis of support vector machines and extreme learning machines, Neural Netw. 33 (2012) 58–66. [10] S. Lin, X. Liu, J. Fang, Z. Xu, Is extreme learning machine feasible? A theoretical assessment (Part II), IEEE Trans. Neural Netw. Learn. Syst. 26 (1) (2015) 21–34. [11] X. Liu, S. Lin, J. Fang, Z. Xu, Is extreme learning machine feasible? A theoretical assessment (Part I), IEEE Trans. Neural Netw. Learn. Syst. 26 (1) (2015) 7–20. [12] G.B. Huang, L. Chen, C.K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans. Neural Netw. 17 (4) (2006) 879–892. [13] G.-B. Huang, L. Chen, Convex incremental extreme learning machine, Neurocomputing 70 (16–18) (2007) 3056–3062. [14] G.-B. Huang, L. Chen, Enhanced random search based incremental extreme learning machine, Neurocomputing 71 (2008) 3460–3468. [15] G. Feng, G.-B. Huang, Q. Lin, R. Gay, Error minimized extreme learning machine with growth of hidden nodes and incremental learning, IEEE Trans. Neural Netw. 20 (8) (2009) 1352–1357. [16] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM: optimally pruned extreme learning machine, IEEE Trans. Neural Netw. 21 (1) (2010) 158–162. [17] J. Deng, K. Li, G.W. Irwin, Fast automatic two-stage nonlinear model identiﬁcation based on the extreme learning machine, Neurocomputing 74 (16) (2011) 2422–2429. [18] E. Soria-Olivas, J. Gomez-Sanchis, J.D. Martin, J. Vila-Frances, M. Martinez, J.R. Magdalena, A.J. Serrano, BELM: Bayesian Extreme Learning Machine, IEEE Trans. Neural Netw. 22 (3) (2011) 505–509. [19] Y. Yang, Y. Wang, X. Yuan, Bidirectional extreme learning machine for regression problem and its learning effectiveness, IEEE Trans. Neural Netw. Learn. Syst. 23 (9) (2012) 1498–1505. [20] R. Zhang, Y. Lan, G.-B. Huang, Z.-B. Xu, Universal approximation of extreme learning machine with adaptive growth of hidden nodes, IEEE Trans. Neural Netw. Learn. Syst. 23 (2) (2012) 365–371. [21] R. Zhang, Y. Lan, G.-B. Huang, Z.-B. Xu, Y.C. Soh, Dynamic extreme learning machine and its approximation capability, IEEE Trans. Cybern. 43 (6) (2013) 2054–2065. [22] A. Castano, F. Fernandez-Navarro, C. Hervas-Martinez, PCA-ELM: a robust and pruned extreme learning machine approach based on principal component analysis, Neural Process. Lett. 37 (3) (2013) 377–392. [23] H. Zhou, G.B. Huang, Z. Lin, H. Wang, Y.C. Soh, Stacked extreme learning machines, IEEE Trans. Cybern. (2015). [24] W.-Y. Deng, Q.-H. Zheng, Z.-M. Wang, Projection vector machine, Neurocomputing 120 (0) (2013) 490–498. [25] H.-J. Rong, Y.-S. Ong, A.-H. Tan, Z. Zhu, A fast pruned-extreme learning machine for classiﬁcation problem, Neurocomputing 72 (1–3) (2008) 359–366. [26] D. Du, K. Li, G.W. Irwin, J. Deng, A novel automatic two-stage locally regularized classiﬁer construction method using the extreme learning machine, Neurocomputing 102 (0) (2013) 10–22. [27] Z. Bai, G.-B. Huang, D. Wang, H. Wang, M.B. Westover, Sparse extreme learning machine for classiﬁcation, IEEE Trans. Cybern. 44 (10) (2014) 1858–1870. [28] J. Luo, C.-M. Vong, P.-K. Wong, Sparse Bayesian extreme learning machine for multi-classiﬁcation, IEEE Trans. Neural Netw. Learn. Syst. 25 (4) (2014) 836–843. [29] N. Wang, M.J. Er, M. Han, Parsimonious extreme learning machine using recursive orthogonal least squares, IEEE Trans. Neural Netw. Learn. Syst. 25 (10) (2014) 1828–1841. [30] Y.-P. Zhao, K.-K. Wang, Y.-B. Li, Parsimonious regularized extreme learning machine based on orthogonal transformation, Neurocomputing 156 (2015) 280–296. [31] W. Deng, Q. Zheng, L. Chen, Regularized extreme learning machine, in: Proceedings of 2009 IEEE Symposium on Computational Intelligence and Data Mining, 2009, pp. 389–395. [32] X. Zhang, Matrix Analysis and Applications, Tsinghua University Press, Beijing, 2004. [33] Y. Zhao, K. Wang, Fast cross validation for regularized extreme learning machine, J. Syst. Eng. Electron. 25 (5) (2014) 895–900.

Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering ﬁeld from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, he received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His research interests include machine learning and kernel methods.

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Bing Li was born in 1990. He received the B.S. degree from Xinxiang University, China, in 2014. He is currently pursuing the M.Eng. degree from Nanjing University of Science and Technology, China. His research interests include machine learning, pattern recognition, etc.

17 Ye-Bo Li studied in Shenyang Aerospace University of China from September 2004 to June 2008, majored in aircraft power engineering and received B.S. degree. From September 2008 to October 2014, he studied in Nanjing University of Aeronautics and Astronautics of China, majored in aerospace propulsion theory and engineering, and received M.S. degree and Ph.D. degree. Now, he is an engineer at AVIC Aeroengine Control Research Institute of China. His research interests include modelling, control and fault diagnosis on aircraft engines.

Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i

An accelerating scheme for destructive parsimonious extreme learning machine

An accelerating scheme for destructive parsimonious extreme learning machine

Recommend Documents