Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
An accelerating scheme for destructive parsimonious extreme learning machine Yong-Ping Zhao a,n, Bing Li a, Ye-Bo Li b a b
School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China AVIC Aeroengine Control Research Institute, Wuxi 214063, China
art ic l e i nf o
a b s t r a c t
Article history: Received 27 December 2014 Received in revised form 10 February 2015 Accepted 2 April 2015 Communicated by G.-B. Huang
Constructive and destructive parsimonious extreme learning machines (CP-ELM and DP-ELM) were recently proposed to sparsify ELM. In comparison with CP-ELM, DP-ELM owns the advantage in the number of hidden nodes, but it loses the edge with respect to the training time. Hence, in this paper an equivalent measure is proposed to accelerate DP-ELM (ADP-ELM). As a result, ADP-ELM not only keeps the same hidden nodes as DP-ELM but also needs less training time than CP-ELM, which is especially important for the training time sensitive scenarios. The similar idea is extended to regularized ELM (RELM), yielding ADP-RELM. ADP-RELM accelerates the training process of DP-RELM further, and it works better than CP-RELM in terms of the number of hidden nodes and the training time. In addition, the computational complexity of the proposed accelerating scheme is analyzed in theory. From reported results on ten benchmark data sets, the effectiveness and usefulness of the proposed accelerating scheme in this paper is confirmed experimentally. & 2015 Elsevier B.V. All rights reserved.
Keywords: Single-hidden layer feedforward network Extreme learning machine Destructive algorithm Constructive algorithms Sparseness
1. Introduction The widespread popularity of single-hidden layer feedforward networks (SLFNs) in extensive fields is mainly due to their powerfulness of approximating complex nonlinear mappings and simple forms. As a specific type of SLFNs, extreme learning machine (ELM) [1–3] recently has drawn a lot of interests from researchers and engineers. Generally, training an ELM consists of two main stages [4]: (1) random feature mapping and (2) linear parameters solving. In the first stage, ELM randomly initializes the hidden layer to map the input samples into a so-called ELM feature space with some nonlinear mapping functions, which can be any nonlinear piecewise continuous functions, such as the sigmoid and the RBF. Since in ELM the hidden node parameters are randomly generated (independent of the training samples) without tuning according to any continuous probability distribution instead of being explicitly trained, it owns the remarkable computational priority to regular gradient-descent backpropagation [5,6]. That is, unlike conventional learning methods that must see the training samples before generating the hidden node parameters, ELM can generate the hidden node parameters before seeing the training samples. In the second stage, the linear parameters can be obtained by solving the Moore-Penrose generalized
n
Corresponding author. E-mail address:
[email protected] (Y.-P. Zhao).
inverse of hidden layer output matrix, which reaches both the smallest training error and the smallest norm of output weights [7]. Hence, different from most algorithms proposed for feedforward neural networks not considering the generalization performance when they were proposed first time, ELM can achieve better generalization performance. From the interpolation perspective, Huang et al. [3] indicated that for any given training set, there exists an ELM network which gives sufficient small training error in squared error sense with probability one, and the number of hidden nodes is not larger than the number of distinct training samples. If the number of hidden nodes amounts to the number of distinct training samples, then training errors decreases to zero with probability one. Actually, in the implementation of ELM, it is found that the generalization performance of ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes is large enough [8]. In addition, unlike traditional SLFNs, such as radial basis function neural networks and multilayer perceptron with one hidden layer, where the activation functions are required to be continuous or differentiable, ELM can choose the threshold function and many others as activation functions without sacrificing the universal approximation capability at all. In statistical learning theory, Vapnik–Chervonenkis (VC) dimension theory is one of the most widely used frameworks in generalization bound analysis. According to the structure risk minimization perspective, to obtain better generalization performance on testing set, an algorithm should not only
http://dx.doi.org/10.1016/j.neucom.2015.04.002 0925-2312/& 2015 Elsevier B.V. All rights reserved.
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
2
Table 1 The computational complexity of ADP-ELM/ADP-RELM at the ith iteration. Algorithms
ADP-ELM/ADP-RELM at the ith iteration
Operation 1 Rði 1Þ
Multiplication ( )
Addition ( þ)
Division (/)
Square root (pffiffiffi)
1 3 1 6Li 6Li
1 3 1 6Li 6Li
1 2 1 2Li þ 2Li
0
1 2Li ðLi þ 1Þm
1 2Li ðLi 1Þm
0
0
1 2 1 2Li þ 2Li þ Li m
1 2 3 2Li 2Li þ Li m
Li
0
1 3 1 2 1 3 1 2 6Li þ 2Li þ 3Li þ 2Li m þ 2Li m
1 3 1 2 5 1 2 1 6Li þ 2Li 3Li þ 2Li mþ 2Li m
1 2 3 2Li þ 2Li
0
ði 1Þ
β^
ði 1Þ β^ js
2
‖r 0j ði 1Þ‖22
, js ¼ 1; ⋯; Li
s
In sum Note: Li ¼ L iþ 1.
Table 2 The computational complexity of DP-ELM/DP-RELM at the ith iteration. Algorithms
DP-ELM/DP-RELM at the ith iteration
Operation h
RðiÞ s;j
s
2
i T ðj i 1Þ ¼ Q ðj iÞ 4 s s
Rðj iÞ s
T ðj iÞ s
01ðL i js þ 1Þ
t ðisiÞ
3 5js ¼ 1; ⋯; Li 1
Multiplication ( )
Addition (þ )
Division (/)
Square root (pffiffiffi)
2 2 2 3 5 3Li þ Li 3Li þ 2Li m 2Li m
2 1 3 1 2 5 3Li þ 2Li 6Li þ Li m Li m
L2i Li
1 2 1 2L i 2L i
‖t ðisiÞ ‖2F js ¼ 1; ⋯; Li
Li m
L i ðm 1 Þ
0
0
In sum
2 2 2 3 5 3Li þ Li 3Li þ 2Li m Li m
2 1 3 1 2 11 3Li þ 2Li 6 Li þ Li m
L2i Li
1 2 1 2L i 2L i
Table 3 Specifications on each benchmark data set. Data sets
Hidden node type
#Hidden nodes
log 2 ðμÞ
Energy efficiency
Sigmoid RBF
190 220
15 20
Sigmoid RBF
80 90
Sigmoid RBF
0.5 Sml2010 0.5 Parkinsons 0.5 Airfoil 0.5 Abalone 0.5 Winequality white 0.5 CCPP 0.5 Ailerons 0.5 Elevators 0.5 Pole
#Training
#Testing
#Inputs
#Outputs
450 450
318 318
8 8
2 2
2 8
3000 3000
1137 1137
16 16
2 2
100 70
10 13
4000 4000
1875 1875
16 16
2 2
Sigmoid RBF
100 90
17 30
800 800
703 703
5 5
1 1
Sigmoid RBF
40 60
11 10
2800 2800
1377 1377
8 8
1 1
Sigmoid RBF
90 70
8 11
3000 3000
961 961
11 11
1 1
Sigmoid RBF
220 230
34 35
6000 6000
3527 3527
4 4
1 1
Sigmoid RBF
120 100
1 10
7154 7154
6596 6596
40 40
1 1
Sigmoid RBF
60 50
4 11
8752 8752
7847 7847
18 18
1 1
Sigmoid RBF
300 300
6 27
10000 10000
4958 4958
26 26
1 1
Notes: #Training represents the number of training samples, #Testing represents the number of testing samples, #Inputs represents the number of input variables, and #Outputs represents the number of output variables.
achieve low training error on training set, but also should have a lower VC dimension. Aiming to classification tasks, Liu et al. [9] proved that the VC dimension of an ELM is equal to its number of hidden nodes with probability one, which states that ELM has a relatively low VC dimension. With respect to regression problems, the generalization ability of ELM had been comprehensively studied in [10] and [11], leading to the conclusion that ELM with some suitable activation functions, such as polynomials, sigmoid and
Nadaraya–Watson function, can achieve the same optimal generalization bound as a SLFN in which all parameters are tunable. From the above analyses, it is known that ELM owns excellent universal approximation capability and generalization performance. However, due to the fact that ELM generates hidden nodes randomly, it usually requires more hidden nodes than traditional SLFN to get matched performance. Large network size always signifies more running time in the testing phase. In cost sensitive learning, the testing time
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
3
3.4
20
3.2
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
10
3.0
Training time (sec.)
Training time (sec.)
15
2.8 2.6
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
2.4 2.2
5 2.0 10
20
0 40
60
80
100
120
140
160
60
70
80
5.2
30
4.8
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
Training time (sec.)
Training time (sec.)
50
180
#Hidden nodes
20
40
#Hidden nodes 20
25
30
15
10
4.4
4.0
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
3.6
5 3.2 20
0 50
100
150
200
#Hidden nodes
40
60
80
#Hidden nodes Fig. 2. The training time on Sml2010 (a) sigmoid and (b) RBF.
Fig. 1. The training time on Energy efficiency (a) sigmoid and (b) RBF.
should be minimized, which requires a compact network to meet testing time budget. Thus, the topic on improving the compactness of ELM has recently attracted great interests. First, Huang et al. [12] proposed an incremental ELM (I-ELM), which randomly generates the hidden nodes and analytically calculates the output weights. Due to the fact that I-ELM does not recalculate the output weights of all the existing nodes when a new node is added, hence its convergence rate can be further improved by recalculating the output weights of the existing nodes based on a convex optimization method when a new hidden node is randomly added [13]. In I-ELM, there may be some hidden nodes, which play a very minor role in the network output and eventually increase the network complexity. In order to avoid this problem and to obtain a more compact network, an enhanced version of I-ELM was presented [14], where in each learning step several hidden nodes are randomly generated and among them the hidden node leading to the largest residual error decreasing will be added to the existing network. In [15], an error minimized ELM was proposed, which can add random hidden nodes one by one or group by group. And during the growth of the network, the output weights are updated incrementally. In [16], an optimally pruned extreme learning machine was presented, which is based on the original ELM with additional steps to make it more robust and compact. Deng et al. [17] adopted the two-stage stepwise strategy to improve the compactness of the ELM. At the first stage, the selection procedure can be automatically terminated based on
the leave-one-out error. At the second stage, the contribution of each hidden nodes is reviewed, and insignificant ones are replaced. This procedure dose not terminate until no insignificant hidden nodes exist in the final model. When Bayesian approach is applied to ELM, a Bayesian ELM was got [18]. This Bayesian method can not only optimize the network architecture, but also make use of a priori knowledge and obtain the confidence interval. Subsequently, a new learning algorithm called bidirectional ELM was presented [19], which can greatly enhance the learning effectiveness, reduce the number of hidden nodes, and eventually further increase the learning speed. In addition, the traditional ELM was further extended by Zhang et al. [20] to be the one using Lebesgue p-integrable hidden activation functions, which can approximate any Lebesgue p-integrable functions on a compact input set. Further, a dynamic ELM where the hidden nodes can be recruited or deleted dynamically according to their significance to network performance was proposed [21], so that not only the parameters can be adjusted but also the architecture can be selfadapted simultaneously. Castano et al. [22] proposed a robust and pruned ELM approach, where the principal component analysis is utilized to select the hidden nodes from input features while corresponding input weights are deterministically defined as principal components rather than random ones. To solve large and complex sample issues, a stacked ELM was designed [23], which divides a single large ELM network into multiple stacked small ELMs which are serially connected. In addition, to improve
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
7.0
2.5
6.5
2.0
Training time (sec.)
Training time (sec.)
4
6.0
5.5
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
5.0
4.5
1.5
1.0
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
0.5
0.0 20
40
60
80
20
100
40
60
80
100
#Hidden nodes
#Hidden nodes 2.1 5.2 1.8
Training time (sec.)
Training time (sec.)
5.0
4.8
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
4.6
4.4
1.5
1.2
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
0.9
0.6
0.3
4.2 10
20
30
40
50
60
20
70
40
60
80
#Hidden nodes
#Hidden nodes Fig. 3. The training time on Parkinsons (a) sigmoid (b) RBF.
Fig. 4. The training time on Airfoil (a) sigmoid (b) RBF.
the performance of ELM on high-dimension and small-sample problems, a projection vector machine [24] was proposed. Aiming at classification tasks [7,25–28], several algorithms were proposed to optimize the network size of ELM. Recently, novel constructive and destructive parsimonious ELMs (CP-ELM and DP-ELM) were proposed based on recursive orthogonal least squares [29]. CP-ELM starts with a small initial network and gradually adds new hidden nodes until a satisfactory solution is found. In contrary, DP-ELM starts by training a larger than necessary network and then removes the insignificant node one by one. Further, Zhao et al. [30] extend CP-ELM and DP-ELM to the regularized ELM (RELM), thus correspondingly obtaining CPRELM and DP-RELM. Generally speaking, compared with CP-ELM, DP-ELM can obtain more compact network size when reaching nearly the same accuracy, but it loses the edge on the training time. The main reason is that when DP-ELM removes every insignificant hidden node, a series of Givens rotations are implemented to compute the additional residual error reduction. As known, the computational burden of Givens rotation is high. Hence, a scheme is proposed to accelerate DP-ELM (ADP-ELM) instead of Givens rotations. ADP-ELM not only needs fewer hidden nodes than CP-ELM, but also wins it in the training time. When this accelerating scheme is extended to DP-RELM, ADP-RELM is yielded. ADP-RELM reduces the training time of DP-RELM further, and outperforms CP-RELM in terms of the number of hidden nodes and the training time. In theory, this proposed accelerating scheme can acquire the same effect as Givens rotations but
requires less training time when removing the insignificant hidden nodes. Finally, extensive experiments are conducted on benchmark data sets, and the results verify the effectiveness of the proposed scheme in this paper. The remainder of this paper is organized as follows. In Section 2, ELM and RELM are briefly introduced. As two destructive algorithms, DP-ELM and DP-RELM are described in Section 3. In Section 4, an equivalent measure is proposed instead of Givens rotations to respectively accelerate DP-ELM and DP-RELM, yielding ADP-RELM and ADP-RELM, and their computational complexity is analyzed. Experimental results on ten benchmark data sets are presented in Section 5. Finally, conclusions follow.
2. ELM and RELM 2.1. ELM N For N arbitrary distinct samples ðxi ; t i Þ i ¼ 1 , where xi A ℜn and t i A ℜm , the traditional SLFN with L hidden nodes and activation function hðxÞ can be mathematically modeled as y¼
L X
βi hðx; ai ; bi Þ
ð1Þ
i¼1
where ai A ℜn is the input weight vector connecting the ith hidden node and the input nodes, bi A ℜ is the bias of the ith hidden node,
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
5
4.0
1.08
3.8 1.05
3.6
Training time (sec.)
Training time (sec.)
1.02
0.99
0.96
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
0.93
0.90
3.4 3.2 3.0
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
2.8 2.6 2.4 2.2
5
10
15
20
25
30
35
40
20
40
#Hidden nodes
60
80
#Hidden nodes
2.6
4.2 4.0
2.5
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
2.4
Training time (sec.)
Training time (sec.)
3.8
2.3
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
2.2
2.1
3.6 3.4 3.2 3.0 2.8
2.0 10
20
30
40
50
2.6
60
10
#Hidden nodes
20
30
40
50
60
70
#Hidden nodes Fig. 5. The training time on Abalone (a) sigmoid (b) RBF. Fig. 6. The training time on Winequality white (a) sigmoid (b) RBF.
βi A ℜm is the output weight vector connecting the ith hidden node and the output nodes. If the outputs of the SLFN amount to the targets, the following compact formulation is got Hβ ¼ T
ð2Þ
T
where β ¼ β1 ; ⋯; βL , T ¼ ½t 1 ; ⋯; t N T , and 2 3 hðx1 ; a1 ; b1 Þ ⋯ hðx1 ; aL ; bL Þ 6 7 ⋮ ⋱ ⋮ H¼4 5 hðxN ; a1 ; b1 Þ
⋯
†
T
1
ð3Þ
hðxN ; aL ; bL Þ
2.2. RELM
ð4Þ T
β
where ‖d‖F denotes the Frobenius norm. Letting ðdGELM =dβÞ ¼ 0 may yield the same solution as (4).
here H is the so-called hidden-layer output matrix. For the L traditional SLFN, the parameters ðai ; bi Þ i ¼ 1 in (1) are usually determined using the gradient-based methods. However, in ELM these parameters are randomly initiated before seeing the training samples. In this case, training an ELM is simply equal to finding the output weight matrix β of a linear system (2). A simple solution of (2) is given explicitly as follows:
β^ ¼ H † T
As for this point, most algorithms for SLFN do not consider the generalization performance when they were proposed first time. Actually, Solving (2) in the squared error sense is equivalent to minimizing the following problem min GELM ¼ ‖H β T‖2F ð5Þ
where H ¼ H H H is the Moore-Penrose generalized inverse of H. The solution β^ in (4) is to minimize the training error as well as the norm of the output weights [7]. In other words, ELM aims to reach better generalization performance by reaching both the smallest training error and the smallest norm of output weights.
Due to the random choice of input weights and the hidden layer biases, the hidden layer output matrix H may not be of full column rank, or the condition number of H T H is too large. In these scenarios, solving β^ directly using (4) may drop the ELM accuracy. To end this, after adding the regularization term to (5), the mathematical model of RELM is [31] min GRELM ¼ ‖H β T‖2F þ μ‖β‖2F ð6Þ β
where μ is the regularization parameter. Letting dGRELM =dβ ¼ 0 yields 1 β^ ¼ H T H þ μI HT T ð7Þ
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
6
22
35
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
25
20
21
Training time (sec.)
Training time (sec.)
30
15
10
20
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
19
18
5
0 50
100
150
20
200
40
60
80
100
120
#Hidden nodes
#Hidden nodes 20.5
45 20.0
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
35 30
Training time (sec.)
Training time (sec.)
40
25 20
19.5
19.0
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
18.5
18.0
15 17.5
10
20
50
100
150
40
200
60
80
100
#Hidden nodes
#Hidden nodes
Fig. 8. The training time on Ailerons (a) sigmoid and (b) RBF.
Fig. 7. The training time on CCPP (a) sigmoid and (b) RBF.
here Q is an N N orthogonal matrix satisfying Q T Q ¼ Q Q T ¼ I, and RLL is an upper triangular matrix with the same rank as H.
3. DP-ELM and RDP-ELM 3.1. DP-ELM 3.1.1. Preliminary work According to the conclusion in [3], it has been proved that an ELM with at most N hidden nodes and with almost any nonlinear activation function can exactly learn N distinct samples. In fact, in real-world applications, the number of hidden nodes is always less than the number of training samples, i.e., L oN, and the matrix H is of full column rank with probability one. Hence, the following theorem is obtained Theorem 1. The minimizer of Eq. (5) amounts to the following minimizer: n o min G0ELM ¼ ‖RLL β T^ Lm ‖2F ð8Þ β
where " H¼Q
RLL
ð9Þ
0ðN LÞL "
QTT ¼
#
T^ Lm
T~ ðN LÞm
Proof. Since the matrix H is of full column rank, hence it can be " # RLL decomposed as H ¼ Q using QR decomposition accord0ðN LÞL ing to the matrix theory [32], where Q is an orthogonal matrix, RLL is an upper triangular matrix. Thus, we get " # " #2 RLL T^ Lm β ~ ‖H β T‖2F ¼ ‖Q T H β Q T T‖2F ¼ 0ðN LÞL T ðN LÞm F
" #2 R β T^ 2 2 Lm LL ¼ ¼ RLL β T^ Lm þ T~ ðN LÞm T~ ðN LÞm F F
Plugging (11) into (5) and eliminating the constant part ‖T~ ðN LÞm ‖2F , Eq. (8) is obtained. Now, this proof is completed. □ Since Eqs. (5) and (8) have the same minimizer, we do not distinguish GELM and G0ELM in this paper. In addition, according to (10) we know ‖T‖2F ¼ ‖T^ Lm ‖2F þ‖T~ ðN LÞm ‖2F
# ð10Þ
ð11Þ
F
ð12Þ
where ‖T~ ðN LÞm ‖2F is defined as the initial residual error, ‖T^ Lm ‖2F is the additional residual error. The output weight matrix β^ in (4)
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
7
13.8
100
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
13.6
Training time (sec.)
Training time (sec.)
80
13.4
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
13.2
13.0
10
20
60
40
20
30
40
50
50
60
100
120
13.2
100
Training time (sec.)
Training time (sec.)
13.3
13.1
13.0
12.9
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
12.8
12.7 10
20
40
50
50
100
300
together with
3.1.2. Realization The DP-ELM starts with full candidate regressors, i.e., let the initial ^ and a full index set P ¼ f1; 2; ⋯; Lg, regressor matrix Rð0Þ ¼ R, T ð0Þ ¼ T, and then it gradually removes the insignificant hidden nodes one by one until the stopping criterion is met. Assume that at the ith iteration the candidate regressor matrix RðiÞ s is obtained by removing the regressor r ðsiÞ from the previous regressor matrix Rði 1Þ ð14Þ
Q ðiÞ
T
" T ði 1Þ ¼
T ðiÞ t ðisiÞ
#
ð16Þ
where Q ðiÞ consists of a series of given rotations satisfying T T Q ðiÞ Q ðiÞ ¼ Q ðiÞ Q ðiÞ ¼ I ðL i þ 1ÞðL i þ 1Þ , RðiÞ is an upper triangular matrix, and t ðisiÞ is the last row vector of the transformed targets. Obviously, ‖t ðisiÞ ‖2F is the increase on the additional residual error due to the removal of the regressor r ðsi 1Þ . In fact, Eq. (15) can become much simpler, since it only needs Givens rotations to be performed on each column of the submatrix including the columns after r ðsi 1Þ in Rði 1Þ . Hence, together with (16) they are simplified as h
RðiÞ s;j s
T ðj i 1Þ s
i
2 ¼ Q ðj iÞ 4 s
Rðj iÞ
T ðj iÞ
01ðL i js þ 1Þ
t ðisiÞ
s
s
3 5
ð17Þ
where js is the column index of the regressor r ðsi 1Þ in the matrix Rði 1Þ , RðiÞ s;j and Rðj iÞ are the submatrices including elements from js th s
s
row and js th column to the end of RðiÞ s and RðiÞ , respectively, T ðj i 1Þ and
is retriangularized by a series of Givens rotations as 01ðL iÞ
250
Fig. 10. The training time on Pole (a) sigmoid and (b) RBF.
ð13Þ
RðiÞ s ’Rði 1Þ \r ðsi 1Þ ; s A P
150
#Hidden nodes
where R and T^ are from (8) after omitting subscripts. Because R is an upper triangular matrix, β^ in Eq. (13) is easily got with backward substitutions. Although an ELM can be obtained easily through (13), its solution is dense. That is, there maybe exist insignificant hidden nodes in the solution. Hence, we need to improve the compactness of ELM. It should be pointed out that every regressor (or column) r i of the matrix R ¼ ½r 1 ; ⋯; r L in (8) corresponds to one hidden node. Hence, compacting ELM amounts to determining the significant regressors from R.
RðiÞ s ¼ Q ðiÞ
200
20 30
Rβ^ ¼ T^
RðiÞ
300
40
can be got by solving
"
250
60
#Hidden nodes
Then,
200
CP-ELM DP-ELM ADP-ELM CP-RELM DP-RELM ADP-RELM
80
Fig. 9. The training time on Elevators (a) sigmoid and (b) RBF.
RðiÞ s
150
#Hidden nodes
#Hidden nodes
s
#
ð15Þ
T ðj iÞ are the submatrices including elements from js th row to the end of s
T ði 1Þ and T ðiÞ , respectively, and the orthogonal matrix Q ðj iÞ is governed s
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
8
0.080
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
RMSE
0.064
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.090
RMSE
0.072
0.095
0.085
0.056 0.080
0.048 0.075
0.040 20
40
60
80
100
10
120
20
30
40
50
60
70
#Hidden nodes
#Hidden nodes 0.11
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
RMSE
0.056
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.10
RMSE
0.064
0.048
0.09
0.08
0.040
0.07
0.032 50
100
150
10
200
20
#Hidden nodes
50
60
3.2. DP-RELM
by 2 Q
40
Fig. 12. The testing accuracy on Sml2010 (a) sigmoid and (b) RBF.
Fig. 11. The testing accuracy on Energy efficiency (a) sigmoid and (b) RBF.
ðiÞ
30
#Hidden nodes
¼4
3
I ðjs 1Þðjs 1Þ Q ðj iÞ
5
ð18Þ
s
Theorem 2. The minimizer of Eq. (6) is equivalent to the following minimizer: n o min G0RELM ¼ ‖RLL β T^ Lm ‖2F ð21Þ β
Hence, the index of the ith regressor to be removed is
s† ¼
‖t ðiÞ ‖2F argmin is 2 s A P ‖T‖ F
ð19Þ
Eq. (19) shows that the regressor incurring the least increase on the additional residual error will be removed. After removing the ith regressor r ðsi† 1Þ , the index set P is updated as P’P\ s† . DP-ELM does not stop until the following criterion is reached ‖T ðiÞ ‖2F r1ρ ‖T‖2F
ð20Þ
where 0 o ρ{1, which can balance the training error against the model complexity.
where " # " # H RLL pffiffiffi μI LL ¼ Q 0NL " QT
T 0Lm
#
" ¼
T^ Lm T~ Nm
ð22Þ
# ð23Þ
here Q is an ðN þ LÞ ðN þ LÞ orthogonal matrix satisfying Q T Q ¼ Q Q T ¼ I, and RLL is an upper triangular matrix with the rank L. The proof of Theorem 2 is similar to that of Theorem 1, so we omit it here. The matrices RLL and T^ Lm in Theorems 1 and 2 are different, but we adopt the same symbols, which does not lead to misunderstanding. When we obtain RLL and T^ Lm from Theorem 2, DP-RELM is naturally obtained by adopting the same realization procedure as DP-ELM.
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
0.216
0.104
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.208
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.100
RMSE
0.212
RMSE
9
0.096
0.092
0.204 0.088
0.200 10
20
30
40
50
60
70
20
80
40
60
80
100
#Hidden nodes
#Hidden nodes 0.105
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
RMSE
0.218
0.216
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.100
0.095
RMSE
0.220
0.090
0.214 0.085
0.212
0.080
0.210 10
20
30
40
50
20
60
40
#Hidden nodes
Proof. Substituting (15) and (16) into (25), we get " # RðiÞ ðiÞ ðiÞ ðiÞ ðiÞ ði 1Þ 2 G s ¼ ‖R s β T ‖F ¼ Q βðiÞ 01ðL iÞ
4. ADP-ELM and ADP-RELM 4.1. An equivalent measure Assume that at the (i 1)th iteration in DP-ELM or DP-RELM, Rði 1Þ and T ði 1Þ are obtained, and let βði 1Þ
" # " ðiÞ # T RðiÞ ðiÞ ¼ β ðiÞ F2 t is 01ðL iÞ
T ði 1Þ 2F
ð24Þ ðiÞ
ði 1Þ
^ Evidently, G ¼ 0. When the regressor r ðsi 1Þ is removed at the ith iteration, Eq. (24) becomes as n o ^ ðiÞ ¼ min GðiÞ ¼ ‖RðiÞ βðiÞ T ði 1Þ ‖2 G s F s s βði 1Þ
ð25Þ
Then, define ðiÞ
ΔðiÞ s ¼ G^ s G^
ði 1Þ
^ ðiÞ ¼G s
80
Fig. 14. The testing accuracy on Airfoil (a) sigmoid and (b) RBF.
Fig. 13. The testing accuracy on Parkinsons (a) sigmoid and (b) RBF.
n o ^ ði 1Þ ¼ min Gði 1Þ ¼ ‖Rði 1Þ βði 1Þ T ði 1Þ ‖2 G F
60
#Hidden nodes
ð26Þ
¼ ‖RðiÞ β T ðiÞ ‖2F þ ‖t ðisiÞ ‖2F
ð28Þ
ðiÞ
^ ðiÞ ¼ ‖t ðiÞ ‖2 . □ Because ‖RðiÞ β^ T ðiÞ ‖2F ¼ 0, hence G s is F ði 1Þ To remove the regressor r s† at the ith iteration, we need to ðiÞ 2 compute ‖t is ‖F L-iþ1 times, which consumes a lot of time. From Theorem 3, note that the issue of computing ‖t ðisiÞ ‖2F can be sideðiÞ stepped with finding the equivalent measure Δ s . If we can seek a ðiÞ computationally cheap method of calculating Δ s , then the bottleðiÞ 2 neck of computing ‖t is ‖F will be circumvented.
ðiÞ s
here Δ represents the increase on the objective function in (24) due to the removal of the regressor r ðsi 1Þ . Therefore, the following theorem is obtained
4.2. An accelerating scheme ðiÞ
Theorem 3. The following formula holds
ΔðiÞ s ¼ ‖t ðisiÞ ‖2F where t ðisiÞ is from (16).
ð27Þ
If at the ith iteration each Δ s is computed directly according to (25), the computational burden is very huge, not cheaper than that of computing ‖t ðisiÞ ‖2F . Hence, it is necessary to seek a method of ðiÞ computing each Δ s cheaply. Assume that the column index of the ði 1Þ regressor r s to be removed in the matrix Rði 1Þ is js , then
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
10
0.125
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
RMSE
0.0760
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.124
0.123
RMSE
0.0765
0.0755
0.0750
0.122
0.121
0.0745
0.120 10
20
30
40
20
40
#Hidden nodes 0.0775
0.0760
80
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.1230
0.1225
RMSE
0.0765
RMSE
0.1235
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.0770
60
#Hidden nodes
0.0755
0.1220
0.0750
0.1215 0.0745 0.0740
0.1210 10
20
30
40
50
20
30
40
#Hidden nodes Fig. 15. The testing accuracy on Abalone (a) sigmoid and (b) RBF.
ði 1Þ ‖β^ ‖2 js
2
ð29Þ
Λðjsi 1Þ
where tr ðdÞ represents the trace of a square matrix. In addition,
T ^ ði 1Þ ¼ ‖T ði 1Þ ‖2 tr T ði 1Þ Rði 1Þ G F
holds, ði 1Þ ði 1Þ is the js th row vector of the minimizer β^ of where β^ js
1 T ði 1Þ Eq. (24), Λjs is the js th diagonal element of Rði 1Þ Rði 1Þ .
Proof. First, see the following optimization problem. n o ^ ði 1Þ ¼ min Gði 1Þ ¼ ‖Rði 1Þ βði 1Þ T ði 1Þ ‖2 þ ‖Θβði 1Þ ‖2 G s F F s s s βði s 1Þ
ð30Þ
^ ði 1Þ ¼ lim G s θ- þ 1
ð31Þ ði 1Þ
Letting dGði s 1Þ =dβ s ði 1Þ
β^ s ¼
Rði 1Þ
T
¼ 0 gets
Rði 1Þ þ Θ
T
1
Θ
Rði 1Þ
T
T ði 1Þ
ð32Þ
þΘ
T
Θ
1
Rði 1Þ
T ði 1Þ
Rði 1Þ
T
T
Rði 1Þ
Θ
T
Rði 1Þ
T
T
Rði 1Þ
Rði 1Þ
T
Rði 1Þ T
1
Θ
¼
1
Rði 1Þ
ði 1Þ
T
! T ði 1Þ
ð34Þ
Rði 1Þ
Θ I þΘ
T
Rði 1Þ
Rði 1Þ
T
1
Rði 1Þ
1
!1
Θ
1
ð35Þ
ð36Þ
^ ði 1Þ ^ ði 1Þ G ¼ lim G s θ- þ 1
ði 1Þ
¼ lim tr @ β^
ð33Þ
Rði 1Þ
So,
θ- þ 1
T ði 1Þ Þ
1
^ ði 1Þ G ^ ði 1Þ G s 0 1
1 !1 T ði 1Þ T ði 1Þ A ¼ tr @ β^ Θ I þ Θ Rði 1Þ Rði 1Þ Θ Θβ^
0
R ði 1 Þ
Rði 1Þ
Hence
ðiÞ
T
Rði 1Þ þ Θ
ΔðiÞ s ¼ G^ s G^
Plugging (32) into (30) yields ^ ði 1Þ ¼ ‖T ði 1Þ ‖2 tr G s F
Rði 1Þ
According to Woodbury formula [32], we get
where Θ ¼ diag 01 ; ⋯; 0js 1 ; θ; 0js þ 1 ; ⋯; 0L i þ 1 . Comparing (30) with (25), note that ^ ðiÞ G s
60
Fig. 16. The testing accuracy on Winequality white (a) sigmoid and (b) RBF.
Theorem 4.
ΔðiÞ s ¼
50
#Hidden nodes
T
1
1 !1 T ði 1 Þ ði 1 Þ ði 1Þ ^ A Θ IþΘ R R Θ Θβ
ð37Þ
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.0535
RMSE
0.0534
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.048
0.047
RMSE
0.0536
11
0.0533
0.0532
0.046
0.0531
0.045
0.0530 100
120
140
160
180
20
200
40
60
80
#Hidden nodes
#Hidden nodes 0.0490
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
RMSE
0.0536
0.0534
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.0485
0.0480
0.0475
RMSE
0.0538
0.0532
0.0470
0.0465
0.0460
0.0530 0.0455
100
120
140
160
180
10
200
#Hidden nodes
ði 1Þ
‖22 θ
2
θ- þ 11 þ Λði 1Þ θ js
¼ 2
ði 1Þ
‖β^ js
Λ
‖22
ð38Þ
ði 1Þ js
□
backward substitutions and it is also an upper triangular matrix, 1
T ¼ r 01 ði 1Þ; ⋯; r 0L i ði 1Þ . Therefore here denoted by Rði 1Þ
Λðjsi 1Þ ¼ ‖r 0js ði 1Þ‖22
ð39Þ
ði 1Þ
Δ
¼
‖22
‖r 0j ði 1Þ‖22
ð40Þ
s
Hence, the index of the ith regressor to be removed for ADP-ELM and ADP-RELM is found as ði 1Þ
‖β^ js
‖22 0 s A P ‖r ði 1Þ‖2 2 js
s† ¼ argmin
4.3. Analysis of computational complexity
iteration, to remove the regressor r ðsi† 1Þ from the candidate matrix 1 . Due to Rði 1Þ using Eq. (41), we firstly need to compute Rði 1Þ 1 Rði 1Þ being an upper triangular matrix, Rði 1Þ can be easily 1 ði 1Þ got with backward substitutions. When Rði 1Þ is found, β^ and ‖r 0j ði 1Þ‖22 can be acquired. Then, Eq. (41) is used to find the s
Substituting (39) into (29), we get ‖β^ js
60
Assume that ADP-ELM or ADP-RELM is implemented at the ith
ði 1Þ
is an upper triangular matrix, on the basis of the 1 is easily got with the knowledge of matrix theory Rði 1Þ
ðiÞ s
50
procedure of removing the insignificant hidden nodes terminates, which is convenient to control the network size.
Up to now, the proof has been finished. Since R
40
Fig. 18. The testing accuracy on Ailerons (a) sigmoid and (b) RBF.
which is simplified as ‖β^ js
30
#Hidden nodes
Fig. 17. The testing accuracy on CCPP (a) sigmoid and (b) RBF.
ΔðiÞ s ¼ lim
20
ð41Þ
As for the stopping criterion, Eq. (20) is also suitable ADP-ELM and ADP-RELM. Of course, we can predefine a positive integer M for ADP-ELM and ADP-RELM. When M regressors are left, the
regressor r ðsi† 1Þ to be removed. For the above procedure, the computational complexity of each step is listed in Table 1. When DP-ELM or DP-RELM is at the ith iteration, in order to obtain t ðisiÞ with (17), a series of Givens rotation need be constructed in this process. For example, assume that u is a two-dimensional vector denoted by u ¼ ½u1 ; u2 T , then Givens rotation is simply shown as " # v w u1 ‖u‖2 ð42Þ ¼ u2 0 w v qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi where v ¼ ðu1 =rÞ, w ¼ ðu2 =rÞ, and r ¼ u21 þ u22 . Notice that constructing a Givens rotation needs four multiplications, one addition, two divisions, and one square root. When t ðisiÞ is got using (17),
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
12
then the index s† is determined via (19). The computational complexity of DP-ELM or DP-RELM is tabulated in Table 2. Comparing Table 1 with 2, the conclusion that the computational complexity of ADP-ELM/ADP-RELM is lower than that of DPELM/DP-RELM for any operation is easily drawn. Naturally, ADPELM/ADP-RELM accelerates the training process of DP-ELM/DPRELM while keeping the same accuracy.
0.208
0.206
RMSE
4.4. The flowchart of ADP-ELM and ADP-RELM
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.210
In summary, the flowchart of ADP-ELM/ADP-RELM is depicted as follows.
0.204
0.202
0.200
Algorithm 1. ADP-ELM/ADP-RELM 1 Initializations:
100
N Input the training samples ðxi ; t i Þ i ¼ 1 . Choose the type of activation function. Choose a small ρ 4 0 or a positive integer M for controlling the number of final hidden nodes exactly. Let P ¼ f1; ⋯; Lg, and i¼1.
120
140
160
Generate the hidden layer output matrix H with L random hidden nodes. 3 For ADP-RELM, determine the appropriate regularization parameter μ.
180
200
220
240
260
280
#Hidden nodes 0.27
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.26
RMSE
2
0.25
0.24
0.044
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
RMSE
0.043
0.042
0.23 50
100
150
200
#Hidden nodes Fig. 20. The testing accuracy on Pole (a) sigmoid and (b) RBF.
0.041
Obtain R and T^ according (9) and (10) ((22) and (23)), respectively, for ADP-ELM (ADP-RELM). ^ 5 Let Rð0Þ ¼ R and T ð0Þ ¼ T. 4
0.040
0.039 20
30
40
50
#Hidden nodes
9
0.056
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
0.054
RMSE
0.052 0.050 0.048
6 7 8
0.046
1 Compute the upper triangular matrix Rði 1Þ based
on Rði 1Þ with backward substitutions. 1 ði 1Þ ¼ R ð i 1Þ T ði 1Þ . 10 Calculate β^ 11 Determine the regressor r ðsi† 1Þ to be removed according to (41). 12 13 14 15 16
0.044
If Eq. (20) is satisfied or i4 L M Go to step 16. Else
17
Obtain RðiÞ s† according to (14). Get RðiÞ and T ðiÞ from (15) and (16), respectively. Let P’P\ s† and i’iþ 1, and go to step 6. End if ðiÞ
Solve RðiÞ β^ ¼ T ðiÞ with backward substitutions. P ^ ðiÞ βj h x; aj ; bj Output ADP-ELM/ADP-RELM: f ðxÞ ¼ jAP
0.042
5. Experiments 10
20
30
40
#Hidden nodes Fig. 19. The testing accuracy on Elevators (a) sigmoid and (b) RBF.
In this section, we experimentally study the usefulness of the proposed scheme in accelerating the training processes of DP-ELM and DP-RELM. The experiment environment consists of: Windows
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
13
Table 4 The detailed experimental results on benchmark data sets. Data sets
Hidden node type
Algorithms
#Hidden nodes
RMSE
Energy efficiency
Sigmoid
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
190 80 81 81 190 130 105 105 220 82 86 86 220 208 182 182
4.8952E 027 5.0472E 03 4.8988E 027 3.5454E 03 4.8711E 027 3.6369E 03 4.8711E 027 3.6369E 03 4.3066E 027 1.6941E 03 4.2996E 027 1.8785E 03 4.2993E 027 1.9261E 03 4.2993E 027 1.9261E 03 4.0442E 02 71.8273E 03 4.0288E 027 9.3419E 04 4.0388E 027 2.2874E 03 4.0388E 027 2.2874E 03 3.5343E 027 1.0892E 03 3.5393E 027 1.1035E 03 3.5380E 027 9.7920E 04 3.5380E 027 9.7920E 04
0.0187 0.002 9.622 7 0.098 16.909 7 0.361 6.930 7 0.046 0.0117 0.002 11.6477 0.337 16.0717 0.187 6.363 7 0.137 0.436 7 0.010 14.095 7 0.571 28.3487 0.857 11.7337 0.100 0.426 7 0.011 18.6147 0.670 14.3197 0.101 5.8017 0.112
0.030 7 0.002 0.0117 0.000 0.0127 0.002 0.0127 0.002 0.0097 0.001 0.006 7 0.001 0.004 7 0.000 0.0057 0.004 0.302 7 0.006 0.1157 0.009 0.1187 0.001 0.1177 0.001 0.3017 0.008 0.280 7 0.002 0.260 7 0.012 0.258 7 0.013
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
80 29 32 32 80 75 65 65 90 36 17 17 90 60 34 34
7.5580E 027 6.1177E 03 7.5289E 027 3.8640E 03 7.5570E 027 4.4971E 03 7.5570E 027 4.4971E 03 7.3716E 02 72.8558E 03 7.3731E 027 2.8613E 03 7.3728E 027 2.9638E 03 7.3728E 027 2.9638E 03 7.7460E 02 79.3898E 03 7.6741E 027 6.1075E 03 7.6085E 027 5.5149E 03 7.6085E 027 5.5149E 03 7.3686E 027 5.3856E 03 7.3654E 027 5.6984E 03 7.3307E 02 7 6.3927E 03 7.3307E 02 7 6.3927E 03
0.032 7 0.003 2.759 7 0.047 3.1797 0.060 2.558 7 0.015 0.0197 0.002 3.228 7 0.009 2.6787 0.071 2.439 7 0.056 1.1387 0.013 4.5317 0.019 5.0707 0.064 4.202 7 0.060 1.1167 0.006 4.932 7 0.016 4.9487 0.016 4.308 7 0.013
0.0357 0.001 0.0147 0.002 0.0157 0.002 0.0157 0.004 0.0157 0.001 0.0147 0.003 0.0137 0.003 0.0137 0.003 0.430 7 0.004 0.1777 0.006 0.087 7 0.004 0.087 7 0.004 0.430 7 0.008 0.286 7 0.003 0.1597 0.002 0.1597 0.002
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
100 67 52 52 100 85 75 75 70 41 46 46 70 62 59 59
2.0338E 017 2.4357E 03 2.0338E 017 1.5048E 03 2.0336E 017 1.7314E 03 2.0336E 017 1.7314E 03 2.0224E 017 1.5824E 03 2.0224E 017 1.4386E 03 2.0229E 017 1.4048E 03 2.0229E 017 1.4048E 03 2.1198E 017 2.7295E 03 2.1193E 017 1.5492E 03 2.1190E 017 2.1838E 03 2.1190E 017 2.1838E 03 2.1090E 017 1.2686E 03 2.1093E 017 1.2167E 03 2.1093E 017 1.2610E 03 2.1093E 017 1.2610E 03
0.052 7 0.009 6.2757 0.044 6.429 7 0.053 5.360 7 0.011 0.032 7 0.018 6.6337 0.011 5.8577 0.013 5.3117 0.050 1.2477 0.020 4.989 7 0.088 4.902 7 0.041 4.623 7 0.024 1.252 7 0.017 5.2187 0.031 4.7217 0.042 4.6147 0.019
0.020 7 0.001 0.0137 0.003 0.0117 0.002 0.0117 0.003 0.0217 0.001 0.0197 0.005 0.0177 0.007 0.0167 0.005 0.581 7 0.016 0.343 7 0.022 0.3787 0.003 0.3797 0.003 0.583 7 0.013 0.503 7 0.001 0.4837 0.004 0.4847 0.008
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
100 95 98 98 100 78 87 87 90 88 73 73 90 88 70 70
8.8070E 02 74.2246E 03 8.8091E 027 4.0819E 03 8.8077E 02 74.2395E 03 8.8077E 02 74.2395E 03 8.6679E 027 2.9691E 03 8.6737E 02 71.9706E 03 8.6713E 02 7 2.8901E 03 8.6713E 02 7 2.8901E 03 8.3086E 027 1.4600E 03 8.3040E 027 1.4769E 03 8.3085E 027 1.7361E 03 8.3085E 027 1.7361E 03 8.3111E 02 71.4513E 03 8.3101E 027 1.4383E 03 8.3117E 027 1.6314E 03 8.3117E 02 7 1.6314E 03
0.0147 0.002 1.6767 0.022 0.202 7 0.015 0.1177 0.018 0.0097 0.002 1.664 7 0.038 0.790 7 0.033 0.4157 0.010 0.3177 0.010 1.5727 0.040 1.1917 0.023 0.7317 0.007 0.309 7 0.006 1.595 7 0.019 1.1877 0.009 0.8047 0.008
0.006 7 0.004 0.0057 0.003 0.0057 0.002 0.0057 0.001 0.0147 0.001 0.0117 0.002 0.0127 0.003 0.0127 0.003 0.2797 0.009 0.269 7 0.001 0.2067 0.001 0.205 7 0.002 0.2727 0.011 0.2677 0.002 0.208 7 0.001 0.208 7 0.001
40 40 28 28 40 40 30
7.4643E 027 4.2603E 04 7.4643E 027 4.2603E 04 7.4624E 027 2.7234E 04 7.4624E 027 2.7234E 04 7.4583E 027 4.1528E 04 7.4583E 027 4.1528E 04 7.4587E 027 4.3381E 04
0.0167 0.001 1.0517 0.009 1.0157 0.048 0.953 7 0.043 0.0127 0.001 1.0757 0.003 0.999 7 0.005
0.285 7 0.001 0.286 7 0.003 0.166 7 0.003 0.1697 0.007 0.0097 0.001 0.0097 0.003 0.006 7 0.004
RBF
Sml2010
Sigmoid
RBF
Parkinsons
Sigmoid
RBF
Airfoil
Sigmoid
RBF
Abalone
Sigmoid
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM
Training time (s)
Testing time (s)
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
14
Table 4 (continued ) Data sets
Hidden node type
RBF
Winequality white
Sigmoid
RBF
CCPP
Sigmoid
RBF
Ailerons
Sigmoid
RBF
Elevators
Sigmoid
RBF
Algorithms
#Hidden nodes
RMSE
Training time (s)
Testing time (s)
ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
30 60 36 30 30 60 53 42 42
7.4587E 027 4.3381E 04 7.4558E 027 4.8491E 04 7.4552E 027 3.5553E 04 7.4513E 027 2.9363E 04 7.4513E 027 2.9363E 04 7.4279E 02 71.7412E 04 7.4276E 027 1.8356E 04 7.4285E 027 2.0016E 04 7.4285E 027 2.0016E 04
0.9717 0.003 0.7387 0.019 2.456 7 0.079 2.4957 0.032 2.2477 0.028 0.7137 0.010 2.5247 0.042 2.408 7 0.019 2.236 7 0.025
0.0067 0.003 0.3557 0.009 0.2147 0.005 0.1797 0.003 0.1807 0.008 0.360 7 0.024 0.3137 0.002 0.298 7 0.007 0.2977 0.012
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
90 58 53 53 90 69 79 79 70 50 55 55 70 58 47 47
1.2076E 017 8.1419E 04 1.2077E 017 7.3178E 04 1.2075E 017 7.9765E 04 1.2075E 017 7.9765E 04 1.2024E 017 4.6597E 04 1.2024E 017 4.7537E 04 1.2024E 017 4.4131E 04 1.2024E 017 4.4131E 04 1.2122E 01 75.1000E 04 1.2123E 01 75.0902E 04 1.2121E 017 5.1387E 04 1.2121E 017 5.1387E 04 1.2109E 017 4.6537E 04 1.2111E 017 4.5950E 04 1.2111E 017 4.6525E 04 1.2111E 017 4.6525E 04
0.032 7 0.004 3.4877 0.058 3.583 7 0.058 2.902 7 0.008 0.0247 0.016 3.6747 0.059 2.856 7 0.059 2.6417 0.016 0.923 7 0.027 4.0217 0.012 3.289 7 0.031 2.8707 0.009 0.928 7 0.035 4.1417 0.036 3.4727 0.039 3.096 7 0.022
0.030 7 0.001 0.0187 0.003 0.0167 0.004 0.0167 0.004 0.0197 0.003 0.0107 0.002 0.0107 0.003 0.0137 0.004 0.3117 0.026 0.208 7 0.004 0.236 7 0.005 0.2377 0.010 0.302 7 0.013 0.2337 0.004 0.1907 0.005 0.1917 0.006
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
220 211 174 174 220 206 180 180 230 208 205 205 230 189 150 150
5.3057E 027 9.3348E 05 5.3064E 027 8.5253E 05 5.3064E 027 1.0417E 04 5.3064E 027 1.0417E 04 5.3032E 027 4.6702E 05 5.3031E 027 4.2081E 05 5.3034E 027 4.9733E 05 5.3034E 027 4.9733E 05 5.2954E 027 1.0245E 04 5.2962E 027 9.6408E 05 5.2963E 027 9.2246E 05 5.2963E 027 9.2246E 05 5.3080E 027 4.3402E 05 5.3083E 027 4.7827E 05 5.3082E 027 6.4042E 05 5.3082E 027 6.4042E 05
0.305 7 0.009 23.2777 1.231 18.5117 1.532 9.0667 0.815 0.093 7 0.015 22.4987 0.250 19.654 7 2.670 8.6317 0.346 5.8167 0.037 30.308 7 0.107 19.0627 0.060 12.542 7 0.088 5.543 7 0.029 30.502 7 0.782 36.2467 1.428 18.8067 0.337
0.0407 0.002 0.039 7 0.006 0.0337 0.005 0.0337 0.004 0.0437 0.001 0.036 7 0.003 0.034 7 0.003 0.034 7 0.003 3.2357 0.024 2.960 7 0.010 2.925 7 0.011 2.9247 0.017 3.2177 0.015 2.7167 0.012 2.1177 0.006 2.1157 0.008
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
120 31 35 35 120 79 74 74 100 22 17 17 100 63 56 56
4.5491E 027 1.9189E 04 4.5495E 027 9.0525E 05 4.5483E 027 1.9785E 04 4.5483E 027 1.9785E 04 4.5153E 027 1.3803E 04 4.5155E 027 1.5541E 04 4.5151E 027 1.5914E 04 4.5151E 027 1.5914E 04 4.6844E 027 5.1310E 04 4.6754E 02 74.6998E 04 4.6828E 027 2.4087E 04 4.6828E 027 2.4087E 04 4.5955E 027 2.5765E 04 4.5975E 027 2.6970E 04 4.5975E 027 2.7819E 04 4.5975E 027 2.7819E 04
0.1377 0.003 19.4647 0.078 21.8047 0.155 19.5647 0.039 0.0757 0.019 21.282 7 0.123 20.8337 0.296 19.3717 0.157 3.298 7 0.205 18.6877 0.033 20.202 7 0.057 18.9157 0.031 3.1407 0.010 19.943 7 0.037 19.783 7 0.070 18.925 7 0.094
0.0467 0.001 0.028 7 0.004 0.029 7 0.005 0.029 7 0.004 0.0497 0.005 0.0487 0.012 0.0477 0.008 0.0477 0.011 2.9047 0.016 0.6777 0.005 0.596 7 0.048 0.595 7 0.007 2.905 7 0.013 1.9127 0.012 1.9017 0.010 1.902 7 0.014
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM
60 40 28 28 60 55 43 43 50 34 28 28 50 40 34
3.9780E 027 9.2234E 04 3.9719E 027 3.4326E 04 3.9783E 027 8.2234E 04 3.9783E 027 8.2234E 04 3.9111E 027 6.3041E 04 3.9111E 027 6.0934E 04 3.9103E 02 7 5.1272E 04 3.9103E 027 5.1272E 04 4.2481E 027 7.2707E 04 4.2314E 02 71.2490E 03 4.2291E 027 1.1311E 03 4.2291E 027 1.1311E 03 4.1710E 027 6.4155E 04 4.1738E 027 9.138E 04 4.1697E 02 78.0712E 04
0.0767 0.004 13.543 7 0.011 13.5617 0.086 13.260 7 0.029 0.0357 0.002 13.759 7 0.012 13.465 7 0.071 13.3537 0.028 1.9157 0.009 13.0757 0.047 13.0797 0.068 12.883 7 0.070 1.926 7 0.022 13.2357 0.021 13.0627 0.031
0.0277 0.001 0.025 7 0.005 0.0247 0.005 0.0247 0.007 0.0357 0.001 0.029 7 0.003 0.025 7 0.008 0.025 7 0.002 1.753 7 0.090 1.1707 0.007 0.980 7 0.005 0.983 7 0.013 1.7107 0.040 1.3547 0.004 1.1797 0.011
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
15
Table 4 (continued ) Data sets
Pole
Hidden node type
Sigmoid
RBF
Algorithms
#Hidden nodes
Testing time (s)
34
4.1697E 02 78.0712E 04
12.962 7 0.063
1.1787 0.010
ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM ELM CP-ELM DP-ELM ADP-ELM RELM CP-RELM DP-RELM ADP-RELM
300 225 187 187 300 248 280 280 300 93 80 80 300 242 233 233
2.0204E 01 74.6552E 03 2.0190E 017 4.6273E 03 2.0203E 017 3.9885E 03 2.0203E 017 3.9885E 03 1.9994E 01 72.9768E 03 1.9994E 01 72.9057E 03 1.9993E 01 72.9785E 03 1.9993E 01 72.9785E 03 2.4132E 017 8.4840E 03 2.4119E 01 74.3560E 03 2.4113E 01 74.0748E 03 2.4113E 01 74.0748E 03 2.3101E 017 3.8627E 03 2.3099E 017 4.0630E 03 2.3098E 017 3.9031E 03 2.3098E 017 3.9031E 03
0.7577 0.047 60.0077 1.070 75.7077 9.695 34.563 7 1.071 0.305 7 0.036 61.292 7 2.037 28.202 7 0.162 16.946 7 0.502 13.384 7 0.263 55.5787 0.377 105.659 7 0.461 54.8747 0.263 12.8217 0.290 74.855 7 0.548 76.1527 2.083 41.526 7 1.350
0.080 7 0.007 0.0747 0.011 0.0577 0.002 0.0577 0.005 0.090 7 0.022 0.0757 0.005 0.0767 0.001 0.0767 0.002 6.323 7 0.132 2.038 7 0.007 1.666 7 0.012 1.666 7 0.009 6.3307 0.155 5.1937 0.019 5.1137 0.026 5.1137 0.139
nique [33]. To statistically obtain robust and reliable results, 30 trials are carried out with every algorithm on each data set. To facilitate comparison, the performance index root mean square error (RMSE) is used, which is defined as vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u #Testing u X 1 RMSE ¼ t ð43Þ ‖t~ i t i ‖2F #Testing m i ¼ 1 where #Testing denotes the number of testing samples, t~ i is the estimated vector of the testing sample t i . Figs. 1–10 show the training time against #Hidden nodes. For CP-ELM and CP-RELM, the training time increases with the increasing #Hidden nodes. These phenomena are generated by the fact that CP-ELM and CP-RELM start with null models, and
2
Training time (s)
ADP-RELM
7 32-b operating system, Intels Core™ i3-2310 M CPU @2.10 GHz, 2.00 GB RAM, Matlab2013b platform. Ten benchmark data sets (three multi-output data sets plus seven single-output ones) are utilized to do experiments, which include Energy efficiency, Sml2010, Parkinsons, Airfoil, Abalone, Winequality white, CCPP, Ailerons, Elevators, and Pole. The front seven data sets are from the well-known UCI machine learning repository,1 and the rear three are available from the data collection.2 For each data set, before doing experiments, we get rid of the repeated samples and the redundant inputs. And then the remaining inputs have been normalized into the range [ 1, 1] while the outputs have been normalized into [0, 1]. In addition, we divide every data set into two parts, i.e., the training set and the testing set, whose details are shown in Table 3. In this paper, two commonly-used hidden node types are chosen as the functions of hidden layer, activation i.e., the sigmoid hðxÞ ¼ 1= 1 þ exp xT ai and the RBF hðxÞ ¼ exp bi ‖x ai ‖22 , where ai are randomly chosen from the range [ 1, 1], bi is chosen from the range (0. 0.5) [12]. In this paper, we select ELM, CP-ELM, DP-ELM, ADP-ELM, RELM, CP-RELM, DP-RELM, and ADP-RELM to implement tests. First of all, we employ the cross validation to determine a nearly optimal number of hidden nodes (#Hidden nodes) for ELM from the set f10; 20; ⋯; 300g. Then, we extend it to other algorithms. In RELM, CP-RELM, DP-RELM, and ADP-RELM, there is an additional parameter, viz. the regularization parameter μ, to be determined. Here, n o it is chosen from 2 50 ; ⋯; 20 ; ⋯; 25 using cross-validation tech-
1
RMSE
http://archive.ics.uci.edu/ml/. http://www.dcc.fc.up.pt/ ltorgo/Regression/DataSets.html.
then gradually pick up hidden nodes one by one. Contrarily, for DP-ELM, ADP-ELM, DP-RELM, and ADP-RELM, the training time decreases with the increasing #Hidden nodes. The main reason is that these destructive algorithms start with full model and then remove the insignificant hidden nodes step by step. Under the same number of hidden nodes, ADP-ELM (ADP-RELM) needs less training time than DP-ELM (DP-RELM), which demonstrates that the accelerating scheme proposed in this paper indeed speeds up the original training process. In other words, the effectiveness of the presented scheme is verified experimentally. When choosing the same number of hidden nodes, generally CP-ELM (DP-ELM, and ADP-ELM) is superior to CP-RELM (DP-RELM, and ADP-RELM) with respect to the training time. One major reason is that during the process of obtaining RLL and T^ Lm with (9) and (22), res" # H pectively, RELM with the expanded matrix pffiffiffi μI LL usually needs more decomposing time than ELM. Before removing the insignificant hidden nodes, ADP-ELM (ADP-RELM) has the same training time as DP-ELM (DP-RELM), because the accelerating scheme does not start to work. Figs. 11–20 show the RMSE against #Hidden nodes. In these figures, the dash line represents the ELM testing accuracy, and the RELM testing accuracy is denoted by the dash dot line. In theory, the dash dot line lies under the dash line. That is, RELM should obtain better generalization performance than ELM, because ELM is a special type of RELM when the regularization parameter μ ¼ 0. However, for Airfoil and CCPP with RBF hidden nodes RELM is inferior to ELM. The reason for this is that RELM chooses the optimal regularization parameter from the finite set n o 2 50 ; ⋯; 20 ; ⋯; 25 instead of the whole real space. Note that the RMSE decreases with the increasing #Hidden nodes. Generally speaking, at first the RMSE drops very quickly, but it decreases slowly when approaching the benchmark line (the dash line or the dash dot line), because the learning capability becomes gradually weak as larger #Hidden nodes. At the same #Hidden nodes, the testing accuracy of ADP-ELM (ADP-RELM) overlaps that of DP-ELM (DP-RELM), which demonstrates that the proposed equivalent scheme does not alter the generalization performance. When CPELM, DP-ELM, and ADP-ELM (CP-RELM, DP-RELM, and ADP-RELM) touch (or nearly touch) the dash line (the dash dot line), we terminate them mandatorily. For these cases, Table 4 gives their statistical results. When CP-ELM and DP-ELM reach nearly the same accuracy as ELM, in general CP-ELM needs more hidden
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
16
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
nodes than DP-ELM, but it takes the advantage in terms of the training time. When the presented accelerating scheme is applied to DP-ELM, ADP-ELM outperforms CP-ELM with respect to both the training time and the number of hidden nodes. In some cost sensitive scenarios, ADP-ELM shows meaningful, which owns the priority of the training time and the testing time to CP-ELM. While arriving at nearly the same accuracy as RELM, DP-RELM works better than CP-RELM in the training time and #Hidden nodes. ADP-RELM not only keeps the same accuracy as DP-RELM, but also reduces the training time further, which is more beneficial for the training time sensitive scenarios. In addition, ELM (RELM) requires the least training time among ELM, CP-ELM, DP-ELM, and ADPELM (RELM, CP-RELM, DP-RELM, and ADP-RELM) because of no additional actions on it.
6. Conclusions Single-hidden layer feedforward networks recently have drawn much attention due to their excellent performances and simple forms. Unlike the traditional SLFNs where the input weights and the biases of hidden layer need to be tuned, ELM randomly generates these parameters independent of the training samples, and its output weights can be analytically determined. In this situation, ELM maybe generates redundant hidden nodes. Hence, it is necessary to improve the compactness of ELM network. CP-ELM and DP-ELM can reduce the number of hidden nodes without impairing the generalization performance. Compared with CPELM, DP-ELM shows excellent in the number of hidden nodes, but it needs more training time. Hence, an equivalent measure is proposed in this paper to accelerate the training process of DPELM. ADP-ELM improves DP-ELM’s performance on the training time so that ADP-ELM needs less training time than CP-ELM. In other words, ADP-ELM is superior to CP-ELM in terms of the training time and the number of hidden nodes, which is important for the training time sensitive scenarios. These similar situations are suitable for the RELM. That is, ADP-RELM improves the training process of DP-RELM further, and it needs fewer hidden nodes and less training time than CP-RELM. Experimental results have shown that the proposed scheme is able to accelerate DP-RELM and DPRELM indeed, and the analysis of the computational complexity also favors this viewpoint.
Acknowledgment This research was partially supported by the National Natural Science Foundation of China under Grant no. 51006052, and the NUST Outstanding Scholar Supporting Program. Moreover, the authors wish to thank the anonymous reviewers for their constructive comments and great help in the writing process, which improve the manuscript significantly. References [1] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in: Proceedings of IEEE International Joint Conference on Neural Networks, 2004, pp. 985–990. [2] G.-B. Huang, C.-K. Slew, Extreme learning machine: RBF network case, in: Proceedings of 8th International Conference on Control, Automation, Robotics and Vision, 2004, pp. 1029–1036. [3] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1–3) (2006) 489–501. [4] G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: a review, Neural Netw. 61 (2015) 32–48. [5] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by backpropagating errors, Nature 323 (6088) (1986) 533–536. [6] Y.A. LeCun, L. Bottou, G.B. Orr, K.-R. Muller, Efficient backprop, Lect. Notes Comput. Sci. (2012) 9–48.
[7] G.-B. Huang, X. Ding, H. Zhou, Optimization method based extreme learning machine for classification, Neurocomputing 74 (1–3) (2010) 155–163. [8] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Trans. Syst. Man Cybern. Part B: Cybern. 42 (2) (2012) 513–529. [9] X. Liu, C. Gao, P. Li, A comparative analysis of support vector machines and extreme learning machines, Neural Netw. 33 (2012) 58–66. [10] S. Lin, X. Liu, J. Fang, Z. Xu, Is extreme learning machine feasible? A theoretical assessment (Part II), IEEE Trans. Neural Netw. Learn. Syst. 26 (1) (2015) 21–34. [11] X. Liu, S. Lin, J. Fang, Z. Xu, Is extreme learning machine feasible? A theoretical assessment (Part I), IEEE Trans. Neural Netw. Learn. Syst. 26 (1) (2015) 7–20. [12] G.B. Huang, L. Chen, C.K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans. Neural Netw. 17 (4) (2006) 879–892. [13] G.-B. Huang, L. Chen, Convex incremental extreme learning machine, Neurocomputing 70 (16–18) (2007) 3056–3062. [14] G.-B. Huang, L. Chen, Enhanced random search based incremental extreme learning machine, Neurocomputing 71 (2008) 3460–3468. [15] G. Feng, G.-B. Huang, Q. Lin, R. Gay, Error minimized extreme learning machine with growth of hidden nodes and incremental learning, IEEE Trans. Neural Netw. 20 (8) (2009) 1352–1357. [16] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM: optimally pruned extreme learning machine, IEEE Trans. Neural Netw. 21 (1) (2010) 158–162. [17] J. Deng, K. Li, G.W. Irwin, Fast automatic two-stage nonlinear model identification based on the extreme learning machine, Neurocomputing 74 (16) (2011) 2422–2429. [18] E. Soria-Olivas, J. Gomez-Sanchis, J.D. Martin, J. Vila-Frances, M. Martinez, J.R. Magdalena, A.J. Serrano, BELM: Bayesian Extreme Learning Machine, IEEE Trans. Neural Netw. 22 (3) (2011) 505–509. [19] Y. Yang, Y. Wang, X. Yuan, Bidirectional extreme learning machine for regression problem and its learning effectiveness, IEEE Trans. Neural Netw. Learn. Syst. 23 (9) (2012) 1498–1505. [20] R. Zhang, Y. Lan, G.-B. Huang, Z.-B. Xu, Universal approximation of extreme learning machine with adaptive growth of hidden nodes, IEEE Trans. Neural Netw. Learn. Syst. 23 (2) (2012) 365–371. [21] R. Zhang, Y. Lan, G.-B. Huang, Z.-B. Xu, Y.C. Soh, Dynamic extreme learning machine and its approximation capability, IEEE Trans. Cybern. 43 (6) (2013) 2054–2065. [22] A. Castano, F. Fernandez-Navarro, C. Hervas-Martinez, PCA-ELM: a robust and pruned extreme learning machine approach based on principal component analysis, Neural Process. Lett. 37 (3) (2013) 377–392. [23] H. Zhou, G.B. Huang, Z. Lin, H. Wang, Y.C. Soh, Stacked extreme learning machines, IEEE Trans. Cybern. (2015). [24] W.-Y. Deng, Q.-H. Zheng, Z.-M. Wang, Projection vector machine, Neurocomputing 120 (0) (2013) 490–498. [25] H.-J. Rong, Y.-S. Ong, A.-H. Tan, Z. Zhu, A fast pruned-extreme learning machine for classification problem, Neurocomputing 72 (1–3) (2008) 359–366. [26] D. Du, K. Li, G.W. Irwin, J. Deng, A novel automatic two-stage locally regularized classifier construction method using the extreme learning machine, Neurocomputing 102 (0) (2013) 10–22. [27] Z. Bai, G.-B. Huang, D. Wang, H. Wang, M.B. Westover, Sparse extreme learning machine for classification, IEEE Trans. Cybern. 44 (10) (2014) 1858–1870. [28] J. Luo, C.-M. Vong, P.-K. Wong, Sparse Bayesian extreme learning machine for multi-classification, IEEE Trans. Neural Netw. Learn. Syst. 25 (4) (2014) 836–843. [29] N. Wang, M.J. Er, M. Han, Parsimonious extreme learning machine using recursive orthogonal least squares, IEEE Trans. Neural Netw. Learn. Syst. 25 (10) (2014) 1828–1841. [30] Y.-P. Zhao, K.-K. Wang, Y.-B. Li, Parsimonious regularized extreme learning machine based on orthogonal transformation, Neurocomputing 156 (2015) 280–296. [31] W. Deng, Q. Zheng, L. Chen, Regularized extreme learning machine, in: Proceedings of 2009 IEEE Symposium on Computational Intelligence and Data Mining, 2009, pp. 389–395. [32] X. Zhang, Matrix Analysis and Applications, Tsinghua University Press, Beijing, 2004. [33] Y. Zhao, K. Wang, Fast cross validation for regularized extreme learning machine, J. Syst. Eng. Electron. 25 (5) (2014) 895–900.
Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, he received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His research interests include machine learning and kernel methods.
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i
Y.-P. Zhao et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Bing Li was born in 1990. He received the B.S. degree from Xinxiang University, China, in 2014. He is currently pursuing the M.Eng. degree from Nanjing University of Science and Technology, China. His research interests include machine learning, pattern recognition, etc.
17 Ye-Bo Li studied in Shenyang Aerospace University of China from September 2004 to June 2008, majored in aircraft power engineering and received B.S. degree. From September 2008 to October 2014, he studied in Nanjing University of Aeronautics and Astronautics of China, majored in aerospace propulsion theory and engineering, and received M.S. degree and Ph.D. degree. Now, he is an engineer at AVIC Aeroengine Control Research Institute of China. His research interests include modelling, control and fault diagnosis on aircraft engines.
Please cite this article as: Y.-P. Zhao, et al., An accelerating scheme for destructive parsimonious extreme learning machine, Neurocomputing (2015), http://dx.doi.org/10.1016/j.neucom.2015.04.002i