Communicated by Dr. G.-B. Huang
Accepted Manuscript
Feature selection of generalized extreme learning machine for regression problems Yong-Ping Zhao, Ying-Ting Pan, Fang-Quan Song, Liguo Sun, Ting-Hao Chen PII: DOI: Reference:
S0925-2312(17)31815-5 10.1016/j.neucom.2017.11.056 NEUCOM 19121
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
21 April 2017 19 September 2017 26 November 2017
Please cite this article as: Yong-Ping Zhao, Ying-Ting Pan, Fang-Quan Song, Liguo Sun, Ting-Hao Chen, Feature selection of generalized extreme learning machine for regression problems, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.11.056
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Feature selection of generalized extreme learning machine for regression problems
a
CR IP T
Yong-Ping Zhaoa , Ying-Ting Pana , Fang-Quan Songa , Liguo Sunb,† , Ting-Hao Chenc Jiangsu Province Key Laboratory of Aerospace Power Systems College of Energy and Power Engineering
Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China b
School of Aeronautic Science and Engineering, Beihang University, Beijing 100091, China Guangdong Maritime Safety Administration, Guangzhou 510260, China
AN US
c
Abstract
Recently a generalized single-hidden layer feedforward network was proposed, which is an extension of the original extreme learning machine (ELM). Different from the traditional ELM, this generalized ELM (GELM)
M
utilizes the p-order reduced polynomial functions of complete input features as output weights. According to the empirical results, there may be insignificant or redundant input features to construct the p-order reduced polynomial function as output weights in GELM. However, to date there has not been such work of selecting
ED
appropriate input features used for constructing output weights of GELM. Hence, in this paper two greedy learning algorithms, i.e., a forward feature selection algorithm (FFS-GELM) and a backward feature selection algorithm (BFS-GELM), are first proposed to tackle this issue. To reduce the computational complexity, an iterative strategy
PT
is used in FFS-GELM, and its convergence is proved. In BFS-GELM, a decreasing iteration is applied to decay this model, and in this process an accelerating scheme was proposed to speed up computation of removing the
CE
insignificant or redundant features. To show the effectiveness of the proposed FFS-GELM and BFS-GELM, twelve benchmark data sets are employed to do experiments. From these reports, it is demonstrated that both FFS-GELM and BFS-GELM can select appropriate input features to construct the p-order reduced polynomial
AC
function as output weights for GELM. FFS-GELM and BFS-GELM enhance the generalization performance and simultaneously reduce the testing time compared to the original GELM. BFS-GELM works better than FFSGELM in terms of the sparsity ratio, the testing time and the training time. However, it slightly loses the advantage in the generalization performance over FFS-GELM. Key words: single hidden layer feedforward network; extreme learning machine; feature selection; greedy learning; iterative updating
† Corresponding author: E-mail:
[email protected]
1
ACCEPTED MANUSCRIPT
1
Introduction
In the field of neural networks, feedforward neural networks (FNNs) have been investigated thoroughly in the past two decades due to their excellent supervised learning performance. Single hidden layer feedforward neural networks (SLFNs) are the simplest FNN with only one hidden layer. Typical examples include multilayer perceptrons with one hidden layer and radial basis function networks, which mathematically consist of a linear combination of some basis functions. Because of the simple form and outstanding approximation capabilities [1–3], SLFNs are popular in many applications such as pattern recognition, signal processing, time-series prediction, nonlinear system modeling
CR IP T
and control, and so on [4–13]. It has already been proven that a SLFN with N − 1 hidden nodes can give any N input-target relations exactly [14, 15]. In [16], the same conclusion that a SLFN with N − 1 sigmoidal hidden nodes can learn N distinct samples without error was confirmed again. Subsequently, Huang et al. [17] extended this result and rigorously proved that a SLFN with at most N hidden nodes and with any bounded nonlinear activation function which has a limit at one infinity can learning N distinct samples with zero error.
AN US
Generally speaking, to obtain a good SLFN there are two subproblems to be solved: optimising the tunable parameters and constructing the optimal network architecture. On one side, since the backprogagation was proposed [18], a large number of gradient-based optimization methods, such as Levenberg-Marguardt algorithm [19, 20], scaled conjugate gradient algorithm [21], BFGS algorithm [22], and so forth, were investigated to search the optimal parameters for FNNs. On the other side, the greedy learning algorithms including the forward learning and the
M
backward learning are commonly utilized to determine the appropriate and compact architectures for SLFNs. The forward learning algorithms start with a small initial network and gradually add new hidden nodes until a satisfactory architecture is obtained. The forward orthogonal least squares method [4], fast recursive algorithm [23], kernel
ED
matching pursuit [24, 25], and least angle regression [26] belong to this class of algorithms. The backward learning algorithms starts by training a larger than necessary network and then deletes one hidden node at a time, which is
PT
the least significant in terms of the user-defined cost function deduction. The pruning algorithms [27–33] are the typical members. Generally, the forward learning algorithms tend to be more popular than the backward learning
CE
algorithms since they are computationally cheap and tend to have modest memory requirements. In contrast, the backward learning algorithms are less efficient with respect to numerical operations. But from a theoretical point of view, the backward algorithms have provable convergence properties while the forward learning algorithms offer no
AC
such known guarantees [34]. However, both the forward and backward learning algorithms may not be optimal [35]. In the learning community of SLFNs, recently there have been two sparkling members: support vector machine (SVM) [36–38] and extreme learning machine (ELM) [39–41]. For SVM, there are no parameters from the hidden layer to be tuned. We only require to optimise the parameters from the output layer. Although the optimisation on the tunable parameters begins with all the training samples, only a small portion of the training samples, referred to as support vectors, is found to construct the final model. SVM is a sparse machine learning algorithm in theory, but the sparsity of the solution is not as good as what we expect, which greatly blocks it into practical use. Investigations [42–44] illustrated that the support vectors of SVMs are sometimes swollen, and dispensable support vectors exist, so some algorithms [42, 45] were proposed to optimise the solution of SVM further.
2
ACCEPTED MANUSCRIPT As another special type of SLFN, the input weights and biases of hidden nodes in ELM are generated randomly, and its output weights are analytically determined by solving a linear system. Compared with the conventional gradient-based learning algorithms, the training speed of ELM is thousands of times faster [40]. Additionally, ELM has weaker generalization ability than SVM for small sample but can generalize as well as SVM for large sample, but it gains the overwhelming superiority in computational speed especially for large scale sample problems to SVM [46]. For the original ELM, there still exist a number of efforts on optimising its network architecture. An incremental ELM [41] and its improvements [47–49] were proposed in order to achieve a more compact network architecture.
CR IP T
Moreover, an error-minimization-based method [50] was presented to grow hidden nodes for ELM one by one or group by group, and output weights are incrementally updated which significantly reduces the computational complexity. Recently, the constructive parsimonious ELMs [51–53] were developed based on the orthogonal transformation. These aforementioned algorithms are inside the range of the forward learning algorithms. Likewise, there are backward learning algorithms to compact the ELM network. To address classification problems, a pruned-ELM was proposed [54], where it begins from an initial large number of hidden nodes, and then the irrelevant nodes are pruned by
AN US
considering their relevance to the class labels. Subsequently, an optimally pruned ELM methodology was presented [55], in which an original ELM is initially constructed, and then the hidden nodes are ranked with multi-response sparse regression technique [56], the number of hidden nodes finally decided by leave-one-out cross validation [57]. In addtion, there are destructive methods of constructing parsimonious ELMs correspondingly [51, 58]. Recently, a generalized SLFN (GSLFN) was investigated [59]. In the original ELM, output weights are constant,
M
which are irrelevant to the sample inputs. In contrast, GSLFN uses the p-order reduced polynomial functions of inputs as output weights as well as any infinitely differentiable activation function as hidden nodes. Evidently, when p = 0
ED
GSLFN degenerates into the original ELM. That is to say, the original ELM is a special case of GSLFN. Extensive experiments shown that the flexible polynomial order p renders GSLFN with considerably compact architecture yielding great superiority to the original ELM in terms of generalization and learning speed. In our opinion, the
PT
proposed SLFN in [59] is only an extension of the original ELM, so in this paper we name it as the generalized ELM (GELM).
CE
According to the obtained conclusion, we know that the generalization performance of the original ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes
AC
is large enough [60]. Although the large number of hidden nodes can guarantee that ELM gains good performance, its testing time is affected by this large number of hidden nodes. That is to say, the more hidden nodes usually signify the more testing time. The large number of hidden nodes in the ELM will impair its widespread applications in which the testing time is strictly required. Hence, a lot of algorithms stated previously were derived to make the network architecture of ELM more compact. In [59], experimental results demonstrates that GELM usually needs fewer hidden nodes but performs better than the original ELM in terms of generalization and learning speed. However, in GELM the output weights are closely related to the sample inputs, i.e., the sample features. The more features involved in the p-order reduced polynomial function as output weights will also worsen the testing time as well as the more hidden nodes. In addition, it has been proved that the generalization performance of GELM can be further enhanced by considering partial input 3
ACCEPTED MANUSCRIPT features to construct the p-order reduced polynomial function as output weights instead of complete input features because there exist insignificant or redundant input features [59]. There are two benefits of removing the insignificant or redundant features to construct the p-order reduced polynomial function as output layer weights: reducing the testing time and enhancing the generalization performance, so it is very important to select appropriate features to construct the p-order reduced polynominal function as output weights for GELM. In the naive GELM, the p-order reduced polynominal function of complete input features is chosen as output layer weights. Therefore, in this paper we first tend to make some contributions to feature selection of the p-order reduced polynomial function in GELM.
CR IP T
Enlightened by experience above on compacting the network architecture with the greedy learning algorithms, here the greedy learning strategies are extended to select appropriate input features to construct the p-order reduced polynominal function as output weights for GELM. As a result, two important algorithms are obtained to select appropriate features to construct the p-order reduced polynominal function as output weights for GELM, shown as follows
AN US
(1) The forward learning algorithm for feature selection of GELM (FFS-GELM): FFS-GELM begins with an original ELM, and then it gradually grows input features to construct the p-order reduced polynomial function as output weights according to some criterion. It does not terminate until the stopping criterion is satisfied. To reduce the training computational burden, an iterative strategy is utilized in FFS-GELM. (2) The backward learning algorithm for feature selection of GELM (BFS-GELM): Contrary to FFS-GELM,
M
BFS-GELM starts with a full model, i.e., a GELM model with the p-order reduced polynominal function of complete input features as output weights as in [59]. Then it removes the insignificant or redundant features
ED
from the p-order reduced polynominal function one by one until the performance required is got. Similar to FFS-GELM, BFS-GELM employs the iterative strategy to shrink model gradually in order to cut down the computational complexity. Moreover, during the process of computing the criterion of removing the insignificant
PT
or redundant features, a scheme is proposed to relieve the computational burden. To confirm the usefulness of the proposed FFS-GELM and BFS-GELM, experimental evaluations over a range of
favored.
CE
benchmark data sets are carried out. From the reported results , the effectiveness of FFS-GELM and BFS-GELM is
AC
The rest of this paper is organized as follows. In section 2, the original ELM and GELM are introduced. In section 3, FFS-GELM is proposed to select appropriate features for the original GELM gradually. In order to reduce the computational complexity, an iterative scheme is applied to accelerate FFS-GELM, and its convergence is proved. In the following section, BFS-GELM is developed, and a scheme is presented to speed up its implementation. To confirm the effectiveness and feasibility of the proposed FFS-GELM and BFS-GELM, twelve benchmark data sets are employed to do experiments in section 5. Finally, conclusions follow.
4
ACCEPTED MANUSCRIPT
2
ELM and GELM
2.1
ELM N
Given a set of training samples {(xk , tk )}k=1 , where xk ∈
L X
βi h(ai , bi ; xk )
(1)
i=1
where h(ai , bi ; xk ) denotes the output of the ith hidden node with the hidden-node parameters (ai , bi ) ∈
CR IP T
βi ∈
h(aL , bL ; x1 ) h(aL , bL ; x2 ) .. . h(aL , bL ; xN )
h β = β1
···
βL
AN US
h(a1 , b1 ; x1 ) · · · h(a1 , b1 ; x2 ) · · · H= .. .. . . h(a1 , b1 ; xN ) · · · iT
h and T = t1
···
tN
iT
(2)
(3)
(4)
M
Here, H is called the hidden-layer output matrix of the network, in which the ith column is the output vector of the ith hidden node with respect to inputs, i.e. {xi }N i=1 , and the jth row is the output vector of the hidden nodes
ED
with respect to xj , β and T are corresponding matrices of output weights and targets, respectively. In ELM, the hidden-layer output matrix H is randomly generated. That is, the input weights and the biases of the hidden nodes are randomly generated and they are independent of each other. Thus, training ELM simply amounts to getting the
βˆ = H † T
(5)
CE
PT
solution of the linear system (2) with respect to the output weight matrix β, specifically
where H † is the Moore-Penrose generalized inverse of H.
GELM
AC
2.2
From equation (2), we know that output weights in ELM are not explicitly related to input features. In contrary, in GELM the output weights are defined as the p-order reduced polynomial functions of input features T
xk = [xk1 , · · · , xkn ] , i.e., T βiT (xk ) = βi00 +
p X n X
T βijr xrkj = zkT Bi
(6)
zkT = [1, xk1 , · · · , xkn , · · · , xpk1 , · · · , xpkn ]
(7)
r=1 j=1
where zk ∈
5
ACCEPTED MANUSCRIPT
BiT = [βi00 , βi11 , · · · , βin1 , . . . , βi1p , · · · , βinp ]
(8)
and M = np + 1, βijr ∈
h(ai , bi ; xk )zkT Bi = hT (xk ) ⊗ zkT B
where ⊗ is the Kronecker product [61], and T T B = B1T , · · · , BL
CR IP T
ykT =
AN US
hT (xk ) = [h(a1 , b1 ; xk ), · · · , h(aL , bL ; xk )]
(9)
(10)
(11)
If the outputs in (9) equals the targets, the following linear system is obtained in a compact form GB = T
where G ∈
T
z1T
ED
M
h (x1 ) ⊗ .. G= . T hT (xN ) ⊗ zN
(12)
(13)
Since the rank of GT G maybe does not equal LM , the linear system (12) can be solved by the ridge regression
PT
estimator as follows
ˆ = (GT G + µI)−1 GT T B
(14)
where µ is the ridge parameter. And the universal approximation capability of (14) is proved [59].
AC
CE
In theory, finding the solution of (14) amounts to optimizing the following problem minB {J = kBk2F + CkEk2F } s.t. GB = T − E
(15)
where k · kF represents the Frobenius norm, C > 0 is the user-specified ridge parameter controlling the tradeoff between the training error E and the output weight norm. After eliminating E from (15), we obtain (14) with 1/C instead of µ. ˆ we can get GELM as follows After obtaining B, ˆ f (xk ) = hT (xk ) ⊗ zkT B
(16)
ˆ ∈
ACCEPTED MANUSCRIPT
Loop: Select the appropriate input features to construct the p-order reduced polynomial function as output weights according to some criterion
Start: Train an original ELM
End: FFS-GELM is got when the stopping criterion is met.
Figure 1 The procedure of FFS-GELM
CR IP T
put on the influence of input features on the testing time. To be more important, according to the experimental results [59], the conclusion that the generalization performance with complete input features employed to construct (7) is better than that with partial input features can not be guaranteed. Hence, it is necessary and important to select appropriate input features for (7) in GELM. Obviously, doing this will bring two benefits: reducing the testing
3
AN US
time and enhancing the generalization performance. In this context, FFS-GELM and BFS-GELM are proposed.
FFS-GELM
3.1
Preliminary Work
FFS-GELM is a forward learning algorithm, which starts with the original ELM. Then, it gradually selects feature
M
for (7) according to some criterion, and does not stop until the terminating criterion is reached. Figure 1 illustrates this procedure. In what follows, we will introduce it. Firstly, we rearrange (7) as
ED
zkT = [1, xk1 , · · · , xpk1 , · · · , xkn , · · · , xpkn ] = [1, zk1 , · · · , zkn ] (13) is rewritten as
AC
CE
PT
where zki = [xki , · · · , xpki ], i = 1, · · · , n. Accordingly, equation T T h (x ) ⊗ z 1 1 .. G= . T hT (xN ) ⊗ zN T T h (x1 ), h (x1 ) ⊗ z11 , .. .. = . . T T h (xN ), h (xN ) ⊗ zN 1 , h i = G0 , G1 , · · · , Gn T
where G0 = H, Gi =
(17)
··· .. .
, hT (x1 ) ⊗ z1n .. .
···
, hT (xN ) ⊗ zN n
(18)
h (x1 )⊗z1i
.. .
hT (xN )⊗zN i
, i = 1, · · · , n. Correspondingly, the output matrix B is reorganized T B = B0T , B1T , · · · , BnT
(19)
Here, Gi and Bi correspond to the ith feature of inputs.
Hence, the feature selection of GELM is equivalent to choosing Gi ’s from G. Then, define an empty index set P = Ø and a full index set Q = {1, · · · , n}. Let GP = G0 , BP = B0 , and the index q is initialized with 0. 7
ACCEPTED MANUSCRIPT 3.2
The Criterion of Feature Selection
At the qth iteration, similar to equation (15) we get n o min J (q) = kBP k2F + CkT − GP BP k2F
(20)
BP
where GP and BP are submatrices of G and B respectively, whose indexes are confined to the index set P . Letting dJ (q) dBP
= 0 yields ˆ P = (GT GP + I/C)−1 GT T B P P
F
CR IP T
If Gi is selected at the (q + 1)th iteration, we have
2
h
B P (q+1)
T − G
+ C min Ji = P
BP ∪{i}
Bi
(21)
2 i BP
Gi
Bi
(22)
F
Here we make use of the backfitting strategy [24] to compute (22). That is, the BP in (22) is frozen with the value
AN US
ˆ P in (21) at the qth iteration. Substituting B ˆ P into (22), it is transformed to B
n o (q+1) 2 2 ˆ P )T Gi Bi min Ji = Jconst + C kGi Bi kF + kBi kF − 2Ctr (T − GP B Bi
(23)
where tr(·) represents the trace of square matrix, Jconst is a constant part given by
2
ˆ 2 ˆP Jconst = C T − GP B
+ B P F
dJi dBi
M
(q+1)
Letting
F
= 0 gives
ED
ˆ i = (GT Gi + I/C)−1 GT (T − GP B ˆP ) B i i
(24)
(25)
Substituting (25) into (23), we get
PT
(q+1) ˆ P )T Gi B ˆi Jˆi = Jconst − Ctr (T − GP B
Hence, define
(q+1)
CE
∆i
(q+1)
(27)
6 0 holds.
AC
Theorem 1 ∆i
ˆ P )T Gi B ˆi − Jconst = −Ctr (T − GP B
(q+1)
= Jˆi
(26)
Proof Since GTi Gi + I/C is a positive definite matrix, then (GTi Gi + I/C)−1 is a positive definite matrix. Based on the knowledge of matrix theory, the following formula is got (GTi Gi + I/C)−1 = Λ2
(28)
where Λ is also a positive definite matrix. Substituting (25) and (28) into (27), it can be expressed as (q+1)
∆i (q+1)
Hence, ∆i
6 0. This proof is completed.
2
ˆ P ) = −C ΛGTi (T − GP B
F
8
(29)
ACCEPTED MANUSCRIPT (q+1)
In fact, the ∆i
indicates that at the (q + 1)th iteration the ith feature incurs the reduction on the cost function (q+1)
of (22). The smaller the ∆i
is, the more important the ith feature is for constructing the p-order reduced
polynominal function as output weights of GELM. Thus, we can find the index of the feature to be selected by n o (q+1) s = arg min ∆i
(30)
i∈Q
3.3
Iteratively Updating
ˆ P ∪{s} , After determining the sth feature to construct the p-order reduced polynominal function, we need to update B
CR IP T
which is employed in (27) for selecting the next feature if BFS-GELM does not stop. However, if the classical techniques, e.g., Gaussian elimination, are used to compute the matrix inverse in (21) from scratch, the computational cost is O L2 q 2 p2 (Lqp + N ) . Hence, we develop an iterative computation of the inverse matrix in (21). Firstly, we let
R(q+1) = GTP ∪{s} GP ∪{s} + I/C
GTP GP + I/C = GTs GP
−1
GTP Gs
GTs Gs
AN US
+ I/C
−1
(31)
Given that R(q) = (GTP GP +I/C)−1 has already been computed at the qth iteration, according to Sherman-Morrison formula [61] R(q+1) can be updated at a cost of O L2 q 2 p2 (Lp + N ) U DU T + −DU T 0 0
M
R(q+1)
R(q) = 0
D
ED
where 0’s are zero matrices of appropriate dimension,
−U D
(32)
(33)
D = (GTs Gs + I/C − GTs GP U T )−1
(34)
PT
U = R(q) GTP Gs
AC
CE
ˆ P ∪{s} is given by Usually q > 1, so the computational cost is mitigated. After obtaining R(q+1) , B ˆ P ∪{s} B
T G P = R(q+1) T GTs
(35)
Meanwhile, let P ← P ∪ {s}, Q ← Q\{s}, and q ← q + 1.
3.4
The Stopping Criterion for FFS-GELM
For a greedy learning algorithm, it is necessary to determine an appropriate stopping criterion. Here, FFS-GELM will terminate if
ˆ P k2 kT − GP B F 6 kT k2F
(36)
is satisfied, where is a small positive number. When equation (36) is met, it demonstrates that the targets T have already been explained very well with the tolerant errors, which can avoid the overfitting and enhance the 9
ACCEPTED MANUSCRIPT performance. That is, the generalization performance is enhanced, and the p-order reduced polynominal function of appropriate features is selected as output weight of GELM, which always means the less testing time. If = 0, FFS-GELM degenrates into the original GELM. Hence, the original GELM can be considered as a special case of FFS-GELM with = 0. In addition, a nonnegative integer nmax (0 6 nmax 6 n) can be predefined according to our requirements. When the number of selected features reaches nmax , FFS-GELM stops in order to control the number of selected features conveniently and exactly.
The Flowchart of FFS-GELM Algorithm
CR IP T
3.5
The complete structure of FFS-GELM algorithm is shown as Algorithm 1. Algorithm 1: FFS-GELM algorithm
1: Initialize:
AN US
N
• Input training samples {(xk , tk )}k=1 .
• Choose the type of hidden nodes, and calculate the hidden layer output matrix H. • Determine the ridge parameter 1/C.
• Choose a small > 0 or a nonnegative integer nmax . • Let P = ∅, Q = {1, · · · , n}, and q = 0.
−1
M
• Set GP = H, R(0) = GTP GP + I/C
3:
ED
2: If equation (36) is satisfied or q > nmax Go to step 11.
PT
4: Else
ˆ P according to (21). , and calculate B
Determine the index s according to (30).
6:
Update R(q+1) based on (32).
7:
ˆ P ∪{s} from (35). Calculate B
AC
CE
5:
8:
Let P ← P ∪ {s}, Q ← Q\{s}, and q ← q + 1.
9:
Go to step 2.
10: End T ˆ P , where z T = [zki , · · · , zkj ] , i, j ∈ P . 11: Output ykT = hT (xk ) ⊗ [1, zkP ] B kP
3.6
Convergence
Theorem 2 The cost function J in (15) monotonously decreases with an increasing number of selected features.
10
output weights according to some criterion
met.
ACCEPTED MANUSCRIPT
Loop: Remove the insignificant or redundant features one by one from the p-order reduced polynomial function in accordance with some criterion
Start: Train a full GELM
End: BFS-GELM is obtained when the stopping criterion is met.
Figure 2 The procedure of BFS-GELM
function is Jˆ(q) =
CR IP T
Proof Assume that FFS-GELM is at the qth iteration. Before the sth feature is included, the value of the cost minB J = kBk2F + CkT − GBk2F s.t.
BQ = 0
(37)
T where BQ = BiT , · · · , BjT , i, j ∈ Q. When the s feature is selected, the value of the cost function becomes s.t.
AN US
Jˆ(q+1) =
minB J = kBk2F + CkT − GBk2F
(38)
BQ\{s} = 0
Comparing (37) with (38), note that that the constraint Bs = 0 in (37) does not appear in (38), which indicates that the feasible set of (37) is the subset of that of (38). Hence, we get
BFS-GELM
PT
4
(39)
ED
Now, the proof of Theorem 2 is completed.
M
Jˆ(q+1) 6 Jˆ(q)
BFS-GELM is a backward learning algorithm, which starts with the original GELM and removes the insignificant
CE
or redundant features one by one from the p-order reduced polynomial function in accordance with some criterion. This process stops when the defined terminating criterion is satisfied. Its procedure is shown in Figure 2. Define a
4.1
AC
full index set P = {1, · · · , n}, and let q = 0. Then, BFS-GELM is initialized with GP = G in (18).
The Criterion of Removing Feature
In BFS-GELM, the criterion adopted is similar to that of FFS-GELM. That is, when one feature is removed, the value of the cost function in (15) will alter, so we can judge that the corresponding feature is important or not according to the altering magnitude of the cost function. From Theorem 2, it is noted that the value of the cost function will increase with an decreasing number of features. Hence, the increase on the cost function incurred by removing one feature is much smaller, and this feature will be viewed to be less significant, and vice versa. Different from FFS-GELM, where most important features are selected, in BFS-GELM the least important is removed.
11
ACCEPTED MANUSCRIPT Now assume that BFS-GELM is at the q iteration. The value of the cost function in (15) is denoted by n o Jˆ(q) = min J (q) = kBP k2F + CkT − GP BP k2F
(40)
BP
If the ith feature is removed at the (q + 1) iteration, it becomes o n (q+1) J−i = kBP \{i} k2F + CkT − GP \{i} BP \{i} k2F
(q+1) Jˆ−i = min
BP \{i}
(41)
Hence, the index of the feature to be removed at the (q + 1)th iteration can be got by
CR IP T
o n (q+1) s = arg min Γ−i = Jˆ−i − Jˆ(q) i∈P
(42)
To determine the sth feature to be removed, we need to solve the optimal problem (41) n − q times. If in each time ˆ P \{−i} in (41) is calculated using (21), the computational complexity is very high or even the optimal solution B unacceptable. Hence, it shows the necessity of developing a method to accelerate the computation of Γ−i at each
4.2
AN US
iteration.
An Accelerating Scheme
Assume that the index i is the jth element in the set P . Then, see the following optimization problem as n o (q) min Ji = kBP k2F + kW BP k2F + CkT − GP BP k2F BP
(43)
M
where W = diag{01 , · · · , 0j−1 , W , 0j+1 , · · · , 0|P | }, here 0l is a zero matrix of size Lp × Lp, l = 1, · · · , j − 1, j +
ED
1, · · · , |P |, | · | represents the cardinality of set, and W is a diagonal matrix given by w . . W = . w (q)
dJi dBP
= 0, we have
CE
Letting
Lp×Lp
PT
where w > 0.
(44)
AC
Plugging (45) into (43) gets
e P = (I/C + W 2 /C + GT GP )−1 GT T B P P
(q) Jˆi = CkT k2F − Ctr T T GP (I/C + W 2 /C + GTP GP )−1 GTP T
Denote (GTP GP + I/C)−1 by A(q) , according to Woodbury formula [61] we have
(A(q) )−1 + W 2 /C
−1
(45)
= A(q) − A(q) W (CI + W A(q) W )−1 W A(q)
(46)
(47)
Considering (47) and (40), equation (46) is formulated as (q) ˆ T W (CI + W A(q) W )−1 W B ˆP Jˆi = Jˆ(q) + Ctr B P
ˆ P is the optimal solution of (40). where B
12
(48)
ACCEPTED MANUSCRIPT Comparing (41) with (43), when w goes to infinity, which is equivalent to removing the ith feature from (43), (q)
Ji
(q+1)
degenerates into J−i
. Hence (q+1) (q) Jˆ−i = lim Jˆi
(49)
ˆ T W (CI + W A(q) W )−1 W B ˆP Γ−i = lim Ctr B P
(50)
ˆ T (A(q) )−1 B ˆi Γ−i = Ctr B i ii
(51)
w→+∞
and
w→+∞
CR IP T
Equation (50) can be simplified further
ˆ i and A(q) are submatrices of the optimal solution B ˆ P in (40) and A(q) , respectively, corresponding to the where B ii ith feature. Theorem 3 Γ−i > 0 in (51) holds.
The proof of Theorem 3 is similar to that of Theorem 1, so we omit it here.
AN US
To obtain Γ−i , the computational cost is O L2 p2 (n − q)2 (Lp(n − q) + N ) directly using (42). In contrary, the
computational complexity of calculating (51) is O(L3 p3 ). Evidently, the computational cost of Γ−i is reduced with (51). Consequently, equation (42) is simplified as
n o ˆ T (A(q) )−1 B ˆi s = arg min Γ−i = Ctr B i ii
4.3
Iteratively Decreasing
(52)
M
i∈P
ED
When the sth feature is determined to be removed, we need to calculate A(q+1) ready for the next iteration if
PT
BFS-GELM does not stop. If the classical techniques are utilized to obtain A(q+1) , the computational cost is O L2 p2 (n − q − 1)2 (Lp(n − q − 1) + N ) . Here, we make use of Sherman-Morrison formula [61] to have it at a low
AC
CE
cost. Assume that A(q) has already been known, then it is decomposed as
A(q)
A 11 = AT12 AT13
A12 A22 AT23
A13
A23 A33
where the matrices A12 , A22 , and A23 correspond to the sth feature. Hence i h A A A 11 13 12 T − A−1 A(q+1) = A23 22 A12 AT23 AT13 A33
(53)
(54)
The computational cost of calculating A(q+1) is reduced to O Lp(n − q − 1)2 . Obviously, the computational burden is cut down using (54). Meanwhile
ˆ P \{s} = A(q+1) GT B P \{s} T In addition, let P ← P \{s}, and q ← q − 1.
13
(55)
ACCEPTED MANUSCRIPT 4.4
The Stopping Criterion for BFS-GELM
In this paper, the following stopping criterion is adopted for BFS-GELM ˆ P k2 kT − GP B F 61−ρ kT k2F
(56)
where ρ > 0, which can balance the overfitting against the model complexity. The larger ρ means the fewer features to construct the p-order reduced polynomial function as output weights, the less testing time, and the larger training errors. Contrarily, the smaller ρ signifies the more features to construct the p-order reduced polynomial function as
CR IP T
output weights, the more testing time, and the smaller training errors. When ρ = 0, there are not features to be removed. In this situation, BFS-GELM amounts to the original GELM. In addition, we can specify the number of the features to be removed for BFS-GELM. This method is convenient to control the number of features to construct
AC
CE
PT
ED
M
AN US
the p-order reduced polynomial function as output weights.
14
ACCEPTED MANUSCRIPT 4.5
The Flowchart of BFS-GELM Algorithm
In summary, the procedure of BFS-GELM is depicted as Algorithm 2. Algorithm 2: BFS-GELM algorithm
1: Initialize: N
• Input training samples {(xk , tk )}k=1 .
• Determine the ridge parameter 1/C. • Choose a small ρ > 0 or a nonnegative integer nremove . • Let P = {1, · · · , n}, and q = 0.
2: If equation (56) is satisfied or q > nremove 3:
Go to step 11.
4: Else Determine the index s according to (52).
6:
Update A(q+1) based on (54).
7:
ˆ P \{s} from (55). Calculate B
8:
Let P ← P \{s}, and q ← q + 1.
9:
Go to step 2.
ED
PT
CE
10: End
ˆ P = A(0) GT T . , and B P
M
5:
−1
AN US
• Set GP = G, and calculate A(0) = GTP GP + I/C
CR IP T
• Choose the type of hidden nodes, and calculate the hidden layer output matrix H.
T ˆ P , where z T = [zki , · · · , zkj ] , i, j ∈ P . 11: Output ykT = hT (xk ) ⊗ [1, zkP ] B kP
Experiments
AC
5
In this section, the performance of the proposed FFS-GELM and BFS-GELM is tested through experiments of twelve benchmark data sets, which are divided into three groups, shown as (1) Multiple outputs data sets: Sml2010, Parkinsons, and Concrete slump; (2) Single output data sets of small size: Triazines, Boston housing, Winequality red, and Winequality white; (3) Single output data sets of large size: Bank32NH, Puma32H, Cpu act, Ailerons, and Elevators. where Sml2010, Parkinsons, Concrete slump, Boston housing, Winequality red, and Winequality white are obtained
15
ACCEPTED MANUSCRIPT
Table 1 Specifications of each benchmark data set #Training
#Testing
#Features
#Outputs
Sml2010
400
368
8
2
Parkinsons
3000
2875
16
2
Concrete slump
60
43
7
3
Triazines
100
85
58
1
Boston housing
350
156
13
1
Winequality red
800
559
11
1
Winequality white
2500
1461
11
Bank32NN
4500
3692
32
Puma32H
4500
3692
32
Cpu act
4500
3692
21
Ailerons
7154
6596
40
Elevators
8752
7847
18
AN US
CR IP T
Data sets
1 1 1
1 1 1
Notes: #Training represents the number of training samples, #Testing represents the number of testing samples, #Features represents the number of input features, #Outputs represents the number of outputs.
from the well-known UCI repository1 , Triazines, Bank32NH, Puma32H, Cpu act, Ailerons, and Elevators are available from the data collection2 . For each data set, it is randomly divided into two splits: the training set and the testing
M
set. As for their details, Table 1 tabulates them.
ED
In this paper, two typical hidden nodes are chosen as activation functions of hidden layer, i.e. the Sigmoid h(x) = 1/(1 + exp{−xT ai }) and the RBF h(x) = exp −bi kx − ai k2 , where ai is randomly chosen from the range
[-1, 1], bi is chosen from the range (0, 0.5) [41]. All experiments have been carried out in MATLAB2013b environment
PT
running on an ordinary personal laptop with Intel(R) Core(TM) i3-2310M CPU with 2.10GHz, and 2.00GB RAM. To find the average performance rather than the best one, thirty trials are conducted for each data set with every
CE
algorithm. In GELM, there exist a parameter C and the polynomial order p to be determined before experiments. Similar to [59], GELMs with p = 1 and p = 2 are mainly focused on. For each p, ten cross-validation technique is utilized to determine a nearly optimal parameter, shown in Table 2, for C from {2−20 , 2−19 , · · · , 219 , 220 } with the
AC
original GELM, and then it is extended to FFS-GELM and BFS-GELM. To facilitate comparisons among different algorithms, the performance index root mean square error (RMSE) is defined by v u #Testing u X
1
ti − tˆi 2 RMSE = t F #Testing × m i=1
(57)
where tˆi is the estimated value of ti . The smaller the RMSE is, the better generalization performance is got for each algorithm.
1 http://archive.ics.uci.edu/ml/ 2 http://www.dcc.fc.up.pt/%7Eltorgo/Regression/DataSets.html
16
ACCEPTED MANUSCRIPT Table 2 The nearly optimal parameters for C Hidden node type Sigmoid Sml2010 RBF
Sigmoid Parkinsons RBF
Sigmoid Concrete slump
Sigmoid
Triazines
RBF
M
Sigmoid
ED
Boston housing
RBF
Sigmoid
CE
PT
Winequality red
AC
log2 C
1
7
2
9
1
−2
2
0
1
3
2
5
1 2 1 2 1 2
6
6
−1 −2
−1
−2
AN US
RBF
p
CR IP T
Data sets
RBF
Sigmoid Winequality white RBF
Sigmoid Bank32NH RBF
Sigmoid Puma32H RBF
1
−7
2
−6
1
3
2
5
1
4
2
3
1
8
2
7
1
−3
2
−3
1
−2
2
−1
1
−3
2
−3
1
−1
2
2
1
−5
2
−5
1
1
2
1
1
−9
2
−2
1
−2
continued on next page
17
ACCEPTED MANUSCRIPT
– continued from previous page Hidden node type
Sigmoid Cpu act RBF
Sigmoid Ailerons
2
6
1
2
2
0
1
11
2
9
1
0
2
2
Sigmoid
1 2 1
0
6
7 2 2
7
AN US
RBF
log2 C
1
RBF
Elevators
p
CR IP T
Data sets
2
7
Figures 3 and 4 show the experimental results for Sml2010 with one trial. In each panel, there are three lines, in
M
which the black dash line represents the original GELM, the red solid line with circle marks represents FFS-GELM, the blue solid line with triangle marks represents BFS-GELM. For each red or blue line, there are always some marks
ED
below the dash line, which indicates that FFS-GELM or BFS-GELM is able to select appropriate features to construct the p-order reduced polynomial function for GELM as output weights and win the better generalization performance than the original GELM. And the idea of employing partial features instead of complete features to construct the
PT
p-order reduced polynomial function as output weights for GELM is verified and right. The similar results of the other benchmark data sets are omitted due to space limitation. From these plots, notice that there always is a
CE
minimum value in FFS-GELM or BFS-GELM. In worst cases, the minimum values of FFS-GELM and BFS-GELM are equal to the RMSEs of the original GELM. Here we collect these minimum values and other experimental results
AC
of all benchmark data sets with Table 3. Actually, the optimal features to construct the p-order reduced polynomial function as output weights for GELM can be easily determined with the combination of the proposed stopping criterions and cross validation technique. Table 3 Detailed experimental results of benchmark data sets
Data sets Sml2010
Hidden nodes type Sigmoid
p 1
Algorithms
#Feature (sparsity ratio)
RMSE
Training time
Testing time (s)
(sec.)
(sec.) 1.177±0.031
GELM
16 (1.000)
7.5441E-02±3.2171E-03
2.912±0.033
FFS-GELM
8 (0.500)
6.9330E-02±2.1736E-03
3.648±0.043
0.616±0.007
BFS-GELM
3 (0.188)
6.9274E-02±2.2778E-03
3.335±0.037
0.240±0.005
continued on next page
18
ACCEPTED MANUSCRIPT – continued from previous page
2
RBF
1
2
Parkinsons
Sigmoid
1
2
RBF
1
2
Concrete slump
Sigmoid
1
2
Algorithms
Sigmoid
AC
CE
Triazines
RBF
1
2
1
2
Boston
housing
Sigmoid
1
2
Testing time (s)
(sec.)
(sec.) 1.188±0.012
16 (1.000)
7.6941E-02±2.9320E-03
2.908±0.045
9 (0.563)
7.4365E-02±3.3046E-03
3.769±0.058
0.671±0.006
BFS-GELM
6 (0.375)
7.3721E-02±3.0315E-03
3.321±0.014
0.453±0.004
GELM
16 (1.000)
6.7981E-02±2.9029E-03
3.957±0.015
1.723±0.013
FFS-GELM
9 (0.563)
6.5092E-02±2.8309E-03
4.905±0.057
1.201±0.007
BFS-GELM
4 (0.250)
6.7230E-02±5.3215E-03
4.369±0.018
0.831±0.006
GELM
16 (1.000)
7.4476E-02±2.0333E-03
3.394±0.017
1.424±0.005
FFS-GELM
9 (0.563)
7.1343E-02±3.1724E-03
4.304±0.038
BFS-GELM
4 (0.250)
7.2302E-02±2.7983E-03
3.870±0.022
0.578±0.008
GELM
16 (1.000)
2.0021E-01±6.9923E-04
3.114±0.056
2.448±0.023
FFS-GELM
15 (0.938)
2.0006E-01±4.5077E-04
5.164±0.068
2.387±0.026
BFS-GELM
13 (0.813)
2.0013E-01±3.8468E-04
3.455±0.010
2.061±0.005 2.488±0.010
0.944±0.007
GELM
16 (1.000)
1.9923E-01±1.1978E-03
3.122±0.019
FFS-GELM
16 (1.000)
1.9923E-01±1.1977E-03
5.254±0.056
2.484±0.007
BFS-GELM
15 (0.938)
1.9918E-01±1.1067E-03
3.391±0.011
2.327±0.013 3.496±0.026
GELM
16 (1.000)
2.0547E-01±1.3232E-03
4.200±0.037
FFS-GELM
8 (0.500)
2.0458E-01±8.2917E-04
5.081±0.033
2.363±0.009
BFS-GELM
7 (0.438)
2.0518E-01±1.5356E-03
4.681±0.007
2.207±0.010
GELM
16 (1.000)
2.0258E-01±7.3960E-04
3.692±0.019
2.983±0.013
FFS-GELM
15 (0.938)
2.0257E-01±6.5583E-04
5.776±0.032
3.000±0.010
BFS-GELM
15 (0.938)
2.0257E-01±6.5583E-04
4.092±0.034
3.000±0.010
GELM
7 (1.0000)
1.7077E-01±9.6989E-04
0.105±0.009
0.023±0.004
FFS-GELM
7 (1.000)
1.7077E-01±9.6989E-04
0.361±0.035
0.023±0.004
BFS-GELM
7 (1.000)
1.7077E-01±9.6989E-04
0.274±0.024
0.025±0.004
7 (1.000)
1.8620E-01±1.4599E-03
0.138±0.007
0.025±0.003
3 (0.429)
1.7483E-01±1.6743E-03
0.088±0.010
0.013±0.002
2 (0.286)
1.7650E-01±1.7228E-03
0.481±0.100
0.009±0.002
GELM
7 (1.000)
1.7842E-01±1.5312E-03
0.151±0.013
0.058±0.009
3 (0.429)
1.7772E-01±1.6802E-03
0.143±0.020
0.048±0.001
BFS-GELM
2 (0.286)
1.7745E-01±1.5896E-03
0.454±0.033
0.042±0.003
GELM
7 (1.000)
1.8906E-01±2.3262E-03
0.167±0.010
0.046±0.005 0.032±0.005
ED
GELM
FFS-GELM
PT
2
Training time
GELM
FFS-GELM
1
RMSE
FFS-GELM
BFS-GELM RBF
#Feature (sparsity ratio)
CR IP T
p
AN US
type
M
Hidden nodes
Data sets
FFS-GELM
2 (0.286)
1.8164E-01±4.5755E-03
0.083±0.004
BFS-GELM
5 (0.714)
1.8354E-01±3.1910E-03
0.416±0.031
0.038±0.004
GELM
58 (1.000)
1.8796E-01±2.8187E-03
0.448±0.013
0.279±0.010
FFS-GELM
19 (0.328)
1.8623E-01±2.1782E-03
0.486±0.017
0.093±0.003
BFS-GELM
15 (0.259)
1.8697E-01±2.5380E-03
1.171±0.019
0.074±0.004
GELM
58 (1.000)
1.8854E-01±5.0281E-03
0.450±0.013
0.263±0.007
FFS-GELM
32 (0.552)
1.8715E-01±5.5982E-03
0.696±0.027
0.140±0.002
BFS-GELM
23 (0.397)
1.8546E-01±5.2845E-03
1.200±0.041
0.102±0.002
GELM
58 (1.000)
2.3454E-01±3.3161E-02
0.449±0.012
0.280±0.007
FFS-GELM
32 (0.552)
2.2071E-01±3.9278E-02
0.732±0.034
0.162±0.005
BFS-GELM
56 (0.966)
2.3448E-01±3.3182E-02
0.656±0.009
0.270±0.003
GELM
58 (1.000)
4.1096E-01±2.2298E-01
0.456±0.016
0.272±0.011
FFS-GELM
55 (0.948)
4.0983E-01±2.2392E-01
1.665±0.027
0.245±0.004
BFS-GELM
55 (0.948)
4.1077E-01±2.2319E-01
0.699±0.006
0.245±0.004 0.123±0.005
GELM
13 (1.000)
6.6357E-02±1.8047E-03
0.408±0.018
FFS-GELM
8 (0.615)
6.6211E-02±1.6074E-03
0.490±0.037
0.081±0.011
BFS-GELM
4 (0.308)
6.6000E-02±2.4952E-03
0.766±0.062
0.041±0.006
GELM
13 (1.000)
6.7950E-02±2.1322E-03
0.410±0.008
0.127±0.006
FFS-GELM
13 (1.000)
6.7950E-02±2.1322E-03
0.935±0.079
0.128±0.010
continued on next page
19
ACCEPTED MANUSCRIPT – continued from previous page
1
2
Winequality red
Sigmoid
1
2
RBF
1
2
Winequality white
Sigmoid
1
2
Algorithms
1
AC
RBF
Puma32H
1
2
1
2
Sigmoid
1
2
RBF
1
Testing time (s)
(sec.)
(sec.) 0.129±0.012
13 (1.000)
6.7950E-02±2.1322E-03
0.633±0.046
13 (1.000)
7.4286E-02±1.7882E-03
0.580±0.012
0.206±0.009
FFS-GELM
4 (0.308)
7.1813E-02±3.9564E-03
0.513±0.031
0.119±0.013
BFS-GELM
5 (0.385)
7.4116E-02±7.8505E-03
0.879±0.010
0.120±0.005
GELM
13 (1.000)
7.3376E-02±3.1256E-03
0.512±0.011
0.176±0.014
FFS-GELM
11 (0.846)
7.3055E-02±3.9283E-03
0.838±0.031
0.156±0.009
BFS-GELM
4 (0.308)
7.1807E-02±5.6049E-03
0.931±0.010
0.091±0.008
0.727±0.012
0.363±0.007
GELM
11 (1.000)
1.3071E-01±1.3585E-04
FFS-GELM
6 (0.545)
1.3052E-01±2.3555E-04
BFS-GELM
3 (0.273)
GELM
11 (1.000)
FFS-GELM BFS-GELM
0.823±0.026
0.201±0.006
1.3049E-01±1.4441E-04
1.002±0.011
0.100±0.003
1.3010E-01±1.9106E-04
0.736±0.014
0.374±0.016
9 (0.818)
1.3010E-01±2.1333E-04
1.058±0.037
0.298±0.009
3 (0.273)
1.2975E-01±5.9066E-04
1.015±0.012
0.105±0.003
GELM
11 (1.000)
1.3138E-01±2.9877E-04
1.158±0.018
0.679±0.014
FFS-GELM
11 (1.000)
1.3138E-01±2.9877E-04
1.803±0.036
0.677±0.011
BFS-GELM
5 (0.455)
GELM
11 (1.000)
1.491±0.014
0.491±0.018
1.2987E-01±3.2645E-04
1.3094E-01±3.6282E-04
0.989±0.021
0.543±0.013 0.546±0.008
FFS-GELM
11 (1.000)
1.2987E-01±3.2645E-04
1.537±0.044
BFS-GELM
11 (1.000)
1.2987E-01±3.2645E-04
1.163±0.026
0.544±0.010
GELM
11 (1.000)
1.1976E-01±1.8074E-04
2.082±0.041
0.922±0.015
FFS-GELM
10 (0.909)
1.1939E-01±1.0837E-04
3.063±0.054
0.837±0.005
BFS-GELM
8 (0.727)
1.1965E-01±1.8252E-04
2.344±0.018
0.671±0.006
GELM
11 (1.000)
1.1837E-01±2.2564E-04
2.095±0.030
0.923±0.009
FFS-GELM
7 (0.636)
1.1814E-01±2.1145E-04
2.567±0.043
0.595±0.007
1.1817E-01±2.2382E-04
2.307±0.018
0.758±0.010
11 (1.000)
1.2054E-01±3.7069E-04
3.466±0.026
1.756±0.033
9 (0.818)
1.1918E-01±1.2819E-04
4.294±0.022
1.596±0.013
GELM
9 (0.818)
9 (0.818)
1.2047E-01±4.5031E-04
3.708±0.016
1.596±0.008
GELM
11 (1.000)
1.1756E-01±4.5079E-04
2.852±0.039
1.378±0.030
FFS-GELM
11 (1.000)
1.1756E-01±4.5079E-04
3.962±0.044
1.379±0.012
BFS-GELM
11 (1.000)
1.1756E-01±4.5079E-04
2.996±0.015
1.379±0.008
32 (1.000)
1.0128E-01±1.9910E-04
8.670±0.120
6.317±0.030
ED
BFS-GELM
PT
Sigmoid
CE
Bank32NH
Training time
GELM
FFS-GELM
2
RMSE
BFS-GELM
BFS-GELM RBF
#Feature (sparsity ratio)
CR IP T
RBF
p
AN US
type
M
Hidden nodes
Data sets
GELM
FFS-GELM
28 (0.875)
1.0122E-01±1.9148E-04
16.121±0.146
5.572±0.068
BFS-GELM
17 (0.531)
1.0095E-01±2.4823E-04
9.003±0.080
3.363±0.003
GELM
32 (1.000)
1.0309E-01±6.2558E-04
8.097±0.103
5.987±0.052
FFS-GELM
21 (0.656)
1.0297E-01±6.3869E-04
13.712±0.047
3.993±0.017
BFS-GELM
17 (0.531)
1.0282E-01±7.0511E-04
8.679±0.014
3.240±0.007 6.841±0.084
GELM
32 (1.000)
1.0121E-01±6.1947E-04
9.211±0.250
FFS-GELM
31 (0.969)
1.0120E-01±6.1453E-04
17.508±0.071
6.631±0.029
BFS-GELM
19 (0.594)
1.0116E-01±6.9315E-04
9.652±0.065
4.360±0.020
GELM
32 (1.000)
1.0144E-01±1.0100E-03
8.367±0.022
6.434±0.009
FFS-GELM
32 (1.000)
1.0144E-01±1.0100E-03
16.847±0.040
6.435±0.028
BFS-GELM
32 (1.000)
1.0144E-01±1.0100E-03
8.832±0.087
6.450±0.011
GELM
32 (1.000)
1.5627E-01±3.2272E-04
8.343±0.030
6.208±0.021
FFS-GELM
12 (0.375)
1.5613E-01±3.5118E-04
10.624±0.102
2.290±0.013
BFS-GELM
15 (0.469)
1.5610E-01±3.4875E-04
8.690±0.060
2.862±0.029
GELM
32 (1.000)
1.3493E-01±1.0333E-02
7.910±0.101
5.869±0.016 1.485±0.019
FFS-GELM
8 (0.250)
1.2872E-01±9.7299E-03
8.973±0.085
BFS-GELM
9 (0.281)
1.2893E-01±9.7028E-03
8.423±0.024
1.660±0.010
GELM
32 (1.000)
1.5502E-01±1.2037E-03
9.269±0.083
6.937±0.039
continued on next page
20
ACCEPTED MANUSCRIPT – continued from previous page p
2
Cpu act
Sigmoid
1
2
RBF
1
2
Ailerons
Sigmoid
1
2
RBF
1
Algorithms
1
1.5491E-01±1.1480E-03
16.396±0.067
5.860±0.016
1.5484E-01±1.3445E-03
9.522±0.050
4.329±0.017 6.290±0.087
GELM
32 (1.000)
1.4457E-01±6.4977E-03
8.493±0.254
FFS-GELM
14 (0.438)
1.4236E-01±7.0176E-03
11.902±0.094
3.112±0.011
BFS-GELM
11 (0.344)
1.4249E-01±7.7472E-03
9.512±0.293
2.548±0.029 4.018±0.015
GELM
21 (1.000)
3.0365E-02±4.2305E-04
5.710±0.107
FFS-GELM
13 (0.619)
3.0039E-02±6.1038E-04
8.343±0.054
2.577±0.013
BFS-GELM
10 (0.476)
3.0126E-02±8.3810E-04
6.205±0.016
2.000±0.009
GELM
21 (1.000)
2.7799E-02±6.5421E-04
2
3.899±0.018
FFS-GELM
13 (0.619)
2.7546E-02±5.2142E-04
7.994±0.151
2.418±0.021
13 (0.619)
2.7572E-02±5.8409E-04
5.865±0.020
2.418±0.013
GELM
21 (1.000)
3.5123E-02±2.4610E-03
6.999±0.119
5.134±0.095
FFS-GELM
15 (0.714)
3.4269E-02±1.9096E-03
10.402±0.050
4.033±0.013
BFS-GELM
12 (0.571)
3.3974E-02±1.8052E-03
7.755±0.089
3.457±0.034
GELM
21 (1.000)
3.1535E-02±1.1989E-03
6.181±0.012
4.455±0.009
2.9740E-02±1.4879E-03
7.698±0.065
2.296±0.008
FFS-GELM
9 (0.429)
BFS-GELM
8 (0.381)
2.9987E-02±1.3073E-03
6.781±0.030
2.105±0.012
GELM
40 (1.000)
4.4834E-02±8.3945E-05
16.489±0.298
14.312±0.307
FFS-GELM
19 (0.475)
4.4766E-02±7.7205E-05
24.455±0.244
6.505±0.056
BFS-GELM
17 (0.425)
4.4722E-02±5.5490E-05
16.575±0.163
5.797±0.014 12.912±0.026
GELM
40 (1.000)
4.5229E-02±1.6300E-04
15.257±0.052
FFS-GELM
16 (0.400)
4.4872E-02±2.0545E-04
21.647±0.074
5.164±0.027
BFS-GELM
14 (0.350)
4.4916E-02±2.2684E-04
15.889±0.096
4.540±0.018
4.5031E-02±1.6822E-04
16.797±0.032
14.435±0.055
39 (0.975)
4.5031E-02±1.6822E-04
35.639±0.127
14.496±0.022
20(0.500)
4.5010E-02±1.8694E-04
17.825±0.135
7.940±0.010
40 (1.000)
4.6093E-02±8.0560E-04
16.250±0.125
13.720±0.161
FFS-GELM
17 (0.425)
4.5815E-02±1.0251E-03
23.654±0.167
6.322±0.039
BFS-GELM
21 (0.525)
4.5782E-02±9.0675E-04
17.083±0.030
7.616±0.009
GELM
18 (1.000)
3.5201E-02±3.4013E-04
9.797±0.088
7.520±0.019
14 (0.778)
3.4276E-02±1.1756E-04
14.884±0.251
5.911±0.118
11 (0.611)
3.4402E-02±1.9835E-04
10.171±0.075
4.621±0.023 7.441±0.069
GELM
GELM
PT CE AC
1
5.487±0.026
BFS-GELM
FFS-GELM
RBF
(sec.)
27 (0.844)
BFS-GELM
2
Testing time (s)
(sec.)
19 (0.594)
40 (1.000)
ED
Sigmoid
Training time
FFS-GELM
BFS-GELM
Elevators
RMSE
BFS-GELM
FFS-GELM
2
#Feature (sparsity ratio)
CR IP T
type
AN US
Hidden nodes
M
Data sets
GELM
18 (1.000)
3.4653E-02±2.7360E-04
9.795±0.236
FFS-GELM
15 (0.833)
3.4066E-02±1.0453E-04
15.208±0.186
6.282±0.060
BFS-GELM
11 (0.611)
3.4205E-02±9.4131E-05
10.015±0.041
4.542±0.024 10.170±0.047
GELM
18 (1.000)
3.5250E-02±6.2821E-04
12.846±0.196
FFS-GELM
15 (0.833)
3.4143E-02±2.5593E-04
18.293±0.652
8.936±0.162
BFS-GELM
13 (0.722)
3.4150E-02±2.4449E-04
13.025±0.143
8.050±0.061
GELM
18 (1.000)
3.4439E-02±2.9209E-04
11.155±0.018
8.737±0.016
FFS-GELM
15 (0.833)
3.3916E-02±1.5227E-04
16.574±0.073
7.585±0.009
BFS-GELM
11 (0.611)
3.3921E-02±1.5153E-04
11.943±0.356
6.034±0.144
In Table 3, the sparsity ratio of the number of selected features to the number of complete input features is defined. Assume that the sparsity ratio is τ , then the testing time is reduced to
τ np+1 np+1
roughly according to (16).
This can be further simplified as τ if np 1. From this formula, it is obvious that the relationship between the sparsity ration and the testing time is approximately linear. The sparsity ratio of the original GELM is always equal
21
ACCEPTED MANUSCRIPT to one, because it utilizes complete input features to construct the p-order reduced polynomial function as output weights. Statistically, BFS-GELM always needs fewer input features to construct the p-order reduced polynomial function than FFS-GELM when obtaining the best generalization performance. The fewer input features means the smaller sparsity ratio. Hence, the sparsity ratio of BFS-GELM is smaller than that of FFS-GELM. Since the testing time is directly proportional to the sparsity ratio, then BFS-GELM outperforms FFS-GELM in terms of the testing time in statistics. Due to lack of sparsity on input features to construct the p-order reduced polynomial function as output weights, the original GELM needs more testing time. Therefore, in those scenarios which have a strict
CR IP T
demand on the testing time, BFS-GELM is most preferred while obtaining the good generalization performance. The original GELM not only loses its edge on the testing time, but also possesses the poorest generalization performance. Hence, motivations of developing FFS-GELM and BFS-GELM are confirmed. That is, the generalization performance of the original GELM can be improved using partial input features to construct the p-order reduced polynomial function as output weights. One main underlying reason is that insignificant or redundant input features of constructing the p-order reduced polynomial function as output weights impair the generalization performance.
AN US
Statistically, FFS-GELM is slightly superior to BFS-GELM with respect to the generalization performance, but not obviously. In other words, the generalization performance of BFS-GELM is comparable to that of FFS-GELM. In theory, as a forward learning algorithm, FFS-GELM should need less training time than BFS-GELM. However, according to Table 3, it loses the advantage in terms of the training time. This phenomenon is originated from the fact that FFS-GELM usually needs more input features of constructing the the p-order reduced polynomial function
M
as output weights when reaching the best generalization performance. The more input features to be selected will require more iterations, which incurs more training time naturally. Among these algorithms, the original GELM
ED
needs least training time, for which the reason is that it takes no action on input features to construct the p-order reduced polynomial function as output weights.
All in all, both FFS-GELM and BFS-GELM improve the original GELM with respect to the generalization using
PT
partial input features to construct the p-order reduced polynomial function as output weights, and they need less testing time. In comparison with FFS-GELM, BFS-GELM is the winner in terms of the sparsity ratio, the training
Conclusions
AC
6
CE
time, and the testing time, but slightly loses the advantage in the generalization performance.
As a popular topic in SLFNs, ELM has drawn many researchers’ interest due to its very fast learning speed, good generalization ability, and ease of implementation. Different from traditional SLFNs such as radial basis function network, ELM generates the input weights and the biases of hidden layer randomly, and its output weights are analytically determined through solving the Moore-Penrose generalized inverse of a linear system. In the original ELM, output weights are constant, which are not directly related to input features at all. However, a generalized ELM was recently proposed, in which output weights have a close relationship with input features. That is, this GELM utilizes the p-order reduced polynomial functions of complete input features to construct output weights. To be more important, the approximation capability of GELM had already been proved, and it can obtain better generalization
22
ACCEPTED MANUSCRIPT performance than the original ELM from the experimental reports. However, according to the empirical results it is found that there maybe exist insignificant or redundant input features to construct the p-order reduced polynomial function as output weights in GELM. These insignificant or redundant input features can not only deteriorate the generalization performance, but also add the testing time. Hence, it is necessary and important to select appropriate input features to construct the p-order reduced polynomial function as output weights for GELM. The greedy learning algorithms are the commonly-used tricks of feature selection in machine learning community. Hence, two algorithms, i.e., FFS-GELM and BFS-GELM, are proposed in this paper. Specifically, FFS-GELM is a forward
CR IP T
learning algorithm, while BFS-GELM is a backward learning algorithm. Both of them can select appropriate input features for GELM. And compared with the original GELM, FFS-GELM and BFS-GELM improve the generalization performance and reduce the testing time. Moreover experimental results favor these conclusions. Of course, there are some problems in the proposed FFS-GELM and BFS-GELM. For example, how to choose an appropriate sparsity ratio, which can balance the relationship between the generalization performance and the testing time very well, needs further discussion. In this paper, both FFS-GELM and BFS-GELM are applied to regression problems, so it
AN US
is expected to extend them to classification problems. In addition, the influence of the number of hidden node L and the polynomial order p on the testing time is another good research topic.
Acknowledgments
M
This research was partially supported by the Fundamental Research Funds for the Central Universities under nos.
ED
NJ20160021 and NS2017013, and the National Natural Science Foundation of China under Grand no. 11502008.
References
PT
[1] J. Park and I. W. Sandberg, “Universal approximation using radial-basis-function networks,” Neural Computation, vol. 3, no. 2, pp. 246–257, 1991. [Online]. Available: http://dx.doi.org/10.1162/neco.1991.3.2.246
CE
[2] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251 – 257, 1991. [Online]. Available: http://www.sciencedirect.com/science/article/pii/089360809190009T [3] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer feedforward networks with a nonpolynomial activation
AC
function can approximate any function,” Neural Networks, vol. 6, no. 6, pp. 861 – 867, 1993.
[4] S. Chen, S. Billings, and W. Luo, “Orthogonal least squares methods and their application to non-linear system identification,” International Journal of Control, vol. 50, no. 5, pp. 1873 – 1896, 1989.
[5] J. Leonard and M. A. Leonard, “Radial basis function networks for classifying process faults,” IEEE Control Systems, vol. 11, no. 3, pp. 31–38, 1991. [6] H. Leung, G. Hennessey, and A. Drosopoulos, “Signal detection using the radial basis function coupled map lattice,” IEEE Transactions on Neural Networks, vol. 11, no. 5, pp. 1133 – 1151, 2000. [Online]. Available: http://dx.doi.org/10.1109/72.870045
23
ACCEPTED MANUSCRIPT [7] H. Leung, T. Lo, and S. Wang, “Prediction of noisy chaotic time series using an optimal radial basis function neural network,” IEEE Transactions on Neural Networks, vol. 12, no. 5, pp. 1163 – 1172, 2001. [Online]. Available: http://dx.doi.org/10.1109/72.950144 [8] M. Cowper, B. Mulgrew, and C. Unsworth,
“Nonlinear prediction of chaotic signals using a normalised
radial basis function network,” Signal Processing, vol. 82, no. 5, pp. 775 – 789, 2002. [Online]. Available: http://dx.doi.org/10.1016/S0165-1684(02)00155-X [9] M. C. Choy, D. Srinivasan, and R. L. Cheu, “Neural networks for continuous online learning and control,” vol. 17,
no. 6,
http://dx.doi.org/10.1109/TNN.2006.881710
pp. 1511 – 1531,
2006. [Online]. Available:
CR IP T
IEEE Transactions on Neural Networks,
[10] H. Nishikawa and S. Ozawa, “Radial basis function network for multitask pattern recognition,” Neural Processing Letters, vol. 33, no. 3, pp. 283–299, 2011. [Online]. Available: http://dx.doi.org/10.1007/s11063-011-9178-9 [11] N. Wang,
J.-C. Sun,
M. J. Er,
and Y.-C. Liu,
“Hybrid recursive least squares algorithm for online
http://dx.doi.org/10.1016/j.neucom.2015.09.090 [12] J. Cao,
K. Zhang,
M. Luo,
C. Yin,
AN US
sequential identification using data chunks,” Neurocomputing, vol. 174, pp. 651 – 660, 2016. [Online]. Available:
and X. Lai,
“Extreme learning machine and adaptive sparse
representation for image classification,” Neural Networks, vol. 81, pp. 91 – 102, 2016. [Online]. Available: http://dx.doi.org/10.1016/j.neunet.2016.06.001
[13] J. Cao, W. Wei, J. Wang, and R. Wang, “Excavation equipment recognition based on novel acoustic statistical features,”
M
IEEE Transactions on Cybernetics, 2017.
[14] M. A. Sartori and P. J. Antsaklis, “A simple method to derive bounds on the size and to train multilayer
ED
neural networks,” IEEE Transactions on Neural Networks, vol. 2, no. 4, pp. 467 – 471, 1991. [Online]. Available: http://dx.doi.org/10.1109/72.88168
[15] S.-C. Huang and Y.-F. Huang, “Bounds on the number of hidden neurons in multilayer perceptrons,” IEEE Transactions
PT
on Neural Networks, vol. 2, no. 1, pp. 47 – 55, 1991. [Online]. Available: http://dx.doi.org/10.1109/72.80290 [16] S. Tamura and M. Tateishi, “Capabilities of a four-layered feedforward neural network:
four layers versus
CE
three,” IEEE Transactions on Neural Networks, vol. 8, no. 2, pp. 251 – 255, 1997. [Online]. Available: http://dx.doi.org/10.1109/72.557662 [17] G.-B. Huang and H. Babri, “Upper bounds on the number of hidden neurons in feedforward networks with arbitrary
AC
bounded nonlinear activation functions,” IEEE Transactions on Neural Networks, vol. 9, no. 1, pp. 224–229, 1998.
[18] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[19] M. H. Hagan and M. B. Menhaj, “Training feedforward networks with the marquardt algorithm,” IEEE Transactions on Neural Networks, vol. 5, no. 6, pp. 989 – 993, 1994. [Online]. Available: http://dx.doi.org/10.1109/72.329697 [20] B. M. Wilamowski and H. Yu, “Improved computation for levenberg-marquardt training,” IEEE Transactions on Neural Networks, vol. 21, no. 6, pp. 930 – 937, 2010. [Online]. Available: http://dx.doi.org/10.1109/TNN.2010.2045657 [21] M. F. Moller, “A scaled conjugate gradient algorithm for fast supervised learning,” Neural Networks, vol. 6, no. 4, pp. 525 – 533, 1993. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0893608005800565
24
ACCEPTED MANUSCRIPT [22] R. Battiti and F. Masulli, “Bfgs optimization for faster and automated supervised learning,” in International Neural Network Conference.
Springer Netherlands, 1990, pp. 757–760.
[23] K. Li, J.-X. Peng, and G. W. Irwin, “A fast nonlinear model identification method,” IEEE Transactions on Automatic Control, vol. 50, no. 8, pp. 1211 – 1216, 2005. [Online]. Available: http://dx.doi.org/10.1109/TAC.2005.852557 [24] P. Vincent and Y. Bengio, “Kernel matching pursuit,” Machine Learning, vol. 48, no. 1-3, pp. 165 – 187, 2002. [Online]. Available: http://dx.doi.org/10.1023/A:1013955821559 [25] V. Popovici, S. Bengio, and J.-P. Thiran, “Kernel matching pursuit for large datasets,” Pattern Recognition, vol. 38,
CR IP T
no. 12, pp. 2385 – 2390, 2005. [Online]. Available: http://dx.doi.org/10.1016/j.patcog.2005.01.021 [26] Q. Zhou, S. Song, C. Wu, and G. Huang, “Kernelized lars-lasso for constructing radial basis function neural networks,” Neural Computing and Applications, vol. 23, no. 7-8, pp. 1969 – 1976, 2013. [Online]. Available: http://dx.doi.org/10.1007/s00521-012-1189-6
[27] G. Castellano, A. M. Fanelli, and M. Pelillo, “Iterative pruning algorithm for feedforward neural networks,” IEEE Transactions on Neural Networks, vol. 8, no. 3, pp. 519 – 531, 1997. [Online]. Available:
AN US
//dx.doi.org/10.1109/72.572092
http:
[28] R. Setiono, “Penalty-function approach for pruning feedforward neural networks,” Neural Computation, vol. 9, no. 1, pp. 185 – 185, 1997.
[29] X. Hong and S. Billings, “Givens rotation based fast backward elimination algorithm for rbf neural network pruning,” Control Theory and Applications, vol. 144, no. 5, pp. 381 – 384, 1997. [Online]. Available:
http://dx.doi.org/10.1049/ip-cta:19971436
M
IEE Proceedings:
[30] D. M. L. Barbato and O. Kinouchi, “Optimal pruning in neural networks,” Physical Review E - Statistical Physics,
ED
Plasmas, Fluids, and Related Interdisciplinary Topics, vol. 62, no. 6 B, pp. 8387 – 8394, 2000. [Online]. Available: http://dx.doi.org/10.1103/PhysRevE.62.8387
[31] T. D. Jorgensen, B. P. Haynes, and C. C. F. Norlund, “Pruning artificial neural networks using neural complexity
PT
measures,” International Journal of Neural Systems, vol. 18, no. 5, pp. 389 – 403, 2008. [Online]. Available: http://dx.doi.org/10.1142/S012906570800166X Qiao,
Y.
Zhang,
and
CE
[32] J.-f.
H.-g.
Han,
“Fast
unit
pruning
algorithm
for
feedforward
neural
network
design,” Applied Mathematics and Computation, vol. 205, no. 2, pp. 622 – 627, 2008. [Online]. Available:
AC
http://dx.doi.org/10.1016/j.amc.2008.05.049 [33] Y.-P. Zhao, J.-G. Sun, Z.-H. Du, Z.-A. Zhang, and H.-B. Zhang, “Pruning least objective contribution in kmse,” Neurocomputing, vol. 74, no. 17, pp. 3009 – 3018, 2011. [Online]. Available: http://dx.doi.org/10.1016/j.neucom.2011. 04.004
[34] P. B. Nair, A. Choudhury, and A. J. Keane, “Some greedy learning algorithms for sparse regression and classification with mercer kernels,” Journal of Machine Learning Research, vol. 3, no. 4-5, pp. 781 – 801, 2003. [35] A. Sherstinsky and R. W. Picard, “On the efficiency of the orthogonal least squares training method for radial basis function networks,” IEEE Transactions on Neural Networks, vol. 7, no. 1, pp. 195 – 200, 1996. [Online]. Available: http://dx.doi.org/10.1109/72.478404 [36] V. N. Vapnik, The Nature of Statistical Learning Theory.
25
New York, NY, USA: Springer-Verlag New York, Inc., 1995.
ACCEPTED MANUSCRIPT [37] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273 – 273, 1995. [Online]. Available: http://dx.doi.org/10.1023/A:1022627411411 [38] V. Vapnik, “An overview of statistical learning theory,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988–999, 1999. [39] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: A new learning scheme of feedforward neural networks,” in Proceedings of IEEE International Conference on Neural Networks, vol. 2, Budapest, Hungary, 2004, pp. 985 – 990. [Online]. Available: http://dx.doi.org/10.1109/IJCNN.2004.1380068
CR IP T
[40] ——, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1-3, pp. 489 – 501, 2006. [Online]. Available: http://dx.doi.org/10.1016/j.neucom.2005.12.126
[41] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879 – 892, 2006. [Online]. Available: http://dx.doi.org/10.1109/TNN.2006.875977
[42] C. J. Burges and B. Scholkopf, “Improving the accuracy and speed of support vector machines,” in Advances in Neural MIT Press, 1997, pp. 375–381.
AN US
Information Processing Systems.
[43] M. Pontil and A. Verri, “Properties of support vector machines,” Neural Computation, vol. 10, no. 4, pp. 955 – 955, 1998. [Online]. Available: http://dx.doi.org/10.1162/089976698300017575
[44] Q. Li, L. Jiao, and Y. Hao, “Adaptive simplification of solution for support vector machine,” Pattern Recognition, vol. 40, no. 3, pp. 972 – 980, 2007. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0031320306003153 R.-C. Chen,
and X. Guo,
“Pruning support vector machines without altering performances,”
M
[45] X. Liang,
IEEE Transactions on Neural Networks,
vol. 19,
no. 10,
pp. 1792 – 1803,
2008. [Online]. Available:
ED
http://dx.doi.org/10.1109/TNN.2008.2002696
[46] X. Liu, C. Gao, and P. Li, “A comparative analysis of support vector machines and extreme learning machines,”
S0893608012001086
PT
Neural Networks, vol. 33, pp. 58 – 66, 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
[47] G.-B. Huang and L. Chen, “Convex incremental extreme learning machine,” Neurocomputing, vol. 70, no. 16C18, pp.
CE
3056 – 3062, 2007. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231207000677 [48] ——, “Enhanced random search based incremental extreme learning machine,” Neurocomputing, vol. 71, no. 16-18, pp.
AC
3460 – 3468, 2008. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231207003633 [49] Y.-P.
Zhao,
Z.-Q.
Li,
P.-P.
Xi,
D.
incremental extreme learning machine,”
Liang,
L.
Sun,
Neurocomputing,
and
T.-H.
vol. 241,
Chen,
“Gramschmidt
pp. 1 – 17,
process
based
2017. [Online]. Available:
http://dx.doi.org/10.1016/j.neucom.2017.01.049
[50] G. Feng, G.-B. Huang, Q. Lin, and R. Gay, “Error minimized extreme learning machine with growth of hidden nodes and incremental learning,” IEEE Transactions on Neural Networks, vol. 20, no. 8, pp. 1352–1357, 2009. [51] N. Wang, M. J. Er, and M. Han, “Parsimonious extreme learning machine using recursive orthogonal least squares,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 10, pp. 1828 – 1841, 2014. [Online]. Available: http://dx.doi.org/10.1109/TNNLS.2013.2296048
26
ACCEPTED MANUSCRIPT [52] Y.-P. Zhao,
K.-K. Wang,
orthogonal transformation,”
and Y.-B. Li, Neurocomputing,
“Parsimonious regularized extreme learning machine based on vol. 156,
pp. 280 – 296,
2015. [Online]. Available:
http:
//dx.doi.org/10.1016/j.neucom.2014.12.046 [53] Y.-P. Zhao and R. Huerta, “Improvements on parsimonious extreme learning machine using recursive orthogonal least squares,” Neurocomputing, vol. 191, pp. 82 – 94, 2016. [Online]. Available: http://dx.doi.org/10.1016/j.neucom.2016.01. 005 [54] H.-J. Rong, Y.-S. Ong, A.-H. Tan, and Z. Zhu, “A fast pruned-extreme learning machine for classification problem,” Neurocomputing, vol. 72, no. 1-3, pp. 359 – 366, 2008. [Online]. Available: http://www.sciencedirect.com/science/
CR IP T
article/pii/S0925231208000738
[55] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, and A. Lendasse, “Op-elm: Optimally pruned extreme learning machine,” IEEE Transactions on Neural Networks, vol. 21, no. 1, pp. 158–162, 2010.
[56] T. Simila and J. Tikka, “Multiresponse sparse regression with application to multidimensional scaling,” Warsaw, Poland, 2005, pp. 97 – 102.
Systems Engineering and Electronics,
vol. 25,
//dx.doi.org/10.1109/JSEE.2014.000103
AN US
[57] Y.-P. Zhao and K.-K. Wang, “Fast cross validation for regularized extreme learning machine,” Journal of no. 5,
pp. 895 – 900,
2013. [Online]. Available:
http:
[58] Y.-P. Zhao, B. Li, and Y.-B. Li, “An accelerating scheme for destructive parsimonious extreme learning machine,” Neurocomputing, vol. 167, pp. 671 – 687, 2015. [Online]. Available: http://dx.doi.org/10.1016/j.neucom.2015.04.002
M
[59] N. Wang, M. J. Er, and M. Han, “Generalized single-hidden layer feedforward networks for regression problems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 6, pp. 1161 – 1176, 2015. [Online]. Available:
ED
http://dx.doi.org/10.1109/TNNLS.2014.2334366
[60] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42, no. 2, pp. 513 – 529, 2012. [Online].
PT
Available: http://dx.doi.org/10.1109/TSMCB.2011.2168604 Beijing, China: Tsinghua University Press, 2004.
AC
CE
[61] X. Zhang, Matrix Analysis and Applications.
27
ACCEPTED MANUSCRIPT
Yong-Ping Zhao received his B.E. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been pursuing the
CR IP T
M.S. and Ph.D. degrees at Nanjing University of Aeronautics and Astronautics. In December 2009, He received Ph.D. degree, and won the award of the Nominated for the National Excellent Doctoral Dissertation Award of China in 2013. Currently, he is a professor and with the college of energy and power engineering, Nanjing University of Aeronautics and Astronautics. His research interests include aircraft engine modeling, control and
ED
M
AN US
fault diagnostics, machine learning and pattern recognition.
Ying-Ting Pan received her bachelor degree in the thermal energy and power engineering field from South
PT
East University, China, in July 2016. She won the Meritorious Winner award of COMAP's Mathematical Contest in Modeling in 2015. After graduation, she worked as an assistant engineer in Jiangsu Blue Sky Technology Co.
CE
Ltd for one year. From September 2017, she will pursue the master degree at Nanjing University of Aeronautics
AC
and Astronautics, majoring in control engineering.
Fang-Quan Song received the B.S. degrees from the Nanjing University of Aeronautics and Astronautics, in 2017. He is currently pursuing the M.S. degree from Nanjing University of Aeronautics and Astronautics. His research interest is aircraft engine modeling, control, fault diagnostics, machine learning and pattern recognition.
CR IP T
ACCEPTED MANUSCRIPT
Liguo Sun received his B.Sc. and M.Sc. degrees from Nanjing University of Aeronautics and Astronautics (China) in 2008 and 2010. On October 30th of 2014, he received his Ph.D. degree from the Control and Simulation Group, Faculty of Aerospace Engineering, TU Delft. Since June of 2015, he has been an associate professor at the Flight Dynamics and Flight Safety group at the School of Aeronautic Science and Engineering,
AN US
Beihang University. His current research interests include nonlinear system identification, machine learning, multivariate spline theory, fault-tolerant (nonlinear) flight control, aircraft safe- flight-envelope prediction,
CE
PT
ED
M
propulsion control.
Ting-Hao Chen received his B.S. degree in thermal energy and power engineering field from Civil Aviation Flight University of China, Guanghan, China, in 2007. He received M.S. degree in Aero-engine fault diagnosis
AC
field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2010. His research interests include Aero-engine fault diagnosis, machine learning, etc.
ACCEPTED MANUSCRIPT
0 .0 9 0
G E L M F S F -G E L M B S F -G E L M
0 .0 8 5
CR IP T
R M S E
0 .0 8 0
0 .0 7 5
AN US
0 .0 7 0
0 .0 6 5 2
4
6
8
1 0
1 2
1 4
1 6
1 4
1 6
M
# F e a tu re (a )
ED
0 .0 8 7
PT
0 .0 8 4
CE
0 .0 8 1
0 .0 7 8
AC
R M S E
G E L M F S F -G E L M B S F -G E L M
0 .0 7 5
0 .0 7 2 2
4
6
8
1 0
1 2
# F e a tu re (b ) Figure 3 Experimental results for Sml2010 with the Sigmoid (a)p = 1(b)p = 2
30
ACCEPTED MANUSCRIPT
0 .0 7 6
G E L M F S F -G E L M B S F -G E L M
0 .0 7 4
CR IP T
R M S E
0 .0 7 2
0 .0 7 0
0 .0 6 8
0 .0 6 4 4
6
8
AN US
0 .0 6 6
1 0
1 2
1 4
1 6
M
# F e a tu re (a )
ED
0 .0 9 5
CE
0 .0 8 5
0 .0 8 0
AC
R M S E
G E L M F S F -G E L M B S F -G E L M
PT
0 .0 9 0
0 .0 7 5
0 .0 7 0 2
4
6
8
1 0
1 2
1 4
# F e a tu re (b ) Figure 4 Experimental results for Sml2010 with the RBF (a)p = 1(b)p = 2
31
1 6