Feature selection of generalized extreme learning machine for regression problems

Communicated by Dr. G.-B. Huang Accepted Manuscript Feature selection of generalized extreme learning machine for regression problems Yong-Ping Zhao...

Download PDF

930KB Sizes 0 Downloads 160 Views

Report

PDF Reader
Full Text

Communicated by Dr. G.-B. Huang

Accepted Manuscript

Feature selection of generalized extreme learning machine for regression problems Yong-Ping Zhao, Ying-Ting Pan, Fang-Quan Song, Liguo Sun, Ting-Hao Chen PII: DOI: Reference:

S0925-2312(17)31815-5 10.1016/j.neucom.2017.11.056 NEUCOM 19121

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

21 April 2017 19 September 2017 26 November 2017

Please cite this article as: Yong-Ping Zhao, Ying-Ting Pan, Fang-Quan Song, Liguo Sun, Ting-Hao Chen, Feature selection of generalized extreme learning machine for regression problems, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.11.056

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Feature selection of generalized extreme learning machine for regression problems

a

CR IP T

Yong-Ping Zhaoa , Ying-Ting Pana , Fang-Quan Songa , Liguo Sunb,† , Ting-Hao Chenc Jiangsu Province Key Laboratory of Aerospace Power Systems College of Energy and Power Engineering

Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China b

School of Aeronautic Science and Engineering, Beihang University, Beijing 100091, China Guangdong Maritime Safety Administration, Guangzhou 510260, China

AN US

c

Abstract

Recently a generalized single-hidden layer feedforward network was proposed, which is an extension of the original extreme learning machine (ELM). Different from the traditional ELM, this generalized ELM (GELM)

M

utilizes the p-order reduced polynomial functions of complete input features as output weights. According to the empirical results, there may be insignificant or redundant input features to construct the p-order reduced polynomial function as output weights in GELM. However, to date there has not been such work of selecting

ED

appropriate input features used for constructing output weights of GELM. Hence, in this paper two greedy learning algorithms, i.e., a forward feature selection algorithm (FFS-GELM) and a backward feature selection algorithm (BFS-GELM), are first proposed to tackle this issue. To reduce the computational complexity, an iterative strategy

PT

is used in FFS-GELM, and its convergence is proved. In BFS-GELM, a decreasing iteration is applied to decay this model, and in this process an accelerating scheme was proposed to speed up computation of removing the

CE

insignificant or redundant features. To show the effectiveness of the proposed FFS-GELM and BFS-GELM, twelve benchmark data sets are employed to do experiments. From these reports, it is demonstrated that both FFS-GELM and BFS-GELM can select appropriate input features to construct the p-order reduced polynomial

AC

function as output weights for GELM. FFS-GELM and BFS-GELM enhance the generalization performance and simultaneously reduce the testing time compared to the original GELM. BFS-GELM works better than FFSGELM in terms of the sparsity ratio, the testing time and the training time. However, it slightly loses the advantage in the generalization performance over FFS-GELM. Key words: single hidden layer feedforward network; extreme learning machine; feature selection; greedy learning; iterative updating

† Corresponding author: E-mail: [email protected]

1

ACCEPTED MANUSCRIPT

1

Introduction

In the field of neural networks, feedforward neural networks (FNNs) have been investigated thoroughly in the past two decades due to their excellent supervised learning performance. Single hidden layer feedforward neural networks (SLFNs) are the simplest FNN with only one hidden layer. Typical examples include multilayer perceptrons with one hidden layer and radial basis function networks, which mathematically consist of a linear combination of some basis functions. Because of the simple form and outstanding approximation capabilities [1–3], SLFNs are popular in many applications such as pattern recognition, signal processing, time-series prediction, nonlinear system modeling

CR IP T

and control, and so on [4–13]. It has already been proven that a SLFN with N − 1 hidden nodes can give any N input-target relations exactly [14, 15]. In [16], the same conclusion that a SLFN with N − 1 sigmoidal hidden nodes can learn N distinct samples without error was confirmed again. Subsequently, Huang et al. [17] extended this result and rigorously proved that a SLFN with at most N hidden nodes and with any bounded nonlinear activation function which has a limit at one infinity can learning N distinct samples with zero error.

AN US

Generally speaking, to obtain a good SLFN there are two subproblems to be solved: optimising the tunable parameters and constructing the optimal network architecture. On one side, since the backprogagation was proposed [18], a large number of gradient-based optimization methods, such as Levenberg-Marguardt algorithm [19, 20], scaled conjugate gradient algorithm [21], BFGS algorithm [22], and so forth, were investigated to search the optimal parameters for FNNs. On the other side, the greedy learning algorithms including the forward learning and the

M

backward learning are commonly utilized to determine the appropriate and compact architectures for SLFNs. The forward learning algorithms start with a small initial network and gradually add new hidden nodes until a satisfactory architecture is obtained. The forward orthogonal least squares method [4], fast recursive algorithm [23], kernel

ED

matching pursuit [24, 25], and least angle regression [26] belong to this class of algorithms. The backward learning algorithms starts by training a larger than necessary network and then deletes one hidden node at a time, which is

PT

the least significant in terms of the user-defined cost function deduction. The pruning algorithms [27–33] are the typical members. Generally, the forward learning algorithms tend to be more popular than the backward learning

CE

algorithms since they are computationally cheap and tend to have modest memory requirements. In contrast, the backward learning algorithms are less efficient with respect to numerical operations. But from a theoretical point of view, the backward algorithms have provable convergence properties while the forward learning algorithms offer no

AC

such known guarantees [34]. However, both the forward and backward learning algorithms may not be optimal [35]. In the learning community of SLFNs, recently there have been two sparkling members: support vector machine (SVM) [36–38] and extreme learning machine (ELM) [39–41]. For SVM, there are no parameters from the hidden layer to be tuned. We only require to optimise the parameters from the output layer. Although the optimisation on the tunable parameters begins with all the training samples, only a small portion of the training samples, referred to as support vectors, is found to construct the final model. SVM is a sparse machine learning algorithm in theory, but the sparsity of the solution is not as good as what we expect, which greatly blocks it into practical use. Investigations [42–44] illustrated that the support vectors of SVMs are sometimes swollen, and dispensable support vectors exist, so some algorithms [42, 45] were proposed to optimise the solution of SVM further.

2

ACCEPTED MANUSCRIPT As another special type of SLFN, the input weights and biases of hidden nodes in ELM are generated randomly, and its output weights are analytically determined by solving a linear system. Compared with the conventional gradient-based learning algorithms, the training speed of ELM is thousands of times faster [40]. Additionally, ELM has weaker generalization ability than SVM for small sample but can generalize as well as SVM for large sample, but it gains the overwhelming superiority in computational speed especially for large scale sample problems to SVM [46]. For the original ELM, there still exist a number of efforts on optimising its network architecture. An incremental ELM [41] and its improvements [47–49] were proposed in order to achieve a more compact network architecture.

CR IP T

Moreover, an error-minimization-based method [50] was presented to grow hidden nodes for ELM one by one or group by group, and output weights are incrementally updated which significantly reduces the computational complexity. Recently, the constructive parsimonious ELMs [51–53] were developed based on the orthogonal transformation. These aforementioned algorithms are inside the range of the forward learning algorithms. Likewise, there are backward learning algorithms to compact the ELM network. To address classification problems, a pruned-ELM was proposed [54], where it begins from an initial large number of hidden nodes, and then the irrelevant nodes are pruned by

AN US

considering their relevance to the class labels. Subsequently, an optimally pruned ELM methodology was presented [55], in which an original ELM is initially constructed, and then the hidden nodes are ranked with multi-response sparse regression technique [56], the number of hidden nodes finally decided by leave-one-out cross validation [57]. In addtion, there are destructive methods of constructing parsimonious ELMs correspondingly [51, 58]. Recently, a generalized SLFN (GSLFN) was investigated [59]. In the original ELM, output weights are constant,

M

which are irrelevant to the sample inputs. In contrast, GSLFN uses the p-order reduced polynomial functions of inputs as output weights as well as any infinitely differentiable activation function as hidden nodes. Evidently, when p = 0

ED

GSLFN degenerates into the original ELM. That is to say, the original ELM is a special case of GSLFN. Extensive experiments shown that the flexible polynomial order p renders GSLFN with considerably compact architecture yielding great superiority to the original ELM in terms of generalization and learning speed. In our opinion, the

PT

proposed SLFN in [59] is only an extension of the original ELM, so in this paper we name it as the generalized ELM (GELM).

CE

According to the obtained conclusion, we know that the generalization performance of the original ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes

AC

is large enough [60]. Although the large number of hidden nodes can guarantee that ELM gains good performance, its testing time is affected by this large number of hidden nodes. That is to say, the more hidden nodes usually signify the more testing time. The large number of hidden nodes in the ELM will impair its widespread applications in which the testing time is strictly required. Hence, a lot of algorithms stated previously were derived to make the network architecture of ELM more compact. In [59], experimental results demonstrates that GELM usually needs fewer hidden nodes but performs better than the original ELM in terms of generalization and learning speed. However, in GELM the output weights are closely related to the sample inputs, i.e., the sample features. The more features involved in the p-order reduced polynomial function as output weights will also worsen the testing time as well as the more hidden nodes. In addition, it has been proved that the generalization performance of GELM can be further enhanced by considering partial input 3

ACCEPTED MANUSCRIPT features to construct the p-order reduced polynomial function as output weights instead of complete input features because there exist insignificant or redundant input features [59]. There are two benefits of removing the insignificant or redundant features to construct the p-order reduced polynomial function as output layer weights: reducing the testing time and enhancing the generalization performance, so it is very important to select appropriate features to construct the p-order reduced polynominal function as output weights for GELM. In the naive GELM, the p-order reduced polynominal function of complete input features is chosen as output layer weights. Therefore, in this paper we first tend to make some contributions to feature selection of the p-order reduced polynomial function in GELM.

CR IP T

Enlightened by experience above on compacting the network architecture with the greedy learning algorithms, here the greedy learning strategies are extended to select appropriate input features to construct the p-order reduced polynominal function as output weights for GELM. As a result, two important algorithms are obtained to select appropriate features to construct the p-order reduced polynominal function as output weights for GELM, shown as follows

AN US

(1) The forward learning algorithm for feature selection of GELM (FFS-GELM): FFS-GELM begins with an original ELM, and then it gradually grows input features to construct the p-order reduced polynomial function as output weights according to some criterion. It does not terminate until the stopping criterion is satisfied. To reduce the training computational burden, an iterative strategy is utilized in FFS-GELM. (2) The backward learning algorithm for feature selection of GELM (BFS-GELM): Contrary to FFS-GELM,

M

BFS-GELM starts with a full model, i.e., a GELM model with the p-order reduced polynominal function of complete input features as output weights as in [59]. Then it removes the insignificant or redundant features

ED

from the p-order reduced polynominal function one by one until the performance required is got. Similar to FFS-GELM, BFS-GELM employs the iterative strategy to shrink model gradually in order to cut down the computational complexity. Moreover, during the process of computing the criterion of removing the insignificant

PT

or redundant features, a scheme is proposed to relieve the computational burden. To confirm the usefulness of the proposed FFS-GELM and BFS-GELM, experimental evaluations over a range of

favored.

CE

benchmark data sets are carried out. From the reported results , the effectiveness of FFS-GELM and BFS-GELM is

AC

The rest of this paper is organized as follows. In section 2, the original ELM and GELM are introduced. In section 3, FFS-GELM is proposed to select appropriate features for the original GELM gradually. In order to reduce the computational complexity, an iterative scheme is applied to accelerate FFS-GELM, and its convergence is proved. In the following section, BFS-GELM is developed, and a scheme is presented to speed up its implementation. To confirm the effectiveness and feasibility of the proposed FFS-GELM and BFS-GELM, twelve benchmark data sets are employed to do experiments in section 5. Finally, conclusions follow.

4

ACCEPTED MANUSCRIPT

2

ELM and GELM

2.1

ELM N

Given a set of training samples {(xk , tk )}k=1 , where xk ∈
L X

βi h(ai , bi ; xk )

(1)

i=1

where h(ai , bi ; xk ) denotes the output of the ith hidden node with the hidden-node parameters (ai , bi ) ∈
CR IP T

βi ∈


 h(aL , bL ; x1 )   h(aL , bL ; x2 )    ..  .   h(aL , bL ; xN )

h β = β1

···

βL

AN US

 h(a1 , b1 ; x1 ) · · ·   h(a1 , b1 ; x2 ) · · ·  H= .. ..  . .   h(a1 , b1 ; xN ) · · · iT

h and T = t1

···

tN

iT

(2)

(3)

(4)

M

Here, H is called the hidden-layer output matrix of the network, in which the ith column is the output vector of the ith hidden node with respect to inputs, i.e. {xi }N i=1 , and the jth row is the output vector of the hidden nodes

ED

with respect to xj , β and T are corresponding matrices of output weights and targets, respectively. In ELM, the hidden-layer output matrix H is randomly generated. That is, the input weights and the biases of the hidden nodes are randomly generated and they are independent of each other. Thus, training ELM simply amounts to getting the

βˆ = H † T

(5)

CE

PT

solution of the linear system (2) with respect to the output weight matrix β, specifically

where H † is the Moore-Penrose generalized inverse of H.

GELM

AC

2.2

From equation (2), we know that output weights in ELM are not explicitly related to input features. In contrary, in GELM the output weights are defined as the p-order reduced polynomial functions of input features T

xk = [xk1 , · · · , xkn ] , i.e., T βiT (xk ) = βi00 +

p X n X

T βijr xrkj = zkT Bi

(6)

zkT = [1, xk1 , · · · , xkn , · · · , xpk1 , · · · , xpkn ]

(7)

r=1 j=1

where zk ∈
5

ACCEPTED MANUSCRIPT

BiT = [βi00 , βi11 , · · · , βin1 , . . . , βi1p , · · · , βinp ]

(8)

and M = np + 1, βijr ∈
h(ai , bi ; xk )zkT Bi = hT (xk ) ⊗ zkT B

where ⊗ is the Kronecker product [61], and T T B = B1T , · · · , BL

CR IP T

ykT =

AN US

hT (xk ) = [h(a1 , b1 ; xk ), · · · , h(aL , bL ; xk )]

(9)

(10)

(11)

If the outputs in (9) equals the targets, the following linear system is obtained in a compact form GB = T

where G ∈


T

z1T



ED

M

 h (x1 ) ⊗    ..  G= .     T hT (xN ) ⊗ zN

(12)

(13)

Since the rank of GT G maybe does not equal LM , the linear system (12) can be solved by the ridge regression

PT

estimator as follows

ˆ = (GT G + µI)−1 GT T B

(14)

where µ is the ridge parameter. And the universal approximation capability of (14) is proved [59].

AC

CE

In theory, finding the solution of (14) amounts to optimizing the following problem   minB {J = kBk2F + CkEk2F }  s.t. GB = T − E

(15)

where k · kF represents the Frobenius norm, C > 0 is the user-specified ridge parameter controlling the tradeoff between the training error E and the output weight norm. After eliminating E from (15), we obtain (14) with 1/C instead of µ. ˆ we can get GELM as follows After obtaining B, ˆ f (xk ) = hT (xk ) ⊗ zkT B

(16)

ˆ ∈
ACCEPTED MANUSCRIPT

Loop: Select the appropriate input features to construct the p-order reduced polynomial function as output weights according to some criterion

Start: Train an original ELM

End: FFS-GELM is got when the stopping criterion is met.

Figure 1 The procedure of FFS-GELM

CR IP T

put on the influence of input features on the testing time. To be more important, according to the experimental results [59], the conclusion that the generalization performance with complete input features employed to construct (7) is better than that with partial input features can not be guaranteed. Hence, it is necessary and important to select appropriate input features for (7) in GELM. Obviously, doing this will bring two benefits: reducing the testing

3

AN US

time and enhancing the generalization performance. In this context, FFS-GELM and BFS-GELM are proposed.

FFS-GELM

3.1

Preliminary Work

FFS-GELM is a forward learning algorithm, which starts with the original ELM. Then, it gradually selects feature

M

for (7) according to some criterion, and does not stop until the terminating criterion is reached. Figure 1 illustrates this procedure. In what follows, we will introduce it. Firstly, we rearrange (7) as

ED

zkT = [1, xk1 , · · · , xpk1 , · · · , xkn , · · · , xpkn ] = [1, zk1 , · · · , zkn ] (13) is rewritten as

AC

CE

PT

where zki = [xki , · · · , xpki ], i = 1, · · · , n. Accordingly, equation   T T h (x ) ⊗ z 1 1     ..   G= .    T hT (xN ) ⊗ zN  T T  h (x1 ), h (x1 ) ⊗ z11 ,  .. .. = . .   T T h (xN ), h (xN ) ⊗ zN 1 , h i = G0 , G1 , · · · , Gn  T 

where G0 = H, Gi = 

(17)

··· .. .

, hT (x1 ) ⊗ z1n .. .

···

, hT (xN ) ⊗ zN n

     

(18)

h (x1 )⊗z1i

.. .

hT (xN )⊗zN i

, i = 1, · · · , n. Correspondingly, the output matrix B is reorganized T B = B0T , B1T , · · · , BnT

(19)

Here, Gi and Bi correspond to the ith feature of inputs.

Hence, the feature selection of GELM is equivalent to choosing Gi ’s from G. Then, define an empty index set P = Ø and a full index set Q = {1, · · · , n}. Let GP = G0 , BP = B0 , and the index q is initialized with 0. 7

ACCEPTED MANUSCRIPT 3.2

The Criterion of Feature Selection

At the qth iteration, similar to equation (15) we get n o min J (q) = kBP k2F + CkT − GP BP k2F

(20)

BP

where GP and BP are submatrices of G and B respectively, whose indexes are confined to the index set P . Letting dJ (q) dBP

= 0 yields ˆ P = (GT GP + I/C)−1 GT T B P P

F

CR IP T

If Gi is selected at the (q + 1)th iteration, we have 

  2



 h

B P (q+1)

T − G

  + C min Ji = P

BP ∪{i} 

Bi 

(21)

  2   i BP



  Gi

 Bi 

(22)

F

Here we make use of the backfitting strategy [24] to compute (22). That is, the BP in (22) is frozen with the value

AN US

ˆ P in (21) at the qth iteration. Substituting B ˆ P into (22), it is transformed to B

n o (q+1) 2 2 ˆ P )T Gi Bi min Ji = Jconst + C kGi Bi kF + kBi kF − 2Ctr (T − GP B Bi

(23)

where tr(·) represents the trace of square matrix, Jconst is a constant part given by

2

ˆ 2 ˆP Jconst = C T − GP B

+ B P F

dJi dBi

M

(q+1)

Letting

F

= 0 gives

ED

ˆ i = (GT Gi + I/C)−1 GT (T − GP B ˆP ) B i i

(24)

(25)

Substituting (25) into (23), we get

PT

(q+1) ˆ P )T Gi B ˆi Jˆi = Jconst − Ctr (T − GP B

Hence, define

(q+1)

CE

∆i

(q+1)

(27)

6 0 holds.

AC

Theorem 1 ∆i

ˆ P )T Gi B ˆi − Jconst = −Ctr (T − GP B

(q+1)

= Jˆi

(26)

Proof Since GTi Gi + I/C is a positive definite matrix, then (GTi Gi + I/C)−1 is a positive definite matrix. Based on the knowledge of matrix theory, the following formula is got (GTi Gi + I/C)−1 = Λ2

(28)

where Λ is also a positive definite matrix. Substituting (25) and (28) into (27), it can be expressed as (q+1)

∆i (q+1)

Hence, ∆i

6 0. This proof is completed.

2

ˆ P ) = −C ΛGTi (T − GP B

F

8

(29)

ACCEPTED MANUSCRIPT (q+1)

In fact, the ∆i

indicates that at the (q + 1)th iteration the ith feature incurs the reduction on the cost function (q+1)

of (22). The smaller the ∆i

is, the more important the ith feature is for constructing the p-order reduced

polynominal function as output weights of GELM. Thus, we can find the index of the feature to be selected by n o (q+1) s = arg min ∆i

(30)

i∈Q

3.3

Iteratively Updating

ˆ P ∪{s} , After determining the sth feature to construct the p-order reduced polynominal function, we need to update B

CR IP T

which is employed in (27) for selecting the next feature if BFS-GELM does not stop. However, if the classical techniques, e.g., Gaussian elimination, are used to compute the matrix inverse in (21) from scratch, the computational cost is O L2 q 2 p2 (Lqp + N ) . Hence, we develop an iterative computation of the inverse matrix in (21). Firstly, we let

R(q+1) = GTP ∪{s} GP ∪{s} + I/C

 GTP GP + I/C = GTs GP

−1

GTP Gs

GTs Gs

AN US

+ I/C

−1 

(31)

Given that R(q) = (GTP GP +I/C)−1 has already been computed at the qth iteration, according to Sherman-Morrison formula [61] R(q+1) can be updated at a cost of O L2 q 2 p2 (Lp + N )  U DU T + −DU T 0 0



M

R(q+1)

 R(q) = 0

D

 

ED

where 0’s are zero matrices of appropriate dimension,

−U D

(32)

(33)

D = (GTs Gs + I/C − GTs GP U T )−1

(34)

PT

U = R(q) GTP Gs

AC

CE

ˆ P ∪{s} is given by Usually q > 1, so the computational cost is mitigated. After obtaining R(q+1) , B ˆ P ∪{s} B

  T G P = R(q+1)  T GTs

(35)

Meanwhile, let P ← P ∪ {s}, Q ← Q\{s}, and q ← q + 1.

3.4

The Stopping Criterion for FFS-GELM

For a greedy learning algorithm, it is necessary to determine an appropriate stopping criterion. Here, FFS-GELM will terminate if

ˆ P k2 kT − GP B F 6 kT k2F

(36)

is satisfied, where is a small positive number. When equation (36) is met, it demonstrates that the targets T have already been explained very well with the tolerant errors, which can avoid the overfitting and enhance the 9

ACCEPTED MANUSCRIPT performance. That is, the generalization performance is enhanced, and the p-order reduced polynominal function of appropriate features is selected as output weight of GELM, which always means the less testing time. If = 0, FFS-GELM degenrates into the original GELM. Hence, the original GELM can be considered as a special case of FFS-GELM with = 0. In addition, a nonnegative integer nmax (0 6 nmax 6 n) can be predefined according to our requirements. When the number of selected features reaches nmax , FFS-GELM stops in order to control the number of selected features conveniently and exactly.

The Flowchart of FFS-GELM Algorithm

CR IP T

3.5

The complete structure of FFS-GELM algorithm is shown as Algorithm 1. Algorithm 1: FFS-GELM algorithm

1: Initialize:

AN US

N

• Input training samples {(xk , tk )}k=1 .

• Choose the type of hidden nodes, and calculate the hidden layer output matrix H. • Determine the ridge parameter 1/C.

• Choose a small > 0 or a nonnegative integer nmax . • Let P = ∅, Q = {1, · · · , n}, and q = 0.

−1

M

• Set GP = H, R(0) = GTP GP + I/C

3:

ED

2: If equation (36) is satisfied or q > nmax Go to step 11.

PT

4: Else

ˆ P according to (21). , and calculate B

Determine the index s according to (30).

6:

Update R(q+1) based on (32).

7:

ˆ P ∪{s} from (35). Calculate B

AC

CE

5:

8:

Let P ← P ∪ {s}, Q ← Q\{s}, and q ← q + 1.

9:

Go to step 2.

10: End T ˆ P , where z T = [zki , · · · , zkj ] , i, j ∈ P . 11: Output ykT = hT (xk ) ⊗ [1, zkP ] B kP

3.6

Convergence

Theorem 2 The cost function J in (15) monotonously decreases with an increasing number of selected features.

10

output weights according to some criterion

met.

ACCEPTED MANUSCRIPT

Loop: Remove the insignificant or redundant features one by one from the p-order reduced polynomial function in accordance with some criterion

Start: Train a full GELM

End: BFS-GELM is obtained when the stopping criterion is met.

Figure 2 The procedure of BFS-GELM

function is Jˆ(q) =

CR IP T

Proof Assume that FFS-GELM is at the qth iteration. Before the sth feature is included, the value of the cost   minB J = kBk2F + CkT − GBk2F  s.t.

BQ = 0

(37)

T where BQ = BiT , · · · , BjT , i, j ∈ Q. When the s feature is selected, the value of the cost function becomes  s.t.

AN US

Jˆ(q+1) =

  minB J = kBk2F + CkT − GBk2F

(38)

BQ\{s} = 0

Comparing (37) with (38), note that that the constraint Bs = 0 in (37) does not appear in (38), which indicates that the feasible set of (37) is the subset of that of (38). Hence, we get

BFS-GELM

PT

4

(39)

ED

Now, the proof of Theorem 2 is completed.

M

Jˆ(q+1) 6 Jˆ(q)

BFS-GELM is a backward learning algorithm, which starts with the original GELM and removes the insignificant

CE

or redundant features one by one from the p-order reduced polynomial function in accordance with some criterion. This process stops when the defined terminating criterion is satisfied. Its procedure is shown in Figure 2. Define a

4.1

AC

full index set P = {1, · · · , n}, and let q = 0. Then, BFS-GELM is initialized with GP = G in (18).

The Criterion of Removing Feature

In BFS-GELM, the criterion adopted is similar to that of FFS-GELM. That is, when one feature is removed, the value of the cost function in (15) will alter, so we can judge that the corresponding feature is important or not according to the altering magnitude of the cost function. From Theorem 2, it is noted that the value of the cost function will increase with an decreasing number of features. Hence, the increase on the cost function incurred by removing one feature is much smaller, and this feature will be viewed to be less significant, and vice versa. Different from FFS-GELM, where most important features are selected, in BFS-GELM the least important is removed.

11

ACCEPTED MANUSCRIPT Now assume that BFS-GELM is at the q iteration. The value of the cost function in (15) is denoted by n o Jˆ(q) = min J (q) = kBP k2F + CkT − GP BP k2F

(40)

BP

If the ith feature is removed at the (q + 1) iteration, it becomes o n (q+1) J−i = kBP \{i} k2F + CkT − GP \{i} BP \{i} k2F

(q+1) Jˆ−i = min

BP \{i}

(41)

Hence, the index of the feature to be removed at the (q + 1)th iteration can be got by

CR IP T

o n (q+1) s = arg min Γ−i = Jˆ−i − Jˆ(q) i∈P

(42)

To determine the sth feature to be removed, we need to solve the optimal problem (41) n − q times. If in each time ˆ P \{−i} in (41) is calculated using (21), the computational complexity is very high or even the optimal solution B unacceptable. Hence, it shows the necessity of developing a method to accelerate the computation of Γ−i at each

4.2

AN US

iteration.

An Accelerating Scheme

Assume that the index i is the jth element in the set P . Then, see the following optimization problem as n o (q) min Ji = kBP k2F + kW BP k2F + CkT − GP BP k2F BP

(43)

M

where W = diag{01 , · · · , 0j−1 , W , 0j+1 , · · · , 0|P | }, here 0l is a zero matrix of size Lp × Lp, l = 1, · · · , j − 1, j +

ED

1, · · · , |P |, | · | represents the cardinality of set, and W is a diagonal matrix given by   w     .   . W = .    w (q)

dJi dBP

= 0, we have

CE

Letting

Lp×Lp

PT

where w > 0.

(44)

AC

Plugging (45) into (43) gets

e P = (I/C + W 2 /C + GT GP )−1 GT T B P P

(q) Jˆi = CkT k2F − Ctr T T GP (I/C + W 2 /C + GTP GP )−1 GTP T

Denote (GTP GP + I/C)−1 by A(q) , according to Woodbury formula [61] we have

(A(q) )−1 + W 2 /C

−1

(45)

= A(q) − A(q) W (CI + W A(q) W )−1 W A(q)

(46)

(47)

Considering (47) and (40), equation (46) is formulated as (q) ˆ T W (CI + W A(q) W )−1 W B ˆP Jˆi = Jˆ(q) + Ctr B P

ˆ P is the optimal solution of (40). where B

12

(48)

ACCEPTED MANUSCRIPT Comparing (41) with (43), when w goes to infinity, which is equivalent to removing the ith feature from (43), (q)

Ji

(q+1)

degenerates into J−i

. Hence (q+1) (q) Jˆ−i = lim Jˆi

(49)

ˆ T W (CI + W A(q) W )−1 W B ˆP Γ−i = lim Ctr B P

(50)

ˆ T (A(q) )−1 B ˆi Γ−i = Ctr B i ii

(51)

w→+∞

and

w→+∞

CR IP T

Equation (50) can be simplified further

ˆ i and A(q) are submatrices of the optimal solution B ˆ P in (40) and A(q) , respectively, corresponding to the where B ii ith feature. Theorem 3 Γ−i > 0 in (51) holds.

The proof of Theorem 3 is similar to that of Theorem 1, so we omit it here.

AN US

To obtain Γ−i , the computational cost is O L2 p2 (n − q)2 (Lp(n − q) + N ) directly using (42). In contrary, the

computational complexity of calculating (51) is O(L3 p3 ). Evidently, the computational cost of Γ−i is reduced with (51). Consequently, equation (42) is simplified as

n o ˆ T (A(q) )−1 B ˆi s = arg min Γ−i = Ctr B i ii

4.3

Iteratively Decreasing

(52)

M

i∈P

ED

When the sth feature is determined to be removed, we need to calculate A(q+1) ready for the next iteration if

PT

BFS-GELM does not stop. If the classical techniques are utilized to obtain A(q+1) , the computational cost is O L2 p2 (n − q − 1)2 (Lp(n − q − 1) + N ) . Here, we make use of Sherman-Morrison formula [61] to have it at a low

AC

CE

cost. Assume that A(q) has already been known, then it is decomposed as

A(q)

 A  11  = AT12  AT13

A12 A22 AT23

A13



  A23   A33

where the matrices A12 , A22 , and A23 correspond to the sth feature. Hence     i h A A A 11 13 12 T −  A−1 A(q+1) =  A23 22 A12 AT23 AT13 A33

(53)

(54)

The computational cost of calculating A(q+1) is reduced to O Lp(n − q − 1)2 . Obviously, the computational burden is cut down using (54). Meanwhile

ˆ P \{s} = A(q+1) GT B P \{s} T In addition, let P ← P \{s}, and q ← q − 1.

13

(55)

ACCEPTED MANUSCRIPT 4.4

The Stopping Criterion for BFS-GELM

In this paper, the following stopping criterion is adopted for BFS-GELM ˆ P k2 kT − GP B F 61−ρ kT k2F

(56)

where ρ > 0, which can balance the overfitting against the model complexity. The larger ρ means the fewer features to construct the p-order reduced polynomial function as output weights, the less testing time, and the larger training errors. Contrarily, the smaller ρ signifies the more features to construct the p-order reduced polynomial function as

CR IP T

output weights, the more testing time, and the smaller training errors. When ρ = 0, there are not features to be removed. In this situation, BFS-GELM amounts to the original GELM. In addition, we can specify the number of the features to be removed for BFS-GELM. This method is convenient to control the number of features to construct

AC

CE

PT

ED

M

AN US

the p-order reduced polynomial function as output weights.

14

ACCEPTED MANUSCRIPT 4.5

The Flowchart of BFS-GELM Algorithm

In summary, the procedure of BFS-GELM is depicted as Algorithm 2. Algorithm 2: BFS-GELM algorithm

1: Initialize: N

• Input training samples {(xk , tk )}k=1 .

• Determine the ridge parameter 1/C. • Choose a small ρ > 0 or a nonnegative integer nremove . • Let P = {1, · · · , n}, and q = 0.

2: If equation (56) is satisfied or q > nremove 3:

Go to step 11.

4: Else Determine the index s according to (52).

6:

Update A(q+1) based on (54).

7:

ˆ P \{s} from (55). Calculate B

8:

Let P ← P \{s}, and q ← q + 1.

9:

Go to step 2.

ED

PT

CE

10: End

ˆ P = A(0) GT T . , and B P

M

5:

−1

AN US

• Set GP = G, and calculate A(0) = GTP GP + I/C

CR IP T

• Choose the type of hidden nodes, and calculate the hidden layer output matrix H.

T ˆ P , where z T = [zki , · · · , zkj ] , i, j ∈ P . 11: Output ykT = hT (xk ) ⊗ [1, zkP ] B kP

Experiments

AC

5

In this section, the performance of the proposed FFS-GELM and BFS-GELM is tested through experiments of twelve benchmark data sets, which are divided into three groups, shown as (1) Multiple outputs data sets: Sml2010, Parkinsons, and Concrete slump; (2) Single output data sets of small size: Triazines, Boston housing, Winequality red, and Winequality white; (3) Single output data sets of large size: Bank32NH, Puma32H, Cpu act, Ailerons, and Elevators. where Sml2010, Parkinsons, Concrete slump, Boston housing, Winequality red, and Winequality white are obtained

15

ACCEPTED MANUSCRIPT

Table 1 Specifications of each benchmark data set #Training

#Testing

#Features

#Outputs

Sml2010

400

368

8

2

Parkinsons

3000

2875

16

2

Concrete slump

60

43

7

3

Triazines

100

85

58

1

Boston housing

350

156

13

1

Winequality red

800

559

11

1

Winequality white

2500

1461

11

Bank32NN

4500

3692

32

Puma32H

4500

3692

32

Cpu act

4500

3692

21

Ailerons

7154

6596

40

Elevators

8752

7847

18

AN US

CR IP T

Data sets

1 1 1

1 1 1

Notes: #Training represents the number of training samples, #Testing represents the number of testing samples, #Features represents the number of input features, #Outputs represents the number of outputs.

from the well-known UCI repository1 , Triazines, Bank32NH, Puma32H, Cpu act, Ailerons, and Elevators are available from the data collection2 . For each data set, it is randomly divided into two splits: the training set and the testing

M

set. As for their details, Table 1 tabulates them.

ED

In this paper, two typical hidden nodes are chosen as activation functions of hidden layer, i.e. the Sigmoid h(x) = 1/(1 + exp{−xT ai }) and the RBF h(x) = exp −bi kx − ai k2 , where ai is randomly chosen from the range

[-1, 1], bi is chosen from the range (0, 0.5) [41]. All experiments have been carried out in MATLAB2013b environment

PT

running on an ordinary personal laptop with Intel(R) Core(TM) i3-2310M CPU with 2.10GHz, and 2.00GB RAM. To find the average performance rather than the best one, thirty trials are conducted for each data set with every

CE

algorithm. In GELM, there exist a parameter C and the polynomial order p to be determined before experiments. Similar to [59], GELMs with p = 1 and p = 2 are mainly focused on. For each p, ten cross-validation technique is utilized to determine a nearly optimal parameter, shown in Table 2, for C from {2−20 , 2−19 , · · · , 219 , 220 } with the

AC

original GELM, and then it is extended to FFS-GELM and BFS-GELM. To facilitate comparisons among different algorithms, the performance index root mean square error (RMSE) is defined by v u #Testing u X

1

ti − tˆi 2 RMSE = t F #Testing × m i=1

(57)

where tˆi is the estimated value of ti . The smaller the RMSE is, the better generalization performance is got for each algorithm.

1 http://archive.ics.uci.edu/ml/ 2 http://www.dcc.fc.up.pt/%7Eltorgo/Regression/DataSets.html

16

ACCEPTED MANUSCRIPT Table 2 The nearly optimal parameters for C Hidden node type Sigmoid Sml2010 RBF

Sigmoid Parkinsons RBF

Sigmoid Concrete slump

Sigmoid

Triazines

RBF

M

Sigmoid

ED

Boston housing

RBF

Sigmoid

CE

PT

Winequality red

AC

log2 C

1

7

2

9

1

−2

2

0

1

3

2

5

1 2 1 2 1 2

6

6

−1 −2

−1

−2

AN US

RBF

p

CR IP T

Data sets

RBF

Sigmoid Winequality white RBF

Sigmoid Bank32NH RBF

Sigmoid Puma32H RBF

1

−7

2

−6

1

3

2

5

1

4

2

3

1

8

2

7

1

−3

2

−3

1

−2

2

−1

1

−3

2

−3

1

−1

2

2

1

−5

2

−5

1

1

2

1

1

−9

2

−2

1

−2

continued on next page

17

ACCEPTED MANUSCRIPT

– continued from previous page Hidden node type

Sigmoid Cpu act RBF

Sigmoid Ailerons

2

6

1

2

2

0

1

11

2

9

1

0

2

2

Sigmoid

1 2 1

0

6

7 2 2

7

AN US

RBF

log2 C

1

RBF

Elevators

p

CR IP T

Data sets

2

7

Figures 3 and 4 show the experimental results for Sml2010 with one trial. In each panel, there are three lines, in

M

which the black dash line represents the original GELM, the red solid line with circle marks represents FFS-GELM, the blue solid line with triangle marks represents BFS-GELM. For each red or blue line, there are always some marks

ED

below the dash line, which indicates that FFS-GELM or BFS-GELM is able to select appropriate features to construct the p-order reduced polynomial function for GELM as output weights and win the better generalization performance than the original GELM. And the idea of employing partial features instead of complete features to construct the

PT

p-order reduced polynomial function as output weights for GELM is verified and right. The similar results of the other benchmark data sets are omitted due to space limitation. From these plots, notice that there always is a

CE

minimum value in FFS-GELM or BFS-GELM. In worst cases, the minimum values of FFS-GELM and BFS-GELM are equal to the RMSEs of the original GELM. Here we collect these minimum values and other experimental results

AC

of all benchmark data sets with Table 3. Actually, the optimal features to construct the p-order reduced polynomial function as output weights for GELM can be easily determined with the combination of the proposed stopping criterions and cross validation technique. Table 3 Detailed experimental results of benchmark data sets

Data sets Sml2010

Hidden nodes type Sigmoid

p 1

Algorithms

#Feature (sparsity ratio)

RMSE

Training time

Testing time (s)

(sec.)

(sec.) 1.177±0.031

GELM

16 (1.000)

7.5441E-02±3.2171E-03

2.912±0.033

FFS-GELM

8 (0.500)

6.9330E-02±2.1736E-03

3.648±0.043

0.616±0.007

BFS-GELM

3 (0.188)

6.9274E-02±2.2778E-03

3.335±0.037

0.240±0.005

continued on next page

18

ACCEPTED MANUSCRIPT – continued from previous page

2

RBF

1

2

Parkinsons

Sigmoid

1

2

RBF

1

2

Concrete slump

Sigmoid

1

2

Algorithms

Sigmoid

AC

CE

Triazines

RBF

1

2

1

2

Boston

housing

Sigmoid

1

2

Testing time (s)

(sec.)

(sec.) 1.188±0.012

16 (1.000)

7.6941E-02±2.9320E-03

2.908±0.045

9 (0.563)

7.4365E-02±3.3046E-03

3.769±0.058

0.671±0.006

BFS-GELM

6 (0.375)

7.3721E-02±3.0315E-03

3.321±0.014

0.453±0.004

GELM

16 (1.000)

6.7981E-02±2.9029E-03

3.957±0.015

1.723±0.013

FFS-GELM

9 (0.563)

6.5092E-02±2.8309E-03

4.905±0.057

1.201±0.007

BFS-GELM

4 (0.250)

6.7230E-02±5.3215E-03

4.369±0.018

0.831±0.006

GELM

16 (1.000)

7.4476E-02±2.0333E-03

3.394±0.017

1.424±0.005

FFS-GELM

9 (0.563)

7.1343E-02±3.1724E-03

4.304±0.038

BFS-GELM

4 (0.250)

7.2302E-02±2.7983E-03

3.870±0.022

0.578±0.008

GELM

16 (1.000)

2.0021E-01±6.9923E-04

3.114±0.056

2.448±0.023

FFS-GELM

15 (0.938)

2.0006E-01±4.5077E-04

5.164±0.068

2.387±0.026

BFS-GELM

13 (0.813)

2.0013E-01±3.8468E-04

3.455±0.010

2.061±0.005 2.488±0.010

0.944±0.007

GELM

16 (1.000)

1.9923E-01±1.1978E-03

3.122±0.019

FFS-GELM

16 (1.000)

1.9923E-01±1.1977E-03

5.254±0.056

2.484±0.007

BFS-GELM

15 (0.938)

1.9918E-01±1.1067E-03

3.391±0.011

2.327±0.013 3.496±0.026

GELM

16 (1.000)

2.0547E-01±1.3232E-03

4.200±0.037

FFS-GELM

8 (0.500)

2.0458E-01±8.2917E-04

5.081±0.033

2.363±0.009

BFS-GELM

7 (0.438)

2.0518E-01±1.5356E-03

4.681±0.007

2.207±0.010

GELM

16 (1.000)

2.0258E-01±7.3960E-04

3.692±0.019

2.983±0.013

FFS-GELM

15 (0.938)

2.0257E-01±6.5583E-04

5.776±0.032

3.000±0.010

BFS-GELM

15 (0.938)

2.0257E-01±6.5583E-04

4.092±0.034

3.000±0.010

GELM

7 (1.0000)

1.7077E-01±9.6989E-04

0.105±0.009

0.023±0.004

FFS-GELM

7 (1.000)

1.7077E-01±9.6989E-04

0.361±0.035

0.023±0.004

BFS-GELM

7 (1.000)

1.7077E-01±9.6989E-04

0.274±0.024

0.025±0.004

7 (1.000)

1.8620E-01±1.4599E-03

0.138±0.007

0.025±0.003

3 (0.429)

1.7483E-01±1.6743E-03

0.088±0.010

0.013±0.002

2 (0.286)

1.7650E-01±1.7228E-03

0.481±0.100

0.009±0.002

GELM

7 (1.000)

1.7842E-01±1.5312E-03

0.151±0.013

0.058±0.009

3 (0.429)

1.7772E-01±1.6802E-03

0.143±0.020

0.048±0.001

BFS-GELM

2 (0.286)

1.7745E-01±1.5896E-03

0.454±0.033

0.042±0.003

GELM

7 (1.000)

1.8906E-01±2.3262E-03

0.167±0.010

0.046±0.005 0.032±0.005

ED

GELM

FFS-GELM

PT

2

Training time

GELM

FFS-GELM

1

RMSE

FFS-GELM

BFS-GELM RBF

#Feature (sparsity ratio)

CR IP T

p

AN US

type

M

Hidden nodes

Data sets

FFS-GELM

2 (0.286)

1.8164E-01±4.5755E-03

0.083±0.004

BFS-GELM

5 (0.714)

1.8354E-01±3.1910E-03

0.416±0.031

0.038±0.004

GELM

58 (1.000)

1.8796E-01±2.8187E-03

0.448±0.013

0.279±0.010

FFS-GELM

19 (0.328)

1.8623E-01±2.1782E-03

0.486±0.017

0.093±0.003

BFS-GELM

15 (0.259)

1.8697E-01±2.5380E-03

1.171±0.019

0.074±0.004

GELM

58 (1.000)

1.8854E-01±5.0281E-03

0.450±0.013

0.263±0.007

FFS-GELM

32 (0.552)

1.8715E-01±5.5982E-03

0.696±0.027

0.140±0.002

BFS-GELM

23 (0.397)

1.8546E-01±5.2845E-03

1.200±0.041

0.102±0.002

GELM

58 (1.000)

2.3454E-01±3.3161E-02

0.449±0.012

0.280±0.007

FFS-GELM

32 (0.552)

2.2071E-01±3.9278E-02

0.732±0.034

0.162±0.005

BFS-GELM

56 (0.966)

2.3448E-01±3.3182E-02

0.656±0.009

0.270±0.003

GELM

58 (1.000)

4.1096E-01±2.2298E-01

0.456±0.016

0.272±0.011

FFS-GELM

55 (0.948)

4.0983E-01±2.2392E-01

1.665±0.027

0.245±0.004

BFS-GELM

55 (0.948)

4.1077E-01±2.2319E-01

0.699±0.006

0.245±0.004 0.123±0.005

GELM

13 (1.000)

6.6357E-02±1.8047E-03

0.408±0.018

FFS-GELM

8 (0.615)

6.6211E-02±1.6074E-03

0.490±0.037

0.081±0.011

BFS-GELM

4 (0.308)

6.6000E-02±2.4952E-03

0.766±0.062

0.041±0.006

GELM

13 (1.000)

6.7950E-02±2.1322E-03

0.410±0.008

0.127±0.006

FFS-GELM

13 (1.000)

6.7950E-02±2.1322E-03

0.935±0.079

0.128±0.010

continued on next page

19

ACCEPTED MANUSCRIPT – continued from previous page

1

2

Winequality red

Sigmoid

1

2

RBF

1

2

Winequality white

Sigmoid

1

2

Algorithms

1

AC

RBF

Puma32H

1

2

1

2

Sigmoid

1

2

RBF

1

Testing time (s)

(sec.)

(sec.) 0.129±0.012

13 (1.000)

6.7950E-02±2.1322E-03

0.633±0.046

13 (1.000)

7.4286E-02±1.7882E-03

0.580±0.012

0.206±0.009

FFS-GELM

4 (0.308)

7.1813E-02±3.9564E-03

0.513±0.031

0.119±0.013

BFS-GELM

5 (0.385)

7.4116E-02±7.8505E-03

0.879±0.010

0.120±0.005

GELM

13 (1.000)

7.3376E-02±3.1256E-03

0.512±0.011

0.176±0.014

FFS-GELM

11 (0.846)

7.3055E-02±3.9283E-03

0.838±0.031

0.156±0.009

BFS-GELM

4 (0.308)

7.1807E-02±5.6049E-03

0.931±0.010

0.091±0.008

0.727±0.012

0.363±0.007

GELM

11 (1.000)

1.3071E-01±1.3585E-04

FFS-GELM

6 (0.545)

1.3052E-01±2.3555E-04

BFS-GELM

3 (0.273)

GELM

11 (1.000)

FFS-GELM BFS-GELM

0.823±0.026

0.201±0.006

1.3049E-01±1.4441E-04

1.002±0.011

0.100±0.003

1.3010E-01±1.9106E-04

0.736±0.014

0.374±0.016

9 (0.818)

1.3010E-01±2.1333E-04

1.058±0.037

0.298±0.009

3 (0.273)

1.2975E-01±5.9066E-04

1.015±0.012

0.105±0.003

GELM

11 (1.000)

1.3138E-01±2.9877E-04

1.158±0.018

0.679±0.014

FFS-GELM

11 (1.000)

1.3138E-01±2.9877E-04

1.803±0.036

0.677±0.011

BFS-GELM

5 (0.455)

GELM

11 (1.000)

1.491±0.014

0.491±0.018

1.2987E-01±3.2645E-04

1.3094E-01±3.6282E-04

0.989±0.021

0.543±0.013 0.546±0.008

FFS-GELM

11 (1.000)

1.2987E-01±3.2645E-04

1.537±0.044

BFS-GELM

11 (1.000)

1.2987E-01±3.2645E-04

1.163±0.026

0.544±0.010

GELM

11 (1.000)

1.1976E-01±1.8074E-04

2.082±0.041

0.922±0.015

FFS-GELM

10 (0.909)

1.1939E-01±1.0837E-04

3.063±0.054

0.837±0.005

BFS-GELM

8 (0.727)

1.1965E-01±1.8252E-04

2.344±0.018

0.671±0.006

GELM

11 (1.000)

1.1837E-01±2.2564E-04

2.095±0.030

0.923±0.009

FFS-GELM

7 (0.636)

1.1814E-01±2.1145E-04

2.567±0.043

0.595±0.007

1.1817E-01±2.2382E-04

2.307±0.018

0.758±0.010

11 (1.000)

1.2054E-01±3.7069E-04

3.466±0.026

1.756±0.033

9 (0.818)

1.1918E-01±1.2819E-04

4.294±0.022

1.596±0.013

GELM

9 (0.818)

9 (0.818)

1.2047E-01±4.5031E-04

3.708±0.016

1.596±0.008

GELM

11 (1.000)

1.1756E-01±4.5079E-04

2.852±0.039

1.378±0.030

FFS-GELM

11 (1.000)

1.1756E-01±4.5079E-04

3.962±0.044

1.379±0.012

BFS-GELM

11 (1.000)

1.1756E-01±4.5079E-04

2.996±0.015

1.379±0.008

32 (1.000)

1.0128E-01±1.9910E-04

8.670±0.120

6.317±0.030

ED

BFS-GELM

PT

Sigmoid

CE

Bank32NH

Training time

GELM

FFS-GELM

2

RMSE

BFS-GELM

BFS-GELM RBF

#Feature (sparsity ratio)

CR IP T

RBF

p

AN US

type

M

Hidden nodes

Data sets

GELM

FFS-GELM

28 (0.875)

1.0122E-01±1.9148E-04

16.121±0.146

5.572±0.068

BFS-GELM

17 (0.531)

1.0095E-01±2.4823E-04

9.003±0.080

3.363±0.003

GELM

32 (1.000)

1.0309E-01±6.2558E-04

8.097±0.103

5.987±0.052

FFS-GELM

21 (0.656)

1.0297E-01±6.3869E-04

13.712±0.047

3.993±0.017

BFS-GELM

17 (0.531)

1.0282E-01±7.0511E-04

8.679±0.014

3.240±0.007 6.841±0.084

GELM

32 (1.000)

1.0121E-01±6.1947E-04

9.211±0.250

FFS-GELM

31 (0.969)

1.0120E-01±6.1453E-04

17.508±0.071

6.631±0.029

BFS-GELM

19 (0.594)

1.0116E-01±6.9315E-04

9.652±0.065

4.360±0.020

GELM

32 (1.000)

1.0144E-01±1.0100E-03

8.367±0.022

6.434±0.009

FFS-GELM

32 (1.000)

1.0144E-01±1.0100E-03

16.847±0.040

6.435±0.028

BFS-GELM

32 (1.000)

1.0144E-01±1.0100E-03

8.832±0.087

6.450±0.011

GELM

32 (1.000)

1.5627E-01±3.2272E-04

8.343±0.030

6.208±0.021

FFS-GELM

12 (0.375)

1.5613E-01±3.5118E-04

10.624±0.102

2.290±0.013

BFS-GELM

15 (0.469)

1.5610E-01±3.4875E-04

8.690±0.060

2.862±0.029

GELM

32 (1.000)

1.3493E-01±1.0333E-02

7.910±0.101

5.869±0.016 1.485±0.019

FFS-GELM

8 (0.250)

1.2872E-01±9.7299E-03

8.973±0.085

BFS-GELM

9 (0.281)

1.2893E-01±9.7028E-03

8.423±0.024

1.660±0.010

GELM

32 (1.000)

1.5502E-01±1.2037E-03

9.269±0.083

6.937±0.039

continued on next page

20

ACCEPTED MANUSCRIPT – continued from previous page p

2

Cpu act

Sigmoid

1

2

RBF

1

2

Ailerons

Sigmoid

1

2

RBF

1

Algorithms

1

1.5491E-01±1.1480E-03

16.396±0.067

5.860±0.016

1.5484E-01±1.3445E-03

9.522±0.050

4.329±0.017 6.290±0.087

GELM

32 (1.000)

1.4457E-01±6.4977E-03

8.493±0.254

FFS-GELM

14 (0.438)

1.4236E-01±7.0176E-03

11.902±0.094

3.112±0.011

BFS-GELM

11 (0.344)

1.4249E-01±7.7472E-03

9.512±0.293

2.548±0.029 4.018±0.015

GELM

21 (1.000)

3.0365E-02±4.2305E-04

5.710±0.107

FFS-GELM

13 (0.619)

3.0039E-02±6.1038E-04

8.343±0.054

2.577±0.013

BFS-GELM

10 (0.476)

3.0126E-02±8.3810E-04

6.205±0.016

2.000±0.009

GELM

21 (1.000)

2.7799E-02±6.5421E-04

2

3.899±0.018

FFS-GELM

13 (0.619)

2.7546E-02±5.2142E-04

7.994±0.151

2.418±0.021

13 (0.619)

2.7572E-02±5.8409E-04

5.865±0.020

2.418±0.013

GELM

21 (1.000)

3.5123E-02±2.4610E-03

6.999±0.119

5.134±0.095

FFS-GELM

15 (0.714)

3.4269E-02±1.9096E-03

10.402±0.050

4.033±0.013

BFS-GELM

12 (0.571)

3.3974E-02±1.8052E-03

7.755±0.089

3.457±0.034

GELM

21 (1.000)

3.1535E-02±1.1989E-03

6.181±0.012

4.455±0.009

2.9740E-02±1.4879E-03

7.698±0.065

2.296±0.008

FFS-GELM

9 (0.429)

BFS-GELM

8 (0.381)

2.9987E-02±1.3073E-03

6.781±0.030

2.105±0.012

GELM

40 (1.000)

4.4834E-02±8.3945E-05

16.489±0.298

14.312±0.307

FFS-GELM

19 (0.475)

4.4766E-02±7.7205E-05

24.455±0.244

6.505±0.056

BFS-GELM

17 (0.425)

4.4722E-02±5.5490E-05

16.575±0.163

5.797±0.014 12.912±0.026

GELM

40 (1.000)

4.5229E-02±1.6300E-04

15.257±0.052

FFS-GELM

16 (0.400)

4.4872E-02±2.0545E-04

21.647±0.074

5.164±0.027

BFS-GELM

14 (0.350)

4.4916E-02±2.2684E-04

15.889±0.096

4.540±0.018

4.5031E-02±1.6822E-04

16.797±0.032

14.435±0.055

39 (0.975)

4.5031E-02±1.6822E-04

35.639±0.127

14.496±0.022

20(0.500)

4.5010E-02±1.8694E-04

17.825±0.135

7.940±0.010

40 (1.000)

4.6093E-02±8.0560E-04

16.250±0.125

13.720±0.161

FFS-GELM

17 (0.425)

4.5815E-02±1.0251E-03

23.654±0.167

6.322±0.039

BFS-GELM

21 (0.525)

4.5782E-02±9.0675E-04

17.083±0.030

7.616±0.009

GELM

18 (1.000)

3.5201E-02±3.4013E-04

9.797±0.088

7.520±0.019

14 (0.778)

3.4276E-02±1.1756E-04

14.884±0.251

5.911±0.118

11 (0.611)

3.4402E-02±1.9835E-04

10.171±0.075

4.621±0.023 7.441±0.069

GELM

GELM

PT CE AC

1

5.487±0.026

BFS-GELM

FFS-GELM

RBF

(sec.)

27 (0.844)

BFS-GELM

2

Testing time (s)

(sec.)

19 (0.594)

40 (1.000)

ED

Sigmoid

Training time

FFS-GELM

BFS-GELM

Elevators

RMSE

BFS-GELM

FFS-GELM

2

#Feature (sparsity ratio)

CR IP T

type

AN US

Hidden nodes

M

Data sets

GELM

18 (1.000)

3.4653E-02±2.7360E-04

9.795±0.236

FFS-GELM

15 (0.833)

3.4066E-02±1.0453E-04

15.208±0.186

6.282±0.060

BFS-GELM

11 (0.611)

3.4205E-02±9.4131E-05

10.015±0.041

4.542±0.024 10.170±0.047

GELM

18 (1.000)

3.5250E-02±6.2821E-04

12.846±0.196

FFS-GELM

15 (0.833)

3.4143E-02±2.5593E-04

18.293±0.652

8.936±0.162

BFS-GELM

13 (0.722)

3.4150E-02±2.4449E-04

13.025±0.143

8.050±0.061

GELM

18 (1.000)

3.4439E-02±2.9209E-04

11.155±0.018

8.737±0.016

FFS-GELM

15 (0.833)

3.3916E-02±1.5227E-04

16.574±0.073

7.585±0.009

BFS-GELM

11 (0.611)

3.3921E-02±1.5153E-04

11.943±0.356

6.034±0.144

In Table 3, the sparsity ratio of the number of selected features to the number of complete input features is defined. Assume that the sparsity ratio is τ , then the testing time is reduced to

τ np+1 np+1

roughly according to (16).

This can be further simplified as τ if np 1. From this formula, it is obvious that the relationship between the sparsity ration and the testing time is approximately linear. The sparsity ratio of the original GELM is always equal

21

ACCEPTED MANUSCRIPT to one, because it utilizes complete input features to construct the p-order reduced polynomial function as output weights. Statistically, BFS-GELM always needs fewer input features to construct the p-order reduced polynomial function than FFS-GELM when obtaining the best generalization performance. The fewer input features means the smaller sparsity ratio. Hence, the sparsity ratio of BFS-GELM is smaller than that of FFS-GELM. Since the testing time is directly proportional to the sparsity ratio, then BFS-GELM outperforms FFS-GELM in terms of the testing time in statistics. Due to lack of sparsity on input features to construct the p-order reduced polynomial function as output weights, the original GELM needs more testing time. Therefore, in those scenarios which have a strict

CR IP T

demand on the testing time, BFS-GELM is most preferred while obtaining the good generalization performance. The original GELM not only loses its edge on the testing time, but also possesses the poorest generalization performance. Hence, motivations of developing FFS-GELM and BFS-GELM are confirmed. That is, the generalization performance of the original GELM can be improved using partial input features to construct the p-order reduced polynomial function as output weights. One main underlying reason is that insignificant or redundant input features of constructing the p-order reduced polynomial function as output weights impair the generalization performance.

AN US

Statistically, FFS-GELM is slightly superior to BFS-GELM with respect to the generalization performance, but not obviously. In other words, the generalization performance of BFS-GELM is comparable to that of FFS-GELM. In theory, as a forward learning algorithm, FFS-GELM should need less training time than BFS-GELM. However, according to Table 3, it loses the advantage in terms of the training time. This phenomenon is originated from the fact that FFS-GELM usually needs more input features of constructing the the p-order reduced polynomial function

M

as output weights when reaching the best generalization performance. The more input features to be selected will require more iterations, which incurs more training time naturally. Among these algorithms, the original GELM

ED

needs least training time, for which the reason is that it takes no action on input features to construct the p-order reduced polynomial function as output weights.

All in all, both FFS-GELM and BFS-GELM improve the original GELM with respect to the generalization using

PT

partial input features to construct the p-order reduced polynomial function as output weights, and they need less testing time. In comparison with FFS-GELM, BFS-GELM is the winner in terms of the sparsity ratio, the training

Conclusions

AC

6

CE

time, and the testing time, but slightly loses the advantage in the generalization performance.

As a popular topic in SLFNs, ELM has drawn many researchers’ interest due to its very fast learning speed, good generalization ability, and ease of implementation. Different from traditional SLFNs such as radial basis function network, ELM generates the input weights and the biases of hidden layer randomly, and its output weights are analytically determined through solving the Moore-Penrose generalized inverse of a linear system. In the original ELM, output weights are constant, which are not directly related to input features at all. However, a generalized ELM was recently proposed, in which output weights have a close relationship with input features. That is, this GELM utilizes the p-order reduced polynomial functions of complete input features to construct output weights. To be more important, the approximation capability of GELM had already been proved, and it can obtain better generalization

22

ACCEPTED MANUSCRIPT performance than the original ELM from the experimental reports. However, according to the empirical results it is found that there maybe exist insignificant or redundant input features to construct the p-order reduced polynomial function as output weights in GELM. These insignificant or redundant input features can not only deteriorate the generalization performance, but also add the testing time. Hence, it is necessary and important to select appropriate input features to construct the p-order reduced polynomial function as output weights for GELM. The greedy learning algorithms are the commonly-used tricks of feature selection in machine learning community. Hence, two algorithms, i.e., FFS-GELM and BFS-GELM, are proposed in this paper. Specifically, FFS-GELM is a forward

CR IP T

learning algorithm, while BFS-GELM is a backward learning algorithm. Both of them can select appropriate input features for GELM. And compared with the original GELM, FFS-GELM and BFS-GELM improve the generalization performance and reduce the testing time. Moreover experimental results favor these conclusions. Of course, there are some problems in the proposed FFS-GELM and BFS-GELM. For example, how to choose an appropriate sparsity ratio, which can balance the relationship between the generalization performance and the testing time very well, needs further discussion. In this paper, both FFS-GELM and BFS-GELM are applied to regression problems, so it

AN US

is expected to extend them to classification problems. In addition, the influence of the number of hidden node L and the polynomial order p on the testing time is another good research topic.

Acknowledgments

M

This research was partially supported by the Fundamental Research Funds for the Central Universities under nos.

ED

NJ20160021 and NS2017013, and the National Natural Science Foundation of China under Grand no. 11502008.

References

PT

[1] J. Park and I. W. Sandberg, “Universal approximation using radial-basis-function networks,” Neural Computation, vol. 3, no. 2, pp. 246–257, 1991. [Online]. Available: http://dx.doi.org/10.1162/neco.1991.3.2.246

CE

[2] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251 – 257, 1991. [Online]. Available: http://www.sciencedirect.com/science/article/pii/089360809190009T [3] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer feedforward networks with a nonpolynomial activation

AC

function can approximate any function,” Neural Networks, vol. 6, no. 6, pp. 861 – 867, 1993.

[4] S. Chen, S. Billings, and W. Luo, “Orthogonal least squares methods and their application to non-linear system identification,” International Journal of Control, vol. 50, no. 5, pp. 1873 – 1896, 1989.

[5] J. Leonard and M. A. Leonard, “Radial basis function networks for classifying process faults,” IEEE Control Systems, vol. 11, no. 3, pp. 31–38, 1991. [6] H. Leung, G. Hennessey, and A. Drosopoulos, “Signal detection using the radial basis function coupled map lattice,” IEEE Transactions on Neural Networks, vol. 11, no. 5, pp. 1133 – 1151, 2000. [Online]. Available: http://dx.doi.org/10.1109/72.870045

23

ACCEPTED MANUSCRIPT [7] H. Leung, T. Lo, and S. Wang, “Prediction of noisy chaotic time series using an optimal radial basis function neural network,” IEEE Transactions on Neural Networks, vol. 12, no. 5, pp. 1163 – 1172, 2001. [Online]. Available: http://dx.doi.org/10.1109/72.950144 [8] M. Cowper, B. Mulgrew, and C. Unsworth,

“Nonlinear prediction of chaotic signals using a normalised

radial basis function network,” Signal Processing, vol. 82, no. 5, pp. 775 – 789, 2002. [Online]. Available: http://dx.doi.org/10.1016/S0165-1684(02)00155-X [9] M. C. Choy, D. Srinivasan, and R. L. Cheu, “Neural networks for continuous online learning and control,” vol. 17,

no. 6,

http://dx.doi.org/10.1109/TNN.2006.881710

pp. 1511 – 1531,

2006. [Online]. Available:

CR IP T

IEEE Transactions on Neural Networks,

[10] H. Nishikawa and S. Ozawa, “Radial basis function network for multitask pattern recognition,” Neural Processing Letters, vol. 33, no. 3, pp. 283–299, 2011. [Online]. Available: http://dx.doi.org/10.1007/s11063-011-9178-9 [11] N. Wang,

J.-C. Sun,

M. J. Er,

and Y.-C. Liu,

“Hybrid recursive least squares algorithm for online

http://dx.doi.org/10.1016/j.neucom.2015.09.090 [12] J. Cao,

K. Zhang,

M. Luo,

C. Yin,

AN US

sequential identification using data chunks,” Neurocomputing, vol. 174, pp. 651 – 660, 2016. [Online]. Available:

and X. Lai,

“Extreme learning machine and adaptive sparse

representation for image classification,” Neural Networks, vol. 81, pp. 91 – 102, 2016. [Online]. Available: http://dx.doi.org/10.1016/j.neunet.2016.06.001

[13] J. Cao, W. Wei, J. Wang, and R. Wang, “Excavation equipment recognition based on novel acoustic statistical features,”

M

IEEE Transactions on Cybernetics, 2017.

[14] M. A. Sartori and P. J. Antsaklis, “A simple method to derive bounds on the size and to train multilayer

ED

neural networks,” IEEE Transactions on Neural Networks, vol. 2, no. 4, pp. 467 – 471, 1991. [Online]. Available: http://dx.doi.org/10.1109/72.88168

[15] S.-C. Huang and Y.-F. Huang, “Bounds on the number of hidden neurons in multilayer perceptrons,” IEEE Transactions

PT

on Neural Networks, vol. 2, no. 1, pp. 47 – 55, 1991. [Online]. Available: http://dx.doi.org/10.1109/72.80290 [16] S. Tamura and M. Tateishi, “Capabilities of a four-layered feedforward neural network:

four layers versus

CE

three,” IEEE Transactions on Neural Networks, vol. 8, no. 2, pp. 251 – 255, 1997. [Online]. Available: http://dx.doi.org/10.1109/72.557662 [17] G.-B. Huang and H. Babri, “Upper bounds on the number of hidden neurons in feedforward networks with arbitrary

AC

bounded nonlinear activation functions,” IEEE Transactions on Neural Networks, vol. 9, no. 1, pp. 224–229, 1998.

[18] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.

[19] M. H. Hagan and M. B. Menhaj, “Training feedforward networks with the marquardt algorithm,” IEEE Transactions on Neural Networks, vol. 5, no. 6, pp. 989 – 993, 1994. [Online]. Available: http://dx.doi.org/10.1109/72.329697 [20] B. M. Wilamowski and H. Yu, “Improved computation for levenberg-marquardt training,” IEEE Transactions on Neural Networks, vol. 21, no. 6, pp. 930 – 937, 2010. [Online]. Available: http://dx.doi.org/10.1109/TNN.2010.2045657 [21] M. F. Moller, “A scaled conjugate gradient algorithm for fast supervised learning,” Neural Networks, vol. 6, no. 4, pp. 525 – 533, 1993. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0893608005800565

24

ACCEPTED MANUSCRIPT [22] R. Battiti and F. Masulli, “Bfgs optimization for faster and automated supervised learning,” in International Neural Network Conference.

Springer Netherlands, 1990, pp. 757–760.

[23] K. Li, J.-X. Peng, and G. W. Irwin, “A fast nonlinear model identification method,” IEEE Transactions on Automatic Control, vol. 50, no. 8, pp. 1211 – 1216, 2005. [Online]. Available: http://dx.doi.org/10.1109/TAC.2005.852557 [24] P. Vincent and Y. Bengio, “Kernel matching pursuit,” Machine Learning, vol. 48, no. 1-3, pp. 165 – 187, 2002. [Online]. Available: http://dx.doi.org/10.1023/A:1013955821559 [25] V. Popovici, S. Bengio, and J.-P. Thiran, “Kernel matching pursuit for large datasets,” Pattern Recognition, vol. 38,

CR IP T

no. 12, pp. 2385 – 2390, 2005. [Online]. Available: http://dx.doi.org/10.1016/j.patcog.2005.01.021 [26] Q. Zhou, S. Song, C. Wu, and G. Huang, “Kernelized lars-lasso for constructing radial basis function neural networks,” Neural Computing and Applications, vol. 23, no. 7-8, pp. 1969 – 1976, 2013. [Online]. Available: http://dx.doi.org/10.1007/s00521-012-1189-6

[27] G. Castellano, A. M. Fanelli, and M. Pelillo, “Iterative pruning algorithm for feedforward neural networks,” IEEE Transactions on Neural Networks, vol. 8, no. 3, pp. 519 – 531, 1997. [Online]. Available:

AN US

//dx.doi.org/10.1109/72.572092

http:

[28] R. Setiono, “Penalty-function approach for pruning feedforward neural networks,” Neural Computation, vol. 9, no. 1, pp. 185 – 185, 1997.

[29] X. Hong and S. Billings, “Givens rotation based fast backward elimination algorithm for rbf neural network pruning,” Control Theory and Applications, vol. 144, no. 5, pp. 381 – 384, 1997. [Online]. Available:

http://dx.doi.org/10.1049/ip-cta:19971436

M

IEE Proceedings:

[30] D. M. L. Barbato and O. Kinouchi, “Optimal pruning in neural networks,” Physical Review E - Statistical Physics,

ED

Plasmas, Fluids, and Related Interdisciplinary Topics, vol. 62, no. 6 B, pp. 8387 – 8394, 2000. [Online]. Available: http://dx.doi.org/10.1103/PhysRevE.62.8387

[31] T. D. Jorgensen, B. P. Haynes, and C. C. F. Norlund, “Pruning artificial neural networks using neural complexity

PT

measures,” International Journal of Neural Systems, vol. 18, no. 5, pp. 389 – 403, 2008. [Online]. Available: http://dx.doi.org/10.1142/S012906570800166X Qiao,

Y.

Zhang,

and

CE

[32] J.-f.

H.-g.

Han,

“Fast

unit

pruning

algorithm

for

feedforward

neural

network

design,” Applied Mathematics and Computation, vol. 205, no. 2, pp. 622 – 627, 2008. [Online]. Available:

AC

http://dx.doi.org/10.1016/j.amc.2008.05.049 [33] Y.-P. Zhao, J.-G. Sun, Z.-H. Du, Z.-A. Zhang, and H.-B. Zhang, “Pruning least objective contribution in kmse,” Neurocomputing, vol. 74, no. 17, pp. 3009 – 3018, 2011. [Online]. Available: http://dx.doi.org/10.1016/j.neucom.2011. 04.004

[34] P. B. Nair, A. Choudhury, and A. J. Keane, “Some greedy learning algorithms for sparse regression and classification with mercer kernels,” Journal of Machine Learning Research, vol. 3, no. 4-5, pp. 781 – 801, 2003. [35] A. Sherstinsky and R. W. Picard, “On the efficiency of the orthogonal least squares training method for radial basis function networks,” IEEE Transactions on Neural Networks, vol. 7, no. 1, pp. 195 – 200, 1996. [Online]. Available: http://dx.doi.org/10.1109/72.478404 [36] V. N. Vapnik, The Nature of Statistical Learning Theory.

25

New York, NY, USA: Springer-Verlag New York, Inc., 1995.

ACCEPTED MANUSCRIPT [37] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273 – 273, 1995. [Online]. Available: http://dx.doi.org/10.1023/A:1022627411411 [38] V. Vapnik, “An overview of statistical learning theory,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988–999, 1999. [39] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: A new learning scheme of feedforward neural networks,” in Proceedings of IEEE International Conference on Neural Networks, vol. 2, Budapest, Hungary, 2004, pp. 985 – 990. [Online]. Available: http://dx.doi.org/10.1109/IJCNN.2004.1380068

CR IP T

[40] ——, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1-3, pp. 489 – 501, 2006. [Online]. Available: http://dx.doi.org/10.1016/j.neucom.2005.12.126

[41] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879 – 892, 2006. [Online]. Available: http://dx.doi.org/10.1109/TNN.2006.875977

[42] C. J. Burges and B. Scholkopf, “Improving the accuracy and speed of support vector machines,” in Advances in Neural MIT Press, 1997, pp. 375–381.

AN US

Information Processing Systems.

[43] M. Pontil and A. Verri, “Properties of support vector machines,” Neural Computation, vol. 10, no. 4, pp. 955 – 955, 1998. [Online]. Available: http://dx.doi.org/10.1162/089976698300017575

[44] Q. Li, L. Jiao, and Y. Hao, “Adaptive simplification of solution for support vector machine,” Pattern Recognition, vol. 40, no. 3, pp. 972 – 980, 2007. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0031320306003153 R.-C. Chen,

and X. Guo,

“Pruning support vector machines without altering performances,”

M

[45] X. Liang,

IEEE Transactions on Neural Networks,

vol. 19,

no. 10,

pp. 1792 – 1803,

2008. [Online]. Available:

ED

http://dx.doi.org/10.1109/TNN.2008.2002696

[46] X. Liu, C. Gao, and P. Li, “A comparative analysis of support vector machines and extreme learning machines,”

S0893608012001086

PT

Neural Networks, vol. 33, pp. 58 – 66, 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/

[47] G.-B. Huang and L. Chen, “Convex incremental extreme learning machine,” Neurocomputing, vol. 70, no. 16C18, pp.

CE

3056 – 3062, 2007. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231207000677 [48] ——, “Enhanced random search based incremental extreme learning machine,” Neurocomputing, vol. 71, no. 16-18, pp.

AC

3460 – 3468, 2008. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231207003633 [49] Y.-P.

Zhao,

Z.-Q.

Li,

P.-P.

Xi,

D.

incremental extreme learning machine,”

Liang,

L.

Sun,

Neurocomputing,

and

T.-H.

vol. 241,

Chen,

“Gramschmidt

pp. 1 – 17,

process

based

2017. [Online]. Available:

http://dx.doi.org/10.1016/j.neucom.2017.01.049

[50] G. Feng, G.-B. Huang, Q. Lin, and R. Gay, “Error minimized extreme learning machine with growth of hidden nodes and incremental learning,” IEEE Transactions on Neural Networks, vol. 20, no. 8, pp. 1352–1357, 2009. [51] N. Wang, M. J. Er, and M. Han, “Parsimonious extreme learning machine using recursive orthogonal least squares,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 10, pp. 1828 – 1841, 2014. [Online]. Available: http://dx.doi.org/10.1109/TNNLS.2013.2296048

26

ACCEPTED MANUSCRIPT [52] Y.-P. Zhao,

K.-K. Wang,

orthogonal transformation,”

and Y.-B. Li, Neurocomputing,

“Parsimonious regularized extreme learning machine based on vol. 156,

pp. 280 – 296,

2015. [Online]. Available:

http:

//dx.doi.org/10.1016/j.neucom.2014.12.046 [53] Y.-P. Zhao and R. Huerta, “Improvements on parsimonious extreme learning machine using recursive orthogonal least squares,” Neurocomputing, vol. 191, pp. 82 – 94, 2016. [Online]. Available: http://dx.doi.org/10.1016/j.neucom.2016.01. 005 [54] H.-J. Rong, Y.-S. Ong, A.-H. Tan, and Z. Zhu, “A fast pruned-extreme learning machine for classification problem,” Neurocomputing, vol. 72, no. 1-3, pp. 359 – 366, 2008. [Online]. Available: http://www.sciencedirect.com/science/

CR IP T

article/pii/S0925231208000738

[55] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, and A. Lendasse, “Op-elm: Optimally pruned extreme learning machine,” IEEE Transactions on Neural Networks, vol. 21, no. 1, pp. 158–162, 2010.

[56] T. Simila and J. Tikka, “Multiresponse sparse regression with application to multidimensional scaling,” Warsaw, Poland, 2005, pp. 97 – 102.

Systems Engineering and Electronics,

vol. 25,

//dx.doi.org/10.1109/JSEE.2014.000103

AN US

[57] Y.-P. Zhao and K.-K. Wang, “Fast cross validation for regularized extreme learning machine,” Journal of no. 5,

pp. 895 – 900,

2013. [Online]. Available:

http:

[58] Y.-P. Zhao, B. Li, and Y.-B. Li, “An accelerating scheme for destructive parsimonious extreme learning machine,” Neurocomputing, vol. 167, pp. 671 – 687, 2015. [Online]. Available: http://dx.doi.org/10.1016/j.neucom.2015.04.002

M

[59] N. Wang, M. J. Er, and M. Han, “Generalized single-hidden layer feedforward networks for regression problems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 6, pp. 1161 – 1176, 2015. [Online]. Available:

ED

http://dx.doi.org/10.1109/TNNLS.2014.2334366

[60] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42, no. 2, pp. 513 – 529, 2012. [Online].

PT

Available: http://dx.doi.org/10.1109/TSMCB.2011.2168604 Beijing, China: Tsinghua University Press, 2004.

AC

CE

[61] X. Zhang, Matrix Analysis and Applications.

27

ACCEPTED MANUSCRIPT

Yong-Ping Zhao received his B.E. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been pursuing the

CR IP T

M.S. and Ph.D. degrees at Nanjing University of Aeronautics and Astronautics. In December 2009, He received Ph.D. degree, and won the award of the Nominated for the National Excellent Doctoral Dissertation Award of China in 2013. Currently, he is a professor and with the college of energy and power engineering, Nanjing University of Aeronautics and Astronautics. His research interests include aircraft engine modeling, control and

ED

M

AN US

fault diagnostics, machine learning and pattern recognition.

Ying-Ting Pan received her bachelor degree in the thermal energy and power engineering field from South

PT

East University, China, in July 2016. She won the Meritorious Winner award of COMAP's Mathematical Contest in Modeling in 2015. After graduation, she worked as an assistant engineer in Jiangsu Blue Sky Technology Co.

CE

Ltd for one year. From September 2017, she will pursue the master degree at Nanjing University of Aeronautics

AC

and Astronautics, majoring in control engineering.

Fang-Quan Song received the B.S. degrees from the Nanjing University of Aeronautics and Astronautics, in 2017. He is currently pursuing the M.S. degree from Nanjing University of Aeronautics and Astronautics. His research interest is aircraft engine modeling, control, fault diagnostics, machine learning and pattern recognition.

CR IP T

ACCEPTED MANUSCRIPT

Liguo Sun received his B.Sc. and M.Sc. degrees from Nanjing University of Aeronautics and Astronautics (China) in 2008 and 2010. On October 30th of 2014, he received his Ph.D. degree from the Control and Simulation Group, Faculty of Aerospace Engineering, TU Delft. Since June of 2015, he has been an associate professor at the Flight Dynamics and Flight Safety group at the School of Aeronautic Science and Engineering,

AN US

Beihang University. His current research interests include nonlinear system identification, machine learning, multivariate spline theory, fault-tolerant (nonlinear) flight control, aircraft safe- flight-envelope prediction,

CE

PT

ED

M

propulsion control.

Ting-Hao Chen received his B.S. degree in thermal energy and power engineering field from Civil Aviation Flight University of China, Guanghan, China, in 2007. He received M.S. degree in Aero-engine fault diagnosis

AC

field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2010. His research interests include Aero-engine fault diagnosis, machine learning, etc.

ACCEPTED MANUSCRIPT

0 .0 9 0

G E L M F S F -G E L M B S F -G E L M

0 .0 8 5

CR IP T

R M S E

0 .0 8 0

0 .0 7 5

AN US

0 .0 7 0

0 .0 6 5 2

4

6

8

1 0

1 2

1 4

1 6

1 4

1 6

M

# F e a tu re (a )

ED

0 .0 8 7

PT

0 .0 8 4

CE

0 .0 8 1

0 .0 7 8

AC

R M S E

G E L M F S F -G E L M B S F -G E L M

0 .0 7 5

0 .0 7 2 2

4

6

8

1 0

1 2

# F e a tu re (b ) Figure 3 Experimental results for Sml2010 with the Sigmoid (a)p = 1(b)p = 2

30

ACCEPTED MANUSCRIPT

0 .0 7 6

G E L M F S F -G E L M B S F -G E L M

0 .0 7 4

CR IP T

R M S E

0 .0 7 2

0 .0 7 0

0 .0 6 8

0 .0 6 4 4

6

8

AN US

0 .0 6 6

1 0

1 2

1 4

1 6

M

# F e a tu re (a )

ED

0 .0 9 5

CE

0 .0 8 5

0 .0 8 0

AC

R M S E

G E L M F S F -G E L M B S F -G E L M

PT

0 .0 9 0

0 .0 7 5

0 .0 7 0 2

4

6

8

1 0

1 2

1 4

# F e a tu re (b ) Figure 4 Experimental results for Sml2010 with the RBF (a)p = 1(b)p = 2

31

1 6

Feature selection of generalized extreme learning machine for regression problems

Feature selection of generalized extreme learning machine for regression problems

Recommend Documents