Improvements on parsimonious extreme learning machine using recursive orthogonal least squares

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Improveme...

Download PDF

684KB Sizes 2 Downloads 51 Views

Report

PDF Reader
Full Text

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Improvements on parsimonious extreme learning machine using recursive orthogonal least squares Yong-Ping Zhao a,n, Ramón Huerta b a b

College of Energy and Power Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, PR China BioCircuits Institute, University of California San Diego, La Jolla, CA 92093-0404, USA

art ic l e i nf o

a b s t r a c t

Article history: Received 3 March 2015 Received in revised form 10 November 2015 Accepted 5 January 2016 Communicated by Ning Wang

Recently novel constructive and destructive parsimonious extreme learning machines (CP-ELM and DPELM) arose to cope with regression problems. With these foundations, several improvements on CP-ELM and DP-ELM are suggested. CP-ELM can be improved by replacing the Givens rotation with the Householder transformation, yielding the improved CP-ELM (ICP-ELM) which results in the acceleration of the training speed without hampering the generalization performance. Subsequently, a hybrid constructive– destructive ELM (CDP-ELM) is generated integrating elements from CP-ELM and DP-ELM. The goal is to combine the advantages of training speed and parsimony from CP-ELM and DP-ELM. Finally, experiments on regression data sets and a real-world system identiﬁcation of robot arm example are done to test the feasibility and efﬁcacy of these variants including ICP-ELM and CDP-ELM. & 2016 Elsevier B.V. All rights reserved.

Keywords: Single-hidden-layer feedforward network Extreme learning machine Givens rotation Householder transformation

1. Introduction In the past two decades, the approximate capabilities of singlehidden-layer feedforward networks (SLFNs) to nonlinear mappings have been extensively researched due to their successful applications from scientiﬁc research and engineering [1,2]. A general SLFN with m outputs, n inputs, L hidden nodes can be expressed as f ðx Þ ¼

L X

βi hðai ; bi ; xÞ;

x A ℜn

ð1Þ

i¼1

h where f ¼ f 1

⋯

i h f m , βi ¼ β i;1

⋯

βi;m

iT

is the weight vec-

tor connecting the ith hidden node to the output nodes, hðai ; bi ; xÞ is the output of the ith hidden node with the hidden-node parameters ðai ; bi Þ A ℜn ℜ. Currently, a hot spot is to ﬁnd appropriate learning parameters for ai , bi , and βi such that f ðxÞ can approximate the true relationship between actual inputs and outputs to the maximum possible extent [3]. The most popular strategy to obtain the parameters is a tuning based learning strategy. In this case, as long as the appropriate hidden nodes are suitably selected and the good adjusting regulations are applied to them, any continuous function can be well approximated using an SLFN network [4]. However, there are two main drawbacks in this scheme. One is that the learning speed is slow. For a given level of generalization n

Corresponding author. E-mail addresses: [email protected], [email protected] (Y.-P. Zhao).

performance of the SLFN network, hundreds of or thousands of numerically iterative steps are required. Another hindrance is that the SLFN learning machine may easily be trapped in local minima. Unlike conventional neural network frameworks on SLFN, Huang et al. [4–6] have recently sidestepped both bottlenecks above and proved that SLFNs with nonlinear piecewise continuous hidden nodes and randomly generated hidden node parameters can work well as universal approximators. The trick consists of calculating the output weights connecting the hidden nodes to the output nodes, thus yielding extreme learning machine (ELM). According to [7], the traditional iterative techniques are not required for determining the parameters of ELM. ELM originated from neuronbased SLFNs [4,5,8] is extended to a general version using nonlinear piecewise continuous functions as hidden nodes. As a promising tool to solve SLFN problems, ELM has shown excellent performance in regression and classiﬁcation applications with extremely fast training speed and better generalization performance [6]. An open question on ELMs is how to determine an appropriate compact architecture. Some online sequential learning algorithms [7,9–16] were developed with ﬁxing the network size. To solve this problem, there are two main strategies. The ﬁrst one is referred to as constructive algorithms. Some typical representatives of this class are the incremental ELMs [4,17–25], which start with a small initial size and gradually add new hidden nodes until a required solution is obtained. Unlike the incremental algorithms, whose existing hidden nodes are frozen when the new hidden nodes are added, there still exist some adaptive algorithms where the

http://dx.doi.org/10.1016/j.neucom.2016.01.005 0925-2312/& 2016 Elsevier B.V. All rights reserved.

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

existing networks may be replaced by newly generated networks which have fewer hidden nodes and better generalization performance [3,26]. The second strategy is based on the so-called destructive algorithms, also known as pruning algorithms [27– 30], in which a network with a larger than necessary size is initially trained, and then the redundant or less effective hidden nodes are gradually eliminated. All in all, both strategies are expected to seek a more compact architecture for ELM. Recently, novel constructive and destructive parsimonious ELMs (CP-ELM and DP-ELM) [31] were developed for multi-input multi-output SLFNs. The initial hidden nodes for CP-ELM and DPELM are randomly generated and their corresponding hidden layer output matrixes are orthogonalized. Then the sequential partial orthogonalization (SPO) using the Givens rotation is utilized to determine appropriate networks for CP-ELM and DP-ELM. Subsequently, both CP-ELM and DP-ELM were extended to the regularized ELM (RELM), giving rise to CP-RELM and CP-DELM respectively [32]. Following the spirits of CP-ELM and DP-ELM, some variants are proposed in this paper. In the following, our main contributions are listed as:

where c ¼ cos θ, s ¼ sin θ, and the non-zero entries of G i; j; θ are determined by the following rules 8 g ll ¼ 1; l ai; j > > > > > g ii ¼ c > < g jj ¼ c ð3Þ > > > g ij ¼ s > > > :g ¼s ji

with i oj. The matrix G is unitary, i.e., GTG¼GGT ¼I. The product of rotated in G i; j; θ y indicates that the vector y is counterclockwise the (i, j) plane by θ radians, and ‖G i; j; θ y‖2 ¼ ‖y‖2 , where ‖ U‖2 designates the Euclidean " #norm. Assume that y is a twoy1 , then the rotation is simply given by dimensional vector, y ¼ y2 " # c s y1 r ¼ ð4Þ y2 s c 0 where r ¼

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ y21 þ y22 , c ¼ yr1 , and s ¼ yr2 .

2.2. Householder transformation 1) Acceleration of the CP-ELM, yielding the improved CP-ELM (ICPELM), by using the Householder transformation as a replacement of Givens rotation in SPO. This results in expediting the training speed of CP-ELM while keeping the same performance on prediction accuracy and the number of hidden nodes. 2) Combining concepts of CP-ELM and DP-ELM to yield a novel algorithm, referred to as the constructive-destructive parsimonious ELM (CDP-ELM), for the goal of obtaining better sparseness and faster training speed. In comparison with the original CP-ELM and DP-ELM, CDP-ELM performs better in terms of the training speed and the number of hidden nodes while retaining nearly the same prediction accuracy. The paper is organized as follows. Some preliminaries on Givens rotation and Householder transformation are brieﬂy introduced in Section 2. In Section 3, brief reviews of ELM, CP-ELM, and DP-ELM are given. In the following, ICP-ELM is obtained on the basis of the mechanism that Givens rotation used in CP-ELM is replaced with Householder transformation, and combining ideas of CP-ELM and DP-ELM yields CDP-ELM. In Section 5, the usefulness and powerfulness of the proposed algorithms, viz. ICP-ELM and CDP-ELM, are tested on twelve benchmark regression data sets and a mechanical system identiﬁcation problem. Finally, conclusions follow.

2. Preliminaries 2.1. Givens rotation Givens rotation is a rotation in the plane spanned by two coordinates axes, named after the contribution of Wallace Givens [33]. Givens rotation is an elementary matrix, which can be expressed as 2

1 6⋮ 6 6 60 6 6 G i; j; θ ¼ 6 ⋮ 6 60 6 6 4⋮ 0

⋯

0

⋯

0

⋯

⋱

⋮

⋱

⋮

⋱

⋯

c

⋯

s

⋮

⋱ ⋯

⋮ s

⋱ ⋯

⋮ c

⋱ ⋯

⋱

⋮

⋱

⋮

⋱

⋯

0

…

0

⋯

0

3

⋮7 7 7 07 7 7 ⋮7 7 07 7 7 ⋮5 1

ð2Þ

The Householder matrix is an elementary reﬂector [34,35], which is a rank-one modiﬁcation of the identity matrix and has the canonical form ð5Þ P ¼ I 2vvT = vT v where v is a given column vector. If P is a Householder transformation matrix, it meets PT ¼ P and P 1 ¼ PT . The Householder matrix is typically used in numerical algorithms for the construction of orthogonal bases for which problems take forms amenable to simple solutions. From a computational viewpoint, the Householder transformation can annihilate selected elements of vectors or matrix in an isometric mapping mode, that is, Py ¼ ‖y‖2 ek

ð6Þ

where v ¼ y ‖y‖2 ek , ek is the kth column of identity matrix. 2.3. Comparison of Givens rotation and Householder transformation Assume that a column vector y is of size N, i.e., h iT y ¼ y1 ⋯ yN . If we want to transform it into the form ‖y‖2 ek ðk A f1; ⋯; NgÞ using the Givens rotation and Householder transformation, respectively, their computational complexity is: 2.3.1. Givens rotation In this case, only one element is eliminated from the vector y when multiplied by one Givens matrix. Therefore, N 1 Givens matrixes are required to be constructed in total. To construct one Givens matrix, two multiplications, two divisions, one addition and one square root are needed as indicated in Eq. (4). The product between G and y takes 4 multiplications and 2 additions. 2.3.2. Householder transformation First, a vector v in Eq. (5) needs to be constructed, which is given by [36] ð7Þ v ¼ y þsgn yk ‖y‖2 ek where yk is the kth element of y. In this step, calculating ‖y‖2 needs N multiplications, N 1 additions, and one square root. In the process of encoding the algorithm, let v’y and vk ’sgn yk ‖y‖2 , where vk is the kth entry of v, so only one multiplication is need. If yk is equal to zero, the plus sign is adopted.

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

H. According the conclusions in [17], if the N training samples are distinct, H is of full-column rank with probability one when L rN. In many real-world problems, it is very easily satisﬁed that the number of hidden nodes is not larger than the number of training samples.

Table 1 Computational comparison between Givens and Householder. Operation

Number of operations

Multiplications Additions divisions Square roots

Givens

Householder

6(N 1) 3(N 1) 2(N 1) N1

3N þ 2 3N 1 1 1

3.2. CP-ELM Aiming at a sequentially learning problem, by virtue of SPO with Givens rotation in [31], the following equation can be obtained " # R TIARE Q L ⋯Q 1 ½H; T ¼ ð12Þ 0ðN LÞL TIRE

Hence, vT y Py ¼ y y2 y2 þ y v k

ð8Þ

Note that ‖y‖2 is calculated in Eq. (7). Here, it is easily got that Eq. (8) needs 2N þ 1 multiplications, 2N additions, one division. Table 1 sums up those costs above. From this table, it is easily observed that for a large N the Householder transformation reduces the computational complexity compared with the Givens rotation to realize (6).

3. ELM, CP-ELM and DP-ELM

3.1. ELM ELM is a simple and efﬁcient SLFN, which generates the hidden node parameters ai and bi randomly for Eq. (1). Then, given a

N training set of ðxi ; t i Þ i ¼ 1 A ℜn ℜm , if the outputs of SLFN are equal to the corresponding targets, thus we have L X

βi hðai ; bi ; xi Þ;

i ¼ 1; ⋯; N

ð9Þ

i¼1

Eq. (9) can be equivalently expressed in an economic size as Hβ ¼ T

ð10Þ

where 2 hða1 ; b1 ; x1 Þ 6 6 hða1 ; b1 ; x2 Þ H¼6 6 ⋮ 4

hða1 ; b1 ; xN Þ

2

βT

⋯ ⋯ ⋱ ⋯

hðaL ; bL ; x1 Þ

3

7 hðaL ; bL ; x2 Þ 7 7 7 ⋮ 5 hðaL ; bL ; xN Þ

¼ h1 ; h2 ; ⋯; hL ;

3

Lm

h with βi ¼ β i;1

2 ⋯

iT

t T1

3

6 T7 6 t2 7 7. Here, H is the so-called 7 4 ⋮ 5

βi;m , T ¼ 6 6

where H† ¼ HT H

1

ð14Þ

T where ‖ U ‖F is the Frobenius norm deﬁned as ¼ trace T T . Since Eqs. (10) and (13) have the same solution, hence sparsifying the solution of Eq. (10) is equivalent to ﬁnding a parsimonious solution for (13). As a constructive algorithm, CP-ELM starts with no regressors, viz. the initial regressor matrix R ðr0Þ ¼ ∅, an empty index set S ¼ ∅, a full index set P ¼ f1; ⋯; Lg, and a full candidate regressor matrix Rðc0Þ ¼ R. At the ith iteration, R ðri 1Þ recruits the most signiﬁcant regressor r ðsi 1Þ ðs A P Þ in an incremental mode from the current candidate matrix R ðci 1Þ , and then they are updated as h i 8 < Rðri 1Þ ’ Rðri 1Þ ; r ðsi 1Þ ð15Þ : Rði 1Þ ’R ði 1Þ =r ði 1Þ s c c ‖T‖2F

where r ðsi 1Þ A ℜL . Then, the regressor matrix R ðri 1Þ is triangularized together with the candidate matrix Rðci 1Þ using Givens orthogonal transformation as h i h i Q ðiÞ R ðri 1Þ ; R ðci 1Þ ¼ R ðriÞ ; RðciÞ ð16Þ " # R ðiÞ where R ðriÞ ¼ , R ðiÞ A ℜii is an upper triangular matrix, 0ðL iÞi

ðiÞ ð0Þ where TðiÞ A ℜim , Tð0Þ ¼ ∅, T^ A ℜðL iÞm , T^ ¼ TIARE . Based on the following theorem, a series of signiﬁcant regressors can be chosen from the candidate regressor matrix Rc until the stopping criterion is satisﬁed.

ð11Þ HT is the Moore-Penrose generalized inverse of

where ji is the index of last nonzero element of r ðsi 1Þ , subscript i ji

hidden layer output matrix. Hence, training ELM simply amounts to getting the solution of the linear system (10) with respect to the output weight matrix β, yielding

ð13Þ

Notice that the solution of Eq. (13) is very easily obtained via backward substitutions, because R is an upper triangular matrix. In addition, here TIRE is referred to as the initial residual error (IRE) and TIARE is the initial additional residual error (IARE), whose relationship is shown as

Theorem 1. [31]: The ith regressor r ðsi 1Þ selection amounts to the following constructive SPO (CSPO) implemented by partial Givens row rotations: 2

T 3 ðiÞ t ðisiÞ h i 6 r ii 7 ðiÞ ði 1Þ ði 1Þ 6 7 ð18Þ Q i j r s;i j ; T1 j i þ 1 ¼ 4 5 ðiÞ i i i ^ 0ðji iÞ1 T 1 ji i

t TN

β ¼ H† T

Rβ ¼ TIARE

Rð0Þ ¼ ∅, Q ðiÞ is an orthogonal matrix derived from Givens rotations. Moreover, " # " # Tði 1Þ TðiÞ Q ðiÞ ði 1Þ ¼ ð17Þ ðiÞ T^ T^

NL

6 17 6 T7 β 7 β¼6 6 27 6 ⋮ 7 4 5

βTL

where R is an upper triangular matrix, Q k ; k ¼ 1; ⋯; L are orthogonal matrices satisfying Q Tk Q k ¼ Q k Q Tk ¼ I. The solution of Eq. (10) is consistent with β of

‖TIRE ‖2F ¼ ‖T‖2F ‖TIARE ‖2F

This section brieﬂy describes the ELM [4,5], CP-ELM and DPELM, which are pertinent to our improved algorithms. The details on CP-ELM and DP-ELM are elaborated on in [31]. Here we only outline their key steps closely related to our algorithms.

ti ¼

3

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

T denote rows from i to ji, ji Zi, r ðiiiÞ and t ðisiÞ are the ith diagonal of R ðriÞ Þ and the ith row of TðiÞ , respectively, and the partial matrix Q ði i ji

T T Þ ðiÞ ðiÞ ðiÞ meets Q ði i Q ¼ Q Q ¼ I and is given by ji i ji i ji i ji 2 3 Iði 1Þði 1Þ 6 7 ð i Þ Qi j 7 ð19Þ Q ðiÞ ¼ 6 i 4 5 IðL ji ÞðL ji Þ

jAP

‖t ðijiÞ ‖2F

ð20Þ

‖TIARE ‖2F

2 I i1 i1 ¼4

3 Q ðiÞ

5

ð27Þ

i

The feasibility of Eq. (26) is given by corollary 3 in [31].

s ¼ argmin jAP

Then, let S’S [ fsg, and P’P=fsg. The stopping criterion of CP-ELM can be determined by ‖TðiÞ ‖2F 1 o ‖TIARE ‖2F

Q

ðiÞ

Hence, the removal criterion of regressor in DP-ELM is shown

Hence, the selection criterion of CP-ELM is given as s ¼ argmax

submatrix of Q ðiÞ , shown as

η

ð21Þ

‖t ðijiÞ ‖2F

ð28Þ

‖TIARE ‖2F

Here, the least signiﬁcant regressor is chosen and removed iteratively, and let P’P=fsg. Naturally, the stopping criterion of DP-ELM is obtained

where the predeﬁned η{1 is a very small positive number. When the CP-ELM terminates, the optimal β is easily obtained from the following equation

‖TðiÞ ‖2F r1ρ ‖TIARE ‖2F

R ðiÞ β ¼ TðiÞ

where ρ is a small positive number, which can control the number of the removed regressors. When the stopping criterion is met, DPðiÞ ELM will terminate. The weight matrix β is found from RðiÞ β ¼ Τ . To date, DP-ELM has been brieﬂy completed.

ð22Þ

via backward substitutions. Up to now, the algorithm CP-ELM is ﬁnished.

ð29Þ

3.3. DP-ELM Unlike CP-ELM, DP-ELM starts with a full regressors, i.e., Rð0Þ ¼ R and Tð0Þ ¼ TIARE , and a full index set P ¼ f1; ⋯; Lg. At the ith iteration, the candidate regressor matrix R ðciÞ is obtained by eliminating the least signiﬁcant regressor from the previous regressor matrix Rði 1Þ , that is, R ðciÞ ’Rði 1Þ =r ðsi 1Þ

ð23Þ

where s A P represents the original index of the removal regressor, r ðsi 1Þ is the removal regressor from R ði 1Þ . Then, the regressor matrix RðciÞ is recursively triangularized by Givens rotation as " # R ðiÞ ðiÞ ðiÞ ð24Þ Q Rc ¼ 01ðL iÞ and ðiÞ ði 1Þ

Q T

2 ¼4

3 TðiÞ T 5 t ðisiÞ

ð25Þ

T T where Q ðiÞ Q ðiÞ ¼ Q ðiÞ Q ðiÞ ¼ I, R ðiÞ is an upper triangular

T is the last row of the transformed output. Actually, matrix, t ðisiÞ Eqs. (24) and (25) can be realized in a simple and efﬁcient way with destructive SPO (DSPO) using partial Givens rotations, that is, 2 3 RðiÞ TðiÞ h i i i 6

T 7 ð26Þ Q ðiÞ R ðiÞ ; Tði 1Þ ¼ 4 5 i i c;i 01L i i þ 1 t ðisiÞ where i is the column index of the removal regressor from Rði 1Þ , R ðiÞ and RðiÞ are the submatrices consisting of elements from ith c;i

i

row and ith column to the end of and TðiÞ are i ði 1Þ

of T

RðciÞ

ðiÞ

and R , respectively,

Tði 1Þ i

the submatrices conﬁning to rows from ith to the end ðiÞ

and T , respectively. The orthogonal matrix

Q ðiÞ i

is also a

4. The proposed ICP-ELM and CDP-ELM 4.1. ICP-ELM According to comparisons between Givens rotation and Householder transformation in Section 2.3, it is observed that Householder transformation needs less computational complexity than Givens rotation when realizing the same transformation on a given vector. Hence, we can replace Givens rotation with Householder transformation to accelerate the training of CP-ELM. In CP-ELM, the key step is to compute Eq. (18) using partial Givens rotations. If Householder transformation is utilized to accelerate Eq. (18) instead of Givens rotations, similar to Theorem 1 the following theorem is obtained as Theorem 2. The ith regressor r ðsi 1Þ selection is equivalent to the following constructive SPO (CSPO) implemented by partial Householder row transformation: 2

T 3 ðiÞ t ðisiÞ h i 6 r ii 7 Þ ði 1Þ ði 1Þ 6 7 Q ði i ð30Þ ¼ r ; T ji s;i ji 1 ji i þ 1 4 5 ðiÞ ^ 0ðji iÞ1 T 1 ji i where ji is the index of the last nonzero elements of r ðsi 1Þ , sub T script i ji denote rows from i to ji, ji Z i, r ðiiiÞ and t ðisiÞ are the ith diagonal of RðriÞ in (16) and the ith row of TðiÞ in (17), respectively,

T T Þ ðiÞ ðiÞ Þ Þ and the partial matrix Q ði i ¼ Q ði i Q ði i j meets Q i j Q i j j j i

i

i

i

i

¼ I and is given by Þ Q ði i j ¼ I2 i

vvT vT v

ð31Þ

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Table 2 Speciﬁcations on each benchmark data set. Data sets

#Attributes

#Outputs

#Training

#Testing

Energy efﬁciency Sml2010 Parkinsons Yacht Stock Concrete Mg Airfoil Space_ga Winequality white Kinematics Puma32H Cpu_small CCPP Ailerons

8 16 16 6 9 8 6 5 6 11 8 32 12 4 40

2 2 2 1 1 1 1 1 1 1 1 1 1 1 1

400 2764 3000 170 500 550 700 900 1800 2000 4200 4500 5000 5000 7154

368 1373 2875 138 450 455 685 603 1307 1961 3992 3692 3192 4527 6596

0.1

RMSE

0.09

0.08

0.07

0.06

0.05

0

50

100

150

#Hidden nodes

Fig. 1. The process of determining #hidden nodes for ELM with RBF via crossvalidation on Energy efﬁciency.

‖r ðs;ii 1Þj ‖2 e1 , where r ðs;ii 1Þj with v ¼ r ðs;ii 1Þj þ sgn r ðs;ii 1Þj is the i

i

1

i

i

ﬁrst element of r ðs;ii 1Þj , and i

8 > 1; r ðs;ii 1Þj 4 0 <

> i 1

¼ sgn r ðs;ii 1Þj ði 1Þ i 1 > > r0 : 1; sgn r s;i j i

simultaneously are also cut down. In CP-ELM, Eq. (18) is frequently used to determine the most signiﬁcant regressor r ðsi 1Þ from the candidate regressors matrix R ðci 1Þ , so it is wise to use the Theorem 2 replacing Theorem 1 to accelerate CP-ELM. To distinguish the original CP-ELM, here the algorithm using Theorem 2 is referred as an improved version, viz. ICP-ELM. ICP-ELM can obtain the same generalization performance as CP-ELM, meantime needing fewer training computational costs. 4.2. CDP-ELM

0.11

0.04

5

1

ð32Þ

1

Proof. This proof is similar to the proof of Theorem 3 in [31]. □ The only difference between Theorems 1 and 2 is that Theorem 1 Þ utilizes a sequence of Givens rotations to construct Q ði i ji , but Theorem 2 adopting Householder transformation. 2 3 If we want to r ðiiiÞ 5 in an isometric change the column vector r ðl i;i1Þj into4 i i 0ðji iÞ1 mode, both Givens rotation and Householder transformation can realize it. However, Givens rotation takes more computations than Householder transformation from Table 1. Similarly, the computational costs of changing Tð1i 1j Þ i þ 1 into i 2 3 ðiÞ T t 6 7 4 ðiÞ 5 ^ T 1 ji i

Notice that CP-ELM gains in training speed but loses advantage in the number of hidden nodes in comparison with DP-ELM. Although the proposed ICP-ELM accelerates the training speed of CP-ELM further, it does not cut down the number of hidden nodes while maintaining the generalization performance. Hence it is fair to say that both ICP-ELM and DP-ELM have their fair share of advantages and disadvantages. So it may be a good idea to integrate them so that we get the best of both worlds. In other words, we tend to pick up their advantages but discard their disadvantages. Integrating ICP-ELM with DP-ELM yields CDP-ELM. Compared with both DP-ELM and CP-ELM, CDP is superior with respect to the training time and the number of hidden nodes. Since the constructive algorithms, CP-ELM and ICP-ELM continue to select so-called signiﬁcant regressors until the criterion of Eq. (21) is satisﬁed, they are actually sequential forward greedy algorithms. DP-ELM, on the other hand, is a backward greedy algorithm. Sequential forward greedy algorithms and backward greedy ones do not necessarily ﬁnd an optimal solution for ELM, because they are all suboptimal search schemes [37]. Clearly, choosing the desired optimal regressors out of L candidate choices for ELM involves a combinatorial search, which is an NP-hard problem. Hence, heuristic approaches that approximate the optimal solution are unavoidable. In CP-ELM and ICP-ELM, those hidden nodes incurring the additional residual error reductions maximally will be chosen as signiﬁcant regressors, whose purpose is to generate the largest additional residual error reduction with as few hidden nodes as possible. There is an obvious shortcoming to do this. This is, if we want to continue reducing the additional residual error further, the number of so-called signiﬁcant regressors must increase. Hence, a strategy of combining the ideas of CP-ELM and DP-ELM together is proposed here, which may reduce the additional residual error further without the necessity of increasing the number of signiﬁcant regressors. Thus, the aim of inducing the same additional residual error with fewer signiﬁcant regressors is reached. When at the ith iteration a signiﬁcant regressor r ðsi 1Þ is chosen i according to Eq. (20), so far the matrixes RðriÞ in (16) " including #

TðiÞ in (17) signiﬁcant regressors r l ; ⋯; r k ; ⋯; r s l;⋯;k;⋯s A S and ðiÞ T^ are obtained, and R ðriÞ is an upper triangular matrix. At this moment, if the regressor r k is removed from R ðriÞ ðiÞ R~ r ’R ðriÞ =r k ; k A S

ð33Þ

ðiÞ R~ r

is recursively retriangularized by Givens rotations 3 ~ ðiÞ R ðiÞ ðiÞ 5 ð34Þ Q~ R~ r ¼ 4 0ðL i þ 1Þði 1Þ Then

2

and 2

iÞ Tðim ðiÞ ~ 4 Q ðiÞ T^ ðL iÞm

2

3 ðiÞ T~ 6 ði 1Þm 7 6 ðiÞ T 7 7 5¼6 6 t~ ik 7 6 7 4 ðiÞ 5 ^ T ðL iÞm 3

ðiÞ is the orthogonal matrix satisfying where Q~

ð35Þ

ðiÞ Q~

T

ðiÞ Q~ ¼

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

ðiÞ T

ðiÞ T ðiÞ ðiÞ ¼ ILL , R~ is an upper triangular matrix, ‖ t~ ik ‖2F is Q~ Q~ the increase of the additional residual error due to the removal of ðiÞ the regressor r . Because Q~ consists of a series of Givens rotation

3) Choose the maximum number of circles C max , and initialize q ¼0;

n 4) Get the training data pairs ðxk ; t k Þ k ¼ 1 A ℜn ℜm sequentially; 5) The recursive orthogonal least square strategy [39,40] is used to orthogonally transform the hidden node output matrix H and target matrix T into R and TIARE given by (13), respectively;

k

iÞ without matrixes such as (2), it will only affect the matrix Tðim ðiÞ ^ effects on TðL iÞm . From Eq. (26), Eqs. (34) and (35) can be obtained economically by DSPO with partial Givens rotations 2 3 ðiÞ ðiÞ R~ j T~ j h ðiÞ i ðiÞ 6 ð i Þ

ðiÞ T 7 ð36Þ Q~ j R~ r;j i ; Tj ¼ 4 5 t~ 01ði jÞ ik

ðiÞ 3where j is the column index of the regressor r k , R~ r;j i is the submatrice consisting of elements from jth to ith row and from jth colðiÞ ðiÞ umn to the end of R~ , R~ is the submatrice composed of elements r

j

ðiÞ ðiÞ from jth row and jth column to the end of R~ , Tðj iÞ and T~ j are the ðiÞ iÞ and T~ ði 1Þm , submatrices including rows from jth to the end of Tðim respectively. According to Eq. (36), the additional residual error (ARE) metric for each regressor in R ðriÞ is found as ðiÞ T 2 ; kAS ~ ð37Þ AREðr k Þ ¼ F t ik

Hence, we can seek the regressor r s† with the minimum ARE via s† ¼ arg min AREðr k Þ kAS

ð38Þ

In the following, two cases will be faced 1) s† ¼ s In this case, the regressor r s incurs the least ARE among

r l ; ⋯; r k ; ⋯; r s . If we want to continue reducing the additional residual error, the number of the selected regressors has to increase. 2) s† a s For this case, the regressor r s† can be removed from the regressor group r l ; ⋯; r k ; ⋯; r s , which amounts to the regressor r s† replaced by r s . There are two beneﬁts to do this. One is that the additional residual error continues decreasing, which usually indicates that the generalization performance on ELM can be improved further. Another one is that the number of hidden nodes remains constant. This replacement principle is consistent with the Occam's razor “plurality should not be posited without necessity” [38]. Subsequently, let 8 2 3 > ~ ðiÞ > R > ð i 1 Þ 5 >R ’4 > r > > 0ðL i þ 1Þði 1Þ > > > > > > ði 1Þ ~ ðiÞ > > ’T ði 1Þm > > ð i 1 Þ 6 is† 7 > T^ > ’4 5 > > ðiÞ > > T^ ðL iÞm > > > > > h i > ðiÞ > > RðciÞ ; r s† : Rðci 1Þ ’Q~

Then, S’S= s† , P’P [ s† . The stopping criterion of CDP-ELM can adopt the same as that of CP-ELM, i.e. Eq. (21), or the desired number of hidden nodes is reached. In summary, the ﬂowchart of CDP-ELM is shown as Algorithm 1. CDP-ELM 1. Initialization: 1) Choose the type of hidden nodes, and randomly generate L hidden nodes; 2) Choose a predeﬁned positive small value η or the desired number of hidden nodes M;

6) Let R ðr0Þ ¼ ∅, Rðc0Þ ¼ R, S ¼ ∅, P ¼ f1; ⋯; Lg, and initialize i¼1. 2. If Eq. (21) is satisﬁed or i 4 M 3. Go to step 23; 4. Else 5.

Get t ðijiÞ ðj A P Þ using Theorem 2;

6.

Get r ðsi 1Þ according to Eq. (20);

7.

Obtain R ðri 1Þ and Rðci 1Þ according to Eq. (15);

8.

Get RðriÞ and R ðciÞ according to Eq. (16);

9. 10. 11. 12. 13. 14. 15. 16.

ðiÞ Get TðiÞ and T^ according to Eq. (17); Let S’S [ fsg, and P’P=fsg; ðiÞ Obtain R~ r from Eq. (33);

ðiÞ T from Eq. (36); Get t~ ik

Get r s† according to (38); If i ¼ 1 or s† ¼ s or q ¼ C max i’i þ 1, q ¼ 0, go to step 2; Else

ði 1Þ according to Eq. 17. Obtain R ðri 1Þ , R ðci 1Þ , Tði 1Þ , and T^ (39);

18. Let S’S= s† , andP’P [ s† ; 19. q’q þ 1; 20. Go to step 2; 21. End if 22. End if 23. Obtain β solving Eq. (22) with backward substitutions; P βi hðai ; bi ; xÞ 24. Output CDP-ELM: f ðxÞ ¼ iAS

Remark 1. To avoid the dead lock phenomenon appearing in CDPELM, there is a positive integer C max to help escaping it. Generally, the dead lock rarely appears. As for the choice of C max , C max ¼ 50 is enough in our experiments. Remark 2. Since there is a replacement mechanism in CDP-ELM, compared to ICP-ELM, CDP-ELM, there is need of additional computational costs. If C max is the average number when recruiting a new regressor, the computational complexity of CDP-ELM is C max times as high as that of ICP-ELM in theory. In practice, this computational relationship between them does not hold, because CDPELM usually needs fewer hidden nodes reaching nearly the same generalization performance.

5. Experiments To show the validity of the proposed ICP-ELM and CDP-ELM, in this section, we present a comparison of ELM, CP-ELM, DP-ELM, ICP-ELM, and CDP-ELM on ﬁfteen benchmark regression data sets. For each data set, their attributes are normalized into the closed interval [ 1, 1], and their outputs are normalized into [0, 1] [5]. Then, a mechanical system identiﬁcation problem on the inverse dynamics of a ﬂexible robot arm is modeled using the involved algorithms in this paper. These

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

0.090

0.10

0.085

ELM CP-ELM ICP-ELM DP-ELM CDP-ELM

0.075

ELM CP-ELM ICP-ELM DP-ELM CDP-ELM

0.09

0.08

RMSE

RMSE

0.080

0.070

0.07

0.06

0.065

0.05

0.060

0.055

7

10

20

30

40

50

60

70

80

90

0.04

10

20

30

#Hidden nodes

40

50

60

70

80

90

#Hidden nodes

Fig. 2. RMSE vs. #hidden nodes on Energy efﬁciency (a) sigmoid (b) RBF.

experiments are run in MATLAB R2013b and the same PC with Intels Core™ i3-2310M processor and 2G RAM. To obtain robust statistical results, thirty different trials are conducted for each instance. In addition, for the purpose of comparison one performance index, i.e., the rooted mean squared errors (RMSE), is deﬁned as vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ u u#Testing u P ‖t^ t ‖2 i i F u t i¼1 ð40Þ RMSE ¼ #Testing m where t^ i denotes the prediction value of the desired t i , #Testing is the total number of the testing samples. A smaller RMSE usually means a better prediction performance for an algorithm. In order to have a fair comparison, the number of hidden nodes (#hidden nodes) of ELM is near optimally determined from the set f10; 20; ⋯; 150g by cross-validation method (please see Fig. 1)1, and then they are extended to other algorithms as the initial candidate hidden nodes. For each algorithm, two typical hidden 1 node output functions are chosen: the sigmoid hðxÞ ¼ ðai U x þ bi Þ 1þe

and the RBF hðxÞ ¼ e bi ‖x ai ‖ . For the case of ELM with sigmoidal hidden nodes, the input weights and hidden biases are randomly chosen from the range [ 1, 1]. For the RBF activation function, the centers are randomly chosen from the range [ 1, 1] whereas the impact factor is chosen from the range (0, 0.5) [4]. 2

5.1. Benchmark data sets There are ﬁfteen data sets consisting of 12 multi-input singleoutput data sets and 3 multi-input 2-output ones. Among them, Energy efﬁciency, Sml2010, Parkinsons, Yacht, Concrete, Airfoil, Winequality white, and CCPP are obtained from the well-known UCI repository.2 For Sml2010, two indoor temperatures are predicted using the other attributes excluding the attributes date, time, and the date of the week. Stock, Kinematics, Puma32H, Cpu_small, and Ailerons are found from the data collection.3 The rest two, viz. Mg and Space_ga, are available in the personal

1 Energy efﬁciency is a benchmark data set. We will introduce it in the following section. 2 http://archive.ics.uci.edu/ml/ 3 http://www.dcc.fc.up.pt/ ltorgo/Regression/DataSets.html

webpage.4 For each data set case, if there are repeated rows (samples) or redundant columns (attributes),5 we will remove them in advance. The details on each data set is tabulated in Table 2. Fig. 2 gives the relationship between RMSE and #hidden nodes on Energy efﬁciency. Here, the RMSE value got by ELM is referred to as the benchmark line. For the other algorithms, it is easily got that RMSE decreases with the increasing number of hidden nodes. When they touch the dashed baseline, their executions are terminated. From here, we know that there are redundant hidden nodes even if there is a near initially optimal selection on hidden nodes for the ELM. The line generated by ICP-ELM overlaps the one produced by CP-ELM, which indicates that CP-ELM and ICP-ELM keep the same generalization performance. In comparison with CP-ELM and ICPELM, CDP need fewer hidden nodes, i.e., the line generated by CDP-ELM is usually beneath those produced by CP-ELM and ICPELM, which demonstrates that the mechanism of integrating DPELM into CP-ELM is feasible and effective. Moreover, CDP-ELM is also superior to DP-ELM in the number of hidden nodes. Due to the similarity or small changes on the experimental results, in order to save space here the other results are not shown. However, the experimental details are all listed in Table 3. By analyzing the results in Table 3, under nearly the same generalization performance, CDP-ELM requires least number of hidden nodes. That is to say, the parsimony ratio of CDP-ELM is best, which potentially means the best real time in the testing phase. The least testing time in Table 3 reveals this point statistically. To be more important, the viewpoint of integrating DP-ELM into CP-ELM is experimentally conﬁrmed. Compared to CP-ELM, DP-ELM needs less #hidden nodes and ICP-ELM needs the same #hidden nodes. Among all the competitors, the naive ELM needs most hidden nodes or testing time, which maybe is not suitable for the scenarios in a strict demand of testing speed, because it does nothing to optimize the architecture compactly. For the results in Table 3, for each algorithm the training RMSE is less than the testing RMSE. This is expected, since the training samples have been learned by the algorithms but the testing 4

http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/regression.html These attributes own the same values, and then they will make no contributions to the ﬁnal prediction. Hence, we regard them as the redundant attributes. 5

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

8

Data sets

Energy efﬁciency

Hidden node type

Sigmoid

RBF

Sml2010

Sigmoid

RBF

Parkinsons

Sigmoid

RBF

Yacht

Sigmoid

RBF

Stock

Sigmoid

RBF

Concrete

Sigmoid

Algorithms

ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM

RMSE ( Mean 7 Dev.)

# Initial hidden nodes L

Testing

Training

5.925E-02 7 6.423E-03 5.920E-02 7 8.465E-03 5.920E-02 7 8.465E-03 5.949E-02 7 6.168E-03 5.944E-02 7 7.780E-03 4.601E-027 1.966E-03 4.605E-02 7 2.632E-03 4.605E-02 7 2.632E-03 4.613E-027 2.276E-03 4.609E-02 7 2.922E-03 7.280E-02 7 3.573E-03 7.269E-02 7 5.595E-03 7.269E-02 7 5.595E-03 7.283E-02 7 1.001E-02 7.222E-02 7 3.836E-03 8.075E-02 76.305E-03 8.081E-02 7 7.122E-03 8.081E-02 7 7.122E-03 8.095E-02 7 7.662E-03 8.127E-02 7 1.072E-02 2.076E-01 7 2.737E-03 2.085E-01 7 1.338E-03 2.085E-01 7 1.338E-03 2.085E-01 7 1.083E-03 2.086E-01 7 1.091E-03 2.146E-01 7 1.892E-03 2.148E-01 7 1.155E-03 2.148E-01 7 1.155E-03 2.147E-01 7 1.329E-03 2.147E-01 7 1.650E-03 5.955E-02 7 7.196E-03 6.009E-02 7 6.431E-03 6.009E-02 7 6.431E-03 6.012E-02 7 4.887E-03 5.938E-02 7 7.295E-03 3.247E-02 7 1.132E-02 3.093E-02 7 2.467E-03 3.093E-02 7 2.467E-03 3.051E-02 77.906E-03 3.178E-02 7 3.279E-03 3.981E-02 7 1.595E-03 4.006E-02 7 1.524E-03 4.006E-02 7 1.524E-03 3.985E-02 7 1.694E-03 3.991E-02 7 2.147E-03 4.030E-02 7 1.284E-03 4.054E-02 7 1.211E-03 4.054E-02 7 1.211E-03 4.040E-02 7 2.124E-03 4.040E-02 7 1.649E-03 9.341E-02 7 4.305E-03 9.350E-02 73.183E-03

3.278E-02 7 1.292E-03 3.708E-027 1.929E-03 3.708E-027 1.929E-03 3.472E-02 71.501E-03 3.902E-02 72.001E-03 3.053E-02 79.532E-04 3.420E-02 74.916E-04 3.420E-02 74.916E-04 3.468E-02 7 5.715E-04 3.681E-02 7 1.185E-03 3.603E-02 71.622E-03 3.847E-02 7 1.349E-03 3.847E-02 7 1.349E-03 3.902E-02 71.391E-03 4.045E-02 7 1.342E-03 4.062E-02 7 2.422E-03 4.209E-02 72.451E-03 4.209E-02 72.451E-03 4.585E-02 72.722E-03 4.323E-02 72.737E-03 2.023E-01 7 8.077E-04 2.094E-01 7 9.486E-04 2.094E-01 7 9.486E-04 2.088E-01 7 5.851E-04 2.096E-01 7 8.733E-04 2.131E-01 7 1.440E-03 2.172E-01 7 4.674E-04 2.172E-01 7 4.674E-04 2.161E-01 7 1.277E-03 2.165E-01 7 1.076E-03 2.665E-02 7 4.185E-03 3.383E-02 74.587E-03 3.383E-02 74.587E-03 3.665E-02 7 3.624E-03 3.353E-027 4.755E-03 6.498E-03 7 2.189E-04 1.624E-02 7 2.059E-03 1.624E-02 7 2.059E-03 1.066E-02 79.923E-04 1.449E-02 7 1.217E-03 2.675E-027 6.865E-04 3.783E-02 71.589E-03 3.783E-02 71.589E-03 3.692E-02 71.390E-03 3.681E-02 7 1.510E-03 2.233E-027 4.042E-04 3.700E-027 8.927E-04 3.700E-027 8.927E-04 3.693E-02 71.673E-03 3.718E-02 7 1.029E-03 7.254E-02 7 1.227E-03 7.495E-02 7 1.502E-03

140 140 140 140 140 130 130 130 130 130 60 60 60 60 60 60 60 60 60 60 80 80 80 80 80 50 50 50 50 50 60 60 60 60 60 80 80 80 80 80 110 110 110 110 110 140 140 140 140 140 90 90

# Final hidden nodes M

140 92 92 93 71 130 88 88 72 65 60 31 31 23 19 60 41 41 23 30 80 29 29 26 23 50 18 18 19 17 60 43 43 35 38 80 59 59 55 52 110 28 28 26 24 140 24 24 26 20 90 61

Parsimony ratio M/L

1.0000 0.6571 0.6571 0.6643 0.5071 1.0000 0.6769 0.6769 0.5538 0.5000 1.0000 0.5167 0.5167 0.3833 0.3167 1.0000 0.6833 0.6833 0.3833 0.5000 1.0000 0.3625 0.3625 0.3250 0.2875 1.0000 0.3600 0.3600 0.3800 0.3400 1.0000 0.7167 0.7167 0.5833 0.6333 1.0000 0.7375 0.7375 0.6875 0.6500 1.0000 0.2545 0.2545 0.2364 0.2182 1.0000 0.1714 0.1714 0.1857 0.1429 1.0000 0.6778

Training time (s)

Testing time (s)

Mean7 Dev.

Mean 7 Dev.

0.01117 0.0005 4.8168 7 0.0611 0.3925 7 0.0040 4.9985 7 0.0620 2.71547 0.1905 0.2365 7 0.0059 4.35817 0.1359 0.5940 7 0.0289 4.86917 0.0584 2.5222 7 0.2543 0.0205 7 0.0020 1.81147 0.0563 1.4681 7 0.0073 1.81277 0.0121 1.52617 0.0334 0.80057 0.0647 2.56737 0.0422 2.19637 0.0136 2.58447 0.0359 2.3696 7 0.0271 0.0290 7 0.0017 2.92977 0.0603 2.2699 7 0.0105 3.23327 0.0101 2.41527 0.0307 0.70027 0.0452 2.2830 7 0.0825 2.0890 7 0.0211 2.33357 0.0766 2.11027 0.0460 0.00287 0.0025 0.44637 0.0644 0.08337 0.0116 0.3980 7 0.0083 0.45417 0.0284 0.0664 7 0.0024 1.00277 0.0294 0.20727 0.0426 0.83737 0.0134 1.33307 0.1653 0.00857 0.0025 1.4482 7 0.0305 0.14407 0.0114 3.05717 0.0394 0.3053 7 0.0824 0.31127 0.0106 2.37117 0.1244 0.46647 0.0240 6.9630 7 0.0337 0.56167 0.0340 0.00767 0.0019 1.29507 0.0374

0.0038 7 0.0009 0.0020 7 0.0005 0.0020 7 0.0004 0.0022 7 0.0004 0.0016 7 0.0002 0.2194 7 0.0079 0.1661 7 0.0147 0.1652 7 0.0086 0.1320 7 0.0030 0.1144 7 0.0048 0.0151 7 0.0010 0.0076 7 0.0033 0.0076 7 0.0017 0.0071 7 0.0070 0.0068 7 0.0031 0.4157 7 0.0250 0.2732 7 0.0165 0.2600 7 0.0119 0.17357 0.0109 0.1973 7 0.0070 0.0245 7 0.0009 0.01777 0.0015 0.0179 7 0.0033 0.0140 7 0.0018 0.0139 7 0.0032 0.6553 7 0.0236 0.2444 7 0.0176 0.2449 7 0.0051 0.2564 7 0.0063 0.2306 7 0.0094 0.0012 7 0.0009 0.0009 7 0.0059 0.0009 7 0.0005 0.00077 0.0059 0.0008 7 0.0003 0.0527 7 0.0025 0.0458 7 0.0069 0.0453 7 0.0058 0.0448 7 0.0065 0.0374 7 0.0030 0.0131 7 0.0009 0.0028 7 0.0013 0.0023 7 0.0003 0.0018 7 0.0044 0.0013 7 0.0009 0.2873 7 0.0201 0.0533 7 0.0033 0.0536 7 0.0079 0.0613 7 0.0041 0.0467 7 0.0050 0.0204 7 0.0006 0.0126 7 0.0006

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

Table 3 Performance comparison among different algorithms on benchmark data sets.

Mg

Sigmoid

RBF

Airfoil

Sigmoid

RBF

Space_ga

Sigmoid

RBF

Winequality white

Sigmoid

RBF

Kinematics

Sigmoid

RBF

Puma32H

Sigmoid

7.495E-02 7 1.502E-03 7.354E-027 1.324E-03 7.496E-02 7 1.490E-03 6.337E-02 7 7.670E-04 6.792E-02 7 1.315E-03 6.792E-02 7 1.315E-03 6.702E-02 7 1.420E-03 6.569E-02 7 1.222E-03 1.286E-01 7 9.543E-04 1.327E-01 7 8.051E-04 1.327E-01 7 8.051E-04 1.297E-01 7 7.310E-04 1.323E-01 7 1.026E-03 1.250E-01 7 1.284E-03 1.317E-01 7 4.944E-04 1.317E-01 7 4.944E-04 1.287E-01 7 9.530E-04 1.316E-01 7 3.759E-04 7.491E-02 7 1.088E-03 7.555E-02 71.033E-03 7.555E-02 71.033E-03 7.519E-02 7 1.086E-03 7.592E-02 79.915E-04 7.382E-02 79.564E-04 8.014E-02 7 1.849E-03 8.014E-02 7 1.849E-03 7.451E-02 7 8.702E-04 8.074E-02 71.169E-03 3.228E-02 7 2.371E-04 3.394E-02 7 2.526E-04 3.394E-02 7 2.526E-04 3.393E-02 7 1.588E-04 3.397E-02 72.043E-04 3.289E-02 7 2.132E-04 3.322E-02 7 1.626E-04 3.322E-02 7 1.626E-04 3.320E-02 7 1.332E-04 3.344E-02 7 2.555E-04 1.180E-01 78.015E-04 1.188E-01 76.265E-04 1.188E-01 76.265E-04 1.182E-01 77.323E-04 1.188E-01 76.756E-04 1.169E-01 7 6.924E-04 1.182E-01 75.619E-04 1.182E-01 75.619E-04 1.184E-01 74.378E-04 1.183E-01 74.686E-04 9.409E-02 7 1.967E-03 9.432E-02 7 1.986E-03 9.432E-02 7 1.986E-03 9.431E-02 7 1.984E-03 9.431E-02 7 1.980E-03 9.075E-02 7 1.881E-03 9.191E-02 7 2.081E-03 9.191E-02 7 2.081E-03 9.175E-02 7 1.922E-03 9.184E-02 7 1.940E-03 1.480E-01 7 2.875E-04 1.501E-017 3.081E-04 1.501E-017 3.081E-04 1.484E-01 7 2.944E-04

90 90 90 110 110 110 110 110 60 60 60 60 60 70 70 70 70 70 120 120 120 120 120 110 110 110 110 110 60 60 60 60 60 50 50 50 50 50 50 50 50 50 50 60 60 60 60 60 150 150 150 150 150 150 150 150 150 150 100 100 100 100

61 60 53 110 76 76 61 70 60 30 30 38 27 70 36 36 37 29 120 103 103 93 85 110 70 70 80 54 60 24 24 20 18 50 38 38 30 27 50 30 30 32 23 60 31 31 24 25 150 120 120 111 111 150 114 114 96 96 100 22 22 43

0.6778 0.6667 0.5889 1.0000 0.6909 0.6909 0.5545 0.6364 1.0000 0.5000 0.5000 0.6333 0.4500 1.0000 0.5143 0.5143 0.5286 0.4143 1.0000 0.8583 0.8583 0.7750 0.7083 1.0000 0.6364 0.6364 0.7273 0.4909 1.0000 0.4000 0.4000 0.3333 0.3000 1.0000 0.7600 0.7600 0.6000 0.5400 1.0000 0.6000 0.6000 0.6400 0.4600 1.0000 0.5167 0.5167 0.4000 0.4167 1.0000 0.8000 0.8000 0.7400 0.7400 1.0000 0.7600 0.7600 0.6400 0.6400 1.0000 0.2200 0.2200 0.4300

0.1812 7 0.0051 1.1566 7 0.0082 1.0015 7 0.0402 0.3011 7 0.0741 2.5964 7 0.0132 0.5144 7 0.0089 2.7230 7 0.0240 2.8723 7 0.1509 0.0057 7 0.0013 0.4569 7 0.0357 0.1406 7 0.0104 0.4299 7 0.0314 0.2719 7 0.0317 0.2147 7 0.0052 0.8138 7 0.0394 0.3197 7 0.0108 0.8572 7 0.0230 0.5009 7 0.0300 0.0165 7 0.0023 3.1051 7 0.0498 0.41247 0.0204 2.3228 7 0.0955 4.9162 7 0.5022 0.4820 7 0.0240 2.7348 7 0.0701 0.73507 0.0247 2.3196 7 0.0509 1.9813 7 0.2003 0.0140 7 0.0018 0.9099 7 0.0409 0.6289 7 0.0111 1.0068 7 0.0164 0.6951 7 0.0295 0.3932 7 0.0178 1.1176 7 0.0158 0.9214 7 0.0173 1.0701 7 0.0203 1.0773 7 0.0609 0.0156 7 0.0028 0.8498 7 0.0555 0.6661 7 0.0064 0.7891 7 0.0120 0.7488 7 0.0246 0.5372 7 0.0226 1.6141 7 0.0247 1.3008 7 0.0119 1.6360 7 0.0165 1.4196 7 0.0260 0.0897 7 0.0017 8.4501 7 0.0535 3.1489 7 0.0303 7.5520 7 0.0455 9.9600 7 0.3881 2.7166 7 0.0110 11.0650 7 0.1224 5.7771 7 0.1186 11.2180 7 0.0920 11.5830 7 0.4702 0.0618 7 0.0013 6.9782 7 0.1311 6.1216 7 0.0661 7.9505 7 0.0306

0.01257 0.0001 0.01097 0.0045 0.00217 0.0011 0.21627 0.0065 0.16667 0.0046 0.16397 0.0057 0.13397 0.0033 0.14147 0.0028 0.00377 0.0009 0.00187 0.0004 0.00167 0.0002 0.00267 0.0009 0.00137 0.0009 0.2093 7 0.0055 0.11377 0.0076 0.11487 0.0121 0.12117 0.0098 0.0922 7 0.0069 0.0063 7 0.0034 0.00507 0.0017 0.00537 0.0012 0.0043 7 0.0022 0.00347 0.0014 0.34077 0.0444 0.2090 7 0.0105 0.2025 7 0.0124 0.23717 0.0150 0.15407 0.0054 0.01047 0.0008 0.0040 7 0.0022 0.0043 7 0.0014 0.00397 0.0028 0.00367 0.0018 0.2898 7 0.0155 0.21697 0.0070 0.21887 0.0096 0.17217 0.0119 0.15267 0.0058 0.00777 0.0022 0.00577 0.0043 0.00577 0.0025 0.0063 7 0.0032 0.00527 0.0022 0.5294 7 0.0106 0.27937 0.0110 0.27997 0.0092 0.21387 0.0090 0.2303 7 0.0077 0.04867 0.0017 0.0423 7 0.0096 0.0428 7 0.0039 0.04017 0.0023 0.0407 7 0.0018 2.5503 7 0.0188 1.95617 0.0397 1.96567 0.0454 1.6468 7 0.0156 1.6463 7 0.0098 0.0423 7 0.0007 0.01247 0.0028 0.01287 0.0031 0.02447 0.0002

9

9.350E-02 7 3.183E-03 9.385E-02 7 3.880E-03 9.373E-02 7 3.572E-03 9.111E-02 7 3.009E-03 9.318E-027 2.586E-03 9.318E-027 2.586E-03 9.309E-02 7 2.928E-03 9.305E-02 7 2.862E-03 1.376E-01 7 1.797E-03 1.378E-01 71.642E-03 1.378E-01 71.642E-03 1.379E-01 72.094E-03 1.378E-01 71.166E-03 1.368E-01 7 8.336E-04 1.371E-01 7 1.051E-03 1.371E-01 7 1.051E-03 1.371E-01 7 1.158E-03 1.372E-01 7 9.847E-04 8.591E-02 71.805E-03 8.584E-02 7 1.970E-03 8.584E-02 7 1.970E-03 8.602E-02 7 1.889E-03 8.586E-02 7 1.414E-03 8.200E-02 71.304E-03 8.206E-02 7 1.484E-03 8.206E-02 7 1.484E-03 8.211E-02 71.012E-03 8.196E-02 7 1.329E-03 3.532E-02 7 8.163E-04 3.545E-02 7 7.480E-04 3.545E-02 7 7.480E-04 3.539E-02 7 3.458E-04 3.543E-02 7 3.789E-04 3.479E-02 7 5.115E-04 3.479E-02 7 4.060E-04 3.479E-02 7 4.060E-04 3.476E-02 7 4.307E-04 3.474E-02 7 3.200E-04 1.237E-01 71.012E-03 1.238E-01 7 8.795E-04 1.238E-01 7 8.795E-04 1.238E-01 7 8.558E-04 1.238E-01 7 9.404E-04 1.231E-017 9.268E-04 1.230E-01 7 3.406E-04 1.230E-01 7 3.406E-04 1.230E-01 7 6.630E-04 1.231E-017 6.519E-04 9.982E-02 7 2.110E-03 9.996E-02 7 2.100E-03 9.996E-02 7 2.100E-03 9.998E-02 7 2.117E-03 9.995E-02 7 2.104E-03 9.807E-02 7 1.908E-03 9.902E-02 7 2.105E-03 9.902E-02 7 2.105E-03 9.901E-02 71.874E-03 9.906E-02 7 1.951E-03 1.557E-01 74.502E-04 1.558E-01 7 3.556E-04 1.558E-01 7 3.556E-04 1.559E-01 7 4.778E-04

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

RBF

ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM

10

Data sets

Hidden node type

RBF

Cpu_small

Sigmoid

RBF

CCPP

Sigmoid

RBF

Ailerons

Sigmoid

RBF

Algorithms

CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM

RMSE ( Mean 7 Dev.)

# Initial hidden nodes L

Testing

Training

1.558E-01 7 5.359E-04 1.555E-01 7 5.267E-04 1.562E-01 7 4.159E-04 1.562E-01 7 4.159E-04 1.561E-01 74.847E-04 1.562E-01 7 7.363E-04 3.761E-02 7 2.436E-03 3.773E-02 71.021E-03 3.773E-02 71.021E-03 3.797E-02 7 1.503E-03 3.788E-02 7 8.984E-04 5.272E-02 7 7.082E-03 5.288E-02 7 7.349E-03 5.288E-02 7 7.349E-03 5.284E-02 7 4.398E-03 5.321E-02 7 5.061E-03 5.324E-02 7 2.232E-04 5.328E-02 7 1.861E-04 5.328E-02 7 1.861E-04 5.328E-02 7 1.397E-04 5.328E-02 7 2.696E-04 5.351E-02 7 1.791E-04 5.357E-027 1.140E-04 5.357E-027 1.140E-04 5.355E-02 7 1.269E-04 5.359E-02 7 1.742E-04 4.579E-02 73.828E-04 4.583E-02 7 3.265E-04 4.583E-02 7 3.265E-04 4.582E-02 7 4.607E-04 4.584E-02 7 3.768E-04 4.722E-02 7 1.025E-03 4.722E-02 7 7.577E-04 4.722E-02 7 7.577E-04 4.724E-02 76.177E-04 4.725E-02 7 4.977E-04

1.504E-01 7 3.412E-04 1.464E-01 7 4.377E-04 1.500E-01 74.207E-04 1.500E-01 74.207E-04 1.488E-01 7 2.065E-04 1.496E-01 7 4.200E-04 3.091E-02 7 3.533E-04 3.553E-02 7 6.477E-04 3.553E-02 7 6.477E-04 3.492E-02 7 9.952E-04 3.620E-02 7 6.649E-04 3.220E-02 7 3.195E-04 4.290E-02 7 1.107E-03 4.290E-02 7 1.107E-03 4.664E-02 7 3.164E-03 4.569E-02 7 2.007E-03 5.314E-02 7 7.794E-05 5.337E-02 7 9.243E-05 5.337E-02 7 9.243E-05 5.344E-027 4.652E-05 5.368E-02 7 8.584E-05 5.384E-02 7 7.117E-05 5.409E-02 7 7.242E-05 5.409E-02 7 7.242E-05 5.447E-02 71.072E-04 5.407E-02 7 8.068E-05 4.394E-02 7 1.292E-04 4.506E-02 7 2.597E-04 4.506E-02 7 2.597E-04 4.487E-02 7 2.080E-04 4.512E-02 7 2.164E-04 4.451E-02 7 2.102E-04 4.674E-02 75.809E-04 4.674E-02 75.809E-04 4.658E-02 7 4.739E-04 4.655E-02 7 3.181E-04

100 150 150 150 150 150 140 140 140 140 140 140 140 140 140 140 130 130 130 130 130 100 100 100 100 100 120 120 120 120 120 110 110 110 110 110

# Final hidden nodes M

19 150 22 22 27 21 140 39 39 37 29 140 44 44 27 29 130 100 100 79 69 100 73 73 49 64 120 29 29 28 23 110 19 19 18 18

Parsimony ratio M/L

0.1900 1.0000 0.1467 0.1467 0.1800 0.1400 1.0000 0.2786 0.2786 0.2643 0.2071 1.0000 0.3143 0.3143 0.1929 0.2071 1.0000 0.7692 0.7692 0.6077 0.5308 1.0000 0.7300 0.7300 0.4900 0.6400 1.0000 0.2417 0.2417 0.2333 0.1917 1.0000 0.1727 0.1727 0.1636 0.1636

Training time (s)

Testing time (s)

Mean 7 Dev.

Mean7 Dev.

6.2234 7 0.0703 3.2254 7 0.1797 7.70507 0.1235 5.6566 7 0.0325 14.19077 0.1793 5.9014 7 0.1434 0.1086 7 0.0036 5.6252 7 0.0771 2.7317 7 0.0287 9.16777 0.0504 3.0992 7 0.1056 3.1578 7 0.0233 9.0006 7 0.1556 5.7535 7 0.0435 12.1736 7 0.0401 6.0198 7 0.0734 0.0956 7 0.0047 5.7844 7 0.2835 2.4157 7 0.1342 5.6409 7 0.0437 5.4385 7 0.4106 2.1811 7 0.1280 11.1413 7 0.0505 9.7022 7 0.0129 11.0765 7 0.0459 12.2756 7 0.1945 0.1427 7 0.0088 20.5790 7 0.1147 19.0394 7 0.2721 22.9840 7 0.1567 19.3302 7 0.2478 3.7945 7 0.0296 22.0392 7 0.2318 21.0584 7 0.1286 23.9525 7 0.1310 21.20247 0.1426

0.01147 0.0009 2.6023 7 0.1460 0.3862 7 0.0036 0.3866 7 0.0052 0.48217 0.0064 0.3828 7 0.0220 0.0287 7 0.0125 0.01297 0.0035 0.01297 0.0020 0.01277 0.0032 0.01227 0.0029 1.98167 0.0496 0.6306 7 0.0167 0.6295 7 0.0055 0.4065 7 0.0267 0.41987 0.0076 0.0364 7 0.0090 0.03107 0.0050 0.03127 0.0020 0.0265 7 0.0056 0.01867 0.0015 1.95127 0.1234 1.3949 7 0.0154 1.37227 0.0107 0.9456 7 0.0134 1.21607 0.0091 0.0453 7 0.0019 0.0223 7 0.0048 0.02277 0.0042 0.0220 7 0.0059 0.0208 7 0.0032 3.4294 7 0.0189 0.62487 0.0373 0.62197 0.0100 0.5959 7 0.0212 0.59277 0.0087

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

Table 3 (continued )

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

11

1.0 0.8

0.4

0.6 0.4

System out di

System input ui

0.2

0.0

-0.2

0.2 0.0 -0.2 -0.4 -0.6

-0.4

-0.8 200

400

600

800

1000

-1.0

200

Sample

400

600

800

1000

Sample Fig. 3. Robot arm data set (a) input (b) output.

0.06

ELM CP-ELM ICP-ELM DP-ELM CDP-ELM

RMSE

0.05

0.04

0.03

0.02

0.01

5

10

15

20

25

30

35

#Hidden nodes

5.2. Robot arm example As an interesting problem of research, the identiﬁcation of the inverse dynamics of a ﬂexible robot arm is analyzed [41]. This robot arm is installed on an electrical motor. The input of this system is the reaction torque of the structure, and its corresponding output is the acceleration of the ﬂexible arm. This data set contains 1024 pairs of input–output samples, depicted as Fig. 3. The inverse model of this system can be learned by taking T xi ¼ ui 1 ; ui 2 ; ui 3 ; ui 4 ; ui 5 ; di 1 ; di 2 ; di 3 ; di 4

0.08 0.07

ELM CP-ELM ICP-ELM DP-ELM CDP-ELM

0.06

RMSE

0.05

0.04 0.03 0.02 0.01

10

20

30

samples have not been seen before. In addition, the training RMSE of the naive ELM is less than those of other algorithms, which results from the fact that the overﬁtting phenomena occur due to the too complex architecture in the naive ELM. According to Table 3, the ELM needs least training time among all the algorithms. Before CP-ELM, DP-ELM, ICP-ELM and CDP-ELM start to select or delete regressors, the upper triangular matrix R in Eq. (13) needs to be obtained a priori. The recursive orthogonal least square strategy [39,40] can be used to ﬁnish it, but in our experiments qr function in MATLAB toolbox is utilized as a surrogate. Hence, in our results the training time includes the time of invoking the qr function. With the exception of ELM, ICP-ELM need the least training time, since the Givens rotation was replaced by the Householder transformation. The CDP-ELM is superior to CPELM and DP-ELM in training time without impairing the generalization performance. Compared to DP-ELM, CP-ELM needs less training time, which is consistent to the conclusion in [31].

40

50

#Hidden nodes Fig. 4. Modeling effectiveness using different algorithms (a) sigmoid (b) RBF.

and t i ¼ ui .Hence, we have 1019 samples of the form ðxi ; t i Þ, whose front part including 510 data points is utilized to train the model, the rest for testing. The modeling effectiveness of this inverse identiﬁcation problem is illustrated by Fig. 4 and Table 4. In Fig. 4, it is observed that the CDP-ELM ﬁrst reaches the benchmark line generated by the ELM, which denotes that CDPELM needs least number of hidden nodes for nearly the same generalization performance. That is to say, the CDP-ELM is the winner with respect to the compact architecture, which beneﬁts the real time in the testing phase. Therefore, it is demonstrated that the idea of integrating the spirit of DP-ELM into CP-ELM is

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Testing time (s)

Mean 7 Dev.

0.0038 7 0.0003 0.00157 0.0006 0.00157 0.0005 0.00187 0.0008 0.00107 0.0002 0.21317 0.0041 0.1539 7 0.0071 0.15177 0.0059 0.1204 7 0.0095 0.0893 7 0.0014

Mean 7Dev.

0.0096 7 0.0014 0.9765 7 0.0503 0.14677 0.0110 1.47377 0.0101 0.3249 7 0.0640 0.21437 0.0066 1.1326 7 0.0312 0.3745 7 0.0203 1.1609 7 0.0097 0.62107 0.0226

# Final hidden nodes

1.0000 0.3556 0.3556 0.3889 0.2778 1.0000 0.6875 0.6875 0.5375 0.3875

Parsimony ratio

Training time (s)

12

90 90 90 90 90 80 80 80 80 80 1.064E-02 7 1.592E-04 1.199E-02 74.338E-04 1.199E-02 74.338E-04 1.139E-02 71.156E-04 1.204E-02 7 1.641E-04 1.068E-02 7 1.061E-04 1.130E-02 75.520E-04 1.130E-02 75.520E-04 1.099E-02 71.239E-04 1.171E-027 2.000E-04 1.299E-02 7 7.742E-05 1.317E-02 7 4.668E-04 1.317E-02 7 4.668E-04 1.318E-02 71.545E-04 1.310E-02 72.820E-04 1.307E-02 71.405E-04 1.339E-02 76.694E-04 1.339E-02 76.694E-04 1.339E-02 73.008E-04 1.332E-02 74.643E-04 RBF

ELM CP-ELM ICP-ELM DP-ELM CDP-ELM ELM CP-ELM ICP-ELM DP-ELM CDP-ELM Sigmoid

Training Testing

Algorithms

RMSE ( Mean 7Dev.)

# Initial hidden nodes

90 32 32 35 25 80 55 55 43 31

6. Conclusions

Hidden node type

Table 4 Details on robot arm example by different algorithms.

feasible and effective. In addition, it is shown that ICP-ELM and CPELM obtain the same prediction accuracy under the same number of hidden nodes. From Table 4, it is seen that ICP-ELM only accelerates the training speed of CP-ELM. Table 4 showcases that the training speed of ELM is the fastest among all the algorithms, but it has a full solution, in which the redundant hidden nodes existing affects its performance on the testing speed. ICP-ELM possesses the same prediction accuracy as CP-ELM but has a faster training speed, which validates the efﬁcacy of substituting Householder transformation for Givens rotation. In comparison with CP-ELM and DP-ELM, CDP-ELM takes the advantage in the training speed. However, the training speed of CDP-ELM is lower than that of ICP-ELM. The main reason is that there is an additional optimization mechanism in CDP-ELM compared to ICP-ELM. In the algorithms, CDP-ELM requires the least #hidden nodes under nearly the same generalization performance, which indicates the success of integrating DP-ELM into CP-ELM. Moreover, the training RMSE is less than the testing RMSE due to the training samples learnt in advance.

As an emergent SLFN, ELM has drawn much recent attention because of its simple form and fast training speed. However, the property of randomly generating the hidden nodes usually makes ELM requiring much more hidden nodes to reach the required performance. Due to the existence of redundant hidden nodes in ELM, its performances on training speed and generalization deteriorate. Hence, it is necessary to develop effective tools to sparsify ELM. CP-ELM and DP-ELM are novel techniques to ﬁnd parsimonious ELMs for regression problems. CP-ELM is superior to DP-ELM in training speed, but loses its edge on the parsimony ratio. In this paper, an improved CP-ELM is proposed by replacing Givens rotation with Householder transformation, which results in a decrease of the training speed of CP-ELM. Furthermore, combining the ideas behind CP-ELM and DP-ELM yields our CDP-ELM. In contrast to CP-ELM and DP-ELM, CDP-ELM works better in training speeds and the number of hidden nodes. Experimental results on regression data sets and a robot arm example demonstrates that ICP-ELM is able to get a faster training speed comparable to CP-ELM, and CDP-ELM is superior to CP-ELM and DPELM in training speed and parsimonious structure.

References [1] C.G. Looney, Pattern Recognition Using Neural Networks: Theory and Algorithms for Engineers and Scientists, Oxford Niversity Press, Inc., New York, NY, USA, 1997. [2] N. Wang, M.J. Er, M. Han, Generalized single-hidden layer feedforward networks for regression problems, IEEE Trans. Neural Netw. Learn. Syst. 26 (6) (2015) 1161–1176. [3] R. Zhang, Y. Lan, G.-B. Huang, Z.-B. Xu, Universal approximation of extreme learning machine with adaptive growth of hidden nodes, IEEE Trans. Neural Netw. Learn. Syst. 23 (2) (2012) 365–371. [4] G.B. Huang, L. Chen, C.K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans. Neural Netw. 17 (4) (2006) 879–892. [5] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1–3) (2006) 489–501. [6] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classiﬁcation, IEEE Trans. Syst. Man Cybern. B: Cybern. 42 (2) (2012) 513–529. [7] N.-Y. Liang, G.-B. Huang, P. Saratchandran, N. Sundararajan, A fast and accurate online sequential learning algorithm for feedforward networks, IEEE Trans. Neural Netw. 17 (6) (2006) 1411–1423. [8] G.B. Huang, Q.Y. Zhu, C.K. Siew, ieee, Extreme learning machine: a new learning scheme of feedforward neural networks, in: IEEE International Joint Conference on Neural Networks, 2004, pp. 985–990.

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

Y.-P. Zhao, R. Huerta / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [9] G.-B. Huang, N.-Y. Liang, H.-J. Rong, P. Saratchandran, N. Sundararajan, On-line sequential extreme learning machine, in: Proceedings of the IASTED International Conference on Computational Intelligence, 2005, pp. 232–237. [10] Y. Lan, Y.C. Soh, G.-B. Huang, Ensemble of online sequential extreme learning machine, Neurocomputing 72 (13–15) (2009) 3391–3395. [11] H.T. Huynh, Y. Won, Regularized online sequential learning algorithm for single-hidden layer feedforward neural networks, Pattern Recognit. Lett. 32 (14) (2011) 1930–1935. [12] Y. Gu, J. Liu, Y. Chen, X. Jiang, H. Yu, TOSELM: timeliness online sequential extreme learning machine, Neurocomputing 128 (2014) 119–127. [13] Z. Shao, M.J. Er, An online sequential learning algorithm for regularized Extreme Learning Machine, Neurocomputing (2015). [14] W.-Y. Deng, Y.-S. Ong, P.S. Tan, Q.-H. Zheng, Online sequential reduced kernel extreme learning machine, Neurocomputing (2015). [15] J.P. Nobrega, A.L.I. Oliveira, Kalman ﬁlter-based method for Online Sequential Extreme Learning Machine for regression problems, Eng. Appl. Artif. Intell. 44 (2015) 101–110. [16] B. Wang, S. Huang, J. Qiu, Y. Liu, G. Wang, Parallel online sequential extreme learning machine based on MapReduce, Neurocomputing 149 (Part A) (2015) 224–232. [17] G.-B. Huang, L. Chen, Convex incremental extreme learning machine, Neurocomputing 70 (16–18) (2007) 3056–3062. [18] G.-B. Huang, L. Chen, Enhanced random search based incremental extreme learning machine, Neurocomputing 71 (2008) 3460–3468. [19] G. Feng, G.-B. Huang, Q. Lin, R. Gay, Error minimized extreme learning machine with growth of hidden nodes and incremental learning, IEEE Trans. Neural Netw. 20 (8) (2009) 1352–1357. [20] Y. Yang, Y. Wang, X. Yuan, Bidirectional extreme learning machine for regression problem and its learning effectiveness, IEEE Trans. Neural Netw. Learn. Syst. 23 (9) (2012) 1498–1505. [21] Y. Lan, Y.C. Soh, G.-B. Huang, Constructive hidden nodes selection of extreme learning machine for regression, Neurocomputing 73 (16–18) (2010) 3191–3199. [22] L. Guo, J.-h Hao, M. Liu, An incremental extreme learning machine for online sequential learning problems, Neurocomputing 128 (2014) 50–58. [23] Y. Ye, Y. Qin, QR factorization based incremental extreme learning machine with growth of hidden dodes, Pattern Recognit. Lett. (2015). [24] Z. Xu, M. Yao, Z. Wu, W. Dai, Incremental regularized extreme learning machine and it's enhancement, Neurocomputing (2015). [25] Z. Shao, M.J. Er, N. Wang, An efﬁcient leave-one-out cross-validation-based extreme learning machine (ELOO-ELM) with minimal user intervention, Cybern. IEEE Trans. (2015). [26] R. Zhang, Y. Lan, G.-B. Huang, Z.-B. Xu, Y.C. Soh, Dynamic extreme learning machine and its approximation capability, IEEE Trans. Cybern. 43 (6) (2013) 2054–2065. [27] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM: optimally pruned extreme learning machine, IEEE Trans. Neural Netw. 21 (1) (2010) 158–162. [28] A. Castano, F. Fernandez-Navarro, C. Hervas-Martinez, PCA-ELM: a robust and pruned extreme learning machine approach based on principal component analysis, Neural Process. Lett. 37 (3) (2013) 377–392. [29] H.-J. Rong, Y.-S. Ong, A.-H. Tan, Z. Zhu, A fast pruned-extreme learning machine for classiﬁcation problem, Neurocomputing 72 (1–3) (2008) 359–366. [30] A. Grigorievskiy, Y. Miche, M. Käpylä, A. Lendasse, Singular value decomposition update and its application to (Inc)-OP-ELM, Neurocomputing (2015). [31] N. Wang, M.J. Er, M. Han, Parsimonious extreme learning machine using recursive orthogonal least squares, IEEE Trans. Neural Netw. Learn. Syst. 25 (10) (2014) 1828–1841.

13

[32] Y.-P. Zhao, K.-K. Wang, Y.-B. Li, Parsimonious regularized extreme learning machine based on orthogonal transformation, Neurocomputing 156 (2015) 280–296. [33] W. Givens, Computation of plane unitary rotations transforming a general matrix to triangular form, J. Soc. Ind. Appl. Math. 6 (1) (1958) 26–50. [34] A.S. Householder, Unitary triangularization of a nonsymmetric matrix, J. ACM 5 (4) (1958) 339–342. [35] A.A. Dubrulle, Householder transformations revisited, SIAM J. Matrix Anal. Appl. 22 (1) (2000) 33–40. [36] X. Zhang, Matrix Analysis and Applications, Tsinghua University Press, Beijing, 2004. [37] P.B. Nair, A. Choudhury, A.J. Keane, Some greedy learning algorithms for sparse regression and classiﬁcation with Mercer kernels, J. Mach. Learn. Res. 3 (4–5) (2003) 781–801. [38] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classiﬁcation, Johh Wiley & Sons Inc., UK, 2001. [39] J.E. Bobrow, W. Murray, An algorithm for RLS indentiﬁcation of parameters that vary quickly with time, IEEE Trans. Autom. Control 38 (2) (1993) 351–354. [40] D.L. Yu, J.B. Gomm, D. Williams, A recursive orthogonal least squares algorithm for training RBF networks, Neural Process. Lett. 5 (3) (1997) 167–176. [41] S. Balasundaram, D. Gupta, Kapil, Lagrangian support vector regression via unconstrained convex minimization, Neural Netw. 51 (2014) 67–79.

Yong-Ping Zhao received his BE degree in the thermal energy and power engineering ﬁeld from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been pursuing the MS and PhD degrees at Nanjing University of Aeronautics and Astronautics. In December 2009, He received PhD degree, and won the award of the Nominated for the National Excellent Doctoral Dissertation Award of China in 2013. Currently, he is a professor and with the college of energy and power engineering, Nanjing University of Aeronautics and Astronautics. His research interests include aircraft engine modeling, control and fault diagnostics, machine learning and pattern recognition.

Ramón Huerta (Ph.D., 1994 – Universidad Autónoma de Madrid) is a research scientist at the BioCircuits Institute, UC San Diego. Prior his current appointment, he was an associate professor at the Universidad Autónoma de Madrid (Spain). His areas of expertize include dynamic systems, artiﬁcial intelligence, and neuroscience. His work deals with the development algorithms for the discrimination and quantiﬁcation of complex multidimensional time series, model building to understand the information processing in the brain, and chemical sensing and machine olfaction applications based on bio-inspired technology. Huerta's research work gathers in a publication record of over 100 articles in peer-reviewed journals at the intersection of computer science, physics, and biology.

Please cite this article as: Y.-P. Zhao, R. Huerta, Improvements on parsimonious extreme learning machine using recursive orthogonal least squares, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.01.005i

Improvements on parsimonious extreme learning machine using recursive orthogonal least squares

Improvements on parsimonious extreme learning machine using recursive orthogonal least squares

Recommend Documents