An online sequential multiple hidden layers extreme learning machine method with forgetting mechanism

An online sequential multiple hidden layers extreme learning machine method with forgetting mechanism

Accepted Manuscript An online sequential multiple hidden layers extreme learning machine method with forgetting mechanism Dong Xiao, Beijing Li, Sheng...

623KB Sizes 3 Downloads 58 Views

Accepted Manuscript An online sequential multiple hidden layers extreme learning machine method with forgetting mechanism Dong Xiao, Beijing Li, Shengyong Zhang PII:

S0169-7439(17)30425-2

DOI:

10.1016/j.chemolab.2018.01.014

Reference:

CHEMOM 3582

To appear in:

Chemometrics and Intelligent Laboratory Systems

Received Date: 22 June 2017 Revised Date:

15 December 2017

Accepted Date: 29 January 2018

Please cite this article as: D. Xiao, B. Li, S. Zhang, An online sequential multiple hidden layers extreme learning machine method with forgetting mechanism, Chemometrics and Intelligent Laboratory Systems (2018), doi: 10.1016/j.chemolab.2018.01.014. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

An online sequential multiple hidden layers extreme learning machine method with forgetting mechanism Dong Xiao*, Beijing Li, Shengyong Zhang

RI PT

(Information Science & Engineering School, Northeastern University, Shenyang, China, 110004)

Abstract

SC

In many practical applications, training data are presented one-by-one or chuck-by-chuck and

M AN U

also have the property of timeliness very frequently. The ensemble of an online sequential extreme learning machine (EOS-ELM) can learn data one-by-one or chunk-by-chunk with fixed or varying chunk size. The online sequential extreme learning machine with forgetting mechanism (FOS-ELM) can learn data with the property of timeliness. In many practical applications, such as stock

TE D

forecasting or weather forecasting, the training accuracy can be improved by discarding the outdated data and reducing the influence on later training processes. Since the real-time variations of data are accompanied by a series of unavoidable noise signals, to make the training output closer to the actual

EP

output, an online sequential multiple hidden layers extreme learning machine with forgetting

AC C

mechanism (FOS-MELM) is proposed in this paper. The proposed FOS-MELM can retain the advantages of FOS-ELM, eliminate the influence of unavoidable noise and improve the prediction accuracy. In this work, experiments have been completed on chemical (styrene) data. The experimental results show that FOS-MELM has high accuracy, better stability and better short-term prediction than FOS-ELM. Keywords: ELM; forgetting mechanism; online sequential; modeling; FOS-MELM

ACCEPTED MANUSCRIPT

1. .Introduction At present, single-hidden-layer feedforward networks (SLFNs) have been applied in many fields such as in simple and popular network structures [1,2]. A main reason for their popularity is

RI PT

that SLFNs have a universal approximation capability. Any continuous function defined on a compact set can be approximated by an SLFN with a bounded sigmoidal activation function in the uniform topology. In 2004, G.B. Huang proposed an extreme learning machine (ELM) technique

SC

aiming at reducing the computational costs by an error back-propagation procedure during the

M AN U

training process and making the network output infinitely approach the actual output [3]. Scholars have studied this concept and proposed many new ideas such as the Backpropagation Neural Network [4] and the Rough RBF Neural Network [5]. This paper examines literature that further studied the basis of theoretical learning algorithms [6-9].

TE D

The backpropagation (BP) algorithm sets all the training parameters randomly and updates the parameters iteratively, and it is easy to generate a local optimal solution and spend training time. However, ELM only needs to set the weights and bias of hidden neurons randomly. The weights

EP

from the hidden layer to the output layer are analytically determined using the time-efficient

AC C

least-squares method. In recent years, various ELM variants have been proposed, aiming to achieve better results including Deep ELM with kernels based on the Multilayer Extreme Learning Machine algorithm (DELM) [10], a two-hidden-layer extreme learning machine (TELM) [11], a Four-Layered Feedforward Neural Network [12], a multiple kernel extreme learning machine (MK-ELM) [13], a two-stage extreme learning machine [14], an online sequential extreme learning machine [15-21], and other algorithms that quote the ELM auto-encoder (ELM-AE) [22-24].

ACCEPTED MANUSCRIPT To accommodate the changing data in practical applications, N.Y. Liang proposed a fast and accurate online sequential learning algorithm for feedforward networks (OS-ELM) [17]. The algorithm can learn data one-by-one or chunk-by-chunk (blocks of data) with fixed or varying chunk

RI PT

size. The activation functions for the additive nodes in OS-ELM can be any bounded non-constant piecewise continuous function. In OS-ELM, the hidden nodes parameters are randomly selected, and the output weights are analytically determined based on the sequentially arriving data. Real-time data

SC

is then used to update the change parameter of the OS-ELM structures. The algorithm developed for

M AN U

batch learning has been shown to have the better generalization performance. G.B. Huang proposed an algorithm called the ensemble of online sequential extreme learning machine based on the OS-ELM [18]. The EOS-ELM is composed of p OS-ELM with the same output and the same number of hidden layer nodes. The result is the average of p OS-ELM, and improve the accuracy of

TE D

the OS-ELM model.

When the training data are timeliness (for example, stock data, weather data), the fixed model established by ELM, OS-ELM, and EOS-ELM can't adapt to the changing data. Over time, the fixed

EP

model will deviate from the direction of the training data change. Therefore, the outdated training

AC C

data whose effectiveness is lost after several time, should be abandoned, which is the forgetting mechanism. Next, J.W. Zhao and Z.H. Wang proposed the online sequential extreme learning machine with forgetting mechanism (FOS-ELM) [19] which inherits part of the characteristics of the OS-ELM. Compared with other models, the FOS-ELM algorithm has wide adaptability to many kinds of datasets. At the same time, it also solves the fact that many actual data are difficult to simulate and analysis.

ACCEPTED MANUSCRIPT Based on the FOS-ELM, this paper adds multiple hidden layers to the network structure and proposes the online sequential multiple hidden layers extreme learning machine with forgetting mechanism (FOS-MELM), which inherits the better generalization performance of FOS-ELM and its

RI PT

short running time and reflects the advantages of multiple hidden layers ELM. The final network output is the approximated, expected output by each layer.

The remainder of this paper is organized as follows. Section 2 reviews the method and

SC

framework structure of two-hidden-layer ELM. Section 3 presents the proposed online sequential

M AN U

multiple hidden layers extreme learning machine with forgetting mechanism (FOS-MELM), and Section 4 reports and analyzes the experimental results. Finally, the conclusions are presented in Section 5. 2. .Multi-hidden-layer ELM

TE D

In 2016, B.Y. QU and B.F. Lang proposed the two-hidden-layer extreme learning machine (TELM) [13]. One improved algorithm, called the multiple hidden layers (MELM), based on TELM attempts to make the actual hidden layers output approach expect the hidden layer outputs. A better

EP

way to map the relationship between input and output signals has been developed by MELM, which

AC C

is known as the multi-hidden-layer ELM. The network structure of MELM includes one input layer, multiple hidden layers, one output layer, and each hidden layer with l hidden neurons. The activation function of the network is selected as g (x) . The workflow of the MELM architecture is depicted in Figure 1 with three hidden layers for an example as follows:

X

B2

B1

B

W1

W H

β

W2

H2

H4

Figure 1. The workflow of the MELM

ACCEPTED MANUSCRIPT Consider the training sample datasets { X , T} = {xi , ti }(i = 1,2,3....Q) , where the matrix X are input samples and T are the labeled samples. The MELM first treat all hidden layers as one hidden layer, so the output of the hidden layer can

RI PT

be expressed as H = g(WX + B) with the parameters of the weight W and bias B of the first hidden layer randomly initialized. Next, the output weight matrix β between the all hidden layer and the

β = H +T

SC

output layer can be obtained by calculating Eq. (1) as follows:

(1)

M AN U

MELM separates all the hidden layers into two hidden layers (the first hidden layer and the others), so the network has two hidden layers. According to the workflow of Figure 1, the actual output of the second hidden layer can be obtained as follows: H 1 = g (W1 H + B1 )

(2)

TE D

where W1 is the weight matrix between the first hidden layer and the second hidden layer, H is the output matrix of the first hidden layer, B1 is the bias of the second hidden layer.

EP

However, the expected output of the second hidden layer can be obtained by calculating Eq. (3): H1* = T β +

(3)

AC C

where β + is the generalized inverse of the matrix β . Make the expected output equal to the actual output of the hidden layer, H 1 = H 1* Now MELM defines the matrix WHE = [ B1 W1 ] , so the parameters of the second hidden layer can be easily obtained by calculating Eq. (2) and the inverse function of the activation function. WHE = g −1 ( H1* ) H E+

(4)

where H E+ is the generalized inverse of H E = [1 H ]T , 1 denotes a one-column vector of size Q, and its elements are the scalar unit 1. The notation g −1 ( x) indicates the inverse of the activation

ACCEPTED MANUSCRIPT function g ( x) . By selecting the appropriate activation function g ( x) , TELM calculates Eq. (4), so the actual output of the second hidden layer is updated as follows:

H 2 = g (WHE H E )

RI PT

(5)

Therefore, the weights matrix β between the second hidden layer and the output layer is

β new = H 2+T

(6)

M AN U

where H 2+ is the generalized inverse H 2 .

SC

updated as follows:

MELM then separates all the hidden layers into three hidden layers (the first hidden layer, the second hidden layer and the others), so the network has three hidden layers. According to the workflow of figure 1, the output of the three hidden layers can be obtained, so the expected output of

TE D

the third hidden layer can be expressed as follows:

+ H3* = T βnew

(7)

+ β new is the generalized inverse of the weights matrix β new .

EP

MELM defines the matrix WHE1 = [ B2 W2 ] , so the parameters of the third hidden layer can be

AC C

easy obtained by calculating formula (7) and formula H 3* = g ( H 2W2 + B2 ) = g (WHE1 H E1 ) .

WHE1 = g −1 ( H 3* ) H E+1

(8)

where H 2 is the actual output of the second hidden layer, W2 is the weight between the second hidden layer and the third hidden layer, B2 is the bias of the third hidden neurons, H E+1 is the generalized inverse of H E1 = [1 H 2 ]T , 1 denotes a one-column vector of size Q, and its elements are the scalar unit 1. The notation g −1 ( x) indicates the inverse of the activation function g (x) .

ACCEPTED MANUSCRIPT To test the performance of the proposed MELM algorithm, different activation functions are applied for regression and classification problems. Generally, the logistic sigmoid function g ( x) = 1 /(1 + e − x ) is adopted, and the actual output of the third hidden layer is calculated as follows:

H 4 = g ( WHE1H E1 )

RI PT

(9)

Therefore, the weights matrix β new1 between the third hidden layer and the output layer is

β new1 = H 4+T

SC

updated as follows:

(10)

M AN U

where H 4+ is the generalized inverse of H 4 , so the actual output of the MELM network can be expressed as Eq. (11).

f ( x) = H 4 β new1

(11)

TE D

3. .The online sequential extreme learning machine with forgetting mechanism In 2012, J.W. Zhao and Z.H. Wang proposed the online sequential extreme learning machine with forgetting mechanism (FOS-ELM) [19]. To adapt the changing training data over time, the

EP

author supplemented some valid unit time for each chunk data, which is the idea of the forgetting

AC C

mechanism to abandon the outdated training data. FOS-ELM is an ensemble of the online sequential extreme learning machine (OS-ELM) with the forgetting mechanism. The activation function g and the hidden nodes L are chosen, so the output function of the hidden layer is G(a,b,x), a is the weight between the input layer and the hidden layer, b is the bias of the hidden layer, and x is the input matrix. Suppose that the training data are changing chunk-by-chunk and that each chunk remains valid for s unit time. Then, the training data at k-th unit time is calculated by equation (12), and N j is the number of the training data at the j-th chunk, j = 0, 1, 2, ..., k.

ACCEPTED MANUSCRIPT

∑ Nj χ k = {( xi , t i )}i = ( j =0 k −1 N ) +1 ∑ j =0 j k

(12)

Each training data remains valid for s unit time as χ k remains valid at the [k, k+s] unit time. Therefore, the data at the (k+1)-th unit time can be expressed as the following equation: −

k

RI PT

∑ Nj χ (k + 1) = U χl = {( xi , ti )} j=0 k −s i = ( ∑ N j ) +1 j =0 k − s +1 k

(13)

SC

When k < s − 1 , all the training data is valid. Now, assume that k ≥ s − 1 and that the number

of the training data are larger than the number of the hidden layer nodes L. The following is the

M AN U

process of the prediction of the datum Z k +1 at (k+1)-th time for each l = k − s + 1, k − s + 2, L k . Therefore, the l - th partial hidden layer output matrix is as follows:

G(a1 , b1 , x l −1 ) L G(aL , bL , x l −1 ) t T  ∑ j=0 N j +1 ∑ j=0 N j +1    ∑ N +1   Hl =  M O M     and T j =  T M  , G(a1 , b1 , x∑l N j +1 ) L G(aL , bL , x∑l N j +1 )  t∑ N  j =0 j =0     l −1

j =0

j

(14)

TE D

l

j

j =0

where the hidden-layer parameters are randomly set as ( ai , bi ), i = 1,..., L.

EP

Using the Moore-Penrose method, the β (k ) can be calculated as follows:

Where

AC C

 H k − s +1  Tk − s +1   H k − s +1   M  β =  M  ,so β (k ) =  M         H k   Tk   H k 

+

Tk − s +1   H k − s +1   M =P M  k     Tk   H k 

T

Tk − s +1   M     Tk 

(15)

-1

  H k − s +1  T  H k − s +1     k Pk =   M   M   = ( ∑ H lT H l ) −1   l = k − s +1   H k   H k    

 H k − s +1  If  M   H k 

T

 H k − s +1   M  is nonsingular, then    H k 

+ T  H k − s +1    H k − s +1   H k − s +1    M  =  M   M         H k    H k   H k    

-1

 H k − s +1   M     H k 

T

(16)

ACCEPTED MANUSCRIPT L

Therefore, the final ensemble output is f L ( x) = ∑ ( β (k ))i G (ai , bi , x) . The prediction of the i =1

datum Z k +1 at the (k+1)-th unit time is f L ( Z k +1 ) . Next, we predict the datum Z k + 2 at the (k+2)-th unit time if the (k+1)-th datum is presented

RI PT

before. The author uses a similar process of the datum Z k +1 prediction at the (k+1) unit time. The paper keeps the same hidden parameters ( ai , bi ), i = 1, 2,L , l . Therefore, the output weight β of

+

Tk − s + 2   H k −s+2   M =P  M  k +1      Tk +1   H k +1 

T

M AN U

H k −s +2  Tk − s + 2   H k −s+2  So  M  β =  M  and β (k + 1) =  M   H k +1   Tk +1   H k +1 

SC

the network is shown as follows,

Tk − s + 2   M     Tk +1 

(17)

-1

  H k −s+ 2 T  H k −s +2     k +1       Pk +1 =  M   M  = ( ∑ H lT H l ) −1   l =k − s + 2   H k +1   H k +1    

(18)

TE D

Then, calculate Pk +1 by Pk directly as follows:

T  k − H k − s +1   H k − s +1   T −1 T  Pk +1 = ( ∑ H l H l ) = ∑ H l H l +      l = k − s +1 l =k − s + 2  H k +1   H k +1    k +1

EP

− H k − s +1  = Pk − Pk    H k +1 

AC C

Additionally,

T

  I +  H k − s +1  P − H k − s +1   H  k H   k +1   k +1   

 H k − s +1   H k − s + 2  Tk − s + 2      T  H   T  =  ∑ H l Tl  =  M     k +1   k +1   l = k − s + 2  H k  T

− H k − s +1  = P β (k ) +    H k +1 

k +1

T

−1 k

T

T

Then,

-1

  H k − s +1   × P   H k +1  k 

(19)

Tk − s +1  T  M  + − H k − s +1  Tk − s +1        H k +1   Tk +1    Tk 

T T Tk − s +1   −1 − H k − s +1   H k − s +1   − H k − s +1  Tk − s +1   T  =  Pk +1 −  H     β (k ) +  H    k +1   H k +1  k +1   Tk +1   k +1     

− H k − s +1   H k − s +1  − H  T  β (k ) +  k − s +1   k − s +1  = P β (k ) −      H k +1   H k +1   H k +1   Tk +1  −1 k +1

T

-1

T

(20)

ACCEPTED MANUSCRIPT H k −s +2  β (k + 1) = Pk +1  M   H k +1 

T

Tk − s + 2   M     Tk +1 

T T  − H k − s +1   H k − s +1  − H k − s +1  Tk − s +1   −1  β (k ) +  = Pk +1 Pk +1β (k ) −      H k +1   H k +1    H k +1   Tk +1    T

 Tk − s +1   H k − s +1      T  −  H  β (k ) .   k +1   k +1  

RI PT

− H k − s +1  = β (k ) + Pk +1    H k +1 

(21)

Suppose FOS-ELM is composed of p OS-ELM with a forgetting mechanism that has the same

SC

output of hidden node G(a,b,x) and the same number of hidden nodes L.

M AN U

Therefore, the FOS-ELM algorithm is used by computing the average of some OS-ELM with the forgetting mechanism for each unit time.

According to above discussion, the algorithm of FOS-ELM is used. Assume the data appeared chunk-by-chunk with varying size, and each is presented as follows:

∑ Nj χ l = {( xi , t i )}i = ( j =0 l −1 N ) +1 at the l - th unit time remains valid for s unit time. ∑ j =0 j

TE D

l

∑ j =0 N j Step 1. Assume the chunks of the datum χ l = {( x i , t i )}i = ( l −1 N ) +1 ( l = 0,1,2...., k ) and the number ∑ j =0 j l

AC C

Set r = 1.

EP

of data is much larger than the number of hidden nodes L. Z k +1 is predicted at the (k+1)-th unit time.

(1). Randomly initialize the parameter (ai( r ) , bi( r ) ) , i = 1,2,...., L . (2). Compute the l -th hidden layer output,

H

(r ) l

G(a1( r ) , b1( r ) , x l−1 ) L G(aL( r ) , bL( r ) , x l−1 ) ( N ) + 1 (∑ N j)+1  ∑ j=0 j  j =0  and = M O M Tl   (r) (r ) (r ) (r) G ( a , b , x ) L G ( a , b , x ) l l  1 1 L L ∑ j=0 N j ∑ j=0 N j  

(3). Compute the Pk( r ) = (

k

∑H

l = k − s +1

( r )T l

H l( r ) ) −1 .

t l −1  (∑ j=0 N j)+1  , l = k − s + 1,..., k . = M    t∑l N j  j =0  

ACCEPTED MANUSCRIPT  H k( r−)s +1    (4). Compute the output weight β ( r ) (k ) = Pk( r )  M   H k( r )   

T

Tk − s +1   M .    Tk 

(5). Set r = r + 1 . If r ≤ p , then go to (1) of Step 1.

f L ( Z k +1 ) =

RI PT

(6). For the datum Z k +1 at the (k+1)-th unit time, the final prediction can be given as

1 p (r ) ∑ f L ( Z k +1 ) . p r =1

SC

Step 2. Sequential learning procession of FOS-ELM: assume the (k+1)-th chunk data k +1

(k+2)-th unit time, and set r = 1.

M AN U

∑ Nj χ (k + 1) = {( xi , ti )} j=0 k has arrived, then, predict the output of the datum Z k + 2 at the i =( ∑ N j ) +1 j =0

(1). Compute the (k+1)-th hidden layer output H k( r+)1 . T

(3). Calculate the output weight β

(r)

-1

T (r ) (r) (r)    I +  H k − s +1  P ( r ) − H k − s +1   ×  H k − s +1  P ( r ) .  (r )  k  (r )  k  (r )     H k +1   H k +1    H k +1  

TE D

(2). Compute Pk(+r1)

− H k( r−)s +1  = Pk( r ) − Pk( r )  (r )   H k +1 

(k + 1) = β

(r)

− H k( r−)s +1  (k ) + P  (r )   H k +1  (r ) k +1

T

 Tk − s +1   H k( r−)s +1  ( r )   − β (k ) .   Tk +1   H ( r )    k +1   

EP

(4). Set r = r + 1 . If r ≤ p , then go to (1) of Step 2. (5). For the data Z k + 2 at the (k+2)-th unit time, the final prediction can be given as

1 p (r ) ∑ f L (Z k +2 ) . p r =1

AC C

f L (Z k + 2 ) =

4. Online sequential multiple hidden layers extreme learning machine with forgetting mechanism (FOS-MELM) To adapt the changing training data over time and improve the accuracy of ELM network for dynamic data, we propose an algorithm named online sequential multiple hidden layers extreme

ACCEPTED MANUSCRIPT learning machine with forgetting mechanism (FOS-MELM). The structure of the FOS-MELM selects the three hidden layers as an example. Here, we use a three hidden layers of ELM with a forgetting mechanism as an example and

RI PT

analyze the FOS-MELM algorithm. First, the training samples and the three hidden layers network structure (each of the three hidden layers has L hidden neurons) is given. Then, the activation function g is chosen so that the output function of the hidden layer is G(a,b,x), a is the weight

SC

between the input layer and the first hidden layer, b is the bias of the first hidden layer, and x is the

M AN U

input matrix. Suppose the data are changing chunk-by-chunk and each chunk remains valid for s unit time. Therefore, the chunk training data at k-th unit time is equation (18), and N j is the number of the training data at the j-th chunk, j = 0, 1, 2, ...., k.

∑ Nj χ k = {( xi , t i )}i = ( j =0 k −1 N ) +1 ∑ j =0 j k

(22)

TE D

Thus, χ k remains valid at the [k,k+s] unit time. The data at the (k+1)-th unit time can be expressed as the following equation: −

∑ Nj χ (k + 1) = U χl = {( xi , ti )} j=0 k −s i = ( ∑ N j ) +1 j =0 k − s +1 k

(23)

EP

k

AC C

Step 1: :Now, assume k ≥ s − 1 and the training number is larger than the number of hidden layer nodes L. The following is the prediction process of the dataset

Z k +1 : for each

l = k − s + 1, k − s + 2, L k . Therefore, the l - th first hidden layer output matrix is as follows:

G(a1 , b1 , x l −1 ) L G(aL , bL , x l −1 ) N + 1 N + 1 j j ∑ ∑   j =0 j =0  Hl =  M O M   G ( a , b , x ) L G ( a , b , x ) l l  1 1 ∑ N j +1 L L ∑ j=0 N j +1  j =0 

where the hidden-layer parameters ( ai , bi ), i =1,...L are randomly initialized.

(24)

ACCEPTED MANUSCRIPT Using the Moore-Penrose method, the final hidden layer output matrix β (k ) can be calculated as follows: +

Tk − s +1   H k − s +1   M =P M  k     Tk   H k 

T

Tk − s +1   M     Tk 

RI PT

 H k − s +1  β (k ) =  M   H k  where the following is set:

-1

(26)

SC

  H k − s +1  T  H k − s +1     k Pk =   M   M   = ( ∑ H lT H l ) −1   l = k − s +1   H k   H k    

(25)

M AN U

Now, we assume the weight matrix and the bias of the second hidden layer is W1 , B1 at the

l - th . The expected output matrix H1 of the second hidden layer can be shown as

Tk − s +1  H1 =  M  β (k ) +  Tk 

TE D

FOS-MELM defines the matrix WHE = [ B1

(27)

W1 ] , so the parameters of the second hidden layer

can be easy obtained by using formula (17) and the inverse function of the activation function.

EP

WHE = g −1 ( H1 ) H E+

(28)

where H E+ is the generalized inverse of HE = [1 H ] ; 1 denotes a one-column vector, and its T

AC C

elements are the scalar unit 1. The notation g −1 ( x) indicates the inverse of the activation

 H k − s +1  function g (x) , and H =  M  .  H k  With the appropriate activation function g (x) , FOS-MELM calculates the actual output H 2 of the second hidden layer as follows:

H 2 = g (WHE H E ) Therefore, the final hidden layer output weights matrix β can be updated as follows:

(29)

ACCEPTED MANUSCRIPT β new

Tk − s +1  = H  M   Tk  + 2

(30)

Now, we assume the weight matrix and bias of the third hidden layer is W2 , B2 at the l - th .

RI PT

The output matrix of the third hidden layer can be shown as: Tk − s +1  + H 3 =  M  β new  Tk 

(31)

SC

Now, we assume the weight matrix and bias of the third hidden layer is W2 , B2 . The output matrix of the third hidden layer can be shown as

M AN U

Tk − s +1  + H 3 =  M  β new  Tk 

FOS-MELM defines the matrix WHE1 = [ B2

(31)

W2 ] , so the parameters of the third hidden layer

TE D

can be easy obtained using formula (28) and the inverse function of the activation function. WHE1 = g −1 ( H 3 ) H E+1

(32)

where H E+ is the generalized inverse of H E1 = [1 H 2 ]T ; 1 denotes a one-column vector, and

AC C

function g ( x) .

EP

its elements are the scalar unit 1. The notation g −1 ( x) indicates the inverse of the activation

With the appropriate activation function g ( x) , FOS-MELM calculates the actual output of the third hidden layer as follows:

H 4 = g (WHE1 H E1 )

(33)

Therefore, the final output weights matrix β can be updated as follows:

Tk − s +1  β new1 = H  M   Tk  + 4

(34)

ACCEPTED MANUSCRIPT Step 2: :Next, we predict the datum Z k + 2 at the (k+2)-th unit time if the (k+1)-th datum is presented. A similar process of the datum Z k +1 prediction at the (k+1) unit time is used, and the same hidden parameters ( ai , bi ), i =1,...L are kept. Therefore, the output weight β of the networks is as follows, +

Tk − s + 2   H k −s+2   M =P  M  k +1      Tk +1   H k +1  -1

T

Tk − s + 2   M     Tk +1 

RI PT

H k −s +2  Tk − s + 2   H k −s+2      so  M  β =  M  and β ( k + 1) =  M   H k +1   Tk +1   H k +1 

M AN U

Now, the Pk +1 is calculated directly by the Pk as

T  k +1 − H    H k − s +1   k − s + 1 T Pk +1 = ( ∑ H H l ) =  ∑ H l H l +      l = k − s +1 l =k − s + 2  H k +1   H k +1    k +1

T l

−1

Additionally,

  I +  H k − s +1  P − H k − s +1   H  k H   k +1   k +1   

EP

 H k − s +1   H k − s + 2  Tk − s + 2   k +1 T     H   T  =  ∑ H l Tl  =  M     k +1   k +1   l = k − s + 2  H k  T

T

-1

-1

  H k − s +1   × P   H k +1  k 

T

Tk − s +1  T  M  + − H k − s +1  Tk − s +1        H k +1   Tk +1   Tk  

T T T − H k − s +1  Tk − s +1   −1 − H k − s +1   H k − s +1   − H k − s +1  Tk − s +1  = P β (k ) +     = Pk +1 −  H        β (k ) +  H k +1   H k +1  k +1   Tk +1   H k +1   Tk +1     

AC C

−1 k

T

T

− H k − s +1   H k − s +1  − H  T  = P β (k ) −  β (k ) +  k − s +1   k − s +1      H k +1   H k +1   H k +1   Tk +1  −1 k +1

Then,

(37)

TE D

− H k − s +1  = Pk − Pk    H k +1 

T

(36)

SC

  H k −s+2 T  H k −s +2     k +1 Pk +1 =   M   M   = ( ∑ H lT H l ) −1   l =k − s + 2   H k +1   H k +1    

(35)

(38)

ACCEPTED MANUSCRIPT H k −s +2  β (k + 1) = Pk +1  M   H k +1 

T

Tk − s + 2   M     Tk +1 

T T  − H k − s +1   H k − s +1  − H k − s +1  Tk − s +1   −1  β (k ) +  = Pk +1 Pk +1β (k ) −      H k +1   H k +1    H k +1   Tk +1    T

 Tk − s +1   H k − s +1      T  −  H  β ( k ) .   k +1   k +1  

RI PT

− H k − s +1  = β (k ) + Pk +1    H k +1 

(39)

Now, we assume that the weight matrix and bias of the second hidden layer is W1 , B1 at the

SC

l + 1 - th . The output matrix of the second hidden layer can be shown as

M AN U

Tk −s + 2  H1 =  M  β (k + 1) +  Tk +1  FOS-MELM then defines the matrix WHE = [ B1

(40)

W1 ] , so the parameters of the second hidden

layer can be easy obtained using formula (37) and the inverse function of the activation function.

TE D

WHE = g −1(H1)HE+

(41)

where H E+ is the generalized inverse of HE = [1 H ] ; 1 denotes a one-column vector, and its T

elements are the scalar unit 1. The notation g −1 ( x) indicates the inverse of the activation function

AC C

EP

 H k −s+2  g ( x) , and H =  M  .  H k +1 

With the appropriate activation function g ( x) , FOS-MELM calculates the actual output of the second hidden layer as follows:

H 2 = g (WHE H E )

(42)

Therefore, the output weights matrix β can be updated as follows:

β new

Tk − s + 2  = H  M   Tk +1  + 2

(43)

ACCEPTED MANUSCRIPT Now, we assume the weight matrix and bias of the third hidden layer is W2 , B2 at l - th . The output matrix of the third hidden layer can be shown as follows:

Tk − s + 2  + H 3 =  M  β new  Tk +1 

RI PT

FOS-MELM then defines the matrix WHE1 = [ B2

(44)

W2 ] , so the parameters of the third hidden

layer can be easy obtained by using formula (28) and the inverse function of the activation function.

SC

WHE1 = g −1 ( H 3 ) H E+1

(45)

M AN U

where H E+ is the generalized inverse of H E1 = [1 H 2 ]T ; 1 denotes a one-column vector and its elements are the scalar unit 1. The notation g −1 ( x) indicates the inverse of the activation function g ( x) .

With the appropriate activation function g ( x) , FOS-MELM calculates the actual output of the

TE D

third hidden layer as follows:

H 4 = g (WHE1 H E1 )

(46)

Tk − s + 2  β new1 = H  M   Tk +1  + 4

(47)

AC C

EP

Therefore, the output weights matrix β can be updated as follows:

All the H matrix( H 2 , H 4 ) must be normalized between the range of -0.9 and 0.9 when the max of the matrix is more than 1 and the min of the matrix is less than -1.

5. Experiments

To verify the actual effect of the algorithm proposed in this paper, we have performed the following experiments, which are divided into three parts: regression problems, classification problems and the application of mineral selection industry. All the experiments were conducted in

ACCEPTED MANUSCRIPT MATLAB 2010b on a computer with 2.30 GHZ and an i3 CPU. (1). Regression problems To test the performance of the regression problems, several widely used functions are listed

RI PT

below. We use these functions to generate a dataset that includes random selection of sufficient training samples, and the remaining are used as a testing samples. The activation function is selected as the hyperbolic tangent function g ( x) = (1 − e − x ) /(1 + e − x ) . The usual root-mean-square error

SC

(RMSE) is regarded as a measure of training or testing precision.

M AN U

(1) f ( x) = x i ; x : 0 − 25.0 , i : 1.5 − 2.0

Table 1. The time of the modeling FOS-ELM S=2

S=4

20

0.022575

0.024580

40

0.027962

60

0.032869

S=6

S=2

S=4

S=6

0.04028

0.030062

0.033281

0.042141

TE D

L

FOS-MELM(three hidden layer)

0.029256

0.041835

0.039083

0.046944

0.047719

0.036155

0.041653

0.051213

0.054519

0.058239

EP

Table 2. The RMSE of the function 1

AC C

FOS-ELM

Training RMSE

L

s=2

20

FOS-MELM(three hidden layer)

Testing RMSE

Training RMSE

Testing RMSE

s=4

s=6

s=2

s=4

s=6

s=2

s=4

s=6

s=2

s=4

s=6

7.88e-4

8.97e-4

0.0013

7.35e-4

7.27e-4

0.0014

4.85e-15

2.89e-13

4.07e-16

6.49e-15

2.77e-13

4.65e-16

40

1.56e-4

8.97e-4

0.0028

7.26e-4

7.26e-4

0.0023

4.34e-18

3.46e-18

3.46e-18

4.49e-18

3.75e-18

3.75e-18

60

1.56e-4

8.97e-4

0.0013

7.26e-4

7.27e-4

0.0013

6.19e-18

3.81e-17

5.21e-17

5.07e-17

5.07e-17

8.73e-17

ACCEPTED MANUSCRIPT (2) Simulation experiment with a styrene polymerization reactor In this section, the proposed method is applied to the thermally initiated bulk polymerization of styrene in a batch reactor, which shows the typical nonlinear characteristics. The styrene

RI PT

polymerization reactor is used as the research object, and the purpose of the reactor is to adjust the temperature of the reactor so the conversion rate of the reaction, the number of reaction products and the average length of the chain length at the end of the reaction are close to the optimum results. The

SC

reaction mechanism model is as follows:

+

M AN U

dy1 (r1 + r2Tc ) 2 1 − y1 = (1 − y1 ) 2 exp(2 y1 + 2 χ y12 )( dt Mm r1 + r2Tc E y1 ) Am exp(− m ) r3 + r4Tc T

TE D

dy y2 1 dy2 dt (1 − 1400 y2 ) = B 1 + y1 dt Aw exp( ) T dy1 B Aw exp( ) dy3 T −y ) = dt ( 3 1500 dt 1 + y1

Table 3. Constant parameter of the mechanism model parameter values 4.266 ×105 l/ (mol ⋅ s )

Aw

0.33454

AC C

EP

Parameter name Am

B

4364.6 K

Em

10103.5cal/mol

Mm

104 g/mol

r1

0.9328 × 103 g/mol

r2

−0.87902 g/ (l ⋅ mol )

r3

1.0902 × 103 g/mol

r4

−0.59 g/ (l ⋅ mol )

χ

0.33

(48)

ACCEPTED MANUSCRIPT In Table 2 and formula (48), parameter T is the absolute temperature of the reactor, Tc is the temperature, Aw and B are the weight average chain length and the coefficient of temperature correlation, respectively, and Am and Em are the frequency factors and activation energies of

RI PT

monomer polymerization, respectively. Finally, r1 − r4 are the density temperature correction values, and M m χ are monomer molecular weight and polymer interaction parameters, respectively. The quality indexes of the product include conversion rate y1 , the dimensionless number-average y2

SC

and the weight-average chain lengths y3 . The endpoint quality index is y = [ y1 (t f )

y2 (t f )

y3 (t f )] ,

M AN U

and t f represents the endpoint. The total reaction time is set to 400 minutes, and the control variable includes the temperature of the reactor T. According to the time average, the process is divided into 20 sections, and the temperature is kept constant in each period; the equation parameters are given in Table 2, namely, the process variables. 60 batches of data are generated and a mass variable matrix is

TE D

implemented on the reactor. The first 20 batches of data are taken as training samples, and the initial model is established. 20 samples are used as update samples, and the latter 20 are used as test

function.

EP

samples to test the model prediction effect. Before modeling, the excitation function is a sigmoid

L

AC C

Table 4. The time of the modeling

FOS-ELM

FOS-MELM(three hidden layer)

S=2

S=3

S=2

S=3

20

0.018271

0.018560

0.025728

0.030546

40

0.020561

0.021817

0.026452

0.031284

60

0.019669

0.020893

0.029800

0.043916

ACCEPTED MANUSCRIPT Table 5. The RMSE of simulation experiment of styrene polymerization reactor FOS-ELM

FOS-MELM(three hidden layer)

Training RMSE

Testing RMSE

Training RMSE

Testing RMSE

s=2

s=3

s=2

s=3

s=2

s=3

s=2

s=3

20

2.023e-05

1.839e-05

3.217e-04

2.435e-04

3.121e-5

7.765e-5

2.423e-04

1.480e-05

40

3.287e-6

1.643e-7

4.513e-04

2.601e-04

2.995e-6

3.749e-5

1.202e-04

2.001e-04

60

3.286e-6

1.438e-6

2.311e-04

8.342e-05

2.637e-5

1.719e-5

2.628e-04

3.034e-04

-5

the result of training data(s=2)

x 10

-4

5.5

FOS-ELM training FOS-MELM training

3

SC

3.5

RI PT

L

5 4.5

FOS-ELM testing FOS-MELM testing

M AN U

2.5

the result of testing data(s=2)

x 10

4

the RMSE

the RMSE

2

1.5

1

3.5 3

2.5 2

0.5

1.5

25

30

35 40 45 the hidden layer nodes

(a) -5

0 20

30

25

30

35 40 45 the hidden layer nodes

(c)

35 40 45 the hidden layer nodes

50

55

60

(b) -4

3.5

the result of testing data(s=3)

x 10

FOS-ELM testing FOS-MELM testing

2.5

AC C

the RMSE

5

1

25

3

EP

6

2

1 20

FOS-ELM training FOS-MELM training

7

3

60

the result of training data(s=3)

x 10

4

55

the RMSE

8

50

TE D

0 20

2

1.5

1

0.5

50

55

60

0 20

25

30

35 40 45 the hidden layer nodes

50

(d)

Figure 2. The RMSE of simulation experiment of styrene polymerization reactor

55

60

ACCEPTED MANUSCRIPT 6. Conclusion In many practical applications, the data change over time one-by-one and chunk-by-chunk and often have timeliness. To reflect the timeliness efficiently, an algorithm was proposed with the

RI PT

forgetting mechanism, which discards the outdated data in sequential learning to reduce their bad effect on neural networks. A novel neural network algorithm called FOS-MELM, which makes the actual hidden layer output approach the expected hidden layer output with layer by layer iteration as

SC

well as improves the average training and testing performance, is proposed in this paper. The

M AN U

experimental results show that the function approximation tasks of FOS-ELM remarkably reduce the average training and testing RMSE. During the simulation experiment of styrene polymerization reactor problems, the average testing error percentage is distinctly lower than those for other algorithms of ELM.

TE D

Acknowledgment

This research is supported by National Natural Science Foundation of China (Grant No. 41371437, 61203214). Fundamental Research Funds for the Central Universities (N160404008).

1.

AC C

References

EP

National Twelfth Five-Year Plan for Science and Technology Support (2015BAB15B01) PR China.

M. Leshno, V. Y. Lin and A. Pinkus. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function[J], Neural Networks, 1993, 6(6): 861-867.

2.

K. Hornik. Approximation capabilities of multilayer feedforward networks[J], Neural Networks, 1991, 4(2): 251-257.

3.

G. B. Huang, Q. Y. Zhu and C. K. Siew. Extreme learning machine: a new learning scheme of feedforward neural networks[J], 2004 IEEE International Joint Conference on Neural Networks,

ACCEPTED MANUSCRIPT 2004, 2: 985-990. 4.

H. C. Hsin, C. C. Li, M. G. Sun and R. J. Sclabassi. An Adaptive Training Algorithm for Back-Propagation Neural Networks[J], IEEE Transactions on Systems, Man, and Cybernetics,

5.

RI PT

1995, 25(3): 512-514. S. F. Ding, G. Ma, Z. Z. Shi. A Rough RBF Neural Network Based onWeighted Regularized Extreme Learning Machine[J], Neural Process Lett 2014, 40 :245–260.

J. Schmidhuber. Deep learning in neural networks: An overview[J], Neural Networks, 2015, 61:

SC

6.

7.

M AN U

85-117.

W. C. Yu, F. Z. Zhuang, Q. He, Z. Z. Shi. Learning deep representations via extreme learning machines[J], Neurocomputing, 2015, 149: 308-315.

8.

K. Han, D. Yu, I. Tashev. Speech Emotion Recognition Using Deep Neural Network and

9.

TE D

Extreme Learning Machine[J], INTERSPEECH 2014.

R. Nian, B. He, B. Zheng. Extreme learning machine towards dynamic model hypothesis in fish ethology research[J], Neurocomputing, 2014, 128: 273–284.

EP

10. S. Ding, N. Zhang, X. Z. Xu, L. L. Guo and J. Zhang. Deep extreme learning machine and it's

AC C

application in EEG classification[J], Mathematical Computation, 2014, 61(1): 376-390. 11. B. Y. Qu and B. F. Lang. Two-hidden-layer extreme learning machine for regression and classification[J], Neurocomputing, 2016, 175(29): 826-834. 12. S. Tamura and M. Tateishi. Capabilities of a four-layered feedforword neural network: four layers versus three[J], IEEE Transactions on Neural Networks, 1997, 8(2): 251-255. 13. X. W. Liu and L. Wang. Multiple Kernel extreme learning machine[J], Neurocomputing, 2015, 149(3): 253-264.

ACCEPTED MANUSCRIPT 14. Y. Lan, Y. C. Soh and G. B. Huang. Two-stage extreme learning machine for regression[J], Neurocomputing, 2010, 73(16): 3028-3038. 15. S. Scardapane, D. Comminiello, M. Scarpiniti, A. Uncini. Online Sequential Extreme Learning

RI PT

Machine With Kernels[J], IEEE Transactions on neural networks and learning systems, 2015, 26(9): 2214-2220.

16. H. J. Rong and G. B. Huang. Online Sequential Fuzzy Extreme Learning Machine for Function

SC

Approximation and Classification Problems[J], IEEE Transactions on Systems, Man, and

M AN U

Cybernetics, Part B (Cybernetics), 2009, 39(4): 1067-1072.

17. N. Y. Liang, G. B. Huang. A Fast and Accurate Online Sequential Learning Algorithm for Feedforward Networks[J], IEEE Transactions on neural networks, 2006, 17(6): 1411-1423. 18. Y. Lan, Y. C. Soh, G. B. Huang. Ensemble of online sequential extreme learning machine[J],

TE D

Neurocomputing, 2009, 72: 3391–3395.

19. J. W. Zhao, Z. H. Wang and D. S. Park. Online sequential extreme learning machine with forgetting mechanism[J], Neurocomputing, 2012, 87(15): 79-89.

EP

20. G. B. Huang, N. Y. Liang. On-Line Sequential Extreme Learning Machine[J], the IASTED

AC C

International Conference on Computational Intelligence (CI 2005), 2005. 21. B. Mirza, Z. P. Lin, K. A. Toh. Weighted Online Sequential Extreme Learning Machine for Class Imbalance Learning[J], Neural Process Lett, 2013, 38:465-486. 22. E. Canbria and G. B. Huang. Extreme learning machines-representational learning with ELMs for big data[J], IEEE Intelligent Systems, 2013, 28(6): 30-59. 23. W. T. Zhu, J. Miao, L. Y. Qing and G. B. Huang. hierarchical extreme learning machine for unsuperbised repressentation lwarning[Z], 2015 International Joint Conference on Neural

ACCEPTED MANUSCRIPT Networks, 2016. 24. W. C. Yu, F. Z. Zhuang, Q. He and Z. Z. Shi. learning deep representations via extreme learning

AC C

EP

TE D

M AN U

SC

RI PT

machine[J], Neurocomputing, 2015, 149(3): 308-315.

ACCEPTED MANUSCRIPT 1. This paper proposed a new algorithm of online sequential ELM network structure with multiple

AC C

EP

TE D

M AN U

SC

RI PT

hidden layers.