Knowledge-Based Systems xxx (2015) xxx–xxx
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Optimizing the echo state network with a binary particle swarm optimization algorithm Heshan Wang, Xuefeng Yan ⇑ Key Laboratory of Advanced Control and Optimization for Chemical Processes of Ministry of Education, East China University of Science and Technology, Shanghai 200237, PR China
a r t i c l e
i n f o
Article history: Received 10 November 2014 Received in revised form 2 June 2015 Accepted 6 June 2015 Available online xxxx Keywords: Echo state network Reservoir computing Artificial neural network Binary particle swarm optimization algorithm Time-series prediction
a b s t r a c t The echo state network (ESN) is a novel and powerful method for the temporal processing of recurrent neural networks. It has tremendous potential for solving a variety of problems, especially real-valued, time-series modeling tasks. However, its complicated topologies and random reservoirs are difficult to implement in practice. For instance, the reservoir must be large enough to capture all data features given that the reservoir is generated randomly. To reduce network complexity and to improve generalization ability, we present a novel optimized ESN (O-ESN) based on binary particle swarm optimization (BPSO). Because the optimization of output weights connection structures is a feature selection problem and PSO has been used as a promising method for feature selection problems, BPSO is employed to determine the optimal connection structures for output weights in the O-ESN. First, we establish and train an ESN with sufficient internal units using training data. The connection structure of output weights, i.e., connection or disconnection, is then optimized through BPSO with validation data. Finally, the performance of the O-ESN is evaluated through test data. This performance is demonstrated in three different types of problems, namely, a system identification and two time-series benchmark tasks. Results show that the O-ESN outperforms the classical feature selection method, least angle regression (LAR) method in that its architecture is simpler than that of LAR. Ó 2015 Elsevier B.V. All rights reserved.
1. Introduction Recently, reservoir computing (RC) [1,2] has drawn much attention from the machine learning community as a novel recurrent neural network (RNN). RC differs from other algorithms in two important ways: first, a large and untrained dynamic reservoir is used. Second, the desired output function is usually implemented through a linear memory-less mapping of the full instantaneous state of a dynamical system. Some examples of popular RC methods are echo state networks (ESNs) [3–5], liquid state machines [6], back-propagation decorrelation neural networks [7] and evolution of recurrent systems with linear outputs [8]. In this study, we focus on the ESN approach, which is among the simplest effective RC forms. ESN is characterized by the use of an RNN with a fixed untrained reservoir and a simple linear readout. This reservoir contains a large number of randomly and sparsely connected neurons. The sole trainable component is the readout weights, which can be ⇑ Corresponding author at: P.O. Box 293, MeiLong Road No. 130, Shanghai 200237, PR China. Tel./fax: +86 21 64251036. E-mail address:
[email protected] (X. Yan).
obtained through simple linear regression. ESNs have been applied successfully to a wide range of real-world domains, including time-series prediction [9,10], batch bioprocesses modeling [11], nonlinear dynamic system identification [12,13], speech processing [14], mobile traffic forecasting [15], gas turbine prediction [16], stock price prediction [17] and language modeling [18]. However, ESN is occasionally criticized for its blackbox nature: the reservoir connectivity and weight structure are generated randomly beforehand, thus, the process of establishing optimal reservoirs for a given task is an issue [19]. An ESN transforms an incoming time-series signal into a high-dimensional state space. Not all dimensions may contribute to the solution. The internal layer of ESN is sparsely connected, hence, the fact that each output node is connected to all internal nodes seems contradictory [20]. Therefore, the output connection of ESN should be optimized. Many researchers have recently focused their efforts on new ways to optimize the architecture of artificial neural networks, including the methods of pruning [21,22], construction [23] and evolutionary algorithms [24–26]. Constructing algorithms start training with a small network and incrementally add hidden nodes during training when the network cannot reduce the training error. However, there are also some issues that constructive algorithms
http://dx.doi.org/10.1016/j.knosys.2015.06.003 0950-7051/Ó 2015 Elsevier B.V. All rights reserved.
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003
2
H. Wang, X. Yan / Knowledge-Based Systems xxx (2015) xxx–xxx
need to overcome. For example, when a network is grown, there is no guarantee that all of the added hidden nodes are properly trained. Pruning algorithms start with an over-sized network and remove unnecessary network parameters, either during training or after convergence to a local minimum. Unfortunately, one of the disadvantages of pruning algorithms is their heavy computational burden since the majority of the training time is spent on networks which are larger than necessary. Evolutionary learning algorithms have shown great capability to solve problem of feed-forward neural networks recently. In [27], SeyedAli et al. proposed a hybrid particle swarm optimization and gravitational search algorithm to train feed-forward neural networks in order to reduce the problems of trapping in local minima and the slow convergence rate of current evolutionary learning algorithms. Dutoit et al. treat optimization of output weights connection structures as a feature selection problem and proposed several classical feature selection methods such as, all subsets, forward selection, backward elimination and least angle regression (LAR) to investigate how pruning some connections from the reservoir to the output layer can help increase the generalization capability of reservoir computing. [28]. Kobialka and Kayani use a greedy feature selection algorithm to exclude irrelevant internal ESN states [20]. The optimization of the connection structure of output weights is a problem of whether internal and output layer nodes are connected, thus, it is a discrete optimization problem. Moreover, the dimension of optimization variable is equal to the number of internal neurons. The number of internal neurons (generally 200–1000 neurons) must be large enough to capture all data features given that the reservoir is generated at random. In sum, the optimization of the connection structure of output weights of ESN is a discrete, high-dimension, complex, and strongly nonlinearity feature selection problem. Existing feature selection approaches, such as greedy search algorithms, suffer from a variety of problems, such as stagnation in local optima and high computational cost [29]. Therefore, an efficient global search technique is needed to address feature selection problems. Many studies report that evolutionary computation algorithm effectively solve such computational problems. Evolutionary computation techniques are well-known for their global search ability, and have been applied to feature selection problems. These includes particle swarm optimization (PSO) [30,31] and genetic algorithms (GAs) [32]. Compared with GAs, PSO is easier to implement, has fewer parameters, computationally less expensive, and can converge more quickly [33]. Due to these advantages, PSO has been used as a promising method for feature selection problems. PSO is such a global search technique, which is computationally less expensive, easier to implement, has fewer parameters and can converge more quickly than other techniques, such as GA. PSO is a population-based optimization technique that emulates the social behavior of animals, such as the swarming of insects, the flocking of birds, and the schooling of fish, when searching for food in a collaborative manner. This technique was originally designed and introduced by Eberhart and Kennedy [34,35] and has been widely applied in fields such as optimal control and design [36,37], biomedical [38–40], clustering and classification [41,42], electronics and electromagnetics [43,44], bi-level pricing problems in supply chains [45], and modeling [46–48]. The original PSO has been utilized in continuous space, where in trajectories are defined as changes in position on an number of dimensions. However, many optimization problems occur in discrete space. Thus, the traditional PSO cannot solve binary combinational optimization problem, such as structural topology optimization. Thus, [49] introduce a discrete binary version of PSO that can be used on discrete binary variables. In continuous PSO, trajectories are defined as changes in position on an number of dimensions. By contrast, binary particle swarm optimization
(BPSO) trajectories are changes in the probability that a coordinate will take on a value of zero or one. BPSO has been used in many applications, such as the problem of instance selection for time series classification [50], ear detection [51], and feature selection [52]. There are some flaws still exist in original BPSO such as local minima and slow convergence speed. [53] proposed a binary version of bat algorithm to improve BPSO and the results prove that the proposed binary bat algorithm is able to significantly improve the performance on majority of the benchmark functions. [54] proposed a hybrid PSO and gravitational search algorithm (GSA) and the results prove that this hybrid algorithm outperforms both PSO and GSA in terms of improved exploration and exploitation. Ref. [55] presented a new discrete BPSO that addresses the difficulties associated with the original BPSO. To improve the generalization performance of ESN and to simplify its structure, the current paper introduces an optimized ESN (O-ESN) that utilized the new BPSO algorithm [55] to optimized the structure of connections from reservoir to the output layer of ESN. The remainder of this article is organized as follows: Section 2 provides a brief overview of ESN design and training. Section 3 presents a short review of the BPSO algorithm. Section 4 discusses the experimental results. Finally, Section 5 presents a brief conclusion. 2. Echo state network 2.1. Architecture of the ESN An ESN is composed of three parts, as illustrated in Fig. 1: The left component has K input neurons, the internal (reservoir) part has N reservoir neurons, and the right component has L output neurons. The reservoir state sðtÞ and output oðtÞ at discrete time step t are described by [56]:
sðtÞ ¼ f ðW su uðtÞ þ W ss sðt 1Þ þ W so oT ðt 1ÞÞ
ð1Þ
oðtÞ ¼ sT ðtÞ W os
ð2Þ
where f is the reservoir activation function (typically a hyperbolic tangent or another sigmoidal function). uðtÞ; sðtÞ and oðtÞ are the input, reservoir state, and output at discrete time step t respectively. The connection weights between the input neurons and the reservoir are presented in a N K weight matrix W su . The
Fixed Reservoir N internal units s(t)
K input units
L output units
u(t)
o(t)
W so
Wos
Wus Wss
Fig. 1. The basic architecture of ESN. Dashed arrows indicate connections which are trained in the ESN approach. Shaded arrows indicate feedback connections that are possible but not required. Black solid arrows indicate connections which are random created and fixed during training.
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003
3
H. Wang, X. Yan / Knowledge-Based Systems xxx (2015) xxx–xxx
connections between reservoir neurons are depicted in an N N weight matrix W ss . The connection weights between all neurons and output neurons are presented in a L N weight matrix W os , and the feedback matrices are displayed in a N L weight matrix W so . sðtÞ is initialized as a zero vector and T denotes the transpose. The architecture without feedback connections (open loop) can make one-step-ahead predictions, where as that with feedback connections (close loop) can make iterative (multi-step-ahead) prediction 2.2. ESN training As discussed previously, W ss ; W so , and W su are fixed prior to training and are assigned random values drawn from a uniform distribution. The readout matrix W os alone should be trained. The reservoir weight matrix W ss must be a sparse matrix with a spectral radius below 1 to maintain the echo state property. The reservoir connection matrix W ss is typically scaled as
W ss
aW ss =jkmax j
ð3Þ W ss
where jkmax j is the spectral radius of and 0 < a < 1 is a scaling parameter. A typical ESN does not include the feedback connections. While an ESN with feedback connections can iteratively predict time-series problems, this prediction process enhances calculation complexity and induce stability problems. In an ESN without feedback connections, its stability is controlled by the spectral radius of the reservoir matrix W ss , which is maintained below 1. By contrast, the stability of an ESN with feedback connections depends not only on the reservoir matrix but also on the feedback and readout matrices [57]. Furthermore, the spectral radius of the matrix W ss þ W so W os affects the stability of such ESNs. During training, the obtained reservoir states are collected in a state matrix S as follows:
3 sT ð1Þ 6 sT ð2Þ 7 7 6 7 S¼6 6 .. 7 4 . 5 2
ð4Þ
W os ¼ arg minðkSw Ok2 þ kkwk2 Þ
ð9Þ
w
where k P 0 is the ridge parameter determined on a hold-out validation set. The solution of the readout matrix is replaced by: 1
W os ¼ ðS T S þ k2 I N Þ S T O
ð10Þ
where I N is the identity matrix of size N. Both offline algorithm (pseudo-inverse) and online algorithms (recursive least squares) can be used to train the ESN. In the current study, the readout matrix of all proposed ESNs is calculated with RR because not only is it effective, but it can also overcome the over-fitting issue [59]. 2.3. Least angle regression (LAR) LAR is a stylized version of the stagewise procedure that uses a simple mathematical formula to accelerate computations [60]. b This method generates estimates through OðtÞ ¼ SW o , where s
b OðtÞ is the actual readout output matrix. The main steps of LAR are as follows: As with classic forward selection, LAR starts with an empty set of readout connection, among the neurons that are not yet connected to the readout, the neuron displaying activity that is most b correlated with the current error OðtÞ OðtÞ, i.e., sðiÞ, is connected to the readout at each step. The largest possible step in the direction of this predictor is taken until sðjÞ is equally correlated with the current residual. Instead of continuing along sðiÞ, LAR proceeds in a direction that is equiangular to sðiÞ and sðjÞ until a third reservoir states sðmÞ earns its way into the most correlated set. LAR then proceeds equiangularly among sðiÞ; sðjÞ, and sðmÞ and along the ‘‘least angle direction’’ until a fourth variable enters and so on. Consequently, the weights of the current readout connections are reduced and changed. However, the weights of the current readout connections are merely increased instead of solving the original least square problem with the current set of readout connections until another readout connection can be added. Therefore, this algorithm naturally combines pruning and regression. In LAR, the inverse matrix must be calculated ([58] describes this process in detail), therefore, its computation process is occasionally hampered by a singular matrix.
sT ðnÞ and the corresponding target outputs are collected in a target output matrix O:
2
oð1Þ
3
6 oð2Þ 7 7 6 7 O¼6 6 .. 7 4 . 5
ð5Þ
oðnÞ where n is the number of training samples. The readout matrix should then be capable of solving a linear regression problem:
S W os ¼ O
ð6Þ
the traditional method uses the least squares solution:
W os ¼ arg minkSw Ok2 w
ð7Þ
where k k denotes the Euclidean norm and the readout matrix W os is given by: 1
W os ¼ ðS T SÞ S T O
ð8Þ
In [28], Dutoit proposed a regularization method called ridge regression (RR) to calculate the readout matrix [58]. This method is a shrinkage method that involves adding a penalty term that is proportional to the Euclidean norm of the readout matrix:
3. Regulation of the ESN based on the BPSO algorithm 3.1. Review of BPSO The PSO algorithm based on swarm intelligence theory is an evolutionary computational technique. It was first proposed by Kennedy and Eberhart in 1995 [34]. Subsequently, they developed a discrete binary version of PSO in 1997, namely, BPSO. This variation is used in practice to solve some combinatorial optimization problem. The continues PSO technique conducts searches using a population of particles. Each particle represents a candidate solution to the problem. In this variation, the velocity and the position of each particle in a d-dimensional space can be modified with the following equations
v id ðt þ 1Þ ¼ w v id ðtÞ þ c1 randðÞðp bestit xid Þ þ c2 randðÞ ðg bestd xid Þ xid ðt þ 1Þ ¼ xid ðtÞ þ v id ðt þ 1Þ
ð11Þ ð12Þ
where randðÞ is a random functions in the range [0,1], p best it denotes the personal best of the itparticle; g best d denotes global best of the d particle; c1 and c2 are positive constants; w is the inertia weight; xi ¼ ðxi1 ; xi2 ; . . . ; xiD Þ represents the position of the ith
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003
4
H. Wang, X. Yan / Knowledge-Based Systems xxx (2015) xxx–xxx
particle, and v i ¼ ðv i1 ; v i2 ; . . . ; v id Þ corresponds to the velocity for particle i. In BPSO, a particle moves in a state space restricted to 0 and 1 in each dimension because the moving velocity is defined in terms of changes in the probability that a bit will be in one state or another. The position is updated according to Eq. (14)
Sðv id Þ ¼ sigmoidðv id Þ ¼
1 1 þ ev id
ð13Þ NMSE ¼
if randðÞ < Sðv id ðt þ 1ÞÞ then xid ðt þ 1Þ ¼ 1; else xid ðt þ 1Þ ¼ 0 ð14Þ where Sðv id Þ is a logistic function transformation and randðÞ is a quasi-random number that is selected from a uniform distribution in ½0:1; 1:0. In the continuous PSO, v id is limited by the value v max . In the discrete PSO, v id is also limited in the range of ½v max ; v max . 3.2. Improved BPSO algorithm The original BPSO has a obvious disadvantages, this relates to sigmoid function (Eq. 13). In the standard PSO there is no difference between a big value of v id in the positive and negative direction and it just shows that the greater movement is required based on the previous position. However in the binary PSO, a difference is associated so that increasing the value in the positive direction causes bigger probability (probability of 1) for the particle position and increasing in the negative direction causes probability of zero. Also, in the standard PSO while the particle velocity for a particular dimension goes to zero, it means that particle has a suitable position in that dimension. While, in the BPSO using sigmoid function, the position may be changed and with the probability of 0.5, id x, takes the value of 1 or 0. So the transfer function is the main component and most important part of the BPSO algorithm. The effectiveness of employing these new transfer functions is investigated in terms of avoiding local minima and convergence speed. Therefore, a transfer function should be selected very carefully for binary algorithms because failure to do so may significantly degrade performance [61]. Transfer functions force particles to move in a binary space. The transfer function in binary version is responsible to map a continuous search space to a discrete search space. In [62] Mirjalili and Lewis introduced and evaluated six s-shaped and v-shaped transfer functions and the results prove that the v-shaped family of transfer functions significantly improves the performance of the original binary PSO. In [55], Hossein et al. proposed a new BPSO to overcome a limitation of the original BPSO in which the algorithm does not converge effectively. The new probability function is defined as follows:
^Sðv id Þ ¼ 2 jsigmoidðv id Þ 0:5j
ð15Þ
N X b ðtÞÞ2 ðoðtÞ o t¼1
ð17Þ
N r2
b ðtÞ is the actual readout output; oðtÞ is the desired output; where o
r2 denotes the variance of oðtÞ, and N is the total number of oðtÞ. It can also be the normalized root mean square error ðNRMSEÞ
NRMSE ¼
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi XN ðoðtÞ o b ðtÞÞ2 t¼1
ð18Þ
N r2
The NMSE or NRMSE of the ESN is minimized through BPSO. First, an ESN with sufficient internal units is established and trained using training data. The connection structure of its output weights is then optimized through BPSO with validation data. Finally, the performance of the O-ESN is evaluated using test data. The inputs into internal and feedback connections are fixed throughout the entire training and testing time; thus, the readout connection alone can be regulated. ESN can be trained both offline and online by minimizing a given loss function. In most cases, we evaluate model performance through NMSE ;nonetheless, it is evaluated using NRMSE in some cases. The procedure for the O-ESN with a BPSO algorithm can be summarized as follows: 1. Data are divided into three categories, namely, training, validation, and testing data. 2. An ESN is established with sufficient internal units to calculate the initial error. The internal units are standard sigmoid units with a transfer function f ¼ tanh. The input weights W su are generated at random from a uniform distribution over an interval [1, 1]. The feedback weights W so are derived in a similar manner if feedback connections exist. 3. The established ESN is trained using RR with training data. The ridge parameter is determined based on the validation set. The initial validation NMSEIV , test performance error NMSEIT , and the validation reservoir state matrix S IV are determined. 4. The error function NMSE or NRMSE is used as the objective function. The optimization objective is to minimize the error T
function. The initial population X 0 ¼ ½x01 ; x02 ; . . . ; x0n is generated, where n is the number of the particles. pibest ; pgbest ; Ptibest , and P tgbest are initiated, where Ptibest denotes the personal best of the ith particle and P tgbest denotes the global best of the ith
Eq. (14) is substituted into Eq. (16)
if randðÞ < ^Sðv id ðt þ 1ÞÞ then xid ðt þ 1Þ ¼ exchangeðxid ðtÞÞ; else xid ðt þ 1Þ ¼ xid ðtÞ
represent the connection of the ith reservoir node with the output layer, then xi ¼ 0 represents the disconnection status and xi ¼ 1 is the connection status. x ¼ ½x1 ; x2 ; . . . ; xN denotes the connection status of the reservoir nodes and the output layer. In this study, the objective function is the normalized mean square error (NMSEÞ
ð16Þ
3.3. Regulate ESN with BPSO In this study, the improved BPSO algorithm is used to optimize the connection structures of the output weights in ESN. The connection and disconnection status of the connections of output weights can be set as 1 and 0. The optimization variable in this study is a binary variable that corresponds to the connection status of the output weights of ESN. The dimension of each particle is the number of internal nodes. If xi
particle. P0ibest ¼ P 0gbest ¼ minðFðx0i ÞÞ. The internal unitN is the dimension of each particle. The positions of particles are randomly initialized within the hypercube. Elements of xti are randomly selected from binary values 0 and 1. 5. The new reservoir states matrix S N is calculated
S N ¼ S IV X t
ð19Þ
The new readout matrix is 1
W os ¼ ðS TN S N þ k2 I N Þ S TN O
ð20Þ t
The validation performance FðX Þ (NMSE or NRMSE) of each particle is evaluated through its current position xti , where
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003
5
H. Wang, X. Yan / Knowledge-Based Systems xxx (2015) xxx–xxx
Fig. 2. Global best of validation performance on ESN test function for 10th order NARMA task in optimizing process of O-ESN when N ¼ 200; 300; 500.
t is the number of iterations. The objective function FðX t Þ is computed using the validation data. 6. The performance of each individual is compared with its best validation performance.
if Fðxti Þ < Ptibest then Ptibest ¼ Fðxti Þ;
pibest ¼ xti :
7. The validation performance of each particle is compared with that of the global best particle.
if Fðxti Þ < Ptgbest then Ptgbest ¼ Fðxti Þ;
pgbest ¼ xti :
8. The velocity and the position of each particle is updated according to Eqs. (15) and (16). 9. Step5 is repeated until convergence is achieved. 10. The best particle of the validation performance is determined. The O-ESN with the best particle is tested, and the testing performance NMSET is recorded.
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003
6
H. Wang, X. Yan / Knowledge-Based Systems xxx (2015) xxx–xxx
Table 1 Global best of testing performance on ESN test function for 10th order NARMA task in optimizing process of O-ESN and LAR when N ¼ 200; 300; 500. Reservoir size (N)
Mean N (LAR)
Mean N (NBPSO)
Initial testing NMSEIT
Testing NMSET of O-ESN (LAR)
Testing NMSET of O-ESN (NBPSO)
200
171
141
0.0499 0.0503 0.0525 0.0433 0.0484 0.0479 0.0515 0.0488 0.0473 0.0469
0.0444 0.0422 0.0485 0.0412 0.0415 0.0419 0.0465 0.0451 0.0443 0.0432
0.0436 0.0417 0.0473 0.0397 0.0406 0.0412 0.0459 0.0435 0.0423 0.0420
Mean NMSEIT = 0.0487
Mean NMSET = 0.0439
Mean NMSET = 0.0428
0.0449 0.0512 0.0455 0.0420 0.0409 0.0463 0.0503 0.0447 0.0435 0.0440
0.0366 0.0388 0.0395 0.0373 0.0327 0.0371 0.0425 0.0380 0.0370 0.0345
0.0361 0.0376 0.0386 0.0366 0.0321 0.0369 0.0422 0.0372 0.0371 0.0347
Mean NMSEIT = 0.0453
Mean NMSET = 0.0374
Mean NMSET = 0.0369
0.0489 0.0457 0.0287 0.0521 0.0436 0.0422 0.0389 0.0373 0.0294 0.0513
0.0372 0.0425 0.0302 0.0440 0.0391 0.0392 0.0354 0.0339 0.0280 0.0409
0.0351 0.0390 0.0270 0.0420 0.0373 0.0361 0.0323 0.0309 0.0263 0.0404
Mean NMSEIT = 0.0418
Mean NMSET = 0.0370
Mean NMSET = 0.0346
300
500
215
332
202
282
Optimizing ability percentage (NBPSO)
12.11%
18.54%
17.22%
Fig. 3. A fragment of the Laser time series.
4. Experiment and result In this section, the performance of the O-ESN is evaluated in three tasks, namely, NARMA system, Santa Fe laser time-series, and Mackey–Glass time-series, which are widely performed in
ESN studies [2,63,64]. The results are compared with those of the LAR method, which is an un-optimized ESN. All readout mappings were fitted using offline RR. For ESN we calculated out-of sample (test set) performance measures over 10 simulation runs because of the random created reservoirs. Each runs based on 10 times
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003
H. Wang, X. Yan / Knowledge-Based Systems xxx (2015) xxx–xxx
7
Fig. 4. Global best of validation performance on ESN test function for laser task in optimizing process of O-ESN when N ¼ 200; 300; 500.
running of BPSO because of the stochastic nature of BPSO. In order to perform the optimization ability we define a optimizing ability percentage,
OP ¼
NMSEIT NMSET NMSEIT
ð21Þ
where NMSEIT is initial NMSE before optimization and NMSET is NMSE after optimization. For the Wilcoxon’s rank sum test [65], the ‘‘+’’, ‘‘’’, and ‘‘’’ marks denote that the performance of NBPSO is significantly better than, worse than, and almost similar
to the compared algorithm, respectively. The parameter settings of the selected model representatives are detailed in Appendix A.
4.1. Experimental tasks and results 4.1.1. Nonlinear auto regressive moving average (NARMA) system In this system, the current output depends on both the current input and the history output. The task was introduced in [66] and is given by Eq. (21)
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003
8
H. Wang, X. Yan / Knowledge-Based Systems xxx (2015) xxx–xxx
Table 2 Global best of testing performance on ESN test function for laser task in optimizing process of O-ESN and LAR when N ¼ 200; 300; 500. Reservoir size (NÞ
Mean N (LAR)
Mean N (NBPSO)
Initial testing NMSEIT
Testing NMSET of O-ESN (LAR)
Testing NMSET of O-ESN (NBPSO)
Optimizing ability percentage (NBPSO)
200
175
122
0.0285 0.0289 0.0297 0.0265 0.0288 0.0285 0.0277 0.0290 0.0286 0.0281
0.0167 0.0195 0.0182 0.0211 0.0143 0.0184 0.0167 0.0196 0.0177 0.0206
0.0133 0.0150 0.0157 0.0161 0.0096 0.0143 0.0139 0.0159 0.0144 0.0149
49.64%
Mean NMSEIT = 0.0284
Mean NMSET = 0.0183+
Mean NMSET = 0.0143
0.0281 0.0272 0.0276 0.0270 0.0259 0.0261 0.0273 0.0287 0.0263 0.0261
0.0150 0.0156 0.0168 0.0188 0.0186 0.0145 0.0164 0.0174 0.0194 0.0187
0.0135 0.0139 0.0147 0.0158 0.0158 0.0120 0.0145 0.0134 0.0141 0.0142
Mean NMSEIT = 0.0270
Mean NMSET = 0.0171+
Mean NMSET = 0.0142
0.0274 0.0257 0.0246 0.0251 0.0236 0.0241 0.0288 0.0273 0.0284 0.0242
0.0162 0.0169 0.0157 0.0161 0.0123 0.0140 0.0184 0.0182 0.0152 0.0201
0.0149 0.0154 0.0137 0.0136 0.0096 0.0116 0.0187 0.0149 0.0112 0.0161
Mean NMSEIT = 0.0259
Mean NMSET = 0.0163+
Mean NMSET = 0.0140
300
500
205
313
oðt þ 1Þ ¼ 0:3oðtÞ þ 0:05oðtÞ
168
253
9 X oðt iÞ þ 1:5uðt 9ÞuðtÞ þ 0:1 i¼0
ð22Þ where oðtÞ is the system output at time t and uðtÞ is the system input at time t and is an i.i.d stream with whose value is in the range [0, 0.5]. The current output depends on both the input and the previous outputs. In general, this system is difficult to model because of its nonlinearity and potentially long memory. The networks were trained on the system identification task to output oðtÞ based on uðtÞ. The NARMA sequence used in this study is 3600 items long, of which the first 1200 items are used for training. The next 1200 items are utilized as a hold-out validation set and the remaining 1200 items for testing. The first 200 values from the training and test sequences were used in an initial washout period. The initial reservoir sizes (N) of the ESN for the 10th-order NARMA task are 200, 300, and 500. This task does not require feedback connections. The validation performances of the O-ESN given different reservoir sizes are depicted in Fig. 2, where as the testing performance levels of O-ESN and LAR are presented in Table 1. 4.1.2. Santa Fe laser time series The Santa Fe laser dataset [67] is a cross-cut through the periodic to chaotic intensity pulsations of a real laser. The laser time-series task is derived from Santa Fe Competition and is a one-step prediction on a time series that is made by sampling the intensity of a far-infrared laser in a chaotic regime (Fig. 3).
47.41%
45.94%
However, the task is characterized by some known difficulties, including numerical round-off noise and the presence of different time-scales in the time series. The prediction of this time-series is particularly difficult in break down events. The task is to predict the subsequent value yðt þ 1Þ (one-step-ahead prediction). The dataset contains 10,000 values. In this study, the first 6600 values were used for training, the next 1700 as a hold-out validation set, and the remaining 1700 for testing. The first 200 values from training and testing sequences were applied in an initial washout period. The initial reservoir sizes of the simple cycle reservoir network for the laser task are 200, 300, and 500. The bias input is a constant 0.02 value. This time-series task requires feedback connections. The validation performance of the O-ESN given different reservoir sizes is depicted in Fig. 4, whereas the testing performance results of O-ESN and LAR are displayed in Table 2. 4.1.3. Mackey–Glass (MG) time series The MG time series [68] is a standard benchmark model for chaotic time-series prediction. It has been successfully applied to ESNs. The time series is defined by the following differential equation:
@oðtÞ 0:2oðt aÞ ¼ 0:1oðtÞ @t 1 þ oðt aÞ10
ð23Þ
The most common values for a are 17 and 30; when a > 16:8, the system has a chaotic attractor. In [9], Jaeger proposed an
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003
9
H. Wang, X. Yan / Knowledge-Based Systems xxx (2015) xxx–xxx
Fig. 5. Validation NRMSE84 (84-step predictions) for Mackey–Glass task in optimizing process of O-ESN when N ¼ 300; 500; 700.
84-step-ahead prediction model that utilizes a 1000 reservoir-unit ESN with feedback connections for the MG time series task. As in [9], we set a ¼ 17 (the value generally employed in most literature on the MG time-series). The reservoir sizes N are 300, 500, and 700. The dataset contains 10,000 values; the first 3000 values are used in training, the next 2000 in a hold-out validation, and the remaining 4200 in an 84-step prediction that was reiterated 50 times. The first 1000 steps were discarded to flush out the initial transient. In the current study, the normalized root mean square error at the 84th time step NRMSE84 is obtained with
NRMSE84 ¼
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi XN ðo ^i ð84Þ oi ð84ÞÞ2 j¼1
N r2
ð24Þ
^i ð84Þ and oi ð84Þ are the predicted and desired output, where o respectively, at time step 84. The bias input is a constant value of 0.02. This time-series task requires feedback connections. The validation performance of the O-ESN given different reservoir sizes is presented in Fig. 5, whereas the testing performance of the O-ESN is exhibited in Table 3. The results of the LAR method is not
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003
10
H. Wang, X. Yan / Knowledge-Based Systems xxx (2015) xxx–xxx
Table 3 Global best of testing performance on ESN test function for MG task in optimizing process of O-ESN when N ¼ 300; 500; 700. Reservoir size (NÞ
Mean N (NBPSO)
Initial testing NMSEIT
Testing NMSET of O-ESN (NBPSO)
Optimizing ability percentage (NBPSO)
300
245
0.007521 0.009873 0.006297 0.004325 0.005225 0.004338 0.005349 0.009159 0.008224 0.009482
0.003410 0.004103 0.002894 0.003508 0.005001 0.003027 0.002891 0.003647 0.005410 0.004692
44.72%
Mean NMSE84 = 0.006979
Mean NMSE84 = 0.003858
0.001984 0.002056 0.004219 0.001513 0.001649 0.001947 0.003482 0.001978 0.002194 0.003157
0.001106 0.001111 0.002743 0.000703 0.001183 0.001209 0.002245 0.000811 0.001535 0.001348
Mean NMSE84 = 0.002418
Mean NMSE84 = 0.001399
3.756e4 2.399e4 1.994e4 4.891e4 2.489e4 3.179e4 3.149e4 4.012e4 2.467e4 1.914e4
3.007e4 2.007e4 1.014e4 4.665e4 2.086e4 3.056e4 3.002e4 3.479e4 2.047e4 1.789e4
Mean NMSE84 = 3.025e4
Mean NMSE84 = 2.615e4
500
700
356
567
42.14%
13.55%
Table 4 The comparison of O-ESN method and other methods.
NARMA Laser Mackey–Glass
ESN (N = 700)
SVESM (N = 1200)
SHESN (N = 500)
Elman (N = 10)
GFNN
BRNN
O-ESN (N = 700)
NMSE = 0.0239 NMSE = 0.0202 NRMSE84 = 3.4e4
– – NRMSE84 = 8.5e3
– NMSE = 0.0135 NRMSE84 = 7.65e3
NMSE = 0.781 NMSE = 0.032 NRMSE84 = 0.787
– – NRMSE84 = 0.562
– NMSE = 0.0262 NRMSE84 = 0.164
NMSE = 0.0201 NMSE = 0.0134 NRMSE84 = 2.59e4
presented in this paper because the singular matrix was observed in the pruning process. To validate the performance of the proposed method, several methods, such as Support Vector Echo-State Machine [57] (SVESM), Scale-Free Highly Clustered ESN [69] (SHESN), Boosted RNN [70] (BRNN), Generalized Fuzzy Neural Network [71] (GFNN), traditional Elman Neural Network (Elman) are implemented for the comparison. The comparison of O-ESN (minimum mean testing error) and other methods is presented in Table 4. 4.2. Discussion The experimental results indicate that using all the output weights connections could not achieve the best performance. Both BPSO and LAR significantly reduce reservoir size while continuously improving their performance. However, BPSO outperforms LAR given both the NARMA system and the laser dataset. This suggest that classical linear feature selection, LAR is easy to fall into local optima and intelligent algorithm like BPSO has a better global optimization capability. The proposed method could select a relatively small number of output weights connections with which the ESN could achieve higher accuracy than with all output
weights connections. This suggests that BPSO could find a subset of complementary output weights connections to improve the performance. This shows that BPSO is an efficient search technique for feature selection than classical feature selection method. According to Figs. 2, 4, and 5, the proposed method limits the validation error for a given reservoir size considerably. The global best of validation performance deteriorate at the beginning of iteration, however the performance start to get better after several iterations step. This because there is no selection operation as GA in BPSO, but due to the global optimization capability of BPSO the performance will converge in the total optimization process. According to the optimizing ability percentage of Tables 1–3 and Wilcoxon’s rank sum test, the optimization ability of BPSO is better in time series problems with feedback connections (Laser and MG task) than problems without feedback connections (NARMA task). According to Table A1, the output activation function of Laser and MG task is a nonlinear function and the output activation function of NARMA task is a linear function. So the feature selection problem of Laser and MG task is a nonlinear feature selection problem and the feature selection problem of NARMA task is a linear feature selection problem. This shows the optimization ability of BPSO is better in nonlinear feature selection problems than LAR.
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003
11
H. Wang, X. Yan / Knowledge-Based Systems xxx (2015) xxx–xxx
Moreover, according to Tables 1 and 2. The architecture of the BPSO is simpler than that of LAR. As per Fig. 5, the validation error appears unstable and oscillates during the optimization process because of the feedback connections and multi-step prediction. The reservoir is affected by the optimization process when feedback connections are present; as discussed previously, the stability of the ESN with feedback connections depends not only on the reservoir matrix but also on the feedback and readout matrices. The stability of the iterative predictor cannot be guaranteed by simply limiting the spectral radius of W ss . Given W ss þ W so W os , the stability of the iterative predictor is dependent not only on internal and feedback weights but also on the output weights that are the result of network training. Therefore, stability cannot be ensured before it is determined [57]. Furthermore, the need for multi-step-ahead predictions indicates a long-term prediction problem. Unlike one-step time series prediction, long-term prediction is typically subject to increasing uncertainties from various sources. For instance, the accumulation of errors and the lack of information complicate the prediction process [72]. An effective multi-step predictor requires enough accurate one-step predictors to limit accumulated errors in iteration. Although the validation error appears unstable and oscillates in the multi-step prediction optimization process, an ideal swarm can still be determined during the optimization process. The testing performance can still be improved as well.
5. Conclusion The current study presents an O-ESN in which a large reservoir is used. The connection structure of its readout weights is then optimized using BPSO. Three widely used benchmark tasks are performed to demonstrate the performance of the O-ESN and of the LAR. The LAR method is not applied to MG data because the internal state matrix S is a singular matrix that is observed in the pruning process. Although the validation error appears unstable and oscillates in the multi-step prediction optimization process, an ideal particle can still be determined during the optimization process. Moreover, testing performance can still be improved. All of the aforementioned applications indicate that the regulation of readout connections can indeed improve the generalization performance of ESN. Furthermore, BPSO can help simplify the search for an efficient reservoir. The internal layer of ESN is sparsely connected; hence, the fact that each output node is connected to all internal nodes seems contradictory. Thus, one can start with a large reservoir (large enough to ensure that the important features of the input data are observed in the reservoir) rather than with an empty set of readout connections such as in LAR. One can then eliminate useless readout connections to enhance the ESN prediction model. Finally, one can apply a relatively simple architecture while maintaining the acceptability of the testing error. In the future, we may investigate other feature selection methods to optimize the output weights connection of ESN. Because The reservoir connectivity and weight structure of ESN are created randomly beforehand, So the random connectivity and weight structure of the reservoir is unlikely to be optimal. In order to find an optimal reservoir given a task and to improve the performance of ESN, we will apply several recently proposed algorithms such as, Ions Motion Algorithm [73], Grey Wolf Optimizer [74], Ant Lion Optimizer [75] and Multi Verse Optimizer [76] to ESN in the future.
Appendix A. Parameter values See Table A1.
Table A1 Parameter values for the NBPSO of different tasks. Task
Initialize particles (N i Þ
Iteration
Output activation function
c1, c2
wmin, wmax
NARMA
100
300
linear
Laser
100
300
tanh
Mackey– Glass
200
300
tanh
c1 = 2, c2 = 2 c1 = 2, c2 = 2 c1 = 2, c2 = 2
wmin = 0.1, wmax = 0.6 wmin = 0.1, wmax = 0.6 wmin = 0.1, wmax = 0.6
References [1] M. LukošEvicˇIus, H. Jaeger, Survey: reservoir computing approaches to recurrent neural network training, Comput. Sci. Rev. 3 (2009) 127–149. [2] D. Verstraeten, B. Schrauwen, M. D’Haene, D. Stroobandt, An experimental unification of reservoir computing methods, Neural Networks 20 (2007) 391– 403. [3] H. Jaeger, The Echo State Approach to Analysing and Training Recurrent Neural Networks, Technology GMD technical report 148, German National Research Center for Information, Germany, 2001. [4] H. Jaeger, M. Lukoševicˇius, D. Popovici, U. Siewert, Optimization and applications of echo state networks with leaky-integrator neurons, Neural Networks 20 (2007) 335–352. [5] Y. Xia, B. Jelfs, M.M. Van Hulle, J.C. Príncipe, D.P. Mandic, An augmented echo state network for nonlinear adaptive filtering of complex noncircular signals, IEEE Trans. Neural Networks 22 (2011) 74–83. [6] W. Maass, T. Natschläger, H. Markram, Real-time computing without stable states: a new framework for neural computation based on perturbations, Neural Comput. 14 (2002) 2531–2560. [7] J.J. Steil, Backpropagation–Decorrelation: Online recurrent learning with O (N) complexity, in: Proceedings of the 2004 IEEE International Joint Conference on Neural Networks, 2004, pp. 843–848. [8] J. Schmidhuber, D. Wierstra, M. Gagliolo, F. Gomez, Training recurrent networks by evolino, Neural Comput. 19 (2007) 757–779. [9] H. Jaeger, H. Haas, Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication, Science 304 (2004) 78–80. [10] Y. Song, Y. Li, Q. Wang, C. Li, Multi-steps prediction of chaotic time series based on echo state network, in: 2010 IEEE Fifth International Conference on BioInspired Computing: Theories and Applications (BIC–TA), 2010 pp. 669–672. [11] H. Wang, X. Yan, Reservoir computing with sensitivity analysis input scaling regulation and redundant unit pruning for modeling fed-batch bioprocesses, Ind. Eng. Chem. Res. 53 (2014) 6789–6797. [12] H. Jaeger, Adaptive nonlinear system identification with echo state networks, in: Advances in Neural Information Processing Systems, 2002, pp. 593–600. [13] S.I. Han, J.M. Lee, Fuzzy echo state neural networks and funnel dynamic surface control for prescribed performance of a nonlinear dynamic system, IEEE Trans. Industr. Electron. 61 (2014) 1099–1112. [14] M.D. Skowronski, J.G. Harris, Minimum mean squared error time series classification using an echo state network prediction model, in: Proceedings of the 2006 IEEE International Symposium on Circuits and Systems, vol. 4, 2006, pp. 3156. [15] Y. Peng, M. Lei, J. Guo, Clustered complex echo state networks for traffic forecasting with prior knowledge, in: Instrumentation and Measurement Technology Conference (I2MTC), 2011, pp. 1–5. [16] X.L. Xu, T. Chen, S.H. Wang, Condition prediction of flue gas turbine based on Echo State Network, in: 2010 Sixth International Conference on Natural Computation (ICNC), 2010, pp. 1089–1092. [17] X. Lin, Z. Yang, Y. Song, Short-term stock price prediction based on echo state networks, Expert Syst. Appl. 36 (2009) 7313–7317. [18] M.H. Tong, A.D. Bickett, E.M. Christiansen, G.W. Cottrell, Learning grammatical structure with echo state networks, Neural Networks 20 (2007) 424–432. [19] M.C. Ozturk, D. Xu, J.C. Príncipe, Analysis and design of echo state networks, Neural Comput. 19 (2007) 111–138. [20] H.U. Kobialka, U. Kayani, Echo state networks with sparse output connections, in: 2010 International Conference on Artificial Neural Networks, 2010, pp. 356–361. [21] A. Luchetta, Automatic generation of the optimum threshold for parameter weighted pruning in multiple heterogeneous output neural networks, Neurocomputing 71 (2008) 3553–3560. [22] S.H. Yang, Y.P. Chen, An evolutionary constructive and pruning algorithm for artificial neural networks and its prediction applications, Neurocomputing 86 (2012) 140–149. [23] C. MacLeod, G. Maxwell, S. Muthuraman, Incremental growth in modular neural networks, Eng. Appl. Artif. Intell. 22 (2009) 660–666. [24] D. Ballabio, M. Vasighi, V. Consonni, M. Kompany-Zareh, Genetic algorithms for architecture optimisation of counter-propagation artificial neural networks, Chemometr. Intell. Lab. Syst. 105 (2011) 56–64. [25] S. Mirjalili, S.M. Mirjalili, A. Lewis, Let a biogeography-based optimizer train your multi-layer perceptron, Inf. Sci. 269 (2014) 188–209.
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003
12
H. Wang, X. Yan / Knowledge-Based Systems xxx (2015) xxx–xxx
[26] S. Mirjalili, How effective is the Grey Wolf optimizer in training multi-layer perceptrons, Appl. Intell. (2014) 1–12. [27] S. Mirjalili, S.ZM. Hashim, H.M. Sardroudi, Training feedforward neural networks using hybrid particle swarm optimization and gravitational search algorithm, Appl. Math. Comput. 218 (2012) 11125–11137. [28] X. Dutoit, B. Schrauwen, J. Van Campenhout, D. Stroobandt, H. Van Brussel, M. Nuttin, Pruning and regularization in reservoir computing, Neurocomputing 72 (2009) 1534–1546. [29] B. Xue, M. Zhang, W.N. Browne, Single feature ranking and binary particle swarm optimisation based feature subset ranking for feature selection, in: Proceedings of the Thirty-fifth Australasian Computer Science Conference, vol. 122, 2012, pp. 27–36. [30] A. Unler, A. Murat, A discrete particle swarm optimization method for feature selection in binary classification problems, Eur. J. Oper. Res. 206 (2010) 528– 539. [31] C.S. Yang, L.Y. Chuang, C.H. Ke, C.H. Yang, Boolean binary particle swarm optimization for feature selection, in: IEEE World Congress on Computational Intelligence Evolutionary Computation, 2008, pp. 2093–2098. [32] H. Yuan, S.S. Tseng, W. Gangshan, Z. Fuyan, A two-phase feature selection method using both filter and wrapper, in: IEEE International Conference on Systems, Man, and Cybernetics, 1999, pp. 132–136. [33] J. Kennedy, W.M. Spears, Matching algorithms to problems: an experimental test of the particle swarm and some genetic algorithms on the multimodal problem generator, in: Proceedings of the IEEE International Conference on Evolutionary Computation, 1998, pp. 78–83. [34] R.C. Eberhart, J. Kennedy, A new optimizer using particle swarm theory, in: Proceedings of the sixth International Symposium on Micro Machine and Human Science, 1995, pp. 39–43. [35] J. Kennedy, J.F. Kennedy, R.C. Eberhart, Swarm Intelligence, Morgan Kaufman, 2001. [36] M. Benedetti, R. Azaro, D. Franceschini, A. Massa, PSO-based real-time control of planar uniform circular arrays, Antennas Wireless Propag. Lett., IEEE 5 (2006) 545–548. [37] M. Donelli, R. Azaro, F.G. Natale, A. Massa, An innovative computational approach based on a particle swarm strategy for adaptive phased-arrays control, IEEE Trans. Antennas Propag. 54 (2006) 888–898. [38] W. Cedefto, D. Agraflotis, Particle swarms for drug design, in: IEEE Congress on Evolutionary Computation, 2005, pp. 1218–1225. [39] W. Cedeño, D.A. Agrafiotis, Comparison of particle swarms techniques for the development of quantitative structure-activity relationship models for drug design, in: IEEE Computational Systems Bioinformatics Conference, 2005, pp. 322–331. [40] N. Khemka, C. Jacob, G. Cole, Making soccer kicks better: a study in particle swarm optimization and evolution strategies, in: The 2005 IEEE Congress on Evolutionary Computation, 2005, pp. 735–742. [41] S.C. Cohen, L.N. Castro, Data clustering with particle swarms, in: IEEE Congress on Evolutionary Computation, 2006, pp. 1792–1798. [42] J. Dai, W. Chen, H. Gu, Y. Pan, Particle swarm algorithm for minimal attribute reduction of decision data tables, in: First International Multi-Symposiums on Computer and Computational Sciences, 2006, pp. 572–575. [43] F. Grimaccia, M. Mussetta, R.E. Zich, Genetical swarm optimization: selfadaptive hybrid evolutionary algorithm for electromagnetics, IEEE Trans. Antennas Propag. 55 (2007) 781–785. [44] C.F. Juang, C.H. Hsu, Temperature control by chip-implemented adaptive recurrent fuzzy controller designed by evolutionary algorithm, IEEE Trans. Circ. Syst. I Regul. Pap. 52 (2005) 2376–2384. [45] Y. Gao, G. Zhang, J. Lu, H.M. Wee, Particle swarm optimization for bi-level pricing problems in supply chains, J. Global Optim. 51 (2011) 245–254. [46] T. Amraee, B. Mozafari, A. Ranjbar, An improved model for optimal under voltage load shedding: particle swarm approach, in: IEEE Power India Conference, 2006, 2006, pp. 1632597, http://dx.doi.org/10.1109/POWERI. 2006.1632597. [47] L. Dos Santos Coelho, R.A. Krohling, Nonlinear system identification based on B-spline neural network and modified particle swarm optimization, in: IEEE International Joint Conference on Neural Networks, 2006, pp. 3748–3753. [48] A. Kalos, Modeling MIDI music as multivariate time series, in: IEEE Congress on Evolutionary Computation, 2006, pp. 2058–2064.
[49] J. Kennedy, R.C. Eberhart, A discrete binary version of the particle swarm algorithm, in: IEEE International Conference on Systems, Man, and Cybernetics, 1997, pp. 4104–4108. [50] T. Zhai, Z. He, Instance selection for time series classification based on immune binary particle swarm optimization, Knowl.-Based Syst. 49 (2013) 106–115. [51] M. Ravi Ganesh, R. Krishna, K. Manikantan, S. Ramachandran, Entropy based binary particle swarm optimization and classification for ear detection, Eng. Appl. Artif. Intell. 27 (2014) 115–128. [52] L.Y. Chuang, C.H. Yang, J.C. Li, Chaotic maps based on binary particle swarm optimization for feature selection, Appl. Soft Comput. 11 (2011) 239–248. [53] S. Mirjalili, S.M. Mirjalili, X.S. Yang, Binary bat algorithm, Neural Comput. Appl. 25 (2014) 663–681. [54] S. Mirjalili, G.G. Wang, LdS Coelho, Binary optimization using hybrid particle swarm optimization and gravitational search algorithm, Neural Comput. Appl. 25 (2014) 1423–1435. [55] H. Nezamabadi-pour, M. Rostami Shahrbabaki, M. Maghfoori-Farsangi, Binary particle swarm optimization: challenges and new solutions, CSI J. Comput. Sci. Eng. 6 (2008) 21–32. [56] H. Jaeger, Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the Echo State Network Approach, Technical report GMD report 159, German National Research Center for Information Technology, 2002. [57] Z.W. Shi, M. Han, Support vector echo-state machine for chaotic time-series prediction, IEEE Trans. Neural Networks 18 (2007) 359–372. [58] A.E. Hoerl, R.W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67. [59] F. Wyffels, B. Schrauwen, D. Stroobandt, Stable output feedback in reservoir computing using ridge regression, in: International Conference on Artificial Neural Networks, 2008, pp. 808–817. [60] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Ann. Stat. 32 (2004) 407–499. [61] S. Saremi, S. Mirjalili, A. Lewis, How important is a transfer function in discrete heuristic algorithms, Neural Comput. Appl. (2014) 1–16. [62] S. Mirjalili, A. Lewis, S-shaped versus V-shaped transfer functions for binary particle swarm optimization, Swarm Evol. Comput. 9 (2013) 1–14. [63] B. Schrauwen, M. Wardermann, D. Verstraeten, J.J. Steil, D. Stroobandt, Improving reservoirs using intrinsic plasticity, Neurocomputing 71 (2008) 1159–1171. [64] J.J. Steil, Online reservoir adaptation by intrinsic plasticity for backpropagation–decorrelation and echo state learning, Neural Networks 20 (2007) 353–364. [65] F. Wilcoxon, Individual comparisons by ranking methods, Biometr. Bull. (1945) 80–83. [66] A.F. Atiya, A.G. Parlos, New results on recurrent network training: unifying the algorithms and accelerating convergence, IEEE Trans. Neural Networks 11 (2000) 697–709. [67] A.S. Weigend, N. Gershenfeld, Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley, 1994. [68] M.C. Mackey, L. Glass, Oscillation and chaos in physiological control systems, Science 197 (1977) 287–289. [69] Z. Deng, Y. Zhang, Collective behavior of a small-world recurrent neural system with scale-free distribution, IEEE Trans. Neural Networks 18 (2007) 1364–1375. [70] M. Assaad, R. Boné, H. Cardot, A new boosting algorithm for improved timeseries forecasting with recurrent neural networks, Inform. Fus. 9 (2008) 41–55. [71] Y. Gao, M.J. Er, NARMAX time series model prediction: feedforward and recurrent fuzzy neural network approaches, Fuzzy Sets Syst. 150 (2005) 331– 350. [72] A. Sorjamaa, J. Hao, N. Reyhani, Y.N. Ji, A. Lendasse, Methodology for long-term prediction of time series, Neurocomputing 70 (2007) 2861–2869. [73] B. Javidy, A. Hatamlou, S. Mirjalili, Ions motion algorithm for solving optimization problems, Appl. Soft Comput. 32 (2015) 72–79. [74] S. Mirjalili, S.M. Mirjalili, A. Lewis, Grey wolf optimizer, Adv. Eng. Softw. 69 (2014) 46–61. [75] S. Mirjalili, The ant lion optimizer, Adv. Eng. Softw. 83 (2015) 80–98. [76] S. Mirjalili, S.M. Mirjalili, A. Hatamlou, Multi-verse optimizer: a natureinspired algorithm for global optimization, Neural Comput. Appl. (2015) 1–19.
Please cite this article in press as: H. Wang, X. Yan, Optimizing the echo state network with a binary particle swarm optimization algorithm, Knowl. Based Syst. (2015), http://dx.doi.org/10.1016/j.knosys.2015.06.003