Online fitted policy iteration based on extreme learning machines

Online fitted policy iteration based on extreme learning machines

Knowledge-Based Systems 100 (2016) 200–211 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/loca...

781KB Sizes 0 Downloads 85 Views

Knowledge-Based Systems 100 (2016) 200–211

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Online fitted policy iteration based on extreme learning machines Pablo Escandell-Montero a,∗, Delia Lorente b, José M. Martínez-Martínez a, Emilio Soria-Olivas a, Joan Vila-Francés a, José D. Martín-Guerrero a a

Intelligent Data Analysis Laboratory (IDAL), Electronics Engineering Department, University of Valencia, Av. de la Universitat s/n, Burjassot, 46100 Valencia, Spain b Centro de Agroingeniería, Instituto Valenciano de Investigaciones Agrarias (IVIA), Crta. Moncada-Náquera km 5, Moncada, 46113 Valencia, Spain

a r t i c l e

i n f o

Article history: Received 2 July 2015 Revised 29 December 2015 Accepted 8 March 2016 Available online 14 March 2016 Keywords: Reinforcement learning Sequential decision-making Fitted policy iteration Extreme learning machine

a b s t r a c t Reinforcement learning (RL) is a learning paradigm that can be useful in a wide variety of real-world applications. However, its applicability to complex problems remains problematic due to different causes. Particularly important among these are the high quantity of data required by the agent to learn useful policies and the poor scalability to high-dimensional problems due to the use of local approximators. This paper presents a novel RL algorithm, called online fitted policy iteration (OFPI), that steps forward in both directions. OFPI is based on a semi-batch scheme that increases the convergence speed by reusing data and enables the use of global approximators by reformulating the value function approximation as a standard supervised problem. The proposed method has been empirically evaluated in three benchmark problems. During the experiments, OFPI has employed a neural network trained with the extreme learning machine algorithm to approximate the value functions. Results have demonstrated the stability of OFPI using a global function approximator and also performance improvements over two baseline algorithms (SARSA and Q-learning) combined with eligibility traces and a radial basis function network. © 2016 Elsevier B.V. All rights reserved.

1. Introduction Reinforcement learning (RL) is a learning paradigm in the field of machine learning for solving decision-making problems where decisions are made in stages [1]. This kind of problems appears in many fields, such as medicine [2,3], automatic control [4,5], artificial intelligence [6,7], or operations research [8,9]. The standard RL setting consists of an agent (or controller) in an environment (or system). Each decision (also called action) produces an immediate reward. The agent learns to perform actions in order to maximize the reward accrued over time. The goal is defined by the user through the reward function. Contrary to other approaches, RL does not require a mathematical model of the system, but it is based on experience (or data). The agent acquires experience interacting with the environment. Fig. 1 represents the main RL elements and how they interact. At each stage or discrete time-point k, the agent receives the environment’s state, and on that basis it selects an action. As a consequence of its action, in the next time step, the agent receives a numerical reward and the environment evolves to a new state. The agent selects actions depending on the



Corresponding author. Tel.: +34 963543421; fax: +34 963544353. E-mail address: [email protected] (P. Escandell-Montero).

http://dx.doi.org/10.1016/j.knosys.2016.03.007 0950-7051/© 2016 Elsevier B.V. All rights reserved.

environment state using a policy that assigns an action to every state. Typically, the agent modifies the policy as a result of the interactions with the environment. The objective of the agent is to find an optimal policy. Many RL algorithms rely on value functions to find optimal policies. Given a policy, the value function estimates the long-term reward obtained by the agent when it follows that policy. Classical RL methods are limited to discrete, small problems because they require exact representations of value functions. However, in most realistic problems the environment state space is large or infinite (e.g., if state variables are continuous). In such cases, value functions must be approximated. Function approximators can be classified according to their generalization capabilities as global or local. In global approximators (e.g. multilayer perceptron or support vector machines) a change in the parameters induced by an update in a certain part of the input space might influence the values of any region of the output space. On the contrary, a change in the input space of a local approximator (e.g., radial basis network (RBF) or k-nearest-neighbor) only affects to a localized region of the output space [10]. Although global approximators a priori may have very positive effects in combination with RL algorithms [11], they usually lead to poor results even in very simple cases compared to local approximators [12,13]. This is due to the fact that, when the data acquired during the last

P. Escandell-Montero et al. / Knowledge-Based Systems 100 (2016) 200–211

Fig. 1. Basic elements of RL and their flow of interaction.

agent-environment interaction are used to update the parameters of the approximator, they are adjusted to learn the value function at one particular region. However, unpredictable changes may also occur at other regions of the value function [11]. A possible solution to tackle the limitation of global approximators is to accumulate a set of experiences large enough to be representative of the complete state-space and update the entire value function at the same time. In this sense, Ernst et al. [14], based on previous works by Gordon [15] and Ormoneit and Sen [16], proposed a new reinforcement learning scheme where the determination of the value function was reformulated as a sequence of standard batch-mode supervised learning problems. Their algorithm, called fitted Q iteration (FQI), works offline and is based on value iteration, this being one of the two main classes of RL algorithms. The other central class of algorithms is policy iteration. Both classes of algorithms are topics of current research and widely used. Nevertheless, policy iteration typically converges using fewer iterations despite it is computationally more demanding than value iteration [1,17,18]. This paper extends the main ideas of FQI to an online, policy iteration algorithm, which can be reliably combined with global approximators and speeds up learning by reusing the acquired data. In contrast to FQI, the proposed algorithm works online using a semi-batch approach. Therefore, the function approximator should be fast enough to fulfill the time restrictions imposed by online learning. In this context, during the experiments carried out to evaluate the proposed method, a neural network trained with the extreme learning machine (ELM) algorithm was employed to approximate the value functions. ELM is a method for training singlehidden layer feedforward neural networks (SLFNs) that provides an extremely fast learning speed [19]. The rest of this paper is organized as follows. Section 2 presents the required background in Markov decision processes and reinforcement learning. Section 3 introduces briefly extreme learning machine. The details of the proposed algorithm are described in Section 4. Section 5 discusses the relation of the proposed algorithm with similar methods. In Sections 6 and 7, the experiments and results are presented, respectively. Finally, the conclusions are drawn in Section 8. 2. Reinforcement learning The reinforcement learning problem can be formalized using the Markov decision processes (MDPs) [20] framework. Firstly, this section introduces the RL problem using MDPs. Then, the classical algorithm policy iteration is described briefly. 2.1. Markov decision processes An MDP is defined by the tuple {S, A, P, ρ }, where S is the state space of the process, A is the action space, P the transition probability function P: S × A × S → [0, 1) that gives the probability of the next state as a result of the chosen action, and ρ : S × A × S → R is the bounded reward function [20]. Here, S

201

models the possible states of the environment and A includes the actions that can be performed by the agent. Let k denote the current stage or time step. Once the action ak is applied to the state sk , the next state sk+1 is determined by the transition probability function P. The agent selects actions according to its policy π : S → A which drives the action selection process. Each transition generates an immediate reward rk+1 = ρ (sk , ak , sk+1 ). The reward evaluates the immediate effect of the transition, but it does not provide information about its long-term effects [18]. The goal of the agent is to learn a policy that maximizes the return (e.g. the sum of rewards received over time). Such a maximization policy, denoted by π ∗ , is said to be optimal. For an initial state s0 and under the policy π , the expected infinite-horizon discounted return1 is [21]:



Rπ ( s

0

) = lim Esk+1 |sk ,π (sk ) K→∞

K 



γ ρ (sk , π (sk ), sk+1 ) k

(1)

k=0

where γ ∈ [0, 1) is the discount factor. This parameter can be intuitively interpreted as a way to balance the immediate reward and future rewards. Future rewards are more relevant for the calculation of the return when γ approaches 1. The quality of a policy π can be measured using its stateaction value function Q π : S × A → R (commonly referred to as Q-function). The Q-function is defined as the total expected discounted reward that is encountered starting from state s, taking action a and thereafter following policy π [18]:

Q π (s, a ) = Es |s,a



ρ (s, a, s ) + γ Rπ (s )



(2)

where s is the state reached after taking action a in the state s. Due to the Markov property, the Q-function of a policy π satisfies the Bellman equation [18]:

Q π (s, a ) = Es |s,a



ρ (s, a, s ) + γ Q π (s , a )



Q ∗ (s, a )

(3) = maxπ Q π (s, a ),

The optimal Q-function is defined as i.e., the best Q-function that can be obtained from any policy. From the optimal Q-function, an optimal policy can be easily derived choosing in each state the action that maximizes Q∗ :

π ∗ (s ) = arg max Q ∗ (s, a ) a

(4)

In general, for a given Q-function, a policy that maximizes Q in this way is said to be greedy in Q. Therefore, a given MDP can be solved (i.e. finding an optimal policy) by first finding Q∗ , and then using Eq. (4) to compute a greedy policy in Q∗ . This methods are collectively known as value function methods. 2.2. Policy iteration Policy iteration (PI) is one of the major approaches used in RL to solve MDPs [23]. This section introduces some relevant theoretical results from the classical policy iteration algorithm, which will be used in Section 4 to derive the proposed algorithm. Policy iteration consists in using an iterative process to construct a sequence of policies that are monotonically improved. Starting with an arbitrary policy π 0 , at iteration l > 0, the algorithm computes the Q-function underlaying π l (this step is called policy evaluation). Then, given Q πl , a new policy πl+1 that is greedy with respect to Q πl is found (this step is called policy improvement). Fig. 2 depicts schematically the policy iteration algorithm [24]. When the state-action space is finite and exact representations of the value function and policy are used (usually 1 Although there are several types of returns, this work only focuses on infinitehorizon discounted return due to its useful theoretical properties. For a discussion of these properties and other types of returns, see [21,22].

202

P. Escandell-Montero et al. / Knowledge-Based Systems 100 (2016) 200–211

optimization of the parameters due to the fact that it is not based on gradient-descent methods or global search methods. Let be a set of N patterns, P = (xi , oi ); i = 1 . . . N, where {xi } ∈ Rd1 and {oi } ∈ Rd2 , so that the goal is to find a relationship between {xi } and {oi }. For the sake of simplicity, we focus on the case of single output regression problems, where d2 = 1. If there are M nodes in the hidden layer, the SLFN’s output for the j-th pattern is given by yj :

MDP model

Improved policy

Policy improvement

Policy evaluation

yj =

M 



hk · f wk , x j



(7)

k=1

Value function

Fig. 2. Diagram of the classical policy iteration algorithm. Notice the dependence of the policy evaluation step with the MDP model.

by means quence of converges policy π ∗

of lookup tables indexed by state-action pairs), the seQ-functions produced by policy iteration asymptotically to Q∗ as l → ∞ [18]. At the same time, an optimal is obtained. Algorithm 1 shows the pseudocode of the

Algorithm 1 Classical policy iteration with Q-functions. Require: initial policy π0 1: l ← 0 2: repeat 3: find Q πl , the Q-function of πl πl+1 (s ) = arg maxa Q πl (s, a ) 4: l ←l+1 5: 6: until πl+1 = πl 7: return π ∗ = πl , Q ∗ = Q πl

where 1 ≤ j ≤ N, wk stands for the parameters of the k-th element of the hidden layer (weights and biases), hk is the weight that connects the k-th hidden element with the output layer and f is the function that gives the output of the hidden layer; in the case of a multilayer perceptron, f is an activation function applied to the scalar product of the input vector and the hidden weights. Eq. (7) can be expressed in matrix notation as y = G · h, where h is the vector of weights of the output layer, y is the output vector and G is given by:



f ( w1 , x1 ) .. ⎝ G= . f ( w1 , xN )

... .. . ···



f ( wM , x1 ) .. ⎠ . f ( wM , xN )

(8)

As commented above, ELM proposes a random initialization of the parameters of the hidden layer, wk . Afterwards, the weights of the output layer are obtained by the Moore–Penrose’s generalized inverse [29] according to the expression h = G† · o, where G† =

{policy evaluation} {policy improvement}



−1

GT · G · GT is the pseudo-inverse matrix (superscript T means matrix transposition). 4. Online fitted policy iteration

classical policy iteration. The crucial step of policy iteration algorithm is policy evaluation. At the policy evaluation step, the Q-function Qπ of policy π can be found exploiting the fact that it satisfies the Bellman equation (3). Suppose that the set of all Q-functions is denoted by Q, then the Bellman operator can be defined as the mapping T : Q → Q that computes the right-hand side of the Bellman equation for any function Q [18]:

[T (Q )](s, a ) = Es |s,a



ρ (s, a, s ) + γ Q (s , a )



(5)

Using (5), Eq. (3) can be rewritten as

Q π (s, a ) = [T (Q π )](s, a )

(6)

An important property of the Bellman operator is that the sequence of Q-functions generated by the iteration Qn = T (Qn−1 ) for n > 0 converges to Qπ from any starting point Q0 . This result, which is a consequence of the fixed point theorem, is employed to find iteratively the Q-function of each policy π l in Algorithm 1 [18]. 3. Extreme learning machine Extreme learning machine (ELM) is a simple and efficient learning algorithm for single-hidden layer feedforward neural networks (SLFNs) proposed by Huang et al. [19]. ELM has been successfully applied to a great number of problems [25–28], providing a good generalization performance at an extremely fast learning speed. In [19], it is shown that the weights of the hidden layer can be initialized randomly, thus being only necessary the optimization of the weights of the output layer. That optimization can be carried out by means of the Moore–Penrose generalized inverse [29]. Therefore, ELM allows to reduce the computational time needed for the

4.1. Theoretical bases Policy iteration provides a way to determine iteratively a sequence of value functions that converges to the optimal one, from which an optimal policy can be easily derived. Nevertheless, the classical PI algorithm presented in Section 2.2 has two shortcomings that limit its practical use in complex problems: i) it is based on the Bellman operator, which requires an exact model of the system dynamics; and ii) it assumes exact representations of the value functions, which is only feasible when the environment’s state space is small and discrete. This section introduces a new policy iteration algorithm that circumvents the drawbacks of classical PI. The update of the Q-function estimation in the classical PI is made by applying (5) to all state-action pairs. Thus, it is necessary to generate the next state and the corresponding reward for each state-action pair, a process that requires knowing the MDP model. Algorithms based on the MDP model are known as dynamic programming or RL model-based algorithms. In contrast, RL model-free algorithms compute the optimal policy using the experience acquired during the interaction between the agent and the environment. To this end, it is useful to define the sampled Bellman operator Tˆ [30]. Assuming that, at time point k, it is observed a sample comprised of the transition between states (sk , ak , sk+1 ) and its associated reward rk+1 , then:



Tˆ (Q ) (sk , ak ) = rk+1 + γ Q (sk+1 , ak+1 ).

(9)

The sampled Bellman operator is essentially a Bellman operator where the expected reward and value of the next state-action pair are replaced with a realization sampled from the process. Thus, given a representative set of samples, Eq. (9) can be employed to

P. Escandell-Montero et al. / Knowledge-Based Systems 100 (2016) 200–211

compute the policy evaluation step of PI without knowing the MDP model, using only the data collected from the process. Even though the requirement of an MDP model is avoided, the algorithm still assumes an exact representation of value functions and policies. As previously stated, in real problems the number of states and actions may be very large or infinite (when the state variables are continuous). This entails two problems. First, the value function should be represented in a compact (approximated) way due to memory constraints. And second, such representation should facilitate generalization: in a continuous state space the agent may never encounter the same state twice and should instead generalize from experiences in nearby states when encountering a novel one [31]. While representing the policy may also be challenging, an explicit representation can be avoided by computing policy actions on-demand from the approximate value function [32]. Considering function approximators as a projection operator on the hypothesis space (any supervised learning algorithm can be seen as a projection) and denoted by , the iteration procedure used to evaluate a policy using function approximation and the sampled Bellman operator is given by:



Qˆn = Tˆ Qˆn−1



Algorithm 2 Online fitted policy iteration. Require: Regression algorithm 1: initialize the regression algorithm arbitrarily (Qˆ π0 ) 2: l = 0 3: repeat l ←l+1 4: 5: πl = arg maxa Q πl−1 (s, a ) Collect a set of E experiences gathered in form of four-tuple 6: j j j j D = {(sk , ak , rk+1 , sk+1 ), j = 1, . . . , #D} using the ( -greedy) current policy πl , where  is a parameter that controls the exploration probability Qˆ πl ← Policyevaluation(D, regression algorithm) 7: 8: until policy iteration stopping conditions are reached

Algorithm 3 Policyevaluation(D, regression algorithm).

2: 3: 4: 5: 6: 7: 8:

Set the regression algorithm (Qˆ0π ) to an arbitrary initial estimation (e.g., equal to zero everywhere on S × A) n=0 repeat n←n+1 compute the training set T S = {(input j , target j ), j = 1, . . . , #D } where: j j input j = (sk , ak ) j j tar get j = r + γ Qˆ π (s , ak+1 ) k+1

n−1

k+1

compute Qˆnπ (s, a ) from T S using the regression algorithm 10: until policy evaluation stopping conditions are reached 11: return Qˆ π 9:

Improved policy

Batch of samples

Approximated policy evaluation

Policy improvement

Function approximator Approximated value function

Fig. 3. Diagram of the proposed online fitted policy iteration algorithm. The policy evaluation step is based on samples acquired during the interaction between agent and environment. Additionally, the value function is stored in a function approximator.

(10)

where the notation Qˆ indicates that the Q-function is represented approximately. In Eq. (10), the estimated Q-function at iteration n, Qˆn , is computed by applying the sampled Bellman operator to the previous estimation, Qˆn−1 , and projecting the result onto the hypothesis space . The proposed method combines the idea of a semi-batch RL scheme together with Eq. (10) to provide a policy iteration algorithm able to work reliably with global approximators. The pseudocode of this approach is shown in Algorithms 2 and 3. The first

1:

203

step is to initialize the regression algorithm used to approximate an arbitrary Q-function Qˆ π0 . Then, in the policy improvement step, a new policy π l is derived from Qˆ π0 . This policy is used to interact with the environment in order to collect a set of samples or experiences (interactions agent-environment) D that are gathered in form of four-tuples (sk , ak , sk+1 , rk+1 ). D is implemented as a first-in first-out (FIFO) memory whose size is different depending on the problem, but it should be large enough to contain a set of samples representative of the state space. The number E of experiences collected before each policy evaluation step may vary during the learning phase, starting with low values in the early stages to allow the agent to improve its policy quickly and decreasing later to reduce the computational cost. In the first iterations, D can contain experiences acquired with other policies different to the one that is being evaluated. However, it is known that policy iteration algorithms generate near-optimal solutions even when the policy evaluation step is not exact [17]. Furthermore, this can be seen as an additional way to introduce exploratory actions. Next, the acquired experiences are employed to evaluate the policy. For this purpose, Algorithm 3 implements Eq. (10), where first the sampled Bellman operator is applied to all experiences, and then the result is used to train (‘fit’) the function approximator. This process is repeated until convergence. Afterwards, a new improved policy derived from Qˆ πl is used to collect a new set of experiences and the process is repeated until new policies do not improve. The algorithm has been termed as online fitted policy iteration (OFPI) because the policy is learned while the agent interacts with the environment, and it uses the concept of fitting the function approximator in all state space based on a policy iteration scheme. The main stages of the proposed algorithm are summarized in Fig. 3. 4.2. Practical issues This section discusses some practical issues that should be taken into account and the approach employed in the experiments performed in Section 6. 4.2.1. Exploration A generic issue with model-free RL algorithms is the exploration problem. In the case of algorithms based on PI, given a policy π , the agent has to interact with the environment using

204

P. Escandell-Montero et al. / Knowledge-Based Systems 100 (2016) 200–211

actions selected according to that policy, but this process biases the interaction by underrepresenting states that are unlikely to occur under π . As a result, the estimation of the Q-function Qπ may be highly inaccurate [23]. The most frequently used approach to ensure exploration consists in artificially introducing some randomization in the actions selected by the policy π . A possible implementation of this solution is  -greedy exploration, which selects actions according to:

 ak =

arg maxa Q π (sk , a )

with probability 1 − k

a uniformly random action in A

with probability

k (11)

where  k is the exploration probability at step k. Although there exist more sophisticated methods to incorporate exploration, the experiments presented in this work were carried out using  -greedy exploration because it provides a good trade-off between performance and complexity [33], thus being an adequate choice for our experiments. 4.2.2. Stopping conditions Stopping conditions are required to decide at which iteration Algorithms 2 and 3 must be stopped. Theoretically, policy evaluation converges only in the limit, but in practice it must be halted. A typical stopping condition for policy evaluation is to check the difference between two consecutive Q-functions (i.e., ||Qˆn − Qˆn−1 ||) and stop when it decreases below a given threshold ξ > 0 [1,18]. Nevertheless, when there is no guarantee that the sequence of Qˆn functions actually converges (typically for some classes of function approximators, see Section 4.3), it is convenient to define a maximum number of iterations [14]. These stopping conditions are also valid for the policy iteration loop. The proposed algorithm was implemented using both conditions. 4.3. Convergence of online fitted policy iteration The convergence of the classical PI algorithm is based on the contracting property of the Bellman operator T. Similarly, to ensure the convergence of OFPI, the projected sampled Bellman operator Tˆ must also be a contraction mapping. The characteristics of the mapping Tˆ are determined by the projection operator, which depends on the function approximator used. There is a class of approximators, collectively referred to as kernel-based methods [14–16], that maintain the operator Tˆ as a contraction. For example, the k-nearest-neighbors method, partition and multi-partition methods, locally weighted averaging, linear and multi-linear interpolation among others, pertain to this class of approximators. Unfortunately, many common approximators such as linear regression, artificial neural networks or regression trees are not kernel-based methods, and therefore when they are used together with the proposed algorithm, convergence to optimal policies is not ensured. Nevertheless, in RL field is common to find successful algorithms whose convergence is not theoretically ensured [14,34–37]. Section 6 empirically shows that the proposed algorithm gives good practical performance in combination with ELM networks despite the lack of theoretical guarantees of convergence. 5. Related work As introduced in Section 1, the proposed algorithm is strongly related to fitted Q iteration (FQI). Both algorithms share the working principle of computing value functions using a sequence of standard batch-mode supervised learning. This concept was earlier proposed by Gordon [15] in the setting of model-based value iteration, and later adapted by Ormoneit and Sen [16] and Ernst et al. [14] to the model-free case. The main difference is that while FQI

is based on value iteration, the proposed algorithm has its origins in policy iteration. Whereas FQI makes use of the optimal Bellman operator to find an optimal value function and its corresponding optimal policy, OFPI employs an online procedure where the Bellman operator is used to evaluate the current policy and finding a new improved policy at each iteration. The idea of a fitted policy iteration scheme is not completely new. The closest work is probably the algorithm presented by Antos et al. in [38]. However, study in [38] is focused on the theoretical analysis of a fitted policy iteration algorithm that works offline, where all training data is collected in advance with a fixed policy. In [18], an algorithm for policy evaluation called fitted policy evaluation for Q-functions is outlined, but no empirical result is provided. The survey of Geist and Pietquin [30] also briefly comments on the possibility of using the Bellman operator instead of the optimal Bellman operator in the FQI algorithm. Perkins and Precup [39] analyzed the convergence properties of a batch-mode policy iteration algorithm; despite their algorithm also works in batchmode, the parameters of the approximator are updated sample by sample, which can produce poor results in combination with global approximators. The algorithm proposed in this work, OFPI, extends the strengths of fitted Q iteration to a policy iteration scheme that allows online learning. It can be considered as a semi-batch method because the set of samples employed to approximate the value function is updated continuously. 6. Experiments This section describes the experiments carried out to evaluate the performance of the proposed algorithm. Three systems from the automatic feedback control field were used as benchmark. All problems have been previously used to test RL algorithms due to their interesting properties: the dynamics of the systems is highly non-linear and they have continuous state spaces being, therefore, necessary to employ a function approximator. The dimensionality of each system is different in order to test the proposed algorithm in different scenarios. The problems, despite challenging, are lowdimensional, which allows extensive experimentation with reasonable computational costs. The performance of OFPI was compared with two baseline algorithms: SARSA [40] and Q-learning [41]. SARSA belongs to the same family of methods that OFPI, both are policy iteration algorithms, whereas Q-learning is a value iteration method. These two algorithms are likely the best understood RL methods and they have been widely used in the literature as baseline for assessing the performance of novel methods [14,42–47]. SARSA and Q-learning also require some kind of function approximation when used in continuous MDPs. They are usually combined with local approximators. A radial basis function (RBF) network with fixed Gaussian BFs was used as function approximator in this work. Each basis function was defined as:



BF(s ) = exp −

||s − c||2 2σ 2



(12)

where c is the centre and σ determine the standard deviation of the Gaussian. In the three problems, the centres of the Gaussians were equally spaced over the input space. In addition, both baseline algorithms introduce two new parameters. On the one hand, the learning rate, denoted by α . Specifically, the learning rate schedule used is given by:

αt = α0

a a + (t )β

(13)

where α 0 is the initial value, a and β are related to the shape of α t , and t is the number of interactions between the agent and

P. Escandell-Montero et al. / Knowledge-Based Systems 100 (2016) 200–211

205

Instead of the drag coefficient, a drag function, c(k), is used. Similarly, the mass is replaced by a mass function m(k). These functions take into account the effects of an object moving through a fluid. Furthermore, it is assumed that the thrust produced by the propeller is not directly proportional to the control action a, but that the effective thrust is obtained through a function, p(k, a), that models the nonlinear efficiency of the propeller at different velocities. The goal in this experiment is acting on the propeller to reach kset as fast as possible. The system dynamics of the underwater vehicle is given by the following dynamic equation [50]:

Fig. 4. Reward function over a one dimensional controlled variable.

the environment. This learning rate schedule is more flexible than other typical schedules and can be adjusted to achieve better rates of convergence [33]. On the other hand, the second parameter, denoted by λ, controls the decay rate of the eligibility traces, a technique that improves the convergence speed of SARSA and Qlearning [48]. Since the action space is discrete in the three benchmark problems, it has been employed an independent approximator (RBF network in SARSA and Q-learning, and ELM network in OFPI) to each action, a common strategy in such cases [14]. The performance of the three algorithms was evaluated by taking snapshots of the current policy at increasing moments of time and computing their average return [49]. This method leads to a curve recording the performance of the policy as it evolves with the number of episodes (or learning trials). Furthermore, each experiment was repeated 30 times in order to obtain a reliable average performance. The reward function employed in the experiments was designed, based on [50,51], to be as general as possible. It provides a maximum reward when the variable to be controlled is near to the set point, and decreases smoothly to zero as its difference with the set point increases. The exact definition is given by

ρ (s, a ) = ρ (e ) = 1 − tanh (|e| · w ) √  0.95 −1 w = tanh 0.01 2

where e is the control deviation, i.e., the difference between the controlled variable and the set point. Fig. 4 shows the shape of the reward in a one-dimensional setting. This continuous and smooth reward function presents two advantages: 1) it enables the agent to learn more precise control policies than other discrete rewards; and 2) the value functions can be approximated more accurately because they are also smooth [50]. Furthermore, due to its generality, it can be applied in a wide range of problems. The remaining of this section describes the benchmark problems and the experimental setup of the three algorithms (OFPI, SARSA and Q-learning) employed in each case. 6.1. Underwater vehicle The first benchmark problem was proposed in [50] with the aim of assessing RL algorithms. It is a synthetic problem that have little relationship to a real system. Despite the fact that the system has a simple structure (the state-space only contains one dimension), the system dynamics has highly nonlinear properties [50]. The problem represents an underwater vehicle driven by a propeller where the goal is to regulate the velocity until a fixed set point kset = 2 m/s. The drag coefficient and the mass of the vehicle, denoted by c and m respectively, are assumed not to be constant.

p(k, a ) · a − c (k ) · |k| · k k˙ = f (k, a ) = m (k ) c (k ) = 0.2 · sin(|k| ) + 1.2 m(k ) = 1.5 · sin(|k| ) + 3.0 p(k, a ) = −0.5 · tanh [(|(c (k ) · |k| · k ) − a| − 30 ) · 0.1] + 0.5 (15) The underwater vehicle was simulated using the ODE solver of Matlab ode45 with a time interval tODE = 0.03 s. The one dimensional state vector is s = [v]. The underwater vehicle velocity is restricted to [−4, 4] m/s. The action space was discretized into three discrete values A = {−30, 0, 30}; and the discount factor was set to γ = 0.95. The exploration probability  k was initialized at 0.6 and was decreased exponentially with a rate of 0.9905 at the end of each episode in order to obtain an exploration probability of approximately 0.1 at the end of the experiment (after 180 episodes). Each episode is terminated when the underwater velocity is outside of the allowed range or after 50 time steps. The initial state of the episodes was randomly selected from a uniform distribution over all possible states. In the OFPI algorithm, the batch size (i.e., the quantity of experience stored) was fixed to 20 episodes. At the beginning of the learning process, the number of episodes between policy evaluations was set to E = 1, and it was incremented by one at each iteration until reaching 20 episodes between evaluations. The policy evaluation loop was halted if one of the following conditions hold: the maximum number of iterations nmax = 5 is reached, or a distance between two consecutive approximations of the Q-function is lower than a threshold ξ , i.e., ||Qˆn − Qˆn−1 || < ξ , where ξ = 0.07. Since Q is a continuous function, ||Qˆn − Qˆn−1 || was computed evaluating the Q-function in 15 equidistant points distributed along all the state space. The value functions are approximated using a multilayer perceptron trained with the ELM learning algorithm, so we refer to this method as ELM-OFPI. Before approximating the value function, state variables were normalized into the range [0, 1]. While the size of the input and output layers is determined by the problem structure, the size of the hidden layer is a parameter that must be set by the designer. In this case, the number of hidden nodes was varied from 20 to 100 in steps of 20. The best results were obtained with a network 80 hidden nodes, thus, the performance shown in the results corresponds with such a network. In baseline algorithms, denoted as RBF-SARSA and RBF-Qlearning, the configuration parameters were set to the same values since both algorithms were aimed to approximate the same function. The Gaussian standard deviation σ and the number of RBFs were hand-tuned testing several combinations of both parameters; exactly it was tested σ = {0.15, 0.3, 0.45, 0.6} for 10, 15 and 20 basis functions. The best results were obtained with 15 RBFs and σ = 0.3. Similarly, the learning rate schedule (α0 = 0.4, a = 10, β = 0.5) was defined to maximize the convergence speed, as well as the decay rate of the eligibility traces, which was empirically set to λ = 0.4. Table 1 summarizes the parameters used during the experiments in the underwater vehicle problem.

206

P. Escandell-Montero et al. / Knowledge-Based Systems 100 (2016) 200–211

(a) Inverted pendulum.

(b) Aircraft pitch control. The base coordinate axis of the aircraft is denoted by X. Fig. 5. Schematic representation of (a) the inverted pendulum and (b) pitch control problems.

Table 1 Summary of the parameters configuration in the underwater vehicle problem. Underwater vehicle

γ

0

 END

 decay rate

0.95

0.6

0.1

0.9905

σ

α0

0.3

0.4

a 10

Table 2 Summary of the parameters configuration in the inverted pendulum problem. Inverted pendulum

tODE 0.03

RBF-SARSA/Q-learning #BFs 15

0

 END

0.6

0.1

σ

α0

0.3

0.4

 decay rate 0.9992

tODE 0.1

RBF-SARSA/Q-learning

β

λ

0.5

0.4

nmax 5

ξ

ELM-OFPI #Hidden nodes 80

#BFs 144

a 15

β

λ

0.7

0.4

nmax 5

0.01

ELM-OFPI Batch size 20 episodes

0.07

6.2. Inverted pendulum The inverted pendulum problem, also known as the polebalancing or cart-pole problem, is a classical benchmark that has been widely used in automatic control [52–54] and reinforcement learning [24,55]. This control task involves learning to balance a rigid pole mounted on a mobile cart by applying appropriate forces to the cart. There are numerous versions of this problem in the literature, the version used in this work is similar to the one described by Lagoudakis et al. [24]. Fig. 5a shows a schematic representation of the problem. The cart is free to move in a one-dimensional track and the pole is free to move only in the vertical plane of the cart and track. The mass (m) and length (l) of the pendulum are unknown to the agent. Three actions are allowed: left force (−50 N), right force (50 N), and no force (0 N). All three actions are noisy; uniform noise in [−10, 10] is added to the chosen action. The state transitions are governed by the nonlinear dynamics of the following system:

θ¨ =

γ

0.95

g · sin(θ ) − α · m · l · (θ˙ )2 · sin(2θ )/2 − α · cos(θ ) · a 4 · l/3 − α · m · l · cos2 (θ )

(16)

where g is the gravity constant (g = 9.8 m/s2 ), m is the mass of the pendulum (m = 2 kg), M is the mass of the cart (M = 8 kg), l is the length of the pendulum (l = 0.5 m), and α = 1/(m + M ). The state space of the problem consists of the vertical angle θ and the angular velocity θ˙ of the pendulum. An angle higher (in absolute value) than π /2 or a maximum number of 2400 steps signals the end of the episode. Episodes start with a state randomly selected from the subset defined by the ranges θ = [−π /2, π /2] and θ˙ = [−5, 5]. The set point is the equilibrium point, i.e., θset = 0. The proposed algorithm was configured to start collecting E = 1 episode between policy evaluations, afterwards, in each iteration E was increased by a factor of 5 until 200. Regarding the stop conditions, the set of discrete points Stest = {−π /2, −π /6, π /2, π /6} × {−5, −5/3, 5/3, 5} was employed to test the distance between consecutive Q-functions in the stopping conditions criteria. The re-

#Hidden nodes 210

Batch size 200 episodes

ξ

maining parameters of ELM-OFPI, RBF-SARSA and RBF-Q-learning were selected following a procedure similar to the one described in the previous section. Table 2 shows the value of each parameter after the tuning process. Again, the learning performance is shown by taking snapshots of the current policy at increasing moments of time. But in this case, instead of evaluating the policies with an estimation of their average return, they are evaluated by means of the average number of steps per episode that the pendulum is in equilibrium. Although both measures are equally valid, the number of equilibrium steps is the most common performance measure in the literature for this problem [24]. The average number of steps was computed over a set of initial states S0 = {−π /10, π /10} × {−0.3, 0.3} for each run. 6.3. Aircraft pitch control The last benchmark problem deals with an automatic pilot that controls the pitch of a passenger aircraft. The original problem is described by a set of six non-linear differential equations [56], but they can be decoupled and linearized under certain assumptions. The experiments are based on a subproblem focused on the pitch control at constant altitude and velocity [50]. Fig. 5b shows a diagram of the aircraft with some of the variables that determine the state of the system. The main axis of the aircraft is considered as the base coordinate system (denoted by X). The aircraft velocity, V, is assumed to be constant for any α , which denotes the angle of attack. The control task consists in providing proper elevator deflection angles, δ , in such a way that the pitch angle, θ , is as near as possible to the desired pitch angle fixed to θset = 0.2 rad. When the elevator deflection angle is set, the agent also influences the pitch rate (denoted by p) of the aircraft and the angle of attack. The following equations describe the dynamics of the aircraft:

α˙ = 0.232δ + 56.7 p − 0.313α p˙ = 0.0203δ − 0.426 p − 0.0139α θ˙ = 56.7 p

(17)

P. Escandell-Montero et al. / Knowledge-Based Systems 100 (2016) 200–211

b

SARSA

Q-learning

1

1

0.8

0.8 Mean reward

Mean reward

a

0.6

0.4

0.2

0

207

0.6

0.4

0.2

1

20

40

60 80 100 120 Number of episodes

140

160

0

180

c

1

20

40

60 80 100 120 Number of episodes

140

160

180

140

160

180

OFPI 1

Mean reward

0.8

0.6

0.4

0.2

0

1

20

40

60 80 100 120 Number of episodes

Fig. 6. Performance of (a) SARSA and (b) Q-learning using an RBF network to approximate the Q-function, and (c) FPI using an ELM network in the underwater vehicle problem. The mean value is represented in red, the yellow-shaded area is the standard deviation, and the maximum and minimum values are drawn with blue-dashed lines. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

This set of equations assumes that θ ∈ [−0.5, 0.5] rad, because the simplified model provides an approximation of the aircraft’s real behavior only for this restricted range of pitch angles. Despite the fact that the system of equations (17) is linear, the process has challenging properties. The pitch angle can be quickly modified acting over the elevator deflection angle δ . However, the angle of attack has a very slow dynamics compared to the pitch angle. In consequence, when the agent brings the pitch angle to the set-point, θ set , the angle of attack is still changing because it needs more time to reach a stable value. Therefore, the agent must continuously act over the elevator deflection angle in order to compensate this effect and to maintain the pitch angle near to θ set . The state vector is composed by three variables, s = [α , q, θ ], and there is a one-dimensional action a = [δ ] which was discretized into two values A = {−1.4, 1.4}. The maximum length of each episode was fixed to 600 steps. An episode ends after reaching the maximum length or when the pitch angle is outside of the allowed range. The initial state of the episodes is randomly selected from a uniform distribution over the subset specified by the ranges α = [−0.3, 0.3], q = [−0.015, 0.015] and θ = [−0.5, 0.5]. In ELM-OFPI, the parameter E was kept equal that in the inverted pendulum problem. The distance between consecutive approximations of the Q-function was measured in a set of dis-

crete points that is representative of the complete Q-function, namely, Stest = {−0.25, 0, 0.25} × {−0.01, 0, 0.01} × {−0.4, 0, 0.4}. The remaining parameters related to the proposed algorithm and the baseline algorithm are shown in Table 3. The learning performance was measured in terms of the average return estimated over a set S0 of representative initial states. S0 contains different values of the variable to be controlled (pitch angle), while the initial angle of attack and pitch rate are always zero, specifically S0 = {0} × {0} × {−0.5, −0.25, 0.25, 0.5}. 7. Results and discussion The results of the experiments with RBF-SARSA, RBF-Q-learning and ELM-OFPI are presented in Figs. 6–8. Each graph shows the metric employed to evaluate the quality of the policy (vertical axis) versus the number of episodes processed by the algorithm (horizontal axis). High values in the vertical axis indicate better performance. In addition to the mean reward (red continuous lines) computed over 30 independent runs of the experiment, the graphs show the standard deviation (yellow-shaded area) and the maximum and minimum values (blue-dashed lines). In the underwater vehicle problem (Fig. 6), it can be observed that all algorithms reach a similar near-optimal performance at the end of the experiment, after 180 episodes. However, the

208

P. Escandell-Montero et al. / Knowledge-Based Systems 100 (2016) 200–211

a

b

SARSA

Q-learning

2500

2000 Balanced steps

Balanced steps

2000

1500

1000

1000

500

500

0

1500

1

400

800 1200 1600 Number of episodes

2000

0

2400

c

1

400

800 1200 1600 Number of episodes

2000

2400

OFPI

2500

Balanced steps

2000

1500

1000

500

0

1

400

800 1200 1600 Number of episodes

2000

2400

Fig. 7. Performance of (a) SARSA and (b) Q-learning using an RBF network to approximate the Q-function, and (c) FPI using an ELM network in the inverted pendulum problem. The mean value is represented in red, the yellow-shaded area is the standard deviation, and the maximum and minimum values are drawn with blue-dashed lines. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article). Table 3 Summary of the parameters configuration in the aircraft pitch control problem. Aircraft pitch control

γ

0

 END

0.95

0.3

0.1

σ

α0

0.6

0.4

 decay rate 0.9995

tODE 0.05

RBF-SARSA/Q-learning #BFs 125

a 15

β

λ

0.7

0.3

nmax 10

0.5

ELM-OFPI #Hidden nodes 60

Batch size 200 episodes

ξ

behavior of each algorithm is different. RBF-SARSA starts with a slightly better average performance than ELM-OFPI. In this starting point, the learning curve of RBF-SARSA also shows a high degree of variance, which decreases as the number of episodes grows. In the case of RBF-Q-learning, the initial learning speed, as well as the variance, is higher than in RBF-SARSA. On the other hand, the convergence of ELM-OFPI is faster and more reliable than the two baseline methods. After 60 episodes, the proposed algorithm reaches the maximum reward in all the experiments. Fig. 7 shows the results achieved in the inverted pendulum problem. The quality of the policies is measured by the number

of steps that the pendulum is maintained in equilibrium. The policy is considered optimal if it achieves 2400 balanced steps. Compared to the underwater vehicle problem, the algorithms needed almost 15 times more episodes to converge, which suggests that this problem is noticeably more complex. When comparing the performance of baseline algorithms (Fig. 7a and b) and ELMOFPI (Fig. 7c), some differences are apparent. During the first 200 episodes, baseline algorithms show better performance than ELM-OFPI. On average, they are able to balance the pole during around 1250 steps and, in some episodes, reaching the maximum of 2400 steps in the case of RBF-SARSA and almost the maximum (more than 2300 steps) in the case of RBF-Q-learning. By contrast, ELM-OFPI obtains an average performance of around 500 balanced steps. Nevertheless, when the number of episodes increases, the situation changes. ELM-OFPI improves quickly until reaching a near-optimal policy in 600 episodes, while baseline algorithms require significantly more episodes to learn policies of similar quality. The results of the last benchmark, the pitch control problem, are shown in Fig. 8. Although the learning curves of the three algorithms are more similar than in the previous cases, it is still clear that the proposed algorithm learns optimal policies in a more reliable way that RBF-SARSA and RBF-Q-learning, as reflected by the low variance of Fig. 8c after 800 episodes.

P. Escandell-Montero et al. / Knowledge-Based Systems 100 (2016) 200–211

b

SARSA

Q-learning

1

1

0.9

0.9

0.8

0.8 Mean reward

Mean reward

a

0.7 0.6 0.5

0.7 0.6 0.5

0.4

0.4

0.3

0.3

0.2

1

400

800 1200 1600 Number of episodes

2000

0.2

2400

c

209

1

400

800 1200 1600 Number of episodes

2000

2400

2000

2400

OFPI 1 0.9

Mean reward

0.8 0.7 0.6 0.5 0.4 0.3 0.2

1

400

800 1200 1600 Number of episodes

Fig. 8. Performance of (a) SARSA and (b) Q-learning using an RBF network to approximate the Q-function, and (c) FPI using an ELM network in the aircraft problem. The mean value is represented in red, the yellow-shaded area is the standard deviation, and the maximum and minimum values are drawn with blue-dashed lines. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article). Table 4 Number of episodes required to reach the convergence point and performance (mean, maximum/minimum, and standard deviation) at that point. Benchmark

Underwater vehicle

Inverted pendulum

Aircraft

Algorithm

SARSA Q-learning OFPI SARSA Q-learning OFPI SARSA Q-learning OFPI

Num. episodes

80 60 60 1200 1400 600 10 0 0 800 600

The qualitative analysis of the convergence curves has revealed some differences in the behavior of the three algorithms assessed. However, it is also interesting to compare the different methods in a quantitative way. To this end, the convergence point has been defined as the point when the performance reaches the 95% of the maximum performance. Table 4 shows the number of episodes required by each algorithm in the three benchmark problems to reach the convergence point, as well as the performance (mean, maximum/minimum, and standard deviation) at that point. The

Performance at the convergence point Mean

Minimum

Maximum

Standard dev.

0.94 0.92 0.96 2319.47 2350.52 2318.69 0.94 0.93 0.94

0.89 0.72 0.96 114 1782.24 1293 0.92 0.86 0.87

0.96 0.96 0.96 2400 2400 2400 0.97 0.96 0.97

0.02 0.09 0 422.79 202.54 467.90 0.01 0.02 0.03

most important conclusion drawn from Table 4 is the noticeable reduction of number of episodes required to converge. Specifically, compared with RBF-SARSA, the reduction was of 25% in the underwater vehicle, 50% in the inverted pendulum, and 40% in the aircraft pitch control. On the other hand, compared with RBF-Q-learning, the reduction in the inverted pendulum and the aircraft pitch control was of 57.14% and 25%, respectively. In the underwater vehicle problem, both algorithms (RBF-Q-learning and ELM-OFPI) converged at the same number of episodes.

210

P. Escandell-Montero et al. / Knowledge-Based Systems 100 (2016) 200–211

Table 5 Average execution time per step of algorithms SARSA, Q-learning and OFPI in the three benchmark problems. Benchmark

Underwater Inv. pendulum Aircraft

Average time per step (ms) SARSA

Q-learning

OFPI

0.4278 0.7312 0.5896

0.4347 0.7379 0.5904

0.6701 1.1944 1.0845

Nevertheless, at the convergence point, all policies obtained by ELM-OFPI are optimal (the standard deviation is zero), while in RBF-Q-learning some of the policies were still suboptimal. On average, the number of episodes required to converge was reduced in a 38.33% compared with RBF-SARSA and 27.38% compared with RBF-Q-learning. In general terms, results suggest that both baseline algorithms outperform ELM-OFPI when the agent has processed a few episodes, that is, in early learning stages. On the contrary, when the number of episodes grows, ELM-OFPI learns high quality policies before than the baseline algorithms. This behavior is likely due, in part, to the combination of two factors: the semi-batch nature of OFPI and the global approximator used to store the value functions, i.e., the ELM network. Some global approximators, including ELM networks, may lead to unpredictable results in zones of the input space where there are not enough data, a common situation during the first learning stages of RL. On the other hand, the semi-batch approach of OFPI allows to store a set of transitions. Once this set contains enough transitions to be representative of the complete state space, the ELM network can approximate the complete Q function and OFPI converges quickly to near-optimal policies. It should be noted that OFPI makes an extensive use of the experience acquired by the agent. Instead of discarding each transition after updating the Q function once, as occurs in traditional algorithms, OFPI stores the transition and uses it to make several updates. This procedure of data reuse allows to reach optimal policies with less data, but it also involves a higher computational cost. Table 5 shows the average computation time2 per step required by each algorithm in the benchmark problems. These measures do not include the time employed to solve the equations that describe the environment’s state. The average time per iteration was obtained by running the algorithm during a fixed number of episodes and dividing the total time per the number of steps. In the case of OFPI, the algorithm does not perform an iteration in each step, such as occurs in SARSA and Q-learning, but that the parameter E controls how frequently the policy is updated. In spite of this difference, the average time per step is a good measure to compare the computational complexity of each method. As can be observed in Table 5, the times per step required by SARSA and Q-learning were very similar between them, and noticeable lower than the one required by OFPI. Specifically, during the experiments OFPI needed a 68.65% and 67.27% more time per step than SARSA and Q-learning, respectively. Despite this increase in computational time, it should be taken into account that OFPI reduced the number of iterations required to find an optimal policy, which is a critical issue in many RL applications.

2 Average time was computed by running 180 episodes in the underwater problem and 2400 episodes in the inverted pendulum and aircraft problems. Experiments were run with Matlab 2012b on a Mac OS X platform (2.4 GHz Intel Core 2 Duo, 4GB RAM).

8. Conclusions This paper has proposed a new reinforcement learning algorithm called online fitted policy iteration (OFPI). The algorithm is based on a semi-batch scheme that stores the experiences acquired by the agent. This mechanism provides two benefits. On the one hand, it increases the convergence speed by reusing each experience several times. And, on the other hand, it enables the use of global approximators by formulating the value function estimation process as a sequence of standard supervised learning problems. The performance of OFPI has been empirically assessed through three benchmark problems. During the experiments, OFPI has employed an ELM network to approximate the value functions because it provides a good trade-off between approximation error and training speed. The results have been compared with the ones obtained by two baseline RL algorithms, SARSA and Q-learning, combined with eligibility traces and an RBF network as function approximator. OFPI required less interactions between agent and environment than SARSA to learn optimal policies in the three benchmark problems. Compared with Q-learning, OFPI was also faster in all problems except in the underwater vehicle problem, in which both achieved the convergence point after 60 episodes. The increase of data efficiency provided by OFPI requires a higher computational cost than other RL algorithms. Therefore, OFPI may not be suitable for applications where the time available to process the data is very short. However, the practical usefulness of RL algorithms in most problems is limited by the quantity of data required to learn valid policies more than by the computational cost [57]. Thus, it is usually preferable to enhance the data efficiency by means of using all the available computation resources. Several lines of future research remain open. OFPI is limited to discrete action spaces, due, in part, to the fact that OFPI is based on the fitted Q iteration algorithm, which also uses discrete actions. In [50], Hafner and Riedmiller proposed a modification of fitted Q iteration that deals with continuous actions. That modification can likely be extended to OFPI. Another possibility that we are analyzing is to improve the approximation capabilities of the ELM network through ensembles [58,59], regularization [60,61] or pruning [62]. It is worth noting, however, that OFPI is not limited to ELM networks. By contrast, it does not impose any restriction about the approximator used to store the value functions. Thus, it would be also interesting to assess its performance with other approximators. Acknowledgments The authors thank the anonymous reviewers for their many suggestions for improving the quality of the manuscript. References [1] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, The MIT Press, 1998. [2] J.D. Martín-Guerrero, F. Gomez, E. Soria-Olivas, J. Schmidhuber, M. Climente– Martí, N.V. Jiménez-Torres, A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients, Expert Syst. Appl. 36 (6) (2009) 9737–9742. [3] P. Escandell-Montero, M. Chermisi, J.M. Martínez-Martínez, J. Gómez-Sanchis, C. Barbieri, E. Soria-Olivas, F. Mari, J. Vila-Francés, A. Stopper, E. Gatti, J.D. Martín-Guerrero, Optimization of anemia treatment in hemodialysis patients via reinforcement learning, Artif. Intell. Med. 62 (1) (2014) 47–60, doi:10.1016/j.artmed.2014.07.004. [4] F. Hernández-del Olmo, E. Gaudioso, A. Nevado, Autonomous adaptive and active tuning up of the dissolved oxygen setpoint in a wastewater treatment plant using reinforcement learning, IEEE Trans. Syst. Man Cybern.: Part C 42 (5) (2012) 768–774, doi:10.1109/TSMCC.2011.2162401. [5] H. Shahbazi, K. Jamshidi, A.H. Monadjemi, H. Eslami, Biologically inspired layered learning in humanoid robots, Knowl.-Based Syst. 57 (2014) 8–27, doi:10. 1016/j.knosys.2013.12.003.

P. Escandell-Montero et al. / Knowledge-Based Systems 100 (2016) 200–211 [6] J. Peters, S. Schaal, Reinforcement learning of motor skills with policy gradients, Neural Netw. 21 (4) (2008) 682–697, doi:10.1016/j.neunet.20 08.02.0 03. [7] Z. Zang, D. Li, J. Wang, D. Xia, Learning classifier system with average reward reinforcement learning, Knowl.-Based Syst. 40 (2013) 58–71, doi:10.1016/ j.knosys.2012.11.011. [8] F. Bernardo, R. Agustí, J. Pérez-Romero, O. Sallent, An application of reinforcement learning for efficient spectrum usage in next-generation mobile cellular networks, IEEE Trans. Syst. Man Cybern. Part C 40 (4) (2010) 477–484, doi:10.1109/TSMCC.2010.2041230. [9] Ö. Cos¸ gun, U. Kula, C. Kahraman, Analysis of cross-price effects on markdown policies by using function approximation techniques, Knowl.-Based Syst. 53 (2013) 173–184, doi:10.1016/j.knosys.2013.08.029. [10] E. Alpaydin, Introduction to Machine Learning, The MIT Press, 2004. [11] M. Riedmiller, Neural fitted Q iteration first experiences with a data efficient neural reinforcement learning method, in: Proceedings of the 16th European Conference on Machine Learning, 2005, p. 317—328. [12] J.A. Boyan, A.W. Moore, Generalization in reinforcement learning: safely approximating the value function, Adv. Neural Inf. Process. Syst. 7 (1995) 369—376. [13] R.S. Sutton, Generalization in reinforcement learning: Successful examples using sparse coarse coding, Adv. Neural Inf. Process. Syst. 8 (1996) 1038—1044. [14] D. Ernst, P. Geurts, L. Wehenkel, Tree-based batch mode reinforcement learning, J. Mach. Learn. Res. 6 (2005) 503—556. [15] G.J. Gordon, Approximate Solutions to Markov Decision Processes, School of Computer Science Carnegie Mellon University, 1999 (Ph.D. thesis). ˚ Sen, Kernel-based reinforcement learning, Mach. Learn. 49 [16] D. Ormoneit, A. (2-3) (2002) 161–178. [17] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming, first ed., Athena Scientific, 1996. [18] L. Busoniu, R. Babuska, B.D. Schutter, D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators, first ed., CRC Press, 2010. [19] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1-3) (2006) 489–501, doi:10.1016/j.neucom. 2005.12.126. [20] M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, first ed., Wiley-Interscience, 2005. [21] D.P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II, third ed., Athena Scientific, 2007. [22] L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: a survey, J. Artif. Intell. Res. 4 (1996) 237—285. [23] D. Bertsekas, Approximate policy iteration: a survey and some new methods, J. Control Theory Appl. 9 (3) (2011) 310–335, doi:10.1007/s11768-011-1005-3. [24] M.G. Lagoudakis, R. Parr, L. Bartlett, Least-squares policy iteration, J. Mach. Learn. Res. 4 (2003) 2003. [25] D. Lorente, N. Aleixos, J. Gómez-Sanchis, S. Cubero, J. Blasco, Selection of optimal wavelength features for decay detection in citrus fruit using the ROC curve and neural networks, Food Bioprocess Technol. 6 (2) (2011) 530–541, doi:10.1007/s11947-011-0737-x. [26] P. Escandell-Montero, J.M. Martínez-Martínez, J.D. Martín-Guerrero, E. SoriaOlivas, J. Gómez-Sanchis, Least-squares temporal difference learning based on an extreme learning machine, Neurocomputing 141 (2014) 37–45, doi:10.1016/ j.neucom.2013.11.040. [27] D. Li, N. Li, J. Wang, T. Zhu, Pornographic images recognition based on spatial pyramid partition and multi-instance ensemble learning, Knowl.-Based Syst. 84 (2015) 214–223. [28] X. Luo, F. Liu, S. Yang, X. Wang, Z. Zhou, Joint sparse regularization based sparse semi-supervised extreme learning machine (s3elm) for classification, Knowl.-Based Syst. 73 (2015) 149–160. [29] C.R. Rao, S.K. Mitra, Generalized Inverse of Matrices and Its Applications, John Wiley & Sons Inc, 1972. [30] M. Geist, O. Pietquin, A Brief Survey of Parametric Value Function Approximation, Technical Report, IMS Research Group, Supelec, France, 2010. [31] G. Konidaris, I. Scheidwasser, A. Barto, Transfer in reinforcement learningvia shared features, J. Mach. Learn. Res. 13 (2012) 1333–1371. [32] M. Wiering, M.v. Otterlo (Eds.), Reinforcement Learning: State-of-the-Art, 2012 edition, Springer, 2012. [33] W.B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, second ed., Wiley, 2011. [34] L.-J. Lin, Self-improving reactive agents based on reinforcement learning, planning and teaching, Mach. Learn. 8 (1992) 293—321. [35] G. Tesauro, Temporal difference learning and TD-Gammon, Commun. ACM 38 (3) (1995) 58–68, doi:10.1145/203330.203343. [36] M. Riedmiller, Concepts and facilities of a neural reinforcement learning control architecture for technical process control, Neural Comput. Appl. 8 (4) (20 0 0) 323–338. [37] J.R. Millán, D. Posenato, E. Dedieu, Continuous-action Q-learning, Mach. Learn. 49 (2) (2002) 247–265, doi:10.1023/A:1017988514716. [38] A. Antos, C. Szepesvri, R. Munos, Value-iteration based fitted policy iteration: learning with a single trajectory, in: Proceedings of IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, ADPRL 2007, Hawaii, 2007, pp. 330–337.

211

[39] T.J. Perkins, D. Precup, A convergent form of approximate policy iteration, in: S. Thrun, K. Obermayer (Eds.), Advances in Neural Information Processing Systems, 15, MIT Press, Cambridge, MA, 2002, pp. 1595–1602. [40] G. Rummery, M. Niranjan, On-line Q-learning Using Connectionist Systems, Technical Report, Cambridge University Engineering Department Technical Report CUED/F-INFENG/TR 166., 1994. [41] C.J.C.H. Watkins, P. Dayan, Technical note: Q-learning, Mach. Learn. 8 (3-4) (1992) 279–292. [42] H.V. Hasselt, Double Q-learning, in: J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, A. Culotta (Eds.), Advances in Neural Information Processing Systems, 23, Curran Associates, Inc., 2010, pp. 2613–2621. [43] H.R. Maei, C. Szepesvári, S. Bhatnagar, R.S. Sutton, Toward off-policy learning control with function approximation., in: J. Furnkranz, T. Joachims (Eds.), Proceedings of International Conference on Machine Learning, ICML, Omnipress, 2010, pp. 719–726. [44] M. Santos, J.A. Martín, V. López, G. Botella, Dyna-: a heuristic planning reinforcement learning algorithm applied to role-playing game strategy decision systems, Knowl.-Based Syst. 32 (2012) 28–36, doi:10.1016/j.knosys.2011. 09.008.New Trends on Intelligent Decision Support Systems. [45] A. Fachantidis, I. Partalas, G. Tsoumakas, I. Vlahavas, Transferring task models in reinforcement learning agents, Neurocomputing 107 (2013) 23–32, doi:10. 1016/j.neucom.2012.08.039. [46] H. van Seijen, R.S. Sutton, True online TD(λ), in: Proceedings of the 31th International Conference on Machine Learning, ICML ’14, 2014, pp. 692–700. [47] J. Strahl, T. Honkela, P. Wagner, A gaussian process reinforcement learning algorithm with adaptability and minimal tuning requirements, in: S. Wermter, C. Weber, W. Duch, T. Honkela, P. Koprinkova-Hristova, S. Magg, G. Palm, A. Villa (Eds.), Proceedings of International Conference on Artificial Neural Networks and Machine Learning, ICANN 2014, Lecture Notes in Computer Science, 8681, Springer International Publishing, 2014, pp. 371–378, doi:10.1007/ 978- 3- 319- 11179-7_47. [48] R.S. Sutton, Learning to predict by the methods of temporal differences, Mach. Learn. 3 (1) (1988) 9—44. [49] S. Adam, L. Busoniu, R. Babuska, Experience replay for real-time reinforcement learning control, IEEE Trans. Syst. Man Cybern. Part C 42 (2) (2012) 201–212, doi:10.1109/TSMCC.2011.2106494. [50] R. Hafner, M. Riedmiller, Reinforcement learning in feedback control, Mach. Learn. 84 (1) (2011) 137–169, doi:10.1007/s10994- 011- 5235- x. [51] M.P. Deisenroth, C.E. Rasmussen, J. Peters, Gaussian process dynamic programming, Neurocomputing 72 (79) (2009) 1508–1524, doi:10.1016/j.neucom.2008. 12.019. [52] H. Wang, K. Tanaka, M. Griffin, An approach to fuzzy control of nonlinear systems: stability and design issues, IEEE Trans. Fuzzy Syst. 4 (1) (1996) 14–23, doi:10.1109/91.481841. [53] X. Ruan, M. Ding, D. Gong, J. Qiao, On-line adaptive control for inverted pendulum balancing based on feedback-error-learning, Neurocomputing 70 (46) (2007) 770–776, doi:10.1016/j.neucom.2006.10.012. [54] H.-J. Rong, S. Suresh, G.-S. Zhao, Stable indirect adaptive neural controller for a class of nonlinear system, Neurocomputing 74 (16) (2011) 2582–2590, doi:10. 1016/j.neucom.2010.11.029. [55] V. Gullapalli, Direct associative reinforcement learning methods for dynamic systems control, Neurocomputing 9 (3) (1995) 271–292, doi:10.1016/ 0925-2312(95)0 0 035-X. [56] University of Michigan, CTM: Digital control Tutorial, 1996, (http://www.engin. umich.edu/group/ctm/digital/digital.html). (accesed 15.03.11). [57] S. Kalyanakrishnan, P. Stone, Batch reinforcement learning in a complex domain, in: Proceedings of the Sixth International Joint Conference on Autonomous Agents and Multiagent Systems, ACM, New York, NY, USA, 2007, pp. 650–657. [58] M. Heeswijk, Y. Miche, T. Lindh-Knuutila, P.A. Hilbers, T. Honkela, E. Oja, A. Lendasse, Adaptive ensemble models of extreme learning machines for time series prediction, in: Proceedings of the 19th International Conference on Artificial Neural Networks: Part II, ICANN ’09, Springer-Verlag, Berlin, Heidelberg, 2009, p. 305314, doi:10.1007/978- 3- 642- 04277- 5_31. [59] P. Escandell-Montero, J.M. Martínez-Martínez, E. Soria-Olivas, J. Guimerá– Tomás, M. Martínez-Sober, A.J. Serrano López, Regularized committee of extreme learning machine for regression problems, in: Proceedings of the 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2012, Belgium, 2012, pp. 251–256. [60] E. Soria-Olivas, J. Gómez-Sanchis, J.D. Martín, J. Vila-Francés, M. Martínez, J.R. Magdalena, A.J. Serrano, BELM: Bayesian extreme learning machine, IEEE Trans. Neural Netw. 22 (3) (2011) 505–509, doi:10.1109/TNN.2010.2103956. [61] J.M. Martínez-Martínez, P. Escandell-Montero, E. Soria-Olivas, J.D. MartínGuerrero, R. Magdalena-Benedito, J. Gómez-Sanchis, Regularized extreme learning machine for regression problems, Neurocomputing 74 (17) (2011) 3716–3721, doi:10.1016/j.neucom.2011.06.013. [62] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM: optimally pruned extreme learning machine, IEEE Trans. Neural Netw. 21 (1) (2010) 158–162, doi:10.1109/TNN.2009.2036259.A Publication of the IEEE Neural Networks Council.