A model-based deep reinforcement learning method applied to finite-horizon optimal control of nonlinear control-affine system

A model-based deep reinforcement learning method applied to finite-horizon optimal control of nonlinear control-affine system

Journal of Process Control 87 (2020) 166–178 Contents lists available at ScienceDirect Journal of Process Control journal homepage: www.elsevier.com...

2MB Sizes 0 Downloads 44 Views

Journal of Process Control 87 (2020) 166–178

Contents lists available at ScienceDirect

Journal of Process Control journal homepage: www.elsevier.com/locate/jprocont

A model-based deep reinforcement learning method applied to finite-horizon optimal control of nonlinear control-affine system Jong Woo Kim a, Byung Jun Park a, Haeun Yoo b, Tae Hoon Oh a, Jay H. Lee b, Jong Min Lee a,∗ a

School of Chemical and Biological Engineering, Institute of Chemical Processes, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea b Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology, 291, Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea

a r t i c l e

i n f o

Article history: Received 23 September 2019 Revised 20 January 2020 Accepted 4 February 2020

Keywords: Reinforcement learning Approximate dynamic programming Deep neural networks Globalized dual heuristic programming Finite horizon optimal control problem Hamilton–Jacobi–Bellman equation

a b s t r a c t The Hamilton–Jacobi–Bellman (HJB) equation can be solved to obtain optimal closed-loop control policies for general nonlinear systems. As it is seldom possible to solve the HJB equation exactly for nonlinear systems, either analytically or numerically, methods to build approximate solutions through simulation based learning have been studied in various names like neurodynamic programming (NDP) and approximate dynamic programming (ADP). The aspect of learning connects these methods to reinforcement learning (RL), which also tries to learn optimal decision policies through trial-and-error based learning. This study develops a model-based RL method, which iteratively learns the solution to the HJB and its associated equations. We focus particularly on the control-affine system with a quadratic objective function and the finite horizon optimal control (FHOC) problem with time-varying reference trajectories. The HJB solutions for such systems involve time-varying value, costate, and policy functions subject to boundary conditions. To represent the time-varying HJB solution in high-dimensional state space in a general and efficient way, deep neural networks (DNNs) are employed. It is shown that the use of DNNs, compared to shallow neural networks (SNNs), can significantly improve the performance of a learned policy in the presence of uncertain initial state and state noise. Examples involving a batch chemical reactor and a one-dimensional diffusion-convection-reaction system are used to demonstrate this and other key aspects of the method. © 2020 Elsevier Ltd. All rights reserved.

1. Introduction In the optimal control theory, the Hamilton–Jacobi–Bellman (HJB) equation plays a key role, not only in verifying the optimality of a given policy, but also in constructing one. The HJB equation is a partial differential equation (PDE) and its solution is referred to as the value function, which measures the ‘cost-to-go’ of a starting state [5]. However, since it seldom can be solved exactly for nonlinear systems, various approaches to build or implement approximate solutions have been tried. Broadly, these approaches can be classified into two. The first approach tries to mimic the behavior of the optimal policy by solving a finite horizon optimal control problem on-line for an encountered state at each sample time employing various mathematical programming techniques, as is done in model predictive control (MPC) [8]. The other approach tries to construct approximate solutions to the HJB equation via



Corresponding author. E-mail addresses: [email protected] (J.H. Lee), [email protected] (J.M. Lee).

https://doi.org/10.1016/j.jprocont.2020.02.003 0959-1524/© 2020 Elsevier Ltd. All rights reserved.

simulation-based learning, as in neurodynamic programming [7], heuristic dynamic programming [26], and approximate dynamic programming (ADP) [39]. The fact that optimal control policies are learned iteratively through data connects the latter approach to the machine-learning approach of reinforcement learning (RL), where an agent also tries to learn optimal decision policies iteratively by interacting with its environment [45]. The potential advantage of the learning based approach over the on-line mathematical programming (MP) based approach is three-fold: First, a closed-loop state feedback policy can be obtained for general stochastic control problems, while an openloop optimal control solution in MPC, though recomputed at each time, can be much less effective in such problems [22–24]. Second, the majority of the computation is done off-line, whereas the approach of MPC can lead to exorbitant on-line computational demand [21,38]. Unlike MP-based approaches, system dynamics is not taken into account in HJB equation and the optimization problem is decomposed to a single-horizon problem. Hence, the offlinecomputational complexity does not grow excessively in the learn-

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

ing based solution for HJB equation. The third potential advantage is that it is flexible enough to work with varying levels of system knowledge ranging from a completely known system to no system knowledge at all, i.e., having on-line data only [26]. It is worth noting that there exist several MP-based methods that aim to avoid the re-optimization and to obtain the closed-loop policy explicitly such as using nonlinear optimization sensitivity [53] and parametric programming method [4]. The proposed approach is to provide an alternative viewpoint which is based on interactive learning. When the system dynamics is control-affine and the cost function is quadratic, the HJB equation can be transformed and decomposed into simpler equations with respect to the value function, its first-order derivative (called costate function), and the policy function [1]. Globalized dual heuristic programming (GDHP), a modelbased RL method is based on these surrogate equations, utilizing all such information about the system dynamics and the cost function [26,40,52,57]. Finite horizon optimal tracking control (FHOC) is the main problem of interest in this study. RL approaches for FHOC have widely been studied [2,36,51,54,56]. The FHOC problem has a certain feature that distinguishes it from the infinite horizon problem. The solution involves time-varying value, costate, and policy functions with terminal boundary conditions. In the case that timevarying trajectories must be tracked, the usual approach has been to reformulate the tracking problem as a regulation problem by introducing the offset dynamics [34,51,54]. However, this method requires the steady-state values of the state and the control input to be known for each target reference value in order to achieve offset-free setpoint tracking. Instead, this study adopts state space augmentation including the integral action and the time variable in order to express the time dependency and to be free from obtaining such prior information [8]. One of the key elements of these learning-based methods is a function approximator, which is used to represent the value, policy, or other related function in a parametric form. A popular choice has been the neural networks given its property as a universal approximator: A neural network with a single hidden layer can represent any bounded continuous function defined in any compact subset of the real space [18]. Hence, thus far, most studies have employed a single-layer neural networks (SNNs) [1,14,17,35,47]. The learning method for SNNs can be formulated as a least-squares problem, by taking advantage of the linearity of the approximator. In the case of FHOC, in order to construct a time-varying approximator, Cheng [10] and Adhyaru [2] used time-varying weights, whereas Heydari [17], Zhao [56], and Mu [36] employed timevarying activation functions. Both approaches showed good results for obtaining feedback policies of the FHOC problem in small scale problems for which the functions to represent are relatively simple. Function approximation methods with SNN have been widely extended to the formulations such as robust control [2,19,49,54], event-triggered system [50,52,58], non-zero sum games [28,55,59], and distributed parameter systems [31,46]. The latest trend is to explore the potential benefits of deep neural networks (DNNs) which have multiple hidden layers [15]. Though SNNs is a universal approximator, it does not scale very well, showing exponential complexity growth for certain types of functions. Having multiple hidden layers has been shown to result in better scalability, meaning there is a right choice of depth for each function from the viewpoint of statistical efficiency. Use of DNNs has allowed model-free RL methods to be applied successfully to problems with high-dimensional continuous state and action spaces [33], and the state-of-the-art deep RL (DRL) has shown some remarkable performances in certain applications such as robotics, games, spacecrafts, etc. [11,25,44]. Motivated by the recent successes, this study adopts DNNs as function approximators. In the context of the proposed model-based RL method, DNNs will

167

be shown to give more robust approximations in the presence of varying initial state and state noise, compared to SNNs. In summary, to the proposed RL method incorporates DNNs into the GDHP algorithm to address the FHOC problem wherein the system is nonlinear control-affine and the stage-wise cost is quadratic. A suitable DNNs structure for the GDHP method, which handles certain features of the FHOC problem, e.g., time-varying value and policy functions, presence of boundary conditions, is suggested. The HJB equation is modified to a suitable form for the delta-input formulation used to address the time-varying reference tracking requirement. In addition, several ways to alleviate difficulties arising in training DNNs are introduced and adopted. Finally, the overall method is applied to examples of a nonlinear batch reactor and 1-dimensional diffusion-convection-reaction process, which have high dimensional state space. It is shown that use of DNNs is critical in obtaining a converging policy with satisfactory performance in the presence of uncertainties. The remainder of this paper is organized as follows: Section 2 introduces the FHOC tracking problem and the derivation of the HJB equation. Section 3 presents the concept of value, costate, and policy functions and various RL algorithms which differ by the level of utilized model knowledge. Gradient based learning methods with DNNs and their problems as well as fixes are explained in Section 4. Section 5 illustrates the examples involving a chemical reactor and a diffusion-convection-reaction process to examine the comparative performance of the proposed method. Finally, some discussions and concluding remarks are provided in Section 6.

2. Finite horizon optimal tracking control problem and the HJB equation Consider a stochastic nonlinear control-affine system of





y = h ( x ),

x ( 0 ) ∼ p( x 0 ) ,

dx = f˜(x ) + g˜(x )u dt + dw x∈

(1)

where x ∈ RS is the state, u ∈ RA is the control, y ∈ RO is the output vector, and p(x0 ) is the probability distribution of initial state.  ⊂ RS denotes the compact state space. f˜(x ) ∈ RS , g˜(x ) ∈ RS×A , and h(x ) ∈ RO are the smooth nonlinear functions that represent the system dynamics. The state dynamics is subject to the state transition uncertainty dw ∈ RS which is independent of both state and control. For the development and discussion of sampled-data control, it is convenient to express Eq. (1) in the discrete-time form of

xk+1 = f (xk ) + g(xk )uk + wk y k = h ( xk ) , x ∈ ,

x 0 ∼ p( x0 ) , k ∈ {0, . . . T − 1}

(2)

where the subscript k denotes the sample time index and T is the time horizon. Note that operators f , g, and w denote the explicit integration of the ODEs for time interval [tk , tk + tk ], either analytically or through their numerical approximations. For example, f and g may be defined through the Runge–Kutta (RK) integration equations, which are explicit but nevertheless include some intermediate variables for the calculation. In the case of w, the Euler–Maruyama method which is a stochastic version of Euler discretization can be used [37]. In constant reference tracking problems, it is common that delta-input uk = uk − uk−1 and the augmented state space formulation. Additionally, to express the time dependency in FHOC problem, the time variable is augmented as

168

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

xk = [xk , uk−1 , tk ] to define the new state space dynamics of xk+1



   xk+1 uk tk+1

=

f ( xk )

g( x )

k      g( x k ) f (xk ) + g(xk )uk−1

+

uk−1 tk + tk

yk = h(xk , uk−1 , tk ),

I 0

x ( 0 ) ∼ p( x 0 ) ,

With the state independent assumption of the state transition uncertainty wk ,

B

 T ∂ r ( x k , u k ) ∂ h ( xk ) =2 Q ( y k − ρk ) ∂ xk ∂ xk

  u k +

I 0 wk 0

x∈

(3)

to ensure integral action. Representing the distribution function for the discrete-time state transition as p(xk+1 |xk , uk , wk ), the likelihood of the state trajectory x0:T := {x0 , x1 , . . . , xT } is obtained by the successive multiplication of the probability distributions of the initial state and the state transitions as

p(x0:T ) = p(x0 )

T

−1

p(xk+1 |xk , uk , wk )

(4)

Evaluation of Eqs. (11) and (12) requires the first-order differentiation of system dynamics functions f(xk ), g(xk ) and h(xk ), which can be done either analytically or numerically. The optimal control policy function π ∗ (xk ) is a state feedback control law that minimizes the objective function. For the controlaffine system with the quadratic cost function, the explicit form of the optimal control policy function can be obtained by the firstorder optimality condition of Eq. (9) as



π ∗ (xk ) = arg min E p(xk+1 |xk ,uk ,wk ) r (xk , uk ) + V ∗ (xk+1 ) u k

  1 = − R−1 gT (xk )E p(xk+1 |xk ,uk ,wk ) λ∗ (xk+1 ) 2

k=0

An objective function of the FHOC problem is defined as

J (x0 ; u0:T −1 ) = E p(x0:T )



φ ( xT ) +

T −1

r ( x i , u i )

(5)

i=0

where T is the terminal time index, φ (xT ) is the terminal cost and r(xk , uk ) is the stage-wise cost at time tk . The expectation operator E[·] is defined with respect to r the state trajectory likelihood p(x0: T ). The objective function measures the expectation value of the performance of a trajectory. The stage-wise cost and the terminal cost functions are positive definite, and quadratic as below:

r (xk , uk ) := yk − ρ

 +  

2 k Q

uk 2R ,

k = 0, · · · , T − 1

φ (xT ) := yT − ρT 2H

(6) (7)

where ρk = ρ (tk ) denotes the reference function for state x¯k ,  ·  is the weighted Euclidean norm, and Q, R, and H are the positivedefinite weighting matrices. The time varying ‘cost-to-go’ function V, also referred to as the value function is defined as

V (xk ; uk:T −1 ) := J (xk ; uk:T −1 ) = E p(xk:T )



φ ( xT ) +

T −1

r ( x i , u i )

(8)

i=k

which measures the expected value of the objective function J starting from xk at time k. By Bellman’s principle of optimality, the optimal cost-to-go function satisfies the following recursive equation



V ∗ (xk ) = min E p(xk+1 |xk ,uk ,wk ) r (xk , uk ) + V ∗ (xk+1 ) , uk

k = 0, · · · , T − 1

(9) where V∗ ( · ) denotes the optimal value function. Costate function is defined as the partial derivative of the value function with respect to the augmented state xk .

λ∗ (xk ) :=

∂ V ∗ ( xk ) ∂ xk

(10)

The optimality condition expressed in terms of the costate can be obtained by differentiating both sides of the Bellman equation with respect to the state xk as in [40]

∂ r ( x , u )  ∂ x  T

k k k+1 λ∗ (xk ) = E p(xk+1 |xk ,uk ,wk ) + λ∗ (xk+1 ) ∂ xk ∂ xk

(13)

Substituting Eq. (13) into Eq. (9) yields the following discrete-time version of the Hamilton–Jacobi–Bellman (HJB) equation:



V ∗ (xk , k ) = E p(xk+1 |xk ,uk ,wk ) yk − ρk 2Q +

1 ∗ λ (xk+1 )g(xk )R−1 g(xk )T λ∗ (xk+1 ) + V ∗ (xk+1 ) 4

V ∗ ( xT ) = φ ( xT ),

λ∗ (xT ) = ∂ φ (xT )/∂ xT

(14) (15)

whose solution is the optimal value function V∗ . The resulting nonlinear PDE does not yield an analytical solution in general, except for the linear quadratic optimal control problem where V ∗ (xk ) = 1 T 2 xk Pk xk . The PDE can be transformed into ODE with respect to Pk , which is the well-known Riccati difference equation. However, in the nonlinear system, the PDE cannot be transformed to an ODE because of the function g(xk ), thus motivating the development of an approximate numerical solution method. 3. Model-based reinforcement learning algorithm: globalized dual heuristic programming (GDHP) In this section, a numerical algorithm for solving the HJB equation is developed, which involves an successive, iterative evaluation of the value, costate, and policy functions. The algorithm is not new and is referred to as the GDHP algorithm in the literature [40,52,57]. Instead of solving Eq. (14) directly, the algorithm sequentially updates the individual optimality conditions with respect to the value, costate, and policy functions computed by the measurement data (xk , uk , rk , xk+1 ) obtained at each time step k. The iterative scheme of GDHP starts with initialization of the functions with zero vectors

V ( 0 ) = 0,

λ(0) = 0S , π (0) = 0A

(16)

where 0n denotes a zero vector with n entries. At time step k and the iteration step i, state and output of the next time step are computed with the system dynamics under the policy π (i) (xk ) and the additive state noise wk .

xk+1 = f (xk ) + g(xk )π (i ) (xk ) + Bwk

(17)

yk = h ( xk )

(18)



2



2

r k =  y k − ρk  +  π ( i ) ( x k )  Q

(11)

(12)

R

(19)

A measurement data tuple (xk , uk , rk , xk+1 ) is stored to evaluate the value, costate, and policy functions in the next iteration step.

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

At the ith iteration step, the value and costate functions are updated by using the optimality equations (Eqs. (9) and (11)), in which the minimum values of the right-hand-sides are evaluated by the optimal policy used in the ith iteration, πk(i ) = π (i ) (xk ) [40].

V (i+1) (xk ) = r (xk , πk(i ) ) + V (i ) (xk+1 )

λ

(i+1 )

(20)

 T ∂ r (xk , πk(i) ) ∂ xk+1 ( xk ) = + λ(i) (xk+1 ) ∂ xk ∂ xk

(21)

Finally, the policy function of the next iteration step is obtained with the updated costate function.

1 2

π (i+1) (xk ) = − R−1 gT (xk )λ(i+1) (xk+1 )

(22)

Note that the expectation with respect to p(xk+1 |xk , uk , wk ) in the optimality equations of the value, costate, and policy functions are estimated with a single sampled data in the GDHP algorithm, as in the Monte-Carlo fashion. The above-mentioned iteration scheme is performed until the value, costate, and policy functions converge. Since expectations are computed by the Monte-Carlo method with single sampled data per iteration, the procedure typically requires large amounts of data for convergence [45]. In view of this, any model knowledge ∂x

incorporated into the iteration procedure (e.g. ∂kx+1 and g(xk )) can k be helpful. The iteration period can be set arbitrarily with respect to the time interval, such that the algorithm can be implemented either on a run-to-run basis or in batch mode where iteration is performed after data from several runs are collected. In this study, we perform iteration at every sample time. GDHP is in fact a generalized version combining the ideas of heuristic dynamic programming (HDP) and dual heuristic programming (DHP) [40]. In HDP, only the equations for the value and policy functions, (Eqs. (20) and (22)), are used whereas only those for the costate and policy functions, (Eqs. (21) and (22)), are employed in DHP. Since the costate function can be differentiated directly from the value function in HDP, the system dynamics function f(xk ) is not required. However, errors in the value function can be amplified through differentiation, which can cause unstable learning behavior throughout the HDP algorithm. 4. Function approximation and learning with deep neural networks 4.1. GDHP with a function approximator Value, costate, and policy functions are typically highly complex nonlinear functions defined in the state space . One of the main difficulties in RL is the type selection and structure design of the function approximator [9]. This difficult stems from several factors. First, the optimal value, costate, and policy functions are defined globally in , not for a local region generated by a single trajectory. Thus the approximated functions should also be defined over the entire state space. Second, nonlinearities of the functions depend on the system dynamics and the cost functions, and thus can be arbitrarily complex. Finally, the complexity may grow even larger in the stochastic cases where the optimality equations Eqs. (9)– (13) contain the expectation operator. As a universal function approximator, deep neural networks (DNNs) has received much attention [20]. Reinforcement learning with DNNs, often called deep RL (DRL), is solving the RL problems in a very high-dimensional state space [27,33]. The recent success of DNNs can be attributed to the following: First, DNNs are known to provide highly scalable structure with respect to the dimension of the function input, as it can perform automatic feature selections [6]. Second, the development of automatic differentiation

169

(AD), pretraining for good initialization, and GPU-based computing have significantly lowered the barrier of training DNNs. The robust interpolation capabilities of DNNs can compensate for the weakness of the GDHP algorithm in estimating the HJB equation, which is basically a sample-based and bootstrapping method. Specifically, the right-hand-sides of the GDHP update rules, i.e., Eqs. (20)–(22), contain the recursive function evaluations for each noisy argument xk+1 . Hence, the approximators should represent the value, costate, and policy functions by interpolating between such noisy, transient data. Since the overfitting problem is inevitable with any complex function approximator with a large number of fitting parameters, the performance of the learned control policy can degrade significantly when the uncertainties encountered on-line are significantly different from those seen during the offline training. Discussions of the methods to control the trade-off between generalizability and being prone to overfitting are presented in the following section. Section 5 provides the result of DNNsâ approximations, when obtained using the state-ofthe-art DRL stable learning methods, significantly outperform the SNNs approximation used in the previous studies of [30], in terms of robustness to stochastic uncertainties. In the GDHP algorithm, the measurement tuple (xk , uk , rk , xk+1 ), which is composed of the current augmented state, control, cost, and the successor augmented state, is obtained at every time step. The elements of tuple for the terminal time are just the terminal augmented state and terminal cost as (xT , φ ). The measurement tuple set are expressed as







D = Sn |Sn = xkn , ukn , rkn , xkn +1 , n ∈ N



(23)

DT = {Sn |Sn = (xTn , φn ), n ∈ N}

(24)

where Sn is the nth measurement tuple of the set. The augmented states within the set D can be alternatively expressed as a compact matrix form, XD , XD+ ∈ R(S+A+1 )×N such that XD = [xk1 , xk2 , . . . , xkN ] and XD+ = [xk1 +1 , xk2 +1 , . . . , xkN +1 ], respectively. Denote the function index set of value, costate, and policy functions as ψ ∈ {V, λ, π }. DNNs approximations of value, costate, and policy functions are written as

ψˆ (XD ) = Wˆ ψ ,H+1 σψ ,H (Wˆ ψ ,H σψ ,H−1 (· · · Wˆ ψ ,2 σψ ,1 (Wˆ ψ ,1 XD ) · · · )) (25) We assume that the three networks have same number of layers, i.e., H, and same widths at each layer. The weight matrices and the nonlinear activation functions of the lth layers of the function ψˆ ˆ ψ ,l and σ ψ ,l ( · ), respectively. are denoted as W

In accordance with the recursion in Eqs. (20)–(22), the ‘targets’ that the current estimates of the value, costate, and policy functions (i.e., ψˆ (xk )) should satisfy regarding the optimality equations, denoted as τ V,k , τ λ,k , and τ π ,k , respectively, are evaluated each time when a single sample (xk , uk , rk , xk+1 ) becomes available [29]:

τV,k = rk + Vˆ (xk+1 )  τλ,k = 2

∂ h ( xk ) ∂ xk

(26)

T

 Q ( h ( x k ) − ρk ) +

1 2

τπ ,k = − R−1 gT (xk )λˆ (xk+1 )

∂ xk+1 ∂ xk

T λˆ (xk+1 )

(27)

(28)

In addition to the optimality targets, we used the residuals which measures the deviation of the definition of costate function in Eq. (10),

eV λ,k =

∂ Vˆ (xk ) ˆ − λ ( xk ) ∂ xk

(29)

170

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

This equation forces the value and the costate functions to satisfy the definition explicitly. The targets associated with the boundary conditions of the value and costate functions τ V,T and τ λ,T are also defined as

τV,T = φ (xT ) τλ,T =

(30)

∂φ (xT ) ∂ xT

(31)

When the algorithm proceeds online, the data collection and ˆ ψ ,l become training can proceed concurrently. The DNNs weights W incrementally and adaptively optimized to minimize the residuals by gradient based optimization as the measurements are obtained. Denote the vectors τ V , τ λ , τ π and eVλ as  the stacked vectors of  Eqs. (26)–(29) for the set D, i.e., τV = vec [τV,k1 , τV,k2 , . . . , τV,k|D| ] . Similarly, the vectors τ V,T , and τ λ,T are the stacked vectors of Eqs. (30) and (31) for the set DT . Then we can define the loss functions with the square of the Euclidean norms as, LV (D ∪ DT ) = Vˆ (XD ) − τV 22 + βV,1 eV λ 22 + βV,2 Vˆ (XDT ) − τV,T 22

(32)

ˆ (XD ) − τλ 2 + βλ,1 eV λ 2 + βλ,2 λ ˆ (XD ) − τλ,T 2 Lλ ( D ∪ DT ) = λ T 2 2 2

(33) Lπ (D ) = πˆ (XD ) − τπ 22

(34)

where β V,1 , β V,2 , β λ,1 , and β λ,2 are the scalar weighting factors. ˆ ψ ,l are updated by the gradient descent The weights of the DNNs W rule as

∂ Lψ ∂vec(Wˆ ψ ,l ) l ∈ {1, . . . , H + 1}, ψ ∈ {V, λ, π }

vec(Wˆ ψ+,l ) = vec(Wˆ ψ ,l ) − ηψ

ˆ ψ ,l and where W

ˆ+ W ψ ,l

(35)

denote the current and updated

lth

layer of

the DNNs weight estimates of function ψ , respectively, ηψ denotes the learning rate. The gradient of the networks with respect to the DNNs weights is calculated by the backpropagation rule. Note that the previous GDHP algorithms make the relationship between the value and the costate functions implicitly by sharing the first layer of their neural networks [29,40]. A single objective function is used for learning value and the costate.



2



2

2

2

ˆ (XD ) − τλ  L = Vˆ (XD ) − τV  + β1 λ





2

2

ˆ (XD ) − τλ,T  + β2 Vˆ (XDT ) − τV,T  + β3 λ T 2

2

(36)

We tested both structures and concluded that separation yields a better result for our examples. The reason that a single SNN and objective formulation can be problematic is that it is harder to choose the weighting factors β1 , β2 , and β3 than those in our method. This is because the weights in a neural networks should be distributed within a similar scale for the numerical stability issue, and it is difficult to recognize the proper scales of the value and costate functions before the experiments. Moreover, in DNNs case, the problem of determining the number of layers to be shared gives additional difficulty. 4.2. Stable learning of DNNs Employing the gradient descent based optimization in its bare form can lead to problems since DNNs with a large number of

weight parameters are vulnerable to overfitting and the intrinsically unstable behavior of bootstrap methods. The state-of-the-art deep RL algorithms have tried to alleviate these issues with some tricks like replay buffer, target networks, etc.[27,32,33]. In addition to those tricks, we propose pre-training with some known policies and discretization to prevent the overfitting and to accelerate the training speed. 4.2.1. Replay buffer DNNs have a significantly larger number of weights compared to other function approximators, which can cause an overfitting problem. Various regularization techniques such as minibatch training, weight regularization, and drop-out method are utilized to avoid the overfitting problem and reduce the generalization error [20]. Especially, in minibatch training, a batch dataset is di vided into different subsets (i.e., D = m i=1 Di ) and the gradient descent method is applied to each Di . This method compromises between a batch training and the single-data training, and concretely, it is more robust than the single-data training. The gradient learning method using minibatch is the well-known stochastic gradient descent (SGD) method and is justified by the Robbins–Monro stochastic approximation method [5]. Replay buffer is a specific batch RL method which stores simulated data [41]. The illustration is provided in Fig. 1. At each iteration, minibatches of D and DT are prepared independently by uniform sampling from the replay buffer to estimate gradients in Eq. (35). The rationale for constructing the minibatch data by uniform sampling of the replay buffer is to reduce the correlations between RL data sets and make them independently and identically distributed (i.i.d.). First, contrary to the supervised learning where a fixed training data set are given, time-varying dataset is received in RL. By uniformly sampling from the replay buffer, overfitting with a recent data set is prevented, and thus temporal correlation is reduced. Second, there also exist correlations within a data set, because it contains state trajectories generated by a control policy that maps a state to an action. Decorrelated minibatch data satisfies the i.i.d. condition which is a crucial assumption made on machine learning training data sets [33]. Note that the size of D should be calibrated by observing the trade-off between stability and training speed. 4.2.2. Target network GDHP uses bootstrapping to estimate the value and costate functions. When computing the residuals in value and costate networks (Eqs. (26) and (27)), the weights at time steps k (current) and k + 1 (target) are both used. In the training process, as the values of the weights for the value, costate, and policy networks change to match the bootstrapped target values, the target values also vary concurrently. For example, the gradient of Vˆ (XD ) − τV 22 , the first term of Eq. (32), is



2

∂vec(Vˆ (XD ) − Vˆ (XD+ )) ∂vec(Wˆ V,l )

T



Vˆ (XD ) − RD − Vˆ (XD+ )



(37)

where RD denotes the stacked vector of [rk1 , . . . rk|D| ]. Since XD

and XD+ both consisted of noisy measurements, the gradient Eq. (37) has high variance due to the arithmetic operations of stochastic terms, which can cause instability. The target network method alleviates the side-effects from the bootstrapping method [48] by evaluating the target values τ V , τ λ and τ π from the separate networks called target networks. The weights are set differently in the current and target networks while maintaining the same network structures. Denote ψˆ  as the target networks, then the modified targets for Eqs. (26)–(28) are written as  = r + Vˆ  (x τV,k k k+1 )

(38)

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

171

Fig. 1. Illustration of replay buffer. The data structure of the replay buffer takes first-in-first-out (FIFO) form, while the total size of the replay buffer is kept the same. Minibatch D for training is randomly sampled from the replay buffer.



∂ h ( xk ) τλ ,k = 2 ∂ xk

T

 Q ( h ( x k ) − ρk ) +

∂ xk+1 ∂ xk

1 2

τπ ,k = − R−1 gT (xk )λˆ  (xk+1 )

T λˆ  (xk+1 )

(39)

(40)

In this case, the gradient Eq. (37) becomes

 2

∂vec(Vˆ (XD )) ∂vec(Wˆ V,l )

T



Vˆ (XD ) − RD − Vˆ  (XD+ )



(41)

xk+1 = xk + h

s

bi pi ,

(43)

i=1

ˆ V,l with respect to the target value funcsince the dependency of W +  ˆ tion V (XD ) is removed, hence the variance of the gradient estiˆ  as the weights of the lth layer of mate becomes smaller. Let W ψ ,l target networks. Then the learning rule for the target networks is expressed as a weighted sum of the target and current networks

ˆ + = τ W ˆ + + (1 − τ )W ˆ , W ψ ,l ψ ,l ψ ,l

4.2.4. Discretization Obtaining the discrete-time state space model in Eq. (43) involves the numerical integration method. To reduce the numerical error, rather than using the first-order Euler method (i.e., zero-order hold(ZOH), a higher-order discretization method is implemented. We implemented Dormand-Prince (DOPRI5) method which is the most widely used among Runge–Kutta family [13,42]. Given the system dynamics x˙ = f (t, x, u ) with the initial condition, x(t0 ) = x0 , the explicit Runge–Kutta method is given by

τ 1

(42)

+

ˆ where W ψ ,l is the updated target network weights, and τ is the

weighting factor which indicates how soft the update should be [33]. The rule intentionally slows down the learning rate by slowing the changes in the target value, which greatly improves the stability of the learning.

where

p1 = f (tk , xk , uk ), p2 = f (tk + c2 h, xk + h(a21 p1 ), uk ), p3 = f (tk + c3 h, xk + h(a31 p1 + a32 p2 ), uk ) .. .

p j = f tk + c j h, xk + h

j−1 





a ji pi , uk

(44)

i=1

= f (tk( j ) , xk( j ) , uk ) .. .

4.2.3. Pre-training It is shown that the value, costate, and policy networks are all functions of each other in Eqs. (32)–(34). This cyclic relationship makes the GDHP method prone to divergence. Consequently, choosing a good initialization of the policy network and adding exploration are important for stable learning [19]. Existing control methods such as the linear quadratic regulator (LQR) or model predictive controller (MPC) provide good starting suboptimal policies which can be used for initialization [43]. In the ensuing examples, the LQR integrator (LQI) is implemented as an initial control policy denoted as l(xk ). Furthermore, a noise sequence N should added to the controllerâs output (i.e., uk = u∗k + N (k )) in order to explore the neighborhoods of the suboptimal trajectories. Choosing a proper noise sequence is essential to the learning based optimal control in order to match the persistent excitation condition [19] or exploitation-exploration trade-off [45]. We used Ornstein-Uhlenbeck process as N generated from dxt = −θ xt dt + σ dWt , with θ = 0.15 and σ = 0.3, where Wt denotes the Wiener process [27].





ps = f tk + cd h, xk + h

 d−1





adi pi , uk

(44)

i=1

pj is the jth interpolated function value at an intermediate time ( j)

( j)

and state points tk and xk , respectively. h denotes the discrete time step and d denotes the number of discretization stages. Coefficients aij , bi and ci are given by the Butcher tableau. The DOPRI5 method uses discretization stages of d = 5. Then the derivative of the successor state with respect to the current state can be evaluated by the following explicit formula: d ∂ xk+1 ∂ pi = I+h bi ∂ xk ∂ xk i=1  T d ∂ f (tk(i) , xk(i) , uk ) ∂ xk(i) = I+h bi , ∂ xk ∂ xk(i) i=1

(45)

172

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

where



j−1 ∂ xk( j ) ∂ xk(i) =I+h ∂ xk ∂ xk i=1

Table 1 Minimum and maximum values of the state, MVs, and CVs in the operating region.

T a ji

(46)

The control derivative is computed in the same way: d ∂ xk+1 ∂ pi ( xk ) =h bi ∂ uk ∂ uk i=1

=h

d

bi

i=1

∂ f (tk(i) , xk(i) , uk ) ∂ uk

(47)

4.3. Overall algorithm The overall algorithm is summarized in Algorithm 1. Structures and training schemes of the value, costate, and policy networks with the target networks incorporated are illustrated in Fig. 2. The overall schematic diagram is illustrated in Fig. 3. Algorithm 1 GDHP 1:

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

procedure GDHP with DNNs ˆ and Initialize: Initialize critic, costate, and policy networks Vˆ , λ πˆ ˆ  , and Initialize critic, costate, and policy target networks Vˆ  , λ πˆ  Initialize replay buffer D Initialize terminal replay buffer DT Initialize noise process N Discretize and augment the system dynamics as Eq. (3) and obtain derivatives by Eqs. (43), (45) and (47) for i = 1 to E do for k = 1 to T do if i ≤ Initial policy then u k = l ( x k ) + N ( k ) else uk = πˆ (i) (xk ) + N (k ) Get next augmented state xk+1 and stage-wise cost rk with control input uk by Eqs. (3) and (6) if k == T then Store (xT , φT ) to DT else Store (xk , uk , rk , xk+1 ) to D Uniform sample of minibatch D˜ from D ∪ DT Compute residual functions LV (D˜ ), Lλ (D˜ ), and Lπ (D˜ ) by Eqs. (32)-(34) ˆ +,W ˆ + , and W ˆ + by Eq. (35) Update the weights W V,l λ,l π ,l  + ˆ + ˆ ˆ + by Eq. (42) Update the weights W , W , and W V,l

λ,l

π ,l

5. Results and discussions In this section, numerical results of applying the GDHP algorithm with DNNs on simulated systems are presented. The main objective for this section is to show that the benefit of using DNNs over SNNs in the cases where state space dimension is high and initial states as well as the state transition are subject to stochastic uncertainties.

A benchmark semi-batch reactor presented in [12] is used. In the reactor, the following parallel second-order reactions occur: k1

k2

B + C −→ D

Minimum

Maximum

CA (mol/L) CB (mol/L) CC (mol/L) T (K) V (L) Qf (L/min) TK (K) VCC (mol)

0 0 0 293.15 50 0 293.15 0

1 1 1 323.15 150 1.5 318.15 90

System equations are derived from mass and energy balances as follows:

C˙ A = −

Qf CA − k1 (T )CACB V

Qf (CB f − CB ) − k1 (T )CACB − k2 (T )CBCC V Qf C˙C = − CC + k1 (T )CACB − k2 (T )CBCC V Qf 1  T˙ = (T f − T ) − H1 k1 (T )CACB V ρC p

C˙ B =



UA

+ H2 k2 (T )CBCC −



V˙ = Q f r ,

(T − TK )

if t < 31 min otherwise

0 Qf

Qfr =

ρCpV

(49)

where CA , CB , and CC denote the molar concentrations of substances A, B, and C, and T and V are the temperature and volume of the reactor, respectively. The kinetic parameters are functions of the temperature following the Arrhenius law

 E  i

ki (T ) = ki0 exp −

RT

,

i ∈ {1, 2}

(50)

Initial conditions of states are CA0 = 1 mol/L, CB0 = 0 mol/L, CC0 = 0 mol/L, T0 = 298 K, and V0 = 50 L. The manipulate variables (MV) are the flow rate of substance B (Qf (t)) and the jacket temperature (TK (t)). The controlled variables (CV) are the reactor temperature (T(t)) and product yield (V(t)CC (t)). Other physical parameters are specified as: T f = 308 K, CB f = 0.9 mol/L, UA/ρC p = 3.75 L/min, k10 = 5.0969 × 1016 L/mol min, k20 = 2.2391 × 1017 L/mol min, E1 /R = 12305 K, E2 /R = 13450 K, H1 /ρC p = −28.5 KL/mol, and H2 /ρC p = −20.5 KL/mol. The compact subsets of state, MVs, and CVs are described in Table 1. The reactor is operated in the following scheme: While the reactant A is initially charged the temperature is maintained with a room temperature 298.15 K. After the start-up process until t = 30 min, the second reactant B is fed and reaction occurs. The temperature is regulated to 308.15 K until t = 80min and gradually cooled down to 303.15 K until the batch terminal time. The temperature profiles during the start-up and the cool-down are relatively unimportant, thus the reference values are not assigned. Finally, the product C is discharged with the target yield (42 mol). The objective is formally represented as

J = V (tT )CC (tT ) − ϕ2H +

T −1

T (ti ) − ρ (ti )2Q + ui 2R

 (51)

i=0

5.1. Example 1: Semi-batch reactor

A + B −→ C

Variables

(48)

where the temperature reference trajectory ρ (t) and the target yield ϕ are specified as

⎧ ⎪ ⎨298.15 K,

308.15 K, ρ (t ) = ⎪ ⎩303.15 K, Not assigned,

0≤t <3 30 ≤ t < 80 , 98 ≤ t Otherwise

ϕ = 42 mol

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

173

Fig. 2. Structures and training schemes of the value, costate, and policy networks. Original networks are trained with the signals from the target networks, while the target networks weights are obtained by the soft update rule.

minibatch Dk,i consists of 30 samples from D and 2 samples from DT , totally 32 samples. Learning rates ηV , ηλ , and ηπ were all 10−4 . Loss functions weighting factors were set to be βV,1 , βλ,1 = 10−2 , and βV,2 , βλ,2 = 1. The target network weighting factor τ was 0.05.  −1 (i ) We measured the episode loss functions Lψ = Tk=0 Lψ (Dk,i ) to

(i+1 ) (i ) (i ) determine the convergence criteria: Lψ − Lψ /Lψ  < 10−5 , ψ ∈ {V, λ}. In order to have an efficient initialization of the DNNs, the linear quadratic integral control (LQI), which is a variant of LQR for the tracking problem, was used as the initial policy l(xk ). At every time step, the system dynamics were linearized by computing the discretized Jacobian with respect to the current state xk , and the LQI gain was obtained by solving the steady-state Riccati equation. The LQI controller was employed until episode 25 to update just the value and costate networks. After that, the policy function resulting from the iteration with the data from the previous episode was implemented until convergence.

Fig. 3. A schematic diagram of GDHP algorithm.

We used weight matrices as Q = 0.1I, if 0 ≤ t < 3 or 30 ≤ t < 80 or 98 ≤ t and otherwise, Q = 0, R = 0.01I, and H = I. Sampling time was set as 0.5 min and the horizon was 100 min. The state space dynamics is normalized into [−1, 1] by applying the linear 2x − xM − xm transformation X = , where X is the normalized state xM − xm and control values and xM and xm are the maximum and minimum values for corresponding variables, respectively. The bounds for the variables are kept constant during the learning process. 5.1.1. Specification of hyperparameters Five-layer DNNs were used for the value and costate networks and four-layer DNNs were used for policy network. The number of nodes in the value, costate, and policy networks were (S + A + 1, 50, 50, 30, 1), (S + A + 1, 50, 50, 50, S + A + 1), and (S + A + 1, 50, 50, 30, A), respectively, referring to the network structure of [27]. Leaky ReLU activation functions were used for hidden layers. Neural networks weights are initialized through the technique referred to as He initialization [16]. The size of the replay buffer |D| was set to be 10 times of a single episode length and terminal replay buffer size |DT | was 10. At each sampling time k in episode i, a

5.1.2. Tracking results in the deterministic system case The tracking results for the deterministic system case are discussed in this section. The trajectories of the states and CVs under several control polices at different training stages of the algorithm are presented in Figs. 4 and 5. In particular, plotted trajectories are for the 20th, 75th, 120th, and 200th episodes, each of which is being operated under different conditions, i.e., the LQI policy, intermediate GDHP policies yet to converge, and the final converged policy, respectively. The trajectories of the CVs in Fig. 5 indicate that the final GDHP policy performs quite well in both the tracking of the reactor temperature and satisfying the boundary condition for the product yield, while the LQI policy in episode 25 fails to do so. Fig. 6 shows the log values of episode loss for value and costate networks and log values of the episode cost. The episode loss continue to decrease throughout the scenario, except for having high values right after the control policy being switched from initial LQI policy to the GDHP policy. Since the episode loss functions LV and Lλ measures the deviations of optimality equations Eqs. (20)–(22), costate function definition Eq. (10), and the boundary conditions Eq. (15), we can conclude that all deviations become zero. Fig. 6 indicates that the converged GDHP policy shows smaller episode cost than that of the initial LQI polcy. 5.1.3. Tracking results under initial state uncertainty The model based GDHP algorithm should be able to give a control policy, that works well with multiple initial states. Let us as-

174

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

Fig. 4. State trajectories of GDHP at episodes 20, 75, 120, and 200.

Fig. 5. Trajectories of the CVs of GDHP at episodes 20, 75, 120, and 200. Circle mark for the first graph denotes the reference trajectory of the reactor temperature, and the second graph denotes the boundary condition for the product yield.

sume that the initial state p(x0 ) follows a certain probability distribution, and the initial state realized in the off-line training process is sampled from this distribution. We assumed that the initial state is subject to the uniform distribution whose support is 0.3 times of the minimum and the maximum values of the operating region (Table 1). We compare the tracking results under three different types of neural networks structures represented by DNNs, and two different SNNs. The former SNNs structures for value, costate, and policy networks were (8, 100, 1), (8, 100, 8), and (8, 100, 2), whose numbers of weights are smaller than those in DNNs. Those in latter SNNs are (8, 500, 1), (8, 500, 8), and (8, 500, 2), which are larger than those in DNNs. These three networks were trained with data generated under two different conditions: one with the nominal data and the other with uniform initial state distribution. After 200 episodes training, the resulting six policies were tested for

the scenarios with twenty different initial state conditions which were randomly generated from the uniform distribution which has significantly larger supports (i.e., 0.7 times of minimum and maximum values of the operating regions) than that used in the training process. Fig 7 shows the mean episode costs with respect to six different policies from the DNNs and two SNNs trained with randomly sampled initial states and DNNs and two SNNs with a single fixed initial state, respectively. In the cases where randomness is considered in the training phase, all three networks achieved good control performances. This shows that obtaining the policy map with respect to the initial state distribution does not require complex models such as multilayer structure. However, when the uncertainty is omitted in training, both SNNs resulted in failure, regardless of the number of nodes. The generalizability of the DNNs led to a better feedback policy even trained without considering the uncertainty.

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

175

Fig. 6. Log values of episode loss functions LV and Lλ and episode cost. Initial LQI controller was implemented before episode 25 and GDHP policy was applied after episode 25.

quired to express the robust control policy for the state noises since the state trajectories show noisy behavior throughout the entire horizon. On the other hand, we were able to obtain a successful control policy only when DNNs is exploited in ‘w/o RT’ cases. In summary, if uncertainties with respect to the initial state and the state noise are modeled and taken into account during the training process, robustness is acquired by using the complex function approximator, regardless of using the multi-layered structure. However, the fact that it is possible to maintain robustness even in the case where the uncertainty is not considered during training allowed us to observe the benefit of DNNs. When the online uncertainty pushes the system outside the region experienced offline, multi-layer structure results in the improved generalizability. 5.2. Example 2: Diffusion-Convection-Reaction (DCR) process

Fig. 7. Mean episode costs for episodes with 20 randomly sampled initial states under policies trained with DNNs and two SNNs with random initial states, and DNNs and two SNNs with a deterministic initial state. ‘SNNs 500’ states that the SNNs having 500 number of nodes and ‘w/o RT’ denotes that it is trained with nominal data.

5.1.4. Tracking results under state noise This section addresses the tracking control problem in the presence of the state noise. The experimental group of the neural networks structures is equivalent with that in the previous section. These three networks were trained with data generated under two different conditions: one with the nominal data and the other with the Gaussian state noise of N (0, 0.005 ). In the test phase, six scenarios were tested in four cases with the noise variance ranging from 0.002 to 0.008. Performance of each scenario was measured with the mean value of 20 repeated episodes. We assume that the system starts from a fixed initial state and uncertainties arise from the state noise only. The mean episode costs for the six test cases are illustrated in Fig. 8. Regarding the control policies trained with the data from state noises, ‘DNNs’ and ‘SNNs 500’ demonstrate similar performances, while ‘SNNs’ fails to achieve a good result. Unlike the experiment results in initial state distribution, large model complexity is re-

Performance of the GDHP algorithm with DNNs is examined in the context of a system with high-dimensional state and control space, such as a spatially distributed parameter system expressed by PDEs. We use the example of a typical deterministic diffusion-convection-reaction (DCR) process in one-dimensional rod [3]. Consider an exothermic reaction A → B, and the temperature is to be controlled by a distributed actuator along the same geometry. The dimensionless temperature profile denoted as x(z, t) is described by the following PDE:

∂x ∂ ∂x  ∂x = (1 − αdc ) k (z ) + αdc ∂t ∂z ∂z ∂z  −γ /(1+x) −γ  + βT (z ) e −e + βU (u(z, t ) − x )

(52)

with the Dirichlet boundary conditions and the initial condition of

x ( 0, t ) = 0,

x ( π , t ) = 0,

x(z, 0 ) = 0.5

(53)

where z is the one-dimensional spatial coordinate, and u(z, t) denotes the control input profile. Terms in the PDE denote the diffusion, convection, reaction, and control related terms, respectively. α dc is the weight between the and convection terms; k(z) expresses the diffusion coefficient; β T (z) is the heat of reaction along the coordinate; γ is the activation energy, and β U is the heat transfer coefficient. All parameters have dimension-

176

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

Fig. 8. Mean episode costs for 20 scenarios under policies trained with DNNs and two SNNs with state noises, and DNNs and two SNNs with the deterministic system. Mean episode costs were obtained based on the results from the simulation of 20 episodes. ‘SNNs 500’ states that the SNNs having 500 number of nodes and ‘w/o RT’ denotes that it is trained with nominal data.

Fig. 9. State trajectories of the DCR process under the GDHP algorithm at episodes 10, 30, 35, and 200.

less values. Specifically, αdc = 0.5, k(z ) = 0.5 + 0.7/(1 + z ), βT (z ) = 16(cos(z ) + 1 ), γ = 4, and βU = 1. The objective is to control the state x¯ (z, t ), z ∈ [0, π ] to the zero setpoint. The objective is written as:

J=

π 0

x¯ (z, tT )2H +

T −1

x¯ (z, ti )2Q + u(z, ti )2R

tion time to solve the Riccati equation for the system dimensions S = 40, A = 40, and O = 40, which needed to construct the LQI, the following manually designed policy was used as the initial policy instead [31]:

 dz

(54)

i=0

The positive definite control weighting matrices Q and R were set to be identity matrices with appropriate dimensions. This system has been examined in [31] where the HDP algorithm was applied to a reduced-order model, but the control dimension was limited to one. For numerical simulations and Jacobian calculation of the PDE, we applied the finite difference discretization for the spatial derivative and used the method-of-line for the time derivative. Specifically, the central difference method was used for the second-order derivative in the diffusion term, and the upwind difference scheme (UDS) for the first-order derivative in the convection term. The spatial coordinate z was discretized into 40 equidistant intervals, and the sampling time and the horizon were chosen as 0.005 s and 4 s, respectively. Since it took too much computa-

 u(z, t ) =

−2, 0,

0.2π ≤ z ≤ 0.4π otherwise

The GDHP algorithm was applied for 200 episodes with the same algorithmic hyperparameters as the previous example. The state and control trajectories at episode 10, 30, 35, and 200 are shown in Figs. 9 and 10. The plot of the state profile shows a suboptimal trajectory when the GDHP policy is first implemented at episode 25, but it quickly stabilized to the zero setpoint by episode 35. According to Fig. 11, episode loss functions of value and costate networks continue to decrease by episode 200. It also indicates the converged GDHP policy at episode 200 shows improved episode cost than that of the initial LQI policy. The GDHP algorithm with DNNs is shown to be effective in handling high-dimensional state space systems in this example.

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

177

Fig. 10. Control trajectories of the DCR process under the GDHP algorithm at episodes 10, 30, 35, and 200.

Fig. 11. Log values of episode loss functions LV and Lλ and episode cost. Initial LQI controller was implemented before episode 25 and the GDHP policy was applied after episode 25.

6. Concluding remarks In this paper, the GDHP algorithm is formulated for the FHOC problem. To ensure good representation of the involved functions in high dimensional state space systems, we adopted DNNs. Various tricks to ensure stable learning, e.g., the replay buffer, the target network, pre-training, and higher-order discretization methods were utilized. The use of DNNs were shown to give improved robustness with respect to various stochastic uncertainties, such as those in the initial state and state transitions, especially in the context of a high dimensional dissipative PDE system. Although feedback policy approximated with DNNs showed improved robustness in the simulated examples, additional analysis is needed to confirm that. Especially, in those case where uncertainties in the initial state and state transition can be quantified explicitly, the stochastic version of the HJB equation and the parametric dependency of the HJB should be considered. Then the conventional control-loop structure that includes state estimation and/or on-line identification can be coupled with the approach. The approach can be further extended to the system with nonquadratic cost structure. Many batch problems consider the maximization of the production rate subject to various constraints. An

explicit form of the policy function cannot be obtained in such cases, so an additional formulation is required. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. CRediT authorship contribution statement Jong Woo Kim: Conceptualization, Methodology, Software, Writing - original draft. Byung Jun Park: Writing - original draft. Haeun Yoo: Validation, Visualization. Tae Hoon Oh: Validation. Jay H. Lee: Supervision, Writing - review & editing. Jong Min Lee: Supervision, Writing - review & editing. Acknowledgments This research was respectfully supported by the National Research Foundation of Korea (NRF) grant funded by the Korean gov-

178

J.W. Kim, B.J. Park and H. Yoo et al. / Journal of Process Control 87 (2020) 166–178

ernment (MSIT) (NRF-2016R1A5A1009592) and the National Research Foundation of Korea (NRF) grant (NRF-2019M3F2A1072470). References [1] M. Abu-Khalaf, F.L. Lewis, Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach, Automatica 41 (5) (2005) 779–791. [2] D. Adhyaru, I. Kar, M. Gopal, Fixed final time optimal control approach for bounded robust controller design using HamiltonJacobiâ–Bellman solution, IET Control Theory Appl. 3 (9) (2009) 1183–1195. [3] A. Armaou, P.D. Christofides, Dynamic optimization of dissipative PDE systems using nonlinear order reduction, Chem. Eng. Sci. 57 (24) (2002) 5083–5114. [4] A. Bemporad, M. Morari, V. Dua, E.N. Pistikopoulos, The explicit linear quadratic regulator for constrained systems, Automatica 38 (1) (2002) 3–20. [5] D.P. Bertsekas, Dynamic Programming and Optimal Control, 1, Athena Scientific, Belmont, MA, 2005. [6] D.P. Bertsekas, Feature-based aggregation and deep reinforcement learning: a survey and some new implementations, IEEE/CAA J. Autom. Sin. 6 (1) (2019) 1–31. [7] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming, 5, Athena Scientific, Belmont, MA, 1996. [8] F. Borrelli, A. Bemporad, M. Morari, Predictive Control for Linear and Hybrid Systems, Cambridge University Press, 2017. [9] L. Bus¸ oniu, T. de Bruin, D. Tolic´ , J. Kober, I. Palunko, Reinforcement learning for control: Performance, stability, and deep approximators, Annu. Rev. Control 46 (2018) 8–28. [10] T. Cheng, F.L. Lewis, M. Abu-Khalaf, Fixed-final-time-constrained optimal control of nonlinear systems using neural network HJB approach, IEEE Trans. Neural Netw. 18 (6) (2007) 1725–1737. [11] Y.H. Cheng, B. Jiang, H. Li, X.D. Han, On-orbit reconfiguration using adaptive dynamic programming for multi-mission-constrained spacecraft attitude control system, Int. J. Control Autom.Syst. 17 (4) (2019) 822–835. [12] I.S. Chin, K.S. Lee, J.H. Lee, A technique for integrated quality control, profile control, and constraint handling for batch processes, Ind. Eng. Chem. Res. 39 (3) (20 0 0) 693–705. [13] J.R. Dormand, P.J. Prince, A family of embedded Runge–Kutta formulae, J. Comput. Appl. Math. 6 (1) (1980) 19–26. [14] Y. Ge, S. Li, P. Chang, An approximate dynamic programming method for the optimal control of alkai-surfactant-polymer flooding, J. Process Control 64 (2018) 15–26. [15] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep Learning, 1, MIT press Cambridge, 2016. [16] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: surpassing human-level performance on imagenet classification, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026–1034. [17] A. Heydari, S.N. Balakrishnan, Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics, IEEE Trans. Neural Netw., Learn.Syst. 24 (1) (2013) 145–157. [18] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (5) (1989) 359–366. [19] Y. Jiang, Z.P. Jiang, Robust adaptive dynamic programming and feedback stabilization of nonlinear systems, IEEE Trans. Neural Netw. Learn.Syst. 25 (5) (2014) 882–893. [20] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436. [21] J.H. Lee, From robust model predictive control to stochastic optimal control and approximate dynamic programming: a perspective gained from a personal journey, Comput. Chem. Eng. 70 (2014) 114–121. [22] J.H. Lee, J.M. Lee, Approximate dynamic programming based approach to process control and scheduling, Comput. Chem. Eng. 30 (10–12) (2006) 1603–1618. [23] J.H. Lee, W. Wong, Approximate dynamic programming approach for process control, J. Process Control 20 (9) (2010) 1038–1048. [24] J.M. Lee, J.H. Lee, Approximate dynamic programming strategies and their applicability for process control: a review and future directions, Int. J. Control Autom.Syst. 2 (2004) 263–278. [25] S. Levine, C. Finn, T. Darrell, P. Abbeel, End-to-end training of deep visuomotor policies, J. Mach. Learn. Res. 17 (1) (2016) 1334–1373. [26] F.L. Lewis, D. Vrabie, Reinforcement learning and adaptive dynamic programming for feedback control, IEEE Circt. Syst. Mag. 9 (3) (2009). [27] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, Continuous control with deep reinforcement learning, arXiv:1509.02971 (2015). [28] D. Liu, H. Li, D. Wang, Online synchronous approximate optimal learning algorithm for multi-player non-zero-sum games with unknown dynamics, IEEE Trans. Syst. Man Cybern. 44 (8) (2014) 1015–1027. [29] D. Liu, D. Wang, D. Zhao, Q. Wei, N. Jin, Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming, IEEE Trans. Autom. Sci.Eng. 9 (3) (2012) 628–634.

[30] D. Liu, Q. Wei, D. Wang, X. Yang, H. Li, Adaptive Dynamic Programming With Applications in Optimal Control, Springer, 2017. [31] B. Luo, H.N. Wu, H.X. Li, Adaptive optimal control of highly dissipative nonlinear spatially distributed processes with neuro-dynamic programming., IEEE Trans. Neural Netw. Learn.Syst. 26 (4) (2015) 684–696. [32] V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: International Conference on Machine Learning, 2016, pp. 1928–1937. [33] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529. [34] C. Mu, Z. Ni, C. Sun, H. He, Data-driven tracking control with adaptive dynamic programming for a class of continuous-time nonlinear systems, IEEE Trans. Cybern. 47 (6) (2017) 1460–1470. [35] C. Mu, D. Wang, H. He, Novel iterative neural dynamic programming for data-based approximate optimal control design, Automatica 81 (2017) 240–252. [36] C. Mu, D. Wang, H. He, Data-driven finite-horizon approximate optimal control for discrete-time nonlinear systems using iterative HDP approach, IEEE Trans. Cybern. 48 (10) (2018) 2948–2961. [37] B. Øksendal, Stochastic Differential Equations: An Introduction with Applications, third ed., Springer-Verlag, Berlin, Heidelberg, 1992. [38] B. Park, S.-K. Oh, J.M. Lee, Closed-loop subspace identification of dual-rate non-uniformly sampled system under MPC with zone control, International Journal of Control, Automation andSystems (2019). To appear [39] W.B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, 703, John Wiley & Sons, 2007. [40] D.V. Prokhorov, D.C. Wunsch, Adaptive critic designs, IEEE Trans. Neural Netw. 8 (5) (1997) 997–1007. [41] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Prioritized experience replay, arXiv:1511.05952 (2015). [42] L.F. Shampine, M.W. Reichelt, The MATLAB ODE suite, SIAM J. Sci. Comput. 18 (1) (1997) 1–22. [43] H.S. Sidhu, P. Siddhamshetty, J. Kwon, Approximate dynamic programming based control of proppant concentration in hydraulic fracturing, Mathematics 6 (8) (2018) 132. [44] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., Mastering the game of Go without human knowledge, Nature 550 (7676) (2017) 354. [45] R.S. Sutton, A.G. Barto, Reinforcement Learning: an Introduction, MIT press, 2018. [46] B. Talaei, S. Jagannathan, J. Singler, Boundary control of linear uncertain 1-D parabolic PDE using approximate dynamic programming, IEEE Trans. Neural Netw. Learn.Syst. 29 (4) (2017) 1213–1225. [47] K.G. Vamvoudakis, F.L. Lewis, Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica 46 (5) (2010) 878–888. [48] H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double Q-learning., in: AAAI, 2, Phoenix, AZ, 2016, p. 5. [49] D. Wang, Intelligent critic control with robustness guarantee of disturbed nonlinear plants, IEEE Trans. Cybern. (2019), doi:10.1109/TCYB.2019.2903117. [50] D. Wang, M. Ha, J. Qiao, Self-learning optimal regulation for discrete-time nonlinear systems under event-driven formulation, IEEE Trans. Autom. Control (2019), doi:10.1109/TAC.2019.2926167. [51] D. Wang, D. Liu, Q. Wei, Finite-horizon neuro-optimal tracking control for a class of discrete-time nonlinear systems using adaptive dynamic programming approach, Neurocomputing 78 (1) (2012) 14–22. [52] J. Yi, S. Chen, X. Zhong, W. Zhou, H. He, Event-triggered globalized dual heuristic programming and its application to networked control systems, IEEE Trans. Ind. Inform. 15 (3) (2018) 1383–1392. [53] V.M. Zavala, L.T. Biegler, The advanced-step NMPC controller: optimality, stability and robustness, Automatica 45 (1) (2009) 86–93. [54] H. Zhang, L. Cui, X. Zhang, Y. Luo, Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method, IEEE Trans. Neural Netw. 22 (12) (2011) 2226–2236. [55] D. Zhao, Q. Zhang, D. Wang, Y. Zhu, Experience replay for optimal control of nonzero-sum game systems with unknown dynamics, IEEE Trans. Cybern. 46 (3) (2015) 854–865. [56] Q. Zhao, H. Xu, S. Jagannathan, Neural network-based finite-horizon optimal control of uncertain affine nonlinear discrete-time systems, IEEE Trans. Neural Netw. Learn.Syst. 26 (3) (2015) 486–499. [57] X. Zhong, Z. Ni, H. He, Gr-GDHP: a new architecture for globalized dual heuristic dynamic programming, IEEE Trans. Cybern. 47 (10) (2016) 3318–3330. [58] Y. Zhu, D. Zhao, H. He, J. Ji, Event-triggered optimal control for partially unknown constrained-input systems via adaptive dynamic programming, IEEE Trans. Ind. Electron. 64 (5) (2016) 4101–4109. [59] Y. Zhu, D. Zhao, X. Li, Iterative adaptive dynamic programming for solving unknown nonlinear zero-sum game based on online data, IEEE Trans. Neural Netw. Learn.Syst. 28 (3) (2016) 714–725.