Anton A. Kiss, Edwin Zondervan, Richard Lakerveld, Leyla Özkan (Eds.) Proceedings of the 29th European Symposium on Computer Aided Process Engineering June 16th to 19th , 2019, Eindhoven, The Netherlands. © 2019 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/B978-0-12-818634-3.50154-5
Reinforcement Learning for Batch-to-Batch Bioprocess Optimisation P. Petsagkourakisa , I. Orson Sandovalb , E. Bradfordc , D. Zhanga,d * and E.A. del Rio-Chanonad,* a School
of Chemical Engineering and Analytical Science,The University of Manchester, M13 9PL, UK b Instituto de Ciencias Nucleares, Universidad Nacional Autónoma de México, A.P. 70543, C.P. 04510 Ciudad de México, Mexico c Department of Engineering Cybernetics, Norwegian University of Science and Technology, Trondheim, Norway d Centre for Process Systems Engineering (CPSE), Department of Chemical Engineering, Imperial College London, UK
[email protected],
[email protected]
Abstract Bioprocesses have received great attention from the scientific community as an alternative to fossil-based products by microorganisms-synthesised counterparts. However, bioprocesses are generally operated at unsteady-state conditions and are stochastic from a macro-scale perspective, making their optimisation a challenging task. Furthermore, as biological systems are highly complex, plant-model mismatch is usually present. To address the aforementioned challenges, in this work, we propose a reinforcement learning based online optimisation strategy. We first use reinforcement learning to learn an optimal policy given a preliminary process model. This means that we compute diverse trajectories and feed them into a recurrent neural network, resulting in a policy network which takes the states as input and gives the next optimal control action as output. Through this procedure, we are able to capture the previously believed behaviour of the biosystem. Subsequently, we adopted this network as an initial policy for the "real" system (the plant) and apply a batch-to-batch reinforcement learning strategy to update the network’s accuracy. This is computed by using a more complex process model (representing the real plant) embedded with adequate stochasticity to account for the perturbations in a real dynamic bioprocess. We demonstrate the effectiveness and advantages of the proposed approach in a case study by computing the optimal policy in a realistic number of batch runs. Keywords: Reinforcement Learning, Batch Process, Recurrent Neural Networks, Bioprocesses
1. Introduction There has been a global interest in using sustainable bio-production systems to produce a broad range of chemicals and substitute fossil derived synthetic routes (Harun et al., 2018). Bioprocesses exploit microorganisms to synthesise platform chemicals and highvalue products by using different means of resources (Jing et al., 2018). Compared to a traditional chemical process, a biochemical process is highly complex due to the intricate relationships between metabolic reaction networks and culture fluid dynamics (del Rio-
920
P. Petsagkourakis et al.
Chanona et al., 2018). As a result, it is difficult to construct accurate dynamic models to simulate general large-scale biosystems, and plant-model mismatch is inevitable. Furthermore, bioprocess dynamics are often stochastic due to the underlying metabolic pathways which are sensitive to even mild changes in operating conditions (Zhang and Vassiliadis, 2015; Thierie, 2004). Therefore, developing control and optimisation strategies for bioprocesses remains an open challenge. Given these critical limitations in physical models, in this work we propose a data-driven approach to address this challenge. We must seek a strategy that can handle both the system’s stochasticity and plant-model mismatch. It is here that we have opted to use reinforcement learning and more specifically, policy gradients, the rationale behind this is next explained. Reinforcement learning (RL) addresses the problem of solving nonlinear and stochastic optimal control problems (Bertsekas, 2000). Two main branches have been established on how to solve dynamic optimisation problems via RL. The first one is based on dynamic programming (DP), hence termed approximate dynamic programming (ADP). DP relies on the HamiltonJacobi-Bellman equation (HJBE), the solution of which becomes intractable for small size problems with nonlinear dynamics and continuous state and control actions. Hence, past research has relied on using ADP techniques to find (approximate) solutions to this type of problem (Sutton and Barto, 2018). The second branch, is to use policy gradients which directly obtain a policy by maximising a desired performance index. This approach is well suited for problems where both the state and control space are continuous. We have therefore adopted this approach for this work. Policy gradient methods are further explained in Section 2.2. Finally, to address plant-model mismatch, we have applied a data-driven approach. Although there are knowledge-based modelling strategies such as iterative learning (Moore et al., 2006), here we propose a fully data-driven method that can learn from the true dynamics of the system while incorporating previous knowledge.
2. Preliminaries In this section we introduce important concepts for the proposed work. 2.1. Recurrent Neural Network Recurrent neural networks, (RNNs) (Rumelhart et al., 1986), are tailored to address sequential data. RNNs produce an output at each time step and have recursive connections between hidden units. This allows them to have a "memory" of previous data and hence be well suited to model time-series. Thus, it could be said that RNNs are essentially simulating a dynamic system for a given set of parameters. In this work, RNNs are applied to parameterise the stochastic policy. 2.2. Policy Gradient Methods A particular family of RL methods that do not require an explicit estimate of the value of state-action pairs are called policy gradient methods. These methods rely on a parametrised policy function πθ (·) that returns an action a given a state of the system s and a set of intrinsic parameters θ . In the case of stochastic policies, the policy function returns the defining parameters of a probability distribution over possible actions, from which the
Reinforcement Learning for Batch-to-Batch Bioprocess Optimisation
921
actions are sampled: a ∼ πθ (a|s) = π(a|s, θ ) = p(at = a|st = s, θt = θ ).
(1)
In this work, an RNN is used as the parametrised policy, which takes states and past controls as inputs and returns a mean and a variance from which a control is drawn. In this setting, the exploitation-exploration trade-off is represented explicitly as the variance of the underlying distribution of the policy. Deterministic policies may be approached as a limiting case where variance fades upon convergence. Let a sequence of H states and actions be called an episode τ. An episode τ is a secuence of states and actions generated by following the current policy, τ ∼ p(τ|θ ). The reward function (R(·)) may be estimated over K episode samples: ˆ ) = Eθ [R(τ)] = R(θ
τ
p(τ|θ )R(τ)dτ ≈
1 K ∑ R(τ (k) ). K k=1
(2)
ˆ ) by using The objective of policy gradient methods is to maximise a reward function R(θ gradient ascent techniques over a continuously differentiable policy. The evolution of the intrinsic parameters at each optimisation step m is given by θm+1 = θm + αm ∇θ R|θ =θm .
(3)
Differentiable policies guarantee smooth changes over sufficiently small variations of the parameters. 2.3. REINFORCE Algorithm The REINFORCE (Williams, 1992) algorithm approximates the gradient of the policy to maximise the expected reward with respect to the parameters θ without the need of a dynamic model of the process. For this, it is necessary to take advantage of the gradient of a logarithm and a decomposition of the likelihood of an H step episode τ under a policy πθ (Sutton and Barto, 2018; Peters and Schaal, 2008), H
p(τ|θ ) = p(s0 ) ∏ p (sh+1 |sh , ah ) · πθ (ah |sh ) ,
(4)
h=0
to express approximate the desired gradient as
1 K ∑ R(τ (k) )∇θ log p(τ (k) |θ ) K k=1 τ
H 1 K (k) (k) ≈ ∑ R(τ (k) ) ∑ ∇θ log πθ ah |sh . K k=1 h=1
ˆ )= ∇θ R(θ
R(τ)p(τ|θ )∇θ log p(τ|θ )dτ ≈
(5)
The variance of this estimation can be reduced with the aid of an action-independent baseline b without introducing bias. A simple but effective baseline is the expectation of reward under the current policy, approximated by the mean over the sampled paths: ˆ )≈ b = R(θ
1 K ∑ R(τ (k) ), K k=1
(6)
922
P. Petsagkourakis et al.
ˆ )≈ ∇θ R(θ
H
1 K
(k) (k) R(τ (k) ) − b ∑ ∇θ log πθ ah |sh . ∑ K k=1 h=1
(7)
This selection increases the log likelihood of an action by comparing it to the expected reward of the current policy.
3. Reinforcement Learning for Bioprocess Optimisation under Uncertainty In this work, during Stage 1, we assume a preliminary model has been constructed to approximate the real system’s dynamics. This approximate model can be used to generate a large size of episodes (different control actions subject to probability) for each training epoch (an epoch corresponds to a specific set of RNN parameter values) to initially design a control policy network that produces the model identified optimal policy. Hence the control policy is an RNN which takes states (including time to termination) and past controls as inputs and returns a mean and a variance from which a control action is drawn. During Stage 2 (real plant optimal control), the policy network is directly updated using the real system (the plant) in a batch-to-batch framework. Accuracy of the control policy (i.e. the RNN) is therefore consolidated during an online implementation. The stochastic control policy is a RNN that represents a conditional probability distribution πθ . The RNN predicts the mean and standard deviation of the next action through a deterministic map using the measurements (the previous states si , a sequences of previous actions (ai ) with i ∈ {−N, . . . , −1}) and the time that is left for the end of the batch process. This proposed stochastic control policy is trained using the REINFORCE algorithm (see Section 2). This algorithm aims to maximise a given reward (e.g. concentration of target product at the final time), with the mean value of this reward being employed as a baseline during the update of RNN parameter values (see Eq. 6 & 7). The use of a baseline has been proven to be advantageous as long as it is independent of the control actions (Sutton and Barto, 2018). Initially (Stage 1), the policy network is trained off-line using an available approximate model. The algorithm runs for several epochs and episodes until convergence. The network is then adopted to generate initial optimal policies for the real system (the plant, simulated by a model complex process model which is not available for RNN construction)(Stage 2). Methods used in the REINFORCE algorithm usually require a large number of episodes and epochs, therefore a good initial solution is paramount so that Stage 2 (which is assumed to be done in the actual plant) can be completed with few batchto-batch runs. In order to keep the problem realistic, only a small number of batches is utilised in Stage 2 to refine the policy network.
4. Computational Case Studies The proposed methodology is applied to a fed-batch bioreactor, where the objective is to maximise the concentration of target product (y2 ) at the end of the batch time, using light and an inflow rate (u1 and u2 ) as control variables. The plant (real photo-production system) is simulated using the following equations: dy1 u2 y2 = −(u1 + 0.5 u21 )y1 + 0.5 dt (y1 + y2 )
dy2 = u1 y1 − 0.7u2 y1 dt
(8)
where u1 , u2 and y1 , y2 are the control variables and the outlet concentrations of the reactant and product, respectively. The batch operation time course is normalised to 1.
Reinforcement Learning for Batch-to-Batch Bioprocess Optimisation
923
Figure 1: (a) The time trajectories produced by the trained policies. (b) The reward computed for the approximate model. Additionally, a random disturbance is assumed, which is given by a Gaussian distribution with mean value 0 and standard deviation 0.02. It is assumed that only the following approximate model (simplified from the complex model) is known, and the preliminary training is performed based on this model to construct the control policy network, whilst the real system model is unknown due to the complexity of the process mechanisms. dy1 = −(u1 + 0.5 u21 )y1 + u2 dt
dy2 = u1 y1 − u2 y1 dt
(9)
Initially, 100 epochs and 800 episodes are generated from the simplified model to search the optimal control policy that maximises the reward for Eq. 9. Control variables are constrained to be in [0, 5]. The control policy RNN is designed to contain 3 hidden layers, each of which comprising 15 neurons embedded by a hyperbolic tangent activation function. Adam (Kingma and Ba, 2014) is employed to compute the network parameter values. It should be mentioned that the reward computed after convergence is almost the same in comparison to the result given by the optimal control problem (OCP) at Stage 1. The maximum rewards for RL and OCP are 0.637 and 0.640, respectively. The reward for its epoch is depicted in Figure 2b) and the process trajectories after the final update of the policy network are shown in Figure 2a). This policy is then used to initialise the REINFORCE algorithm for the plant’s RL (Stage 2), where 25 episodes are used (i.e. 25 real plant batches). The solution after only 4 epochs is 0.575 whilst the stochastic-free optimal solution identified using the unknown (complex) model of the plant is 0.583. The reward of this epoch is depicted in Figure 1b) and the process trajectories after the last epoch are depicted in Figure 1a).
5. Conclusions In this work we show that by adapting reinforcement learning techniques to uncertain and complex bioprocesses we are able to obtain a near optimal policy for a stochastic system where the true dynamics are unknown. Furthermore, we obtain this result in a realistic scenario where only a modest number of batch runs is used. We emphasise that
924
P. Petsagkourakis et al.
Figure 2: (a) The time trajectories produced by the real plant. (b) The reward computed by the updated training using the plant ("real" system) for each epoch. we assume no process structure and embed both stochasticity and plant-model mismatch into the considered system, the optimisation of which is generally known to be intractable. Future work will focus on more complex case studies and exploring other RL methods, such as bias reduction and sample efficiency strategies.
References D. P. Bertsekas, 2000. Dynamic Programming and Optimal Control, 2nd Edition. Athena Scientific. E. A. del Rio-Chanona, J. L. Wagner, H. Ali, D. Zhang, K. Hellgardt, 2018. Deep learning based surrogate modelling and optimization for microalgal biofuel production and photobioreactor design. AIChE Journal, in press. I. Harun, E. A. Del Rio-Chanona, J. L. Wagner, K. J. Lauersen, D. Zhang, K. Hellgardt, aug 2018. Photocatalytic Production of Bisabolene from Green Microalgae Mutant: Process Analysis and Kinetic Modeling. Industrial & Engineering Chemistry Research 57 (31), 10336–10344. K. Jing, Y. Tang, C. Yao, E. A. del Rio-Chanona, X. Ling, D. Zhang, feb 2018. Overproduction of L-tryptophan via simultaneous feed of glucose and anthranilic acid from recombinant Escherichia coli W3110: Kinetic modeling and process scale-up. Biotechnology and Bioengineering 115 (2), 371–381. D. P. Kingma, J. Ba, 2014. Adam: A Method for Stochastic Optimization. ArXiv:1412.6980. K. L. Moore, Y. Chen, H. Ahn, Dec 2006. Iterative learning control: A tutorial and big picture view. In: Proceedings of the 45th IEEE Conference on Decision and Control. pp. 2352–2357. J. Peters, S. Schaal, 2008. Reinforcement learning of motor skills with policy gradients. Neural networks 21 (4), 682–697. D. E. Rumelhart, G. E. Hinton, R. J. Williams, 1986. Learning representations by back-propagating errors. Nature 323, 533. R. Sutton, A. Barto, 2018. Reinforcement Learning: An Introduction, second edition Edition. MIT Press. J. Thierie, feb 2004. Modeling threshold phenomena, metabolic pathways switches and signals in chemostatcultivated cells: the Crabtree effect in Saccharomyces cerevisiae. Journal of theoretical biology 226 (4), 483–501. R. J. Williams, 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), 229–256. D. Zhang, V. S. Vassiliadis, nov 2015. Chlamydomonas reinhardtii Metabolic Pathway Analysis for Biohydrogen Production under Non-Steady-State Operation. Industrial & Engineering Chemistry Research 54 (43), 10593–10605.