Comparison of Model Predictive and Reinforcement Learning Methods for Fault Tolerant Control

Comparison of Model Predictive and Reinforcement Learning Methods for Fault Tolerant Control

10th IFAC Symposium on Fault Detection, 10th IFAC on Fault Detection, 10th IFAC Symposium Symposium on Fault Detection, Supervision and Safety for Tec...

534KB Sizes 0 Downloads 22 Views

10th IFAC Symposium on Fault Detection, 10th IFAC on Fault Detection, 10th IFAC Symposium Symposium on Fault Detection, Supervision and Safety for Technical Processes 10th IFAC Symposium on Fault Detection, Supervision and Safety for Technical Processes Supervision and Safety for Technical Processes Available online at www.sciencedirect.com Warsaw, Poland, August 29-31, 2018 Supervision and Safety for Technical 10th IFAC Symposium on Fault Detection, Warsaw, Poland, Poland, August August 29-31, 29-31, 2018 2018 Processes Warsaw, Warsaw, Poland, Augustfor 29-31, 2018 Processes Supervision and Safety Technical Warsaw, Poland, August 29-31, 2018

ScienceDirect

IFAC PapersOnLine 51-24 (2018) 233–240

Comparison of Model Predictive and Comparison of Model Predictive and Comparison of Model Predictive and Reinforcement Learning Methods for Fault Comparison Learning of Model Methods Predictive and Reinforcement for Fault Reinforcement Learning Methods for Fault Tolerant Control Reinforcement Learning Methods for Fault Tolerant Control Tolerant Control ∗Tolerant Control ∗∗∗ ∗∗ Ibrahim Ahmed ∗∗ Hamed Khorasgani ∗∗ ∗∗ Gautam Biswas ∗∗∗ ∗∗∗

Ibrahim Ahmed ∗ Hamed Khorasgani ∗∗ Gautam Biswas ∗∗∗ Ibrahim Ibrahim Ahmed Ahmed Hamed Hamed Khorasgani Khorasgani Gautam Gautam Biswas Biswas ∗ ∗∗ ∗ Ibrahim Ahmed Hamed Khorasgani Gautam Biswas ∗∗∗ ∗ University, USA (e-mail: [email protected]). ∗ Vanderbilt Vanderbilt University, University, USA (e-mail: [email protected]). USA (e-mail: ∗ Vanderbilt ∗∗ Vanderbilt University, USA University, (e-mail: [email protected]). [email protected]). ∗∗ Vanderbilt University, USA (e-mail: (e-mail: ∗∗ Vanderbilt USA Vanderbilt University, USA ∗ ∗∗ Vanderbilt University, USA (e-mail: [email protected]). Vanderbilt University, USA (e-mail: (e-mail: [email protected]). [email protected]). ∗∗[email protected]). ∗∗∗ Vanderbilt University, USA (e-mail: University, [email protected]). ∗∗∗ Institute of Software Integrated Systems, Vanderbilt ∗∗∗ Institute of Software Integrated Systems, Vanderbilt University, of Software Integrated Systems, Vanderbilt University, ∗∗∗ Institute [email protected]). InstituteUSA of Software Integrated Systems, Vanderbilt University, (e-mail:[email protected]). USA (e-mail:[email protected]). USA (e-mail:[email protected]). ∗∗∗ InstituteUSA of Software Integrated Systems, Vanderbilt University, (e-mail:[email protected]). USA (e-mail:[email protected]). Abstract: A desirable property in controllers is to changes as Abstract: A A desirable desirable property property in in fault-tolerant fault-tolerant controllers controllers is is adaptability adaptability to to system system changes changes as as Abstract: Abstract: A desirable property in fault-tolerant fault-tolerant controllers is adaptability adaptability to system system changes as they evolve during systems operations. An adaptive controller does not require optimal control they evolve during systems operations. An adaptive controller does not require optimal control they evolve during systems operations. An adaptive controller does not require optimal control Abstract: desirable property in fault-tolerant controllers is adaptability to system changes as they evolve during systems operations. An adaptive controller does not require optimal control policies to be enumerated for possible faults. Instead it can approximate one in real-time. policies to A be enumerated for possible possible faults. Instead it can approximate one in real-time. policies to be enumerated for faults. Instead it can approximate one in real-time. they evolve during systems operations. An adaptive controller does not require optimal control policies to be enumerated for possible faults. Instead it can approximate one in real-time. We present two adaptive fault-tolerant control schemes for a discrete time system based on We present present two two adaptive adaptive fault-tolerant control control schemes schemes for for aa discrete discrete time time system system based based on We policies to be for possible faults. Instead itforcan approximate in predictive real-time. We present twoenumerated adaptive fault-tolerant fault-tolerant control schemes a discrete timeaa one system based on on hierarchical reinforcement learning. We compare their performance against model hierarchical reinforcement learning. We compare compare their performance performance against model predictive hierarchical reinforcement learning. We their against a model predictive We present two adaptive fault-tolerant control schemes for a discrete time system based on hierarchical reinforcement learning. We compare their performance against a model predictive controller in presence of sensor noise and persistent faults. The controllers are tested on aa controller in in presence presence of of sensor sensor noise noise and persistent faults. The controllers are tested on controller and persistent faults. The controllers are tested on aa hierarchical reinforcement learning. We compare their performance against a model predictive controller in presence of sensor noise and persistent faults. The controllers are tested on fuel tank model of a C-130 plane. Our experiments demonstrate that reinforcement learningfuel tank tank model model of of aa C-130 C-130 plane. plane. Our Our experiments experiments demonstrate demonstrate that that reinforcement reinforcement learninglearningfuel controller in presence of more sensor noise persistent faults. controllers The that controllers tested on a fuel model of a C-130 plane. Our and experiments demonstrate reinforcement learningbased controllers perform robustly than model predictive under faults, partially basedtank controllers perform more robustly than model predictive predictive controllers underare faults, partially based controllers perform more robustly than model controllers under partially fuel tank model a C-130 plane. Our sensor experiments demonstrate that reinforcement based controllers perform more robustly than model controllers under faults, faults, learningpartially observable systemof models, and varying sensor noise predictive levels. observable system models, and varying noise levels. observable system models, and varying noise levels. based controllers than model controllers under faults, partially observable systemperform models,more and robustly varying sensor sensor noise predictive levels. © 2018, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. observable system models, and varying sensor noise levels. Keywords: Reinforcement learning control, predictive Keywords: Reinforcement Reinforcement learning learning control, control, model model predictive predictive control, control, fault fault tolerance, tolerance, Keywords: Keywords: Reinforcement learningreinforcement control, model model predictive control, control, fault fault tolerance, tolerance, model-based control, hierarchical hierarchical reinforcement learning model-based control, learning model-based control, hierarchical reinforcement learning Keywords: Reinforcement learningreinforcement control, model predictive control, fault tolerance, model-based control, hierarchical learning model-based control, hierarchical reinforcement learning 1. fault diagnoser is and 1. INTRODUCTION INTRODUCTION fault diagnoser diagnoser is is correct, correct, and and the the system system can can reconfigure reconfigure 1. fault 1. INTRODUCTION INTRODUCTION fault diagnoser is correct, correct, and the the system system can can reconfigure reconfigure in response response to the the supervisor. in to supervisor. in response to the supervisor. 1. INTRODUCTION fault diagnoser is correct, and the system can reconfigure in response to the supervisor. MP and RL-based control rely on the space Complex systems are expected to operate efficiently in diMP and RL-based control rely on sampling sampling the the state state space space in response to the supervisor. MP and RL-based control rely on Complex systems are expected to operate efficiently in diComplex systems are expected to operate efficiently in diMP and RL-based rely policy. on sampling sampling the state state in real-time real-time to find findcontrol a control control policy. However, such space FTC Complex systems are expected to operate efficiently in diverse environments. A controller that can function within in to a However, such FTC in real-time to find a control policy. However, such FTC verse environments. A controller that can function within verse environments. A controller that can function within MP and RL-based control rely on sampling the state in real-time to find a control policy. However, such space FTC approaches may be subject to additional constraints due Complex systems are expected to operate efficiently in diverse environments. A controller that can function within bounds despite malfunctions and degradation in the sysapproaches may be subject to additional constraints due may be subject to additional constraints due bounds despite despite malfunctions malfunctions and and degradation degradation in in the the syssys- approaches bounds in real-time to find a control policy. However, such FTC approaches may be subject to additional constraints due to faults and constraints on operation. Controllers can be verse environments. A controller that can function within bounds despite malfunctions and degradation in the system, or or changes changes in in the the environment environment is is safer. safer. With With the the rise rise to to faults and constraints on operation. Controllers can be faults and constraints on operation. Controllers can be tem, tem, or changes in the environment is safer. With the rise approaches may be subject to additional constraints due to faults and constraints on operation. Controllers can be limited by processing power, poor observability of states bounds despite malfunctions and degradation in the system, or changes in the environment is safer. With the rise in automation and increasing complexity of systems, direct limited by processing power, poor observability of states limited by processing power, poor observability of states in automation and increasing complexity of systems, direct in automation and increasing complexity of direct to and constraints on operation. Controllers can be by processing poor ofbecause states duefaults to sensor sensor failures,power, and loss of observability functionality because tem, or changes insupervision the environment is safer. With the rise limited in and increasing complexity of systems, systems, direct or timely human may not always be possible. due to failures, and loss of functionality due to sensor failures, and loss of functionality because or automation timely human supervision may not not always be possible. possible. or timely human supervision may always be limited by processing power, poor observability of states due to sensor failures, and loss of functionality because of faulty components. We explore how the two control in automation and increasing complexity of systems, direct or timely human supervision be possible. Fault Tolerant Control (FTC) seeks to guarantee stability of faulty faulty components. components. We We explore explore how how the two two control control Fault Tolerant Control (FTC)may seeksnot to always guarantee stability of Fault Tolerant Control (FTC) seeks to guarantee stability due to sensor failures, and explore loss of functionality faulty components. We how the the two because control approaches fare under various conditions. or human supervision may not be possible. Fault Tolerant Control (FTC)and seeks to always guarantee stability of andtimely performance in nominal nominal and anomalous conditions. approaches fare under various conditions. approaches fare under various conditions. and performance in anomalous conditions. and in anomalous conditions. of faulty components. We explore how the two control fare under various conditions. Faultperformance Tolerant Control (FTC)and seeks to guarantee stability approaches and performance in nominal nominal and anomalous conditions. Some work has been done in designing RL-based fault Blanke et al. (1997) delineates the difference between Some work has been done in designing RL-based fault fault approaches fare under various conditions. work has been done in designing RL-based Blanke et al. al. (1997) (1997) delineates the difference difference between Some and performance in nominal and anomalous conditions. Blanke et delineates the between Some work has been done in designing RL-based fault adaptive controllers. Lewis et al. (2012) provides a theoBlanke et al. (1997) delineates the difference between fail-safe and and fault-tolerant fault-tolerant control control in in their their treatment treatment of of adaptive adaptive controllers. Lewis et al. (2012) provides a theocontrollers. Lewis et al. (2012) provides a theofail-safe fail-safe and control in their treatment of Some has been done designing RL-based fault controllers. Lewis (2012) provides theoretical discussion of using RL in real-time to find control Blanke et (1997) the between fail-safe andal.fault-tolerant fault-tolerant control in difference their treatment of adaptive faults. Fail-safe control employs robust control approaches retical work discussion of using using RLetinin inal. real-time to find aa a acontrol control retical discussion of RL real-time to find faults. Fail-safe Fail-safe control delineates employs robust control approaches faults. control employs robust control approaches adaptive controllers. Lewis et al. (2012) provides a theoretical discussion of using RL in real-time to find a control policy that converges to the optimum. Liu et al. (2017) fail-safe and fault-tolerant control their treatment of policy faults. Fail-safe controlfrom employs robust control approaches to avoid degradation faults in the first place. But policy that that converges converges to to the the optimum. optimum. Liu Liu et et al. (2017) (2017) to avoid avoid degradation from faults ininthe the first place. But But to degradation from faults in first place. retical discussion of using RL in real-time to find aand control policy that converges toto the optimum. Liu et al. al. (2017) applies RL based control aa discrete time system uses faults. Fail-safe control employs robust control approaches to avoid degradation from faults in the first place. But it comes at the cost of additional redundancy. Faultapplies RL based control to discrete time system and uses RL based control to aa discrete time system and uses it comes comes at at the the cost cost of of additional additional redundancy. redundancy. FaultFault- applies it policy that converges to the optimum. Liu et al. (2017) applies RL based control to discrete time system and uses aa fully connected shallow neural network to approximate to avoid degradation from faults in the first place. But it comes at the cost of additional redundancy. Faulttolerance combines redundancy management with confully connected shallow neural network to approximate a fully connected shallow neural network to approximate tolerance combines redundancy management with contolerance combines redundancy management withFaultconapplies RL based a discrete time system and uses fully connected shallowtoneural network to approximate the control policy. it comes at the cost of redundancy. tolerance combines redundancy management con- athe troller reconfiguration and adaptivity to avoid systemthe control policy.control control policy. troller reconfiguration andadditional adaptivity to avoid avoidwith systemtroller reconfiguration and adaptivity to systema fully connected shallow neural network to approximate the control policy. tolerance combines controller reconfiguration and adaptivity avoidwith systemwide failures, while keeping the system operational (to wide failures, failures, while redundancy keeping the management systemtooperational operational (to the The main contribution of this work is to benchmark wide while keeping the system (to Thecontrol main policy. contribution of of this this work work is is to to benchmark benchmark The main contribution troller reconfiguration and adaptivity to avoid systemwide failures, while keeping the system operational (to the extent possible). A fault-tolerant controller allows for main contribution of this work is to benchmark the extent extent possible). possible). A A fault-tolerant fault-tolerant controller controller allows allows for for The RL-based controllers against MP control. We implement the RL-based controllers controllers against against MP MP control. control. We We implement implement RL-based wide failures, while keeping the system operational (to the extent possible). A fault-tolerant controller allows for graceful degradation, i.e., it may be sub-optimal after The main controllers contribution of this work is We to benchmark against MP control. implement graceful degradation, degradation, i.e., i.e., it it may may be be sub-optimal sub-optimal after after RL-based discrete-time MP control and propose two approaches for graceful discrete-time MP control and propose two approaches for discrete-time MP control and propose two approaches for the extent possible). A fault-tolerant controller allows for graceful degradation, i.e., it may be sub-optimal after encountering faults, but it will keep the system operational RL-based controllers against MP control. We implement discrete-time MP control and propose two approaches for encountering faults, but it will keep the system operational applying RL principles towards FTC. The controllers are encountering faults, but it will keep the system operational applying RL principles towards FTC. The controllers are RL principles towards FTC. The controllers are graceful degradation, it keep may the be system sub-optimal after applying encountering faults, but it will operational for extended periods of time. discrete-time MP control and propose twoof for applying principles FTC. The controllers are for extended extended periods ofi.e., time. tested on a model of fuel transfer system a C-130 cargo for of tested on on RL model of fuel fueltowards transfer system ofapproaches C-130 cargo cargo aaa model of transfer system of aaa C-130 encountering faults, but it will keep the system operational tested for extended periods periods of time. time. applying RL principles towards FTC. The controllers are tested on model of fuel transfer system of C-130 cargo plane where where the the controllers controllers attempt attempt to to maintain maintain aa balanced balanced Patton (1997) surveys constituent research areas of FTC. plane where the attempt to a Patton (1997)periods surveysofconstituent research areas areas of of FTC. plane for extended time. Patton (1997) surveys research tested on a model fuelpresence transfer of a C-130 cargo where the controllers controllers attempt to maintain maintain a balanced balanced fuel distribution in the of leaks. Patton (1997) surveys constituent constituent areas of FTC. FTC. plane We explore supervision methods in this paper. Supervision fuel distribution distribution inofthe the presence ofsystem leaks. fuel in presence of leaks. We explore explore supervision methods in in research this paper. paper. Supervision We supervision methods this Supervision plane where the controllers attempt to maintain a balanced fuel distribution in the presence of leaks. Patton (1997) surveys constituent research areas of FTC. We explore supervision methods in this paper. Supervision uses a knowledge of to choose optimal uses a a priori priori knowledge knowledge of of system system faults faults to to choose choose optimal optimal fuel The distribution rest of of the the inpaper paper is organized organized as follows. follows. Section Section uses The rest is as the presence of leaks. The rest paper organized as follows. Section We explore supervision this paper. Supervision uses a priori priori knowledgemethods of system systeminfaults faults to choose optimal actions. Fault-tolerant supervision has been explored in rest of ofaa the the paper is is of organized asRL-based follows. control. Section actions. Fault-tolerant supervision has been explored in The 22 provides background MP and actions. Fault-tolerant supervision has been explored in provides background of MP and RL-based control. provides background of MP and uses a priori knowledgeprobabilistic of system faults choose optimal actions. Fault-tolerant supervision has to been explored in 22The several fields including reasoning, fuzzy logic, rest33 discusses ofaa the paper asRL-based follows. control. Section provides background oforganized MP and RL-based control. several fields including probabilistic reasoning, fuzzy logic, Section online and offline RL controller design. several fields including probabilistic reasoning, fuzzy logic, Section discusses onlineisand and offline RL controller design. Section 3 discusses online offline RL controller design. actions. Fault-tolerant supervision has been explored in several fields including probabilistic reasoning, fuzzy logic, and genetic algorithms. This paper focuses on the evalua2 provides a background of MP and RL-based control. 3 discusses online and offline RL controller design. and genetic genetic algorithms. algorithms. This This paper paper focuses focuses on on the the evaluaevalua- Section Our case study is presented in Section 4. The results are and Our case study is presented in Section 4. The results are Our case study is presented in Section 4. The results are several fields including probabilistic reasoning, fuzzy logic, and genetic algorithms. This paper focuses on the evaluation of of two two supervision supervision approaches approaches for for FTC: FTC: model model predicpredic- Our Section online in andSection RL controller case3 discusses study is presented inoffline Section The results tion presented and discussed discussed in Section 5. 4. Sections 6 design. andare 7 tion of two supervision approaches for FTC: model predicpresented and 5. Sections 6 and 7 presented and discussed in Section 5. Sections 6 and 7 and genetic algorithms. This paper focuses on the evaluation of two supervision approaches for FTC: model predictive (MP) and reinforcement learning (RL)-based control. Our case study is presented in Section The results and discussed in Section 5. 4. Sections 6 andare 7 tive (MP) (MP) and and reinforcement reinforcement learning learning (RL)-based (RL)-based control. control. presented tive tion of twoand supervision approaches for FTC: model predictive (MP) reinforcement learning (RL)-based control. We system models are reasonably accurate, the We assume assume system models are reasonably accurate, the presented and discussed in Section 5. Sections 6 and 7 We system models reasonably accurate, the tive assume (MP) and reinforcement learning (RL)-based control. We assume system models are are reasonably accurate, the We assume system are reasonably accurate, Control) the233 Hosting by Elsevier Ltd. All rights reserved. 2405-8963 © 2018 2018, IFACmodels (International Federation of Automatic Copyright © IFAC Copyright © 2018 233 Copyright © under 2018 IFAC IFAC 233 Control. Peer review responsibility of International Federation of Automatic Copyright © 2018 IFAC 233 10.1016/j.ifacol.2018.09.583 Copyright © 2018 IFAC 233

IFAC SAFEPROCESS 2018 234 Warsaw, Poland, August 29-31, 2018

Ibrahim Ahmed et al. / IFAC PapersOnLine 51-24 (2018) 233–240

discuss future work, open problems in this domain, and the conclusions. 2. BACKGROUND 2.1 Model Predictive Control Model Predictive Control (MPC) optimizes the current choice of action by a controller that drives the system state towards the desired state. It does this by modeling future states of the system+environment and a finite receding horizon over which to predict state trajectories and select the “best” actions . At each time step, a sequence of actions is selected up until the horizon such that a cost or distance measure to a desired goal is minimized. For example, for a system with state variables (x1 , x2 , ..., xN ) and the desired state (d1 , d2 , ..., dN ) the cost may be a simple Euclidean distance measure between the two vectors. A model predictive controller operates over a set of states s ∈ S , actions a ∈ A, and a state transition model of the system T : S × A → S. It generates a tree of state trajectories rooted at the system’s current state. The tree has a depth equal to the specified lookahead horizon for the controller. The shortest path trajectory, that is, the sequence of states and actions producing the smallest cost is chosen, and the first action in that trajectory is executed. The process repeats for each time step. Garcia et al. (1989) discuss MPC in further detail particularly in reference to continuous systems. Abdelwahed et al. (2005) presents a case study for MPC of a hybrid system. The control algorithm for the discrete subsystem is essentially a limited breadth-first search of state space. They use a distance map, which uses the euclidean distance between the controller state and the closest goal state as the cost function. The MPC algorithm implemented for this work is adapted from Abdelwahed et al. (2005) and described in Algorithm 1. 2.2 Reinforcement Learning Reinforcement learning is the learning of behaviour by an agent, or a controller, from feedback through repeated interactions with its environment. Kaelbling et al. (1996) divide RL into two broad approaches. In the first, the genetic programming approach, an agent explores different behaviours to find ones that yield better results. In the second, the dynamic programming approach, an agent estimates the value of taking individual actions from different states in the environment. The value of each state and action dictates how an agent behaves. This section explores methods belonging to the latter approach. An agent (controller) is connected to (interacts with) the system+environment through its actions and perceptions. Actions an agent takes cause state transitions of the system in the environment. The change is perceived as a reinforcement signal from the system+environment, and the agents measurement of the new state. The standard RL problem can be represented as a Markov Decision Process (MDP). A MDP consists of a set of states S, a set of actions A, a reward function R : S × A × S →  234

Algorithm 1 Model Predictive Control Require: system model T : S × A → S Require: distance map D : S →  Require: horizon N 1: s0 ← CurrentState 2: Initialize state queue Queue = {s0 } 3: Initialize state set V isited = {s0 } 4: Initialize optimal state Optimal ← N ull 5: M inDistance ← ∞ 6: while Queue.length > 0 do 7: state = Queue.pop() 8: if state.depth > N then 9: Break 10: end if 11: newStates = state.neighbourStates ∩ ¬V isited 12: for s ∈ newStates do 13: Visited.add(s) 14: Queue.insert(s) 15: if D(s) < M inDistance then 16: M inDistance = D(s) 17: Optimal = s 18: end if 19: end for 20: end while 21: while OptimalState.previous = s0 do 22: Optimal ← Optimal.previous 23: end while 24: action = T (Optimal.previous, ·) → Optimal which provides reinforcement after each state change, and a state transition function T : S × A → Π(S), which determines the probabilities of going to each state after an action. This is the model for the system operating in an environment. For the purposes of this paper, the transition function is assumed to deterministically map s ∈ S and a ∈ A to the next state s ∈ S.

In a MDP, everything an agent needs from the environment is encoded in the state. The history of prior states does not affect the value or reward of the following states and actions. This is known as the Markov property. An agent’s objective is to maximize its total future reward. The value of a state is the maximum reward an agent can expect from that state in the future. V (s) = max(E[R(s, a, s ) + γV (s )]), a∈A

(1)

where γ is the discount factor that weighs delayed rewards against immediate reinforcement. Alternatively, an agent can learn the value of each action, the q-value, from a state: Q(s, a) = E[R(s, a, s ) + γ max Q(s , a)]. a∈A

(2)

We use the q-value formulation for this paper. An agent can exploit either value function to derive a policy that governs its behaviour during operation. To exploit the learned values, a policy greedily or stochastically selects actions that lead to states with the highest value:

IFAC SAFEPROCESS 2018 Warsaw, Poland, August 29-31, 2018

Ibrahim Ahmed et al. / IFAC PapersOnLine 51-24 (2018) 233–240

πgreedy (s) = arg max Q(s, a) a∈A

πstochastic (s) = Π(A)

(3)

(1), known as the Bellman equation, and (2) can be solved recursively for each state using dynamic programming. However, the complexity of the solution scales exponentially with the number of states and actions. Various approaches have been proposed to approximate the value function accurately enough with minimal expenditure of resources. Monte Carlo (MC) methods (Sutton and Barto (2017)) alleviate the Curse of Dimensionality by approximating the value function. Instead of traversing the entirety of state-space, they generate episodes of states connected by actions. For each state in an episode, the total discounted reward received after the first occurrence of that state is stored. The value of the state is then the average total reward across episodes. Generating an episode requires a choice of action for each state. The exploratory action selection policy πexp can be uniform, partially greedy with respect to the highest valued state (-greedy), or proportional to the relative values of actions from a state (softmax). The MC approach still requires conclusion of an episode before updating value estimates. Additionally, it may not lend itself to problems that cannot be represented as episodes with terminal or goal states. Temporal Difference (TD) methods provide a compromise between DP and MC approaches. Like MC, they learn episodically from experience and use different action selection policies during learning to explore the state space. Like DP they do not wait for an episode to finish to update value estimates. TD updates the value function iteratively at each time step t. Gt = Rt+1 + γ max Qt−1 (st+1 , a) a∈A

Qt (st , at ) = Qt−1 (st , at ) + α(Gt − Qt−1 (st , at )),

(4)

where α is the learning rate for the value function. Gt , called the return, is a new estimate for the value. The value function is updated using the error between the new estimate and the existing approximation. Note that G estimates the new value by only using the immediate reward and backing up the discounted previous value of the next state. This is known as T D(0). TD methods can be expanded to calculate better estimates of values by explicitly calculating rewards several steps ahead, and only backing up value estimates after that. These are known as T D(λ) methods. For example, the return for T D(2) is: (2)

Gt

= Rt+1 + γ(Rt+2 + γ max Qt−1 (st+2 , a)) a∈A

= Rt+1 + γRt+2 + γ 2 max Qt−1 (st+2 , a) Qt (st , at ) = Qt−1 (st , at ) +

a∈A (2) α(Gt −

(5)

Qt−1 (st , at ))

In the extreme case, when G is calculated for all future steps until a terminal state sT , the TD method becomes a MC method because value estimates then depend on the returns from an entire episode. 235

235

An improvement on T D(λ) methods is achieved by making the return G more representative of the state space. Values of actions are weighed by the agent’s action selection policy. This is known as n−step Tree Backup (Precup (2000)). Defining πexp (a|s) as the probability of exploratory actions, sT as the terminal time, and: Vt =



πexp (a|st )Qt−1 (st , a)

a

(6)

δt = Rt+1 + γVt+1 − Qt−1 (st , at ), where Vt is the expected value of st under the action selection policy πexp , and δt is the error between the return and prior value estimate. Then the nth -step return is given by: (n)

Gt

= Qt−1 (st , at )+ min(t+n−1,T −1)

 k=t

δk

k 

i=t+1

γπexp (ai |si ).

(7)

The value function can then be updated using the error (n) Gt − Qt−1 (st , at ). A longer lookahead gives a more representative estimate of value. A greedy action selection policy with n = 0 reduces this to the T D(0). The following section presents a modified n−step Tree Backup algorithm tailored for adaptive fault-tolerant control. 3. REINFORCEMENT LEARNING-BASED CONTROLLER DESIGN Temporal difference based RL methods iteratively derive a controller’s behaviour by estimating the value of states and actions in a system through exploration. During operation, the controller exploits the knowledge gained through exploration by selecting actions with the highest value. This works well if the system dynamics is stationary. The agent is able to achieve optimal control by exploring over multiple episodes and iteratively converging on the value function. Conservative learning rates can be used such that the value approximation is representative of the agent’s history. When a fault occurs in a system, its dynamics change. The extant value function may not reflect the most rewarding actions in the new environment model. The agent seeks to optimize actions under the new system dynamics. The agent must, therefore, estimate a new value function by exploration every time a fault occurs. The learning process can be constrained by deadlines and sensor noise. This paper proposes design of a RL-based controller subject to the following requirements: Adaptivity: The controller should be able to remain functional under previously unseen faults (Goldberg et al. (1993)). Speedy convergence: The agents behaviour should be responsive to faults as they occur. Sparse sampling: The system model may lose reliability following a fault due to sensor noise or incorrect diagnosis. The agent should be able to calculate value

IFAC SAFEPROCESS 2018 236 Warsaw, Poland, August 29-31, 2018

Ibrahim Ahmed et al. / IFAC PapersOnLine 51-24 (2018) 233–240

estimates representative of its local state space from minimal (but sufficient) sampling. Generalization: The agent should be able to generalize its behaviour over states not sampled in the model during learning. The temporal difference RL approach is inherently adaptive as it frequently updates its policy from exploration. Its responsivity can be enhanced by increasing the learning rate α. In the extreme case, α = 1 replaces the last value of an action with the new estimate at that time step. However, larger values of α may not converge the value estimate to the global optimum. RL controllers have to balance exploratory actions to discover new optima in the value function against exploitative actions that use the action values learned so far. An exploration parameter  ∈ [0, 1) can be set, which determines the periodicity of value updates and hence the controller’s responsivity to system faults. Hierarchical RL can be used to sample the state space at various resolutions (Lampton et al. (2010)). An agent can use a hierarchy function H : S → + that gives the step size of actions an agent takes to sample states. H can yield smaller time steps for states closer to goal and coarser sampling for distant states. This allows the agent to update the value function more frequently with states where finer control is required. TD RL methods can use function approximation to generalize the value function over the state space instead of using tabular methods. Gradient descent is used with the error computed by the RL algorithm to update parameters of the value function provided at design time. Function bases like radial, Fourier, and polynomial can be used to generalize a variety of value functions. Algorithm 2 implements Variable n−step Tree Backup for a single episode. It updates the value estimate after exploring a sequence of states up to a finite depth d of actions with varying step sizes. Value estimates for each state in the sequence depend on following states up to n steps ahead. A RL-based controller explores multiple episodes to converge on a locally optimum value for actions. During operation it exploits the value function to pick the most valuable actions as shown in (3). The following subsections discuss two approaches to approximating values for a RL-based controller.

Algorithm 2 Variable n-Step Tree Backup Episode Require: initial state s0 Require: system model T : S × A → S Require: value function Q : S × A →  Require: reward function R : S × S → R Require: hierarchy H : S →  Require: action selection policy π : S → Π(A) Require: parameters: α, d, γ, n 1: Get action a0 from πexp (s0 ) 2: Q0 ← Q(s0 , a0 ) 3: T ← ∞, t ← 0, τ ← 0 4: while t < d and τ < T − 1 do 5: if t < T then: 6: Take action at for H(st ) steps 7: st+1 ← T (st , at ) 8: Rt ← R(st , at , st+1 ) 9: if st+1 is in goal states then 10: T ←t+1 11: δ t ← Rt − Q t 12: else:  13: δt ← Rt +γ a πexp (a | st+1 )Q(st+1 , a)−Qt 14: at+1 ← πexp (st+1 ) 15: Qt+1 ← Q(st+1 , at+1 ) 16: πt+1 ← πexp (at+1 | st+1 ) 17: end if 18: end if 19: τ ←t−n+1 20: if τ ≥ 0 then 21: E←1 22: G ← Qt 23: for k = τ, ..., min(τ + n − 1, T − 1) do 24: G ← G + E · δk 25: E ← γE · πk+1 26: end for 27: Error ← Q(sτ , aτ ) − G 28: Update Q(sτ , aτ ) with α · Error 29: end if 30: t←t+1 31: end while Algorithm 3 Offline RL Control 1: d ← ∞ 2: for s ∈ S do 3: Call Episode 4: end for 5: while True do 6: s0 ← CurrentState 7: action = arg maxa∈A Q(s0 , a) 8: end while 3.2 Online Control

3.1 Offline Control An offline RL-based controller derives the value function by dense and deep sampling of the state space. The learning is done once each time a new value approximation is required. The agent learns from episodes over a larger sample of states in the model. Each episode lasts till a terminal state is reached. Offline control is computationally expensive. However, by employing a partially greedy exploratory action selection policy π and a suitable choice of learning rate α, it is liable to converge to the global optimum. 236

Online RL-based control interleaves operation with multiple shorter learning phases. The agent sporadically explores via limited-depth episodes starting from the controller’s current state. The sample size is small and local. Each exploratory phase is comparatively inexpensive but may only converge to a local optimum. Like MPC, online RL employs a limited lookahead horizon periodically to learn the best actions. Unlike MPC, online RL remembers and iteratively builds upon previously learned values of actions.

IFAC SAFEPROCESS 2018 Warsaw, Poland, August 29-31, 2018

Ibrahim Ahmed et al. / IFAC PapersOnLine 51-24 (2018) 233–240

237

A trial begins with all tanks filled to capacity with a possible fault in one tank. The trial concludes when no fuel is left in tanks. For each trial, the maximum imbalance and the time integral of imbalance are used as performance metrics.

Algorithm 4 Online RL Control Require: exploration rate  Require: sampling density ρ 1: while True do 2: s0 ← CurrentState 3: if RandomN umber <  then 4: neighbours ← {T (s0 , a) : a ∈ A} 5: for state ∈ neighbours do 6: if RandomN umber < ρ then 7: Call Episode 8: end if 9: end for 10: end if 11: action = arg maxa∈A Q(s0 , a) 12: end while

Goal states in the system are configurations where there is zero moment about the central axis. The magnitude of the moment is the imbalance in the system. The distance map D used by MP controller is a measure of imbalance. D(s) = Abs(3(T1 − T4 ) + 2(T2 − T3 ) + (TLA − TRA )) (9) RL controllers are supplied with a reward function R. The controller is rewarded for reaching low-imbalance states and for doing it fast when there is more fuel to spare.

4. CASE STUDY The three control schemes (MP, Offline RL, and Online RL) are applied to a simplified version of the fuel tank system of a C-130 cargo plane (see Fig. 1). There are six fuel tanks. The fuel tank geometry is symmetric and split into a left/right arrangement where each side primarily feeds one of two engines. Control of fuel transfer is required to maintain aircraft center of gravity (CG). Changes in the fuel pump operating performance, plumbing, engine demand, and valve timing can affect the way each tank transfers fuel. The CG is a function of fuel quantities in the tanks, therefore, the fuel control system can restore the CG by turning transfer valves off or on to bring the system back into acceptable limits. Fuel transfer is controlled by two components: pumps and valves. Pumps maintain a constant flow to the engines. Under nominal operating conditions, the fuel transfer controller is designed to pump fuel from outer tanks first, and then sequentially switch to inner tanks. Each tank has a transfer valve feeding into a shared conduit. When valves are opened, pressure differentials between tanks may result in fuel transfer to maintain the balance. The state of the system is described by twelve variables: six fuel tank levels, which are continuous variables, and six valve states, that are discrete-valued. For simplicity, fuel levels are represented on a 0 − 100 min-max scale. Valve positions are binary on/off variables. A controller can switch any combination of valves on or off, giving a total of 64 possible configurations fr the action space.

  T1 + T2 + TLA + TRA + T3 + T4 600 1 + 1 + D(s )

R(s, s ) =

(10)

RL controllers also hierarchically sample state space. The further a state is from goal, the more imbalanced the tanks are, therefore, the longer the action duration is at that state. H(s) = 1 + Log10 (1 + D(s))

(11)

Value function approximation is done via a linear combination Q(s, a) of normalized fuel level and valve state products with a bias term. At each episode, the Error is used to update weights w by way of gradient descent.  T (1 + V ) T4 (1 + V4 ) T 1 1 Q(s, a) = w · ,..., ,1 200 200  T (1 + V ) T4 (1 + V4 ) T 1 1 ,..., ,1 ∇Qw = 200 200 w ← w − α · Error · ∇Qw

(12)

RL controllers employ a softmax action selection policy πexp during exploration (Doya et al. (2002)). The probability of actions is proportional to the values of those actions at that instant. This ensures that during learning, states which yield more reward are prioritized for traversal. 5. RESULTS AND DISCUSSION

s : {T1 , T2 , TLA , TRA , T3 , T4 , V1 , V2 , VLA , VRA , V3 , V4 } (8) a : {V1 , V2 , VLA , VRA , V3 , V4 } Faults in the fuel tank system are leaks in fuel tanks. When a leak occurs, a tank drains additional fuel at a rate proportional to the quantity left. It is assumed that controllers have accurate system models and the fault diagnosis results are provided in a timely manner. WE make the single fault assumption, i.e., at any point in time, there is at most one fault in the system. Sensor noise is simulated by scaling the measurements of tank levels with values derived from a Gaussian distribution centered at 1 with a standard deviation σ. The noise is assumed to be inherent to the sensors. 237

For each of the three control schemes, thirty trials were carried out with random faults (Ahmed (2017)). Table 1 shows the values of the baseline parameters used. For these parameters, MP and online RL controllers sampled the same number of states at each step on average. Under baseline conditions, Figure 2 shows the performance of various controllers using two metrics. (1) Peak imbalance is the average maximum value D(s) reached during trials. (2) Total imbalance is the average time integral of D(s) over each trial. These results are representative of how controllers sample the state space to select actions. MP control samples all reachable states within a horizon at each step. It finds the

IFAC SAFEPROCESS 2018 238 Warsaw, Poland, August 29-31, 2018

Ibrahim Ahmed et al. / IFAC PapersOnLine 51-24 (2018) 233–240

Fig. 1. Simplified C-130 fuel system schematics. The controller manages valves (dotted ovals) and can observe fuel tank levels. Net outflow to engines via pumps (solid ovals) is controlled independently. Name Learning rate Discount Depth Lookahead steps Exploration rate Sampling density Horizon

Symbol α γ d n  ρ N

MP N/A N/A N/A N/A N/A 1 1

Offline RL 0.1 0.75 30 10 N/A 1 N/A

Online RL 0.1 0.75 5 5 0.4 0.5 N/A

Table 1. Controller baseline parameters locally optimal action. The choice of each action depends on the instantaneous values associated with the states in the model. In case of sensor noise, sub-optimal actions may be chosen. Offline RL control samples a large fraction of potential successor states once, and derives a single value function and control policy that is applied to select actions in the future. Online RL periodically samples a small, local fraction of successor states to make recurring corrections to its policy. It starts off being sub-optimal, but with each exploratory phase improves on its choice of actions till it converges to the lowest cost choices. It is possible in scenarios where response time is critical, online RL may not have time to sample enough states to converge to an acceptable policy. Both RL methods rely on accumulated experience. Having a learning rate means that state values, and hence the derived control policy, depend on multiple measurements of the model. For Gaussian sensor noise, the average converges to the true measurement of the state variable. Therefore, the controller is influenced little by random variations in measurements. Furthermore, the choice of value function can regularize the policy derivation so it does not over-fit to noise. RL methods are also sensitive to the form of the value approximation function, which may under-fit the true value of states or over-fit to noise. They forego some accuracy in state valuation by using approximations in exchange for 238

Fig. 2. Performance of controllers under varying sensor noise. RL-based controllers are more robust to sensor noise than MPC. Lower imbalances are better.

the ability to interpolate values for unsampled sates. The use of neural networks as general function approximators may allow for a more accurate representation of true state values, and therefore, better control.

IFAC SAFEPROCESS 2018 Warsaw, Poland, August 29-31, 2018

Ibrahim Ahmed et al. / IFAC PapersOnLine 51-24 (2018) 233–240

Another set of 30 trials was carried out with online RL and MP controllers to simulate sampling and processing constraints during operation. State sample sizes were restricted by lowering the sampling density ρ. At each step, the MP and RL controllers could only advance a fraction ρ of available actions to their neighbouring states to explore. The exploration rate was set to  = 0.2 to ensure equal sample sizes for both controllers. Sensor noise was reset to σ = 0. Results are shown in Figure 3. Online RL control is quick to recover from a fault and maintains smaller imbalance for the duration of the trial. By having a value function approximation, the controller estimates utilities for actions it cannot sample in the latest time step based off of prior experience. MP control, however, is constrained to choose only among actions it can observe in the model. It is, therefore, more likely to make suboptimal choices. With a decreasing sample density, MP control’s performance deteriorates whereas online RL control is less sensitive to the change.

Control MP Online RL

Peak Imbalance 147.66 75.17

239

Total Imbalance 1972.89 721.57

Table 2. MP vs Online RL control (σ = 0.05, ρ = 0.5) and deliver consistently better performance than MP control. The complexity of both MP and RL approaches is linear in the number of states sampled. However, by discarding prior experience and sampling states anew at each step, model predictive control forfeits valuable information about the environment that RL methods exploit. 6. FUTURE WORK The fuel tank model made a simplifying assumption of discrete-time dynamics. The problem of continuous-time reinforcement learning-based control remains open and has been discussed in some detail by Doya (2000) and Lewis et al. (2012). The RL-based controllers approximated the value of actions based on a second order polynomial function, which was learned using stochastic gradient descent. The universal approximation theorem as described by Cybenko (1989) shows that single hidden-layer neural networks with sigmoidal non-linearities can approximate any continuous and real-valued function under some constraints. Using such a basis for the value function may lead to more optimal control policies that reflect the true values of actions. Using stochastic gradient descent is computationally efficient as a single observation is used to compute changes to the value approximation. However, it is prone to instability in convergence in case of noisy data. Experience replay, as applied to control systems by Adam et al. (2012), maintains a history of prior observations which can be randomly sampled in batches to calculate value updates. By aggregating observations, the effects of sampling errors and noise may be mitigated and an optimal control policy may be derived sooner. 7. CONCLUSION

Fig. 3. Performance of controllers under varying sample densities. RL-based controllers are more robust to smaller state samples than MPC. Lower imbalances are better. Finally, combining sensor noise and low sampling densities yielded a significant disparity in performance between the RL and MP control approaches as shown in Table 2. The results show that reinforcement learning-based control is more robust. When the model is insufficiently observable, either due to sensor noise or due to sampling constraints, RL controllers are able to generalize behaviour 239

Reinforcement learning-based control provides a competitive alternative to model predictive control, especially in situations when the system degrades over time, and faults may occur during operations. RL control’s independence of explicitly defined goals allows it to operate in environments where faults may render nominal goal states unfeasible. The ability to generalize behavior over unseen states and actions, and to derive control policies from accumulated experience, make RL control the preferred candidate for system models subject to computational constraints, partially observable environments, and sensor noise. There are several venues for exploration in the choice of hyperparameters for RL based controllers which may yield further performance gains. ACKNOWLEDGEMENTS This work is supported by an Air Force SBIR Contract # FA8501-14-P-0040 to Global Technology Connec-

IFAC SAFEPROCESS 2018 240 Warsaw, Poland, August 29-31, 2018

Ibrahim Ahmed et al. / IFAC PapersOnLine 51-24 (2018) 233–240

tion (GTC), Inc., Atlanta, GA. Vanderbilt University has worked on this project under a subcontract from GTC, Inc. REFERENCES Abdelwahed, S., Wu, J., Biswas, G., Ramirez, J., and Manders, E.J. (2005). Online fault adaptive control for efficient resource management in advanced life support systems. Habitation, 10(2), 105–115. Adam, S., Busoniu, L., and Babuska, R. (2012). Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2), 201–212. Ahmed, I. (2017). Qlearning. URL https://github. com/hazrmard/QLearning/releases/tag/ifac2018. Github repository. Blanke, M., Izadi-Zamanabadi, R., Bogh, S.A., and Lunau, C.P. (1997). Fault-tolerant control systems - a holistic view. Control Eng. Practice, 4(5), 693–702. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4), 303–314. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural computation, 12(1), 219–245. Doya, K., Samejima, K., Katagiri, K.i., and Kawato, M. (2002). Multiple model-based reinforcement learning. Neural computation, 14(6), 1347–1369. Garcia, C.E., Prett, D.M., and Morari, M. (1989). Model predictive control: Theory and practice - a survey. Automatica, 25(3), 335–348. Goldberg, J., Greenberg, I., and Lawrence, T.F. (1993). Adaptive fault tolerance. Proceedings of the IEEE Workshop on Advances in Parallel and Distributed Systems. Kaelbling, L.P., Littman, M.L., and Moore, A.W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Lampton, A., Valasek, J., and Kumar, M. (2010). Multiresolution state-space discretization for q-learning with pseudo-randomized discretization. The 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona. Lewis, F.L., Vrabie, D., and Vamvoudakis, K.G. (2012). Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers. IEEE Control Systems, 32(6), 76–105. Liu, L., Wang, Z., and Zhang, H. (2017). Adaptive faulttolerant tracking control for mimo discrete-time systems via reinforcement learning algorithm with less learning parameters. IEEE Transactions on Automation Science and Engineering, 14(1), 299–313. Patton, R.J. (1997). Fault-tolerant control systems: The 1997 situation. IFAC Proceedings, 30(no. 18). Precup, D. (2000). Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, 80. Sutton, R.S. and Barto, A.R. (2017). Reinforcement learning: An introduction. Cambridge: MIT Press, 2nd edition. Draft.

240