Adaptive Generalized Policy Iteration in Active Fault Detection and Control★

Adaptive Generalized Policy Iteration in Active Fault Detection and Control★

9th IFAC Symposium on Fault Detection, Supervision and 9th IFAC on Fault Detection, Supervision and Safety of Symposium Technical Processes 9th IFAC S...

468KB Sizes 0 Downloads 130 Views

9th IFAC Symposium on Fault Detection, Supervision and 9th IFAC on Fault Detection, Supervision and Safety of Symposium Technical Processes 9th IFAC Symposium on 9th IFAC Symposium on Fault Fault Detection, Detection, Supervision Supervision and and Safety of Technical Processes September 2-4, 2015. Arts et Métiers ParisTech, Paris, France Available online at www.sciencedirect.com Safety of Technical Processes Safety of Technical Processes September 2-4, 2015. Arts et Métiers ParisTech, Paris, France September September 2-4, 2-4, 2015. 2015. Arts Arts et et Métiers Métiers ParisTech, ParisTech, Paris, Paris, France France

ScienceDirect

IFAC-PapersOnLine 48-21 (2015) 505–510

Adaptive Generalized Policy Iteration in Adaptive Generalized Policy Iteration in  Adaptive Generalized Policy Iteration in Active Fault Detection and Control  Active Fault Detection and Control  Active Fault Detection and Control

ˇ ˇ Ivo Punˇ cochaˇ r, Jan Skach, Miroslav Simandl ˇ ˇ Ivo Punˇ c ochaˇ rr,, Jan Skach, Miroslav Simandl ˇ ˇ Ivo Punˇ c ochaˇ Jan Skach, Miroslav Simandl ˇ ˇ Ivo Punˇ cochaˇ r, Jan Skach, Miroslav Simandl Department of Cybernetics and NTIS - New Technologies for the Department of Cybernetics and NTIS - New Technologies the Department of and NTIS New for the Information Faculty of Universityfor of West DepartmentSociety, of Cybernetics Cybernetics andApplied NTIS --Sciences, New Technologies Technologies for the Information Society, Faculty of Applied Sciences, University of West Information Society, Faculty Faculty of 14 Applied Sciences, University of West West Bohemia, Univerzitn´ ı 8, 306 Plzeˇ n,Sciences, Czech Republic, (e-mails: Information Society, of Applied University of Bohemia, Univerzitn´ ı 8, 306 14 Plzeˇ n , Czech Republic, (e-mails: Bohemia, Univerzitn´ ıı 8, n Republic, [email protected], [email protected], [email protected]) Bohemia, Univerzitn´ 8, 306 306 14 14 Plzeˇ Plzeˇ n,, Czech Czech Republic, (e-mails: (e-mails: [email protected], [email protected], [email protected]) [email protected], [email protected], [email protected]) [email protected], [email protected], [email protected]) Abstract: The paper deals with a suboptimal solution to the problem of active fault detection Abstract: deals with a suboptimal solution to the problem of active detection Abstract: The paper deals with solution to of fault detection and controlThe for paper stochastic nonlinear systems over an infinite time horizon. Thefault design of an Abstract: The paper dealsnonlinear with a a suboptimal suboptimal solution to the the problem problem of active active fault detection and control for stochastic systems over an infinite time horizon. The design of an and control for stochastic nonlinear systems over an infinite time horizon. The design of active fault detector and controller is formulated as a dynamic optimization problem that is and control for stochastic nonlinear issystems over as an ainfinite time horizon. The designthat of an an active fault detector and controller formulated dynamic optimization problem is active fault fault detector and controller controller is formulated formulated as aa A dynamic optimization problem that is solved using detector the generalized policy iteration algorithm. key parameter of thisproblem algorithm that active and is as dynamic optimization that is solved using the generalized policy iteration A key parameter this algorithm that solved using the generalized policy iteration algorithm. A key parameter of this algorithm that influences computational demands and the algorithm. speed of convergence is the of number of successive solved using the generalized policy iteration algorithm. A key parameter of this algorithm that influences computational and the speed of convergence is the numberareof successive influences demands and of is the successive approximations used in a demands policy evaluation step. Although general guidelines known, no influences computational computational demands and the the speed speed of convergence convergence is guidelines the number numberareof of known, successive approximations used in a policy evaluation step. Although general no approximations used in a policy evaluation step. Although general guidelines are known, no exact algorithm for choosing this parameter exists. The paper proposes an adaptive algorithm approximations used in a policy evaluation exists. step. Although general guidelines are known, no exact algorithm for this The proposes an algorithm exact algorithm this for choosing choosing this parameter parameter exists. The paper papergiven proposes an adaptive adaptive algorithm for determining key parameter to speed exists. up convergence a specified accuracy of the exact algorithm for choosing this parameter The paper proposes an adaptive algorithm for determining this key parameter to speed up convergence given a specified accuracy of the for this parameter to given aa specified accuracy of solution. The adaptive algorithm is demonstrated and compared with non-adaptive generalized for determining determining this key key parameter to speed speed up up convergence convergence given specified accuracy of the the solution. The adaptive algorithm is demonstrated and compared with non-adaptive generalized solution. The adaptive algorithm is demonstrated and compared with non-adaptive generalized policy iteration algorithms in a numerical example. solution. The adaptive algorithm is demonstrated and compared with non-adaptive generalized policy policy iteration iteration algorithms algorithms in in aa a numerical numerical example. example. policy algorithms in numerical example. © 2015,iteration IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Keywords: Active fault detection, optimal control, dynamic programming, generalized policy Keywords: Active fault detection, optimal control, dynamic programming, generalized policy Keywords: iteration. Keywords: Active Active fault fault detection, detection, optimal optimal control, control, dynamic dynamic programming, programming, generalized generalized policy policy iteration. iteration. iteration. 1. INTRODUCTION complex functional problem. The first approach uses a 1. INTRODUCTION complex functional The first uses 1. complex parametrized functional problem. problem. The optimization first approach approachover usesthea a suitably policy and 1. INTRODUCTION INTRODUCTION complex functional problem. The first approach uses a suitably parametrized policy and optimization over the suitably parametrized optimization over the to perform policy direct and policy search (Baxter and Automatic control deals with many challenging research parameters suitably parametrized policy and optimization over the parameters to perform direct policy search (Baxter and Automatic control deals with many challenging research parameters to direct policy (Baxter and 2001). The second approach, which is used more Automatic control deals with challenging research topics including fault detection. Faults such as fluid leaks Bartlett, parameters to perform perform direct policy search search (Baxter and Automatic control deals with many many challenging research Bartlett, 2001). The second approach, which is used more topics including fault detection. Faults such as fluid leaks Bartlett, 2001). The second approach, which is used more often, is based on dynamic programming (DP) where the topics including fault detection. Faults such as fluid leaks and stuck actuators often cause fatal consequences to a Bartlett, 2001). The second approach, which is used more topics including fault detection. Faults such as fluid leaks often, is based on dynamic programming (DP) where the and stuck often cause fatal consequences to aa often, is based on dynamic programming (DP) the Bellman principle of optimality is employed to where transform and stuck actuators often cause fatal consequences to system andactuators its environment. Therefore, it is important to often, is based on dynamic programming (DP) where the and stuck actuators often cause fatal consequences to a Bellman principle of optimality is employed to transform system and its environment. Therefore, it is important to Bellman principle of optimality is employed to transform the original functional problem into a series of simpler system and its environment. Therefore, is important to recognize faults as soon as they occur. it Bellman principle of optimality is employed to transform system and its environment. Therefore, it is important to the original functional problem into a series of simpler recognize faults as soon as they occur. the original functional into series subproblems 1957). recognize faults soon they the original optimization functional problem problem into aa (Bellman, series of of simpler simpler recognize faults as as soon as asapproaches they occur. occur.can be employed to overlapping overlapping optimization subproblems (Bellman, 1957). Two principally different overlapping optimization subproblems (Bellman, 1957). The optimaloptimization policy can be obtained from the Bellman Two principally different approaches can be employed to overlapping subproblems (Bellman, 1957). The optimal policy can be obtained from the Bellman Two principally different approaches can be employed to address the problem of fault detection. Within the first Two principally different approaches can be employed to The optimal policy can be obtained from the Bellman function that represents the solution to from a functional equaaddress the problem of fault detection. Within the first The optimal policy can be obtained the Bellman that the solution to aa functional address problem fault the first approach a monitored system is observedWithin without being address the the problem of ofsystem fault detection. detection. Within thebeing first function function that represents the solution to functional equation known asrepresents the Bellman functional equation (Bu¸equasoniu approach a monitored is observed without function that represents the solution to a functional equaknown as the Bellman functional equation (Bu¸ ssoniu approach aa monitored is observed without being intentionally excited bysystem a probing signal (Korbicz et al., tion approach monitored system is observed without being tion known as the Bellman functional equation (Bu¸ et al., 2010). Since it is impossible to find this solution intentionally excited aa probing signal (Korbicz et al., known asSince the Bellman functional equation (Bu¸ soniu oniu et al., 2010). it is impossible to find this solution intentionally excited by probing signal (Korbicz et al., 2004; Isermann, 2011).by This approach to fault detection is tion intentionally excited by a probing signal (Korbicz et al., et al., al., 2010). 2010).for Since it is is systems impossible tocriteria, find this thisnumerical solution analytically general andto 2004; Isermann, 2011). This approach to fault detection is et Since it impossible find solution analytically for general systems and criteria, numerical 2004; Isermann, 2011). This approach to fault detection is called passive. The second approach introduced in (Zhang, 2004; Isermann, 2011). This approach to fault detection is methods analytically for criteria, of approximate dynamicand programming (Powell, called passive. second introduced in (Zhang, for general general systems systems and criteria, numerical numerical methods of approximate dynamic programming (Powell, called passive. The second approach introduced in (Zhang, 1989) is calledThe active faultapproach detection (AFD) because an analytically called passive. The second approach introduced in (Zhang, methods of approximate dynamic programming (Powell, 2007) and reinforcement learning (Lewis and Liu, 2013) 1989) is called active fault detection (AFD) because an methods ofreinforcement approximate dynamic programming (Powell, 2007) and learning (Lewis and Liu, 2013) 1989) is called active fault detection (AFD) because an auxiliary input signal is injected into a system aiming to 1989) is called active fault detection (AFD) because an 2007) and reinforcement learning (Lewis and Liu, 2013) are applied. One of such numerical methods is based on auxiliary input signal is injected into aa system to 2007) and reinforcement learning (Lewis andisLiu, 2013) are applied. One of such numerical methods based on auxiliary input signal is injected into system aiming to improve the quality of fault detection. Active faultaiming detector auxiliary input signal is injected into a system aiming to are quantization applied. One One of of asuch such numerical methodswhich is based based on the continuous state-space allows improve the quality of fault detection. Active fault detector are applied. of numerical methods is on quantization of aa continuous state-space which allows improve the quality of fault detection. Active fault detector designs based on different frameworks and different levels the improve the quality of fault detection. Active fault detector quantization of continuous state-space which allows the Bellman function to be enumerated for all states. In designs based on different frameworks and different levels the quantization of a continuous state-space which allows Bellman function to be enumerated for all states. In designs based on different frameworks and different levels of generality were introduced in (Kerestecioglu, 1993; the designs based on different frameworks and different levels the Bellman function to be enumerated for all states. general, the solution to the Bellman functional equation is of generality were introduced in (Kerestecioglu, 1993; the Bellman functiontotothe beBellman enumerated for allequation states. In In general, the solution functional is of generality were introduced in (Kerestecioglu, 1993; Campbell and Nikoukhah, 2004; Niemann, 2006; Scott of generality were introduced in (Kerestecioglu, 1993; general, the solution to the Bellman functional equation is found off-line using an algorithm that iteratively computes Campbell and Nikoukhah, 2006; Scott general, the solution toalgorithm the Bellman functional equation is found off-line using an that iteratively computes Campbell and Nikoukhah, 2004; Niemann, 2006; Scott et al., 2014). Furthermore, the2004; AFDNiemann, problem was extended Campbell and Nikoukhah, 2004; Niemann, 2006; Scott found off-line off-line using using an Bellman algorithmfunction that iteratively iteratively computes approximations to the (Bertsekas, 2001). et al., 2014). Furthermore, the AFD problem was extended found an algorithm that computes approximations to the Bellman function (Bertsekas, 2001). et al., 2014). Furthermore, the AFD problem was extended to include control objectives in (Blackmore and Williams, et al., 2014). Furthermore, the AFD problem wasWilliams, extended Besides approximations to 2001). the value andBellman policy function iteration(Bertsekas, algorithms, the to include to the the Bellman function (Bertsekas, 2001). Besides the value and policy iteration algorithms, the to include control objectives in (Blackmore and Williams, 2006). Thiscontrol kind objectives of problemin is(Blackmore denoted asand active fault approximations to include control objectives in (Blackmore and Williams, Besides the value and policy iteration algorithms, the generalized policy iteration (GPI) algorithm is commonly 2006). This kind of problem is denoted as active fault Besides thepolicy value iteration and policy iteration algorithms, the generalized (GPI) algorithm is commonly 2006). This kind of problem is denoted as active fault detection and control (AFDC). A unified formulation 2006). This kind of problem is denoted as active fault generalized policy iteration (GPI) algorithm is commonly used because it provides a better compromise between the detection and control (AFDC). A unified formulation generalized policy iteration (GPI) algorithm is commonly used because it provides a better compromise between the detection and control (AFDC). of a general AFDC problem and A itsunified special formulation cases were speed detection andAFDC control (AFDC). Aits unified formulation used because because it provides provides better compromise compromise between the the of convergence andaa computational demands. of aa general problem and cases it better between ˇ ˇ were used of convergence and computational demands. of general problem and its special cases considered inAFDC (Simandl and Punˇ coch´ aspecial ˇr, 2009; Simandl of a general AFDC problem and its special cases were speed speed of convergence and computational demands. ˇ ˇˇ were considered in ( Simandl and Punˇ c och´ a ˇ r , 2009; Simandl speed of convergence and computational demands. ˇ The GPI algorithm starts with an initial policy and each considered and aaˇˇrr,, 2009; ˇ ˇ et al., 2011).in considered in ((Simandl Simandl and Punˇ Punˇccoch´ och´ 2009; Simandl Simandl The algorithm with an policy and each et al., 2011). The GPI GPI consists algorithm starts with an initial initial(PE) policystep andand each iteration of starts a policy evaluation a et al., 2011). GPI algorithm starts with an initial policy and each et al.,design 2011).of the optimal active fault detector and con- The iteration consists of a policy evaluation (PE) step and a The iteration consists of aa policy a policy improvement (PI) step.evaluation In the PE(PE) step,step the and value The design of the optimal active fault detector and coniteration consists of policy evaluation (PE) step and a policy improvement (PI) step. In the PE step, the value The design of the optimal active fault detector and controller, which takes the availability of future information The design of the optimal active fault detector and conpolicy improvement (PI) step. In the PE step, the value function of the current policy is computed non-iteratively troller, which takes the availability of future information policy improvement (PI) step. In the PE step, the value of current computed troller, which the future into account, is based on finding of a policy that gener- function troller, which takes takes the availability availability of future information information function of athe the current policy isequations computed non-iteratively by solving system of policy linear is ornon-iteratively iteratively by into account, is based finding generof the current policy is computed non-iteratively by solving a system of linear equations or iteratively by into account, is based on finding policy that generates decisions and inputson such that aaaa policy chosen that criterion is function into account, is based on finding policy that generby solving a system of linear equations or iteratively by successive approximations (Sutton and Barto, 1998). The ates decisions and inputs such that a chosen criterion is by solving a system of linear equations or iteratively by successive approximations (Sutton and Barto, 1998). The ates decisions and inputs such that a chosen criterion is minimized. There are two approaches for solving such ates decisions and inputs such that a chosen criterion is successive approximations (Sutton and Barto, 1998). The use of successive approximations (SA) is preferable for minimized. There are two approaches for solving such successive approximations (Sutton and Barto, 1998). The of successive approximations (SA) is preferable for minimized. There are two approaches for solving such minimized. There are by two solvingproject such use  use of approximations (SA) for problems with many states as the computational demands work was supported the approaches Czech Science for Foundation, use of successive successive approximations (SA) is is preferable preferable for  This problems with many states as the computational demands This work was supported by the Czech Science Foundation, project  problems with many states as the computational demands of the non-iterative solution to the system of linear equaNo. GA15-12068S. This work was supported by the Czech Science Foundation, project  problems with many states as the computational demands This work was supported by the Czech Science Foundation, project of the non-iterative solution to the system of linear equaNo. GA15-12068S. of No. GA15-12068S. GA15-12068S. of the the non-iterative non-iterative solution solution to to the the system system of of linear linear equaequaNo. Copyright © 2015, 2015 IFAC 505 Hosting by Elsevier Ltd. All rights reserved. 2405-8963 © IFAC (International Federation of Automatic Control) Copyright © 2015 IFAC 505 Copyright © 2015 IFAC 505 Peer review under responsibility of International Federation of Automatic Copyright © 2015 IFAC 505Control. 10.1016/j.ifacol.2015.09.576

SAFEPROCESS 2015 506 September 2-4, 2015. Paris, France

Ivo Punčochař et al. / IFAC-PapersOnLine 48-21 (2015) 505–510

tions may be prohibitive. Although general guidelines for selecting the number of SA are known, no exact algorithm for choosing this parameter can be found in literature. The goal of this paper is to propose an adaptive algorithm that aims at speeding up the convergence of the GPI algorithm. The idea is to observe the progress of policies and determine the number of SA based on a specified accuracy of the solution. 2. PROBLEM FORMULATION The paper is focused on a multiple-model AFDC where one model represents a fault-free behavior of the system and other models represent the system under faults. First, the AFDC problem is formulated as an imperfect state information problem. Then its transformation to a perfect ˇ state information problem is done (Skach et al., 2014) to obtain a problem that can be directly addressed by DP. 2.1 Imperfect state information model It is assumed that a system can be described at each time step k ∈ T = {0, 1, . . .} by the following nonlinear stochastic discrete-time model xk+1 = f (xk , µk , uk ) + wk , (1) nx nx represents a known where f : R × M × U → R dynamics of the system, xk ∈ Rnx is a known common system state, µk ∈ M = {1, 2, . . . , Nµ } is an unknown T  model index, xak = xT ∈ Rnx × M denotes a hybrid k , µk nu state, uk ∈ U ⊂ R is an input from a discrete finite set of admissible inputs U = {u1 , u2 , . . . , uNu }, wk ∈ Rnx is a state noise with the known conditional probability density function (pdf) pw (wk |xak ). The initial conditions x0 and µ0 are mutually independent. The pdf p(x0 ) and the probability Π(µ0 ) are known. The system dynamics depends on the actual model index µk and it is determined by a nonlinear function fµk : Rnx × U → Rnx f (xk , µk , uk ) = fµk (xk , uk ) , (2) where the fault-free behavior of the system is represented by f1 and the faulty behaviors are defined by functions fi for i = 2, 3, . . . , Nµ . The switching between the faultfree and faulty behaviors of the system is defined by a stationary finite-state Markov chain with the known transition probabilities Πi,j = Π (µk+1 = j|µk = i). 2.2 Perfect state information model Since the part µk of the hybrid state xak in the model (1) is not observed directly, an imperfect state information problem is obtained. To simplify the design of the active fault detector and controller, a sufficient statistic for the model index µk is used to redefine the state and thus to obtain a perfect state information problem. One possible sufficient statistic is represented by the conditional probability distribution P (µk |xk0 , uk−1 ). Note that the nota0   T T j T T is used to denote a sequence tion zi = zi , zi+1 , . . . , zj of variables from the time step i to the time step j. The following part presents a perfect state information problem ¨ om, 1965). ¯ k (Astr¨ formulation using a hyper-state x 506

  T T ¯ k = xT ∈S = Let the hyper-state be defined as x k , bk T  nx nx ¯ (R × B) ⊂ R , where bk = bk,1 , . . . , bk,Nµ −1 ∈ B, is the belief state reduced by one dimension. The components of the reduced belief state are defined as bk,i =   k−1 k P µk = i|x0 , u0 for i = 1, 2, . . . , Nµ − 1 and the set B is specified as B = {b ∈ RNµ −1 : b ≥ 0, 1T b ≤ 1}.

Then the original model (1) is replaced by a new model with perfect state information ¯ k+1 = ϕ (¯ x xk , uk , xk+1 ) , (3) where the system dynamics is described by a nonlinear vector function ϕ : S × U × Rnx → S.

In this paper, it is assumed that the hyper-state space is quantized by introducing a uniform grid S g ≡ S1g × S2g × . . . × Sngx¯ = {s1 , s2 , . . . , sNs } of discrete states st ∈ Rnx¯ , t ∈ {1, 2, . . . , Ns }. The discrete sets Sjg are chosen such that the uniform grid S g covers a region of hyperstate space with non-negligible probabilities of hyper-state ¯ k can be projected onto the trajectories. Any hyper-state x set S g using an aggregation function (Ikonen, 2007) sk = g(¯ xk ) = arg min ¯ xk − ξ2 . (4) ξ∈S g

Note that the hyper-state space S g is a finite discrete set.

When considering the hyper-state space quantization, the problem formulation can be seen as a Markov decision problem specified by the quadruple (S g , U , P, L), where P represents Ns × Ns × Nu array of conditional probabilities Pi,j,l = P (sk+1 = sj |sk = si , uk = ul ) that satisfy Pi,j,l ≥ 0 for all i, j, l and

Ns 

Pi,j,l = 1 for all i, l, (5)

j=1

and L represents a cost per one time step. The conditional probabilities Pi,j,l must be determined such that the finitestate Markov process approximates the original model (3). The conditional probabilities can be defined (Ikonen and Kortela, 2008) using a neighborhood O(sj ) of a grid point sj as xk+1 ∈ O(sj )|¯ x k = si , u k = u l ) = Pi,j,l = P  (¯   (6) ¯ k+1 |¯ p x xk = si , uk = ul d¯ xk+1 , O(sj )

where the neighborhood O(sj ) is defined by means of the aggregation function g as O(sj ) = {¯ x ∈ Rnx¯ : g(¯ x) = sj }. (7) The conditional probabilities Pi,j,l can be computed using several techniques from which the Monte Carlo is the most general and also the most computationally intensive one. However, this computation must be performed only once for a given model. 2.3 AFDC objectives The active fault detector and controller is represented as     σ(sk ) dk = ρ(sk ) = , (8) γ(sk ) uk

where dk ∈ M is a decision about the model index µk , σ : S g → M is a nonlinear function of detection, γ : S g → U is a nonlinear vector function of control, and ρ : S g → M×U is an unknown nonlinear vector function.

SAFEPROCESS 2015 September 2-4, 2015. Paris, France

Ivo Punčochař et al. / IFAC-PapersOnLine 48-21 (2015) 505–510

When a decision dk and a control uk are taken in a grid hyper-state sk , they are evaluated by a cost function L : S g × M × U → R that is bounded in the infinity norm and makes a compromise between detection and control objectives. This cost function is defined as L (sk , dk , uk ) = αLd (sk , dk ) + (1 − α) Lc (sk , uk ) , (9) where Ld is a detection cost function, Lc is a control cost function and α ∈ [0, 1] is a weighting parameter. Note that with a slight abuse of notation, a simplified expression L(sk , ρ(sk )) will be used instead of L(sk , σ(sk ), γ(sk )) through the text. The design of the optimal active fault detector and controller is based on minimization of the total cost over an infinite number of stages  F   k λ L (sk , dk , uk ) , (10) J (ρ, s0 ) = lim E F →+∞

3. DYNAMIC PROGRAMMING The value function Vρ of a policy ρ, Vρ : S g → R,  F   i Vρ (sk ) = lim E λ L(si , ρ(si ))|sk , F →∞

j=1

u Pi,j

j

i

= P (sk+1 = s |sk = s , uk = u). The optimal where detector σ ∗ and the optimal input signal generator γ ∗ are then expressed as d∗ =σ ∗ (si ) = arg min αLd (si , d), (17) d∈M

u∗ =γ ∗ (si )



= arg min(1−α)L (s , u)+λ u∈U

c

i

Ns  j=1



u Pi,j V ∗ (sj ).

(18)

4. GENERALIZED POLICY ITERATION (11)

i=k

j=1

γ Pi,j

where i = 1, . . . , Ns and = P (sk+1 = sj |sk = si , uk = i γ(s )) stands for the transition probabilities when the controller γ is used. The optimal value function, also called the Bellman function, is defined as V ∗ (si ) = min Vρ (si ) (13) ρ

and can be obtained by solving the Bellman functional equation   Ns    γ V ∗ (si ) = min L(si , ρ(si )) + λ Pi,j V ∗ sj  . (14) j=1

Then, the optimal decision and control policy are given as   Ns    γ ρ∗ (si ) = arg min L(si , ρ(si )) + λ Pi,j V ∗ sj . (15) ρ

u∈U

The solution to (16) is used in (17) and (18) to provide on-line detection and control.

returns the expected value of the future costs when starting in a state sk and following the policy ρ. The time index will be dropped in the rest of Section 3 for convenience. Using the discounted Markov decision problem formulation and the Bellman principle of optimality (Bellman, 1957; Bertsekas, 2001), it can be shown that for a given policy ρ the value function Vρ has to satisfy the following functional equation Ns    γ Vρ si = L(si , ρ(si )) + λ Pi,j Vρ (sj ), (12)

ρ

to the Bellman functional equation (14). Since the detection cost function Ld depends only on the decision dk together with the state sk and the control cost function Lc depends only on the control uk together with the state sk , the Bellman functional equation for the active fault detector and controller can be expressed as V ∗ (si ) = α min Ld (si , d) d∈M   Ns  (16) u Pi,j V ∗ (sj ), +min (1−α)Lc (si , u)+λ

k=0

where λ ∈ (0, 1) is a discount factor that reduces an impact of costs in the future.

507

j=1

Note that the solution to the Bellman functional equation (14) is found off-line. This step might be computationally demanding for large state and control spaces.

In this section, the GPI algorithm (Sutton and Barto, 1998) for solving the Bellman functional equation in the AFDC setting will be briefly described and computational demands will be discussed. The Bellman functional equation (14) is usually solved iteratively using the GPI algorithm. The algorithm starts with an initial policy ρ(0) . At each iteration i, i = 0, 1, 2, . . . of the algorithm, the current policy ρ(i) is evaluated in the first step (PE) and then it is improved in the second step (PI). The value function Vρ(i) of the policy ρ(i) is determined by solving (12). Note that (12) can be also expressed as a system of linear equations (i)

(INs − λPγ )vρ

(i)

(i)

= cρ ,

where INs is the identity matrix of order Ns , Pγ is (i) the transition matrix induced by the policy γ (i) , vρ ∈ RNs is the vector of unknown elements that are defined (i) ρ(i) as vm = Vρ(i) (sm ), and cρ ∈ RNs is the vector defined (i)

as cρm = L(sm , ρ(i) (sm )). The PE step can be performed either by solving the system of linear equations (19) using a noniterative method such as the Gaussian elimination method (GEM) or by solving the functional equation (12) using an iterative SA method. The iteration of the SA method is given as (i)

(i)

(i)

(0)

ρ and v(0)

as the zero vector. During the PI step, the

policy ρ

The optimal active fault detector and controller is derived from the detection cost function (9), which is substituted 507

(i)

ρ ρ v(j+1) , (20) = cρ + λPγ v(j) where j = 0, 1, 2, . . . , ¯j is an iteration index. Even though the SA method can be initiated with an arbitrary vec(i−1) ρ(i) ρ(i) , a common practice is to set v(0) = vρ tor v(0)

(i)

3.1 Optimal active fault detector and controller

(19) (i)

is improved using the following equation   Ns  (i+1) m m u n ρ (s ) = arg min L(s ,d,u)+λ Pm,nVρ(i) (s ) . (21) d∈M,u∈U

n=1

SAFEPROCESS 2015 508 September 2-4, 2015. Paris, France

Ivo Punčochař et al. / IFAC-PapersOnLine 48-21 (2015) 505–510

The PE and PI steps are repeated with sequentially increasing iteration index i until the infinity norm of the difference between two successive value functions drops below the threshold V and the minimum number of iterations imin is run, or two consecutive polices are the same. Note that the convergence of the GPI algorithm has been proven in (Bertsekas and Tsitsiklis, 1996). When the SA method is used in the PE step, three cases can be recognized in the GPI algorithm (Lewis and Vrabie, 2009). If ¯j = 0, the value iteration algorithm is obtained. If ¯j → ∞, the GPI algorithm results into the policy iteration algorithm. The last case is represented by a nonzero finite ¯j. When the state-space is large, solving (19) by a noniterative method is time consuming. A way to get around this difficulty is to use a finite number ¯j of SA. The following approaches for selecting ¯j were proposed in (Puterman, 1994) (a) a nonzero fixed ¯j is used at each GPI iteration i, (b) ¯j is chosen according to a prespecified pattern (e.g. increasing ¯j for increasing i), (c) ¯j is adapted based on a chosen performance index (j+1) (j) (e.g. specified accuracy Vγ (i) − Vγ (i) ∞ ≤ PE , where PE > 0 is a given threshold). 5. ADAPTIVE ALGORITHM General guidelines to set the number of SA in the PE step are known. However, an exact algorithm for choosing ¯j is missing. In this section, an adaptive algorithm for determining the number of SA in the PE step will be introduced in the context of the AFDC problem. An objective of the adaptive algorithm is to speed up the convergence given a specified accuracy of the solution. In (Puterman, 2001) it was concluded that the GPI algorithm based on the SA method with an adaptive choice of rather low values of ¯j is the most computationally efficient when the number of states Ns is large. One SA in the PE step requires Ns2 operations. A bound on a maximum efficient ¯j can be obtained by comparing these computational requirements with the computational demands of a noniterative method for solving (19). Suppose that the GEM requiring Ns3 /3 + Ns2 + Ns /3 operations is employed. Then for the SA method to be effective, it might be reasonable to consider ¯j ≤ Ns /3 for a non-sparse (i) transition probability matrix Pγ . The accuracy of SA in the PE step can be bounded (Puterman and Shin, 1978) as follows ¯

2λj+1 L(sk , ρ(i) (sk ))∞ . (22) (1 − λ) It can be seen that the number of SA decreases with decreasing λ. Let ψ > 0 be an accuracy threshold of the SA method that bounds the right hand side of (22). By rearranging terms, it can be derived that to guarantee a desired accuracy, the number of SA ¯j has to satisfy the following inequality     log 2L(sk(1−λ)ψ (i) ,ρ (sk ))∞ ¯j ≥  − 1 (23)  , log λ   (¯ j+1)

Vγ∗(i) − Vγ (i) ∞ ≤

508

Algorithm 1 AGPI algorithm Step 1: Initialization. Select any admissible policy ρ(0) . Define PE accuracy threshold ψ, GPI quality threshold V , minimum number of GPI algorithm iterations imin . Set i, j = 0. Step 2: Policy evaluation. Using adaptive rule (23), compute the number of SA ¯j. if ¯j > Ns /3 then Evaluate ρ(i) by the exact PE (19). else (i) ρ(0) Set vρ = vρ(i) or vm = 0 for all sm ∈ S g . Evaluate ρ(i) by the iterative approach (20), j ← j + 1 until j = ¯j. end if Output: Vρ(i)

Step 3: Policy improvement. Update policy ρ(i) by Vρ(i) using (21). Output: γ (i+1) Step 4: Stopping condition. if γ (i+1) = γ (i) or Vρ(i+1) − Vρ(i) ∞ ≤ V ∧ i ≥ imin then Terminate the GPI algorithm. else i ← i + 1 and return to Step 2. end if where · denotes the ceiling function. Note that the denominator log λ in (23) is negative. The sign of the corresponding numerator depends on the accuracy threshold ψ, the infinity norm L(sk , ρ(i) (sk ))∞ , and λ. Therefore, ¯j increases for increasing value of the infinity norm. A low value of this infinity norm might lead to a negative ¯j, which can be understood so that no improvement of the value function is needed. The adaptive GPI (AGPI) algorithm is summed up in Algorithm 1. Note that the presented algorithm was derived for the case when the GEM is used to solve (19) and it is assumed that the conditional probabilities Pi,j,l are represented by full matrices. 6. NUMERICAL EXAMPLE In the numerical example, the growth of bacteria in a chemostat is considered (Seborg et al., 2004). A substrate need for bacteria to grow is continuously fed to the chemostat. It is assumed that the chemostat is well mixed and has a drain to flow out the mixture of substrate and bacteria. By applying the forward Euler method with the sampling period Ts = 1[hr], the chemostat dynamics can be described by the following nonlinear discrete-time model Ts c1 xk,1 xk,2 xk+1,1 = xk,1 + −Ts uk,1 xk,1 +wk,1 c2 + xk,2 Ts c1 xk,1 xk,2 +Ts uk,1 (uk,2 −xk,1)+wk,2 , xk+1,2 = xk,2 − c3 (c2 +xk,2) (24) where xk = [xk,1 , xk,2 ]T ∈ R2 is a system state, xk,1 and xk,2 are bacteria and substrate concentrations, respectively, uk = [uk,1 , uk,2 ]T ∈ R2 is a system input, uk,1 is a substrate flow rate divided by the chemostat volume, uk,2 is an input substrate concentration, and wk = [wk,1 , wk,2 ]T ∈ R2 is an independent state noise defined by the Laplace distribution with the location parameter η = 0 and the scale parameter β = 0.0012. Parameters c1 , c2 , c3

SAFEPROCESS 2015 September 2-4, 2015. Paris, France

Ivo Punčochař et al. / IFAC-PapersOnLine 48-21 (2015) 505–510

demonstrate its performance. An EPI algorithm performs the PE step exactly by solving the system of linear equations using the backslash operator that uses UMFPACK routines. The GPI-F6 and GPI-F12 algorithms employ a fixed number of 6 and 12 SA, respectively, within the PE step. Finally, a GPI-P algorithm employs a pattern ¯j = 2 + 2i of SA in the i-th iteration of the GPI algorithm. The AGPI algorithm uses the accuracy threshold ψ = 0.5 and the bound on the maximum number of the SA iterations ¯j < 110 is chosen based on experimental results of the EPI algorithm. Note that the comparison is meaningful since the proposed AGPI algorithm determines ¯j based on the accuracy threshold ψ and applies the same technique as the EPI algorithm when it is computationally more effective. The remaining parameters for simulations are chosen to be V = 10−5 and imin = 30.

a) PE computational time

25

total PE one PE

time [s]

20 15 10 5 0

EPI

AGPI

GPI−P

GPI−F6

GPI−F12

b) GPI computational time

time [s]

150

total GPI

100 50 0

EPI

AGPI

GPI−P

GPI−F6

509

GPI−F12

Fig. 1. A comparison of typical times required to perform one PE step, all PE steps, and the whole computation for different algorithms. are the maximum growth rate of bacteria, half-velocity constant, and yield parameter, respectively. It is assumed that the system can be described either by the fault−1 free model with parameters c1 = 0.2 [hr] , c2 = 1 [g/l], c3 = 0.5[g/g], or by the faulty model where the half-velocity constant is changed to c2 = 0.9. The transition probability matrix of switching between fault-free and faulty models is Π(µk+1 = j|µk = i) = 0.02 for i, j ∈ M, i = j. The uniform grid of 468741 points over the hyper-state space is defined by the Cartesian product S g = S1g × S2g × S3g , where S1g = {0, 0.05, . . . , 4.5}, S2g = {0, 0.05, . . . , 2.5}, and S3g = {0, 0.01, . . . , 1}.

Fig. 1a) shows computational times required by the PE step for the algorithms considered. The EPI and AGPI algorithms require more time to perform the PE step compared to the other algorithms. However, they terminate after performing only 14 GPI iterations which is less than 20 for the GPI-F6 algorithm and 18 for the GPI-P algorithm. Consequently, the AGPI algorithm is faster than the EPI, GPI-P, and GPI-F6 algorithms considering the total computational time as depicted in Figure 1b). Since the formulation of the stopping conditions can influence the total computational time, it is necessary to mention that all the algorithms were terminated by the stopping condition that compares two consecutive policies. Despite the positive results, one should realize that the computational demands for sparse structures depend on the code implementation such as the order of multiplication.

u T where pu = [pu = [0.01, 0.03]T , ps = [ps1 , ps2 ]T = 1 , p2 ] T [5, 1] , and q = [q1 , q2 ]T = [1, 1]T are weighting parameters. Note that according to (23) no adaptation will take place in the AGPI algorithm when Lc does not depend on uk . The set of admissible inputs is U = U1 × U2 , U1 = {0, 0.05, 0.1}, U2 = {0, 1, . . . , 10}. The discount factor is λ = 0.91, the weighting factor is α = 0.5, and the conditional probabilities Pi,j,l are computed according to (5) using 500 Monte Carlo simulations.

The results also show that the AGPI algorithm is by no means the most time efficient algorithm. As shown in Fig. 1b), the GPI-F12 algorithm converges in 13 GPI iterations and takes less time to converge than the AGPI algorithm. One can observe that the GPI-F12 algorithm converges even faster than the EPI algorithm as the approximate value functions of some policies were probably closer to the exact Bellman function. However, finding such favorable number of fixed SA iterations ¯j is enormously computationally demanding because it means to run the GPI-F algorithm for all possible ¯j. The other algorithms have also their design parameters that influence the convergence speed. The GPI-P algorithm requires to choose a specific pattern and the analysis might be even more time consuming than in the case of the GPI-F algorithm. For these reasons, it seems that the AGPI algorithm may be the most practical option since the value of ψ can be determined based on the known problem definition. Finally, note that the computational demands of the EPI algorithm are enormous when sparsity is not considered and non-sparse algorithms are employed.

Since the problem is very large and the representation (i) of Pγ using a full matrix would have high memory demands, a sparse matrix is used. Consequently, all corresponding computations are performed using specialized algorithms that take advantage of the sparsity and thus reduce computational demands as well. The simulation environment is MATLAB R2013a on a machine with Intel Core i5-3570 CPU @ 3.40GHz and 8GB RAM. The AGPI algorithm is compared with the following algorithms to

A simulation of the closed-loop system might be helpful to understand the behavior of the designed active fault detector and controller. Typical trajectories for the time horizon of 72 [hr] are depicted in Fig. 2. The system initially starts as a fault-free and the initial state is x0 = [3, 2]T . Except for oscillations caused by state space quantization both components of xk follow the reference. The change of the model is detected with a delay of one time step. Thus the detection aim and the control aim are fulfilled.

The detection aim is to minimize the probability of making a wrong decision. The control aim is to minimize control effort while the system state follows the reference r = [r1 , r2 ]T = [4.4, 1]T . Therefore, the detection and control cost functions are chosen to be  1 − sk,3 if dk = 1, d L (dk ,sk ) = sk,3 if dk = 2, 2 2  (25)    c u s −qi (sk,i−ri )2 L (sk ,uk ) = , |pi uk,i|+ pi 1−e i=1

i=1

509

SAFEPROCESS 2015 510 September 2-4, 2015. Paris, France

Ivo Punčochař et al. / IFAC-PapersOnLine 48-21 (2015) 505–510

a) State trajectories

5

state

4 3

r1 sk,1 xk,1 r2 sk,2 xk,2

2 1 0

0

10

20

30

40

50

60

70

b) System detection

detection

2

µk sk,3 dk

1.5 1 0.5

0

10

20

40

50

60

70

c) System input

0.1

uk,1

30

10

0.08

8

0.06

6

0.04

4

0.02 0

uk,1 uk,2 0

10

20

30

40

50

60

70

uk,2

0

2 0

time step k

Fig. 2. Typical state, decision, and input trajectories for a time horizon of 72 [hr]. 7. CONCLUSION The paper dealt with the adaptive generalized policy iteration algorithm that determines the number of successive approximations to speed up its convergence. The adaptive algorithm was used to design an active fault detector and controller and the numerical example illustrates its performance. Despite the promising results it would be interesting to make a detailed analysis of the sparse computations that can influence computational requirements. REFERENCES ¨ Astr¨ om, K.J. (1965). Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10, 174–205. Baxter, J. and Bartlett, P.L. (2001). Infinite-Horizon Policy-Gradient Estimation. Journal of Artificial Intelligence Research, 15, 319–350. Bellman, R.E. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1 edition. Bertsekas, D.P. (2001). Dynamic Programming and Optimal Control (Volume II). Athena Scientific, Belmont, Massachusetts, 2nd edition. Bertsekas, D.P. and Tsitsiklis, J.N. (1996). NeuroDynamic Programming. Athena Scientific, Belmont, Massachusetts. Blackmore, L. and Williams, B. (2006). Finite Horizon Control Design for Optimal Discrimination between Several Models. In Proceedings of the 45th IEEE Conference on Decision and Control, 1147–1152. Ieee, San Diefo, California. Bu¸soniu, L., Babuˇska, R., Schutter, B.D., and Ernst, D. (2010). Reinforcement learning and dynamic programming using function approximators. CRC Press, Boca Raton, Florida. 510

Campbell, S.L. and Nikoukhah, R. (2004). Auxiliary Signal Design for Failure Detection. Princeton University Press. Ikonen, E. (2007). Model-Based Process Control via Finite Markov Chains. In Proceedings of the 9th IFAC Workshop on Adaptation and Learning in Control and Signal Processing, 1–6. Saint Petersburg, Russia. Ikonen, E. and Kortela, U. (2008). Adaptive Process Control Using Controlled Finite Markov Chains Based on Multiple Models. In Proceedings of the 17th World Congress IFAC, 7919–7924. Seoul, Korea. Isermann, R. (2011). Fault-Diagnosis Applications. Springer, Heidelberg, Germany. Kerestecioglu, F. (1993). Change detection and input design in dynamical systems. Research Studies Press, Taunton. Korbicz, J., Koscielny, J.M., Kowalczuk, Z., and Cholewa, W. (2004). Fault Diagnosis: Models, Artificial Intelligence, Applications. Springer. Lewis, F.L. and Liu, D. (2013). Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. John Wiley & Sons, Inc., Hoboken, New Jersey. Lewis, F.L. and Vrabie, D. (2009). Reinforcement learning and adaptive dynamic programming for feedback control. Circuits and Systems Magazine, IEEE, 40–58. Niemann, H. (2006). A Setup for Active Fault Diagnosis. IEEE Transactions on Automatic Control, 51(9), 1572– 1578. Powell, W.B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimensionality. John Wiley & Sons, Inc., Hoboken, New Jersey, 1st edition. Puterman, M.L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., Hoboken, New Jersey. Puterman, M.L. (2001). Encyclopedia of Physical Science and Technology, volume 4, chapter Dynamic Programming, 673–696. Academic Press, 3 edition. Puterman, M.L. and Shin, M.C. (1978). Modified Policy Iteration Algorithms for Discounted Markov Decision Problems. Management Science, 24(11), 1127–1137. Scott, J.K., Findeisen, R., Braatz, R.D., and Raimondo, D.M. (2014). Input design for guaranteed fault diagnosis using zonotopes. Automatica, 50(6), 1580–1589. Seborg, D.E., Edgar, T.F., and Mellichamp, D.A. (2004). Process Dynamics and Control. John Wiley & Sons, Inc., Hoboken, New Jersey, 2nd edition. Sutton, R.S. and Barto, A.G. (1998). Reinforcement Learning : An Introduction. A Bradford Book, Cambridge, MA, USA. ˇ Simandl, M. and Punˇcoch´aˇr, I. (2009). Active fault detection and control: Unified formulation and optimal design. Automatica, 45(9), 2052–2059. ˇ ˇ Simandl, M., Sirok´ y, J., and Punˇcoch´aˇr, I. (2011). New Special Cases of General Active Change Detection and Control Problem. In Proceedings of the 18th IFAC World Congress, 4260–4265. Milano, Italy. ˇ ˇ Skach, J., Punˇcoch´aˇr, I., and Simandl, M. (2014). Approximate active fault detection and control. Journal of Physics: Conference Series, 570(7). Zhang, X.J. (1989). Auxiliary Signal Design in Fault Detection and Diagnosis. Springer Verlag, Heidelberg, Germany.