Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Reliability Engineering and System Safety journal homepage: www.elsevier.com/locate/ress
Planning structural inspection and maintenance policies via dynamic programming and Markov processes. Part I: Theory K.G. Papakonstantinou n, M. Shinozuka Department of Civil and Environmental Engineering, University of California Irvine, Irvine, USA
art ic l e i nf o
Keywords: Optimal stochastic control Partially Observable Markov Decision Processes Uncertain observations Belief space Structural life-cycle cost Infrastructure management
a b s t r a c t To address effectively the urgent societal need for safe structures and infrastructure systems under limited resources, science-based management of assets is needed. The overall objective of this two part study is to highlight the advanced attributes, capabilities and use of stochastic control techniques, and especially Partially Observable Markov Decision Processes (POMDPs) that can address the conundrum of planning optimum inspection/monitoring and maintenance policies based on stochastic models and uncertain structural data in real time. Markov Decision Processes are in general controlled stochastic processes that move away from conventional optimization approaches in order to achieve minimum lifecycle costs and advice the decision-makers to take optimum sequential decisions based on the actual results of inspections or the non-destructive testings they perform. In this first part of the study we exclusively describe, out of the vast and multipurpose stochastic control field, methods that are fitting for structural management, starting from simpler to sophisticated techniques and modern solvers. We present Markov Decision Processes (MDPs), semi-MDP and POMDP methods in an overview framework, we have related each of these to the others, and we have described POMDP solutions in many forms, including both the problematic grid-based approximations that are routinely used in structural maintenance problems, and the advanced point-based solvers capable of solving large scale, realistic problems. Our approach in this paper is helpful for understanding shortcomings of the currently used methods, related complications, possible solutions and the significance different solvers have not only on the solution but also on the modeling choices of the problem. In the second part of the study we utilize almost all presented topics and notions in a very broad, infinite horizon, minimum life-cycle cost structural management example and we focus on point-based solvers implementation and comparison with simpler techniques, among others. & 2014 Elsevier Ltd. All rights reserved.
1. Introduction In this paper the framework of planning and making decisions under uncertainty is analyzed, with a focus on deciding optimum maintenance and inspection actions and intervals for civil engineering structures based on the structural conditions in real time. The problem of making optimum sequential decisions has a huge history in a big variety of scientific fields, like operations research, management, econometrics, machine maintenance, control and game theory, artificial intelligence, robotics and many more. From this immense range of problems and methods we carefully chose to analyze techniques that can particularly address the engineering and mathematical problem of structural management, and we also present them in a manner that we think is most appropriate for the potential
n
Corresponding author. Tel.: þ 1 949 228 8986. E-mail address:
[email protected] (K.G. Papakonstantinou).
interested readers, who are dealing with this particular problem and/or structural safety. A large variety of different formulations can be found addressing the problem of maintenance and management of aging civil infrastructure. In an effort to very succinctly present the most prevalent methodologies we classify them in five different general categories. The first category includes methods that rely on simulation of different predefined policies and indicative works can be found by Engelund and Sorensen [1] and Alipour et al. [2]. Based on the simulation results, the solution that provides the best performance among these scenarios is chosen, which could be the one with the minimum cost or cost/benefit ratio, etc. It is evident that a problem with this approach is that the chosen policy, although better than the provided alternatives, will hardly be the optimal among all the possible ones that can actually be implemented. In the second category we include methods that are usually associated with a pre-specified reliability or risk threshold and several different procedures have been suggested in the literature. In Deodatis et al. [3] and Ito et al. [4] the structure is
http://dx.doi.org/10.1016/j.ress.2014.04.005 0951-8320/& 2014 Elsevier Ltd. All rights reserved.
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i
2
K.G. Papakonstantinou, M. Shinozuka / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
maintained whenever the simulation model is reaching the reliability threshold, while in Zhu and Frangopol [5] the same logic is followed with the exception that the maintenance actions to take at the designated times are suggested by an optimization procedure. Thoft-Christensen and Sorensen [6] and Mori and Ellingwood [7] pre-assume a given number of lifetime repairs, in order to avoid the discrete nature of this variable in their nonlinear, gradient-based optimization process, and based on their modeling they identify optimum maintenance times so that the reliability remains above the specified threshold. Zhu and Frangopol [5] also followed this approach but used a genetic algorithm, which has significant computational cost however, in order to drop the assumption of pre-determined number of lifetime repairs and to be able to model the available maintenance actions in a more realistic manner. Overall, the available methods in this category provide very basic policies and the simultaneous use of optimization algorithms in a probabilistic domain, in this context, usually compels use of rudimentary models. Unfortunately, this last statement, concerning a probabilistic domain, is also valid when the problem is cast in a generic optimization formulation, which we characterize as another category although the work in [5] would also fit in. Formulations in this class usually work well with deterministic models, the available number of possible different actions is typically greater than before and a multiobjective framework is enabled. The problem is frequently solved by genetic algorithms and a Pareto front is sought. The choice of genetic algorithms, or other heuristic search methods, for solving the problem is not accidental since these methods can also tackle the discrete part of the problem, like the number of lifetime actions and the chosen action type in each maintenance period. Unavoidably, the computational cost is significant nonetheless and probabilistic formats are problematic with these techniques. Representative works can be seen in [8–10], among others. All presented methods until now rely exclusively on simulation results and in essence do not take actual data into account in order to adjust or determine the performed actions, with the works in [3,4] being some sort of exception. While this may be sufficient for a variety of purposes, it is definitely incongruous for an applied, real world structural management policy. To address the issue a possible approach is suggested in the literature which is typically, but not utterly, associated with condition based thresholds. We classify these methods in a fourth category and a representative work can be seen in Castanier et al. [11]. The main idea behind these methods is to simulate deterioration based on a continuous state stochastic model, with Gamma processes being a favored candidate, and to set certain condition thresholds based on optimization, in between which a certain action takes place. Assuming perfect inspections, the related action is thus performed as soon as the structure exceeds a certain condition state during its lifetime. As probably understood already, the main weakness of this formulation is the usually unrealistic assumption about perfect observations. Due to this, although capabilities of the formulation are generally broad and versatile, including probabilistic outcome and duration of actions, the inspection part is lacking important attributes and analogous sophistication with other parts of the approach. A secondary concern with this approach can be also identified in the fact that the global optimum may be hard to find in non-convex spaces, although this is not a general limitation and is dependent on the specifics of the problem and the optimization algorithm used. In the fifth category we include models that rely on stochastic control and optimum sequential decisions and these are the models of further interest in this paper. These approaches usually work in a discrete state space, and like the ones in the previously described category also take actual, real-time data into account in order to choose the best possible actions. In their most basic form
of Markov Decision Processes (MDPs) these models share the limitation of perfect observations, although they can generally provide more versatile, non-stationary policies, and taking advantage of their particular structure the search for the global optimum is typically unproblematic. Indicative of the successful implementation of MDPs in practical problems, Golabi et al. [12] and Thompson et al. [13] describe their use with fixed biannual inspection periods in PONTIS, the predominant bridge management system used in the United States. Most importantly however, as is also shown in detail in this paper, MDPs can be further extended considerably to a large variety of models and especially to Partially Observable Markov Decision Processes (POMDPs) that can take the notion of the cost of information into account and can even address the conundrum of planning optimum policies based on uncertain structural data and stochastic models. We believe that POMDP based models are adroit methods with superior attributes for the structural maintenance problem, in comparison to all other methods. They do not impose any unjustified constraints on the policy search space, such as periodic inspections, threshold performances, pre-determined number of lifetime repairs, etc., and can instead incorporate in their framework a diverse range of formulations, including condition-based, reliability and/or risk-based problems, periodic and aperiodic inspection intervals, perfect and imperfect inspections, deterministic and probabilistic choice and/or outcome of actions, perfect and partial repair, stationary and non-stationary environments, infinite and finite horizons, and many more. Representative works with a POMDP framework can be seen in Madanat and Ben-Akiva [14], Ellis et al. [15] and Corotis et al. [16], while further references about studies based on Markov Decision Processes are also given in the rest of this paper and in the second part of this work, [17]. To illustrate schematically a POMDP policy, with a minimum life-cycle cost objective, in a general, characteristic structural inspection and maintenance problem, Fig. 1 is provided. In this figure, the actual path of the deterioration process (continuous blue line) has been simulated based on one realization of a nonstationary Gamma process and is overall unknown to the decisionmaker except when he decides to take an observation action. The gray area in Fig. 1 defines the mean þ/ 2 standard deviations uncertainty area which is given by the used stochastic model. This probabilistic outcome of the simulation model is the only base for maintenance planning for the decision-maker when actual observation data cannot be taken into account. Even with an accurate stochastic model, the fact that the actual deterioration process is never observed will usually result in non-optimum actions, for a certain structure, since the realized process can be, for example, in percentiles far away from the mean. Taking observation data into account the decision-maker can update his belief about the deterioration level of the structure according to his prior knowledge and the accuracy of observations. In Fig. 1 the belief updating is shown clearly based on the outcome of the first two different observation actions (marked with þin the figure). As seen, the first observation method is more accurate (probably at a higher cost), in comparison to the second, and directs more effectively to the true state of the system. Although rarely the case with structural inspection/monitoring methods, if a certain observation action can identify the state of the structure with certainty, the belief is then updated to this state with probability one. As is shown in detail in the rest of this paper, POMDPs plan their policy upon the belief state-space and this key feature enables them also to suggest times for inspection/monitoring and types of observation actions, without any restrictions, unlike any other method. Concerning maintenance actions, POMDPs can again optimally suggest the type and time of actions without any modeling limitations. Two different maintenance actions are shown as an example in Fig. 1, marked by the red rectangles. The length of the rectangles
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i
K.G. Papakonstantinou, M. Shinozuka / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
3
deterioration level →
Deterioration process Mean +/− 2 s.d. uncertainty Observations Maintenance actions
time →
Fig. 1. A schematic POMDP policy for a structural inspection and maintenance problem.
indicates the duration time of actions. The first maintenance action is preceded by a quite precise observation action and substantially improves the condition of the structure. Since the outcome of maintenance actions is usually also probabilistic, the belief of the decision-maker over the structural deterioration level after the action is updated based on the observation and the performed maintenance. In the POMDP framework observation actions are not necessarily connected to maintenance actions and hence a computed policy may suggest a maintenance action at certain instances of the belief space without observing first. Such an occasion is depicted at the second maintenance action in Fig. 1, where the decision-maker does not want to pay the cost of information to update his belief and decides to maintain the structure regardless. Based on his prior belief and the probabilistic outcome of the performed action, which is of lower maintenance quality but less time demanding and costly than the previous one, the belief of the decision-maker is updated accordingly. Mainly due to the absence of real-time data in this case, the actual deterioration level is at a somewhat extreme percentile and the remaining uncertainty, after the action, is still considerable. Despite the fact that the maintenance and inspection problem has received considerable attention along the years and that POMDPs provide such a powerful framework for its solution, this is still not widely recognized today. We believe that one possible reason for this is that until recently a very serious limitation of POMDP models was that the optimal policy was impossible to be computed for anything but very small problems. Hence, significant available works in this area primarily, if not exclusively, focused on the modeling part and the important solving part of POMDP models was degraded in these works. This depressing news for solving the POMDP models did not motivate researchers and engineers to enter the field and unfortunately this is even currently so, despite recent, significant advances of POMDP solvers, mainly in the field of robotics. Addressing this issue in this paper, we exclusively describe out of the vast and multipurpose stochastic control field, methods that are fitting for structural management, starting from simpler to sophisticated techniques, and modern solvers capable of solving large-scale, realistic problems. More specifically, in Section 2 we briefly describe MDPs as a foundation for the rest of the paper and in Section 3 we explain how state augmentation works which is a valuable technique to form non-stationary problems, among others. In Section 4 we describe semi-MDPs, which can model the duration of actions, and we relate them to the important, for structural management, decision interval notion, while in Section 5 we explain why MDPs and semi-MDPs present intrinsic
limitations for our considered problem. We then explain in Section 6 in detail how POMDPs can give answers to all these limitations, we explain the belief updating and the belief space concept and overall we present this demanding topic in a concise and clear way. In Section 7 we examine solving techniques and we deliberately present simple approximate solvers for POMDPs that are directly based on MDPs, because currently structural management programs like PONTIS only rely on MDPs and hence these programs could be straightforwardly enhanced by these methods. We also present grid-based solvers, which are almost exclusively used in the literature today for structural maintenance problems with POMDPs, and we explain their inadequacies and we finally analyze point-based solvers that have the capability to solve larger scale problems. We believe our approach in this Part I paper is helpful for understanding shortcomings of the currently used methods, related complications and possible solutions and we hope that will also help interested readers understand the significance different POMDP solvers have, not only on the solution of course but also on the modeling of the problem, and why different researchers often choose specific models based on solver availability. Based on this Part I paper, in our Part II companion paper, [17], we utilize almost all presented topics and notions in a very broad, realistic, infinite horizon, minimum life-cycle cost example, with hundreds of states, choice availability of both inspection and maintenance actions, uncertain observations and non-stationarity, and we focus on point-based solvers implementation and comparison with simpler techniques, among others. Closing this introductory part it is important to mention that several other formulations for the structural maintenance problem exist which could either be integrated in the provided categories as model variations or could perhaps form new categories. Representative examples can be found in renewal theory concepts [18–20], continuous time Markov Decision Processes [21], and in review papers with valuable references, [22–25]. Notwithstanding the pluralism, however, of the methodologies, the sophistication and adeptness of POMDP formulations are exceptional.
2. Markov Decision Processes Markov Decision Processes (MDPs) are controlled stochastic processes in which a decision-maker is uncertain about the exact effect of executing a certain action. A MDP assumes that at any time step t the environment is in a state s A S, of a finite set of states, the agent takes an action a A A, of a finite set of actions, and receives a reward (or cost) R(s,a) as a result of this action, while the
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i
K.G. Papakonstantinou, M. Shinozuka / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
environment switches to a new state s0 according to a known stochastic model, with transition probability Pðs0 js; aÞ. In the rest of this work only rewards will be referred, since cost can be simply perceived as negative rewards. Thus, MDP is a 4-tuple (S, A, P, R) and the Markov property implies that the past of s is independent of its future, conditional on the present. More formally, the next state st þ 1 only depends on the previous state st and action at : pðst þ 1 jst ; at ; st 1 ; at 1 ; …; s0 ; a0 Þ ¼ pðst þ 1 jst ; at Þ
ð1Þ
A decision problem within the MDP framework requires that the decision-maker finds a sequence of actions that optimize some objective long-term reward function. Since the problem is stochastic, typically the objective function is additive and based on expectations. The most common function used is maximization of the total expected discounted rewards and this is the one considered in this work. For finite horizon models is given: T max Ε ∑ γ t Rt ; ð2Þ t¼0
where E[.] denotes the expectation operator, T is the planning horizon and γ (0 r γ o 1) is the annual discount factor by which a future reward must be multiplied in order to obtain the present value. For infinite horizon problems T is set to 1. For a whole spectrum of models and objective functions the reader is referred to [26]. Based on the maximization of the objective function an optimal policy πn (or plan or control) can be found. The policy maps states to actions, π : S-A, and it can be stationary, if it is independent of the particular time step at which the decision-maker is executing the policy, or non-stationary, if it varies over different time steps. The optimal policy is usually non-stationary for the finite horizon model and stationary for the infinite discounted horizon model. In this work we will only consider deterministic policies, which are policies that assign actions to states deterministically. In general, stochastic policies are also feasible and can assign actions to states according to some probability distribution [26].
recursively. Rðs; π n ðsÞÞ in Eq. (4) is expanded: Rðs; π n ðsÞÞ ¼ Rðs; aÞ ¼ ∑ pðs0 js; aÞrðs; a; s0 Þ;
ð5Þ
s0 A S
where r reward for performing action a and resulting to state s0 from state s. Combining Eqs. (4) and (5) and based on Bellman's principle of optimality [27], which states that any sub-policy of an optimal policy must also be optimal, the optimal value function V* for n remaining steps policy starting at state s is: V nn ðsÞ ¼ max ∑ pðs0 js; aÞrðs; a; s0 Þ þ γ ∑ pðs0 js; aÞV nn 1 ðs0 Þ ; ð6Þ aAΑ
s0 A S
s0 A S
where V nn 1 is the optimal value function for the n-1 remaining steps policy. Following Eq. (6) the optimal policy π* is given: ð7Þ π nn ðsÞ ¼ arg max Rðs; aÞ þ γ ∑ pðs0 js; aÞV nn 1 ðs0 Þ aAΑ
s0 A S
Eqs. (4) and (6) can be solved recursively starting from n ¼0, as already stated, and this dynamic programming approach is called value iteration or backward induction. The whole operation is also known as Bellman backup and by defining the operator as H MDP we can write Eq. (6) as the mapping: V nn ¼ H MDP V nn 1
ð8Þ
The optimality equations can also be written using the socalled action-value functions or Q-functions. The relation between value and Q functions is: V nn ðsÞ ¼ maxQ nn ðs; aÞ;
ð9Þ
aAΑ
and so: Q nn ðs; aÞ ¼ Rðs; aÞ þ γ ∑ pðs0 js; aÞV nn 1 ðs0 Þ
ð10Þ
s0 A S
For the discounted, infinite horizon problem the decisionmaker is interested in policies that are independent of time. For this reason, Eq. (6) is modified to the fixed point equation: V n ðsÞ ¼ maxQ n ðs; aÞ aAΑ ¼ max ∑ pðs0 js; aÞrðs; a; s0 Þ þ γ ∑ pðs0 js; aÞV n ðs0 Þ ; aAΑ
2.1. Dynamic programming
s0 A S
ð11Þ
s0 A S
and Eqs. (7) and (8) change consistently to: Dynamic programming is a family of techniques introduced by Bellman [27] while considering solutions to the problem of acting optimally in MDPs. One way to characterize a MDP policy is to consider its value function V π : S-ℝ, which represents the expected reward of some complete policy. For every state s, V π estimates the amount of discounted reward the decision-maker can gather when he starts in s and acts according to π in the future: T V πt 0¼ 0 ðsÞ ¼ Rðs; π 0 ðsÞÞ þ Ε ∑ γ t Rðs; π t ðsÞÞ ; ð3Þ t¼1
where π0, πt the policy at decision times 0 and t, respectively. A nice property of Eq. (3) is that the total expected reward for some policy can be decomposed into the expected reward associated with the first policy step and the expected reward for the remaining policy steps. Although this property does not help in finding the optimum policy, if we consider going forward in time (the different scenarios that appear are just too many), it is of paramount importance if we begin at the end of time T, or otherwise there are n ¼0 steps remaining to reach the end at time T. In that case we can write: V πnn ðsÞ ¼ Rðs; π n ðsÞÞ þ γ ∑ pðs0 s; π n ðsÞÞV πnn11 ðs0 Þ; ð4Þ s0 A S
where n represents remaining steps to the end, and knowing the terminal value function V πnn11 ¼¼ 00 ðsÞ, Eq. (4) can be solved
π n ðsÞ ¼ arg maxQ n ðs; aÞ aAΑ ¼ arg max Rðs; aÞ þ γ ∑ pðs0 js; aÞV n ðs0 Þ ; aAΑ
ð12Þ
s0 A S
and: V n ¼ H MDP V n
ð13Þ
The infinite horizon problem can also be solved with the value iteration method. The algorithm terminates when the difference between the value functions computed by two consecutive steps is smaller than a chosen ε-value. In the context of value iteration for infinite horizon problems, Eq. (6) is still valid with the very important difference however that n is irrelevant of time and merely represents iteration steps. In the infinite horizon case V n ¼ 0 ðsÞ is an arbitrary initial value function and not the terminal value function, used in the finite horizon case. The value iteration algorithm can be implemented with several modifications in the infinite horizon case. For example, Gauss– Seidel style iterations are possible or asynchronous dynamic programming backups, where exhaustive state space backups are avoided and arbitrary states are backed up in arbitrary order, provided that in the limit all states are visited infinitely often [28]. Alternate approaches for the computation of the optimal policy for the discounted, infinite horizon problem are policy iteration, modified policy iteration and linear programming. We have not
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i
K.G. Papakonstantinou, M. Shinozuka / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
used these methods in this work however since value iteration is a more general algorithm, with easier applicability in the finite horizon and partial observability cases. For further information on these methods the reader is referred to [26].
3. State augmentation In general, many situations can be found where some of the assumptions of the basic problem are violated. For example, the Markov property itself, in Eq. (1), that the next state st þ 1 only depends on the previous state st and action at may be very prohibitive. In such cases however, it is usually possible to reformulate the problem into the basic problem format. This process is called state augmentation because it typically involves the enlargement of the state space. One useful, often encountered and simple example to demonstrate the technique is a problem where the system state st þ 1 not only depends on the preceding state st and action at but on earlier states and actions as well; that is the problem is history dependent. For simplicity, assume that the future state of the system depends on the present and one time period in the past. In this case the system equation has the form: st þ 1 ¼ f t ðst ; at ; st 1 ; at 1 Þ;
s1 ¼ f 0 ðs0 ; a0 Þ;
ð14Þ
where t is the time index and f any appropriate function. Introducing additional state variables xt ¼ st 1 ; yt ¼ at 1 , Eq. (14) can be written as follows: 0 1 0 1 st þ 1 f t ðst ; at ; xt ; yt Þ Bx C B C st ð15Þ @ tþ1 A ¼ @ A; yt þ 1 at and by defining s~ t ¼ ðst ; xt ; yt Þ as the new state, we can rewrite Eq. (14) as: s~ t þ 1 ¼ f~ t ðs~ t ; at Þ;
ð16Þ
where f~ t is defined in Eq. (15). By using Eq. (16) as the system equation and by expressing the reward function in terms of the new state, the problem is successfully reduced to the basic Markovian problem. It is easily understood that state augmentation is a powerful technique that allows proper description of numerous complicated problems. Of particular importance for this work are also cases where time is encoded in the state description. Based on this, nonstationary problems can be easily modeled and also any finite horizon problem can be converted to an infinite horizon one. For the latter, the addition of an absorbing terminal state, in which no reward can be obtained any more, is further required. Unfortunately, however, state augmentation often comes at the price that the reformulated problem may have complex and high dimensional state and/or action spaces. Further information on state augmentation, non-stationarity and a variety of history dependent problems can be seen in [26,29,30].
4. Semi-Markov Decision Processes
5
state change occurs and its time model reflects fluctuations of these changes in time. Based on these properties, the connection of a semi-Markov process with a Markov renewal process can be also easily understood, [21,26,31]. More formally, a SMDP is defined as a 5-tuple (S, A, P, R, F) where, as before, S finite set of states, A finite set of actions, P state transition probabilities, R rewards and F probabilities of transition times for each state-action combination, so that Fðtjs; aÞ is the probability that the next decision epoch occurs within t time units, after the decision-maker chooses action a in state s at a decision epoch. Defining f ðtjs; aÞ the distribution of stochastic transition times, the probability φ that the system will be in state s0 for the next decision epoch, at or before t time units after choosing action a in state s at a decision epoch, can, in its general form, be computed by: Z t φðt; s0 js; aÞ ¼ pðτ; s0 js; aÞf ðτjs; aÞdτ ð17Þ 0
For the discounted, infinite horizon case, the optimal value function V* is given by: Z 1 e γt φðdt; s0 js; aÞV n ðs0 Þ ; ð18Þ V n ðsÞ ¼ max Rðs; aÞ þ ∑ aAΑ
s0 A S
0
where the exponential term stands for the continuous time discounting model and R is defined accordingly, as an instant reward due to an action taken at a decision epoch, and an accumulated reward received over the transition time between decisions epochs. 4.1. Decision interval Although SMDPs extend MDPs considerably, they only allow decisions when a change of state occurs, as already explained. In the context of management of aging infrastructure this is very restrictive, since we obviously wish to allow decision to be taken even if the state does not change at decision epochs. In [29] a formulation can be seen where decisions can be taken, for the discrete time case, at varying multiples of the unit time interval, Δt. For the discounted, infinite horizon case, the optimal value function V* in this case is given by: V n ðsÞ ¼ max Rðs; a; ΔtÞ þ γ Δt ∑ pðΔt; s0 js; aÞV n ðs0 Þ ; ð19Þ a A Α;Δt A ½1;NN o 1
s0 A S
where pðΔt; s0 js; aÞ can be calculated by basic Markov chain principles. For example, for the simplest case where the action duration is ignored, Δt¼2 and a¼ 0 declare the do-nothing action, pðΔt; s0 js; aÞ is given by: pðΔt ¼ 2; s0 js; aÞ ¼ ∑ pðs0 js1 ; 0Þpðs1 js; aÞ
ð20Þ
s1 A S
Rðs; a; ΔtÞ in Eq. (19) is calculated equivalently to Eq. (18) as an instant reward due to an action taken at a decision epoch, and an accumulated reward received over the transition time, Δt, between decisions epochs. For example, a possible formulation for the previous simple case is written as: Rðs; a; Δt ¼ 2Þ ¼ rðs; aÞ þγ ∑ pðs1 js; aÞrðs; a; s1 Þ þ s1 A S
Semi-Markov Decision Processes (SMDPs) extend discrete-time MDPs by incorporating a continuous model of time. In a discretetime Markov process, it is assumed that the amount of time spent in each state, before a transition occurs, is a unit time. In typical SMDP models, however, actions are modeled as taking random amounts of time but are only allowed at discrete points in time when an “event” or otherwise a change of state occurs. Thus, a semi-Markov model approximates the dynamics of a continuoustime system by decomposing the process along times in which the
þγ ∑ pðs0 js1 ; 0Þrðs1 ; 0; s0 Þ; 2
ð21Þ
s0 A S
and rðs; aÞ is the reward for performing action a from state s. Comparable formulations including the duration of actions might be more complicated to apply but nonetheless are still straightforward. It all comes down to a proper evaluation of transition probabilities and rewards. Eq. (17) is certainly helpful while other derivations based on survival functions are also applicable. For example, for the do-nothing action it is often
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i
K.G. Papakonstantinou, M. Shinozuka / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
6
assumed that an infrastructure asset can deteriorate only one state at a time, resulting in a transition probability matrix of: 2 3 p1;1 p1;2 0 ::: ::: 0 6 0 ::: 0 7 p2;2 p2;3 0 6 7 6 7 ::: ::: ::: ::: ::: ::: 7; PðtÞ ¼ 6 ð22Þ 6 7 6 0 ::: ::: ::: pn 1;n 1 pn 1;n 7 4 5 0 0 ::: ::: 0 pn;n where p1,1 is the transition probability from state 1 to state 1 and accordingly for the rest. For this particular type of problems, concerning structural maintenance, and for practical applications, the common modeling assumption of a possible single state deterioration in one period is not usually as restrictive as it may seem because the unit time interval can be often conveniently chosen in such a way that the introduced error, if any, is insignificant. Knowing the probability density function of duration time in state s at time t, f s ðtÞ, the survival function is given by: Z t f s ðτÞdτ; ð23Þ Ss ðtÞ ¼ 1 0
and further declaring f s-s0 ðtÞ; Ss-s0 ðtÞ the probability density function and survival function respectively of the time it will take the process to transit from state s to state s’ the probability of moving into state s’ ¼2 at time t þΔt is given by: pðs0 ðt þ ΔtÞ ¼ 2jsðtÞ ¼ 1Þ ¼
f 1 ðtÞΔt ; S1 ðtÞ
ð24Þ
and in the general case: f 1-i ðtÞΔt ; i ¼ f1; 2; :::; n 1g pðs ðt þ ΔtÞ ¼ i þ1jsðtÞ ¼ iÞ ¼ S1-i ðtÞ S1-ði 1Þ ðtÞ 0
ð25Þ Similar expressions for SMDPs with discrete fixed and non-fixed decision intervals concerning asset management can be seen in [32–34]. In the probable case where under this specific modeling type, other exogenous variables or time have to be incorporated in the state-space, state augmentation techniques, as already explained, can be easily utilized. Finally, the more complicated case where decisions are allowed at any time, in a continuous time format, are not described here, since they usually do not improve modeling quality in the context of maintenance of deteriorating infrastructure. The interested reader is however referred to [35].
of the cost of information is lost, since all inspections are perfect. In reality, inspections of infrastructure facilities usually have a nonnegligible cost and more accurate inspection techniques are selfevidently more expensive than cruder inspection methods. To address these limitations one initial suggestion in the literature, [36,37], was to construct an augmented state-space, based on all the possible combinations of actions and perfect (not necessarily coupled with actions) or imperfect inspection results, so as each state consists of an information vector with a possible history of actions and observations. Having all these states and their rewards the problem can be then solved by a dynamic programming algorithm with deterministic transitions. Apparently, this methodology is of very limited use. For problems with large or infinite horizons and in general, common, non-trivial problems, where the number of information vectors is large (or infinite), the computational requirements are prohibitive. This disappointing situation motivated efforts to look for quantities more appropriate than the information vectors and adequate enough for planning under uncertainty. Such quantities are called sufficient statistics and are basically the beliefs that a decision-maker may have over the states of the system. Techniques to perform planning over the uncertain, belief state-space are presented in the next section under the framework of Partially Observable Markov Decision Processes.
6. Partially Observable Markov Decision Processes Partially Observable Markov Decision Processes (POMDPs) provide a flexible and mathematically sound decision making framework for partially observable environments. In situations where inspection techniques and observations do not reveal the true state of the system with certainty, a belief b over the states of the system, S, can only be obtained. This belief is a probability distribution over S and a sufficient statistic of the history of actions and observations, meaning that knowing b, but not the full history, provides the decision-maker with the same amount of information. All beliefs are contained in a jSj 1 dimensional simplex. Having an initial belief about the system and after taking an action a and observing ο the belief is easily updated by Bayes’ rule: bðs0 Þ ¼
pðojs0 ; aÞ ∑ pðs0 js; aÞbðsÞ; pðojb; aÞs A S
where pðojb; aÞ is the usual normalizing constant: pðojb; aÞ ¼ ∑ pðojs0 ; aÞ ∑ pðs0 js; aÞbðsÞ
5. Limitations of MDPs for infrastructure management Until now the versatile nature of MDPs has been made clear. The transition matrices can be based on a big variety of stochastic processes and stationary and non-stationary environments, infinite and finite horizons, periodic and aperiodic inspection intervals, history dependent actions and actions’ duration can be modeled. However, a basic assumption of MDPs is the fact that inspections always reveal the true state of the system with certainty. While many problems in infrastructure management can perhaps support this feature, there are equally many, if not more, occasions where such an assumption is unrealistic. Especially for the problem we are considering in the companion paper of this work, [17], of corroding reinforced concrete structures, this assumption is indeed unrealistic since, currently, all the non-destructive techniques available can only approximately evaluate the extent of steel damage due to corrosion. A secondary limitation, originating from the perfect inspections assumption, is the fact that at every decision epoch, and prior to an action, a perfect inspection is assumed to be performed. Thus, whenever an action is performed at a decision epoch, including the do-nothing action, necessarily an inspection has to precede it. Lastly, in the MDP framework the notion
ð26Þ
s0 A S
ð27Þ
sAS
Thus, a POMDP is defined as a 6-tuple (S, A, P, O, Po, R) where S, A and O finite set of states, actions and possible observations respectively, P state transition probabilities, Po observation probabilities modeling the effect of actions and states on observations, and R rewards. The optimal value function for the discounted, infinite horizon POMDPs can be given by: V n ðbÞ ¼ max ∑ bðsÞ ∑ pðs0 js; aÞrðs; a; s0 Þ þ aAΑ sAS s0 A S ð28Þ þ γ ∑ ∑ pðojs0 ; aÞ ∑ pðs0 js; aÞbðsÞV n ðbðs0 ÞÞ ; o A Os0 A S
sAS
and in a more condensed form by: V n ðbÞ ¼ max ∑ bðsÞRðs; aÞ þ γ ∑ pðojb; aÞV n ðbs0 Þ aAΑ
sAS
ð29Þ
oAO
Further defining the operator as H POMDP we can write Eq. (29) as the mapping: V n ¼ H POMDP V n
ð30Þ
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i
K.G. Papakonstantinou, M. Shinozuka / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
It is thus easily seen that the basic form of the value function is the same as in a MDP, consisting of an immediate reward due to an action and expected future rewards. However, there are a few extra complications to consider in this case. Without knowing the exact state of the system the value function is now defined over the continuous belief state-space which, as we noted already, represents probability distributions over the finite states of the model. Furthermore, since the decision-maker has no control over the observation outcome concerning the system, the value function is cast so that all possible observations after an action are taken into account using a weighted sum, based on the observation probabilities. In Eqs. (28) and (29) the immediate rewards do not depend on the observation outcomes since, as we stated, the decision-maker has no control over them. This is the value function form we used in the two companion papers. If, however, for whatever reason, is desirable that the immediate rewards also depend on the observation results, the two equations can be straightforwardly modified accordingly. Until now we have only dealt with the discounted, infinite horizon POMDP case and we will continue alike in this paper. The finite horizon case can be easily understood based on the details we provided earlier for the MDP and the infinite case. Anyhow, as also stated in more detail in the state augmentation section, any finite horizon problem can be converted to an infinite horizon one. Although a POMDP is conceptually similar to a MDP, accurate enough planning for a POMDP is a far more difficult problem. The main difficulty is that the size of the policy space is much larger and continuous. In a problem with |S| states the belief state-space lies on a |S| 1 dimensional continuous space. This is referred to sometimes as the curse of dimensionality. Another important difficulty is related to the size of the space of reachable belief points, where a reachable belief point is one which is obtained when a decision-maker has an initial belief and follows some policy. This space is affected by the size of the action and observation spaces, as well as the length of the exploration horizon. This is referred to sometimes as the curse of history. A significant feature of POMDP models, which is of importance for their solution as well, is that their optimal or ε-optimal value functions are piecewise linear and convex (PWLC) for a finite horizon case, and can be approximated arbitrary well by PWLC functions for infinite horizon tasks [38]. The convexity of the value function stems from the fact that the value of a belief close to one of the corners of the belief simplex (where things are fully certain) (Figs. 2 and 3) will be high, since the less uncertainty the decisionmaker has over the true state of the system, the better decisions he can make and as such receive higher rewards. The piecewise linearity of the value function means that the function is composed solely of line segments or hyperplanes. There may be many hyperplanes that are combined together to make up the function but at any belief point there is only one hyperplane that covers it. The set of vectors that help form the value function are called α vectors, α ¼ fα1 ; α2 ; :::; αk ; :::g and each vector consists of the |S| ¼M coefficients of one of the hyperplanes of the piecewise linear function, αk ¼ ½αk ðs ¼ 1Þ; αk ðs ¼ 2Þ; :::; αk ðs ¼ MÞ. The gradient of the value function at any belief point is given by a corresponding α vector and based on a set of α vectors the value function is written: V n ðbÞ ¼ max ∑ bðsÞ αi ðsÞ; n
ð31Þ
fαi gi s A S
and Eq. (29) becomes: "
#
∑ bðs0 Þ αi ðs0 Þ V n ðbÞ ¼ max ∑ bðsÞRðs; aÞ þ γ ∑ pðojb; aÞmax n aAΑ
sAS
oAO
fαi gi s0 A S
ð32Þ
With this representation the value function over the continuum of points of the belief state-space can be represented with a
7
Fig. 2. Simplex of belief space for |S| ¼3 and an example belief point, b ¼[0.1, 0.7, 0.2].
Fig. 3. Sample value function for |S|¼ 2. Each α vector defines a region over the belief simplex.
finite number of items, i.e. the α vectors. As the belief space is a simplex, each vector defines a region over the simplex which represents a set of belief states. Since there is only a finite number of α vectors, there is also only a finite number of regions defined over the simplex. In Fig. 2, the simplex for the jSj ¼ 3 case can be seen, together with one example belief point on the simplex for a two dimensional projection. The simplex for the jSj ¼ 2 case is simply a line and a representative convex value function consisting of α vectors is seen in Fig. 3. The value function, which is plotted with bold in Fig. 3 is generally defined as the upper surface of these vectors due to the max operator in Eqs. (31) and (32). The value function for the jSj ¼ 3 case is seen in Fig. 4 as a three dimensional surface, comprised of planes, lying above the triangle simplex of Fig. 2. The planes (α vectors) impose a partition of the belief simplex and the borders of the partitions are projected on the triangle.
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i
K.G. Papakonstantinou, M. Shinozuka / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
8
with Eq. (36), one gets: αbn þ 1 ¼ arg max ∑ bðsÞg ba ðsÞ ; fgba ga A Α
V
b(s3)
which is the supporting α vector for a belief point b at the n þ1 iterative step. Note that in general not only the computed α vector is obtained but the optimal action to take in the current step as well, since, as seen in Eq. (38), each vector is associated with a specific action. Thus, finding a vector that defines a region over the belief simplex also informs the decision-maker which action to take, in the case that his belief about the system belongs to this specific region of the simplex. Based on this observation the policy at b can also be given by:
b(s1)
π n þ 1 ðbÞ ¼ aðαbn þ 1 Þ
Fig. 4. Sample value function for |S|¼ 3.
6.1. Bellman backups
αbaο ¼ arg max ∑ pðojs0 ; aÞ ∑ pðs0 js; aÞbðsÞαin ðs0 Þ;
The value iteration method, described earlier for a MDP, is also a basic tool for POMDP planning. In this section we will show how the Bellman backup of a particular belief point, together with its supporting α vector, can be calculated. Rewriting Eq. (32) for the iterative step from n to n þ 1, instead of the final, optimum value function, results in: " # V n þ 1 ðbÞ ¼ max ∑ bðsÞRðs; aÞ þ γ ∑ pðojb; aÞmax ∑ bðs0 Þ αin ðs0 Þ sAS
V n þ 1 ðbÞ ¼
# pðojs0 ; aÞ ¼ max ∑ bðsÞRðs; aÞ þ γ ∑ pðojb; aÞmax ∑ ∑ pðs0 js; aÞbðsÞ αin ðs0 Þ aAΑ sAS fαin gi s0 A Spðojb; aÞs A S oAO
3 7 6 7 6 ¼ max6 ∑ bðsÞRðs; aÞ þ γ ∑ max ∑ bðsÞ ∑ pðojs0 ; aÞpðs0 js; aÞ αin ðs0 Þ 7 i g a A Α 4s A S 5 0 fα oAO n i sAS s AS |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
g iao ðsÞ
"
#
ð34Þ
¼ max ∑ bðsÞRðs; aÞ þ γ ∑ max ∑ bðsÞg iao ðsÞ ; o A O fgao gi s A S i
sAS
and using the identity: max½xyi ¼ x arg max½xyi ; fyi gi
#
V n þ 1 ðbÞ ¼ max ∑ bðsÞRðs; aÞ þ γ ∑ bðsÞ ∑ arg max ∑ bðsÞg iao ðsÞ sAS
sAS
oAO
ðgiao Þi
sAS
3
6 !7 6 7 6 7 i ¼ max6 ∑ bðsÞ Rðs; aÞ þ γ ∑ arg max ∑ bðsÞg ao ðsÞ 7 a A Α 6s A S 7 o A O fgiao gi sAS 4 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}5
¼ ∑ bðsÞarg max ∑ bðsÞg ba ðsÞ sAS
fgba ga A Α
gba ðsÞ
ð36Þ
sAS
Comparing the value function representation in terms of α vectors: V n þ 1 ðbÞ ¼ max ∑ bðsÞ αin þ 1 ðsÞ; fαin þ 1 gi s A S
the value function can be written: V n þ 1 ðbÞ ¼ max ∑ bðsÞ Rðs; aÞ þ γ ∑ ∑ pðojs0 ; aÞpðs0 js; aÞαbaο ðs0 Þ ; aAΑ
sAS
o A Os 0 A S
ð41Þ and gba can be computed: ð42Þ
After careful observation, the complexity of computing an exhaustive backup, in a big O notation, using the giao vectors (first method) or the forward-projected beliefs (second method), is found OðjSjjΑjjΟjðβjVj þ jVjjSjÞÞ and OðjSjjΑjjΟjðβjVj þ βjSjÞÞ respectively, where β is the number of beliefs that need to be backed up for Vn þ 1 and jV j is the number of α vectors that describe the value function. For a more thorough explanation the reader is referred to [39]. Without going into much detail, because of notions that we have not officially introduced in this paper, the preferable backup method depends on the considered problem each time. The most obvious remark is that if β is larger than jV n j the first method is more efficient, in general. However, other important points of concern could be the generality and extendibility of the method, the sparsity of beliefs, batch backups performance, possible computer-memory limitations, etc.
7. Approximate POMDP planning
Eq. (34) becomes: " aAΑ
ð40Þ
sAS
ð35Þ
fyi gi
2
s0 A S
o A Os0 A S
Expanding Eq. (33) based on Eq. (26) yields:
2
fαin gi
g ba ðsÞ ¼ Rðs; aÞ þ γ ∑ ∑ pðojs0 ; aÞpðs0 js; aÞαbaο ðs0 Þ
fαin gi s0 A S
oAO
ð33Þ
aAΑ
ð39Þ
An equivalent way to compute the Bellman backup is to take the forward-projected belief (after a possible action and observation) and to maximize directly over the αin vectors instead of calculating giao first. Using again the identity in Eq. (35) and defining:
b(s2)
aAΑ
ð38Þ
sAS
ð37Þ
Until now we have seen that the value function consists of a finite number of α vectors; jVj; and that given those, the updated α vector for a specific belief can be straightforwardly computed by a Bellman backup. Locating however all belief points that are required to compute all the necessary α vectors or, equivalently, enumerating possible vectors, jΓj, and then pruning useless vectors to end up with the exact jVj is very costly. One of the simplest algorithms that tries to perform exact value iterations to find the optimum value function is due to Monahan [40]. The algorithm constructs a set of Γ vectors with jΓ n þ 1 j ¼ jAjjV n jjOj , based on all possible combinations of actions, observations and α vectors; and then identifies the redundant vectors, which are completely dominated by other vectors (Fig. 5), and prunes them to isolate the useful number of vectors, jV n þ 1 j. Unfortunately, it has been proven in practice that except the potential exponential growth of the number of useful vectors, identifying these vectors
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i
K.G. Papakonstantinou, M. Shinozuka / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
9
Fig. 6. QMDP approximation of the value function which upper bounds the exact function. Fig. 5. Pruning example. Redundant linear vectors are shown in gray.
cannot be solved efficiently in time, as well. Other exact methods perform better than Monahan's algorithm but all of them still carry a very high computational cost, making them inefficient in computing optimal policies for anything but very small problems. Due to this limitation we have concentrated our attention to approximate value iteration methods in this work, which are suboptimal but can compute successful policies for much larger problems. For more information on exact methods for POMDPs the reader is referred to the review articles [40–42], as well as to [43]. A wide variety of value function approximation methods can be found in the literature. In this section, we have only considered cases where a policy is found through the approximate value function representation and not by policy search algorithms, which try to directly optimize the policy. Review papers and reports on different approximate planning techniques can be found in [41,44,45]. The main idea of the value function approximation approach is to approximate the value function by a new function of lower complexity, which is easier to compute than the exact solution. In some cases, according to the used approximation, we are able to know whether it overestimates or underestimates the optimal value function and this information on the bounds can be used in multiple ways, (for convergence properties for example). To define the upper bound, if H is the exact value function mapping and H~ its approximation, H~ upper-bounds H for some V when ~ HVðbÞ ZHVðbÞ holds for every b of the belief simplex. An analogous definition for the lower bound stands as well.
(Fig. 6) and the resulting policy in this case are given by: V Q MDP ðbÞ ¼ max ∑ bðsÞQ nMDP ðs; aÞ aAΑ
π Q MDP ðbÞ ¼ arg max ∑ bðsÞQ nMDP ðs; aÞ aAΑ
2
Since MDP planning is much simpler than a POMDP one, several methods have been suggested for the latter that use value function approximations based on the underlying MDP. One of the easiest methods to create a policy is the Most Likely State (MLS) method [44]. The MLS method assumes full observability by finding the state of the system with the highest probability, and executing the action that would be optimal for that state in the MDP, resulting in the policy: π MLS ðbÞ ¼ arg maxQ nMDP arg max bðsÞ; a ; ð43Þ sAS
n
where Q MDP are the action-value or Q-functions defined also earlier in Eq. (10): Q nMDP ðs; aÞ ¼ Rðs; aÞ þγ ∑ pðs0 js; aÞV nMDP ðs0 Þ
ð44Þ
s0 A S
Another approximation variant based on the underlying MDP approximates the value function with the Q-functions and the method itself is also known as QMDP [44]. The value function
ð45Þ
sAS
V Q MDP is piecewise linear and convex with at most jAj useful linear functions (the action-value, Q-functions) and the time complexity of the method is OðjΑjjSj2 Þ, equally to MLS, when the policy of Eq. (45) is used. The QMDP approximation upper bounds the exact value function of the partially observable case (Fig. 6). The intuition behind this, without formally proven it here, is that the QMDP takes into account only the uncertainty at the current step and assumes full observability in all future steps. Since this is an optimistic assumption, the method provides an optimistic estimation of the value function, given that with less information (or more uncertainty) the decision-maker cannot find better solutions and receive higher rewards. Both MLS and QMDP methods ignore partial observability and their policies, as given here, do not select information gathering actions, that is, actions that only try to gain more information about the status of the system. To improve these approximations and account, to some degree, for partial observability the Fast Informed Bound (FIB) method has been proposed [45] which integrates the observation probabilities into the value function approximation as follows:
7.1. Approximations based on MDP and Q-functions
aAA
sAS
6 V FIB ðbÞ ¼ max4 ∑ bðsÞ Rðs; aÞ þ γ ∑ max ∑ pðojs0 ; aÞpðs0 js; aÞ αi ðs0 Þ aAΑ
sA S
njAj o A Ofαi gi ¼ 1 s0
A S
!# ;
ð46Þ
where the at most jAj linear functions, based on the Q-functions, have been written as α vectors by a slight abuse of notation. From Eq. (46) the FIB policy can be straightforwardly understood and it is also easily conceived that V FIB is piecewise linear and convex with at most jAj useful linear functions. The time complexity of the method is OðjΑj2 jSj2 jOjÞ, a significant reduction compared to the exact approach. As most likely recognized already, the main idea of the FIB method is to represent the value function based on the best linear function for every observation and every current state separately, in contrast to the exact approach where linear functions are sought that give the best result for every observation and the combination of all states. The value function in this method remains an upper bound of the exact value function, according to the same intuitive explanation provided for the QMDP case, however it is guaranteed to be a tighter bound than the QMDP one.
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i
K.G. Papakonstantinou, M. Shinozuka / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
10
7.2. Grid-based approximations A value function over the continuous belief space can be also G G G approximated by a finite set of grid points G ¼ fb1 ; b2 :::bN g and an interpolation–extrapolation rule that estimates the value of an arbitrary point of the belief space based on the values of the points on the grid. A general, convex interpolation–extrapolation rule can be given as: N
G
VðbÞ ¼ ∑ λb ðiÞV ðbi Þ;
ð47Þ
i¼0
b where λb ðiÞ Z 0 8 i and ∑N i ¼ 0 λ ðiÞ ¼ 1. The values of the points on the grid can be computed by the value iteration method as: G G V n þ 1 ðbi Þ ¼ max ∑ bi ðsÞRðs; aÞ þ γ ∑ pðojb; aÞV n ðbs0 Þ ; ð48Þ aAΑ
sAS
oAO
while the computation of the λ coefficients varies according to the used convex rule (e.g. nearest neighbor, kernel regression, linear point interpolations and many others). In [41,45] several formulations of grid-based approximations can be seen with different convex rules and grid selection alternatives (e.g. regular and nonregular grids, fixed or adaptive grids, etc.). One of the simplest grid-based representations consists of a fixed, regular grid, which partitions the belief space evenly into equal size regions, and the nearest neighbor interpolation–extrapolation rule. In the nearest neighbor approach the value of a belief point is estimated using the value at the grid point closest to it, in terms of some distance metric defined over the belief space. Thus, in this case for any belief point there is exactly one nonzero coefficient λb ðiÞ ¼ 1 and all others are zero. These coefficients, and consequently the value function approximation rule, are fixed in a dynamic programming context and for this grid choice and convex rule the problem can be converted into a fully observable MDP with states corresponding to the grid points in G. Due to its simplicity this method is repeatedly used in the literature in problems concerning maintenance of civil structures, e.g. [46–48]. However, it is important to note that it suffers from many undesired attributes that will lead to inferior solutions in comparison with other approximation methods. First of all, fixed, regular grids are restricted to a specific number of points and any grid resolution refinement results in an exponential increase of the grid size, preventing one from using the method with high grid resolution in problems with large state spaces. Furthermore, even with dense resolutions, the representation does not focus computation on the parts of the belief space which may be more useful (i.e. often appearing belief subspaces), the value function is represented as a piecewise constant function (instead of the true representation which is piecewise linear) and no information about any bound (upper or lower) can be obtained. An alternative, similar approach to interpolation–extrapolation rules is to use curve-fitting techniques. Curve fitting relies on a set of belief-value pairs, but despite remembering all of them it tries to summarize them, in the best possible way, in terms of a given parametric function model and a certain criterion (e.g. leastsquares fit). The drawback of this approach is that, when combined with the value iteration method, it may lead to instability and/or divergence. 7.3. Point-based solvers Point-based value iteration solvers are relatively new approximation methods that are currently the state of the art in POMDP planning. Their increased popularity comes from the fact that they can efficiently solve large problems which were almost impossible to be solved adequately a few years back. Different point-based algorithms are currently available, having different characteristics
and implementation details. However, the core of the methods is the same. The main steps of these solvers are to use a simple lower bound initialization for the value function over the belief simplex, iteratively collect belief points that are likely to describe the system and may support a more accurate value function overall, and perform backups, as in Subsection 6.1, not only of the value but of the gradient (α vector) of these points as well, so as to improve the value function representation over the whole belief simplex in every iteration. Hence, the point-based solvers have some similarities with the grid-based approaches. In both methods the value function is approximated based on a finite number of belief points and among the variety of convex interpolation– extrapolation rules the α vectors are chosen in this case, which preserve the piecewise linear and convex property of the value function. In comparison to exact methods, which are seeking to find the exact set of points that cover all linear vectors defining the value function, the point-based solvers use an incomplete set of belief points that are much easier to locate, chosen under some heuristic selection criteria, and lower-bound the exact value function, since the set of belief points is incomplete and the initial value function used is a lower bound as well. The Point Based Value Iteration (PBVI) class of algorithms, described in detail in [49,50], is conceptually the most straightforward implementation of point-based solvers. In PBVI a small initial set of belief points is chosen together with a simple initialization of the value function. At each step the set can be expanded by selecting new reachable belief points, based on different heuristic criteria, until some horizon length, T. PBVI performs a series of T, full backup iterations then, considering all the so far collected belief points and starting each iteration with a value function empty of α vectors. After the series of backups finishes, the belief set is further expanded and this procedure continues until some convergence criterion is met. In summary, PBVI expands the reachable belief point set gradually, performs full backups over all collected points and does not require pruning, since at each backup step the new set of α vectors is initialized with no vectors. Another recent point-based solver is Perseus. Perseus algorithm is described in [51] and is the algorithm that we used in the second part of this work, [17], to solve the detailed application of a structural maintenance problem. We will thus describe the algorithm in more detail there and we will only highlight its main characteristics in this paper. Unlike most point-based algorithms, which iterate between steps of belief point collection and backups, Perseus builds a fixed set of reachable belief points B, at the start. B is constructed by less sophistication, in comparison with heuristics approaches of other point-based algorithms, but usually contains much more points. Since the belief point set is extended, Perseus only performs the least belief points backups required at each iteration, working only on a subset of B, until it guarantees that the value function approximation is improved for all points in B. Like PBVI, Perseus does not require pruning, given that at every backup step the new value function is initialized without any α vectors. Heuristic Search Value Iteration (HSVI) algorithm is presented in [52,53]. HSVI possesses certain attributes which potentially make it very promising for considerably large POMDP problems with big state spaces. HSVI uses heuristics, based on both the upper and lower bound of the value function, to collect belief points and guide the search in the belief space. One important feature of the algorithm is that this heuristic strategy may lead to faster identification of critical belief points. As in all point-based solvers, HSVI iteratively builds the value function starting from both a lower and an upper bound. After the collection phase at one algorithmic iterative step finishes, HSVI performs backup of only the newest belief at each step, in a Gauss–Seidel style, all the way back to the beginning at the initial belief. Unlike PBVI and Perseus, this backup technique requires α vector pruning, after some
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i
K.G. Papakonstantinou, M. Shinozuka / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
algorithmic steps, since the value function may contain vectors that were useful at some step of the algorithm but became redundant in subsequent steps and only slow down the computational performance of the algorithm. Finally, a similar algorithm to HSVI is called SARSOP (Successive Approximation of the Reachable Space under Optimal Policies) and was introduced in [54]. SARSOP follows the same framework as HSVI but uses more sophisticated collection techniques, trying to restrict backups to the most important belief points even more. SARSOP still uses both bounds in its heuristic but enhances it with entropy concepts and clustering features, among others.
8. Conclusions In this paper, stochastic control approaches that are appropriate for infrastructure management and minimum life-cycle costs are analyzed. We briefly describe Markov Decision Processes (MDPs) as a foundation for the rest of the paper and we gradually advance to more sophisticated techniques, and modern solvers capable of solving large-scale, realistic problems. Particularly, we present the state-augmentation procedure, semi-MDPs, and their broader association with important notions, as well as Partially Observable MDPs (POMDPs) that can efficiently address significant limitations of alternative methods. We also examine solving techniques, from simple approximate solvers for POMDPs that are directly based on MDPs and can be straightforwardly utilized by available structural management programs, to the inadequate grid-based solvers which are excessively used for maintenance problems with POMDPs, and finally to advanced, modern point-based solvers with enhanced attributes, capable of solving larger scale problems. Overall, it can be easily recognized in the paper that POMDPs extend studies based on alternative concepts, such as the classic reliability/risk-based maintenance, and they do not impose any unjustified constraints on the policy search space, like periodic inspection periods, threshold performances, perfect inspections and many more. A clear disadvantage of POMDPs, however, is that they are difficult to be solved, especially for large models with many states, and this paper helps explain the significance of different POMDP solvers both on the solution and the modeling of the problem. Based on this Part I paper, the companion Part II paper, [17], utilizes almost all presented topics and notions in a demanding minimum life-cycle cost application, where the optimum policy for a deteriorating structure consists of a complex combination of a variety of inspection/monitoring types and intervals, as well as maintenance actions and action times.
Acknowledgments The work reported in this paper has been partially supported by the National Science Foundation under Grant no. CMMI-1233714. This support is gratefully acknowledged. References [1] Engelund S, Sorensen J. A probabilistic model for chloride-ingress and initiation of corrosion in reinforced concrete structures. Struct Saf 1998:20. [2] Alipour A, Shafei B, Shinozuka M. Capacity loss evaluation of reinforced concrete bridges located in extreme chloride-laden environments. Struct Infrastruct Eng 2013:9. [3] Deodatis G, Fujimoto Y, Ito S, Spencer J, Itagaki H. Non-periodic inspection by Bayesian method I. Probab Eng Mech 1992:7. [4] Ito S, Deodatis G, Fujimoto Y, Asada H, Shinozuka M. Non-periodic inspection by Bayesian method II: structures with elements subjected to different stress levels. Probab Eng Mech 1992:7. [5] Zhu B, Frangopol DM. Risk-based approach for optimum maintenance of bridges under traffic and earthquake loads. J Struct Eng 2013:139.
11
[6] Thoft-Christensen P, Sorensen J. Optimal strategy for inspection and repair of structural systems. Civil Eng Syst 1987:4. [7] Mori Y, Ellingwood B. Maintaining reliability of concrete structures. II: optimum inspection repair. J Struct Eng ASCE 1994:120. [8] Liu M, Frangopol D. Multiobjective maintenance planning optimization for deteriorating bridges considering condition, safety, and life-cycle cost. J Struct Eng ASCE 2005:131. [9] Furuta H. Life cycle cost and bridge management. In: Chen S-S, AH-S Ang, editors. Frontier Technologies for Infrastructure Engineering, Structures and Infrastructures Series, Vol-4. Leiden, The Netherlands: CRC Press; 2009. [10] Lounis Z, Daigle L. Multi-objective and probabilistic decision-making approaches to sustainable design and management of highway bridge decks. Struct Infrastruct Eng 2013:9. [11] Castanier B, Berenguer C, Grall A. A sequential condition-based repair/ replacement policy with non-periodic inspections for a system subject to continuous wear. Appl Stoch Models Bus Industry 2003:19. [12] Golabi K, Thompson P, Hyman W. Pontis Tech Man 1992. [13] Thompson P, Small E, Johnson M, Marshall A. The Pontis bridge management system. Struct Eng Int 1998:8. [14] Madanat S, Ben-Akiva M. Optimal inspection and repair policies for infrastructure facilities. Transp Sci 1994:28. [15] Ellis H, Jiang M, Corotis R. Inspection, maintenance, and repair with partial observability. J Infrastruct Syst ASCE 1995:1. [16] Corotis R, Ellis H, Jiang M. Modeling of risk-based inspection, maintenance and life-cycle cost with partially observable Markov decision processes. Struct Infrastruct Eng 2005:1. [17] Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov processes. Part II: POMDP implementation. Reliab Eng Syst Safety 2014, http://dx.doi.org/ 10.1016/j.ress.2014.04.006, in press. [18] Streicher H, Rackwitz R. Time-variant reliability-oriented structural optimization and a renewal model for life-cycle costing. Probab Eng Mech 2004:19. [19] Rackwitz R, Joanni A. Risk acceptance and maintenance optimization of aging civil engineering infrastructures. Struct Saf 2009:31. [20] Nicolai RP, Frenk JB, Dekker R. Modelling and optimizing imperfect maintenance of coatings on steel structures. Struct Saf 2009:31. [21] Hu Q, Yue W. Markov Decision Processes with their Applications. New York, NY, USA: Springer; 2008. [22] Dekker R. Applications of maintenance optimization models: a review and analysis. Reliab Eng Syst Saf 1996:51. [23] Frangopol D, Kallen M-J, Noortwijk J. Probabilistic models for life-cycle performance of deteriorating structures: review and future directions. Prog Struct Eng Mater 2004:6. [24] Noortwijk JM. A survey of the application of gamma processes in maintenance. Reliab Eng Syst Saf 2009:94. [25] Frangopol D. Life-cycle performance, management, and optimisation of structural systems under uncertainty: accomplishments and challenges. Struct Infrastruct Eng 2011:7. [26] Puterman M. Markov Decision Processes. Discrete Stochastic Dynamic Programming, 2nd ed.. Hoboken, NJ, USA: Wiley; 2005. [27] Bellman RE. Dynamic Programming. Princeton, NJ, USA: Princeton University Press; 1957. [28] Bertsekas D. Dynamic Programming and Optimal Control, Vol. II, 1st ed. Nashua, NH, USA: Athena Scientific; 1995. [29] White DJ. Markov Decision Processes. New York, NY, USA: Wiley; 1993. [30] Bertsekas D. Dynamic Programming and Optimal Control, Vol. I, 3rd ed.. Nashua, NH, USA: Athena Scientific; 2005. [31] Ibe O. Markov Processes for Stochastic Modeling. San Diego, CA, USA: Elsevier Academic Press; 2009. [32] Kleiner Y. Scheduling inspection and renewal of large infrastructure assets. J Infrastuct Syst ASCE 2001:7. [33] Mishalani R, Madanat S. Computation of infrastructure transition probabilities using stochastic duration models. J Infrastuct Syst ASCE 2002:8. [34] Yang YN, Pam HJ, Kumaraswamy MM. Framework development of performance prediction models for concrete bridges. J Infrastuct Syst ASCE 2009:135. [35] Guo X, Hernández-Lerma O. Continuous Time Markov Decision Processes. Theory and Applications. Heidelberg, Germany: Springer; 2009. [36] Bertsekas D. Dynamic Programming and Stochastic Control. New York, NY, USA: Academic Press; 1976. [37] Madanat S. Incorporating inspection decisions in pavement management. Transp Res B 1993:27B. [38] Sondik E. The optimal control of partially observable Markov processes [PhD Thesis]. Stanford University; 1971. [39] Spaan M. Approximate planning under uncertainty in partially observable environments [PhD Thesis]. University of Amsterdam; 2006. [40] Monahan GE. A survey of partially observable Markov decision processes: theory, models and algorithms. Manage Sci 1982:28. [41] Lovejoy W. A survey of algorithmic methods for partially observed Markov decision processes. Ann Oper Res 1991:28. [42] Cassandra, A. Optimal Policies for Partially Observable Markov Decision Processes. Report CS-94-14, Brown Univ, 1994. [43] Cassandra A, Littman ML, Zhang NL. Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In: Proceedings of the 13th Conference on Uncertainty in Artificial Intelligence 1997.
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i
12
K.G. Papakonstantinou, M. Shinozuka / Reliability Engineering and System Safety ∎ (∎∎∎∎) ∎∎∎–∎∎∎
[44] Cassandra A, Kaelbling LP, Kurien J. Acting under uncertainty: Discrete Bayesian models for mobile-robot navigation. In: Proceedings of the IEEE/ RSJ International Conference on Intelligent Robots and Systems, Vol. 2. IEEE, 1996. [45] Hauskrecht M. Value-function approximations for partially observable Markov decision processes. J Artif Intell Res 2000:13. [46] Madanat S. Optimizing sequential decisions under measurement and forecasting uncertainty: application to infrastructure inspection, maintenance and rehabilitation [PhD Thesis]. Massachusetts Institute of Technology; 1991. [47] Smilowitz K, Madanat S. Optimal inspection and maintenance policies for infrastructure networks. Comput-Aided Civil Infrastruct Eng 2000:15. [48] Faddoul R, Raphael W, Chateauneuf A. A generalised partially observable Markov decision process updated by decision trees for maintenance optimisation. Struct Infrastruct Eng 2011:7.
[49] Pineau J, Gordon G, Thrun S. Point-based value iteration: an anytime algorithm for POMDPs. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) 2003. [50] Pineau J, Gordon G, Thrun S. Anytime point-based approximations for large POMDPs. J Artif Intell Res 2006:27. [51] Spaan M, Vlassis N. Perseus: randomized point-based value iteration for POMDPs. J Artif Intell Res 2005:24. [52] Smith T, Simmons R. Heuristic search value iteration for POMDPs. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI) 2004. [53] Smith T, Simmons R. Point-based POMDP algorithms: Improved analysis and implementation. In: Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI) 2005. [54] Kurniawati H, Hsu D, Lee W. SARSOP: efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Proceedings of Robotics: Science & Systems. 2008.
Please cite this article as: Papakonstantinou KG, Shinozuka M. Planning structural inspection and maintenance policies via dynamic programming and Markov.... Reliability Engineering and System Safety (2014), http://dx.doi.org/10.1016/j.ress.2014.04.005i