Optimal policy for structure maintenance: A deep reinforcement learning framework

Optimal policy for structure maintenance: A deep reinforcement learning framework

Structural Safety 83 (2020) 101906 Contents lists available at ScienceDirect Structural Safety journal homepage: www.elsevier.com/locate/strusafe O...

4MB Sizes 0 Downloads 109 Views

Structural Safety 83 (2020) 101906

Contents lists available at ScienceDirect

Structural Safety journal homepage: www.elsevier.com/locate/strusafe

Optimal policy for structure maintenance: A deep reinforcement learning framework

T

Shiyin Weia,b,c, Yuequan Baoa,b,c, Hui Lia,b,c,



a

Key Lab of Smart Prevention and Mitigation of Civil Engineering Disasters of the Ministry of Industry and Information Technology, Harbin Institute of Technology, Harbin 150090, China b Key Lab of Structures Dynamic Behavior and Control of the Ministry of Education, Harbin Institute of Technology, Harbin 150090, China c School of Civil Engineering, Harbin Institute of Technology, Harbin 150090, China

ARTICLE INFO

ABSTRACT

Keywords: Bridge maintenance policy Deep reinforcement learning (DRL) Markov decision process (MDP) Deep Q-network (DQN) Convolutional neural network (CNN)

The cost-effective management of aged infrastructure is an issue of worldwide concern. Markov decision process (MDP) models have been used in developing structural maintenance policies. Recent advances in the artificial intelligence (AI) community have shown that deep reinforcement learning (DRL) has the potential to solve large MDP optimization tasks. This paper proposes a novel automated DRL framework to obtain an optimized structural maintenance policy. The DRL framework contains a decision maker (AI agent) and the structure that needs to be maintained (AI task environment). The agent outputs maintenance policies and chooses maintenance actions, and the task environment determines the state transition of the structure and returns rewards to the agent under given maintenance actions. The advantages of the DRL framework include: (1) a deep neural network (DNN) is employed to learn the state-action Q value (defined as the predicted discounted expectation of the return for consequences under a given state-action pair), either based on simulations or historical data, and the policy is then obtained from the Q value; (2) optimization of the learning process is sample-based so that it can learn directly from real historical data collected from multiple bridges (i.e., big data from a large number of bridges); and (3) a general framework is used for different structure maintenance tasks with minimal changes to the neural network architecture. Case studies for a simple bridge deck with seven components and a long-span cable-stayed bridge with 263 components are performed to demonstrate the proposed procedure. The results show that the DRL is efficient at finding the optimal policy for maintenance tasks for both simple and complex structures.

1. Introduction Civil infrastructure experiences performance degradation due to aging, environmental erosion, natural or man-made hazards, etc. The structural condition assessment and maintenance processes have been the main concern throughout the world [1,6,14]. Life-cycle management is widely used to maximize the cost-effectiveness of maintenance actions, and decisions for maintenance actions are based on the maintenance policy. Several types of policy have been developed over the last five decades. Preventive maintenance is the most important, and includes timebased maintenance (TBM) and condition-based maintenance (CBM). TBM is conducted at regular intervals, whereas CBM is performed according to the inspected or predicted structural conditions [15,30,31]. In general, CBM is usually better than TBM because of the additional structural information provided by an inspection. CBM is usually based ⁎

on a deterioration model, a reward model, and a core policy-decision model [6]. The deterioration model [5,7] which considers the deterioration of the infrastructure under different maintenance policies, assesses the effectiveness of a maintenance action (including no repair) by predicting the associated consequences. The reward model, which is related to system performance and objectives, assesses the costs of different maintenance actions. The core policy-decision model aims to make rational and cost-effective maintenance decisions based on the reward, the structural performance, and the constraints. Of all the decision models, Markov decision process models (MDPs) are the most frequently employed by managers in practice. For example, the core methodology behind the bridge management system (BMS) software PONTIS and BRIGIT are both based on an MDP [9,18,29]. An MDP [2,10,26,28] uses a finite state space to represent the structural conditions and the Markov property holds in the state space [16], i.e., future states are independent of the historic states given

Corresponding author at: School of Civil Engineering, Harbin Institute of Technology, Harbin 150090, China. E-mail address: [email protected] (H. Li).

https://doi.org/10.1016/j.strusafe.2019.101906 Received 10 March 2018; Received in revised form 17 October 2019; Accepted 3 November 2019 0167-4730/ © 2019 Elsevier Ltd. All rights reserved.

Structural Safety 83 (2020) 101906

S. Wei, et al.

the current state. The decision maker chooses a plan from a finite set of maintenance actions based on actual observations from an inspection of the structural condition, then it performs the maintenance action on the structure and receives a reward as a consequence of the action. The effects of natural deterioration, hazards, and maintenance actions are all depicted as transition matrices among possible states, which are based on the physical model, expert experience, or statistics of the bridge management history. Generally, worse conditions cause a higher risk of structural failure and financial losses. A maintenance action can improve the structural condition and reduce these risks, and has a certain cost associated with it. Therefore, this process is a natural optimization problem. The Bellman equation provides a mathematical framework for MDPs. Dynamic programming (DP) and linear programming (LP) algorithms are frequently employed to obtain the optimum maintenance policies [11,17,22]. DP and LP algorithms require the expectation of a return reward, which is calculated using the Bellman equation. However, the calculation is expensive and inefficient for problems with large state or action spaces [26]. From the perspective of artificial intelligence (AI), the maintenance policy-making problem can be treated as a special case of reinforcement learning (RL), and be solved efficiently using a family of efficient sampling algorithms, such as the bootstrapping TD (temporal difference) method, the Monte-Carlo tree search (MCTs) method, and others. Hence, a general deep reinforcement learning (DRL) framework is proposed for structural maintenance policy decisions. The optimization is sample-based so that it can directly learn from simulation or real historic data collected from multiple bridges. A deep neural network (DNN) structured agent enables the framework to be used in various bridge cases with little change required to the network architecture (depth and layer size). The RL is inspired by the trial-and-error process of human learning in psychology. The agent (the decision maker in the maintenance tasks) learns to perform optimal actions by interacting with the task environment (the bridge in the maintenance tasks). These interactions mean that the agent performs maintenance actions in the task environment based on the current state, and the environment responds to the agent by returning the state transition and reward (Fig. 1). The reward is an indicator of the goodness of the action performed under the given states and is usually in the form of financial costs or benefits. From such interactions, the agent acquires knowledge about the task environment and assigns credits to maintain a value function to obtain an optimized policy. DRL approximates the value function using DNNs [19,25] for problems with large state or action spaces. DRL has been

recognized as an important component for constructing general AI systems and has been applied to various engineering tasks for decisionmaking and control, e.g., video games [19], robots [13], question-answering systems [4] and self-driving vehicles [23]. This study proposes a DRL framework for the automated policy making of bridge maintenance actions. Section 2 establishes the DRL framework for maintenance and introduces the optimization method for learning the optimal maintenance policy. From the perspective of DRL, as shown in Fig. 1, the task environment comprises the physical properties of the bridge structure, the agent represents the BMS, an action is one of the possible maintenance actions, and the reward corresponds to the financial cost of the maintenance action and the possible associated risks under certain conditions. Section 3 illustrates the application of the DRL framework to a general bridge maintenance task using both a simple bridge deck structure and a complex cable-stayed bridge. The performance of the proposed DRL method is compared with other hand-crafted CBM and TBM policies. The conclusions and discussions are summarized in Section 4. 2. DRL framework for structural maintenance policies DRL is employed to obtain the optimal maintenance policy for a bridge to keep it in acceptable condition by minimizing the costs of maintenance. Deep learning (DL) approximates the value function with a DNN and the RL provides the improvement mechanism. 2.1. Q-learning and Deep Q-Network (DQN) DRL-based maintenance policy making is based on MDP models, which are frequently employed to describe the processes of structural maintenance. An MDP is a tuple S , A , P , R , , where S is the structural state space (i.e., the discrete structural rate set according to the inspection manual, such as very good, good, fair, poor, urgent, or critical). A = {a1, a2, , am} is the possible maintenance action space (i.e., the predefined maintenance actions with m types of action levels, such as no maintenance, minor maintenance, major maintenance, and replace). P = P (St + 1 St , At ) is the state transition probability matrix, i.e., the structural state transition probability from St at year t to St + 1 at year t + 1 when performing the maintenance action At at year t. When the action At is ’no repair’, the structural state will be degraded by the probability due to natural erosion or damage. When the maintenance action At is of different levels (minor maintenance, major maintenance, or replace), the structural state will be enhanced to different levels by the probability due to the maintenance. Given the state of the bridge, a decision regarding maintenance is made based on the maintenance policy. This policy is defined as the conditional probability of an action under the given state, (At St ) = P (At St ) , where a good policy should consider the consequences of a given maintenance action. Here, the reward function Rt = R (St , At ) serves as an indicator of the con[0, 1] is a discount factor considering sequence, R = E [Rt St ] and the long-term rewards. The reward in this study is defined as the negative of the maintenance costs plus the risk to the structure. Effective maintenance may increase the maintenance costs while decreasing the structural risks. The goal of DRL-based maintenance policy making is to learn the maintenance policy that maximizes the total reward during the entire lifespan of the structure. All the states, maintenance actions, and reward function sets are defined a priori based on the inspection and maintenance manual and the cost criterion, while the state transition probability matrix is obtained from practical experience or a numerical model of the structure. A sequence of bridge states, maintenance actions, and rewards that depict the entire history of the RL task following a certain policy is denoted as an episode: S1, A1 , R1, S2, , RT 1, ST , AT , RT ~ .A probabilistic version and a sample version of a trajectory following a policy are shown in Fig. 2(a) and (b) respectively. The optimal policy should consider all the sequences during the

Fig. 1. Schematic of reinforcement learning [26] and mapping relation with structural maintenance task. 2

Structural Safety 83 (2020) 101906

S. Wei, et al.

The input of the neural network is the state s = St and the output is the corresponding state-action value Q (St , a, w), where the weights w are updated by minimizing the mean-squared error (MSE):

J (w) = E

1 (Y 2

Q w (S t , a)) 2

(3)

via stochastic gradient descent (SGD) [12]:

w = E [(Y

Q w (St , a))]

w Qw

(St , a)

(4)

However, the Q value network may not give the exact value because it is trained based on information from the MC target Y, which depends on the explored episodes. During training, it is better to explore than to exploit [26]. To explore more possible policies that have not been encountered in the experience, the -greedy search method is employed during training (explore new possibilities with the probability of while , where decays with exploiting the known with a probability of 1 the learning process):

(At St ) =

Fig. 2. Trajectory following the policy [26]. The hollow node denotes the state, the solid node denotes the action, and the rectangular hollow node denotes the terminal state.

if

T t 1 k=0

a

/m

otherwise

(5)

Cmax:

{

1 At = 0 0 otherwise else (Ct < Cmax ):

(1)

Q (St , At ) = E [Gt St , At ]

(At St ) =

Given the value of Q (·) , one can find the optimal maintenance action that maximizes the expectation of the return. When the state and action spaces are relatively large, calculating Q (·) is very expensive. A deep Q network (DQN) provides an efficient way to approximate this value. The optimization method used is the training method based on the sample version shown in Fig. 2(b), where the data are used to train the DQN. In practical applications, data can be sourced from bridges and components in similar natural environments (a big data source) or from simulations between the BMS agent and the bridge (a simulation data source). Field and experimental results can be embedded in the physical properties of the task environment by specifying the state transition matrices and the reward functions. In this way, the agent can consider engineering concerns (such as rare hazard events) by learning to recognize the task environment through the interactions. The DQN is given by Q w (s, a) Q (s , a) as parameterized by w. A DNN -structured Q network is employed in this study due to the powerful nonlinear representation and mapping capacity [19,24]. The incremental form of the iteration in the training is:

Q (St , At )

Ct

(At St ) =

kR t+k+1

=

+ / m if At = argmax Q (St , a)

where m is the number of legal actions in each step. In addition, constraints can be imposed on the policy during training to implement real constraints in real life, for example, if the maximum maintenance costs is limited to be Cmax , then the -greedy policy may be changed to the following:

entire lifespan of the bridge. Therefore, the return Gt which balances the short- and long-term rewards (where T is the lifespan of the bridge) and the state-action value function Q (St , At ) which is the expectation, are introduced as:

Gt = Rt + 1 + Rt + 2 +

1

1 + / m ifAt = argmax aQ (St , a) /m otherwise

All the examples in Section 3 are unconstrained, however, constraints can be easily considered in this way. The pseudo-code of Qlearning is listed as follows: Q-learning

Q (St , At ) + = Q (St , At ) +

[Rt + max Q (St + 1, a) a

Q (St , At )]

(2)

where is the user-defined learning rate, = Y Q (St , At ) is the MC T t 1 k Rt + k + 1 is the MC target. The MC target Y is error,and Y = Gt = k = 0 generated from the interactions between the BMS agent and the bridge structure in the MC sampling style under the given policy. Given the structural state St of year t, the maintenance action is sampled by the policy At ~ (a St ) , the reward is calculated be Rt = R (St , At ), and the consequent state is then obtained St + 1 ~P (s St , At ) . The simulation continues over the episode until t = T . These kinds of interactions generate significant content for the dataset of the transition tuple (St , At , Rt , St + 1) in the episode. Y is calculated thereafter. In this way, Eq. (2) implements the expectation of Eq. (1) using MC simulations. The general transition data (St , At , Rt , St + 1, Y ) are employed as the training dataset.

2.2. CNN DNNs can have several parameters with deep layers. A convolutional neural network (CNN) is a kind of DNN with an architecture that can learn spatio-temporal features. It has fewer but more efficient shared parameters, which are in convolutional form. Besides the deep architecture, three properties contribute to the efficiency of CNNs: local connections, shared weights, and pooling [12,27,32]. Unlike a fully connected network in which the output nodes of each layer are connected to all the input nodes in the next layer, the 3

Structural Safety 83 (2020) 101906

S. Wei, et al.

Fig. 3. DRL architecture that contains the four stages: (1) state encoding, (2) feature learning by CNN, (3) Q learning from a fully connected network, and (4) policy making using the -greedy method based on the output of the Q network. The input is the stack of the one-hot encoding for the structural conditions and the binary coding of the relevant year. The input state is best reshaped to approximately a square to fit the 2-D CNN network.

Fig. 4. Representative Transverse and longitudinal sections [20].

4

Structural Safety 83 (2020) 101906

S. Wei, et al.

3.1.1. Problem formulation An optimal maintenance policy usually depends on how a bridge changes under different maintenance levels, the objectives of the maintenance (i.e., bridge performance objectives and maintenance cost objectives), and constraints. In the DRL framework, both maintenance level and bridge condition are discretized. Maintenance actions are categorized from 0 to 3 and condition is from 0 to 5. Deterioration is modeled by the state transition matrix under various maintenance types. The objectives are considered in the reward model. They are related to the financial costs and structural performance. Constraints can be imposed in the greedy policy, as illustrated in Eq. 5. States In the MTQ inspection report, the structural conditions for all the deck system components are rated from 0 to 5 according to the severity of the material defects and the percentage loss of the components’ crosssection and surface area along the length: 0 is ‘very good’ and 5 is ‘critical’. In addition, the age of the bridge is also important for the maintenance policy. Maintenance decisions may be different for the same structural conditions in a different life-cycle year. For example, for a bridge with a 100-year life expectancy in a not bad condition, the optimal policy may be a minor repair in year 30 but no repair in year 99. Therefore, the state S in the DRL framework stacks the components’ structural condition and age to form a vector of length 8. A trick here is to encode the structural condition of each component to the onehot encoding vector of length 6, so the encoding of all the components forms a 2-D array of size 7 × 6 . The age (with the scale of 100) is encoded to the binary form as a vector of 7, since 27 = 128. For example, the binary encoding of 49 is 0110001. Thus, the states are normalized. Therefore, the stacked state is a 2-D array of sized 7 × 7 . Thus, the size of the state space is |S | = 100 × 67 2.8 × 107 . An example of the encoded feature of a random state St = [0, 4, 4, 3, 1, 2, 3, 49] is shown in Fig. 5(a). Maintenance actions Actions are generally divided into four discrete levels according to the maintenance effect and costs [21]:no repair (0), minor repair (1), major repair (2), and replacement (3). Each component is assigned one of these maintenance levels. The maintenance action can be represented by a vector of length 7 and the corresponding one-hot encoding is a 2-D array of size 7 × 4 as shown in Fig. 5 (b). he size of the action space is |A | = 47 1.6 × 10 4 . State transition matrices State transition matrices fully depict the dynamic properties of the bridge and can be obtained from data using Bayes’ rule or from the deterioration model developed from field and laboratory experiments. In this application, the state transition matrices are obtained based on the statistical results from [20]. The natural environment is considered to be a moderate environment and the inspection interval is adjusted to 1 year in this study. The transition matrices are different under the various action levels. The state transition matrices for no maintenance actions are listed in Fig. 6 while Fig. 7 lists the matrices for the maintenance actions of different levels. A basic rational assumption is that without any maintenance actions the structure will deteriorate by at most one level in a year. The element pij in each matrix represents the probability that the component transforms from state i to state j under different maintenance action levels. Note that although the correlation of the state transition between different components is neglected, the CNN is powerful at representing correlations. The effect of maintenance actions is assumed to be dependent on the structural conditions and the maintenance actions, and is determined by the maintenance level via the state transition matrices. Therefore, if the condition is worse, the maintenance effect is lower for the same maintenance action. As the level of the maintenance action increases, the maintenance effect improves for the same structural condition. Replacement transforms a component from any condition to a condition of 0. To facilitate the problem formulation, all the deck system components share the same state transition probability matrices, as shown

Fig. 5. Sample illustrations of state and action coding: St = [0, 4, 4, 3, 1, 2, 3, 49] and action coding: At = [3, 0, 2, 3, 3, 1, 1].

output nodes of a CNN are connected to a limited number of input nodes in a local patch. Connections are represented by a set of weights, which is called the convolutional kernel. Kernels are designed to learn the local correlated features of the input and are shared by different patches within the same layer. Therefore, the shared parameters in the same layer detect the same pattern in different parts of the input. The CNN used in this paper contains two stages in each convolutional layer, a convolutional stage and a nonlinear activation stage. The convolutional operation and nonlinear activation function in each convolutional layer are

y=

(k *x + b)

(6)

where x is the input of the layer (the encoding of the state); * is the convolutional operator, k is the trainable convolutional kernel with its size as the hyperparameters, b is the addictive bias, and (·) is the nonlinear activation function. The ReLU function [8] is used in the paper: (x ) = max (x , 0) . Fig. 3 shows the DRL architecture used in cases 1 and 2 in Section 3. It has four stages: (1) one-hot encoding, which encodes the state and time into the normalized form inputs; (2) feature learning, which is implemented by the hidden convolutional layers; (3) Q-learning, which is implemented with three fully connected layers, where the last layer is reshaped to be the output Q value of the given state; and (4) policy making, which implements the -greedy policy in Eq. (5). 3. Case studies 3.1. Case I: a simple bridge deck system A bridge deck system (Fig. 4), described in an inspection report published by the Ministry of Transportation in Quebec (MTQ)[20], is the first example used to illustrate the DRL framework for maintenance policy decisions. The database is part of a comprehensive system to manage highway structures in Quebec. Only the environments and the structural materials are considered in the statistics. The dataset contains information up to year 2000 on 9,678 province-owned bridges. The transition probability matrices P of the deck system components under a moderate environment were calculated with Bayes’ rule. In addition, the maintenance tasks are treated at the component level, such that the state and actions are all considered at the component level. Seven components are considered in the MTQ deck system: the wearing surface, the drainage system, exterior face 1, exterior face 2, end portion 1, end portion 2, and the middle portion, as shown in Fig. 4. These components are numbered numerically in this order. 5

Structural Safety 83 (2020) 101906

S. Wei, et al.

Fig. 6. State transition probability matrices under moderate environment and no maintenance.

in Fig. 7. Rewards Rewards are defined as a combination of the financial maintenance costs and structural risks in Eq. 7. The negative value is taken to be consistent with the literal meaning of ’reward’. Rewards are the userspecified goals for maintaining a bridge, and the optimal policy leads to maintenance actions that maximize the total rewards. Reference values can be based on an elaborate metric for the costs listed inhttp://

sv08data.dot.ca.gov/contractcost/, or specified by the manager. However, the exact value is not important in this study, so the costs are specified by the authors in this study. The units for the financial costs are set to 1 since the relative values are more important than the absolute ones. The maintenance cost of a deck system component c is assumed to be dependent on the state of the deck system component s and the maintenance actions a as follows:

6

Structural Safety 83 (2020) 101906

S. Wei, et al.

Fig. 7. State transition probability matrices for all deck system components. Table 1 Reward rate depended on condition rate and maintenance level. Ratecondition Rateaction

0.80 for condition = 0

0.85 for condition = 1

0.90 for condition = 2

0.95 for condition = 3

1.0 for condition = 4

1.0 for condition = 5

0.0 0.1 0.3 0.4

0.0 × 0.80 0.1 × 0.80 0.3 × 0.80 1.0 × 0.80

0.0 × 0.85 0.1 × 0.85 0.3 × 0.85 1.0 × 0.85

0.0 ×0.90 0.1 ×0.90 0.3 × 0.90 1.0 ×0.90

0.0 × 0.95 0.1 × 0.95 0.3 × 0.95 1.0 × 0.95

0.0 × 1.0 0.1 × 1.0 0.3 × 1.0 1.0 × 1.0

0.0 × 1.0 0.1 × 1.0 0.3 × 1.0 1.0 × 1.0

for for for for

action = 0 action = 1 action = 2 action = 3

Note: total costs of deck system components are 80, 60, 80, 60, 100, 120, 100, respectively; the risk rate of all components for the six conditions are 0.01,0.01, 0.02, 0.03, 0.1, 0.3 times of the total costs, respectively. In the next section, rewards (costs) of each year are normalized based on the sum of the total costs of all deck components (i.e., 600).

layer are 2 × 2, 2 × 2, 2 × 2 , and 1 × 1, respectively. Fig. 8illustrates a sample state, the CNN-approximated Q value, and the -greedy actions under the given state. The training procedure for the DRL network is listed in the pseudocode in Section 2.1. There are five steps: (1) Initialization: The training starts from arbitrarily initialized parameters w} and policy w (·) . (2) Simulation: Initialize St , and run MC simulations from t = 1 to t = 100 under the given the policy w (·) and transition matrices P to collect the maintenance transition data St , At , Rt , St + 1 . (3) Storing: T t 1 k Rt + k + 1 and save Calculate the MC target Y = Gt = k = 0 St , At , Rt , St + 1, Y to memory buffer M . (4) Updating: Update the network parameters w and then the policy w using the batch data sampled from M . (5) Iteration: Repeat steps (2)-(4) until convergence. In step (2), the network determines the DRL policy w (·) , which inputs state St and outputs Q w (St , a). The maintenance action is sampled from the output policy At ~ w (a St ) . In step (4), the network is updated using the sampled batch dataset from the memory buffer M . The network inputs the batch of states {St } and outputs the corresponding {Q w (St , a} . {At , Y } are used to update the parameters using Eq. 4. The capacity of the memory buffer M is 10 4 and the training batch size is 103 . Once the memory buffer is full, the newly generated simulation data St , At , Rt , St + 1, Y overwrites the memory buffer so that the network can always be trained on the newly updated dataset.

R (c , s , a ) = costtotal (c ) × ratecondition (s ) × rateaction (a) + costtotal (c ) × raterisk (s ) (7) where costtotal (c ), ratecondition (c ), rateaction (a) are the cost rates that depend on the deck system component c, the state of deck system component s, and the maintenance action a, respectively; raterisk (s ) is the structural risk rate related to bridge states, and costtotal (c ) × raterisk (s ) measures the risk by probabilistic financial costs, as listed in Table 1. DRL paradigm The size of the Q value table for the deck system maintenance task is |S ||A | = 100 × 67 × 47 , which is too large to be solved using DP method. In the DRL framework, the state St is treated as the input and the state-action value Q (Sa, a) is approximated by the outputs of the network. The -greedy policy in Eq. 5 is employed to sample the maintenance action in the training step. The convolutional DQN architecture shown in Fig. 3(a) is employed to obtain the optimal maintenance actions for the encoded input state. The input size is 7 × 7 × 1, where the third dimension is extended to make it easier to connect to the neural network. Next, there are four convolutional layers, with the sizes of each layer being 7 × 7 × 4, 4 × 4 × 16, 2 × 2 × 32 and 1 × 1 × 64 respectively. The sizes of the kernel for each layer are 3 × 3, 3 × 3, 3 × 3, and 2 × 2 , respectively, and the stride sizes of each 7

Structural Safety 83 (2020) 101906

S. Wei, et al.

costtotal (c ) × ratecondition (s ) × rateaction (a) and the DQN loss is the mean square error J (w) . The DRL policy rapidly converges from a randomly initialized policy to the policy with the lowest cost after 10,000 training steps in approximately 25 min. The model still explores new state-action pairs to find a better policy (represented as spikes on the cost and DQN loss curves and denoted by the rectangle) due to the -greedy mechanism. In each new exploration, the costs increase corresponding to increases in the DQN loss. However, because the optimal policy has already been found and will not change in the specified task environment, after the exploration, the DRL policy will quickly revert to the optimal solution. However, the exploration continues during training. Table 2 compares the normalized costs for different maintenance policies (DRL, time-based, and condition-based maintenance policies) over 1,000 MC simulations of bridge maintenance from t = 1 to 100. In each simulation, maintenance action At is sampled from the given policy under the specific state St and the consequent state St + 1 is sampled from the state transition matrices P (· St , At ). ’Time-X’ denotes the time-based policy of making minor repairs on all deck components every X years (X = 5, 10, 15, 20). Condition-based policy ‘Condition-X’ denotes the condition-based policy for making minor repairs on the deck system components whose conditions are worse than condition X (X = 1, 2, 3). The comparison shows that the DRL policy is optimal among these policies. It has the lowest average normalized life-cycle cost of −1.3885. The Condition-1 policy, which has a similar normalized average life-cycle cost of −1.3938, is ranked second, while the other policies cost significantly more. The life-cycle condition distribution in Fig. 10(a) shows that the DRL and Condition-1 policies have similar life-cycle condition distributions. That is, the deck system components mainly remain in either condition 1 or 0. This may be due to the formulation of the reward. The yearly structural risk rate related to the structural condition, raterisk (s ) is set as 0.01, 0.01, 0.02, 0.03 and 1 for conditions 0 to 4, which assumes that conditions 0 and 1 are the same from a structural risk and structural performance perspective, while the other conditions cost more. Therefore, the top two policies opt for maintenance when the condition is worse than 1, and they always keep the structural condition between 0 and 1. Fig. 10(b) compares the number of maintenance actions and action distributions in every 5-year period. The results show that the timebased policies perform maintenance actions uniformly in the specified interval. The condition-based policies do not opt for much maintenance in the early years because of the initial good condition of the bridge. The DRL policy opts for fewer actions during both the early and later years. This suggests that in the last few years of the life cycle, the DRL policy tends to require less frequent maintenance because the risk expectation reduces and the benefit of maintenance and the risk sequence of no repair may reach a balance under the terminal constraint. However, the Condition-1 policy depends only on the condition and is

Fig. 8. Sample: state is s = [0, 4, 4, 3, 1, 2, 3, 49] and greedy action is a = [0, 1, 2, 1, 0, 0, 1].

The CNN-structured DQN is built using Tensorflow, and the task environment is established according to the standard of the OpenAI Gym environment. The learning rate is set as = 0.001, the discount factor is set as = 0.95, and the parameter decays with continued iterations. All the results are obtained on a desktop PC operating 64-bit Ubuntu with an i7-4770 processor at 3.4 GHz and 8G RAM. The detailed source code is available athttps://github.com/HIT-SMC/Bridgemaintenance. 3.1.2. Results The performance of the DRL during training for the simulation steps are shown in Fig. 9. The total cost is the sum of the maintenance costs over the simulation (the first expression in (the former part of Eq. 7

Fig. 9. Performance of DRL policy during training. The rectangle indicates the exploring spike.). 8

Structural Safety 83 (2020) 101906

S. Wei, et al.

Table 2 Case I: Total costs comparison (1000 simulations). Policies

DRL

Condition-1

Condition-2

Condition-3

Time-5

Time-10

Time-15

Time-20

Total cost µ Total cost

−1.3885 0.0685

−1.3938 0.0663

−1.6427 0.0994

−2.0669 0.1649

−2.6008 0.0199

−1.9287 0.0988

−1.9133 0.2371

−2.1070 0.4057

independent of the time (terminal constraint). The transition details (St , At , Rt ) of the MC simulations for the bridge maintenance under the DRL policy, including the Condition-1 and the Time-15 policies, are shown in Fig. S1, S2, and S3 in the Supplementary. The simulation cases show that the optimal DRL policy performs like Condition-1, i.e., it chooses minor repairs for components worse than condition 1 but opts for no repair when the time is close to terminal. Major repair and replacement are not selected during the entire maintenance trajectory, which may be due to the lack of constraints on maintenance actions (e.g., total maintenance costs) that the DRL policy chooses the frequent minor repair to always keep the bridge in a good condition.

employed (Fig. 3(b)). 3.2.1. Problem formulation For simplicity, the problem formulation for the cable-stayed bridge example is like that for the deck system described in Section 3.1.1. The condition of each component is rated from 0 to 5, from very good to critical. The optional maintenance actions are numbered from 0 to 3 corresponding to actions from no repair to replace. The yearly state transition matrices of different components due to deterioration and various maintenance action levels are shown in Figs. 14 and 15 in appendix. The reward function, which considers the financial maintenance costs and structural performance measured by probabilistic risk costs, has the same format as Eq. (7) and Table 1. total _cost (c ) are 3, 2, 10, and 4 for the longest stay-cable, the longest girder section, the tower, and the pier, respectively. The rewards are then normalized based on the sum of the total costs of all components. The DRL network architecture is similar to the network used in Section 3.1, but has deeper layers and more layer units. The input stacks the one-hot encoded structural condition with the length of 1578 (263 × 6 ) and the binary-encoded age with the length of 7. The input is then expanded to a vector of length 1600 by zero padding. Finally, it is

3.2. Case II: a long-span cable-stayed bridge A long-span cable-stayed bridge with a main span of 648 m in mainland China [3,15] is used to investigate the performance of the proposed approach for complex structures. The bridge is a two-tower cable-stayed bridge with 168 cables, 89 box-girder sections, two bridge towers, and four bridge piers, as shown in Fig. 11, giving a total of 263 components. A DRL network with deeper layers and more units is

Fig. 10. Comparison of different maintenance policies for 1000 simulations of the simple deck system. 9

Structural Safety 83 (2020) 101906

S. Wei, et al.

Fig. 11. The investigated the long-span cable-stayed bridge[15].

Fig. 12. Performance of DRL policy during training. The rectangle indicates the exploring spike.

Fig. 13. Comparison of different maintenance policies for 1000 simulations of the long-span cable-stayed bridge system.

10

Structural Safety 83 (2020) 101906

S. Wei, et al.

Table 3 Case II: Total costs comparison (1000 simulations). Policies

DRL

Condition-1

Condition-2

Condition-3

Time-5

Time-10

Time-15

Time-20

Total cost µ Total cost

−1.267 0.014

−1.279 0.013

−1.567 0.019

−1.953 0.033

−2.055 0.013

−2.274 0.077

−3.216 0.119

−4.157 0.141

Fig. 14. State transition matrices under moderate environment and no maintenance for the long-span cable-stayed bridge.

Fig. 15. State transition matrices under different level of maintenance for the long-span cable-stayed bridge.

11

Structural Safety 83 (2020) 101906

S. Wei, et al.

condition X. The DRL policy is the optimum of all the policies. It has the lowest average normalized life-cycle cost of −1.267. Condition-1 is second with an average normalized life-cycle cost of −1.279. Moreover, as in Fig. 10(a), both the DRL and Condition-1 policies lead to similar life-cycle condition distributions. The components mainly remain in either condition 1 or 0, as shown in Fig. 13(a). Once again, the reason is that the reward function assumes that conditions 0 and 1 are the same from the structural risk and structural performance perspective. Therefore, the DRL policy always chooses to keep the structural condition between 0 and 1. The distributions of maintenance actions for different policies are shown in Fig. 13(b). The DRL policy opts for fewer actions during the final few years compared to Condition-1, which implies that the DRL network learns to take the age of the bridge into consideration when making policy decisions. The time-based policies have uniformly distributed maintenance actions in the specified interval and the condition-based policies do not opt for much maintenance in the early years because of the initial good condition of the bridge. These results are like those described in Section 3.1.2. This implies that DRL is effective in finding the optimal policy for different maintenance tasks. 4. Conclusion and discussions This paper proposes a DRL framework for automatic bridge maintenance policy decision-making. Examples for a highway bridge deck system with seven elements and a long-span cable-stayed bridge are given, and the following conclusions are made:

Fig. 16. State encoding for the cable-stayed bridge.

reshaped to a 2-D array of size 40 × 40 × 1 (as shown in Fig. 16 in appendix). Six convolutional layers of size 40 × 40 × 8, 20 × 20 × 32, 10 × 10 × 128, 5 × 5 × 512, 3 × 3 × 1024 and 1 × 1 × 4096 are adopted in the feature learning stage. The sizes of the kernel are all 3 × 3, and the stride sizes are 1 × 1, 2 × 2, 2 × 2, 2 × 2, 2 × 2 , and 1 × 1 respectively. In the Q-learning stage, output of the CNN is used the input layer of the fully connected layer. The hidden layer size of the fully connected layer is set as 8,000. The model then outputs the Qvalue of size 263 × 4 , and maintenance action are obtained in the -greedy manner. Therefore, the size of the state space is |S | = 100 × 6263 4.5 × 10 206 , and the action space is |A | = 4263 4.5 × 10158 . The hyperparameters are set as = 0.95, = 0.0001, the capacity of the experience buffer is 105 , and the training batch size is 103 . The DRL network is built on TensorFlow, and is trained on a 64-bit Ubuntu desktop enabled with a GTX-1080Ti GPU device. Same as in Section 3.1, the training procedure includes five steps according to the pseudo code, i.e., initialization, simulation, storing, updating, and iteration until convergence. The detailed source code is available athttps:// github.com/HIT-SMC/Bridge-maintenance.

(1) This paper proposes a DRL framework as a general solution to component-level maintenance policy decision making. The structures correspond to the environmental task in AI field, and policy making corresponds to the agent. The optimal policy is learned through iterations of the policy evaluation and policy improvement stages. The DL implements the policy evaluation by training the DQN. The input dataset can be sourced either from historical data for bridges in a similar natural environment or from maintenance simulations using a MC method. (2) The DRL framework provides a general method for structures with different complexities (i.e., different numbers of components) with little change in the network architecture. The maintenance of a complex structure requires a deeper and wider network than a simpler structure. To build an effective DRL neural network, it is recommended that the size of the fully connected hidden layer is four times that of the output Q layer to avoid underfitting. The number of CNN layers is determined according to the sizes of the input, kernel, and stride. A smaller update step is recommended for deeper neural networks. (3) DRL opts for less maintenance in the last few years of the life cycle, which implies that the DRL agent takes the age and special engineering concerns into account in the task environment. (4) The training of the proposed DRL framework is sample-based, which means it can learn directly from real historical data and can incorporate the real history and physical model in the same framework. When real data are not available, an environment model is required to respond to the DRL agent’s maintenance policy. Therefore, to obtain a rational maintenance policy through the learning process, it is important to have reasonable models for deterioration and cost criteria. Physical and structural concerns (like performance under earthquakes) should be incorporated into the task environment.

3.2.2. Results Fig. 12 illustrates the performance of the DRL policy during training for the simulation steps. The simulation is for the given state transition matrices and the current step policy, also denoted as step 2 of the five steps according to the pseudocode. The total cost is the sum of the maintenance costs over the simulation and the DQN loss is the mean square error given by Eq. (3). The DRL policy converges from a randomly initialized policy to the policy with the lowest cost after 100,000 training steps in approximately 3 days. The spikes imply that the DRL agent is exploring due to the -greedy mechanism. Table 3 compares the normalized costs for different maintenance policies (the learned DRL, time-based, and condition-based maintenance policies) over 1000 simulations of bridge maintenance from t = 1 to 100. In each simulation, the maintenance action is sampled from the given policy and the consequent state is sampled from the state transition matrices given in Fig. 14 and Fig. 15. ’Time-X’ represents time-based policy of making minor repairs on all components every X years. ‘Condition-X’ represents the condition-based policy of making minor repairs on the components whose conditions are worse than

Acknowledgements This study was financially supported by the NSFC (Grant Nos. U1711265, 51638007, 51678203, 51478149 and 51678204). 12

Structural Safety 83 (2020) 101906

S. Wei, et al.

Appendix A. Supplementary data

2018;155:1–15. [16] Madanat Samer, Ibrahim Wan Hashim Wan. Poisson regression models of infrastructure transition probabilities. J Transp Eng 1995;121(3):267–72. [17] Medury Aditya, Madanat Samer. Simultaneous network optimization approach for pavement management systems. J Infrastruct Syst 2013;20(3):04014010. [18] Mirzaei Zanyar, Adey Bryan T, Klatter L, Thompson PD. The IABMAS bridge management committee overview of existing bridge. Manage Syst 2014. [19] Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Rusu Andrei A, Veness Joel, Bellemare Marc G, Graves Alex, Riedmiller Martin, Fidjeland Andreas K, Ostrovski Georg, et al. Human-level control through deep reinforcement learning. Nature 2015;518(7540):529. [20] Morcous G. Performance prediction of bridge deck systems using markov chains. J Perform Constr Facil 2006;20(2):146–55. [21] Papakonstantinou KG, Shinozuka M. Optimum inspection and maintenance policies for corroded structures using partially observable markov decision processes and stochastic, physically based models. Probab Eng Mech 2014;37:93–108. [22] Robelin Charles-Antoine, Madanat Samer M. Dynamic programming based maintenance and replacement optimization for bridge decks using history-dependent deterioration models. Applications of Advanced Technology in Transportation. 2006. p. 13–8. [23] Sallab Ahmad EL, Abdou Mohammed, Perot Etienne, Yogamani Senthil. Deep reinforcement learning framework for autonomous driving. Electron Imaging 2017;2017(19):70–6. [24] Schmidhuber Jürgen. Deep learning in neural networks: an overview. Neural Networks 2015;61:85–117. [25] Silver David, Huang Aja, Maddison Chris J, Guez Arthur, Sifre Laurent, Van Den Driessche George, Schrittwieser Julian, Antonoglou Ioannis, Panneershelvam Veda, Lanctot Marc, et al. Mastering the game of go with deep neural networks and tree search. Nature 2016;529(7587):484. [26] Sutton Richard S, Barto Andrew G. Reinforcement learning: an introduction. MIT press; 2018. [27] Tang Zhiyi, Chen Zhicheng, Bao Yuequan, Li Hui. Convolutional neural networkbased data anomaly detection method using multiple information for structural health monitoring. Struct Control Health Monitor 2019;26(1):e2296. [28] Tao Zongwei, Corotis Ross B, Hugh Ellis J. Reliability-based bridge design and life cycle management with markov decision processes. Struct Saf 1994;16(1–2):111–32. [29] Thompson Paul D, Small Edgar P, Johnson Michael, Marshall Allen R. The pontis bridge management system. Struct Eng Int 1998;8(4):303–8. [30] Van Noortwijk JM. A survey of the application of gamma processes in maintenance. Reliab Eng Syst Saf 2009;94(1):2–21. [31] Wei Shiyin, Zhang Zhaohui, Li Shunlong, Li Hui. Strain features and condition assessment of orthotropic steel deck cable-supported bridges subjected to vehicle loads by using dense FBG strain sensors. Smart Mater Struct 2017;26(10):104007. [32] Xu Yang, Wei Shiyin, Bao Yuequan, Li Hui. Automatic seismic damage identification of reinforced concrete columns from images by a region-based deep convolutional neural network. Struct Control Health Monitor 2019:e2313.

Supplementary data associated with this article can be found, in the online version, athttps://doi.org/10.1016/j.strusafe.2019.101906. References [1] Bao Yuequan, Chen Zhicheng, Wei Shiyin, Xu Yang, Tang Zhiyi, Li Hui. The state of the art of data science and engineering in structural health monitoring. Engineering 2019. [2] Cesare Mark A, Santamarina Carlos, Turkstra Carl, Vanmarcke Erik H. Modeling bridge deterioration with markov chains. J Transp Eng 1992;118(6):820–33. [3] Chen Zhicheng, Bao Yuequan, Li Hui. The random field model of the spatial distribution of heavy vehicle loads on long-span bridges. Health Monitoring of Structural and Biological Systems 2016, vol. 9805. International Society for Optics and Photonics; 2016. p. 98052Q. [4] Ferrucci David, Brown Eric, Chu-Carroll Jennifer, Fan James, Gondek David, Kalyanpur Aditya A, Adam Lally J, Murdock William, Nyberg Eric, Prager John, et al. Building Watson: an overview of the DeepQA project. AI Magazine 2010;31(3):59–79. [5] Frangopol Dan M. Life-cycle performance, management, and optimisation of structural systems under uncertainty: accomplishments and challenges 1. Struct Infrastruct Eng 2011;7(6):389–413. [6] Frangopol Dan M, Dong You, Sabatino Samantha. Bridge life-cycle performance and cost: analysis, prediction, optimisation and decision-making. Struct Infrastruct Eng 2017;13(10):1239–57. [7] Frangopol Dan M, Kallen Maarten-Jan, van Jan Noortwijk M. Probabilistic models for life-cycle performance of deteriorating structures: review and future directions. Progr Struct Eng Mater 2004;6(4):197–212. [8] Glorot Xavier, Bordes Antoine, Bengio Yoshua. Deep sparse rectifier neural networks. Proceedings of the fourteenth international conference on artificial intelligence and statistics. 2011. p. 315–23. [9] Hawk Hugh. Bridgit deterioration models. Transp Res Rec 1995;1490:19–22. [10] Hu Qiying, Yue Wuyi. Markov decision processes with their applications. Springer Science & Business Media; 2007. volume 14. [11] Kuhn Kenneth D. Network-level infrastructure management using approximate dynamic programming. J Infrastruct Syst 2009;16(2):103–11. [12] LeCun Yann, Bengio Yoshua, Hinton Geoffrey. Deep learning. Nature 2015;521(7553):436. [13] Levine Sergey, Pastor Peter, Krizhevsky Alex, Ibarz Julian, Quillen Deirdre. Learning hand-eye coordination for robotic grasping with deep learning and largescale data collection. Int J Rob Res 2018;37(4–5):421–36. [14] Li Hui, Jinping Ou. The state of the art in structural health monitoring of cablestayed bridges. J Civil Struct Health Monitor 2016;6(1):43–67. [15] Li Shunlong, Wei Shiyin, Bao Yuequan, Li Hui. Condition assessment of cables by pattern recognition of vehicle-induced cable tension ratio. Eng Struct

13