Optimal control for earth pressure balance of shield machine based on action-dependent heuristic dynamic programming

Optimal control for earth pressure balance of shield machine based on action-dependent heuristic dynamic programming

ISA Transactions 94 (2019) 28–35 Contents lists available at ScienceDirect ISA Transactions journal homepage: www.elsevier.com/locate/isatrans Rese...

672KB Sizes 0 Downloads 62 Views

ISA Transactions 94 (2019) 28–35

Contents lists available at ScienceDirect

ISA Transactions journal homepage: www.elsevier.com/locate/isatrans

Research article

Optimal control for earth pressure balance of shield machine based on action-dependent heuristic dynamic programming ∗

Xuanyu Liu, Sheng Xu , Yueyang Huang School of Information and Control Engineering, Liaoning Shihua University, Fushun 113001, China

highlights • • • •

Action-dependent heuristic dynamic programming (ADHDP) is used to control earth pressure balance in sealed cabin. Based on Bellman’s principle of optimality, ADHDP controller for earth pressure balance is designed. ADHDP controller can realize earth pressure balance control, and the control process is steadier. The proposed method has shorter adjusting time and strong anti-interference ability.

article

info

Article history: Received 14 January 2018 Received in revised form 8 April 2019 Accepted 12 April 2019 Available online 18 April 2019 Keywords: Shield machine Earth pressure balance Optimize control Action-dependent heuristic dynamic programming

a b s t r a c t Earth pressure balance (EPB) shield has been widely used in underground construction. The excavation face stability is crucial to avoid the accidents caused by EPB shield tunneling, so that it is very important to propose an effective control method for the earth pressure balance in sealed cabin. Considering the problem that stable automatic control of the earth pressure in shield’s sealed cabin is difficult, an optimal control method of the earth pressure is proposed based on action-dependent heuristic dynamic programming (ADHDP), which can realize online autonomous learning and adaptive control in tunneling process. According to Bellman’s principle of optimality, the cost function with respect to the sealed cabin’s earth pressure is given. In addition, the action network and critic network of ADHDP controller are constructed. The critic network approximates the cost function and feeds error back into the action network. With the goal of minimizing the cost function, the action network utilizes the critic network’s error to optimize screw conveyor speed. The simulation results show that the earth pressure controller based on ADHDP can realize the earth pressure balance control, and the control process is steadier. Moreover, ADHDP controller has good dynamic performance and anti-interference ability. © 2019 ISA. Published by Elsevier Ltd. All rights reserved.

1. Introduction How to realize the earth pressure balance control of sealed cabin is one of the key problems to be solved urgently in the construction technology of earth pressure balance (EPB) shield. In fact, the imbalance of the earth pressure in sealed cabin is the direct factor causing the ground deformation and serious accidents in the construction of underground tunnel engineering. Therefore, in order to ensure the safety of shield tunneling and prevent ground deformation, it is of great significance to propose an effective control method to realize precise control of the earth pressure. In view of EPB control in sealed cabin, some scholars have carried out relevant researches. Yeh [1] proposed an automatic ∗ Correspondence to: No.1, Dandong Road (West), Wanghua District, Fushun City, Liaoning Province, China. E-mail address: [email protected] (S. Xu). https://doi.org/10.1016/j.isatra.2019.04.007 0019-0578/© 2019 ISA. Published by Elsevier Ltd. All rights reserved.

control system for EPB that used back propagation neural networks (BPNN) to optimize control parameters. Yang et al. [2] used traditional PID controller to control the sealed cabin’s earth pressure. Li et al. [3] applied a robust method to adjust screw conveyor speed to control the earth pressure in sealed cabin. Based on LS-SVM, Liu et al. [4] established a prediction model of the earth pressure and achieved EPB by optimizing the advance speed and screw conveyor speed. Wang et al. [5] proposed an earth pressure control method based on feedforward–feedback compound. The above-mentioned EPB control methods have obtained certain control effect, but they rely on the control model with high accuracy and cannot realize autonomous learning and adaptive control of the tunneling process. Dynamic programming (DP) is the mathematical method to optimize the decision process. ‘‘For a multistep decision process, when one of the levels and its state are used as the initial stage and initial state, the decision at the next moment must be optimal strategy for the initial state, which is independent of initialization state and decision’’, this principle is called Bellman’s principle of

X. Liu, S. Xu and Y. Huang / ISA Transactions 94 (2019) 28–35

optimality and is the theoretical basis of dynamic programming. Adaptive dynamic programming (ADP) is the development of traditional DP, which can effectively overcome the problem of dimensionality disaster. In recent years, the thoughts of ADP or reinforcement learning (RL) have been introduced to design the adaptive controller in order to solve the optimal control problem of complex nonlinear systems. Currently, ADP utilizes three neural networks to approximate the cost function, compute control policy and model system. And the single-network ADP has also been further developed where only one network was used for each player instead of the dual network used in a typical ADP architecture. In [6], a criticonly Q-learning method was developed, which aimed to design adaptive optimal tracking control to solve the model-free optimal control problem of nonaffine nonlinear discrete-time systems, and its effectiveness was demonstrated. Malla and Ni [7] proposed a new history experience replay design to avoid the usage of the model network or identifier of the system, which successfully integrated the history experience into the traditional ADP design. Luo et al. [8] proposed an off-policy adaptive Q-learning method with the actor–critic neural network structure to study the data-based optimal output regulation problem of discretetime systems. An improved reinforcement learning method was applied to minimize electricity costs on the premise of satisfying the power balance and generation limit of units in a microgrid with grid-connected mode [9]. Considering the continuous-time systems with the input constraints, Modares et al. [10] developed an online learning algorithm based on the policy iteration technique, which was implemented on an actor–critic structure to solve the Hamilton–Jacobi–Bellman equation. At present, most scholars focus on the theoretical research on ADP. How to apply ADP to solve the optimal control problem of actual industrial objects is rarely studied. In this paper, we combine ADP with EPB shield to solve the optimal control problem of EPB. Action-dependent heuristic dynamic programming (ADHDP) is a form of ADP. The convergence and stability of ADHDP have been proven as shown in [11–13]. Here, the optimal control for the controlled variable is realized through online self-learning and adaptive control, which is suitable for solving the optimal control problem of complex nonlinear systems that are difficult to establish accurate mechanism model. The critic network of ADHDP not only takes the earth pressure as input, but also takes screw conveyor speed as input, so that better control precision can be obtained. Moreover, ADHDP does not need a model network, only including critic network and action network, and the controller can act directly on the EPB shield, so that the control error can be effectively reduced. Consequently, we propose the optimal control method for the earth pressure in sealed cabin based on ADHDP. In this paper, we first establish the control model of the sealed cabin’s earth pressure. On that basis, based on Bellman’s principle of optimality, the cost function with respect to the earth pressure is proposed, and the earth pressure controller is designed based on ADHDP. Finally, the effectiveness is verified by experimental simulations. 2. Principle of earth pressure balance 2.1. Control model of earth pressure in sealed cabin The structure of EPB shield is shown in Fig. 1. Based on the principle that the intake volume and the discharge volume of soil in sealed cabin are equating or approaching, the EPB shield mainly adjusts screw conveyor speed to change the volume of soil in sealed cabin, so as to maintain the balance between the sealed cabin’s earth pressure and the pressure on the excavation

29

Fig. 1. Earth pressure balance shield.

Fig. 2. Principle of earth pressure balance.

face. And the principle of EPB is shown in Fig. 2. In this paper, The earth pressure control model is cited in [14]. Here, we briefly describe the derivation process of the control model. More details can be found in [14]. During the sampling time ∆t, the volume of soil cut by the cutter head into sealed cabin can be expressed as

∫ Vi = A

v dt = Av ∆t ,

(1)

where A is the cross-sectional area of shield, v is the thrust speed. At the same time, the volume of soil discharged by the screw conveyor is

∫ V0 =

As ωhηdt = As ωhη∆t ,

(2)

where h is the spiral distance of the screw conveyor, η is the efficiency of the screw conveyor, ω is screw conveyor speed, As = π (rs2 − rf2 ) is the effective discharge area of the screw conveyor, rs is the radius of the screw conveyor, rf is the radius of the shaft of the screw conveyor. Therefore, the volume increment of soil in sealed cabin can be denoted as

∆ V = Vi − V0 .

(3)

According to the constitutive properties of soil, the relation between the volume of soil and the increment of the earth pressure in sealed cabin can be approximated as

∆p = Et ∆ε,

(4)

30

X. Liu, S. Xu and Y. Huang / ISA Transactions 94 (2019) 28–35

where

∆ε =

∆L

=

L

A∆L AL

=

∆V Vc

,

(5)

and ∆ε is the increment of axial strain in sealed cabin, Et is the deformation modulus of soil, ∆L is the longitudinal increment of soil, L is the length of sealed cabin, Vc is the volume of sealed cabin. Hence, the relation between the increment of earth pressure and screw conveyor speed is

∆p =

Et

[ ∫

Vc

A

v dt −



As ωhηdt

] =

Et Vc

(Av − As ωhη) ∆t .

(6)

Eq. (6) can be written as a differential form, so the earth pressure’s control model of shield’s sealed cabin is obtained that is Et [Av − As ω (k) hη] ∆t , (7) p (k + 1) = p (k) + Vc here the screw conveyor speed (ω) is the control variable (control input ) and the earth pressure in sealed cabin (p) is the controlled variable (p are both state and output.), and the other parameters are constants.

Dynamic programming is the mathematical method for optimizing the decision process. In this paper, for simplicity of description, only the states and the control are considered onedimensional. Obviously, the dynamic process of a discrete system is a multistep control process, and it is assumed that the state equation of the nonlinear discrete system is (8)

where x ∈ Rn is system status, u ∈ Rm is control amount. In general, the cost function of dynamic programming in [15] is JN (x (0)) =

N −1 ∑

γ k U [x (k) , u (k) , k] ,

(9)

k=0

where U is the utility function, γ ∈ (0, 1] is the discount factor, JN is the accumulation value of utility function. The goal of optimal control is to solve the optimal control sequence, so that Eq. (9) is minimized. The key of dealing with the optimal control problem of the system by dynamic programming is to use the initial value of the system as a parameter, and then uses the properties of the optimal objective function to obtain the dynamic programming equation satisfying the cost function. The basic principle of ADHDP is to approximate the cost function of Bellman’s dynamic programming equation given in Eq. (9) by iteration. Eq. (9) can also be expressed as in [13] that is J (k) =

∞ ∑

γ t U (t + k) .

function can be gradually moved to the optimal value under the current control strategy through the networks training. Accordingly, we take Eq. (11) as the cost function of controller and design the controller for EPB control based on ADHDP. Subsequently, the optimal cost function of controller at time k can be expressed as J ∗ (k) = min U (p (k + 1) , ω (k + 1) , k + 1) + γ J ∗ (k + 1) ,

{

}

u(k+1)

(12) and the optimal screw conveyor speed can be expressed as

2.2. Principle of EPB control based on ADHDP

x (k + 1) = f [x (k) , u (k) , k] ,

Fig. 3. Structure of ADHDP controller for the earth pressure in sealed cabin.

(10)

{ ω∗ (k) = arg min U (p (k + 1) , ω (k + 1) , k + 1) u(k+1) } + γ J ∗ (k + 1) .

(13)

3. Design ADHDP controller 3.1. ADHDP controller structure The structure of ADHDP controller for the earth pressure in sealed cabin is given in Fig. 3. As shown in Fig. 3, ADHDP controller consists of the action network and the critic network. The flow of each signal is shown as solid line, and the error feedback path of the critic network and the action network is shown as dotted line [6]. The two critic networks (# 1 and # 2) only represent networks at different moments. Their outputs (J ∗ (k) and J ∗ (k + 1)) and U(k + 1) form an error signal that causes the weight of the critic network to be updated. The earth pressure in sealed cabin is used as the input of the action network, and the output is screw conveyor speed. The input of EPB shield is the earth pressure and screw conveyor speed, and the output is the earth pressure at next moment. Then, screw conveyor speed and the earth pressure at next moment are taken as the input of the critic network, and the cost function of ADHDP controller at next moment is outputted. As a whole, the critic network approximates the cost function, and the action network aims to minimize the cost function for adjusting screw conveyor speed, so as to achieve the EPB control.

t =0

Eq. (10) approximates the cost function with the function approximator and obtains the optimal cost function and the optimal control laws by iteration [16,17]. The optimal control strategy is generated through the networks training, and the minimum value of Eq. (10) is obtained, so as to reach the purpose of optimization control. According to the control model in Eqs. (7), (10) can be expressed as J (k) =

∞ ∑

γ i−k U [p (i) , ω (i) , i] .

(11)

3.2. Define utility function Utility function is an important indicator that is directly related to the earth pressure in sealed cabin during the design of ADHDP controller. To some extent, the choice of utility function affects the control performance of the controller. Apparently, to design a controller satisfying system’s requirements, utility function must be able to reflect the property of the system itself. In this paper, the control purpose is to make the earth pressure stable within the range from 0.19 MPa to 0.23 MPa, which we can achieve through defining the following utility function.

i=k

According to Bellman’s principle of optimality, if Eq. (11) is used as the indicator of network weight adjustment, the cost

U (k) =

{

[p (k) − 0.21]2 1

|p (k) − 0.21| ≤ 0.02 otherwise

(14)

X. Liu, S. Xu and Y. Huang / ISA Transactions 94 (2019) 28–35

Fig. 5. Structure of the action network.

Fig. 4. Structure of the critic network.

hidden layer to the output layer are shown as

3.3. Training of the critic network The selection of the number of hidden layer nodes is crucial to the training effect of network. Therefore, according to the trial and error method and artificial experience, we firstly calculate the average convergence value of the cost function when hidden layer nodes are 3 to 15. The simulation results show that when hidden layer node of the critic network is 5, the minimum average convergence value of the cost function is 0.021. So the critic network utilizes the three-layer BPNN with 2–5–1 structure, as shown in Fig. 4. In Fig. 4, Wc1 is the weight of input layer to hidden layer, Wc2 is the weight of hidden layer to output layer. The bipolar function and the linear function as the transfer functions of hidden layer and output layer, respectively. The training process of the critic network consists of forward calculation and error back propagation. The forward calculation process of the critic network can be expressed as ch1j (k) =

2 ∑

J ∗ (k) =

[pi (k) , ωi (k)] · Wc1ij (k) , T

1 − e−ch1j (k)

(15)

,

(16)

ch2j (k) · Wc2j (k) ,

(17)

1 + e−ch1j (k) 5 ∑

[ ] ∂ Ec (k) ∆Wc2 (k) = lc − ∂ Wc2 (k) [ ] ∂ Ec (k) ∂ J ∗ (k) = lc − ∗ · ∂ J (k) ∂ Wc2 (k)

j=1

Wc2 (k + 1) = Wc2 (k) + ∆Wc2 (k) .

(18) (19)

When Ec (k) = 0 is the case for all moments, Eq. (20) can be derived as J ∗ (k) = U (k + 1) + γ J ∗ (k + 1) = U (k + 1) + γ [U (k + 2) + γ J ∗ (k + 2)]

= ··· ∞ ∑ = γ i−k−1 U (i) .

(22)

The rules for updating the weight of input layer to hidden layer are shown as

[ ] ∂ Ec (k) ∆Wc1 (k) = lc − ∂ Wc1 [ ] ∂ Ec (k) ∂ J ∗ (k) ∂ ch2 (k) ∂ ch1 (k) = lc − ∗ ∂ J (k) ∂ ch2 (k) ∂ ch1 (k) ∂ Wc1 (k) ] 1[ 2 = −lc · ec (k) · Wc2 (k) · 1 − ch2 (k) · [p (k) , ω (k)]T , 2

(23) (24)

where lc ∈ (0, 1] is the learning rate of the critic network, ∆Wc is the weight increment of the critic network. When Ec (k) < εc , the adjustment of the weight of the critic network is terminated, where εc is a small positive constant. Noted that lc is not constant, we let it be decremented by the difference series, i.e. l∗c = lc − 0.01. If lc < 0.01, l∗c = 0.05. 3.4. Training of the action network

where ch1j (k) is the input of the jth node of hidden layer, ch2j (k) is the output of the jth node of hidden layer. In order to make the critic network approximate to Eq. (11), the error is defined as ec (k) = J ∗ (k) − U (k + 1) − γ J ∗ (k + 1) , 1 Ec (k) = ec 2 (k) . 2

(21)

= −lc · ec (k) · ch2 (k) ,

Wc1 (k + 1) = Wc1 (k) + ∆Wc1 (k) ,

i=1

ch2j (k) =

31

(20)

i=k+1

Comparing Eqs. (20) and (11), there is J ∗ (k) = J (k + 1). Thus, a trained critic network can be obtained by minimizing Eq. (19), and its output is an estimate of the cost function defined by Eq. (11). According to Eqs. (18) and (19), based on the gradient descent algorithm and the chain rule, the updating rules for the weight of

The selection method of hidden layer node is the same as that in Section 3.3. It is verified by simulation that compared with other node values, when hidden layer node of the action network is 4, screw conveyor speed optimized is closer to the actual value. Therefore, a three-layer BPNN with 1 − 4 − 1 structure is employed in the action network, as shown in Fig. 5. In Fig. 5, Wa1 is the weight of input layer to hidden layer, Wa2 is the weight of hidden layer to output layer. The bipolar function and the linear function as the transfer functions of hidden layer and output layer, respectively. The training process of the action network is similar to the critic network. The forward calculation process of the action network can be expressed as ah1j (k) = p (k) · Wa1j (k) , ah2j (k) =

ω (k) =

−ah1j (k)

1−e

1 + e−ah1j (k)

4 ∑

,

ah2j (k) · Wc2j (k) ,

(25) (26) (27)

j=1

where ah1j (k) is the input of the jth node of hidden layer, ah2j (k) is the output of the jth node of hidden layer. With the consideration of control variable constraint, the tuning criteria of screw

32

X. Liu, S. Xu and Y. Huang / ISA Transactions 94 (2019) 28–35 Table 1 Parameters of the control model.

conveyor speed is given by

{ ωmin , ω (k) = ωmax , ωi (k) ,

if ωi (k) < ωmin if ωi (k) > ωmax , otherwise

(28)

where ωmin and ωmax are the minimum and maximum values of screw conveyor speed, respectively. The action network aims to minimize Eq. (11), and then obtains the optimal screw conveyor speed. So the network error is defined as ea (k) = J ∗ (k) = U (k + 1) + γ J ∗ (k + 1) , Ea (k) =

1 2

ea (k) . 2

(29) (30)

According to Eqs. (29) and (30), the updating rules for the weight of hidden layer to output layer are shown as

[ ] ∂ Ea (k) ∆Wa2 (k) = la − ∂ Wa2 (k) [ ] ∂ J ∗ (k) ∂ω (k) = la −ea (k) · ∂ω (k) ∂ Wa2 (k) ∂ J ∗ (k) = −la · ea (k) · · ah2 (k) , ∂ω (k)

(31)

∂ J ∗ (k) ∂ J ∗ (k) ∂ ch2 (k) ∂ ch1 (k) = ∂ω (k) ∂ ch2 (k) ∂ ch1 (k) ∂ω (k) ] 1[ 2 1 − ch2 = Wc2 (k) · (k) · Wc1 (k) ,

(32)

2 Wa2 (k + 1) = Wa2 (k) + ∆Wa2 (k) .

Et Vc

[Av − As ω (k) hη] ∆t + ∆p,

0.372 m 0.110 m 0.9 m 30.92 m2 2.5 MPa 27.86 m3 0.032 m/min

v

Table 2 Control variable threshold. Variables

Maximum/Unit

Minimum/Unit

p(k) ω(k)

0.29 MPa 9.6 rpm

0.15 MPa 2.2 rpm

Parameters

Initial value

la lc

0.5 0.3 0.9 10−4 10−5 (−1, 1) (−1, 1)

γ εc εa Wa Wc

3.5. Training strategies of ADHDP controller (33)

(34)

(35) (36)

where la ∈ (0, 1] is the learning rate of the action network, ∆Wa is the weight increment of the action network. When Ea (k) < εa , the adjustment of the weight of the action network is terminated, where εa is a small positive constant. The tuning criteria of la is similar to lc , i.e. l∗a = la − 0.01. If la < 0.01, l∗a = 0.05. Convergence and stability are the two most important issues in adaptive control systems design. In actual shield tunneling, changes in geological conditions or other uncertainties often cause pressure fluctuations in chamber. Therefore, it is necessary to verify the stability and anti-interference performance of the earth pressure control method proposed in this paper. In order to verify the stability and robustness of the ADHDP controller, we add artificial disturbance to optimizing process to simulate the uncertainty in the process of shield tunneling. The shield system considering uncertainty can be expressed as p (k + 1) = p (k) +

Value/Unit

rs rf h A Et Vc

Table 3 Parameter setting of ADHDP controller.

Then the rules for updating the weight of hidden layer for the action network are shown as

[ ] ∂ Ea (k) ∆Wa1 (k) = la − ∂ Wa1 (k) [ ] ∂ J ∗ (k) ∂ ah2 (k) ∂ ah1 (k) = la −ea (k) · ∂ ah2 (k) ∂ ah1 (k) ∂ Wa1 (k) ] ∂ J ∗ (k) 1 [ = −la · ea (k) · · 1 − a2h2 (k) · p (k) , ∂ ah2 (k) 2 ∂ J ∗ (k) ∂ J ∗ (k) ∂ω (k) ∂ J ∗ (k) = = · Wa2 (k) , ∂ ah2 (k) ∂ω (k) ∂ ah2 (k) ∂ω (k) Wa1 (k + 1) = Wa1 (k) + ∆Wa1 (k) ,

Parameters

(37)

where ∆p is the increment of earth pressure due to the uncertainties in shield tunneling.

(1) Initialize the values of p(k), Wa , Wc , la , lc , γ and qmax (maximum iteration). (2) Set i = 1, put p(k) to Eq. (25) and compute ω(k) by Eq. (27). (3) Enter p(k) and ω(k) into the critic network (# 1) and EPB shield, compute J ∗ (k) and p(k + 1) by Eqs. (17) and (7), respectively. (4) Enter p(k + 1) into the action network (# 2) and compute ω(k + 1), then compute J ∗ (k + 1) by Eq. (17). (5) Compute U(k + 1) by Eq. (14). (6) Set j = 1, according to Eq. (19), update the weight of the critic network by Eqs. (22) and (24). If Ec (k) > εc , j = j + 1, l∗c = lc − 0.01, continue; otherwise, go to (7). (7) Set j = 1, according to Eq. (30), update the weight of the action network by Eqs. (33) and (36). If Ea (k) > εa , j = j + 1, l∗a = la − 0.01, continue; otherwise, go to (8). (8) Set i = i + 1, go to (2). (9) If q ≤ qmax , go to (2); otherwise, stop. 4. Results of the simulation The data of simulation is based on the measured data of a subway construction site in Beijing. The main parameters of the control model in Eq. (7) are listed in Table 1. According to the actual measured data, the threshold of model’s control variables in this paper is shown in Table 2. The initial value of p(k) is 0.19 MPa. The initial weights of critic network and action network are arbitrary value in the range of (−1, 1), and parameter setting of ADHDP controller is shown in Table 3. The proposed method is simulated by MATLAB, according to training strategies of ADHDP controller. We first analyze the earth pressure optimization process of the controller without interference. After simulation, the updating trajectories of the action network weights and critic network weights are shown in Figs. 6 and 7, respectively. In Figs. 6 and 7, Wa1 = [Wa11 , Wa12 , Wa13 , Wa14 ], Wa1j is input weights of the jth node of hidden layer; Wa2 = [Wa21 , Wa22 , Wa23

X. Liu, S. Xu and Y. Huang / ISA Transactions 94 (2019) 28–35

33

Fig. 6. Weights updating for the action network.

Fig. 7. Weights updating for the critic network.

Fig. 8. The error curves of the action network and critic network.

, Wa24 ], Wa2j is output weights of the jth node of hidden layer; Wc1 = [Wc11 , Wc12 ], Wc11 = [Wc111 , Wc112 , Wc113 , Wc114 , Wc115 ], Wc12 = [Wc121 , Wc122 , Wc123 , Wc124 , Wc125 ], Wc1ij is input weights of the ith node of input layer to the jth node of hidden layer; Wc2 = [Wc21 , Wc22 , Wc23 , Wc24 , Wc25 ], Wc2j is output weights of the jth node of hidden layer. In Figs. 6 and 7, a clear view of the convergence of the networks’ weights are depicted. In addition, the error curves of the action network and critic network are shown in Fig. 8. We can find that after 18 steps of iteration, Ea and Ec quickly converge close to zero.

At the same time, screw conveyor speed and earth pressure are optimized. The cost function of ADHDP controller without interference is shown in Fig. 9, and optimizing trajectories of screw conveyor speed and earth pressure are shown in Fig. 10. It is obvious that the cost function converges quickly, and screw conveyor speed and earth pressure reach a steady state rapidly. Note that once the weights of two networks are known, the optimal ADHDP controller has been found. In order to verify the anti-interference ability of ADHDP controller, interference was added in 70 steps of iteration. What is more, to demonstrate the optimization ability of ADHDP controller, this paper compared it with the control method in [18].

34

X. Liu, S. Xu and Y. Huang / ISA Transactions 94 (2019) 28–35

Fig. 9. The cost function of ADHDP controller without interference.

The cost function of ADHDP controller for earth pressure in sealed cabin is shown in Fig. 11. Optimizing trajectory of screw conveyor speed is given in Fig. 12, and the control trajectory of the earth pressure is shown in Fig. 13. Considering the pressure changes caused by the uncertain factors in the shield tunneling, different increments of earth pressure are manually added in steps 45, 75, 105 and 145 respectively, and the optimizing effect is shown in Fig. 14. From Fig. 11, it can be seen that the cost function converges quickly in the phases of initial training and adding interference, which shows that ADHDP controller is effective for optimizing the earth pressure and has good dynamic performance. As shown in Figs. 12 and 13, no matter in the initial training or after adding interference, screw conveyor speed achieves the optimal value and the earth pressure reaches a steady state rapidly, which satisfies the target defined in Section 3.2. Additionally, ADHDP controller can be stabilized after being continuously disturbed depicted in Fig. 14, which indicates that ADHDP controller has the strong anti-interference ability. From Fig. 13, it is obvious that compared with the method in [18], ADHDP controller has shorter adjusting time without overshoot and the optimizing process is more stable. Moreover, the adjustments of screw conveyor speed and the earth pressure are in line with parameters varying in the actual tunneling process of shield. 5. Conclusions To overcome the problem that the earth pressure in sealed cabin is difficult to steadily control, an optimal control method of the earth pressure based on ADHDP is proposed. The simulation

Fig. 11. The cost function of ADHDP controller.

Fig. 12. The optimized trajectory of screw conveyor speed.

results show that the earth pressure controller based on ADHDP realizes the EPB control of sealed cabin and satisfies control target of the earth pressure. What is more, the proposed method has the advantages of faster convergence, better stability and stronger anti-interference ability, which provides a new way for the optimization control of shield tunneling process. Our subsequent research will focus on multi-variable control, and comprehensively investigate the effect of other control parameters on earth pressure to improve the control accuracy of ADHDP controller.

Fig. 10. Optimizing trajectories of screw conveyor speed and earth pressure.

X. Liu, S. Xu and Y. Huang / ISA Transactions 94 (2019) 28–35

35

Conflict of interest The authors declare that there is no conflict of interest in this paper. References

Fig. 13. The control trajectory of the earth pressure.

Fig. 14. Optimizing effect considering unknown factors in shield tunneling.

Acknowledgments This work were supported by the national natural science foundation of China (Grant No. 61773190) and the Dr. Scientific research foundation of Liaoning province, China (Grant No. 201501104).

[1] Yeh I. Application of neural networks to automatic soil pressure balance control for shield tunneling. Automat Constr 1997;5(5):421–6. [2] Yang H, Shi H, Gong G, Hu G. Earth pressure balance control for EPB shield. Sci China Ser E-Tech Sci 2009;52(10):2840–8. [3] Li S, Qu F, Cao L, Liu B. Experimental investigation about chamber pressure control of earth pressure balance shield. J China Coal Soc 2011;36(6):934–7. [4] Liu X, Shao C, Ma H, Liu R. Optimal earth pressure balance control for shield tunneling based on LS-SVM and PSO. Automat Constr 2011;20:321–7. [5] Wang L, Gong G, Yang H, Hou D. Earth pressure balance control based on feedforward-feedback compound control. J Cent South U (Sci Technol) 2013;44(7):2726–35. [6] Luo B, Liu D, Huang T, Wang D. Model-free optimal tracking control via critic-only q-learning. IEEE Trans Neural Netw Learn 2016;27(10):2134–44. [7] Malla N, Ni Z. A new history experience replay design for model-free adaptive dynamic programming. neurocomputing 2017;266:141–9. [8] Luo B, Yang Y, Liu D. Adaptive q-learning for data-based optimal output regulation with experience replay. IEEE Trans Cybern 2018;1–12. [9] Li FD, Wu M, He Y, Chen X. Optimal control in microgrid using multi-agent reinforcement learning. ISA Trans 2012;51(6):743–51. [10] Modares H, Naghibi Sistani MB, Lewis FL. A policy iteration approach to online optimal control of continuous-time constrained-input systems. ISA Trans 2013;52(5):611–21. [11] Sokolov Y, Kozma R, Werbos L, Werbos P. Complete stability analysis of a heuristic approximate dynamic programming control design. Automatica 2015;59:9–18. [12] Watkins C. Learning from delayed rewards (Ph.D. dissertation), England: Cambridge University; 1989. [13] Prokhorov D, Wunsch D. Adaptive critic designs. IEEE Trans Neural Netw 1997;8(5):997–1007. [14] Shangguang ZC, Li SJ, Sun W, Luan MT, Liu B. Controlling earth pressure of head chamber of earth pressure balance (EPB) shield machine. J China Coal Soc 2010;35(3):402–5. [15] Bellman R. Dynamic programming. Science 1966;153(3731):34–7. [16] Murray J, Cox C, Lendaris G, Saeks R. Adaptive dynamic programming. IEEE T Syst Man Cybern Part C 2002;32(2):140–53. [17] Werbos P. Using ADP to understand and replicate brain intelligence: the next level design. In: IEEE international symposium on approximate dynamic programming & reinforcement learning. 2007, p. 209–16. [18] Zhang X. Control model of earth pressure for shield tunneling based on neural network (Master Dissertation), Dalian, China: Dalian University of Technology; 2009.