Improved value iteration for neural-network-based stochastic optimal control design

Improved value iteration for neural-network-based stochastic optimal control design

Neural Networks 124 (2020) 280–295 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet Impro...

6MB Sizes 0 Downloads 24 Views

Neural Networks 124 (2020) 280–295

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Improved value iteration for neural-network-based stochastic optimal control design✩ ∗

Mingming Liang a , Ding Wang b,c , , Derong Liu d a

State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China b Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China c School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China d School of Automation, Guangdong University of Technology, Guangzhou 510006, China

article

info

Article history: Received 19 May 2019 Received in revised form 5 January 2020 Accepted 7 January 2020 Available online 28 January 2020 Keywords: Adaptive critic designs Adaptive dynamic programming Neural networks Optimal control Stochastic processes Value iteration

a b s t r a c t In this paper, a novel value iteration adaptive dynamic programming (ADP) algorithm is presented, which is called an improved value iteration ADP algorithm, to obtain the optimal policy for discrete stochastic processes. In the improved value iteration ADP algorithm, for the first time we propose a new criteria to verify whether the obtained policy is stable or not for stochastic processes. By analyzing the convergence properties of the proposed algorithm, it is shown that the iterative value functions can converge to the optimum. In addition, our algorithm allows the initial value function to be an arbitrary positive semi-definite function. Finally, two simulation examples are presented to validate the effectiveness of the developed method. © 2020 Elsevier Ltd. All rights reserved.

1. Introduction Adaptive dynamic programming (ADP), first proposed in Werbos’ papers (Werbos, 1977, 1991), has shown great effectiveness and feasibility in obtaining the optimal control policy for nonlinear systems (Fu, Xie, Rakheja, & Na, 2017; Jiang & Zhang, 2018; Wang, in press; Wang, Ha, & Qiao, in press; Wang, He, & Liu, 2017; Wang & Liu, 2018; Wang & Mu, 2017; Wang & Qiao, 2019; Wang & Zhong, 2019; Zhu & Zhao, 2017) and stochastic processes (Bertsekas & Tsitsiklis, 1996). The biggest breakthrough of the ADP algorithm lies in that it uses neural networks to represent the iterative value function and the iterative control law. For Markov decision processes, Bertsekas (1995) presented the method of ‘‘approximate value iteration" to obtain the optimal performance index function for discrete stochastic processes with large state space. In Bertsekas (1995), the author used the neural-structure to approximate the corresponding control law and value function. In ✩ This work was supported in part by Beijing Natural Science Foundation, China under Grant JQ19013, in part by the National Natural Science Foundation of China under Grant 61773373 and Grant 61533017, and in part by the Youth Innovation Promotion Association of the Chinese Academy of Sciences. ∗ Corresponding author at: Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China. E-mail addresses: [email protected] (M. Liang), [email protected] (D. Wang), [email protected] (D. Liu). https://doi.org/10.1016/j.neunet.2020.01.004 0893-6080/© 2020 Elsevier Ltd. All rights reserved.

Bertsekas (2011a), the author discussed several issues related to the approximate dynamic programming method such as the convergence and convergence rate for the sequence of polices. In Bertsekas (2011b), the author presented the implementations of the method by combining the λ-policy iteration algorithm and the approximate dynamic programming algorithm to review the issues such as the bias issue and the exploration issue. The approximate dynamic programming method is also applied to many actual Markov decision processes such as missile defense and interceptor allocation problems (Bertsekas, Homer, Logan, Patek, & Sandell, 2000) and retailer inventory management problems (Roy, Bertsekas, Lee, & Tsitsiklis, 1997). Stable ADP algorithms aim at obtaining the optimal stable policy for dynamic systems. In Bertsekas (2017b), the author analyzed how to obtain the optimal stable policy for discretetime deterministic systems. In Bertsekas (2018a), the author for the first time proved that the proper policy iteration method could not only be applied to solve the shortest path problems (SSP) with finite state space (which can be viewed as Markov decision processes), but also be applied to solve SSP with infinite state space. Furthermore, analyses of the relationship between the controllability of the system and the stability of the iterative control laws were also presented. In Wei and Liu (2014), the authors proposed the ‘‘θ -ADP" algorithm. It is shown that each iterative decision rule obtained by the proposed ‘‘θ -ADP" algorithm can be guaranteed to stabilize the nonlinear system.

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

In Bertsekas and Yu (2016), Bertsekas and Yu analyzed how to obtain the optimal stable policy for finite-state Markov decision processes. In Bertsekas and Yu (2016), the authors also proved that in some situations, the iteration method used in traditional ADP algorithms might not obtain the optimal performance index function. It is demonstrated that for some finite-state Markov decision processes the optimal policy may not even be stable. In Bertsekas (2017a), based on the study of stable policy in Markov decision processes, Bertsekas presented the novel notation of regular policies. In traditional studies, the policy iteration method and the value iteration method would search for the optimum in the whole policy space resulting in complexed theoretical analysis. The author pointed out that if we could search for the optimal policy and value function in the regular policy space, then the iterative policies and iterative value functions would exhibit favorable properties. In Bertsekas (2016), Bertsekas analyzed how to obtain the optimal stable policy for Markov decision processes with uncertainties. In Bertsekas (2018b), Bertsekas analyzed how to obtain the optimal stable policy for infinite-state Markov decision processes. In Bertsekas (2018b), the author also pointed out that the overall optimal performance index function of the Markov decision processes might not be the same as the optimal performance index function over the stable policies. The value iteration algorithms mentioned above require that the iterative decision rule and the iterative value function be updated infinite times to obtain the optimal performance index function. In practical projects, it is very difficult to meet these requirements for the conventional value iteration algorithms. First, for many practical applications, the existence of noise makes it very hard to implement the above algorithms. Second, the initial value function is usually required to be zero or other strict conditions must be satisfied. In practical cases, it is often impossible for us to find such value function to meet these strict conditions. Third, for many practical applications, it is impossible to execute the algorithm for infinite times. Instead, we usually terminate the algorithm within finite steps. Hence, we may obtain the iterative decision rule and the iterative value function instead of the optimal ones. Here, we must be aware that although the optimal stationary policy is able to stabilize the system, the obtained iterative decision rule or the iterative stationary policy may not stabilize our system. Thus it is required to develop some methods to overcome these shortcomings of the conventional value iteration algorithm. In the present paper, we present an improved value iteration ADP algorithm to solve optimal control problems for stochastic processes. We summarize our main contributions and novelties as follows: (1) For the first time we propose a new criteria to identify whether the obtained policy is stable or not for stochastic processes. (2) For the first time we combine the improved value iteration method with neural networks to overcome the problems raised in stochastic models with large state space, such as the curse of dimensionality and the heavy computational burden. (3) For the first time we show that the sequence of value functions generated by our algorithm can finally reach the optimum with a new convergence approach. (4) We show that the strict conditions applied to the initial value function are significantly released. This paper is organized as follows. In Section 2, we present the system dynamic characteristic and some assumptions. The improved value iteration ADP algorithm for stochastic processes is derived. In Section 3, the monotonicity and the convergence properties of the iterative value function are developed. In Section 4, we illustrate how our algorithm utilizes the neural networks to approximate the corresponding value functions and decision rules. In Section 5, two simulation examples are presented to validate the effectiveness of the developed method. Finally, in Section 6, we conclude this paper with a few remarks.

281

2. Problem formulations 2.1. System descriptions In this paper, we focus on the following discrete-time dynamic systems: x(k + 1) = F (x(k), a(k), ω(k), k),

k = 0, 1, 2, 3, . . . ,

(1)

where x(k) ∈ X is the system state, a(k) ∈ Ax is the action made by the decision maker observing the system state x(k), and ω(k) is the environment disturbance which is independent of ω(τ ) for all τ < k. Here we denote the set of possible system states as X and the allowable actions in state x as Ax . Given the current system state x(k) and the current action a(k), according to Puterman (1994), the next system state x(k + 1) is determined by a probability distribution p(·|x(k), a(k)). We denote the decision rule as function d(x, k): X × N → Ax , which specifies the action choice a(k) = d(x, k) when the system occupies state x at time k. Let the policy

π = (d(x, 0), d(x, 1), d(x, 2), . . . , d(x, k), . . .)

(2)

be an arbitrary sequence of decision rules where d(x, k) denotes the decision rule at time k. The expected total reward for state x(0) under the policy π is defined as π

J (x(0)) = E

{∞ ∑

} U(x(k), d(x, k)) ,

(3)

k=0

where U(x(k), d(x, k)) is the utility function. The goal of the presented algorithm is to find an optimal policy to minimize the performance index function (3). For convenience of analysis, results of this paper are based on the following assumptions. Assumption 1 (cf. Puterman, 1994). The function F (x, a, ω, k) is Lipschitz continuous on its domain. The system state set X and the action set Ax are discrete (finite or countably infinite). The function U(x, a) and p(·|x, a) do not vary with time. The decision rule d(x, k) satisfies d(0, k) = 0. The noise ω(k) is associated with the current system state x and current system action d(x, k) which satisfies ω(k) = 0 when d(x, k) = 0. Assumption 2. Let V denote the set of bounded real-valued functions on X . To compare values of policies and make statements about the monotone convergence of algorithms, we assume further that V is partially ordered, with partial order corresponding to component-wise ordering. That is, for Va ∈ V and Vb ∈ V , if Va (x) ≤ Vb (x) for all x ∈ X , then Va ≤ Vb . Define the set of decision rules at time k as Dk , and let Ξ denote the set of all policies, i.e., Ξ = D0 × D1 × D2 × D3 . . ., where × denotes the Cartesian product. Then, we can express the optimal performance index function as J ∗ (x) = inf {J π (x)} .

(4)

π ∈Ξ

Further, according to Puterman (1994), the optimal performance index function satisfies the Bellman equation J ∗ (x) = min a∈Ax

⎧ ⎨ ⎩

U(x, a) +

∑ j∈ X

p(j|x, a)J ∗ (j)

⎫ ⎬

.

(5)



To obtain this optimal index function, an improved value iteration algorithm will be developed.

282

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

2.2. Derivations of the value iteration algorithm

Theorem 1. For i = 0, 1, 2, 3, . . ., let Vi (x) and di (x) be obtained by (6)–(9). Given constants ζ , ζ , η and η that satisfy

In this section, we develop an improved value iteration algorithm to obtain the optimal stable policy and the optimal performance index function for the stochastic processes. For all x ∈ X , let

0 ≤ ζ ≤ ζ < ∞,

(15)

0 ≤ η ≤ η < 1,

(16)

V0 (x) = Ψ (x),

(6)

where Ψ (x) is an arbitrary positive semi-definite function. Then, for all x ∈ X , the iterative decision rule d0 (x) which forms the stationary policy ˜ π0 ≡ (d0 (x), d0 (x), d0 (x), . . .) is computed as d0 (x) = arg min a∈Ax

⎧ ⎨

U(x, a) +





p(j|x, a)V0 (j)

⎫ ⎬

(7)

Vi (x) = U(x, di−1 (x)) +

p(j|x, di−1 (x))Vi−1 (j).

(8)

j∈ X

di (x) = arg min a∈Ax

U(x, a) +





p(j|x, a)Vi (j)

j∈ X

⎫ ⎬

ηJ ∗ (x) ≤ V0 (x) ≤ ηJ ∗ (x),

(

.

(9)



η−1

i

1+ζ

(1 + ζ )i

  Ud(x)  = sup |U(x, d(x))| .

⎧ ⎨

x∈X

According to Kreyszig (1978), Pd(x) is a bounded   linear transformation on V . Define the norm of Pd(x) as Pd(x) , and specifically,

   { } Pd(x)  = sup Pd(x) V , ∥V∥ ≤ 1, V ∈ V .

∑   Pd(x)  = sup p(j|x, d(x)).

(14)

V1 (x) ≥ min U(x, a) + η

⎭ ⎫ ⎬

(21)





p(j|x, a)J ∗ (j)

j∈X

η−1 1+ζ

U(x, a) + ζ

1−η 1+ζ

} U(x, a) .

(22)

p(j|x, a)J ∗ (j) ≤ ζ U(x, a), (22) can be developed into

j∈X

{ V1 (x) ≥ min U(x, a) + η



a∈Ax

+ζ +

p(j|x, a)J ∗ (j)

j∈X

η−1 1+ζ

U(x, a)

1−η ∑ 1+ζ

} p(j|x, a)J (j) . ∗

(23)

j∈ X

Combining similar terms of (23), we can obtain

{( 



a∈Ax

j∈X

When Pd(x) is a probability matrix, we have Pd(x)  = 1. Inspired by Lincoln and Rantzer (2006) and Rantzer (2006), we present a new convergence analysis method for the present value iteration algorithm.

p(j|x, a)V0 (j)

j∈X

{

(13)

When X is discrete, Pd(x) is a matrix with components p(j|x, d(x)). This definition implies that

⎩ ⎧ ⎨

⎫ ⎬

η−1

Since (12)

U(x, a) +



By adding ζ U(x, a) to and subtracting the same term from 1+ζ (21), (21) can easily be transformed into

(11)

∥V∥ = sup |V (x)| .

(20)

j∈X

x∈X

Define the norm of the vector V as ∥V∥, and specifically,

J ∗ (x) ≤ Vi (x)

∑ ≥ min U(x, a) + p(j|x, a)ηJ ∗ (j) a∈Ax ⎩ ⎭ j∈X ⎫ ⎧ ⎬ ⎨ ∑ p(j|x, a)J ∗ (j) . = min U(x, a) + η a∈Ax ⎩ ⎭

(10)

For the discrete set X , we refer to elements  of V as vectors. Define the norm of the vector Ud(x) , written by Ud(x) , as

)

holds for i = 0, 1, 2, . . .. According to (18), the left-hand side of the inequality (19) obviously holds for i = 0. Let i = 1. Based on the left-hand side of (18), it is easy to obtain

3.1. Properties of the improved value iteration ADP algorithm

x∈X

i

(19)

η−1

i

a∈Ax

∥V (x)∥ = sup |V (x)| .

) η−1 ∗ J (x), J (x) ≤ Vi (x) ≤ 1 +ζ (1 +ζ )i (



Proof. First, we prove that

V1 (x) = min



(18)

i = 0, 1, 2, . . . .

3. Improved value iteration ADP algorithm

In the remainder of this paper, we let |X | denote the number of elements in X , Ud(x) denote the |X |-dimensional vector with xth component U(x, d(x)), V denote the |X |-dimensional vector with xth component V (x), and Pd(x) the |X | × |X | matrix with (x, j)th element given by p(j|x, d(x)). For each V (x) ∈ V , we define the norm of V (x) by

)

(1 +ζ )i

The algorithm will iterate between (8) and (9).

x∈X

(17)

and

(

The iterative decision rule di (x) which forms the stationary policy ˜ πi ≡ (di (x), di (x).di (x), . . .) is computed as

⎧ ⎨

p(j|x, a)J ∗ (j) ≤ ζ U(x, a),

j∈X

1 +ζ

For all i = 1, 2, 3 . . ., let Vi (x) be the iterative value function that satisfies the following equation



ζ U(x, a) ≤

respectively, for x ∈ X , the iterative value function Vi (x) satisfies

.



j∈ X



V1 (x) ≥ min a∈Ax

( +

1+ζ

η−

η−1

)

1+ζ

) η−1 ∑ 1+ζ

j∈X

U(x, a)

} p(j|x, a)J (j) . ∗

(24)

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

The proof of

According to the Bellman equation (5), (24) becomes

( V1 (x) ≥

1+ζ

η−1

)

1+ζ

(

J ∗ (x).

(25)

Assume (20) holds for i = l − 1, l = 1, 2, 3, . . .. Then for i = l, we have Vl (x) = min a∈Ax

⎧ ⎨

U(x, a) +

⎫ ⎬



⎩ {

p(j|x, a)Vl−1 (j)

η−1 Vi (x) ≤ 1 + ζ (1 + ζ )i i

) J ∗ (x), i = 0, 1, 2, . . .

(31)

follows similar steps. The proof is completed. □ Theorem 2. For i = 0, 1, 2, 3, . . ., let Vi (x) and di (x) be obtained by (6)–(9). Given constants ζ , ζ , η and η that satisfy



j∈X

283

≥ min U(x, a)

0 ≤ ζ ≤ ζ < ∞,

(32)

0 ≤ η ≤ 1 ≤ η < ∞,

(33)

a∈Ax

+



{

p(j|x, a) 1 + ζ

l−1

η−1

}

(1 + ζ )l−1

j∈X

} J (j) . ∗

(26)

l η−1

By adding ζ U(x, a) to and subtracting the same term from (1+ζ )l (26), (26) can easily be transformed into

(17) and (18), respectively, for i = 0, 1, 2, 3, . . ., the iterative value function Vi (x) can be restricted by

( 1 +ζ

i

η−1 (1 +ζ )i

)

(

J ∗ (x) ≤ Vi (x) ≤ 1 +ζ

i

η−1 (1 +ζ )i

)

J ∗ (x).

(34)

{ Vl (x) ≥ min U(x, a)

Proof. Using the same technique in Theorem 1, we can obtain (34) similarly as (21)–(31) and for the sake of simplicity we do not present the detailed proof here. □

a∈Ax

{

Since

+

1+ζ



l



l



η−1

l−1

}∑

(1 + ζ )l−1

η−1 (1 + ζ )l 1−η

p(j|x, a)J ∗ (j)

Theorem 3. For i = 0, 1, 2, 3, . . ., let Vi (x) and di (x) be obtained by (6)–(9). Given constants ζ , ζ , η and η that satisfy

j∈ X

U(x, a)

}

(1 + ζ )l

U(x, a) .

(27)

p(j|x, a)J ∗ (j) ≤ ζ U(x, a), (27) can be developed into

j∈X

0 ≤ ζ ≤ ζ < ∞,

(35)

1 ≤ η ≤ η < ∞,

(36)

ζ U(x, a) ≤

{

1+ζ

+ζ +ζ

l

η−1

l−1

(1 + ζ

η−1 (1 + ζ )l

}∑

)l−1

ηJ ∗ (x) ≤ V0 (x) ≤ ηJ ∗ (x),

p(j|x, a)J ∗ (j)

(1 + ζ )l

( 1 +ζ

} p(j|x, a)J (j) . ∗

(28)

j∈X

Vl (x) ≥ min a∈Ax

( +



1+ζ

a∈Ax

(

η−1

l−1

1+ζ

1+ζ

U(x, a)

)∑

(1 + ζ )l−1

(1 + ζ )l

= min

)

(1 + ζ )l

1−η ∑

l−1

{(

+

1+ζ

η−1

l

l

p(j|x, a)J ∗ (j)

j∈X

p(j|x, a)J (j)

j∈X

η−1 (1 + ζ

)

)l

Vl (x) ≥

1+ζ

l

η−1 (1 + ζ

)l

)

i

(39)

(40)

0 ≤ η ≤ η < ∞,

(41)



p(j|x, a)J ∗ (j) ≤ ζ U(x, a)

(42)

j∈X

} p(j|x, a)J (j) . ∗

(29)

J ∗ (x).

and

ηJ ∗ (x) ≤ V0 (x) ≤ ηJ ∗ (x),

j∈X

According to the Bellman equation (5), we obtain

(

(



0 ≤ ζ ≤ ζ < ∞,

ζ U(x, a) ≤

U(x, a)

)∑

)l

(1 +ζ )i

) η−1 ∗ J (x). J (x) ≤ Vi (x) ≤ 1 +ζ (1 +ζ )i

)

Theorem 4. For i = 0, 1, 2, 3, . . ., let Vi (x) and di (x) be obtained by (6)–(9). Given constants ζ , ζ , η and η that satisfy

}

(1 + ζ

η−1



η−1

l

i

Proof. We can obtain (39) similarly as the proof of Theorem 1 and for the sake of simplicity we do not present the detailed proof here. □

Combining similar terms of (28), we can obtain

{(

(38)

respectively, for x ∈ X , the iterative value function Vi (x) can be restricted by

j∈ X

U(x, a)

1−η ∑

l−1

(37)

and

a∈Ax

+

p(j|x, a)J ∗ (j) ≤ ζ U(x, a)

j∈X

Vl (x) ≥ min U(x, a)

{



(43)

respectively, for x ∈ X , we can obtain (30)

lim Vi (x) = J ∗ (x).

i→∞

(44)

284

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

Proof. Notice that

{( lim

i→∞

1+ζ

η−1

i

3.2. Derivations of the stability criteria

)

(1 + ζ )i

}

J ∗ (x)

= J ∗ (x)

(45)

Definition 1 (cf. Bertsekas, 2016, 2018a, 2018b). A policy π = (d(x, 0), d(x, 1), d(x, 2) . . . , d(x, k) . . .) is said to be admissible with respect to (3) if ∀k = 0, 1, 2, 3 . . ., the decision rule d(x, k) satisfies d(0, k) = 0 and ∀x ∈ X , J π (x) is finite.

and

{( 1 + ζi

lim

i→∞

η−1 (1 + ζ )i

{(

1+ζ

= lim

i→∞

)

i

} J ∗ (x)

η−1

)

(1 + ζ )i

J ∗ (x)

}

= J ∗ (x).

(46)

According to Theorems 1, 2, 3, (45) and (46), we have (44) immediately. The proof is completed. □ Remark 1. We can easily see that there must exist constants ζ , ζ , η and η that satisfy (40)–(43). That is to say as long as the initial value function is a positive semi-definite function, the sequence of iterative value functions obtained by our algorithm would converge to the optimal performance index function. Next we will analyze the convergence properties of the iterative value functions. Theorem 5. For i = 0, 1, 2, 3, . . ., let Vi (x) and di (x) be obtained by (6)–(9). If the initial value function V0 (x) satisfies V1 (x) ≤ V0 (x),

0 ≤ Vi+1 (x) ≤ Vi (x).

(48)

Proof. Based on (9) and (47), the 1st iterative value function and the 2nd iterative value function have the following relationship

a∈Ax

⎧ ⎨

U(x, a) +

⎩ ⎧ ⎨



p(j|x, a)V1 (j)



⎫ ⎬

ei = 0, i = 1, ei ̸ = 0, i ̸ = 1.

p(j|x, a)V0 (j)



j∈X

(49)

Using mathematical induction, suppose (48) holds for i = l − 1, l = 2, 3, 4 . . .. Then for i = l we can obtain

⎧ ⎨

U(x, a) +

⎩ ⎧ ⎨

≤ min U(x, a) + a∈Ax ⎩



⎫ ⎬

Let xk = x(k) represent the state of the system at time k. Let pk (x|ei , π ) represent the probability that at time k the system occupies the state x with the policy π = (d(x, 0), d(x, 1), d(x, 2) . . . , d(x, k) . . .) applied to the stochastic process whose initial state is given by ei . Let Pk,ei ,π be expressed by pk (e1 |ei , π ) ⎢ pk (e2 |ei , π ) ⎥ ⎢ p (e |e , π ) ⎥ ⎢ k 3 i ⎥



⎭ ⎫ ⎬

= Vl (x).

⎭ (50)

The proof is completed. □ Theorem 6. For i = 0, 1, 2, 3, . . ., let Vi (x) and di (x) be obtained by (6)–(9). If the initial value function V0 (x) satisfies V1 (x) ≥ V0 (x),

⎢ .. . ⎢ ⎢ p (e |e , π ) ⎣ k i i .. .

⎥ ⎥. ⎥ ⎥ ⎦

(54)

Clearly Pk,ei ,π is a distribution vector whose elements are all nonnegative. In addition, elements in this vector add to 1. That is to say ∀π ∈ Ξ ∞ ∑

pk (ei |ej , π ) = 1

(55)

Let p(x(k + 1)|x(k), d(x, k)) represent the probability that the system goes into state x(k + 1) at time k + 1 given that it was in the state x(k) at time k and the decision maker used the decision rule d(x, k). Then the one step transition matrix Pd(x,k) with the decision rule d(x, k) applied at time k is given by (56). For any π = (d(x, 0), d(x, 1), d(x, 2), . . . , d(x, k), . . .), it is clear that the one step transition matrix Pd(x,k) and the probability function pk (x|ei , π ) satisfy Eq. (57). (See Box I.)

p∞ (x) = 1, x = 0, p∞ (x) = 0, x ̸ = 0.

J π = Ud(x,0) +

{ l−1 ∞ ∑ ∏

Pd(x,m) Ud(x,l)

< ∞.

(59)

Writing with the component-wise notation, we have

⎧ ∞ ⎨∑ ∑ k=0

Proof. Using the same technique in Theorem 5, we can obtain (52) similarly and we do not present the detailed proof here. □

}

m=0

l=1

U(x, d(x, 0)) + (52)

(58)

Proof. According to Definition 1, we have the conclusion that

(51)

then for i = 0, 1, 2, 3, . . ., we have Vi+1 (x) ≥ Vi (x) ≥ 0.



Pk.ei ,π = ⎢

{

p(j|x, a)Vl−1 (j)

j∈X



Lemma 1. If a policy π = (d(x, 0), d(x, 1), d(x, 2), . . . , d(x, k), . . .) is admissible, then with the time k increasing to infinity, the probability function pk (x|ei , π ) converges to p∞ (x) which satisfies

p(j|x, a)Vl (j)

j∈ X

(53)

holds.

= V1 (x).

a∈Ax

{

i=1

⎭ ⎫ ⎬

j∈ X

≤ min U(x, a) + a∈Ax ⎩

Vl+1 (x) = min

Definition 2. In a discrete-time Markov chain x(k) with a finite or infinite set of states e1 , e2 , e3 , . . . , ei , . . ., let these system states satisfy

(47)

then for i = 0, 1, 2, 3, . . ., we have

V2 (x) = min

In this part, we provide the stability criteria.



j∈X

pk (j|ei , π )U(j, d(j, k))

⎫ ⎬

< ∞.

(60)



Since the positive definite function U(x, d(x, k)) is finite and the probability function pk (x|ei , π ) satisfies 0 ≤ pk (x|ei , π ) ≤ 1,

(61)

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

p(e1 |e1 , d(e1 , k)) ⎢ p(e1 |e2 , d(e2 , k)) ⎢ p(e |e , d(e , k)) ⎢ 1 3 3



.. . ⎢ ⎣ p(e |e , d(e , k)) 1 i i ··· ⎡ pn (e1 |e1 , π ) ⎢ pn (e1 |e2 , π ) ⎢ p (e |e , π ) n−1 ∏ ⎢ n 1 3 Pd(x,m) = ⎢ .. ⎢ . ⎢ m=0 ⎣ p (e |e , π )

Pd(x,k) = ⎢ ⎢

n

1

i

···

p(e2 |e1 , d(e1 , k)) p(e2 |e2 , d(e2 , k)) p(e2 |e3 , d(e3 , k))

.. .

··· ··· ··· .. .

···

··· ···

p(e2 |ei , d(ei , k)) pn (e2 |e1 , π ) pn (e2 |e2 , π ) pn (e2 |e3 , π )

.. . pn (e2 |ei , π ) ···

··· ··· ··· .. . ··· ···

p(ej |e1 , d(e1 , k)) p(ej |e2 , d(e2 , k))

··· ··· ··· .. .

p(ej |ei , d(ei , k))

··· ··· ··· ··· .. .



⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ··· ⎦ ···

··· .. .

pn (ej |e1 , π ) pn (ej |e2 , π )

285

(56)



⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ··· ⎦ ···

··· .. .

pn (ej |ei , π )

···

(57)

Box I.

we can get the conclusion that lim



k→∞

pk (j|ei , π )U(j, d(j, k)) = 0.

(62)

Vi+1 (x) − Vi (x) < θ U(x, di (x)).

j∈X

Let P∞ be expressed as

P∞

⎡ p (e ) ∞ 1 ⎢ p∞ (e2 ) ⎢ .. ⎢ . =⎢ ⎢ p (e ) ∞ i ⎣ .. .



limk→∞ pk (j|ei , π ) = p∞ (x|ei , π ) = 1, x = 0, limk→∞ pk (j|ei , π ) = p∞ (x|ei , π ) = 0, x ̸ = 0.

p(j|x, di (x))Vi (j) − Vi (x) < (θ − 1)U(x, di (x)).

(67)

j∈ X

(63)

Using the vector notation, (67) can be written as PT1,x,πi Vi − PT0,x,πi Vi < (θ − 1)PT0,x,πi Udi (x) ,

Since the positive definite function U(x, d(x, k)) is finite, it is easy to obtain

{

(66)

According to (9), the inequality (66) can be written as

⎤ ⎥ ⎥ ⎥ ⎥. ⎥ ⎦

Proof. According to (65), there must exist a constant −∞ < θ < 1 satisfying

(64)

PT1,x,π i

(68)

PT0,x,π Pdi (x) . i

where = Here, we should notice that (68) holds for any x ̸ = 0. For the system state x = 0, we have the conclusion PT1,0,πi Vi − PT0,0,πi Vi = 0.

(69)

Combining (68) and (69), we have the conclusion

This completes the proof. □

Pdi (x) Vi − Vi ≤ (θ − 1)Udi (x) .

Remark 2. According to Lemma 1, we can see that as long as the policy π = (d(x, 0), d(x, 1), d(x, 2), . . . , d(x, k), . . .) applied to the stochastic process is admissible, for large k, the probability function pk (x|ei , π ) should be independent of the initial state ei . In other words, regardless of the initial state, the Markov chain reaches a steady or stable limiting distribution after a large number of transactions. When such limits exist, the system settles down and becomes stable. Remark 3. It is proven in Puterman (1994) that the optimal stationary policy π ∗ is an admissible policy. In real-world applications, however, the algorithm cannot be implemented for infinite iterations to obtain the optimum. The algorithm must be terminated within finite steps and an iterative decision rule will be used to control the system. However, the iterative decision rule used to control the system may not stabilize the stochastic process. To overcome this difficulty, the properties of the iterative decision rule di (x) will be analyzed. It is necessary to propose a termination criteria to indicate whether the obtained iterative decision rule is admissible or not.

Now we let the vector pˆ 0 be expressed as pˆ 0 = Pdi (x) Vi − Vi − (θ − 1)Udi (x) .

pˆ 0 (x) ≤ 0, ∀x ∈ X ,

(72)

where pˆ 0 (x) denotes the element of the vector pˆ 0 . Then for any x ∈ X , we have the following conclusion ∞ ∑

p(j|x, di (x))pˆ 0 (j) ≤

j∈X

∞ ∑

p(j|x, di (x)) ∗ 0 = 0.

(73)

i∈X

Using the vector notation, (73) can be written as Pdi (x) pˆ 0 ≤ 0.

(74)

That is to say, P2di (x) Vi − Pdi (x) Vi ≤ (θ − 1)Pdi (x) Udi (x) . By mathematical induction, we can easily prove that

Vi+1 (x) − Vi (x) < U(x, di (x)),

holds for any k = 0, 1, 2, 3, . . .. According to (57), we can easily show that

then the iterative stationary policy πi = (di (x), di (x), di (x), . . .) is an admissible policy.

(71)

Then according to (70), we know that each element of the vector pˆ 0 satisfies

Theorem 7. Suppose that Assumption 1 holds. For i = 0, 1, 2, . . ., let Vi (x) and di (x) be obtained by (6)–(9). If for any x ̸ = 0, the iterative decision rule di (x) guarantees the following inequality (65)

(70)

1 Pkd+(x) Vi − Pkdi (x) Vi ≤ (θ − 1)Pkdi (x) Udi (x) i

PTk+1,x,πi Vi − PTk,x,πi Vi ≤ (θ − 1)PTk,x,πi Udi (x)

(75)

(76)

(77)

286

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

holds for any x ̸ = 0 and k = 0, 1, 2, 3, . . .. Since −∞ < θ < 1, and Udi (x) is positive, we can get

{

PTk+1,x,π Vi − PTk,x,π Vi < 0, i i PTk+1,x,π = PTk+2,x,π = · · · = PT∞ , i

PTk,x,π ̸ = PT∞ , i PTk,x,π = PT∞ .

(78)

i

i

Define the set of distribution vectors as P , which is expressed as P = P|P = Pk,x,π , ∀k = 0, 1, 2, . . . , ∀x ∈ X , ∀π ∈ Ξ .

{

}

Gi (P) = P Vi .

(80)

PTk,x,π Vi i

Since is monotonically decreasing and bounded from below by zero as k increases from 0 to infinity, we have the conclusion lim PTk,x,πi Vi = c ≥ 0.

(81)

k→∞

Next we will prove that lim PTk,x,πi Vi = 0.

(82)

k→∞

(82) can be proven by contradiction. Assume that lim PTk,x,πi Vi > 0.

(83)

k→∞

Gi (P) > 0, Gi (P) = 0,

P ∈ P , P ̸ = P∞ , P = P∞ ,

∞ ∑

Pm di (x) Udi (x) .

(93)

m=0

Since each component in the vector P0d (x) Vi is finite and −∞ < i θ∑ < 1, we can obtain that each component in the vector ∞ m m=0

Pd (x) Udi (x) is also finite, which proves the conclusion. □ i

Theorem 8. Suppose that Assumption 1 holds. For i = 0, 1, 2, . . ., let Vi (x) and di (x) be obtained by (6)–(9). Then, for all x ̸ = 0, there exists a finite N > 0 satisfying VN +1 (x) − VN (x) < U(x, dN (x)).

(94)

Proof. The conclusion can be proven by contradiction. Assume that (94) is false, and for all N = 0, 1, 2, . . ., there exists an xN ∈ X that satisfies VN +1 (xN ) − VN (xN ) ≥ U(xN , dN (xN )).

(95)

Let N → ∞. According to Theorem 4, we can get limN →∞ (VN +1 (xN ) − VN (xN )) = 0. According to (95), we can get lim U(xN , dN (xN )) = lim U(xN , d∞ (xN )) = 0.

N →∞

(96)

N →∞

It contradicts the positive definiteness of U(x, a). Hence, the assumption is false and the conclusion holds. □

Noticing the continuity of function Gi (P) and the fact that

{

P0di (x) Vi ≥ (θ − 1)

(79)

Define the function Gi (P): P → R as T

Letting k → ∞, we can get

(84) 3.3. Design procedure of the improved value iteration ADP algorithm

we can get the conclusion that there must exist a number η > 0 which satisfies

  Pk,x,π − P∞  ≥ η, ∀k = 0, 1, 2, . . . . i

(85)

Based on the above preparations, we summarize the design procedure of the improved value iteration ADP algorithm in Algorithm 1.

Let

−ζ =

PT Pdi (x) Vi − PT Vi .

}

{

max {P|∥P−P∞ ∥≥η}

(86)

Then, −ζ exists because the continuous function T

T

P Pdi (x) Vi − P Vi

(87)

has a maximum over the compact set {P|∥P − P∞ ∥ ≥ η}. Based on (78), we have −ζ < 0. It follows that PTk,x,πi Vi = PT0,x,πi Vi

+

k ∑ (

PTm,x,πi Vi



PTm−1,x,πi Vi

Algorithm 1 Improved value iteration ADP algorithm for stochastic processes Initialization: Choose a computation precision ε ; Give an initial positive semi-definite value function V0 (x); Iteration: 1: Set the iteration index i = 0; 2: Obtain the decision rule di (x) by solving

) di (x) = arg min a∈Ax

m=1

≤ PT0,x,πi Vi − ζ k.

lim

k→∞

= 0.

lim PTk,x,πi = PT∞ .

(90)

lim

= 0.

6: 7:

(91)

i

k→∞

j∈X

U(x, a) +

p(j|x, a)Vi (j)

⎫ ⎬

;



(92)

Obtain Vi+1 (x) by



p(j|x, di (x))Vi (j);

j∈X

5:

As limk→∞ PTk,x,π Vi = 0, we have 1 Pkd+(x) Vi i



Vi+1 (x) = U(x, di (x)) +

4:

On the other hand, according to (76), we can get

⎧ Pdi (x) Vi − P0d (x) Vi ≤ (θ − 1)P0d (x) Udi (x) , ⎪ i i ⎪ ⎪ ⎨ P2d (x) Vi − Pdi (x) Vi ≤ (θ − 1)Pdi (x) Udi (x) , i .. ⎪ ⎪ . ⎪ ⎩ k+1 Pd (x) Vi − Pkd (x) Vi ≤ (θ − 1)Pkd (x) Udi (x) . i i i

3:

(89)

Note that only P∞ in P satisfies PT∞ Vi = 0, we have k→∞



(88)

Since the right-hand side will eventually become negative, the inequality contradicts the assumption that c > 0. Hence we have PTk,x,πi Vi

⎧ ⎨

Set i = i + 1; If ∥Vi − Vi−1 ∥ < ε , go to next step. Else, go to Step 2; If Vi (x) − Vi−1 (x) < U(x, di−1 (x)), go to next step. Else, go to Step 2; return Vi (x) and di (x).

4. Neural network implementation for the improved value iteration ADP algorithm In practical cases, the number of system states for the stochastic processes is usually very large. Hence, we need to utilize approximation structures such as neural networks to approximate the iterative value functions and the iterative decision rules

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

287

Fig. 1. The structure diagram of the proposed algorithm.

Fig. 3. Iterative value functions by the improved value iteration ADP algorithm.

obtained by the improved value iteration algorithm. Here we use neural networks to approximate di (x) and Vi (x) for i = 0, 1, 2, . . .. We use κ to define the hidden layer size, Wh to define the weight matrix from the input layer to the hidden layer, and Wo to define the weight matrix from the hidden layer to the output layer. We use bh to define the bias vector for the hidden layer and bo to define the bias vector for the output layer. Then we represent the neural network as

ˆ h , Wo , bh , bo , x) = WoT f (WhT x + bh ) + bo , N(W

{

}

(97)

where f + bh ) is a κ -dimensional vector, [f (z)]i = 2/(1 + e(−2zi ) )−1, i = 1, 2, 3, . . . , κ . Here we denote f (·) as the activation function. (WhT x

In our algorithm, we utilize two neural networks (i.e., the critic network outputting Vˆ i (x) and the action network outputting dˆ i (x)) to approximate the performance index function Vi (x) and the decision rule di (x), respectively. We demonstrate the overall structure diagram in Fig. 1. 4.1. The critic network We utilize the critic network to approximate the iterative performance index function Vi (x). The output of the critic network

Fig. 2. Iterative value functions by the improved value iteration ADP algorithm.

288

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

Fig. 4. Iterative trajectories of system state x1 .

Fig. 5. Iterative trajectories of system state x2 .

can be expressed as lT T Vˆ il (x) = Woci f (Whc x + bhc ) + bloci ,

{

}

(98)

0 where l = 0, 1, 2, 3, . . .. Let Woci , Whc , bhc and b0oci be random weight matrices. When we train the critic network, the

l hidden-output weight matrix Woci and the bias bloci are updated, while the input-hidden weight matrix Whc and the bias bhc are fixed. Here we can present the target of the critic network as

Vi (x) = U(x, dˆ i−1 (x)) +



(p(j|x, dˆ i−1 (x))Vˆ i−1 (j)).

j∈X

(99)

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

289

Fig. 6. Iterative trajectories of input control.

4.2. The action network

Then we define the error function for the critic network as elci (x) = Vi (x) − Vˆ il (x).

(100)

Applying (98) and (99) to (100), we can obtain elci (x)

Here we utilize the action network to approximate the iterative decision rule di (x). The output of the action network can be expressed as lT T dˆ li (x) = Woai f (Wha x + bha ) + bloai ,

{

⎫ ⎬

⎧ ⎨

∑ (p(j|x, dˆ i−1 (x))Vˆ i−1 (j)) = U(x, dˆ i−1 (x)) + ⎭ ⎩ j∈ X { lT { } } T − Woci f (Whc x + bhc ) + bloci .

(101)

∂ elci (x) l ∂ Woci

di (x) = arg min a∈Ax

T = −f (Whc x + bhc ).

(102)

The objective function to be minimized for the critic network is Ecil (x)

=

1 2

(elci (x))2

.

(103)

The gradient-based weight updating rule can be applied to train the critic network as l+1 l l Woci = Woci + ∆Woci

=

l Woci

− ρc w

{

∂ Ecil (x) ∂ elci (x) l ∂ elci (x) ∂ Woci

l = Woci − ρc w elci (x)∇c ,

= bloci − ρcb

⎧ ⎨

U(x, a) +



∑ j∈X

⎫ ⎬

λp(j|x, a)Vˆ i (j) . ⎭

Then we define the error function for the action network for all x ∈ X as elai (x) = dˆ li (x) − di (x).

(108)

The objective function to be minimized for the action network is 1

(el (x))T (elai (x)). (109) 2 ai The gradient-based weight updating rule can be applied to train the action network as follows

l Eai (x) =

{ (104)

=

l Woai

− ρaw

∂ Eail (x) ∂ elai (x) ∂ dˆ li (x) l ∂ elai (x) ∂ dˆ li (x) ∂ Woai

l T = Woai − ρaw f (Wha x + bha )(elai (x))T ,

∂ Ecil (x) ∂ Vˆ il (x) ∂ Vˆ il (x) ∂ bloci

= bloci − ρcb elci (x),

(107)

l+1 l l Woai = Woai + ∆Woai

}

l+1 boci = bloci + ∆bloci

{

(106)

0 where l = 0, 1, 2, . . .. Let Woai , Wha , bha and b0oai be random weight matrices. When we train the action network, the hiddenl output weight matrix Woai and the bias bloai are updated, while the input-hidden weight matrix Wha and the bias bha are fixed. Here we present the target function as follows

Combining (98) with (101), we can define ∇c as

∇c ≜

}

}

(110)

+1 bloai = bloai + ∆bloai

}

{ = (105)

where ρc w > 0 and ρcb > 0 are the learning rates of the critic network. If the training precision is achieved, then we say that the performance index function Vi (x) can be approximated by the critic network.

bloai

− ρab

∂ Eail (x) ∂ elai (x) ∂ dˆ li (x) ∂ elai (x) ∂ dˆ li (x) ∂ bloai

= bloai − ρab elai (x),

}

(111)

where ρaw > 0 and ρab > 0 are the learning rates of the action network. If the training precision is achieved, then the decision rule di (x) is well approximated by the action network.

290

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

Table 1 Parameters of Example 1. Parameters

Value

M g l

1/3 kg 9.8 m/s2 3/2 m 4/3 Ml2 0.2 0.1 s 1 1 1 Gaussian noise

J

fd

∆T k1 k2 k3

ω

Fig. 8. Error obtained by (65) with i = 6.

Fig. 7. Error obtained by (65) with i = 3.

5. Examples To validate the effectiveness of the developed method, two examples are presented for numerical implementation to obtain the optimal performance index function of the stochastic processes.

Fig. 9. Iterative value functions by improved value iteration ADP algorithm.

Example 1. We consider the torsional pendulum system. The dynamics of the pendulum is given as follows: x1 (k + 1) = x1 (k) + ∆Tx2 (k), x2 (k + 1) =

∆TMgl J



∆T J

sin(x1 (k)) +

1 − ∆Tfd J

x2 (k)

u(k) + ω(k).

(112)

The utility function is chosen as U(x1 , x2 , u) = k1 x21 + k2 x22 + k3 u2 .

(113)

The parameters for the system are given in Table 1. We denote system state x as

[ x=

x1 x2

]

.

(114)

Let the state space be expressed as Ω = {(x1 , x2 )|−1 ≤ x1 ≤ 1, −1 ≤ x2 ≤ 1}. Let the initial state be x0 = [1, −1]T . Neural networks are used to implement the improved value iterative ADP algorithm. According to Bertsekas and Tsitsiklis (1996), we can build the initial value function using critic network, where the initial value function can be expressed as (98).

Fig. 10. Iterative value functions by improved value iteration ADP algorithm.

In numerical implementation, we conduct the iterative decision rule and iterative value function for 100 times until the computation precision ϵ = 0.01 is satisfied. Figs. 2 and 3 demonstrate the iterative value functions using our proposed improved value iteration ADP algorithm, where ‘‘In" indicates the initial iteration item and ‘‘Lm" indicates the limiting iteration item.

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

291

Fig. 11. Iterative value functions by improved value iteration ADP algorithm.

Figs. 4 and 5 demonstrate the iterative system state trajectories using our proposed algorithm. Fig. 6 demonstrates the iterative control trajectories using our proposed algorithm. Fig. 7 shows the error obtained by (65) with i = 3. From Fig. 7, we get the conclusion that not all x ∈ Ω satisfy V4 (x) − V3 (x) < U(x, d3 (x)). Fig. 8 shows the error obtained by (65) with i = 6. From Fig. 8, we get the conclusion that all x ∈ Ω satisfy V7 (x) − V6 (x) < U(x, d6 (x)). Theorem 7 indicates that the 3rd iterative decision rule d3 (x) is not admissible and the 6th iterative decision rule d6 (x) is admissible. These conclusions are validated by Figs. 4–6. Figs. 4(b), 5(b), and 6(b) show that the 3rd iterative decision rule d3 (x) is not admissible. Figs. 4(c), 5(c), and 6(c) show that the 6th iterative decision rule d6 (x) is admissible. Example 2. We consider the following system dx1 dt dx2 dt

= −x1 + x2 u + ω1 , = −x2 + (1 + cos2 (x1 )) sin(u) + u + ω2 .

(115)

Fig. 12. Optimal performance index function obtained by improved value iteration ADP algorithm.

We discretize the above system using the sampling interval

∆T = 0.001s. This yields

The utility function is chosen as

x1 (k + 1) = (1 − ∆T )x1 (k) + ∆Tx2 (k)u(k) + ∆T ω1 (k),

U(x, u) = xT Qx + uT Ru,

x2 (k + 1) = (1 − ∆T )x2 (k) + ∆Tu(k) + ∆T ω2 (k)

+ ∆T (1 + cos2 (x1 (k))) sin(u(k)).

(116)

(117)

where Q and R denote the identity matrices with suitable dimensions.

292

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

Fig. 13. Iterative trajectories of system state x1 .

We denote system state x the same as (114). Let the state space be expressed as Ω = {(x1 , x2 )|−1 ≤ x1 ≤ 1, −1 ≤ x2 ≤ 1}. Let the initial state be x0 = [1, −1]T . Neural networks are used to implement the improved value iterative ADP algorithm. According to Bertsekas and Tsitsiklis (1996), we can build the initial value function using critic network, where the initial value function can be expressed as (98). In numerical implementation, we perform the iterative decision rule and iterative value function for 1000 times until the computation precision ϵ = 0.01 is satisfied. Figs. 9 and 10 demonstrate the iterative value functions using the proposed algorithm. From Fig. 9, we can see V1 (x) ≥ V0 (x) and the iterative value function Vi (x) is a monotonically nondecreasing sequence. From Fig. 10, we can see V1 (x) ≤ V0 (x) and the iterative value function Vi (x) is a monotonically nonincreasing sequence. Figs. 11 and 12 demonstrate the iterative value functions using the proposed algorithm, where Fig. 11(a) is the initial value function and Fig. 12 is the optimal performance index function. Figs. 13–15 demonstrate the iterative system state trajectories and control laws corresponding to the iterative value functions demonstrated in Figs. 11 and 12. Fig. 16 shows the error obtained by (65) with i = 224. From Fig. 16, we get the conclusion that not all x ∈ Ω satisfy V225 (x) − V224 (x) < U(x, d224 (x)). Fig. 17 shows the error obtained by (65) with i = 799. From Fig. 17, we get the conclusion

that all x ∈ Ω satisfy V800 (x) − V799 (x) < U(x, d799 (x)). Figs. 13(b), 14(b) and 15(b) show that the 224th iterative decision rule d224 (x) is not admissible. Figs. 13(c), 14(c) and 15(c) show that the 799th iterative decision rule d799 (x) is admissible. Remark 4. We use ‘‘the post-decision state variable" technique to avoid calculating p(j|x, a) in our simulations. ‘‘The post-decision state variable" technique is explained in great detail in Powell (2007). The post-decision state variable is the state of the system after we have made a decision but before any new noise has arrived. For our simulation examples it is possible to break down the effect of the decision a(k) and the noise ω(k) on the system state variable x(k). According to Powell’s theory (Powell, 2007), we can break our original transition function x(k + 1) = F (x(k), a(k), ω(k)) into two steps xa (k) = F M ,a (x(k), a(k)), x(k + 1) = F M ,W (xa (k), ω(k)).

(118)

where x(k) is the state of the system before we make a decision, while xa (k) is the state after we make a decision before the noise ω(k) is applied to the system. For this reason we refer to xa (k) as the post-decision state variable.

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

293

Fig. 14. Iterative trajectories of system state x2 .

The standard form of Bellman’s equation is

V (x) = min a∈Ax

⎧ ⎨

U(x, a) +





p(j|x, a)V (j)

⎫ ⎬

.

Using the post-decision state variable to update the value function, we have (119)



j∈ X

} + Via (xa (k + 1))}|xa (k) . ∑

If we let V a (xa (k)) = E V (x(k))|xa (k)

{

=



}

p(j|x(k), a)V (j),

(120)

j∈X

then according to Powell’s theory (Powell, 2007), we can obtain the optimality equation around the post-decision state variable a

(123)

The key step is that we write j∈X p(j|x(k), a)V (j) as a function of xa (k) rather than x(k), where we take advantage of the fact that xa (k) is a deterministic function of x(k). This can bring in a tremendous computational advantage and avoid calculating p(j|x, a). 6. Conclusion

{

V (x (k)) = E min{U(x(k + 1), a) a

{

Via+1 (xa (k)) = E {U(x(k + 1), di (x(k + 1)))

a∈Ax

}

+ V (x (k + 1))}|x (k) . a

a

a

(121)

At each iteration, to generate the next iterative decision rule, we have di (x) = arg min U(x, a) + Via (xa (k)) .

{

a∈Ax

}

(122)

In this paper, to obtain the optimal performance index function for the stochastic processes, a novel value iteration ADP algorithm is presented. We propose a new criteria to verify whether the obtained policy is stable or not for the stochastic processes. We prove that the sequence of iterative value functions obtained by the proposed algorithm can finally converge to the optimum. In addition, our algorithm allows the initial value function to be an arbitrary positive semi-definite function. Finally, we present

294

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

Fig. 15. Iterative trajectories of input control.

Fig. 16. Error obtained by (65) with i = 224.

Fig. 17. Error obtained by (65) with i = 799.

simulations to validate that the proposed criteria is able to guarantee that the obtained policy is stable and admissible while

achieving the goal of finding the optimal performance index function for the stochastic processes.

M. Liang, D. Wang and D. Liu / Neural Networks 124 (2020) 280–295

References Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control. Belmont, MA, USA: Athena Scientific. Bertsekas, D. P. (2011a). Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3), 310–335. Bertsekas, D. P. (2011b). Lambda-policy iteration: A review and a new implementation: Laboratory for Information and Decision Systems Report LIDS-P-2874. Bertsekas, D. P. (2016). Robust shortest path planning and semicontractive dynamic programming: Laboratory for Information and Decision Systems Report LIDS-P-2915. Bertsekas, D. P. (2017a). Regular policies in abstract dynamic programming. SIAM Journal on Optimization, 27(3), 1694–1727. Bertsekas, D. P. (2017b). Value and policy iteration in deterministic optimal control and adaptive dynamic programming. IEEE Transactions on Neural Networks and Learning Systems, 28(3), 500–509. Bertsekas, D. P. (2018a). Stable optimal control and semicontractive dynamic programming. SIAM Journal on Control and Optimization, 56(1), 231–252. Bertsekas, D. P. (2018b). Proper policies in infinite-state stochastic shortest path problems. IEEE Transactions on Automatic Control, 63(11), 3787–3792. Bertsekas, D. P., Homer, M. L., Logan, D. A., Patek, S. D., & Sandell, N. R. (2000). Missile defense and interceptor allocation by neuro-dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, 30(1), 42–51. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Belmont, MA, USA: Athena Scientific. Bertsekas, D. P., & Yu, H. (2016). Stochastic shortest path problems under weak conditions: Laboratory for Information and Decision Systems Report LIDS-P-2909. Fu, Z., Xie, W., Rakheja, S., & Na, J. (2017). Observer-based adaptive optimal control for unknown singularly perturbed nonlinear systems with input constraints. IEEE/CAA Journal of Automatica Sinica, 4(1), 48–57. Jiang, H., & Zhang, H. (2018). Iterative ADP learning algorithms for discrete-time multi-player games. Artificial Intelligence Review, 50(1), 75–91. Kreyszig, E. (1978). Introductory Functional Analysis with Applications. New York, USA: John Wiley & Sons. Lincoln, B., & Rantzer, A. (2006). Relaxing dynamic programming. IEEE Transactions on Automatic Control, 51(8), 1249–1260.

295

Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York, USA: John Wiley & Sons. Puterman, M. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York, USA: John Wiley & Sons. Rantzer, A. (2006). Relaxed dynamic programming in switching systems. IEE Proceedings-Control Theory and Applications, 153(5), 567–574. Roy, B. V., Bertsekas, D. P., Lee, Y., & Tsitsiklis, J. N. (1997). A neuro-dynamic programming approach to retailer inventory management. In Proceedings of the 36th IEEE conference on decision and control (pp. 4052–4057). IEEE. Wang, D. Intelligent Critic Control With Robustness Guarantee of Disturbed Nonlinear Plants, IEEE Transactions on Cybernetics, in press, http://dx.doi. org/10.1109/tcyb.2019.2903117. Wang, D., Ha, M., & Qiao, J. Self-learning optimal regulation for discrete-time nonlinear systems under event-driven formulation, IEEE Transactions on Automatic Control, in press, http://dx.doi.org/10.1109/TAC.2019.2926167. Wang, D., He, H., & Liu, D. (2017). Adaptive critic nonlinear robust control: A survey. IEEE Transactions on Cybernetics, 47(10), 3429–3451. Wang, D., & Liu, D. (2018). Neural robust stabilization via event-triggering mechanism and adaptive learning technique. Neural Networks, 102, 27–35. Wang, D., & Mu, C. (2017). Developing nonlinear adaptive optimal regulators through an improved neural learning mechanism. Science China Information Sciences, 60(5), 1–3. Wang, D., & Qiao, J. (2019). Approximate neural optimal control with reinforcement learning for a torsional pendulum device. Neural Networks, 117, 1–7. Wang, D., & Zhong, X. (2019). Advanced policy learning near-optimal regulation. IEEE/CAA Journal of Automatica Sinica, 6(3), 743–749. Wei, Q., & Liu, D. (2014). A novel iterative θ -adaptive dynamic programming for discrete-time nonlinear systems. IEEE Transactions on Automation Science and Engineering, 11(4), 1176–1190. Werbos, P. J. (1977). Advanced forecasting methods for global crisis warning and models of intelligence. General System Yearbook, 22, 25–38. Werbos, P. J. (1991). A menu of designs for reinforcement learning over time. (pp. 67–95). Cambridge, MA, USA: MIT Press. Zhu, Y., & Zhao, D. (2017). Comprehensive comparison of online ADP algorithms for continuous-time optimal control. Artificial Intelligence Review, 49(4), 531–547.