The Borkar–Meyn theorem for asynchronous stochastic approximations

The Borkar–Meyn theorem for asynchronous stochastic approximations

Systems & Control Letters 60 (2011) 472–478 Contents lists available at ScienceDirect Systems & Control Letters journal homepage: www.elsevier.com/l...

265KB Sizes 2 Downloads 42 Views

Systems & Control Letters 60 (2011) 472–478

Contents lists available at ScienceDirect

Systems & Control Letters journal homepage: www.elsevier.com/locate/sysconle

The Borkar–Meyn theorem for asynchronous stochastic approximations Shalabh Bhatnagar Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India

article

abstract

info

Article history: Received 15 January 2011 Accepted 4 April 2011 Available online 2 May 2011

In this paper, we give a generalization of a result by Borkar and Meyn (2000) [1], on the stability and convergence of synchronous-update stochastic approximation algorithms, to the case of asynchronous stochastic approximations with delays. We then describe an interesting application of the result to asynchronous distributed temporal difference (TD) learning with function approximation and delays. © 2011 Elsevier B.V. All rights reserved.

Keywords: The Borkar–Meyn theorem Asynchronous stochastic approximation with delays Temporal difference learning

1. Introduction In [1] (see also [2], Chapter 3), Borkar and Meyn analyze the ddimensional stochastic recursion xn+1 = xn + a(n)(h(xn ) + Mn+1 ),

n ≥ 0.

(1)

The analysis of this recursion by Borkar and Meyn [1] is remarkable for the reason that using exclusively an ordinary differential equation (ODE) approach, they show both the stability and the convergence of (1) under very general requirements on the function h : Rd → Rd , the step-size sequence {a(n)} and the ‘noise’ Mn+1 , n ≥ 0. Proving the stability of such a scheme in a general scenario is normally difficult. Approaches in the literature for showing stability of similar schemes are usually problem specific, for instance, based on constructing a stochastic Lyapunov function [3]. Further, approaches that use the ODE method often assume that the recursion (such as (1)) is almost surely bounded (and so they do not show stability of the scheme because it has already been assumed stable). Many times the scheme (1) is simply projected to a closed and bounded subset of Rd after each update. In such a case, the scheme is forced to remain stable. Having established (or else assumed) stability of the iterates in (1), proving their convergence using the ODE approach is relatively straightforward. The associated ODE in this case corresponds to x˙ (t ) = h(x(t )).

(2)

By considering a scaled version of (1) and another scaled ODE (associated with the scaled version), Borkar and Meyn [1] show that under sufficiently general assumptions, (1) remains stable. Next, they show using an ODE analysis involving the regular ODE

E-mail address: [email protected]. 0167-6911/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.sysconle.2011.04.002

(2) that the scheme is convergent. The assumptions in [1] have been found to be easily verifiable in many reinforcement learning (RL) [4] schemes, see for instance, [1] where certain applications to RL algorithms have been described in addition, as well as in the case of the algorithms presented in [5,6]. The recursions considered in [1] however use synchronous updates i.e., all the d component recursions in (1) are assumed to be updated simultaneously and there are no delays. Many real world engineering problems require in a natural manner, asynchronous implementations of optimization and control algorithms. For instance, consider the problem of intelligent traffic signal control at intersections [7]. The problem is to find the optimal order to switch traffic lights from the various sign configurations as well as to find the length of time for which a signal should be turned green on each lane. It is normally assumed that sensors are deployed along various lanes to sense the level of congestion. The sensors then communicate this information to the traffic signal junction that decides on the schedule for traffic lights in the next cycle. If one considers controlling traffic on an entire road network involving several junctions, it is natural to consider a situation where each road traffic junction runs its own local algorithm and then shares information with other traffic junctions. In general there will be communication delays in information being shared across junctions. Each traffic junction can also run its algorithm according to its local clock. Thus some junctions might update at a faster rate than others. A rigorous treatment on deterministic asynchronous update algorithms can be found in Chapters 6–8 of [8]. In the stochastic setting, [9] presents an asynchronous implementation of a similar scheme as (1) and analyzes its convergence. It is assumed there that each component xn (i), i = 1, . . . , d of xn is updated on a different processor using the local clock of that processor. Further, each processor communicates, with a random

S. Bhatnagar / Systems & Control Letters 60 (2011) 472–478

473

delay, its latest component update to the other processors. The convergence of the scheme in [9] is however shown under the assumption that the updates are uniformly bounded i.e., the scheme remains stable. In this paper, using similar methods as in [1], we provide a set of conditions required to prove both the stability and convergence for asynchronous stochastic approximations with delays. We use the asynchronous update setting in [9] for this purpose. The derived conditions are seen to be verifiable in RL applications [10]. In particular, we show the convergence of asynchronous temporal difference (TD) learning [11,4] with function approximation by verifying our assumptions. Traditionally, RL algorithms have not been studied using asynchronous updates with delays, see, however, [12] for certain actor-critic algorithms in the look-up table case (i.e., full state representations) that incorporate asynchronous updates for the critic but synchronous updates for the actor. To the best of our knowledge, TD learning with function approximation has not been analyzed previously for asynchronous update recursions with delays. One can expect that asynchronous versions of traditional RL algorithms will have important ramifications in various practical problems such as of traffic signal control (described above). The rest of the paper is organized as follows. We present the framework and assumptions in Section 2. The proof of stability and convergence is given in Section 3. The application of this result to asynchronous TD with function approximation is given in Section 4. Finally, concluding remarks are provided in Section 5.

(ii) a (n + 1) ≤ a(n) for∑ all n ≥ 0, ∑ 2 (iii) n a(n) = ∞ and n a(n) < ∞.

2. The framework

Here and in what follows, for any vector x ∈ Rd , ‖x‖ denotes its norm. We do not restrict attention to a particular norm as any norm on Rd can be considered. We now briefly discuss these assumptions. Assumption 1 implies that all components are updated often enough (in relation to each other). This can for instance be satisfied by the following condition:

Consider a system with d processors labeled 1, . . . , d and let the ith processor update the ith component of a d-dimensional parameter x = (x(1), . . . , x(d))T . As with [9], the index n will be used to denote the ‘global’ clock and ν(i, n) will denote the number of times out of n that the component i has been updated. Let Yn denote a random subset of S¯ , {1, . . . , d} that indicates the ∑n subset of components updated at time n. Then, ν(i, n) = m=0 I {i ∈ Ym }. We assume that after every component update, the corresponding processor transmits the update information to every other processor. Let τij (n) denote the ‘random’ delay incurred in processor j receiving processor i’s output at time n. Thus at time n, processor j knows xn−τij (n) (i) as the most recent update from i. It is possible that because of randomness in delays, j might actually have a more recent update from i than xn−τji (n) (i) but it believes (and hence uses) the latter to be the most recent update because it is the latest update to have been received from i. We assume that the ith processor knows its latest update, i.e., τii (n) = 0 ∀i ∈ S¯ , n ≥ 0. Let h : Rd → Rd be a given map. In particular, for any x ∈ Rd , let h(x) = (h1 (x), . . . , hd (x))T . The recursion that we will be concerned with is the following. For i = 1, . . . , d, n ≥ 0, xn+1 (i) = xn (i) + a(ν(i, n))I {i ∈ Yn }

  × hi (xn−τ1i (n) (1), . . . , xn−τdi (n) (d)) + Mn+1 (i) .

(3)

For n ≥ 0, let Fn = σ (xm , Mm , Ym , τij (m), 1 ≤ i, j ≤ d, m ≤ n) denote a sequence of associated sigma fields. Let Λ(·) be a (d × d) matrix-valued measurable function such that for each t , Λ(t ) is a diagonal matrix with nonnegative diagonal entries that are upper bounded by one. These entries will correspond to the relative rates at time t (in relation to the global clock) with which the various components get updated. We make the following assumptions: Assumption 1. We let lim infn→∞

ν(i,n) n

¯ > 0, ∀i ∈ S.

Assumption 2. The step-sizes a(n), n ≥ 0 satisfy (i) 0 < a(n) ≤ a¯ , ∀n ≥ 0, where a¯ > 0 is a given constant,

Assumption 3. We have τij (n) ≥ 0 for all i, j ∈ {1, . . . , d}, ∀n ≥ 0. In addition, τij (n) ≤ K for all n ≥ K and τij (n) ≤ n otherwise (i.e., for n < K ), where K > 0 is a given (deterministic) constant which could be large. Assumption 4. (i) The function h : Rd → Rd is Lipschitz continuous with Lipschitz constant L > 0. (ii) There is a unique globally asymptotically stable equilibrium x∗ ∈ Rd for the ODE x˙ (t ) = Λ(t )h(x(t )),

x(0) ∈ Rd ,

t ≥ 0.

(iii) The functions hr (x) , h(rx)/r , r ≥ 1, x ∈ Rd , satisfy hr (x) → h∞ (x) as r → ∞ uniformly on compacts for some h∞ : R d → R d . (iv) The origin in Rd is an asymptotically stable equilibrium for the ODE x˙ (t ) = Λ(t )h∞ (x(t )),

x(0) ∈ Rd ,

t ≥ 0.

Assumption 5. (i) The sequence {(Mn , Fn )}, n ≥ 1 with Mn = (Mn (1), . . . , Mn (d))T is a martingale difference sequence. (ii) For some constant C0 < ∞ and any initial condition x0 ∈ Rd ,

‖Mn+1 ‖ ≤ C0 (1 + ‖xn ‖),

n ≥ 0.

Condition (⋆): {Yn } is an ergodic Markov chain whose state space corresponds to Sˆ = the set of some predefined subsets of S¯ so that each component i ∈ S¯ is contained in at least one of the subsets (i.e., the states of Yn ). Under (⋆), it follows from Corollary 8, Chapter 6 of [2] that ∑ Λ(t ) = diag(Λ1 (t ), . . . , Λd (t )) with Λi (t ) = i∈A∈Sˆ π (A), where

ˆ By π (A) is the stationary probability of {Yn } being in state A ∈ S. ˆ Hence, by (⋆), Λi (t ) > 0 ∀i ∈ ergodicity of {Yn }, π (A) > 0 ∀A ∈ S. ¯ S.

The requirements in Assumption 2 are satisfied by most of the standard examples of diminishing step-sizes such as a(n) = 1/n, n ≥ 1, a(n) = 1/nα , n ≥ 1 for any α ∈ (0.5, 1) and a(n) = 1/(n log n), n ≥ 2 etc. Assumption 3 requires the delays to be bounded by a possibly large constant. Most real-time applications, for instance, require the delays to be bounded (even though they could be random). For instance, in the case of the TCP (transmission control protocol) window flow algorithm used in the Internet, a time out period of T (a constant) is set by the sources. If packets are not acknowledged by that time, the source assumes that the packet is lost. Moreover, in practice, an algorithm such as (3) is often run only for a given (large) number of epochs, say M. One may then choose K to be M itself. In [9], the delays are assumed to satisfy a moment bound and are allowed to be possibly unbounded. This however comes at the cost of stringent (additional) requirements on the step-sizes a(n), n ≥ 0; see [9]. By allowing delays to be bounded, we are able to relax the requirements in [9] on the stepsizes. For instance, the class of step-sizes a(n) = 1/nα , n ≥ 1 with α ∈ (0.5, 1) do not satisfy the requirements in [9], while they are seen to satisfy Assumption 2. An assumption similar to Assumption 4 has been used in [1] except for additional Λ(t ) terms multiplying the RHS of the ODEs in Assumption 4(ii) and (iv). These terms arise because of

474

S. Bhatnagar / Systems & Control Letters 60 (2011) 472–478

asynchronous updates. It is easy to see that since h(·) is a Lipschitz continuous function, so are hr (·) ∀r ≥ 1 and h∞ (·). Further, hr , ∀r ≥ 1 and h∞ have the same Lipschitz constant (L) as the function h. Assumption 5(i) implies that the noise Mn+1 , n ≥ 0 forms a martingale difference sequence. In [1], the following requirement has been used in place of Assumption 5(ii): E [‖Mn+1 ‖2 | Fn ] ≤ C0 (1 + ‖xn ‖2 ). The analysis in the next section can be shown with some extra work under this requirement. However, as we show in Section 4, the requirement in Assumption 5(ii) is seen to be satisfied by TD learning and can also be seen to be satisfied by many other reinforcement learning algorithms.

‖φ∞ (Tx , y)‖ ≤ ‖φ∞ (Tx , x)‖ + ‖φ∞ (Tx , y) − φ∞ (Tx , x)‖ δ ¯ ≤ + ‖x − y‖eLC Tx 2

δ



+

2

δ 2

= δ.

It then follows by Lyapunov stability that ‖φ∞ (t , y)‖ < 1/8 ∀t ≥ Tx . Since ∂ U¯ is compact, it can be covered by a finite number of such open balls with corresponding times Tx1 , . . . , Txn . Now setting T = max(Tx1 , . . . , Txn ) gives the claim.  Lemma 2. Under Assumption 4, there exists r0 > 0 and T > 0 such that for all initial conditions x ∈ ∂ U¯ , ‖φr (t , x)‖ < 1/4 for t ∈ [T , T + 1] and r > r0 .

3. Proof of convergence

Proof. Let T > 0 be as in Lemma 1. Now

Without loss of generality, we assume a¯ = 1 (cf. Assumption 2(i)). The proofs in Lemmas 1 and 2 closely follow those of Lemmas 1–2 and Corollary 3, pp. 22–24 of [2]. Let φ∞ (t , x), t ≥ 0 denote the solution to the ODE

φr (t , x) = x +

x˙ (t ) = Λ(t )h∞ (x(t ))

with x(0) = x.

(4)

Similarly, for any r ≥ 1, let φr (t , x), t ≥ 0 denote the solution to the ODE x˙ (t ) = Λ(t )hr (x(t ))

with x(0) = x.

{x∈Rd |‖x‖=1}

Let U¯ denote the unit sphere in R and ∂ U¯ be its boundary. d

Lemma 1. Under Assumption 4, there exists a T > 0 such that for all initial conditions x ∈ ∂ U¯ , ‖φ∞ (t , x)‖ < 1/8 for all t > T . Proof. By Assumption 4(iv), the origin is a globally asymptotically stable equilibrium for the ODE (4). Hence (from Lyapunov stability), given ϵ = 1/8, ∃δ > 0 such that ‖x‖ < δ implies that ¯ let Tx be such ‖φ∞ (t , x)‖ < 1/8 ∀t > 0. For x(0) = x ∈ ∂ U, that ‖φ∞ (Tx , x)‖ < δ/2. Also, let y ∈ ∂ U¯ be some other initial condition. Now t

φ ∞ ( t , x) = x +



φ∞ (t , y) = y +



Λ(s)h∞ (φ∞ (s, x))ds, 0 t

Λ(s)h∞ (φ∞ (s, y))ds. 0

Hence,

‖φ∞ (t , x) − φ∞ (t , y)‖ ≤ ‖x − y‖ +

φ ∞ ( t , x) = x +

t



Λ(s)h∞ (φ∞ (s, x))ds. 0

Thus,

‖φr (t , x) − φ∞ (t , x)‖ ≤

t



‖Λ(s)(hr (φr (s, x))

− h∞ (φ∞ (s, x)))‖ds ∫ t ≤ C¯ ‖hr (φr (s, x)) − h∞ (φ∞ (s, x))‖ds.

(5)

‖Ax‖.

max

Λ(s)hr (φr (s, x))ds, 0

0

Given the d-vector norm ‖·‖, we define the corresponding induced (d × d)-matrix norm, also denoted ‖ · ‖ (by an abuse of notation) as follows. For a given d × d-matrix A,

‖A‖ =

t



t



‖Λ(s)(h∞ (φ∞ (s, x)) 0

− h∞ (φ∞ (s, y)))‖ds ∫ t ≤ ‖x − y‖ + ‖Λ(s)‖ ‖(h∞ (φ∞ (s, x)) 0

− h∞ (φ∞ (s, y)))‖ds. Now since Λ(s) for any s ≥ 0 is a diagonal matrix with its diagonal elements in (0, 1], there exists a C¯ > 0 such that ‖Λ(s)‖ ≤ C¯ . From the Gronwall inequality, ¯

‖φ∞ (t , x) − φ∞ (t , y)‖ ≤ ‖x − y‖eLC Tx .

0

Now,

‖hr (φr (s, x)) − h∞ (φ∞ (s, x))‖ ≤ ‖hr (φr (s, x)) − hr (φ∞ (s, x))‖ + ‖hr (φ∞ (s, x)) − h∞ (φ∞ (s, x))‖ ≤ L‖φr (s, x) − φ∞ (s, x)‖ + ϵ(r ), where, by Assumption 4(iii), ϵ(r ) does not depend on x ∈ ∂ U¯ because φ∞ ([0, T ], ∂ U¯ ) is compact. This is because both [0, T ] and ∂ U¯ are compact sets and φ∞ is a continuous map. Now by Gronwall inequality, one obtains, ¯

‖hr (φr (t , x)) − h∞ (φ∞ (t , x))‖ ≤ ϵ(r )C¯ T eLC T , ¯ The claim follows for all t ≤ T . Now ‖φ∞ (T , x)‖ < 1/8 ∀x ∈ ∂ U. ¯ with r chosen so that ϵ(r )C¯ (T + 1)eLC (T +1) < 1/8.  Let a¯ (n) , maxi∈Yn a(ν(i, n)). Then a¯ (n), n ≥ 0 are seen to satisfy Assumption 2(i) and (iii), ∑ though not Assumption 2(ii). Now n ¯ (m), n ≥ 1. Further, let define t (0) = 0 and t (n) = m=0 a T x¯ (t ) = (¯x1 (t ), . . . , x¯ d (t )) , t ≥ 0 be defined via x¯ (t (n)) = xn , n ≥ 0, with linear interpolation on each interval [t (n), t (n + 1)]. Let q(i, n) , (a(ν(i, n))/¯a(n))I {i ∈ Yn }. Then q(i, n) ∈ (0, 1] ∀n. Define another sequence {Tn } of time instants according to T0 = 0 and Tn+1 = min{t (m) | t (m) ≥ Tn + T }, n ≥ 0. Then Tn+1 ∈ [Tn + T , Tn + T + 1], ∀n. Here T is as in Lemma 1. Note that there exists a subsequence {m(n)} of {n} such that Tn = t (m(n)). Let r (n) , max(‖¯x(Tn )‖, 1), n ≥ 0. Let xˆ (t ) = (ˆx1 (t ), . . . , xˆ d (t ))T , t ≥ 0 be defined via xˆ (t ) = x¯ (t )/r (n) for t ∈ [Tn , Tn+1 ). Note that ‖ˆx(Tn )‖ ≤ 1, ∀n ≥ 1. Theorem 1. Under Assumptions 1–5, supt ‖ˆx(t )‖ < ∞ almost surely. Proof. We rewrite the algorithm (3) as

¯

Now ∃ϵ0 > 0 such that ‖x − y‖ < ϵ0 implies that ‖x − y‖eLC Tx < δ/2. Thus,

xk+1 (i) = xk (i) + a¯ (k)q(i, k)(hi (xk (1), . . . , xk (d))

+ Mk+1 (i)) + ϵk (i),

(6)

S. Bhatnagar / Systems & Control Letters 60 (2011) 472–478

where ϵk (i) = a¯ (k)q(i, k)(hi (xk−τ1i (k) (1), . . . , xk−τdi (k) (d)) − hi (xk (1), . . . , xk (d))). Let m(n) ≤ k < m(n + 1), for some n > 0. From Assumption 5(ii), we have that

|Mk+1 (i)| ≤ ‖Mk+1 ‖ ≤ C0 (1 + ‖xk ‖).

(7)

For any vector x = (x(1), . . . , x(d))T ∈ Rd , let ‖x‖1 = denote its l1 norm. If ‖ · ‖ is the l1 norm itself, then

 |Mk+1 (i)| ≤ C0 1 +

d −

∑d

i =1

|x(i)|

A similar argument as before also shows that



|xk (l)| .

|Mk+1 (i)| ≤ K0 1 +

×

∀i = 1, . . . , d. Dividing (6) throughout by r (n) and observing that xˆ i (t (k)) = x¯ i (t (k))/r (n) = xk (i)/r (n) ∀k ≥ 0, one obtains xˆ i (t (k + 1)) = xˆ i (t (k)) + a¯ (k)q(i, k)(hr (n),i (ˆx(t (k)))

ˆ k+1 (i)) + ϵˆk (i), +M

|xm−τlj (m) (l)|

l =1



k−1 −

a(ν(j, m))I {j ∈ Ym } 1 +

≤ K4 a¯ (k)  ×

1+

d −

|hi (¯x(t (k))) − hi (0)|

 ×

|¯xj (t (k))|

1+

≤ K4 a¯ (k)

(10)

j =1

where K2 = max(|hr (n),i (0)|, K1 ) > 0. It also follows from (8) that

ˆ k+1 (i)| = |M



d − |Mk+1 (i)| ≤ K0 1 + |ˆxl (t (k))| , r (n) l=1

(11)

since r (n) ≥ 1. Further,

d −

|xk−τji (k) (j) − xk (j)|.

 a¯ (m) 1 +

l=m(n)

d −

 |ˆxl (t (m))| ,

l=1

j =1





a¯ (l)K0

1+

d −

 |ˆxj (t (l))| +

k −

ϵˆl (i),

(14)

l=m(n)

j =1

×

a¯ (l)

l=m(n)



× hj (xm−τ1j (m) (1), . . . , xm−τdj (m) (d))|

k −

ϵˆl (i) ≤ K4

l=m(n)

a(ν(j, m))I {j ∈ Ym }

a(ν(j, m))I {j ∈ Ym }Mm+1 (j)|.

|ˆxl (t (m) − τlj (t (m)))|

l =1

|ˆxi (t (k + 1))| ≤ |ˆxi (t (m(n)))|   k d − − + a¯ (l)K2 1 + |ˆxj (t (l))|

k −

m=k−τji (k)

a¯ (m) 1 +

for some constants K4 , K5 > 0. Recall here that by Assumption 3, τlj (k) ≤ K ∀l, j ∈ S¯ , k ≥ 0. Thus, one obtains from (9),

m=k−τji (k)

+|

k−1 −



d −

m=k−K

k−1

k−1 −



i = 1, . . . , d. Now

Now

|xk (j) − xk−τji (k) (j)| ≤ |

|ˆxl (t (m))|

l=m(n)

j =1



a(ν(j, m))I {j ∈ Ym }



d k−1 − −

+ K5 da¯ (k)

+

ϵk (i) ≤ K1 a(ν(i, k))I {i ∈ Yk }

k−1 −

d −

k

d −

|ˆxl (t (m) − τlj (t (m)))|

j=1 m=k−K







l =1

|ˆxj (t (k))|,

|ˆxj (t (k))| ,

a(ν(j, m))I {j ∈ Ym }

j=1 m=k−τji (k)

for some constant K1 > 0 (since h(·) is Lipschitz continuous). Thus,

|hr (n),i (ˆx(t (k)))| ≤ K2 1 +

|ˆxj (t (k) − τji (t (k))) − xˆ j (t (k))|

l =1

j =1

d −

k−1 −

d −

+ K5 a¯ (k)

d −

d −

j=1 m=k−τji (k)

Further,

r ( n) j = 1

(13)

j=1

(9)

d K1 −

|xm (l)| .

l=1

ϵˆk (i) ≤ K1 a(ν(i, k))I {i ∈ Yk }

|hr (n),i (ˆx(t (k)))| ≤ |hr (n),i (0)| + |hr (n),i (ˆx(t (k))) − hr (n),i (0)|.

r ( n)



Now since ϵˆk (i) = ϵk (i)/r (n), we have

where hr (n),i (ˆx(t (k))) = hi (r (n)ˆx(t (k)))/r (n) = hi (¯x(t (k)))/r (n). ˆ k+1 (i) = Mk+1 (i)/r (n) and ϵˆk (i) = (ϵk (i))/r (n). Now Further, M

1

d −

m=k−τji (k)

(8)

l =1



a(ν(j, m))I {j ∈ Ym }



d −

1+

+ K0

|xk (l)| ,

= K1

k−1 − m=k−τji (k)





|¯xl (t (k))| ,

for some constant K3 > 0. Thus,



|hr (n),i (ˆx(t (k))) − hr (n),i (0)| =



l =1

|xk (j) − xk−τji (k) (j)| ≤ K3



If not, then there exists a C1 > 0 such that ‖x‖ ≤ C1 ‖x‖1 for all x ∈ Rd (see Proposition A.9, pp. 655 of [13]). Thus, in general, there exists a K0 > 0 such that d −

d −

|hj (¯x(t (k)))| ≤ K3 1 +

l =1



475

1+

d −

d −

l −1 −

a¯ (m)

j=1 m=m(n)−K

 |ˆxr (t (m) − τrj (t (m)))|

r =1

(12)

+ K5 d

k − l=m(n)

a¯ (l)

k−1 − m=m(n)−K

 a¯ (m) 1 +

d − r =1

 |ˆxr (t (m))| .

476

S. Bhatnagar / Systems & Control Letters 60 (2011) 472–478

Now,

∑k

l=m(n)

a¯ (l) ≤

∑m(n+1)−1 l=m(n)

k



ϵˆl (i) ≤ K4 (T + 1)

d

k





l=m(n)

m(n+1)−1

a¯ (l) ≤ (T + 1). Thus,

+

×

1+

d −

a¯ (m)

By the discrete Gronwall inequality we now have Y (t (k + 1)) ≤ K7 e(2K +T +1)K6 ,

 |ˆxr (t (m) − τrj (t (m)))|

r =1



k −

+ K5 d(T + 1)

a¯ (m) 1 +

m=m(n)−K

d −

 |ˆxr (t (m))| .

r =1

Summing over i ∈ S¯ in (14), one obtains

|ˆxi (t (k + 1))|



k −

|ˆxi (t (m(n)))| + K2 d

a¯ (l)

d −

l=m(n)

i =1

+ K0 d

k −

a¯ (l)

l=m(n)

d −

|ˆxj (t (l))|

x˙ (t ) = λ(t )hc (x(t )),

a¯ (l)

n

d −

l=m(n)−K

+ K4 d(T + 1)

d −

k −

+ (K2 + K0 )d

|ˆxj (t (l))|

j =1

a¯ (l)

j=1 l=m(n)−K k −

d −

|ˆxi (t (m) − τij (t (m)))|

i=1

Let Y (t ) = k −

i=1



a¯ (l) ≤

a¯ (l).

l=m(n)−K

|ˆxi (t )|. Note that

m(n+1)−1

m(n)−1



a¯ (l) =

a¯ (l) +

m(n+1)−1



a¯ (l)

l=m(n)

l=m(n)−K

l=m(n)−K

l=m(n)−K

k −

a¯ (l) + (K4 + K5 )d2 (T + 1)

l=m(n)

∑d

≤ K + (T + 1), since a¯ (l) ≤ a¯ = 1 for all l ≥ 0. Now using the fact that τjk ≤ K , one obtains from the above, Y (t (k + 1)) ≤ Y (t (m(n))) + (K0 + K2 )d

k −

a¯ (l)Y (t (l))

l=m(n) k −

+ K5 d (T + 1) 2

a¯ (l)Y (t (l))

l=m(n)−K k −

+ K4 d2 (T + 1)

(16)

for c ≥ 1. For n ≥ 0, let x˜ (t ), t ∈ [Tn , Tn+1 ] denote the trajectory of (16) with c = r (n) and x˜ n (Tn ) = xˆ (Tn ). The following result now holds as a consequence of Theorem 1 and standard arguments, see for instance, Lemma 1, Chapter 2 of [2].

|ˆxj (t (l))|

k −

(15)

Consider also the associated ODE

j =1

j =1

+ K5 d2 (T + 1)

Hence, supm(n)≤k 0 as before (cf. Proposition A.9, pp. 655 of [13]). The claim follows. 

x˙ (t ) = λ(t )h(x(t )).

i =1 d −

∀m(n) ≤ k < m(n + 1).

Let ui (t ) = q(i, n) for t ∈ [t (n), t (n + 1)), n ≥ 0. For any t ≥ 0, let λ(t ) be a (d × d)-diagonal matrix with u1 (t ), . . . , ud (t ) as its diagonal elements. Consider now the ODE

d



a¯ (l) ≤ 2K + (T + 1).

l=m(n)

j=1 m=m(n)−K





a¯ (l)Y (t (l))

Lemma 3. Under Assumptions 1–5, limn→∞ supt ∈[Tn ,Tn+1 ] ‖ˆx(t ) − x˜ n (t )‖ = 0 with probability one. We now have the following result on stability of the original iterates. Theorem 2. Under Assumptions 1–5, supn ‖xn ‖ < ∞ with probability one. Proof. The proof follows in the same manner as that of Theorem 7, pp. 26–27 of [2].  For n ≥ 0, let xn (t ), t ∈ [Tn , Tn+1 ] denote the trajectory of (15) with xn (Tn ) = xˆ (Tn ). We have the following result whose proof follows as a consequence of Theorem 2 and Lemma 1, Chapter 2 of [2]. Lemma 4. Under Assumptions 1–5, limn→∞ supt ∈[Tn ,Tn+1 ] ‖¯x(t ) − xn (t )‖ = 0 with probability one. Next, we have the following result. Lemma 5. Under Assumptions 1–5, any limit point of x¯ (s + ·) in C ([0, ∞); Rd ) as s → ∞ is a solution of the ODE x˙ (t ) = Λ(t )h(x(t )).

l=m(n)−2K

+ (K0 + K2 )d(T + 1) + (K4 + K5 )d2 (T + 1)(T + 1 + K ). Now note that Y (t (m(n))) ≤ C¯ for some constant C¯ > 0 since ‖ˆx(t (m(n)))‖ ≤ 1. Thus, one obtains Y (t (k + 1)) ≤ K6

k −

a¯ (l)Y (t (l)) + K7

l=m(n)−2K

where K6 = (K0 + K2 )d + (K4 + K5 )d2 (T + 1) and K7 = C¯ + (K0 + K2 )d(T + 1) + (K4 + K5 )d2 (T + 1)(T + 1 + K ). Now k − l=m(n)−2K

a¯ (l) ≤

m(n+1)−1



l=m(n)−2K

a¯ (l) ≤

m(n)−1



l=m(n)−2K

a¯ (l)

Proof. The proof follows as in Theorem 2, pp. 81–82 of [2].



We finally have the following result. Theorem 3. Under Assumptions 1–5, the sequence xn , n obtained from (3) satisfies xn → x∗ as n → ∞.

≥ 0

Proof. The proof follows as a consequence of Theorem 2, Chapter 2 of [2].  4. Application to temporal difference learning We now present an application of the above result in reinforcement learning. Let {Xn } be a Markov decision process (MDP) with {Zn } as its underlying control sequence. Let S be the

S. Bhatnagar / Systems & Control Letters 60 (2011) 472–478

state space of this process. Let A(i) be the set of feasible actions when the process is in state i ∈ S. Let A = ∪i∈S A(i). We assume that S as well as A are finite sets. Let p(i, j, a), i, j ∈ S and a ∈ A(i) be the transition probabilities. Thus, if the state of the process at a given instant is i and action a is chosen, the next state is j with probability p(i, j, a). Further, let cn be a random cost incurred by the agent at instant n. This depends in general on the state and action at time n. In particular, we assume that E [cn | Xm , Zm , m ≤ n] = C (Xn , Zn ) almost surely. Here C (i, a) is the expected cost incurred by the agent when the state is i and action chosen is a. The agent follows a given stationary deterministic policy i.e., picks actions at any instant according to a given function π : S → A such that π(i) ∈ A(i) ∀i ∈ S. The MDP {Xn } under policy π is a Markov process with transition probabilities pˆ (i, j) = p(i, j, π (i)), i, j ∈ S. The problem is to evaluate the long-term cost incurred by the agent under policy π assuming an infinite horizon discounted cost criterion. Let V (i) denote the cost-to-go in state i that is defined as V (i) = E

 ∞ −

 γ k ck | X0 = i, π ,

(17)

k=0

i ∈ S. We will assume however that the cardinality of S is large and so estimating V (i) for all states i is computationally infeasible. Thus we approximate V (i) ≈ φ(i)T θ , where φ(i) = (φ1 (i), . . . , φd (i))T is a d-dimensional feature associated with state i. Also, θ = (θ(1), . . . , θ (d))T is an associated parameter. Let Φ denote the |S | × d feature matrix with φ(i)T , i ∈ S as its rows. We make the following assumptions. Assumption 6. The process {Xn } under the given policy π is irreducible and aperiodic.

477

The asynchronous version of TD is thus as follows. For i = 1, . . . , d,

 θn+1 (i) = θn (i) + a(ν(i, n))I {i ∈ Yn } cn +

d −

η

n−τji (n) j

 φin . (19)

j=1

We now make the following assumption on {Yn }. Assumption 8. (i) {Yn } is an ergodic Markov chain on Sˆ – a set of some predefined subsets of S¯ so that each component i ∈ S¯ is contained in at least one of the subsets. (ii) Λi (t ) = α for all i ∈ S¯ and some α ∈ (0, 1). Note that Assumption 8(i) is essentially Condition (⋆). As explained previously, it follows under ∑ Assumption 8(i) that Λi (t ), i ∈ S¯ has the form Λi (t ) = {A∈Sˆ |i∈A} π (A), where

π (A), A ∈ Sˆ is the stationary distribution of {Yn }. Assumption 8(ii) essentially implies that all components are updated (in the asymptotic limit) with the same relative frequency. For example, in the case when {Yn } takes values in {{1}, {2}, . . . , {d}} i.e., at each instant exactly one component is updated and the transition probabilities of {Yn } are such that the stationary probability of the chain being in any state is the same across states, then ¯ As will be seen in the proof of Theorem 4, Λi (t ) = 1/d ∀i ∈ S. Assumption 8(ii) will not be required if the matrix Λ(t )Φ T D(γ P − I )Φ is negative definite. We show in Theorem 4 that the above matrix is negative definite under Assumption 8. This is critically required for verifying Assumption 4(iv) and (ii). We now proceed to analyze (19). Let C π = (C (i, π (i)), i ∈ S )T . Define T : R|S | → R|S | to be the operator TJ (i) = C (i, π (i)) + γ



p(i, j, π (i))J (j),

j∈S

Assumption 7. The d columns of the matrix Φ are linearly independent. Further, d ≤ |S |.

i = 1, . . . , d. We have the following result.

By Assumption 6, {Xn } is positive recurrent as well under π because it is a finite-state chain. Let P denote the transition probability matrix of {Xn } under π and let d(i), i ∈ S be its

Theorem 4. Under Assumptions 2, 3 and 6–8, {θn } governed by (19) satisfy θn → θ ∗ as n → ∞ with probability one. Further, θ ∗ is the unique solution to

stationary distribution. Further, let D be a diagonal matrix with entries d(i), i ∈ S along the diagonal. The regular TD algorithm is as follows:

Φ ⊤ DΦ θ ∗ = Φ ⊤ DT (Φ θ ∗ ).

θn+1 = θn + a(n)δn φ(Xn ),

(18)

where δn is the temporal difference defined by δn = (cn + γ φ(Xn+1 )T θn − φ(Xn )T θn ), n ≥ 0. Here θn = (θn (1), . . . , θn (d))T . This algorithm in the synchronous case has been analyzed for its convergence in [11]. We present next the asynchronous version of algorithm (18). Consider a system with d processors marked 1, . . . , d. At instant n, the jth processor observes the feature component φjn , φj (Xn ), j = 1, . . . , d as well as the single-stage cost signal cn . Thus each processor observes cn , n ≥ 0. Also, we assume that the processors do not observe the state at all. It is the controller that observes the state at each instant and picks actions according to policy π . Alternatively, since we consider a problem of prediction and not control, one may assume that there is no controller as such and that the policy from where actions are chosen is known to the environment. In such a case, the state need not be observed at all. Let ηjn , (γ φjn+1 θn (j) − φjn θn (j)). The quantity ηjn is computed by the jth processor and transmitted to all the other processors. We assume that processor i receives the above quantity with a delay τji (n). In terms of ηjn , j = 1, . . . , d, the temporal difference error δn corresponds to

δn = cn +

d − j =1

η . n j

(20)

In particular,

θ ∗ = −(Φ ⊤ D(γ P − I )Φ )−1 Φ ⊤ DC π .

(21)

Proof. As explained previously (recall that Assumption 8(i) is the same as Condition (⋆)), Assumption 1 follows from Assumption 8(i). We proceed by verifying Assumptions 4 and 5. The ODE associated with (19) is

θ˙ (t ) = Λ(t )(Φ T D(T (Φ θ (t )) − Φ θ (t ))) , Λ(t )h(θ (t )).

(22)

It is easy to see that h(θ ) is Lipschitz continuous in θ . Let h∞ (θ ) , lim

r →∞

h( c θ ) c

= Φ T D(γ P − I )Φ θ ,

where I is the identity matrix. Consider now the ODE

θ˙ (t ) = Λ(t )h∞ (θ (t )) = Λ(t )Φ T D(γ P − I )Φ θ .

(23)

We will show that the matrix Λ(t )Φ D(γ P − I )Φ is negative definite under Assumptions 6–8. For any x ∈ R|S | , define a weighted Euclidean norm ‖x‖D according to ‖x‖D = (x⊤ Dx)1/2 . Note that T

‖x‖2D = x⊤ Dx = ‖(D)1/2 x‖2 . Now for any function V ∈ R|S | , we have

478

S. Bhatnagar / Systems & Control Letters 60 (2011) 472–478

‖PV ‖2D = V T P T DPV =



d(i)E 2 [V (Xn+1 ) | Xn = i, π]

i∈S





=



d(i)E [V 2 (Xn+1 ) | Xn = i, π]

i∈S

d(j)V 2 (j) = ‖V ‖2D .

i∈S

We thus have

‖γ PV ‖D ≤ γ ‖V ‖D . Now, V T Dγ PV = γ V T (D)1/2 (D)1/2 PV ≤ γ ‖(D)1/2 V ‖ ‖(D)1/2 PV ‖

= γ ‖V ‖D ‖PV ‖D ≤ γ ‖V ‖2D = γ V ⊤ DV . Thus, V T D(γ P − I )V ≤ (γ − 1)‖V ‖2D < 0,

∀V ̸= 0,

implying that D(γ P − I ) is negative definite. Thus, Φ T D(γ P − I )Φ is also negative definite since Φ is of full rank from Assumption 7. Now from Assumption 8, Λ(t ) = α I with α > 0. Hence

Λ(t )(Φ T D(γ P − I )Φ ) = α I (Φ T D(γ P − I )Φ )

Mn+1 (i) = δn φin − E [δn φin | Gn ]

=

cn +

d −

 η

n j

 φ −E n i

j =1

cn +

d −

 η

n j

 φ | Gn . n i

j =1

It is easy to see that (Mn+1 , Gn ), n ≥ 0 with Mn+1 = (Mn+1 (1), . . . , Mn+1 (d))T ∀n is a martingale difference sequence. Further, ‖Mn+1 ‖ ≤ C0 (1 + ‖θn ‖), for some constant C0 . Now let θ ∗ be a solution to

Λ(t )Φ T D(T (Φ θ ) − Φ θ ) = 0.

(24)

Note that (24) corresponds to the linear system of equations

α(Φ T DC π + Φ T D(γ P − I )Φ θ ) = 0.

Acknowledgment The author thanks Prof. Vivek Borkar for helpful discussions.

is also negative definite. From the above, zero is the unique globally asymptotically stable equilibrium for the ODE (23). Let Gn , n ≥ 0 be the sequence of sigma fields Gn = σ (φim , τji (m), Ym , rm (i), i, j = 1, . . . , d, m ≤ n). Now define Mn+1 (i), i = 1, . . . , d, n ≥ 0 according to



stochastic approximation to the case of asynchronous stochastic approximation with delays. While both tapering as well as constant step-size cases have been considered in [1], we only considered the case of diminishing step-sizes. We required stepsizes to be diminishing in order to offset the effect of delays. In particular, even though by the first part of Assumption 2(iii), t (n) → ∞ as n → ∞, it can be seen that t (n + K ) − t (n) → 0 as n → ∞ for any K fixed (that in particular can be chosen as in Assumption 3). The case of constant step-sizes is more hard to deal with. It might be possible to generalize the constant step-size scenario in the case when the delays asymptotically vanish. We also showed an application of the result to the case of asynchronous update temporal difference learning with function approximation and delays. We required Assumption 8(ii) in order to show the negative definiteness of an associated matrix. It would be interesting to study the application of this result on other RL algorithms as well and to see whether Assumption 8(ii) can be relaxed in these.

(25)

Now since Φ D(γ P − I )Φ is negative definite, it is of full rank and invertible. Hence θ ∗ is the unique solution to (25) and corresponds to (21). Thus Assumptions 4 and 5 are verified. The claim now follows from Theorem 3.  T

5. Conclusions We presented a generalization of a result by Borkar and Meyn [1] on the stability and convergence of synchronous-update

References [1] V.S. Borkar, S.P. Meyn, The O.D.E. method for convergence of stochastic approximation and reinforcement learning, SIAM Journal on Control and Optimization 38 (2) (2000) 447–469. [2] V.S. Borkar, Stochastic Approximation: A Dynamical Dystems Viewpoint, Cambridge University Press and Hindustan Book Agency, 2008. [3] H.J. Kushner, G.G. Yin, Stochastic Approximation Algorithms and Applications, Springer Verlag, New York, 1997. [4] D.P. Bertsekas, Dynamic Programming and Optimal Control, third ed., vol. II, Athena Scientific, Belmont, MA, 2007. [5] S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh, M. Lee, Natural actor-critic algorithms, Automatica 45 (2009) 2471–2482. [6] S. Bhatnagar, An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes, Systems and Control Letters 59 (2010) 760–766. [7] L.A. Prashanth, S. Bhatnagar, Reinforcement learning with function approximation for traffic signal control, IEEE Transactions on Intelligent Transportation Systems (2011), in press (doi:10.1109/TITS.2010.2091408). [8] D.P. Bertsekas, J.N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, Prentice Hall, Englewood Cliffs, NJ, 1989. [9] V.S. Borkar, Asynchronous stochastic approximations, SIAM Journal on Control and Optimization 36 (3) (1998) 840–851. [10] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996. [11] J.N. Tsitsiklis, B. Van Roy, An analysis of temporal difference learning with function approximation, IEEE Transactions on Automatic Control 42 (5) (1997) 674–690. [12] V.R. Konda, V.S. Borkar, Actor-critic like learning algorithms for Markov decision processes, SIAM Journal on Control and Optimization 38 (1) (1999) 94–123. [13] D.P. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont, MA, 1999.