A systolic architecture for iterative LQ optimization

A systolic architecture for iterative LQ optimization

Automatica,Vol. 27, No. 5, pp. 799-810,1991 0005-1098/91$3.00+ 0.00 PergamonPresspie ~) 1991InternationalFederationof AutomaticControl Printedin Gre...

930KB Sizes 1 Downloads 85 Views

Automatica,Vol. 27, No. 5, pp. 799-810,1991

0005-1098/91$3.00+ 0.00 PergamonPresspie ~) 1991InternationalFederationof AutomaticControl

Printedin GreatBritain.

A Systolic Architecture for Iterative LQ Optimization* L. CHISCIt and G. ZAPPAt~t

A systolic architecture to iteratively compute the LQ dynamic feedback gain can be profitably used to speed up controller parameter synthesis in an adaptive control environment. Key Words--Optimal control; parallel processing; computational methods; computer architecture.

fast and robust matrix computations (e.g. inversion, decompositions, factorization updating and eigenvalue computation) and digital signal processing (e.g. discrete Fourier transform, convolution, adaptive prediction and adaptive beamforming) (S. Y. Kung, 1988). At the same time, the design and manufacturing of systolic hardware have been undertaken [see, for instance, Fortes and Wah (1987) for a thorough review of the state of the art]. The control community was not insensitive to the opportunities offered by parallel computing. Major efforts, however, were devoted to implementing on multiprocessor architectures control schemes involving complex tasks (e.g. nonlinear simulation, fault diagnosis, statistical analysis of data) rather than redesigning basic control algorithms in order to fully exploit the potential advantages of the fine-grain concurrency provided by systolic arrays (Irwin and Fleming, 1990). The only noticeable exception was the design of systolic Kalman filters: indeed since the pioneering work of Andrews (1981) and Jover and Kailath (1986), several new designs have been proposed (Chen and Yao, 1986; Sung and Hu, 1987; S. Y. Kung, 1988; Gaston and Irwin, 1989, 1990), leading to remarkable performance enhancements. In the authors' opinion, adaptive control of fast dynamical systems (e.g. flexible robot links, electromechanical devices) could also benefit from a systolic design. Considering an indirect adaptive controller as the interconnection of three main blocks (i.e. a linear controller, a recursive parameter estimator and a block for the synthesis of the controller parameters), the main computational load is due to parameter estimation and controller parameter synthesis. Most of the relative algorithms---whatever control law is adopted: pole-placement, linear quadratic (LQ) control, multistep-predictive

AImlraet--An efficient square root algorithm and its systolic implementation for LQ dynamic output regulation of a SISO plant are presented. Due to the particularity of the undertaken formulation, based on an input-output plant description, the problem under consideration exhibits a very special structure which is fully exploited by the proposed implementation. As compared to the best parallel LQ implementations conceived for generic state-space descriptions and cost-functionals, which use O(n 2) processors and take O(n) processing time per iteration, our implementation uses a trapezoidal array of d(4n + 2 d - 1) processors and takes O(n/d) time, where n is the plant order and d is a design parameter trading off computing speed versus hardware complexity. In particular, if d is O(n), processing time turns out to be independent of n. The proposed architecture, being capable of providing the optimal feedback gain both in "receding horizon" and "iterations spread in time" modes, could be profitably used to speed up the controller parameter synthesis task in an adaptive control environment for applications where both high speed and high performance are demanded.

Notation Lower-case (upper-case) letters denote scalars or vectors (matrices) A ' = transpose of A 0,m = n x m null matrix 0, = null column vector of dimension n In = n ~ n identity matrix 110112 ~ v'Mv, n non-negative definite x e ~ , Ix] = smallest integer greater than or equal to x.

1. INTRODUCTION

THE BREAKTHROUGHof VLSI (Very Large Scale Integration) technology had a strong impact in the development of computing architectures dedicated to specific classes of algorithms. In particular, systolic arrays have been proposed for

*Received 10 February 1988; revised 24 April 1989; received in final form 27 December 1990. The original version of this paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by Associate Editor D. W. Clarke, under the direction of Editor P. C. Parks. t Dipartimento di Sistemi e Informatica, Universith di Firenze, Via di Santa Marta 3, 50139 Firenze, Italy. ~tAuthor to whom all correspondence should be addressed. 799

800

L. CHisci and G. ZAPPA

control---exhibit some "regularity" and are, therefore, suitable for systolic implementation. In this paper, a systolic architecture to iteratively solve the LQ output regulation problem of a SISO plant is proposed; as such, it can be exploited for computing the optimal feedback gain according to a quadratic costfunctional. LQ optimization is related, by duality, to Kalman filtering (KF). Therefore, systolic arrays for KF could also provide the LQ optimal gain. However they are not efficient: in fact, KF solutions concern generic state-space descriptions and noise covariance matrices, requiring O(N 3) operations per iteration where N is the state dimension. Hence, for a planar array with O(N 2) processors, the processing time per iteration reduces to O(N). Conversely, in computing the LQ optimal control law, one can arbitrarily choose the base of the state-space representation in which Riccati iterations are carried out. In fact, a simple change of coordinates suffices to relate the feedback gain from "artificial" to "physical" variables. Canonical realizations, for instance, induce sparsity in the relevant matrices and, in the SISO output regulation case, reduce the number of operations per iteration to O(N2). Moreover, different canonical realizations induce the same kind and amount of operations but different data dependencies. This has no relevance in a sequential (single-processor) implementation but affects the degree of pipelining, and hence the efficiency, of the systolic design. From our analysis, in particular, it turns out that the best choice is the nonminimal realization corresponding to an input-output (i/o) description of the plant. Incidentally, this is also the "natural" one in adaptive control, since it does not require state-observers, and its parameters can be directly estimated from i/o data. The proposed architecture is based on the triangular array introduced by Gentleman and H. T. Kung (1981) for QR decomposition. It consists of a trapezoidal array (2d rows and 2d + 2n - 1 columns) of processors together with a bank of 2d + 2n multiplexers, where n is the plant order and d is a design parameter which trades off computing speed vs. hardware complexity. The architecture will carry out an LQ iteration in time O(n/d) instead of time O(n 2) for the single processor implementation. In particular, if d = O ( n ) , processing time becomes independent of the plant order, that is to say that the processing time achieved with O(n 2) processors for a n-order plant is of the same order of magnitude of that achieved by a single processor for a plant of order 1. Since the array can compute the parameters of

several control laws related to LQ optimization, and can be used in conjunction with existing arrays for linear filtering and recursive parameter estimation, it may provide the kernel of a cheap, high-performance and high-speed adaptive controller. The paper is organized as follows. In Section 2, some background material is presented, concerning algorithms for UD factorization updating and their mapping onto systolic arrays. In Section 3 the UD solution of the LQ problem (Peterka, 1986) is revisited with an eye to efficient systolic design. In Section 4 the new array is described and its performances analysed, while Section 5 contains concluding remarks concerning implementation and technological aspects. 2. BACKGROUND MATERIAL In order to provide to the paper a tutorial character for control engineers, possibly not fully aware of recent developments on systolic arrays, we shall present in this section some basic facts concerning algorithms and architectures.

A. Modified Givens rotations Let us consider a symmetric non-negative definite matrix M of dimension n and rank at most two. M can be factorized as M = F'AF

(1)

where A = d i a g { O l , 6Z}, F =

fz "

Moreover assume, without loss of generality, that the first component of the row vector fz is f21 = 1. Then the Modified Givens Rotation (MGR) (Gentleman, 1973) is a transformation rg: (A, F)--> (A, F) leaving M unchanged, viz.

M=~'AP such that ,'~ = diag fn=O,

(3,, 32},

f21 = 1.

It can be shown that 32 = ~2 + ~1f121 61 = ~61 =/362 with 62 Moreover,

F = GF,

61

Systolic architecture for LQ optimization

Remark 1. The M G R is called dyadic reduction in Peterka (1986), since the matrix M can be expressed as the weighted sum of two symmetric dyads:

B. Triangularization updating The transformation ~ modifies the factorization (1) so that the first component offl is zeroed out w.r.t, the first component of f2. By iterating the transformation it is possible to compute the UD factorization of a n x n matrix M of rank at most m, i.e. to find M = F ' A F where A is diagonal, F = [ U F~] and U • ~m×m is upper triangular with unit diagonal elements (u.u.t. matrix). In particular, it is possible to recover this structure on F in case it is lost due to the addition of a row or column vector, by using the row or column annihilation procedures described below. B.1. Row annihilation. Let M = F'AF, where: A = b l o c k - diag {61, D} • ~mxm,

D diagonal;

F=

U u.u.t.

U F~ •

Then M can be factorized also as M =P'AP where A = block - diag {t~l,/)},

/) diagonal;

F=

Ou.u.t.

[,

0U-'

f31 FI]'

by sequentially applying m - 1 MGRs to the first row of F w.r.t, the jth row of F, j = 2, 3 . . . . . m, so as to annihilate the components of y one by one from left to right. B.2. Column annihilation. Let M = F ' A F , where: A = block - diag {dil, D}, F=

[1 f

0"-1] U J'

D diagonal; U u.u.t.

Here we have assumed m = n. Then M can be factorized also as M = P'AP where /~=[

1

k]

0m_ 1

~.J '

Uu.u.t.

by sequentially applying m - 1 MGRs to the jth row of F ( j = m , m - 1 . . . . . 2) w.r.t, the first row of F, so as to annihilate the components of f one by one from bottom up.

C. Systolic array for triangularization updating Since triangularization updating represents a key ingredient in a variety of numerical linear algebra problems involving symmetric matrices, a triangular array (triarray) has been introduced

801

(Gentleman and H. T. Kung, 1981; McWhirter, 1983) as one of the first and most typical applications of systolic arrays. In order to gain more generality, it is convenient to consider an extension of the triarray, i.e. the trapezoidal array (trapezarray) shown in Fig. 1. To be more specific, a N1 x N2 (NI-< N2) trapezarray consists of the interconnection of a N1 x N~ triarray with a N I x (N2 N1) rectangular array. This array involves two kinds of elementary processing elements (PEs), diagonal and off-diagonal, whose processing functions are also displayed in Fig. 1. Diagonal PEs generate the coefficients of a M G R and propagate them rightwards to the neighboring PE. Off-diagonal PEs actually perform a M G R on the basis of the coefficients received by the neighboring left PE. Input (output) to (from) the array is provided through the diagonal input port IPo and the vertical input ports IP~,IP2 . . . . . line (the diagonal output port OPo and the vertical output ports OP1, O P 2 , . . . , OPN2-NJ. It is also convenient, for expository reasons, to denote the input to the array by the tuple (d, r) where d represents the scalar input at the diagonal port IPo and r represents the row vector input at the vertical ports IP1, IP2. . . . . IPN2, to be introduced in skewed order. With reference to the row annihilation procedure of Section B. 1, let us assume that the entries of the diagonal matrix D are stored within diagonal PEs and that the nontrivial entries of the matrix [U F1] are stored within off-diagonal PEs. Therefore, provided that (61, [), fl]) is injected into the input ports, the array of Fig. 1 carries out (via MGRs): (a) annihilation of the row vector y; (b) updating (D, U, F0 ~ (/), U, F1) of the entries stored within PEs; (c) extraction of the transformed weight tS~ and of the residual vector ~ from the diagonal and vertical output ports, respectively. Further details can be found in Gentleman and H. T. Kung (1981) and McWhirter (1983). Due to its different dependence-graph (S. Y. Kung, 1988), column annihilation cannot be directly implemented on the array of Fig. 1. In fact, row annihilation runs from the upper-left corner to the lower-right corner of the array, while column annihilation should run from the main diagonal to the upper-right corner. However, since the LQ optimization procedure considered in the next section is based solely on row annihilations, we shall not elaborate on modifications required to perform column annihilation.

802

L. CHISCI and G. ZAPPA IPo

IP1

IPN,

IP2

~

IPN~+I

IPN=

q

p .........

3 !

D

DIAGONAL PROCESSOR

OPm-N~

ot'o oP,

Pi

OFF-DIAGONAL PROCESSOR

gn g12 ] g21 g22

DELAY ELEMENT

Po

Zk

:gk - i

iD

[1:,]

2. d : = d - + # i { 2

u

:=

G

{iu

4. i f #i -- 0 and ~ ~ 0 then #o := d else #o := g22gi FIG. 1. Basic systolic trapezarray for UD factorization updating.

The performance of a parallel implementation for an iterative algorithm is in general characterized by three figures of merit, i.e. latency, throughput rate and efficiency. Latency is defined by the number of time units* which elapse from input of data until the output of the corresponding results. Throughput rate measures the speed of the implementation. It is defined by the number of iterations that are performed per time unit. Its reciprocal is called pipelining period and denoted by Tp. Efficiency, denoted by ~, measures the utilization rate of the PEs. It is defined by the ratio of the speed up to the number of PEs used. Speed up is, in turn, defined by the ratio of the pipelining period of the sequential implementation to the pipelining period of the parallel implementation. It can easily be seen that the array in Fig. 1 performs triangularization updating with latency of 2N1 time units, throughput rate of one iteration per time unit and 100% efficiency. * A time unit is referred to as the time interval necessary for the slowest PE to complete one elementary cycle.

3. UD SQUARE ROOT ALGORITHMS FOR LQ OPTIMIZATION

UD square root algorithms for LQ optimization are briefly reported hereafter following Peterka (1986). Emphasis is put here on the choice of the state-space realization and of the Bellman function, which have a direct relevance to the efficient systolic mapping of the algorithms. For the sake of simplicity, only LQ output regulation of a SISO plant is considered. Let us consider a SISO plant described by the ARX model A ( q - 1 ) y ( t ) = B ( q - 1 ) u ( t ) + e(t)

(2)

where u(t) is the input, y ( t ) is the output and {e(t)} is a sequence of zero-mean i.i.d, random variables with variance o~; A(.) and B(.) are polynomials in the backward shift operator q-1 A ( q -1) = 1 + a l q -1 + . • • + a , q - " B ( q -1) = b l q -1 + • . . + b , q -~

where, w.l.o.g., degB =n.

we assumed that

(3) degA=

Systolic architecture for LQ optimization The ARX model (2) can also be represented in state-space form by (4)

where x(t) denotes a (possibly nonminimal) state vector of dimension N---n. Moreover, let us consider the T-step ( T - l ) quadratic costfunctional E W ( t , T) I Y'], 1 r ~(t, T) _a ¥ ~ [y2(/+ k) + pu2(t + k - 1)],

with

the

nonanticipative

(5)

control

u(t) e 0(5 t~)

(6)

In ( 5 - 6 ) o¢"-~ {y(k), u ( k - 1 ) : k -< r}, a ( # ~) is the a-algebra generated by jt~, and E[. ]~t~] denotes conditional expectation w.r.t• a(~t 0. According to Bellman's Dynamic Programming the optimal input sequence {u*(t+k); k = 0 , 1. . . . . T - 1 } , minimizing (5), is provided by u*(t + k) = arg rain E[Vk I .~t+k] u(t+k)

Fk=

i1

Uk+lb

Uk+IA

0

c



In order to perform the minimization in (13), it is sufficient to apply the transformation procedure (Ak, Fk)-+ ( [kk, fk), yielding (16)

with Ak ----block - diag {*, Dk, *},

Dk diagonal

Pk =

Uk u.u.t.

, 0N_I

and * denote don't care entries. Due to the special structure of Fk, U(t + k) affects only the first component of Fk[u(t+ k ) x ' ( t + k)]'. Hence u*(t + k) is just that value for which this component is set to zero, viz. u*(t + k) = - f , x(t + k).

(7)

Thus, we get

where the Bellman function Vk = Vk(X(t+k)) propagates backward according to V k ~- min {E[Vk+ 1 I #t+k] + pu2(t + k)} u(t+k)

+ y2(t + k)

A =f~ Vk = IIx(t + k)ll~ + Ok where

(8)

P, = U'~Dk U, Ok = Vk+a + a2(h'Pk+lh).

with the initial condition Vr -- IIx(t + T)II~.

(14)

M k ----F'kAkFk ----FtkAkF k

Jt k= 1

together strategy

M, = F~a~e~

Ak = block - diag {p, D,+l, 1}

x(t + 1) = Ax(t) + bu(t) + he(t + 1) y(t) = cx(t)

p_>O

where

803

Pr = c'c.

(9)

The optimal input sequence is given by the state-feedback u*(t + k) = -f~, x(t + k).

(10)

Moreover, Vk+l has the form V,+~ -- IIx(t + k + 1)ll~k., + 1Jk+l,

(11)

where Vk+~ is a scalar deterministic quantity and Pk+l can be factorized as Pk+l = U'k+lDk+lUk+l,

1. Matrix-matrix multiplication Uk+l[b Z], 2. Column annihilation of Uk+lb w.r.t, the first row of ~ ; 3. Triangularization of Uk+IA; 4. Row annihilation of c w.r.t, the triangular matrix resulting from step 3. N

(12)

with Dk+ 1 diagonal and Uk+x u.u.t.. Using the factorization (12) and the representation (4) in (8), we get: Vk = Vk+l + min,(,+k){E[llx(t + k + + pu2(t + k)} + y2(l + k)

Remark 3. Computational complexity with a generic realization• For a generic state-space realization (4), this LQ update procedure requires

a)ll~k+, ] ~,+k]

= (Vk+l + o~h'Pk+lh) • 2 + mlnu(t+k) [l[u(t + k)x'(t + k)] t IIM,

Remark 4. Computational complexity with a canonical realization. Let us consider . the observer canonical-realization (Kailath, 1980), viz. "-1 A = LIn_l

[0

i

-a],

b = [bn" • "b2 bl]'

(13)

c = h' = [0n_ 1 11.

a = [ a , , . . . au al]' (18)

804

L. CHISCI and G. ZAPPA

I1°- °]

Then Fk in (15) becomes Fk=

Uk+lb 0

where

U~+l =

0k+l

Otn--1

[1 On--1 "

we get the nonminimal (N = 2 n - 1) state-space representation

--Uk+la 1

~+1

[b A ] = [ i u _ 1 c=h'=[1

]

U,+I=[

f;']

(22)

Uk+l[b A]= [~,~k+l Y;(N_l)2]

(23)

with 0 0]+0.

(24)

Taking into account ( 2 2 - 24), the matrix M, in (13) can be factorized as in (14) with Ak = block - diag {p,/9,+1, 0, 0, 1}

and g2, is quasi-u.u.t., viz.

[r_,],

Iell

Qk= UkJ Uk u.u.t.;

y,

3. Row annihilation of Yk w.r.t. U,, viz. transformation (Zk, £ 2 , ) ~ (Zk, g)k) such that Zk = block - diag {*, D,},

Dk diagonal;

However, the procedure outlined in Remark 4 is still unsuitable for efficient systolic implementation. In fact, even if the resulting three steps can be separately performed on a systolic trapezarray, their combination will severely hinder pipelining of iterations. More precisely, at each iteration, the inner products of step 1 represent a bottleneck for the overall computation. Moreover, the sequencing of the column and row annihilation steps 2 and 3 is also critical as it prevents pipelining and requires that the array structure be significantly more complicated in order to handle both column and row annihilation procedures. Conversely, the choice of a nonminimal realization permits to avoid both inner products and column annihilations. In fact, defining the parameter vector

-a,]

e2 where ex=[1

Therefore, the sparsity of A and c in (18) allows, w.r.t, the generic procedure of Remark 3, to simplify step 1 and to get rid of the triangularization step 3. []

bE - a z ' " b ,

(25)

V,= ¢~k

~ , = block - diag

Uk U.u.t..

(19)

and the state vector

x(t)=[y(t) u ( t - 1 ) y(t - n + 1)]'

9k+,] ~2,+1 "

Yk=[9,+l

P~= 0.+1 QkJ

-al

1 Ou-i

Then,

Ak = block - diag {*, Z,}, Yk diagonal;

O=[bl

0~v-1].

Dk+l = block - diag {dk+l, Yk+l}

1. Matrix-vector multiplications Uk+lb, Uk+la ; 2. Column annihilation of Uk+lb w.r.t, the first row of Fk, viz. transformation (A,, Fk)---~ (Ak, Fk) such that

[UkJ'

(21)

Partition Dk+l and Uk+l as

In this case, the LQ iteration requires

[1

°N_1)2J

{~'~k+l, 1, 1},

0;v], e : = [ 0

1 0~_1].

With reference to the representation (19-21) the LQ iteration requires, therefore, the following steps: 1. Row addition (24); 2. Row annihilations of el, e2 and Yk w.r.t. ~kA further improvement in efficiency can be obtained by iterating the modified Bellman function

n--1 i=1

I?,+1 = Vk+l+ ~] [ y 2 ( t + k + l - i )

+ pu2(t + k + 1 - i)]

l) (20)

(27)

according to ~?k= min {E[I?,+I I #'+~]} + pu2( t + k + 1 - n) u(t+k) +y2(t + k + 1 - n) (28) with the initial condition 9r = IIx(/+ T)II2T, Pr = diag {1,/9 . . . . .

1, 19, 1}.

(29)

Comparing (28) with (8), it is easily seen that Ak and Fk reduce to Ak = block - diag {D,+I, p, 1} = block - diag {d,+l, Zk}

y(t-1)...u(t-n+

(26)

f~= ~

(30)

Systolic architecture for LQ optimization so that the annihilation of the row vectors el and e 2 is no longer required. Summing up, the overall updating procedure takes the form

805 06 =

0

-1

8~ = bl

02 = - a l

Oa = b=

04 = -a=

-as

$s=ba

Ak--'>/~k = block - diag {*, *, Dk}

L

Fo og fk l"

= I 1

(31)

LON u,,J Remark 5. Comparing the procedure (31) with the one of Remark 4, we find that the latter requires roughly half the number of operations and is, therefore, preferable for sequential implementation. On the other hand, the former is by far more favourable for parallel implementation, as will be seen in the next section. []

Remark 6. Use of iterative LQ optimization in adaptive control. Iterative LQ optimization provides a sequence of feedback gain vectors {fk; k = T - 1 , . . . , 1, 0}. In adaptive control, this procedure can be exploited to generate the control signal in the following way. At each sampling instant t, the current control is generated according to

u(t) = -fo( O(t))x(t). where the feedback gain fo(O(t)) is computed from the currently available parameter estimate 0(t) by iterating the LQ procedure over a T-steps horizon. Depending on whether the LQ iterations are reinitialized or not at each sampling instant, we may have the two following control strategies. 1. Receding Horizon (RH) control. At each sampling instant the LQ procedure is reinitialized with (29). 2. Iterations Spread in Time (IST) control. At each sampling instant, the LQ procedure starts with the matrix P (actually its factors U and D) resulting from the last iteration of the previous sampling period. Provided that parameter estimates converge, the IST control asymptotically yields the infinitehorizon LQ optimal control. Notice that, in this case, T can be quite small since it does not affect the closed-loop asymptotic behaviour. [] 4. S Y S T O L I C

IMPLEMENTATION

The updating procedure described in the previous section via equations (26-31) can be easily mapped onto a (N + 2 ) x (N + 2) triarray (cf. Section 2C). With reference to (22), let us assume that the matrices Uk+l and Ok+ 1 a r e stored in the N x N top-left part of the array, while the values 1, p and, respectively, 0 are stored in the remaining

FIG. 2. L Q i t e r a t i o n o n the t r i a r r a y : I n i t i a l c o n f i g u r a t i o n .

diagonal and off-diagonal PEs. This initial configuration is illustrated in Fig. 2 for the case n - - 3 ( N = 5). Then it is easily seen that if the input (0, [ - 1 0]) is fed into the array, the scalar w e i g h t dk+ 1 and the vector Yk in (24) emerge from the output ports of the first row. In fact, in correspondence with p~--0 and ~ = - 1 at the input of a diagonal PE, the resulting M G R matrix becomes

Hence (dk+l, ~k) is timely sent downwards into the part of the array that contains (~'k, ~'~k) SO that the row annihilation procedure (31) is carried out. At the end, the optimal feedback gain f~ is stored within the PEs of the second row while the matrices Uk, Dk stay in the Nth dimensional bottom-right part of the array. The final configuration of the array is depicted in Fig. 3. This updating procedure can of course be iterated by moving (Dk, Uk) from the bottomright to the top-left part, reinitializing the PEs of the last two columns with the appropriate data and unloading the feedback gain vector f~. A suitable systolic design is shown in Fig. 4 where the entries of U, D are propagated diagonally from bottom-right to top-left instead of being updated in place. This design is, therefore, based on a "modified triarray" with additional upward diagonal connections, additional input ports for loading the initial values and additional output ports for unloading the

806

L. CHISCI and G. ZAPPA

FIG. 3. LQ iteration on the triarray: Final configuration.

feedback gain. The resulting performance gives a pipelining period Tp = 6 and efficiency ~f ~ 17%. A considerably more efficient design will be derived in a while based on a cylindrical trapezarray topology for data moving. As a preliminary step towards this final array design, let us consider pipelining of multiple iterations on an enlarged triarray. For instance, two subsequent iterations can be pipelined onto a (N + 4) x (N + 4) triarray (where Dk+ 1 and Uk÷l are stored, as before, in the N x N top-right part) provided that ( 0 , [ - 1 0 0 0]) and (0, [0 0 - 1 0]) are subsequently input. Notice that, in conjunction with ~ = 0 at the input of a

:* 05(N+2) i. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5D 10D

' o:.~,. D

po

FIG.

4. Modified

triarray design optimization.

for

iterative

LQ

diagonal PE, no interaction occurs between the incoming data and the data stored in the PEs of the same row. Accordingly, T iterations can be performed onto a 2T x (2T + N) trapezarray. At the outset of the computation, the diagonal PEs are initialized with 1 and p alternatively, while the off-diagonal PEs are initialized with zeroes. At the kth iteration (k = T - 1. . . . . 1, 0), the input (0, [02(T_k_l) - - 1 0 0~k]) is injected into the array. As soon as all the T data rows have passed through the array, the PEs of the 2 ( T - k)th row contain the feedback gain fT,. In particular f~ is stored within the last row and can be, therefore, unloaded simply by appending the extraction input (0, [0~r-x - 1 0~v]) to the other data inputs. It is clear that in order to perform a new optimization the PEs must be re-initialized. This architecture, however, still suffers from some major drawbacks. First, the size of the array depends on the control horizon T. This is somewhat inconvenient as one should be allowed to choose an arbitrarily large T. Second, it can implement RH control, where the optimization procedure is restarted at each sampling period, but cannot implement IST control, which requires that the optimization is carried out indefinitely (without restart) with possibly new plant parameter estimates at each sampling period. Third, due to the sparsity of the input matrix, actually a band matrix, the design is highly inefficient in terms of PE utilization, the more inefficient the larger is T. In order to circumvent these difficulties, we would like to implement the gain computation for an arbitrary horizon T into a fixed-size (i.e. not depending on T) array. Partitioning (Navarro et al., 1987), namely mapping an algorithm into an array whose size is problemindependent, represents a key issue of systolic design. Similar considerations hold for the other dimension N. However, since the variability of the plant order is considerably lower, partitioning w.r.t. N will not be addressed hereafter. Hence, assume that only a 2 d x ( 2 d + N ) trapezarray is available while the optimization problem must be solved for any T > d . The problem can still be tackled provided that the array is supported by a bank of multiplexers and by circular connections allowing either data injection from the outside (host system) or data recirculation, and possibly by a delay network (of size Ld x (N + 1)) for buffering the recirculated data. The overall architecture is shown in Fig. 5. The way in which this architecture can simulate the full 2T x (2T + N) trapezarray is illustrated in Fig. 6. The active part of the full trapezarray has been partitioned into smaller

Systolic architecture for LQ optimization : i~_--01~

807

]1p~/P~+, ~oP~+,,

.....................

0

/

/ /

/

~,-,.1

/

/

/

\

s,

\

I I

\\

I

I \

, \

]VIULTIPLF, XER

Oeo OPt

0

I

s, \

/

/

/

/

OPN

\

] I

FIG. 6. Partitioning multiple LQ iterations on a fixed-size trapezarray.

c=0,1

o o :=

¢it + ( 1

-- c)iz

Correspondingly, the nonzero part of the input data matrix of size T x ( 2 T + N ) has been partitioned into strips of d rows each. The ith strip of input data, denoted by ~i, has therefore the following format

FIG. 5. Cylindrical trapezarray design for iterative L0 optimization.

trapezarrays of size 2d x ( 2 d + N ) . The ith subarray (from the top) is denoted by 3 .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

,-1,

r-'r

0

- -t--'t

,-1,

r-w

.

- -t--'~

. . . . . . . . . . . . . . . . . . . . . . . .

0

. . . . . . . . . . . . . . . . . . . . . . . .

,-1, r-'v

1-11 '11

.L-I.

- -t---~

0

0 o

SCHEME

where data rows are input to the array starting from the bottom. Due to the particular structure of the updating procedure (two consecutive data rows interact with triangular windows of size (N + 2) x (N + 2) shifted of two positions along the main diagonal of the entire array), ~ starts interacting with the array in 3 . Before introducing ~i in 3 , however, ~/ must have subsequently interacted with ~ - e + t . . . . , ~i-2, ~-i where P = [ N / 2 d ] + I . Note that P represents the number of passes through the array that a data-strip needs to perform in order to be completely annihilated in groups of 2d columns per pass. Now consider implementation over the architecture of Fig. 5. We can consecutively input ~1, ~2, ~3 . . . . into the array and annihilate each strip by multiple passes. As a generic strip, say

,

J

,

J

,

. . . . . . . . . . . . . . . . . . . . . . . .

0

T ....

T . . . .

....

"r . . . .

J

0

1 I.............................. ]

~i, would emerge from the bottom ports, after a group of transformations, with a latency of (4 + L)d time units (4d being the latency of the trapezarray and Ld being the additional latency of the delay network), the next strip ~i+1 can be injected into the array (5 + L)d time units later. The array operation is therefore cyclic modulo T d = ( 5 + L ) d . The input scheduling for a complete cycle is shown in Table 1. Note that, during the ith cycle, the available 2d x (2d + N) trapezarray simulates 3 . Therefore, soon after ~i has been fed in from the external ports IP1, 11)2. . . . . IPN+2d for the first pass, the strips ~i--P+2 . . . . . ~i--1, ~i are subsequently fed-back through the ports OPo, 01)1 . . . . . OPN, respectively for the Pth . . . . ,3rd, 2nd passes. In order to do this, there is a separate two-way multiplexer for each array column. Control of

L. CHISCI and G. ZAPPA

808

TABLE1. INPUTSCHEDULINGFORASINGLELO ITERATIONONTHEARCHITECTUREOFFIG.5 Cycle

Input strip

Time interval

~i

( i - 1)Td + ( i - - l)Ta + d -

9, e+2

1

1st

(i-1)T d + d + (i- l)Ta + 2d-1

91 e+3

(i-1)T d + 2d+(i-1)T

9,_ 1 ~i

Pth

d + 3d-1

(e-1)th

(i-1)Ta+(P-2)d+(i-1)Td+(P-1)d-1

3rd

(i - 1)Td + ( P - 1 ) d + (i - I ) T a + Pd - 1

2nd

--

(i - 1)Td + Pd + i Td - 1

~i+1

i+1

Pass

--

iTa + iTd + d -

(32) namely the overall latency, i.e. (4 + L ) d , should be no less than ( P - 1 ) d , the number of data rows that must be recirculated inbetween the input of two consecutive data-strips. [ N / 2 d ] <- 4 + L

P e r f o r m a n c e considerations

According to (32), L can be zero if the array size is large enough, precisely if d-> [N/8]. In general, for a given d, L should be chosen as small as possible for best performance, i.e. L = max {0, [N/2d] - 4}. Accordingly, the performance of our cylindrical array design is characterized by the following figures: (a) Pipelining period

Td

(33)

and (b) Efficiency 4n 2 + 6n + 2 ~ ( N , d) - 2Td(4n + 2d - 1)

1st

and the design parameter d. Examination of (33) and (34) shows that a good choice for d lies in the interval 1-< d-< [n/4]; the closer is d to the lower bound the higher is the efficiency (the lower is the throughput rate) and vice versa the closer is d to the upper bound. The array size d should therefore be properly selected so as to trade-off speed vs efficiency. In the authors' knowledge, the two array designs of Figs 4 and 5 represent the first specific attempts of parallel implementation for the LQ problem under consideration. Systolic designs of KF algorithms can be adapted to the general LQ problem but give inefficient performance for the specific LQ problem defined by (2) and (5). For instance, the array of Gaston and Irwin (1990) uses (n + 1)(n + 4)/2 PEs and takes 2n + 3 time units to perform one LQ iteration. Of course, the array of Gaston and Irwin has been designed for generic state-space representations and weight matrices in the cost-functional. Hence it cannot take advantage of the very special structure of the i/o plant model (2) and of the cost-functional (5) which induce sparsity in the system and weight matrices. A performance comparison between Gaston and Irwin's design, the Modified Triarray Design (MTD) of Fig. 4 and the Cylindrical Trapezarray

multiplexing is also cyclic modulo Td and very simple as the various multiplexers must be controlled in skewed order. For the architecture to work properly, the parameters L (buffer size) and d (array size) must satisfy the inequality

Tp = ~ = max {5, [n/d] + 1}

1

(34)

which depend on the plant order n = [N + 1]/2

TABLE 2. PERFORMANCE COMPARISONOF LQ SYSTOLICDESIGNS Design Gaston-Irwin MTD

CTD

#PEs n2+5n+4 2 4n 2 + 6n + 2 2 d(4n+2d-1)

Tp

Efficiency

2n + 3

O ( n - l) (very low)

6

16.7%

max{5,[n/d]+l}

4n 2 + 6n + 2 2dmax{5,[n/d]+l}(4n+2d-1)

Systolic architecture for LQ optimization Design (CTD) of Fig. 5, is reported in Table 2. It is evident that the CTD, whose efficiency depends on n and d, is far more efficient than the MTD, which, on the other hand, has a constant efficiency of about 17%. For instance, taking d = 1 and n - - 5 , we get, for the CTD: Tp = 6 (like for the MTD) and ~ ~ 52% (much better than the MTD). For an arbitrary n, taking d = [n/4], the CTD gives Tp = 5 and ~ ranging from 28% to 61%, again much better than the MTD.

Initialization and feedback gain unloading To make the array work properly we need to perform, once at each cycle, the following tasks. 1. Reinitialize the PE memories with the appropriate values (i.e. 1 and /9 in the diagonal PEs and zero in the off-diagonal PEs) just before the recirculated data start interacting with PEs. 2. Unload the components of the feedback gain vector that are stored in the bottom row of the trapezarray each time a strip has completed its first pass through the array. To unload the feedback gain, we may append the extraction row (0, [0~d-1 --1 0~]) to each input data-strip. This slightly increases the pipelining period (i.e. Tp = (d + 1/d)Td). A control signal can also be propagated together with the extraction row in order to reinitialize PEs. An important feature of this fixed-size architecture is that it can easily cope with the IST adaptive control strategy simply by modifying the input data-strip ~ as soon as a new parameter estimate 0 becomes available. 5. CONCLUSIONS

A new systolic architecture for LQ optimization has been presented. It follows a UD square root approach and exploits a particular i/o representation and Bellman function in order to increase throughput rate. By an optimal choice of the array size, a throughput rate of 0.2 iterations per time unit can be obtained, thus outperforming other, more complex, systolic designs proposed for the dual Kalman filtering problem. The basic regulation problem considered in the paper can be generalized in several respects, e.g. inclusion of dynamic weights, reference signals, measurable disturbances and multivariable plants. The proposed array can deal with all of these extensions by suitably modifying the pattern of input data. In addition, it can be used in conjunction with existing systolic arrays for linear filtering and recursive parameter estimation in such a way that, linked together, the AUTO

27:5-D

809

three arrays can adaptively compute the LQ control signal from i/o data only (Chisci and Zappa, 1989). Other contributions to fast and parallel adaptive control computations are given in Chisci et al. (1990, 1991). Admittedly, adaptive control is not the simple interconnection of an estimator, a controller parameter synthesis and a linear controller blocks. Several supervisory and monitoring tasks must be performed which may induce on-line interventions both in the estimation algorithm (e.g. change of the forgetting factor, filtering of i/o data and normalization) and in the control law (e.g. change of the cost-functional, constrained optimization). These tasks can be assigned to the host system: the systolic array would therefore perform only the low-level and computationally intensive part of a complex adaptive control task, hierarchically structured. These aspects are also discussed in Chisci and Zappa (1991). From a technological point of view, the proposed systolic architecture can be implemented in hardware using a variety of design solutions (Fortes and Wah, 1987); for instance, the PEs of the systolic array could be realized using commercially available chips, e.g. conventional microprocessors, digital-signal-processors or transputers (McCanny and McWhirter, 1987). These off-the-shelves devices, however, are not designed for the specific application and may therefore be inadequate to meet the design requirements. Even if useful for rapid prototyping, this solution is not suitable for the level of granularity of the proposed LQ implementation. In fact, the computing power of commercial chips is not well matched to the low complexity of the task executed by a single PE. A better balancing between communication and computations could be achieved by considering a coarser-grain implementation, i.e. by assigning more complex tasks to the PEs. An opposite philosophy is that of designing dedicated chips. The ultimate goal is that of implementing the proposed architecture into a single chip or, at least, a reduced chip-set. This may not be possible with the currently achievable densities of integration but is very likely to occur by the year 2000, when the density of integration is expected to increase of a factor 1000. To achieve highly dense and efficient layouts, the PEs of the systolic array must be carefully designed and optimized for the target application. For instance, the PEs of our LQ array could be designed using CORDIC cells. The main problem associated with the adoption of a special purpose design philosophy is cost-effectiveness.

810

L. CHISCI and G. ZAPPA

Another solution for systolic design is the adoption of programmable systolic arrays. Nowadays, systolic-array chips with 10 to 100 simple programmable 1-bit cells are commercially available. It is auspicable that the increase of integration density will make it possible to integrate into a single chip more and more cells, with increased wordlengths and enhanced arithmetic (possibly floating-point) capabilities. Since the tasks of PEs and, sometimes, the array configuration can be programmed, mapping algorithms into programmable systolic arrays is done via software. Summing up, there exist possibilities to implement in hardware the proposed LQ systolic implementation assembling either commercial or dedicated chips, as well as using dedicated or programmable systolic arrays. Given the current state of the art, none of these solutions is completely satisfactory from a cost-performance point of view for low- or medium-cost industrial automation problems. In fact, a dedicated design would be too expensive while designs based on commercial or programmable systolic-array chips would be inefficient as the former are too coarse-grain and the latter are too fine-grain for the proposed systolic implementation. On the other hand, a dedicated design would be suitable for leading control applications (e.g. those of military, aircraft and space) where cost is of no concern. We believe, however, that the rapid advances of VLSI technologies, together with a better comprehension of adaptive control issues and the development of reliable "standard" algorithms, will fill this gap and make systolic solutions to control problems feasible and cost-effective in industrial applications also.

Acknowledgement--This work was supported in part by the Italian MURST.

REFERENCES Andrews, A. (1981). Parallel processing of the Kalman Filter. Proc. Int. Conf. on Parallel Processing. Columbus, OH, 216-220. Chen, M. J. and K. Yao (1986). On realizations of least-squares estimation and Kalman filtering by systolic arrays. In W. Moore, A. McCabe and R. Urquhart (Eds), Systolic Arrays. Adam Hilger, Bristol, UK. Chisci, L., H. Lev-Ari, D. T. M. Slock and T. Kailath (1991). Fast parallel self-tuning controllers. Int. J. Control. (to appear). Chisci, L. and G. Zappa (1989). Systolic architectures for adaptive control. DSI Technical Report 20/89, University of Firenze, Italy. Chisci, L., G. Zappa and E. Mosca (1990). On the implementation of predictive adaptive control by adaptive predictors. Int. J. Adaptive Control Signal Proc., 4, 187-206. Fortes, J. A. B. and B. W. Wah (Eds) (1987). Special issue on systolic arrays. Computer, 20. Gaston, F. M. F. and G. W. Irwin (1989). Systolic approach to square-root information filtering. Int. J. Control, 50, 225-248. Gaston, F. M. F. and G. W. Irwin (1990). Systolic Kalman filtering: an overview. Proc. lEE Pt. D, 137, 235-244. Gentleman, W. M. (1973). Least-squares computations by Givens transformations without square roots. J. Inst. Math. Applic., 12, 329-336. Gentleman, W. M. and H. T. Kung (1981). Matrix triangularization by systolic arrays. Proc. SPIE (Real Time Signal Processing IV), 298, 19-26. Irwin, G. W. and P. J. Fleming (Eds) (1990). Parallel processing for real-time control. Proc. lEE Pt. D, 137, 177-260. Jover, J. M. and T. Kailath (1986). A parallel architecture for Kalman filter measurement update and parameter estimation. Automatica, 22, 43-57. Kailath, T. (1980). Linear Systems. Prentice-Hall, Englewood Cliffs, NJ. Kung, S. Y. (1988). VLSI Array Processors. Prentice-Hall, Englewood Cliffs, NJ. McCanny, J. V. and J, G. McWhirter (1987). Some systolic array developments in the United Kingdom. Computer, 20, 51-63. McWhirter J. G. (1983). Recursive least-squares minimization using a systolic array. Proc. SPIE Symp. (Real Time Signal Processing VI), 430, 105-112. Navarro, J. J., J. M. Llaberia and M. Valero (1987). Partitioning: An essential step in mapping algorithms into systolic array processors. Computer, 20, 77-89. Peterka, V. (1986). Control of uncertain processes: Applied theory and algorithms. Kybernetika, 22 (suppl.), 1-102. Sung, T. Y. and Y. H. Hu (1987). Parallel VLSI implementation of the Kalman Filter. IEEE Trans. Aerospace Elect. Syst. AES-23, 215-224.