Future Generation Computer Systems 8 (1992) 349-361 North-Holland
349
Mapping uniform recurrences o n t o small size arrays * Vincent Van Dongen Centre de Recherche Informatique de Montrgal, 3744 Jean Brillant, Suite 500, Montreal, Quebec, H3T1P1 Canada
Abstract Van Dongen, V., Mapping uniform recurrences onto small size arrays, Future Generation Computer Systems 8 (1992) 349-361. Given a regular application described by a system of uniform recurrence equations, systolic arrays are commonly derived by means of an affine transformation; an affine schedule determines when the computations are performed and an affine processor allocation where they are performed. Circuit transformations are then applied on the resulting circuit when the application needs to be mapped onto a smaller size array. This method is in two steps and thus can hardly be optimized globally. We hereafter present a different method for designing small size arrays. We derive them in one step by means of an affine schedule and a near-affine processor allocation. By doing so, we can generalize the optimization technique for affine mapping to be applicable here. The method is illustrated on the band-matrix multiplication and on the convolution algorithms.
Keywords. Uniform recurrence; systolic array design; near-affine mapping; automatic partitioning.
i. I n l r o d u c l i o n Systolic arrays are p a r t i c u l a r circuits m a d e of identical processing e l e m e n t s c o n n e c t e d in a local a n d r e g u l a r m a n n e r ; a high t h r o u g h p u t is achieved by m a k i n g extensive use of p a r a l l e l i s m a n d pipelining. T h e regularity a n d locality of their c o n n e c t i o n s m a k e t h e m ideally suited for a V L S I i m p l e m e n t a t i o n . Such arrays are applicable to p r o b l e m s in signal processing, n u m e r i c a l c o m p u t ing, g r a p h theory a n d o t h e r areas [8]. G i v e n a n a p p l i c a t i o n described by a system of u n i f o r m r e c u r r e n c e e q u a t i o n s ( U R E ) , systolic arrays are derived w h e n using affine s p a c e - t i m e t r a n s f o r m a t i o n s ; a n affine schedule d e t e r m i n e s w h e n the c o m p u t a t i o n s are p e r f o r m e d a n d a n
Correspondence to: Vincent Van Dongen, Centre de Recherche Informatique de Montr6al, 3744 Jean Brillant, Suite 500, Montr6al, Qu6bec, H3T1P1 Canada * This work was done for Philips S.A. when the author was member of the Philips Research Laboratory in Belgium.
affine processor allocation w h e r e they are perf o r m e d [14,15]. This s p a c e - t i m e m a p p i n g techn i q u e is g e n e r a l i z e d in this p a p e r to derive small size arrays, i.e. arrays c o n t a i n i n g fewer cells t h a n the ones o b t a i n e d w h e n using affine m a p p i n g s . T h e p r o b l e m of realizing a u t o m a t i c a l l y small size arrays has two i m p o r t a n t applications: 1. Software compilation f o r general-purpose arrays. T h e p r o b l e m is e n c o u n t e r e d in the m a p p i n g of applications, here systems of U R E , o n fixed size g e n e r a l - p u r p o s e arrays, e.g. a n u m b e r of t r a n s p u t e r s [16] or the W a r p [1]. 2. Hardware compilation. Small size circuits may be r e q u i r e d d u e to area constraints. T h e c o m m o n a p p r o a c h for deriving small systolic arrays consists in two steps [4,5,13,6,7]. A n array is first derived by m e a n s of a n affine spacetime m a p p i n g . This array usually c o n t a i n s too m a n y processors. T h e schedule is t h e n slowed down so that less parallelism is achieved, a n d the
0376-5075/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved
V. VanDongen
350
Behavioraldescriptionusing URE
affine processor l all6cation affineschedule systolicarray
l ne~r'caffi,nerallocation affineschedule
circuit transformation
ognizing that such a description can be used in the synthesis of systolic arrays [14]. Definition 2.1. A system of uniform recurrences is a set of m recurrences of the form Z E-~"s-'~
OI(Z ) =fl(Oll(Z--
small size systolicarray
01,1) ,
O12(Z -- °0"2,1) . . . . .
OIII(Z--Oll,I)) z ~ -~s ~
Fig. 1. The common approach for deriving small systolic arrays consists in two steps. In this paper, small size arrays are derived directly from the behavioral description by means of a unique space-time mapping.
processor allocation is modified correspondingly. This second step is a particular circuit transformation. Our approach is different. We derive small size arrays directly from the behavioral description by means of a unique space-time mapping, as shown in Fig. 1. The advantage of this direct approach is that it can be optimized• Its drawback is that it only works with a particular partitioning strategy known as the Locally Sequential Globally Parallel (LSGP) partitioning scheme [4,13,10,7]. But this scheme is ideal when using a generalpurpose processor array as an array of transputers [16,3]. In the next section, systems of uniform recurrences are introduced. Then, in Section 3, nearaffine mappings are defined. In particular, it is shown that these mappings can be viewed as a set of affine transformations defined on lattices of the index space• In Section 4, constraints for dealing with these mappings are derived. In Section 5, we recall how an affine schedule can be optimized. In Section 6, we generalize the optimization method to deal with near-affine mappings• We summarize the complete methodology in Section 7, and we conclude in Section 8.
2. Uniform recurrence equations We follow the suggestion given by Karp et al. [9] of describing 'regular' algorithms (for which systolic arrays are suited) using uniform recurrence equations (URE). Quinton was first in rec-
0 2 ( z ) = f z ( O z t ( z - 01,2),
02 (z - 0 2 , 2 ) , - . . , 0212(Z -- 012,2) ) Z ~ ~s ~
Ore(Z) = L ( O m l (z -- Ol,m)
,''',Omtm(Z--Olm,m))
(1) where
Vu~[1, m],Vu~[1,
l,]-+Ou,v~Z e,
(2)
and -@s is a convex polyhedron of Z e parameterized with s (the number of operands of f~. is noted l,,). The vector z is the index; it is an integer vector of e entries. The range of admissible value for z is defined by its domain ~s. The vectors Ou.v are called the dependence vectors of (1), and Or(z) is called an instance of the variable Q . A system of recurrences only defines a set of inter-related computations that must be performed. It does not define any order of computation. The order in which the computations can be performed is going to be defined later on by means of a schedule. (See Section 3.) In addition to such a system, a set of initialization values are to be specified at the boundary of the domain ~s-
Example 2.1. Given a set of N weights w[1], w[2] .... , w[N], and an infinite sequence x[1] . . . . of input signals, a Finite Impulse Response (FIR) filter computes the infinite sequence y(1), y(2), ... of output signals out of the following formula: N
l
Y'~ w [ k ] × x [ i - k ] . k=l
(3)
Mapping uniform recurrences
It can be assumed that, when i - k < 1, x[i - k] = 0. This application (for F I R filtering) can be implemented with the following system of U R E :
i> l, l <_k _ l, l <_k l, l <_k l, l <_k
y(i, k) = y ( i , k - 1) +p(i, k) p(i, k) = w(i, k) ×x(i, k) w(i, k) = w ( i - 1, k) x(i, k) = x ( i - 1, k - 1).
(4)
351
The index domain is defined by the following constraints:
O
O
-w2
O
(7)
The domain is a polyhedron having 16 vertices. The size parameters w 1 and w 2 represent half band widths; A is of band width 2 • w~ + 1 and B is of band width 2 • w 2 + 1. The system of recurrences (6) must be initialized outside the index domain with C(i, j, k) = O, A(i, j, k) = ai, k and
B(i, j, k ) = b~,j.
The system is to be initialized as follows: i
--+y(i, k ) = 0
i
--+x(i, k) =x[i] --+x(i,k)=O --+ w(i, k) = w[k].
i=O, l <_k < N - 1 i = 0 , l <_k < N
Let us conclude this section by saying that writing systems of uniform recurrences for specific applications is a problem by itself, and it is outside the scope of this p a p e r to try solving it. Interested readers are invited to look at the specific literature such as [2].
The output sequence is given by
1 <_i, k = U - - + y ( i ) = y ( i , k). In (4), the index z is (i, k). Its domain 2 , is the semi-infinite polyhedron bounded by 1 _
Example 2.2. Given an N × N matrix A of band width wl and an N × N matrix B of band width w 2, the product C = A x B is a matrix of size N × N and of band width w 1 + w 2 - 1. The elements cio can be computed out of the following equation: ci,j= Y'~ai,k.bkj.
(5)
k
The band-matrix multiplication can be computed with the following system of U R E :
C(i, j, k) = C(i, j, k - 1) + A ( i , j, k) ×B(i, j, k) A(i, j, k) = A ( i , j - 1, k) B(i, j, k) = B ( i - 1, j, k).
(6)
3. Space-time mapping A specific circuit for performing the application defined by a system of the form (1) is defined by associating to each O,,(z) the m o m e n t (or time) and the place at which this computation will take place. Let t(z, v) denote the m o m e n t when O,.(z) will be executed; this function is called the schedule. For the place, we hereafter consider an architecture made of processors, each of which consists of a set of elementary operation units. So, the place at which the computations take place has a two-level address. At the first level, p(z, v) identifies a processor; we call p the processor allocation function. At the second level, q denotes the operation unit within p that will execute
O,(z). A circuit is then defined with a mapping from (z, v) to (t(z, v), p(z, v), q). Note that there are several conditions for a mapping to be valid. In particular, the mapping should be one-to-one so that two computations are not performed both at the same place and at the same time, the schedule must respect the del~endencies between the computations, and the range of p and q should be finite. A basic methodology for systolic array design makes use of an affine mapping of the form, i.e. one consisting of an affine schedule and an affine
352
v. Van Dongen
processor allocation [14,15]. An affine mapping is of the form:
In order to derive small size arrays, we suggest to generalize (8) and to use space-time mappings of the form
Z e + l --') z e + l : (Z, U)
Z e + l --> z e + l :
(:((;2 =
"Z+ U
,
(8)
where T ~ Z e, O/U~ Z, P ~ Z e - I X Z e, and /3~ Z e-1. Note that the dimension of p is e - 1. This means that each processor p is identified with an address of length e - 1. When e < 4, this address can be the coordinates of the Euclidian space of dimension e - 1; the resulting circuit is then an array of dimension e - 1. The ith component of p, which we note Pi, is defined with
p;(z, v) = P/.z + where Pi is the ith line of P, and [~v,i the ith entry of/3~.. In (8), both a , and fl~. are functions of v. Several design methodologies have a~, constant (independent on v) and null entries f o r / 3 , . Yet, as shown in [17], it is very useful, in particular for unidirectional array design, to allow av a n d / 3 , to be any functions of v.
0+
-.->
(Z, U)
't(z, v)
=T-z+a
p l ( z , v)
= (PI" z +/3va ) div d 1
p2(z, V)
= (P2" z +/3v,2) div d 2
fie_ I(Z, U)
= (Pe-l'Z+~v,e-1)
,q
div d e_ 1 =u
v
(9) where 'div' denotes the integer division, and d t No, Vl ~ [1, e - 1]. Both in (8) and in (9), q = v. Thus, each processor consists of the m functions f l , f 2 , . . . . fm of (1). Mappings with q=~v can be used for example to describe internal multiplexing. But this is outside the scope of this paper. So from now on, we will omit to mention that q = v whenever we give a space-time mapping. In matricial form, the mapping (9) becomes (when we omit q):
p(z, v)
( P ' z +/3~,) div d . ,
.~ y 2 - yl
wl ~ w2 ~.
........
x! ;xg
L
1control signal generator I
reset k.................
Fig. 2. LSGP partitioning: the circuit is a systolic array that uses some internal memory and a simple control mechanism.
Mapping uniform recurrences
where tip = (d l, d 2 , . . . , d e - 1) ~ N0e- 1 and 'div' represents the component wise integer division. By definition, the processor allocation p(z, v) of (10) is near-affine, a sub-class of quasi-affine mappings [17]. The schedule t(z, v) is affine. Note that when d 1 = d 2 = . . . = d e _ 1 = 1, the mapping (9) is affine. With affine mappings, the computations that are processed by the same processor are on a line perpendicular to P1, P2, etc. The hyperplane perpendicular to PI contains all the computations performed by the processors with the same lth component. With a near-affine allocation, the computations performed by the processors with the same lth component are on dt consecutive hyperplanes perpendicular to Pl. Thus, ' b a n d s ' of indexed points are m a p p e d onto the same processor. Within each band, the computations are performed in sequence by making use of the local m e m o r y of the processor. Example 2.1 (cont'd). With the mapping
353
rate of the array is 1 / R . Each processor contains one latch for y, and R + 1 latches for x. The number of switches is one for x, and one for y. Note that in the complete array, the total number of latches is N + M for x; it is proportional to N for any value of M. When R >_ N, the architecture is sequential. The processors communicate with their neighbouts once every R clock cycles only; they are loosely coupled, which is ideal when using a general-purpose fixed size array with expensive inter-processor communication. We now show that the near-affine mapping (10) transforms the system of U R E (1) into a finite set of systems of U R E defined on lattices of (t, p) ~ Z e. Let the rational approximant of (10) be: ~(z,v)
=
(P'z+fl,.)/dp]"
(12)
The mapping (10) can be written as: t(z, v) T'z+a,
P(Z,V))=((P'z+~,,-rp)/dv) p ( i , k, v)
1 k div
'
0
(11)
one maps the uniform recurrences (4) with N = 6 onto the circuit shown in Fig. 2. It works as follows. Each input to the combinational logic is a two-input switch. Every clock cycle, the switches change of input. This is achieved with a control signal whose value alternates from one to zero. The control signal can either be broadcasted or it can be pipelined through the array as shown in Fig. 2. The signal values can be generated on line by a simple circuit that initializes its values when the reset is on. This part of the circuit is called the control signal generator in Fig. 2. The rate of the circuit is one-half; every two clock cycles, a new input signal enters the array, and an output value is being produced. This can be easily generalized as follows. With the affine schedule
t(i, k, v) = R . i + k , and the processor allocation
with r p = ( P . z +/3,.) mod
dp.
(14)
The components of rp a r e ( r l , r 2 . . . . . re_l), and for all l ~ [ 1 , e - 1 ] , r / ~ [ 0 , d / - 1 ] . So, rp has d 1 × d 2 x • • • x d e_l distinct values. For each value of rp, the mapping (13) is affine over the lattice of z defined by (14). Thus, a near-affine mapping is a finite set of affine mappings defined over lattices of z. More precisely, it is a rational affine mapping followed by a translation that depends on a lattice of z; the rational approximant is followed by a perturbation function. The inverse of a near-affine mapping is a finite set of affine mappings defined over lattices of (t, p). The latter lattices can be computed as follows. The inverse of (13) is: 1 Z = -- "adJ a
(,)(
dp
×p-~,,+rp
p ( i , k, v) = k div R, where R > 2, one maps the uniform recurrences (4) onto an array of M = [ N / R ] processors. Every R clock cycles, a new input value enters the array, and a new output is produced; the I / O
× p-/3.
+ --" a adj P
(15)
~ Van Dongen
354
where a is the determinant of (~) and adj(~) denotes the adjoint matrix of (px). The index z is in Z e if and only if
3(al) and 3(bl) show the second lattice and the result of the associated transformation. The inverse of (11) is: 1
adj(T)
" (dp×;~:,+rp)m°d(a)
(16) For a fixed value of rp, (15) is an affine mapping and (16) defines a lattice of (t, p). More precisely, it is a rational affine transformation followed by a translation that depends on the lattice of (t, p).
Example 2.1 (cont'd). The mapping (11) can be rewritten as:
p ( i , k, v)
(k-r)/2
-1
t
= (:)"
'
where r = k mod 2. This mapping consists of two affine mappings defined on lattices of (i, k). Figures 3(aO) and 3(bO) show the first lattice and the result of the associated transformation. Figures
t= 2.i+k
'P
~=(k-O)/2 k-O) rood2 = 0
Thusp each uniform recurrence of (1) is transformed into a set of uniform recurrences, one for each lattice of (t, p). Furthermore, the domain 2 s is a convex polyhedron of Z e, and the image of 2 s becomes the union of polyhedrons defined on lattices of (t, p). Finally, let us consider the control that must be added to the circuit to take into account the changes in the connections upon the lattices of (t, p). It can be verified that any hyperplane z i = c, where zi is the ith component of z and c is some constant, becomes Li•
p-fl~ +rp
=c .a,
i= (t- 2.p-0)/2 k=2.p+0 (t- 2.p- 0) mod 2 =0
(bO)
(aO)
t= 2.i+k (k-l)
IP
mode=O/
p=(k-l)/2
(al)
The two lattices in (t, p) are defined by (t - 2. p - r) mod 2 =,0.
/
i=(t-2.p- 1)/2 k=2.p+l (t-2.p- 1) mod2=0
(bl)
Fig. 3. A near-affine mapping is a finite set of affine mappings: on the lattice shown in (a0), the first affine transformation gives the lattice shown in (b0), while on the lattice shown in (al), the second transformation gives the lattice shown in (bl). It can also be obtained with a rational affine transformation followed by a translation that depends on a lattice of z; the result of the rational transformation is shown in dash in (bl).
355
Mapping uniform recurrences
where L i is the ith line of adj(~). This hyperplane of (t, p) is exactly the one used to define the lattice in (16). Thus, a control signal can run along it to specify this lattice. At most e control signals will be required in the final array, one per axis.
Example 2 . 1 (cont'd). Consider the recurrence x(i, k ) = x ( ( i , k ) - (1, 1)) of (4). When k mod 2
= 0, (i, k) is in the first lattice and ((i, k) - (1, 1)) is in the second one. The application of the corresponding affine transformations gives: x ( 2 - i + k, k / 2 ) = x ( 2 - ( i - 1) + (k - 1), ( ( k 1) - 1)/2) which is equivalent to: x(t, p ) = x ( t - 3, p - 1). On the other hand, when k mod 2 = 1, we obtain: x(2.i+k, (k- 1)/2)=x(2.(i1)+(k1), (k - 1)/2) which is equivalent to: x(t, p) = x(t - 3, p). Figure 4 shows these two cases. Furthermore, the image of 2 s is the union of polyhedrons defined on lattices of (t, p). Consider the inequality i > 0 defining a boundary of the domain of (4). For the first lattice of (t, p), i = ( t - 2 -p - 0 ) / 2 , and the condition becomes: (t-2.p-O)/2>O. For the second lattice of (t, p), i = (t - 2 .p - 1)/2, and the condition becomes: ( t - 2 - p - 1 ) / 2 > 0. Figure 4 shows the two resulting boundaries in (t, p). For the control, the hyperplane i = 0 becomes t - 2 -p - r = 0. A control signal can run along it, i.e. along (t, p ) = (2, 1). (See Fig. 2.) The hyperplane k = 0 becomes 2 . p + r = 0. No control sig-
nal is needed along that plane since it does not define the lattices. (See Fig. 3.)
4. Constraints The L S G P partitioning scheme is achieved with an affine schedule. The basic constraints on the schedule are [14]: Vc ~
[1, m ] ,
Vu~[1,1,]~ T'O,,, +oL,-a,,,>_l,
(18)
V u ~ [1, m ] - ~ a ,
(19)
_> - m i n vk ~ 2,(T • Vk),
where the Vk'S are the vertices of 2 , . Furthermore, if the domain is semi-infinite along a ray R, the schedule is subject to the additional constraint: T . R >_ 1.
(20)
On the other hand, the constraint on the processor allocation, when the domain is semi-infinite along R, is VI~
[1, e-
1]
-~P,.R=O.
(21)
This must be satisfied with both affine and nearaffine processor allocations. Any two different instances of 0,. that are computed in parallel must be performed on distinct processors. In other words, the mapping (9) must be one-to-one. We now consider this problem. Let w denote the n u m b e r of d / s that are equal to one; w ~ [0, e - 1]. When w = 0, all d i s
k r }; i,;
(a)
(b)
Fig. 4. A near-affine mapping transforms a uniform recurrence into a finite set of uniform recurrences defined on lattices of (t, p). It transforms a linear constraint of ~s into a finite set of linear constraints in (t, p).
V. Van Dongen
356
are different from one, while when w = e - 1, the mapping is affine. In the following, let us assume that d I = d 2 = ... = d w = 1 and that dw+l, d w + 2 , . . . , de-1 are different from one. (One can always order the e - 1 axes of the processor allocation so that it is verified.) Furthermore, let gt denote the greatest common divisor of the components of Pt (i.e. gl = gcd(Pt)), and P[ = P J g t . Also, we define gt as gcd(T) and T ' as T ' = T / g t. Thus, T ' and P[ are integral vectors whose components are relatively prime. Let us summarize the results given in [17]. • The values gt, gl . . . . . gw have no effect on the condition for having a one-to-one mapping. • For the mapping (9) to be one-to-one, the a b s o l u t e v a l u e o f the d e t e r m i n a n t of (T', PI' . . . . . P ' , Pw+l,.. ",Pe-1) must be greater than I ] ~t ~dl . l " • When e = 2, the mapping (10) is one-to-one if and only if
Example 2.1 (cont'd). Consider the near-affine mapping (11) and its inverse (17). Here, A = (21). Its two lattices do not intersect if and only if the system
has no solution, which is the case since 1 = 2. k~ has no solution. Unidirectional arrays have specific properties. In particular, the so-called Locally Parallel Globally Sequential (LPGS) partitioning scheme is directly applicable on such arrays. Its advantage is to use the external memory instead of the local memory as with the L S G P scheme [10,12]. Unidirectional arrays can be automatically realized, by means of additional constraints. For example, the lth component of Cu,~, is positive if and only if l't • Ou,v +/3~,t - 13.,t >__O.
• When e > 3 and w = 1, a systematic method polynomial with e can be used to check that the mapping is one-to-one. Check that the mapping is one-to-one when only one value of d t is different from one. • In the general case, the mapping (10) is one-toone if and only if, Vv ~ [1, m], there exists no pair of points in some reduced domains ~ , that have the same image. Because a near-affine mapping is a set of affine transformations on lattices of z, another method for checking that it is one-to-one consists in verifying that the images of these lattices do not intersect. First, one can compute the nonempty lattices defined by (16), by solving systems of diophantine equations. Second, one can check that any pair of nonempty mappings do not intersect. This is simple to check since two lattices defined by (16) and characterized by r , = r and r , = r ' do not intersect if and only if the following system of diophantine equations of vector variable k is non-empty: A" ( r -
r') =k×a,
where A is the e × (e - 1) matrix formed with the e - 1 last columns of adj(~), and a is a vector whose e entries are equal to a.
5. Optimization of the schedule In a system of U R E , the index domain -~s is parameterized with s. Let .~ denote an instance of -~s when s has a given value. Given a finite dependence graph and an affine schedule, the latency, which will be noted At, is At = m a X z ~ ( t ( z ) )
- minz~(t(z))
+ 1
= max/~[l,h](T • V/) - min i ~[1,h](T • V/) + maxv~[1,m](av) -- minu~tl,ml(au) + 1 = AT+ maXv~[1,m](av)
- minu ~[1,ml(au)
with A T = maxj.~tl,hj(T • V/) - mini~tl,h1(T • V/) + 1. For large computation domains ~ , the minimization of AT leads to a minimal value of At. The problem of minimizing A T can be solved by means of integer programming. Let D i j be the set of all non-null vectors X such that Vz c ~
~ X - V ~ < X - z < X.Vj,
Mapping uniform recurrences
The later condition is equivalent to the following set of linear constraints: Va ~ [1, hi] ~ (Vio - Vi) " X > 0
where V,.° and Vjb denote the vertices of -~ connected respectively to Vi and to Vj. By definition of Oi,j we have that: VT ~ O , j ~ AT = T- (Vj - V/) + 1.
(23)
Thus, the optimization can be achieved for any possible pair of vertices (Vi, Vj) of .~, and the global minimum will simply be the minimum of the local solutions. Example 2.2 (cont'd). Assume that all A,~'s are one. It can be shown that the only pair of vertices whose Dij gives a solution is (V/,V~)= ((0, 0, 0 ) , ( N - l , N-l, N-l)). The optimal schedule is found by minimizing T - ( 1 , 1, 1) under the constraints T- (1, 0, 0) >_ 1, T- (0, 1, 0) _> 1, and T . (0, 0, 1) _> 1. The solution is given by:
t(i, j, k) = i + j + k +a~.,
(24)
with a A =O/B= 0 and O/c = 1. The associated number of time steps is 3 • (N - 1) + 1. Instead of looking for T that minimizes AT, one may want to find T that minimizes some other cost function. This can easily be achieved when the solution space of T is bounded. In that case, one can compute all possible solutions, and find the one that minimizes the cost. The solution space of T can be bounded by imposing AT to be in a given range. Let [3T ~, AT ~'] denote this range. Because of (23), any slope T of Did has its AT in that range if and only if: A T " - 1 < T . ( V i - V i ) < A T " - 1.
The same technique is also applicable to infinite computation domains. Yet, in that case, the quantity maxz~(t(z))
V b ~ [1, hi] ~ ( V j - Vj~) • X > 0 '
357
- min~2(t(z))
+ 1
is always infinite. Hence, it cannot be used for minimizing the schedule. Given an infinite domain .~ of vertices V1, V2,...,Vh, and of ray R, we define 2 ' as the finite convex domain of vertices Vk and V h + ~ = V ~ + R , k ~ [ 1 , h]. The edges of 2 ' are the edges of .~, plus the edges between Yk and Vh+k, plus the edges between V~+~ and Yh+j whenever there is an edge between Vi and Vj. The value of At can be defined on this restricted domain as
At = m a X z ~ , ( t ( z ) ) - m i n ~ , ( t ( z ) )
+ 1.
According to this definition, At is now a finite quantity which can be used in the comparison of different schedules of infinite dependence graphs. Example 2.1 (cont'd). The restricted domain ~ ' is here a rectangle of vertices (1, 1), (1, N), (2, N) and (2, 1). Let us assume that all ~l,,,'s are one. The only pair of vertices which lead to a valid schedule is (V/, Vj)= ((1, 1), (2, N)). The schedule that minimizes AT is found by minimizing T- (1, N - 1) under the constraints T. (1, 0) >_ 1, T-(0, 1)> 1, and T.(1, 1)>_ 1. The solution is given by:
t(i, k) =i +k +a~:, with a x = O / w = - 2 , a p = - l a n d
a~.=0.
6. Processor allocation optimization Let us define AP t as follows:
AP~= [ maxz~2(Pt'z) - m i n z ~ ( P " z )
+l]
(25)
(27)
Example 2.2 (cont'd). Let us for example compute all the slopes T whose AT ~ [30,40], when N = 8. The set of solutions is:
When the components of Pt are relatively prime, '~Pt represents the number of cells along l, for each value of v. Else, it is a good approximation of the number of cells. The value of Pt that minimizes
{(1, 1, 3), (1, 2, 2), (1, 3, 1), (2, 1, 2), (2, 2, 1), (3, 1, 1)}.
(26)
maxz~ ~(Pl. z ) - m i n z ~ ( P / , z ) + 1
(28)
~ Van Dongen
358
clearly minimizes (27). Thus, when d t is fixed, the optimization of A P t becomes similar to the one of A T . For a given Dij, we have: V P t ~ D i , j --->m a x z ~ ( P t -
z) - m i n z ~ ( P l ,
z) + 1
= Pt" ( V j - V,) + 1. This later quantity can be minimized by means of integer programming. Example 2.2 (cont'd). Assume again that s = (N, wl, W 2) = (8, 2, 3). In that case, 2 has 16 vertices as shown in Fig. 4. Consider the pair (V/, Vj) = ((0, w I + w2, Wl), (w 1 + w2, 0, We)). The associated integer programming problem is to minimize ( W 1 + W2, - - W 1 -- W2, W 2 -
no other pair of vertices can yield to a better solution. (There is only one equivalent solution, which is ( - 1, 0, 1)~dr.)
W1 ) • P/
It is also possible to find all the slopes Pt such that AP l is in a given range. Let [APt", Apf] denote this range. It can be verified that any slope Pt ~ Di,~ has its associated A P t in that range if and only if: d, "API" - d l < Pt" (Vy - Vi) < d l • DP~' - 1.
(29)
This is direct generalization of what was done on the schedule, and an application of: [a/b] >c ~a
>b.(c-
[a/b]
1) + 1
The constraint (29) bounds the solution space of
v,.
= (5, - 5 , 1) .P~, under the constraints
(1, 1, 1 ) - P t > O,
(0, - 1 , O) "Pt_> O,
Example 2.2 (cont'd). Consider the problem of finding all the slopes Pl such that A P t is in [5,10] and d t = l . When fixing (Vi, Vj) to ((0, w 1+ w 2, Wl), (w I + w 2, 0, w2)), the set of solutions is:
(0,-1,-1)-Pt_>
0,
{(1, 0, - 1 ) , (0, - 1 , 1), (2, 0, - 2 ) } .
(1, 0, 0 ) " P t > 0,
(1, 0, 1) " P t > 0.
(5, - 5 , 1) -P t>_ 1(to avoid PI= 0),
(-1,
- 1 , - 1 ) "P/_> 0,
The optimal slope is Pt = (1, 0, - 1). The associated A P t is 2 . w I + 1 = 5. The optimal rational solution is therefore P t / d t = (1, 0, - 1 ) / d t. It can be shown that this solution is a global optimum;
Constraints on T range for
AT Sol. for T
Constraints on Pl
When considering all pairs of vertices, the complete set of solutions for Pl is: {+ (1,0,-1),
+(0,-1,
+ ( 1 , 0, 0), + (0, 1, 0), + (0, 0, 1)}.
...
Constraints on
Pc-I
/1
range for ~'P~-I / ~ - l
Sol. for PI
Sol. for Pe-I
range for
AP1
1), _ + ( 2 , 0 , - 2 ) ,
One-to-one mappings
cost
Optimal solutions
Fig. 5. The main structure of the adopted methodology.
(30)
359
Mapping uniform recurrences
7. Design methodolo~' The problem is to find a mapping of the form (9) that verifies a particular number of design constraints. As explained in the previous sections, a set of slopes T that verifies a number of constraints and that minimizes AT can be found by means of integer programming. Also, the set of slopes whose AT is in a given range can be found with a similar technique. The same applies on each axis of the processor allocation separately (for each Pl). Once a set of solutions has been found for T and for each Pt, the mappings of the form (9) can be found by simply combining the different solutions. For each combination, one can verify that the mapping is one-to-one, as explained in Section 4; only the one-to-one mappings should be kept. If no compatible solution exists, one can either modify the value of one dt or modify one range of values, and restart the process. The search for one-to-one mappings is clearly an iterative process. Finally, a cost function can be evaluated on every one-to-one mapping to select the optimal solutions. The methodology is summarized in Fig. 5. It has been implemented with success in the design tool Presage [18]. Example 2.2 (cont'd). The minimal value of AT is 22. It is achieved with T = (1, 1, 1). When d l = d 2 = 1, the minimal value of A P l is 5; p~ = _+(1, 0 , 1). Yet, the mapping such that T = (1, 1, 1), Pl = (1, 0, - 1) and P2 = ( - 1, 0, 1) is not one-to-one. (The corresponding value of a' is zero.) A first solution consists in relaxing the admissible range for AP 2. For example, if this range is [5,10], the set of solutions for P2 is given by (30). Out of all combinations, four mappings only are one-to-one; they are all equivalent and given by
d 2 = 1; each processor works once every three steps only. An implementation that requires a third of the cells can be achieved with d 1 = 3, e.g. with the mapping:
T P1 P2
i+j+k+a, =
(1,0, -1).(i,
j, k ) div3 (0,1,-1)'(i,j,k)
It can be verified that this mapping is indeed one-to-one. When w 1 = 2 and w 2 = 3, the number of cells is ( ( 2 . 2 + 1) div 3) × ( 2 . 3 + 1). Example 2.2 (cont'd). In the previous example, an array of 2 × 7 cells is used. Assume that the problem is to be mapped on an array of size 8. One can either use an array of size 2 x 4 or a linear (i.e. one-D) array. In the first case, we can keep d 1 = 3 and pl(i, j, k ) = (1, 0, - 1 ) - ( i , j, k) which give A P 1 =2. We can use d 2 = 2 and p 2 ( i , j , k ) = ( O , 1, - 1 ) . ( i , j , k) d i v 2 to obtain A P 2 = 4. The schedule that minimizes the latency can then be found as follows. First, one can try with the minimal schedule (24), but the corresponding mapping is not oneto-one. One can then use a range for AT. The minimal value being 8 . 3 - 2 = 22, one can first try with the range [23,30]. It can be shown that the associated set of solutions is: {(I, I, 2), (1, 2, 1), (2, 1, 1)}. But still, none of these solutions yield to a one-toone mapping. One can then try with the range [31,40]; the set of solutions for T is given by (26). Still, no schedule gives a one-to-one mapping. One can then try with the range [41,45] whose set of solutions is: {(I, 1, 4), (1, 2, 3), (2, 1, 3), (1, 3, 2), (2, 2, 2), (3, 1, 2), (1, 4, 1), (2, 3, 1), (3, 2, 1), (4, 1, 1)}.
P, P2 with
=
+_(1,0, - 1 ) . ( i , j , k ) + ( 0 , 1, 1) (i, j,
It can be verified that any slope of T in the set:
,
{(4, 1, 1), (1, 3, 2), (2, 1, 3), (2, 3, 1)}
k)
a A = o<8 = 0 a n d a c = 1. A l l
solutions
lead
to the same well-known systolic array presented by Kung and Leiserson in [11]. Example 2.2 (cont'd). The value of a' corresponding to the previous solution is 3, while d I •
gives a one-to-one mapping. For a linear array, one can use d 1 = 5 and d 2 = 1. In that case, Pl = (1, 0, - 1 ) . (i, j, k) div 5 P2 = ( i , j , k ) " (0, 1, - 1).
(31)
360
V. Van Dongen
The same iterative process can be used to find a compatible schedule. It can be verified that any slope T in (26) yields to a one-to-one mapping.
with the latter technique, while it can when doing the space-time mapping in one pass. Most of the theory presented in this p a p e r has been implemented with success in the tool n a m e d 'Presage' [18]. In fact, the illustrative mappings were derived with its use.
8. Conclusion Given a regular application described by a system of uniform recurrence equations, systolic arrays are commonly derived by means of an affine space-time mapping. In this paper, we generalized the associated methodology to design small size arrays, by using an affine schedule and a near-affine processor allocation, a sub-class of quasi-affine mappings [17]. We showed that a near-affine processor allocation that uses a given n u m b e r of processors can be automatically derived; the optimization method is a direct extension to the one using affine mappings. In the proposed approach, the schedule and each component of the processor allocation are found independently. Ranges for the n u m b e r of time steps and for the number of processors are given by the user to delimit each solution space. Compatible solutions, i.e. one-to-one mappings, are then found by exhaustive search. A cost function finally selects the optimal solutions. The associated partitioning method is the well-known Locally Sequential Globally Parallel scheme [13,7,10]. This method requires large local memories for the processing elements. O t h e r partitioning techniques are known, buth these cannot be found directly with our space-time mapping technique. Yet the advantage of our approach c o m p a r e d to the common one where circuit transformations are applied on systolic arrays derived with affine mappings is that optimal solutions can be found automatically. As a particular case, we can optimally map e-D (i.e. e-dimensional) recurrences onto E-D arrays, where E is any value between 0 and e - 1. One simply needs to fix the n u m b e r of coordinates to 1 along e - E 1 axes of the processor allocation, and find a valid near-affine mapping. This method is to be compared with the multiprojection technique introduced in [19], which consists in applying an affine mapping k times to reduce the dimension of the array to e - k. Again, an optimal global solution can hardly be found
References [1] M. Annaratone, E. Arnould, T. Gross, H.T. Kung, M. Lam, O. Menzilcioglu and J.A. Webb, The warp computer: Architecture, implementation, and performance, IEEE Trans. Comput. C-36(12) (Dec. 1987) 1523-1538. [2] J. Bu and E. Deprettere, Converting sequential iterative algorithms to recurrent equations for automatic design of systolic arrays, in: Proc. ICASSP'88 (IEEE, 1988). [3] H. Brains, Adaptation du logiciel presage h la g6n6ration de r6seaux systoliques impl6mentables sur un r6seau de transputers, Technical Report RR 90-23, Universit6 Catholique de Louvain, Novembre 1990. [4] J. Bu, Systematic design of regular VLSI processor arrays, PhD thesis, Delft University of Technology, May 1990. [5] P. Clauss, Synth~se d'algorithmes systoliques et implantation optimale en place sur r6seaux de processeurs synchrones, PhD thesis, Universit6 de Franche-Comt6, 1990. [6] J.-M. Delosme and I.C.F. Ipsen, Sage and condense: A two-phase approach for the implementation of recurrence equations on multiprocessors architectures, in: L.W. Hoevel, ed., 21st Annual Hawai Internat. Conf. on System Sciences (1988) 126-130. [7] M. Garcia and J. Navarro, Systematic hardware adaptation of systolic algorithms, in: 16th Annual Internat. Syrup. on Comp. Architecture (Computer Society Press, Rockville, MD, 1989) 96-104. [8] H.T. Kung and C.E. Leiserson, Systolic arrays (for VLSI), in: Sparse Matrix Proc. 1978 (Society for Industrial and Applied Mathematics, 1978) 256-282. [9] R.M. Karp, R.E. Miller and S. Winograd, The organization of computations for uniform recurrence equations, J. A C M (1967). [10] S.Y. Kung, VLSI array processors, Signal and Image Processing Institute, 1987. [11] C. Mead and L. Conway, Introduction to VLSI Systems, ch. 8, Highly Concurrent Systems (Addison-Wesley Series in Computer Science, Reading, MA, 1980) 263-332. [12] D.I. Moldovan, On the design of algorithms for VLSI systolic arrays, IEEE Proc. (1983). [13] H. Nelis and E. Deprettere, Automatic design and partitioning of systolic/wavefront arrays for vlsi, Circuits Systems Signal Processing 7(2) (1988) 235-252. [14] P. Quinton, Automatic synthesis of systolic arrays from uniform recurrent equations, in: Proc. IEEE l l t h Internat. Syrup. on Computer Architecture (1984). [15] S.K. Rao, Regular iterative algorithms and their implementations on processor arrays, PhD thesis, Information Systems Lab., Stanford University, 1985.
Mapping uniform recurrences [16] The transputer databook, 2nd ed. Inmos, Bristol, 1989. [17] V. Van Dongen, From systolic to periodic array design, PhD thesis, Universit6 Catholique de Louvain, January 1991. [18] V. Van Dongen and M. Petit, Presage: A tool for the parallelization of nested loop programs, in: L. Claesen, ed., Formal VLSI Specification and Synthesis (VLSI Design Methods - I) (North-Holland, Amsterdam, 1990) 341-359. [19] Y. Wong and J.M. Delosme, Optimal systolic implementations of n-dimensional recurrences, in: IEEE Internat. Conf. on Computer Design: VLSI in Computers (Oct. 7-10 1985) 618-621.
361
Vincent 'Van Don~m was horn in Brussels, Belgium, in 1962. He obtained the degree of Engineer in Computer Science at the University of Louvain (Belgium) in 1985, and a Doctorat en Sciences Appliqu~es in 1991 at the same university. From 1985 to 1991, he was a member of the scientific staff at PRLB (Philips Research Laboratory Belgium). He is now chief researcher in parallel architectures at CRIM (Centre de Recherche Informatique de Montreal, Canada). His interests include parallel architectures, software and hardware compilation techniques, and portable parallel programming environments.