Parallel Computing 16 (1990) 351-359 North-Holland
351
Implementing Gaussian elimination on a matrix-matrix multiplication systolic array Tanguy RISSET Laboratoire LIP-IMAG, Ecole Normale Sup&ieure de Lyon, 46 allge d'Italie, 69364 Lyon Cedex 07, France Received 29 June 1990
Abstract. We show that any systolic array dedicated to matrix-matrix multiplication can also execute Gaussian elimination.
Keywords. Matrix multiplication, Gaussian elimination, Systolic array.
1. Introduction
Most matrix algorithms are composed of nested 'for' loop algorithms. When mapping one matrix algorithm onto a systolic array, we give an architecture and a program for the cells. For each algorithm, the architecture is different. It would be interesting to have more parametrizable architectures for which only the imput and the cell programs would determine the algorithm executed by the array. Intuitively, a lot of matrix algorithms are built upon the same skeleton. Some work has already been done to unify different algorithms on the same architecture [9], but this research domain is not fully explored yet. Gaussian elimination has been implemented on various 2-D systolic arrays [1,6] and recently on a modular finear arrays [2]. On the other hand, an impressive number of systolic arrays [7,10,8,4,5,3,12,13] have been proposed for the matrix multiplication problem. In this note, we further explain the similarity that has been observed between the arrays implementing respectively Gaussian elimination and matrix-matrix multiplication [7]. We prove that any array capable of executing the matrix-matrix multiplication C = AB, can be re-programmed so as to perform Gaussian elimination on A. Note that we do not modify the flow of the variables in the array, nor the i n p u t / o u t p u t characteristics. We simply change the program of the cells. Our result holds for any matrix-matrix array that respects the standard scheduling scheme based upon the usual cubic dependence-graph [10,4]. In particular, most of the published arrays which use only a single copy of the input coefficients do satisfy this condition and therefore can be re-programmed so as to perform Gaussian elimination. In Section 2, we specify our hypothesis on the initial matrix-matrix array. In Section 3 we derive the new program of the cells for Gaussian elimination. In Section 4, we prove the correctness of our construction. 0167-8191/90/$03.50 © 1990 - Elsevier Science Publishers B.V. (North-Holland)
T. Risset / ImplementingGaussianelimination
352
2. Original matrix-matrix array Consider two n × n dense matrices A and B, and let C = AB. We have
cij= L aik'bkj,
l <~i<~n,l <~j<~n
k=l
A natural way to serialize this accumulation leads to the following system of equations:
Input equations l <~i <~n, l <~k <~n, j = 0 l <~j <<.n, l <~k <~n, i = 0
--*
A(i, j, k ) = aik B(i, j, k ) = bkj
l <~i <~n, l <<.j<~n, k = 0
C(i, j, k ) = eij
Computation equations
j
l <~ i <~ n, l <~j <<. n, l <~ k <~ n
C(i, j, k ) = C(i, j, k - 1 ) +A(i, j - 1 , k ) . B ( i - 1 ,
I A ( i , j, k ) = A ( i , j - 1 , k) ~B(i, j, k ) = B ( i - 1 ,
Output equations l<~i<~n,l<~j<~n,k=n
j, k)
~
j, k)
cij=C(i, j , k )
The dependence graph is D n = { ( i , j, k); l ~ < i ~ < n , l ~ < j ~ < n , l ~ < k ~ < n } of Z 3. Such a system of recurrent equations is said to be uniform, since computation at node (i, j, k) depends only on values computed at points that are obtained by a translation which does not depend upon the values of i, j and k. The dependence graph G is shown in Fig. 1 for n = 3. Since the recurrent equations are uniform, we have a finite set of dependence vectors O = (0 a, 0b, Oc) that defines where node (i, j, k) has to take its input values. For instance 0a=(0, 1, 0) because node (i, j, k) reads its A-input from node (i, j - 1 , k). Similarly, Oh = (1, 0, 0), and 0C= (0, 0, 1). We call these vectors dependence vectors because the computation in vertex (i, j, k) depends upon the result of the computation in the vertices (i, j, k) - 0, 0~@. Our only hypothesis on the original matrix-matrix array is that its scheduling is compatible with these dependence vectors. The program of the cells is shown in Fig. 2.
Tk Cll
T
a
Y
/
T o32
C12
I
....
CI3
I
~
,_
/
)
".
J)
31
Fig. 1. Dependencegraph for matrix-matrixproduct.
a,n
T. Risset / Implementing Gaussian elimination t
353
t+l aout
bin
bout
Gin
Gout ctrlout
ctrhn aout:=ain; bout:=bin; if ct rlin =valid_operatio n Gout:=Cin +ain*bin;
Fig. 2. The program of the matrix-matrix product array cells.
Fig. 2 should not be misleading and is only for the sake of illustration: we do not assume unidirectional flow, for instance. The original algorithm is assumed to be correct. Whenever a given cell operates at a given time-step (under some control), it udpates cij into cij .'= cij + agk * bkj for some triple (i, j, k) in D R. We modify this operation so as to perform Gaussian elimination.
3. Transformation of the program of the cells Initially, we feed in matrix A together with underminated (nil) values for matrix B. C-coefficients are also assumed to have the value nil, whenever they are input to the array or computed inside it. First of all, and only for the sake of introducing our final solution, we assume that we have the possibility to determine which node (i, j, k) of D n is indeed being computed by any cell at any time-step. Of course the cells do not have the knowledge of the indices of their A, B and C inputs. In a second step, we show how to generate the appropriate control for determining the operation of the cells. We know that the original program satisfies some constraints for the ordering of the computation, and this information will enable us to retrieve on-the-fly the knowledge of the indices being computed.
3.1 Case analysis The uniformized algorithm for Gaussian elimination is:
Input equations l<~i<~n,l<~j<~n, k = 0
~
A(i, j, k ) = a i j
Computation equations l <~k <<.n, j = k , k + l <~i<~n A(i, j, k ) = A ( i , j, k - 1 ) / U ( i , l <~k <~n, i = k, k <~j <~n A(i, j, k ) = A ( i , j, k - 1 ) l <~k <~n, k + l <~i <~n, k + l <~j <<.n L(i, j, k ) = L(i, j - 1 , k) l <~k <~n, j = k , k + l <~i<~n ---> L(i, j, k ) = A ( i , k, k ) l <~k <<.n, k <~j <~n + l, k + l <~i <<.n --~ V(i, j, k ) = U ( i - 1, j, k ) l<~k<~n, i = k , k<~j<~n --* U(i, j, k ) = A ( k , j, k ) l <~k <~n, k + l <~i<~n, k + l <~j<<.n A(i, j, k ) = A ( i , j, k - 1 ) - L ( i , * U(i, j, k)
j, k )
j, k)
T. Risset / Implementing Gaussian elimination
354
Output equations k + l <~i <~n l <~k <~n, k <~j <~n l~
l~k = A(i, k, k) u k j = A ( k , j, k)
See [11] for the notion of uniformization. The c o m p u t a t i o n d o m a i n is D n = (i, j, k), 1 ~< k ~< n, k ~< i ~< n, k ~ j or k > i will not m o d i f y any data. We replace the original p r o g r a m by the following: if controlin = valid_ operation then let (i, j, k) be the index of the node being c o m p u t e d if (i = j = k) then bk, j := - 1 / a i , k ; if (i = j and k > i) then bk. i := ai,k; if (i > j and j = k) then Ci, j : = ai, k * bk,j; if (i > j and k > j ) then ai, ~ := ai, k + ci. j * bk,j;
Remarks: - bi.j is modified only once: in index ( j , i, i). - c~,j is modified only once: in index (i, j, j). a~,j is modified in all indices of the form (i, l, j ) with I < i and l < j . during the c o m p u t a t i o n of the elimination factor (index (i, j, j ) i > j ) , we could have copied the value of ci,j in ai,j so as to c o m p u t e the L U decomposition instead simply triangularizing the matrix A (al,ji > j is not modified in the rest of the algorithm).
-
-
3.2 New program The actual p r o g r a m that will be executed by the cells is: I F ctrl a = P - r T H E N b := a; / * copy pivot row element into b * / ctrl a .'= nil; / * ctrl a means ' t h e control signal * / ctrl b .'= (Piv, 0); / * received (or send) with the */ / * coefficient of matrix A ' */ I F ctrl a = Piv T H E N b .'= - l / a ; / * beginning of the c o m p u t a t i o n of the pivot * / ctrl b := (Piv, 1); ctrl a .'= nil; I F ctrl a = P - c et ctrl b = (Piv, s2) T H E N / * end of the c o m p u t a t i o n of the pivot * / c := a* b;
ctrl b -'= (Piv, 0); ctrl a := nil; I F s2 = 1 T H E N ctrl c = (Piv, 3); E L S E ctrl c = (Piv, 1); I F ctrl a = U p d et ctrl c = (Piv, s2) T H E N / * c o m p u t a t i o n of the loop b o d y a -'= a + b ' c ; / * and setting of the signals C A S E s2 O F / * on A for the next stage. 3: ctrl a = Piv; ctrl c = (Piv, 2);
T. Risset / Implementing Gaussian elimination
355
F.-4
F-
F"
--q
--4
m
m
--4
--
--D
-4 Di
Fig. 3. The uniformized d e p e n d e n c e g r a p h of G a u s s i a n e l i m i n a t i o n for n = 3.
2: ctrl a = P - r; ctrl c = (Piv, 2); 1: ctrl a = P - c; ctrlc = (Piv, 0); 0: ctrl a = U p d ; ctrlc = (Piv, 0); W e define the c o n f i g u r a t i o n of c o n t r o l signals set o n the coefficients of m a t i x A c ( k ) (Fig. 4): W e will execute this p r o g r a m on the cells of the m a t r i x - m a t r i x p r o d u c t array, in which we feed the m a t r i x A together with the c o n f i g u r a t i o n of c o n t r o l signals c(1). Signal Piv s t a n d s for pivot, it indicates t h a t the coefficient a~,, which a c c o m p a n i e s Piv will b e used to c o m p u t e the pivot. Signal P - c indicates t h a t this coefficient o f A (ai, k) is n e e d e d to c o m p u t e the pivot. Signal P - r indicates this coefficient of A o n the s a m e row as ak, k, it allows us to u p d a t e the coefficient of m a t r i x B b y p e r f o r m i n g bi, k : = ak, i. This m u l t i p l e x i n g allows us to have the m e e t i n g of the coefficients of the A A t p r o d u c t , i n p u t i n g o n l y m a t r i x A in the array. Signal U p d i n d i c a t e t h a t the coefficient of m a t r i x A m u s t be u p d a t e d : ai, j :~ ai, j
ai,k ak, k * ak, j .
nil nil .......................................... nil nil ....... nil ................................... nil Piv P - r P - r ............. P - r P-c U p d ................... U p d P-c
nil ....... nil nil ....... nil
P-c U p d ................... U p d
k-1 Fig. 4. C o n f i g u r a t i o n c(k).
Ik -1
356
T. Risset / Implementing Gaussian elimination
Remark. It is easy to find a way of setting control signals in matrix B and C so that the cell can perform the elimination as described before. But in our transformation matrix C is not initialised and in some arrays computing the matrix-matrix product, there is no way to initialised matrix C. In our way of setting control signals, only matrix A is initialised with the signals. Notation. We will write (il, j l , k l ) < (i2, j2, k2) if index (il, j l , kl) is executed before index (i2, j2, k2) We will write a!k),,J for the value of ai. j after k step (of the intermost 10op) of the Gaussian algorithm:
rnin(k,i--l, j) ((m-l) a}~) = ui,J-(°)_
£
m=l
~ai'm
) *
--m,jtT(m-1)
\(a{°!,,;is
the initial value of a~,j)
am,m
It is easy to see that a},~) = a}~? if i > j and that a},~) = a},i] 1) if j >~ i (1). We will call 'phase k ' the set of indexes of the form (i, j, k) i, j >/k, where k is fixed.
4.
Correctness
We want to prove by recurence on k that: If
at the beginning of phase k (that is to say at the index (k, k, k)): matrix A is in the configuration c(k) a,,j (i, j > k) has the value a~,~-1) then after the execution of phase k, at the beginning of phase k + 1 (that is to say at the index ( k + l , k + l , k + 1)): - matrix A is in the configuration c(k + 1) - ai, j (i, j > k) has the value a~,5) - ai, j (i < k or j ~< k) has not been modified during phase k.
Remarks: - At the beginning of the algorithm, matrix A is in configuration c(1) and ai,j has the value a(0) t,J
- During a phase, ai. ~ is only modified once, therefore the value assigned to a,.~ when it is modify in phase k (if ever it is modified) will be the value of a~.~ at the beginning of the next phase. We suppose (hypothesis hr) that: the matrix A is in the configuration c(k) at index (k, k, k). - a i , j (i, j >1k) has the value "i,j ~(k-1) at index (k, k, k). At the index (k, k, k) (first index of phase k) the coefficients concerned are ak,k, bk,k, C~,k. We have: ctrla~, k = Piv and ak, k = a(kk~-1) (hypothesis hk) Therefore we compute: 1 ( 1) bk, k --a k ,k
a (k~,ff 1)
ctrlbk, k := (Piv, 1); ctrlak, k := nil; As we have pointed out before, ~.,(~) .(k-a) k , k --- U k,k , and a k , k is not modified until phase k + 1 thus a~,,g will have the value a(~,kk ) at the beginning of phase k + 1.
T. Risset / Implementing Gaussian elimination
357
At the index (k + 1, k, k) the coefficients concerned a r e ak+ l, k, bk,k, ¢k + l,k" We have: c t r l a k + l, k = P - c (hypothesis h k ) 1 ctrlbk, k = (Viv, O) and bk, k a~k~a) because bk, ~ is not modified between (k, k, k ) and (k + 1, k, k ) Therefore we compute: ~k+l,k a ( g f l)
Ck+l,k := bk, k * ak + Lk ctrlCk+l, k := (Piv, 3); ctrlak+Lk := nil; ctrlbk, k "-= (Piv, 0);
At the index (i, k, k)
i> k+ 1
We have: ctrla~, k = P - c (hypothesis hk) ctrlbk, k = (Piv, O) and bk, k --
1
a~k, k l )
(1)
we can prove this result by recurrence: bk, ~ and ctrlb receive those values at index ( k + 1, k, k). Between all the pairs of indexes (i, k , k ) a n d (i + 1, k , k ) , bk,,~ and ctrl b are not modified and nor are they modified in index (i, k, k), hence (1) is true for all index (i, k, k)i > k. Therefore we compute:
(
a~'kZ 1) )
ci, k := bk, k * a,, k =
a~,~a)
ctrlci, k .'= (Piv, 1); ctrla,, k := nil; ctrlbk, k .'= (Piv, 0); (not changed)
At the index (k, k, i)
i> k
We have ctrla~, i = P - r(hypothesis hk) ak, i = a(kki-a) because a k , i ( i > k ) has not been modified since index (k, k - 1, i). Therefore we compute: bi, k := a k , i ( = a(k,k/-1)); ctrlak, ~:= nil; ctrlb~, k := (piv, 0); As we have pointed out before, a k(k)i = ak(kZ 1) (if i >/k), and ak, i is not modified until phase k + 1, thus ak, i will have the value 'a(kk~) ai the beginning of phase k + 1.
At the index (k + 1, k, k + 1) We have ctrl a k + 1,k + 1 = Upd(hypothesis h k) ctrlCk+l, k = (Piv, 3) and Ck+l, k
a(k-1) k ÷ l , k (cf. index ( k + 1, k, k), ck+l, k is not moda(kkl)
ified between indices ( k + 1, k, k ) and ( k + 1, k + 1).) bk+~ , ~ -_ ~, (kk,-ka+) l (cf. index (k, k, k + 1), bk+a k is not modified between indices (k, k, k + l ) and ( k + l, k, k + l)). Therefore we compute: ak+ l, k+ 1 :~ a k + l , k + 1 -k- ¢k+l,k * bk+l,k
= ~(k-1) "k+l,k+l
r~(k_ 1) '~k+ 1,k -~(k) t a(,kkl) * a(k, k2~ = ' ~ k + l , k + l ]
ctrlak+l,~+ 1 := Piv; ctrlCk+ a,k := (piv, 2);
358 And
T. Risset / Implementing Gaussian elimination
ak+l,k+ 1 and ctrl a are not modified until phase k + 1,
a k(k). + l , K,~-1
s o ak+l,k+ 1 will have the value at the beginning of phase k + 1 and will be a c c o m p a g n e d with ctrl~ = Piv.
At the index (i, k, k + l) i > k + l We have ctrlai.k+ 1 = U p d (hypothesis hk) a(k-l) ctrlci, k = (Piv, 1) cid, a ~i,k k ; l ) (cf. index (i, k, k), ci. ~ is not modified between indices (i, k, k ) and (i, k, k + 1).) bk+l,k ----a{k-1)k,g+l (cf. index (k, k, k + 1), bk+ 1,k is not modified after this index in phase k). Therefore we compute: a i , k + l := ai,k + l q- Ci, k
* bk+l,k
(,.1(k-1) = t*i,k+l
a~, _ _k-i) gl(k, k l ,
-
)
* a(kkk+l~ = a},kk)+ l
ctrlai,k+ 1 := P - c; ctrlci, k := (piv, 0); A n d a~,k+ x and ctrla are not modified until phase k + 1, so a c k + l will have the value a i,k+l (k) at the beginning of phase k + 1 and will be a c c o m p a g n i e d with ctrl a = P - c.
At the index (k + l, k, i) i > k + l We have ctrlak+l, i -- U p d (hypothesis hk)
a(k-1)
k+l,k
ctrlG+l, k--- (Piv, 2) and Ck+l, k
a(,~kl ) (cf. index ( k + 1, k, k + 1), Ck+l.k is not
modified between indices (k + 1, k, j ) and ( k + 1, k, j + 1) and the signal (Piv, 2) of C is not modified in the indices (k + 1, k, j ) j > k + 1) Therefore we compute: ak+l,i
:=
a k + l , i -[- Ck+l,k * bi k ,
,~(k-1)
=~k+l,i
_-o(kkk 1) * a(k-1) k,i
~*k+l,k
a(k) k+l,i
ctrlak+l, i := P - r; ctrlG+ 1,~ := (piv, 2); A n d ak+la and ctrl a are not modified until phase k + 1, so ak+l, i will have the value "k+l,i"(k) at the beginning of phase k + 1 and will be with ctrl a = P - r.
At the index (i, k, j) i, j > k + 1 We have: ctrlai.j = U p d (hypothesis hk) a(k-D
ctrlci, k = (Piv, O) and ci, k =
,,k (cf. index (i, k, k + 1), C~.k is not modified between a(kk~l)
indices (i, k, 1) and (i, k, 1 + 1) and the signal (Piv, O) of C is not modified in the indices (i, k, 1) l > k + l ) Therefore we compute: ai, j
:=
~(k-1)
ai, j + Ci,k * bj, k ( = "i,j
a(k-1) i,k
a(k~-1)* a(,kJ- 1)= a},5))
ctrlai, j := Upd; ctrlc~, k := (piv, 0); A n d ai,j and ctrl, are not modified until phase k + 1, so a,,j will have the value a},k) at the beginning of phase k + 1 and will be with ctrla --- Upd. Summary. During the phase k all a,,1 (i, j > k ) are set to a},k), the ai, i (i ~ k or j ~< k ) are not modified but, if i < k or j < k a},5) = --i,j a (k-l) " if i = k a l,J !k.) -- t'ti, ~(k-1) j "
T. Risset / Implementing Gaussian elimination
359
if i > k and j = k, a},~?= 0 (for Gaussian elimination). In our derivation, these values are neither set to 0 nor to the values of the coefficients of L in the LU decompositon (but this is a straightforward modification of the program). The control signals are set in the configuration c(k + 1). Thus, the recurence hypothesis is verified. In conclusion, at the end of the algorithm (at index (n, n, n) the ai, i have the values _,,ja! n) (except if j > i, where -,7!") - i , j = 0) The array computes the Gaussian elimination of A.
5. Conclusion In this paper, we prove that most arrays capable of executing the matrix-matrix product, can be re-programmed so as to perform the Gaussian elimination (because most matrix-matrix product arrays respect the three dependence vectors (0, 0, 1), (0, 1, 0), (1, 0, 0)). The Gaussian elimination array that we obtain will not be optimal, neither in time nor in space because it has the complexity of a matrix-matrix product. But our construction provides many new arrays to perform Gaussian elimination, especially linear modular arrays. For example, so far, in the only linear array performing Gaussian elimination [2] all the computations concerning the pivot were done in the same cell. Here, we have algorithms which perform Gaussian elimination with all the coefficients moving, which allows us to pipe-line the Gaussian elimination with other computations. Another interesting aspect of the construction is that we can use the same architecture for matrix-matrix product and Gaussian elimination. This versatility can be interesting in many scientific computing applications.
References [1] H.M. Ahmed, J.M. Delosme and M. Morf, Highly concurrent computing structures for matrix arithmetic and signal processing, Computer 15 (1) (1982) 65-82. [2] A. Benaini and Y. Robert, A modular systolic linear array for Gaussian elimination, Technical Report 89-08, LIP-IMAG, Ecole Normale Suprrieure de Lyon, 1989. [3] A. Benaini and Y. Robert, An even faster matrix product on systolic array, Parallel Comput. 12 (1989) 249-254. [4] A. Benaini and M. Tchuente, Matrix product on modular linear sytolic arrays, in: M Cosnard et al., eds., Parallel and Distributed Algorithms (North-Holland, 1989) 79-88. [5] A. Benaini and M. Tchuente, Produit matriciel sur un rrseau linraire, Revista Matematicas Aplicadas 10 (1) (1989) 5-25. [6] W.M. Gentleman and H.T. Kung, Matrix triangularisation by systolic arrays, in: Proc. SPIE (Society of Photo-Optical Instrumentation Engineers), Vol. 298, Realtime Signal Processing I V (1981) 19-26. [7] H.T. Kung and C.E. Leiserson, Systolic arrays (for VLSI), in: Sparse Matrix Proc. 1978 (Society for Industrial and applied Mathematics, 1978). 256-282. [8] P. Lee and Z.M. Kedem, Synthesizing linear algorithms from nested for loop algorithms, IEEE Trans. Comput. 37 (12) (1988) 1578-1598. [9] D.R. O'Hallaron, Uniform approach for solving some classical problems on a linear array, in: Proc. Internat. Conf. ICPP 89 Vol. 3 (IEEE Press, 1989) 161-168. [10] F.P. Preparata and J.E. Vuillemin, Area-time optimal VLSI networks for multiplying matrices, Inform. Processing Letters 11 (2) (1980) 77-80. [11] P. Quinton and V. Van Dongen, Uniformisation of linear recurrence equations: a step toward the automatic synthesis of systolic arrays, Internat. Conf. Systolic Arrays, San-Diego, (May 1988) 473-482. [12] I.V. Ramakrishnan and P.J. Varman, Modular matrix multiplication on a linear array, IEEE Trans. Comput. 33 (11) (1984) 952-958. [13] I.V. Ramakrishnan and P.J. Varman, An optimal family of matrix multiplication algorithms on linear arrays, in: Proc. Internat. Conf. ICPP'85 (IEEE Press, 1985) 376-383.