JOURNALOF
PARALLEL
AND
DISTRIBUTEDCOMPUTING
Area-Time
8,52-59
(
1990)
Trade-Offs for Matrix-Vector
Multiplication
B.CODENOTTI Istituto di Elaborazione dell’Informazione-CNR,
Via S. Maria 46, Piss, Italy
AND G.LOTTIANDF.ROMANI Dipartimento di Informatica. Universita ’ di Pisa, Corso Italia 40. Pisa, Italy
For what concerns matrix-vector products, a lower bound has been shown in [ 71, using the information flow technique, while the best upper bounds have been obtained by the mesh of trees network [ 31, and by a linear array of processing elements, working in a systolic fashion [ 2 ] . In this paper we consider matrix-vector multiplication, deriving both lower and upper bounds (Sections 3 and 4). We will take into consideration the problem of multiplying a sparse matrix by a vector as well ( Section 5 ) .
The area-time complexity of the matrix-vector multiplication problem is studied in four models of VLSI computation. Both upper and lower bounds to different measures of area-time complexity which depend on the chosen I/O conventions are derived. We show that the VLSI complexity of matrix-vector multiplication is dominated by the information flow, by the number of arithmetic operations needed to solve the problem, or by the size of data to be stored into the circuit. The problem
of multiplying a sparse matrix by a vector is examined as well, since it is a special case of wide interest in many applications. 0 1990 Academic Press, Inc.
2. PRELIMINARIES
Here we consider the VLSI models of computation to be used in the paper. Following J&Ta [ 11, we assume, without loss of generality, the chip to be a rectangle, whose area A is determined by gates connected by wires. Wires have minimal width X > 0, and each gate has area Q( X 2). A constant number of wires can intersect at any point. Moreover the chip is where- and when-oblivious; i.e., the position and the time for which data are available are independent of the values of the input bits. The propagation of a bit along a wire takes a unit time, independent of the wire length. It is well known that the amount of communication required to solve a given problem plays a central role in its VLSI complexity. Our general problem consists of the evaluation of a set of functions whose arguments can be specified by p units of information. We are interested in deriving a lower bound on the amount of communication required to solve the problem, given that the p units of information specifying the arguments are distributed among several processors. Following [ 1] we assume that these p units of information are partitioned among two processors P, and P2, such that all the functions are known to both P, and P2. We refer to the minimum number of units of information exchanged between P, and Pz as the amount of information transfer, denoted as 4.
1. INTRODUCTION
Matrix-vector multiplication plays a central role in numerical linear algebra, since one has to compute matrixvector products in several stages of almost all numerical algorithms. Moreover the area-time complexity of matrix-vector multiplication presents the following interesting feature. While the complexity of matrix multiplication is dominated by the AT2 measure, i.e., optimal designs exist with respect to such a measure, for all the range of time T E (Q( log n), O(n)) [ 5, 7 1, we will show that the investigation of the area-time complexity of matrix-vector multiplication can lead either to area-time trade-offs obtained with respect to different measures, depending on the chosen I/O conventions, or simply to area lower bounds. More precisely, the area-time complexity of matrix multiplication depends on the information flow; on the contrary, the areatime complexity of matrix-vector multiplication depends on the information flow, on the number of arithmetic operations needed to solve the problem, or on the size of data to be stored into the circuit. An analogous phenomenon was investigated by J&Ilri [l] for graph problems. 0743-73 15/90 $3.00 Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.
AND ASSUMPTIONS
52
TRADE-OFFS
FOR
MATRIX-VECTOR
In the following we distinguish between pipelined and not pipelined models. In our context pipelined systems are those in which the number of I / 0 gates is less than the number of data; i.e., multiple input can be provided through one port at each instance of a given problem. Four possibilities are considered for the I/O conventions: General-pipelined model: I / 0 ports lie anywhere on the chip and pipelining is allowed for inputs. General-not pipelined model: I/O ports lie anywhere on the chip and pipelining is not allowed for inputs. Boundary-pipelined model: all I/O ports lie on the boundary of the chip and pipelining is allowed for inputs. Boundary-not pipelined model: all I/O ports lie on the boundary of the chip and pipelining is not allowed for inputs. Therefore, two kinds of restrictions are considered throughout the paper, namely boundary and not pipelined models. The former restriction is an interesting case, especially to those people who actually fabricate integrated circuits, and the latter restriction is useful for chips to be used in a modular fashion and/or to be used as components of more complicated circuits. For a specific problem area-time lower bounds depending on the model used are derived as functions of the following quantities: 4: the information transfer with partitioned input, as defined above; C,: the sequential complexity of the problem, i.e., the minimal number of elementary operations sufficient to solve the problem on sequential models; I: the number of input data words. In the general-pipelined hold:
model the following lower bounds
AT* = Q( q%2), e.g., see [ 1, 7, 91. AT=
Q(C,).
In fact, let p be the number of processing elements. The AT bound derives from the trivial relations T = Q( CJp), A = R(p). In the boundary-pipelined mod;:: ill the bounds for the general-pipelined model hold, together with AT2 = Q(qU). Indeed, let h and 1be the height and the length of the rectangular chip, respectively, and let h z 1. Perform a cut along the short size of the chip so that roughly half of the input words are received by each side of the chip. Since the number of input words is I, it follows that T a 1/(2h + 21);
53
MULTIPLICATION
i.e., T = Q(I/ h). On the other hand, since the information transfer is 6, then 4 words have to cross the cut. It follows that T = Q( 4/l) and AT2 = Q( @I). In the general-not pipelined model all the bounds for the general-pipelined model hold, together with A = Q(Z). In the boundary-not pipelined model all the bounds for the other models hold, together with AT = Q(qU). Indeed, we have Z < 2h + 21; i.e., h = n(1). Moreover T = Q( 4 / ,) as shown for the boundary-pipelined model. In the next section we will discuss the problem of multiplying a matrix by a vector. In Section 4 we will discuss dense matrix by vector multiplication; in Section 5 the “sparse” matrix-vector multiplication will be considered with particular interest to a special class of matrices. The notation log is used to denote the logarithm to base 2 throughout. 3. THE MATRIX-VECTOR MULTIPLICATION PROBLEM
Let A = (au) be an n X n real matrix and let x = (xj) be an n-vector with real entries. The product Ax is the n-vector y whose components are Yi = i
&jXj.
/=I
In numerical computations real numbers are approximated in some finite arithmetic; i.e., numbers are represented by words of fixed length d. The value d is often independent of the size of the problem; in this case the asymptotic complexity of the problem is the same both in word-based models and in bit-based VLSI models. Sometimes, error analysis considerations suggest that a word length be used depending on the size of the problem. This leads to different asymptotic complexity results in the two types of models. We will state our results by assuming the former situation to hold; namely the word length is fixed. When solving this problem by VLSI circuits there are two main possibilities, namely: (a) The matrix A is resident on the circuit and the input data are the entries of x . (b) The input data are the entries of both A and x . Case (a) occurs when the matrix A is related to a given transform (e.g., discrete Fourier transform) or when the same matrix has to be multiplied by several different vectors
54
CODENOTTI,
LOTTI,
AND
ROMAN1
(e.g., iterative methods for the solution of linear systems). input wires, else if the vector enters in (Ysteps (1< (YG k) it The entries ofA are either constant and wired on the circuit has k/cu input wires. or entered in a preprocessing phase. In case (a) one more The following recurrence relations can be established for lower bound holds, namely A = Q(R), where R is the num- the height H(n) and the length L(n) of the resulting netber of matrix elements which have to be stored into the cir- work cuit. Note that if the circuit has to deal with general dense matrices then R = P?. If the circuit has to deal with general H(k) = h(k); sparse matrices with no more than M(n) elements then R H(n) = 2H(n/2) + n/q = M(n). Otherwise if the matrix has a special structure, then R is the number of words sufficient to encode the enL(k) = f(k); tries of A. L(n) = 2L( n/2) + n/q In this section we present a general technique for obtaining good upper bounds; namely we show two ways of aggregating basic matrix-vector multipliers, and we discuss the whence attained area-time performances. These results will be apH(n) = n/kh(k)+ n/alog(n/k); plied either to “dense” or to “sparse” matrix-vector multiplication, in the following sections. The technique consists L(n) = n/kZ(k) + n/a log(n/k); in partitioning the matrix into (n/k) X (n/k) blocks of size k and performing the n X n matrix-vector product by using and finally, simpler modules for the multiplications of order k. + log(n/k)/(ka)(h(k) ( 1) Linear array of elementary processors. Let us con- A(n) = n2[l(k)h(k)/k2 sider a linear array of n modules, each capable of perfotm+ I(k)) + log2(n/k)/a2], (3.1) ing the basic step yi = yi + aijxj. Vector x enters sequentially 7’(n) = O(log(n/k) + t(k)). (3.2) via one input path on the boundary and it is transmitted to the neighboring module. Vector y exits sequentially via one 4. THE DENSE CASE output path on the boundary. The column of the matrix enters via n input paths and some delays ensure the correctIn this section we present some lower and upper bounds ness of the network. The area used by this linear array is on the area-time complexity of dense matrix-vector multiA(n) = O(n); the time is T(n) = U(n). This design is the plication, in the models considered in Section 1. classical systolic array [ I,6 1. For general dense matrix-vector multiplication the se(2) Mesh of trees. Another type of construction is the quential complexity is C, = Q( n 2, and the amount of comrecursive design of Fig. 2, which can be derived by a mesh munication required is I$ = L?(n) [ 71. Moreover of trees [ 31. Assume n to be a power of two. Module P, performs n X n matrix-vector multiplication using four I= n, R = n2, if the matrix is resident, P,,,? modules. Module Pk is capable of performing the I=n2, otherwise. k X k matrix-vector multiplication z = Bu in time t(k) and it has height h(k) and length Z(k). If the module Pk does Consider now the upper bounds which can be attained not use pipelining for the input of the vector u then it has k with specific networks. ( 1) Basic module. The first design is the basic module of Fig. 3. It consists of an arithmetic processor with height and length of the same order; one input and one output path, lb) + one word wide; and a finite addressable memory which stores the partial results together with, in the resident case, the entries of the matrix. ( 2) Basic module + mesh of trees. The basic module can be used to perform k X k matrix-vector multiplications in , a mesh of trees, used to perform n X n matrix-vector multiplication. (3) Linear array of elementary processors. It is the design of Fig. 1 (see Section 3 ) . (4) Linear array + mesh of trees. The linear array of elementary processors can be used to perform the k X k maFIG. 1. (a) Linear array of modules; (b) single module. trix-vector multiplications in the mesh of tree with ar = k.
TRADE-OFFS
(a)
FOR MATRIX-VECTOR
Boundary-not pipelined model: Only the choice k = 1 is significant. The circuit has to be stretched in one direction by a factor sufficient to insert n* wires connecting the boundary to the processors (see [ 3,4]), then
(b)
P P
--1’ P”l2
Pnl2
k
55
MULTIPLICATION
Bu
H(n) = 2H(n/2)
+ O(n*) = O(n*),
L(n) = 2H(n/2)
+ O(n) = O(nlogn);
U
that is, A = O(n310g n),
T = O(log n).
FIG. 2. (a) Recursive construction of modules; (b) single module.
Note that designs 3 and 4 apply only to the nonresident matrix case. In the following, we shall consider first the nonresident matrix case. and then the resident matrix case. 4.1. Nonresident Matrix Case The following lower bounds can be stated: In In In In
the the the the
general-pipelined model, boundary-pipelined model, general-nonpipelined model, boundary-nonpipelined model,
AT= AT* = A= AT =
Q(n2). Q(n3). Q(n*). Q(n3).
Consider now the upper bounds obtained by the constructions described above. ( 1) Basic module. In nonpipelined models only the case n = 1 is significant, because there is only one input port and pipelining is not allowed. Therefore
In pipelined models better bounds are produced by other designs, to be described later. ( 3 ) Linear array of elementary processors. The design of Fig. 1 produces, in pipelined models, A = O(n);
(4) Linear array + mesh of trees. We obtain the following upper bounds. In the general-pipelined model the input ports for the entries of the matrix do not affect the area bound of the design since they lie anywhere on the chip. From Z(k) = O(l), h(k) = O(k), t(k) = O(k), by substituting in (3.1) and (3.2) we obtain A = 0( n2/k + n2/k210g2( n/k) + n*/k log( n/k)), T = O(log(n/k) L&k=
+ k).
Q(logn).Then
T= O(k)and
A = O(n*/Tlog(n/T)) /2(1)=0(l), A(l)=O(l),
for
TE(Q(logn),
T(l)=O(l).
Z(n) = O(n”*),
In the case of boundary models the circuit is stretched in order to insert the wires which transfer the matrix from the boundary to the elementary processors. Since LY= k then n*/k paths have to be inserted. Therefore the height of the circuit is
A(n) = O(n) H(n) = 2H(n/2)
and
+ O(n*/k),
H(k) = O(k),
T(n) = O(n*). The last performance can be attained, first, by reading and storing the vector, and then by reading the matrix rowwise. (2) Basic module + mesh of trees. This design produces the following upper bounds, obtained by substituting the proper values in ( 3.1) and ( 3.2 ) . General-not pipelined model: Only the choice k = 1 is significant. Hence [ 3 ] A = O(n*log*n),
o(n)).
,(1)=0(l),
In pipelined models, the module performs the straightforward sequential algorithm with the performances h(n) = O(n”*),
T= O(n) [2].
T= O(logn).
J
>
PE
.
FIG. 3. Basic module.
>
56
CODENOTTI.
LOTTI,
AND ROMAN1
whence
TABLE I H(n) = O(n + n2/klog(n/k)),
Lower bound
Model
and from L(n) = 2L(n/2)
+ O(n/k),
Pipelined
L(k) = O(l),
(a) Resident matrix A = O(n’/Tlog’(n/T)),
A = Q(d)
TE (R(log n), O(log2n)) Not pipelined
A = 0(n2)
that is L(n) = O(n/k+
+ n3/k210g2(n/k)
A = O(n),
+ n*/klog(n/k)).
Then the performance in the boundary-p&lined A = O(n3/T210g2(n/T))
for
model is
TE(Q(logn),
o(n)).
4.2. Resident Matrix Case InallmodelswehaveA = Q(R) = Q(n2). ( 1) Basic module. It performs the straightforward sequential algorithm with the following performance. Since all the entries have to be stored in the memory then h(n) = O(n),
l(n) = O(n),
A(n) = O(n2)
and T(n) = O(n*). (2) Basic module + mesh oftrees. It produces the following upper bounds. Pipelinedmodels:Fromh(k)= = 0( k*), (Y= k it follows that A = O(n2 + n2/k210g2(n/k)), LetkE(Q(log’/*
O(k),I(k)=
O(k), T(k)
T= O(log(n/k)
+ k*).
n), 0( log n)); then
A = O(n2/Tlog2(n/T)),
TE(Q(logn),
O(log2n)).
Not pipelined models: Only the choice k = 1 is significant, since multiple inputs are not allowed to pass through a single port. Hence, with analogous considerations, we derive [3] A = O(n210g2n),
log2n), T = O(log n)
A = O(n2
Design
(2) (2)
(b) Nonresident matrix
n/klog(n/k)),
we obtain A(n) = O(n*/k
Upper bound
T= O(logn).
The results produced for the “dense” case are summarized in Table I and plotted in logarithmic scale in Fig. 4.
T = O(n)
General-pipelined
AT = s2(n2)
A = O(n2/T),
Boundarypipelined
AT2
A =
General-not pipelined Boundary-not pipelined
A = 0(n3)
A = O(n’log’n), T = O(log n)
AT = f2(n3)
A = T =
log2(n/T) TE (Q(log n), o(n))
A = O(n), = n(n3)
T = O(n)
O(n3/T210g2(n/T)), TE (Q log n), o(n))
O(n310g n), O(log n)
(3) (4) (3) (4)
(2) (2)
5. BOUNDS FOR A CLASS OF SPARSE MATRICES In this section we present some results concerning the area-time complexity of matrix-vector. multiplication, when the matrix is sparse. Large linear systems with sparse coefficient matrix arise frequently in solving many problems of applied mathematics. Sometimes the coefficient matrix has a strong structure which can be fully exploited in devising algorithms. On the other hand, if A is a general sparse matrix with weak structure properties, the algorithms suited for dense matrices are very inefficient, and “ad hoc” algorithms cannot be derived. Note that iterative methods for the solution of linear systems have to perform, as principal steps, matrix-vector products. The main advantage when using iterative methods with a sparse iteration matrix is that the amount of storage can be bounded by the number of nonzero elements of the iteration matrix. Note that area is the critical resource when dealing with sparse matrices, since their size usually is very large. Our results will lead to a number of nearly optimal VLSI designs for the solution of large sparse linear systems by iterative methods. Given an n X n matrix A, let M(n) denote the number of its nonzero entries. Let us partition A into (n/k) X (n/k) blocks of size k X k. We denote with M(k) the maximum number of nonzero entries of the k X k blocks. We define three types of sparsity: simple sparsity: if M( n) = O(n); weak sparsity: if M( n) = O(n) and a constant c not depending on n exists such that M(k) =Sck, 1 < k < n ;
TRADE-OFFS
7
-
FOR MATRIX-VECTOR
57
MULTIPLICATION
get the element x,,
‘OQT log n
compute and store in contiguous locations the products aijXj,
a=- log A log n
a0 + 0,
i= 1,2 , . . . >12,together with their coordinates. (2) compute the sums
A= n2
yi = i
a..x. JJ J,
i = 1,2, . . . , n.
J=I a,+0
Since the elements belonging to the sets BJ = { ati : a0 # 0, i= 1,2,..., n} are stored in contiguous locations, phase (l)requirestime T= O(n+M(n)). Analogously, the elements of the sets Cj = { a,xj : aij Z 0, i= 1,2 f * 3 n } are stored in contiguous locations and phase (2) requires time T = 0( n + M(n) ), as well. Since M(n) = O(n), this module produces
..._AT =n2 ‘..,
1
2
7
FIG. 4. Upper and lower bounds for the case ofdense matrices plotted in log-log scale. The dotted lines denote lower bounds. The points and the solid lines denote upper bounds. Note that the apparent match between dotted and solid lines is only asymptotic and is a direct consequence of using the log-log scale.
strong sparsity: if M( n) = O(n) and a constant c not depending on n exists such that M(k) =Gc, 1 < k < n ‘12. An example of weak sparsity is provided by band matrices, e.g., those arising from the discretization of partial differential equations. Examples of strong sparsity are some matrices arising from network problems. To exploit sparsity we will use the technique of partitioning the matrix into submatrices, and only pipelined models will be considered in the following. For general sparse matrix-vector multiplication the sequential complexity is C, = Q(n) and the amount of communication required is 4 = Q(n) . Moreover
A(n) = O(nlogn),
T(n) = O(n),
for all the types of sparse matrices, and can be used in pipelined models and with a resident or nonresident matrix. We now present some upper bounds for strongly sparse matrices, where improvements can be obtained upon the results of the dense case (see Fig. 5 ) . (2 ) Basic module + mesh of trees. The basic module can be used to perform the k X k matrix-vector multiplications in the mesh of tree, with cr = k. Choosing k = 0( II ‘12), the number of nonzero elements in any k X k block is constant so that the area A(k) = l(k)h(k) = O(log k) and time T = O(k) suffice for the basic module. The performance is A = O(n2(l(k)h(k)
+ log’(n/k) + (l(k) + W>)logWW)lk2L
and T= O(log(n/k)
I = n,
R = M(n),
z= n+M(n),
otherwise.
Consider the following constructions. ( 1) Basic module. It consists of an arithmetic processor with one input path and one output path, one word wide, and a finite addressable memory. The nonzero elements of the matrix are stored columnwise in the memory together with their row and column coordinates. The area sufficient for the computation is A = O(M(n)log n). The module performs the matrix-vector multiplication in two phases: (1) forj=
1,2,...,ndo
+ k).
if the matrix is resident, Therefore A = O(n2/T210g2(n/T))
and T~(Q(logn),
O(n’/2)).
(3) Grid of basic modules. In Fig. 6 another structure, which performs block matrix-vector products, is shown. The design consists of a mesh of processors of the type shown in Fig. 3. Each processor performs the multiplication of a k X k submatrix by a k-vector and accumulates partial results. Each k-vector enters in k steps and the local computations and data transfers are overlapped.
58
CODENOTTI,
LOTTI,
AND ROMAN1
Let k = 0( n ‘12). Since the number of nonzero elements in any k X k block is constant the area of one processor is 0( log n) . The time sufficient to perform k X k matrix-vector multiplication is 0( YE‘12) and the same time is sufficient for all data transfers. The whole network attains the bounds A = O(nlogn) 5.1. Nonresident Matrix
and
T= O(n’12).
Case
The following lower bound can be stated, in pipelined models: AT2 = Q(n2). For what concerns upper bounds, design 1 can be used in all the pipelined models, while designs 2 and 3 can be used with the general-pipelined model only. 5.2. Resident Matrix
(b)
Case
The following lower bounds can be stated, in pipelined models: AT2 = Q(n2),
A = Q(R) = O(n).
Designs 1,2, and 3 can be used in both pipelined models. 6. CONCLUSIONS In this work A, AT, and AT2 bounds have been developed in four different models of VLSI design for dense and sparse matrix-vector products. We have used recursive construction in order to combine together some well-
FIG.
6.
(a) Grid of modules; (b) single module.
known architectures. Aside from the systolic linear array of Kung and Leiserson [ 21, all these constructions achieve the corresponding lower bounds up to logarithmic factors. Some of the resulting designs are known and others have not yet been considered in the existing literature. It is worth noting that results holding for matrix-vector products can be used in order to design networks for several numerical problems, e.g., iterative methods either for eigenvalues computation or for linear systems solution. ACKNOWLEDGMENT
a
p’og
We thank referee “c” for valuable comments.
T
log n
REFERENCES
2
i
1
a=- log A log n
1\ . . . . . . .._. A.?? \ ~. .._....,,...,,.,_._.,,,,,,......
i
I I
AT& ,1...,, I
‘
I
1
2
7
FIG. 5. Upper and lower bounds for the case of strongly sparse matrices plotted in log-log scale.
I. JaJi, J. The VLSI complexity of selected graph problems. J. Assoc. Comput. Mach. 31,( 1984), 377-391. 2. Kung, H. T., and Leiserson, C. E., Algorithms of VLSI processor arrays. In Mead, C. A., and Conway, L. A. (Eds). Introduction to VLSI Systems. Addison-Wesley, Reading, MA, 1980, pp. 27 l-292. 3. Leigthon, F. T. New lower bound tecniques for VLSI. Proc. 22nd Annual IEEE Symposium on Foundations of Computer Science, 198 1, pp. 1-12. 4. Leiserson, C. E. Area-efficient graph layouts (for VLSI). Proc. 2lst Annual IEEE Symposium on Foundations of Computer Science, 1980, pp. 270-28 1. 5. Preparata, F. P., and Vuillemin, J. Area-time optimal VLSI networks for multiplying matrices. Inform. Process. Lett. 11 ( 1980), 77-80. 6. Preparata, F. P., and Vuillemin, J. Optimal integrated circuit imple-
TRADE-OFFS
FOR MATRIX-VECTOR
mentation of triangular matrix inversion. International Conference on Parallel Processing, Boyne, MI, 1980, pp. 2 1 l-2 16. 7. Savage, J. E. Area-time tradeoffs for matrix multiplications and related problems in VLSI models. J. Comput. System Sci. 22 ( 1981), 230-242. 8. Stoer, J., and Burlisch, R. Introduction to Numerical Analysis. Springer, New York, 1980. 9. Thompson, C. D. Area-time complexity for VLSI. Proc. 11th Annual ACM Symposium on Theory of Computing, New York, 1979, pp. 8 l88.
BRUNO CODENOTTI was born in 1959 in Italy. He received the Laurea in computer science in 1983, 1 lO/ 110 cum laude. He is currently Received November 25, 1986; revised December 1, 1987
MULTIPLICATION
59
a researcher of the National Research Council, and has taught numerical analysis at the University of Pisa since 1985. He has published 30 papers, most of which are on parallel and VLSI computations in numerical linear algebra. GRAZIA LOTTI was born in 1952 in computer science in 1974, 11O/ 110 cum ate professor of computer science at the lished 20 papers on numerical analysis, VLSI design.
Italy. She received the Laurea in laude. She is currently an associUniversity of Pisa. She has pubcomputational complexity, and
FRANCESCO ROMAN1 was born in 1952 in Italy. He received the Laurea in computer science in 1974, 1IO/ 110 cum laude. He is currently a full professor of computer science at the University of Piss. He has published 30 papers on numerical analysis, computational complexity, and parallel and VLSI algorithmics.