INFORMATION
SCIENCES 45, 347-365 (1988)
347
Multilayered Array Computing SUBHASH C. KAK Department of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, Louisiana 70803
ABSTRACT The objective of this paper is to develop algorithms for efficient and fast implementation in a multilayered mode of signal-processing tasks such as convolution, correlation, matrix multiplication, Fourier transformation, Hilbert transformation, etc., using structures built out of a large number of simple cells. Some of these designs are essentially conceptual, while others, such as the ones for matrix multiplication, can be implemented using current technology. The matrix multiplication designs are two-layered arrays, and these arrays can also be used for matrix inversion and other related problems.
I.
INTRODUCTION
The objective of this paper is to develop methods for three-dimensional array designs for fast implementation of different signal-processing tasks. Consider the important problem of matrix-vector multiplication. For matrices with specific structure, such as the DFT matrix, several fast algorithms that exploit the algebraic structure of the transformation are known. These fast algorithms can be visualized by means of interconnected regular processors laid out on a plane. For current VLSI technology such algorithms seem appropriate for hardware implementation. It has been estimated that the minimum feature size in VLSI might reduce to 0.3 to 0.4 pm within a decade. Reductions below this range seem unlikely owing to fundamental quota-mech~c~ limitations. For further integration it appears necessary to use multilayered or gee-d~ension~ structures. An experimental 3-D structure was created by Texas Instruments a few years ago. Amongst other laboratories, Hughes Aircraft Company has also designed experimental 3-D microelectronic structures [I]. While several major problems need to be overcome before such circuits become commerci~ly available, a case exists for the study of 3-D architectures that exploit the specific structure of a given signal processing transform. Not only would such a study provide @Elsevier Science abusing Co., Inc. 1988 52 Vanderbilt Ave., New York, NY 10017
348
SUBHASH C. KAK
algorithms for 3-D implementations, useful should multilayered circuits become available, but such algorithms could also be used in 3-D architectures built up from planar components. An example of the latter would be the design of superfast 3-D array computers. There exists another approach to the design of a 3-D cellular structure, with a different end in view. This is a development that caters directly to two-dimensional signal processing and image analysis. Here one processor (in a 3-D array) is used for each data element of the input set. The study of such computer structures began with ILLIAC and is being pursued further in ICL’s DAP, Goodyear’s MPP, and the Hughes 3-D machine. The Hughes machine has been described in the recent paper by Grinberg, Nudd, and Etchells [2]. The Hughes machine can be viewed as an array of processors arranged horizontally, with the elements of each processor in the array assembled vertically. Each wafer in the stack is an N x N array of a type of processing element, such as accumulator, memory register, or comparator. The 3-D machines, such as the one being developed at Hughes, are therefore special architectures for processing 2-D data. The same can be said for many of the ne\yarchitectures being studied in the Japanese fifth-generation computing project [3]. It should be noted that 3-D array structures will be inherently different from the designs of 3-D machines being studied at Hughes and other laboratories. A 3-D array structure should exploit the properties of a given signa&processing algorithm to give a design where data flows are processed in a three-dimensional array of cells. The data may or may not be two-dimensional. The 3-D machines being designed at several places, which have been referred to in the above paragraph, are an extension of the usual design of the computer to process two-dimensional data in that mode; these designs do not constitute interactive arrays. Similarly the research on systolic arrays [7,9-111, wavefront processors [18], and other VLSI and distributed designs [12-17,191 does not explicitly seek multilayered interactive structures. Given that planar arrays can be designed for regular computing tasks, the question arises: Is something to be gained by considering processing elements in multiple layers, dedicated to different tasks, of planar arrays? The paper provides an answer in the affirmative to this question. It is shown that certain two-layered structures are computationally faster than the corresponding planar structures for the same tasks. The question of the advantages of designs of more than two layers is also taken up, and mul~layered ~plemen~tions of some algorithms are provided. Before a prototype multilayered computational structure can be envisaged, one needs to solve the problems of (1) development of algorithms that can be expressed naturally in a 3-D multilayered form, (2) development of efficient algorithms to perform fundamental interlayer operations, such as shuffling of data, and (3) management of control isolation. This defines a major pro-
MULTILAYERED
349
ARRAY COMPUTING
gram, and our objective, in this paper, is to address merely the first component of it. Section II considers some general considerations in array design. Section III describes multilayered array structures for circular convolution, discrete Hilbert transformation, DFT, and correlation. It is shown that different data sequences may be loaded into different layers, and operations should be performed based on contents of cells in adjacent layers. The results are computed as data flow down the layers. Section IV presents two-layered designs for matrix multiplication that give a speed advantage over the standard planar array. Section V shows how the problem of band-matrix multiplication can be solved three times more efficiently on a two-layered structure than on the hexagonal array of Kung and Leiserson. II.
ARRAY DESIGN: GENERAL CONSIDERATIONS
Several distributed and array computing structures have been described in the literature in recent years. These include pipelined systems, systolic and wavefront arrays, data-flow systems, and alternatives to the von Neumann architectures. Some of the proposed systems are specific to the computation task, while some others present reconfigurable designs. To illustrate the variety of design possibilities that exist, we take the example of matrix-vector multiplication, where at least two cellular designs can be proposed. Consider the following matrix-vector product:
co
aoxo
+
u1x1+
u2X2
+
a3X3
a1
a2
a3
b,
b2
b3
boXo+blXl+b2X2+b3X3
Cl
c2
c3
cox, + ClX, + c*X2 +
c3x3
1
We provide a hardware design for its implementation in Figure 1. In the input mode, the input data are fed into the cell serially and the contents are shifted to the right one by one. In the operational mode, multiplication is done in place and the row adder adds up all the product terms. The result is generated in just four cycles as shown. If a row adder is not available, a modification allows the use of an ordinary accumulator, as shown in Figure 2. The horizontal input into the array is broadcast. This design also requires four cycles. In general it can be said that the speed of a computation can be increased in two ways. Given a hardware structure, one could increase the clock rate by making individual components faster. Technology is already approaching the
SUBHASH C. KAK
350
ROW ADDER
Fig. 1. An array for a matrix-vector product.
upper limits in speeds, however.The other alternative is to suitably design the interface between computation and I/O, since it is the latter that usually determines the effective speed of the overall system. Array structures try to improve the interface by breaking up a computation into smaller fragments. This allows the cells making up the computational structure to be kept busy for a longer time than where each computation cycle is followed by interface ACCUMULATOR
broadcast
Fig. 2. Alternative design.
MULTILAYERED
ARRAY COMPUTING
351
through I/O at much lower speed. The price that one must pay for the increased throughput is the management of the information flowing through the array. Depending on the design, information can flow through the array in different ways. The processing elements may not be identical. Data may be broadcast to several processing elements. A general computation may be viewed as a process. Now since a computation, broken into steps, can involve different operations such as indexing, logical deduction, information abstraction, multiplication, addition, and waiting for other partial results to become available, an efficient computation array, even if reconfigurable, must have processing elements that can perform a variety of tasks. This suggests that a pure systolic array, a structure that does not use global data communication and where the processing elements operate synchronously, is likely to be a poor performer in general. For a highly structured computation, a regular array like a systolic one may, of course, be perfectly adequate. Data-flow and other distributed computing structures already provide for processing with a variety of processors. However, these structures do not consider realization of particular algorithms, and are meant for general computing tasks. Semisystolic arrays relax some of the requirements of pure systolic arrays and allow nonidentical cells as well as broadcasting, but they are limited in their arrangement to two dimensions. We propose that there are advantages to be gained from laying out processing elements in a 3-D arrangement. Not only does this provide a generalization for the usual 2-D planar arrays, but it leaves open the possibility that the cells in each layer could be dedicated to different functions, and data could flow differently across elements in a layer and across layers. Some elementary 3-D designs for a few signal-processing tasks will be described in this paper as evidence that such an approach is feasible. III.
3-D DESIGNS
CIRCULAR
FOR A FEW SIGNAL-PROCESSING
TASKS
CONVOLUTION
Consider a four-point written out as
circular convolution.
The convolution
c = x * w can be
SUBHASH
352
c. KAK
Fig. 3. An arrangement to circulate data.
We propose the use of four elements in the data layer, where the data are circulated as shown in Figure 3. In the three-layered structure of Figure 4, the weights w, once loaded on their layer, are not moved around, and information flows down into the accumulator, with each data-layer processing cell forming the product xi wj as shown. The first cycle yields q. Upon shift of data in next cycle, one obtains C,, and so on. This three-layered design, given an appropriate number of cells, will work for N-point circular convolution as well.
DISCRETE
HILBERT
TRANSFORM
This transform [4--Q has important app~cations in signal processing. Consider the DHT matrix for N = 8:
0
-0.6 0 -0.1 0 0.1 0 _ 0.6
0.6
0 -0.6 0 -0.1 0 0.1 0
0
0.6 0 -0.6 0 -0.1 0 0.1
0.1 0 0.6 0 -0.6 0 -0.1 0
0 0.1 0 0.6 0 -0.6 0 -0.1
-0.1 0 0.1 0 0.6 0 -0.6 0
0 -0.1 0 0.1 0 0.6 0 -0.6
-0.6 0 -0.1 0 0.1 0 0.6 0
We note that the matrix has considerable structure, and all its nonzero entries are either 0.1 or 0.6 in magnitude. A grouping of terms allows us to reduce the number of arithmetic operations. Let EY stand for the Hilbert transform at even numbered points, and OY at odd numbered points. The constants a and b are the characteristics of the Hilbert-transform matrix for N = 8: a = 0.6,
b- 0.1.
MULTILAYERED
ARRAY
353
COMPUTING
xI.
xo
I
/ x2 ’
I
x3 I
Accumulator c
Fig. 4. A three-layer reaJ.iz.ationof circuh convolution.
We have
Figure 5 shows how the transform is computed using a four-layered structure. It has the advantage that the computation is highly parallel and the design is
SUBHASH
354
C. KAK
memory shift
subtraction
\L
r
_
4
a
b
v
V
a(xl-x7)
+ b (x3-x5)
_ multiplication
addition
Fig. 5. The eight-point DHT
modular and expandable. For a larger N (even) one has only to add several cells together. The design is very simple, and its performance is better than that of the conventional systolic architecture for the same N = 8 problem [7]. For an odd N, the movement of data during execution is different and is shown in Figure 6. To summarize, a preliminary look at multilayered design leads to a procedure with the following properties: (1) It is highly parallel and is expandable simply by adding more cells of the same kind together. (2) This structure divides the hardware into components with different functions. For instance, we have three components in the DHT. One is the memory cell, which can shift its contents, the second is the multiplier, and the third is the row adder. It must be emphasized that the suggested design is meaningful only for the specific example of the DHT, and it has been included to show how a potential exists for exploiting 3-D designs to speed up execution of algorithms.
MULTILAYERED
ARRAY COMPUTING
355
x1-x2
v
_
L
1a(xO-x31 (odd) DHT.
_ multiplication
b
a
Fig. 6. A five-point DHT.
subtraction
+ b(x1-X2)
]
Note that it uses more hardware
addition
than a eight-point
(even)
A 16-p&t DHT structure is similar to the &point example of Figure 5. In fact, so long as the basis of this structure is the grouping of terms, this design will apply regardless of the size of the problem. Implementations of the Hadamard transform can be similarly obtained. These implementations are slightly more complex and have more layers. We believe that neither the DHT nor the Hadamard transform implementations qualify as starting points for a systematic procedure for 3-D architectures. All that they establish is that systematic procedures for each signal-processing task may exist.
DISCRETE
FOURIER
TRANSFORM
This important signal-processing task can be accomplished by the three-layer design of Figure 7. The data, once loaded in the top layer, do not circulate. The weights wi are broadcast in layer 2 in such a fashion so that the cells along the
SUBHASH C. KAK
356
I,
w,
w** w3
Accumulator
Fig. 7. A four-point DFT (a conceptual design).
flow path have entries 1, wi, w2*, etc. This can be accomplished by letting w’ flow in one circuit in layer 2, getting m~tip~ed by the broadcast value along the way. As example, consider that w2 has been broadcast. It shows up as w2 under x1, w4 under x2, w6 under x3, and finally ws = 1 under x,,. When the number of data points is prime, one can use the Rader algorithm to reduce the DFT to a cyclic convolution and use the realization of Figure 4 for its ~plementation. If the number is composite, one may represent the data in an array and exploit rn~ti~rne~ion~ side-pressing techniques to increase efficiency.
information
M~LTILAY~
357
ARRAY COMPUTING
CORRELATION
Let us consider correlation for sequences ( w,, w,, . . . I w, } and {x0, x1, x2,. . _, x, }. The resultant sequence {c,, c,, . . , c, _ k } is defined by
w,x,+”
n >
k.
/=0
Consider that k = 3. A multilayered array design is as shown in Figure 8. The top layer contains 2k cells where the w’s circulate with a gap between two
layer
Fig. 8. Multilayered structure for correlation
1
SUBHASH C. KAK
358
successive values. The second layer passes the x’s through again with a gap between successive values. The third layer carries out multiplication of the contents of the cells above each of its cells and adds the product to the current contents. The contents of cells 1,2,3,4 of the 3rd layer are as follows: 1st step,
2nd step,
3rd step, %X0 + WlXl, WOXl, wox2,o;
4th step, w&x,
+
wax,,
w,x,
+
4x2,
WOXZ?
f
WlX2,
wax,
yJ3;
5th step, M+)x1+
WlX1-b
w,x, , wax,
+
WlX3,
wax,;
etc. The array can be so designed that the contents of the third-layer cell with a new w, in the corresponding top layer cell are pushed down to the fourth layer, and out as the sequence co, c,, c,,. . . . The output values will begin to be sent out at the 9th step, at the rate of one each step. In general, after a startup of 2k steps, one will have the stream of correlation values coming out at the optimal rate of one per step. IV.
TWO-LAYERED ARCHITECTURES
The designs in the previous section are essentially conceptual, and are not being proposed as being implementable in the near future unless appropriate advances take place in microelectronic technology. The objective of the design was to establish that m~~ayered ~pl~mentations are possible for different kinds of problems.
MULTILAYERED
359
ARRAY COMPUTING
Fig. 9. The standard array to multiply matrices. The numbers inside the circles (processing elements) are the indices of the product matrix.
TABLE 1
1
5 6 7
b II
a11
2
3 4
output
Input
Step a12 013
b 12
021 a22
+I a32
u23 a33
h21
b22
b,,
hl
Cl1
b32
b23
b33
c21
Cl2 Cl3
c22 f23
c31 C32
c33
SUBHASH
360
Fig. 10. The cylindrical
C. KAK
array for matrix multiplication.
Fig. 11. Matrix multiplication on the mesh array. The double lines are connections layer, and the single lines are connections on the bottom layer.
on the top
MULTILAYERED
ARRAY COMPUTING
361
We now consider another problem, that of matrix m~tip~cation, where the nonplanar design can be imagined to be implementable, and which furthermore offers a speed advantage over the standard planar array. For the problem of matrix multiplication the standard array [17] to solve the problem of multiplying two 3 X 3 matrices is given in Figure 9. It can be easily shown that the number of steps required to solve this problem is 3n - 2, or approximately 3n. The input/output activity table for this array is given as Table 1. Clearly, the processes are not being used optimally, since several zeros are passed through as the imputation proceeds.
Fig. 12. The cylindrical array for n =lO in a two-layer implementation: (b) end view.
top (a) top view:
362
SUBHASH
C. KAK
Porter and Aravena have recently shown [18] that the “cylindrical” architecture of Figure 10 gives a speedup over the standard array. The input output table for the cylindrical array is given below:
Step 1 2 3
Input
output
a11
a21
a31
b 11
bl2
43
a12
a22
a32
b 21
b22
623
a13
az3
a33
b 31
b32
b33
Cl1
c22
c33
4
Cl2
‘23
c31
5
Cl3
c21
‘32
This array of Porter and Aravena requires only five steps, or 2n - 1 steps ( - 2n), to solve the matrix multiplication problem. Several issues regarding implementation, and applications to different problems such as matrix inversion (using the Feddeev algorithm) and band-matrix multiplication, are given in several technical reports by Porter and Aravena, El-Amawy, and the author ([l&21] and forthcoming papers). Another design that provides the same speedup as the cylindrical one is the mesh array introduced by the author recently [19]. As shown in Figure 11, the data in a mesh array are fed like those in the cylindrical one, though the order in which the elements of the product matrix appear is different. Both the cylindrical and the mesh designs can be implemented in two layers. The cylindrical structure can be squashed so that half the processors in each row lie on one layer, and the remaining half on the other. This is illustrated in Figure 12 for the case of n =lO. The mesh array can be realized even more easily, because the processors can all be defined on the same plane, and only the connections need to be laid out in two layers. Thus the connections shown with double lines in Figure 11 could be on the top layer, and the connections with single lines could be on the bottom layer.
V.
MULTIPLICATION
OF TWO BAND MATRICES
The Kung-Leiserson hexagonal array for matrix multiplication [17] requires the injection of two zeros between adjacent data points, and consequently its efficiency is only one-third of the maximum possible. We will now show that maximum efficiency can be obtained upon using two-layered array structures of Section IV. While either cylindrical or mesh designs can be used, we will illustrate our method using the mesh design [20].
Consider
C = AB, where
0
*11
al2
a2,
a22
a23
a32
a33
a34
a43
h
A=
363
ARRAY COMPUTING
MULTILAYERED
B=
3
0
. .
Cl1
Cl2
Cl3
c21
c22
c23
$4
c32
1‘33
ci4
c35
c42
c43
c44
c45
c46
cs3
C54
%
C56
%4
C6S
%6
Using a two-layered structure, the a’s and h’s can be fed continuously shown in Figure 13. The nine processors generate the c values as
Cl1
c44
. . .
Cl2
cc421
c45
c31
(c35)
%5
cs*
.**
c31
cc34)
%4
”
Cl3
(c43)
C.46
.*.
c22 .
.
. . .
c33 ’
C66
as
, . .
c23
cc53)
c$6
’ * ’
$1
(c24)
$4
. *.
The timing diagram is as follows: t=l t=2 t=3 t=4
;i c12* %2*
c31
c33.c13~c23~c21~c32.c44
As is clear, the c’s are not generated synchronously. For every i change in a,, and j change in bi, one needs to incorporate tag information (which could be represented by a single bit) that pushes out the corresponding c. The dimensions of the array depend on the size of the band. Since our matrices A and B had b~dwid~ of 3 x 3, our example array has 9 elements.
SUBHASH
364
C. KAK
Fig. 13. Band-matrix multiplication on a mesh array.
The Kung-Leiserson array has the processes working only one-third of the time, while in our design each processor works continuously, so the data rate is three times higher. VI.
CONCLUSIONS
This paper presents multilayered array designs for certain computing tasks. We have seen that layers that circulate data can be used in tasks that involve a circular operation, as in circular convolution, as well as noncircular operation, as in correlation. These designs have not been obtained by a systematic procedure, and many different kinds of designs for the same problem may exist. Practical designs to do the problem of matrix multiplication in a two-layered array have also been presented. The main aim of the paper was to establish how multilayered designs can be obtained for given algorithms, and how these may provide speedup over planar arrays.
MUL~LAY~D
ARRAY COMPUTING
365
REFERENCES 1. R. D. Etchells, J. Grinberg, and G. R. Nudd, Proc. Sot. Photographic and Insirumentation Engs., Apr. 1981, p. 64. 2. J. Grinberg, G. R. Nudd, and R. D. Etchells, A cellular VLSI architecture, Computer 1759-81 (Jan. 1984). 3. H. S. Stone, Computer research in Japan, Computer 17:26-32 (Mar. 1984). 4. S. C. Kak, The discrete Hilbert transform, Proc. IEEE, 58: 585-586 (Apr. 1970). 5. S. C. Kak, The discrete finite Hilbert transform, Indian J. Pure Appf. Math. 8:1385-1390 (Nov. 1977). 6. %&On Tse, The Discrete Hilbert Transform, MS. Thesis, Louisiana State Univ., July
1984. 7. H. T. Kung, Why systolic architectures ?, Computer 15:37-46 (Jan. 1982). 8. D. E. Dudgeon and R. M. Merserau, ~~ltjdirne~iona~ Digital Signal Processing, PrenticeHall, 1984. 9. W. A. Porter and J. L. Aravena, Array Architectures for Estimation and Control Functions, LSU Report, 1985. 10. H. T. Kung, Putting Inner Loops Automatically in Silicon, Carnegie-Mellon University Report, 1985. 11. Guo-Jei Li and B. Wah, The design of optimal systolic arrays, IEEE Trans. Compur. 34(1):66-77 (Jan. 1985). 12. P. R. Cappello and K. Steiglitz, Unifying VLSI array designs with geometric transformations, in Proceedings of the International Conference on Parallel Processing, 1983. pp. 448-457. 13. J. A, B. Fortes, ~go~~rn Tr~sformations for Parallel Processing and VLSI Architecture Design, Ph.D. Dissertation, Univ. of Southern California, Los Angeles, Dec. 1983. 14. H. T. Kung et al. (Eds.), VLSI Systems and Computation, Computer Science Press, Rockvilie, Md. Oct. 1981, pp. 255-264. 15. E. Horowitz, VLSI architecture for matrix computations, in Proceedings of the Internatianal Conference on Parallel Processing, Aug. 1979, pp. 124-127. 16. K. Hwang and Y. H. Cheng, VLSI computing structure for solving large scale linear system of equations, Proceedings of the International Conference on Parallel Processing, Aug. 1980, pp. 217-227. 17. C. Mead and L. Conway. Introduction to VLSI S,vstems, Addison-Wesley, 1980. 18. W. A. Porter and J. Aravena, Cylindrical Computation of Matrix Products, Tech. Report. Electrical and Computer Engineering Dept., Louisiana State Univ., 1985. 19. S. C. Kak, A Mesh Array for Matrix Multiplication. Tech. Report, Electrical and Computer Enginee~ng Dept., Louisiana State Univ., 1986. 20. S. C. Kak, Multiplication of 2 Band Matrices, Tech. Report, Electrical and Computer Engineering Dept., Louisiana State Univ., 7 Dec. 1985. 21. W. A. Porter and I. Aravena, Orbital architectures, Tech. Report, Electrical and Computer Engineering Dept., Louisiana State Univ.. Jan. 1986. Received
I.5 Augtrst
1987