INTEGRATION, the VLSI journal 16 (1993) 293-313 Elsevier
293
VLSI implementations of number theoretic techniques in signal processing G.A. Jullien, N.M. Wigley and W.C. Miller VLS1 Research Group, University of Windsor, Windsor, Ontario, Canada N9B 3P4
Received 8 September 1992 Revised 8 February 1993
Abstract. This paper explores novel techniques involving number theoretic concepts to
perform real-time digital signal processing for high bandwidth data stream applications in Digital Signal Processing. For most data stream signal processing algorithms, the arithmetic manipulations are simple in form (cascades of additions and multiplications in a well defined structure) but the numbers of operations that have to be computed every second can be large. This paper discusses ways in which recently introduced number theoretic mapping techniques can be used to perform DSP operations by both reducing the amount of hardware involved in the circuitry and by allowing the construction of very benign architectures, down to the individual cells. Such architectures can be used in aggressive VLSI/ULSI implementations. We restrict ourselves to the computation of linear filter and transform algorithms, with the inner product form, which probably account for the vast majority of digital signal processing functions implemented commercially.
Keywords. Polynomial rings; residue number systems; inner product computations; bit-level systolic arrays; pipelined computation; VLSI signal processors; dynamic logic
1. Introduction There is a constant thrust to decrease the delay associated with arithmetic circuits for the purpose of increasing the rate at which processors can operate. Recent results have shown that even mature CMOS technologies are capable of synchronous pipelined arithmetic rates in the hundreds of MHz range [17]. In general purpose computational systems, there is normally a leveling factor based on the need to synchronize (either with clocks, or hand-shaking) disparate computational elements, such as those found in current DSP chips. For data 0167-9260/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved
294
G.A. Jullien et al. / Number theoretic techniques in signal processing
s t r e a m a r c h i t e c t u r e s , h o w e v e r , t h a t c a n o p e r a t e as systolic arrays, the s y n c h r o nizing r e q u i r e m e n t is v e r y s t r a i g h t f o r w a r d ; t h e d i s a d v a n t a g e s a r e the limited u s e f o r s u c h special p u r p o s e a r c h i t e c t u r e s . W i t h t h e a d v e n t o f A S I C t e c h n o l o g y , silicon f o u n d r i e s , a n d t h e w i d e s p r e a d u s e o f a d v a n c e d s o f t w a r e for fast c u s t o m design, it is q u i t e p o s s i b l e to c o n s i d e r t h e use o f s u c h special s y s t e m s f o r e v e n small p r o d u c t i o n runs. D a t a s t r e a m high t h r o u g h p u t D S P s y s t e m s a r e such a target group. T h i s p a p e r is c o n c e r n e d with c o n s t r u c t i n g i n t e g r a t e d s y s t e m s using bit-level systolic a r r a y c o n c e p t s [8,12]. I n s u c h a r c h i t e c t u r e s , t h e i n t e r n a l p i p e l i n e r a t e s a r e m a t c h e d to t h e signal d a t a rates, a n d e v e r y bit ( o r g r o u p o f bits) is p i p e l i n e d . T h i s is in c o n t r a s t to t h e d e s i g n o f c a s c a d e d c o m b i n a t i o n a l circuits for arith-
.................
Graham A. Jullien was educated in the UK, receiving degrees from the universities of Loughborough, Birmingham and Aston. Since 1969 he has been with the Department of Electrical Engineering at the University of Windsor, Ontario, Canada, where he currently holds the rank of University Professor. From 1975 to 1976 he held the position of Visiting Research Engineer at the Central Research Laboratories of EMI in the UK. He teaches courses in signal processing, electronic circuits, computer systems, VLSI design and number theoretic techniques. His current research interests are in the area of algorithms, arithmetic and VLSI for high speed digital signal processing, and he has published widely in this field of interest. He currently directs the VLSI Research Group at Windsor, and actively consults in the area of digital control systems, signal processing and computer systems. He was vice-chairman of the board of directors of the Canadian Microelectronics Corporation from 1991-1993, and is the Windsor University Centre Coordinator in the Micronet Network of Centres of Excellence. He has co-authored the IEEE press book "Residue Number System Arithmetic: Modern Applications in Digital Signal Processing", and recently hosted the 11th IEEE International Symposium on Computer Arithmetic, in Windsor, Ontario. Neil M. Wigley was educated at the University of California, Berkeley, where he received the BA (1959) and PhD (1963) degrees in Mathematics. He is a Professor of Mathematics at the University of Windsor, Ontario, Canada. After many years of research in Partial Differential Equations, he has recently expanded his research interests to applications of Number Theory and Algebra in Signal Processing and VLSI Architectures. He is currently leading a research team into the construction of inner product machines based on finite polynomial ring theory.
William C. Miller was born in Toronto, Ontario, Canada. He received the B.S.E. degree in Electrical Engineering from the University of Michigan, Ann Arbor, in 12960, and the M.A.Sc. and Ph.D. degree in Electrical Engineering from the University of Waterloo, Waterloo, Ontario in 1961 and 1969, respectively. He joined the Department of Electrical Engineering at the University of Windsor in 1968, where he currently holds the rank of Professor. His research interests are oriented towards digital signal processing and the design of massively parallel VLSI processor architectures for application specific problems in the area of image processing relating to machine vision. He also teaches courses in the area . . . . . . of circuit theory, signal processing and system theory. Dr. Miller is also engaged extensively in industrial consulting work. He was director of the CAD/CAM centre at the University of Windsor for a two year period ending in 1988. Dr. Miller is a registered Professional Engineer in the Province of Ontario, and a member of the IEEE.
G.A. Jullien et al. / Number theoretic techniques in signalprocessing
295
metic processors, where the goal is to reduce the critical path time through the cascade [11]. With regard to speed requirements, data throughput rates are dependent upon the signal bandwidth. For example, audio and modem data transmission rates are in the range of tens to hundreds of KHz and standard video and some radar system data rates are in the tens of MHz range. We see that the throughput rate for systolic solutions in these application areas are at least an order of magnitude lower than the speeds recently reported for mature CMOS technologies. Even uncompressed H D T V data rates (in the range of 100MHz) are a factor of 5 lower than reported speeds [17]. It is therefore useful to consider trading off speed for greater functionality within each pipeline stage, and reaping the benefits of reduced area and power consumption. This paper, therefore, also explores these possibilities, by combining a TSPC pipeline latch [1] with dense multiple output NFET blocks based on minimized binary trees; we term such blocks Switching Trees [7]. The paper also introduces a new synthesizer, for pipelined switching trees, that will allow for on-the-fly module generation of the bit-level blocks. The first part of the paper briefly reviews the number theoretic techniques we have devloped to allow large dynamic range computations to be implemented over massively parallel small rings or fields. We then introduce the circuit design procedure, based on embedded switching trees, and the new module generator. We conclude with a comparison study using the module generator, hand layed out switching tree block, and PLA implementation of a typical switching function required by our number theoretic computational technique. 2. Computing over finite rings
Number theoretic architectures have traditionally been based on the Residue Number System (RNS), but the disadvantages of RNS techniques (non-homogeneous data conversion architectures) outweigh the advantages of carry free computation. We have recently introduced a modulus replication approach based on a polynomial ring mapping strategy [14]; however, unlike the algebraic integer mapping procedure [4], our technique allows simple, error-free, mapping of incoming integer streams, and homogeneous conversion architectures at the output. The main body of the computation is performed in identically replicated linear bit-level pipelines; this has important ramifications in terms of fault tolerance and testability when implemented in dense technologies, such as WSI and ULSI. A new mapping strategy for modulus replication has been recently introduced [16] that allows large dynamic range computations over very small rings, and we will briefly review these techniques in this section. 2.1. Integer computations over finite rings
We deal with rings, or fields, that are used for direct computation, and rings that are isomorphic to direct products of the implementation rings or extensions
296
G.A. Jullien et al. / Number theoretic techniques in signal processing
of them. A given digital signal processing algorithm is m a p p e d from real or complex integer arithmetic to the implementation rings, the computation is carried out there, and the result is then m a p p e d back to obtain the final answer. Let m be a positive integer. We denote by R(m) the ring of integers modulo m, i.e.
R(m)={S:Om, ~m},
S={O,
1,..., m-
1}
(1)
Our notation a ~)m b and a ®m b implies the residue reduction of a and b modulo m. If R 1 and R 2 are any two rings then we can define the cross-product ring R 1 × R 2 as the set of pairs (s 1, s 2) ~ S1 × S 2, with addition and multiplication defined component wise, i.e. by
(al, a2)~gl×Rz(bl ' b 2 ) = (al ~RI bl'
a2 ~R2b2)
(al, a2)®R~×g2(bl, b 2 ) = (al ®g, bl, a2 ®g2b2)
C2)
Using a set of rings, defined by modulo operations with relatively prime moduli {ink}; M = I-I~=lm k, there is an isomorphism between R(M) and the direct product of {R(mk)}; this means that calculations over R(M) can be effectively carried out over each R(mk), independently and in parallel. A final mapping to R(M) is performed at the end of a chain of calculations. We have therefore broken down a calculation set in a large dynamic range, M, to a set of L calculations set in small dynamic ranges given by the {m~}. The final mapping is found from the CRT:
L
C3)
k~lM with rh~ = M/mk, X ~ R(M), x k ~ R(m~) and ( . ) - 1 the multiplicative inverse operator. The notation EM indicates summation over the ring R(M).
2.2. Polynomial rings and quotient rings We let R[X] denote the ring of polynomials in the indeterminate X: R[X] = n k {Ek=0akX : a~ ~ R , n ~< 0}. If X 1, X z , . . . , X s are indeterminates then we define the ring R[X1, X2,..., X s ] to be the ring of multivariate polynomials in the indeterminates. For a given polynomial g ( X ) ~ R { X } we consider the set (g(X)) of all (polynomial) multiples of g(X). This set (g(X)) is called the 'ideal' generated by the polynomial g(X) in the ring R[X]. The quotient ring R[X]/g(X) is then defined to consist of all elements of the form f ( X ) + (g(X)), with f ( X ) E R[X]. The more usual way of considering the quotient ring is to consider sums and products of polynomials reduced according to the equation g(X) = O, that is, to consider the remainder after division by g(X).
2.2.1. Quadratic residue number system A special case of the use of quotient rings is in the emulation of complex data processing using computations over quadratic residue rings [5,6]. In this situa-
G.A. Jullien et al. / Number theoretic techniques in signalprocessing
297
tion, the polynomial, g ( X ) , has the form X 2 q - 1 = 0, so that the polynomial division is the same as that for complex numbers but, instead, takes place over a finite polynomial ring. If we choose a base field such that the roots of g ( X ) exist (this does not happen over the reals) then a complex number can be written out as an element of the field. Using a separate representation of the conjugate of the number (again an element of the field) we may perform addition and multiplication, c o m p o n e n t wise, over the base field. The conjugate, and inverse, mapping represent the overhead incurred in providing complex calculations via independent computational paths. The formal definition for the ring is given below: Quadratic residue ring: Q R ( m k ) = { S : ®, ~ ; S = { A °, A*} with A °, A * ~ R ( m k) and A ° = a r + j a i, A* = a r - j a i ; j = f-L---1 ); a r, a i, j ~ R ( m k ) , m k = l--IiP di, where all prime factors have the form Pi = 4ki + 1. A ° is referred to as the normal component of element A = (A °, A * ) and A* as the conjugate component of element A. The multiplication and addition operators both compute componentwise.
2.3. Modulus replication In the above formulations it is seen that an adequate dynamic range requires the use of large moduli. An increase in the size of the dynamic range is then effected by the incorporation of additional moduli. Each new modulus must, of course, be relatively prime to each of the existing moduli, and the new moduli must also satisfy the QRNS condition of having prime factors which are all congruent to 1 (mod 4). The Modulus Replication Residue Number System (MRRNS) [14] is a technique that allows both real and complex inner product processing with replications of a very small number of moderately sized rings (m~ ~< 32). This technique differs from the above in that we use an algebraic formulation of the problem which leads to direct-product rings, namely quotient rings with respect to an ideal whose generators split into smaller factors. The method has the following advantages: (1) There are no quantization problems. The data, either real or complex, are assumed to be of a given fixed bit length. No approximations or scaling are used in encoding the data; this is a major advantage compared to the algebraic integer approach [4]. (2) The polynomials used are of a general nature, so that no restrictions are placed on the prime divisors of the moduli, except in the case of a QRNS representation of complex data, in which case the condition is the usual one of p = 1 (mod 4) for prime divisors p of the modulus M. (3) The same small moduli can be used many times, which allows VLSI implementations of systems which can process data of a large bit length, using direct products of many copies of modular rings with small moduli. (4) Encoding is a simple matter of diverting the bits of the input data to the proper channels. Decoding is only complicated insofar as the Chinese
298
G.A. Jullien et al. / Number theoretic techniques in signal processing
Remainder T h e o r e m is used, and even then only for a limited number of small moduli. Scaling, if used in decoding, is simplified by the ring structures used; certain monomials can be ignored as they represent insignificant digits. Input samples of the data stream are m a p p e d to multivariate polynomials. The indeterminates represent various powers of 2, thus allowing the data to be expressed as polynomials with small coefficients. These coefficients are then m a p p e d to a direct product ring consisting of many copies of Z M (the integer ring, modulo M) as factors. Since the above-mentioned coefficients are small, the modulus M does not have to be very large. The direct product repeats the factor Z M many times, so that the same prime divisors of M are used repeatedly, thus obviating the need for additional, larger primes. The mathematical derivations are somewhat tedious, and the reader is referred to a more complete description in [16]. We will, rather, illustrate the technique using an example of a complex inner product computation (say a direct implementation of a small radix DFT). 2.4. An example
We write the integers representing the real and imaginary parts of the data, together with the coefficients of the FFT, as polynomials in the variables W, X, Y and Z, where W = 2, X = 4, Y = 16, and Z = 256. With this notation, any positive integer < 216 c a n be written in a unique fashion as a sum: E
aili2iai4Wil S
i2
Y i 3Z
i4
(4)
il,i2,i3,i4~{o,1} with the coefficients equal to 0 or 1. Similarly, any negative integer > 216 can be written in the same form with coefficients 0 or - 1 (note that the use of 0 and + 1 implies a signed bit representation of the coefficients). We will use a moduli set {3, 5, 7}. The moduli 3 and 7 do not support a complex unit (and so the QRNS representation cannot be used), so we add an additional indeterminate, which we call T, to represent the complex unit; we use the polynomial T ( T 2 + 1) = 0 to define the mapping [15]. This polynomial always has three roots in any finite ring Z,,, provided m > 2; the penalty we pay for this modification is a 50% increase in the number of rings in the direct product; the advantage is that these rings are very small. Since each of the 'bit indeterminates' also form 1st order polynomials, we may use the same 3 root polynomial to form the direct product mapping. The amazing feature about this mapping is that the complex operator and the bit operators are interchangeable, allowing a variety of binary representations of complex numbers to be simply m a p p e d to the direct product ring. This map is performed by evaluating each of the five variables W, X, Y, Z, and T at each of the three roots 0, + 1 and - 1. This results in 35 = 243 results for each of the moduli 3, 5, and 7. Observe that the map (a tensor product) is very simple, consisting of nothing more difficult than sign changes and additions.
G.A. Jullien et at / Number theoretic techniques in signal processing Y=
0
1
-I
0
1
-1
X=I
0
0
0
1-1
X=-I
olo 0 W=O
0
1 --1 i W=I
0
1
0!-1
299
-I 1
0-1 1 W = - I WY
V
Fig. 1. Mapping for an 8-bit system. As an illustration, Fig. 1 depicts the forward mapping of an 8-bit integer map to three indeterminates: W, X, Y, producing 27 elements for each bit. The mapping elements for bit-5 are shown explicitly. Each mapping layer corresponds to a separate bit; the monomials corresponding to that bit position are shown alongside the map layer. Note that the mapping will be performed for each of the three moduli: 3, 5, 7. For complex data, we include a further indeterminate that represents the complex operator. The ideal used for reduction of the polynomials based on this indeterminate, are identical in form to those used for the bit position indetermihates [16]. By using an inverse mapping procedure of that shown in Fig. 1, the CRT, and a combined scaling algorithm, and we can finally combine the coefficients of the data stream output to give coefficients in the ring Z105. The input word lengths are selected in such a way that modular overflow is either not possible or has very low probability. Typical scaling and conversion techniques can be found in [13]; a discussion of statistical error distributions is found in [16].
3. VLSI implementation If we restrict ourselves to inner product type computations, then both the mapping procedures and main algorithmic computation can be performed using linear systolic arrays. Each of the computations is over a base field (mod 3, 5, or 7) in which each element can be represented by 3-bits (2-bits for mod 3). The most general computational block we will require is thus a 6-bit switching circuit. The blocks, unfortunately, do not have the benign decomposition properties of binary arithmetic, where it is possible to build complete arithmetic circuits from 3-bit input, 2-bit output full adders. The traditional approach for residue blocks has been to suggest the use of ROMs, and this still remains the preferred implementation procedure within the residue arithmetic community [10].
300
G.A. Jullien et al. / Number theoretic techniques in signal processing vdd
Output
T
Fig. 2. E m b e d d i n g a switch t r e e in a t r u e single p h a s e d-latch.
For small numbers of inputs it is possible to consider different implementation procedures for look-up tables other than traditional 2-dimensional R O M decompositions. In particular we can implement the R O M with a maximum decomposition strategy (binary decision tree), such that all decoders are reduced to inverters. We may then apply a graphical minimization strategy to the tree structure in order to reduce the large numbers of elements required in the original truth table description. For a bit-level systolic array implementation, we require to pipeline each output bit from the switching block; it therefore makes sense to embed the minimized tree structure within a pipelining latch.
3.1. Minimized ROM structures (switching trees) The approach we have evolved is to generate a full binary tree, program the bottom of the tree (remove unwanted transistors) and then minimize the resulting structure based on two simple graph theory rules [7]. In doing this we do not invoke any concepts from Boolean algebra which may not yield the best transistor configurations (including PLA configurations).
Top
x0 x1
Xn-1
vn 0
1 2
3 - - - - - ~-4 Base
n
2-3
Fig. 3. A full b i n a r y tree.
n
2-2
2n-I
G.A. Jullien et al. / Number theoretic techniques in signalprocessing
301
Restricting the trees to be n-channel blocks and evaluating only a single node, we can build massively pipelined systems, where every evaluation node is pipelined. The recently introduced true single phase clocking system [1], provides an excellent, stable, pipelining technique for quite complex trees. 3.2. Embedded single phase clocked latch The complete single phase clocked latch, with e m b e d d e d switching tree, is shown in Fig. 2. Since we are only interested in implementing n-channel logic blocks, we use a single inverter p-channel block at the output of each n-channel block. The tree is designed as an n-dimensional R O M (binary tree) where n is the number of input variables, as shown in Fig. 3. Our notation, for this figure, represents transistors whose gates are driven by the true logic input as arcs, \ ; the other arc, / , represents transistors whose gates are driven by the complement of the logic input. By removing selected transistors from the bottom of the tree, we can implement any arbitrary truth table. A full binary tree possesses interesting qualities as far as a series chain discharge block in dynamic logic is concerned. In the full tree we see that, for stable logic inputs, only a single series path connects the top node to one of the bottom nodes, and the capacitance at every node in each of the possible series paths is only 3 s o u r c e / d r a i n capacitances in parallel. Our minimization technique is based on the application of two simple graph reduction rules. We find this approach useful in that it allows a well established relationship between reduced tree structure and silicon layout that is essential for both hand custom layout and module generation approaches for complex multiple output trees. 3.3. Graph based reduction A tree represented by a graph can be denoted as, G = {X, V}, where X is a set of edges (n-channel transistors) {xia}, and V is the vertex set of nodes {uij}. An edge xia, consists of elements (ct, uk,~), where i and k are tree levels, j ~ [0, 2 i+1 - 1], l ~ [0, 2 k - 1], and connection type ct ~ {T, F, W}. The inputs to the tree are gi ~ {0, 1}. If gi = 1, then the path takes edge T if it is present. If gi = 0, the path takes edge F if it is present. W represents an arc which is a wire, or link, connection, and is only present following the successful application of a reduction rule. A path, P(i,j),(k,t), is the connection from node v~,~ to node Vk,t, constructed by edges. A full path connects node v0,0 to node vn,t, where n is the height of the tree. A truth table is m a p p e d onto a full tree by removing a sub-set of edges {xn,j}, j ~ [0, 2 "+1 - 1], from a full tree based on the set of zeros in the table. 3.3.1. Graph reduction rules The following two rules, with proofs omitted for brevity, are used in the graph reduction technique.
302
G.A. JuUien et a L / Number theoretic techniques in signal processing
Rule: 1 Merging o f shared sub-trees If paths from ui,j to 1,,,t and from ui, k to lln,m, where 2" - 1], contain an identical set of edges, starting at nodes where the matching occurs in both sequences more, if k = j , and i - p = 1, then the edges from node can be replaced by a link edge.
j , k ~ [0, 2 i - 1], l,m ~ [0, a node at level p, those can be merged. Furtherui,j to nodes Upj and up,m
Rule: 2 Deletion o f c o m m o n edges Consider a set of edge paths, X 1, connecting a node uij. with a node at level n, and a set of edge paths, X 2, also connecting the node ui,j with a node at level n. Path X 1 follows the T edge from node ui,~, and path X 2 follows the F, edge from node ui,~. If X 2 covers X 1, then the first edge in X1 has ct = W. Rule 1 provides for the greatest reduction in the number of nodes by merging common subtrees. Rule 2 replaces transistor links between nodes with wire links. When merging occurs, however, accidental paths through the tree may be created which can produce false results; it is important to be able to detect these and reverse the reduction step [7]. 3.3.2. Multiple output trees In general, the pipelined switching functions have multiple outputs. The {3, 5, 7} polynomial system we have been using as an example, requires 3 bits to represent each integer in the field. The original tree structure therefore becomes a set of 3 binary trees, and these trees are amenable to merging during the minimization process. The graphical rules of merging and deletion apply to multiple trees quite naturally. 3.4. Switching tree performance
The e m b e d d e d pipelined structure is dynamic and can, potentially, suffer from the effects of charge sharing; this is a particular problem when tree heights are large. Another deleterious effect, due to increased tree height, is the quadratic decrease [11] in the pull down speed of a tree path that connects to ground. The charge sharing effect is somewhat reduced from the possible worst case (say in a Domino logic configuration) by the fact that each pipelined switching
6V
5v" 4V
I¢ ÷
kk \'
tf ........... II
3V
0.0V 20nS 30nS 40nS 50nS 60nS 70nS 80nS 90nS lOOnS Fig. 4. W o r s t c a s e t e s t r e s u l t s for a full b i n a r y t r e e p a t h .
G.4. Jullien et al. / Number theoretic techniques in signalprocessing
303
tree circuit is also driven from a similar circuit. This means that the input logic levels are held constant through both the precharge and evaluate stages of the dynamic operation of the latch and tree. For paths that can provide connections to internal tree nodes, whose capacitance can potentially share charge with the evaluate node capacitor, those paths are already established during precharge and so the internal capacitances will be partially charged. This will reduce the charge sharing during the evaluate phase of operation. Figure 4 shows worst case results for a charge share cycle followed by a pull-down cycle for a 6-high tree. The test was carried out for a full binary tree path, followed by tests for a single large capacitance load (7 transistor drains) placed at different nodes in the path; the arrows on Fig. 4 indicate a movement of the load towards the bottom of the tree. The worst case condition, as expected, is for the load placed on the drain node of the bottom transistor in the tree. The worst case load condition will still allow a solid operation of the latch. Test results for a full binary tree path, for a target 3/x DLM CMOS process [3], indicate that a tree height of 6 yields acceptable charge sharing droop with pull-down times in the region of 20ns. The reduced voltage swing is acceptable if their voltage limits are outside the effective noise margins limit voltages of the latch.
4. Improving performance Even though a 6-high tree appears to perform adequately at reasonable pipeline rates (25MHz for the conservative 3/~ technology), it is possible to considerably improve on this performance by further reducing charge sharing effects and by decreasing pull-down time. 4.1. Reduction o f charge sharing
We use the standard technique of internal tree p-channel pull-up transistors to reduce the charge sharing effect. In the case of the bit-level pipelined blocks studied in this paper, however, a single pull-up transistor is often sufficient. We can also trade a reduction in precharge time for an increase in evaluate time, 6V
Drain 5 rnllllln
5V
7 \t,/
4V 3V 2V 1V
"
/
No puUup
0.0V 0.0S
KN.
Drain 4 pullup 20nS
40nS
P
t--
"~ 60nS
80nS
lOOnS
Fig. 5.25MHz operation: effect of pull-up pFET and increasingevaluate/precharge duty cycle.
304
G.A. Jullien et al. / Number theoretic techniques in signal processing 6V
Minimum
5V 4V 3V 2V 1V O.OV O.OS
Sized Profile 20nS
40nS
60nS
80nS
lOOnS
Fig. 6. Comparison between sized and minimumprofile. without reducing the throughput rate of the pipeline. Because we are only using a single pFET inverter for the p-logic block of the TSPC latch, we also have the flexibility of adjusting the precharge/evaluate duty cycle without being concerned about the effect on the pFET slave latch; i.e. the timing limitations are governed by the nFET latch circuitry. Figure 5 shows the improvement that results when applying a single 6~ width pFET pull-up transistor separately at the 4th and 5th transistor drain nodes in a 6 high tree. There are 7 transistor drain loads at the 2nd and 5th transistor, so that the 5th drain is heavily loaded while the 4th drain has a normal binary tree capacitance loading. The evaluate/precharge ratio has been increased from 1 : 1 to 3:1 to allow larger voltage swings at the evaluation node. Note that the pull-up transistor is effective within one transistor of the large capacitance node, as shown by applying the pull-up at the lightly loaded 4th drain node. The reason for this is that the bit-level pipeline provides constant input logic levels during both precharge and evaluate, the transistors that connect the paths that cause charge sharing also allow precharge current to flow to the same nodes.
4.2. Decreasing pull-down time by sizing A significant reduction in pull-down delay can be obtained by sizing the transistors [9]. This is a complex issue when looking at the interaction between pull-down delay and charge sharing. We use an approximate analytical technique [2] that obtains very close to optimal results while allowing both on-the-fly calculations and algebraic manipulations for module generator applications. Figure 6 demonstrates the improvement in pull down time using a minimum delay sizing profile in the previous test tree (with appropriately sized load transistors) and increasing the width of the precharge and pull up transistors to twice and four times their previous size, respectively. There is, however, about a 4 times increase in power dissipation measured over the two test clock cycles. This will increase to about 8 times if the clock speed is doubled to take advantage of the pull-down decrease. Clearly the option to size the transistor profile will have to be considered in light of the data stream throughput requirements, since the tradeoffs are quite severe.
G.A. Jullien et al. / Number theoretic techniques in signal processing
305
5. Residue module generator In this section we discuss a new approach to a module generator suitable for on-the-fly cell generation for arbitrary switching tree designs. This is essential for a design automation procedure, since it is impractical to pre-design a cell library based on the wide variety of truth table requirements that may have to be met in a typical residue computation setting. Our approach is very similar to the gate matrix or PLA concept, where a two dimensional array of transistors is generated based on a transistor network mapped from a minimized Boolean function. In our case the network is a direct mapping from a minimized binary tree. In order to illustrate the procedure we have developed, an example of Mod 7 multiplication, z = A ®7 B, will be used.
5.1. A modulo 7 multiplier The minimized tree, using our WoodChuck software package [7], is shown in Fig. 7. The order of the inputs (from the top) are {B2, B 1, A 2, A 1, A 0, B0}. This ordering was determined to be the best, based on a limited search, for minimizing interconnection lengths with a close to optimum reduction in number of transistors. We will see that minimizing the number of transistors is not necessarily the best criterion for optimization. The don't care states resulting from the fact that there are only 7 valid states in an 8 state system (3-bits), are used to help reduce the tree structure and to provide, as far as possible, local interconnections rather than cross connections. The merging of the three original trees is quite evident in Fig. 7. The True and False edges on each row represent transistors whose gates are driven by the input signal, or its complement, respectively, to the row, as discussed in Section 3. Note that the bottom gate load, of over 45 transistor gates for each of the true or complement signals in the original binary tree, has been reduced to 5 for the complement input and 6 for the true input. The maximum input signal gate load, of 9 gates, is at the A 2 row (third row from the top). This is a typical profile for a minimized tree network.. We also note that some rotation of tree sections has been performed. This is part of the procedure to minimize long interconnection.
B2 B1 A2 A1 A0 B0 WireEdge
~
TrueEdge
~
FalseEdge
Fig. 7. Minimized tree for Mod 7 multiplication.
306
G.A. Jullien et al. / Number theoretic techniques in signal processing
[~ Shorting Primitive
I
m
WirePrimitives
~] Transistor Primitive
Fig. 8. Table of primitives for the Mod 7 multiplier tree.
5.2. Placement mapping Our approach to the module generator is a direct mapping of tree primitives to layout primitives, using a matrix layout approach. The matrix of primitives for the Mod 7 multiplier is shown in Fig. 8. The levels corresponds to rows on the tree, and each level has two rows; one for the True edges and the other for the False edges. The position of the rows alternates between adjacent levels; this is to accommodate the inverters that are used to drive the complement gate signals. The mapping of primitives to the matrix is performed by either filling, or leaving empty, the table positions shown in Fig. 8. The Wire and transistor primitives are direct mappings from the switching tree, the shorting primitives are used to connect gate signals, propagated on metal 2 lines to polysilicon transistor gate lines. The metal 2 and polysilicon lines run horizontally across each row in the matrix with the metal 2 lines directly on top of the polysilicon lines. By shorting the metal 2 to the polysilicon at several places across the row (ideally near a transistor gate) we can eliminate the time constants associated with the large resistivity of polysilicon and transistor gate capacitances. We use space in the table to place the shorting primitives. Because these primitives are offset from the centre of the metal 2/polysilicon lines, they have two possible vertical directions; both directions have been used in Fig. 8.
5.2.1. Placement algorithm The algorithm used to map the tree edges to the matrix primitives is given below: (1) Start at the top of the right hand tree; and map to the right most column in the matrix.
G.A. Jullien et a L / Number theoretic techniques in signalprocessing
307
(2) Move towards the bottom of the tree, taking either right hand edges or single merged edges, mapping the edges (vertical wire links or transistors) to matrix primitives in the column. Place horizontal wire matrix primitives if a previously mapped edge (in the right hand adjacent matrix column) is connected to the currently mapped edge. The path will terminate when either a left hand link is reached, or when the bottom of the tree is reached. (3) Move to the left until the first unplaced left hand edge, at any vertical position, is reached. Terminate the algorithm if all edges have been placed. (4) Repeat from (2), mapping to a new column to the left of the previous column. At the termination of the algorithm, the matrix is examined for suitable placement of shorting primitives. This is a somewhat heuristic procedure since there is a trade-off between reducing the resistance of the signal path to each transistor gate, and the extra capacitance load of the shorting primitive. There is often limited space for the shorting primitives, particularly near the dense central rows. We can see, from Fig. 8., that shorting primitives have been able to be placed within a short distance of every two or three transistors on a row; this will change with the particular function being implemented. 5.2.2. Observations We see that the area required by the tree edge placement is given by:
A = 2" q)HRow "OWco 1
(5)
where: 4) is the number of input lines to the switching tree (6 in the Mod 7 multiplier example); HRow is the height of each row; O is the number of separate columns required in the placement mapping; and Woo~ is the width of each column. Notably absent from (5) is the number of transistors. Although there will be a correlation between minimizing transistors and minimizing O, the only direct requirement for minimizing transistors is to reduce the number of series transistors in the critical path of the tree (the path that has the maximum number of transistors between the ground plane and the evaluation node). In the Mod 7 multiplier example, the critical path is q). 5.3. Floor plan and layout
Figure 9 shows the floor plan and final layout of the Mod 7 multiplier, using a 3~ DLM p-well CMOS process [3]. The transistor block contains the matrix of primitives mapped from the switching tree, and also the metal2/polysilicon signal wires. Note that the figures have been rotated by 90 °. The inverters are formed by p-channel and n-channel strips, separated by the tree matrix. The matrix also includes the ground switch transistors, and the input clock signal to the switches is buffered by an inverter at the end of the inverter strip. The latch primitives are full custom layouts, and the clock signal to the latches is also buffered at the bottom of the latch column.
308
G.A. Jullien et al. / Number theoretic techniques in signal processing
p-channel inverter strip
D-Latch
Transistor Block
D-Latch
D-Latch n-channel inverter
strip
Buffer
Fig. 9. Floor plan and layout for the Mod 7 multiplier.
The transistor array governs the height of this particular example cell, but often the latches control the size, particularly for smaller numbers of inputs, or when there is a greater decomposition of the switching function (e.g. multi-bit binary adders). For such a cell, the area is now controlled only by the number of input bits (width) and number of output bits (height). For these low area switching functions the design procedure gives greater priority over the control of cross connections, rather than generating optimal solutions to the tree minimization. It is to be noted that the simple algorithm in section 5.2.1 only works with a planar tree mapping and it is a better choice to increase qb if a planar tree is the result.
5.4. Comparison study Based on the statement, in the previous section, concerning layout area that is independent of switching function, it is useful to compare the automatic layout versus full custom hand layout, and also to compare the Switching Tree approach to alternative switching function implementations. We will use the Mod 7 multiplier as the example for layout comparison, and we choose a TSPC PLA design as the comparison architecture. The 3/z CMOS fabrication process is used for all comparison designs.
5. 4.1. Hand layout The approach for the hand layout is to map the three trees individually and to surround the resulting latched trees with the appropriate buffers. The transistor sizes for the trees are based on the minimum width of the d r a i n / s o u r c e area
G.A. Jullien et al. / Number theoretic techniques in signal processing
309
Fig. 10~ Floor plan for the hand-layout design.
afforded by the design rules. The channel widths are also set at this value (5.4~). The floor plan is shown in Fig. 10.
5.4.2. TSPC PLA Design The PLA design uses the same TSPC latch structure as the switching tree approach. Because the PLA contains separate A N D and O R planes, these have been incorporated into the n-channel block and p-channel block of the latch. This provides the ability to pipeline at the rate of the slowest of either block, rather than requiring a single evaluation of the complete structure or separating the planes into different latches. Channel widths are the same as for the Switching Tree designs. The floor plan is shown in Fig. 11, along with the multiplier core layout. The truth table for the Mod 7 multiplier does not decompose sufficiently to allow folding of the core, as seen in Fig. 11.
Fig. 11. PLA design: floor plan and Mod 7 multiplier core.
310
G.A. Jullien et al. / Number theoretic techniques in signal processing
Fig. 12. Area comparison of the three designs.
5.4.3. A r e a c o m p a r i s o n
The 3 designs are compared in Fig. 12. It is clear that the switching tree design has much lower area than the equivalent P L A design, but more surprising is the fact that the automatically generated tree design has lower area than the hand layout. Clearly the mapping of merged trees to a matrix placement is very efficient. A more useful comparison is between the core areas of the three designs, since this will eliminate differences in the design of support circuitry (buffers, latches etc.). This comparison is given in Table 1.
Table 1 Core area comparison Design
Synthesized switching tree
Hand-Layout switching tree
PLA
Core area
252# × 204/z = 0.0514 mm2 100%
163/~ x 458/z = 0.0747 mm2 145%
363/x x 457~ = 0.1659 mm2 323%
Relative %
Table 2 Speed and power comparison Design
Switching tree
PLA
Maximum throughput rate Peak current at 40MHz Average dissipation at 40MHz
50MHz 2.85mA 3.45mW
70MHz 8.36mA 15.4mW
G.A. Jullien et al. / Number theoretic techniques in signal processing
311
5. 4. 4. Speed and power comparison This study has been conducted using mask extracted SPICE files, with level 3 models based on tuning from many fabrication experiments. The two switching tree designs perform almost identically, and so we provide a single result for the two designs. Comparison results are shown in Table 2. The power and peak current measurements are taken at a 40MHz throughput rate. We note that the PLA is able to operate at almost 50% higher throughput rates than the switching tree design; the trade-off, however, is the almost 5 times increase in power dissipation and 3 times increase in the peak current spike. This latter result can be as important as the power dissipation result, since the current spike is effectively multiplied by the number of cells on the chip for perfectly synchronized clocking (no skew between clocks arriving at the cells). This also speaks for producing architectures that allow clocks to be skewed, and the number theoretic techniques, described in the first part of the paper, are directly suitable for such skewed clocking, since the computations are carried out in independent pipelines. We have partially verified the throughput rate predictions by fabricating 6-high test switching trees and observing successful operation at the bandwidth of the output drivers (40MHz). 5.4.5. Future chip densities An important VLSI architectural point is the massive replication required for the small rings. We have to remember that a complete multiplication is performed within a single pipeline cycle using only 6-transistor high trees. Essentially we have changed the VLSI footprint of the computational elements from the roughly square footprint of standard binary multipliers to a narrow-long rectangular footprint. Since the narrow dimension is in the temporal direction, we achieve high speed, low latency implementations. Instead of the integrated two-dimensional data flow experienced with standard binary arithmetic elements (associated with carry propagation), our architecture only communicates across the dynamic range at relatively widely spaced intervals (scaling and conversion). Between these points we have linear independent pipelines using only 3-bit variables. At the conversion points we effectively have a corner turning procedure, where the computations across the entire dynamic range are computed; these computations are also linear pipelines. Testability, and fault tolerance advantages of such an architecture are not to be dismissed, particularly in critical applications; the architecture is ideally suited to current density ULSI fabrication processes. The size of the synthesized Mod 7 multiplier, using the 3/x CMOS process, is 528tz x 662/z -- 0.350mm 2 and if we allow a 10% overhead in routing (remember that the array is very regular) we are able to fit about 260 such cells on a 1 cm 2 die. With a submicron (say 0.8/z) process, this number will increase to over 2500 cells', and we expect pipeline rates in excess of 150MHz; well within speeds for applications such as Radar, H D T V etc.
312
G.A. JuUien et al. / Number theoretic techniques in signal processing
More architectural investigations are required before definite predictions of performance are possible, but for the {3, 5, 7} system we can probably increase the number of cells to 3000 based on the reduced area for the Mod 3 cell. For a 20-bit computational range (say for a complex radar matched filter) we will be able to perform over 12 operations at this data rate, giving in excess of 1GOPs (Billion operations per second) processing power to a single chip. This is in addition to the advantages, associated with processing over independent pipelines, in clock skew tolerance, ease of testing and the possibility of applying simple fault tolerance techniques [12].
6. Conclusions In this paper we have discussed the role of number theoretic techniques in the implementation of digital signal processing systems. In particular we have concentrated on a recently introduced polynomial ring mapping technique which allows large dynamic range computations to be performed using massively parallel small finite ring computational elements. We have demonstrated that large dynamic range computations can be performed by independent residue computations with the smallest usable odd relatively prime moduli of 3, 5 and 7. The unique properties of the computational procedure are massively parallel independent pipelined computations with very small word widths (3-bits). Because the computations themselves are not amenable to the use of replicated cells (there is no common cell that can be used to efficiently implement both Mod 5 and Mod 7 multiplication, for example) we have developed a new circuit style for efficient implementation of the wide variety of switching functions that are required. The switching function is implemented using minimized look-up tables based on binary decision trees (Switching Trees), and the pipeline function is implemented using a true-single-phase-clock dynamic latch; the minimized tree is e m b e d d e d within the latch. This paper has introduced a new synthesis procedure for Pipelined Switching Trees based on a mapping of the switching tree to a 2-dimensional matrix of layout primitives. A simple mapping algorithm is presented, and, using the example of a Mod 7 multiplier, we demonstrate that the synthesized layout is smaller than a hand layout and that the switching tree itself is 3 times smaller than a PLA core. From SPICE simulations, we find that the pipelined switching tree consumes only 20% of the power required by the PLA running at the same frequency. The PLA design, as expected, is able to run at throughput rates that are 50% higher than the switching tree.
Acknowledgments The authors acknowledge financial support from the Natural Sciences and Engineering Research Council of Canada, the Micronet Network of Centres of
G.A. JuUien et al. / Number theoretic techniques in signal processing
313
Excellence, and the fabrication and equipment loan programme of the Canadian Microelectronics Corporation, to carry out this research work. The authors are also indebted to Mr. R. Grondin, and Mr. L. Del Pup for the switching tree software package, the comparison study data and fabrication results.
References [1] Afghahi, M. and C. Svensson, A unified single-phase clocking scheme for VLSI systems, IEEE J. Solid-State Circuits 25 (1990) 225-233. [2] Bizzan, S., G.A. Jullien and W.C. Miller, Analytical approach to sizing NFET chains, IEE Electronics Letters 28 (14) (July 1992) 1334-1335. [3] Canadian Microelectronics Corporation, Guide to the integrated circuit implementation services of the Canadian Microelectronics Corporation, 1986. [4] Games, R.A., An algorithm for complex approximations in Z[e2~ri/8], IEEE Trans. Inform. Th. IT-32 (1986) 603-607. [5] Jenkins, W.K. and J.V. Krogmeier, The design of dual-mode complex signal processors based on quadratic modular number codes, IEEE Transactions on Circuits and Systems CAS-34 (1987) 354-364. [6] Jullien, G.A., R. Krishnan and W.C. Miller, Complex digital signal processing over finite fields, IEEE Transactions on Circuits and Systems CAS-34 (1987) 365-337. [7] Jullien, G.A., W.C. Miller, R. Grondin, Z. Wang, D. Zhang, L. Del Pup and S. Bizzan, WoodChuck: A Low-Level Synthesizer for Dynamic Pipelined DSP Arithmetic Logic Blocks, IEEE International Symposium on Circuits and Systems 1 (1992) 176-179. [8] McCanny, J.V. and J.G. McWhirter, Optimized bit level systolic array for convolution, IEE Proceedings, Pt. G. 131 (1984) 632-637. [9] Shoji, M., FET scaling in domino CMOS gates, IEEE. J. Solid-State Circuits 20 (1985) 1067-1071. [10] Soderstrand, M.A., W.K. Jenkins, G.A. Jullien and F.J. Taylor, Residue number system arithmetic: modern applications in digital signal processing, 1986. [11] Song, P.J. and G. DeMicheli, Circuits and architecture trade-offs for high speed multiplication,']/~EE J. Solid-State Circ. 26 (1991)1184-1198. [12] Taheri, M., G.A. Jullien and W.C. Miller, High speed signal processing using systolic arrays over finite rings, IEEE Trans. Selected Areas in Comm. 6 (1988). [13] Wigley, N. and G.A. Jullien, Array processing on finite polynomial rings, Proceedings o f the International Conference on Application Specific Array Processors (1990) 284-295. [14] Wigley, N.M. and G.A. Jullien, On moduli replication for residue arithmetic computations of complex inner products, IEEE Trans. Comp. 1990. [15] Wigley, N.M. and G.A. Jullien, A flexible modulus residue number system for complex digital signal processing, l E E Electronics Letters 27 (1991) 1436-1438. [16] Wigley, N.M. and G.A. Jullien, Large dynamic range computations over small finite rings, IEEE Trans. Computers, (1993) (in print). [17] Yuan, J. and C. Svennson, High-speed CMOS circuit technique, IEEE. J. Solid-State Circuits 24 (1989) 62-70.