Information rg;cedng ELSEVIER
Information Processing Letters 57 ( 1996) 159-163
A circuit for exact summation of floating-point numbers Michael Miiller a, Christine Riib a**,Wolfgang Riilling b a Max-Planck-lnstitut fiir Informutik 66123 Saarbriicken, Gernumy b Fachhochschule Furtwangen, 78120 Furtwangen, Germany
Received 1 May 1994 Communicated by T. Lengauer
Abstract In recent years methods for solving numerical problems have been developed which in contrast to traditional numerical methods compute intervals which are proven to contain the true solution of the given problem (cf. [ 10,111). These methods rely on an exact evaluation of inner product expressions in order to obtain good (i.e. small) enclosure intervals. Practical experience has shown that using these methods even ill conditioned problems can often be solved with maximum accuracy. Since the exact inner product computation is a basic operation for these methods we developed a circuit to support the difficult part of the inner product computation: the accurate accumulation of the partial products. Keywords: Computer architecture; Floating-point arithmetic; Inner product; Scientific computation; Numerical computation; Arithmetic
chip
1. Background Many computer applications have to deal with “real” numbers. In practice those numbers are usually represented as floating-point numbers. Since not every real number can be represented as a floating-point number this can lead to rounding errors which in some cases make the result of the computation useless. In the past, many algorithms for the standard numerical problems have been developed which try to minimize the effect of rounding errors. But for socalled ill conditioned problems, the results obtained by these algorithms may have nothing in common with the actual solution of the given problem. As an alternative to these traditional numerical methods, verification methods have been introduced
* Corresponding author. Elsevier Science B.V. SSD,OO20-0190(95)00205-7
(cf. [ 10,111) . Verification methods are using interval arithmetic to compute intervals which are proven to contain the true solution of a given problem. Moreover if inner product expressions can be evaluated exactly (with rounding only at the end of an inner product computation) the intervals can generally be made very small using only few iterations. With these methods even most ill conditioned problems can be solved with maximum accuracy. In order to apply these methods it is necessary to provide the “exact inner product” of two vectors of floating-point numbers as a fifth basic operation besides the operations +, -, . and /. There are already extensions of programming languages doing this (e.g. PASCAL-SC [4], PASCAL-XSC [ 83, FORTRANSC [3]). Here as well as in some arithmetic subroutine libraries (e.g. [ 6,1,2] ), the exact inner product is implemented in software. But software implemen-
160
M. Miiller et al./Information
Processing Letters 57 (19%)
tations are often unsatisfactory. The exact inner product is a basic operation and basic operations should be implemented as fast as possible. For a really fast implementation of the exact inner product, dedicated hardware seems to be necessary. Since there already exist good circuits for multiplication we concentrate on the summation part of the inner product computation. There have already been several proposals how to design a chip for the exact summation of the partial products (cf. [ 5,7,12,16,17] ) but as far as we know none of them has been realized yet. In these as well as in our design the summands are added to a fixed-point accumulator. If the vector components are double precision floating-point numbers according to the IEEE standard 754 the accumulator has to be M 4200 bits long. A common disadvantage of all previous approaches is that after the last summand has been input it takes many clock cycles until the exact sum rounded to a floating-point number is available. This cannot be neglected since short inner products often occur in practical applications (cf. a statistical analysis of verifying numerical algorithms in [ 91) . Furthermore some of the previously known approaches would result in too large circuits. Hence we aimed for a circuit which is small and makes the rounded sum available only a short time after the input of the last summand. Section 2 of this paper has been presented at the 10th IEEE Symposium on Computer Arithmetic in Grenoble, 199 1 [ 141, while Sections 3 and 4 are taken from the Ph.D. thesis of the first author [ 131.
2. The idea Intuition tells us that it is not necessary to consider all accumulator digits when we add a number covering only a small part of it, i.e. the correctly adjusted mantissa of a floating-point number. This is clearly true if the number is positive and can be added without producing a carry at the left end of the covered range. If this locality would hold for all additions, we could maintain the contents of the accumulator in an ordinary storage which requires only little area. This could lead to a smaller layout than the previous approaches. To add a new number we read the part affected by the addition, use a moderately sized adder to do the main work and write the modified section back
159-163
intermediate result: cany: carry resolved:
0010111111011 1 + = 0011OOOOO1011
Fig. I. Resolving a carry.
to the storage. To be able to use this simple procedure for all cases, we have to solve two major problems: ( 1) What do we do with the carries? (2) How do we treat negative summands? Let us first ignore negative summands and concentrate on carry handling. From an analysis of the previous approaches we concluded that it seems to be essential to resolve carries immediately as they arise. If this is not done there may be many unresolved carries after the last summand has been input and this contributes to a long postprocessing time which we want to avoid. Later we will see that the idea we use for carry handling also helps to find the significant part of the accu quickly and hence allows us to put out the rounded result only a short time after the last summand has been added. Resolving a carry is equivalent to adding a power of two (in the binary system), i.e. adding a number in which exactly one bit is 1 (cf. Fig. 1). This means that we have to find the next 0 left to the position from which the carry starts and to invert all ones in between. Doing this sequentially is obviously too slow; however, we can also not afford to use something like a binary tree which covers the whole accumulator as in a carry look ahead adder. Such a tree with more than 4000 leaves would be much too large. In order to be able to propagate a carry over many bits at the same time we use the following idea. The accumulator consists of two parts: an ordinary storage, partitioned into words, and an indicator for each word which describes whether the word contains only ones, only zeros or both, ones and zeros. Only in the latter case the value of a word in the storage agrees with the value of the corresponding word of the accu; otherwise the indicator determines the value. We always have to take the indicators into account when we want to read a word from the accu and we have to update them when we write a word back. If a carry is propagated over a word which contains only ones we only need to change the corresponding indicator. The word in the storage can remain unchanged. Only in the word which absorbs the carry each bit is treated individually.
M. Miiller et al./Information Processing Letters 57 (1996) 159-163
zero mix one one mix zero mix 0000~1011~1111~1111~0101~1111~0101 1101 1011 +
1 zero mix one one mix mix mix 0000[1011~1111~1111~0010~1011~0101
+
1
1 0011000 + 1 O!#KIOO
I R
1 x
zero mix zero zero mix mix mix 0000~1011~1111~1111~0010~1011~0101 1 +
1 zero mix zero zero mix mix mix 0c@O~1100~1111~1111~0010~1011~0101 Fig. 2. Adding a floating-point number to the accumulator.
When we add a new floating-point summand we read the words of the accu which are covered by the adjusted mantissa. After the addition the modified words are written back. If we have to resolve a carry we change the one-indicators of the words over which the carry is propagated to zero. Finally we have to increment the contents of the word which absorbs the carry. Fig. 2 shows an example of how this procedure works. In this example the second word from the right contains only ones but its indicator is zero (a carry must have been propagated over this word previously). For the carry propagation over several words we have to find out over which words the carry is propagated and in which word it is absorbed. One way of doing this fast and with little additional hardware is to reduce the problem to an addition as follows. We define a binary number I in which every bit corresponds to a word of the accumulator. A bit of I is 1 if and only if the corresponding word contains only ones. The number I can easily be generated from the indicators of the words. To I we add a number having a 1 only at the position corresponding to the word first entered by the carry. Let R be the result of
161
this addition. The bits at which R differs from I are exactly the bits corresponding to the words passed by the carry (marked with a bar in Fig. 2) and the word which absorbs the carry (marked with a cross). Hence, it is easy to update the indicators and to address the word that absorbs the carry. If we want to use the same adder for the addition of the mantissa to the affected words and for the addition during the carry propagation over several words, the word size should approximately equal the number of words into which the accumulator is partitioned. Thus if the accumulator has length L the word size should be chosen z & (For L x 4200 we have fi x 64.) Up to now we only have considered positive summands. To add a negative summand to the accu we subtract the absolute value of its mantissa from the corresponding words of the accu. This subtraction is done by adding the two’s complement. Note that the value of the sign bit in the two’s complement representation is -2” where n is the number of digits used to represent the number, not counting the sign bit. Hence, if we subtract something from a section of the accumulator using two’s complement representation, a negative sign bit of the result can be interpreted as a negative carry, i.e. we have to subtract this carry from the more significant bits. This negative carry can be handled in nearly the same way as a positive carry. The only difference is that it is absorbed by a 1 and propagated by a 0. The propagation of a negative carry over several words is again reduced to an addition similar to the propagation of a positive carry. Now the description of how we add a summand to the accumulator is complete. It remains to describe how we extract the floating-point representation of the rounded sum from the accumulator. For this purpose it is necessary to find the position of the leftmost bit of the accu which is different from the sign bit. We call this bit the most significant bit (m.s.b.). For the implementation of the different IEEE rounding modes it is also helpful to know the position of the rightmost bit which equals 1 (called least significant bit, 1.s.b.). Since the latter is somewhat easier we first describe how to find the position of the 1.s.b.. Obviously this problem is similar to the problem of carry propagation. The position of the 1.s.b. is the position where a negative carry starting at the right end of the accumulator is absorbed. Thus the problem of finding the word which contains the 1.s.b. can be
162
M. Miiller et al. /Information Processing Letters 57 (1996) 159-163
reduced to an addition. Applying the same technique to this word we can determine the position of the 1.s.b. within it. In order to find the position of the m.s.b. we can apply the same principle to the reversed accumulator. Once we know this position we can take the bit at this position and the next M - 1 bits to the right of it as the mantissa of the rounded sum and truncate the rest of the accumulator. Here M denotes the length of the mantissa of the format in which we want to represent the result. The exponent of the rounded sum can be obtained from the binary representation of the position of the m.s.b. by adding a suitable constant which depends on the format we use. Since we truncated the part of the accumulator that did not fit into the mantissa this corresponds to rounding towards the next smaller floating-point number. If another rounding mode is desired we may have to compute the next larger floating-point number depending on whether what we truncated was less than, equal to, or greater than 1000. . .OOO.For this decision we need to know whether there are bits different from 0 besides the first bit in the part that we truncated. If we know the position of the 1.s.b. this decision can easily be made.
3. The implementation We have designed an accumulator for exact products of double precision floating-point numbers according to the IEEE standard 754 (53 bits mantissa, 11 bits exponent) using a semi custom design system with 1.2 pm CMOS technology. The accumulator is partitioned into 67 words of 64 bits each. Since the mantissa of the exact product of two doubles can be represented with 106 bits, it covers at most three words of the accumulator. Hence a summand can be added using at most five 64 bit additions: three for the local addition of the mantissa, one for finding the word that absorbs a potential carry and one for actually resolving the carry. The circuit has a size of 8.8 mm x 8.75mm and consists of approximately 5000 standard cells (60 000 transistors), 4 K bit RAM and 7 K bit ROM. Assuming typical conditions (supply voltage 5.OY temperature 25°C) our simulations yield that the addition of a floating-point summand takes about 575 ns and
the rounding and outputting of the result takes about 850 ns. Under conditions where maximal delay values have to be assumed (supply voltage 4.3 V, temperature 1OO’C) we estimate 1150ns for the addition procedure and 1700ns for the rounding and output procedure.
4. Possible improvements In our implementation we focused on a simple version which does not require too much area. In this section we will mention two observations which lead to considerably faster implementations at only a moderate increase in hardware cost. The first observation is that we can pipeline the three additions that we use for adding the mantissa to the accumulator. Since the only dependence between these additions is that a carry resulting from one addition has to be considered in the next one, we can already make available the operands for an addition during the preceding addition and we can write back the result while the next addition is carried out. Based on the experience with our implementation we estimate that such a realization would be 30-40% faster and 18% larger. The second observation is that we do not have to wait until the mantissa is added to the accumulator before we can determine the word that absorbs a carry. If we use a second adder, those two tasks can be treated simultaneously and we can also increase the contents of the word which might absorb the carry before we know whether indeed a carry has to be resolved. When the addition of the mantissa is completed we know whether there is a carry to be resolved and then write back the result of the carry handling procedure or ignore it. According to our estimations such a variant would be about 55-60% faster than our implementation and 34% larger.
5. Conclusions We have found a way to design a chip for the exact summation of floating-point numbers which in contrast to previous designs can always deliver the rounded sum only a short time after the last summand has been added. This makes it possible to construct ap-
M. Miiller
et al./Infortnation Processing Letters 57 (1996)
propriate hardware support for the methods mentioned in Section 1. With such hardware the exact evaluation of inner products takes not much longer than a naive evaluation with rounding errors. Since the methods mentioned in Section 1 in general yield a result after only few iterations and the results are tight bounds on the actual solution this might improve the power of computers at least with respect to applications that rely heavily on numerical computations (e.g. weather forecasts, wind tunnel simulations). The algorithm underlying our design could also be implemented in software. But this would be slower by a considerable factor since many things that we do on the way would require extra instructions, for instance taking into account and updating the indicators each time we modify a word of the accumulator. Since the exact inner product is a basic operation for the methods described in Section 1 such a software solution is not satisfactory for large numerical applications.
References 1I 1 ACRITH-XSC: IBM High-Accuracy Arithmetic - Extended Scientific Computation, Version 1, Release 1, IBM Deutschland GmbH, Department 3282, Schijnaicher Stra8e 220, 7030 Bijblingen (IBM, 1990). [ 2 ] ARITHMOS (BS 2000). Kurzbeschteibung, Tabellenheft, Benutzerhandbuch, SIEMENS AG, Bereich Datentechnik, Postfach 83 09 51, D-8000 Mtlnchen 83. Bestellnummer IJ2900-J-287-1 (SIEMENS, 1986). 13 1 J.H. Bleher, S.M. Rump, U. Kulisch, M. Metzger, Ch. Ulhich and W. Walter, FORTRAN-SC: A study of a FORTRAN extension for engineering/scientific computation with access to ACRITH, Computing 39 ( 1987) 93-l 10. ]4] G. Bohlender, L.B. Rail, Ch. Ullrich and J. Wolff von Guldenberg, PASCAL-SC: A Computer Language for Scientific Computation, Perspectives in Computing 17 (Academic Press, Orlando, 1987).
159-163
163
[ 51 P.R. Cappello and W.L. Miranker,Systolic super summation, IEEE
Trans. Comput. 37 (1988)
657-676.
IBM High-Accuracy Arithmetic Subroutine Library (ACRITH), IBM Deutschland GmbH, Department 3282, Schijnaicher St&e 220, 7030 Bliblingen (IBM, 3rd ed., 1984). [7] R. Kirchner and U. Kulisch, Accurate arithmetic for vector processors, J. Parallel D&rib. Comput. 5 ( 1988) 250-270. [8] R. Klatte, U. Kulisch, M. Neaga, D. Ratz and Ch. Ullrich, PASCAL-XSC - Language Reference with Examples (Springer, Berlin, 1992). [9] A. Knofel, Hardwareentwurf eines Rechenwerkes fur unter semimorphe Skalar- und Vektoroperationen Berilcksichtigung der Anfordenmgen verifizietender Algorithmen, Ph.D. Thesis, Universimt Karlsruhe ( 1991). [IO] U. Kulisch and W.L. Miranker, Computer Arithmetic in Theory and Practice (Academic Press, New York, 198 I ) [ 111U. Kulisch and W.L. Miranker, eds., A New Approach to Scient#c Computation (Academic Press, New York, 1983). [ 121 P. Lichter, Realisierung eines VLSI-Chips filr das Gleitkomma-Skalarprodukt der Kulisch-Arithmetik, Diploma&it, Fachbeteich 10, Angewandte Mathematik und Informatik, Universimt des Saarlandes ( 1988). filr Entwurf eines Chips [13] M. Mtlller, ausloschungsfteie Summation von Gleitkommazahlen, Ph.D. Thesis, Universitat des Saarlandes (1993). [ 141 M. Milller. Ch. Rub and W. Rtllling, Exact accumulation of floating-point numbers, in: Proc. 10th IEEE Symp. on [6]
Computer Arithmetic
(1991)
64-69.
[ 151 S.M. Rump, Solving non-linear systems with least significant bit accuracy, Computing 29 (1982) 183-200. R.J.W.T. Tangelder, The design of chip architectures for accurate inner product computation, Ph.D. Thesis, Eindhoven University of Technology ( 1992). [ 171 Th. Winter, Ein VLSI-Chip fur Gleitkomma-Skalarprodukt mit maximaler Genauigkeit, Diplomarbeit, Fachbereich IO, Angewandte Mathematik und Informatik, Universitiit des Saarlandes ( 1985).
[ 161