JOURNAL
OF PARALLEL
AND DISTRIBUTED
COMPUTING
8,12-76 ( 1990)
RESEARCH On the Modulo M Translators
NOTE
for the Prime Memory System
HYLJNSX YOON, * KYUNGSOOK Y. LEE, AND AMOS BAHIRI Department of Computer and Information Science, The Ohio State University, 2036 Neil Avenue Mall, Columbus, Ohio 43210
Although it is impossible to allow conflict-free accessesto parallel memories for all possible required vectors, it has been observed that it is possible to design a parallel memory system which provides conflict-free accesses to arrays for many useful access patterns observed in programs. In [ 41 Budnik and Kuck described a parallel memory system which allows conflict-free accesses to any “linear” two-dimensional-array slices such as rows, columns, diagonals, and inverse diagonals as well as square blocks. They showed that conflict-free accesses for all these patterns frequently used in matrix computations are possible when the parallel memory system consists of a prime number of memory modules M, where M > P and P is a power of 2. Specifically, they recommended the use of a prime number M, M = 2 2L + 1 orM= 22L+’ - 1 for integer L, and an array storage scheme which shifts the starting points of successive rows by the distance of 2=. See Fig. 1 for an example of such a memory system, where P = 4, M = 5, and the skewing distance is 2. Note that all the rows (aio, ai,, ~~2, ai3) (for 0 G i G 3), all the columns (aoj, Urj, Uy, U3j) (for 0
A fast binary to modulo M translation, where M is a prime number, is necessary in many applications. The prime memory system of an array processor which allows parallel, conflict-free access to various slices of a data array is one such application. We review existing modulo M translators, including the one used in the Burroughs Scientific Processor (BSP) . We then propose a very simple and fast translator for a restricted but very useful class of M which includes, as a subclass, the form suggested by P. Budnik and D. J. Kuck (IEEE Trans. Comput. C-20, 12 (Dec. 197 1 ), 1566-l 569). Our basic idea stems from the traditional “casting out nine” rule, which has been used as a quick method for checking the accuracy of remainders in the decimal number system. By generalizing this rule and extending it for the binary number system, we have obtained a new modulo M translator. The new modulo M translator is built upon a set of modulo adders, and is simpler and faster than previous modulo M translators such as the one employed in the BSP. o 1990 Academic Press, Inc.
1. INTRODUCTION An SIMD shared-memory parallel processor consists of a number of processors and memory modules. For such a
a12,
system to be effective, the memory system should be capable of providing a sufficient bandwidth for all the concurrently executing processors. For a parallel processor with P processors and M memory modules, it is desirable to fetch P words of data (called a P-vector) to be used by P processors in a single memory cycle. To that end, in addition to the provision of enough memory modules (such that M 2 P), a P-vector needs to be evenly distributed over memory modules so that two or more elements of the P-vector are not stored in the same memory module. If two or more elements of the required P-vector are stored in the same memory module, the P-vector cannot be fetched in one memory cycle, and the vector processing should be delayed until all the elements of the P-vector are available through subsequent memory fetches, degrading the system performance. * H. Yoon is now with Korea Institute of Science and Technology, Seoul, Korea. This work was performed while he was a graduate student at Ohio State University. 0743-73
Copyright
72
15/90 $3.00
0 1990 by Academic
All rights of reproduction
Press, Inc.
in any form reserved.
a21,
a30),
(43,
a22,
a31),
(a23,
as;?>,
(a33),
as wd
as aI1
the inverse diagonals and (2 X 2) square blocks, can be accessed in a single memory cycle without a conflict. For more details on the parallel memory system, refer to [ 9,4,7, 81. The prime memory system has actually been adopted in some machine designs. The Burroughs Scientific Processor ( BSP), which is an array processor, has P = 16 processors and M = 17 memory modules [ 9, 61. The Flow Model Processor (FMP), which is an MIMD machine designed by Burroughs for the National Aerodynamic Space Agency, has P = 5 12 processors and M = 52 1 memory modules[lO]. However, one major drawback of the prime memory system is the difficulty in the address generation. In any memory system involving multiple memory modules, there is a need to compute the module number for every address generation. The decision of the module number involves a modulo M calculation, where M is the total number of memory modules. Traditionally for the SISD systems with an interleaved memory, a power of 2 is used for M to speed up the modulo calculation. When M is a prime number,
ON THE MODULO
M TRANSLATORS
73
tions. For a high-speed translation, the use of a division or a multiplication can be avoided by exploiting some of the modulo arithmetic characteristics. For example, a modulo M translator via modulo M adders was described in [ 1,2], which we review in the following subsection. 2.1. A Translator Using Only Modulo M Adders [ 1,2 ] Let X be an n-bit number, i.e., X= b,-, . . . b,bo = CyLdbi2’. Then by Eq. (2), IXlM = IC:&‘bi2’lM = IC:=j' (bi2ilMlM. Since lbj2’lM= lbil2’l,l, by = ci can be precomputed, 1 Xl,+, Eq. (3) and 12’lM FIG. 1. 2 ‘- skewing with 22L + 1 memories (L = 1) Note that accessing rows, columns, diagonals, and square blocks are possible without a = 1cgi1 b,C,IM. Since bi = 0 or 1 in the binary number system, 1X I,,, can be obtained by modulo M additions, conflict. given c;‘s. the modulo M calculation can be slow and expensive. Our EXAMPLE 1. Let X be an 8-bit number, X = b, . . . b, b0 concern in this paper is to find an easy and fast modulo M andM= 7.Precomputecj= /2’17toobtain 1,2,4, 1,2,4, translator for a prime number M. 1, 2, for 0 G i < 7. Thus 1x1, = lC& cibil7 = 1267 + be In Section 2, we review some existing modulo M transla+ 4b5 + 2b4 + b3 + 4b2 + 2bl + b0 1,. It can be verified easily tors, one of which was actually employed in the BSP. In that the translator of Fig. 2 consisting only of modulo 7 addSection 3, we propose a very simple and fast translator for ers does this computation. the restricted but very useful form of M which includes, as Although only modulo adders are used, this scheme may a subclass, the form suggested by Budnik and Kuck [ 41. Almost all the fast modulo Mtranslators are based on mod- not be suitable for a large number. Note that if X is an n-bit ulo A4 adders. In Section 4, we show that the modulo A4 number, n terms must be added, and we need (n - 1) modulo adders in Hog nl levels. adder itself can be simplified for this particular form of M. Concluding remarks are given in Section 5. 2.2. A Translator Using ROMs and Modulo MAdders: The BSP Approach 2. BINARY TO MODULO M TRANSLATORS A high-speed and low-cost binary to modulo M translator is quite desirable in many applications using the residue arithmetic techniques [13], and in particular the prime memory system. In this section, we review two existing modulo M translators. First we introduce some basic definitions and properties of the modulo arithmetic. X mod M, denoted 1X 1M, is defined as 1X1,=X-M*LX/Ml,
(1)
where 1YJ is the greatest integer less than or equal to Y, and Mis a prime number. From this definition, many properties can be derived. Among them, two important characteristics for our purposes are ”
n
i= I
i= I
n
n
II-I Xilm = In I-&l, I=1
m,
(2)
m-
(3)
i= I
A straightforward modulo M translation of a binary number is performed by dividing the number by Mand obtaining the remainder. However, the division of a large number can be slow and impractical for high-speed calcula-
To remedy the drawback of the previous scheme for large n, another scheme was proposed by Vora [ 141, which was employed in the prime memory system of the BSP [9, 6, 8 1. In this scheme, a binary number is partitioned into contiguous segments of approximately k bits each, compared to one-bit segments used in the previous scheme. The modulo M translation for each segment is predetermined and stored in the individually associated ROM. Thus the modulo M translation of each k-bit segment can be obtained from the ROM using the k bits as the ROM word address.
b7
b6
b5
b4
b3
b2
bl
b0
3 FIG. 2. An 8-bit number to modulo 7 translator through modulo adders [ 2,3 1.
YOON, LEE, AND BAHIRI
74
The ROM outputs are then combined by the modulo M additions to complete the overall translation. Essentially each ROM replaces (k - 1) modulo adders in flog kl levels of the previous scheme, speeding up the translation for large n . EXAMPLE 2. Let X be a 23-bit number, X = bz2. . . b,bo as in the BSP, and Mbe a prime number 67. Xis partitioned into three segments, %bit, S-bit, and 7-bit each, starting from the most significant bit side. Then by Eq. (2),
First we consider the case when the prime number M is of the form M = 2” - k, where 0 < k < 2m-2. Given an nbit numberX= b,-, . . . b, bO, consider it as a 2 “-at-y number by grouping blocks of m bits together, ’ i.e., X = a,. . . alao, where a, = bi,+m-, . . .bi,+lb,,, 0 < ai < 2” and t = t(n - 1 )/m J. Then the following theorem holds. THEOREM 1. For any 2 “-ary number X = a,. . . alao, ~ndM=2”-k~O
ProoJ: Let A = 2”. Then it can be proved easily that IAl,,., = IAI,-k = k from Eq. (1). Now, IXlhf = IC:=, JX16, = 1 lb2z222 + . . . + b152’5167 X a,A’IM = IC:=oUiIA’(,l, by Eqs. (2) and (3), and + Ib,42’4+ . . . +W’lm+ lb,+ ..a +bolci7167, since I AilM = Inj==, (IAIMM)IM= III:=, kl, = Ik’lM, it follows that I X I M = I C iZo a, I k’ I MI ,,,,. which can be computed by the translator ofFig. 3. The eight This theorem benefits from the fact that it is easier to high-order bits address the (256 X 7) ROM2, which transcompute 1k’ I ,+,than I A’ I M, since A = 2” is larger than k lates I b22222 + . . . + b152 I5 I 67, and the eight middle bits and the smaller the k, the easier the computation. address the (256 X 7) ROMl, which translates I b,,2 I4 A similar result follows when M is of the form M = 2” + . . . + b72’ I 67. The remaining seven low-order bits may + k, where 0 < k < 2”-‘, as shown in the following corolrepresent numbers from 0 to 127. Thus, the modulo 67 lary, given without proof. translation of the third segment is either the binary number COROLLARY~. Forany2”-arynumberX= a,. . .alao, itself (0 to 66) or the binary number (67 to 127) minus 67. Thus, in this particular example, the last segment does not and M = 2” + k, 0 -c k < 2m-‘, (Xl,+, = ICf=, a,( need a ROM and can be translated by one adder and one x (-kYl,vl.v. multiplexer. In general, this scheme needs ) n/k1 (2 k X flog As mentioned earlier, the usefulness of Theorem 1 and Ml) -bit ROMs and (fn / kl - 1) modulo adders in Llog n / Corollary 1 becomes apparent when the value of k is small, kJ levels. i.e., when M is slightly greater or less than a power of 2. In 3. A TRANSLATOR WHEN M IS ALMOST POWER OF 2
A
The principles of the two binary to modulo A4 translators of Section 2 are applicable to any prime number M. However, if the values of M are restricted, we can obtain simpler and faster translators. Recalling that the value of M for the prime memory system suggested by Budnik and Kuck [ 41 is of the restricted form, 22L + 1 or 22L+’ - 1, we present in this section efficient and fast translators when M is of the form of 2” + k. Msa
LSB
particular, when the value of k is -t 1, i.e., M = 2” + 1 as suggested by Budnik and Kuck [4], the translation of the binary number to modulo Mbecomes very easy. COROLLARY 2. For any 2”-ary number X = a,. . . a,ao, IX/M= IE:=oailM> gM=2”1, and JXJ&f= IC:=o(-1)’ XaiIh,ifM=2”+ 1.
We note here that Theorem 1 and Corollaries 1 and 2 are generalizations of the traditional “casting out nine” rule [ 1 I], which has been used as a general method for checking the accuracy of numerical computations. This rule, which is of ancient origin but has largely gone out of use nowadays, states that when a decimal number is divided by 9, the remainder is the same as the sum of the digits modulo 9. For example, 39827,437 mod 9 = (3 + 9 + 8 + 2 + 7 + 4 +3+7)mod9=43mod9=(4+3)mod9=7. EXAMPLE 3. Again consider the problem of Example 1, where X is an &bit number b7. . . b,bo and M = 7. Since M = 2 3 - 1, we can use the result of Corollary 2. Let azulu,, be an octal representation of X. Then by Corollary 2, I X 17 = I a2 + al + a0 I ‘, i.e., I b7b6 + b5b4b3 + b2bl b. )7, which can be computed by the translator of Fig. 4. Compare the two translators of Figs. 4 and 2 to observe the simplicity achieved by Theorem 1 and Corollary 1.
FIG. 3. A 23-bit number to modulo 67 translator [ 141.
’ For the prime memory system, usually X g A4 since X is the memory address and A4 is the number of memory modules.
ON THE MODULO b5 b4 b3
b2 bl
Yf ttt
Yf ttt Modulo ADDER
0
b0
75
MSB 7 BITS I
1
8 BITS 1
1
LSB 8 BITS 1 1
7
b7 b6
Modulo ADDER
ttt 7
w
FIG. 4.
M TRANSLATORS
FIG. 5. The 23-bit number to modulo 17 translator by Corollary 3.
The 8-bit number to modulo 7 translator by Corollary 2.
The implication of Corollary 2 is that when A4 is 2 m f 1, X modulo M can be obtained easily by considering the binary number X as a 2”-ary number, i.e., by partitioning X into m-bit segments. We note that a similar result would follow if X is partitioned into any (multiple of m)-bit segments. In terms of the decimal system, the old “casting out nine” rule can be generalized into, say, “casting out 99” or “casting out 999” or even “casting out lop - 1” for any positive integer p. Depending upon the magnitude of the given number X, a particular segment size may be preferred over others, providing cost versus time trade-offs. The following corollary states to this effect. COROLLARY 3. For any 2Pm-ary number X = d,. . . d,d,,,andM=2”1, JXJM= JC~~od,~~,andwhenM = 2”+ 1, IXIM= IC:=o(-l)“‘diIM. Inparticularifpisan even number, then for M = 2 m + 1, I XI M = I C &, di I M. EXAMPLE 4. Let X be a 23-bit number bz2. . . blbo and M = 17 as in the BSP. Since M = 24 + 1, we can use Corollary 2. Let a5a4a3a2alao be a hexadecimal representation of X. Then by Corollary 2, I XI ,, = I -a5 + a4 - a3 + a2 - a, + %l17,where ai = b4i+3b4i+2b4i+lb4i, which can be com-
tion [ 13 1, the direct logical implementation [ 13 1, the rotated selection method [ 11, and the method based on the property of cyclic groups [ 15, 121. However, the restriction in the form of M, M = 2 m - 1, enables us to design a very simple and fast modulo Madder as well. This technique, called modulo substitution [ 131, is to use an end-around-carry adder. Suppose M = 2” - 1 and a 2 *bit ordinary adder is used. For the operands A and B of 2”bits, the following cases can arise. 1. A+B<2”-l:Inthiscase, (A+B(,=A+B. 2. A + B > 2” - 1: In this case, there will be a carry out, and I A + B I M is obtained by adding 1 to the result, i.e., by feeding back the carry-out signal to the carry-in input point of the adder. 3. A + B = 2 m - 1: In this case, the result in binary notation will be 11. . .I, which can be converted to the correct one, 00. . .O, in 2’s complement number system. (If l’s complement system is used, no correction is necessary). An example of a modulo 24 - 1 adder is shown in Fig. 6.
puted by only five modulo 17 adders (or subtractors) in three levels. A further simplification may be obtained by Corollary 3. Let d,d, do be a 256-ary representation of X, whered2=b22...b,6,dl
=b,5...b8,anddo=b7...bo.
Then by Corollary 3, (X ( 17= 1d2 + d, + do ( 17, which can be computed by two &bit modulo 17 adders in two levels (See Fig. 5 ) . It is seen that this translator is much simpler and faster than the one used in the BSP, where ROMs are used as in the translator of Fig. 3. 4. A MODULO
M ADDER
As we have seen in earlier sections, the modulo M adder is a key component for any fast modulo M translation. There have been many approaches to the design of a general modulo adder, such as the magnetic matrix implementa-
FIG. 6.
An efficient modulo 24 - 1 adder.
YOON,
76
LEE, AND BAHIRI
5. CONCLUSIONS
We have considered modulo A4 translators, especially for M = 2” f 1. By generalizing the traditional “casting out nine” rule for the binary number system, a simple and fast translator was obtained. In many cases, the restriction of the prime number M to the form of 2 m f 1 does not necessarily restrict the applications of the translator considered. We have already seen the example that (Ml A4 = 2” f 1} is a superset of the suitable prime number M for the prime memory system of an array processor. Also, it is a superset of the Mersenne primes, the primes of the form M = 2p - 1, where p is also a prime, which have been used in many applications, e.g., high-speed digital multipliers [ 5 1, where we also need modulo M translators. Since almost all the computer systems use the binary number system, we prefer M to be a power of 2, 2”, to make a multiplication or a division by M easy. However, if the value of M must be a prime due to the characteristics of the application, the best choice of M would be the prime nearest to the 2” to still take advantage of the binary number system. In those cases, the modulo M translator considered in this paper would be very useful. REFERENCES 1. Banerji, D. K. A novel implementation method for addition and subtraction in residue number systems. IEEE Trans. Comput. C-23, 1 (Jan. 1974), 106-109. 2. Banerji, D. K., and Brzozowski, J. A. Sign detection in residue number systems. IEEE Trans. Comput. C-18,4 (Apr. 1969), 313-320. 3. Banerji, D. K., and Brzozowski, J. A. On translation algorithms in residue number systems. IEEE Trans. Comput. C-21, 12 (Dec. 1972), 1281-1285. 4. Budnik, P., and Kuck, D. J. The organization and use of parallel memories. IEEE Trans. Comput. C-20, 12 (Dec. 197 1) , 1566- 1569. 5. Fraenkel, A. S. The use of index calculus and Mersenne primes for the design of a high-speed digital multiplier. J. Assoc. Comput. Much. 8 (1961), 87-96. 6. Kuck, D. J., and Stokes, R. A. The Burroughs Scientific Processor (BSP). IEEE Trans. Comput. C-31,5 (May 1982), 363-376. Received October 23, 1987; revised August 2, 1988
7. Lawrie, D. H. Access and alignment of data in an array processor. [EEE Trans. Comput. C-24,12 (Dec. 1975), 1145-I 155. 8. Lawrie, D. H., and Vora, C. R. Multidimensional parallel accesscomputer memory system. U.S. Patent, No. 4,05 1,55 1, Sept. 27, 1977. 9. Lawrie, D. H.. and Vora, C. R. The prime memory system for array access.IEEE Trans. Comput. C-31,5 (May 1982), 435-442. 10. Lundstrom, S. F., and Barnes, G. H. A controllable MIMD architecture. Proc. 1980 International Conference on Parallel Processing, Aug. 1980, pp. 19-27. Il. Ore, 0. Number Theory and Its History. McGraw-Hill, Chap. 9. New York, 1948. 12. Pries, W., Thanailakis, A., and Card, H. C. Group properties of cellular automata and VLSI applications. IEEE Trans. Comput. C-35, 12 (Dec. 1986), 1013-1024. 13. Szabo, N. S., and Tanaka, R. I. Residue Arithmetic and Its Applications to Computer Technology. McGraw-Hill, New York, 1967. 14. Vora, C. R. Binary to modular M translation. U.S. Patent, No. 3,980,874, Sept. 14, 1976. 15. Yau, S. S., and Chung, J. On the design of modulo arithmetic units based on cyclic groups. IEEE Trans. Comput. C-25, 11 (Nov. 1976). 1057-1067.
HYUNSOO YOON received the B.S. degree in electronics engineering from the Seoul National University, Korea, in 1979, the M.S. degree in computer science from the Korea Advanced Institute ofScience and Technology, in 198 1, and the Ph.D. degree in computer and information science from The Ohio State University, Columbus, in 1988. From 1978 to 1980, he was with the Tongyang Broadcasting Company, Korea, and from 1980 to 1984, with the Computer Division of the Samsung Electronics Company, Korea. From 1988 to 1989 he was a member of the Technical Staff at the AT&T Bell Laboratories, Naperville, Illinois. Currently he is an assistant professor in the Department of Computer Science, Korea Advanced Institute of Science and Technology, Seoul, Korea. His main research interests include parallel computer architecture and communication protocols. KYUNGSOOK Y. LEE received the B.S. degree in chemistry from the Sogang University, Seoul, Korea, in 1970, the M.S. degree in computer science from the University ofUtah, Salt Lake City, in 1976, and the Ph.D. degree in computer science from the University of Illinois at UrbanaChampaign, in 1983. Since 1983 she has been an assistant professor in the Department of Computer and Information Science, The Ohio State University, Columbus. Her research interests are in parallel computing and parallel computer architecture.