INTEGRATION, the VLSI journal 45 (2012) 388–394
Contents lists available at SciVerse ScienceDirect
INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi
Area-time efficient end-around inverted carry adders H.T. Vergos n Computer Engineering & Informatics Department, University of Patras 26 500, Greece
a r t i c l e i n f o
a b s t r a c t
Article history: Received 15 March 2011 Received in revised form 28 September 2011 Accepted 14 November 2011 Available online 22 November 2011
Novel architectures for end-around inverted carry adders are proposed in this manuscript, which use a sparse carry computation unit for deriving only some of the carries in log2 n prefix levels, while all the rest are computed in an extra one. When used for the design of modulo 2n þ 1 adders, the proposed designs offer significant area and power savings compared to earlier proposals, while maintaining a high operation speed. & 2011 Elsevier B.V. All rights reserved.
Keywords: Modulo 2n þ 1 arithmetic Diminished-1 representation Parallel-prefix carry computation
1. Introduction A number of algorithms met in a variety of applications ranging from random number generation and cryptography [1] up to convolution/correlation computation without rounding and truncation errors [2] rely on the use of modulo 2n þ1 arithmetic. A channel performing its operations in modulo 2n þ1 arithmetic is commonly met in a residue number system (RNS) [3–5] application. The RNS has been proposed as a faster to the binary representation alternative for the design of FIR filters [6], specialized digital signal processors [7] and communication components [8]. Three-moduli bases of the form {2n 1, 2n , 2n þ1} have received significant attention for an RNS. Therefore, the design of efficient modulo 2n þ 1 arithmetic components is vital for RNS-based applications. For deriving efficient components for the modulo 2n þ 1 arithmetic, several representations have been researched (for example the carry-save diminished-1 [9] and the stored unibit transfer [10]); the most well known are the normal weighted one and the diminished-1 [11] representation. Irrespectively of the representation used, all these components require a two-operand adder which is mainly built on an end-around inverted carry (EAIC) adder. More specifically, it has been recently shown [12] that a two operand adder for the weighted representation can be designed efficiently by using an EAIC adder and a carry save adder stage, while a diminished-1 two operand adder requires an EAIC adder and few more logic gates for handling zero operands and results.
n
Tel.: þ30 2610962924; fax: þ 30 2610991909. E-mail address:
[email protected]
0167-9260/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2011.11.003
Since the direct connection of the inverted carry output to the carry input of an integer adder would create a combinational loop and lead to an unwanted race condition, efficient designs for EAIC adders have been the focus of several research efforts. Carry lookahead (CLA) adder architectures that take into account the inverted output carry equation have appeared in [13]. In [14] modulo 2n þ1 adders for diminished-1 operands have been considered. These are actually EAIC adders with a parallel-prefix carry computation unit. They are based on extending the carry computation unit of an integer adder by an extra prefix level for handling the EAIC. In [13] it has been shown that the recirculation of the EAIC can be performed within the existing prefix levels of an integer adder. As a result the extra prefix level is no longer required and parallel-prefix EAIC adders have been offered that can operate as fast as their integer counterparts, that is, they offer a logic depth of log2 n prefix levels. Unfortunately, this level of performance requires significantly more cell and interconnect area than the solutions of [14]. In [15] select-prefix EAIC adders have been proposed that aim at reducing the area complexity of the parallel-prefix solutions. These adders offer a lower operating speed than the parallel-prefix ones of [13], but within an implementation area close to that of the CLA ones [13] or to that of the parallel-prefix ones with the extra prefix level [14]. Finally, fast implementations for EAIC adders have appeared in [16] that rely on the use of Ling carries. However, they also require increased cell and interconnect area over the solutions of [14]. In this manuscript, the architectures of [14,13] are considered as two end cases in which only one or all the carries of the EAIC addition are computed within log2 n prefix levels, respectively. It is shown that a number of solutions exists in between, each with its own fan-out, cell and interconnect requirements. The experimental exploration of two such solutions reveals increased operation speed and significantly reduced cell and interconnect area over the solutions of [14].
H.T. Vergos / INTEGRATION, the VLSI journal 45 (2012) 388–394
The rest of the manuscript is organized as follows. Some background issues on parallel-prefix carry computation in integer and EAIC adders are briefly revisited in the next Section. In Section 3 the proposed family of adders is presented. Quantitative comparison results against previous solutions are presented in Section 4.
2. Preliminaries
389
and propagate signals already derived at each bit position. Fig. 1(a) presents the proposal of [14] for a LF EAIC 16-bit adder. The adders proposed by [14] unfortunately suffer from a fan-out equal to n at the EAC signal. For n-bit modular and EAIC adders, the group generate and propagate pair of terms ðGk:j ,P k:j Þ can be defined even when j 4 k in a circular manner, as: ðGk:j ,P k:j Þ ¼ ðGk:0 þ Pk:0 Gn1:j ,P k:0 P n1:j Þ
2
The symbols , þ, , and are used in the following to denote logic AND, inclusive-OR, exclusive-OR and complement, respectively.
For the EAIC addition, parallel-prefix adders have been developed [13] by showing that the EAIC carries ci , 0 ri r n2, are equal to Gi , where ðGi ,Pi Þ are computed by: %
%
%
%
ðGi ,Pi Þ ¼ ðGi:0 ,P i:0 ÞJðGðn1Þ:ði þ 1Þ ,P ðn1Þ:ði þ 1Þ Þ %
2.1. Parallel-prefix addition
%
ð3Þ
and c1 ¼ Gn1:0 . By definition, ðg,pÞ is equal to ðg ,pÞ. It should be noted that the above equations have a cyclic form and in contrast to integer addition, the number of generate and propagate pairs that have to be associated for each carry is equal to n. This means that a parallel-prefix carry computation unit of an EAIC adder has significantly increased area complexity than that of a corresponding integer adder. Fig. 1(b) presents the proposal of [13] for a 16-bit EAIC adder. The carry computation unit of the adders proposed in [13] have a carry computation unit composed of just log2 n prefix levels. For implementing (3), for every i, within log2 n prefix levels, in [13] a transformation method was proposed. For example, c1 ¼ ðG1:0 ,P1:0 ÞJðG15:2 ,P15:2 Þ is equivalently computed as c1 ¼ ðp1 ,g 1 ÞJðp0 ,g 0 ÞJðG15:2 ,P15:2 Þ. This unfortunately leads to aparallel-prefix computation unit that needs a double computation tree. One tree is used to associate generate and propagate signals in their normal form, while the second to associate the complemented form of them. This is indicated in Fig. 1(b) by the double operators required in some columns of the same prefix level. By comparing Fig. 1(a) and (b) it becomes obvious that the increased speed of [13] comes at the penalty of heavily increased cell and interconnect area. The same observation holds for the full parallel prefix (FPP) and reduced area parallel prefix (RAPP) architectures proposed in [16] that follow a similar [13] prefix algorithm but rely on Ling carries. It should be noted that as we move to deeper sub-micron technologies, interconnect parasitics have a growing effect to the delay of a design. To this end, in the next section, novel architectures for EAIC adders are presented, with significantly reduced cell and interconnection area requirements. %
The addition of A ¼ an1 an2 . . . a1 a0 and B ¼ bn1 bn2 . . . b1 b0 in an n-bit parallel adder can be considered as a three-stage process. During the first stage the carry generate, gi, the carry propagate, pi and the half-sum hi bits are computed for every i, 0 o ir n1, according to g i ¼ ai bi , pi ¼ ai þ bi and hi ¼ ai bi . Then, the second stage (also called carry computation unit), computes the carry signals, ci, for 1r i on1, using as its inputs the carry generate and propagate bits. Finally, in the third stage the sum S ¼ sn1 sn2 . . . s1 s0 is computed by si ¼ hi ci1 . The prefix (J) operator [17] which associates pairs of generate and propagate signals by:
%
%
ðg m ,pm ÞJðg k ,pk Þ ¼ ðg m þ pm g k ,pm pk Þ
ð1Þ
allows to map carry computation into a prefix problem. The notation ðGk:j ,P k:j Þ is commonly used to denote the group generate and propagate signals that result after a series of consecutive generate/propagate pairs associations, that is: ðGk:j ,Pk:j Þ ¼ ðg k ,pk ÞJðg k1 ,pk1 ÞJ Jðg j þ 1 ,pj1 ÞJðg j ,pj Þ
ð2Þ
Since every carry ci in an integer adder is equal to Gi:0 a number of distinct algorithms have been introduced for computing all the carries using only J operators. Such algorithms lead to a carry computation unit composed by interconnections of blocks implementing a prefix operator and are well-known as parallel-prefix carry computation units. These algorithms are most often represented by acyclic directed graphs in which the required J operators constitute the black nodes. 2.2. Parallel-prefix EAIC addition The EAIC adders proposed in [14] have a carry computation unit composed of log2 n þ1 levels. The first log2 n prefix levels are those of an integer parallel-prefix adder and in [14] two different algorithms were considered for them; namely, the Ladner–Fischer (LF) [18] and the Kogge-Stone (KS) [19] algorithms. The Kogge– Stone architecture requires the larger number of prefix operators but has the smallest possible fanout, which is equal to 2. The low fanout property helps in achieving lower delay with the cost of additional power due to the increased number of operators. On the contrary, the Ladner–Fisher design prefers sharing the intermediate results, as much as possible, and thus requires the smallest number of prefix operators but suffers from high fanout lines that increase its delay compared to the Kogge–Stone architecture. These first log2 n prefix levels may also be designed according to the hybrid architectures proposed by Knowles [20] that mix levels from the Kogge–Stone and the Ladner–Fischer architectures, to achieve intermediate solutions slightly slower but more area efficient than the Kogge–Stone proposal and slightly faster but requiring more area (and power) than the approach proposed by Ladner and Fisher. The last level, which is a late carry increment stage, uses the EAIC and the group generate
3. Proposed EAIC adders The proposed family of EAIC adders stems from considering the proposals of [14,13], as the two end cases of the number of EAIC addition carries that are computed within the first log2 n prefix levels. In the first case, only one carry is computed within log2 n prefix levels, while in the second case every carry is computed. The first choice leads to high fan-out, whereas the second leads to increased cell and interconnect requirements. The proposed new adders are derived considering the alternative of computing only some of the carries within the first log2 n prefix levels. To this end, it is firstly shown that ci þ 1 can be computed based on ci . Suppose that ci has been computed according to (3). It then holds that: %
%
%
ðg i þ 1 ,pi þ 1 ÞJðGi ,Pi Þ %
%
¼ ðg i þ 1 ,pi þ 1 ÞJðGi:0 ,Pi:0 ÞJðGn1:i þ 1 ,P n1:i þ 1 Þ ¼ ðg i þ 1 ,pi þ 1 ÞJðGi:0 ,Pi:0 ÞJ JðGn1:i þ 2 ,P n1:i þ 2 ÞJðg i þ 1 ,pi þ 1 Þ
390
H.T. Vergos / INTEGRATION, the VLSI journal 45 (2012) 388–394
15
14
13
s15
s14
s13
12
11
10
9
8
7
6
5
4
3
2
1
0
s12
s11
s10
s9
s8
s7
s6
s5
s4
s3
s2
s1
s0
gi pi hi
ai bi
(g,p)
hi
(g+p·g΄, p·p΄)
(g΄,p΄)
15
14
13
12
11
10
9
8
s15
s14
s13
s12
s11
s10
s9
s8
7
s7 ai bi
6
s6
si
c*i-1
5
s5
4
s4
3
s3
2
s2
1
s1
hi = ai⊕ bi gi = ai• bi pi = ai+ bi pi = ai+ bi gi = ai• bi
Fig. 1. The 16-bit EAIC adder of [14] (a) and [13] (b).
¼ ðg i þ 1 ,pi þ 1 ÞJðGi:0 ,P i:0 ÞJ
pi þ 1 P i:0 Pn1:i þ 2 Þ
JðGn1:i þ 2 þ P n1:i þ 2
¼ ðg i þ 1 þ pi þ 1 Gi:0 þpi þ 1 P i:0 Gn1:i þ 2 P n1:i þ 2
g i þ 1 ,Pn1:i þ 2 pi þ 1 Þ
¼ ðg i þ 1 þ pi þ 1 Gi:0 ,pi þ 1 Pi:0 ÞJ ðP n1:i þ 2 þ g i þ 1 Þ,P n1:i þ 2 pi þ 1 Þ
þ pi þ 1 Pi:0 Gn1:i þ 2 ,pi þ 1 Pi:0 P n1:i þ 2 Þ
¼ ðg i þ 1 þ pi þ 1 Gi:0 þ pi þ 1 Pi:0
¼ ðg i þ 1 þ pi þ 1 Gi:0 þpi þ 1 P i:0 Gn1:i þ 2 , pi þ 1 P i:0 Pn1:i þ 2 Þ
Gn1:i þ 2 P n1:i þ 2 þ pi þ 1 Pi:0 Gn1:i þ 2 g i þ 1 ,
¼ ðg i þ 1 ,pi þ 1 ÞJðGi:0 þP i:0 Gn1:i þ 2 , P i:0 Pn1:i þ 2 Þ
JðGn1:i þ 2
0
s0
H.T. Vergos / INTEGRATION, the VLSI journal 45 (2012) 388–394
¼ ðg i þ 1 ,pi þ 1 ÞJðGi:0 ,P i:0 ÞJðGn1:i þ 2 ,P n1:i þ 2 Þ ¼ ðGi þ 1:0 ,Pi þ 1:0 ÞJðGn1:i þ 2 ,P n1:i þ 2 Þ
ð4Þ
where the group generate term of the last relation is equal to ci þ 1 . That is, the next carry of the EAIC addition can be computed straightforwardly, by associating in a prefix operator the ðg i þ 1 ,pi þ 1 Þ pair of generate and propagate terms and the carry in the previous position. Since ci þ 2 can be similarly computed using ci þ 1 , which as before can be computed based on ci , it becomes obvious that we can compute every carry ci þ k of the EAIC addition associating ci and the ðGk:i þ 1 ,P k:i þ 1 Þ pair of group generate and propagate terms in a prefix operator. Stated otherwise, while the architecture of [14] computes the EAIC addition carry at position i by the association of ðn þ iÞ (g,p) terms and the architecture of [13] using n such terms, relation (4) reveals that we can use any number of (g,p) between these two extremes. As a result, we can compute in log2 n prefix levels any number of the EAIC addition carries and then use as many of them as we wish to compute the rest in a further prefix level. In this way a whole family of EAIC adders is derived. While all adders of the family have a carry computation unit composed of log2 n þ 1 prefix %
%
%
%
%
%
391
levels, each member has its own fan-out requirements and cell and interconnection area. The notation Prop-k is used in the following, to denote the proposed adders in which k out of the total n carries of the EAIC addition are computed in the first log2 n prefix levels. Under this definition, the adders proposed in [14] are the Prop-1, while the adders of [13] are the Prop-n members of the family. Fig. 2(a) and (b) presents the proposed Prop-n=2 and Prop-n=4 16-bit EAIC adders. In the Prop-n=2 adder case only the odd numbered carries are computed in log2 n prefix levels, while the even numbered ones in the last prefix level. This adder offers a fan-out equal to 2 and has a similar structure to the area-time efficient adders derived by the Han-Carlson algorithm [21] for integer addition. The Prop-n=4 adder on the other hand has a fan-out equal to 4 but requires significantly less prefix operators along with their interconnections than the Prop-n=2 adder. Considering the number of prefix operators as a qualitative metric of the area efficiency of each adder, it can be computed that the LF and KS proposals of [14] require 47 or 64 prefix operators, respectively, with a maximum fan-out equal to n, the adders of [13] require 74 and the FPP and the RAPP proposals of [16] 68 and 52 prefix operators respectively, with a maximum fan-out equal
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
s15
s14
s13
s12
s11
s10
s9
s8
s7
s6
s5
s4
s3
s2
s1
s0
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
s15
s14
s13
s12
s11
s10
s9
s8
s7
s6
s5
s4
s3
s2
s1
s0
Fig. 2. Prop-n=2 (a) and Prop-n=4 16-bit EAIC adders.
392
H.T. Vergos / INTEGRATION, the VLSI journal 45 (2012) 388–394
synthesis and mapping and the design is annotated with actual wiring parasitics and fan-out capacitances coming directly from the floorplan of the design and not from a wire load model. Each adder’s input and output is assumed to be driven and drive the output and the input of a D flip flop of the same implementation library, respectively. A typical corner (1.2 V, 25 1C) was considered. For obtaining the power data, a simulation driven approach was followed. 216 random input vectors were applied at a 500 MHz frequency at each netlist for deriving the average power dissipation of each. The attained results for each adder are given in Table 1 under the Delay, Area and Power columns. Delay results are given in ps, area results in mm2 , and average power results in mW. The A T2 column compares the different architectures under the well-known area time2 metric. The values of this column are normalized with respect to the best offered by any architecture for a particular adder. The derived experimental data reveal that one or both examined members of the proposed family of adders outperform the earlier proposals of [14,15] in delay, area and average power consumption terms. They also outperform the adders designed according to the RAPP architecture of [16] in all terms in the two widest examined cases. However, they can not reach the speed of the adders proposed in [13] and the ultimate speed of the FPP adders [16]. Unfortunately, in both these proposals, this level of speed performance is achieved at a very high area and average
to 2. The Prop-n=2 and Prop-n=4 explored in this manuscript require only 45 and 39 prefix operators and offer a maximum fan-out of 2 and 4, respectively. It is noted that the elimination of several prefix operators also removes their associated interconnections. As a result the interconnect area is also significantly reduced.
4. Comparisons Since every EAIC adder with the addition of a few gates for handling zero operands and results can be used for diminished-1 modulo 2n þ1 addition, the Prop-n=2 and Prop-n=4 EAIC adders are first quantitatively compared against the EAIC adders proposed in [13–16]. For the adders of [14], a LF or a KS prefix structure is considered for the first log2 n prefix levels. Both the RAPP and the FPP architectures of the adders that use Ling carries [16] are also examined. For attaining the comparison data, structural Verilog descriptions for adders of 4, 8, 16 or 32 bits were first generated. Each description was then mapped in a power characterized 90 nm implementation technology [22]. For the synthesis and mapping of the designs the Synopsys s Design Compiler s tool was used in its topographical operation mode. For achieving faster timing closure, in this mode floorplanning is done in parallel with
Table 1 Experimental results for EAIC adders. Ref. [14] LF n
Delay
4 8 16 32
299 371 431 509
Ref. [14] KS
Area 768.37 1671.57 3805.50 8343.34
Power
AT
0.245 0.574 1.198 2.594
1.29 1.35 1.54 1.49
2
Ref. [16] FPP
Delay
Area
Power
AT
298 369 445 544
761.55 1978.23 4656.65 11 368.73
0.226 0.682 1.562 3.962
1.27 1.58 2.01 2.33
Ref. [16] RAPP
Ref. [15]
Delay
Area
Power
AT
260 324 388 460
790.66 2023.46 5168.50 13 152.81
0.240 0.665 1.807 4.870
1.00 1.25 1.70 1.92
2
Prop-n=2
Delay
Area
Power
A T2
Delay
Area
Power
A T2
Delay
N/A 302 370 442
1952.76 5118.35 14 986.88
0.629 1.711 5.354
1.05 1.53 2.02
N/A 335 413 481
1748.09 4650.92 12 054.81
0.514 1.558 4.322
1.15 1.73 1.93
286 342 403 471
5100
Delay
Area
Power
A T2
N/A 350 411 492
1513.00 3746.26 7767.00
0.456 1.141 2.503
1.09 1.38 1.30
Prop-n=4 Area
[13] [16] FPP
4600
650.96 1499.52 3289.95 7799.13
Power
A T2
Delay
Area
Power
A T2
0.182 0.468 1.057 2.594
1.00 1.03 1.17 1.20
293 346 397 463
677.77 1421.76 2906.73 6750.10
0.198 0.429 0.860 2.080
1.09 1.00 1.00 1.00
[14] LF [16] RAPP
[14] KS Prop-(n/2)
490
515
[15] Prop-(n/4)
4100 Area (μm2)
4 8 16 32
Ref. [13] 2
3600 3100 2600 2100 1600 365
390
415
440
465 Delay (ps)
Fig. 3. Area-time design space exploration of 16-bit EAIC adders.
540
H.T. Vergos / INTEGRATION, the VLSI journal 45 (2012) 388–394
393
Table 2 Experimental results for weighted adders. Ref. [23]
Ref. [12] using Prop-n=2
Ref. [12] using Prop-n=4
n
Delay
Area
Power
Delay
Area
Power
Delay
Area
Power
4 8 16 32
420 484 550 623
1009.68 2344.67 5709.32 14 161.76
0.277 0.720 1.902 5.049
417 473 536 606
869.98 1820.73 3839.77 8808.08
0.219 0.523 1.152 2.773
424 477 530 598
896.79 1742.97 3447.55 7759.05
0.235 0.484 0.954 2.080
power consumption price. More specifically, the totally parallelprefix adders of [13] require from 21% up to 95% more implementation area and consume from 32% up to 134% more power than the proposed adders, while the FPP adders of [16] require from 37% up to 122% more implementation area and consume from 47% up to 157% more power. As a result, the proposed adders are the most efficient of all examined architectures when the A T2 is considered. Under this metric, the proposed adders are also more efficient than the LF and the KS adders of [14] by 29–54% and by 27–132% respectively. They also outperform the adders of [13] by up to 92%, the adders of [15] by up to 32% and the FPP and RAPP proposals of [16] by up to 102% and 93% respectively. Since 216 þ1 is the Fermat number with the most practical interest, for n ¼16, all above adder architectures were synthesized under several delay targets. The derived area-delay curves are plotted in Fig. 3. This area-time exploration reveals that both examined members of the proposed family of adders offer significantly smaller implementations than any previously proposed architecture at all delay targets larger or equal to 397 ps. The area savings offered range from 19.7% up to 57.3% depending on the architecture that the proposed adders are compared against and the delay targeted. In the unifying architecture proposed in [12], a slightly modified EAIC adder is used along with a CSA stage for building a modulo 2n þ1 adder for operands in the normal weighted representation. The results of Table 1 indicate that if the proposed EAIC adders are used as building blocks, the resulting weighted adders will also be more area-time efficient than those that include some other EAIC adder. Therefore, the weighted adders that result by using the proposed EAIC adders in the unifying architecture of [12] are only compared against the weighted adders of [23], which have been shown to be more efficient in both area and time terms than the earlier proposal of [24]. The comparison results given in Table 2, reveal that the weighted adders resulting from using the proposed EAIC adders are faster, smaller and consume less power on the average than the adders of [23] throughout the examined range. The savings offered are about 2% in operation speed and range from 11% up to 45% and from 15% up to 59% in the required implementation area and average power consumption, respectively.
5. Conclusions Efficient architectures of modulo 2n þ1 adders are appreciated in a variety of computer systems fields, including all applications of an RNS system and cryptography. A modulo 2n þ 1 adder is built around an EAIC adder. In this manuscript a new family of EAIC adders was derived, by showing that we can use any carry ci of the EAIC addition along with group generate and propagate pairs of the integer addition for computing the remaining EAIC addition carries. The previous proposals of [14,13] can be considered as the two end cases of this family. Every member of the proposed family of adders has a %
sparse parallel prefix carry computation unit with log2 n þ1 prefix levels, but each has its own area, power and maximum fanout characteristics. The experimental exploration of two members of the proposed adder family, revealed that they outperform in terms of implementation area and average power consumption, all adders that can offer a similar delay, since they reduce the prefix operators required in the carry computation unit as well as their interconnections. Under the A T2 metric, the proposed adders are the most efficient, heavily outperforming all previous solutions for operands wider than 8 bits. Moreover, the proposed adders can be used as building blocks for the design of area-time efficient weighted modulo 2n þ 1 adders under the unifying approach of [12].
References [1] H. Nozaki, et al., Implementation of RSA algorithm based on RNS montgomery multiplication. in: Proceedings of the 3rd International Workshop on Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science, vol. 2162, Springer-Verlag, 2001, pp. 364–376. [2] V.K. Zadiraka, E.A. Melekhina, Computer implementation of efficient discreteconvolution algorithms, Cybernetics and Systems Analysis 30 (1) (1994) 106–114. [3] P.V.A. Mohan, Residue Number Systems: Algorithms and Architectures, Springer-Verlag, 2002. [4] A. Omondi, B. Premkumar, Residue Number Systems: Theory and Implementations, Imperial College Press, 2007. [5] K. Navi, A. Molahosseini, M. Esmaeildoust, How to teach residue number system to computer scientists and engineers, IEEE Transactions on Education 54 (1) (2011) 156–163. [6] Y. Liu, E.M.-K. Lai, Moduli set selection and cost estimation for RNS-based FIR filter and filter bank design, Design Automation for Embedded Systems 9 (2) (2004) 123–139. [7] J. Ramirez, et al., Design and implementation of high-performance RNS wavelet processors using custom IC technologies, Journal of VLSI Signal Processing 34 (3) (2003) 227–237. [8] J. Ramirez, et al., Fast RNS FPL-based communications receiver design and implementation. in: Proceedings of the 12th Conference on Field Programmable Logic, Lecture Notes in Computer Science, vol. 2438, Springer-Verlag, 2002, pp. 472–481. [9] S. Timarchi, K. Navi, Improved modulo 2n þ 1 adder design, International Journal of Computer and Information Science and Engineering (2008) 158–165 (Summer). [10] S. Timarchi, K. Navi, Arithmetic circuits of redundant SUT–RNS, IEEE Transactions on Instrumentation and Measurement 58 (9) (2009) 2959–2968. [11] L.M. Leibowitz, A simplified binary arithmetic for the fermat number transform, IEEE Transactions on Acoustics, Speech, and Signal Processing 24 (5) (1976) 356–359. [12] H.T. Vergos, C. Efstathiou, A unifying approach for weighted and diminished1 modulo 2n þ 1 addition, IEEE Transactions on Circuits and Systems II 55 (10) (2008) 1041–1045. [13] H.T. Vergos, C. Efstathiou, D. Nikolos, Diminished-one modulo 2n þ1 adder design, IEEE Transactions on Computers 51 (12) (2002) 1389–1399. [14] R. Zimmerman, Efficient VLSI implementation of modulo ð2n 7 1Þ addition and multiplication, in: Proceedings of the 14th IEEE Symposium on Computer Arithmetic, 1999, pp. 158–167. [15] C. Efstathiou, H.T. Vergos, D. Nikolos, Modulo 2n 7 1 adder design using select prefix blocks, IEEE Transactions on Computers 52 (11) (2003) 1399–1406. [16] H.T. Vergos, C. Efstathiou, Efficient modulo 2n þ1 adder architectures, Integration, the VLSI Journal 42 (2) (2009) 149–157. [17] R.P. Brent, H.T. Kung, A regular layout for parallel adders, IEEE Transactions on Computers 31 (3) (1982) 260–264. [18] R.E. Ladner, M.J. Fischer, Parallel prefix computation, Journal of the ACM 27 (4) (1980) 831–838.
394
H.T. Vergos / INTEGRATION, the VLSI journal 45 (2012) 388–394
[19] P.M. Kogge, H.S. Stone, A parallel algorithm for the efficient solution of a general class of recurrence equations, IEEE Transactions on Computers 22 (8) (1973) 786–792. [20] S. Knowles, A family of adders, in: Proceedings of the 14th IEEE Symposium on Computer Arithmetic, 1999, pp. 30–34. [21] T. Han, D. Carlson, Fast area-efficient VLSI adders, in: Proceedings of the 8th IEEE Symposium on Computer Arithmetic, 1987, pp. 49–56. [22] Synopsys Inc., SAED 90 nm EDK, Available /https://www.synopsys.com/ apps/protected/university/members.htmlS. [23] C. Efstathiou, H.T. Vergos, D. Nikolos, Fast parallel-prefix modulo 2n þ1 adders, IEEE Transactions on Computers 53 (9) (2004) 1211–1216. [24] A. Hiasat, High-speed and reduced-area modular adder structures for RNS, IEEE Transactions on Computers 51 (1) (2002) 84–89.
Haridimos T. Vergos received his Diploma in Computer Engineering in 1991, and his Ph.D. in 1996, from the Department of Computer Engineering & Informatics of the University of Patras, Greece, where he currently holds an Associate Professor position. He was a member of Atmel Multimedia & Communications Group and worked on the development of the first IEEE 802.11 compliant wireless MAC processor. His research interests include computer arithmetic and architecture, dependable system architectures and low power design and test. Dr. Vergos holds one worldwide patent and has authored or coauthored more than 70 scientific papers.