ALU design and processor branch architecture

ALU design and processor branch architecture

Microprocessing and Microprogramming 36 (1993) 259-278 North-Holland 259 ALU design and processor branch architecture G.B. S t e v e n * a n d F . L...

1MB Sizes 17 Downloads 167 Views

Microprocessing and Microprogramming 36 (1993) 259-278 North-Holland

259

ALU design and processor branch architecture G.B. S t e v e n * a n d F . L . S t e v e n Division of Computer Science, UniversiO, of Hertfordshire, College Lane, Hatfield, Herts, UK Received 19 April 1993 Accepted .17 June 1993

Abstract This paper examines the role of the A L U within the context of high-performance processor design. In particular, the functional requirements of various processor branch architectures are evaluated and related to A L U design. The paper demonstrates that the traditional condition code branch mechanism is unsuitable for high-performance, multipleinstruction-issue processor implementations. First, the use of condition codes hinders code motion and therefore inhibits instruction scheduling. Second, the use of condition codes prevents the early resolution of branch conditions and therefore either increases the processor cycle time or the number of branch delay slots. Various alternative branch mechanisms are examined which remove the first restriction. Two of the branch architectures considered are also shown to remove the second problem. In both architectures the crucial factor is that only a single branch condition needs to be evaluated for each branch. Outline designs of a Relational Unit and an A L U which meet the requirements of the two high-performance branch architectures are also presented and compared with traditional A L U and comparator designs.

Keywords. ALU; branch architecture; relational unit; superscalar.

I. Introduction Although the ALU is the traditional general-purpose workhorse of processor design, very few technical papers are devoted to the relationship between ALU design and processor architecture. Instead the overwhelming majority of authors choose to concentrate on only one aspect.of ALU design, namely addition. As a result the generation of other ALU functions is ignored, as is the relationship between these functions and the branch architecture of a processor. Similarly, those authors who evaluate alternative branch architectures [8, 22] tend to ignore the impact of the different branch architectures on ALU design. This paper examines the role of the ALU within the context of high-performance processor design. Performance beyond the RISC benchmark of one instruction per cycle can be achieved in several *Corresponding author: Email: [email protected]

ways. In a multiple-instruction-issue processor more than one instruction can be dispatched to multiple functional units for execution in each processor cycle. If the hardware determines at run time which instructions are to be issued in parallel to the functional units, then the machine is termed a superscalar processor [17-]. In contrast, if a fixed number of instructions is fetched and issued in each processor cycle, then the machine is termed a VLIW (Very Long Instruction Word) processor [9, 32-]. In either case optimal performance is only likely to be achieved if the compiler increases the scope for parallel instruction execution by re-ordering or scheduling machine code instructions prior to program execution [5, 13-]. An alternative approach to achieving a high instruction throughput is to increase the depth of the pipeline. Recently the term superpipelining has been coined to describe this approach [18, 23]. Not surprisingly, it is in superpipelined designs that the timing restraints on the ALU tend to be particularly severe.

260

G.B. Steven, F.L. Steven

Many classic designs used microprogramming to interpret rich, complex instruction sets. During the interpretation process it was possible to use the ubiquitous ALU to perform a wide variety of functions at different stages of the instruction interpretation process. The ALU was therefore perceived as a multi-purpose building block which was expected to perform a wide range of operations including: • Addition and subtraction • Bitwise logical operations (AND, OR etc) • Comparisons • Program counter incrementation • Branch target address computation • Address calculations. In addition, flag or condition code information had to be generated to update a processor status register. This multi-purpose role led to the design of multi-functional parts typified by the venerable SN74181, 4-bit ALU. More recently the interpretation of instruction sets through microcode has been abandoned for new architectures. Instead processor designs have been heavily pipelined in a sustained drive to minimise the number of machine cycles required to execute each instruction. RISC processor designs [34] attempt to reduce the number of cycles required to execute each instruction to one, while multiple-instruction-issue processors promise to convincingly breach the two instructions per cycle barrier [3, 13]. To support sufficient parallelism to sustain this increased performance, pipelined designs tend to distribute many of the traditional ALU functions to other functional units such as PC incrementers, branch target adders and address adders. In these rapidly changing circumstances it is appropriate to re-examine the role of the ALU. This paper first highlights the shortcomings of traditional ALU designs and then focuses attention on the much neglected role of the ALU as a comparator or Relational Unit. In particular, it is demonstrated that ALU requirements are heavily influenced by the branch mechanism supported by the processor. The next section briefly examines traditional ALU design concepts, while section three examines traditional comparator design. Section four aria-

lyses various branch architectures and relates their implementation requirements to ALU and comparator design. Section five outlines the authors' alternative Relational Unit and ALU design concepts, and section six evaluates the suitability of these concepts for implementing alternative branch mechanisms. Finally section seven offers some concluding remarks.

2. Traditional ALU design Traditional ALU design is exemplified by the SN74181, 4-bit ALU. Not only was this TTL device widely used in its own right, it also spawned numerous functionally equivalent building blocks which are still in use today. Given the pervasive influence of the 74181, it is no surprise that the ALU data path building block provided by the Cascade silicon compiler [4] used at the University of Hertfordshire is functionally identical to the SN74181. The 74181 is a four bit, data path building block which performs all the common arithmetic and logical operations in a single unit (Fig. 1). The device has two major advantages. First, all the functions implemented are made available on a single group of output pins. Second, the logic required to implement the bitwise lcrgical functions is fully integrated into the initial stages of the carry generation logic. This arrangement avoids the requirement for a final multiplexing stage to select the appropriate output

A3-A0

B3-B0

I

I Cin

SN74181

~ 3 - ~0

I Cout

I F3

-

I F0

A

=

B

Fig. 1. SN74181 4-bit ALU.

ALU design and processing branch architecture

~

Y31-28 X31-28

04(27)

) )

Y23-Y20 X2_.3-X20

)

)

~ p4(z"0 04(23)

) )

P4(23)" O4(19) )

)

X19-14 )

Y15-12

~ ) )

X15-12

L

E V

E L 2

)

Xll-8

)

P16(3t) o~z(27) G8(23),

C28-

P4(19)

P8(23)

04(I~

016(1[~

)

P16(1~)

L

E

P4(11) E 04(7)

Y7-4 X7-4

[

)

o4(n)

~

016(31)

P12(27i

P4(l~

Y11-8

Y3-O X3-O

G4(31! [ P4(31;

Y27-24 X27-24

Y19-14

261

OlZ(tl) ,[

r,tz(fi)

V

o8(7~

L

P8(7)

C4

S

$31)

L E C T I O N

SO

2

P4(7)"

) )

04(3)

Cin

P4(3) Fig. 2. Carry generation for a 32-bit ALU.

and ensures that the logical functions do not significantly degrade the worst-case add time of the unit. A disappointing feature of the SN74181 is that it fails to generate an overflow signal for signed addi-

tion and subtraction. This feature is also faithfully copied on the Cascade ALU. The most straightforward way to generate an overflow signal is to exclusive-OR the carry into the most-significant bit

262

G.B. Steven, F.L. Steven

position with the carry from the most-significant bit position. This method produces an overflow signal marginally ahead of the final sum generation. Since the carry into the most-significant bit is not available externally, users of the 74181 are forced to use a more cumbersome method to derive overflow from the sign bits of the operands and the final result. As a result overflow information is not available until after the main ALU result. While the cost of the additional logic is trivial, the added delay can be crucial in a pipelined design, particularly if the overflow signal is required to generate a run-time interrupt. This may place the additional overflow logic on a critical path and increase the processor cycle time (see for example [25]). The 74181 design also tacitly assumes that the device will be used to implement a processor architecture which uses condition code flags. As a result the traditional sign (N), carry (C) and zero (Z) flags are generated for each calculation. However, since these flags are generated as a side effect of the main operation, their timing was not regarded as critical and was therefore not optimised. As a result the zero flag is derived directly from the sum bits and is not available until after the main ALU result. Building blocks similar to the 74181 can be readily combined to implement a 32-bit ALU design (Fig. 2). Classic carry lookahead adder techniques [12, 20] are used to generate carries at 4-bit intervals. In parallel with the carry generation, potential sum values are precomputed for each four-bit sum group, first on the assumption that the carry into the group is zero and second on the assumption that the carry into the group is one. When the carry into each sum group becomes available, it is used to select the correct sum values in a final multiplexing stage. The authors designed their own version of the classic 32-bit ALU [33] using the gates supported by the Cascade silicon compiler. A fan-in of four was available allowing generate and propagate signals to be produced in successive groups of 1, 4, 16 and 32 bits. Selection of alternative functions was incorporated at the first logic level, allowing all data functions to be made available on a single set of output lines in eleven gate delays (Table 4). The opportunity was also taken to generate carry and

overflow flag outputs marginally ahead of the main data outputs. Our purpose was not to generate an optimised ALU design for a specific application. This exercise has been performed many times elsewhere and often involves exploiting the distinctive features of a particular technology [21, 27]. For example, Quach and Flynn 1-27] use complex CMOS gates, which each implement two levels of logic, to design a 32-bit adder (not ALU) with only four gate delays and an estimated add time of 4 ns. In contrast our objective was to expose inherent timing relationships which will affect the implementation of any high-performance processor architecture.

3. Traditional comparator design In practice, the underlying assumptions behind ALU design obscure the central role the ALU plays in branch implementation. Traditionally, the ALU is used as a comparator to resolve conditional branches, yet this role is obscured by the use of condition code flags. Indeed compare is not usually listed as an ALU function. Instead the subtract operation is usually pressed into service to perform comparisons. As a result little attempt is made to optimise the performance of the ALU as a relational unit, and the use of condition code flags makes it more difficult to incorporate a traditional ALU into a high-performance pipelined design.

A,3 - A 0

B3 -B0

!

I

SN7485 I

I

I

A>B

A=B

A
Fig. 3. SN7458 4-bit comparator.

ALU design and processing branch architecture

263

33 Y31

)

Y31'

X31

)

X3I' )

Y30-28 X30-28

GT

~.Q C GT )

Y27-24 )

)

X27-24 )

GT

Y23-Y20 X23-X20

C M P

GT

EQ

)

32-0

EQ

Y19-14 )

GT )

X19-14 )

EQ

C M P

)

GT.

I

3

Relation )

GT,

Y15-12 )

X15-12

T

EQ

Y11-8

)

GT )

Xll-8

)

~.Q

C M P

GT

EQ

Y7-4

)

OT i

X7-4

)

~Q

Y3-0

)

GT )

X3-0

)

)

Fig. 4. A 32-bit comparator.

Comparator design can also be illustrated using a TTL device, the SN7485 (Fig. 3). Again the Cascade building block is functionally identical, illustrating the pervasive influence of these classic TTL

parts. The 7485 compares two, unsigned four-bit numbers and outputs three results, GT (greater than), LT (less than) and EQ (equal). In general, three further relationships, LE (less than & equal),

264

G.B. Steven, F.L. Steven

GE (greater than & equal) and NE (not equal) are also required. Each of these additional relationships can be generated by ORing two of the original three output signals. Larger numbers can be compared using a hierarcy of four bit comparators (Fig. 4). As before LT, G T and EQ signals are generated. Any one of the six possible relationships between unsigned numbers can then be selected using a further two logic levels: Relation = S2.GT + S1.LT + So.EQ. Finally, if both signed and unsigned comparisons are required, the sign input bits must be optionally inverted. The authors designed a 32-bit comparator [33] using the above configuration and the gates supported by Cascade. The design required eleven gate delays to compute and then select a specific relationship (Table 4). Only G T and EQ signals were generated at the intermediate levels. Also, although sign inversion is shown in Fig. 4 as a separate stage, this logic was incorporated into the most-significant four bit grouping without increasing the worst case gate delay through the circuit. Once more our purpose was to compare timing relationships and to relate them to architectural choices, not to produce a fully optimised design using a specific technology.

4. Alternative processor branch architectures

In practice, the functional requirements of an ALU should depend heavily on the processor branch architecture. This section therefore analyses five branch mechanisms, all of which have been used in recent processor architectures: • Traditional condition code register. • Condition flags returned to a general-purpose register. • Multiple condition code registers. • Boolean registers. Each mechanism is examined in the context of the following, four-stage pipeline which is capable of supporting a typical RISC load and store architecture.

• IF • RF • ALU/ MEM • WB

Instruction Fetch Register Fetch & Instruction Decode Perform ALU operation or access data cache Write Back result to generalpurpose register.

It is assumed that ALU instructions have the form Operation

Rdst, Rsrcl, Rsrc2

ALU instructions are fetched from an instruction cache in the IF stage and access register operands from the general-purpose register file in the RF stage. The required operation is performed in the ALU stage, and a result is written to a destination register in the WB stage. Memory reference instructions have the form: LOAD Rdst, < ea > STORE < e a > , R s r c Again instructions are fetched in the IF stage, and register operands are accessed in the RF stage. The effective memory address must also be formed in the RF stage since the data cache is accessed in the A L U / M E M stage. In the case of a LOAD instruction, data from memory is written to a generalpurpose register in the WB stage. Placing the data cache access in the same pipeline stage as the ALU has the major advantage of allowing the result of a load from memory to be used immediately by the next sequential instruction [30]. The disadvantage is that computing all memory addresses in the RF stage restricts the complexity of the addressing modes which can be provided. Branch instruction execution depends on the choice of branch architecture which will now be examined. The results of this examination are summarised in Table 2.

4.1. Traditional condition codes Condition codes or flags have been widely used on many architectures including the Motorola 68000, Intel 8086 and VAX families. Condition code flags

ALU design and processing branch architecture Table 1 Branch resolution using condition code flags

IF

CONPARE Unsigned

Signed

Greater than Less than & equal Greater than & equal

(2.Z. C+Z (2

N.V.Z_ + N .c/.~_ Z + ( N # V) (N 4= V)

Less than Not equal Equal

C Z Z

N 4= V

)

265

RF

IF

BRANCH

)

)

SUB

(WB)

)

PC+offset )

(ALU)

)

(VB)

)

l

Br~'~h Resolution Point

Z

Fig. 5. Condition code architecture branch timing.

are set explicitly by compare instructions and then subsequently tested by a following branch instruction. Branch conditions must therefore be derived indirectly from the flag information. The logic to derive the required branch condition from the condition code flags is not completely trivial as can be seen from Table 1. Note that the overflow flag must be used in the calculations to ensure that the correct relational result is obtained even after arithmetic overflow has occurred in the ALU. Condition codes are also usually set implicitly as a side effect of other ALU operations. In many architectures, such as the Motorola 68000, move instructions also set the flags. Although criticised by some authors [28], this free operation is often seen as one of the main advantages of the traditional mechanism. This belief is supported by instruction set usage figures which show that programs execute significantly more conditional branch instructions than compare instructions [28, 26], thus suggesting that approximately 5 % more instructions would be executed if condition codes were not set implicitly by arithmetic instructions. It is therefore important to try to preserve this advantage when alternative branch mechanisms are considered. Now consider this traditional mechanism in our four-stage pipeline model. In general each compare instruction is immediately followed by a conditional branch instruction. The compare obtains its operands in the RF stage, carries out a subtraction in the ALU stage and loads the appropriate flag bits at the end of the ALU stage. The following branch instruction is fetched from the instruction cache one cycle after the compare and computes the branch target address in a separate branch adder during the RF

A

B

+

÷ I

~I~U

](

Non specific t ~ z ~ o n

I

~ coMilioncodebits $~ms Register I

t

Compute Bm~h Condi~ns ~ branchconditions ~elec(Required Condition

+

Next Sequential Branch Target InstructionAddress Azld~ss

T °r I SeleclNell ii~InlcliOllAdfli~s,I ~To InstructionCache

Fig. 6. Branch resolution: condition code architecture.

stage. The branch target address computation and the compare instruction ALU operation therefore take place in the same machine cycle (Fig. 5). If the conditional branch instruction is to be resolved at the end of the RF stage, the processor hardware must perform the following operations (Fig. 6): (1) Generate the condition flags in the ALU. (2) Derive relational information from the flags.

266

G.B. Steven, EL. Steven

(3) Select the relationship specified in the branch instruction to determine the branch outcome. (4) Select the appropriate next instruction address. Resolving conditional branches at the end of the RF stage results in the following instruction timing: C M P Rsrcl, Rsrc2 Bcc label N O P / * b r a n c h delay slot*/ A single machine cycle or branch delay slot is required before the branch is resolved. Failure to resolve the branch at the end of the RF stage will require two machine cycles or branch delay slots. Since condition codes are not, in general, available until the very end of an ALU subtraction, it is clear that the above architecture will either require the CPU cycle time to exceed the ALU cycle time or will require a branch delay of two. A traditional condition code branch architecture is therefore inappropriate for an implementation where the designers wish to reduce the processor cycle time to the ALU cycle time. A further major disadvantage of the traditional model is that the instruction which sets the flags must, in general, immediately precede the conditional branch instruction. However, in a multipleinstruction-issue processor, code is often scheduled or re-ordered to improve performance. Many scalar processors also require some instruction re-ordering to fill branch and load delay slots. Insisting that compares and branches must remain adjacent inevitably complicates the instruction scheduling process and reduces its effectiveness. More recent architectures which persist with the condition code model, such as the Acorn ARM [10] and the Sun SPARC [11], arrange for instructions to optionally set the condition code flags. Essentially one bit in each instruction is used to control the updating of the flags. Alternatively, on the National 32000 series [16] only compare instructions alter the flag information. Both of these mechanisms allow the compare and branch instructions to be separated and therefore aid the instruction scheduling process. The National alternative, however, loses the performance benefits of allowing a wide range of instructions to implicitly set the flags.

4.2. Condition codes returned to a general-purpose register

One alternative to a single condition code register can be illustrated by the Motorola 88000 RISC architecture [2]. Here a compare instruction returns a 10-bit Boolean vector to one of the 32-bit generalpurpose registers. The vector bits represent the truth values of all useful signed and unsigned integer comparisons. Conditional branch instructions then test the value of a specific register bit. This system has three advantages. First, a compare instruction no longer has to immediately precede its corresponding conditional branch. Therefore an annoying restriction on instruction scheduling is removed. Second, multiple Boolean conditions can be pre-calculated for later use. Thus a compare instruction can be scheduled ahead of the preceeding conditional branch. This additional freedom is particularly welcome when code is being scheduled for a multiple-instruction-issue processor. Thirdly, as a direct consequence of their branch mechanism, Motorola provide a powerful bit testing mechanism. Nonetheless, returning the Boolean vector to a general-purpose register is a mixed blessing. Read and write ports to the general-purpose register file are potentially a performance bottleneck in a highperformance processor. Since typically 10% of instructions executed are compares [26], returning the result of each comparison to the register file places a significant additional burden on the register write ports. Furthermore, although a register must be allocated for each Boolean vector, it is still not possible to treat a specific Boolean result as either a Boolean variable or as an integer. Instead an additional instruction is required to extract a Boolean value for further arithmetic processing. Such processing may be useful if a compiler uses numeric representation [1] to implement complex conditional expressions or in the implementation of languages such as C where a relational expression is expected to deliver an integer. One disadvantage is, however, avoided on the M88000. Although condition codes are no longer set as a side effect of other ALU operations, compares against zero can still be avoided by using

ALU design and processing branch architecture

A

B

+

÷

I ALU

1 Noan~ar~eorap,~

t Bitvector (brauchcox~lilions) Branch ~ T~e

I

SelectRequi~d Coaditio~

I Next3equent~l BmuchTarget I InstruCtionAddress Addless t I Select NextIns~tion &ddless ~To InsPectionCache

Fig. 7. Branch resolution: CCs held in registers.

a variation of the M88000 conditional branch instructions which directly test the contents of a register against zero. If the M88000 architecture is implemented on our pipeline model, the logic required to resolve a conditional branch instruction is simplified slightly (Fig. 7). On the M88000 architecture, relational information can be calculated directly without reference to the traditional flags. A processor designer therefore has two new options. The calculation of the Boolean vector bits can be integrated into the ALU design, or they can be generated in a separate Relational Unit which operates in parallel with the ALU. However, all possible relational conditions must still be computed in parallel, so the branch resolution timing on our model is still far from ideal. It is therefore interesting to note that the MC88100 processor allows an extra half cycle after the end of the ALU cycle to resolve conditional branches [24].

4.3. Condition codes returned to multiple condition code registers The IBM RS6000 also attempts to improve on the traditional condition code mechanism by providing a total of eight separate condition code fields within a 32-bit condition vector [14]. As in the case of the M88000 a compare instruction explicitly selects the destination for the condition code information. Only four bits of information, three relational bits

267

(EQ, G T & LT) and overflow, are generated for each comparison. Branch instructions then explicitly test a combination of the bits in the condition register. Two of the condition code fields are dedicated, one to the integer unit and one to the floating-point unit. As a result arithmetic operations can optionally set the dedicated condition code bits in the traditional fashion, and RS6000 compilers can still use the traditional, side effect mechanism to remove compares against zero. Multiple condition codes provide exactly the same instruction scheduling advantages as returning condition code information to registers. Moreover, using separate condition code registers avoids the additional pressure on the register file ports of the M88000 scheme and allows IBM to relocate the condition flags in a separate branch processor chip. The RS6000 also allows Boolean operations to be performed directly between the individual bits of the condition vector. It is therefore the first architecture we have examined where the Boolean result can be used directly in further computations. Thus the AND operation in the conditional expression IF(a > b) AND (c = d) can be implemented as a single Boolean AND instruction between the appropriate condition vector bits. Unfortunately, it is still not possible to store a condition register bit directly as a Boolean variable or to use a Boolean result directly in the implementation of the C expression: b = b + (c > d). Turning to our pipeline model, the logic required to implement the RS6000 branch mechanism is shown in Fig. 8. As can be seen, the condition vector bits represent a distinct improvement on the traditional flag mechanism. Nonetheless the ALU is still required to compute three conditional relationships (EQ, G T & LT) during each ALU cycle. Since it is unusual to test more than a single condition, most of the information computed is inevitably redundant. In view of the above discussion, it is interesting to find that in the first RS6000 implementation a fixedpoint compare instruction followed by a conditional branch results in a disappointing branch delay of three cycles. However, it is only fair to point out that

268

G.B. Steven, F.L. Steven

A

B

÷

÷

[ ALU

+

~- Noasr~eme eora;*rtson

I ccvl cc~ Icos I cc4/cc31 cczl ¢e, I ccol co~i~n°~t,~ + 32 eox~d.itionrel~is~rbi. ~ Selec~Requhed Branch Condi~on Regisler NextSequen~l Br~nchTarget "i>pe Bit ~on Ad.d.~ss Address [ -~electNextInslructionA~Idress

A

+

B

ALU I' spe ic o p o Specific Condi~on Next Sequenliel Brea'u:hTerget I~r. Axldxess Address

~ T o Ilu~c~on Cache

~To Ins~-ac~onCache Fig. 9. Branch resolution: combined branch & compare. Fig. 8. Branch resolution: multiple CC registers.

the timing problems highlighted in the model are exacerbated by the original RS6000 three chip implementation and that the branch resolution timing is expected to be improved in later versions. 4.4. Combined compare and branch instructions

The MIPS [15] and MIPS-X [6] RISC processors, along with their commerical derivatives [19] break completely away from the traditional condition code model by providing combined compare and branch instructions with the following format: Bcc

Rsrcl, src2, label

If the specified relationship holds between two registers or a single register and an immediate value then the branch is taken. Superficially, combining two instructions saves an instruction and improves performance. However, since in our pipeline the branch condition cannot be resolved until the end of the ALU pipeline stage, the branch delay is simply increased by one cycle. Performance is therefore only improved if two instructions, instead of the more usual one, can be placed in the branch delay slots. Unfortunately, studies suggest that it is considerably harder to find instructions to place in the second branch delay slot [22]. Furthermore this branch delay cannot be reduced by scheduling the comparison ahead of the condi-

tional branch instruction. Both compare and branch are permanently locked together in a single instruction. Nonetheless the logic required for branch resolution is significantly simplified (Fig. 9). All the architectures previously considered require the ALU or comparator logic to generate information on multiple relationships. The actual condition required is only specified in a subsequent branch instruction. In contrast, with a compare and branch instruction the comparison required is known in advance of the ALU operation. The required condition can therefore be computed directly and the condition selection process dispensed with. A further simplification of the branch architecture occurs in the commercial family of MIPS processors [19]. Here fully general compare and branch instructions are only provided to test for equality and inequality. The four remaining branch instructions always compare a single operand against zero. These changes simplify the computation of the branch condition and raise the possibility of resolving some, if not all, compare and branch instructions one cycle earlier at the end of the second pipeline stage. It is therefore disappointing to find that the superpipelined R4000 still requires both the RF and ALU stages to resolve branch and compare instructions [23]. Although this allows the ALU to be used to compute branch target addresses, the implementation results in three branch delay slots,

ALU design and processing branch architecture

only one of which can be filled by the compiler. It should, however, be noted that the third delay slot is a direct result of the superpipelining which allows two cycles rather than one to access the instruction cache.

÷A

269

÷B

ALU

~-

~.c~ co,np,Q'~on

4.5. Boolean reg&ters -8 Booleanze~bletbi~

Finally, the University of Hertfordshire HARP [31] architecture provides a group of eight Boolean registers which are explicitly set to True or False by a compare instruction. Subsequent conditional branch instructions then test a specific Boolean register. A typical conditional branch sequence is therefore: N E B 3 , R6, #10 BT B3, label

/*B3: = ( R 6 <

> 10)*/

(BO~ e,l'v'ayszero)

[ SelectRequired Boo~.

I Re~'is~r

Imcu~onlqe~ Seqlten'dalA~ld~5BnmChAzl~ea~Targel

~ToI.~lzuc'~n Cache Fig. 10. Branch resolution: Boolean Registers.

/* branch if B3 is True*/

The HARP architecture provides the same instruction scheduling advantages as the RS6000 and the M88000. There is no requirement for branch instructions to immediately follow compare instructions, and multiple Boolean conditions can be pre-calculated for future use. As in the case of the RS6000, Boolean registers can be manipulated in a separate Boolean unit. Also since Boolean registers can be loaded and stored from memory, they can be used to implement Boolean variables directly. Boolean values, however, can not be manipulated as integers. It is therefore not possible to perform the addition operation in the following expression without first explicitly moving the Boolean value (c > d) to an integer register: a = b + (c > d). A further disadvantage of Boolean registers is that explicit comparisons against zero are always required. Boolean registers are never set "for free". This drawback could be eliminated by adding a further set of branch instructions of the form Bcc Rsrc, label

/* if (Rsrc relop 0) branch to label*/

to the architecture. Implementation of the branch architecture in our example pipeline is shown in Fig. 10. Once again

branch condition selection is required, this time to select one of eight Boolean register values. Boolean registers, however, share one significant advantage with architectures which combine compare and branch instructions. In both cases the ALU or comparator which evaluates the branch conditions is only required to generate a single result. As is shown in Section 5, this advantage allows the condition required to be made available on a single output line significantly before the end of a worst case addition operation and considerably eases the implementation of the branch architecture. 4.6. Branch architecture summary

The significant characteristics of the five example branch architectures are summarised in Table 2. Several major deficiencies in the classic condition code model have been identified. First, when scheduling instructions for a high-performance processor, it is impossible to separate a compare instruction from its conditional branch. Second it is not possible to pre-compute multiple branch conditions. Finally, the resolution of branch conditions is unnecessarily complex. The instruction scheduling difficulty is avoided by all the alternative architectures considered, with the exception of the combined compare and branch

G.B. Steven, F.L. Steven

270

Table 2 Comparison of branch architectures

Branch mechanism

Relations computed directly

Specific relation computed

Fast branch on zero

Separate compare and branch

Save results of multiple comparisons

Boolean results directly available

Condition codes Acorn ARM SUN Sparc National 32000 M88000 ( C C - > reg) DEC Alpha RS6000 (multiple CC) R2000 (compare & branch) HARP (Boolean registers)

No No No Yes Yes Yes Yes Yes Yes

No No No No No Yes No Yes Yes

Yes Yes Yes No Yes Yes Yes Yes No

No Yes Yes Yes Yes Yes Yes No Yes

No No No No Yes Yes Yes No Yes

No No No No No ? Partially No Partially

architecture of the R2000. The branch resolution complexity can be traced to several factors. First, the required relationships are not computed directly. Instead they are deduced indirectly from condition code flags. Second, instead of calculating a single condition for each branch, the processor is required to compute and retain sufficient information to resolve all possible signed and unsigned conditional branches. The other four example architectures simplify branch resolution by calculating branch conditions directly. Two of the architectures, the R2000 and HARP, allow further simplification by only requiring a single Boolean condition to be calculated for each conditional branch instruction. One desirable feature of the condition code model is that tests against zero are generally avoided. The table shows that virtually all alternative branch architectures retain this advantage and avoid the overhead of separate comparisons against zero. Finally the table suggests that some progress has been made, particularly on the RS6000 and on HARP, in allowing the results of comparisons to be used directly as Boolean variables. We look forward to further progress in this area. Table 2 also includes several additional architectures. One significant recent architecture, the DEC Alpha [-29] simplifies the M88000 mechanism by getting compare instructions to return a single Boolean value to a general-purpose register. The DEC Alpha is therefore a third architecture which

only requires a single relationship to be computed for each conditional branch. As a result the Alpha branch resolution has the same simplicity as the R2000 and HARP. Essentially the Alpha and HARP branch architectures only differ in the location chosen for the Boolean variables. HARP chooses separate Boolean registers to reduce read/write port utilisation on the register file. Alpha chooses to integrate Boolean values into the register file to reduce the machine state.

5. A comprehensive ALU building block Traditional ALUs produce relational information as a side effect of the subtract operation. As a result this condition code or relational information is not available until after the main ALU result. In contrast, two of the example branch architectures, those using Boolean registers and those using combined branch and compare instructions, only require the ALU to perform a single relational comparison (Table 2). As a result these branch architectures offer important implementation advantages. In this section Relational Unit and ALU designs are developed. Both have the ability to produce the required relational result on a single output line. Furthermore this single result is made available significantly ahead of the traditional sum outputs, further improving the implementation of the branch architecture. The Relational Unit is developed first.

ALU design and processing branch architecture

X(i)~

Y(i)~

Select Generateand Propagate Signals c6)

( $3-0 ~, Cin

P(i)

Carry Lookahead Logic

~ Relatior~l Output = Cout Fig. 11. Relational unit design.

271

casting the comparison equations for each bit position in the above form with alternative definitions for G(i) and P(i). The relational result can then be computed as if it were the carry from the most significant bit position. The appropriate equations for G(i) and P(i) for each comparison are developed in the following sections and summarised in Table 3. Unsigned comparisons are considered first. The design is then extended to cope with signed comparisons.

5.1.1. Greater than comparison For an unsigned greater than comparison (GTU), the required result is obtained if the following calculation is performed in each bit position: GTUoot(i) = GT(i) + GTUi,(i). EQ(i).

This is partly for exposition purposes, but also because the authors believe that the provision of separate relational units should be seriously considered in future high-performance processor designs.

5.1. Relational unit design The Relational Unit takes two 32-bit operands, X and Y as inputs and performs the ten standard signed and unsigned relational operations (Fig. 11). The required comparison is selected by a group of control signals, $3-S0 & C,N, and a single Boolean output gives the result. The theory of the carry lookahead adder 1-12,20] is extended to produce a fast parallel comparator. Little endian notation is used throughout. In order to add two numbers, X and Y, the carries into each bit position must first be obtained. For every bit position, i, the following equation must hold: Cout(i) = G(i) + Cin(i)'P(i) where G(i) = X(i).Y(i) P(i) = X(i) + Y(i). Well-known techniques exist for generating the required carry signals from G(i) and P(i). These techniques can be used in a Relational Unit design by

Essentially, the number represented by the i least significant bits of X is greater than the number represented by the i least significant bits of Y either if X(i) > Y(i) or if X(i) = Y(i) and the number represented by the first (i - 1) bits of X is greater than the number represented by the first ( i - 1) bits of Y. Substituting for GT(i) and EQ(i) gives: GTUout(i) = X(i).Y(i) + GTUin(i)'(X(i)'Y(i) + X(i). ~((i)) which reduces to GTUout(i) = X(i).¢((i) + GTUi.(i).(X(i) + c/(i)). To ensure correct operation in the least significant bit position GTUi,(0) or C~N must also be set to zero. The above equations are in the required form: Cout(i) = G(i) + Cin(i). P(i) where G(i) = X(i)- Y(i) P(i) = X(i) + Y(i) or P(i) = X(i). Y(i) + X(i).Y(i). Two formulae are given for P(i) to allow either form to be used in a specific implementation. If in each bit position the above values are selected in place of G(i) and P(i), a standard carry lookahead adder will produce the required result as Cout(31), the carry out of the most significant bit position.

272

G B. Steven, F.L. Steven

Table 3

Unsigned comparisons Comparison

G(i)

P(i)

C,N

EO. ( = ) GTU ( > unsigned)

0 X(i) .Y(i)

I 0

GEU

X(i) .r((i)

X(i).Y(i) +,X(i).Y(i) X(i) +~'(i) or X(i).Y(i) + ,X(i).~((i) X(i) +~((i) or X(i).Y(i) + Y,(i)-~((i) X(i) +?(i) or X(i) .Y(i) +,X(i).Y(i) X(i) +~((i) or X(i) .Y(i) +,X(i).'Y(i) 1 or X(i) .Y(i) +X(i).Y(i)

(>=unsigned) LTU ( < unsigned) LEU (<=unsigned) NEQ( < > )

X(i)-Y(i) ,X(i).Y(i) X(i).Y(i) +,X(i).Y(i)

5.1.2. Greater than or equal comparison If an unsigned greater than or equal (GEU) comparison is required, the equations linking GEUout with GEUIn are identical to those relating GTUout to GTU~, in the previous section. The only difference is that C1N, the carry into the least significant bit, must now be set to logic one to ensure that the result is also one when X equals Y. Alternatively, a greater than or equal comparison can be viewed as a subtraction. If Y is subtracted from X, the carry generated from the most significant bit position is True iff X > = Y. Subtraction in twos complement is achieved by inverting the bits of the Y, adding X and forcing a logic one into the least significant bit position (C~N = 1). If the bits of Y are first inverted the generate signal, G(i)= X(i).Y(i), becomes G(i) = X(i).Y(i) and the propagate signal, P(i) = X(i) + Y(i), becomes P(i) = X(i) + Y(i). These equations are precisely those deduced for P(i) and G(i) in the previous section. 5.1.3. Equal comparison The comparisons X < Y and X <= Y introduce no further cases since the roles of X and Y are simply reversed. EQ (Equals) and NEQ (Not Equals) do, however, require separate consideration. First consider EQ. Equality can be calculated using the following equation in each bit position EQout(i) = EQi~,(i).EQ(i).

1 0 1 0

The i least significant bits of X and Y are equal iff X(i) = Y(i) and the least significant i - 1 bits of X and Y are equal. Substituting EQout(i) = EQi,(i).(X(i)-Y(i) + X(i).~'(i)). For the least significant bit to function correctly we must also set CIN = EQin(O) =

1

Here, G(i) = 0 P(i) = X(i).Y(i) + ,X(i).~/(i).

5.1.4. Not equal comparison Now consider NE. For each bit position NEout(i) = NE(i) + NEi,(i) The i least significant bits of X are not equal to the i least significant bits of Y if either X(i) :# Y(i) or if the i - 1 least significant bits of X differ from the i - 1 least significant bits of Y. Substituting for NE(i): NEou,(i) = (X(i). ~'(i) + X(i). Y(i)) + NEi,(i) or alternatively NEo=(i) = (X(i). ~'(i) + X(i).Y(i)) + (X(i). Y(i) + X(i).Y(i)).NEi,(i). This time G(i) = X(i).~'(i) + X(i).Y(i) P(i) = 1 or P(i) = X(i). Y(i) + .'K(i).Y(i)

ALU design and processing branch architecture

and CIN = NEin(0) = 0. Again the second formula for P(i) may be more convenient to generate in a specific implementation. If the only comparisons required are equality and inequality, the above mechanism is both unnecessarily cumbersome and slow, see for example a recent paper by Cortadella and Llaberia [7] where a highperformance implementation of the more general function (A + B) --- K is described. If, however, all the logical comparisons are required, it is advantageous to use the same logic to generate all the results and to obtain the result of all comparisons on a single output line. The required formulae for G(i), P(i) and C~N are summarised in Table 3. In the case of signed comparisons an additional adjustment must be made in the most significant bit position. Correct signed comparisons will be achieved if, for all signed comparisons, both sign bits are inverted on entry to the relational unit.

5.1.5. Relational unit design The design of a 32-bit Relational Unit based on the above ideas is outlined in Appendix 1. The first two logic levels are used to produce the appropriate generate and propagate signals for each bit position. As shown in Table 3, a single expression for P(i), X(i) = Y(i), can be used for all ten comparisons. The remaining three logic levels implement standard carry lookahead logic (Fig. 11), and produce the required branch condition in only nine gate delays. Significantly, this is faster than both the earlier traditional adder and comparator designs. All these figures would, of course, be roughly halved if the designs were implemented using complex CMOS gates of the types used by Quach [27] and Lynch [21] in their designs. Future high-performance processors will, of necessity, require multiple integer functional units. Typically a shift register, a multiplication unit and several ALUs might be provided. We suggest that designers should also consider providing one or more separate Relational Units. Separate Relational Units have several advantages. First, since a Relational Unit is significantly

273

less complex and requires less fan out than an ALU, a CMOS implementation will require less area and be inherently faster than an ALU. This performance advantage can in turn be exploited to provide fast branch resolution. Second, unlike ALU's, Relational Units consume no result and bypass bus bandwidth. Furthermore, in architectures using separate Boolean registers, no additional write port capacity is required on the general-purpose register file. As an example, replacing three ALU's with two ALU's and two Relational Units is likely to provide greater performance and yet allow a reduction in the number of result and bypass buses. Against these advantages must be set the cost of any operand buses and register file read ports provided specifically for Relational Units. These costs may be significant if providing separate Relational Units does not allow a corresponding reduction in the number of ALUs.

5.2. A L U design While the performance of the Relational Unit design developed in the previous section compares favourably with traditional comparator designs, it can also be incorporated directly into a 32-bit ALU design which implements the traditional arithmetic and logical functions (Appendix 2). The Relational Unit implementation is unchanged with relational results now appearing on the carry output from the mostsignificant bit position. A further logic stage is added to the carry generation logic to select the final sum output. Logical functions are generated by forcing G(i) to zero and selecting a value of P(i) equal to the required result. Generation of the groups of conditional sum values is omitted to simplify the presentation, as is the logic to generate a signed overflow signal [33]. Since the data paths are largely unchanged from the Relational Unit design, a relational operation is still nominally performed in nine gate delays. However, the complexity of the initial generate and propagate selection stage has increased. Furthermore, the requirement to generate carries at four bit intervals has significantly increased both the complexity and fan out requirements of the carry generation logic. Thus using an ALU to perform

274

G.B. Steven, F.L. Steven

Table 4 Relational unit and ALU timing estimates Relational unit

(1) Traditional comparator (2) Traditional ALU (3) Relational unit (single value) (4) Comprehensive ALU (single relational) (5) Branch condition from flags

ALU

Gate delays

Timing estimate

Gate delays

Timing estimate

11 N/A 9

30.9 ns N/A 22.4ns

N/A 11 N/A

N/A 30.2 ns N/A

9

25.3 ns

11

30.2 ns

6

12.2 ns

N/A

N/A

comparisons is, in practice, inherently slower than a separate Relational Unit. The sum formation requires eleven gate delays, again a time identical to the traditional ALU design discussed earlier. In contrast to the relational operations, the ALU results are not slowed down by the incorporation of a comparator since the data paths are identical to those of the traditional ALU design. The additional relational functionality has been achieved by simply changing the values of the control signals.

These results suggest that two of the branch architectures considered, the MIPS combined compare and branch architecture and the HARP Boolean register architecture, are particularly suitable for high-performance pipeline implementations which aim to cycle the ALU at the highest possible rate. In contrast, architectures which use condition codes either in their traditional form or stored in general-purpose registers will find it difficult to minimise ALU cycle times without adding additional branch delay slots.

6. Discussion

7. Conclusions

The performance of the Relational Unit and ALU designs developed in the previous two sections is compared with more traditional designs in Table 4. Detailed timing estimates are also included. These figures are based on the worst case gate delays specified by Cascade for a typical 1.5 # process [4]. These figures should be treated with great caution since, as emphasised earlier, detailed designs have not been undertaken for a specific technology. These results suggest that branch architectures which only compute a single relational result can generate a branch resolution signal two gate delays ahead of the main ALU result. In contrast in a traditional design the same relational result is only available a number of gate delays after the ALU result itself, even if it is assumed that all the condition code flags are available at the same time as the ALU result.

The requirements of processor branch architectures have been related to ALU design, within the context of pipelined processor design. In this context traditional condition code architectures have been shown to be badly wanting. First, the use of condition codes severely impedes code motion and thus hinders instruction scheduling. Second, condition code branch mechanisms hinder early branch resolution by failing to resolve the branch condition until well after the time when the main ALU result is available. To remove the restriction on instruction scheduling, multiple instances of the condition code information must be provided within an architecture. The M88000, RS6000 and HARP architectures all successfully achieve this replication using generalpurpose registers, multiple condition code registers or Boolean registers.

ALU design and processing branch architecture

275

To remove the second disadvantage, early resolution of the branch condition is essential. Two branch architectures, the MIPS compare and branch architecture and the HARP Boolean register mechanism, have been shown to be particularly suited to early branch resolution. Crucially both these architectures only require a single relational condition to be computed for each conditional branch. A Relational Unit and an ALU have been presented which meet the requirements of the MIPS and HARP branch architectures. As a result it has been demonstrated that either a multi-functional ALU or a separate Relational Unit can deliver the required branch condition on a single output line significantly ahead of the main ALU result. A case has also been made for providing one or more separate Relational Units within a high-performance multiple-instruction-issue processor.

Level 0 Generate both phases of input. For i = 0 to 31

Acknowledgement

Level 2 Generate G4(i) and P4(i) for each four bit group. For i = 0 to 7

The authors would like to acknowledge the support of the rest of the HARP team, in particular Rod Adams, Roger Collins, Sue Gray, G o r d o n Green, Simon Trainis and Liang Wang from Computer Science and Paul Findlay, Brian Johnson and Dave McHale from Electrical Engineering. They would also like to thank Dr. S.L. Stott, J.A. Davis and Dr P. Kaye for their support throughout the HARP project. The H A R P project is supported by SERC Research Grant GR/F88018.

YN(i) = Y(i). Level 1 Generate G(i) and P(i) for each bit position For i = 0 to 30 G(i) = S0.X(i).YN(i) + $1 .XN(i).Y(i) P(i) = X(i). Y(i) + XN(i). YN(i). f o r i = 31 G(i) = S2.X(i). YN(i) + S3.XN(i).Y(i) P(i) = X(i)-Y(i) + XN(i).YN(i).

G4(4i + 3) = G(4i + 3) + G(4i + 2)-P(4i + 3) + G(4i + 1).P(4i + 2).P(4i + 3) + G(4i). P(4i + 1) • P(4i + 2). P(4i + 3) P4(4i + 3) = P(4i + 3)-P(4i + 2) • P(4i + 1).P(4i).

Level 3

Appendix 1. Relational unit design

Function

Control signals $3 $2 SI SO

CIN

EQ(=) LTS ( < signed) LES ( <= signed) G T U ( > unsigned) G E U ( > = unsigned) L T U ( < unsigned) LEU ( <= unsigned) GTS ( > signed) GES ( > = signed) NEQ(< >)

0 0 0 0 0 1 1 1 1 1

1 0 1 0 1 0 1 0 1 0

0 1 1 1 1 0 0 0 0 1

XN(i) = ,X(i)

0 1 1 0 0 1 1 0 0 1

0 0 0 1 1 0 0 1 i 1

Generate G16(i) and P16(i) for each sixteen bit group. For i = 0 to 1 G16(16i + 15) = G4(16i + 15) + G4(16i + 11) -P4(16i + 15) + G4(16i + 7) •P4(16i + 11)-P4(16i + 15) + G4(16i + 3).P4(16i + 7) • P4(16i + 11).P4(16i + 15) P16(16i + 15) = P4(16i + 15).P4(16i + 11) •P4(16i + 7).P4(16i + 3).

G.B. Steven, F.L. Steven

276

Level 4 Generate required relational result as carry from most significant bit position.

for i = 31 G(i) = S4.X(i).YN(i)+ S5.XN(i).Y(i) + $6. X(i). Y(i)

BR = G16(31) + G16(15).P16(31)

P(i) = S2.X(i).Y(i)+ S3.XN(i).YN(i)

+ CIN.P16(15)- P16(31).

+ S7.X(i).YN(i) + S7.XN(i).Y(i)

Appendix 2. Comprehensive ALU design Control signals Function $7 $6 $5 $4 $3 EQ(=) 0 0 0 0 1 LTS(unsigned) 0 0 0 1 1 GEU 0 0 0 1 1 ( > = unsigned) LTU(signed) 0 0 1 0 1 GES(>=signed) 0 0 1 0 1 NEQ(< >) 0 0 1 1 1 ADD 1 1 0 0 0 SUB 0 0 0 1 1 RSUB 0 0 1 0 1 AND 0 0 0 0 0 OR 1 0 0 0 0 EOR 1 0 0 0 0

Level 2 For i = 0 to 7 S2 1 1 1 1 1

Sl 0 1 1 0 0

SO 0 0 0 1 1

1 1

1 0 1 0

CIN 1 0 1 0 1 0 1

1 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 0 1 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0

G4(4i + 3) = G(4i + 3) + G(4i + 2). P(4i + 3) + G(4i + 1).P(4i + 2).P(4i + 3) + G(4i).P(4i + 1) • P(4i + 2). P(4i + 3) P4(4i + 3) = P(4i + 3).P(4i + 2) • P(4i + 1).P(4i).

Level 3 For i = 0 to 1 G16(16i + 15) = G4(16i + 15) + G4(16i + 11) • P4(16i + 15) + G4(16i + 7) • P4(16i + 11).P4(16i + 15) + G4(16i + 3).P4(16i + 7) • P4(16i + 11).P4(16i + 15)

Level 0 For i = 0 to 31 XN(i) = X(i) YN(i) = £(i)

G12(16i + 11) = G4(16i + 11) + G4(16i + 7) • P4(16i + 11) + G4(16i + 3) • P4(16i + 7).P4(16i + 11) G8(16i + 7) = G4(16i + 7) + G4(16i + 3)

Level 1 F o r i = 0 to 30 G(i) = S0-X(i).YN(i) + S1 .XN(i)-Y(i) + S6.X(i).Y(i) P(i) = S2-X(i). Y(i) + S3.XN(i). YN(i) + S7.X(i).YN(i) + S7.XN(i).Y(i)

• P4(16i + 7) P16(16i + 15) = P4(16i + 15).P4(16i + 11) • P4(16i + 7).P4(16i + 3) P12(16i + 11) = P4(16i + 11).P4(16i + 7) • P4(16i + 3) P8(16i + 7) = P4(16i + 7).P4(16i + 3).

ALU design and processing branch architecture

Level 4 BR or C31 = G16(31) + G16(15).P16(31) + C~N.P16(15).P16(31) C27 = G12(27) + G16(15).P12(27) + C~N- P 16(15). P 12(27) C23 = G8(23) + G 16(15). P8(23) + CtN-P16(15).P8(23) C19 = G4(19) + G16(15).P4(19) + C~N"P 16(15)- P4(19) C15 = G16(15) + CtN-P16(15) C l l = G 1 2 ( l l ) + C~N'P12(ll) C7 = G8(7) + C~N'P8(7) C3 = G4(3) + C~N'P4(3).

Sum selection 32-bit wide 2-bit multiplexer used to select final sum bits: F o r i = 0 to 7 If C(4i - 1) = 0 S(4i)

=

S0(4i)

S(4i + 1) = S0(4i + 1) S(4i + 2) = S0(4i + 2) S(4i + 3) = S0(4i + 3) Else S(4i) = Sl(4i) S(4i + 1) = $2(4i + 1) S(4i + 2) = $3(4i + 2) S(4i + 3) = $4(4i + 3).

References [1] A.V. Aho, R. Sethi and J.D. UIIman, "Compilers: Principles, Techniques and Tools" (Addison Wesley, Reading, MA, 1986). [2] M. Alsup, Motorola's 88000 family architecture, IEEE Micro (June 1990) 48-66.

277

[3] M. Butler, T. Yeh and Y. Patt, Single instruction stream parallelism is greater than two, 18th Annual Internat. Symp. on Computer Architecture, Toronto (May 1991) 276-286. [4] Seattle Silicon CMOS Data Book, 1988. [5] P.P. Chang, W.Y. Chen, S.A. Mahlke and W.W. Hwu, Comparing static and dynamic code scheduling for multiple-instruction-issue processors, Proc. 24th Annual Internat. Symp. on Microarchitecture, Albuquerque, NM (Nov. 1991 ) 25-33. [6] P. Chow and M. Horowitz, Architectural Tradeoffs in the design of MIPS-X, 14th Annual Intemat. Symp. on Computer Architecture, Pittsburgh (June 1987) 300-307. [7] J. Cortadella and J.M. Llaberia, Evaluation of A + B = K conditions without carry propagation, IEEE Trans. Cornput. 41 (11) (Nov. 1992). [8] J.A. deRosa and H.M. Levy, An evaluation of branch architectures, Proc. 14th Annual Syrup. on Computer Architecture, Pittsburg (June 1987) 10-16. [9] J.A. Fisher, Very long instruction set architectures and the ELI-512, l Oth Annual Symp. on Computer Architecture (June 1983) 140-150. [10] S.B. Furber, VLSI RlSC Architecture and Organization (Marcel Dekker, New York, 1989). [11] R.B. Garner, A. Agrawal, F. Briggs, E.W. Brown, D. Hough, B. Joy, S. Kleiman, S. Muchnick, M. Namjoo, D. Patterson, J. Pendleton and R. Tuck, The Scalable Processor Architecture (SPARC), CompCon 88, San Francisco (Feb. 1988) 278-283. [12] J. Gosling, Design of Arithmetic Units for Digital Computers (MacMillan, New York, 1980). [13] S.M. Gray, R.G. Adams, G.J. Green and G.B. Steven, Static instruction scheduling for the HARP multipleinstruction-issue architecture, to appear in Microprocessors and Microsyst. [14] R.D. Groves and R. Oehler, RISC system/6000 processor architecture, Microprocessors and Microsys. 14(6) (July/Aug. 1990) 357-366. [15] J.L. Hennessy, VLSI processor architecture, IEEE Trans. Comput. (Dec. 1984) 221-1245. [16] C.B. Hunter and E. Farquhar, Introduction to the NS16000 architecture, IEEE Micro (April 1984) 26-47. [17] M. Johnson, Superscalar Microprocessor Design (Pretice-Hall, Englewood Cliffs, NJ 1991 ). [18] N.P. Jouppi and D.W. Wall, Available instruction-level parallelism for superscalar and superpipelined machines, ASPLOS-IIL Boston (April 1989) 272-282. [19] G. Kane, MIPS RlSC Architecture (Prentice-Hall, Englewood Cliffs, NJ 1988). [20] D.J. Kinniment and G.B. Steven, Sequential-state binary parallel adder, Proc. lEE, 117(7) (July 1970) 1211-1218. [21] T. Lynch and E.E. Swartzlander, A spanning tree carry Iookahead adder, IEEE Trans. Comput. 41 (8) (Aug. 1992) 931-939. [22] S. McFarling and J. Hennessy, Reducing the cost of branches, 13th Annual Symp. on Computer Architecture (June 1986) 396-403. [23] S. Mirapuril, M. Woodacre and N. Vasseghi, The mips R4000 processor, IEEE Micro (April 1992) 10-22.

278

G.B. Steven, F.L. Steven

[24] MC88100 Microprocessor User Manual, Motorola (1988). [25] T.N. Mudge, R.B. Brown, W.P. Birmingham, J.A. Dykstra, A.I. Kayssi, R.J. Lomax, O.A. Olukotun, K.A. Sakallah and R.A. Milano, The design of a microsupercomputer, IEEE Comput. (Jan. 1991 ) 57 64. [26] D.A. Patterson and J.L. Hennessy, Computer Architecture: A Quantitative Approach (Morgan Kaufmann, Los Altos, CA, 1990). [27] N.T. Quach and M.J. Flynn, High-speed addition in CMOS, IEEE Trans. Comput. 41(12) (Dec. 1992) 1612-1615. [28] R.D. Russell, The PDP-11 : A case study of how not to design condition codes, Proc. 5th Annual Syrup. on Computer Architecture (April 1978) 190-194. [29] R.M. Supnik, Digital's Alpha, CACM36(2) (Feb. 1993) 30-44.

[30] G.B. Steven, A novel effective address calculation mechanism for RISC microprocessors, SlGARCH (Sep. 1988) 150-6. [31] G.B. Steven, S.M. Grayand R.G. Adams, HARP:Aparallel pipelined RISC processor, Microprocessors and microsyst. 13(9) (Nov. 1989) 579-587. [32] G.B. Steven, R.G. Adams, P.A. Findlay and S.A. Trainis, iHARP: A multiple instruction issue processor, lEE Proc. Part E, Computers and Digital Techniques 139(5) (Sept. 1992) 439-449. [33] G.B. Steven and F.L. Steven, The relationship between ALU design and processor branch architecture, University of Hertfordshire Technical Report, TR150, March 1 993. [34] D. Tabak, RISC Systems (Wiley, New York, 1990).

G o r d o n Steven is a Principal Lecturer and Research Leader in Computer Architecture at the University of Hertfordshire, UK. His current research interests include the development of HARP, a RISC processor with multiple parallel pipelines and the implementation of procedural languages on microprocessor architectures. He received a BSEE in 1966 and an MSEE in 1967, both from Princeton University, USA and his PhD in Computer Science in 1969 from Manchester University, UK. Before joining the University of Hertfordshire, he spent eight years working in the computer industry with Plessey, Computer Technology and Hawker Siddeley Dynamics Engineering. Gordon Steven is a member of Phi Beta Kappa, ACM and IEEE.

F l e u r S t e v e n is a post-doctoral research fellow at the University of Hertfordshire. She obtained a BSc (hons) degree in zoology from the University of Hull in 1985 and an MSc and PhD in computer science from the University of Hertfordshire in 1986 and 1989 respectively. Her research interests include the relationship between high level languages and microprocessor architectures and compiler development. Fleur Steven is also a Fellow of the Royal Entomological Society.