R i SC System/6000 processor architecture Randy D Groves and Richard Oehler* describe the features and operation of IBM's second-generation RISC architecture
The 801 minicomputer ~ project at IBM Research in Yorktown Heights, NY, USA, in 1975 pioneered many of architectural concepts used in RISC including IBM's RT System. The paper describes a second-generation RISC architecture, the POWER architecture, which is based on subsequent research by the original 801 team and is used in the recently announced RISC System~6000. The architecture was designed to support superscalar implementations which can execute multiple instructions every cycle. It provides compound-function instructions which allow application path lengths to be less than would be required on many complex instruction set computers. The architecture also exploits advances in optimizing compiler and operating system technology. An extension to the original 801 virtual memory architecture for hardware support of database storage is also described. microprocessors
RISC
$uperscalararchitectures
The RISC System/6000 (RS/6000) processor architecture was developed to support both engineering-scientific and commercial application environments. To achieve this objective, several goals for this second-generation RISC processor architecture were identified: • It should exploit recent advances in ICs, software and machine organization as well as provide the ability to exploit future advances in these technologies efficiently. • It should allow implementations to realize a balance between fixed-point and floating-point performance and to provide near-vector computer performance in numerically intensive applications, while still performing well on scalar codes. • It should support the efficient implementation of instruction and data caches. Advanced Workstation Division, IBM, 11400 Bumet Road, Austin, TX 78758, USA *T. J. Watson Research Center, IBM, Yorktown Heights, NY, USA Paper received: 21 February 1990. Revised: 21 May 1990
• It should provide improved hardware support for database storage and transaction processing. • It should have a virtual memory architecture which allows for efficient mapping of files and shared objects among a large num bet of con currently active processes. These goals are extensions to those established for the RT processor and system architecture 2-4 with the intention to address higher levels of performance and function for both numerically-intensive and large, multiuser systems. The RS/6000 processor architecture is called the POWER architecture (Performance Optimization With Enhanced RISC). The definition of the architecture was the result of a unique cooperative effort between multiple IBM locations. The seminal ideas for the architecture were developed at IBM Research in Yorktown Heights by a group consisting of most of the original 801 research team. They began work on this architecture in 1985 on a project they called America. The project was then transferred to the IBM Advanced Workstation Division in Austin, TX, in 1986, along with many of the research team members. Since then the architecture has incorporated additional inputs from the IBM Burlington and Toronto, Canada development labs.
DEVELOPMENT OF THE ARCHITECTURE When the 801 minicomputer team reunited to examine the issues of machine organization and architecture in the America project, they reaffirmed the basic RISC design philosophies which had inspired the 801. These philosophies are that • The hardware and software should be designed together to achieve an optimized system. • Only instructions which can be effectively used by the compilers or operating system are included in the architecture. • Only instructions which can be more effectively implemented in hardware than in software are included in the architecture.
0141-9331/90/06357-10 © 1990 Butterworth-Heinemann Ltd
Vol 14 No 6 July~August 1990
357
• The instruction set should be regular and orthogonal with minimized or explicit control over side-effects. • As many optimizations should be moved from runtime to compile-time as possible. • The base hardware functions should be exposed tothe compiler so that it can perform better: optimization by generating only the required operations; elimination of common operations; motion of operations to code sections with lower frequency of execution (e.g. outside of loops); utilization of the hardware through scheduling of the pipelines and dependencies. Despite the name 'RISC' becoming associated with these design philosophies, reducing the number of instructions in a computer's instruction set has been the result of applying these philosophies and not the primary goal. The application of these RISC design philosophies has produced the following set of features which are generally associated with all RISC architectures: • A large number of general-purpose fixed- and floatingpoint registers (usually 32 or more) • References to memory are restricted to separate load and store instructions (a 'load-store' architecture) • Operations are performed on registers and returned to registers, usually involving three operands: two source registers and one destination register. • A limited set of data types is explicitly supported by the hardware (usually bytes, characters, half-word and fullword integers, and single- and double-precision floatingpoint). • Overlap of the fetch of branch target instructions with the execution of other instructions is facilitated. • A fixed instruction size with the register fields located in the same bit positions in all formats. • Hardwired, pipelined control of the processor rather than microcoded control • Dependence on optimizing compilers and system software for maximum performance
cycle instructions. All of these new instructions were selected according to the RISC design philsophies above. The original design target for the 801 was to execute one instruction every cycle. The emphasis of the America team was to define an architecture whose implementations could easily execute more than one instruction per cycle, otherwise known as a 'superscalar' architecture. The POWER architecture was designed from the beginning to allow parallel execution by distinct functional units. This enables the execution of as many instructions per cycle as the processor implementation and compiler technology will allow. Finally, extensions to the 801 and RT virtual memory architectures were created to include cache memory and to manage virtual and database storage more efficiently. The result of the America research project was the definition of a second-generation RISC architecture which has now been implemented as the POWER architecture in the RS/6000 system.
FUNCTIONAL UNITS The processor architecture is based on a logical view of the processor consisting of three independent functional units: a branch processor, a fixed-point processor and a floating-point processor. The interaction of these units with the instruction and data caches, as well as with main storage and I/O, is shown in Figure 1. The functions performed by each of these units are described in the following sections. Specific details of the implementation of these units can be found in References 5 and 6. The key feature of these functional units is that they are designed for maximum concurrency between the units. The 184 instructions were divided between the functional units and defined to minimize the interaction and synchronization of these functional units.
Branch processor
Three components determine processor performance: 1. The number of instructions required to perform the task (path length). 2. The average number of cycles required to execute each instruction (cycles per instruction). 3. The cycle time of the processor. Obviously, the performance of a processor and its architecture can be improved by reducing any one or all of these three components. Since a dataflow model of the original 801 architecture existed which had focussed on reducing the levels of logic required to perform a cycle, the America team felt that as long as this model was not significantly changed, the benefits of newer technology with higher levels of integration would reduce cycle time. Consequently, no architectural effort was focussed on reducing cycle time. Instead, emphasis was placed on reducing both the number of instructions and the number of cycles per instruction. To reduce the number of instructions required to perform a task, high-leverage compound-function instructions were explored which could replace two or more of the original 801 instructions. Several multicycle instructions were also defined for frequently executed functions where, with additional circuits and wider buses, significant savings in total cycles could be achieved compared with a functionally equivalent sequence of several single-
358
The logical function of the branch processor is to process the incoming instruction stream from the instruction CR
SRR0
LR
I SRR1
CTR
Instruction Cache
Branch Processor
MSR
] TID
GPR.'
RTC
MQ
DEC
XER
Fixed-Point
Floating- Point
FPRs
Processor
Processor
FPSCR
I
~ Programmed I/O EIS ElM
I/O Registers and Devices
Data Cache
' Direct Memory rAccess Main Memory
Figure 1.
Logical view of the POWER architecture
Microprocessors and Microsystems
cache and feed a steady flow of instructions to the fixedpoint and floating-point processors. The branch processor provides all of the branching, interrupt and condition code functions within the system. It is designed to execute the seven different branch instructions and nine different condition register instructions. As shown in Figure 1, the branch processor logically contains six special registers. The machine state register (MSR) contains vital machine state such as user or supervisor mode, interrupt enable or disable, and address relocate enable or disable. Upon interrupt, the save-restore registers (SRRO,SRR1) are used to retain the old value of the MSR and the address of the instruction where the interrupt occurred as described in 'Interrupts' below. The return from interrupt instruction reverses the process, restoring the MSR value and resuming instruction execution based on the values in the SRRs. Saving state in the SRRs allows for very fast interrupt processing since no data memory references are required to save or restore state. The software in the interrupt handler need only save and restore the processor state that is absolutely required for the interrupt. Since all data references are handled by the fixed-point processor, architecting interrupts that can be handled without data references is required to reduce interlocking and synchronization between the branch and fixed-point processors. The remaining three special registers in the branch processor are also key in enabling the overlap of the branch processor with the other functional units. The branch processor logically contains a 32-bit condition register (CR). The CR is located within the branch processor since access to these bits is required to resolve CR-based conditional branches. While most CPU architectures have some sort of register that contains the condition code for the results of operations in the machine, this condition register is unique. First, the CR contains eight independent condition code fields, as shown in Figure 2, which are managed by the compiler as a special set of eight registers7. With this technique, multiple condition codes can be retained across the region for which they are live. In addition, the compilers can move the setting of condition codes outside of loops. Since each functional unit can send its condition code to different fields in the CR, interlocks between the functional units caused by the sharing of a common condition code can be avoided by the compiler, thus increasing the parallelism that can be achieved. To further reduce synchronization between the units, the compiler has explicit control over which instructions return a condition code through the record bit included in most instructions. The branch processor supports seven branch instructions
Condition register I
CR0
CR1
CR2
CR3
CR6
CR4 [ CR5
CR7
I
0
4
8
12
16
[LT GT E Q S O I
20
24
[LT GT E Q
28
31
uo I
Fixed-point condition code Floating-point condition code
Figure 2. Condition register and condition code formats (LT: less than; GT: greater than; EQ: equal; UO:unordered; SO: summary overflow)
Vol 74 No 6 July/August 1990
including the return from interrupt previously discussed. All branch instructions have a special link bit which, if set, will cause the address of the next instruction to be placed in the link register (LR) in the branch processor. This function is used to provide the return address on subroutine and supervisor calls. Two of the branch instructions support either program counter relative or absolute addressing for computing the branch target address. Two other branch instructions support indirect addressing using the contents of either the link register (LR) or the count register (CTR). These two instructions are provided for return from subroutine and for branches to addresses that are not program-counter relative. The branch processor also supports the supervisor call (SVC) instruction which is really a software interrupt. The address of the instruction following the SVC is placed in the LR if the link bit of the instruction is set. The current value of the MSR and the SVC number in the instruction are placed into the CTR. The new MSR is the same as the old MSR except that external interrupts are disabled and the machine is forced into supervisor mode. Instruction fetching begins at one of 128 different vector addresses depending on the SVC number. The return from SVC instruction is provided for reversing this process. The definitions of these two instructions allow them to execute entirely in the branch processor. In addition, an SVC can now be treated with semantics similar to a subroutine call. With this kind of SVC mechanism, simple operating system services can be performed rapidly without complicated state manipulation or synchronization between the functional units. Three of the four branch instructions are conditional. The format of the conditional branch instructions is shown in Figure 3. A conditional branch can be based on the value of any bit within the CR. The conditional branch instructions also have a count capability. The count feature is primarily used as the loop-closing instruction of
Conditional Branch instructions
[BC[ Bo I BI I o 6 o
6
o
6
00000001001-01000101011-1~)01-011-1--
B°l
Bo
IAILKI
11
15
11
B I I '15
I21 BLR
ILKI 31
BIIN
I
IL I
B°l 11
15
30
BOT
31
21
Decrement CTR, branch ifCTR Decrement CTR, branch if CTR Branch if CR_bit = 0 Decrement CTR, branch ifCTR Decrement CTR, branch ifCTR Branch if CR_bit = 1 Decrement CTR, branch if CTR Decrement CTR, branch if CTR Branch always
31 I= 0 & CR_bit = 0 = 0 & CRbit = 0 I= 0 & CR bit = 1 = 0 & CR-bit = 1 l= 0 =0
where "-" means don't care
Figure 3. Conditional branch instruction formats and branch options (BC: branch conditional; BCTR: branch condffional to CTR; BOP: branch unit Op Code; BLR: branch conditional to LR;AA: absolute address; LK: set link register; BD: branch displacement; Bh CR bit tested by branch; BO: branch options)
359
cmpi cal bler ai
cr0,r4,0 CTR,r4 r5,0(r0) cr0 r3,r3,-1
# test for length <= 0 # set number of bytes to zero # value to stere = 0 # return if length <= 0 # decrement addr by 1
stbu bc
r5,1(r3) 16,0,zloop
# incr r3 and zero a byte # dec C T R and jump ifC T R != zero
mtspr
zloop:
br
Figure 4.
# return
Object code listing of void bzero(addr, len);
an innermost DO loop. When the count feature is enabled in a conditional branch, the CTR is decremented by I and tested to see if it is 0. The decrement and test of the CTR can be selected independent of or in concert with the test of a CR bit, thus providing a powerful conditional branch capability. The code sequence in Figure 4 illustrates the use of the branch unit registers (and the overlap of the branch and fixed-point units). It is a simple implementation of the C subroutine void bzero(addr, len);. The purpose of this code is to set an area of memory to zero. There are two parameters to the subroutine: r3 (GPR3) contains the address of the string to be zeroed, and r4 contains the length of the string. The LR (link register) contains the return address. The cmpi (compare immediate) instruction tests the length for zero. The mtspr (move to special-purpose register) instruction moves the count to the CTR. The cal (compute address lower) instruction is an easy way to put a zero into rS. The bier is a pseudo-instruction for a conditional branch on less than or equal, where the branch target is in the link register. The ai (add immediate) instruction decrements the starting address (because the next instruction will pre-increment it). The stbu (store byte with update) instruction stores 1 byte and updates the target address (by one in this case). Updating is discussed below in the section on the fixed-point unit. The bc (branch and count) pseudo-instruction decrements the CTR and, if not zero, goes to the branch target (in this case the previous instruction). Note that the code is 'scheduled', i.e. the cmpi is separated from the bier that uses the resulting condition, and the mtspr is also separated from the bc. Also note that these separations are maximum (for this code). The architecture-implied timing of the code in the loop is one iteration per cycle, i.e. the branch unit does the bc, the fixed-point unit does the stbu, and both units are completely overlapped. However, the current implementation requires two cycles to execute a taken branch and, therefore, requires two cycles for this loop. Another aspect of this 'scheduled' code is the test for zero length is not done immediately, i.e. the mtspr and the cal are inserted before the bier. The (correct) assumption here is that the normal path is for non-zero length and that, with this ordering, this path has minimum cycle time, i.e. the bier is overlapped and therefore does not take a cycle. The code in Figure 5 illustrates the use of multiple condition fields, showing how pretesting of the conditions enables the branch code to be no more than just the branch logic itself. On entry to this routine, r3, r4 and r5 contain the addresses of x, y and z respectively. The crop (compare) identifies the target condition field and is a fixed-point instruction. The non-register form of branch (bf -- branch false pseudo-instruction) is an instruction to perform a relative branch. The Ir (load register) is a
360
/* This p r o c e d u r e r e t u r n s the m i d d l e of three numbers ~/ y2: procedure(x, y, z) r e t u r n s (fixed bin(31)) reorder; declare (x, y, z) fixed bin(31), mid fixed bin(31); if x=z then m i d = y; else if x>=z then m i d = z; else m i d = x; return (mid); end y2;
1 i 1
r0,X(r3} r3,Y(r4) r~,ZIr5)
cmp cmp cmp
cr6,r0,r3 cr0,r0,r~ crl,r3,r%
bf btr ir btr ir br bfr ir bfr ir br
Figure 5.
# S t a r t of P r o c e d u r e # (note r3 will c o n t a i n
answer
# load X, Y,
r3,
Z into r0,
on r e t u r n
r~
# # | #
P r e c o m p u t e conditions C o m p a r e X and Y, saving C o m p a r e X and Z, saving C o m p a r e Y and Z, saving
cr6,1t,%8 crl,lt r3,r~ cr0,1t r3,r0
# # # # # #
Branch to Label %8 if X ge Y C o n d i t i o n a l r e t u r n (Y it Z) (return Move Z to r e t u r n r e g i s t e r C o n d i t i o n a l r e t u r n (X it Z} (return M o v e X to r e t u r n r e g i s t e r R e t u r n (return X)
crl,lt r3,r~ cr0,1t r3,r0
# # # # #
C o n d i t i o n a l r e t u r n (return Move Z to r e t u r n r e g i s t e r C o n d i t i o n a l r e t u r n (return Move X to r e t u r n r e g i s t e r R e t u r n (return X)
result result result
in cr6 in cr0 in crl
Y) Z)
Y) Z)
Source and object listing for mid program
pseudo-instruction for moving the contents of one register to another. Also note the scheduling in this code. The registers tO, r3 and r4 are loaded as early as possible before their contents are used. Also, the compares which generate the condition codes cr0, crl and cr6 are scheduled as faraway as possible from the branches which test their values. In addition to the branch instructions, a set of nine condition register instructions are defined that allow all possible Boolean operations to be performed on any two bits within the CR and placed into a third bit in the CR. When the compiler encounters a compound Boolean expression in an if statement, it can generate a series of compares and CR logical operations followed by a single branch instead of the equivalent series of compares and branches. By implementing two functional units in the branch processor (a CR logic unit and a branch unit), the series of compares and CR logical operations can be processed more efficiently than having to process a large series of branches. The CR logical instructions also allow Boolean variables to be assigned to bits within CR when they are being used to control logical flow through a program. This saves fixed-point registers and allows for overlapped branching on these Boolean values. The common characteristic of all the branch processor instructions is that they have been defined in such a way that all the information and resources required to execute the instructions are available within the branch processor itself. All the information required to perform CR logical operations, to resolve conditional branches, to determine the target address of a branch, or to take an interrupt are predefined or are contained in the branch processor special registers. The logical independence of the branch processor allows it to process the incoming instruction stream resolving in advance all interrupts, branch and CR operations. By doing so, the branch processor can then dispatch a steady stream of instructions to the fixed- and floating-point processors. This implies that for large
Microprocessors and Microsystems
sequences of meaningful code, the cycles required for handling branches are completely overlapped. This allows implementations to achieve a zero-cycle branch easily. This capability obviates the delayed branch or branch with execute instructions which traditional RISC processors have used to minimize delays associated with branching.
Fixed-point processor The fixed-point processor (FX) is designed to support the execution of all 79 of the fixed-point arithmetic and logical operations as well as all 55 of the data reference instructions. All of the arithmetic and logical instructions include a record-bit, which the compiler can set to cause this instruction to return a condition code to CR field 0. As shown in Figure 1, the FX has thirty-two 32-bit generalpurpose registers (GPRs) and five special registers. The data address register (DAR) and data storage interrupt status register (DSISR) are used to contain information to help software resolve interrupts caused by data references. The transaction identifier (TID) register is used to contain the transaction ID of the currently executing process. The TID is used by special segments as described in 'Storage control' below. The multiplier and quotient (MQ) register is used by the multiply, divide and extended shift instructions and can also be used as temporary storage by the store string instructions. The FX exception register (XER) contains special flags, such as Carry and Overflow, that are set by arithmetic operations. It also contains the byte count and comparison byte used bythe string instructions. The GPRs contain 32-bit values which can be used for addresses, signed or unsigned integers, characters or logical values. The 24 arithmetic instructions all provide an Overflow Enable bit, which controls whether this instruction will affect the Overflow bits in the XER. This allows the compiler to deal with unsigned values, such as addresses, without spurious setting of the Overflow bits. The arithmetic instructions include 14 add and subtract instructions, which provide complete support for addition and subtraction of constants and register values in both normal and extended precision. The arithmetic instructions include five instructions that support functions such as maximum, minimum and absolute value without the need for a test and branch. In addition, five multiply and divide instructions are defined. Many RISC processors provide single-cycle assist instructions, which perform 1 to 2 bit of the multiply or divide per instruction and rely on the compiler to minimize their frequency. Even with good compiler optimization, multiply and divide still represent a measurable percentage of execution time, especially in engineering and scientific codes. By defining full-function multiply and divide instructions, the implementation has the option to make these instructions execute as rapidly as the circuit budget will allow. The FX supports 16 logical instructions that provide the capability to perform all bit-wise Boolean operations between two registers and place the results in a third register. A powerful set of 26 rotate, shift and mask instructions is provided for dealing with bit strings within a register or spanning multiple registers, as well as for performing simple multiplies and divides by powers of two. The algebraic right shift instructions set the Carry bit so that an add with zero extended instruction can be used
Vol 74 No 6 July/August 1990
to implement divides of negative numbers by a power of two in conformance with the definition of division required by most high-level languages. The FX architecture defines 13 instructions that deal with transferring information between the FX and branch processors. Included in these instructions are the four fixed-point compare instructions that compare two values and return a condition code to one of the eight fields in the CR register of the branch processor. Two trap instructions are provided that compare two values and force the branch processor to take a precise program interrupt. These instructions are used for bounds checking and for debuggers. The remaining seven instructions involve the transfer of various registers or bits of registers to and from the FX and branch processors for computation and testing or for process state save and restore. By making these transfer instructions explicit to the software, the compiler has the opportunity to schedule instructions around these inherently synchronizing operations. The FX processor also handles all 55 data reference instructions. The architecture supports byte (or character), unsigned half-word, signed half-word and full-word data types in the FX GPRs as well as IEEEsingle-precision and double-precision data types in the floating-point registers. The addressing modes supported are absolute, indirect, base plus displacement and base plus index. Automatic increment and decrement of the base register is supported by the update form of these data reference instructions. This update capability is exploited by the compiler in strength reduction 7 and requires two instructions in most RISC architectures. Four special data reference instructions are provided for loading and storing data structures in little-endian format (for example, data created by an Intel 80386 processor). The FX architecture has extensive support for misaligned operands and for dealing with character strings. All data references to misaligned operands (for example, a full word that is not on a full-word boundary) will be handled in hardware provided that the operand is contained within a cache line. Atthe option of the implementation, a misaligned operand reference that spans two cache lines can generate an alignment interrupt, forcing the reference to be completed in software. Special hooks are provided to make this software emulation as efficient as possible. For compatibility, the architecture also provides for a mode which handles misaligned addressing in the same fashion as the RT system. In support of character string data, a set of five string instructions are provided for efficient copy and compare of character strings whose alignment is unknown. Support is included for both null-terminated strings (as in C) and length-specified strings. These string instructions are defined in such a way as to allow a string copy to execute at a rate approaching the maximum bandwidth between the processor and the cache. Special care was taken to ensure that this performance is achieved even for very short strings, since the length of most strings is less than 8 byte.
Floating-point processor The floating-point processor (FP) architecture supports the execution of all 21 of the floating-point instructions. Each of these instructions includes a record-bit, which the compiler can set to cause this instruction to return a
361
condition code to CR field 1. As shown in Figure 1, the FP has thirty-two 64-bit floating-point registers (FPRs) and a floating-point status and control register (FPSCR). The FPSCR contains all the appropriate status information required by the IEEE754-1985 standard 8 as well as control bits for the rounding and exception modes. The FP processor supports floating-point operations with 13 arithmetic instructions. These include the basic add, subtract, multiply, divide, round to single and register move instructions. Unique to this architecture are the four multiply-add instructions that multiply two operands, add this product to a third operand and store the answer into a fourth operand, all with a single rounding error. These instructions not only allow for fast execution in numericallyintensive applications by performing two floating-point operations in one instruction, but also provide additional precision which can often reduce the number of instructions required to achieve a given level of accuracy in many maths routines (for example, sine, cosine and square root). The FP processor supports six instructions that control the FPSCR and provide a means for saving and restoring this register. Two floating-point compare instructions are included that compare two FPRs and return a condition code to the specified field in the CR of the branch processor. Figure 6 illustrates the capability of the POWER architecture when all the architectural features are put together. The listing shows the first loop of an adaptive finite impulse response (FIR) digital filter program. The first two instructions place the loop count into the count register and check whether (N - 1) is less than zero. The loop contains four Ifdu (load floating-point double with update) instructions which fetch the four new operands needed for each loop iteration and update the address for the next iteration. The loop also contains three fma (floating multiply add) and one fnms (floating negative multiply subtract) instructions which perform the eight floating-point operations required in each loop iteration.
* Adaptive F I R Filter SUBROUTINE AFI R( HR,H I ,XR,XI,YR,YI,BETAR,BE TAI,N) REAL*8 HR( 1024 ),BETAR,XR( 1024 ),YR REAL*8 HI(1024),BETAI,XI(1024),YI YR = 0.0 Y I = 0.0 9 10 11 12 13 14
DO
100
mtspr crop
100 I=O,N-1 YR = YR + HR(I)*XR((N-1)-I) YI = YI + HI(I)*XR((N-1)-I) YR = YR - HI(I)*XI((N-1)-I) YI = YI + HR(I)*XI((N-1)-I) CONTINUE
CTR,rll crl,r0,r28
CL.0: lfdu fma lfdu fma lfdu fnms fma bctf
fp6,r31=hr(r31,8) fp5,r29=xr(r29,-8) fp4= fp4,fp6,fp5 fp3,rl2=hi(rl2,8) fp2=fp2,fp5,fp3 fpl,r30=xi(r30,-8) fp4=fp4,fp3,fpl fp2=fp2,fp6,fpl CL.0,crl,0x2/gt
stfd stfd
yr(r7,0)=fp4 yi(rS,0)=fp2
lfdu
Figure 6. Source and object listing of the first loop of an adaptive FIR filter
362
The loop is terminated by a bctf (branch and count false) which examines the result of the comparison to see if (N - I) was less than zero. If (N - 1) was greater than zero, this instruction loops until the count register is decremented to zero. After the loop, the two stfd (store floating double) instructions place the final values of YR and YI into memory. Notice that the inner loop consists of nine compoundfunction instructions, i.e. instructions which are performing more than one function. This loop contains four instructions which will be executed in the fixed-point unit (the Ifdus), four instructions which will be executed in the floatingpoint unit (the fmas and the fnms), and the bctf which will be executed in the branch unit. With ideal overlap, this loop should execute in four cycles; in the current RS/6000 implementation it does. In four cycles, this loop is executing 13 operations (four fixed-point, eight floatingpoint and one branch operation).
STORAGE CONTROL The virtual memory architecture is an upward extension of that found in the RT system. It provides for a 4 petabyte (2 s2) virtual address space and a 4 Gbyte (232) real address space made up of 4 kbyte pages. Support for memorymapped I/O has been made more general, and the hardware locking mechanism for database storage has been enhanced to provide hardware-assisted lock granting for the most frequent cases. Finally, the architecture has been enhanced to provide for the management of software visible instruction and data caches.
Virtual address architecture Figure 7 illustrates the process of virtual address generation and translation. Instruction and data references generated by programs are all 32-bit effective addresses. The most significant 4 bit of each effective address are used to select one of 16 segment registers. Each of these segment registers can be assigned to memory or I/O space via a bit in each segment register. If the most-significant bit of the selected segment register is a 1, this request is sent to I/O space along with the contents of the segment register. If this bit is a 0, the least-significant 24 bit of the segment register (the segment ID) are concatenated with the remaining 28 bit of the effective address to create a 52-bit virtual address. To the executing program, memory appears to be 4 Gbyte of virtual memory broken into sixteen 256-Mbyte segments. Over 16 million (224) unique segment IDs are available which should easily support all the open files and active objects of a large number of concurrently active processes. To efficiently map these large virtual addresses to their respective real page frames, an inverted page table structure is used. The page frame table (PFT) contains one entry per real page frame. The PFT entry format is shown in Figure 8. These entries contain the virtual address to which this page is currently assigned, the pointer to the next page in the search chain, the referenced and changed bits, and the protection and lock bits for this page. A virtual-to-real translation is performed by hashing into a hash anchor table (HAT). Each HAT entry contains an index into the PFT where a search for a matching virtual address is begun. If an invalid pointer is found before
Microprocessors and Microsystems
provides read and write protection at the user and supervisor level using the key bit in the segment register. The architecture provides six instructions for manipulating the segment registers and the TLBs.
32 Bit Effective Address
S" 4
Special s e g m e n t s
"1
16
Segment Registers
i lent I Virtual I Page
or I/O
~ ndex
fl
16
Byte 52bit ,_.= Virtual '~2Offset Address j
=_
Virtual
40
Translation Look-aside Buffers Page Frame Table
l
Real P a g e , ~ r Number 1 " "
J Protection and Locking
Figure 7.
p 32
Real Address
Virtual address generation and translation
finding a matching virtual address, a storage interrupt is taken. Of course, the most recent translations are maintained in translation look-aside buffers (TLBs) in the processor to avoid repeating these searches. A page in a normal memory segment has page-level protection identical to that in the RT system. This
Bit 0
31
pagenumbe~ v f Word 0 Virtual (27 high order bits) Word 1 i
\\\
c
P o i n t e r to n e x t PFTI e n t r y (20 bits)
Word 2 b b ... 32 T I D lock b i t s . . , Word 3 1 w r
pp
a
\\
b b
Transaction IDi (16 bits)
Figure 8. Page frame table entry (v: valid virtual page number; f'. page referenced bit; c: page changed bit; pp: page protect bits (2); i: invalid pointer bit; b: lock bit (32 per page); I: lock type; w: grant write locks; r: grant read locks; a: allow read)
Vol 14 No 6 July~August 1990
When the special bit is a I in a segment register, the datalocking mechanism for the pages in that segment is enabled. The architecture for special segments is an extension of that found in the RT system. The number of lock bits has been increased from 16 to 32 and the size of the transaction ID has been increased from 12 to 16 bit. In addition, granting locks for the most frequent situations can now be performed in hardware, resulting in improved performance in database and transaction processing applications. All pages within a special segment are subdivided into thirty-two 128 byte lines each having a corresponding lock bit in the PFT entry for this page (see Figure 8). Also contained in the PFT entry is the transaction ID assigned to these lock bits. Associated with each transaction ID are control bits. The lock type bit determines whether the lock bits represent read or write locks. The other three control bits enable automatic granting of locks in hardware or can permit read accesses by any transaction. Table I describes when access to a line is permitted and under which circumstances the hardware will grant locks. The symbols in Table 1 are the same as those used in Figure 8. How these features can be exploited by software is described in References 9 and 10. The addition of hardware lock granting to the special segment architecture means that lock interrupts are only
Table 1. Data locking mechanism: note 1: lock bit'b' is set to '1 '; note 2: lock bit 'b' and lock type T are set to '1 '. All other lock bits are set to '0' bl
ar/w
TID match
Access permitted
Notes
Read accesses - 00 01 10 I I
000 00 00 00
no yes yes yes yes
no no yes yes yes
00 I 0 - I
01 01 01
yes yes yes
yes yes yes
- -
I -
-
yes
note 1
Write accesses - 00 01 I 0 11
0 0 0 0
no yes yes yes yes
no no yes yes yes
00 01 11
1 1 1
yes yes yes
yes yes yes
note 2 note 1
363
generated for pages within a special segment that are actively being shared by two different transactions. Since this case is statistically rare, most locks can be automatically granted by the hardware, resulting in improved performance for database and transaction processing applications.
Real Address (Hexadecimal
00000000
Cache architecture Implementations of this architecture are allowed to have their instruction and data caches explicitly visible to the software. This was done to simplify the implementation of the caches and to increase the parallelism that can be achieved between the branch and fixed-point processor as well as between I/O devices and these processors. Software visible caches have two implications: the first is that any program that uses data references to create instructions (for example, loaders, debuggers and simulators) must explicitly force these instructions from the data cache into the instruction cache. Additionally, device driver code that wishes to perform input or output operations on the processor memory must properly flush the necessary pages from the caches before the I/O operation can begin. The processor architecture specifies seven cache instructions to enable software to perform these functions. Software often knows the nature of upcoming data references, so it can often use the cache instructions to improve performance and to reduce bandwidth requirements between the cache and main memory. For example, if the software knows that it is about to overwrite a cache line of memory, it can use the data cache line zero instruction to establish the line in the cache without causing the line to be fetched from main memory. As the cache line is also set to all zeros by this instruction, it can be used to implement a very fast zero-page function. Likewise, if the software knows that it will no longer need the contents of a cache line, the cache line invalidate instruction can be used to eliminate the line from the cache without creating a store-back of the line to main memory. To allow simultaneous look-up in the cache directories and TLBs, the hardware is allowed to use the low-order 20 bit of the virtual address for this look-up. The directory address comparisons are still made on the complete real address but, since eight of the look-up bits participate in the virtual-to-real address translation, software is required to maintain the cache such that aliasing caused by referencing an object by both its virtual and real address is avoided.
Read only memory The storage architecture explicitly reserves a portion of the real address space for ROM. As the real address map in Figure 9 shows, the upper 1 Mbyte of the real address space has been set aside for ROM. This definition allows ROM to be cached without causing aliasing problems between ROM and read-write memory.
Synchronization Special pains were taken in the definition of the architecture to avoid unnecessary synchronization in the
364
Read-Write Memory
FFFO0000 Read Only Memory FFFFFFFF
Figure 9.
Real address map
hardware for storage control. Two synchronization instructions are provided so that software can insert synchronization when required. For example, without explicit software synchronization, the branch processor would have to refetch all prefetched instructions after every operation that modified a segment register just in case a new virtual address was created because the segment ID for the current code segment was changed. Likewise, in the absence of software synchronization, the branch processor would have to refetch after every cache flush instruction. Explicit software synchronization implies complexity in the operating system code, but allows many common code sequences to run significantly faster than if hardware synchronization were required.
INTERRUPTS The architecture supports nine different interrupt types. For all interrupts except the SVC, which was described above, the branch processor performs the following sequence: 1. The address of the next instruction to be executed is placed into SRR0. 2. The current value of the MSR and some interrupt specific information is placed into SRRI~ 3. Most of the bits in the MSR are cleared. 4. Instruction fetching begins at the vector address defined for the interrupt type. The interrupt vector area is either '00000100'x to '00001feO'x or 'fffOO100'x to 'fff01 feO'x, depending on the value of the interrupt prefix bit in the MSR. Since the IP bit is set to I at power-on, all the interrupt vectors are initially mapped into the ROM address space. Software later changes the I P bit to O, remapping the interrupt vectors into read-write memory. System reset and machine check interrupts are nonmaskable interrupts provided from external sources to force reinitialization or to report suspected hardware errors, respectively. Instruction storage, data storage and alignment interrupts are precise interrupts generated by instruction and data references such as page faults. The program interrupt is provided for a variety of suspected
Microprocessors and Microsystems
programming errors such as invalid or privileged operations. A floating-point available interrupt is provided so that the FPRs only need to be saved and restored for processes that really use the floating-point processor, thus saving context switch time. The external interrupt is a maskable interrupt generated by external sources. External interrupt sources set one of the 64 bit in the external interrupt summary (EIS) register. Which bit each interrupt source sets is individually programmable allowing software to determine priorities and to arbitrarily combine interrupts onto the same level if desired. The bits in the EIS are individually masked by the 64 bit in the external interrupt mask (ELM) register, which allows software to create its own interrupt priorities and interrupt levels. If any non-masked EIS bit is set and external interrupts are enabled, the processor takes an external interrupt. The EIS and ElM are located in memorymapped I/O space as indicated in Figure 1. When any interrupt is taken, the first level interrupt handler has the responsibility to save and restore all state required by the interrupt. This allows for very fast interrupt processing. For example, many of the fast supervisor calls do not have to save and restore any state, but merely perform the requested function and return.
TIMER FACILITIES The architecture provides facilities required for efficient monitoring of applications, accurate time-stamping of transactions, and the scheduling of time-dependent operations. These functions are provided by two facilities: a 64-bit real time clock and a 32-bit decrementer. The real time clock is split into its upper and lower 32 bit. The upper part is incremented every second; the lower part is logically incremented every nanosecond for 1 s until it resets to 0 and begins counting again. Only those bits required to achieve a resolution equivalent to the execution of 10 instructions need be implemented in the lower part of the real time clock. The decrementer is a 32-bit register that is logically decremented every nanosecond with the same resolution requirements as the lower part of the real time clock. Whenever the most significant bit of the decrementer changes from 0 to 1, a bit is set in the EIS. If this bit in the EIS is not masked and external interrupts are not masked, then an external interrupt is generated. By providing high resolution timer facilities that can be read in user state, applications can be monitored effectively. The values in these registers can also be used as time stamps for transactions in a distributed system. The decrementer provides the facility for the scheduling of time-dependent operations via the external interrupt mechanism.
PERFORMANCE The performance potential of this architecture is demonstrated by the first implementations available in the RS/ 6000 family. Table 2 compares the performance of the RS/6000 model 540 with a variety of commerciallyavailable processors on the Linpack benchmark measurement of floating-point performance TM 12. The performance of the model 540 compares favourably with machines classified as minisupercomputers today.
Vol 14 No 6 July/August 1990
Table 2.
Floating point performance comparison 11' 12
Computer
Linpack (MFLOPS) n = 100
Cray X-MP (1 proc.) Cray-1S Convex C-210 RS/6000 model 540 Alliant FX/80 (8 proc.) Stellar GS 1000 Ardent Titan-4 (4 proc.) Apollo DN1000 MIPS M/2000
66 27 17 13 I0 9.8 9.4 5.8 3.9
The System Performance Evaluation Cooperative (SPEC) is a consortium of computer vendors who have joined to develop a set of benchmarks and procedures for comparing the performance of advanced computer systems. SPEC Release 1.O is a collection of 10 CPUintensive benchmarks which were released in the fall of 1989. On this suite of benchmarks, the model 540 achieves a SPECmark of 34.7. The highest published SPECmark prior to the announcement of the RS/6000 was 17.8. The RS/6000 family also has the distinction of being the first family of processors in which all of its members have SPECmarks greaters than the clock frequency (in MHz) of their processors. This can be attributed to the superscalar implementation of the architecture.
CONCLUSIONS The POWER architecture was designed to support superscalar implementations containing multiple functional units, allowing the compiler to take advantage of fine grain parallelism. In addition, compound-function instructions were defined that reduce application path lengths. In some instances, these path lengths are actually shorter than those for many CISCs. The result is an architecture that can be implemented to provide levels of fixed-point and floating-point performance which rival those of vector computers on vector codes, and yet maintain that performance level on scalar codes. In addition, the virtual memory architecture has been extended to support an enhanced cache and virtual storage system, thus providing for additional performance at the operating system and application level. As technology densities allow, additional functional units within each processor can be implemented to provide for additional overlap and performance. This architecture can also be implemented in lower cost and lower performance systems that save cost by not exploiting all of the possible parallelism. While significant advances have been made in the optimizing compiler support for the architecture, additional performance improvements will be possible upon this architectural base as compilers support more advanced optimizations.
A C K N O W L E D G E M ENTS As with the 801, John Cocke was the inspiration for most of the key concepts behind this architecture. Andrew
365
Heller provided the vision. Marc Auslander, Albert Chang, Martin Hopkins, Greg Grohoski, Bill Hay, Peter and Vicky Markstein, Robert Montoye, Jack O'Quin and John O'Quin all made significant contributions to the POWER architecture.
REFERENCES 1 Radin, G 'The 801 Minicomputer' SIGARCH Computer Architecture News Vol 10 No 2 (March 1982) pp 39-47 2 Henry, G G 'IBM RT PC architecture and design decisions' I B M RT Personal C o m p u t e r Technology (1986) pp 2-5 3 Hester, P D, Simpson, R O and Chang, A 'The IBM RT PC ROMP and memory management unit architecture' I B M RT Personal C o m p u t e r Technology (1986) pp 48-56 4 Hester, P O and Simpson, R OThe IBM RT PC ROMP processor and memory management unit architecture' I B M Systems J. Vol 26 No 4 (1987) pp 346-360
5 Grohoski, G, Kahle, J, Thatcher, L and Moore, C
10 Chang, A, Mergen, M, Porter, S, Rader, R and Roberts J 'Evolution of storage facilities in the AIX System' I B M RISC System/6000 Techn., SA23-2619, I B M Corp. (1990) pp 138-143 11 Dongarra, J J 'Performance of various computers using standard linear equations software' CS-89-85, C o m p u t e r Science DepL, University of Tennessee, TN, USA (January 1990) 12 'Performance brief: CPU benchmarks - issue 3.9' MIPS Computer Systems Inc. (January 1990)
_
Randy D Groves is a senior engineer in the Advanced Workstation Division of IBM in Austin, TK USA and has recently completed the managementof technology program at the MIT Sloan School of Management, MA, USA. He joined IBM in 1979 at Manassas, VA and transferred to Austin in 1982,joining the design team for the RT System. He managedthe logic design of the instruction cacheand fixed-point units for the RISCSystem/6000processor. Most recent/yhe has been responsible for the architecture of all the RISCbased processors in the Advanced Workstation Division. Groves received BS degrees in electrical engineering and business administration from KansasState University, KA, USA in 1978 and 1979, respectively.
'Branch and fixed-point instruction execution units' I B M RISC System~6000 Techn., SA23-2619, I B M Corp.
(1990) pp 24-33
60lsson, B, Montoye, R, Markstein, P and Nguyenphu, M 'RISC System/6000 floating-point unit' I B M RISC System~6000 Techn., SA23-2619, I B M Corp. (1990) pp 34-43
70'Brien, K, Hay, B, Minish, l, Schaffer, H, Schloss, R, Shepard, A and Zaleski, M 'Advanced compiler technology for the RISC System/6000 architecture' I B M RISC System/6000 Techn., SA23-2619, I B M Corp.
(1990) pp 154-161 8
IEEEStandard 754 for Binary Floating-Point Arithmetic
IEEE New York, NY, USA (1985) 9 Chang, A and Mergen, M F '801 storage: architecture and programming'ACM Trans. C o m p u t e r Systems Vol 6 No 1 (February 1988) pp 28-50
366
.....
Richard R Oehler is manager of system structures in the advanced RISC systems department of the IBM Research Division in Yorktown Heights, NY, USA. He spent seven years at the National Security Agency working on operating system design, followed by one year with RCA working on computer architecture before joining IBM's Research Division in 1970. Since then, he has worked on computer architecture and operating system design. He wasone of the original members of the 801 minicomputer proiect and managed the 801 architecture and simulator, tools and operating system efforts. He has had several assignments, including managerof I0 architecture in IBM Poughkeepsie,NY, USA from 1972 to 1974, and manager of architecture for IBM's System Products Division from 1981 to 1985. He was the lead architect for the RISC System/6000from 1986to 1987.Oehlerhas a BA in mathematicsfrom St John's University, USA.
Microprocessors and Microsystems