Microprocessing and Microprogramming 37 (1993) 105-108 North-Holland
105
INTEGER AND CONTROL UNITS FOR A GaAs 32-BIT RISC PROCESSOR P.P. Carballo, R. Sarmiento and A. Ndfiez Centro de Microelectrdnica Aplicada Universidad de Las Palmas de Gran Canaria Campus Universitario de Tafira E-35017. Las Palmas de Gran Canaria. Spain In this paper we present the design and simulation of the Integer and Control Units of AsGaR, a RISC processor featuring a proprietary streamlined architecture. Running at a clock speed of 440 MHz, it delivers a peak throughput of 110 MIPS. The work reported here makes clear the level of complexity faced by this kind of designs and the need to use a wholistic approach considering all aspects of system implementation. AsGaR has been implemented in a TriQuint GaAs process, using a standard cell library. The current shortage of tools for GaAs design has been overcome by adapting the Cadence/Edge environment to this technology. Simulation has been done by describing the cell library in System Hilo and Verilog.
1. I n t r o d u c t i o n .
2. D e s c r i p t i o n
The feasibility of manufacturing monolithic VLSI processors in GaAs technology has been thoroughly investigated during last decade. Design methodology and implementation criteria have been established [1]. These focus on low transistor count systems and RISC architectures implemented in full custom technique [2,3]. Full custom implementation is very expensive and most of the demonstration projects have been funded by governmental strategic programmes (such as DARPA). It seems that it will take long to make truly monolithic VLSI design of complex systems available for a wide community at reasonable prices for full custom techniques. Hence, this research has looked into combining a standard cell technique, a multichip module packaging, and an especially trimmed RISC architecture with a small instruction set aimed to implement hardware accelerators, graphics hardware, and signal processors.
AsGaR is a 32 bit RISC processor featuring a Harvard 4-stage pipeline architecture implemented in GaAs technology. It has a 9 ns clock cycle with 4 subcycles derived from a master clock of 440 Mhz. It delivers 110 MIPS, completing one instruction per clock cycle. AsGaR instructions belong to four classes: 1) Arithmetic (ADD, SUB, AND, OR, XOR, INS, EXT); 2) Memory access (LOAD, STORE); 3) J u m p s (Short conditioned and unconditioned, long), and 4) Special (RD_PSW, WR_PSW, RD_PC, TRAP- superuser mode, and IRET). Instruction formats are depicted in Figure 1. The short format codes three address operations (two source registers RF1, RF2, and a destination register RD). The long format codes operations involving an immediate operand. Fields TIPO and CDO code the instruction type and function and are convenient for hardwired instruction interpretation. JUMP and CALL instructions have a separate format in which the first two bits select the type and the remaining bits contain the jump address.
of the architecture.
P.P. Carballo et aL
106
• :
: ,
. : •
:
:
:
:
(@
i[
@
Figure 1. Instruction formats: a) short format, b) long format As shown in Figure 2. the AsGaR instruction cycle is divided in four machine cycles: FETCH, ALU, MEM, WR. In order to avoid pipeline stalls due to data dependency two bypass levels are implemented in the datapath. The first by-pass forwards the output of the ALU cycle to the input of ALU cycle of the next instruction. The second by-pass forwards
AsGaR in a TriQuint 0.8 pm process. It is similar to Silicon ECL. SCFL is the GaAs technology most easy to manufacture since the logic swing is not dependent on transistor threshold voltage. Hence, a strict control of technological process parameters is not required. Logic swing is about 1.2 V and the permitted range for threshold voltage variation is the -0.6V to +0.3V interval. The lack of GaAs design tools, as well as technology files -or their high prices-, is one of the main problems for university laboratories. To overcome this the CMA has developed its own design environment for the SC10000 family. This environment has been set up by creating libraries and menus required to capture the design in Cadence/Edge, and by creating all description files needed for System Hilo and Verilog simulations.
FETCH }
~ t Address las,
I
I~stfugt{o~
code
I
°,
1,7, 1
ALU
]
~ I Decodi]'ic. 4I, 1
I ?° t, .. I
MEM IMem, address WR
t
I
¢P~ Read
~2
STORE
t q'S IALU op~ation
~s
the operand obtained during MEM cycle of instruction i to the beginning of ALU cycle of instruction i+2. Each instruction stage is performed in a clock cycle (9 ns) with 4 subcycles of 2.5 ns. The operations performed in each of these phases are shown in Figure 3. They have been scheduled in such a way that datapath collisions and pipeline interlocks are avoided. When an interrupt occurs in instruction i+3, the processor saves the i+l program counter and allows that instruction write stage to be completed. Once the interruption is served and IRET obeyed, the program counter is restored to instruction i+l.
3. Technology used. SCFL (Source Coupled FET Logic)[4] has been the technology chosen to implement
q'~,
]ALU o p e t a t i o ~
{
~4
LOAD
f
~5 i Dec.
Figure 2. Steps of segmentation.
~
Write Registe~
Wt'ite to registe¢
Figure 3. Subcycles of instruction-cycle stages.
4. AsGaR implementation. A printed circuit board and a multichip module (MCM) implementations [5] were studied for AsGaR. A monolithic implementation in a single chip was out of question for a standard cell realisation. Previous work of the group [6] had shown with quantified results, that in board implementation the performance is completely dominated by the interconnect delay between ICs. Partitioning schemes and trade-offs were also quantified. Therefore the choice of a MCM implementation was clear. Table 1 shows the system partition of AsGaR in four modules. The control unit generates all signals
Integer and control units for a GaAs 32-bit RlSC processor
107
Table 1. Results about the different units of AsGaR. Chip
Side (mm)
P o w e r (W)
Control Unit
5,76
5,0
ALU
4,33
3,5
Register files
3,00
2,6
Control logic
4,70
4,0
MCM
28,0
15,1
needed by the datapath, the Program Counter and the Processor Status Word. Three registers trace the state of the processor during instruction execution. Instructions are buffered in a queue and they are decoded at each pipeline stage using random logic. No PLA modules are available in the technology. For this reason the control unit takes 33 mm ~ and consumes near 6W of power. The register file of AsGaR consists of 16 general purpose registers (R0 to R15), where R0=0 and R15 is devoted to supporting the CALL instruction. The block is structured as a dual port register file with two buffer registers RE1 and RE2. The latter is also used for write operations. This unit takes 9.1mm ~ and consumes 2.6W. Read time is 1.96ns and write time is 2.40ns. The arithmetic and logic unit is, together with the register file, very critical for A[0:24]
8[0:24]
Delay
Delay C=a"t"y
A[25:31]
B[25:31]
u g
!
6 STAGECARRYLOOK L AHEADADDlJ.
processor performance. The ALU performs integer adds and substracts as well as simple logic operations. The ALU consists of two subunits: an operand modifier and an adder. The operand modifier is built with a multiplexor which selects the input data according to the scheduled operation. This unit belongs to the critical path, but only the section processing bits A0 and B0 is really critical. For this reason the modification of bits of index 0 has been speeded up in the operand modifier to 856 ps, taking an area of 0.55 mm 2, and consuming 135 roW. On the other hand bits of index 31 take an area of only 0.24 mm 2 and consume just 35 roW. There are also plenty of trade-offs in adder design in GaAs [9]. For this design the best choice was that of Figure 4. It is a mixed structure of carry select/propagate. The first
CARRY
PARITY
~DNi
.. ......... ] :..........
855 1168
715 1014 1313 1612 1911 2333 2755 2918
i................. !
E
i
J517 1855 2216 2565
g
U
iL~c Ahead
I
:........ ] ........ i
NIVI"~;'-,
i
3075
ADDER
"~.........8 X .......~:"
C25 ".,2:1 M U X /."
c25
.....
il
715 1275 1575 1874 2173 2595 3017 3317
T
]
3751
s[25:3t] s[o~4]
S[25:31]
Figure 4. General schematic of the Aritmetic and Logic Unit.
Figure 5. Delay paths of the Arithmetic and Logic Unit.
108
P.P. Carballo et al.
adder generates bits 0 to 24. The carry of this unit is used to select between results of two 7 bit adders whose carry inputs are set to 0 and 1 respectively. In turn the 25 bit adder is structured as six carry look ahead adders (five 4 bit adders and one 5 bit adder). Figure 5. also shows the delay of ALU along different critical paths. Total propagation delay for the ALU is 3.85ns (including the zero bit generation) with an area of 16 mm 2 and a power consumption of 3.1 W.
REFERENCES. [1] V. Milutinovic, ed., "Special Issue on GaAs Microprocessor Technology", IEEE Computer, October, 1986. [2] B. Naused, B. Gilbert, "A 32-Bit, 200 MHz GaAs RISC", IEEE Micro, December 1987, pp. 8-27. [3] B. Cushman, "GaAs Technology meets RISC Arquitecture", VLSI System Design, vol. 9, no. 9, September 1988, pp. 86-77.
5. C o n c l u s i o n s . In this paper we present the design of a 32 bit RISC processor running at a clock speed of 440 Mhz. The work reported here makes clear the level of complexity faced by this kind of designs and the need to use a wholistic approach considering all aspects of system implementation. AsGaR has been implemented with the TriQuint SC10000 standard cell library. The lack of tools for GaAs design has been overcome by adapting the Cadence/Edge environment to this technology. Simulation was done after describing the cell library in System Hilo. Our laboratory is now developing a full custom AsGaR with E/D-MESFET devices and DCFL/SDCFL logic using the layout capture and verification tools in Cadence that have been adapted to this technology. The full-custom version for the 0.8 pm process aims to implement AsGaR in a single chip.
ACKNOWLEDGEMENTS. The authors want to thank Juan F. P~rez Castellano for his work in refining (and debbuging) the architecture, and Dora Viera Curbelo for her work in the implementation of AsGaR. The original architecture was defined in the framework of a cooperation agreement between CMA and the Computer Architecture Department of Technical University of Catalunya (Dr. Jordi Cortadella).
[4] T. Vu, A. Peckzalski, K. Lee y J. Conger, "The performance of Source-Coupled FET Logic Circuits That Use GaAs MESFET's", IEEE Journal of Solid-State Circuits, vol. 23, no. 1, February 1988, pp. 267-279.
[5] H. Bakoglu, Circuits, Interconnections and Packaging for VLSI, Ed. Addison-Wesley Publishing Company, Inc, New York 1990. [6] R. Sarmiento, Aportaciones al disefio de procesadores GaAs. Resultados de las t~cnicas de particiSn en funciSn de los par~metros tecnolSgicos, Tesis Doctoral, Universidad de las Palmas de Gran Canaria, Julio 1991. [7] T. Mudge, R. Brown, W. Birmingham, J. Dykstra, A. Kayssi, R. Lomax, O. Olukotun, K. Sakallah and R. Milano, "The Design of a Microsupercomputer", IEEE Computer, vol. 24, no. 1, January 1991, pp. 57-64.
[8] J. McDonald, H. Greub, R. Steinvorth, B. Donlan y A.S. Bergendahl, " Wafer Scale Interconnections for GaAs Packaging Applications to RISC Architecture", IEEE Computer, April 1987, pp. 21-35. [9] R. Sarmiento, P.P. Carballo, A. Nfifiez, "High Speed Primitives of Hardware Accelerators for Digital Signal Processing in GaAs Technology", Institute of Electrical Engineers, Proceedings G, Vol. 139, No. 2, April 1992, pp. 205-216.