North-Ho41and
MIcroprocessing and Microprogrammlng 30 (1990) 127-134
127
A 32-BIT RISC CPU IMPLEMENTED
IN GaAs
by William A. Geideman, Roger A. Niederland and David L. Harrington McDonnell Douglas Electronic Systems Company, 5301 Huntington Beach, CA 92647 USA
Bolsa Avenue,
This paper describes the architecture, design and performance of a 32-bit RISC microprocessor chip implemented using enhancement mode junction field effect transistor (JFET) technology in GaAs. The chip operated at a maximum speed in excess of 80 MHz with one instruction per cycle.
1. INTRODUCTION Three years ago, at EUROMICRO 87, we reported on the design of a reduced instruction set computer (RISC) implemented in GaAs. This paper updates the earlier work and presents the test results on a fully functional CPU chip implemented in junction field effect transistor (JFET) GaAs technology. The single chip CPU has 20,249 transistors in an area of approximately 1 square centimeter (7.7 mm by 11.1 mm). Performance was demonstrated at an execution rate of 80 million instructions per second (MIPS) based on a single instruction per cycle. This performance was measured on a Trillium 286 Micromaster-Plus test system in the multiplexed mode and the chip operated at the maximum speed of the tester. Speeds slightly above 80 MIPS were measured directly on the chip by measuring the delay in the longest path. The power consumption at
this speed was 4.5 watts giving an excellent speed power relation for commercial application. We have approached this program from the vantage point of the system user and have set out to build a product (the single board computer) which meets the syst em requirements and is not just a technology development. To this end, we have combined in a single group, the processor architecture, GaAs technology, and software development including compiler and real time operating system development. We have chosen to fabricate the CPU chip in JFET GaAs which means we have the advantage of low power, with subsequent development to further increase the speed of operation. Our approach is to get a practical design and then increase the speed by process improvements.
128
W.A. Geideman et al. / A 32Jolt RISC CPU implemented in GaAs
We chose a three stage pipe architecture. The pipe length affects the number of clock cycles needed to complete an instruction if a branch occurs. For normal calculation, the rate of instruction execution is the same. However, for a branch instruction (if x = y, then do z, if x > y, then do q), the number of pipe stages is critical. Since the next instruction cannot be started until the result of the previous operation is known, wait states must be inserted into the pipe. The number of wait states depends o n the length of the pipe and is generally two less than the total pipe length. Thus, our design has one wait state, while other designs have up to four wait states. This means that our architecture is more efficient in code execution and therefore has a higher ratio of instruction cycles to machine cycles. We calculate that our design can produce a sustained level of 100 MIPS at a clock rate of 160 MHz.
2. DESIGN ARCH1TECWURE The chip designated as the MD 484, uses a three stage pipe and executes the instruction set designated as the US government core set of instructions. This instruction set is implemented on the MIPS Inc. R3000 and R6000 commercial microprocessors. The CPU has 17 general purpose registers, each 32-bits wide, as well as special purpose registers for status, program counters, intermediate storage of multiply and divide results, data input and data output. The basic architecture is the Harvard scheme where the instruction data and operand data are supplied over separate buses. A common address bus is used for both data streams. Figure 1 shows the organization of the CPU with the three 32bit wide buses, the various registers, the arithmetic logic unit (ALU), and the barrel shifter. The actual chip is shown in the photomicrograph of figure 2.
--]1
tABus
~ " DataBus " L ~ _ ~
Operand
t
t
t
RF
S R
L 0
C Bus
~.~ ]I
BS
T
r
¢ ---~ , ~ Address \ t / : / " Bus
po2 P-1 P 4~,-4~ ALU A;;R J IF
B Bus t L!_ ¢
~'- ~
I,oo
~,iLU WB IF ALU ADDR
!w.
Instruction
i °'°" [ ~
I Instruction Data Bus
Figure 1.
CPU Block Diagram
I
W.A. Geideman et al. / A 32Jolt RISC CPU implemented in GaAs
•
• •
Figure 2.
F u l l 3 2 - B i t D a t a Path Executes Core MIPS ISA 20,249 Transistors
MDESC GaAs 32-Bit RISC Computer Chip
129
130
W.A. Geideman et aL / A 32-bit RISC CPU implemented in GaAs
An additional advantage of our architecture is a more flexible memory interface. Other designs use an inflexible, fixed timing interface that requires very fast (2 ns) access cache memory chips. Our memory interface combines flexible timing with software scheduling of memory references that allows the user to trade off processor performance for p o w e r consumption. The MD 484 processor will work well with high density silicon memory chips in applications that do not require the high speed, radiation hard GaAs memories. The processor architecture also includes hardware features to better support compiled Ada programs. High clock rate RISC processors require cache memories. To improve the overall throughput of the computer, the time to access the cache memory is important. At the speed of this processor, packaging will play an important role in the speed of the final system. Packaging affects the ability to dissipate the power generated in the chip. It also governs the propagation delay associated with a memory access. The logical improvement path for the GaAs processor is to include more functions on the chip to match the silicon designs like the MIPS R3000 processor. This entails adding cache memory and hardwired multiply circuits to the GaAs chip. We can add the memory on the chip, since the JFET is ideal for memory structures. This larger c a p a c i t y chip will further increase the utilization of available pipe stages since chipto-chip communication will be minimized and the branch addresses may be included on the chip to speed loop calculations. The MD484 processor software suite includes a translator which allows the MIPS Inc. compilers developed for the R3000 to be used directly in the GaAs version. For greater efficiency, we have developed an optimizing Pascal compiler for our processor and are currently initiating development of an Ada compiler and environment targeted specifically for the GaAs processor. We are also d e v e l o p i n g a real time distributed operating system for our processor. We are also working on a multiprocessor architecture and fiber optic bus interface.
The technology necessary to increase the number of gates on a chip in GaAs relates to the development of a three layer interconnect method. We will get a 40% reduction in current chip area by the third layer of metal and we can easily add the branch cache memory in the current area. Power is a problem as the total chip power will increase with the gate count. Our technology has very low memory power and we project a power increase of only 2 watts for the inclusion of the branch cache on the CPU chip. The next generation of the GaAs RISC microprocessor will have considerable i m p r o v e m e n t s o v e r the current version. These will include extension of the register file to a full 32 words, inclusion of the branch cache for instructions on the CPU, elimination of the move to/from input/output registers, and added testability and fault tolerance. The fault tolerance changes under consideration at this time are parity on internal and external buses and the register file, support for a m a s t e r / c h e c k e r pair, signature codes, and instruction decode error detection.
3. TEST RESULTS Figure 3 shows the test results of several fully functional CPUs. The worst case speed is based on execution of an instruction that involves a worst case add operation in the ALU, consisting of the longest path instruction setup and a complete 32-bit carry. The delay measured is translated into a maximum speed of operation. In addition to the worst case measurements, the performance of the simple ALU i n s t r u c t i o n s is included. These instructions do not require a 32-bit carry, and are executed at greater than 100 MHz. In each case, the instructions require access of two, 32-bit operands, transfer of the operands to the ALU, instruction decode to setup the ALU operation, ALU result transferred to and clocked into the temporary register. To obtain the optimum performance, the value of load resistor was purposely modified during processing. The resulting speed, power consumption and yield could be evaluated for
W.A. Geideman et al. / A 32-bit RISC CPU implemented in GaAs
chip testing. This package, however, is not suitable for system applications. To take maximum advantage of the speed of GaAs circuits, it is necessary to include this chip in a multichip module together with the coprocessor chips and with the high speed cache memory. A high density GaAs processor module has been layed out as shown in figure 5. This processor module contains one CPU, two floating point units (FPU), two memory management units (MMU), a branch cache (BC), a console interface chip (CI), a system controller chip (SC), eight high speed cache RAMs with 4Kbits per RAM, and a main memory of eight 3D memory modules fabricated with silicon memory chips to provide an on-board memory capacity of 8 Megabytes. The entire module will be packaged in an area of 5 sq. inches with power dissipation of 60-95 watts as shown in the figure. The technology to implement this multiehip module has been demonstrated at the size shown here for silicon circuits.
a range of different values. As the load resistor is reduced, speed increases, but at the cost of increased power, and reduced noise margin, which reduces yield. Currently, the 4K-ohm load resistor value provides the optimum performance with a minimum impact on yield. Each chip on the wafer was tested with a series of test vectors to verify functionality. The test program is composed of subtests which exercise various components of the CPU. This aids in characterizing the circuit by isolating any failuresto a particular section of the chip. Each subtest consists of a set of test vectors that has been evaluated for fault coverage using HI-LO logic simulation. Currently the total number of vectors for all subtests is 4,000. The total fault coverage obtained by using all vectors is about 85%. 4. PACKAGING The CPU chip was packaged in a 305 pin grid array package as shown in figure 4 for the
Worst Case* Estimate Device of Speed* L W Dle (MHz)
Simple* Add Speed (MHz)
Power (W)
RL (Kohm)
6 J i-:3
74
109
4.7
3.6
6 J 2-2
76
II0
4.7
:3.6
6 D I-3
80
95
4.2
4.6
7 G 2-5
79
106
4.6
4.1
7 H I-3
81
-
4.7
4.2
7 D 2-2
75
4.3
4.7
1
A
3-6
59
-
3.5
5.7
I
A
3-5
75
-
3,3
5.8
* A t Wafer probe * Based on SUBNEG Instruction
Figure 3.
131
CPU-1A Speed Test Data Summary
132
W.A. Geideman et aL / A 32-bit RISC CPU implemented in GaAs
Figure 4.
GaAs Processor in PIN Grid Array
..u II ,PU II .u II ~ II Bc
5 S Q IN AREA: VOLUME: 2.5 CU IN POWER: 60-95 WATTS* • 1 2 - 1 9 W I S Q IN
Ls"
m
i
F..STIMATED PART POWER CPU • 5.10 WATT8 FPU • 5-10 WA'rT8 MMU • 5-10 WAT'r8 BC • 5-10w A ' r r 8 • 2.5-6 WATTS SC • 2.6.4 WA'rI'8 RAM• 8 X I W A T r IdEM • 1 X 10 WAn" (ACTIVE) 7 X 1 WATT (~I'ANDBY)
2.0"
Figure 5.
High Density GaAs Processor Module
W.A. Geideman et al. / A 32-bit RISC CPU implemented in GaAs
The advantage of the multichip module packaging is that the interconnect between chips is closely identical to the interconnect on each chip in that it is implemented by metal lines on an insulator. This means that the interconnect capacitance of the package can be controlled to limit the drive power from each pin on the chips and provide minimum length interconnects thus saving both power and delay time. 5. CONCLUSION This paper has shown that a 32-bit microprocessor chip can be fabricated in GaAs with very high speed operation and with
133
reasonable power consumption. The CPU chip described here will be augmented by other chips of equal complexity to form a single board computer capable of operating at a rate of 100 MIPS for one instruction per cycle with very little need to have wait stages. The software to permit operation of this computer already exists as compilers for Pascal, Fortran 77, C, and Ada are available. This computer shows that digital GaAs fabrication technology has produced VLSI circuits with commercial yields and that these circuits can be employed in advanced systems that require high speed operation or those that must operate in hostile environments such as outer space.