CMOS and ECL implementation of M! PS RISC architecture With reference to the R3000 and R6000 processors, Ashis Khan looks at the factors affecting the implementation of a scalable RISC architecture
The primary goal of first generation RISC processors was to achieve an average execution rate of one instruction per clock cycle (CPI or clocks per instruction). Once this goa/is achieved, the same architecture can be implemented in technologies that offer very high clock rate and achieve high performance. The p a p e r discusses two basic aspects of implementing an architecture in ECL using a case study of MIPS R3000 and R6000 processors. The first concerns the architectural elements that make R3000 easy to implement in ECL and the second looks at h o w to resolve the problems raised by a wide gap between processor clock speed and main memory speed. microsystems
RISC architecture
compilers
R3000
R6000
The computing industry has recently experienced a trend towards 'scalable architecture', i.e. one architecture providing a wide range of performance. From low-end embedded controller and desktop PCs to high-end supercomputers and massive parallel computers, the same architecture can be used and, therefore, a similar software base can be utilized across the range of systems. One way to achieve scalable performance is to use different implementations of a given architecture targetted for different performance levels.
ARCHITECTURE A N D I M P L E M E N T A T I O N An architecture can be implemented in many different technologies. However, certain elements in the architecture definition can affect the selection of the right kind of technology. For example, an architecture defines a register set that may be shared by the CPU and a MIPS Computer Systems, 950 De Guigne, Sunnyvale, CA 94086-3650, USA Revised paper received: 11 April 1990
coprocessor. This is not difficult to implement in a technology that allows the integration of the CPU and coprocessor on the same die, but could be difficult if the coprocessor must be built off-chip (will require registercoherency checking). Similarly, very large register files typically used for implementing register windowing will take up a large die area and therefore may not afford the best use of chip area in a technology that offers less density. Thus, architecture and implementations are not entirely independent. Different implementation technologies offer advantages and disadvantages and, therefore, call for tradeoffs. For example, at a given point in time, CMOS technology may provide low power (2-5 W), high integration (100 k-300 k transistors) and 25-40 MHz range of frequency of operation; whereas ECL technology at the same time will provide high power dissipation (15-20W), low-level integration (70 k-lOOk transistors) and a much higher range of frequency of operation (70-100 MHz). Therefore, these two technologies call for quite different implementations of the same architecture. This article describes two implementations (namely the MIPS R3000 processor in CMOS technology and MIPS R6000 processor in bipolar ECL technology) of the same 32-bit architecture.
RISC architectures and clock rate The canonical equation for the performance of a processor can be used to show how RISC architectures benefit from a technology that offers higher clock rates. The time required to accomplish a specific task can be expressed as the product of three factors Time per task = C * T * I where: C = cycles per instructions, T = time per cycle, I = instructions per task The compiler technology and operating system support reduces the factor I (instructions per task) and a pipelined
0141-9331/90/06367-09 © 1990 ButtenNorth-Heinemann Ltd Vol 14 No 6 July~August 1990
367
implementation coupled with a good cache architecture will allow a reduced C (cycles per instructions). In fact, the architectural goal of all RISC processors is to reduce this factor C and thereby reap the benefit of reduced factor T, or clock speed. A technology that allows very fast clock speed therefore complements the efficiency of RISC architectures.
Bipolar ECL technology Bipolar ECL technology offers the advantages listed below. • • • • •
drive capability for transmission lines drive capability for large capacitive loads shorter propagation delays higher toggle rates signal swing is 800 mV
There are two disadvantages not overcome by current technology. • lower circuit density • high power requirements The new RISC processors, having relegated most complex, underused functionalities to software, are not necessarily affected by these disadvantages. For example, the MIPS R3000 processor, rated at 20 Vax MIPS over a number of large application programs and benchmarks running at 25 MHz, requires only 115 k transistors in CMOS process. Recent developments in ECL technology 1 offer tremendous advantages in increased clock speed within a reasonable level of circuit density. The same R3000 architecture when implemented in ECL technology runs at 66.67 MHz to provide 55 Vax MIPS.
ARCHITECTURAL PARTITIONING IN ECL AND CMOS
Table 1 shows the difference between the R30OO and R6000 technologies. Figure 1 shows the basic chip implementation for both R3000 and R6000. On the right is the CPU datapath that implements the pipeline. There is a stack of functional units, includingALU, 32-bit shifter and an autonomous 32-bit multiply/divide unit. The register file consists of 32 general-purpose 32-bit registers, a double-word (64-bit) special register for multiply and divide results and a 32-bit program counter.
R3000 and R6000 technologies R3000
R6000
Frequency Process
25 MHz 1.5 pm CMOS
Power dissipation Number of transistors Cache RAM access time Maximum primary data/ instruction cache size
2.5 W 115000 20 ns 256 k/256 k
66.67 MHz 0.5/Jm transistor pitch ECL process 23W 89OOO 7ns 16 k/64 k
368
System implementation A typical system design using the R3000 CMOS processor is shown in Figure 2. The floating-point coprocessor (R3010) and the CPU access the cache simultaneously, synchronized by an on-chip PLL (phase lock loop) controller. The on-chip cache controller in the R3OO0 drives the cache (separate instruction and data) directly. The cache supported being write-through, write buffers are used to buffer writes to cache that are later written to main memory (when the buffers are full). The caches can be built from 4 kbyte up to 256 kbyte each of instruction and data. A typical system implementation using the R6OOO is shown in Figure 3. A system configuration will consist of the R6000 CPU, R6010 floating-point coprocessor and the R6020 system bus chip. The floating-point coprocessor does not have an on-chip multiplier, but uses a commercially available multiplier. The caches are built using offthe-shelf BiCMOS and ECL SRAMs.
Implementation differences
MIPS RlSC architecture
Table 1.
The general-purpose registers are all directly and simultaneously addressable from all instructions. There are no mechanisms for hiding a portion of the register file, such as register windows 2, stack caches 3, separate user/kernel registers 4, or process register sets s. Instead, the register file is symmetrical and various compiler techniques (e.g. interprocedural register allocation 6) are used to reduce the overhead of saving and restoring registers. There is a separate set of 64-bit registers for a floating-point coprocessor. The small register file was designed for ease of portability of the architecture to different implementations. On the left of the figure is the datapath of the system coprocessor, which implements the memory management unit, exception handler and a set of registers (MMU registers, exception/control registers, status register, etc.) that are used by the operating system kernel.
The major implementation difference between the R3000 and R6000 is in the working of virtual to physical address translation and the accessing of the cache subsystem. The cache access is the most critical path of most instruction pipelines and is the area of ECL technology which provides the greatest challenge to system designers. There are two aspects of the problem. First, the processor can access the cache using either the virtual address, as shown in Figure 4b, or the physical address as shown in Figure 4a. Virtually addressed caches present difficulties for operating system designers, since the cache is not totally transparent to the software, even in uniprocessor design, and synonym problems 7 must be dealt with. Virtual addresses are said to be 'synonyms' or 'aliases' when they all map to the same physical address. In a virtually addressed cache, synonyms cause problems because multiple copies of the same information can be present at the same time in different cache entries, since two virtual addresses may be mapped to the same physical address. If the content of one virtual address is changed, two distinct copies of the same variable will reside in the cache. There are software and hardware solutions to the problem but these are costly and timeconsuming.
Microprocessors and Microsystems
Pipeline Control
Con~! Re@s~
General Registers Shifter Multiplier/Divid~
Memory
IVIanagemc'nt Unit
Address Adder
Cache Control
PC Incrementer
Losi¢
ngure 1.
MIPS RISC architecture TLB, which is fully-associative, takes up nearly 25% of silicon area. The cache is fully physical (i.e. indexed with a physical address) as are the tags. Since the processor runs at only 25 MHz, the on-chip cache controller was designed to work with very large caches (up to 512 Kbyte), taking advantage of commercially available SRAMs at processor speeds (20 ns access time).
IC "M'ull '°: I ! lC-- us
Two-level TLB design for R6000
,o
system
system
r
Memoryinterface
I
Memory bus Figure 2.
R3OOO-basedsystem block diagram
Such synonym problems can be avoided altogether in physically addressed caches. However, the translation process, which requires a special cache (translation Iookaside buffer, or TLB), must be done rapidly. Thus, the TLB should be on-chip, large for a high hit ratio and extremely fast. The second problem in designing ECI cache address paths derives from the high speed of the processor. This produces a significant gap between processor cycle time and main memory latency. Therefore, a cache miss reduces performance significantly. The R6000 processor clock cycle time is 15 ns and the main memory latency is 100 ns. A cache miss penalty can therefore be in the range of 65 processor cycles, including a refill of 32 words. The cache hit rate should be very high to avoid main memory access, and at the same time the cache must provide data/instructions at the same speed as the processor clock. The R6000 architects came up with innovative solutions to both problems.
Cache memory/MMU implementation The R3000 processor uses all the advantages of CMOS in the design of its TLB. The translation stage is pipelined with cache access and translation is very fast (half a cycle or 20 ns with the processor clock at 25 MHz). It accommodates 64 entries, page size being 4 Kbyte. The on-chip
Vol 14 No 6 July/August 1990
For the R6000 implementation, a full on-chip TLB with a reasonable hit rate was not feasible. To address this problem, the R6000 designers produced a two-level TLB and a two-level cache strategy. The first level cache had to be very fast and was therefore limited in size. A virtually addressed cache was chosen for two reasons. • Virtual-to-physical address translation is primarily required for a cache miss only and therefore no TLB translation is needed unless there is a miss. This is shown in Figure 4b. • For the data cache, page and cache size are kept the same, so that for all practical purposes the data cache is a physically indexed cache and therefore avoids synonym problems. The MIPS architecture does not support any self-modifying code and there are process identifier bits (8-bit field) associated with each entry in the virtually addressed first level instruction cache. Synonym problems are thereby avoided. The 64-entry fully-associative TLB implemented in the R3000 required 3 kbit (each entry having 40 bit), close to 25% of the die area. In the R6000, a fast but small first level TBL (16 entries), capable of translating only 6 bit of the virtual address to 6 bit of the physical address, was included on-chip (see Figure 5). This first level TLB requires only 96 bit. The main TLB backing up the first level TLB is implemented as a part of secondary cache -its implementation is discussed below. The first level cache access and first level TLB access are in parallel, as shown in Figure 6. If there is a miss in primary cache, the secondary cache must be accessed. The secondary caches contain virtual tags (i.e. they are physically indexed but virtually tagged) to obviate a full TLB translation before accessing. Secondary cache architecture design must produce a high hit rate to prevent TLB Iookups (which are necessary only when there is a cache miss). In case of a secondary cache miss and a subsequent full TLB hit, the first level TLB partial translation must be checked. The algorithm (Figure 7) is described below.
369
Instruction Bus
Con~ol
Y Bus N
Instruction Cache
¢
R6010 Floating Point Controller
R Bus
"
B3110 FMPY
N
¢
t °nOLt S Cache
CPU
Address Bus
System Bus
R6020 System Bus Chip
Cache
Data Bus
Figure 3.
Simplified system block diagram for R6OOO-based design (miss)
[ ~
Virtual address
Virtual to physical translation
(hit) Physical address
~, Memory Physical cache
Virtual address
Bit 14 Bit 15 Bit 16
First level TLB {on-chip)
Bit Bit Bit Bit
14 15 16 17
Physical address
a
-~
Virtually addressed virtually tagged cache Virtual to physical translation
Figure 5.
(hit) -I
(miss7
On-chip first level TLB in R6000 __[ First-level
Refill from
memory and check v I tags
Virtual I address
cache d
b
(miss)
[Coneatanate with page [ offset 1
First-level TLB
[Bits 17...14
Bits 13..0 ]
Figure 4. a, physically addressed cache; b, virtually addressed cache • Access primary cache with virtual index (bits 0...15 of the virtual address). Check virtual tag (i.e. bits 16...31 of the virtual address) for primary hit/miss decision. If a hit, progress normally; if a miss, go to the next step. • Access TLB slice with the low bits of the page number (bits 14, 15, 16) and get out enough bits (bits 14...I 9) to address the secondary cache. Note that the index portion for secondary cache access will consist of a page offset portion (bits 0...13 of the virtual address, since page size is 16 kbyte and the same as bits 0...I 3 of the physical address) plus bits 14...19. This is shown in
370
Physically indexed virtually tagged second-leve] cache Figure 6.
Two-level TLB and cache design for R6000
Figure 6. The secondary cache is 512 Kbyte and twoway set-associative. This translation is a 'guess' and may be wrong. Drive out a physical index to the secondary cache RAMS for data and tag. The tag is a virtual address (i.e. bits 20...31 of the virtual address). If the tag matches the translation, the guess was correct
Microprocessors and Microsystems
~ f
• A correct translation but wrong secondary cache tag points to an alias problem and not a secondary cache miss. The full physical address should be compared to the physical tag in the secondary cache. • Following a secondary cache miss the cache line should be refilled from main memory.
UALADDRESS
nd level cache irtually Tagged Physically Indexed
The main TLB is implemented as shown Figure 8. The secondary cache of 64 k words will consist of 60 k words of data, 2 k words of TLB entries and 2 k words of physical tags. cache
hit Two-level cache design for R 6 0 0 0
I 2nd level TLB [ TLB miss - trap to software
~
hlt yes, virtual tag miss 4st level TLB miss ~
/
Iupdale 1st level TLB I ~
~
21~ILIcl J ' J y
cache hit, update virtual tag
v no - cachem=ss
Figure 7.
Two-level TLB lookup algorithm BANK 0
BANK 1
The first level cache, consisting of 64 k of instruction cache and 16 k of data cache, and having an access time of 7 ns, operates at the same speed as the processor. The second level cache, (configurable from 512 kbyte to 2 Mbyte combined) provides the desired hit rate with a one-cycle penalty of primary cache miss. The secondary cache can be built using 15 ns SRAMs operating at 66.67 MHz. The primary caches require a fast hit time and therefore are direct-mapped; secondary cache requires a high hit rate and hence is two-way set-associative. The primary cache is virtually addressed to allow parallel cache access and TLB lookup for address translation, as described previously. The secondary cache is physically addressed, providing higher hit rate and eliminating synonym problems. This cache control logic for both primary and secondary caches is integrated on the R6000 chip. The algorithm is totally transparent to system designers - - cache RAMs are simply connected to the processor.
R 6 0 1 0 floating-point coprocessor
2K
DATA
DATA
60K
FULL TLB TABLE
FULL TLB
TABLE
PHYSICAL TAG
PHYSICAL TAG
2K
Figure 8.
Secondary cache contents
and there is a secondary cache hit. If the tag does not match this constitutes either a first level TLB miss or secondary cache miss. The full TLB should be accessed to resolve the former. If there is a miss when reading the full TLB, the TLB should be refilled. If there is a hit in the full TLB, then a comparison should be made between bits 19...14 of the physical address and bits 19...14 of the physical address as obtained in the previous step by accessing the first level TLB with bits 14...16 of the virtual address. If the comparison reveals a match, there has been a secondary cache miss. If there is no match, the first level TLB translation was incorrect and the correct contents of bits 19...14 of the physical address should be copied to first level TLB from second level TLB.
Vol 74 No 6 July/August 7990
The R6010 is similar to its counterpart R3010 coprocessor in CMOS technology. The MIPS architecture assumes a separate register file for the use of a floating-point coprocessor and therefore the R6010 contains the register file of 16 64-bit registers for floating-point instructions. The R3010 and R6010 floating-point coprocessors conform to the requirements of ANSI/IEEE Standard 754-1985. The R6010 incorporates the register file, pipeline control, precise exception control logic, and a 64-bit ALU for floating-point add, subtract, compare, absolute value, negate, and conversion between formats. Both the R3010 and R6010 have a full 64-bit datapath for double precision computation. Existing ECL densities did not, however, permit an on-chip 64-bit fast multiplier. The R6010 has on-chip control logic for working with an ECL floating-point multiplier chip (BIT's B3110) used for multiply, divide and square root operations. The R3010 CMOS implementation does have an on-chip floatingpoint multiplier and divider. MIPS architecture is aimed at a high vector performance without sacrificing scalar performance. Performance for floating-point vectodzed code can be increased by improving on repeat rate (i.e. the number of cycles before an operation of the same type can begin). However, a very fast repeat rate will require deeply pipelined implementation and therefore higher latency (i.e. the number of cycles before a subsequent instruction may use the result without causing a pipeline to stall). Higher latencies are
377
particularly bad for scalar code; for vector code, higher latencies imply locking up registers for a prolonged time and forcing a lower register-usage on the compiler. A trade-off must be made to achieve high repeat rate and low latency. A typical vector code
Some machines are designed to fetch and execute multiple instructions in a single cycle; however, there are a number of restrictions on the type of instructions that may be combined in one cycle. With its low latency per operation and sophisticated compiler pipeline scheduling techniques, the R6010 avoids restrictions on the compilation of instruction pairs observed in superscalar machines. There are specialized vector processing machines that use vector instructions, multiple functional units, etc., to achieve high performance range in executing codes that can be parallelized. However, performance is severely degraded for codes that do not lend themselves to vectorization. For real-life scientific and engineering applications programs, such a high degree of parallelization is seldom observed. Table 2 lists the repeat rate and latency for both single and double precision operations for the R6010. Figures 9, 10 and 11 show how the R6010 provides a very good vector performance when compared to specialized and expensive vector processors while maintaining superior scalar performance. In each of these charts, the MIPS RC3260 and RC6280 machines shown are based on the R3000 processor and R6000 processor respectively.
DO101 = 1 100 DY(I) = DY(I) + DA*DX(I) 10 CONTINUE requires two loads (DX and DY), one multiply, one add and one store. Even if the floating-point multiply instruction were pipelined to produce multiplication results every cycle, such a heavy-duty multiplier could not be used effectively, since a new multiplication cannot be started until the load, add and store operations are finished. Thus the repeat rate of multiplication and add operations was matched with the load/store operations in R3000 CMOS implementation to use the multiplier unit optimally. While one DP (double precision) multiplication is taking place, another DP add and loads/stores may be completed, ensuring readiness to start another multiply. Thus, two double precision floating-point operations (DP flops) can be completed in five cycles, achieving 10 DP Mflops (peak) for an R3000 running at 25 MHz. For the ECL implementation, the repeat rate for the operations was reduced along with the latencies. This required a proper matching with the rate of loading operands. Hence, 64-bit load operations were implemented to be executed in one cycle. The two floatingpoint operations - - one DP multiply and one DP add -can be completed in four cycles, thus achieving 33.5 DP Mflops (peak) at 66.67 MHz. By concentrating on reducing latencies per operation and developing sophisticated pipeline scheduling techniques, operations may be overlapped wherever possible. For example, a divide instruction taking the R6010 14 cycles to complete can be overlapped with 13 other instructions (floating-point and integer) to complete 14 instructions in 14 cycles, averaging one cycle per instruction.
Table 2.
Repeat rate/latency Repeat rate Single
Operation Add, subtract, conversions Move, absolute value, negate Compare Multiply Divide Square root
Double
Latency Single
Double
2
2
3
3
1
1
2
2
1 3 13 22
1 4 22 40
2 4 14 23
2 6 24 42
30.00 25.00 20.00 15.00
10.00 5.00 0.00
3.90
o,4I tN I VAX 780
Figure 9.
372
ii
~"
i
i i
~ i
t
i
i
i
i
SPARC MIPS Alliant Alliant Multiflo MIPS MIPS Convex Convex CraylS Sys 390 RC3260 FX/80-4 FX/80-8 w Trace RC6280 R6000 C-210 C-220
7/300
(66.7 (80MHz )*
Mhz)
100 X 100 Linpack DP performance
Microprocessors and Microsystems
14.00
12.30
12.00
Mflops
9.50
9.30
10.00 8.00 6.00 3.10
4.00 2.00
0.16
0.00
N@
0.72 ~ ,
0.94 ~
1.20 ~
Sun 4/200
VAX 8700
Alliant MIPS Convex Cray MIPS Cray FX/8 RC3260 C-210 X/l - RC6280 X/MP1S (66.7 IV
,
VAX 780
3.60
I
'
I
Mhz) Figure 10.
70 60 50 40 30
Livermore loops DP performance
Bus controller chip
32.6 40.51
20
I0 0
65.4
Monte Carlo simulation: An example of scalar computing which does not lend itself to vectorization
I
3.7 • '"":::l
~5"2 • r''"'!
~9"5 11.4 ~ . r,"'"",
.
•
g
•
I
~
I
J
I
The R6020 system bus controller is a high-performance ECL chip designed as an interface between processor to memory and I/O to memory. The R6020 incorporates a DRAM controller, interrupt controller and an I/O controller• The system bus, as shown in Figure 12, has a peak bandwidth of 266 Mbyte/s at 66.7 MHz, providing adequate bandwidth to memory for both the processing units and the I/O subsystem in a shared-bus microprocessor-based system.
L ~
SYSTEM DESIGN
Figure 17. Doduc DP relative performance (VAX 780 = 1) (Monte Carlo simulation: an example of vectoring code which is inappropriate for vectorization)
MIPS Computer Systems introduced an R6OOO-based machine (the R6280 RISComputer) in November 1989.
90
"t
70
Vllx
50
m i p s 40
30
~...__...._o/~¢~- --*'--"
.-----O
._.--.--@
R6000(80Mhz) - 65
mips*
i
I
JA
A~
A
R2000(16.TMhz). 12 mips
0 c~np
Figure 12.
d o d u t~pice
eel
esws
yacc
I gnch
I tezse
I tex
I 5diff
I tuoff
I cccm
I uopt
I
I
Wolf
asl
R6000 and R3000 benchmarks (*scaled from 66.7 MHz)
Vol 14 No 6 July/August 1990
373
I
I' t
R6000
MEM
CPU
R6020 SBC
R6020 SBC
R6000
IO Backplane
Bus
R6020 SBC
IO Backplane
IOC I10
I/O Controller
Controller
Figure 13. R6280 system block diagram The block diagram of the system is shown in Figure 13. The CPU processor communicates with main memory and the I/O adapter through the system bus. The system bus is a 32-bit wide, 15 ns synchronous bus providing a 266 Mbyte/s channel to memory and I/O adapters. The system bus uses differential line drivers and a centrally distributed differential clock to minimize skew. With no net current flowing across the backplane connectors (except on transmission boundaries) a high degree of noise immunity is achieved. Each connection to the system bus is controlled by the R6020 bus controller chip that contains I/O queues and all necessary logic to maximize bus utilization. The system can support up to 256 Mbyte of main memory. Each memory board contains 32 Mbyte of DRAM, control logic and an R6020 bus controller chip. The R6020, as mentioned earlier, handles the DRAM refresh timing and performs single-bit error correcting and double-bit error detection on each 32-bit word. The eight-way interleaved memory supports all memory operations (word, partial word and block read/write and synchronization) for high-speed communications with the CPU and I/O adapters. DEBUGGING A SYSTEM
No in-circuit emulator exists for the R6000 that runs at a speed of 66.67 MHz. In fact, it is unlikely that such ICE machines will ever be built for the high-frequency deeply pipelined RISC processors such as the R6000. However, the architecture simulator was extensively used for system modelling, code debugging and performance analysis. The Systems Programmers Package (SPP), a customizable package of software tools, all available in high-level language, was used by system designers to write standalone software systems, create new operating systems, modify existing kernels, and develop machine diagnostics prior to the existence of R6000 hardware. There are tools to download code from a host machine to the bare target machine to bring up the total functioning system. The SPP software consists of architecture simulator,
374
cache simulator, standalone I/O library, debug monitor, program download utilities, sample boot PROM code and diagnostics. Having had the integrity of software checked by the simulator, the task of debugging the system consisted of checking proper timings. Traditional high-frequency Logic analysers were used for this purpose.
PERFORMANCE
Figure 12 shows the performance of a system based on the R6000 processor, R6010 coprocessor, 512 Mbyte of secondary cache, 64 kbyte of primary instruction cache and 16 kbyte of primary data cache. The benchmarks used are listed below. • comp : Unix program 'Compress' that reduces the size of the named files using adaptive Lemple-Ziv coding • doduc:nuclear reactor program producing MonteCarlo simulation • spice : circuit simulator • ccl : C compiler • espresso : Berkeley CAD tool • yacc : Unix compiler-compiler • gnuchess: chess program that simulates movements of eight queens • terse : RTL behavioural modelling program • tex : text processor • 5diff : Unix program that compares two files • nroff : text processor • ccom : front end of the MIPS C compiler • uopt : the UCODE optimizer in the MIPS compiler • wolf : chip router program • as1 : MIPS assembler for pipeline scheduling CONCLUSION
Optimized compilers 8 are critical to RISC processor performance and a tremendous amount of research and development has been carried out in this area. For
Microprocessors and Microsystems
70
R6000 (66.7Mhz) 55
6O
R6000 (80Mhz) 65
developers, system architects and VLSI designers must work together to maximize performance in these implementations.
50 4O VAX mips 30
R2000 (15 Mhz)
2O
R2000 (SMhz)
10
M/500 1986
M/120 1987
REFERENCES
R3000 (25Mhz) 20
s
10OVAX 780
Figure 14.
M/2000 1988
RC6280 1H90
R6000 2H90
Scaling of performance
example, MIPS Computer Systems has invested 100 manyears of development work in compilers alone, and similar effort on operating system development. To maximize the utilization of software development work, the same architecture must be implemented in as many technologies as possible to address the performance needs of all segments. Galium arsenide, silicon-onsapphire and other exotic technologies will be natural candidates for future generation implementations. There is a tendency in the computer industry to scale performance on measurements made on a system running at low frequency. Unless the code is executed entirely from cache (in which case the code should not be used to benchmark the system to begin with) this scaling is grossly inaccurate, since the memory subsystems cannot be scaled so easily. As the processor clock speed increases, an effective cache and main memory architecture must be designed to run with the processor. Figure 14 shows how the same MIPS architecture has been implemented in different technologies to scale along the operating frequencies. As newer technologies allow the processor to run faster, innovative solutions such as multilevel cache and TLB design, etc. must be developed. The chip architects, compiler and O/S
Vol 14 No 6 July/August 7990
1 Wilson, O 'Creating low-power bipolar ECL at VLSI densities', VLSl SysL Des.Vol 7 No 5 (May 1986) pp 84-88 2 Patterson, D A and Sequin, C H 'A Vl_51 RISC' IEEE Comput. Vo115 No 9 (September 1982) pp 8-18 3 Ditzel, D and McLellan, R 'Register allocation for free. The C machine stack cache' SIGPLAN Notices Vo117 No 4 (April 1982) pp 48-56 4 Sachs, H and Hollingsworth, W'A high performance 846 000 transistor UNIX engine: the Fairchild CLIPPER Proc. ICCD IEEE(October 1985) pp 104-108 5 Ragen-Kelley, R and Clark, R 'Applying RISCtheory to a large computer' CompuL Des. Vol 22 No 20 (November 1983) pp 44 6 Chow, F 'Minimizing register usage penalty at procedure calls' Proc. ACM SIGPLAN 88 Syrup. Programm. Lang. Des. & Imp. Vol 23 No 7 (June 1988) pp 85-94 7 Lee, J M and Weinberger, A 'A solution to the synonym problem, IBM Tech. Disc. Bull Vol 22 No 8A (1980) pp 3331-3333 8 Chow, F el al. 'Engineering a RISC compiler system' Proc. COMPCON Spring '86 pp 132-137 Ashis Khan received an MS in electrical engineering from the State University, N e w York, NY, USA and a Bachelor's degree in electrical engineering from the Indian Institute of Technology in 1981. He started his career at Intel Corporation, and was a senior design engineer in the 80386 group. Khan joined MIPS Computer Systems in 1988 as a technical specialist in the design of RISe-based systems. His research interests are computer architecture and parallel processing.
375