Design and implementation of the ‘Tiny RISC’ microprocessor

Design and implementation of the ‘Tiny RISC’ microprocessor

Design and implementation of the 'Tiny RISC' microprocessor Designed for optimized data path operation, Tiny RISC is a 16-bit RISC-style microprocesso...

548KB Sizes 12 Downloads 78 Views

Design and implementation of the 'Tiny RISC' microprocessor Designed for optimized data path operation, Tiny RISC is a 16-bit RISC-style microprocessor and is described by Arthur Abnous, Christopher Christensen, Jeffrey Gray, John Lenell, Andrew Naylor and Nader Bagherzadeh

This paper describes the design and implementation of "Tiny RISC" a 16-bit RISC-style microprocessor designed at UC Irvine. It was fabricated by MOSIS in a 2 p m n-well CMOS technology. Tiny RISC has been designed for optimized data path operation. A simple 32-bit extension of the data path results in a generic functional unit of a VLIW (very long instruction word) processor that is currently being developed at UC Irvine. RISC

VLSI

CMOS

pipelining multi-phaseclocking

This report presents the VLSI design and implementation of the Tiny RISC microprocessor. Tiny RISC is a 16-bit microprocessor with a RlSC-style instruction set. The chips were fabricated by MOSIS 1 in a 2 p m n-well, twolevel metal CMOS technology. It measures 3.9 by 2.6 mm and contains 12 000 transistors. The processor is pipelined, and can execute an instruction in each cycle. The instruction set is designed for efficient pipelining and decoding. Tiny RISC has been designed for optimized data path operation. A simple 32-bit extension of the data path results in a generic functional unit of a pipelined VLIW processor called VIPER that is currently being developed at UC Irvine2' 3. Tiny RISC utilizes a pipeline structure that is refined for increased performance in a processor with multiple functional units such as VIPER. Essentially, Tiny RISC is a 16-bit version of the functional units used in VIPER, designed in the form of a stand-alone microprocessor that can be used as a macro-cell for a variety of applications that require control processors with high Department of Electrical and Computer Engineering,University of California, Irvine, Irvine,CA 92717, USA Paperreceived: 21 January1992. Revised:29 April 1992

performance, low cost and small die area. An example would be the control processor of multi-media chips. A full-custom approach was taken in the design of Tiny RISC. This approach results in higher performance and smaller die area compared to automated methods of synthesizing data paths. Large control structures were implemented using PLAs in the first version of Tiny RISC. The control circuitry is currently being modified to use standard cells almost exclusively. The first version of Tiny RISC achieved a cycle time of 70 ns (14.3 MHz clock speed). This gives it a peak performance of 14 (I 6-bit) MIPS. The processor was laid out using the MAGIC VLSl layout system 4. IRSlM s was used extensively to simulate and verify the operation of the processor.

INSTRUCTION

SET

Table I outlines the instruction set of Tiny RISC. Instruction formats are shown in Figure I. All instructions, memory addresses and operands are 16 bits wide. Most instructions are of the arithmetic/logic/shift type. srcl and src2 are source registers and dest is the destination register. There are eight general-purpose registers, R0 to R7. R0 is hardwired to contain zero at all times. Writing to it is allowed but will not affect its content. Shift instructions have two versions, constant and variable. For constant shift instructions, the shift amount is specified by the lower four bits of a 5-bit immediate value (I). For variable shift instructions, the shift amount is specified by the lower four bits of the contents of src2. ADDI and SUBI are the immediate versions of ADD and SUB with src2 replaced by a 5-bit immediate value (I), which is always zeroextended. The LDI instruction loads an 8-bit immediate value (LI) into dest; LI is also zero-extended. Set instructions (SEQ, SLT, SLTU, SGE, SGEU) set or clear the

0141-9331/92/040187-07 © 1992 Butterworth-Heinemann Ltd

Vo116 No 4 "/992

187

Table 1.

Instruction set

Opcode

Operands

Description

ADD SUB AND OR XOR XNOR LSL LSR ASR

srcl ,src2,dest srcl ,src2,dest srcl ,src2,dest srcl ,src2,dest srcl ,src2,dest srcl ,src2,dest srcl ,src2,dest srcl ,src2,dest srcl ,src2,dest

SEQ SLT SLTU SGE SGEU

srcl ,src2,dest srcl ,src2,dest srcl ,src2,dest srcl ,src2,dest srcl ,src2,dest

ADDI SUBI LSLI LSRI ASRI

srcl ,#1,dest srcl ,#1,dest srcl ,#l,dest srcl ,#l,dest srcl ,#1,dest

LDI LDW STW BRF BRT JUMP CALL

#LI,dest srcl ,dest srcl ,src2 srcl ,#O srcl ,#O srcl srcl ,dest

Add Subtract Bitwise AND Bitwise OR Bitwise exclusive-OR Bitwise exclusive-NOR Logical shift left (variable) Logical shift right (variable) Arithmetic shift right (variable) Set if equal Set if less than Set if less than unsigned Set if greater than or equal Set if greater than or equal unsigned Add immediate Subtract immediate Logical shift left (constant) Logical shift right (constant) Arithmetic shift right (constant) Load immediate Load word Store word Branch if false Branch if true Jump Call

least significant bit of dest by comparing srcl and src2 an(i checking for the specified condition. The upper 15 bits are always cleared to zero. Set instructions are usually used to compute condition codes for a subsequent branch instruction. Load/Store instructions use the register indirect addressing mode. There are two instructions in this category: LDW and STW. Register indirect addressing was chosen based on the pipeline structure of Tiny RISC. For both instructions, srcl contains a data memory address. The LDW instruction loads M[srcl] into dest (in this context, M[x] refers to the content of the memory location addressed by register x). The STW instruction stores the content of src2 to M[srcl]. Tiny RISC has a Harvard architecture. There are two separate memory spaces: one for instructions and one for data. Each memory is accessed via a separate 16-bit multiplexed bus. The DATA bus provides access to the data memory, and the INST bus provides access to the instruction memory. The only difference between the operations of the two buses is that there are no write operations to the instruction memory. Branch instructions (BRF and BRT) check the least significant bit of srcl and transfer control to the target of the branch if the specified condition is met. This allows a single branch delay slot. The target address of branch instructions is computed by adding an 8-bit immediate offset (O) to the address of the instruction following the branch. O is always sign-extended. The CALL instruction is used for procedure call operations. The Program Counter is loaded with srcl, and the return address is saved into dest. The JUMP instruction loads the PC with the content of srcl. It is used to return from procedure calls.

PIPELINE STRUCTURE

5

3

Compute 5

3

Immediate

Long lmmecliate I

Branch Figure 1.

5

3

opcode

I dest [

5

3

I

3 3 2 srcl I src2 ~/,/~J 3 srcl

J

5 l<4..0>

8 LI<7..0> 3 sin1 I

5 0<4..0>

Figure 2 shows the pipeline structure of Tiny RISC. There are four pipeline stages: IF (Instruction Fetch), ID (Instruction Decode), EX (EXecute), and WB (Write Back). The instruction set was designed with this pipeline structure in mind. Each pipeline stage takes one cycle to complete its function. The timing of each cycle is derived from a non-overlapping four-phase clock. As shown in Figure 3, there are five different clock signals along with their complements: Phl, Ph2, Ph3, Ph4 and Ph123. During the IF stage, an instruction is fetched from the program memory. The instruction is decoded, and the source operands are read from the register file during the ID stage. Most control signals are generated in this stage.

I

J

j

Instruction formats Time

I

IF

I

ID IF

EX

WB

ID

EX

WB

IF

ID

EX

IF

ID

+1 EX

~

J

IF : Instrucbon Fetch ID : InslzuctionDecode EX : Execute WB : Write Back

Figure 2.

188

Pipeline structure of Tiny RISC

Microprocessors and Microsystems

Phl

I

I I

Ph2

I I

Ph3

I

--I_

Ph4 Ph 123

Figure 3.

IN

_J

I

CLK

1

OUT

CLK

Four-phase clocking scheme

The control signals that become active during later pipeline stages are delayed by the proper amount. Control transfer instructions update the PC at the end of the ID stage. This means that all control transfer instructions have a delay of one cycle. The instruction following a control transfer instruction is always fetched and executed. It is the programmer's responsibility to schedule a useful instruction in the delay slot. If such an instruction is not found a NOP should be scheduled in the delay slot. During the EX stage, the actual operation specified by the instruction is performed. LDW and STW instructions access the data memory duringthis stage. In the WB stage, the result of the instruction is written to the register file. To avoid data hazards caused by the delay between RD and WB stages, Tiny RISC uses one level of bypassing. Many RISC processors have an extra MEM stage between the EX and WB stages that is used to access the data cache 6' 7. Load/store instructions use the displacement addressing mode. The EX stage is used to add an immediate displacement value to a base register to obtain the effective address of the operand being accessed. The effective address is then used in the MEM stage to access the data cache. A side effect of this scheme is that an instruction could have a data dependency on two previous instructions in the pipeline. This is due to the fact that there are now two pipeline stages separating the ID and WB stages. An extra level of bypassing is needed to alleviate the additional read-after-write (RAW) hazards that can occur. The additional cost of an extra level of bypassing is minimal in a RISC processor. Since Tiny RISC was designed to be used as a functional unit in a VLIW processor, its pipeline structure was designed for increased performance in a processor with multiple functional units. The complexity of bypassing in a processor with n functional units grows in O(n2). By eliminating the MEM stage and using register indirect addressing, the cost of the bypassing circuitry can be reduced by half. This problem is analysed in detail in Reference 3.

DATA PATH DESIGN

The data path of the processor is divided into three units (see Figure 5): the Operand Unit, the Execution Unit and the PC Unit. The data path includes the register file, the bypassing logic, the load/store unit, the shifter, the ALU and the PC unit. Each bit-slice of the data path is 56pm wide and can accommodate seven metal2 (second-level metal) tracks running in the vertical direction. One of these tracks is used to distribute power or ground. This track is shared between two adjacent bit-slices. This is done by flipping every other bit-slice. Static flow-through

Vol 76 No 4 1992

Data path latch cell

Figure 4.

Operand Unit

Clock

Control

ExecutionUnit

PC Unit

Figure 5.

i

Floor plan of Tiny RISC

latches were used throughout the data path. Figure 4 shows the schematic diagram of the latch cell.

OPERAND

UNIT

The Operand Unit (OU) consists of the following hardware blocks: the register file, the bypassing logic and the load/store unit. The block diagram of the register file and the bypassing logic is shown in Figure 6. The block diagram of the load/store unit is shown in Figure 7. The register file is dual-ported and contains eight

789

Register File

[ Ph4 - ~

Read/Write Circuitry srcl

(test

f

DEST

I

I

src2

oP,

I

oP2

I--j v

0/-7 \,

\,

0 //

Ph3

bypass to src2 bypass to srcl IMM

k 0

SRC1

+ D

Figure 6.

1/

I sRc2

select_immediate

I"-1 Ph4

S2

S1

Block diagram of the Operand Unit

registers, R0 to R7. The core of the register file consists of an 8 by 16 array of register cells. The register cell design was based on the CMOS version of the six-transistor dualport static RAM cell with split word lines 9, which is shown in Figure 8.

D $1 $2

The timing of the register file is shown in Figure 9. During Ph2, the bit lines are precharged and the source register addresses are decoded. In Ph3, the word lines are driven by the decoders and the read operation takes place. In Ph4, the source operands are driven onto the $1 and $2 buses. Also, the register address for the write operation to follow in the next Phl is decoded. During

WORD1

1~1~DR 0S~1~ Phl Ph2

C-3

to DATA bus pads

~'-~ LOAD [

T

ATA

DIR. ._~ Ph123

WORD2

L

fromDATAbus pads Figure 7.

190

Block diagram of the Load/Store Unit

BIT2 Figure 8.

BIT1 Dual-port register file cell

Microprocessors and Microsystems

I Figure 9.

Phl

Ph2

Ph3

wdte

I prechargebit lines I I decodereadaddressesI

read

Ph4 wdte address

Timing of the register file

Phl, the desired data is driven onto the bit lines, the output of the decoders is driven onto the word lines and the write operation takes place. For load/store operations, the data memory is accessed via a 16-bit multiplexed bus (the DATA bus). When Phl rises, the memory address (srcl) is loaded into MADR (Memory Address/Data Register) and driven onto the DATA bus. This address must be latched externally. Phl sePJes as the strobe signal for the external address latch. The address should be latched when Phl falls. The DATA bus holds the address until Ph2 rises. The dead time between Phl and Ph2 provides the proper hold time for the external address latch. For load operations, when Ph2 rises, the DATA bus is tri-stated to accommodate the incoming data from memory. The data is loaded into DIR (Data In Register) with the falling edge of Ph3 and is written to the register file in the WB stage. For store instructions, when Ph2 rises, the data to be stored to memory (src2) is driven onto the DATA bus and is then written to the data memory. The instruction memory is accessed via the INST bus, which is identical to the DATA bus except that there are no write operations to the instruction memory. MADR is a special latch that has two inputs (10 and I1) and two clock (or select) signals (SO and $1). The select signals are assumed to be non-overlapping, i.e., they can never be both high at the same time. When SO goes high, I0 is driven to the output. When $1 goes high, I1 is driven to the output. When SO and $1 are both low, the output is latched through bistable action. The schematic diagram of the multiplexing latch is shown in Figure 10. This circuit achieves both the logical operation and the timing required for load/store instructions.

I1

I0

$1I

SO

OUt

Figure 10.

A multiplexing latch

D $1 $2

\

/

?

EXECUTION UNIT The block diagram of the Execution Unit is shown in Figure 11. The Execution Unit consists of two hardware units: the ALU and the shifter. The ALU computes all arithmetic, logical, and comparison operations required by the instruction set. For arithmetic and logical instructions, the specified operation is performed on the source operands and the result is written back into the destination register. For comparison operations, srcl and src2 are compared, and depending on the outcome of the comparison, a one or a zero is wdtten back into dest. The ALU utilizes a dynamic Manchester carry chain 8. The function of the shifter is to shift the operand on $1 by the amount specified by the least significant four bits of the operand on $2. The core of the shifter consists of a 16 x 16 matrix of pass transistors, forming a crossbar switch 9. The ALU and the shifter make extensive use of dynamic logic. The timing of the Execution Unit is as follows. During Ph123 the logic blocks evaluate; during Ph4 the result is driven onto the D bus, and the logic blocks are precharged.

Vol 16 No 4 1992

p

I

Figure 11.

?

I'- Ph'

Block diagram of the Execution Unit

PC UNIT The main function of the PC Unit is to execute instructions that affect the Program Counter (PC), i.e., JUMP, CALL, BRF and BRT. The block diagram of the PC Unit is shown in

191

D $1

branch offset

I

t-- P.,

I

\Zr/ 1 0

take_branch

1/

I 0

\o

1/

reset

\o

1/

jump_or reset

I

\

Ph4

PC

I V Adder

1

/

NPC

NPC_slave RPC

Ph123

Ph4 Ph123

!

To INST bus pads Figure 12.

CONCLUSION We have presented the VLSI design of a RISC-style 16-bit microprocessor. The microprocessor was fabricated in a 2 pm n-well CMOS technology. The chips were tested and found to be working on first silicon. The processor has a cycle time of 70 ns and achieves a peak performance of 14 (16-bit) MIPS. The design required about 12000 transistors.

Block diagram of the PC Unit

Figure 12. The JUMP instruction changes the PC by loading it with srcl. The CALL instruction is similar to JUMP except that it also saves a return address into dest. Branch instructions specify a 16-bit signed offset (O) that is added to the content of the PC to compute the address of the target of the branch. The branch condition is the LSB (least significant bit) of the srcl register that is specified by the instruction. For the BRT instruction, the PC is loaded with PC + O if the LSB of srcl is high; for BRF, the PC is loaded with the target address of the branch if the LSB of srcl is low.

SIMULATION

AND TESTING

In order to verify the functionality of our design, we developed a simulation environment that allowed us to

192

easily compare the expected state of the Docessor to th~ actual state produced by a circuit or switch-level simulator. The first component that was needed for this method was a means of producing the expected result. This was accomplished by writing an RTL (Register Transfer Level) simulator in C (TPSIM). The input to TPSIM is a Tiny RISC assembly program. TPSIM simulates the execution of the input program and computes the expected state of the processor at the end of each machine cycle. The output of TPSIM is written to a file. A helpful feature of TPSIM is that it allows random data to be loaded into the processor; all load instructions that access memory address zero receive a random number. This allows long, non-repeating test programs to be written conveniently using small looping sequences of instructions. IRSIM was used to simulate the extracted layout of the entire processor (including the pad frame). IRSIM was used in the linear mode. The simulation process is driven by a command file that supplies stimuli to IRSIM. The command files are generated by TPSIM, which also produces test vector files that were used to test the fabricated chips. The command files also contain commands that instruct IRSIM to produce an output file that contains the state of the processor at the end of each machine cycle. This output file, along with the expected state file generated by TPSIM, is then passed to the comparison stage of the testing system. The SIMCOMP program was written to compare the output state file produced by I RSIM to the expected state file generated by TPSIM. SIMCOMP detects and reports any mismatches that it encounters. SIMCOMP produces a file in which the expected and actual state of the processor are compared. All errors are visibly marked; this allows designers to easily track down the errors and solve the associated problems.

REFERENCES I

Tomovich, C MOSIS User Manual Release 3.1, USC/ Information Sciences Institute, Marina Del Rey, CA (1988)

2 Abnous, A, Potasman, R, Bagherzadeh, N and Nicolau, A 'A percolation based VLIW architecture' Proceedings of 1991 International Conference on Parallel Processing (1991 ) pp 144-148 3 Abnous, A 'Architecural Design of a VLIW Processor' MS Thesis, Dept. of Electrical and Computer Engineering, University of California, Irvine (1991) 4 0 u s t e r h o u t , J K, Hamachi, G T, Mayo R N, Scott, W S and Taylor, G S 'The Magic VLSI layout system' IEEE Des. Test Comp. (February 1985) 5 Salz, A and Horowitz, M 'IRSIM: An incremental MOS switch-level simulator' Proceeding sof the 26th

Microprocessors and Microsystems

Design A u t o m a t i o n Conference ACM/IEEE, Las Vegas,

Nevada (June 1989) pp 173-178 6 Chow, P and Horowilz, M 'Architectural tradeoffs in the design of MIPS-X' Proceedings of the 14th A n n u a l

a multiprocessor workstation' IEEE'J. SoL State Circuits (December 1989) pp 1688-1698 Weste, N and Eshraghian, K Principles o f C M O S VLSI Design Addison-Wesley, Reading, MA (1985)

Pittsburgh, Pennsylvania (June 1987) pp 300-308

Sherburne, R W, Katevenis, M G H, Patterson, D A and Sequin, C H 'Datapath design for RISC'

7 Lee, D D, Kong, S I, Hill, M D, Taylor, G S, Hodges, D A, Katz, R H and Patterson, D A 'A VLSI chip set of

Proceedings of the Conference on Advanced Research in VLSI MIT (January 1982)

International Symposium

Vol 36 N o 4 7992

on Computer Architecture

Arthur Abnous received BS (Summa Cum Laude) and MS degrees in electrical engineering from the University of California, Irvine, USA, in 1989 and 1991, respectively. He is currently a PhD student at the University of California, Irvine. His research interests include computer architecture, high performance processor design and VLSI system design.

Christopher Christensen was born in Los Angeles, California on 13 October /967. He received his BSEE from the University of California, Irvine, USA, where he is currently working on his MSEE.

Andy NayIor received his BS degreein electrical engineering from the University of California, Irvine, USA, in 1991.He is currently working on his MS degree at UCI. His areasof interest for researchare in computer architecture and VLSI circuit design.

John Lenell is currently a graduate student in the Department of Electrical Engineering at the University of California, Irvine, USA. He earned his BS in electrical engineering from UCI in 1990. His research interests include computer architecture, superscalar microprocessors and VLSI circuit design.

Jetlrey Gray received a BS in engineering from Harvey Mudd College in Claremont, Cafifornia, USA, in 1990. He is currently eaming an MS at the University of California at Irvine. His area of research is VLSI design and microprocessor architecture.

Nader Bagherzadeh received a PhD degree in computer engineering from the University of Texasat Austin, USA, in 198 7. Since 1988 he has been with the Department of Electrical and Computer Engineering at University of California, Irvine. His research interests are parallel processing, computer architecture and VLSI design.

193