Design of the Kydon-RISC processor

Design of the Kydon-RISC processor

Microprocessors and Microsystems 25 (2001) 1±18 www.elsevier.nl/locate/micpro Design of the Kydon-RISC processor H.Y. Yang a,*, S.J. Mertoguno b, N...

581KB Sizes 54 Downloads 160 Views

Microprocessors and Microsystems 25 (2001) 1±18

www.elsevier.nl/locate/micpro

Design of the Kydon-RISC processor H.Y. Yang a,*, S.J. Mertoguno b, N.G. Bourbakis a,b,c a

b

Image-Video and Machine Vision Laboratory, Binghamton University, Binghamton, NY 13902, USA Int. Systems Laboratory, Department of ECE, Technical University of Crete, Chania 73100, Crete, Greece c AIIS Inc., Vestal, New York, NY, USA Received 11 April 2000; revised 12 October 2000; accepted 13 November 2000

Abstract In this paper the design of a RISC, pipelined and superscalar processor (Kydon-RISC) is presented. The Kydon-RISC mainly consists of ®ve independent execution units (integer, ¯oating point, branch/jump, load/store, and I/O buffers). The unique features of the Kydon-RISC are the I/O buffers, which have the ability of sending and receiving multiple data packets to/from different resources (or processors) at the same time. A data packet represents the graph forms of the objects extracted from images received and processed by the Kydon vision system. A comparative evaluation among Kydon-RISC and other RISC processors has shown that the Kydon-RISC performs 15±25% better in a set of primitive operations (summation, many multiplication, matrix multiplication, bubble sort, procedure call, etc.) used in any large scale program. q 2001 Published by Elsevier Science B.V. Keywords: RISC processor design; Superscalar; Pipeline; I/O buffers

1. Introduction Computer vision is one of the most challenging research ®elds of computing with a variety of dif®cult issues, such as understanding of complex environments, provide interpretations of a scene, etc. In response to these challenges, researchers have designed and/or developed various system architectures, such as pyramids, trees, cubes, arrays, etc. at a theoretical or practical level of implementation [29]. A multi-layered array processor architecture (called Kydon) has been designed and the VLSI evaluation of the ®rst four layers was completed in Refs. [1,2]. The upper layer array processor's design was evaluated on the base of a general-purpose microprocessor [30]. A specialized RISC design of the processor node of each array processor, at the upper layers of the Kydon system, was a necessity to improve the overall real-time performance of the Kydon system. Thus, this paper deals with the RISC processor (PE) design and its ®rst-stage evaluation in comparison with other RISC microprocessor. With many advantages over the CISC (complex set instruction computer), the RISC (reduced instruction set computer) seems to be the winner in the CPU competition [4,5,7,8,11,12]. Studies have shown that those most frequently used instructions are only a small portion of the big instruction set [12,15]. A big instruction set, as well as a * Corresponding author. 0141-9331/01/$ - see front matter q 2001 Published by Elsevier Science B.V. PII: S 0141-933 1(00)00101-0

few more fancy addressing modes, make the control unit surprisingly huge and complicated [9,12]. Owing to this, the control unit is always implemented by a microcode, which is much slower than a hardwired one. Besides, a CISC usually has instruction formats with different lengths, which makes the decoding task even harder. The idea of RISC, ®rst, is to employ those frequently used instructions as the instruction set while using a few instructions to achieve the same function performed by a much more complex instruction in a CISC [4,10,11,13,15]. Meanwhile, in a RISC, all instructions are of the same size (length). The above factor means that the control unit implemented in hardwired occupies much smaller VLSI area than that of microcoded implementation. Secondly, RISC itself has a large number of general-purpose registers, largely reducing the frequency of the most time-consuming memory access. Finally, in terms of clock rate, RISC, with its much simpler circuits can have a higher clock rate that again increases the performance of a processor [17,20±23,24,26,27]. In this paper, Section 2 talks brie¯y about the Kydon vision system [1,2] as well as the data propagation traf®c in its higher layers. In Sections 3 and 4, the design of a Kydon-RISC processor is described. The Kydon-RISC processor [28] is designed for carrying out data propagation and image understanding in the higher layers of the Kydon vision system. In Section 5, the initial stage performance evaluation of the RISC processor is presented. Conclusions are given in Section 6.

2

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

Li+1

Upper level

Lower level

Li-1

L3 L2 L1 L0 Fig. 1. The global con®guration of KYDON and connections among PEs.

2. Kydon system 2.1. Global con®guration The Kydon vision system [1,2], shown in Fig. 1, is a selforganized, autonomous multi-layer architecture where all PEs or RISC-processors at the same layer are equal. This system consists of two main functional groups of layers: (i) preprocessing layers and low level understanding (lower level); and (ii) high-level understanding layers (higher level). At the low-level group, each layer is an array processor and its PE can send/receive data to/from its six neighbors 0 1

1 2

0

0

b,0

3

0

1 0

0

1

0

b,0

0 0

1 6

0

4

0

0 5 1

1 0

Fig. 2. Data propagation path directions and the detour bits. (Arrows with a (b) indicate that a PE is either in the upper or lower border.)

in the same layer. In addition, every PE can receive data from the PE right below it (in its next lower layer) as well as send data to the PE right above it (in its next higher layer). The only exception is at the last layer of the low group, where seven PEs are connected to one PE at the ®rst layer of the upper group. The preprocessing layers are capable of performing low-level vision processing, and the design of the PEs as well as their VLSI implementation has already been presented [31]. 2.2. Upper layers: traf®c [28] Each of the higher layers of the Kydon system is an array processor of ni £ ni PEs …ni $ ni11 ; i [ Z† in a hexagonal mesh con®guration. Each PE, holding some piece of the layer's distributed database, receives data from lower layers and executes some algorithms to recognize objects and understand the input images. The communication among the PEs, at the same layer, is similar to the one at the low group layers, while the up/down communication is based on a high-speed bus between layers for KB access and direct one-to-one, or seven-to-one PEs connections. Before any image understanding operation starts, each PE in an array processor should receive all the information extracted from the input image. It means that a PE receives data from its lower layer, it should immediately broadcast them to all other PEs in the same array. This broadcasting introduces various delays. For these delays (or propagation time) to be reduced, an effective propagation scheme should be de®ned to eliminate any possible redundancy, which occurs when the same data are sent to the same PE more than once. In Fig. 2, the propagation scheme is shown. Here, the direction-¯ow bit de®nes the direction that the data ¯ows from a PE sending the data to the PE receiving the data. In Kydon system, the direction-¯ow bits are de®ned as

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

(1,1)

(3,1) (2,0)

(1,0)

(3,0) (3,0)

(3,0) (0,0)

(4,1)

(5,0) (6,1)

(4,1) (4,1)

(4,0) (4,1)

(5,0) (6,1)

(4,1) (4,1)

(4,0)

(6,0)

(4,1) (4,0)

(4,1) (4,1)

(5,0)

3

(4,0) (4,1)

(4,1)

(4,1)

Fig. 3. An example of a data propagation record (note that for each (r,d) in a PE, ªrº is the path direction and ªdº is the detour bit).

follows: ² direction equals to ª0º (from PE at lower layer to a PE at the higher layer); ² direction equals toª1º (from lower right to upper left direction); ² direction equals to ª2º (from lower to upper direction); ² direction equals to ª3º (from lower left to upper right direction); ² direction equals to ª4º (from upper left to lower right direction); ² direction equals to ª5º (from upper to lower direction); ² direction equals to ª6º (from upper right to lower left direction). In order to make the most ef®cient propagation by reducing any possible redundancy, one speci®c bit, called the ªdetour bit, or direction-¯ow bitº in the data package, is set for tracking if the direction of a data propagation path is ever 25 20

PEs

15 10 5 0 0

10 Cycle 20

30

40

Fig. 4. The traf®c pattern of one data packet from (0,0) travelling to all PEs at the same layer.

changed. At this point, a set of rules is de®ned. In particular, if the direction of the propagation path is never changed, all the PEs involved should follow the rules below: 1. For PEs (in a relative position of ª0º in Fig. 2) which receive data from their lower layer, they must reset all the detour bits and send data to all six neighbors. 2. For PEs in a relative position of ª1º in Fig. 2, they must send data to the PEs in direction ª1º; and if a PE is on the border, it must also send data to direction ª6º and reset the detour bit. 3. For PEs in a relative position of ª2º in Fig. 2, they must send data to the neighbors in directions ª1º, ª2º, and ª3º. When PEs send data to direction ª1º and ª3º, the detour bit is set. 4. For PEs in a relative position of ª3º in Fig. 2, they must send data to direction ª3º and ª4º, When PEs send data to direction ª4º, the detour bit is set. 5. For PEs in a relative position of ª4º in Fig. 2, they must send data to direction ª4º. If a PE is at the border, it must also send data to direction ª3º and reset the detour bit. 6. For PEs in a relative position ª5º in Fig. 2, they must send data to their neighbors in directions ª4º, ª5º, and º6º. When PEs send data to direction ª4º and ª6º, the detour bit is set. 7. For PEs in a relative position ª6º in Fig. 2, they must send data to direction ª6º and ª1º. When PEs send data to direction ª1º, the detour bit is set. For PEs whose detour bits are set, they just send data to the same direction from which they received their data. There are two exceptions, however. For those PEs at the upper border that received data from direction ª4º, they

4

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

500 Without delay

400

PEs

300 With delay

200 100 0 0

10 Cycle 20

30

40

Fig. 5. The traf®c pattern of 20 data packets from 20 random locations (note that the delays due to con¯ict among the PEs by exchanging packages).

must send data to direction ª3º. For PEs at the lower border that received data from direction ª1º, they must send data to directions ª4º and ª3º, ª1º and ª6º. Both must reset the detour bits when they send data in different directions. Fig. 3 shows a record of the directions and the detour (or direction) bits where one data packet is sent from the PE with gray color. Fig. 4 shows a traf®c pattern example where only the processor at (0,0) broadcasts one data packet to all other processors in the same layer. In that ®gure, PEs means the number of PEs (processing elements) which are receiving data at speci®c cycles and cycles are de®ned as the time required for one data packet to be sent from one PE to another neighbor PE. Fig. 5 shows another example of a traf®c pattern of 20 data packets broadcasted by 20 different PEs. In this example, the traf®c patterns, both with delays and without delays are shown. A delay happens when two PEs try to send data to each other at the same time. The main reason of the brief description of traf®c patterns is to indicate the delays that occur due to the con¯ict among the PEs when they exchange packages at the same time. In addition, to make a point, the Kydon-RISC design will

OP(6)

R/R format RD(5) RS1(5) RS2(5) Unused (11)

OP(6)

RD(5) RS1(5)

OP(6)

Code

R/I format Immediate (16)

BR format RS1(5) Displacement (16) JMP format Displacement (26)

OP(6)

eliminate this type of traf®c problem by making the information exchange among the PEs very ef®cient. 3. The Kydon-RISC processor 3.1. Instruction set The Kydon-RISC's instruction set adopts instructions from Hermes-RISC [3,6], with some new ones added to it. The instruction set of the Kydon-RISC processor can be divided into ®ve groups: integer instructions; ¯oating point instructions; load/store instructions; branch instructions; and buffer instructions. The integer instructions provide integer addition and subtraction operations, as well as shift and logical operations. In addition, the integer instructions also provide comparison operations that will set the destination register according to the results of the operation. There is no integer multiplication or division, since these operations are rarely used in this system. The ¯oatingpoint instructions provide the addition, subtraction, multiplication, and division operations. The instructions can support both single and double precision ¯oating point. The ¯oatingpoint instructions also have comparison operations that store the results by setting the destination integer register. Load/ store instructions are the only instructions that are capable of moving data into and out of memory with different sizes of 8 bits, 16 bits, 32 bits, and 64 bits. Buffer instructions are for data propagation purposes. The two instructions, used for depositing data in the out-buffer (SOUT) and for retrieving data from the in-buffer (LIN), are conditional instructions. They will be executed only if the out-buffer is not full and the in-buffer is not empty, respectively. Additionally, there are two instructions for resetting the buffers. 3.2. Instruction formats

OP(6)

RD(5)

BUF format Unused(21)

Fig. 6. Instruction formats of the Kydon-RISC processor.

As a RISC, the Kydon-RISC processor conforms to the basic characteristics of RISCs. All instructions have a ®xed length of 32 bits, and there are only ®ve simple instruction

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18 Table 1 The Kydon-RISC processor's instruction set Integer instructions OR Logical OR ORI Logical OR immediate ORHI Logical OR 16-bit MSB AND Logical AND ANDI Logical AND immediate XOR Logical XOR ADD Integer addition ADDI Integer addition immediate ADDU Integer addition unsigned ADDUI Integer addition immediate unsigned SUB Integer subtraction SUBI Integer subtraction immediate SUBU Integer subtraction unsigned SUBUI Integer subtraction unsigned immediate CMP Integer comparison SAL Shift left arithmetic SAR Shift right arithmetic SLL Shift left logical SLR Shift right logical MPSF Move processor status ¯ags to register MTPSF Move register to processor status ¯ags Floating point instructions ADDF Single precision addition ADDD Double precision addition SUBF Single precision subtraction SUBD Double precision subtraction MULF Single precision multiplication MULD Double precision multiplication DIVF Single precision division DIVD Double precision division CMPF Single precision comparison CMPD Double precision comparison CVFD Convert from single to double precision CVDF Convert from double to single precision CVID Convert from integer to double precision CVDI Convert from double precision to integer Load/store instructions LD Load double precision LDI Load double precision immediate LF Load single precision LFI Load single precision immediate SD Store double precision SDI Store double precision immediate SF Store single precision SFI Store single precision immediate LW Load word LWI Load word immediate SW Store word SWI Store word immediate LH Load half word SH Store half word SHI Store half word immediate

5

Table 1 (continued) Load/store instructions LB Load byte LBI Load byte immediate SB Store byte SBI Store byte immediate Branch instructions B&D Branch and Decrease (it branch by using a counter value) BR Branch ( ˆ , ±, ., $, ,, #?) It has an extended opcode in RS2 to indicate the conditions required for branching JMP Jump (Note that JMP with 0 displacement is considered as return) JMPR Jump register (the register holds the address) JSR Jump to Subroutine (and Link) JSRR Jump to Subroutine (and Link) Register Buffer instructions LIN Load word from in buffer if in-buffer is not empty SOUT Store word to out buffer if out-buffer is not full RIN Reset in buffer ROUT Reset out buffer

formats. These formats (shown in Fig. 6; Table 1) are: register±register (R/R), register±immediate (R/I), branch (BR), jump (JMP), and buffer (BUF). All integer and load/store instructions are supported by R/R and R/I formats, while all ¯oating point instructions are supported only in R/R format. Branch instructions are supported by BR and JMP formats. Finally, buffer instructions are supported by the BUF format. In Fig. 6, OP stands for ªopcodeº, RD stands for ªDestination Registerº, RS1 means ªSource Register 1º, RS2 means ªSource Register 2º. Based on the reasons that the length of the instructions and width of the data bus are 32 bits (a word), the displacement ®elds in BR and JMP formats are in terms of words instead of bytes. 3.3. Addressing modes There are two addressing modes in a Kydon-RISC processor: one is an indexed mode (register 1 register) and the other is register based (register 1 immediate). These modes are supported by the R/R and R/I formats. The only instructions that can access memory are the load/store instructions. 4. Kydon-RISC processor design The design of the Kydon-RISC processor is based on the Hermes-RISC processor [3,6]. In the Kydon-RISC processor, there are a few units that are exactly the same as the ones in the Hermes-RISC processor, viz. integer unit, ¯oating point unit, load/store unit, branch unit, and memory management unit. The few differences between the Hermes-RISC processor and the Kydon-RISC processor are as follows. Firstly, while Hermes-RISC takes care of data dependency such as WAR (write after read), RAW

6

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

FP buses

Instruction cache & MMU

Reorder buffer

Int Register files

INT buses

BR inst. reservation station

Branch unit

Int inst. reservation station

Integer unit

FP inst. reservation station

Floating point unit

L/S inst. reservation station

Load/store unit

FP Register files

Dispatch Unit

Buff. inst. reservation station

Data cache & MMU

In/out buffers unit

Lower layer

Six neighbor processors

Higher layer

Fig. 7. The global design of the Kydon-RISC processor.

(read after write), and WAW (write after write) by using scoreboards, the Kydon-RISC processor utilizes the socalled Tomasulo's approach [4]. This features the design of reservation stations and common data buses (CDBs), to exclude those hazards by register renaming [4]. Secondly, the Kydon-RISC processor has no branch registers. Finally, there is a buffer unit in the Kydon-RISC processor, used exclusively for the data propagation in the Kydon vision system, but there is no such mechanism in the HermesRISC processor.

hazard. The processor itself is a conventional load/store processor, thus all operations are carried out in registers, except for loads and stores which move data into or out of memory. The Kydon-RISC processor, originally designed for the Kydon multi-processor vision system, is to be used in both the second and the third functional layers with some minor differences. Fig. 7 shows the global design of the Kydon-RISC processor.

4.1. Global design of the Kydon-RISC processor

There are two register ®les and a few status ¯ags in the Kydon-RISC processor: integer register ®le, ¯oating point register ®le, and processor status ¯ags (PSF). All integer registers and ¯oating point registers are general-purpose

The Kydon-RISC is capable of issuing up to four instructions in a cycle if there is no data dependency and no control

4.2. Register ®les

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

7

Table 2 Status and bookkeeping in each stage (modi®ed from [4]) Status

Wait until

Bookkeeping

Instruction issue

Station or buffer empty

if (RS1.Ud ± 0) {RS[s].V1 U RS1.Ud} else {RS[s].V1 U RS1; RS[s].U1 U 0;}; if (RS2.Ud ± ?0) {RS[s].V2 U RS2.Ud} else {RS[s].V2 U RS2; RS[s].U2 U 0;}; if (RD.Ud ± 0) {Wbuf[s].Ud U RD.Ud} else {Wbuf[s].Ud U 0; Wbuf[s].V U RD;}; RS[s].busy U yes; RD.Ud ˆ s; Reorder[s].Ud U s; Reorder[s].Rd U `RD'; Wbuf[s].busy U yes;

Instruction execution Write black

(RS[r].U1 ˆ 0) && (RS[r].U2 ˆ 0) Execution completed

None Note: this stage may need more than one cycle depending on what type the instruction is. ;r (if (RS[r].U1 ˆ s) {RS[r].V1 U result; RS[r].U1 U 0;}); ;r (if (RS[r].U2 ˆ s) {RS[r].V2 U result; RS[r].U2 U 0;}); ;r (if (Wbuf [r].Ud ˆ s) {Wbuf [r].V U result; Wbuf[r].busy U no;}); ;r (if (Reorder [r].Ud ˆ s) {Reorder [r].V U result; }); ;r (if (Reg[r].Ud ˆ Reorder[r].Ud) {Reg[r] U Reorder[r].V; Reg[r].Ud U 0;}); RS[s].busy U no;

registers. The integer register ®le has 32 registers (R0± R31), each 32 bits long. The ¯oating point register ®le again has 32 registers (F0±F31), even numbers are 64 bits long and odd numbers are 32 bits long. The values of R0 and F0 are always zero. The PSF keeps track of the status of a processor such as the buffer full and empty ¯ags, and can be used as a branch condition. Every register has an extra ®eld Ud specifying the number of the reservation station that has the instruction to produce the result for the register. A value of zero indicates that no active instruction is producing a result for the register. 4.3. Reservation stations and reorder buffer The data in each reservation station can be divided into the following six ®elds. OP: speci®es the opcode that will be performed on source operands V1 and V2. U1,U2: specify the reservation stations that will produce the corresponding value for each operand. If the value of either one of them is zero, it means that the value for the operand is available (kept in V1 or V2) or unnecessary. Once both of them are zero, the instruction can be

executed if the corresponding execution unit is not busy. V1,V2: keep the value of the operands. Note that for each operand only either the Un (n means 1 or 2) ®eld or the Vn ®eld is valid. Busy: indicates that the reservation station and its accompanying functional unit are occupied. The reorder buffer is used to keep a precise interruption that is used for debug purpose. When the dispatch unit sends instructions for execution, it also sends information to the reorder buffer. The information sent to reorder buffer is to put the result back to the original order. If interrupt occurred, the processor will know when it exactly happened and when to stop. The reorder buffer has three ®elds. Ud: speci®es the number of the station, which has the instruction to produce the result that should be saved in the reorder buffer. Rd: speci®es the destination register. V: keeps the value that should be written to the destination register. The data movement in the Kydon-RISC processor depends heavily on CDBs. Each CDB connects the result register of an execution unit to the reorder buffer as well as

8

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

From Branch Target Register To Instruction MMU I A R

Vector Table P C R

From Instruction MMU

Dispatch Logic

4 8 12 16

Branch From Branch unit Target Buffer (BTB) To Reservation Stations

Inst. & PC Queue

Restoring Circuit

Reorder buffer

Integer Register File

To Reservation Stations

Floating Point Register File

To Reservation Stations

Processor Status Flags

To Reservation Stations

From All B ses Fig. 8. The dispatch unit.

to those reservation stations where the value of the result might be needed. One exception is that a result (the target address) of a branch unit should be sent to the dispatch unit. The fact that there might be more than one result produced at a time, and that these results should be sent to the reorder buffer as well as some reservation station simultaneously, means that one CDB is not enough. In Dispatch FP Unit Buses

Integer Registers

Integer Buses

4.4. Dynamic scheduling and register renaming

Reservation station OP

V2

U2

V1

U1 Busy From PSF

Shifter

Logical Unit IRR Integer B ses Fig. 9. The integer unit.

fact, the number of CDBs should be not less than the number of execution units to avoid any structural hazard. The data carried on a CDB should contain at least two ®elds: one identi®er that indicates which corresponding reservation station produces this result; and the value of the result. While at execution, the reorder buffer, all the reservation stations, and the write buffer in the load/store unit are monitoring the output of CDBs. For any one of them, if their identi®er number matches the one on the CDB, then the value on the CDB will be saved into that unit.

In order to achieve dynamic scheduling by utilizing a register renaming scheme, the processor status at every moment must be recorded and used as a reference for updating the status coming next. The pipeline stages (shown in Table 3) and their relevant actions are listed in Table 2. In Table 2, the bookkeeping is described as C-like statements. The RS[s] and RS[r] represent the reservation stations which produce and use the result, respectively. RS[r].U1 indicates the U1 ®eld of reservation station r. Wbuf is the write buffer in load/store unit. Reg represents a register. RS1 represents source register 1, and RS2 means the source register 2.

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

9

Dispatch Integer Integer FP FP Unit Buses Registers Buses Registers

Reservation station OP

V2

U2

Mantissa Exp. S

V1

U1 Busy

Mantissa Exp. S

Unpacked

Parallel Multiplier

Unpacked

Shifter

Divider

Invrt

/ /

Adder sign Logic

Exp

S S S

Exp Exp

Mantissa Exp. S

FP Comparator

Rounding Formating Logic FP Result Reg.

Integer Buses

FP Buses

Fig. 10. The ¯oating point unit.

4.5. Dispatch unit The dispatch unit, shown in Fig. 8, is capable of fetching up to four instructions from the instruction cache at a time, depending on how many instructions were issued in the previous cycle. The fetched instructions are placed into an instruction queue, where instructions wait to be issued to the right reservation stations. In order not to make the bookkeeping scheme too hard to implement, or the branch miss-prediction-restoring task to be more complicated, instructions are issued to reservation stations in-order. The number of entries at a reservation station for every different functional unit should be justi®ed according to the instruction mix. A full reservation station for a particular unit might degrade the performance of a processor by blocking instructions of other

types from being issued. At issuing time, the dispatch unit would change Ud ®elds in the register ®les and the reorder buffer for data controlling purposes. When the dispatch unit detects that a branch instruction has been fetched, it will check the branch target buffer (BTB) to get the speculated address, and at the same time the restoring circuit will also save the old values of the Ud ®elds in register ®les and the write buffer in case the prediction is wrong. When it receives a ªmiss-predictionº signal from the branch unit, it will ¯ush those instructions in the queue. The old values in the restoring circuit will be restored, and the miss-predicted entries in the reorder buffer will be deleted. For those miss-predicted instructions already in reservation stations, the results will be computed and then discarded, since their corresponding entries in the reorder buffer have been removed.

10

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

since branch registers do not exist in the Kydon-RISC processor. In addition, the integer unit in the HermesRISC reads operands and saves results directly from/to the register ®le, while in the Kydon-RISC processor it reads operands from reservation stations and saves results through a CDB, which connects the IRR (integer result register) to the reorder buffer and those reservation stations that may need the result from the integer unit.

Table 3 Various types of pipelines (IF: instruction fetch; IS: instruction issue; EX: execution; WB: write back; BT: branch; AC: effective address compute; ME: memory access; BU: buffer access (all operations except for division are pipelined)) Branch unit pipeline

Buffer unit pipeline

IF(IS)

IF

Load/store unit pipeline IF

Integer unit pipeline

IF

Floating point unit pipeline Multiplication IF Division IF Add or Sub IF Comparison IF Conversion IF

BT IF(IS)

BT

IS IF

BU IS IF

BU IS

BU

IS IF

AC IS IF

ME AC IS IF

ME AC IS

ME AC

ME

EX IS IF

WB EX IS IF

WB EX IS

WB EX

WB

EX EX EX EX EX

EX ± EX WB WB

IS IF

IS IS IS IS IS

4.7. Floating point unit

EX EX WB

The internal structure of the ¯oating point unit, shown in Fig. 10, is the same as the one used in Hermes-RISC. All operations read operands from reservation stations and save data through a CDB, which connects the FRR (¯oating point result register) to the reorder buffer and to those reservation stations that may need the result from the ¯oating point unit. The pipeline stages of this unit are shown in Table 3. 4.8. Branch unit

WB WB

The branch unit, shown in Fig. 11, controls the execution ¯ow by sending the target address to the dispatch unit. Here a branch target buffer (BTB), shown in the dispatch unit, with a two-bit prediction scheme is employed for the branch prediction. When a branch is taken, the target address as well as the ªtakenº bit (represented by 1 or 0) will be written into the BTB. If a branch is never taken, then it won't be stored into the BTB. For a target address in the BTB, if its corresponding branch instruction has two consecutive nontaken efforts, it will be eliminated from the BTB. For a branch mis-prediction, the dispatch unit would be noti®ed to clear the instruction queue, and a restoring is used to restore the Ud ®elds in register ®les and inform the reorder buffer to discard those mis-speculated corresponding entries.

4.6. Integer unit The integer unit of Kydon-RISC processor, shown in Fig. 9, is almost the same as the one in Hermes-RISC [3,6]. The integer unit includes an arithmetic logic unit, a logical unit, and a shifter. A few things are different from the HermesRISC. For instance, comparison operations set ¯ags in condition registers in Hermes-RISC processors, while these operations set ¯ag values in other integer registers Dispatch Unit

Integer Registers

Integer Buses

Reservation station V2

OP

U2

V1

U1 Busy code

1

0

BRPC +4

BR Comp. BRR or RAR Branch Target Register

BRR : Branch Result Register RAR : Return Address Register

Integer Buses

To Dispatch Unit

Fig. 11. The branch unit.

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

11

Dispatch Integer Integer FP FP Unit Buses Registers Buese Registers

Reservation stations OP

V2

U2

V1

U1 Busy

MAR FP Registers FP Buses Integer Buses Integer Registers

Write Buffer

\

Align

To Data MMU

To/From Data MMU

Integer Buses

MDR

FP Buses Fig. 12. The load/store unit.

4.9. Load/Store unit

4.10. In- and out-buffers

The load-store unit, shown in Fig. 12, is the only unit which can read/write data from/to memory. It supports byte, halfword, word, and double-word data movement operations. This unit again has a similar architecture to the one in Hermes-RISC, except that the operands are read from reservation stations and the results are sent to a CDB which, in this case, connects to all reservation stations and the reorder buffer. The values in the two ®elds (V1 and V2) are used for calculating the effective address, kept in MAR (Memory Address Register) for both loads and stores. For loads, the loaded data would be put into MDR (Memory Data Register).

The out-buffer, shown in Fig. 13, works in an asynchronous manner and has two registers, the top pointer (IP) and the bottom pointer (OP), which indicate the places where the next data can be deposited or removed. The out-buffer also has two ¯ags (full and empty) for indicating the status of the buffer. These two ¯ags are part of the processor status ¯ags (PSF). The number of entries in out-buffer is implementation-speci®c. The out-buffer will automatically send data transfer requests when the empty signal is not set. If the out-buffer is empty, it would stop sending requests to other processors. Before the out-buffer sends data to the

Fig. 13. The out-buffer (note that OID, Original Processor ID; DIR, direction; D, detour bit).

12

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

Fig. 14. The in-buffer.

bus(es), there is a decoding unit determining (according to the direction and the detour bit in the data) the buses through which the current data should be sent. An out-buffer can send data up to six directions at a time according to the propagation scheme. In order to send data ef®ciently, there are eight ¯ags, shown in Fig. 13, in the decoding unit for transfer control. Each individual ¯ag (C0±C6) indicates whether a previous sending task was completed or not in each direction. The overall ¯ag (AC) is set if any direction of the previous send operation failed. Thus, a processor will send data to its neighbor(s) as soon as it gets a chance, and it will not send the same data to the same processor twice. The in-buffer (Fig 14) unit shares the same bus with the out-buffer. An in-buffer can receive up to six data from its neighbors. Each direction has its own port and buffer entry. If data is received in one entry, the corresponding valid bit for that entry will be set. There is a polling circuit and inside this unit there is a pointer register. The unit will scan each valid bit cyclically. If the valid bit is set, then the in-buffer stops scanning while the pointer is set to point to the entry. When a register tries to read data from the in-buffer, the value of the pointed entry is put into the register. After a register gets data from the in-buffer, the polling circuit scans the next valid bit while the previous valid bit is reset. A set valid bits means that the data in this entry has not yet been read by registers and can also keep a valid data from being overwritten by other processors. If all six entries have no valid data, then the AIV ¯ag is set and prevents registers from reading data from the in-buffer. If a processor can retrieve data from in-buffer fast enough, then each entry in the in-buffer would possibly be available whenever another processor tries to send data to it. With this approach, the usage of buses can be maximized and delays can be reduced. To obtain maximum usage of the

buses, there is also one more busy ¯ag for each bus to work bi-directionally. 4.11. Memory management units The memory management units are the same as those used in Hermes-RISC. There are two memory management units, one for instructions and one for data. The instruction memory management unit has a capability of issuing four instructions at one time to the instruction queue. To keep the clock rate as high as possible, direct-mapped architecture is used for both caches. The instruction cache is directmapped. It is 8 kbyte in size and is divided into two 4 kbyte caches with 16-byte lines. The data cache is 8 kbyte in size, and with 64-byte lines. This is an ef®cient way of fetching instructions (Fig 15). 4.12. Control unit The control unit handles data movement and the opcode of instructions generates the control signals. The hardwired control unit consists of 12 parts: the integer pipe; the ¯oating point pipe; the branch pipe; the load/store pipe; the in± out buffer pipe; the dispatch pipe; the interrupt handler; the dispatch logic; the reorder buffer controller; the restoring circuits controller; the data memory controller; and the instruction memory controller. Each pipe has its own opcode decoder. The control unit also takes care of interrupt handling. When an interrupt occurs, the address of a speci®c interrupt routine will be sent to an IAR (instruction address register), and the old IAR and PC ®le will be saved in the software interrupt handler. To prevent another interruption, while an interrupt is being handled, a bit in the PSF will be reset to disable additional interruptions. In the ®rst part of the interrupt routine, the contents of register ®les are saved in the

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

13

Memory Address

P r Tag PFT t B T.L.B.

Masking Control

Incr

P r t B

T a g s

To Control Unit

Data Words

2nd Half of Instruction Cache

P r t B

T a g s

Data Words

1st Half of Instruction Cache

Masking Unit

Data Out

Fig. 15. The instruction cache and MMU.

memory (stack), while the last part of the interrupt routine restores the saved data to register ®les. Figs. 16±21 show the control ¯ow diagrams of the dispatch unit, integer unit, ¯oating unit, branch unit, load/store unit, and the in/out buffer. 5. Initial processor evaluation [3,28] In this section, a simple performance evaluation of the processor proposed here is presented and compared to a group of other processors. The initial evaluation here is based on the execution of a set of simple and basic functions that exist in software programs. Although, this is not a standardized approach for the evaluation of a new processor, it is, however, an initial indication about the performance of the processor. To evaluate the performance of the Kydon-RISC processor, a set of machine language programs was written and simulated in a PC. The number of clock cycles needed to execute these functions can be calculated by tracking the

execution of the instructions including stalls. The primitive functions to be evaluated are: summation, multiplication, matrix multiplication, bubble sorting, and searching for the maximum (or minimum) value in a set of numbers. For matrix multiplication, two matrices with dimensions of 20 £ 20 were multiplied. For all the remaining functions, a set of 100 random number was fed to the evaluation programs. The programs such as summation, multiplication, and matrix multiplication are static programs (the instruction count does not depend on the values of the data), while bubble sorting and searching for the maximum value are dynamic programs (the instruction count may change with different sets of data). The most important thing during this evaluation is to ef®ciently schedule the sequence of instructions in order to maximize the instruction-level parallelism while keeping the result unchanged. Thus, programs with a large number of instructions (i.e. matrix multiplication, sorting) have more chances to take the advantages of a superscalar architecture, by executing more than one instruction at a time, than programs with a small number of instructions.

14

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

RS[r].V1<=Reg[RS1] or RS[r].V1<=result RS[r].V2<=Reg[RS2] or RS[r].V2<=result or RS[r].V2<=Imm.

Subtraction, Comparison

Addition

IRR<= V1V2

IRR<= V1V2

Wait until (U1=0 && U2=0)

Shift

Logic

IRR<= V1V2

IRR<= V1V2

Result : the result from a CDB Fig. 16. Integer unit's ¯ow diagram.

Inst. Q.<=M[IAR] IAR<=BTR or IAR<=PCR+4 or IAR<=PCR+8 or IAR<=PCR+12 or IAR<=PCR+16 Inst. IP<=Inst. IP+1 or Inst. IP<=Inst. IP+2 or Inst. IP<=Inst. IP+3 or Inst. IP<=Inst. IP+4

Interrupt

Reg[r]<=IAR IAR<=Interrupt vector Clear interrupt signal

Inst OP<=Inst OP +1 or Inst OP<=Inst OP +2 or Inst OP<=Inst OP +3 or Inst OP<=Inst OP +4 plus Bookkeeping for issue stage in Table 3

Integer Flow in Fig. 52

Floating Point Flow in Fig. 53

Branch Flow in Fig. 54

L/S Flow in Fig. 55

In-Out Buffer Flow in Fig. 56

Bookkeeping for write back stage in Table 3

Bookkeeping for write back stage in Table 3

Bookkeeping for write back stage in Table 3

Bookkeeping for write back stage in Table 3

Bookkeeping for write back stage in Table 3

Fig. 17. Dispatch unit's ¯ow diagram.

ADDD ADDF

Ch R<= V1 V2

MULD MULF

Wait until (U1=0 && U2=0)

SUBD SUBF

Ch R<= V1 V2

Ch R<= V1 V2

DIVD DIVF

CMPD CMPF

Ch R<= V1
V2

chained chained chained CVDF CVFD CVDI

FRR<= Ch R R.F (Single precision)

FRR<= Ch R R.F (Double precision)

FRR<= V1 C.F

CVID

FRR<= V1 C.F

FRR<= V1 V2

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

RS[r].V1<=Reg[RS1] or RS[r].V1<=result RS[r].V2<=Reg[RS2] or RS[r].V2<=result

R: round C: convert F: format Result : the result from a CDB

Fig. 18. Floating point unit's ¯ow diagram.

15

16

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

BRPC<=PC File RS[r].V1<=Reg[RS1] or RS[r].V1<=Result RS[r].V2<=Reg[code] or RS[r].V2<=Result or RS[r].V2<=Dis

JMPR

JMP

B&D

BTR<=BRPC +RS[r].V2

RS[r].V1=0

Wait until (U1=0 & U2=0 )

JSR

BR

BTR<=RS[r].V2

RS[r].V1=code

N

RAR<=BRPC+4 BTR<=BRPC +RS[r].V2

JSRR

RAR<=BRPC+4 BTR<=RS[r].V2

Y Y

N

BRR<=RS[r].V1-1 BTR<=BRPC+Dis

BTR<=BRPC+Dis

Result : the result from a CDB Fig. 19. Branch unit's ¯ow diagram.

From the initial evaluation, the Kydon-RISC presents performance better than other processors [3±5,14,15,17± 19,25,28] (as shown in Table 4). Note that the evaluation of the RISC processors was

implemented by assembly programming for each processor and optimization of the assembly programs by using each processor's architecture. The empty space in the table above indicates that we did not evaluate the cache.

RS[r].V1<=Reg[RS1] or RS[r].V1<=result RS[r].V2<=Reg[RS2] or RS[r].V2<=result or RS[r].V2<=Imm

Load

Not complete

Wait until (U1=0 && U2=0)

Store

MAR<=V1+V2

MAR<=V1+V2

MDR<=M[MAR]

Wbuf[r].V<=Reg[RD] or Wbuf[r].V<=result WbufIP<=WbufIP-1

Wait until Ud=0

Result : the result from a CDB Fig. 20. Load/store unit's ¯ow diagram (note that WbufIP is the pointer to the next available entry of write buffer).

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

17

Table 4 Performance evaluation (cycles) Primitive functions Microprocessor

Summation

Many mult.

Matrix mult.

Maximum value

Bubble sorting

Procedure call

SPARC MIPS 2000 w/2010 HP prec. Arch i860 MC68000 MIRIS IBM RS/6000 HERMES-RISC KYDON-RISC

501 500 401 401 500 801 302 202 205

± 599 ± 401 599 ± 302 400 304

± 92491 ± 50,871 84,091 ± 35,250 33,629 33,629

702 702 500 603 603 1003 504 399 306±404

55,583 50,502 50,590 58,104 45,595 93,751 43,119 31,282 22,394

2280 2686 2177 2367 2686 2515 1530 1365 1365

RS[r].V1<=Reg[RS1] or RS[r].V1<=result

LIN

Wait until U1=0

SOUT

IBR<= InBuf[OP]

OutBuf[IP]<=RS[r].V1

Result : the result from a CDB

Fig. 21. The in±out buffer unit's ¯ow diagram.

6. Conclusions and discussion In this paper, the design and the ®rst stage performance evaluation of the Kydon-RISC processor was presented. The Kydon-RISC processor is a ®ve-way superscalar (with one extra in/out buffer unit exclusively used in the Kydon system) processor, with a small and ef®cient set of instructions. It features register renaming which takes care of data dependencies as well as dynamic scheduling. The Kydon-RISC design has, however, a drawback, which is a relatively slow clock rate due to the complexity of the register renaming. In a pipelined design, the slowest stage of the pipeline determines the speed of the design. The common critical path in a design w/ ¯oating point multiplier will be the multiplier stage. In this paper, we assign three pipeline stages for the multiplication; this may cause the critical path to the memory (cache) access, in a Load Store operation. Since we assign only one stage for loading data from cache and writing it into destination register, this would

most likely be the critical path. In a comparable design implemented w/ design automation tool (not custom layout) in a 35 mm technology, with two stages of pipeline for cache access and write back into register, more than 200 MHz can be achieved. Base on this data, the design here should be able to get about 250 MHz or more in 35 mm, implemented using design automation tools (synthesized). The reason of having a split Instruction Cache is to be able to always fetch 4 consecutive 32-bit words, word aligned (not quad-word aligned). This is because of the design of HERMES [3,6] (later Kydon) may only accept part of the quad-word (four instructions) fetched from the cache, depending on the availability of the queue (reservation station). Note that we are issuing instructions to the queue in order, for the sake of simplicity and hence speed. Implementing a set associative in this case will create an unwarranted complication. In anyway Instruction cache is quit predictable. The performance of Kydon-RISC can be further improved. Firstly, most Kydon system's operations would be done by the integer unit, adding another integer unit may signi®cantly increase the performance. Secondly, the ¯oating-point unit can be divided into two parts: one for summation, subtraction, and comparison; and the other for multiplication, division, and conversion. Thus, up to two ¯oating point instructions can be executed at a time. Although the performance can be improved, this incurs extra cost. Finally, the Kydon-RISC compiler is under development and it will provided the optimization needed for an ef®cient overall performance.

References [1] J.S. Mertoguno, A self-organized, autonomous, multi-layer vision system, PhD Dissertation, EE Department/AAAI Research Lab, Binghamton University (SUNY), 1995. [2] N.G. Bourbakis, J.S. Mertoguno, Kydon: an autonomous, multi-layer image-understanding system: lower layers, J. Engng Appl. AI 9 (1) (1996) 43±52.

18

H.Y. Yang et al. / Microprocessors and Microsystems 25 (2001) 1±18

[3] J.S. Mertoguno, N.G. Bourbakis, Design of the Hermes-RISC processor, J. Microcomput. Appl. 18 (1995) 233±259. [4] J.L. Hennesy, D.A. Patterson, Computer architecture, a Quantitative Approach, 2nd ed., Morgan Kaufmann, San Francisco, CA, 1996. [5] S.J. Siegel, Social aspects of developing emerging technologies: RISC, Master's Thesis, School of Engineering, MIT, 1991. [6] S.J. Mertoguno, Design and evaluation of two RISC superscalarpipeline processor, Masters Thesis, EE Department, SUNY Binghamton, May 1992. [7] R. Weiss, RISC takes gold in processor Olympics, Comput. Design November (1996) 61±78. [8] T.R. Halfhill, AMD vs superman, BYTE November (1994) 95±103. [9] N.G. Bourbakis, S.J. Mertoguno, The design of the UAL processor, J. Microcomput. Appl. 16 (1993) 1±17. [10] N.H.E. Weste, K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, Reading, MA, 1992. [11] F.J. Hill, G.R. Petterson, Digital Systems-hardware Organization and Design, Wiley, New York, 1987. [12] D.A. Patterson, Reduced instruction set computers, CACM 28 (1985) 8±21. [13] B.B. Brey, The Intel Microprocessors 8086/8088/80286/8038680486-architecture, Programming and Interfacing, Prentice-Hall, Englewood Cliffs, NJ, 1995. [14] D. Jones, The MC88200 Ð a cache and memory management unit for M88000 RISC processors, EE March (1989) 39±48. [15] T. Potter, M. Vaden, J. Young, N. Ullah, Resolution of data and control-¯ow dependency in the Power PC 601, IEEE Micro October (1994) 18±29. [17] N. Gaddis, J. Lotz, A quad-issue, CMOS RISC microprocessor, IEEE J. SSC 31 (11) (1996) 1697±1702. [18] S. Mirapuri, M. Woodare, N. Vasseghi, The MIPS R4000 processor, IEEE Micro April (1992) 10±22. [19] K.C. Yager, The MIPS R10000 superscalar microprocessor, IEEE Micro April (1996) 28±40. [20] N. Vasseghi, K. Yager, E. Sarto, M. Seddighnezhad, 200-MHz RISC processor, IEEE J. SSC 31 (11) (1996) 1675±1686. [21] R.L. Sites, Alpha AXP architecture, CACM 36 (2) (1993) 33±44. [22] E. McLellan, The Alpha AXP architecture an 21064 processor, IEEE Micro June (1993) 36±47. [23] B.J. Benscheider, et al., A 300-MHz, 64-bit quad-issue CMOS RISC micro-processor, IEEE J. SSC 31 (11) (1995) 1203±1212. [24] J.H. Edmondson, et al., Superscalar instruction execution in the 21164 Alpha microprocessor, IEEE Micro (1995) 33±43. [25] P. Wayner, SPARC strikes back, BYTE November (1994) 105±112. [26] T. Thompson, B. Ryan, Power PC 620 soars, BYTE November (1994) 113±120. [27] T.R. Halfhill, T5:Brute force, BYTE November (1994) 123±128. [28] H.Y. Yang, Kydon system: traf®c analysis and RISC processor design, Master Thesis, EE Department, Binghamton University (SUNY), January 1998. [29] N. Bourbakis, Parallel and multiprocessor vision system architectures, Int. J. PRAI 12 (3) (1998) 263±264. [30] S. Mertoguno, N. Bourbakis, A analytic evaluation of a fully connected multiprocessor structure, SCS Trans. Comput. Simul. 11 (1) (1994) 45±62. [31] B. Saha, VLSI implementation of Kydon's processing elements, Master Thesis, Image-Video Lab, Department of EE, SUNY-B, December, 1994.

H.Y. Yang received his BS in electrical engineering and physics from National Taiwan University, and his MS in computer engineering from SUNY-BU 1992 & 1995. He currently is a Senior Designer for VIA Technology Corp. designing high speed devices.

S.J. Mertoguno received his BS in electrical engineering and physics from National Technical University of Indonesia, and his MS and PhD in computer engineering from SUNY-BU 1992 & 1995. He currently is a Senior Scientist in a Networks Inc. designing VLSI switches for Internet. Previous working places: Fujitsu, Digital Video Inc. He has published more than 20 articles in refereed International Journals and Conference Proceedings, book-chapters. He is an Associate Editor in an Int. Journal on Arti®cial Intelligence Tools, a Program of an IEEE Symposium on Intelligent Agents 1999, and coChair of the International Standard Committee for High Speed Bus Design. He is conducting research in Applied Arti®cial Intelligence and Distributed Computing-Processors Design. He has received a Best Student paper Award IEEE ATC 1996.

Nikolaos G. Bourbakis (IEEE Fellow) received his BS in mathematics from National University of Athens, Athens, Greece, and his PhD in computer science and computer engineering, Dept. of Computer Engineering & Informatics, University of Patras, Patras, Greece, 1983. He currently is a Professor in ECE at BU and a Professor at TUC, GR, and the Director of two Research Labs. He has directed several research projects funded by government and industry. He has published extensively in refereed International Journals and Conference Proceedings. He is an author, co-author or editor of several books. He is the founder and the Editor-in-Chief of the International Journal on AI Tools, the Editor-in-Charge of a Research Series of Books in AI (WS Publisher), the Founder and General Chair of IEEE Computer Society Conferences, Symposia and Workshops, and Associate Editor in IEEE and Int. Journals and a Guest Editor in 14 special issues in IEEE and Int. Journals related with his research interests. He is conducting research in Applied Arti®cial Intelligence, Image and Video Processing, Biomedical Engineering and Distributed Computing-Processors Design and VLSI-CAD. His research work has been internationally recognized and granted with several presitgious awards. Some of them are: IBM Author recognition Award 1991, IEEE Outstanding Paper Award ATC 1994, IEEE Computer Society Technical Research Achievement Award 1998, IEEE ICTAI 10 years Research Contribution Award 1999.