URPR-1: A single-chip VLIW architecture

URPR-1: A single-chip VLIW architecture

Microprocessing and Microprogramming 39 (1993) 25-41 North-Holland 25 URPR-I: A single-chip VLIW architecture Bogong S u a, Jian Wang b'*, Zhizhon...

995KB Sizes 1 Downloads 57 Views

Microprocessing and Microprogramming 39 (1993) 25-41 North-Holland

25

URPR-I: A single-chip VLIW architecture Bogong

S u a,

Jian Wang b'*, Zhizhong Tang c, Cihong Zhang c and Wei Zhao d

adept, of Computer Science, The College of Staten Island, 130 Stuyvesant Place, Staten Island, NY 10301, USA blNRIA-Rocquencourt, Domaine de Voluceau, BP 105-78153 Le Chesnay Cedex, France CDept. of Computer Science and Technology, Tsinghua University, Beijing, 100084, China dDept, of CES, Case Western Reserve University, 10900 Euclid Ave; Cleveland, OH 44106-7071, USA Received 4 December 1992 Revised 23 March 1993 Accepted 12 May 1993

Abstract URPR-1 is a VLIW architecture which integrates nine PEs on a single chip. It adopts a pipeline register file to eliminate the data anti-dependencies in the innermost loops of the program, thereby further exploiting the instruction-level parallelism and increasing the execution speed of loops. The results of preliminary evaluation on simulators show URPR-1 architecture to have high performance. This paper first introduces the architecture of URPR-1 and its optimizing compiler, and presents the basic principle of pipeline register file; then gives some representative examples to demonstrate the working procedure of URPR-I and its compiler, and the results of preliminary experiments.

Keywords. VLIW; optimizing compiler; instruction level parallelism; pipeline register file; architecture.

1. Introduction

Recently, instruction-level parallel (ILP) processing has been applied in microprocessor architecture, which utilizes the parallel execution of the lowest level operations and increases the computer performance transparently. VLIW architecture has been proven to be a promising architecture to exploit instruction-level parallelism. This architecture relies heavily on sophisticated software to generate the code required to use machine resources effectively [-14]. Some VLIW processors such as iWARP and LIFE-1 have been presented [8, 9, 14], which reach high performance with the aid of their optimizing compilers. The URPR-1 architecture is a single-chip VLIW processor oriented to signal and image processing. The major optimization approaches of its compiler are Trace Scheduling [3] and URPR software pipelining [15] which can fully exploit instruction-level parallelism within the innermost loops. In addition, we design a special hardware - the pipeline register *Corresponding author. Email: [email protected]

file to support the URPR software pipelining, therefore further increasing the performance of URPR-1. Area and pins form the basis for intuitive decisions in modern processor design [5], which is particularly important for designing a single-chip VLIW architecture. The instruction words are so wide that more area will be used if the instruction memory is placed on-chip, or a lot of pins will be needed if it is placed off-chip. In order to reduce the number of pins we adopt on-chip instruction memory, however its size is very limited under the circumstance of current VLSI technology. This decision forces us to choose optimization approaches which have good time efficiency as well as good space efficiency for compilers. The next section describes the architecture of the URPR-1 system, Section 3 describes the optimizing compiler, and Section 4 reports the results of some preliminary experiments on simulators. 2. The URPR-1 architecture

The whole URPR-1 System consists of three parts as illustrated in Fig. 1. They are host computer,

26

B. Suet al.

HOST(SUN4)

............... ..............................

CPU ( SPARC)

[~

ROM32K'32 MONITOR

1.............................

/

II

. . . . . om I.,OCAL-B L ~

~°o..°°.oooo..°oooo..oo.ooo.o..~o.o.oo°.o...

PE0 -- PE8

miD[

INTEFACE

¢ .......................................

[

v~a-sus

INTERFACERAM

DMA2 INTERFACE-URPR

I PIPELINEREGISTERHL£

I

¢ ...............

RAM128K'64

DMA1 HOST-INTERFACE

i. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......................

]

f° ...................

MULTI-AC~.SSMEMORY [

I ADDRESSGENERATOR ~--

INSTRUCTIONMEMORY

I

Fig. 1. Blockdiagramof URPR-1 multiprocessor.

interface unit and URPR-1 processor. The host computer is a SUN-4 workstation which provides the user a UNIX environment, compiles the applications and monitors the whole system's input and output. The interface unit transfers the data and programs between the host and the URPR-1 processor. The serial portion of the application is also executed by the processor on the interface. 2.1. Architecture of the URPR-1 processor

The URPR-1 processor is a 16 bit fixed point singlechip VLIW architecture as shown in Fig. 2. Nine identical PEs, one pipeline register file shared by the PEs, one five-bank multi-access parallel data mem-

ory, one addressing unit, some control logic and on-chip memory for very long instruction words are placed on the single-chip. All these will cost about 600,000 transistors totally which is feasible by current VLSI technology. Most of the transistors are for memory, registers and regular structured processors. The chip has 128 pins. The machine cycle is 50 ns. Each PE is composed of one ALU, one parallel multiplier, and 16 registers among which eight can be pipelined and the other eight are local, as illustrated in Fig. 3. The organization of the PE is around the register file. The ALU and multiplier do not access memory directly, all the operands needed and the results generated are read from and written

URPR- 1 : A Aingle- chip VLIW architecture

CROSS-BARSWITCH5"18"16

I

MULTI-ACCESSMEMORY4K * 16

II

I PIPEIJNEREGISTERFILE 9 * 16 * 16 ]

ADDRESS GENERATOR 0-4

ADDRESSREGISTERHLE

INSTRUCTIONMEMORY 32W*568b

Fig. 2. URPR-1 architecture.

ALU (I~,i0 J

(16biO [ Fig. 3. Block diagram of PE.

27

28

B. S u e t aL

to the pipeline register file under the control of the R/W section in the very long instruction word and the addresses generators. This approach can raise the processing speed with careful arrangement by the optimizing compiler during compile time. In order to provide data timely, 10 data buses between the register file and memory are established and 16 data channels are also built between the registers of adjacent PEs. Each PE can complete 13 operations within one machine cycle including one fixed point multiplication, one arithmetic/logic operation, eight data transmissions between adjacent PEs, two memory accesses and one branch operation• The 9-PE URPR-1 processor can complete 101 operations within one machine cycle including 9 fixed point multiplications, 9 arithmetic/logic operations, up to 72 data transmissions in pipeline register file, 10 memory accesses and one branch operation. When the machine cycle is 50ns, URPR-1 processor has a peak performance of 2020 MIPS, or 360 peak computation performance (only including Multiplication and ALU operations) with 160 Mword/sec data transfer rate between PEs. Since the main application targets of URPR-1 are signal and image processing, the number of PEs and buses as well as the configuration of other hardware resources are decided according to our analysis of most of the routines in this field. The width of the very long instruction word of URPR-1 is 568 bits in three sections: (1) PE section, 405 bits, consisting of nine subsections for nine PEs respectively, (2) R/W section, 150 bits, containing 10 subsections for five memory banks access, (3) BR section, 13 bits, for branch control. Only one branch can be processed within one machine cycle.

2.2. Pipeline register file The pipeline register file is formed by connecting the registers with an identical number of adjacent PEs as shown in Fig. 4. It is the key component responsible for the high speed of URPR-1. Among the 16 registers of each PE, there are eight registers which have data channels with the corresponding registers

F

F .o

0 M /

/

0

R : R7 !

PEi+I

:

i............................. "

W

PEi

...............................

G -T P E

Fig. 4. Pipeline register file.

in its adjacent PEs. There are three kinds of pipelining in the pipeline register file: (1) Vertical pipelining: Vertical pipelining can transfer data between registers not through the ALU so that more CPU time can be squeezed out for other arithmetic computations• (2) Horizontal leftward pipelinino: In this kind of pipelining, data in one register is transferred to the register with identical number in its left adjacent PEs. It is very useful for the variable transmission between code segments within same iteration. (3) Horizontal rightward pipelinin9: Just opposite to the above transmission, here, the data in one register can be transferred to the register with identical number in its right adjacent PEs. This kind of pipelining can be used to transfer data across iterations. It is very useful when a result of computation is referred to in successive loop iterations The pipeline register file can be used to eliminate the data anti-dependencies. We have found that under the circumstances of pipeline register file and other sufficient hardware resources, the length of a basic block of any non-recursive loop can be reduced to its minimum value, 1, by the URPR software pipelining algorithm, that means one iteration can be finished within one machine cycle.

29

URPR-1 : A Aingle-chip VLIW architecture

for (i= 1; i<=n; i++)

I

1

{

2

2 3

1: read(A[i]); 2: B[i]=2*A[i]; 3: C[i] =A[i]+l; 4: D[i]=B[i]-I;

1 2

1

1 4 3 2 1 1 4

3 4

3 4

2 3 4

}

(a) A Loop Example

(c) OptimiTJngResult with Hardware Support

(b) O ~ n ~ g

Resultof U R P R

Registel

PE 3

PE 2

R0

R1

B[i]

R2

D[i]

~-

--

PE 1

PE 0

A[i] ~

- - A[i] . t - - - A[i]

B[i]

--

~-

B[i]

C[i]

(d) Working State of Pipeline Register File

Cycle

PE 3

PE 2

PE I

read A[]]

1

B[I]=2*A[I]

read A[2]

C[1]=A[1]+I

B[2]=2*A[2]

read A[3]

C[2]=A[2]+1

B[3]=2*A[3]

read A[4]

2 3 =4ton

]

PE 0

D[1]=B[1]-I

n+l

D[n+l]=B[n+l]-I C[n+l]=A[n+l]+] B[n+l]=2*A[n+l]

n+2

D[n+2]=B[n+2]-I C[n+2]=A[n+2]+l

n+3

D[n+3]=B[n+3]-I (c) Snapshot of Loop F~ecution Fig. 5. Principle of pipeline register file.

Following is an example to illustrate the working procedure of the pipeline register file. The execution time of the loop shown in Fig. 5 needs 4n cycles on a conventional processor. After applying the U R P R approach to it, it can be

reduced to 2n + 2 cycles to finish as shown in Fig. 5(b). The length of the new loop body (within the frame) is reduced to 2. However, it cannot be reduced further, even if there were more PEs and other hardware resources. The obstacle is due to

30

B. S u e t aL

2

1'

2

3

2'

1

3'

2

I'

3

2'

1:A=X=2

I

211

2:B=A*5 3

2

3: C=A-B

3'

(a) A Loop Body

(b)Optimizing Resultof V M E Approach

(c) Optimizing ResuR of Pipeline Register File Approach

Fig. 6. Comparison between VME and pipeline register file approaches.

the data anti-dependencies. Lam proposed an approach named Variable Module Expansion (VME) for Warp machine by assigning different registers to the same variable in different iterations [11]. The pipeline register file we proposed is a hardware approach to eliminate data anti-dependencies so that the overlapping of different iterations can be enhanced. In the pipeline register file, a variable in one register can be transferred to another one along the pipeline chain as shown in Fig. 5(d). The number of registers that one variable occupies is the length of the pipeline chain. This length is equal to the variable's life time. The pipeline register file can greatly reduce the Register Occupying Time [12] of each register. In Fig. 5(c), they are reduced to 1. Four processors working together can finish one iteration within one machine cycle, the total execution time of the loop is n + 5 machine cycles. Compared with conventional processors, the speed is increased four times. It is increased two times compared with URPR software pipelining approach alone.

In Fig. 5(e), the overall execution procedure on URPR-1 is illustrated. The first three cycles are for the prelude, from the 4th cycle to the nth cycle, the loop body is overlapped by four adjacent iterations. During this period, an iteration is completed and another iteration is initiated every machine cycle. The last three cycles are for the postlude. The example shown in Fig. 6 illustrates the difference between pipeline register file and VME approaches. The total number of different sets of registers for VME is the degree of loop unrolling [10], that is two in Fig. 6(b); the length of the optimized loop body is 2. The execution time per iteration is one machine cycle as the operations included in the loop body are for two iterations. In Fig. 6(c) the length of the optimized loop body is 1 by using the pipeline register file approach. Figure 6 shows: the pipeline register file approach has same time efficiency as Lam's VME approach, but has better space efficiency which is very critical to a single-chip architecture with limited capacity of the on-chip instruction memory. It also can alleviate the bus traffic between PEs and memory banks and increase

31

URPR-1 : A Aingle-chip VLIW architecture

5"18"16b Cross-Bar S-'witeJa

:-

RAMI ~

AG1

|

I I

I

1

ADDRESS R1EGIEIER EKE

I-

Fig. 7. Block diagram of addressing unit.

the computation speed greatly for some applications such as Convolution.

2.3. Multi-bank parallel accesses data memory The total capacity of the on-chip data memory is 4K words among which 2K are ROM for coefficients and the rest are RAM which consists of four banks with 25 ns working cycle. The five banks work in parallel so that there can be 10 simultaneous memory accesses in each machine cycle. All the memory access operations are controlled by ten different sections in the very long instruction word. Each PE can access memory at most twice in one machine cycle via a 5 x 18 x 166 cross-bar switch. The structure of the data memory is shown in Fig. 7, the PRO...PR8 represent the registers in the nine PEs respectively. AG0..AG4 are five address generators to provide the memory addresses. One of its main function is the butterfly address transformation for FFT. Each address generator can complete two calculations in each cycle so that at most ten addresses can be generated. The address calculations and memory accesses are executed in pipeline fashion. The addressing modes supported by URPR-1 are simple. The reason is that in the fields of signal and

image processing, data are always processed in the form of array, so in the innermost loop, the data addresses usually are the values of a simple function of the index of iteration. As to the indirect addressing, it can be realized by assigning the register R0 in each PE as the indirect addressing register.

2.4. Interface unit The interface unit is a bridge between the host and the URPR-1 processor. It is composed of five parts as shown in Fig. 1. Their functions are as follows: (1) The resident monitor program in R O M whose capacity is 32K × 32 bytes. It controls the interface unit and monitors the execution state of the URPR-1 processor. (2) RAM. The very long instruction words generated by the optimizing compiler in the host SUN-4 are placed in this RAM. The data and results are also placed in it. Its capacity is 128K x 64 bytes. (3) The interface processor. It is a SPARC processor just as the C P U of the SUN-4. Two kinds of program can be executed on the interface processor. One is a monitor program in ROM; another is the serial portion of the applications. (4) The communications unit with the host. It is composed of a DMA interface, a state register

32

B. S u e t al.

and relevant control logic. The state register can be accessed by both the host and interface processor. (5) The communications unit with the URPR-1 processor. It is composed of a DMA interface, a backend state register and relevant control logic. The working state of the URPR-1 is stored in this register and it can be accessed simultaneously by both the interface processor and the URPR-1.

2.5. The working procedure of the URPR-1 system The host computer has a driver to load the very long instruction words and data to the interface unit and initiate the execution of the interface by setting the relevant bit in its control register. After the interface is initiated its processor runs the monitor in the ROM which watches the instructions loaded by the host computer. Once a loop is encountered during the course, the monitor loads the corresponding very long instruction words and data into the URPR-1 processor and then starts it to execute the loop. If the URPR-1 processor needs more data or very long instruction words during execution, the monitor also can feed it. Once the URPR-1 processor terminates, it will send an interrupt signal to the interface. The interface processor will fetch the results, restore its state and resume execution of successive serial portion. When the whole application program is finished, the interface unit will inform the host computer by setting a bit in its state register. Then the host computer will fetch the final results.

3. The optimizing compiler In order to ease user programming and make full use of the parallelism provided by the URPR-1 hardware, we have developed an optimizing compiler. It can convert applications in C language to very long instruction words which can be executed efficiently on URPR-1. The compiler is composed of front-end, data flow analysi s, control flow analysis, code generation and optimization. The last one is the key part of the

compiler. Besides the traditional local and global code compaction techniques, we adopt the two-level pipelining [17], general URPR [18] and GURPR* [19] approaches which are suitable for the limited capacity of the on-chip instruction memory.

3.1. Two-level pipelining The basic characteristic of URPR-1 architecture is the pipeline register file. It can transfer data between PEs and eliminate the data anti-dependencies. But to reach the above targets, the compiler not only has to carefully allocate the register resources and solve the problem of register spilling but also has to consider the following new problems: (1) how to form a pipeline chain and assign it to the variables read and written in different PEs in order to ensure the correct data transmission? (2) how to form a pipeline chain and assign it to the same variable in order to eliminate the interbody data anti-dependencies? Since the traditional methods which first do the resource allocation then code optimization cannot solve the above problems, we propose a phasecoupled two-level pipelining technique. The basic idea of two-level pipelining can be summarized as follows: (1) Applying the URPR software pipelining technique to the intermediate code. This is the first level pipelining, as it is done before the register allocation. The effect of data dependencies on code optimization caused by register reuse can be eliminated, therefore the inter body and intra body parallelism can be fully exploited. In addition, the fact that we just consider W/R interbody dependencies but not inter-body R/W dependencies implies variable renaming, which can further eliminate some inter-body dependencies. Actually we have already considered the mapping from the variables to the register chains. While dealing with the R/W data dependencies one variable can reside in more than one register belonging to different PEs. (2) Software pipelining before register allocation actually divides the intermediate code into several segments without violating the intra- and inter-body data dependencies. Each segment is

URPR- I : A Aingle-chip VLIW architecture

assigned to a specific PE which indicates the distribution of each variable in PEs. This distribution will direct the register allocation effectively. The principle of the allocation is that a register chain will connect those PEs containing the same variable. (3) Register-spilling can be solved by inserting noop cycles into the intermediate code to transfer data between memory and registers. Besides, as the main purpose of the first level pipelining is to exploit the parallelism among the PEs only, after register allocation we apply the second level pipelining to further exploit the parallelism within each PE. 3.2. General URPR algorithm

Loop-carried dependency is the data dependency between operations of different iterations I-6]. It imposes a serious influence on restricted software pipelining techniques such as URPR. We adopt the general URPR algorithm [181 to solve this problem, which maintains the good space efficiency of URPR and has a great improvement on time efficiency. The general URPR algorithm has three phases: (1) Pre-processing: First construct the Data Dependence Graph (denoted as DDG) of the loop and find all the cycles in it. Then judge its restrictability according to those cycles. Finally unroll the loop the appropriate times. The pre-processing of loops can eliminate the influence on software pipelining caused by the inter-body dependencies to reach time efficiency the same as that of the general software pipelining approaches. (2) Operation scheduling: the purpose of operation scheduling is to generate a loop body with smaller inter-body distance for URPR software pipelining. Its basic idea is as follows: first partition the strongly connected components in DDG; then schedule the operations in every component respectively (the loop-carried dependencies are used as the main heuristic); finally build an equivalent operation for the scheduled result to reconstruct a DDG, and apply list scheduling to it.

33

(3) Apply the URPR software pipelining algorithm on the scheduled loop-body.

3.3. GURPR* algorithm

The URPR approach is only suitable to simple loops with one basic block. The G U R P R approach can be applied to any loops [161, however its timeefficiency is low. Perfect Pipelining [11 and Pipelining Scheduling [21 have good time efficiency, but their space efficiency is not suitable to architecture with limited on-chip instruction memory like URPR-1. Lam's approach requires many restrictions and its time efficiency is also not good. So we have adopted the GURPR* algorithm. It has good space efficiency as well as good time efficiency [19]. GURPR* adopts the basic idea of the URPR software pipelining approach, that is UnRollingPipelining-Rerolling. This ensures GURPR* good space efficiency. GURPR* also adopts the concept of Parallel Program Flow Graph [11 to treat the loop body as a whole for pipelining so it also has good time efficiency. GURPR* contains the following steps: (1) Compact the loop body by a global compaction algorithm with some heuristics to reduce the inter-body distance D and then represent the compacted result by a parallel program flow graph [11. (2) Build the global inter-body DDG, determine the inter-body distance D and the number of unrolled bodies K. (3) Pipeline the K bodies, while maintaining the execution order of operations which is determined during global compaction of the loop body. This constraint makes rerolling easier. The pipelining result is represented by an overlap form of the parallel program flow graphs. (4) Reroll the pipelining result, delete the redundant operations and generate a new loop body. Finally do the bookkeeping with the operation transformation rules and construct loop back edges, prelude and postlude of the new loop.

B. S u e t al.

34

4. Experiments

We have done some experiments on simulators to verify the design of the URPR-1 processor and

interface unit, test the very long instruction words produced by the prototype of the optimizing compiler and evaluate the whole system. Here, we report our preliminary experiments on signal and image

for ( i=0; i< 1024;i++)

PE5

Cycle

{

PE6

PE4

PE3

PEt

PE2

PE0 1

• 1

k0=Br[i] *Wr [i]-Br[i] *Wi[i]; k 1=Bill] *Wr[i] +Br[i]*Wi[i]; Br'[i]=Ar[i]-k0; Bi'[i]=Ai[i]-kl; Ar'[i] =Ar[i]+k0; Ai'[i]=Ai[i] +kl;

?

?1 d3

11

} 9

"~ll

1~

11

1"4

1~

11

ld

1"4

19

11

7

fi

~

7

fi

o

R

.7

18

1~

1"/

1(

19 1~

17

lfi

11 1,~ za 1~ r:, 7

~

17

~

1

7

~

=E

d

?

1

R

a

'~

~;

1

")

'7

,.~

A

"~

~

?

R

"7

~;

d

113

Q

lfl

Q

~

113 Q

R

R

I£I

Q

(a) Source Program lr~p

1fl2~

No. of Cycle Operation 1

2

3 4 5

6

(1)

(2)

(2) (3) (4)

(3)

7

(4)

s

(5)

9

(6)

10 11 12 13 14 15 16 17

(7) (4) (5) (6) (7) (8) (6) (7)

~8

(s)

19 20 21 22 23

(9) (10) (9) (10) (11)

Operations

')1

91[

??

1G

1R

?J3

i~

17

1,4

1~

19

11

i'~

22

21

2,3

~n

IO

IR

I"7

I~

1~i

Id

1"4

19

11

'21

?3

77

1G

18

711

~"/

~'~

~,~

1A.

l"&

22

?,1

9~

9n

1~

1R

1,l;

1,4

0"4

??

71

OR

1Q

7~

??

h30

1~?7

Load tO, Br; Load tl,Wr; k2=t0+ 1; Load t2, Bi; Load t3,Wi;

lfl?g

tO < - - tl <--k3=t2*t3; k0=k2+k3; tO <-- tl <-- t2 <-- t3
lfl~l

d0=t4-k0; Store d0,Br'; t0<--B
17

in

1~

?fl

?3 lf111 1tim

(c) Very Long Instruction Words

]

H

H

H

H

i

]

[

R0

'DO R~ RA R~

R~

m ~

d l

Cross-Bar

Br->tO

Switch k,g f f .

~?.tN*tt

~TTT

(b) Sequential Code

(d) Working State o f Pipeline Register File and Function Unites

F i g . 8.

Example

of innermost

loop of radix-2

1024

point

complex

FFT.

35

URPR-1 : A Aingle-chip VLIW architecture

processing applications and demonstrate the working procedure of the compiler and the pipeline register file through three representative examples. Also we give the performance of some representative programs on signal and image processing. FFT: Figure 8 shows the butterfly computation of the innermost loop of 1024 point radixed 2 complex FFT on URPR-1 processor. Figure 8(a) is the source code written in C. Figure 8(b) illustrates its intermediate code. tO.... t4, dl .... d3 and k3 .... k5 are intermediate variables, tO 4-, k4 ~- .... are data transformation operations. The segmentation of the intermediate code also indicated by Fig. 8(b), each segment to be allocated to a single PE. Figure 8(d) shows the working state of the pipeline register file

for (i=O; iT) b[i]=255; else b[i]=0;

1 2

1

3 4 5 6 7 8

2 3 4 5 6 7

9 10

8 9

1 2

1

3 4

2 3

1

5 6 7

4 5 6

2 3 4

1 2 3

10

8

7

5

4

2

13 ..11.. 1 2 " " "~ 13

9 10

8 9

6 7

5 6

3 4

2

10 13 ¢.I1~ '~ 12 13

8 9 10

7 8 9

5 6 7

4 5

12,~11~.

}

and function units, the pipeline chains represent the data transmissions between PEs. The very long instruction words produced by the optimizing compiler are presented in Fig. 8(c). From cycle 1 to cycle 10 is the prelude and from cycle 1025 to cycle 1034 is the postlude. The new loop body is executed from cycle 11 to cycle 1024. The length of the new loop body is 1 which means that one butterfly computation can be completed within one machine cycle by seven PEs. As 512 x l0 butterfly computations are needed in 1024 point FFT, the time to complete them is 512 x l0 x 50 ns = 256/~s. With the prelude and postlude time and the time for loading the data and very long instruction words of 27/~s added, the total time for the 1024 point F F T becomes 283/ts.

12,~g,11~. "-~ (a) Source Program

12~¢,11~ (1) x*O; z-kO+tO; Load tl,y[i+l]; (2) x=x+z; z=kl +tl; Load t2,y[i+2]; (3) x=x+z; z=k2+t2; Load t3,y[i+3]; (4) x=x+z; z=k3+t3; Load t4,y[i+4]; (5) x=x+z; z=k4+t4; Load tS,y[i+5]; (6) x=x+z; z=kS+tS; Load t6,y[i+6]; (7) x=x+z; z=k6+t6; Load t7,y[i+7]; (8) x=x+z; z=k7+t7; Load t8,y[i+8]; (9) x=x+z; z=k8+t8; (10) x=x+z; (11) if (x>T) (12) b[i]=255; (13) else b[i]=0; (b) Intermediate Code

10 13 ~.11.~, 12

'1

3

6 7

8 9 13

8 9

10

10

13~,1 12

1,~ 13

(c) Result of Pipelining

I

,~W,11~ 12

13

10

8

7

5

4

2

1~

N

9

N

6

N

3

N]

(d) Result of Rerolling

Fig. 9. Example of the zero-crossings (to be continued on the next page).

J

B. Suet aL

36

FII

P~7

R0

x

PEfi "~"

R1

PE~ ",~.S~----

x

PE3 .~.zt...--

x

PE2 .~k-..-

x

PRO

PE1

"~.-'sa'-'-

z

~t-~'5-L-- z ",~iL--- z "~-~tL-- z .9t-~£-- z .gt-~2L--- z .q# ' t7 ~ ~ t3 tl

R2

x

PE4

x

.~..~2.--..-

x

.~

x

z tO

R3 R4

R5 CBS1 MULl ALU1 qHIP-T CBS2 MUL2 ALU2

T

k7

k6

k8

x>T?

x=x+z

k4

k3

k5

kl

k0

k?

y[i+8]->t8

y[i+7]->~

y[i+5]->~

y[i~]->~

y[i+2]->~

z=k7*t7

~k6*~

z=~*~

z=k3*t3

z=kl*tl

z=k0*t0

x=x+z

x=x+z

x=x+z

x=x+z

x=x+z

x=x+z

y[i+6]->t6

y[i+3]->t3

y[i+0]->t0

z=k5*t5

z=k2*t2

25510->b[i] z=kS*t8

y[i+l]->tl

~:1-1"1b-"F", CBS3 MUL3 ALU3 ~7

~6 ~g'

~'~ ~'~'

~ ~4'

s3 s3'

s? s2'

sl .~1'

(e) Working State of Pipelining Register File and Function Unites

Fig. 9. (continued).

Find zero-crossings: The Find zero-crossings is a global case, because there is a branch within the code as shown in Fig. 9(a) and Fig. 9(b). We have applied the GURPR* algorithm for global software pipelining. The results of pipelining and rerolling are shown in Fig. 9(c) and Fig. 9(d) respectively. The length of new loop body of very long instruction words is 3, including an addition cycle for data transmission. Figure 9(e) illustrates the working state of function units during these three very long instruction words. The total time of Find zero-crossings for a 512 x 512 matrix is 42.3 ms. Maximum filtering: The Maximum Filtering is also a global case. There are three branches within the codes as shown in Fig. lO(a) and Fig. lO(b). The results of pipelining and rerolling of the GURPR* algorithm are shown in Fig. 10(c) and Fig. lO(d) respectively. Even the length of the loop body is reduced from 10 to 3 by applying global software pipelining. From Fig. lO(e) we can find only three

PEs work together which means that the parallelism of hardware resources in the URPR-1 processor is not exploited fully due to only one branch mechanism in URPR-1. Table I lists the performance of some representative algorithms in signal and image processing fields. From the comparison between the length of the sequential code and very long instruction words, we can find the high efficacy of the optimization of the URPR-1. The execution time in Table 1 is only of the computation of the innermost loop on the URPR-1 processor, excluding the time for serial portion executed by interface unit. There are 9 TAP computations within the innermost loop of FIR Filter, which can be completed within one machine cycle on URPR-1 processor. Therefore the execution time of each FIR Filter TAP computation is equal to 50 ns/9 which is less than 6 ns. The loading time includes the time for data loading and unloading, as well as the time for loading the very long

37

URPR- 1 : A Aingle- chip VLIW architecture

in the tasks by the execution time. All operations are assumed to execute equal n u m b e r of times. F r o m Table 1, we m a y find that the Convolution, F I R filter, F F T , and Lattice Filter reach high

instruction words. The total time is the sum of the execution time and the loading time. The last colu m n of Table 1, the c o m p u t a t i o n rates are calculated by dividing the total n u m b e r of operations

1

for (i=0;i
2

,f% 3 4 \/

if (x
5

if (x
6

"xJ

b[i]=x; }

1

2

7

/'x 3

\/ 5 /'x

(a) Source Program (1) x=-l, Load tO, a[i];

10

6

4 1

2

7 g

(2) if (x
/'x 3

4

6

7

\/ 5 /'x

else (4) Load tl, a[i+l]; (5) if (x
10

else

'xJ

(7) Load t2, a[i+2]; (8) if (x
(c) Result of Pipelining

8

3

N 9 10

4

5 6

N

1 7

2

(d) Result of rerolling

Fig. 10. Example of maximum filtering (to be continued on the next page).

B. Suet aL

38

WI1

PE 2

PE 1

r0

x

x

rl

r2

T1

CBSI

PF, f)

..tO-

x

rO

a[i]->x,a[i+l]->tl or a[i+l]->tl

MULl ALU1

x
,qt-llb-q~ 1

CBS2

a[i+2]->x or NOP

a[i]->t0

MUL2 ALU2

x
X=-I

RIqllZT9

CBS3

x->b[i]

a[i+l]->x;a[i+2]->t2 or a[i+2]-t2;

MUL3 ALU3 qt-llb'Tq

x
(e) Register Allocation and FU status Fig. 10. (continued).

computation rate. As the number of PEs and buses as well as the configuration of other hardware resources are determined according to the analysis of those tasks, the parallelism of the URPR-1 processor is exploited quite fully, which means that URPR-1 is very suitable to those applications. Some tasks with the global innermost loop such as Compute Gradient and Laplacian Edge Detection also reach high computation rate, but some such as Maximum Filtering have low computation rate as the original length of the sequential code is short and the optimization is limited by the branch. That indicates that more powerful hardware support on the branch mechanism is needed.

Table 2 lists the execution time of some signal and image processing tasks on SUN-4 workstation and on the URPR-1 System. The speedup of the execution of the innermost loops on the URPR-1 processor versus on SUN-4 workstation equals (the ratio of clock cycles per instruction)x (the ratio of machine cycle time) x (the ratio of the length of sequential code and the length of the very long instruction words), it could be several hundreds due to the optimization efficacy of the length of the innermost loop, which is presented in Table 1. Even the speedup of entire programs restricted by the serial portion of the programs according to Amdahl's Law [7], the performance of the URPR-1

39

URPR-1 : A Aing/e-chip VLIW architecture

Table 1 Performance on URPR-1 processor Length of innermost loop

Task

Sequential code

Execution time

Loading time

Total time

MIPS

Very long inst. words

FIR filter TAP

35

1

6 ns

700

IIR filter (8 coefficients)

33

2

1O0 ns

330

Lattice filter

30

1

13 ns

1024 point FFT complex (radixed 2)

20

1

256/zs

Convolution (512 x 512 matrix, 3 x 3 window)

53

1

13ms

3.3 ms

Image LPC coding (512 x 512 matrix)

23

2

26 ms

Find zero-crossings (512 x 512 matrix)

29

3

39 ms

Maximum filtering (512 x 512 matrix)

11

3

Laplacian edge detection (512 × 512 matrix, 3 x 3 window) Compute gradient using 9*9 Canny operators (512 × 512 matrix)

Lattice filter 1024 point FFT complex (radixed 2) Convolution (512 x 512 matrix, 3 x 3 window)

283 ~ts

600 600 1060

3.3 ms

29.3 ms

230

3.3 ms

42.3 ms

193

39.2 ms

3.3 ms

42.5 ms

73

55.3 ms

395

55.3 ms

560

4

52 ms

112

4

52 ms

3.3 ms

Table 3 Compile time

Execution time (ms)

Task

Compile time (sec)

SUN-4

FFT Convolution FIR filter IIR filter Lattice filter LPC coding Laplacian edge detection Computer gradient (9*9 Canny operator)

5.7 7.7 4.8 4.7 4.6 3.9 10.6 14.8

URPR-1 system

4752

544

125

22

9766

1101

33

3.7

IIR filter

27

3

Image LPC coding (512 x 512 matrix)

4108

486

Find zero-crossings (512 x 512 matrix)

10928

1256

Maximum filtering (512 x 512 matrix)

24028

2712

9523

1113

17818

3200

Compute gradient using 9*9 Canny operators

16.3 ms 16.3 ms

79

FIR filter

Laplacian edge detection (512 × 51 2 matrix, 3 x 3 window)

27 #s

3.3 ms

Table 2 Performance on URPR-1 system Task

3.3 ms

System which uses the URPR-1 processor as an accelerator is quite good. The range of the speedup of entire programs running on the URPR-1 System is 5 to 9. Table 3 shows the compile times for the innermost loops of some signal and image processing applications. Due to the low complexity of the URPR software pipelining algorithm applied in the optimizing compiler, the overhead of the optimization for very long instruction words is acceptable.

40

B. S u e t aL

5. Conclusion The VLIW architectural URPR-1 processor fully exploits instruction-level parallelism to increase machine performance transparently. The users of URPR-1 System will not modify their algorithms and programs, also they will tolerate the overhead of the compile time. We explore a combined approach to design the URPR-1 System. Firstly, it adopts a hardware support-pipeline register file to enhance the optimization efficacy of software pipelining and provide the high data transfer rate between PEs. Secondly, it adopts several optimization algorithms which have good time efficiency as well as good space efficiency to support the on-chip instruction memory whose capacity is limited. We have completed the architecture design of the whole URPR-1 System, the logic design of the URPR-1 processor and interface unit and implemented the prototype of the optimizing compiler written in the C language. The results of preliminary evaluation show URPR-1 having achieved high speed performance. Currently, we are scheduling more extensive tests.

Acknowledgement We wish to thank Mr. Yuanlong Wang who did most of the experimental work. We gratefully acknowledge the support of National Science Foundation of China.

References [1 ] A. Aiken and A. Nicolau, Perfect pipelining: A new loop parallelization, Tech0nique Research Report, 87-873, Dept. of Computer Science, Cornell Univ., 1987. [2] K. Ebcioglu and A. Nicolau, A global resource-constrained parallelization technique, Proc. 3rd Internat. Conf. on Supercomputing, Crete (1989) 154-163.

[3] J.A. Fisher, Trace scheduling: A technique for global microcode compaction, IEEE Trans. Comput. c-30(7) 1981. [4] J.A. Fisher and B.R. Rau, Instruction-level parallel processing, Science 253 (Sept. 1991 ) 1233-1241. [5] M.J. Flynn, Keynote address: Instruction sets and their implementation, Proc 23rd Annual Workshop on Microprogramming (Nov. 1990) 1-6. [6] F. Gasperoni, Compilation techniques for VLIW architectures, Technical Report 435, New York University, Mar. 1989. [7] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach (Morgan Kaufmann, Los Altos, CA, 1990). [8] J. Labrousse and G. Slavenburg, A 50MHz microprocessor with a very long instruction word architecture, Proc. 1990 IEEE Internat. Solid State Circuits Conf. (Feb. 1990) 44-45. [9] J. Labrousse and G. Slavenburg, CREAT-LIFE: A modular design approach for high performances ASlC's, COMPON (1990). [10] M.S. Lam, A systolic array optimizing compiler, Ph.D. Thesis, CS Dept., Carnegie Mellon Univ., May, 1987. [11 ] M.S. Lam, Software pipelining: An effective scheduling technique for VLIW machines, Proc. SlGPLAN'88 Conf. on Programming Language Design and Implementation

(June, 1988). [12] R. Mueller, B. Su, et al., A case study in signal processing microprogramming using the URPR software pipelining techniques, Proc. 19th Annual Workshop on Microprogramming (Oct. 1986) 109-11 5. [13] C. Peterson, J. Sutton and P. Wiley, iWARP: A 100 MOPS, LIW microprocessor for multicomputers, IEEE MICRO (June 1991 ) 26. [14] H. Stone and J. Cocke, Architecture in the 1990s, Computer (Sept. 1991 ) 30-38. [15] B. Su, S. Ding and J. Xia, URPR-An extension of URCR for software pipelining, Proc. 19th Annual Workshop on Microprogramming (Oct. 1986) 104-108. [16] B. Su, J. Wang and J. Xia, Global microcode compaction with timing constraints, Proc. 21st Annual Workshop on Microprogramming (Nov. 1988). El 7] B. Su, J. Wang, et al., A software pipelining based VLIW architecture and optimizing compiler, Proc. 23rdAnnual Internat. Workshop on Microprogramming and Microarchitecture (Nov. 1990) 17-27.

[18] B. Su and J. Wang, Loop-carried dependence and the improved URPR software pipelining approach, Proc. 24th Annual Hawaii Internat. Conf. on System Sciences

(Jan. 1991 ) 366-372. [19] B. Su and J. Wang, GURPR*: A new global software pipelining algorithm, Proc. 24th Annual Internat. Syrup. in Microarchitecture (Nov. 1991 ).

URPR-1 : A Aingle-chip VLIW architecture

41

Bogong Su received the B.S. degree

Chihong Zhang received the B.S.

in computer engineering from Tsinghua University, China, in 1959. Since 1959, he has been a faculty member in Tsinghua University. He was an associate visiting scientist in Courant Institute of New York University between 1979 and 1982. He was also a visiting professor in the Department of Computer Science of Colorado State University in 1986. He has been a professor in the Department of Computer Science and Technology of Tsinghua University since 1989. He is now an adjunct faculty member in the Department of Computer Science of The College of Staten Island and Queens College of The City University of New York. He has published extensively in the areas of microprogramming, VLIW architecture, optimizing compiler and distributed artificial intelligence. Prof. Su is a senior member of IEEE and the vice chair of Distributed Computer System and Microprogramming Society in China.

and M.S. degree in computer science and technology from Tsinghua University, China, in 1984 and 1987, respectively. From 1987 to 1993, he was a lecturer in the Department of Computer Science and Technology of Tsinghua University. His research interests are in fine-grain parallel optimal compiler and VLIW architecture.

Jian Wang received the B.S. degree, M.S. degree and Ph.D. degree in computer science from Tsinghua University, China, in 1986, 1988 and 1991, respectively. Since February, 1992, he has been working as a postdoctoral research scientist in INRIA (Institut National de Recherche en Informaique et en Automatique) in France, and from October, 1993 in the Vienna University of Technology. He will finish his postdoctoral program at the beginning of August, 1993. Dr. Wang has about thirty-five publications in the areas of parallel processing, architectures and compilation techniques for fine-grain parallelism, code scheduling, loop scheduling, software pipelining, VLIW architecture and distributed artificial intelligence.

Zhizhong Tang received the B.S. degree in autocontrol from Tsinghua University, China, in 1970. Since 1970, he has been a professor in Department of Computer Science and Technology of Tsinghua University. His~research interests are in VLIW architecture, and the instruction-level parallel optimal compiler.

Wei Zhao earned the B.S. degree and the M.S. degree in computer science from Tsinghua University, China, in 1987 and 1989, respectively. He is currently working toward the Ph.D. degree in computer science at the Department of Computer Engineering and Science, Case Western Reserve University. His research interests include parallel computer architecture, its optimizing compiler and their applications in digital signal processing and scientific computation. Mr. Zhao will complete his doctoral studies in 1996.