Microprocessing and M[croprogramrning 35 (1992) 595-602 North-HoLland
595
A COMPARISON STUDY OF MINIMIZATION METHODS OF UNIT INTERCONNECTION IN VUW PROCESSORS
C. CARRIERE, M. AUGUIN, F. BOERI, G. MENEZ LASSY, Departement de rl3S, CNRS et Univemit(~ de Nice et de Sophia-Antipolis, 41, Boulevard Napoldon III, F- 06041 Nice ceoex Ftance e-maih
[email protected] Previous studies have shown the significant part of the interconnection optimization in architecture synthesis, During this process, operator allocation to functional units is done, followed by a data path allocation phase which consists in minimizing the numbers of registers, multiplexers and physical links. We propose, in this paper, to descdba briefly four methods to solve the data path allocation problem and to compare their efficiencies on scalar and vector algorithms. Three methods consist in already published techniques and the fourth is a new procedure based on the maximum compatible mechanism.
1.
INTRODUCTION
The importance o| the ASIC developments is always increasing. Indeed, (i) researches on synthesis methods lead to more elficient tool proposals, (ii) the architecture of a processor is "optimal" for a class of algorithms. The present trend is to build specialized circuits. Their conception needs high level knowledge and a tong design cycle. All these factors are in favour to increase the circuit cost. High level synthesis tools permits to reduce the design cost [1. Typically, the inputs of these tools consist in a program written n a high-level language and a set of constraints. From that program, a compiler generates atlow graph [2]. [3], then lunctional units needed to execute it are selected. The next phase consists in the data path allocation. Its aim is to minimize the numbers of registers, multiplexers and physical links [4]. As unit interconnections constitute a significant part of the design cost [5], [6], we propose in this paper to compare four methods implemented in the CAPSYS tool [7], [8]. CAPSYS is a C.A.D tool for the synthesis of dedicated synchronous parallel processors. The simplest implemented method is the greedy procedure directly applied to the edges of the flow graph. Unfortunately, it does not give always the minimum number of registers. Thus an interesting method based on the edge colouring technique [5] is considered. However, the consequence ol minimizin~ registers is an increase ot multiplexers and physical hnks. In order to reduce this drawback, the edges of the flow graph are grouped preliminary according to data transfers between functional units. Then a greedy method may be applied again, A new minimization method based on the maximum compatible concept was developed. The comparison of these methods is performed on convolution and SOR algorithms, This paper presents (i the computer aided design tool CAPSYS, 0i) the data path allocation oroblem, (iii) the considered data path allocation methods, (iv) their
comparison for scalar and vector input algorithms and (v) some concluding remarks. 2.
OVERVIEW OF CAPSYS
2.1 Overall description
CAPSYS [7], [8] is a computer aided design tool (Fig. 1) that takes as input a program of a target appllcation and a set of constraints and provides: • an optimized dedicated parallel architecture, • the associated object code, • a performance report. The functional units which may be selected b y CAPSYS are described in a library. The aim of CAPSYS is to exhibit architectures by exploring a design space. User constraints allow to limit this space in order to provide only architectures that satisfy the designer's requirements.
Pro ram Architectures (LEC) - I I Constraint.~L! CAPSYS !ObiectCode= Amhitectu~'-~
IPerformances_ I w
Library of fonctlonal] units Fig. 1. Inputs/outputs of CAPSYS The synthesis procedure is decomposed into two separate phases: (i) compilation, functional unit selection
596
C. Carriere et el.
and operator allocation, (ii) data path allocation. The programming language (LEC) is an augmented subset of ADA, Le, scalar operators are extended to vector operators in order to take advantage of vector processmg. 2.2 The library of functional units and user constraints The library consists in a set of functional units (FU), each one described by physical parameters (name techneicgy, area, consumption for example) and by a basic function list. With each function is associated: (i) a scheduling template that represents the occupancy times Of the different pipeline stages used for executing the function, (ii) a routing scheme which gives the different ways to route operands to FU's inputs Physical parameters enable the evaluation of the overall cost of candidate architectures, User constraints specify the physical boundaries which cannot be exceeded by the synthesis process. 2.3 The predefined model of architecture in CAPSYS CAPSYS may accept an optional predefined architecture. For example, it may be considered as a compiler for this architecture or as a synthesis tool which is able to take advantage of the introduction of new functional units in the library. The input program s trsrtslated into a ftow graph which describes the data precedences (edges) and the basic functions (nodes) to be applied on data_ The set of basic functions restricted to the control scheme (VLIW) and the memory management defines the generic model of CAPSYS. One representation of this model is illustrated in Fig. 2. It consists in five blocks: control unit (CU), addressing unit (AU), memory bank (MB), memory control unit (MCU), and processing unit (PU). Moreover FUs of the generic model contain capabilities required for connecting processors in a SIMD machine which exploits vector processing. Connections between units of the predetined architecture may have pipeline registers.
3.
THE DATA PATH ALLOCATION PROBLEM
3.1
Data base
The compilation phase of CAPSYS provides a flow graph [9]. Nodes represent scheduled basic functions and edges correspond to information transfers between functional units. Three informations are attached to each node: the basic function, the name of the functional unit which implements the basic function and the scheduling times. An interesting portion of the flow graph issued from the factorial algorithm (Fig. 3) is depicted in Fig. 4.
rocedure factorial is f,i,n : INTEGER; begin f:=l; n:=6; i:=2; for k in 2..n loop f:=f*i; i:=i+l ; e:~d loop; end; J Fig. 3. Factorial algorithm Let us consider an incompletely specified architecture where PU is empty. After selection ~,:ld allecat on of FUs, only following informations are obtained: (i) the set of the functional units excepted reg sters and multiplexem, (ii the object code associated to these functiona units. In Fig, 6a. is shown a candidate architecture obtained for the factorial algorithm. The aim of the data path allocation procedure is to define interconnections between functional units. Generally, these interconnections require registers multiplexers and physical links [5]. In addition to the flow graph, the data path allocation procedure needs routing schemec of basic functions supported by functional unit and optional informations about the predefined architecture. These informations may predafine in the architecture: physical links, registers and multiplex. ers. 3.2 Problem decomposition In the sequel, the term "bus" is redefined and denotes a sub-flow graph, Edges of a bus correspond to data transfers and nodes define basic functions that consume or produce data from or to the bus. For each edge involved in a bus, we defined two times, the time when a functional unit produces a data (Tw) and the time when a functional unit consumes data on its inputs (Tr).
Fig. 2. The generic model of a processor
When the delay Tr-Tw} is graater than zero, a reg ster is required on the bus. Values Tw and Tr are also the times when the register writes and reads data Physically, a bus consists in (i) physical links, multiplexers
597
Minimization methGds of unit in[eroonnection
FU: cu, Fct: fit. ts= 8, te= 8
FU: mcu, FCt: mb_tojc~J FU:mcu, F c t : c u _ t o o u ts= 3, re=- 3 rs= 0, re=-0
FU: cu, FCt: mcu_to_au ts= 8, te= 8
;a,
FU: au, Fct:
ts= 8, re= 8
adrdspl
FU: amd29c332,Fct : add ts=- 4, re= 8 FiJ: mcu, Fat: pu_ to_rr~
ts= 9, t~- 9
FU: fonctionnal unit Fct: basic function FU: rob, Fct: lit ts: start cycle ts= 9, re= 9 te: end cycle
Basic functions: - lit. CU provides a value needed by PU. - adf...depl: adressing mode of AU. - x_to_y: these fun~ons permit to transfer data from FU x to FU y through MCU. - ad~. addition operator. J
Fig. 4. A flow graph deduced from the factorial algorithm and registers which ensure information transfers, (if) functional units which implement basic functions of the sub-tiow graph. Five steps are needed to get the whole arcnitecture: hi) reduction of the graph acoording to the predsfined uses, (if) assignment of the remaining edges in buses, (iii) determination of registers and ther controls, (iv) determination of multiplexers and their controls, (v) determination ol physical links between the whole set of units (functional units, registers and multiplex-
ers).
The second step is very important since it has a direct influence on the next steps. Only an overview of this decomposition process is presented using the greedy method. 4.
DATA PATH ALLOCATION METHODS
A brief description of the data path allocation methods used in CAPSYS is presented in this section. More details are given in [14]. Only data path allocation mechanisms for scalar type input program are considered below. These methods may be grouped in two classes. The first one operates directly on the flow graph, The second one performs a preprocessing on the flow graph with the objective of privileging data transfers relatively to FUs.
4.1 4.1.1
The greedy method
If a predefined architecture contains FUs, regi~ers, multiplexers and physical links, the flow graph can be reduced: pmdefined buses are then craaled. Edges of the flow graph may be removed if (i) information transfers modelled by these edges use FUs in the predefined architecture, (if) predefined physical links support these transfers and (iii), if the delay (Tw-Tr) is greater than zero, therl a register must be located on that path, Then nodes without incident edges are removed. The reduction of the previous flow graph (Fig. 4) according to the predefined amhitecture of Fig. 2 is given in Fig. 5. The remaining edges of the reduced tiow graph are distributed in new buses with a greedy method, An edge Ai is associated with bus Bk ii Tv,(Ai),~Tw~Aj) and Tr(Ai)#Tr(Aj), V Aj e Bk, Le, edges Ai and A are time compatible. If no bus vedfies these two conditions, a new one is created. The greedy algorithm applied to the graph of Fig. 4 determines two buses as described in Fig. 5. Once buses are defined, the list of registers with their sizes is produced. A register memory address is assigned 1o each edge. If data routed by a bus have avedapped life times, then they must be stored in different addresses. The
FU: mcu, Fct: mb_to~ou FU: mcu, FCI:cu to pu ~ s= 3, te= 3 ts= 0, te= 0 [
-----%
Direct allocation methods
|
FU: amd29c332, FCt: add rs= 4, te= 8 a5~
J / /
FU: mcu, Fct: pLto_mb ts:9,te= 9
/ j)
// f a3: a 2Tw=3, : Tr=4, Tw=O, Tr=4. /
Fig. 5. The reduced flow graph and the bus list
a5: Tw=8, Tr=9. J
598
C, Carriere etal,
MCU
Fig. 6a. A candidate architecture obtained for the factorial algorithm List of FUs Involves In the PU:
Amd29C332: ALU, Amd29C323: integer multiplier, R1, R2: two word registers, R3: one word register. MPXI: multiplexerwith three inputs, MPX2, MPX3:multiplexerswith two inputs. These FUs drive data 0t 32 bite exceptedREG3 which drives a flag, encodedon one b t,
complete definition of a bus involves connections of input and output ports of FUs onto that bus. This requires to remove the ambiguity due to commutativity of some basic functions. For example, rooting schemes associated to add function, available in the library, may be: (data 1 -> portt and data 2 -> port2) or (data 1 -> port2 and data 2 -> portt). Routing schemes are chosen such that the number of input ports of FUs connected to a bus is minimized, The next step consists in determining multiplexers not contained in the predefined architecture. There are two categories of multiplexers: multiplexers located before registers and those connected to input ports of FUs. A multiplexer is required when a register or FU input port receives data issued from different buses. After the whole set of units is defined, physical links and binary object code associated to multiplexers end registers are obviously deduced, The architecture is then complstely specified. For example, the one obtained for the factorial algorithm is depicted in Fig. 6b The greedy method is not an optimal procedure with respect to the number of buses. Then we propose, in the next section, another approach based on the edge colouring method. 4.1.2
nodes associated to data production times and start nodes associated to data consuming times. At each edge of the flow graph corresponds a new edge between an end node and a start node. Let Ni andNj be two nodes of the flow graph connected by an edge Ak. Let Fi and Fi be their associated basic functions with production brae Twi and consuming time Trj. Then, end node Ei corresponding to Twi and start node Ej corresponding to Trj in the bipartite state graph are connected by an edge. A different colour is assigned to every edge ocournng at a node in the bipartite state graph. If an edge is not yet colourod and if its end and start nodes have a common colour, the last one is chosen. Else the graph is recolourod with the Kempe Subgraph [5]. The coloured bipartite graph of the previous flow graph example is illustrated in Fig. 7. I
e
r
~
°
d
e
l
The edge coloring .,nethod
The method developed in [5] is based on an edge celoudng algorithm applied on a bipartite state graph. This graph can be cctoured with a minimum number of colours that corresponds to the maximum number of edges occurring at a node. A colour represents a bus. The flow graph is transformed into a bipartite graph, A node of a bipartite graph corresponds to a time of data production (Tw) or data consuming (Tr) by one FU. Two typos of nodes are defined: end
f5 (The color number = 2)
~
L~'~.,
Fig, 7. A colored bipartite graph deduced from factorial algorithm.
aS.~
599
Minimization methoOs of unit interconnect~on
This edge coloudng method is optimal with respect to the bus number. However many example studies show that minimizing registers directly on the flow graph tends to increase multiplexers and physical ~is;°hi~s~t~ y 4.2 4.2.1
a2: Tw=3, Tr=4. ~ a3: Tw-~, Tr=4.
~ ~'meeffon e~: /
's~n~~ln~:C~',n~oPrmr ~z~[o°~n~ssnngfe°f pe .
Allocation methods with preprocessing Preprocesaing and greedy algorithms
Ppreprocessing takes care of physical constraints due to convergence of data transfers toward an input port of an FU. Edges are distributed in logical connections if they verify the following property: time compatible edges which represent information transfers to the same input port of a functional unit are grouped. This property had been chosen because it gives the best results for the processed examples. A logical connection constitutes a sub-bus. Sub-buses are grouped in buses according to the greedy technique. Two connections may belong to the same bus if their are time compatible, Le., if their .e~es are themselves time compatible. Unfortunately, this method is not optimal with respect to the number of buses. The edge ccioudng algorithm cannot be applied on logical connections because colouring an edge leads to give to all edges of the same connection the same colour. In literature, some methods used to solve the data path allocation problem are: clique partitioning 4], [6] and maximum compatible technique [3]. Clique part t on ng s often used when operator allocation and data path allocation are simultaneously pedormed. This method does not provide a minimum number of buses. Then a new one based on maximum compatible concept [14] is briefly presented in the next section, 4.2.2
~
The maximum compatible method
After logical connections are formed, they are grouped together in order to build maximum compatible classes (CM) [10], [11]. A CM is a class which cannot be contained in another CM : each one involves a maximum number of compatible connections. The set of CMs is determined with a successive decomposition technique [12. The initial CM involves all connections. Then for each connection Ci, and each class CMj, such that Ci L=CMj,CMj is divided into two CMs: (i) the first CM contains Ci and connections belonging to CMj which are time compatible with Ci, and (ii) the second CM = CM -{Ci}. Compatible classes included n one another are de sted. Applying this procedure to the previous example dep=cted in Fig. 5, we get CMs presented in Fig. 8. When all CMs are known, a branch and bound procedure [13] selects a minimum number of classes which cover all the connections; a solution tree is built. A node of this tree involves the list Lsel of selected CMs and the list Lrem which contains remaining CMs.
(CNt={Ct,C2,C3!) C1 is only compaZ~e with C3. (CM1={(7,1,C3}, CM2={C2,C3~ ~ C2 is c o m ~ with C3, (~CMt={Cl,C3}, CM3={C2.C3}.CM4={C3} ) CM4 is deleted because it is includedin CM3. (CMS are : CM1 =[C1,C3} and CM3={C2,C3}~
Fig.8. Example of CM building At the root node, Lsel--~, Lrem={CMs} : rio class is selected. At each terminal node Ni, where the selected CMs does not cover all connections: If there exists CMjE Lrem such that CMj is the only CM to cover a connection, then a node issued from Ni is created: t.sel=Lsal(Ni)U{CMj}, Lrem= Lrem(Ni)-{CMj}. Else if CMj~ Lrem is such that CMj involves the greatest number of connections not covered by the selected CMs, then two nodes are created: - a first node where CMj is selected (Lsel=Lsel(Ni)U{CMj}, Lrem=Lrem(Ni)-{CM]]), - a second node where CMj can't be selected (Lsel=Lsel(Ni), Lrem=Lrem(Ni)-{CMj}). The tree construction is stopped when any ncds m~y not be created. The terminal node which comprises the smallest number of selected CMs is retained. Maximum compatible classe constitutes bus. However some connections can belong to several ciasses.ln the previous example, the conn~J'tion C3 is present in the two obtained CMs. Minimizing the number of FU pods connected to a bus is the cdtedon which allows to assign a connection to only one clase. The aim is to minimize the size of multiplexers. This result is achieved by applying again the maximum compatible technique. In the next section, comparison results for the four above mentioned methods are given. 4.3
Comparison results
In order to evaluate the efficiency of the data path allocation methods described inthe previous section, three input programs are considered: image convolution, vectorizedima:~e convolution and the SOR (successive overrelaxahon) algorithm. These algorithms are given in Fig. 9.
600
C. Carriere et el.
Two architectures among those generated by CAPSYS, are retained: (i) the architecture with the minimum set of functional units (ALU, FPU, integer multiplier) which is needed to cover all basic functions deduced from the input programs, (ii) the architecture with six functional urdts: 2xALU, 2xFPU, 2x integer multipliers. These architectures contain also FUs of the generic model (Fig. 2): CU, AU, MB and MCU. Results are presented according to three parameters: (i) the number of registers and their sizes, (ii) the number of multiplexers (and their individual input numbers), (iii) the number of physical links and control bits required.
f rprocedure convolution Is type imege_carree is array[1..256,1..256]of INTEGER; image : imege_carree; im~je _filtrse : image carree; : affay[1.,3,1..3] of INTEGER ;
~% ,.3 for I in 1..3 is0p
f ~ iin 2..255 taop forj in 2..255 toop-- for is replaced by forsll in the vectadzedconvolution [mage_.filtree[i,j]:= imegefiltree[i,j] + flmtre[k,I]*in'~ge[i-2+k,j-2+t]; end loop; end loop; end loop; end I~p; end convolution;
4.4 Comparison according to the number of registers Table I shows registers and their sizes for the two architectures. We notice that the greedy and edge coIoudng methods produce similar results. Methods with preprocessing give different rasuRs. They increase the number of registers and their individual sizes. However the maximum compatible method gives resuits closed to the ones of the edge coloudng algorithm. Yet, this preprocassing algorithm has a running ~me which grows with the number of connections This effect is due to the branch end bound procedure which is time consuming, This procedure may be modified so that a node of the solution tree is not considered if the associated number of CMs is greater than the number of CMs of at least one leaf in the tree. Using this modified procedure, running times may be considerablely improved. Preproceasing methods give solutions with more distributed sizes of registers. This factor tends to increase their number but the whole memory size of registers may be reduced if the assignement of edges to adresses m registers is optimized.
t'~prsca¢lura SOR is n : constant INTEGER:=512; I1_1 : constant INTEGER:=n-1; OM : constant REAL:=0,5; T : array[1..n,l.,n] of REAL; begin for i in 2..n-1 loop foraLIj in sot {2..n-1} Io0p T{ij'] :=OM*O.25*(T[i-1.j]+ T[i+l d']+T[i,j-1]+T[i.j+I])(OM-1.0)'T[i,j]; endloop; end loop; ~,end SOR;
Fig. 9. Convolution and SOR algorithms
Scalar Convolution
lectorized Convolution
SOR
3XPUS Nb: 5 Nb: 5 Nb: 5 I Nb: 5 Nb: 9 Greedy S: 1,2,2,9, S: 1,1,3,6,9 S: 1,2,4,19 S: 1,1,1,2,3, S: 1,7,11, 11. 29. 5,8,8,11. 41,46. Nb: 5 Nb: 5 Nb: 5 Nb: 9 Nb: 5 Edge coloring S: 1,2,2.9, S: 1,1,3.6,9 S: 1,2,4,19,!S: 1,1,1,2,3 S: 1,7,11, 11. 29. 5,8,8.11. 41,46. Methods with preprocesslng 3XPUS
Greedy
~XI'US
3XPUS
6XI-US
6XI'US
Nb: 6 S: 1,1.7,9, 31,39. Nb: 5 S: 1,7,9,31 39.
Nb:5 Nb:8 Nb:6 Nb:lO Nb:6 Nb:6 3:1,12,13 S: 1,3,6,8,~S: 1,1,2,3,3 $: 1,3,6,17, S: 1,2,2,3,3, S: 1,12,30, 15,16,16 4,5,5. 17,19. 4.6,7,8,8 31,41,46 13,30.
Maximum Nb: 5 Nb: 5 Nb: 6 Nb: 10 Nb: 5 Nb: 6 Compatibles S: 1.3,6.8,~ S: 1 3,5.5 6~ S: 1,3,6,17, S: 12,3,3,3 S: 1,10,41, S: 1,5,16, 17,19. 5.6,6,7,8 43,46. 21,23.3g Nb: number of registers S: Sizes of registers
Table I: Number and size of registers.
Minimization methods of unit interc~nncction
Scalar Convolution Greedy
3xFUs Number:.8 3x2, 5x3.
Edge coloring
Number:8 3x2, 5x3.
Greedy
Number:4 lx2, 3x3.
Maximum Compatibles
Number:4 I x2, 3x3.
~ectorlzed Convolution
6xFUS 3xFUs 6xFUs Number:1( Number: 9 Number: 1~ 8x2, 6x3, I x2, 5x3, 3x2, 3x3, 2x4. 3x4. 8x4, 5x5. Number:t( Number:9 Number:19 8x2, 6x3, 1x2, 5x3, 3x2, 3x3, 2x4. 3x4. 8x4, 5x5. Methods wlth prepresessing Number:6 Number:7 Number:lO lx3, lx2, 2:<3, 3x2, 4x3. lx2, 4x4, 4x5. 3x4 Number: 9 Number: 7 Number:.lC 2x2, lx3, 4x2, 2x4. 3x3, 3x2, 4x3. 3x4, 4x5.
601
SOR 3xFUs 6xFUs Number: 1( Nuttier: 1( 2x2, 6x3, 5x2, 2x3 2x4. 8x4,1x.5.' Number:l( Number:l( 2x2, 6x3, ~ 2 x 3 , 2:<4. Number:8 Number:11 4x2, 4x3. 4x2, tx3, 6x4. Number: 8 Number:.1: 3x2, 5x3. 4x2, 4x3, 4x4.
"ixj:": i multiplexers with j inputs Table Ih multiplexer numbers and their inputs 4.5 Comparison according to the number of multiplexers
4.6 Comparison according to the number of physical links and control bits
For each architecture, table II gives the number of multiplexers and the number of their inputs. Data patl~ allocation methods which operate directly on the flow graph give the same number of multiplexers. The difference about the number ol inputs of multiplexers in the right most column in table II, results from the variation of the number of registers. Otherwise no variation occurs. Preprecessing improves significantly the efficiency of the allocation methods. The total number of multiplexers may be divided by a ratio up to 2.6. The maximum compatible method increases lightly the number of multiplexers. A sblution with distributed registers reduced the interconnect units,
In this section, we compare the four algorithms according; to the number (PL} of physical links in PU, and the number (CTRL) of control bits of registers multiplexers. These two parameters are interesting because: PL denotes the amount of registers and multiplexers, and CTRL the sizes of each one. Moreover, CTRL is a factor which influences directly the size of the program memory. Table III resumes resu~s obtained for the three input programs. Methods with preprocessing provide best results with respect to PL parameter. The maximum compatible procedure is the only one which reduces simultaneously physical links and contro b ts
Scalar Convolution
Vectorized ConvoluUor
SOR
3xFUs PL: 33 Ctrl: 36
6xFUs PL: 63 Ctrh 47
3xFUs PL: 54 CVi: 55
6xFUs PL: 96 Ctd: 81
3xFUs PL: 4~ Ctd: 61'
8xFUs PL: 75 Cbrl: 70
PL: 33 Ctrh 36
PL: 63 Ctd: 47
PL: 54 Ctd: 55
PL: 96 Ctd: 81
PL: 43 Ctrl: 61
PL: 73 Ctrh 68
Greedy
PL: 23 Cat: 34
; PL: 47 Ctrl: 45
PL: 31 Ctrh 51
PL: 66 Ctd: 76
PL: 34 Ctd: 70
PL:,~9 Ctrl: 86
Maximum Compatibles
PL: 23 Ctrh 34
PL: 46 Ctrh 45
PL: 31 Ctrl: 51
PL: 64 Ctd: 76
PIll 34 Ctd: 52
PL: 58 Ctrl: 58
Greedy Edge cotorln~
Methods with preprecesslng
Table II1: Physical link (PL) and control bit (Ctrl) numbers
602 5.
C. Carriere et a/.
CONCLUSION
After a brief description of the CAPSYS tool, four methods for solving the data path allocation problem are presented. Their comparison is performed for scaJar and vector input programs. After the phase of compilation and selec~on of FUs, a flow graph is provialed. Edges of the flow graph are assigned to buses acoordin~ to a first greedy algorithm. This algorithm is not optlrnal with respect of the number of registers. An optimal method based on an edge colouring principle is developed. Example studies show that: (i) resuits of greedy and edge coisuring algorithms are very dosed, (ii) they tend to increase the amount of multiplexers and physical links. Reducing the last effect is achieved by a preproceesing which consists in grouping edges of the flow graph in logical connections. These connections are then assigned to buses according to two procedures: (i) a greedy procedure, (ii) a maximum compatible procedure. Praprocassing allows tO reduce the number of multiplexers but unfortunately tends to increase the number of registers and their individual sizes. However the maximum compatible method, which provides, on one hand a number or' registers similar to the edge coloudog algorithm, and on the other hand, a number of multiplexers closed to the one of the greedy method with praprocessing, appears as a good compromise. The register number may be again decreased by using multi input pert register. Futur works should be developped in this way to integrate this type of registe~'s in the maximum compatible method. REFERENCES [1] M.C. McFARLAND, A.C PARKER, R. CAMPOSANO, The high-level synthesis of digital systems, Proceedings of IEEE voL 78, no. 2, february 1990, pp. 301-318. [2] B.M. PANGRLE, Splicer: A heuristic appro~,ch to connectivity bin0ing, 25th IEEE Design Au. tomation Conference, june "~~,8, pp.536-541. [3] N. BERRY, B.M. PANGRLE S=',ALLOC: An algorithm for simultaneous scheduling & connectivit~ binding in datapath synthesis system,
[4]
[5] [6]
[7]
[8]
[9]
[10] [11] [12] [i3]
[14]
the european design automation conference (EDAC), Glasgow, march 1990, pp. 78-82. C. TSENG, D. P. SIEWIOREK, Automated synthesis of data path in digital systems, IEEE trans, on computer-aided design, vcl. cad-5, no. 3, july 1986, pp. 379-395. L. STOK, Intemonnection optimization during data path allocation, EDAC, Glasgow, march 1990, pp. 141-145. P.G. PAULIN, J. P. KNIGHT, Force-directed scheduling for the behavioral Synthesis of ASIC's, IEEE trans on computer-aided design, voL 8, no. 6, june 1989, pp. 661-679. M. AUGUIN, F. BOERI, C. CARRIERE, G. MENEZ, From program to hardware: A parallel architecture compiler, Euromicro 1990, pp. 467474. G. MENEZ, M. AUGUIN, F. BOERI, C. CARRIERE, Generation automalique de modeles d'architecturas de processeurs VLIW sp~cialis~s,Troisi~me symposium sur lee architectures nouvsiiss de machines, Paris, 19-21 join 1991, pp. 133-153. G. MENEZ, M~thode et outil de conception d'architectures VLIW : g~ndralion automatique de mod~les de pmcesseurs specialists these de doctorat, Universit~ de nice, septemb}e 1991. BANNISTER and WlTHEAD, Fundamentals of digital systems (1973), chapter 7, pp. 150-196. F. T. HILL and G. R. PETERSON, Switching theory and logical design (1974), chapter 12, pp. 329-361. PERRIN, DENOUE'I-FE, DACLIN, Syst~mes Iogiques (1967), tome 2, chapter 8, pp. 80-90. R.B. CUTLER and A. MUROGA, Derivation of the minimal sums for completely specified functions, IEEE Trans. on computers, vol. C-36, no. 3, march 1987, pp. 277-292. C. CARRRIERE, D6finition et optimisation de rinterconnexion clans un processeur VLIW, th(~se de doctoral, Universit~ de Nice, septembre 1992