theVLSIjournal
.... . ! i i INTEGRATION, the VLSI Journal 21 (1996) 113 138
ELSEVIER
Synthesis of systems specified as interacting VHDL processes Petru Eles a'b, Krzysztof Kuchcinski a'*, Zebo Peng a aDepartment of Computer and InJbrmation Science, LinkSping University, Sweden bComputer Science and Engineering Department, Technical University of Timisoara, Romania Received 4 May 1995; revised 6 February 1996
Abstract
This paper presents an approach to synthesis of hardware systems specified as interacting VHDL processes. Different from traditional high-level synthesis methodologies our approach takes into account the interactions and interdependence between concurrent processes. Two methods have been developed. The first method supports an unrestricted use of signals and wait statements and synthesizes synchronous hardware with global control of process synchronization for signal update. The second method allows hardware synthesis without the strict synchronization required by the VHDL simulation-based semantics. In both methods VHDL system specifications are first translated into an internal design representation based on timed Petri nets, which is then synthesized into hardware implementation structures at register-transfer level. Our main objective is to preserve simulation/synthesis correspondence during synthesis and to produce hardware that operates with a high degree of parallelism. Experimental results with practical design examples demonstrate that the proposed methods are efficient in terms of both resulted hardware and optimization time.
Keywords: System synthesis; VHDL; Concurrent processes; High-level synthesis; Process communication
1. Introduction
Complex digital systems are usually specified as composition of interacting subsystems, each of them described by a sequential process. These processes run concurrently and interact using predefined communication mechanisms. This view of the digital system is usually captured by a system-level description [-1-4]. The complexity of digital systems specified in this way has been growing during the recent years and generates a need for new synthesis methodologies/tools. High-level synthesis (HLS) design methodology has been developed to solve some of these synthesis problems. It accepts the behavioral specification of a hardware at the algorithmic level. This specification describes a sequence of computations to be performed on the input data to produce the required output data. The resulted hardware structure, at RT level, consists usually of a data part controlled by an FSM controller [-2,5]. *Corresponding author. Fax: 46 13 282666; e-mail:
[email protected]. 1This work was supported by the Swedish National Board for Industrial and Technical Development (NUTEK). 0167-9260/96/$15.00 Copyright © 1996 Elsevier Science B.V. All rights reserved. PII S0 1 6 7 - 9 2 6 0 ( 9 6 ) 0 0 0 1 2-0
114
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
Current HLS methodology does not solve synthesis problems for digital system specifications consisting of interacting processes and therefore system design environments supporting design features related to process interaction are needed. For example, in addition to tranditional HLS allocation and scheduling tasks, new problems like selection of communication strategy, process synchronization structure and hierarchical controller structure must be addressed. Processes have to be considered globally in their interaction structure during system synthesis. This represents an essential distinction from classical high-level synthesis, where one single process is considered at a time. VHDL is a standard hardware description language which is used as a specification language for both simulation and synthesis. Accepting VHDL as the input language for synthesis introduces, however, a lot of additional problems [6-8]. The main problem is that we have to conform to standard VHDL semantics [9, 10] while avoiding language restrictions that are not acceptable in the context of system specification and synthesis. The HLS community solved this problem by restricting the synthesizable VHDL subset to practically sequential programs with very limited use of signals. This is, however, unacceptable if system synthesis is considered. We have to propose means for efficient synthesis of VHDL concurrent processes together with their underlaying communication structures. Our approach focuses on the synthesis of digital hardware specified as interacting concurrent processes in VHDL. The work is done in two steps. First, we translate concurrent processes into an internal design representation and later synthesis based on this representation is carried out. The main feature of our work, which distinguishes it from other similar approaches, is that we put a special attention to preserve VHDL semantics for a wide subset of the language while synthesizing concurrent processes. 1.1. Related work
VHDL is already used as an input language for many HLS systems but the language is restricted in many ways. HLS systems usually accept specifications consisting of a single VHDL process and thus only a reduced sequential subset of the language is considered. The use of an explicit clock signal is often imposed, which entails a lower level style of design specification and in some cases requires the designer to make some scheduling decisions. According to SynVHDL [7] and VSYNTH [11], for example, an architecture body may only contain a single process. Silicon 1076 [12] restricts the use of signal assignments to output ports and requires a design to contain only one process described at the architectural level. In CALLAS [13-15] the designer is required to use one explicit global clock signal. The entire behavior has to be described only in terms of variables, and the use of signals is practically limited to input and output ports. Papers dedicated specially to questions concerning the synthesis of signals, for example [ 16-18], do not address the implications for synthesis of signal assignment semantics in the context of a set of interacting VHDL processes either. The need to consider interacting VHDL processes has been recently recognized by the researchers of the HLS community. DSS [19], developed at University of Cincinnati, supports the synthesis of interacting VHDL processes to a synchronous hardware of strongly coupled FSMs with lockstep execution of processes. The synthesis system HIS [20], designed at IBM, accepts interacting processes but imposes several restrictions on the VHDL specification style. It restricts
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
115
the use of signals to the explicit clock signal in the so-called sequential synchronous model. The system accepts only very strictly defined description styles: sequential synchronous model, explicit state machine model, and a dataflow description using only concurrent signal assignments. By imposing these restrictions the global aspects of VHDL process and signal semantics can be avoided. In [21] a method is presented to translate VHDL processes containing signal assignments and wait statements into sequences that can be handled by HLS tools. Signals are replaced by several variables and the resulted processes contain wait statements only on a clock edge. This approach also handles only one single process at a time and the overall synchronization of several VHDL processes is not discussed. Synthesis of concurrent processes, described in HardwareC, and their communication structure is discussed in [22]. The authors concentrate on optimization of communication interfaces which use blocking and non-blocking message passing. Several interface optimization techniques are proposed. In [23] dynamic scheduling and synchronization synthesis for a set of interacting processes is presented. Processes are modeled using a process algebra notation and optimization is performed based on an integer linear programming formulation.
1.2. Our approach
Our approach is to accept for synthesis system specifications consisting of interacting VHDL processes. For a system-level specification restrictions like those imposed by the synthesis systems mentioned above cannot be accepted. The main difficulty is due to those features of VHDL that are explicitly defined in terms of simulation. Keeping simulation semantics during synthesis with a low additional implementation overhead was one of our main objectives. To achieve this goal we compile VHDL in such a way that its essential semantics with respect to process interaction is explicitly captured in our Petri net based internal design representation. The generated synthesis structures are then optimized by high-level synthesis algorithms. Our approach supports two main specification and synthesis styles. The first one accepts an unrestricted use of signals and wait statements. The communication and synchronization mechanism can be defined by using signals and wait statements, as defined by the VHDL standard. This style can be regarded as too low level for system specifications. Hence our second solution is based on higher level communication primitives for synchronous message passing. It is done by defining specialized subprograms for send, receive and test of messages. A similar approach has been proposed independently in [24] for hardware/software co-specification. The solutions we propose have been implemented, and tested with the CAMAD synthesis system [25]. This paper is divided into seven sections. Section 2 discusses the problems of VHDL simulation semantics and its consequences for synthesis. In Section 3 our design environment is briefly introduced. Sections 4 and 5 discuss two methods for VHDL specification and synthesis. They concentrate on representation of VHDL descriptions in extended timed Petri nets representation and further synthesis steps. In Section 5 we present some of the experimental results and finally in Section 6 we give conclusions and discuss future work.
1 1 6
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
signal
a, s: i n t e g e r : =
Pll p r o c e s s variable
z:integer;
O; P2: p r o c e s s variable
begin • w a i t on s;
begin' wait
z: = a; end p r o c e s s
y2:=a; end process
"
'
"
Pl;
yl,
y2:
integer;
"
on a;
P2;
Fig. 1. An example of interacting processes.
2. V H D L as a synthesis language
VHDL has been defined as a simulation language. Thus, some of the VHDL constructs (access types, files, assert statement) are not significant from the point of view of synthesis. Other features of the language are based on a simulation model and can be only synthesized with some hardware overhead. One of the most difficult issues in this context originates from the way signal assignments and wait statements are defined. As stated in the language definition [9, 10], unlike variables, which are updated as soon as they are assigned a value, signals are only updated at the end of a simulation cycle. This means that the update of signal values must be synchronized with the execution of a wait statement by every process in the system and has to be performed simultaneously for all signals that change their values in that simulation cycle. To illustrate the main problems with VHDL semantics for signal assignment and wait statement, and to draw some conclusions concerning synthesis, we refer to the example in Fig. 1 (we assume that there is no wait statement in processes P1 and P2, except the two given explicitly). Let us consider first only process P1. The value assigned to variable z is the value that has been given to a in the previous execution iteration of the process, because the value of signal a will be updated only when executing the wait statement. At synthesis this behavior has to be captured, regardless if signal assignment to a and reference to the signal value for assignment to z will be scheduled in the same clock cycle or not. In the general case this has to be solved by latching the value assigned to the signal and by updating the signal value only when a wait statement is executed. Latching of the signal value is enough to preserve at synthesis the computational effects of the simulation cycle, only if we consider an isolated process, without any interaction to other processes. To illustrate this we look now at both processes, P1 and P2, in Fig. 1. At simulation, the two values for signal a referred in process P2 will be the same for one execution iteration of the process. This is because the value of the signal cannot change unless both processes, P1 and P2, execute a wait statement. But if we consider the two processes separately and isolate them for synthesis, and then let them work together, the value of a could be changed between the two assignments to variables yl and y2. Thus, it is not sufficient at synthesis to latch values assigned to signals and to update them when wait statements are executed by the given process. This update has to be carried out only when all proceses are executing a wait statement. This behavior has to be implemented in the
P. Eles et al./INTEGRATION, the VLSIJournal 21 (1996) 113-138
117
synthesized hardware, as long as we are not restricting the use of signals and wait statement in the synthesizable VHDL subset. Another important issue concerning VHDL semantics is the modeling of time. According to the language standard, strict timing is specified by after clauses in signal assignment statements and wait statements with time clauses. From the point of view of high-level and system-level synthesis, the decision on when a certain operation will be executed is left to the synthesis tool and is not strictly specified by the designer. That is the reason why synthesis tools working at these levels do not consider strict timing specifications [6, 7, 12, 19, 20]. At the system level we consider time as a notion of causality and hence we are only interested in the partial ordering relation of operations on signals and ports. Providing a certain partial order of operations on signals and ports in a VHDL specification entails introducing some synchronization between processes. According to VHDL standard (simulation) semantics, the designer can enforce this synchronization by using signal assignments and wait statements. The synthesis strategies presented in this paper preserve this temporal relationship between simulation model and the synthesized hardware structure. Thus we achieve simulation~synthesis correspondence, which means that both the simulation model and the synthesized hardware react with the same values (sequences of values) of the signals and ports to identical sequences of stimuli applied to the inputs. Considering correspondence at the level of signals and ports is sufficient, since we are only interested in the external behavior of the resulting hardware.
3. The design environment Our design environment is based on an extension of the CAMAD high-level synthesis system [25]. CAMAD is built around an internal design representation, called ETPN (extended timed Petri net) [25], which has been developed to capture the intermediate results during the high-level synthesis process. The use of Petri nets provides a natural description of concurrency and synchronization of the operations and processes of a hardware system. It gives therefore a natural platform to represent and manipulate concurrent processes of the VHDL specifications. ETPN is a formal representation model derived from Petri net theory and consisting of separate but related models for control and data path. The data path is represented as a directed graph with nodes and arcs. The nodes are used to capture data manipulation units (data storages, arithmetic operators, etc.). The arcs represent the connections of the nodes. The control part of ETPN, on the other hand, is captured as a timed Petri net with restricted transition firing rules. These two parts are related by the control signals coming from the control part to the data path, and the conditional signals traveling in the opposite direction. A simple example of ETPN is illustrated in Fig. 2. Fig. 2(b) shows the data path where each data path node is depicted as a rectangle with a label indicating the basic operation or storage identifier of the node. The arcs of the data path represent the data flow between the nodes. Flow of data from one node to another is controlled by the control signals coming from the control part. The control relation is indicated by using control place names to guard the arcs. When a control place in the Petri net holds a token, its associated arcs in the data path (arcs guarded by the corresponding label) will be open for data to flow. Fig. 2(a) depicts the Petri net which represents the control flow of the example. A control state is represented as a marking of the Petri net, i.e., the possession of
118
P. Eles et aL / I N T E G R A T I O N ,
the VLSI Journal 21 (1996) 113-138
"0"~
"1' [ ~ "0"
...
(a) ~ n ~ o l Pe~i net
--/p0/
w h i l e X>O loop X := Inp; Y := Y + X;
-'/P7/ --/P2/ --/P3/ --/P4/ --/P5/
end Out
--/P6/
...
loop; :~ y;
- -/pl/
(c) VHDL description
Fig. 2. An example of E T P N representation and its corresponding V H D L description. (a) Control Petri net, (b) Data path and (c) V H D L description.
tokens in a subset of the places of the Petri net which are depicted as circles. The changes of control states are carried out as firings of one or several transitions of the Petri net which are depicted as bars. To express that the control flow can be guarded by results of internal computations, we use conditional signals to guard the control flow. A transition may be guarded by one or several conditions produced from the data path. A transition may be fired when it is enabled (all its input places have a token) and the guarding condition is true. If a transition has more than one guarding condition and at least one of them is true, the transition's guarding condition is true. The E T P N example is generated by CAMAD as the compilation output of the input V H D L specification given in Fig. 2(c). The main feature of the E T P N design representation is its ability to capture the intermediate result of a design explicitly so as to allow the design algorithm to make accurate design decisions. For example, if several operations are not data-dependent and can thus be executed concurrently, the situation can be captured precisely by giving their associated control places in the Petri net a potential to hold tokens simultaneously. That is, the set of Petri net places corresponding to the operations will not have any partial ordering relation between them. For example, in Fig. 2, P2 and /'7 control the loading of data to registers X and Y, respectively, which are independent operations. Therefore, P2 and P7 can hold tokens simultaneously. When it is, however, discovered later during synthesis that the potential parallelism will not be able to be implemented (because of resource restrictions, for example), additional partial ordering relations can be introduced. In this case some of the independent operations will be performed in sequence [25]. E T P N is used as a unified design representation which captures the intermediate designs of the high-level synthesis process, and thus allows the synthesis algorithm to employ an iterative improvement approach to carry out the synthesis task. The basic idea is that once the V H D L
P. Eles et a l . / I N T E G R A T I O N , the VLS1Journal 21 (1996) 113-138
119
VHDL Specification
Scheduling ~ ( Allocation Optimization ~ 1
E ~ ETPN / ControlPart[ IDataPath] ~ ..//
/"
Simulation and Verification
Controller II Netlist I I Implementati°nllGTu°n I / ~ RT_level~ / / \ Design Fig. 3. Overviewof CAMAD.
specification is translated into the initial design representation, it can be viewed as a primitive implementation. Correctness-preserving transformations can then be used to successively transform the initial design into an efficient implementation. CAMAD integrates the operation scheduling, data path allocation, control allocation and, to some degree, module binding subtasks of high-level synthesis. This is achieved by developing a set of basic transformations of the design representation which deals simultaneously with partial scheduling and local data path/control allocation. An optimization algorithm is then used to analyze the (global) design and select transformations during each step of the iterative improvement process. Fig. 3 illustrates the basic structure of the CAMAD system. The first step of CAMAD is to map the VHDL specification into ETPN and to perform automatic parallelism extraction. After the transformation steps a RTL hardware implementation is generated which consists of a data path net-list and a controller specified in the form of a finite state machine. The final RTL implementation can be converted into structural VHDL which, as well as the input system specification, can be simulated for verification [26].
4. Synthesis with signal assignments and wait statements Our approach supports two basic styles for specifying interaction of V H D L concurrent processes: one using Signal Assignments and Wait statements (SAW for short), and the other one based on synchronous Send/Receive message passing primitives (SR for short). The SAW style makes it possible for the designer to express process interaction using signals, as described in the VHDL language definition. From the point of view of synthesis SAW implies the hardware implementation of the synchronization imposed by the simulation cycle. This means that processes have to wait for each other, until all of them are executing a wait statement, in order to update the signal
120
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
wait on s
I
s'
c~-LT
Fig. 4. Representation of signals and wait statements (model with signal assignment and wait).
values. The hardware that will be synthesized according to this style is controlled either by a single FSM or by several FSMs working synchronously together. 4.1. Representation of signals and wait statements
A wait on signal statement is represented in E T P N by associating a condition to a transition in the control part corresponding to the waiting process [27]. The condition will be produced as the result of an assignment to the respective signal. In Fig. 4 we show the E T P N representation of a signal and the control part corresponding to a wait statement. According to E T P N semantics, the process will wait on transition T, until condition Cs in the data path becomes true. Signals are modeled by two data path register nodes (s and s'). The value referred by the processes accessing the signal is that in node s while the node s' stores the last assigned value. Condition Cs indicates an event on signal s, and hence will be true only when the new value of the signal differs from the old one. The structure can also be extended to produce the condition corresponding to a transaction on the signal [27]. Updating the signal, by passing the value from node s' to node s, is controlled by a place, called Q in Fig. 4, that will hold a token only when all processes are executing wait statements (in Fig. 5 we illustrate the control E T P N which contains place Q). The control structure, containing this place, consists of a single FSM or a collection of FSMs, as will be discussed in the following two sections. For reasons of simplicity, in the figures throughout this paper, we use a compressed data path representation for signals depicting only the two register nodes (as, for example, in Fig. 5). 4.2. Synthesis of a collection of state machines
Synthesis of several FSMs, one for each process, is performed on a design representation containing several independent control Petri nets synchronizing through shared data path conditions. In Fig. 5 we present the E T P N structure corresponding to the example discussed in Section 2 (Fig. 1). This representation will be generated if the designer requests the synthesis of several FSMs. The supervisor process P0 is automatically generated during compilation of the V H D L description, and is responsible for the synchronized updating of signals, under the control of place Q. This place will be marked only when all processes in the system are executing a wait statement.
121
P. Eles et a l . / I N T E G R A T I O N the VLSI Journal 21 (1996) 113-138
Po
4
.__~_i___. .- ,-..
#3.e.~ ~Yl~ ~ 7 Po
~ #10PaP a ~
,2-7 77-7
#1 #0 Po
#1
#0 Po # 1
__J° ~")1* (Wp1
;
_2_
+
pro~es~ PO
},f/1"(WpI'},~")2"(Wp2},~2"{V~'rP2' },'.-~_.7~ process P1
orocess P2
Fig. 5. Design representation for generation of a collection of FSMs.
The required synchronization is achieved by using the one-bit register nodes x1, x2, ... , Xk, one for each process Pi in the design. Each node xi is initially reset by the initial place Po. Setting ofxi is controlled by any of the places in the respective set of places I2i, where f2~ consists of all control places corresponding to the wait statements that belong to process Pi. Node xi will be reset under the control of any place in the set f2~, where f2~ consists of the places that are direct successors of those in f2~. In our particular example xl is set under control of place Wp1, and is reset under control of place W~,I and Po; Wp2 controls setting of node x2, and W~2 with Po reset the same node. Place c~in process P0 does not control any arc. This dummy place indicates that a certain delay has to be introduced after signal update, under control of place Q, and before a new evaluation of condition C controlled by place Q' (both in process P0). This delay is necessary to complete the resetting (controlled by places in the sets ~2~) of the one-bit nodes x~, corresponding to those processes that are leaving the wait state after signal update. In this way process P0 is forced to stay in state e until all processes leaving a wait state have executed actions controlled by places in sets f2~. Thus, all nodes x/, which had to be reset, got their new value before condition C is reevaluated. This means that all processes waiting for an event that has been produced, had the necessary time to leave the wait state before process P0 reaches place Q' in which it will stay until all processes are again executing a wait statement. The delay assigned to place c~will be equal to the clock cycle time if we use a single global clock; if the individual FSMs use their own clock the delay will be equal to the maximum clock cycle time.
122
P. Eles et a l . / I N T E G R A T I O N , the VLS1Journal 21 (1996) 113-138
C
Ca
Q-
t~rocess P1
DrocessP2
Fig. 6. Design representation for generation of one FSM.
Since the control Petri nets corresponding to the processes are disjoint, synthesis of a separate state machine for each process is possible. Each Petri net is translated into an FSM by a teachability marking generation algorithm [25]. The algorithm follows the transition firing rules of Petri nets and supports simultaneous firing of several transitions if they are all enabled. The set of FSMs is synchronized by the FSM corresponding to process P0. With the help of the nodes Xl, x2, ..., xk in the data path, P0 coordinates the global synchronization so that signals are updated only when all processes are in a wait state. The main advantage of this solution is that the complexity of FSM generation is reduced in comparison to generation of a single FSM for the whole system. However, in this case no data path resources are shared between processes, which may result in a certain hardware overhead. 4.3. Synthesis of one state machine
If the complexity and/or the number of processes are not very large, the control structure can be synthesized to a single FSM. This state machine is generated using a design representation that differs from that in Fig. 5. If the designer requests the synthesis of the example in Fig. 1 to a single FSM, the V H D L compiler generates the E T P N structure shown in Fig. 6. Synchronization between waiting processes in order to update signals is moved entirely into the control part where it becomes explicit. The control places Q~, Q2, . . . , Qk, one for each process in the system, hold a token only when the corresponding process is executing a wait statement. When all processes are waiting, the transition T can be fired; thus the places Q'1,Q~, ...,Q~, will get tokens and the respective signals will be updated. If the condition (C~ or C, in our particular example) associated to
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113 138
123
r
Csl ,< T
c
'}
Fig. 7. A process containing several wait statements.
a signal on which a process Pi is waiting becomes true, process Pi will continue (the token is passed from QI to the output place of the transition on which the process was waiting). If the condition associated to the signal is false (the expected event did not happen) Pi enters again its waiting state (the token is passed back from Q~ to Q/). For the example illustrated in Fig. 6 we considered only one wait statement in a process. The solution for a general situation, with several wait statements executed by a process, is depicted in Fig. 7. Moving synchronization entirely into the control part increases the complexity of the control Petri net. But this complexity does not entail the generation of a higher number of FSM states. The additional constraints introduced into the control part are made use of by CAMAD to eliminate unreachable states at FSM generation and thus to avoid state explosion. For this solution no "supervisor" process P0 has to be generated and there is no need for the register nodes Xl,Xz,...,Xk in the data path. The control parts corresponding to the processes are tightly interconnected and thus it is appropriate to generate just one single FSM. Handling the whole representation globally at synthesis for the design of a single FSM allows more control on the allocation of data path elements and offers the possibility of sharing hardware between different processes at synthesis. During the high-level synthesis process, hardware modules can be shared across the process boundaries if the related operations are scheduled into different time steps, which is illustrated by the synthesis results given in Section 6.
124
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
5. Synthesis with synchronous message passing
The synthesis style presented in the previous section entails the implementation in hardware of the global control imposed by the VHDL simulation cycle. Thus it results in a strong synchronization of the processes, that very often exceeds the level needed for the intended functionality of the synthesized system. This oversynchronization is the price paid for an unrestricted use of signals while preserving at the same time simulation semantics at synthesis. The question is if this oversynchronization can be relaxed, allowing a higher degree of parallelism, without giving up simulation/synthesis correspondence. When describing a system in VHDL for simulation, the designer can rely on the definition of the simulation cycle. Our SAW synthesis style guarantees, by implementing the synchronization imposed through the simulation cycle, a similar behavior of the synthesized hardware and of the simulation model. If hardware implementation of the global control imposed by the simulation cycle is not supported, a correct synthesis (as defined by simulation semantics) cannot be guaranteed in yeneral. Such a synthesis approach would update signals and schedule processes without enforcing global synchronization, and thus can produce hardware with a functionality that differs from simulation behavior. Relaxing oversynchronization and preserving at the same time simulation/synthesis correspondence can be solved only under the following assumption: the correct behavior of the VHDL specification to be synthesized should not rely on the implicit synchronization enforced by the simulation cycle. We call a description that conforms to this requirement well synchronized. In a well synchronized VHDL description all the assumptions that provide the proper synchronization and communication between processes are explicitly stated by operations on signals. We will now present a synthesis strategy that does not reproduce the simulation cycle in hardware while maintaining simulation/synthesis correspondence. It accepts designs specified according to a certain description style and produces independent synchronous FSMs which work in parallel. The descriptions conforming to this style are implicitly well synchronized. When defining this specification style we started from the following main considerations: - It is possible to produce hardware with a high degree of parallelism and asynchrony when the
circuit can be described as a set of loosely connected processes. This means that the amount of communication between these processes is relatively low and that a process communicates usually with a relatively small number of other processes. The enforcement of the simulation cycle for this class of hardware can result in a considerable reduction of the potential parallelism (and consequently of the performances) and produces at the same time an increase in hardware complexity. - For the designer, the specification style must be defined through a small number of simple rules that have to be respected for circuit description in VHDL.
5.1. The desiyner's view With the message passing based SR specification style the designer describes hardware as a set of VHDL processes communicating through signals. Any number of processes can communicate through a given signal (we say that these processes are connected to the signal) but only one of these
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113 138
125
processes is allowed to assign values to it. Assignment of a value is done by a send command. Processes that refer to the signal will wait until a value is assigned to it, by calling a receive command. Both send and receive have the syntax of ordinary procedure calls. A send command, denoted as send(X, e), where X is a signal, e is an expression, and e and X are type compatible, is executed by a process P in two steps: (1) process P waits until all other processes connected to signal X are executing a receive on this signal (if all these processes are already waiting on a receive for X, then process P enters directly step 2); (2) expression e is evaluated and its value is assigned to signal X. This value becomes the new value of X. After that, process P continues its execution.
A receive command, denoted as receive(X), where X is a signal, causes the executing process to wait until a send on signal X is executed. Communication with send and receive can be achieved also through several signals: send(X, el, Y, e2,... ), where X, Y,... are signals, and el, e2, ... are expressions type compatible to the respective signals, solves communication between the executing process and the other processes connected to signal X (that will get the value resulted from evaluation of expression el), to signal Y (that will get the value resulted from evaluation of expression e2), etc. This communication implies successive synchronization with the respective groups of processes, before assigning the new value of the corresponding signal (according to the rules for simple send and receive); the order of successive synchronizations is not predefined. Execution of the c o m m a n d terminates after communication on all signal arguments succeeded. receive(X, Y.... ), where all the arguments are signals, causes the executing process to wait until a send on all arguments is executed; synchronization with the sending processes can be realized in any order. The definition of the send/receive commands ensures that between the execution by a process of two consecutive receives on a given signal X, the value of this signal remains unchanged. This is due to the fact that in this interval no send on that signal can become active. This property is very important from the synthesis' point of view (see Section 5.3). To avoid undesired blocking of a process on a receive c o m m a n d (and possible deadlock situations), the boolean function test is provided; test(X), where X is a signal, returns true if there is a process waiting to execute send on X; otherwise the function returns false. Communication with send and receive requires synchronization between processes. However, it is important to note that this synchronization does not necessarily affect all processes but only those involved in a specific communication (the processes connected to a given signal). We illustrate this with the example in Fig. 8. Processes P1 and P3 are connected to signal a; processes P2, P3, and Pfi are connected to signal b; Processes P4 and Pfi are connected to signal c. When P I executes the send on a it will synchronize with process P3 that executes a receive on the same signal. P3, for executing send on b, has to synchronize with P2 and P5 that have to execute receive on b. P4 and P5 are synchronized for send respectively receive on signal c. Excepting these restrictions, no other synchronization is required for the correct behavior of the system. A V H D L description corresponding to this style is transformed by a preprocessor into an equivalent standard V H D L model for simulation, by expanding send, receive, and rest c o m m a n d s
126
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
signal a,b,c: integer; Pl: process variable x:integer;
begin •
.
P3: process variable y:integer;
P4: process
begin
begin
if'test(a) then •
send(a,x);
receive(a); y:=a+lO; end if;
end process Pl;
sendic,z) ;
end process P4; P5: process
send(b,2*y); . . . end process P3;
P2: process
variable z:integer;
begin
begin receive(b,c
receive(b);
;
end process P5;
end process P2;
Fig. 8. Exampleof VHDL processes interacting with send/receive.
into equivalent sequences containing signal assignments and wait statements. Starting from the same initial description, the VHDL synthesis-compiler generates the ETPN internal representation that will be synthesized by the CAMAD system. Simulation/synthesis correspondence will be preserved during the synthesis process, without providing any synchronization of processes additional to the explicit synchronization required by the send/receive commands. 5.2. The simulation model
A VHDL description based on the SR style is translated by a preprocessor to a standard VHDL program for simulation [28]. The generation of the simulation model is solved in two main steps: (1) A package that exports (resolved) bit signals is generated. For each signal X declared by the designer, the package will export the bit signals P1 X, P2 X . . . . . P n _ X . Each of these signals corresponds to one of the processes labeled P1, P2,..., Pn, that execute receive on signal X. The generated signals will be used for implementation of the handshaking protocol between processes. In order to implement the test function, for each signal X that is used as argument of a test, a bit signal t X will be exported by the generated package. Considering, for instance, the example in Fig. 8, the following package declaration is generated: p a c k a g e p g e n is function r e s ( s : bit vector) return bit;
function ros implements wired or signal P 2 b, P 3 a, P 5 _ b , P 5 _ c : res bit := '0'; signal t a : bit := '0'; end p gen;
(2) send and receive commands are expanded, based on predefined templates, to VHDL sequences that implement the handshaking protocol for synchronous message passing between processes connected to the same signals. A reference to function test(X) will be expanded to the
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113 138
127
boolean expression (t_X = 'l'), and to a wait for 0 statement preceding the statement that contains the reference. The generation of standard V H D L sequences corresponding to operations send and receive, both on one signal and on more signals, is presented in [28]. We will illustrate here three such sequences in the context of the example in Fig. 8: (a) send(b, 2 . y) in process P3 will be expanded to: if P 2 _ b / = ' 1' or P 5 b / = ' 1' then w a i t until P 2 _ b = ' 1' and P 5 b = ' 1' ;
- - P2 a n d P5 e x e c u t e - - r e c e i v e on b;
end if;
P2_b <= '0'; P5_b <= '0'; b <= 2*y;
- u p d a t e signal b;
w a i t for O;
(b) send(a,x) in process P I will be expanded to: - - P3 r e c e i v e s on a; - - f o r t h e t e s t (a);
if P 3 _ a / = '1' then
t _ a < = '1'; w a i t until P 3 _ a end if;
= ' 1 ';
P 3 _ a < = '0'; t _ a < = '0'; a<=x;
- - f o r t h e t e s t (a); - - u p d a t e signal a;
w a i t for O;
(c) receive(b, c) in process P5 will be expanded to: P5 b < = ' l ' ;
P5_c<='l';
w a i t until P 5 _ b = '0' and P 5
c = '0';
5.3. The synthesis strategy For the SR style, based on send~receive commands, a signal will be represented as a simple data path node (nodes a, b, c, for instance, in Fig. 9). After an assignment (as result of a send executed on the respective signal) the value of the node is updated directly. Synchronization between the process assigning to a signal and all those accessing it, imposed by the send/receive mechanism, makes an assignment in two steps unnecessary. The handshaking protocol between the process executing a send and those executing receive on a certain signal is implemented at synthesis using one-bit data path nodes, similar to the bit signals generated for the simulation model [28,29]. For illustration, in Fig. 9 we show the E T P N representation corresponding to our example introduced in Section 5.1 (Fig. 8). For synchronization between process P3, executing send on b, and the other two processes, P2 and P5, connected to the same signal b, condition Cb is used. After both P2 and P5 are waiting on receive for b (and consequently both data path nodes P2_b and P5_b are set), Cb will be true and
128
P. Eles et a l , / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
i"l~0"i
receive ca: """send
receive~
oroci!sP2 proces!P1
send
rfceiie
!~e!sP5
-
"~'1'
send
9roces!P4
....
"0p~15'l' "6P'~15'I' "6P'I~13:"0"p~11 n0 ';1"
c~
~-
~
"10"
~P3
p3a
Ib~
Fig. 9, Design representation for the send/receivemodel.
process P3 can go on to update the value of signal b, under control of place P37. Place P38 controls the resetting of nodes P 2 _ b and P5_b, and thus the condition for process P2 to continue is produced (Ce2 b is true). Process P5, that executes receive on b and c, is allowed to continue only after process P4 has also executed the send on c (and thus Ce5 b, becomes true). Synchronization between P1 and P3 for send and receive on signal a, and between P4 and P5 for send and receive on signal c has been implemented in a similar way. The decision based on test with signal a, in the control part corresponding to process P3, has been implemented using conditions Ct , and Ct_,, produced in the data path by the one-bit register node t_a. The SR synthesis style leads to hardware structures that work at a higher degree of parallelism than those synthesized according to the SAW style presented in Section 4. It does not require a global synchronization of all processes. Signals need not be implemented by special register nodes with additional functional elements in the data path. They are represented and updated exactly like ordinary variables. The control part can be synthesized to a single FSM. But such a synthesis
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
129
enforces synchronism on a potentially highly asynchronous design and thus is not a natural approach for this style. As a consequence of the fact that process interaction is based on a handshaking protocol and does not need any global control, the Petri net controllers of the processes can be implemented as independent and loosely coupled FSMs working in parallel. This is the typical strategy for designs described according to this style. To take advantage of this synthesis approach the designer has to adapt his description to the SR style. Using send and receive instead of wait statements and signal assignments is more natural and simpler for system level descriptions. This modelling style facilitates also the organization of the design into loosely coupled processes with a well structured interface. If such an organization is possible, this style is a natural approach for synthesis and results in efficient and highly parallel hardware implementations.
6. Experimental results In order to test the hardware synthesis strategies we have proposed, several examples have been run with our design environment (Fig. 3). To be relevant from the point of view of our approach, these examples have to be relatively large and the specifications have to be formulated at the level of interacting processes. Thus, classical benchmarks used in HLS literature, such as elliptic filter, GCD, AM 2901, etc., are not suitable in this context. The first example we use has been presented as a synthesis benchmark for the DSS system [19] and this gives the opportunity to compare our results with those obtained using another system. The second example is of medium size and models a peripheral interface circuit. The last two examples represent complex systems from the telecommunication area. All examples have been specified in VHDL according to both synthesis styles accepted by our system: SAW, using signal assignments and wait statements, and SR, using message passing primitives. After VHDL compilation and generation of the corresponding ETPN representations, synthesis has been performed using the CAMAD system. We present synthesis results in Tables 1-4, where the complexity of the control logic is given in terms of the total number of control states and the data path cost is specified by the number and type of function units. CPU times used by CAMAD when running on a SUN-SPARCstation ELC are also listed. Synthesis of the control structure to a single FSM, according to the strategy presented in Section 4.3, is illustrated only for the first example, due to the complexity of the other three designs. To illustrate the complexity of the synthesized models we have given the number of lines of code and the number of processes for each VHDL specification. The number of control places and data path nodes in the internal design representation resulted after compilation is also given. 6.1. The move machine
The example known as the "move machine" [19] is capable of moving instructions and data between memory and processor. ALU operations are considered to be associated to addresses in memory; arithmetic and logic operations are side effects of moving data to and from these locations. The original VHDL description of the move machine given in [19] contains three processes: one for loading the next instruction (Process_3), the second for computing operand
130
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
Table 1 Synthesis results for the move machine Specification style and size
signal
single F S M
assignment/wait
Several FSMs
63 lines of code - 3 processes send/receive 84 lines of code 3 processes
Process 1 Process_2 Process_ 3
Process_l Process 2 Process 3 Supervisor
Size of design representation
Synthesis results
Places
Nodes
Control states
Function units a
CPU time (s)
42
72
48
1 ALU, 1 dec
476.88
12 17 6 4
32 39 9 30
10 10 4 3
1 + , 1 dec 1 + , 1 dec, 1 <
50 11 11
71 18 18
35 5 5
1 ALU, 1 dec 1< 1<
1+
1.82 31.13 0.33 1.13 7.03 2.37 2.32
a +: adder; <: comparator; dec: decoder; A L U s can be of different types.
address (Process_ 1), and the third for executing the instruction (Process_2). The three processes are activated sequentially in a loop, one after the other, which corresponds to the classical execution chain of an instruction. The original description of the move machine architecture has been adapted according to the send/receive communication. We have reorganized the architecture as a control unit (represented by Process_l) connected to two memory modules (two similar processes, Process 2 and Process_3), the first one for instructions and the second one for data. The process representing the control unit loads an instruction, computes the address, and executes the instruction. During the execution of an instruction the next one is loaded to achieve more parallelism. The synthesis results for the move machine are presented in Table 1. We synthesized the initial description, as given in [-19], with process interaction using signal assignments and wait statements, both to a single FSM and to several FSMs. The modified version, using send/receive primitives, has been synthesized to three FSMs, each corresponding to one process. Results reported in [19], for the synthesis of the move machine to a synchronous hardware indicate 118 states and a C P U time needed for the controller generation of 2.27s (this time was obtained using a multiprocessor machine). Comparing the synthesis of a single FSM with that of more FSMs we observe that the second approach results in the use of additional functional units (with a greater cost for the signal assignment/wait modelling style compared to the send/receive one). This is the consequence of the fact that generating separate FSMs no hardware resources are shared (see Section 4.3), but on the other hand results in a lower complexity of the control structure and, for the SR style, in a higher degree of parallelism.
131
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138 Table 2 Synthesis results for U A R T Specification style and size
signal assignment/wait - 117 lines of code 4 processes
send/receive 126 lines of code
- 5 processes
Size of design representation
Synthesis results
Places
Control states
Function units"
Nodes
CPU
time (s)
Process_l Process_2 Process 3 Process_4 Supervisor
32 32 25 11 4
76 75 56 30 39
25 24 18 10 2
1+, 1+, !+, 1+, 1+
Process 1 Process_2 Process_3 Process_4
39 34 19 15
62 68 30 20
28 23 12 8
1 +, 1 ALU 1 +, 2< 1< 1<
Process_5
9
18
5
1 ALU, 1 < 1 ALU, 1 < 2< 2<
1 ALU
28.55 17.30 22.00 3.60 0.83 23.75 24.62 6.73 3.12 3.12
a +: adder; <: comparator; ALUs can be of different types.
6.2. The universal asynchronous receiver~transmitter (UART) This model is given in [30] as an example for system specification in SpecCharts. We have rewritten it in VHDL, with identical functionality, using both the SAW and SR modelling style. The UART interfaces an 8-bit processor with a 1-bit peripheral device. It has an 8-bit parallel input/output data port for communicating with the processor and a 1-bit serial input/output port to communicate with the peripheral. The processor can issue certain commands to the UART (load, read, and reset) specified by two externally set bits. Depending on two other bits the UART can be in one of the following modes: receive (assembles 8 bits on the serial input into an internal register, verifies parity, and signals the processor), transmit (the data byte that has been loaded into an internal register is sent on the serial output followed by a parity bit), and echo (the data byte is received serially as in the receive mode and is then automatically transmitted as in the transmit mode). There is one process in the model taking care of the functionality in each of the three modes (Process_l, Process_2, Process_3). A fourth process (Process_4) signals to the processor when a byte has been assembled and executes the read command by making this byte accessible on the data line. In the model using send/receive operations, a fifth process (Process_5), connected to the external clock, generates an internal signal. By synchronizing with the channel represented by this signal, other processes execute operations which have to be issued synchronously with the clock. The synthesis results for the UART are presented in Table 2.
132
P. Eles et a l . / 1 N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
6.3. The Ethernet network coprocessor
Like the UART, this model is also given in [30] as an example for system specification in SpecCharts. In a HardwareC version it has also been used as a hardware/software partitioning example [31]. We have rewritten it in VHDL, with identical functionality, using both the SAW and SR modelling style. The coprocessor transmits and receives data frames over a network under CSMA/CD (carrier sense multiple access with collision detection) protocol. Its purpose is to off-load the host C P U from managing communication activities. The host C P U programs the coprocessor for specific operations by means of eight instructions. Two of the processes in the model are dealing with enqueuing (Process 1) and decoding/executing (Process_2) these instructions. Transmission to the network is performed in cooperation by three processes. The first one (Process_4) gets a memory address from the C P U and accesses directly the memory through the local bus in order to read the data. This data is forwarded successively to a second process (Process_3) which packages it in a frame according to a prescribed standard. Frames are then sent as a series of bytes to a third process (Process_5) which outputs them on the serial network line. If a collision is detected normal transmission is suspended, a number of jam bytes are generated and after waiting a certain time the frame will be retransmitted. After a successful transmission, the unit waits for a period of time required between frames before attempting another transmission. In parallel to the above processes other four processes deal with reception from the network. There is one process (Process_6) continuously reading bits from the serial line of the network and sending a succession of bytes to a buffering process (Process_7) which forwards them to the process that has to filter out those frames which are destined to the host system (Process_8). This process first waits to recognize a start-of-frame pattern and then compares the following two bytes with the address of the host. If the addresses are equal the rest of the bytes belonging to the frame are read and sent to the next process (Process 9) which writes them to memory using a D M A protocol. The model using send/receive operations contains another process (Process_ 10) that is connected to the external clock. A signal generated by this process is used throughout the model when processes have to wait for certain time intervals requested by the communication protocol. The synthesis results for the Ethernet coprocessor are presented in Table 3. Several subprograms have been used for the specification of the coprocessor. At synthesis they have not been expanded in-line, but are synthesized as part of the controller corresponding to the process they are called from. For details concerning synthesis of subprograms in our environment [27, 32] can be consulted. 6.4. The operation and maintenance (OAM) block of an A T M switch
Asynchronous transfer mode (ATM) is a standard technology designed to efficiently support high-speed digital voice and data communication [33]. ATM has already been accepted in a wide spectrum of telecommunications and data communications communities as the technology of the next decades. ATM is based on a fixed-size virtual circuit-oriented packet switching methodology. All ATM traffic is broken into a succession of cells. A cell consists of five bytes of header information and a 48-byte information field. The header field contains control information of the cell (identification, cell loss priority, routing and switching information). Of particular interest in the header are the
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
133
Table 3 Synthesis results for Ethernet coprocessor Specification style and size
signal assignment/wait - 730 lines of code 9 processes - 7 procedures
send/receive - 677 lines of code - 10 processes 2 procedures
Size of design representation
Synthesis results
Places
Control states
Function units a
Nodes
CPU
time (s)
Process 1 Process_2 Process_3 Process_4 Process_5 Process_6
24 72 138 122 40 28
60 ! 78 277 233 69 56
20 53 105 87 27 23
1+, 1 ALU 1 + , 1 dec, 2 < 1 ALU, 2 dec 1 + , 1 dec, 2 < 1 ALU 1 A L U , 1 dec
14.07 44.98 124.12 139.60 77.90 9.27
Process_7 Process_8 Process_9
Supervisor
40 101 44 4
71 193 96 173
30 68 27 3
1< l A L U , 2 dec 1+, 2 < 1+
17.10 63.93 22.57 5.65
Process_l Process_2 Process 3 Process 4 Process_5
14 129 152 117 19
45 227 201 160 32
9 101 126 76 12
1+, 1 ALU 1 + , 1 dec, 2 < 1 ALU 1+, 2< 1 ALU
9.02 87.55 76.68 114.77 4.97
Process_6 Process_7 Process 8 Process_9 Process_10
27 20 113 37 30
48 32 169 64 51
20 9 91 16 14
1 ALU, 1 dec 1< 1 ALU, 1 dec 1 +, 2 < 1<
7.83 10.90 86.17 16.43 10.63
a +: adder; <: c o m p a r a t o r ; dec: decoder; A L U s can be of different types.
virtual path identifier (VPI) and the virtual channel identifier (VCI). They are used to determine which cells belong to a given connection. The OAM functions in the network are performed on five hierarchical levels associated with the ATM and Physical layers of the protocol reference model [34]. The two highest levels, F4 and F5, are part of the ATM protocol layer. They are handling the OAM functionality concerning VPs and
VCs, respectively [35]: - Fault management: when the appearance of a fault is reported to the F4/F5 block, special OAM cells will be generated and sent on all affected connections; if the fault persists, the management system should be notified. - Performance monitoring: normal functioning of the network is monitored by continuous or periodic checking of cell transmission. - Fault localization: when a fault occurs it might be necessary to localize it further; for this purpose special loop back OAM cells are used.
134
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113-138
Table 4 Synthesis results for the F4 block Specification style and size
signal assignment/wait - 1277 lines of code 4 processes
send/receive 1216 lines of code 4 processes
Size of design representation
Synthesis results
Places
Control states
Function units a
Nodes
CPU
time (s)
Process_l Process 2 Process 3 Process_4 Supervisor
484 64 28 18 4
1576 258 83 59 71
385 49 16 11 2
4+, 1+, 1+, 1 +, 1+
3 dec 1 dec 1< 1<
502.38 b 98.12 85.25 48.57 1.13
Process 1 Process_2 Process_3 Process_4
510 68 26 5
1603 268 70 6
406 49 12 2
3 + , 3 ALU, 1", 3 dec 1 + , 1 ALU, 1", 1 dec 1 + , 1 ALU, 1"
680.75 b 47.77 54.05 0.77
2 ALU, 1 ALU, 1 ALU, 1 ALU,
1", 1", 1", 1",
~_: adder; *: multiplier/divider; <: comparator; dec: decoder; ALUs can be of different types. bThis experiment has been carried out on a SPARCstation 20 (instead of SPARCstation ELC), because of memory requirements. a
- Activation/Deactivation: a special protocol for activation and deactivation of OAM functions that require active participation of several F4/F5 blocks, e.g. performance monitoring, has to be implemented. To perform these functions, the F4/F5 block of an ATM switch deals with specially marked ATM cells, referred to as OAM cells: activation/deactivation cells, performance management cells, and fault management cells. We have specified the functionality of the F4 block in VHDL in order to experiment the different alternatives of hardware and hardware/software synthesis with our environment. Results obtained for hardware synthesis of our models, using both the SAW and the SR modelling style, are presented in Table 4. The models consist of four processes. The main process (Process_l) transfers received user cells to the output and generates output cells as a direct result of receiving certain incoming cells. Performance monitoring and activation/deactivation cells are generated together with response signals to the management system. This process also handles inputs from the management system and the physical layer. When, as result of an incoming cell, other cells have to be generated at fixed time intervals, a second process (Process_2) is notified to take care of this. Buffering on each of the two output streams is controlled by two other processes (Process_3 and Process 4). Details of the functionality and structure of the F4 model are given in [36].
P. Eles et a l . / I N T E G R A T I O N , the VLSI Journal 21 (1996) 113 138
135
6.5. Discussion
The presented examples demonstrate that our system can efficiently synthesize hardware from complex V H D L specifications consisting of interacting processes. This efficiency is shown in terms of both the resulted hardware and the optimization time. As pointed out for the first example, synthesis to a single FSM is a feasible alternative if the number and size of processes in the model are not too large. In this case a lower cost of the data path can be obtained, with a higher complexity of the control structure and a longer optimization time. Synthesis of several FSMs can be performed according to both the SAW and the SR style. Our results indicate that synthesis of models specified according to the SR style, based on the send/ receive communication paradigm, requires lower cost in terms of data path (see the number of functional units and that of the nodes in the design representation). This difference grows with the number of signals used by the model for process interaction and is mainly due to the simplified representation of signals in the SR approach. In terms of control part complexity, the hardware structures synthesized from the two specification styles are very similar as results from the comparison of the number of control states. This leads to the conclusion that the SR synthesis style produces a hardware of lower cost which, as discussed in the previous section, works at a higher degree of parallelism. At the same time, system specification using these primitives is adequate for the design of medium size and large systems. One of the main objectives in the development of our environment was not only to allow the design specification for synthesis to be formulated as a set of interacting processes, but also to produce a hardware that works at a high degree of parallelism. Thus we developed our synthesis strategy which allows the relaxation of the oversynchronization resulted from V H D L simulation semantics, as discussed in Section 5. This produces an improvement of interprocess parallelism for the resulted hardware. The CAMAD system then generates controllers by making data path/ controller trade-offs in order to produce an efficient implementation [25]. In Tables 1-4 this is reflected by the reduction of the total number of control states to 65% of the total number of control places in the design representation, in average. The above discussion does not apply to the synthesis of the move machine architecture to a single FSM. Generally speaking collapsing several parallel control Petri nets into a single FSM leads to state explosion. Our approach, however, constrains the growing of the controller by eliminating unreachable states [25], producing a total number of control states which is at the same order of magnitude as the number of places in the initial control Petri net, as shown in Table 1.
7. Conclusions In this paper we present a technique for system synthesis which concentrates on systems specified as V H D L concurrent processes. The selection of V H D L has been justified by its wide use as well as its general programming language features. Our main objective is to develop an efficient synthesis method which preserves V H D L simulation based semantics. The work started with analyses of the V H D L simulation semantics of concurrent processes' communication facilities. It was noted that
136
P. Eles et al./INTEGRATION, the VLSI Journal 21 (1996) 113-138
VHDL process interaction with signals and wait statements always involves all processes regardless of their actual communication needs. This leads to inefficiencies in the final implementation. We have developed two styles for synthesis: the SAW and the SR style. The SAW specification style makes it possible to use V H D L processes and communication through signals in the way as it is defined in standard VHDL. This has, however, in many cases a big impact on performance and/or cost of the synthesized hardware. To overcome this disadvantage the second style, based on message passing primitives, has been proposed. It is based on higher level communication facilities between concurrent processes. By this we localized the communication between processes and obtained better designs in terms of performance and cost. The results of this research indicate that the VHDL language can be used efficiently for system design if we extend it with higher level communication facilities, such as those proposed in this paper. The communication facilities offered by V H D L (wait statement and signals) are too low level for system specification and require, in general case, too much implementation overhead. It is also important to note that the problem cannot be solved by handling separately one process at a time since the overall synchronization of the system has to be considered. We have performed extensive experiments with practical examples. Experimental results demonstrate that our system can efficiently synthesize hardware from complex V H D L specifications consisting of interacting processes. In the context of mixed hardware/software synthesis our approach can be used both for system specification and for the synthesis of the processes assigned to the hardware partition. In [37] we discussed some aspects concerning the hardware/software co-design of V H D L specifications using the SR design style. Future work will be dedicated to integration of our methodologies and tools into a unified hardware/software co-design environment. Among other topics for future research is an investigation of different communication facilities between concurrent processes and their implementation costs. It covers also research on synthesis of different communication structures as well as their optimization for special cases.
Acknowledgements The authors would like to thank Marius Minea for his contribution to the design and implementation of the V H D L front-end for the CAMAD system.
References [1] D.D. Gajski, N.D. Dutt, A.C.-H. Wu and S.Y.-L. Lin, High-Level Synthesis, Introduction to Chip and System Desi.qn (Kluwer Academic Publishers, Boston, 1992). [-2] P. Michel, U. Lauther and P. Duzy, eds., The Synthesis Approach to Di.qital System Desi.qn (Kluwer Academic Publishers, Boston, 1992). [3] G. De Micheli, Synthesis and Optimization of Digital Circuits (McGraw-Hill, New York, 1994). [4] D.D. Gajski, F. Vahid, S. Narayan and J. Gong, Specification and Design of Embedded Systems (Prentice-Hall, Englewood Cliffs, NJ, 1994). [-5] R. Camposano, From behavior to structure: high-level synthesis, IEEE Des. Test Comput. 7 (5) (1990) 8 19. [6] R. Camposano, L.F. Saunders and R.M. Tabet, VHDL as input for high-level synthesis, IEEE Des. Test Comput. 8 (1) (1991) 43 49.
P. Eles et al./INTEGRATION, the VLSI Journal 21 (1996) 113-138
137
[7] A. Postula, VHDL specific issues in high level synthesis, in: Proc. Euro-VHDL'91 (1991) pp. 70-77. [8] W. Wolfe and R. Manno, High-level modeling and synthesis of communicating processes using VHDL, IEICE Trans. Inform. Systems E76-D (9) (1993) 1039 1046. [9] IEEE Standard VHDL Language Reference, IEEE Std. 1076-1987 (IEEE Computer Soc. Press, Silver Spring, MD, 1987). [10] IEEE Standard VHDL Language Rejerence, IEEE Std. 1076-1993 Revision of IEEE Std. 1076 1987 (IEEE Computer Soc. Press, Silver Spring, MD, 1993). [l 1] P. Harper, S. Krolikoski and O. Levia, Using VHDL as a synthesis language in the Honeywell VSYNTH System, in: J.A. Darringer and F.J. Rammig, eds., Computer Hardware Description Languages and their Applications (North-Holland, Amsterdam, 1990) pp. 315-330. [12] V. Nagasamy, N. Berry and C. Dangelo, Specification, planning and synthesis in a VHDL design environment, IEEE Des. Test Comput. 9 (1992) 58 68. [13] S. M~rz, K. Buchenrieder, P. Duzy, R. Kumar and T. Wecker, CALLAS: a system for automatic synthesis of digital circuits from algorithmic behavioral descriptions, in: Proc. EuroASIC Conf. (1989) pp. 131 142. [14] P. Stoll and P. Duzy, High-level synthesis from VHDL with exact timing constraints, in: Proc. IEEE/ACM Design Automation Conf. (IEEE Computer Soc. Press, Silver Spring, MD, 1992) pp. 188-193. [15] J. Biesenack et al., The Siemens high-level synthesis system CALLAS, IEEE Trans. Very Large Scale Integration (VLSI) 1 (3) (1993) 244 253. [16] H. Kr~imer and J. Miiller, Assignment of global memory elements for multi-process VHDL specifications, in: Proc. IEEE/ACM Internat. Conf. on CAD (IEEE Computer Soc. Press, Silver Spring, MD, 1992) pp. 496 501. [17] L. Ramachandran, F. Vahid, S. Narayan and D. Gajski, Semantics and synthesis of signals in behavioral VHDL, in: Proc. EURO-DAC/VHDL'92 (IEEE Computer Soc. Press, Silver Spring, MD, 1992) pp. 616-621. [18] J. Mfiller and H. Kr/imer, Analysis of multi-process VHDL specifications with a Petri net model, in: Proc. EURO-DAC/VHDL'93 (IEEE Computer Soc. Press, Silver Spring, MD, 1993) pp. 474-479. [19] J. Roy, N. Kumar, R. Dutta and R. Vemuri, DSS: a distributed high-level synthesis system, IEEE Des. Test Comput. 9 (2) (1992) 18 32. [20] R.A. Bergamaschi and A. Kuehlmann, A system for production use of high-level synthesis, IEEE Trans. Very Large Scale Integration (VLSI) 1 (3) (1993) 233-243. [21] F. Vahid, S. Narayan and D.D. Gajski, A transformation for integrating VHDL behavioral specification with synthesis and software generation, in: Proc. EURO-DAC/VHDL'94 (IEEE Computer Soc. Press, Silver Spring, MD, 1994) pp. 552 557. [22] D. Filo, D.C. Ku, C.N. Coelho and G. De Micheli, Interface optimization for concurrent systems under timing constraints, IEEE Trans. Very Large Scale Integration (VLSI) 1 (3) (1993) 268-281. [23] C.N. Coelho and G. De Micheli, Dynamic scheduling and synchronization synthesis of concurrent digital systems under system-level constraints, in: Proc. IEEE/ACM Internat. Conj. on CAD (IEEE Computer Soc. Press, Silver Spring, MD, 1994). [24] W. Ecker, Using VHDL for HW/SW co-specification, in: Proc. EURO-DAC/VHDL'93 (IEEE Computer Soc. Press, Silver Spring, MD, 1993) pp. 500 505. [25] Z. Peng and K. Kuchcinski, Automated transformation of algorithms into register-transfer level implementation, IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems 13 (2) (1994) 150 166. [26] Z. Peng, Digital system simulation with VHDL in a high-level synthesis system, Microprocess. Microprogram. EUROMICRO J. 34 (1-5) (1992) 263 269. [27] P. Eles, K. Kuchcinski, Z. Peng and M. Minea, Conipiling VHDL into a high-level synthesis design representation, in: Proc. EURO-DAC/VHDL'92 (IEEE Computer Soc. Press, Silver Spring, MD, 1992) pp. 604-609. [28] P. Eles, K. Kuchcinski, Z. Peng and M. Minea, Two methods for synthesizing VHDL concurrent processes, Research Report, LiTH-IDA-R-93-22, Link6ping University, 1993. [29] P. Eles, K. Kuchcinski, Z. Peng and M. Minea, Synthesis of VHDL concurrent processes, in: Proc. EURODAC/VHDL'94 (IEEE Computer Soc. Press, Silver Spring, MD, 1994) pp. 540-545. [-30] S. Narayan, F. Vahid and D.D. Gajski, Modeling with SpecCharts, Technical Report #90-20, Dept. of Information and Computer Science, University of California, Irvine, 1990 (Revised 1992).
138
P. Eles et al./INTEGRATION, the VLSI Journal 21 (1996) 113-138
[31] R.K. Gupta and G. De Micheli, System synthesis via hardware-software co-design, Technical Report No. CSL-TR-92-548, Computer Systems Laboratory, Stanford University, 1992. [32] P. Eles, K. Kuchcinski, Z. Peng and M. Minea, Synthesis of VHDL subprograms and processes in the CAMAD system, in: Proc. Workshop on Design Methodologies fi)r Microelectronics and Signal Processing, Gliwice-Cracow, Poland (1993) pp. 359-367. [33] M. De Prycker, Asynchronous Transjer Mode: Solutionfi)r Broadband ISDN (Ellis Horwood, New York, 2nd ed., 1993). [34] B-ISDN Operation and Maintenance Principles and Functions, ITU-T Recommendation 1.610, 1993. [35] Generic Requirements for Operations for Broadband Switching Systems, Bellcore, TANWT-001248 issue 2, 1993. [36] A. Doboli, J. Hallberg and P. Eles, A simulation model for the operation and maintenance functionality in ATM switches, in: Proc. Internat. Conf. on Technical Informatics, Timisoara, Romania, 4 (1994) pp. 31 40. [37] P. Eles, Z. Peng and A. Doboli, VHDL system-level specification and partitioning in a hardware/software co-synthesis environment, in: Proc. 3rd lnternat. Workshop on Hardware~Software Codesign (IEEE Computer Soc. Press, Silver Spring, MD, 1994) pp. 49-55.
Petru Eies received the M.Sc. degree in computer engineering and computer science from the Technical University Timisoara, Romania, in 1979, and the Ph.D. degree in computer engineering from the "Politehnica" University Bucuresti, Romania, in 1993. He is currently an Associate Professor at the Department of Computer Science and Engineering, the Technical University Timisoara, Romania. For several semesters he has visited the CADLAB at the Department of Computer and Information Science, Link6ping University, Sweden. His research interests include high-level and system-level synthesis of digital circuits, hardware/software codesign, hardware description languages, and languages for parallel computing. He has published extensively in these areas. He is a member of the ACM and of the board of directors of Euromicro. Dr. Eles was a corecipient of best paper awards at the 1992 and 1994 European Design Automation Conference. Krzysztof Kucheinski received the M.Sc. and Ph.D. degrees in computer engineering and computer science from Gdansk University of Technology, Poland, in 1977 and 1984, respectively. He is currently an Associate Professor in Computer Systems at the Department of Computer and Information Science, Link6ping University, which he joined in 1986. He is also the head of the CADLAB (Computer Aided Design Laboratory) group at the same university. His research interests concentrate on design methods for digital systems at the architectural level, and in particular, on high-level synthesis, hardware/software co-design, design for testability, and design modeling and evaluation. He has published over 40 technical papers in these areas. He is a member of the board of directors of Euromicro. Dr. Kuchcinski was corecipient of two best papers awards at the EURO-DAC (European Design Automation Conference) in 1992 and 1994.
Zebo Peng received the B.Sc. degree in Computer Engineering from the South China Institute of Technology, China, in 1982, and the Licentiate of Engineering and Ph.D. degrees in Computer Science from Link6ping University, Sweden, in 1985 and 1987, respectively. He is currently an Associate Professor in the Department of Computer and Information Science at Link6ping University. His research interests include high-level synthesis, formal description of hardware, design for testability, test synthesis, hardware/software codesign, and modeling and synthesis of real-time systems. He has published over 40 technical papers in these areas. Dr. Peng was corecipient of two best paper awards at the European Design Automation Conference (EURODAC) in 1992 and 1994.