Constrained algorithmic IP design for system-on-chip

Constrained algorithmic IP design for system-on-chip

ARTICLE IN PRESS INTEGRATION, the VLSI journal 40 (2007) 94–105 www.elsevier.com/locate/vlsi Constrained algorithmic IP design for system-on-chip P...

530KB Sizes 8 Downloads 75 Views

ARTICLE IN PRESS

INTEGRATION, the VLSI journal 40 (2007) 94–105 www.elsevier.com/locate/vlsi

Constrained algorithmic IP design for system-on-chip P. Coussy, E. Casseau, P. Bomel, A. Baganne, E. Martin LESTER Lab. FRE2734, UBS University, BP 92116, 56321 Lorient Cedex, France

Abstract In the system on chip design context, RTL design of complex digital signal processing coprocessors can be improved by using algorithmic description as input for the synthesis process. System integration, that is a major step in SoC design, requires taking into account communication and timing constraints to design and integrate dedicated hardware accelerator. In this paper, we propose a design flow based on formal models that allows high-level synthesis under input/output timing constraints of DSP algorithms. Based on a generic architecture, the presented method provides automatic generation of customized hardware components. We show the effectiveness of our approach in a case study of a maximum a posteriori (MAP) algorithm for turbo decoding. r 2006 Elsevier B.V. All rights reserved. Keywords: IP design and integration; Hardware optimization; Real-time SoC design; DSP applications; MAP algorithm

1. Introduction Due to the complexity of today’s digital signal processing (DSP) applications, the design of digital systems starting at the register transaction level RTL is extremely complex and time consuming. In this context, to satisfy the time-to-market pressure, designers need a more direct path from the functionality down to the silicon: they need a layered system design flow and associated CAD tool to manage SoC complexity in a shorten time. Furthermore, an environment that helps the designer to explore the design space more thoroughly and find optimized designs is required. Hence, in [24,21], authors proposed approaches that use Matlab/Simulink/Stateflow tools for the system specification and that produce a VHDL RTL architecture of the system. Based on hardware macro/generators that use the ‘‘generic’’/‘‘generate’’ mechanisms, the synthesis process can be summarized as a block instantiation [4]. In the same context, observing that many of the hardware functions of a SoC are well known and have already been implemented, the system design flow can be dramatically accelerated by re-using hardware blocks instead of re-designing them from scratch [14]. Current design trends give priority to Corresponding author.

E-mail address: [email protected] (P. Coussy). 0167-9260/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2006.02.003

hardware and software re-use through virtual component (VC)—or intellectual property (IP) core—exchange [14,16]. However, though such components may be parameterizable, they rely on a fixed architectural model with very restricted customization capabilities. This lake of RTL blocks flexibility is especially true for the communication unit (COMU) which sequence orders and/or timing requirements are fixed. These components are hence integrated in the final SoC architecture by using specific interfaces [13] or wrappers [19], that adapt the system communication features to the component requirements, i.e. the SoC integrator has to manage both IP execution requirements and integration constraints. This critical step requires a good modelling of both sets of constraints and techniques to design the interface module [6]. Unfortunately, this adaptation increases the final SoC area and also decreases the system performance. In some cases, the I/O timing requirements cannot be respected due to the wrapper overhead and can cause the SoC design to fail. High-level synthesis (HLS) can be used to solve the flexibility problems. HLS is analogous to software compilation transposed to the hardware domain: the source specification is written in a high-level language that models the behavior of a complex hardware component; an automatic refinement process allows to map the described behavior onto a specific technology target depending on optimization/design constraints [9,11]. A typical HLS tool

ARTICLE IN PRESS P. Coussy et al. / INTEGRATION, the VLSI journal 40 (2007) 94–105

performs four main tasks: (1) source specification analyze (identify computations); (2) hardware resources selection and allocation for each kind of operation; (3) operations scheduling; (4) optimized architecture generation, including a datapath and a control finite-state machine. HLS is a constraint-based synthesis flow: hardware resources are selected from technology-specific libraries of basic components (arithmetic and logic units, registers, multiplexors) where components are characterized in terms of gate count, delay, power consumption, etc. resource selection/allocation and operation scheduling can be constrained to limit hardware complexity (i.e. the number of allocated resources) and reach a given computation rate. Each set of supported parameter values and synthesis constraints allows to instantiate a different dedicated architecture that will fulfill specific functional requirements and achieve specific performance. As a result, HLS tools can be seen as a relevant approach for designing and reusing highly flexible IP cores. A general view of the system design flow including HLS and behavioral VCs is shown in Fig. 1. In the real-time data-intensive application domain, processing is realized on growing data streams. The system/architecture design has therefore to focus on avoiding bottlenecks in the buses and I/O buffers for data-transfer, reducing the cost of data storage and satisfying strict timing constraints and high-data rates. A design flow based on synthesis under constraints is thus needed. This includes: (1) modelling styles to represent system constraints and algorithm requirements, (2) analysis steps checking the feasibility and the consistency of the system constraints according to the algorithm ones, and (3) methods and techniques for optimal synthesis of different IP core parts: processing unit (PU), memory unit (MU), control unit (CU) and COMU. Design Flow

Parameters/constraints

Mathematical specification

Application parameters

Choice of algorithm(s)

Algorithmic parameters

Behavioral refinement

System-level parameters Architectural parameters

Behavioral IP Instantiation and High-Level Synthesis Soft IP

RTL synthesis

Technological parameters

Firm IP Physical synthesis Hard IP

Fig. 1. Design flow including HLS and IP.

95

We propose in this paper a novel approach in which we jointly design both the processing and the communication parts of the IP core using constrained synthesis. This kind of approach allows automating the integration of dataflowdominated hardware accelerators typically used in DSP applications. Taking care about the IP core timing requirements early in the system design flow allows an optimized integration by anticipating the synchronization problems. The solution using IP synthesis under constraints requires the use of HLS tools, efficient constraint modelling and constraint analysis phases to detect inconsistencies. To our knowledge none of the existing works tackles hardware IP design and integration oriented towards the DSP domain—where dataflow-dominated processing is easily used—by using HLS under I/O and timing constraints. Compared to classical component integration based on pre-designed RTL architecture and wrapper generation, the main advantages of our approach are:  Optimally synthesizing and integration of the component by taking into account, in its specification, the system integration constraints that can be for instance bus format, data size or I/O timing properties.  Wide design space exploration through automatic I/O and timing constraint analysis.  Support for variable but bounded timing constraints instead of unbounded or strict delay operation. The paper is organized as follows: in Section 2, we present the proposed design approach. The modelling scheme of both design constraints and IP behavior, and the analysis steps are described. The generic architecture targeted by our synthesis flow is also presented. In Section 3, the effectiveness of our approach is shown in the case of a maximum a posteriori (MAP) design. 2. Design approach Our methodology proposes to raise the abstraction level of IP synthetizable models with the concept of behavioral IP [25], described at the algorithmic level and specified using a hardware description language (HDL) such as SystemC [20] or SpecC [11]. Starting from the system description and its architecture model, the integrator, for each bus or port that connects the IP to other SoC components, refines and specifies the I/O protocols, data sequence orders and timing information of transfers. This new design approach is presented in Fig. 2. 1. The virtual component specification is modelled by a signal-flow graph (SFG). The difference between a SFG and a data-flow graph (DFG) consists in the delay operation used in DSP modelling to express the use of a data computed in the previous iteration of the algorithm (iteration period RÞ. This intermediate SFG representation is carried out by the compilation phase during the

ARTICLE IN PRESS 96

P. Coussy et al. / INTEGRATION, the VLSI journal 40 (2007) 94–105

IP design constraints

Technological library

System Specification & Refinement

Algorithm Specification SFG

Communication features

Constraints graph generation ACG

Modeling

Rate

Constraint Analysis

Constraint graph specification

no

yes

Constraint merging GCG

Analysis

IOCG

Constraint Analysis no yes

System Integration

Interface/wrapper Synthesis

Synthesis

PU Synthesis & Control generation

RTL IP Core “Ready to be Plugged”

Fig. 2. Design approach.

behavioral synthesis process [10]. In a first step, an algorithmic constraint graph (ACG) is generated from the operator latencies and the data dependencies expressed in the SFG. The latencies of the operators are assigned to operation vertices of the ACG during the operator’s selection step of the behavioral synthesis flow. In order to support the features of communication architectures specific to DSP applications, we extend the SIF model (Sequential Intermediate Format) [17]. This new formal model named IO constraint graph (IOCG) supports expressing of integration constraints for each bus (id. port) that connects the IP to other SoC components. It allows (1) to specify transfer-related timing constraints such as ordered transactions, relative timing specification, min–max delay, (2) to include architecture features and (3) to express variations on the data transfer times. Finally, a global constraint graph (GCG) is generated by merging the ACG with the IOCG. Merging is done by mapping the vertices and associated constraints of the IOCG onto input and output vertices set of the ACG. A minimum timing constraint on output vertices (earliest date for data transfer) of the IOCG is transformed in the GCG into a maximum timing constraint (latest date for data computation/production). 2. IP behavior and IP integration constraints being described in a formal model, the feasibility between

the rate, the data dependencies of the algorithm and the technological constraints is analyzed. This first analysis checks the ACG for positive cycles to ensure that the constraint graph is feasible without considering input arrival dates. In a second analysis step, the consistency of the IP integration constraints according to the algorithm ones is analyzed. Consistency analysis refers to the dynamic behavior of the GCG, i.e. if the constraints required on output data are always verified with the behavior specified for input data. 3. The entry point of the IP core design task is the global constraint graph GCG. This design step relies on the synthesis of different functional units of the IP core: PU (Processing Unit), MU (Memory Unit), CU (Control Unit), COMU (Communication Unit). This target architecture allows us to handle separately processing and communication parts [23]. 2.1. Modelling step 2.1.1. I/O constraint graph (IOCG) The IOCG model allows (1) to specify transfer related timing constraints such as ordered transactions, relative timing specification, min–max delay, (2) to include architecture features and (3) to express timing variations on data transfer times.

ARTICLE IN PRESS P. Coussy et al. / INTEGRATION, the VLSI journal 40 (2007) 94–105

Definition. A I/O constraint graph is a hierarchical oriented polar weighted graph IOCGðV ; EÞ where the vertex set V ¼ fv0 ; . . . ; vn g represents data transfers or dummy operations, v0 and vn being the source and the sink vertex, respectively. The edge set, E ¼ fðva ; vb Þg represents dependencies between vertices. A weight wab associated to the edge ðva ; vb Þ represents a timing constraint between the transfers of data va and vb .

Start (v0)

VA ,vb

v0

lab+δ (va )

l0

Va y

lya

-uab-δ (va )

vb



l0 v0

lya y

va

Fig. 3. Dummy node example.

lab+ (va )

va

-uab-δ (va )

vb

v0

l 0a

VA

l ab

vb

IP core

va1

1 -1

VA = {va1, … van }

va2

1

va3

-1

System

(a)

v0 and vn represent the beginning and the end of transfer sequence, respectively. v0 represents hence the start signal going from the system to the IP core. An edge eab ¼ ðva ; vb Þ represents a temporal dependency va  vb between transfer va and vb such that for any execution of the IOCG, transfer va must always be completed before the transfer of vb can be initiated. Let sðva Þ be the time at which transfer va can be initiated. A forward edge ðva ; vb Þ in the IOCG with a positive weight wab ¼ l ab X0 represents a minimum timing constraint that requires sðvb ÞXsðva Þ þ l ab . A backward edge ðvb ; va Þ in the IOCG with a negative weight wba ¼ uab p0 represents a maximum timing constraint uab X0 that requires sðvb Þpsðva Þ þ uab . In order to represent the timing variations associated to data exchanges we define the notion of transfer delay for input vertex va : As described is Section 2 (see also [15]), arbitration phases or handshaking transform transfer times in transfer intervals (timing frames). We assume that transfer delay can be expressed by a bounded execution delay dðva Þ 2 ½dmin ðva Þ, dmax ðva Þ. Hence, in such cases, the weight wab associated to an edge eab is decomposed into a timing constraint and a transfer delay: for a minimum timing constraint wab ¼ l 0ab ¼ l ab þ dðva Þ and for a maximum timing constraint wba ¼ u0ab ¼ uab  dðva Þ. For readability, the two notations va or a will be used indifferently in the rest of the paper. Dummy nodes are used for example to brake the timing correlation between two transfers. Fig. 3 shows dummy nodes can be used to clearly express timing overhead introduced for instance by protocol or data format conversion. Architecture features (serial, parallel transfer, etc.) can be modelled in this IOCG model and are embedded in hierarchical vertices. Fig. 4 shows a burst transfer of a data vector V A followed by the serial transfer of the scalar vb . Characteristics of the burst such as one data rate transfer are embedded into the hierarchical vertex using backward and forward edges between vertices ðva1 ; va2 ; va3 Þ. In such a case the transfer delay d is associated to the hierarchical node itself.

97

(b)

Fig. 4. (a) IP core interdependency with system and (b) IOCG hierarchical vertices.

2.1.2. IP signal-flow graph (SFG) The input of our IP synthesis task consists in a description of the IP core functionality with a HDL such as VHDL, SystemC or SpecC. This initial description is compiled into an intermediate representation that exhibits the parallelism of the application. This graph-based model is named SFG and is defined as follows. Definition. A signal flow graph is an oriented polar graph SFGðV ; EÞ where the vertex set V ¼ fv0 ; . . . ; vn g represents operations, v0 and vn being the source and the sink vertex, respectively. The edge set, E ¼ fðvi ; vj Þg represents the dependencies between operations. A vertex in a SFG represents one of the following operations: arithmetic, logic, data or delay. As said previously, the difference between a SFG and a data-flow graph DFG consists in the delay operation that is used in the DSP domain to express the use of a data computed in the previous iteration of the algorithm (see Fig. 5). Moreover, we define two vertices subset representing read and write operations through the IP interface. Definition. Input vertices of SFGðV ; EÞ consist of all the data vertices written (produced) by the environment onto the IP core ports and are denoted by I  V . Output vertices of SFGðV ; EÞ consist of all the vertices read (consumed) by the environment onto the IP core ports and are denoted by O  V. 2.1.3. Algorithmic constraint graph (ACG) An algorithmic constraint graph ACG is generated by using the operator latencies from the technological library and the data dependencies expressed in the SFG. The latency l of the operators is assigned (associated) to operation vertices of the ACG during the operator’s selection step of the behavioral synthesis flow. We consider real-time implementations of DSP applications that are easily characterized by an iteration period R: all computations are repeated every R time. Real-time implementation of such applications onto SoC architectures should satisfy the following constraints:  Execution timing constraints on each component,  I/O data transfer sequences and their timing specification.

ARTICLE IN PRESS P. Coussy et al. / INTEGRATION, the VLSI journal 40 (2007) 94–105

98

-l3

v0

l1

l2

a

b

δ (a)

y

Rate constraint

δ (b)

IOCG Graph

Technological library v0 a

a

2+ δ (a)

+

+

V0 : System start a : Synchronization point b : Synchronization and reference point

1

Z -1

x

b

-R

x

b

δ (b)

*

*

y

y

SFG

Global Constraint graph GCG

6

-l3

Fig. 5. GCG generation.

For this purpose, we introduce timing constraints to model lower bounds (data dependencies and operator latencies) and upper bounds (delay operations) between the start times of two operations. Let sðvi Þ be the time at which operation vi can be initiated. A minimum timing constraint l ij X0 involves sðvj ÞXsðvi Þ þ l ij .

delay after the completion of the vertex vi before vi can begin its execution. rf ðvi ; vj Þ ¼ MAXfrf ðcÞg. c2C fv ;vj i

In the same way, we define in the backward graph G b the length of a path rb ðcÞ and the maximum distance rb ðvj ; vi Þ as follows: X rb ðcÞ ¼ wmk þ lðvi Þ; c 2 C bvj;vi and rb ðvj ; vj Þ emk 2E c

A maximum timing constraint uij X0 involves sðvj Þpsðvi Þ þ uij . Timing properties of the behavioral model is derived from physical implementation of the IP core using the technological library. In a HLS flow, execution delays are associated with vertices during the allocation task of hardware resources. For each operation the delay is expressed as a number of cycles for a given clock cycle period specified by the IP integrator. This delay represents the latency of the chosen operator (during the HLS selection step) that will implement the operation. In a first step an intermediate algorithmic constraint graph, ACGðV ; EÞ, is generated from the operator latencies and data dependencies expressed in the SFG by introducing a forward edge eij in the constraint graph with a positive weight wij ¼ l ij X0. A backward edge with negative weight wji ¼ Rp0 equals to the data rate constraint of the IP core is introduced to express a delay operation in the constraint graph (see Fig. 5). Each edge represents hence an inequality relationship between the start times of the vertices that must be satisfied in a legal schedule. The length rf ðcÞ of a path c connecting vi to vj f in by rf ðcÞ ¼ P the forward graph f ACG is defined f w  lðv Þ, c 2 C , where C represents the i vi;vj vi;vj emk 2E c mk set of paths going from vi to vj in the forward sub-graph ACG f : The minimum distance rf ðvi ; vj Þ is the length of the longest path from vertex vi to vj . It represents the minimum

¼ MAXfrb ðcÞg c2C bv ;v

j i

¼ MINfjrb ðcÞjg. c2C bv ;vi j

2.1.4. Constraint graphs merging At this step, the ACG and IOCG constraint graphs are merged to produce a global constraint graph GCG. This GCG model that represents in a single graph both the constraints coming from the algorithm and from the I/O data is used for both constraints analysis and IP design. Definition. A global constraint graph is an oriented polar weighted graph GCGðV ; EÞ where the vertex set V ¼ fv0 ; . . . ; vn g represents operation, v0 and vn being the source and the sink vertex, respectively. A weight wij associated to the edge ðvi ; vj Þ represents a timing constraint between vertices vi and vj . The merging of the ACG and the IOCG is realized by mapping the set of IOCG vertices and their associated constraints onto the I and O vertex set of the ACG. Moreover, a minimum timing constraint on output vertex of the IOCG is transformed in a maximum timing constraint in the GCG. In other words the earliest date for data transfer in the IOCG model is considered as the latest date for data computation/production in the GCG model. The resulting constraint graph GCGðV ; EÞ, where E ¼ E f [ E b can be decomposed into two sub-graph

ARTICLE IN PRESS P. Coussy et al. / INTEGRATION, the VLSI journal 40 (2007) 94–105

99

GCG f ðV ; E f Þ and GCG b ðV ; E b Þ called the forward and backward graph containing, respectively, only the forward edges which weights are positive and backward edges which weights are negative. We define a subset of vertices named synchronization points and reference points, that are used for specifying minimum and maximum times or timing intervals of the data transfers. These vertices serve as reference for specifying the start times of operations and are used for the consistency analysis that checks if the constraints define an (in)consistent system.

a: sðvj ÞXsðaÞ þ dðaÞ þ smin a ðvj Þ. Considering more than one synchronization point the start time of a vertex vj is defined as follows: sðvj ÞXMaxr2SðvjÞ fsðrÞ þ dðrÞ þ smin r ðvj Þg. In the same way, we define the maximum offset of a vertex vj with respect to a reference point a as follows:

Definition. The synchronization points of a constraint graph GCGðV ; EÞ consist of the source vertex v0 and all vertices with variable arrival time and are denoted S  I.

We relate the offset and the start time of a vertex vj considering more than one reference point by the following inequality: sðvj ÞpMaxr2RpðvjÞ fsðrÞ þ dðrÞ þ smax ðvj Þg. r We now identify synchronization and reference point that can directly affect the start time sðvj Þ of a vertex vi , through a data dependency, by introducing the concept of relevance.

Definition. The synchronization point set of a vertex vi is the subset of synchronization points Sðvi Þ  S such that a 2 Sðvi Þ if there exists a path in GCG f ðV ; E f Þ from a to vi containing at least one variable weight edge equals to dðaÞ 2 ½dmin ðaÞ; dmax ðaÞ: In other words, a synchronization point a is in the synchronization point set of vi if the vertex vi can begin its execution only after the arrival of the input data a. Definition. The reference points of a constraint graph GCGðV ; EÞ consist of all vertices used to reference a maximum timing constraint and are denoted Rp  I. Definition. The reference point set of a vertex vi is the subset of reference points Rpðvi Þ  Rp such that a 2 Rpðvi Þ if there exists a path in GCG b ðV ; E b Þ from vi to a: In other words, a reference point a is in the reference point set of vi if the vertex vi has to begin its execution after a maximum delay specified relatively to the arrival of the input data a. A minimum offset between two vertices a and vi represents the minimum time after the completion of the synchronization point a before vi can begin its execution. A maximum offset between two vertices vi and a represents the maximum time after the completion of the reference point a before vi has to begin its execution. In other words, minimum and maximum offset represent the minimum and maximum delay constraint to produce a result with respect to the arrival time of an input data. Let V a  V be the subset of vertices containing the synchronization point a and all its successors. Let GCG fa ðV fa ; E fa Þ be the sub-graph induced by V fa : Definition. The minimum offset of a vertex vj 2 V fa with respect to the synchronization point a is an integer value sa ðvj Þ such that sa ðvj ÞXsa ðvi Þ þ wij if there exists an edge of weight wij from vi to vj in GCG fa ðV fa ; E fa Þ, and assuming that sa ðaÞ is normalized to 0. If sa ðvi Þ is the minimum value, then it is the minimum offset of vi with respect to a and is f denoted by smin a ðvj Þ ¼ r ða; vj Þ: We now relate the offset and the start time of a vertex vj considering a unique synchronization point

Definition. The maximum offset of a vertex vj 2 V ba with respect to the reference point a is an integer value b max max smax a ðvj Þ ¼ r ða; vj Þ such that sa ðvj ÞXsa ðvi Þ þ wji if there exists an edge of weight wji from vj to vi in Gba ðV ba ; E ba Þ, and assuming that smax a ðaÞ is normalized to 0.

Definition. A relevant forward path, going from a synchronization point a to a vertex vi ; includes exactly one edge with a variable weight depending of dðaÞ. Definition. The relevant synchronization point set of a vertex vi is the subset of synchronization points Rðvi Þ ¼ faja 2 S such that a relevant forward path between a and vi existsg in GCG f ðV ; E f Þ: Definition. A relevant backward path, going from a vertex vi to a reference point a; includes exactly one edge with a variable weight depending of dðaÞ. Definition. The relevant reference point set of a vertex vi is the subset of reference points RRpðvi Þ ¼ faja 2 Rp such that a relevant backward path between vi and a existsg in GCG b ðV ; E b Þ: 2.2. Analysis step The IP behavior and IP integration constraints being described in a single formal model, the feasibility and the consistency of the IP integration constraints according to algorithm constraints can now be analyzed. 2.2.1. Feasibility This analysis certifies that the constraint graph contains no maximum constraint between two vertices vi and vj less than the longest path from vi to vj . The constraint graph is checked for positive cycles to ensure the constraint graph is feasible. Feasibility is independent of the input arrival time and can be statically determined as follows: Theorem 1. A constraint graph ACGðV ; EÞ is feasible if no positive cycle exists in the ACG, else it is unfeasible, i.e. max f b 9smin a ðvj Þ4sa ðvj Þ or 9r ðvj ; vj Þ4r ðvj ; vj Þ with vi ; vj 2 V . Similar problems have been met in layout compaction and the theorem proof can be found for instance in [18].

ARTICLE IN PRESS 100

P. Coussy et al. / INTEGRATION, the VLSI journal 40 (2007) 94–105

2.2.2. Consistency Consistency refers to the dynamic behavior of the constraint graph relatively to the behavior of its input data. The following lemma lays out a necessary and sufficient condition to determine the consistence of the maximum constraints considering input data arrival times. Lemma. A feasible constraint graph GCGðV ; EÞ is wellb posed if and only if sðvj ÞpsðaÞ þ dðaÞ þ smax a ðvj Þ8vj 2 V , min max a 2 RRpðvj Þ, 8dðaÞ 2 ½d ðaÞ, d ðaÞ, i.e. sðbÞþ b max dðbÞ þ smin b ðvj ÞpsðaÞ þ dðaÞ þ sa ðvj Þ, 8vj 2 V , b 2 Rðvj Þ, a 2 RRpðvj Þ. For the consistency checking it is not necessary to explore all the combinations. The study of a subset is sufficient as established in the following theorem. Theorem 2. A feasible maximum constraint is well-posed if min and only if smax ðbÞ þ dmax ðbÞ þ smin ðaÞ þ dmin ðaÞþ b ðvj Þps max sa ðvj Þ with 8vj 2 V b , 8aab, b 2 Rðvj Þ, a 2 RRpðvj Þ. At this step of the IP design flow we can define if a valid solution for the scheduling problem exits. When the GCG is not consistent, the IP synthesis is impossible: IP integration constraints have to be modified. This leads the system designer to target an alternative solution for the SoC architecture by reviewing the partitioning, and/or redefining communication architecture and/or redesign some parts of the system. 2.3. IP hardware synthesis The entry point of the IP core design task is the GCG presented in Section 2.1. This design step relies on the synthesis of different IP core architectural parts: Processing Unit (PU), Memory Unit (MU), Control Unit (CU), Communication Unit (COMU). The MU synthesis does not directly depend on the GCG modelling but is related to the operation scheduling (see for example [5]). The CU generation is a general problem described for instance in [10,9]. In this section, we focus on both processing and COMU design from the GCG model. The HLS tool we use is GAUT [12]. GAUT is a pipeline architectural synthesis tool, dedicated to Signal and Image Processing applications under a real-time execution constraint. The input specification language is currently a subset of behavioral C or VHDL. Synthesis leads to the generation of a structural and functional RTL VHDL description of the designed architectures. This VHDL file is a direct input for commercial, logical synthesis tools like Quartus from Altera and ISE/Foundation from Xilinx. In a HLS process the design of a PU integrates the following tasks [9]: resource selection and allocation, operation scheduling, and the assignment of operations to the various operators (binding). GAUT first executes the selection task, then performs the allocation and finally simultaneously executes the tasks of scheduling and assignment.

During the Selection step, hardware resources are selected from technology-specific libraries of components (arithmetic and logic units, registers, multiplexors) where components are characterized in terms of gate count, delay, power consumption, etc. A first Allocation consists of implementing the type and the number of operators which satisfy the a priori average parallelism of the application. The Scheduling step is based on a classical ‘‘list scheduling’’ algorithm. It relies on heuristics in which ready operations (operations to be scheduled) are listed by priority order. An operation can be scheduled if the current cycle is greater than its ASAP (as soon as possible) time. With the GAUT tool, an early scheduling is performed on the GCG. In this scheduling, the priority function depends on the mobility criterion. This mobility is statically computed as the difference between ASAP and ALAP (as late as possible) times. The ASAP timing of an operation is computed using ASAP input data arrival times and the latency of the associated operator. The ALAP timing of an operation is computed using ALAP output deadline and the delay of the associated operator. For operations that have the same mobility, the priority is defined using the operation margin. Operation margin is defined as the difference, in number of cycles, between the current cycle and the operation deadline. An operation deadline is the minimum ALAP time allowed on the outputs that have data dependency with this operation. Whenever two ready operations need to access the same resource (this is a socalled resource conflict), the operation with the lower mobility has the highest priority and is scheduled. The other is postponed. When the mobility is equal to zero, one new operator is dynamically allocated to this operation. The Assignment is simultaneously executed with the scheduling. The optimal assignment of a candidate operation on an available operator responds to the minimization of interconnections between operators (multiplexors, demultiplexors, wires). Resource selection/allocation and operation scheduling can be constrained to limit hardware complexity (i.e. the number of allocated resources) and reach a given computation speed. Each set of supported parameter values and synthesis constraints allows to instantiate a different dedicated architecture that will fulfil specific functional requirements and achieve specific performance. 2.3.1. Processing unit The topology of the PU model (see Fig. 6) is initially based on an elementary cell which includes an arithmetic operator (selection step) and registers associated to the inputs of this operator. After the scheduling of the operations and their assignment to the operators, several operations are mapped onto the same operator (at different cycles). Multiplexors and demultiplexors are thus added to the cell to manage the operator re-using and data transfers according to the processing to be executed. This step is performed after an optimization of the number of registers

ARTICLE IN PRESS P. Coussy et al. / INTEGRATION, the VLSI journal 40 (2007) 94–105

Communication Unit

Data Internal busses

Processing Unit

Control Unit

Cell FIFO I/O port

r

LIFO

Op

r

101

Op

r

r

Registers Control bus Controller Control Internal busses

Fig. 6. IP core units architecture.

(a register can be shared if it is not used during some cycles). 2.3.2. Communication unit The PU synthesis step generates a set of functional constraints relative to the PU I/O transfer sequences: the production and consumption dates of I/O data. These sequences thus constitute the background and the schedule of conditions for the IP COMU synthesis process (see [6] for details about the constraint modelling generated by the PU synthesis step). The generic interface model includes three major components: buffers for storing I/O data (FIFO’s, LIFO’s and registers), interconnection components (multiplexors, demultiplexors, tristates) and busses (see Fig. 6). Busses included in the interface unit have two types: external and internal. External busses have two categories: the first one is I/O busses (ports) that represent the physical link between the IP core and the other components of the system. The number of this kind of busses is defined by the integrator in the IOCG model. The second one is the interface PU busses ðCOMU2PUÞ. Their number is determined by the synthesis step of the PU. A FSM controller implements the communication protocol defined by the SoC designer and generates the adequate control signals for the data-path COMU components. More details about the COMU and allocation procedure of data storage components can be found in [1]. 3. Design experiment: the MAP algorithm In this section, our design methodology is applied to the design of the maximum a posteriori MAP algorithm. This error control algorithm is used in digital communication against channel noise and particularly in turbo-decoding [2,8] for the soft-in soft-out (SISO) decoding of a convolutional code. It provides better results for turbodecoding than the well-known Viterbi algorithm. 3.1. MAP algorithm overview The MAP algorithm, also called the forward–backward algorithm (FB), provides for each received symbol uk , k ¼ 0 . . . N  1, a soft output Le ðuk Þ. uk is composed of two

values: yk and La ðuk Þ. The soft estimate Le ðuk Þ is computed by exhaustively exploring all possible paths in the trellis using a forward recursion and a backward recursion. To simplify hardware implementation, we use the Max-LogMAP algorithm described in the logarithmic domain [22]. This algorithm consists of three steps (7(a)).  Forward recursion: The forward state metrics Ak are recursively calculated using the symbols in an increasing order from 0 to N  1. They are stored in an internal memory in order to be used to compute the soft output.  Backward recursion: The backward state metrics Bk are recursively computed using the symbols in a decreasing order from N  1 to 0.  Soft-output computation: The soft output for each symbol at time k is computed by using the backward state metric Bk and the corresponding forward state metric Ak1 read from the memory.

3.2. Algorithm architecture matching In this section, we focus on the FB implementation by matching the algorithm to the architectural constraints (memory, latency, throughput, etc.). Due to its schedule, the backward recursion cannot start before the end of the forward recursion, and hence, the maximal latency of the FB algorithm is 2:N. Moreover, N state metrics vectors need to be memorized, which requires a large amount of memory. To reduce both the memory requirements and the latency, a sliding window algorithm (SW) is used [7]. Using the same graphical representation as in [3], our modified SW-FB algorithm with windows of size L symbols is depicted in Fig. 7. In this figure, the horizontal axis represent time, with units of a symbol period. The vertical axis represents the received symbol. For each window working on symbols aL to ða þ 1ÞL, a forward recursion with memorization is performed from the initial state metric given by the previous window. The backward recursion is then performed on the same window from an initial state metric Bðaþ1ÞL given by the previous iteration [3]. The windows are then processed sequentially by a single elementary component called SubMAP and performing a forward–backward algorithm

ARTICLE IN PRESS P. Coussy et al. / INTEGRATION, the VLSI journal 40 (2007) 94–105

102

Symbols N

BN-1

La (u k) yk

A N-1

Forward

Symbols

Backward Memorizing

B 2L

2L BL

L

A0 0

2N Time

0

A N-1

B 3L

3L

Le(u k)

B N- 1

A0

B N-1

N=4L

(a)

L

2L

3L

4L

Time

(b) Fig. 7. Graphical representation of the (a) FB and the (b) SW-FB algorithms.

Symbols

S Symbols N=4L

N=4L

3L

3L

2L

2L

L

L 0

L 2L 3L 4L 5L 6L 7L 8L

Time

(a)

Init Stream processing End

0

L

2L

3 3L

4L

5L

Time

(b)

Fig. 8. SW-FB scheduling with one (a) SubMAP-H or one (b) SubMAP-V component.

on L symbols. This SW-FB algorithm requires to store L state metrics vectors and its maximal latency is reduced to 2:L symbols. The simplest and natural solution to implement the SubMAP component consists in using the same symbols aL to ða þ 1ÞL for the forward and backward recursions. This component basically implements sequentially one forward and one backward processor and is denoted SubMAP-H. With the scheduling represented in Fig. 7b, the SW-FB algorithm requires two SubMAP-H components working in parallel in staggered rows: the first one uses the symbols aL to ða þ 1ÞL, while the other uses the symbols ða þ 1ÞL to ða þ 2ÞL. However, this scheduling requires to store 2.L forward state metrics and to duplicate the forward and backward processors. The amount of memory can be divided by two by using only one SubMAP-H component with the scheduling of Fig. 8a, but the real-time constraint is not met anymore with this solution. The timing behavior and IOCG of the SubMAP-H component are shown in Figs. 9 and 10, respectively. Hierarchical vertices are used for the s forward and backward state metrics A0 and BN1 to express simultaneous arrival/producing dates of these data (s is the number of states of the convolutional code). The 8 forward and backward state metrics A0 and BN1 are required to start the computations. They are concurrently transmitted in two cycles: 4 state metrics of each type are simultaneously sent to the component per cycle. Weights þ1=  1, i.e. minimum timing constraint equals maximum timing constraint, are used to specify

0

2

4

L+2

Port 1

A0

LA

Port 2

BN - 1

Y

UT Port 3

2L+2

2L+4

2L+6

Processing LE

B0

A N-1

Cycle

Fig. 9. Timing behavior of the SubMAP-H component.

an exact timing constraint equal to one cycle between vertices. The values yk and La ðuk Þ of the received symbols are concurrently read on input ports 1 and 2, respectively. The first result of Le can only be carried out after the arrival of the last input of y and La . In order to minimize the latency of the SubMAP-H component, the relative latency has been set to one cycle. Le , the new metrics AN1 and B0 are then sequentially provided on port 3. In order to overcome the problem of memory duplication and to meet the real-time constraint, the elementary SubMAP is modified by specifying the parallel execution of the two recursions (SubMAP-V). It implements the algorithm by working on two consecutive windows as shown in Fig. 8b: the forward recursion works on the current window with the symbols aL to ða þ 1ÞL while the backward recursion works on the previous window with symbols ða  1ÞL to a:L. In this case, the backward recursion is performed on data stored in the internal memory of the SubMAP-V component.

ARTICLE IN PRESS P. Coussy et al. / INTEGRATION, the VLSI journal 40 (2007) 94–105 1

0

L A (0)

A0

-1

0 v0

1

L A(1)

L A (L-1)

1

B0(0) ... B0(S-1/2)

-1

-1

0

1 BN -1

0

103

B0(s/2) … B0 (s-1)

1 Y (0)

-1

Y (1)

Y (L-1)

Relative Latency

-1

L E(L)

1

1

LE(L-1)

1

LE(1)

-1

-1

B0

-1

A N -1

Fig. 10. IOCG of the SubMAP-H component.

0

2

4

L+2

Port 1

A0

LA

Port 2

BN - 1

Y

UT Port 3

L+4

L+6

L+8

Processing LE

B0

AN -1

Cycle Fig. 11. Timing behavior of the SubMAP-V component.

The timing behavior of the SubMAP-V component is depicted in Fig. 11. The required latency of the SubMAP-V component corresponds to the delay of the critical path to produce the first soft output: 4 clock cycles. 3.3. Synthesis results

1

0

A0 0 0

v0

0

In this section, we present the results of the synthesis under constraints, obtained using the HLS tool GAUT. The convolutional code used for this experiment is a duobinary circular recursive systematic convolutional code used in the DVB-RCS standard (Digital Video Broadcasting—Return Channel Satellite) [8]. The trellis is composed of s ¼ 8 states with 4 branches leaving each state. The SW-FB is implemented with L ¼ 32. 3.3.1. First experiment Fig. 12 represents the IOCG of the MAP algorithm with a SubMAP-V component. The input data timing constraints are the same as in Fig. 11. The output data order is also the same. Notice that output production dates can now be specified relatively to the vertex v0 that represents the start time. The source file describing the algorithm uses 128 lines of behavioral VHDL code to specify the entire SubMAP-V computation. The SFG includes 95 210 edges and 44 048 vertices, divided in 23 168 operations and 20 880 data. Table 1 presents the SFG operation vertices classified by function types. In a first step, an ACG has been generated from the operator latencies and the data dependencies expressed in the SFG. A GCG has then been generated by merging the ACG with the IOCG. We finally checked if a valid solution for the scheduling problem exits for the required relative latency at the consistency analysis step.

1

LE (L)

L A (1)

1 Y (0)

-1 1 -1

L A (L-1)

-1

1 BN -1

4 -4

LA (0) -1

Y (1)

Y (L-1)

-1 L E(L-1)

L E(1)

1 -1

1 B0

-1

A N -1

Fig. 12. IOCG of the SubMAP-V component.

The VHDL RTL file obtained after the synthesis includes 2.4M lines (20 thousand times more than the source code). The complete generation took only 35 min on a 900 MHz SunBlade2000, with 1G RAM and running Solaris8. This time is composed of 4 min to parse, compile and generate the intermediate representation ACG; 2 min for the scheduling and 29 min to generate architecture and its respective VHDL files. Notice that most of the time was spent in the RTL file generation. RTL description has been simulated using Modeltech’s ModelSim5.5c simulator and functional validation has been completed by comparing RTL results to the C model ones. The design results are shown in Table 2. We have targeted a DVB-RCS 50 Mbits/s data rate using a :18 mm CMOS technology. The internal clock frequency is 200 MHz. The latency of the adder, the subtractor and the comparator is less than one cycle which lead to a state metric computation of four cycles. The final architecture includes 64 inverters, 96 adders, 32 subtractors and 52 comparators for the arithmetic operations. The FSM controlling the PU includes 144 states and 8 bit width instructions. The generated RTL architecture includes 996 registers: this corresponds to an average of 3 registers by operator (our architectural model—Section 2.3.1—is based

ARTICLE IN PRESS P. Coussy et al. / INTEGRATION, the VLSI journal 40 (2007) 94–105

104 Table 1 SFG composition

Table 4 SubMAP-V under constraint synthesis results

Inv.

Add.

Sub.

Comp.

Data vertices

FSM states

Inv.

Add.

Sub.

Comp.

Registers UT

8448

11 520

2304

896

20 880

996

8

16

8

12

1172

Table 2 SubMAP-V under constraint synthesis results FSM states

Inv.

Add.

Sub.

Comp.

Registers UT

144

64

96

32

52

996

Table 3 Awaited amount of operators for SubMAP-V Computation

Inv.

Add.

Sub.

Comp.

BMC Forward Backward Extrinsic

4

16

8

32 32 32 32

16

0 24 24 28

Total

12

96

32

72

modelling of IP integration constraints based on an IOCG graph describing external transactions in terms of data structures, transfer modes and timing properties. An analysis step that checks the feasibility and the consistency of the IP integration constraints according to the algorithm constraints is presented. This analysis is essential and can lead the SoC designer to target an alternative solution for the SoC architecture by reviewing the partitioning, modifying some design parts and refining or re-defining the communication features. The approach also relies on layered synthesis techniques that allows an optimized design of the hardware component. Experimental results in the DSP domain show the interest of the methodology and formal modelling that allow architectures to be synthesized from a single algorithmic specification and under various I/O timing constraints. References

on one register for each input of an operator and, with the MAP algorithm, each operator has at least two inputs). Table 3 depicts the awaited amount of operators for the SubMAP-V component (a priori manual estimate from the computations to be realized). The number of components given by the synthesis are in accordance to our estimations. Less comparators and far more inverters are required by our architecture because of component re-using and lack of reduction of regular expressions, respectively. 3.3.2. Second experiment In a second experiment, using the same source file we have designed a MAP component targeting a minimum architecture cost (amount of operators) while keeping the previously described I/O sequence order. The synthesized component performs a 4 Mbits/s data rate. The design results shown in Table 4 describe an architecture that uses 8 inverters, 16 adders, 8 subtractors and 12 comparators. The flatten FSM that drives the PU includes 996 states with 166 different instruction words. The generated RTL architecture uses 1172 registers. This amount of registers comes from the operation serialization that requires memorizing the state metric values. 4. Conclusion In this paper, we proposed a design approach for hardware component IP design under timing and I/O data ordering constraints. This approach relies on the formal

[1] A. Baganne, J.L. Philippe, E. Martin, A formal technique for hardware interface design, in: IEEE Trans. Circuits Systems 45(5) (1998). [2] C. Berrou, A. Glavieux, P. Thitimajshima, Near Shannon limit errorcorrecting coding and decoding: turbo codes, in: Proceedings of the ICC’93, Geneva, Switzerland, 1993, pp. 1064–1070. [3] E. Boutillon, et al., VLSI architectures for the MAP algorithm, IEEE Trans. Commun. 51 (2) (2003). [4] Codesimulink, http://polimage.polito.it/groups/codesimulink.html [5] G. Corre, E. Senn, N. Julien, E. Martin, Memory accesses management during high level synthesis, in: Proceedings of the EEE/ACM/ IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Stockholm, Sweden, 2004. [6] P. Coussy, A. Baganne, E. Martin, A design methodology for IP integration, in: Proceedings of IEEE International Symposium on Circuits and Systems ISCAS, Scottsdale, Arizona, USA, 2002. [7] H. Dawid, H. Meyr, Real time algorithms and VLSI architectures for soft-output MAP convolutional decoding, in: Proceedings of the PIMR’95, vol. 1, 1995, pp. 193–197. [8] C. Douillard, et al., The turbo code standard for DVB-RCS, in: Proceedings of the Second International Symposium on Turbo Codes and Related Topics, September 2000. [9] J.P. Elliott, Understanding Behavioral Synthesis. A Practical Guide to High-Level Design, Kluwer Academic Publishers, Dordrecht, 2000. [10] D. Gajski, et al., High-Level Synthesis: Introduction to Chip and System Design, Kluwer Academic Publishers, Dordrecht, 1991. [11] D. Gajski, et al., SpecC: Specification Language and Methodology, Kluwer Academic Publishers, Dordrecht, 2000. [12] Gaut tool, http://lester.univ-ubs.fr/tools/gaut/tool.htm, 2004. [13] D. Hommais, F. Petrot, I. Auge, A practical toolbox for system level communication synthesis, in: Proceedings of the Ninth IEEE International Symposium on Hardware/Software Co-design CODES, Copenhagen, Denmark, 2001. [14] International Technology Roadmap for Semiconductors, 2001. (http://public.itrs.net).

ARTICLE IN PRESS P. Coussy et al. / INTEGRATION, the VLSI journal 40 (2007) 94–105 [15] M. Jersak, E. Ernst, Enabling scheduling analysis of heterogeneous systems with multi-rate data dependencies and rate intervals, in: Proceedings of the IEEE International Design Automation Conference (DAC), 2003. [16] M. Keating, P. Bricaud, Reuse Methodology Manual for System-ona-Chip Design, third ed., Kluwer Academic Publishers, Dordrecht, 2003. [17] D. Ku, G. De Micheli, Relative scheduling under timing constraints: algorithms for high-level synthesis of digital circuits, IEEE Trans. CAD/ICAS 11 (1992) 696–718. [18] Y. Liao, C. Wong, An algorithm to compact VLSI symbolic layout with mixed constraints, IEEE Trans. CAD/ICAS 2 (1983) 62–69. [19] G. Nicolescu, S. Yoo, A. Bouchima, A.A. Jerraya, Validation in a component-based design flow for multicore SoCs, in: Proceedings of the IEEE International Symposiums on System Synthesis ISSS, Kyoto, Japan, 2002. [20] Open SystemC Initiative (OSCI), SystemC Version 2.0 User’s Guide, Technical Report, 2001. [21] L. Reyneri, F. Cucinotta, A. Serra, L. Lavagno, A hardware/software co-design flow and IP library based on Simulink, in: Proceedings of Design Automation Conference (DAC), 2001. [22] P. Robertson, et al., A comparison of optimal and sub-optimal decoding algorithm in the log domain, in: Proceedings of ICC, 1995. [23] J. Rowson, A. Sangiovanni-Vincentelli, Interface-based design, in: Proceedings IEEE International Design Automation Conference (DAC), 1997, pp. 178–183. [24] J. Ruiz-Amaya, J.M. de Rosa et al., MATLAB/SIMULINK-based high-level synthesis of discrete-time and continuous-time SD modulators, in: Proceedings of Design Automation and Test in Europe (DATE), 2004. [25] G. Savaton, P. Coussy, E. Casseau, E. Martin, A methodology for behavioral virtual component specification targeting SoC design with high-level synthesis tools, in: Proceedings of Forum on Design Languages (FDL), Lyon, France, 2001.

Philippe Coussy, born in 1974, is currently Associate Professor at the University of South Brittany UBS. He received his Ph.D. degree in Electronics from the UBS in 2003 and his M.S. Degree in Computer Architecture from the University Paris 6 (ASIM/LIP6/UMPC) in 1999. He now works at the LESTER Lab. and his research interests are specially the high-level synthesis, communication refinement and synthesis, IP core reuse, SLD languages, design methodologies for complex embedded systems and SoC, CAD tools.

105

Emmanuel Casseau received his Ph.D. Degree in Electrical Engineering from the UBO University, Brest, France, in 1994 and his M.S. Degree in Electronics in 1990. He worked at the French National Telecom School, ENST Bretagne, France, where he developed high-speed Viterbi decoders. He is currently an Associate Professor in the Electronic Department at the University de Bretagne Sud, Lorient, France. He is also in charge of the IP project of the Lester Lab. His research interests include system design, highlevel synthesis, virtual components and SoCs. Pierre Bomel, born in 1962, is a Research Engineer at LESTER, Lorient, France, since 2000. He was IC design Engineer and Project Manager at MATRA-MHS from 1992 to 1999. He received the MS Degree in Software Engineering from the Joseph-Fourier University, Grenoble, France, in 1985. His research interests include Globally Asynchronous Locally Synchronous (GALS) architecture, communication synthesis, hardware/software co-design of embedded systems, and CAD tools. Adel Baganne, born in 1968, is presently an Associate Professor at the UBS University and member of the LESTER Lab. He received his Ph.D. degree in Signal Processing and Telecommunications at the University of Rennes, France, in 1997 and the Engineer degree in Electronics from the National Superior Engineering School in Angers (ESEO), France, in 1993. His research interests include communication synthesis, codesign, co-simulation, computer architecture, VLSI design and CAD tools. Eric Martin, born in 1961, is Full Professor at the University of South Brittany in Lorient, where he is the director of the LESTER laboratory. His interest is for advanced electronic design automation dedicated to real-time signal-processing applications: system specification, high-level synthesis, IP reuse, low-power design, SoC and platform prototyping.