Hardware implementations of software programs based on hierarchical finite state machine models

Hardware implementations of software programs based on hierarchical finite state machine models

Computers and Electrical Engineering 39 (2013) 2145–2160 Contents lists available at ScienceDirect Computers and Electrical Engineering journal home...

2MB Sizes 2 Downloads 62 Views

Computers and Electrical Engineering 39 (2013) 2145–2160

Contents lists available at ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

Hardware implementations of software programs based on hierarchical finite state machine models q Valery Sklyarov ⇑, Iouliia Skliarova Department of Electronics, Telecommunications and Informatics/IEETA, University of Aveiro, Aveiro 3810-193, Portugal

a r t i c l e

i n f o

Article history: Received 7 January 2013 Received in revised form 20 July 2013 Accepted 22 July 2013 Available online 19 August 2013

a b s t r a c t Advances in microelectronic devices have dissolved the boundary between software and hardware. Faster hardware circuits that enable significantly greater parallelism to be achieved have encouraged recent research efforts into high-performance computation in electronic systems without the direct use of processing cores. Standard multi-core processors undoubtedly introduce a number of constraints, such as pre-defined operand sizes and instruction sets, and limits on concurrency and parallelism. This paper suggests a way to convert methods and functions that are defined in a general-purpose programming language into hardware implementations. Thus, conventional programming techniques such as function hierarchy, recursion, passing arguments and returning values can be entirely implemented in hardware modules that execute within a hierarchical finite state machine with extended capabilities. The resulting circuits have been found to be faster than their software alternatives and this conclusion is confirmed by numerous experiments in a variety of application areas. Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction Nowadays, the development of software and hardware becomes more and more interrelated. The emphasis has significantly shifted from general-purpose to application-specific products in the form of embedded processing modules in various areas such as communications, industrial automation, automotive computers, and home electronics. To support applicationspecific computations, a number of new engineering solutions and technological innovations have been proposed. There is a tendency to integrate components on a chip that not so long ago were separated and implemented as autonomous ASICs (application-specific integrated circuits) or ASSP (application-specific standard products). A few years ago, individual ASICs/ASSPs were assembled together with the surrounding logic, often implemented in autonomous FPGAs (field-programmable gate arrays); today all these components are coupled within the same micro-chip. For example, the Zynq-7000 [1] extensible processing platform (EPP) incorporates a processing system (PS) that combines the industry-standard ARM dual-core Cortex™-A9 32-bit RISC processor and a number of peripherals such as memory controllers, USB (universal serial bus), Gigabit Ethernet, and UART (universal asynchronous receiver/transmitter). The same micro-chip contains a built-in gate array (programmable logic – PL) from the Artix-7 or Kintex-7 FPGA families that is linked with the PS through the AXI (advanced extensible interface). EPPs like Zynq [1] can run software that interacts with parallel processing elements (PE) that have been mapped to hardware. The main objective of any PE is to provide greater performance than an equivalent software component with similar functionality that is typically composed of a set of functions in C, or methods in Java. The relative effectiveness

q

Reviews processed and recommended for publication to Editor-in-Chief by Associate Editor Dr. Rene Cumplido.

⇑ Corresponding author. Tel.: +351 234401539.

E-mail addresses: [email protected] (V. Sklyarov), [email protected] (I. Skliarova). 0045-7906/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compeleceng.2013.07.019

2146

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

(e.g. performance) of software modules that have been mapped to hardware PEs needs to be tested, analyzed and compared. Thus, it is important to be able to create the functionality of typical software constructions directly in hardware circuits. This paper addresses the provision of modularity, hierarchy (including recursion), and parallelism in hardware. Modularity and hierarchy are very widely used techniques in general-purpose programming. They are supported by the majority of application-specific development systems for the design of software in single/multi-core autonomous and built-in microcontrollers, mainly originating from specifications in C, and less frequently in Java. In many practical cases, there is a need for hardware accelerators to achieve higher performance by parallelizing the most critical parts of the programs in hardware circuits. Thus, mapping such processor-intensive software fragments to hardware by applying potential parallelism becomes very important. There are many known methods that allow modularity, hierarchy and parallelism to be realized in hardware and a survey of some of these is presented in [2]. Our approach is based on a hierarchical finite state machine (HFSM) model, which is less constrained than potential alternatives [2], can easily be implemented in hardware, and is very consistent with the corresponding software technique. The model is also supported by known templates that are fully synthesizable [2] in commercial computer-aided design (CAD) systems. The next sections provide additional details of the model, review and compare existing alternatives, and explicitly indicate innovations proposed in the paper through the following contributions: 1. A new HFSM model with datapath based on optimized stacks built from memory blocks (see Sections 4 and 5). 2. A regular technique permitting the values of signals (as well as pointers) to be supplied to the invoked HFSM modules and the returned values (pointers) to be accessed after terminating the modules, which makes easier to generate hardware from software procedures (see Sections 4 and 5). 3. Concurrent execution of HFSM modules permitting broad parallelism to be supported (see Section 6). 4. Examples of practical applications clearly demonstrating benefits that are gained from the proposed extended HFSM capabilities through experiments and comparisons (see Section 7). The remainder of the paper is organized in seven sections. Section 2 analyzes the related work aimed at the acceleration of software through hardware, but not necessarily based on the technique proposed in this paper. Section 3 describes the HFSM model and demonstrates how this enables software methods and functions to be mapped to electronic circuits. Special attention is paid to the advantages and distinctive features of this model. Section 4 discusses the tradeoffs between software and hardware and presents in detail the proposed novel technique for converting methods and functions from general-purpose languages to hardware. Section 5 suggests a method for the optimization of HFSM memory. Section 6 considers regular solutions for parallel implementations. Section 7 is dedicated to practical applications, experiments, and comparisons. The conclusion is given in Section 8.

2. Related work Combining the capabilities of software and hardware permits many characteristics of existing applications to be improved. The earliest work on this was done at the University of California in Los Angeles [3]. The idea was to create a computer with a fixed + variable structure by augmenting a standard processor with an array of reconfigurable logic, assuming that this logic can be utilized to solve some processor tasks faster and more efficiently. Such a combination of the flexibility of software and the speed of hardware was considered to be a new way to evolve higher performance computing from any general purpose computer. The level of technology in 1959–1960 prevented this idea from being put in practice. Today a very similar technique has been implemented on a chip that combines multi-core processors, embedded blocks, and advanced reconfigurable logic. For example, the Xilinx Zynq xc7z020 EPP [1] permits the implementation and testing of: (1) systems requiring the development of software and invoking the on-chip PS; (2) application-specific hardware in PL using embedded blocks such as DSPs (digital signal processors) and memories, and arbitrary logic composed of FPGA slices; and (3) a fixed + variable structure computational system combining the PS and the PL with high-speed data exchange between them. Let us discuss a potential scenario for interactions between the PS and the PL. The PL implements a set of modules that are activated from the PS. A module is a hardware circuit that executes a dedicated task. It is an entity in a hardware description language (HDL) such as VHDL that potentially invokes other entities. Data exchange is provided either directly between the PS and the PL, or through a shared window in memory that is accessed from the both the PS and the PL. As soon as the PS needs to initiate accelerated operations, it sends a request to the PL and either transfers data associated with the operations to the PL, or indicates an address and size for the shared memory area in which the data are stored. The PL executes the operations and informs the PS as soon as the operations have been completed. Finally, the results are transferred back, either directly or through the shared memory window. Clearly the PS and the modules implemented in the PL can work in parallel. To accelerate software running on the PS, we need to be able to replace time-consuming software procedures (functions in C) with functionally equivalent hardware modules that are faster and to apply parallelism and pipelining. Thus, we need fast mechanisms that enable selected software functions to be converted to hardware modules that provide the equivalent functionality, and thus execute exactly the same operations as the software functions, but faster. This problem has been widely investigated and several common techniques have been applied. A direct approach is to take an entire software program and apply an automatic conversion to hardware. The program can be written in a general-purpose language (GPL) (most often in

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

2147

C), or in a system-level specification language (SLSL), which is a modified GPL that provides hardware-targeted constructions such as variable operands sizes and parallelism to be described more efficiently. For example, Impulse-C and Handel-C add explicit constructs to the C language that specify parallelism and timing in order to create hardware. A compiler [4] generates circuits from high-level languages that support a modular bottom-up construction. It takes a subset of C without the addition of explicit parallelism and produces hardware accelerators. An alternative technique is used in such toolsets as Catapult-C and Dime-C. These target particular platforms and optimize selected subsets of C to benefit from their implementation in hardware. The tools referenced above are undoubtedly effective and they have been widely used in engineering practice. They are commonly applied to a register-transfer level (RTL). Optimization of control dominated applications that involve sophisticated finite state machines (FSM) is limited. For example, in Celoxica Handel-C, FSMs were realized in a very simple way and such features as hierarchy and recursion in the FSMs were not supported. The paper [5] demonstrates that the effective use of FPGA dedicated resources allows clock rates to be speeded up and the microchip area to be reduced significantly. The proposal in [5] was to apply application-targeted HDL constructs. Optimization techniques are presented with implementation examples and the corresponding quantitative performance evaluation. In most cases a 50% reduction in chip area was achieved with a simultaneous speed-up. Thus, we need not only a conversion technique, but also methods that enable FPGA resources (look-up tables – LUTs, flip-flops, embedded blocks) to be used efficiently. The paper [6] demonstrates that although directly constructed models based on HDL provide cycle-accurate performance estimates, these models are very slow. SLSL SystemC has been used to enhance model performance. Simulation can be done in one (high-level) language and the final synthesizable implementation is done in HDL. For such purposes, an exact transformation is required from the high-level functions to the lower-level modules and vice versa. There have been many efforts to convert from software to hardware locally i.e. to convert a selected piece of code, which is either the most time consuming or needs to be implemented in hardware for some reason, and targeting a particular area, such as communications. For example, in [7] a method called homogeneous co-simulation was proposed, where hardware and software were modeled in VHDL. This method was applied to a typical communication system. Software was converted to hardware and vice versa. The technique [8] is applied to hardware/software interfaces and it is also application-targeted. The paper [9] claims that FPGAs achieve a significant speedup over microprocessors and their configurability offers an advantage over traditional ASICs. However, FPGAs do not enjoy high-level language programmability, as microprocessors do, which is the main obstacle. The compiler that is described generates circuits from C source code to execute on FPGAs. A comparison done in [10] for two applications (video processing algorithms that analyze the motion of objects and object features within a scene, and a wireless communication receiver that includes classical communications blocks) shows that the synthesis of FPGA-based circuits can be done from high-level tools but it produces less optimized (by a factor of at least 2–3 times) hardware circuits. A number of publications are dedicated to hardware/software co-design for particular architectures. For example, in [11] a dynamic system reconfiguration for networks-on-chip (NoC) is proposed, which involves databases (that need to be extended for new applications) from which mapping of NoCs is provided. The database includes existing component implementations composed of hardware and software modules. This differs from the methods that will be considered in this paper that allow formally hardware modules to be constructed from the existing software functions. However, the constructed modules may extend the library [11], so the proposed and the existing methods complement each other. The following general conclusion can be drawn from the analysis presented above: 1. We will consider methods that permit the direct conversion of software procedures (such as C functions) to hardware. This means that we will replicate mechanisms used in software procedures, taking advantages of faster implementations in hardware (i.e. if a hierarchy/recursion is used in a software function, the corresponding hardware module will use the same hierarchy/recursion). This approach differs from all the alternatives described above and enables rapid conversion in either direction (i.e. from software to hardware and vice versa). 2. We will apply parallelism and pipelining where explicitly identified as possible, i.e. the software designer explicitly indicates functions that can be executed concurrently and specifies under which conditions. For example, in traversing N -ary trees, different branches may be indicated to be processed in parallel, or a sorting network is requested to be pipelined (many examples can be found in [12]). This approach differs from the methods described above and gives more flexibility to the designers. 3. We allow run-time changes of the hardware modules with the aid of the methods described in [13]. This approach is easily applied to the technique proposed; it is either not valid or very limited for the methods referenced above. 4. The proposed technique is especially beneficial for software/hardware co-design and experiments in devices such as the Zynq EPP (some examples will be given in Section 7). This is because the ‘‘try, test and compare’’ approach may be used directly. Indeed, some software functions can be parallelized and replaced with hardware modules without modifying the rest of software. The results can be compared and the conclusion can be drawn. 3. Hierarchical finite state machines The HFSM model was proposed in [14]. The model was realized in hardware and successfully tested in a number of industrial products. Further improvements were made and consequently new practical applications have been implemented,

2148

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

tested and evaluated. Theoretical and practical issues of HFSMs have been analyzed in [15–17] and taken as a comparison base for further application-specific improvements [18]. It is important that HFSMs can be used at different levels both in hardware and software, for example for local control in [19] and in the implementation of relatively complex embedded systems. Statecharts [20] specifications are also applicable to HFSMs, and they were adapted for object-oriented programming and used as part of a unified modeling language (UML). The main objective of this paper is to develop an approach to the synthesis of digital circuits and systems from software functions/procedures in hardware modules that are executed hierarchically (also allowing recursive invocations) and, if necessary, in parallel. Since the behavior of software functions/procedures is often described by flow charts we will use a similar specification in the form of hierarchical graph-schemes (HGSs) with the following formal description (see Fig. 1). An HGS is a directed connected graph containing rectangular (Fig. 1a), rhomboidal (Fig. 1b), and triangular (Fig. 1c) nodes. Each HGS has one entry point, which is a rectangular node named Begin (Fig. 1d) and one exit point, which is a rectangular node named End (Fig. 1e). Other rectangular nodes contain either micro instructions (Fig. 1f) or macro instructions (Fig. 1g) or both (Fig. 1h). We will also allow micro instructions to be assigned to the nodes Begin and End if required. Any micro instruction Yj (Fig. 1f) includes a subset of micro operations from the set Y = {y1, . . . , yN}. A micro operation is an output binary signal. Any macro instruction Zk incorporates a subset of macro operations from the set Z = {z1, . . . ,zQ} (Fig. 1g). Each macro operation is described by another HGS of a lower level called a module. If a macro instruction includes more than one macro operation then these macro operations have to be executed in parallel (Fig. 1i). Each rhomboidal node contains one element from the set X [ H, where X = {x1, . . . , xL} is the set of logic conditions, and H = {h1, . . . , hI} is the set of logic functions. A logic condition is an input signal, which communicates the result of a test. Each logic function is calculated by performing a predefined set of sequential steps that are described by an HGS (a module) of a lower level. Directed lines (arcs) connect the inputs and outputs of the nodes in the same manner as for an ordinary graph-scheme. Each triangular node contains an expression which can produce a set of one-hot values associated with the outputs of this node. As soon as the control flow passes a triangular node, exactly one output must be selected enabling the control flow to proceed (see examples in Fig. 1c and j). The output of a rectangular node k with more than one element zi, zj, . . . from the set Z is called a merging point (Fig. 1i). Control flow passes the merging point if and only if all the elements zi, zj, . . . have been completed. This means that a node following the node k is only activated after terminating all the macro operations zi, zj, . . . Using HGSs enables any complex control algorithm to be developed step by step, concentrating the efforts at each stage on a specified level of abstraction. Each separate HGS (i.e. module) can be tested independently. It is known that a set of HGSs can be implemented in an HFSM with stack memory, which permits the execution of hierarchical algorithms. We will skip the formal mathematical definition and will describe the HFSM model informally. Let x1, . . . , xL/y1, . . . , yN be sets of input/output signals. Structurally, an HFSM contains one or two stacks. In case of two stacks one of them (FSM_stack) keeps states and the other (M_stack) enables transitions between modules to be done. Any module is considered to be either a FSM or an HFSM. The stacks are managed by a circuit (C) that is responsible for new module invocations and state transitions in active modules that are designated by the outputs of the M_stack. Since each particular module has a unique identification code, the same HFSM states can be repeated in different modules. Any non-hierarchical (conventional) transition is performed through the change of a code only on the top register of the FSM_stack (see Fig. 2 and the mark ). Any hierarchical call activates a push operation and alters the states of the both stacks in such a way that the M_stack will store the code for the new (called) module and the FSM_stack will be set to an initial state of the called module (see Fig. 2 and the mark j). Any hierarchical return just activates a pop operation without any change in the stacks (see Fig. 2 and the mark ). As a result, a transition to the state following the state where the terminated module was called will be executed. The stack pointer is common to the both stacks. In the explored here HFSM with datapath the circuit C has RTL structure (see Fig. 2) enabling operations of high-level languages to be either mapped directly or in a slightly altered manner and consequently to be executed in hardware. The model depicted in Fig. 2 possesses the following advantages:  It does not have the limitations that exist for processing cores, such as the constrained size of operands, a predefined set of instructions, limited parallelism, the impossibility of fast combinational operations.  It is entirely synthesizable.

(a)

(b)

(c)

(f)

(g)

(h)

(d)

(e)

(i)

(j)

Fig. 1. Nodes of a hierarchical graph-scheme.

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

inputs

Cn : push, pop, clock, reset x x new module 1 L next state Cn

current module

C outputs

M_stack

y1 yN

RTL Cn

2149

inputs: operations A set of registers and other components for count, shift, compute, accumulate, etc.

FSM_stack outputs: flags/states current state

Fig. 2. HFSM, which provides support for hierarchy and recursive computations.

 It implements hierarchy (including potential recursion) faster than in software [21,22], i.e. a smaller number of clock cycles is required. 4. From software to hardware It is known that hierarchy and recursion are extremely powerful problem-solving techniques supported by HFSMs. This paper presents further improvements and describes enhanced models of HFSMs that allow different types of arguments to be passed to hardware modules and to be returned from the modules, much as in software programs. Let us consider some examples. The following Java code (where the method gcd is called recursively) finds the greatest common divisor of four positive integers A, B, C, and D: public static int gcd(int A, int B, int C, int D) { return gcd(gcd(A, B),gcd(C, D)); } public static int gcd(int A, int B) { if (B > A) return gcd(B, A); else if (B==0) return A; else return gcd(B, A%B); }

Fig. 3 depicts HGSs for the gcd method and you can see that the HGSs look similar to the corresponding flow charts. The first HGS (Fig. 3a) sequentially executes modules Z21 (A, B), Z21 (C, D) and Z21 (R1, R2), where R1/R2 are the results returned by the modules Z21 (A, B)/Z21 (C, D). The module Z21 calls the module Z22 (A, B) (see Fig. 3b), which returns the remainder after division of A by B (i.e. A%B). In Section 6 we will show how modules (such as Z21 (A, B) and Z21 (C, D)) can be executed in parallel. A HGS is a synthesizable specification [22]. However, we need HFSM modules with arguments and returned values to be synthesized. To provide this functionality in hardware it is essential to: (a) implement recursive calls of the function gcd with different numbers of arguments (see the modules Z41 and Z21 that will be further explained later) and (b) return values (of type int for our particular example). The following C code (where the function treesort is called recursively) constructs and returns a sorted list from a binary tree (such as that studied in [22]):

Fig. 3. Description of Java methods by sequentially executed HGSs.

2150

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

ValueAndCounter ⁄treesort(treenode ⁄node) { /⁄ node is a pointer to the root of the tree ⁄/ ValueAndCounter ⁄tmp; /⁄ tmp is a temporary pointer to a list item ⁄/ static ValueAndCounter ⁄ttmp=NULL; /⁄ at the beginning the list is empty ⁄/ if(node!=NULL) {/⁄ if the node exists ⁄/ treesort(node->lnode); /⁄ Sort left sub-tree ⁄/ tmp = new ValueAndCounter; /⁄ allocate memory for a new list item tmp ⁄/ tmp->next=ttmp; /⁄ store pointer to the previous list item ⁄/ tmp->val = node->val; /⁄ save the value ⁄/ tmp->count = node->count; /⁄ save the number of repetitions of the value node->val ⁄/ ttmp = tmp; /⁄ extend the list ⁄/ treesort(node->rnode); /⁄ Now sort right sub-tree ⁄/ return ttmp; } }

Any tree node has the following structure: struct treenode { int val; /⁄ value of an item of type int ⁄/ int count; /⁄ number of items with the value val ⁄/ treenode ⁄lnode; /⁄ pointer to left sub-node ⁄/ treenode ⁄rnode; /⁄ pointer to right sub-node ⁄/};

Any list item has the following structure: struct ValueAndCounter { int val; /⁄ value of an item of type int ⁄/ int count; /⁄ number of items with the value val ⁄/ ValueAndCounter ⁄next; /⁄ pointer to the next item of type ValueAndCounter ⁄/

};

We assume here that the tree has already been built (using, for example, the method [22]). The nodes of the tree contain four fields: a pointer to the right child node, a pointer to the left child node, a counter, and a value (an integer in our case). The nodes are maintained so that at any node, the left sub-tree contains only values that are less than the value at the node, and the right sub-tree contains only values that are greater. The counter indicates the number of occurrences of the value associated with the respective node. If we call the function with the statement beginning = treesort(root);, it returns a pointer to a list of the sorted data items. To provide similar functionality in hardware, we need to be able to: (a) pass arguments through pointers and (b) return pointers. The C function above can easily be described in the form of HGSs and similarly, the given HGSs can be almost directly coded in programming languages (Java and C for our examples). A C function call with arguments and a returned value can be represented by an HGS rectangular node but the existing methods do not allow such a function to be implemented in a regular way in HFSM modules. To overcome this problem, we suggest that a third stack memory (AR_stack) be introduced for arguments (A) together with an additional register for the returned value (R) as shown in Fig. 4. Now the method gcd can be converted to an HFSM as follows: 1. Stacks are described in an HDL such as VHDL using templates from [22]. 2. Other blocks are described based on the HDL template from [22] and using the following additional rules: a. Arguments passed by value are stored in the AR_stack when a module (for a method/function) is being activated.

inputs: operations M_Stack + FSM_stack

RTL

AR_Stack

Register for

for passing arguments

returning values

outputs: flags/states

Fig. 4. Using additional elements for passing arguments and returning values/pointers.

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

2151

b. Different numbers of arguments passed to the same function are recognized by specifying a different HDL module depending on the actual number of arguments. This can be seen as a hardware technique for replicating method/ function overloading in software. c. For each argument that is a pointer, the address is stored in the AR_stack when a module (for method/function) is being activated. d. A single returned value/pointer is copied to a specially allocated register when a module is terminated and all arguments previously passed to this module are destroyed. Let us continue our examples. Note, that in the calls gcd(A, B, C, D) and gcd(gcd(A, B), gcd(C, D)) above, the hardware modules are different. They are designated Z41 (for four arguments) and Z21 (for two arguments) as shown in Fig. 3. When the module Z41 is activated, the following statement has to be performed in VHDL (designated Z41 because subscripts and superscripts are not allowed in HDLs): when stateWhereTheModuleZ1_4IsActivated => push <= ‘1’; NextModule <= Z1_4; pass_arguments <= A & B & C & D; - - preparing arguments for the AR_stack

Here, the signal push is used in another (concurrent) process to increment the stack pointer that is common to all three stacks; NextModule is also used in the other process to indicate the transition to the next module (Z41 in our example); pass_arguments is a signal of type std_logic_vector that enables the arguments A, B, C, and D of a specified size to be kept. The statement above is included in the process RTL for the relevant block in Fig. 4. All three stacks are described in the following VHDL process MEMORY: MEMORY: process(clock) begin if rising_edge(clock) then - - a0 is an initial state; z0 is a top-level module if reset = ‘1’ then stack_pointer <= 0; FSM_stack(0) <= a0; - - synchronous reset M_stack(0) <= z0; stack_overflow <= ‘0’; AR_stack(0) <= (others => ‘0’); else if push = ‘1’ then - - TYPE 1 if stack_pointer = stack_size then - - handling stack overflow else stack_pointer <= stack_pointer + 1; FSM_stack(stack_pointer+1) <= a0; - - initial state is always a0 FSM_stack(stack_pointer) <= N_S; - - N_S is the next state in the calling module M_stack(stack_pointer+1) <= NextModule; - - NextModule is the next module AR_stack(stack_pointer+1) <= pass_arguments; - - passing arguments end if; elsif pop = ‘1’ then - - TYPE 2 stack_pointer <= stack_pointer – 1; - - decrementing the stack_pointer when the - - module is terminated else - - TYPE 3 FSM_stack(stack_pointer) <= N_S; - - conventional state transition to N_S end if; end if; end if; end process MEMORY;

Here, clock and reset are hardware synchronization and initialization signals, pop is the signal that decrements the stack pointer (stack_pointer) during a hierarchical return (when the called module is being terminated and control has to be passed to the calling module). There are three potential transitions here: TYPE 1: hierarchical call – when the calling module activates a called module; TYPE 2: hierarchical return – when the called module is being terminated; TYPE 3: conventional state transition – common to non-hierarchical FSMs. Since there is just a single value returned, it is kept in a signal that is declared as: signal return_value: std_logic_vector(size_of_operands-1 downto 0); where size_of_operands is a generic constant. Two processes, RTL and MEMORY, are executed concurrently (i.e. in parallel). Thus, the RTL process prepares data for the MEMORY process and the latter activates the module Z41 for the Java method gcd(int A, int B, int C, int D) shown above, i.e. the module Z41 begins execution from the next clock cycle and receives the arguments A, B, C, D through the AR_stack. The calls of

2152

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

the other modules shown in Fig. 3a are exactly the same but only the first two fields of the AR_stack are used, i.e., for example: when stateWhereTheModuleZ1_2IsActivated => push <= ‘1’; NextModule <= Z1_2; pass_arguments() <= A & B; - - preparing arguments for the AR_stack Both modules Z41 and Z21 have to return a value. This is done in the following statements: when stateWhereTheResultIsProduced => N_S <= indicatingTheNextState; return_value <= signalThatKeepsTheResult; when stateWhereTheCalledModuleIsTerminated => pop <= ‘1’;

Here, the first when statement prepares the returned value, the second when statement activates the signal pop, and the MEMORY process decrements the stack pointer that is common to all three stacks. The complete VHDL code for the RTL process for the module Z41 looks like this: case M_stack(stack_pointer) is when Z1_4 => case FSM_stack(stack_pointer) is when a0 => N_S <= a1; - - initialization (this state can be skipped if it is not needed) when a1 => push <= ‘1’; NextModule <= Z1_2; N_S <= a2; - - call of gcd(A,B) pass_arguments <= - - arguments A and B; when a2 => push <= ‘1’; NextModule <= Z1_2; N_S <= a3; - - call of gcd(C,D) pass_arguments <= - - arguments C and D; - - below the call of gcd(R1=gcd(A,B),R2=gcd(C,D)) when a3 => push <= ‘1’; NextModule <= Z1_2; N_S <= a4; pass_arguments <= - - arguments R1 and R2 where R1 is a saved return_value from - - the module Z1_2 with arguments A,B, and R2 is a saved - - return_value from the module Z1_2 with arguments C,D; when a4 => pop <= ‘1’; return_value <= - - the final result; when others => null; end case; - - description of the module Z1_2 and other potentially available modules

There are three hierarchical calls here in the states a1, a2, and a3 (see Fig. 3a). A hierarchical return is executed in the state a4. Note that there is no data dependency in the calls gcd(A, B) and gcd(C,D). Thus, the corresponding modules (i.e. gcd(A, B) and gcd(C, D)) can be executed in parallel which will be shown in Section 6. Similarly, other methods/functions can be executed concurrently and the number of parallel modules is only limited by the hardware resources available. 5. Optimization technique The MEMORY process from the previous section requires excessive hardware resources when it is built as a logic block. However, it can also be constructed from embedded or distributed memories. Since the signals push, pop, clock, reset, stack_ pointer are common to all the stacks, the memory can be organized as shown in Fig. 5. The VHDL code for the stacks constructed from block RAM (see RAM_block in Fig. 5) looks like this: EMBEDDED_or_DISTRIBUTED_MEMORY: process(clock) begin - - states and modules are represented below explicitly by binary codes if rising_edge(clock) then if reset = ‘1’ then stack_pointer <= 0; stack_overflow <= ‘0’; - - see Fig. 5a FSM_Register <= (others => ‘0’); - - see Fig. 5c else if push = ‘1’ then - - hierarchical call if stack_pointer = 2⁄⁄ram_addr_bits-1 then stack_overflow <= ‘1’; else stack_pointer <= stack_pointer + 1; - - the arguments are passed through the signal to_AR FSM_Register <=

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

2153

to_AR & N_M & (size_of_FSM_stack_words-1 downto 0 => ‘0’); RAM_block(stack_pointer) <= to_AR & C_M & N_S; end if; elsif pop = ‘1’ then - - hierarchical return stack_pointer <= stack_pointer - 1; FSM_Register <= RAM_block(stack_pointer-1); else - - convenional transition FSM_Register(size_of_FSM_stack_words-1 downto 0) <= N_S; end if; end if; end if; end process EMBEDDED_or_DISTRIBUTED_MEMORY; RAM_block is declared as an array: constant ram_width: integer:= - - size of words for the single stack shown in Fig. 5a,b constant ram_addr_bits: integer:= - - size of RAM addresses type DistributedRAM is array (2⁄⁄ram_addr_bits-1 downto 0) of std_logic_vector (ram_width-1 downto 0); signal RAM_block: DistributedRAM; - - Block RAM is declared similarly to distributed RAM

Fig. 6 illustrates different types of transitions in the HFSM for hierarchical calls (Fig. 6a), conventional state transitions (Fig. 6b), and hierarchical returns (Fig. 6c). Note, that the stack is passive in a hierarchically called module (the stack is needed just for a hierarchical return from the called module). Thus, just a register (FSM_Register) can be used for passing arguments and executing state and module transitions. As soon as a transition to the next module has to be done (in the case of a hierarchical call), a binary vector (BVc = to_AR & N_M & ) with the arguments (to_AR) and the codes of the called module (N_M) with its initial state (all zeros) is copied to the register as shown in Fig. 6a. Conventional state transitions are executed similarly to an ordinary FSM using the register FSM_Register (see Fig. 6b). The arguments are taken directly from the register (FSM_Register). As soon as a hierarchical return has to be done, the binary vector (BVr) from the stack shown in Fig. 6c (containing the arguments, the code of the calling module and the code of the next state in the calling module after termination of the called module) is copied to the FSM_Register (FSM_Register <= RAM_block(stack_pointer-1);). Thus, the calling module will continue its execution. The line RAM_block(stack_pointer) <= to_AR & C_M & N_S; in the EMBEDDED_or_DISTRIBUTED_MEMORY process above sets the code of the next state N_S that is needed after the termination of the called module. As a result, after the corresponding hierarchical return, the transition to the proper HFSM state occurs (FSM_Register <= RAM_block(stack_ pointer-1);). Since the next state is determined before the invocation of a module, the called module cannot change the predetermined state transition. For the majority of practical applications this does not create a problem. However, in

Fig. 5. (a) Single block of embedded/distributed RAM for the three stacks in Fig. 4, (b) active stack register, and (c) state transitions and hierarchical calls through the FSM conventional register.

2154

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

Fig. 6. Three types of stack transitions in HFSM: (a) hierarchical calls; (b) conventional transitions, and (c) hierarchical returns.

some cases it is a problem, which must be resolved. This can be done by replacing the line above with the statement: RAM_ block(stack_pointer) <= to_AR & C_M & C_S; where C_S (the current state in the calling module) has to be further replaced with such N_S in the calling module that is found taking into account potentially changed conditions in the called module(s). 6. Parallel HFSM We have already mentioned in Section 4 that some modules (such as Z21 (A, B) and Z21 (C, D) in Fig. 3a) can be executed in parallel (see Fig. 7). Let us take for further study only HGS rectangular nodes with more than one macro operation making up sub-sets Z1, Z2, . . . Thus, parallel execution of the macro operations assigned to each sub-set has to be provided. For example in Fig. 7b there are three sub-sets: Z1 = {z1, z2, z3}, Z2 = {z1, z4}, and Z3 = {z2, z3, z4}. The main module Z0 = {z0} also needs to be implemented and up to three modules (see sets Z1 and Z3) need to be executed in parallel. According to the proposal in this paper, a parallel HFSM (PHFSM) can be designed by applying the following rules: 1. Macro operations from each sub-set Zi are assigned to different HFSMs running in parallel. The HFSM implementing the calling module is responsible for the parallel activation of the called modules and for verification that all called modules from the same set have been completed (i.e. execution can proceed after the relevant merging point such as that is shown in Fig. 1i). For our example in Fig. 7b, the assignment can be done as follows: HFSM1 z0, z1, z2; HFSM2 z2, z3, z4; HFSM3 z3, z4. For the example in Fig. 7a: HFSM1 Z41 ; Z21 (A, B), Z21 (R1, R2); HFSM2 Z21 (C, D). 2. Each HFSMp is described as a VHDL component with three additional signals that are introduced in the next point. 3. If a calling (zq?) and a called (?zp) module (zq ? zp) belong to the same HFSM component, then functionality is exactly the same as for a non-parallel HFSM (see Sections 4 and 5). Suppose now that zq ? zp and the modules zq?, ?zp belong to different components HFSMq and HFSMp. To trigger a macro operation ?zp from zq?, the following three additional

Fig. 7. Examples of parallel operations.

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

2155

signals are involved: (1) startp to activate the HFSMp (HFSMq ? HFSMp); (2) zp to choose the module ?zp in the HFSMp; (3) finishp to indicate that the module ?zp is completed. The signals startp and zp are formed (assigned) in zq? and used (tested) in ?zp. The signal finishp is assigned in ?zp and tested in zq?. 4. Finally parallel execution of macro operations in the sets Z1, Z2, Z3 in Fig. 7b will be provided in the following three HFSM components: Z1 ? {HFSM1(z1), HFSM2(z2), HFSM3(z3)}; Z2 ? {HFSM1(z1), HFSM2(z4)}; Z3 ? {HFSM1(z2), HFSM2(z3), HFSM3(z4)}. Parallel execution of macro operations from the set Z1 = {Z21 (A, B), Z21 (C, D)} in Fig. 7a will be provided in two HFSM components: {HFSM1(Z21 (A, B)), HFSM2(Z21 (C, D)). The technique described above enables any reasonable number of HFSMs mapped to VHDL components to be executed at the same time. All the HFSM features discussed in Sections 4 and 5 are entirely provided. Concurrent execution of VHDL components is combined with modularity and recursion within individual HFSMs. However, parallel calls from recursively activated modules are not allowed. The maximum number of concurrent HFSMs has to be known in advance to provide the necessary mapping to VHDL components. The graph of parallel invocations (such as Z1 ? {z1, z2, z3}; Z2 ? {z1, z4}; Z3 ? {z2, z3, z4}) has to be a tree (i.e. cycles are not allowed for parallel invocations but they are allowed for sequential invocations considered in Sections 3-5). Thus, any called module cannot call any of its predecessors with parallel calls.

7. Experiments and comparisons Four types of experiments have been carried out. Firstly, we tested selected applications in software (in C and Java). Secondly, the synthesis and implementation of the circuits from a specification in VHDL were done in the Xilinx ISE 14.4 for Spartan-6 FPGA and Zynq xc7z020 EPP (prototyping boards Atlys and ZedBoard). Thirdly, fixed + variable structure computational systems that combine the PS and the PL (see Section 2) were tested in the Zynq xc7z020 EPP. Finally, comparison was made with some applications analyzed in the previously published papers [21,22]. The following sub-sections discuss the outcomes taking into account performance of the tested circuits and their hardware resources. The results are shown for a number of practical applications and one of these applications (for processing N ary trees) is described below in its entirety. An N -ary tree is a rooted connected graph that does not contain cycles and for which any internal node has at most N children. Fig. 8 depicts an example of an N -ary tree (N = 4) that can be seen as a graph representing operations A, B, C, D, E, . . . , M associated with the tree nodes a, b, c, d, e, . . . , m and relationships between the operations are shown by tree edges. Alternatively this tree can store a set of data that are linked in accordance with given relationships. For example, the tree in Fig. 8 holds the following set of integers: 60, 12, 31, 56, 0, 9, 63, 28, 6, 1, 58, 15, 2, 62, 48, 49, 7, 29, 50, 5, 3, 30, 59, 23. Let us consider the binary codes of the integers decomposed in G-bit groups (G = 2): 1 1 1 1 0 0, 0 0 1 1 0 0, 0 1 1 1 1 1, 1 1 1 0 0 0, 0 0 0 0 0 0, 0 0 1 0 0 1, 1 1 1 1 1 1, 0 1 1 1 0 0, 0 0 0 1 1 0, 0 0 0 0 0 1 1 1 1 0 1 0, 0 0 1 1 1 1, 0 0 0 0 1 0, 1 1 1 1 1 0, 1 1 0 0 0 0, 1 1 0 0 0 1, 0 0 0 1 1 1, 0 1 1 1 0 1, 1 1 0 0 1 0, 0 0 0 1 0 1, 0 0 0 0 1 1, 0 1 1 1 1 0, 1 1 1 0 1 1, 0 1 0 1 1 1. The first group on the left-hand side is shown in italic. Let us use this group to allocate three children of the root for all the codes found: 0 0, 0 1, and 1 1 leading to the children b, c and d, accordingly. Now the nodes b, c and d can be considered as roots of sub-trees for which the same rules have to be applied. Items from the last group are not expanded for new tree nodes, but are just associated with the leaves at depth 2 (these are the leaves e–m). Such a tree can be traversed by applying either an iterative or a recursive procedure. Data attached to the leaves are ordered (the leftmost leaf contains the smallest set and the rightmost leaf – the greatest set of data items). Thus, the tree can be used for data sorting, or for searching for particular items. For example, to check if the data item 28 (0 1 1 1 0 0) is in the set you can execute three tests: one for the tree root and others for the nodes c and j (see underlined codes in Fig. 8). N -ary trees are involved in numerous practical applications and we will use them for sorting data by applying two types of modules: (1) for traversing the tree enabling all leaves to be found and (2) for fast sort of data associated with leaves. The first module will have two alternative implementations: iterative and recursive. The second module executes sequential (nonrecursive) operations enabling reusable sorting networks [12] to be involved. The experiments were carried out for 16-bit items decomposed in two parts. The first part is divided in four two-bit groups represented by an N -ary tree of depth 4

Fig. 8. An example of N -ary tree (N = 4).

2156

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

and N is also chosen to be four. Binary codes from the second part are associated with the tree leaves and, thus, any leaf can have up to 256 unique items attached. Both types of the modules were implemented and tested in software and in hardware. Note that hardware implementation can also be done using the previously published and known results. For this reason we will start in the next sub-section with the original contributions of this paper and the advantages which can be gained from these. 7.1. New contributions Suppose an N -ary tree for sorting data has been built and it is necessary to extract the sorted data from the tree. The following recursive C function does this: void traverse_tree(treenode ⁄root, int depth) { depth++; if (root == NULL) {depth- -; return;} if (depth == max_depth) { sort_and_print_leaf_data(root); depth–; return;} for (int i = 0; i < N; i++) traverse_tree(root->node[i],depth); /⁄ recursive call ⁄/ depth- -; } where treenode is the following C structure (N is a constant N ): struct treenode { int ⁄arrayTOsort; int count; treenode ⁄node[N]; };

Similarly, an iterative function void iterative_traverse_tree(treenode ⁄root, int depth) can be built for which the treenode structure has an additional field with a pointer to the parent node. The functions traverse_tree and iterative_traverse_tree can be transformed to hardware circuits using the known methods and tools described in Section 2. However neither of them enables implementations in hardware similar to software mechanisms. In our case direct conversion is done and the relevant reverse conversion is also possible. This feature is important for various types of simulation [6] and permits the interface between software and hardware to be unified and simplified. In contrast to previous publications, the stacks of the HFSMs were optimized and built from embedded memories and any module can accept arguments (such as (treenode ⁄root, int depth) in C code above) and return a value. Let us look again at Fig. 8. Different branches of the tree (such as with the local roots b–d) can be traversed concurrently and, thus, the PHFSM described in Section 6 can be applied directly allowing different modules to be executed in parallel. Eventual data dependency between the modules is avoided by storing sub-trees in different memory blocks. Also, any module allows a pipeline to be created. For example, the function sort_and_print_leaf_data(root); in the C code above sorts data associated with the tree leaves. Examples of circuits that execute similar operations are given in [12]. Fig. 9 demonstrates a pipeline implemented in an HFSM module. As soon as the function traverse_tree finds the sub-set of data with the smallest values (e.g. node e in Fig. 8), all items are transferred to the input of the leftmost pipeline register in Fig. 9. At the next iteration, a subsequent sub-set (e.g. node f in Fig. 8) is transferred and the results of operations with the first sub-set are stored in the rightmost pipeline register of Fig. 9. It is known [12] that this type of pipeline allows the sorting of multiple sub-sets (in our case they are associated with different leaves) to be accelerated significantly. Thus, the methods proposed enable any module that is directly built from a software function to be further improved by applying various acceleration techniques (e.g. parallelism and pipelining) without changes in the interface with the rest of the implemented system, just adjusting timing characteristics. Any HFSM module has a unified interface that is the same as the interface of the corresponding software procedure (C function in particular). However, the implementations of modules may be different. For example, the recursive function traverse_tree(treenode ⁄root, int depth) can easily be replaced by the iterative function iterative_traverse_tree(treenode ⁄root, int depth). Such a technique is indispensable for experiments and comparisons. 7.2. Software/hardware co-design Combining the capabilities of software and hardware permits many characteristics of applications to be improved. Nowadays this technique can be implemented on a chip such as the Zynq EPP and we used this for experiments (xc7z020 microchip available on the ZedBoard) as follows: the PS executes software programs that have been developed in C language. The PL exchanges data with the PS using AXI-based high-bandwidth connectivity and executes problem-specific algorithms. Fig. 10 gives more details of the interaction which is organized with the aid of Xillybus Lite IP core [23]. The user software applications run in the ARM Cortex-A9 under Linux. HFSM modules are designed in Xilinx ISE 14.4 and they may interact with the Xillybus IP core as shown in Fig. 10. The latter provides data exchange with the PS through the AXI.

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

2157

Fig. 9. A pipeline controlled by an HFSM module.

Let us consider a particular practical application. Suppose we want to find a minimal row cover of a given binary matrix, i.e. the minimum number of rows such that in conjunction they have at least one value ‘1’ in each column. The approximate algorithm [24] that allows this problem to be solved requires the following sequence of steps: (1) discovering a matrix column Cmin, with the minimal Hamming weight N 1min (if N 1min ¼ 0 then the covering does not exist); (2) discovering a row Rmax, with the value ‘1’ in the column Cmin, with the maximum Hamming weight N 1max ; (3) including the row Rmax in the solution and removing this row and all the columns, which have values ‘1’ in the Rmax; and (4) repeating the steps 1–3 until the matrix is empty. The following application has been designed, implemented and tested: (1) the PS receives the given matrix from the host PC; (2) the matrix is transmitted to the PL and all horizontal and vertical masks (which mark the deleted rows and columns) are reset to zero; (3) the PL searches for the column Cmin with the minimal Hamming weight N 1min and sends the values N 1min and Cmin to the PS; (4) if N 1min ¼ 0, the PS informs the host PC that there is no solution, otherwise it indicates for which rows the value N 1max has to be found in the PL; (5) the PL finds Rmax and sends it to the PS; (6) the PS updates masks in the PL. The masks are used to indicate the rows and columns that have been removed and, thus, the same masked (reduced) matrix is taken for subsequent processing. The steps above are repeated until either the covering is found or it is concluded that the solution does not exist.

7.3. Performance and resources evaluation Experiments with N -ary trees built from arbitrary generated data sets have demonstrated that the result is found faster than in software. If we compare recursive and iterative modules, then the former have a little bit better performance. This can be explained as follows. Traversing N -ary trees is directly supported by recursive calls/returns whereas in an iterative module backtracking (traveling back from children to parents) needs to be done through additional operations over pointers (from children to parents) in the treenode structure. This is inherent to tree-based graphs for which recursive procedures are often more preferable. When we replace cycles by recursive function invocations (see examples with the greatest common divisor in Section 4), performance for the two types of implementations indicated above is almost the same. Thus, the recursive technique can be justified just by a clearer and more understandable specification. It should be noted that the paper promotes hierarchy and modularity in circuit design, and recursion is not a primary objective. The main target is hierarchy, and as a consequence, the unification and reusability of modules. The latter are not necessarily recursive but provide support for recursion. If parallel processing of N -ary trees branches (sub-trees) is applied, then acceleration is increased. Maximum acceleration is obtained for sub-trees having almost equal numbers of nodes.

Fig. 10. Implemented interaction between the PS and PL.

2158

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

The speedup is achieved because: (1) software functions were accelerated in HFSM modules; (2) HFSM modules can run in parallel; (3) HFSM modules can incorporate a pipeline; and (4) HFSM modules can improve known sorting networks allowing for reuse of network segments and, thus, combinational operations are optimized through their rational combination with sequential operations. We mentioned at the beginning of this section that comparison was also done with some applications that were analyzed in previously published papers [21,22]. Experiments have been carried out for the following problems: P1 – sorting based on a binary tree; P2 – approximate method for discovering a minimal row cover of a binary matrix. In [21] synthesis from a specification in Handel-C was done in the DK3 design suite for xc2v1000–4fg456 FPGA (Virtex-II family of Xilinx) that is available on the RC200 prototyping board. Thus, comparison with synthesis from an SLSL can also be performed. All VHDL projects from [21,22] were re-synthesized in ISE 14.4 of Xilinx. The results are shown in Table 1 in form Ns/ET, where Ns is the number of FPGA slices and ET is the execution time in nanoseconds for the problem P1 and in microseconds for the problem P2. All VHDL projects were implemented in a Spartan-6 FPGA for the Atlys prototyping board. Initial data were taken in exactly the same way as in [21]. Handel-C projects gave the worst results and this confirms the conclusions that were arrived at in Section 2. The best execution time is underlined in Table 1 and the best performance is achieved with the methods proposed in this paper.

Table 1 The results of experiments. Problem

Handel-C projects [21] Recursive

VHDL projects [21] Iterative

Recursive

New VHDL projects Iterative

Recursive

Iterative

408/718 1719/636

1659/601

P1 (ns)

1293/1957

750/1341

475/963

463/1293

P2 (ls)

5118/7280

5118/7280

2073/920

1911/912

436/797

Table 2 Experiments with pure parallel and sequential circuits.

Even–odd merge sorting network Bitonic merge sorting network HFSM-based circuit a

Ns

Fmax (MHz)

ET (ns)

474 (6%)a 584 (8%)a 279 (4%)a

21.0 21.4 122.3

47.5 46.7 32.8

Resources used include also circuits for interactions with the host computer.

Fig. 11. Resource (a) and performance (b) evaluation for different circuits implemented and tested in the Atlys board.

2159

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160 Table 3 The results of the projects calculating the greatest common divisor of j unsigned 32-bit integers (see Sections 4-6 above for details). Project (j = 4 for the first four lines and j = 8 for the last line)

The best previous result Sequential (without memory blocks) Sequential (with memory blocks) Parallel1 (with memory blocks) Parallel2 (with memory blocks)

Atlys prototyping board (Xilinx FPGA Spartan-6) Ns

Fmax (MHz)

737 616 343 446 721

62.9 64.2 74.7 70.9 66.2

It is known that FPGAs give the best results for highly parallel implementations. Sorting networks can be seen as good examples. The main problem is very excessive FPGA resources. For example, the results of [25] show that even in the relatively advanced and expensive FPGA XC5VFX130T from the Xilinx Virtex-5 family, the maximum number of sorted data items (of size M = 32 bits) is 64. To overcome this problem we combined combinational and sequential operations [12] making it possible the number of data items to be increased. We synthesized three circuits implementing the fastest known combinational even–odd merge and bitonic merge sorting networks and compared the results with circuits that sequentially reuse even–odd transition segments implemented in a dedicated HFSM module. Tests were done in the Atlys prototyping board for eight 32-bit data items. Table 2 presents the results. At first glance it looks strange that the sequential circuit gives the best results but it is true because propagation delays in FPGA have been significantly reduced. Thus, sequential circuits might be better than pure parallel combinational circuits. Let us look now at an HFSM module that implements the pipeline depicted in Fig. 9 and compare the results with the best known sorting networks (see Fig. 11 where L is the number of pipeline registers). The even–odd merge network (which is one of the least resource consuming of the known pure combinational sorting networks) can be implemented in the chosen FPGA only for up to 32 items (M = 32) and for a bigger number of items, the resources of the FPGA are not sufficient. On the other hand, we were able to implement circuits for up to 256 items, i.e. eight times larger (see Fig. 11). Table 3 gives comparison results for Xilinx projects calculating the greatest common divisor of j unsigned 32-bit integers. Here, Fmax is the maximum attainable clock frequency in MHz. The best previous result achieved by the authors is shown in the first line of the table. The recursive algorithm described in Section 4 was used. Clearly, the previous results are improved upon. We believe that in future work it might be possible to automate the process of generating the compiled code/hardware module combination by adding a pre-processor to a C compiler. Then the original programmer would only need to know how to flag functions that potentially could be in hardware and/or executed in parallel. The programmer could then use the PS/PL facility without needing to understand in detail how the hardware modules were created. We do hope that such feature can be very important for methods described in [11]. 8. Conclusion The paper presents new methods that enable software modules to be converted to hardware implementations. The proposed technique is based on the known model of a hierarchical finite state machine and it permits more complicated cases of hierarchy and parallelism to be realized in electronic circuits, namely: HFSM modules can accept and return data much as in software; different modules can be executed concurrently; better optimization methods can be applied. The results of numerous experiments with applications from different areas have clearly demonstrated the advantages of the proposed technique (namely, an increase in performance and a decrease in hardware resources) and its broad applicability to both autonomous devices and hardware accelerators for software products. Acknowledgments The authors would like to thank Ivor Horton for his very useful comments and suggestions and Artjom Rjabov for making some experiments in EPP. This research was supported by FEDER through the Operational Program Competitiveness Factors – COMPETE and National Funds through FCT – Foundation for Science and Technology in the context of the Projects FCOMP01-0124-FEDER-022682 (FCT reference PEst-C/EEI/UI0127/2011) and Incentivo/EEI/UI0127/2013. References [1] Santarini M. Zynq-7000 EPP sets stage for new era of innovations. Xcell J 2011(75) [second quarter]. [2] Skliarova I, Sklyarov V. Recursion in reconfigurable computing: a survey of implementation approaches. In: Proceedings of the 19th international conference on field-programmable logic and applications – FPL’2009, 2009. p. 224–9. [3] Estrin G. Organization of computer systems – the fixed plus variable structure computer. In: Proceedings of western joint IRE-AIEE-ACM computer conference, 1960. p. 33–40. [4] Villarreal JR, Park A, Najjar WA, Halstead R. Designing modular hardware accelerators in C with ROCCC 2.0. In: Proceedings of the 18th annual international IEEE symposium on field-programmable custom computing machines – FCCM 2010, 2010. p. 127–34. [5] Latif K, Aziz A, Mahboob A. Optimal utilization of available reconfigurable hardware resources. Comput Electr Eng 2011;37(6):1043–57.

2160

V. Sklyarov, I. Skliarova / Computers and Electrical Engineering 39 (2013) 2145–2160

[6] Aziz SM. A cycle-accurate transaction level SystemC model for a serial communication bus. Comput Electr Eng 2009;35(5):790–802. [7] Sabaei M, Dehghan M, Faez K, Ahmadi M. A VHDL-based HW/SW cosimulation of communication systems. Comput Electr Eng 2001;27:333–43. [8] King M, Dave N, Arvind. Automatic generation of hardware/software interfaces. In: Proceedings of XVII international conference on architectural support for programming languages and operating systems, 2012. p. 325–36. [9] Guo Z, Najjar W, Buyukkurt B. Efficient hardware code generation for FPGAs. ACM Trans Architec Code Optim 2008;5(1). [10] Berkeley Design Techn., Inc., An independent evaluation of high-level synthesis tools for Xilinx FPGAs, 2010. . [11] Hollstein T, Glesner M. Advanced hardware/software co-design on reconfigurable network-on-chip based hyper-platforms. Comput Electr Eng 2007;33(4):S.310–9. [12] Sklyarov V, Skliarova I. Parallel processing in FPGA-based digital circuits and systems. TUT Press; 2013. [13] Sklyarov V. Reconfigurable models of finite state machines and their implementation in FPGAs. J Syst Architec 2002;47:1043–64. [14] Sklyarov V. Finite state machines with stack memory and their automatic design. In: Proceedings of USSR conference on computer-aided design of computers and systems, Part 2; 1983. p. 66–7 [in Russian]. [15] Neishaburi MH, Zilic Z. Hierarchical Trigger generation for post-silicon debugging. In: Proceedings of the international symposium on VLSI design, automation and, test – VLSI-DAT’2011, 2011. p. 1–4. [16] Hu W, Zhang Q, Mao Y. Component-based hierarchical state machine – a reusable and flexible game AI technology. In: Proceedings of the 6th IEEE joint international conference on information technology and artificial intelligence – ITAIC, 2011. p. 319–24. [17] Mihhailov D, Sklyarov V, Skliarova I, Sudnitson A. Acceleration of recursive data sorting over tree-based structures. Electron Electr Eng 2011;7(113):51–6. [18] Ninos S, Dollas A. Modeling recursion data structures for FPGA-based implementation. In: Proceedings of the 18th international conference on fieldprogrammable logic and applications – FPL’2008, 2008. p. 11–6. [19] Muñoz DM, Llanos CH, Ayala-Rincón M, van Els RH. Distributed approach to group control of elevator systems using fuzzy logic and FPGA implementation of dispatching algorithms. Eng Appl Artif Intell 2008;21:1309–20. [20] Harel D. Statecharts: a visual formalism for complex systems. Sci Comput Programm 1987;8:231–74. [21] Sklyarov V, Skliarova I, Pimentel B. FPGA-based implementation and comparison of recursive and iterative algorithms. In: Proceedings of the international conference on field-programmable logic and applications – FPL’05, 2005. p. 235–40. [22] Sklyarov V. FPGA-based implementation of recursive algorithms. Microprocess Microsyst 2004;28(5–6):197–211 [Special Issue on FPGAs: Applications and Designs]. [23] Xillybus, Xillybus Lite for Zynq-7000: easy FPGA registers with Linux. . [24] Zakrevskij A, Pottoson Yu, Cheremisiniva L. Combinatorial algorithms of discrete mathematics. Tallinn: TUT Press; 2008. [25] Mueller R, Teubner J, Alonso G. Sorting networks on FPGAs. Int J Very Large Data Bases 2012;21(1):1–23. Valery Sklyarov received the Ph.D. degree in computer science in 1978, the Doctor of Science degree in computer science in 1986, and the aggregation (agregação) in electrical engineering in 2001. He is currently a professor in the Department of Electronics, Telecommunications and Informatics, University of Aveiro, Portugal. He has authored and co-authored 19 books and over 300 papers. Iouliia Skliarova received the M.Sc. degree in computer engineering in 1998, and the Ph.D. degree in electrical engineering in 2004. She is currently an assistant professor in the Department of Electronics, Telecommunications and Informatics, University of Aveiro, Portugal. She has authored and co-authored two books and over 100 papers on subjects which include reconfigurable systems, digital design, and combinatorial optimization.