Microprocessors and Microsystems 44 (2016) 2–16
Contents lists available at ScienceDirect
Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro
On hardware synthesis and implementation of PLC programs in FPGAs Adam Milik∗ Institute of Electronics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
a r t i c l e
i n f o
Article history: Received 28 September 2015 Revised 30 December 2015 Accepted 3 February 2016 Available online 13 February 2016 Keywords: PLC FPGA DSP48 LD IL SFC DFG High level synthesis Logic synthesis Reconfigurable hardware
a b s t r a c t Many processes require controllers with an instant response (e.g. motor control, CNC machines). A highperformance PLC can be constructed with use of programmable logic devices. A lack of custom synthesis tools disables the use of standard languages widely accepted by automation designers. The paper presents the systematic process of a PLC program synthesis to hardware structure. An input PLC program is given according to the IEC61131-3 standard. The synthesis process has been developed for implementation of a program described with the LD and SFC languages. The essential idea of synthesis process is obtaining a massively parallel operating hardware structure that significantly reduces response processing time. The PLC program is translated into originally developed dedicated graph structure that enables a wide range of optimizations. In the next step, it is mapped into a hardware structure. In order to reduce resource requirements, a strategy with resource sharing is shown, which is an original extension of general mapping concepts. Modern FPGAs are equipped with arithmetic cores dedicated for signal processing, inspiring the development of the original DSP48 block mapping strategy. It attempts to utilize all features of the block in the pipelined calculation model. The considerations are summarized with the implementation result compared against standard PLC implementation, a mutual comparison of general hardware mapping, and with the use of DSP48 units. © 2016 Elsevier B.V. All rights reserved.
1. Introduction The introduction of Programmable Logic Controllers (PLCs) in the late 1960s allowed for fast development and modification of control algorithms. PLCs quickly replaced electromechanical (relay-based) control systems and revolutionized design methods of control algorithms. It has radically reduced the complexity of control system development. A control algorithm modification requires changes only in a program stored in memory. Physical modification of a controller circuit was no longer required. Initially PLCs replaced binary control systems. Processing abilities have been quickly extended to numerical calculation. Finally, it becomes a universal platform capable of handling a wide range of control tasks, from pure binary control to advanced numerical calculations. In a controlled system, a PLC closes a feedback loop between sensors and actuators. The response for input change to be useful for a proper system control must be worked out in a given time. In contrast to general purpose computations, control systems work with tight time dependencies. The PLC performance is a key factor in control system design. Increasing the complexity of control algorithms and large number of handled signals, specific sensors require a high-performance processing platform capable of instant ∗
Tel.: +48 507 295 415; fax: +4832 237 22 25. E-mail address:
[email protected],
[email protected]
http://dx.doi.org/10.1016/j.micpro.2016.02.003 0141-9331/© 2016 Elsevier B.V. All rights reserved.
delivery of calculation results. The problem of control system response time is raised in [1,2]. The problem of predictability in time response is discussed in [3,4]. The critical factor of the PLC is response time. In the case of binary control it is expected to work out as quickly as possible the response to signal changes. For discrete time control systems the calculation process must be completed in a given sampling period. Multiple conditional paths inside a program disable precise determination of execution time [3]. The aforementioned limitations encourage the development of a custom controller implemented in programmable hardware structures. 1.1. Programming The great success of PLCs is connected to the ease of programming. The first implementations used machine language, which required understanding of a PLC architecture and operation. Introduction of the Ladder Diagram (LD) language enabled the creation of a control algorithm in almost the same way as that which relay systems designers became accustomed. It should be noticed that not all physical features are followed by LD implementation. Development of different languages required standardization that started in the late 1980s. The result of these processes is a standard currently known as IEC61131. It is continuously developed by its subsequent revisions. The standard covers different aspects of
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
3
Fig. 1. PLC operation cycle (A) and response time (B).
a PLC construction. Part 3 of the IEC61131 describes programming languages [5–7]. There is a wide range of programming languages available. They start from the simple textual instruction list (IL) to the high level structured text (ST) language. The LD, in spite of its simplicity, is still very popular among automation designers. It has been inherited from relay-based control systems design methodology originating in electrical schematics. Languages like LD or IL are perfectly suited for describing simple control tasks. When the complexity of the control task increases and mutually parallel operations are described, the Sequential Function Chart (SFC) is used. The SFC language is based on GRAFCET [8] and enables describing the control algorithm in a form of linked steps with the ability of parallel execution, conditional control passing and synchronization of control passing between steps. Steps are linked with actions that are responsible for implementing control activities that can be described with IL or LD. The SFC introduces a control hierarchy that enables partitioning of a control problem with the respective programming representation. All these languages allow one to express a control algorithm independently from a target hardware platform. The standard describes the syntax and evaluation rules. Automation designers concentrate on solving the control problem, while the implementation details are hidden behind the standard and implementation tools. 1.2. PLC program evaluation and execution performance The standard PLC operation cycle is shown in Fig. 1A. It is based on serial-cyclic evaluation of control program. Internal operations and a program processing part can be distinguished. The control program execution is one of the components of a PLC program loop. It must not contain unconditional infinite loops or conditional loops preventing the main loop evaluation. PLC response time varies depending on a moment of an input event arrival relative to the program loop execution stage (Fig. 1B). The performance improvement of logic controllers and control algorithm execution is an important problem. It has been investigated extensively [1,4]. Standardized programming languages determine an input for this problem. Lack of the standard set of benchmarks, like in case of logic synthesis, make comparison of the result difficult. Possible solutions are sought using different approaches. The first is based on continuous improvement of PLCs CPU. Efforts are made to obtain a custom CPU structure that tightly adheres to IL programming language [9]. The IL can be assumed as a fundamental language for control implementation. The serial execution of instructions is a main performance limitation. Efforts have been made to develop processing units with independent bit and word processors with tightly coupled synchronization mechanisms [10]. The architecture allows partially parallel evaluation of bit and byte instructions. This solution improves the hardware platform independently from the executed program. The parallel execution of control program is a way of significant performance increase and response time reduction. The parallel
computation of control tasks requires the development of a hardware platform and respective tools for translating a user program that is able to take the benefits from the platform. This approach has been proposed by several research groups [11–15]. It was even shown that controllers with advanced fuzzy algorithms are efficiently implemented in hardware [16]. The FPGA device enables the programmability of the target platform, similar to microprocessors. A circuit structure is controlled by a configuration stream that can be compared to a program stored in a memory. A specific FPGA architecture for direct implementation of a LD program has been proposed in [17]. The efficiency of hardware implementation depends on the employed high-level synthesis algorithms. Early studies of direct mapping of PLC programs to hardware structures were [11,12]. The essential problem of synthesis is extracting parallel fragments of the implemented program. Initially this problem was investigated for purpose of obtaining the SFC structure from a LD program [18]. A similar methodology for extracting parallel operations was proposed in [11]. The LD implementation has been limited to logic operations disregarding implementation of arithmetic and arithmetic-dependent operations like timers or counters. One interesting approach to LD synthesis is discussed in [15]. The LD implementation with a modular-based approach to arithmetic functions implementation was shown in [19]. A very specific approach was proposed in [13]. These authors proposed translation of IL to C language. Finally, the obtained C language was synthesized to hardware with use of a commercially available high-level synthesis tool. In order to obtain the best implementation, the authors suggest a design space search with use of different IL to C conversion strategies. The proposed approach seems to be time-consuming and unfeasible. An interesting hybrid implementation based on a custom hardware and software implementation was proposed in [20]. The implementation takes benefits from a fast hardware evaluation of logic statements while arithmetic operations are processed with use of a microprocessor soft-core. 1.3. Paper outline The paper describes the originally developed complex implementation process of a mixed-language PLC program. It is aimed at obtaining hardware architecture with massive parallel processing and reduced requirements for logic resources. The implementation process of a control program is shown in Fig. 2. The problems addressed in this paper cover a wide range of issues from analysis of input program to the final implementation. An originally developed graph-based method of representing control program independently from a language is demonstrated. The language-independent form is obtained by analyzing the input program using its grammar [21,22]. Formal methods have been developed for obtaining graph forms from LD and SFC languages that are an extension of studies elsewhere [23,24]. The intermediate form allows the application of optimizations for logic and arithmetic operations. It is suitable for generating a wide range of
4
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
a limitation implied by the LD implementation and stands in contrast to the physical behavior of the electrical circuit. Taking into account these considerations, initially the LD model is described by following three elements:
LD = I, Q, F
(1)
where: I is a set of variables associated with input signals, Q is a set of variables associated with output signals and internal markers, and F is an ordered set of functions described by rung operations. The network consisting of r rungs is described by an ordered set of functions:
qi = fi (Ii , Qi ), i = 1...r, Ii ⊆ I, Qi ⊆ Q, fi ∈ F
Fig. 2. Control program synthesis flow.
outputs from dedicated hardware to target independent instruction sequences. The second part is devoted to hardware mapping problems of the language-independent structure to a wide range of FPGA resources. It is focused on the implementation of fixed point arithmetic and logic operations. Mapping strategies are considered, utilizing typical resources (i.e. a general hardware mapping strategy). These were inspired by High-Level Synthesis algorithms [25–27] that were improved and accommodated to the specific requirements of controller implementation strategies. In order to reduce the resource requirements in the mapping procedure, resource sharing is used. The method determines the resource sharing cost factor for each resource. The sharing cost factor allows one to balance operation distribution among available calculation resources. A significant reduction of resource sharing cost is observed with use of distributed RAM blocks. Modern FPGA devices are equipped with dedicated arithmetic resources for signal processing. Here, a strategy enabling the use of DSP48 hardware cores is described [28]. These high-performance cores are used for creating an efficient pipelined calculation system. The paper concludes with implementation benchmarks of reference control programs implemented with the use of different strategies. The achieved performance, resource allocation and specific components utilization degree are compared. In order to originate the proposed architecture and synthesis process to standard solutions and other specific PLC architectures, selected control tasks’ execution times are compared. 2. Language-independent PLC program representation The efficient implementation of the control program requires the development of a language-independent representation that allows program synthesis process to be performed until the synthesizable HDL description is obtained. The intermediate form is the result of analysis of input program with use of a processing model. Before introducing language-independent intermediate representation, the LD and the SFC formal models are considered. 2.1. LD analysis model For the purpose of analysis and synthesis, a model of LD program has been developed. The sequential method of evaluation has been utilized for developing the model. Rungs are evaluated in the top-down order; each rung is evaluated from left to right with restriction that all preceding components (from left) are already evaluated. The unidirectional energy flow through components is
(2)
The sequential nature of evaluation process suggests that the whole network is evaluated in r steps. A value of variables belonging to the set I is updated before the calculation process starts, and does not change during rung evaluation (see Fig. 1A). A value of variables belonging to the set Q is successively updated with the evaluation process progress. The sequential nature of the evaluation process can be transformed to fully parallel operation by the following substitution:
q1 = f1 (Ii , q1 . . . qr ), i = 1
(3)
qi = fi (Ii , f1 . . . fi−1 , qi . . . qr ), i = 2...r
The substitution process replaces all preceding rungs output variables (q) with its evaluated value (f) while current and succeeding rungs output variables are taken from the memory (calculated in a previous cycle). Finally, all rungs are evaluated at once introducing discrete in time operation based on previous (qi ) and current values (fi ). The idea has been depicted with use of block diagrams. In Fig. 3A, a sequential implementation that utilizes outputs of registers as a feedback signal is shown. Registers are updated sequentially with the calculation progress. The implementation that utilizes the parallel evaluation method is depicted in Fig. 3B. This approach of translating the LD description is beneficial for parallel implementation. Finally the LD can be described by a functionally equivalent parallel evaluation model:
LD = I, Q, F , Q0
q1(n ) = f1 Ii , q1(n−1) . . . qr (n−1) , i = 1
(4)
qi(n ) = fi Ii , f1 . . . fi−1 , qi(n−1) . . . qr (n−1) , i = 2...r The functions fi depend on the set of inputs I, currently (Qn ) and previously calculated results (Qn − 1 ). The set Q0 holds the initial value of variables for the first calculation step. The substitution given by (3) reveals the variables’ dependencies eliminating sequential evaluation of rungs and passing the partial result. Finally, the processing is formulated as discrete processing, representing a set of bounded finite state machines. 2.2. SFC analysis model The following SFC model has been proposed conforming to IEC61131-3 requirements. It is described by four elements:
SF C = S, T , A, S0 : S0 ⊆ S
(5)
where: S is a set of steps, T is a set of transitions and A is a set of actions, S0 is a set of initially active steps. The transitions that connect the steps are described by triple:
t = SP , SS , c : t ∈ T , SP ⊆ S, SS ⊆ S
(6)
where: SP is a set of preceding steps, SS is a set of succeeding steps and c is a logic condition that fires the transition. The condition is
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
5
Fig. 3. LD processing models: sequential (A) and parallel (B).
formulated with use of other languages (LD, IL or FBD). This makes the SFC description flexible but also complicates the synthesis approach and requires implementation of mixed language compiler. Each step s contains a Boolean variable x that denotes its activity referred as s.x [5]. A two-state FSM is an equivalent representation of each step. The set S consists of mutually bounded two-state FSMs. The synthesis process requires determination of the activation s.xSET and deactivation s.xCLR functions for each step:
s.x = (s.xSET ∧ s.x ) ∨ (s.xCLR ∧ s.x )
(7)
The general synthesis model has been developed from basic steps sequences shown in Fig. 4. The linear sequence (Fig. 4.A) passes activity from the preceding step sp to the succeeding step ss depending on c:
sp.xCLR = c
ss.xSET = sp.x ∧ c
sp.xi ∧ ci : i = 1..n
(9)
i
The divergent sequence (Fig. 4.C) represents an activity distribution: disregarding the priority evaluation factors (pi ), the following equations describe the control flow:
sp.xCLR =
ci ssi .xSET = sp.x ∧ ci : i = 1..n
(10)
i
The formal design rules require that conditions firing transition be mutually exclusive:
ci ∧ c j = 0 : i = 1..n, j = 1..n, i = j
c pi =
⎧ ⎪ ⎨ ⎪ ⎩
(11)
i=1
ci : i−1
c j ∧ ci :
i = 2..n, pi−1 > pi
(12)
j=1
where cpi denotes the priority-resolved condition formula for the ith transition. The parallel activity of the steps is described with simultaneous sequence (Fig. 4 D). It is used for spreading (forking) to multiple paths and synchronizing (joining) control paths. For description simplicity, the subsets SP and SS are used to describe the activity formula:
(8)
The convergent sequence (Fig. 4.B) is a composition of the previous case:
spi .xCLR = ci ss.xSET =
In order to simplify the manual design process, the priority labeling of transitions is used (the pi label next to transition). It implements an implied priority encoding of conditions (an equivalent of nested if - else statements):
sp.xCLR = ss.xSET =
sp.x
∧c
(13)
sp∈Sp
The SFC action describes a control activity that is linked to steps that control its execution. Each action specifies an independent control unit dependent on an action qualifier. The standard defines 9 different action qualifiers [5,6]. The action block execution depends not only on linked steps activity, but also on the time passed since activation (i.e. time delayed or limited) or calculation cycles (pulse type actions). 2.3. Language-independent representation for control program For the processing of a mixed language description, an Enhanced Data Flow Graph (EDFG) with attributed edges has been developed. This has been inspired by concepts of DFGs [25,27] and attributed edges used in BDDs [29]. The EDFG initially records only
6
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
Fig. 4. Basic step and transition sequences in the SFC.
A
C
B
E
D
Fig. 5. EDFG basic properties.
the processing sequence; neither mapping to clock cycles nor hardware resources assignment. The EDFG is created in the incremental analysis process [21,22]. It is aimed at obtaining the single loop run processing structure. A sequence of nodes and graph transformations allow its adaptation to hardware structure requirements. The EDFG handles unary operations like logic inversion and arithmetic complement using attributes assigned to graph edges. Another extension that has been implemented and widely utilized is multiple argument nodes for commutative operations. The conditional flow of processing is represented by the selection node. The edge attributes allow one to distinguish between arguments and the selection condition. Attributes are also utilized for immutable operations (e.g. division). The EDFG is given by:
G = V, E : V = {v1 , . . . , vn }, E = {e1 , . . . , em }
(14)
where: V is a set of nodes, E is a set of directed edges with attributes. The directed edge e is described by a triple:
e = vSRC , vDST , a : vSRC ∈ V, vDST ∈ V, A = {a1 , . . . , ak }
(15)
where: vSRC is a predeceasing node, vDST is a successor node of the directed edge and a is an attribute of the edge chosen from the set A. Exemplary EDFGs are shown in Fig. 5. The attributed edge enables the combination of an assignment operation with a logic inversion or an arithmetic complement. This enhancement simplifies algorithms for graph creation, optimization and hardware mapping. The raw form of the logic graph (A) and the arithmetic graph (C) obtained during compilation process are shown. Next to them are shown in Fig. 5 respective graphs after the argument merge operations (B) and (D). A conditional execution case is considered in (E).
A
B
C
Fig. 6. EDFG optimisation problems.
The attributed edges allow one to distinguish different arguments of conditional node (the multiplexer shaped node). The last case also shows combing of arithmetic and logic operations. It should be noticed that the argument merge operation enables constant propagation.. The merge process utilizes the de Morgan law for logic nodes extending the merge procedure. Only a limited logic optimization is possible at an early stage of synthesis through EDFG transformations. Unfortunately the EDFG structure is not well suited for complete logic minimization (Fig. 6.A). Nodes that are terminal for logic operations chain are subject of logic minimization. For these nodes, a minimal sum of products is calculated (Fig. 6B). This operation enables optimization of conditional arithmetic paths in the analyzed graph (Fig. 6C). The arithmetic operations are subjects of elementary algebraic optimization, which is based on constant evaluation, complementary argument pair reduction in additive operations. This constant evaluation enables argument reduction, or even operation node absorption.
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
7
Fig. 7. EDFG mapping of basic LD components.
2.3.1. EDFG representation of LD The LD model is used to obtain the EDFG in compilation process. Selected elements substitution with equivalent EDFG fragments are shown in Fig. 7. The compilation process successively delivers components and their connections from the analyzed network. Each component is connected to network nodes that are denoted by ni where i is the node index. Switches are elementary items of the LD. Their respective subgraphs are shown in A, B and C points of Fig. 7. In contrast to the typical logic AND operation, they are connected to the following node that eventually merges energy flow from other branches. During sequential analysis, the following node drivers are collected sequentially. A switch is represented by the EDFG structure, consisting of an AND node followed with an OR node. The OR node enables connection between subsequent node drivers. A coil is substituted by a value assignment. The case of value reassignment is shown in Fig. 7D. The driving node is updated by changing the source of the directed arc connecting with a sink in the form of a value assignment node. The EDFG during construction perfectly records variable dependencies. After construction, EDFG nodes without load are optimized in a recursive manner. Formally the process looks for zero rows in the adjacency matrix. The vertex associated with the zero row is eliminated, which results in removing the respective row and column in the adjacency matrix. The procedure is repeated iteratively until all zero rows are eliminated and further optimization is not possible. The LD also defines a complex block that implements timers, counters and arithmetic operations. An exemplary complex block equivalent is shown in Fig. 7.E. The equivalent subgraph of the block is inserted into the EDFG structure during the analysis process. It should be noticed that Boolean type nodes are allowed to be driven by multiple sources (block outputs) producing logic OR of all drivers. The numerical outputs are handled as ordinary variables in programming languages and are the subject of subsequent assignments. In order to demonstrate substantially the LD to EDFG conversion process, an example shown in Fig. 8A is considered. The dia-
gram depicts a control system responsible for generating a pulse of 100 time units. The pulse generation process is triggered by the input signal I1 and control output Q1. The I2 input allows for immediate braking pulse generation (i.e. master reset). Fig. 8B illustrates the EDFG after completing the generation process. There are distinguishable patterns used for substituting the respective components of the analyzed diagram. The developed translation method enables the representation of logic and arithmetic operations on a common graph. The mixed representation for logic and arithmetic operations is a significant improvement from models proposed in [11,15], where consideration has been limited to logic operations and their relationship. The example also shows variable dependency tracking in the sequential analysis process. The Q2 signal is used for generating a triggering pulse as a memory element of the last value of I2. In the first rung reference, the Q2 value is made by accessing its stored value through the read node. The Q2 value is updated from the I1 input signal in the next rung. Similar processing is observed for the Q3 signal, which is accessed by the first rung but its update is made in the third rung. The Q1 signal is an example of a signal reference after assignment. In this case, signal Q1 is referred by a third rung while the EDFG records it as a directed edge from the Q1 driving node to all sink nodes. Fig. 8C depicts the EDFG after applying optimization, which enables constant propagation and single argument node removal. The EDFG does not allow the application of complete logic minimization directly, but allows a reduction of the redundant nodes and applies the de Morgan laws, reducing the total number of nodes. The arithmetic operations that are generated in the form of a long grapevine are merged into multiple argument nodes. This allows the introduction of algebraic optimizations, balancing the arithmetic operation tree. This problem is addressed in detail in the mapping chapters in the Section 3. 2.3.2. EDFG representation of SFC The analysis process transforms an input SFC into a functional equivalent given with EDFG. It has been divided into synthesis of control flow and control activities. The step activity variable is
8
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
A
B I1
Q2
Q3
I2
I1
Q1
Q2
1
Q1
&
OR
&
OR
&
OR
Q3 &
I2 &
OR
OR
&
Q1
Q1 I1
Q2
Q1
Q3
1 I1
IN
1
Q
Q2
TON 100
PT
ET
ET
1 0
C
0
I2 Q2 I1
1
Q3 &
OR
ET 1
&
PT
Q3
Q1
t
Q1 I1
Q3
0
100 &
Q2
ET
Legend
1 0
0
ET 1
1
0
Q3
V
Read variable V
Simple edge
V
Write variable V
Inverng edge
Q3 PT
100
t
& Fig. 8. LD to EDFG translation example.
represented with the basic structure given by (7), shown in Fig. 9A. The references to the SFC are made by the step variable and two nodes responsible for setting and clearing the step activity. The step setting condition refers to the additional logic-OR node. The proposed structure enables iterative linking of multiple transitions to the step simplifying the EDFG construction process (Fig. 9B). In this sequence, the activity can be passed from different steps that target the common step sS . The set node is iteratively linked to nodes representing logic formulas passing activity from the preceding steps according to (5). The divergent sequence case is considered in Fig. 9C. The activity is passed from the preceding steps according to (6). Priority encoding for the activity passing given by (8) can be reduced to the logic sum by absorbing the priority part of the encoding:
spi .xCLR = c1 ∨
n i=2
i−1 j=1
c j ∧ ci
=
n
ci
(16)
i=1
The clear node is fed by conditions of divergent transitions. The simultaneous transition is constructed according to (9). This results in the creation of a common sub EDFG used for both subsets of steps: preceding and succeeding (Fig. 9D). An activity controller is created for each action. The basic structure for N, S and R action triggering is depicted in Fig. 10A. The S and R actions crate the activity storage with a hidden variable (a1 .q) with a dominant reset action, as stated by standard. The time-controlled actions (e.g. L and D) implement a timer unit (Fig. 10B). This unit is linked to the action activity merge node. The iterative method of conditional variable assignment triggered by multiple actions is presented in Fig. 10C.
The presented methods of generating an EDFG from an SFC are illustrated with an exemplary design shown in Fig. 11A. The SFC diagram depicts a tank controller example from [8]. The SFC is an example of mixed language design, where conditions are described with an LD. The LD analysis process discussed earlier is applied. The final result of generation process is shown in Fig. 11B. In order to show a correspondence between the input SFC and the result EDFG, its layout has been arranged to reflect the source. The dashed-line rectangles separate parts of the EDFG that correspond to respective steps. Transition firing conditions are placed outside the rectangular step blocks. It should be noticed that the firing condition is used for both the accepting and the giving away activity. The same condition is shared by the preceding and succeeding steps linked to the same transition. In the case of convergent transition, the condition is supplemented by the preceding steps activity checking. This can be observed for transition linking steps {s1, s4} → {s2, s5}. The firing condition merges the check of activity of steps s1 and s4 and logic signal m. The actions of N type are controlled directly from the step variables. The step variable is already assigned, which enables the controlling steps activity directly from nodes delivering step activity, instead of introducing an additional assignment cycle. This generation methodology enables the directed graph structure to be obtained. It reveals all parallel operations, enabling mapping strategies utilizing parallel calculations (e.g. hardware mapping). The obtained graph (Fig. 11B) is a single cycle controller that is able to complete all calculations during one cycle. The calculation model is independent of the number of steps or conditions. Calculations are guaranteed to complete with this single step approach. The step duration expressed in clock cycles is dependent on the hardware structure used for implementation. In
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
Fig. 9. EDFG mapping of SFC steps.
Fig. 10. EDFG mapping of SFC actions controller (A, B) and variable drivers (C).
Fig. 11. SFC to EDFG translation example.
9
10
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
contrast to programmatic approaches, the predictable and constant response time is achieved with proposed representation.
A
3. General hardware mapping strategy The general hardware mapping strategy is dedicated for implementation process in general-purpose regular LUT-based FPGA architectures. This strategy assumes that there is available basic arithmetic support in an FPGA structure, allowing the construction of adders. A multiplication operation implementation method is target platform-dependent and can be either a hardware multiplier core, a sequential or a combinatorial implementation in general purpose logic [30]. An integer division operation is not supported directly by a hardware core and requires respective implementation with the use of available logic resources. Timing dependencies expressed in clock cycles for each operation and resource allocation factor are delivered to the implementation procedure. The resource allocation factor is important in the case of sharing. It allows a comparison of the cost of sharing logic with the cost of the shared resource [31]. If the sharing costs exceed the resource cost, it is feasible to create the new resource instance and reschedule the entire design. Before the mapping procedure can take place, the EDFG structure must be accommodated to hardware mapping. Each arithmetic node must correspond to the elementary arithmetic hardware resource. The optimisation procedures have merged arithmetic nodes, which results in tree height reduction (see Fig. 5D). The node merge also reduces constant values. An operation node arguments set after optimization (i.e. propagation) contains at maximum one constant node. There are two operations that are applied to accommodate EDFG nodes to hardware components. The first is aimed at reducing the cost of complement arguments handling. The second enables calculation time balancing of multiple argument nodes by a calculation time-driven expansion method. After initial EDFG accommodation, the scheduling and mapping procedures take place. Then, the direct mapping procedure and the resource sharing approach are considered. 3.1. Complement edges transformation EDFG complement edges are a convenient method for transforming arithmetic operations. This specific feature requires a procedure that allows adder nodes and eventual logic operations to be obtained. It will result in the regular structure of EDFG, enabling extended sharing of adder units. When all transformations (i.e. optimizations) on the arithmetic level are completed, the complement edges are translated into equivalent hardware. The complement value in two’s complement system is given as: −a = a + 1. The complement operation requires the use of adder and logic inversion. The procedure attempts to distribute these two operations among existing nodes, if possible. The respective expansion procedures are applied to additive and multiplicative operations. They are aimed at reducing the number of additional operations but do not constrain successive processes. The complement arc transformation process connected to addition node is shown in Fig. 12A and B. The process changes a complement arc into a bitwise inversion arc and adds a constant 1. Referring to the property of a single constant argument per operation node, multiple constant value nodes are merged. The transformation process is applied iteratively to all edges. The multiplicative operation node allows a reduction (Fig. 13A) or propagation to the constant (Fig. 13B) complement attribute. If there is only one edge with a complement attribute set, it is forward- or backward-propagated. The decision depends on the successive node. If the successive node performs addition, then forward propagation of the attribute is used. When the forward prop-
B
Fig. 12. Complement edge transformations for addition node.
Fig. 13. Complement edge transformations for multiplication nodes.
agation is not possible, the complement operation is expanded and back-propagated to the argument with a shorter calculation time. Selection of the path with a shorter execution time (in the sense of ASAP scheduling) allows a calculation time increase to be avoided.
3.2. Multiple arguments node expansion process The node with multiple arguments (>2) has been widely used in EDFG (node merge procedure). It does not correspond to elementary hardware components (i.e. adder, multiplier). It cannot be directly mapped and requires expansion to satisfy elementary component requirements. The expansion process creates two argument nodes that are directly mapped. A special handling is used for additive nodes with a constant that attempts utilization of carry in input. The developed expansion process iteratively expands the arithmetic nodes with more than two arguments. During the expansion process, the operation execution time estimation is taken into consideration. The procedure balances the total calculation time of the subtree during node expansion. Considering the calculation time allows for better balancing of the tree and improves the obtained scheduling result. Let us now introduce a variable t that is associated with an EDFG node, and describes the operation completion time. The operation completion time (the ASAP approach) for the particular node is calculated as:
t j = MAX i (ti ) + t p j
(17)
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
11
Fig. 14. Time-driven arithmetic node expansion algorithm example.
Fig. 15. Direct hardware mapping procedure: (A) An input EDFG, (B) The mapped circuit structure.
where: tj is a j node calculation completion time, ti is ith argument node completion time of node j and tpj is j node operation completion time according to the operation type and implementation target. The tj value can be determined, provided all ti are known. The expansion algorithm starts from initializing the execution time variable t for each operation node. In Fig. 14.A, the variable t is shown as a subscript of each node. All operation nodes t variables are set to an uninitialized value (t = −1) except for the reading variable nodes, which are assigned with t = 0 (i.e. the value is immediately available at the first cycle). For all variable assignment nodes, the following procedure is applied. Starting from the variable assignment node, the procedure traces back to argument nodes. If the current node t variable is not assigned, then it is a subject of the t value calculation according to (17). If there is an argument node for which t = −1 the procedure is called recursively for it. If for the node representing mutable arithmetic operation of the adjacency matrix row contains more than two nonzero items, it is subject to the expansion process. In practical EDFG implementation, the node arguments count is used. For the expansion, two arguments are selected with the smallest t value. A new node of the same type (operation) is created. The selected arguments are reconnected to the newly created node. The new node is assigned t value according to (17) and becomes the argument of the node j (i.e. the node under the expansion process). The expanded nodes are marked with a grey color in Fig. 14A and B) The expansion process for node j is continued until the number of arguments is equal to 2. There is a special case of expansion for addition nodes with constant. An adder is able to perform addition of two vector arguments (a, b) and a single bit item (ci) applied to the carry in
input. When the constant value c meets a requirement:
y = a + b + ci:ci ∈ {0, 1} c < n−1 :c ∈ N ⇒ c =
n
ci j
(18)
j=1
where: c is a constant node value represented as a natural number, and n is the number of arguments of the expanded node. The constant value is distributed among the expanded nodes in form of carry in value. 3.3. General direct mapping procedure An EDFG (after arithmetic operation expansion) is the subject of hardware mapping. Fig. 15 illustrates the problem given in form of the EDFG and a direct mapping concept. The exemplary diagram shows the implementation of the following typical signal processing equations:
d = α (e − p) + β (d − e ) + d p=e
(19)
where: e is a variable associated with an input signal, d – a variable associated with an output signal, p – a read-write variable associated with an internal signal, α and β are read-only variables (parameters of function). It has been assumed that these equations are evaluated sequentially, one after the other. The method utilizes a direct correspondence between hardware arithmetic primitives and respective nodes of an EDFG. The EDFG is directly mapped to respective arithmetic components, which are
12
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
beginning of the procedure. The scheduling procedure assigns the operation to particular operation cycles. After a first pass of the assignment, the resource reduction step is possible depending on operation mobility. The operations are shifted in the mobility range into free time gaps. The schedule map requires binding to the physical resources. The binding procedure is aimed at reducing the overall sharing cost. This problem has been raised in [26]. The resource sharing cost is calculated for each input. For reducing the address-decoding costs, the one hot controller is used [34]. Multiplexing n arguments with k-input LUT (look up table) with the one-hot approach is given by following formula: Fig. 16. EDFG mapping with resource sharing.
separated with registers (depicted as hatched rectangles). The direct mapping process bounds permanently each node with the unique hardware resource. This has been shown schematically in Fig. 15B. The assignment process can be made with two possible scheduling greedy approaches: ASAP or ALAP [25]. The ASAP method has been selected for the considered example. The flow controller (not shown in Fig. 15B) assures an ordered data flow through the unit. At the bottom of the schematic diagram the timeline of the calculation process is shown. The problem of variable allocation does not exist while each node is assigned to the unique hardware component. The direct mapping concept results in fast resource run out and small hardware utilization. In this approach, each arithmetic module is used once per calculation cycle. The directly mapped structure is perfect for pipelined processing systems [32]. It only requires balancing the time dependencies in all processing paths by inserting registers. The PLC controller or fast feedback controllers require cyclic calculation in a loop with a controlled object. The utilization of pipelined processing is not possible while object feedback is required for starting the next calculation cycle. 3.4. Resource sharing mapping The limited number of resources (especially multiplier cores), dividers that are resource demanding components, and the low utilization factor of components, induce the development of a method that enables resource sharing. The set of arithmetic cores and the set of variables are reused and distributed in time. Fig. 16 depicts the result of scheduling and mapping processes. The schematic diagram shows the obtained structure of the circuit. The table next to the circuit holds the schedule of variables passed to arithmetic units and respective results assignment. For the systematic scheduling procedure, each operation node is described with the structure:
vSCH = cS , cL , c, t0 , t1
(20)
where cS and cL are ASAP and ALAP schedule cycles respectively, c the scheduled calculation cycle, t0 the result availability for register writing, and t1 the last access to the result. In the literature, the scheduling process assumes unit execution time for all operations [25,27,33]. The proposed general structure allows for scheduling operations with different operation execution times and resource availabilities. The scheduling procedure assigns values to the respective variables. It starts from cS and cL . The difference between cL and cS determines the operation mobility. The mobility factor is used for selecting operations for hardware resource assignment. Nodes with lower mobility factor are promoted for scheduling. The local dependencies are used for further selection factors. The algorithm is based on the modified list scheduling approach. It has been inspired by ideas presented in [25,27] and accommodated to the specific representation of EDFG and controller implementation. The maximum available resource set is determined at the
δLUT = vs
⎧
⎨ 2n − k ⎩
k−1 0
+1
n>1
(21)
n=1
where: vs is the vector size, and δ LUT is the number of LUT generators. The δ LUT factor exhibits low sensitivity to changes during the resource allocation procedure, hindering changes tracking. A general sharing cost factor (shf) was introduced for variables and constants. It records the number of inputs required for a particular hardware resource. The sharing cost value is calculated as:
sh f i = 2Vi + Ci
(22)
where: shfi is an ith operation general sharing factor, Vi is an ith operation set of variables, Ci is an ith operation set of constants. The variable nodes are exchanged using bipartite graph nodes swapping to minimize the overall multiplexing costs in a scheduled operations set. In order to reduce argument multiplexing costs, the use of distributed RAM blocks (Fig. 16) is attempted. The problem of memory modules use is also addressed in [35]. Currently, the ability to use the distributed RAM is reduced to variables that implement processing parameters. Process parameters are marked with grey rectangles (α , β ). Those variables are used in a read-only approach. The separate channel enabling parameters update is created. Summarizing the shared resources approach introduces a much higher complexity of mapping compared to the direct approach. The main advantage is resource sharing that is limited in target device. The serious limitation for the method is the high sharing cost of adders, which makes implementation in some cases infeasible. Utilizing the distributed RAMs reduces multiplexing costs, but currently is limited to special usage cases.
4. Hardware mapping dedicated for DSP48A modules Starting from Virtex 4 Xilinx’s FPGAs families offer DSP support through the hardwired core called DSP48. The unit is intended to implement the most common DSP operation based on multiplication and result accumulation. The Spartan 6 family implements the DSP48A1 core version [28]. A generalized block diagram is shown in Fig. 17A. The unit is partially run time programmable, which is marked on the diagram with grey rectangles. Under the diagram, the equivalent EDFG mapping pattern is shown (Fig. 17B). The central part and a key feature of the block is a multiplier unit. The core itself is able to implement 18 bit number processing, which completely satisfies the 16 bit integer processing requirements. The 48 bit accumulator perfectly fits for fixed point number scaling and protects against overflowing with a margin of 12 bits. The registers structure is statically selectable during the implementation process. A pipelined structure shown in the diagram has been chosen for timing efficiency purposes. The EDFG expanded for general hardware mapping is not suitable for the DSP48 mapping procedure. The node merge concept
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
13
Fig. 17. The DSP48A simplified block diagram (A) and EDFG mapping pattern (B).
A
B
DSP48 Cluster
a a
b
c
1
DSP48 Cluster
0
0
d
a
b c
b
c
1
0
b
a 0 c
d
C.
DSP48 Cluster
c
a
d b
1
0
0
a c
b
d Fig. 18. DSP48A mapping algorithm.
discussed in EDFG transformation perfectly prepares for concepts other than general mapping. Using DSP48 units requires a clustered expansion to the pattern shown in Fig. 17B. There are two levels of clustering: the first merges operations into single DSP48 unit. The second level of clustering merges operations with accumulative addition of the result. The clustering operation improves the performance enabling pipelined calculations. The mapping procedure requires two constant values of 0 and 1 to be declared, which enable bypassing of the adder and multiplier stages, respectively. Let us now consider the general expansion and mapping for addition and multiplication nodes. The expansion procedure starts from the assignment node and traces back to the argument nodes. General properties of the mutable arithmetic operation node should be recalled. The node argument set consists of nodes of different operations, variable read or constants. The only case that disables merging of nodes exists if the result of operation is an argument of multiple nodes. The mapping procedure recursively visits unassigned argument nodes. The set of nodes is ordered
with the calculation time with the ASAP approach. If the node cycle identifier is not assigned, then it is subject of recursive evaluation. The DSP48 pattern is created with clustered nodes to which nodes are reconnected. The addition node with multiple arguments (Fig. 18A) selects the pair of arguments with smaller cycle identifiers (possibly variable read nodes). They are connected to the DSP48 cluster pattern as B and D arguments. If there are still unmapped arguments, a new instance of DSP48 is created. It is connected to the previous one in an accumulative fashion. The B and D argument selection of the DSP48 unit is repeated. This operation is continued until all arguments of the addition node are assigned. The final mapping procedure holds the created DSP48 instance, while tracing back to the map requesting node enables the use of the multiplier node. The multiplication node with multiple arguments mapping pattern is shown in Fig. 18B. It is a rare case in signal processing, but the general approach requires that this situation be addressed. As previously, a pair of nodes with the smallest ASAP time factors is selected. The clustering operation is repeated iteratively by
14
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16 Table 1 Mapping strategies implementation: comparison of results. Type
1C
2C
8T Fig. 19. DSP48A simplified block diagram (1) and EDFG mapping pattern (2).
creating DSP48 clusters until all arguments are assigned to DSP48 clusters. It should be noticed that DSP48 defer the next calculation for one cycle in order to propagate arguments through the adder stage. Finally, Fig. 18C shows the mixed addition and multiplication nodes mapping result. Successive unit reuse is shown with pipelined processing utilization. The procedure attempts to utilize the unit fully, without cycle stalls. The Spartan 6 (5th generation Virtex architecture) device half tail (tail – a horizontal elementary row of architecture) layout consists of a DSP48 unit and two block RAM cores. When memories are configured to the 18 bit data bus that matches the width of the DSP48 block data path, they are able to store 512 words. The memories content is initialized by a configuration process that enables the placement of the initial value of variables and constant values. Implementing the memory content initialization block is not required, which reduces hardware overhead. A schematic diagram of calculation unit based on DSP48 and respective block RAMs is shown in the Fig. 19. The block RAMs operate in parallel, creating 3 arguments addressable register set for A, B and D inputs of the DSP48 core. The control unit is implemented with the use of general purpose logic components. Its concept is based on counter-based control units with LUT-based argument readdressing. The proposed structure uses dedicated hardware resources and minimizes the usage of general purpose logic resources (i.e. LUTs and Flip-Flops). Even the smallest representatives from Spartan6 families (e.g. XC6SLX4) enable the implementation of 8 arithmetic blocks with a performance of 250–300 MHz, resulting with a peak performance of 2.0–2.4 GMAC/s. 5. Controller performance Performance comparison was made in two domains. The first comparison is made to determine the mutual performance ratios of different synthesis and implementation strategies. It compares the resource utilization and maximum operating frequencies of the controller structure. The second comparison shows the performance ratios of proposed controller architectures in reference to standard PLCs CPU from Siemens S7 line and highly optimized academic architecture given in [9]. 5.1. Implementation strategies comparison Modern FPGAs offer a reach set of resources that are able to implement arithmetic calculations. It should be emphasized that logic operations are implemented inefficiently and are evaluated during one clock period, achieving extraordinary performance. The comparison process is focused on the domain of arithmetic and mixed
2C8T
Map alg.
DIR SHR SDR D48D D48B DIR SHR SDR D48D D48B DIR SHR SDR D48D D48B DIR SHR SDR D48D D48B
FF
297 189 112 24 22 594 351 204 28 22 307 344 45 35 32 898 692 247 48 43
LUT
tC fMAX [MHz]
DSP48
LOG
MEM
MUL
ALL
115 136 50 28 20 230 175 86 32 23 440 214 62 18 12 667 386 140 55 41
– – 18 72 – – – 18 72 – – – 30 72 – – – 48 72 –
4 1 1 – – 8 1 1 – – – – – – 8 1 1 – –
– – – 1 1 – – – 1 1 – – – 1 1 – – – 1 1
197.2 130.6 180.3 260.5 302.9 185.3 128.4 178.1 239.8 302.9 286.2 127.9 181.3 258.2 301.3 179.2 125.8 178.3 255.2 283.1
7 7 7 8 10 7 10 10 10 14 3 10 10 18 19 7 10 10 26 30
TH [Mc/s]
28.1 18.6 25.7 32.5 30.2 26.4 12.8 17.8 23.9 21.6 95.4 12.7 18.1 14.3 15.8 25.6 12.5 17.8 9.8 9.4
logic and arithmetic tasks implementation. The Spartan 6 architecture has been used as a reference target platform. In order to make a quantitative comparison, the exemplary designs of the control systems have been implemented with the presented strategies. PID based systems were selected, implementing one (1C) and two (2C) instances of the PID module. The time-dependent system is represented by an 8T project that consists of 8 timer instances. Finally, the mixed design (2C8T) consisting of 8 timers and two instances of the PID controller is considered. Results of implementation process are gathered in Table 1. They were obtained through generation of mapped design utilizing specific components instantiations according to guidelines of the XST synthesis tool [36]. In Table 1 are gathered resource types utilized by particular design and selected mapping strategies. Due to specific utilization of LUTs, components operating as general purpose logic components (LOG) and distributed memories (MEM) are distinguished. The DSP48 unit utilization distinguishes between a multiplier only usage (MUL) or full unit utilization (ALL). In order to simplify the performance assessment and method influence to logic architecture, three factors are given: fMAX , tC and TH. fMAX denotes the maximum operating frequency while tC describes the number of clock cycles required for calculations. In order to normalize the unit performance and enable mutual comparison, the throughput factor TH is used. It is f calculated as: T H = MAX tC . The direct mapping strategy (DIR - direct) is the simplest one. Usage of registers is dominant for storing all variables and coefficients. LUTs utilization is relatively high in this strategy. In the case of the 8C design utilization of LUTs is higher than in the case of 1C and 2C, while timers are based on addition and comparison operations. In the direct mapping approach, each node is bound with hardware instance implementing an operation. This results in a very high throughput for calculations with a small number of cycles. For 8T design, the TH factor reaches the level of 95.4 million cycles per second. The next algorithm (SHR - shared) implements the resource sharing. It reveals the problem of additional cost of resource sharing. It introduces the calculation resources reduction, but overall LUT resource utilization is higher than in the direct approach for PID controllers (1C, 2C). Timer implementation (8C) demonstrates a reduction of LUTs, even though arguments are multiplexed. The relatively high sharing factor (8 to 1) reduces performance to 33.3% and increases a propagation delay,
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
reducing throughput to 13.3%. Utilizing distributed RAMs (SDR – shared with distributed RAM) allows the number of registers and multiplexing resources to be reduced. The observed results reduce the number of required LUTs as general purpose logic components. One other positive influence is a reduction of the number of combinatorial layers, reducing propagation delay. The maximum frequency increase is observed, which results in a better TH factor. Compared to the SHR strategy, average performance increase of about 40% is observed while LUT resources requirements has been reduced to about 50%, on average. The aforementioned mapping strategies are based on elementary arithmetic components. They can be targeted to a wide range of FPGA devices. The DSP48A1 core available in Spartan 6 is used as a multiplier-only part. The DSP48 core dedicated mapping strategy enables full utilization of the remaining components (viz. adder, accumulator, path multiplexers), allowing further reduction of resource requirements. There are two methods of implementation that differ in variable storage implementation. The D48D uses DSP48 with distributed RAMs while D48B utilizes the RAMB8 block memories. In both cases, utilization of general purpose logic is significantly reduced. All arithmetic computation is implemented in DSP48 block. A good performance of D48D method is observed, which gives the best trade-off between resource requirements and performance. The D48B methodology exhibits slightly lower performance caused by the scheduling limitation. The unit stalls operation for writing a result. Benefits are observed coming from utilizing the pipeline calculations. The number of cycles in the case of 1C and 2C circuits increases from 10 to 14 (only a 40% increase for doubled calculations). Similar profits can be observed in 8T and 2C8T. The pipelined approach and calculation distribution, possible through scheduling procedures of the EDFG, results in a performance increase for multiple item calculations. 5.2. Performance comparison to standard PLC families In order to show substantial benefits coming from the presented hardware implementation of logic controllers over the standard PLC implementation, a comparison of selected control tasks is made. For comparison purposes, three CPU representatives are selected from SIMATIC S7 families and academic design of highly specialized PLC CPU architecture [9]. The same reference designs were implemented for PLCs as for FPGA implemented counterparts. The execution time has been calculated using the manufacturer’s data sheet [37]. The program execution times are gathered in Table 2. The calculation times are shown for hardware controllers operating with clock signal of 100 MHz for general LUT approach and 150 MHz for DSP48-based designs. According to the implementation results (see Table 1), the selected operating frequencies are guaranteed to within a wide margin. The performance ratio is calculated in reference to the fastest SIMATIC S7 319 CPU (shown in bold). It could be estimated from the instruction execution time that the CPU frequency of this controller is at least 500 MHz. The performance ratio increase is observed with an increased number of tasks and calculations executed in parallel. The DIR strategy exhibits the highest performance, but also the highest hardware resource requirements. This method is applicable when the maximum performance is expected. The 8 timers design is executed 31 times faster than the programmatic approach. Methods utilizing component sharing (SHR and SDR) represent a reduced performance compared to the DIR method. In cases where the critical path dominates over other tasks, an additional time cost of resources sharing is hidden by it. This can be observed in 2C8T design, where timer processing is equal to PID controller processing. Strategies utilizing the DSP48 component offer a slightly reduced performance but higher operating frequencies, resulting
15
Table 2 Controllers performance comparison. Target platform
Design 1C 2C 8T 2C8T Execution time [ns] relative calculation speed to S7 319
Simatic S7
314 317 319
PLC CPU [9] FPGA LUT fCLK = 100 MHz
DIR SHR SDR
FPGA DSP48 fCLK = 150 MHz
D48D D48B
3240 0.077 950 0.262 249 1 270 0.922 70.0 3.56 70.0 3.56 70.0 3.56 53.3 4.67 66.7 3.73
6480 0.077 1900 0.262 498 1 540 0.922 70.0 7.11 100.0 4.98 100.0 4.98 66.7 7.46 93.3 5.34
10560 0.089 4080 0.231 944 1 1120 0.843 30.0 31.47 100.0 9.44 100 9.44 120.0 7.87 126.7 7.45
17040 0.085 5980 0.241 1442 1 1660 0.869 70.0 20.60 100.0 14.42 100.0 14.42 173.3 8.32 200.0 7.21
from using dedicated arithmetic core and block memories (D48B). One limitation implied by block memories introduces a slight performance reduction compared to LUT-based memory utilization. The hardware controller performance increase is observed for tasks that implement multiple parallel tasks. The scheduling procedure enables precise distribution of calculation inside the calculation resources, resulting in a significantly reduced response time, compared to standard PLC controllers executing a program in a serial cyclic fashion. It should be pointed out that hardware implementation enables an extremely fast and parallel execution of logic instructions, which perfectly reflects the FPGA platform’s abilities. 6. Conclusions The paper presents an entire synthesis process of fast FPGA implemented PLC from a program written using standard languages. Translation models of PLC languages are shown. These models enable massive parallel calculations, assuring processing results are identical to the standard sequential approach. Language models are used for developing the original language-independent program representation. For the purpose of program representation, the EDFG graph has been developed. This representation is suitable for performing all processing toward final implementation. The EDFG constructing methods, based on language models, have been shown. Obtaining a language-independent form of program representation enables further processing. It should be noticed that EDFG can be considered as a general purpose representation. The paper is focused on obtaining massively parallel hardware implementation of the control program. The EDFG can also be used for generating an optimized instructions sequence for classic PLC architectures. The EDFG representation enables optimization of a control program. During EDFG construction, process fragments of program are eliminated, which does not contribute to the outputs. The optimization process is continued by simplification of logic and arithmetic operations. Finally, logic fragments are the subject of logic minimization. This step allows for further optimization of a program representation. The mapping procedure starts with EDFG accommodation to hardware components. The multiple node expansion problem is discussed, which maximizes parallel execution of the algorithm. One other problem that has been addressed is complemented argument optimization. The hardware prepared EDFG is mapped
16
A. Milik / Microprocessors and Microsystems 44 (2016) 2–16
with two different approaches. The direct approach allows instant transformation into the hardware structure. The hardware component reuse has been introduced in the second method, which increases the hardware components utilization. The component sharing concept introduces an additional implementation cost connected to argument multiplexing. The concept of reducing the multiplexing costs by swapping arguments assigned to hardware operation instances was presented. Finally, the DSP48 component mapping procedure is shown. The mapping procedure introduces the mapping strategy that expands operation nodes into DSP48 patterns. The method takes benefits from the pipelined architecture and accumulative adder of the DSP48 unit. It is accommodated to achieve possibly the highest utilization of the DSP48 unit. The presented algorithms belong to the originally developed synthesis tool for hardware implemented PLC, capable of synthesizing LD and SFC languages. The obtained experimental results allow one to achieve throughput in a range between 10–95 Mc/s (million processing cycles per second). The median of the processing performance is 18.3 Mc/s. These results were compared to the standard PLC controllers. The performance improvement is observed from 3.5 for single tasks, up to 31 for eight similar timer tasks. A comparison was made for the fastest available CPU on the market (S7 – 319). The comparison made for the arithmeticbased tasks demonstrated that implementation and performance is significantly worse than logic operation processing. Different implementation strategies allow accommodation of the resource requirements and performance. Resource sharing slightly impairs the performance ratio but allows the hardware resource allocation to be reduced. The compilation and synthesis tool is the subject of ongoing research and development. It is planned to extend arithmetic support to floating point numbers, as well as further improvement of the scheduling and mapping processes. Early research shows that language-independent representation delivers very promising results for generating multiple core PLC CPU instruction streams. Acknowledgments This work was supported by the Ministry of Science and Higher Education funding for statutory activities (decision no. 8686/E367/S/2015 of 19 February 2015) References [1] M. Chmiel, On reducing PLC response time, Bull. Polish Acad. Sci. – Tech. Sci. 56 (3) (2008) 229–238. [2] T. Klopot, P. Laszczyk, K. Stebel, J. Czeczot, Flexible function block implementation of the balance-based adaptive controller as the potential alternative for PID-based industrial applications, Trans. Inst. Meas. Control 36 (8) (2014) 1098–1113. [3] S.A. Edwards, E.A. Lee, The case for the precision timed (PRET) machine, in: Design Automation Conference DAC 2007, San Diego, California, 2007. [4] S.A. Edwards, K. Sungjun, E.A. Lee, I. Liu, H.D. Patel, M. Schoeberl, A disruptive computer design idea: architectures with repeatable timing, in: Proceedings of the IEEE International Conference on Computer Design (ICCD), Lake Tahoe, CA, USA, 2009, pp. 54–59. [5] E.N. Cenelec, 61131-3, Programmable Controller – Part 3: Programming Languages, International Standard, Brussels, 2013. [6] J.K. H., M. Tiegelkamp, IEC 61131-3: Programming Industrial Automation Systems: Concepts and Programming Languages, Requirements for Programming Systems, Decision-Making Aids, Springer-Verlag, Berlin Heidelberg, 2010. [7] R.W. Lewis, Programming Industrial Control Systems Using IEC 1131-3, IET, 1998. [8] R. David, Grafcet: a powerful tool for specification of logic controllers, IEEE Trans. Control Syst. Technol. 3 (3) (1995) 253–268. ´ , P. Smolarek, IEC 61131-3-based PLC implemented by [9] M. Chmiel, R. Czerwinski means of FPGA, in: IFAC Conference on Programmable Devices and Embedded Systems, Cracow, 2015. [10] M. Chmiel, E. Hrynkiewicz, Concurrent operation of processors in the bit-byte CPU of a PLC, Control Cybern. 39 (2) (2010) 559–579.
[11] D. Du, X. Xu, K. Yamazaki, A study on the generation of silicon-based hardware PLC by means of the direct conversion of the ladder diagram to circuit design language, Int. J. Adv. Manuf. Technol. 49 (5) (2010) 615–626. [12] C. Economakos, G. Economakos, FPGA implementation of PLC programs using automated high-level synthesis tools, in: Proceedings of the IEEE International Symposium on Industrial Electronics, 2008, pp. 1908–1913. [13] C. Economakos, G. Economakos, C-based PLC to FPGA translation and implementation: the effects of coding styles, in: Proceedings of the16th International Conference on System Theory, Control and Computing, 2012, pp. 1–6. [14] A. Milik, On hardware synthesis of reconfigurable logic controllers from ladder diagrams according to IEC61131-3, in: IFAC Workshop on Programmable Devices and Embedded Systems, 2013. [15] J. Mocha, D. Kania, Hardware implementation of a control program in FPGA structures, Przeglad ˛ Elektrotechniczny 88 (12a) (2012) 95–100. [16] B. Wyrwoł, E. Hrynkiewicz, Decomposition of the fuzzy inference system for implementation in the FPGA structure, Int. J. Appl. Math. Comput. Sci. 23 (2) (2013) 473–483. [17] J.T. Welch, J. Carletta, A direct mapping FPGA architecture for industrial process control applications, in: International Conference on Computer Design, 20 0 0, pp. 595–598. [18] A. Falcione, B.H. Krogh, Design recovery for relay ladder logic, IEEE Control Syst. 13 (2) (1993) 90–98. [19] S. Ichikawa, M. Akinaka, H. Hata, R. Ikeda, H. Yamamoto, An FPGA implementation of hard-wired sequence control system based on PLC software, IEEJ Trans. Electr. Electron. Eng. 6 (2011) 367–375. [20] N.W. Bergmann, P. Waldeck, S.K. Shukla, FPGA implementations of ladder diagrams, Modern Appl. Sci. 7 (3) (2013) 64–73, doi:10.5539/mas.v7n3p64. [21] J.E. Hopcroft, J.D. Ullman, Introduction to automata theory, Languages and Computation, Addison-Wesley, 1979. [22] N. Wirth, Algorithms + Data Structures = Programs, Prentice Hall, 1976. [23] A. Milik, E. Hrynkiewicz, On translation of LD, IL and SFC given according to IEC-61131 for hardware synthesis of reconfigurable logic controller, in: Proceedings of IFAC World Congress, Cape Town, 2014. [24] A. Milik, A. Pułka, On FPGA dedicated SFC synthesis and implementation according to IEC61131, in: Proceedings of the International Conference on Signals ´ Poland, IEEE, 2014. and Electronic Systems (ICSES), Poznan, [25] D. Gajski, N. Dutt, A. Wu, S. Lin, High-Level Synthesis Introduction to Chip and System Design, Kluwer Academic Publishers, 1994. [26] Hadjis S., Canis A., Anderson J.H., Choi J., Nam K., Brown S., Czajkowski T.: Impact of FPGA architecture on resource sharing in high-level synthesis, ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, CA, 2012 [27] P.G. Paulin, J.P. Knight, Algorithms for high-level synthesis, IEEE Des. Test Comput. 6 (6) (1989) 18–31. [28] Xilinx: UG389, Spartan-6 FPGA DSP48A1 Slice, 2009 [29] G.D. Hachtel, F. Somenzi, Logic Synthesis and Verification Algorithms, Kluwer Academic Publishers, 20 0 0. [30] J.P. Deschamps, G.J.A. Bioul, G.D. Sutter, Synthesis of Arithmetic Circuits: FPGA, ASIC and Embedded Systems, John Wiley & Sons, 2006. [31] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J.H. Anderson, S. Brown, T. Czajkowski, LegUp: High-level synthesis for FPGA-based processor/accelerator systems, in: ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), Monterey, CA, 2011, pp. 33–36. [32] W. Sun, M.J. Wirthlin, S. Neuendorffer, FPGA pipeline synthesis design exploration using module selection and resource sharing, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 26 (2) (2007) 254–265. [33] G. Wang, W. Gong, R. Kastner, Operation scheduling: algorithms and applications, in: P. Coussy, A. Morawiec (Eds.), High-Level Synthesis. From Algorithm to Digital Circuit, Springer Science + Business Media, 2008. ´ , D. Kania, Finite state machine logic synthesis for complex pro[34] R. Czerwinski grammable logic devices, Springer, Berlin, 2013. [35] P. Coussy, C. Chavet, P. Bomel, D. Heller, E. Senn, E. Martin, GAUT: a high-level synthesis tool for DSP applications. From C algorithm to RTL architecture„ in: P. Coussy, A. Morawiec (Eds.), High-Level Synthesis. From Algorithm to Digital Circuit, Springer Science + Business Media, 2008. [36] Xilinx: UG XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices, 2013 [37] Siemens: SIMATIC S7-300 Instruction list S7-30 0 CPUs and ET 20 0 CPUs, Siemens AG Nurnberg, 2011 Adam Milik. received M.Sc. and Ph.D. degrees from Silesian University of Technology of Gliwice in 1997 and 2003 respectively. Since 2003 he is a professor assistant at Institute of Electronics of Silesian University of Technology of Gliwice. His main interests and research areas are: high-level logic synthesis and implementation, algorithm implementation, technology mapping in FPGA devices, the hardware high-level modeling systems based on HDLs and its integration with other tools like MATLAB, Simulink or SystemVue.