Accepted Manuscript
Enhanced Architecture for Programmable Logic Controllers Targeting Performance Improvements Laurence Crestani Tasca, Edison Pignaton de Freitas, Flavio Rech Wagner ´ PII: DOI: Reference:
S0141-9331(18)30020-6 10.1016/j.micpro.2018.06.007 MICPRO 2707
To appear in:
Microprocessors and Microsystems
Received date: Revised date: Accepted date:
17 January 2018 28 May 2018 13 June 2018
Please cite this article as: Laurence Crestani Tasca, Edison Pignaton de Freitas, Flavio Rech Wagner, ´ Enhanced Architecture for Programmable Logic Controllers Targeting Performance Improvements, Microprocessors and Microsystems (2018), doi: 10.1016/j.micpro.2018.06.007
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT MICPRO, JUNE 2018
1
Enhanced Architecture for Programmable Logic Controllers Targeting Performance Improvements Laurence Crestani Tasca1 , Edison Pignaton de Freitas2 , and Fl´avio Rech Wagner3
CR IP T
The first PLCs used in industries replaced the old-fashion relay logic systems. Hence the first language adopted to program them was the ladder diagram (LD), since LD is very similar to relay’s circuit diagrams with a graphical representation of contacts for inputs and coils for outputs, as presented in the diagram of Figure 1. Although other languages like function block diagram (FBD), structured text (ST), instruction list (IL) and sequential function chart (SFC) are also available for programming PLCs, LD is still popular among PLC programmers nowadays [1]. Figure 1 presents a ladder diagram example in which each line (also called rung) controls the states of three outputs named as O/0, O/1, and O/2, based on the states of three inputs designated as I/0, I/1, and I/2. Rungs in a ladder diagram may have normally open and normally closed contacts. In the given example, the state of O/0 in the first rung is set to ”on” if input I/0 is ”on” (normally open contact) and if input I/1 is ”off” (normally closed contact). Similarly, in the second rung, the output O/1 is set to ”on” state if both I/0 and I/2 inputs are in ”on” state. Finally, in the third rung, if input I/1 is ”off” and input I/2 is ”on” then the output O/2 is set to the ”on” state.
M
AN US
Abstract—Programmable logic controllers (PLCs) are massively used as the central processing control units in industrial automation solutions. Unfortunately, the lack of performance of these controllers requires coupling specific drivers to the PLCs to enable deterministic response in time-sensitive real-time applications, thus reducing the importance of the PLCs in the entire automation systems. Current solutions do not focus on PLC performance, and it is, therefore, safer to use the design pattern with these specific drivers when required. This situation does not give PLC users of those solutions full control of the provided automation and, consequently, it increases the price of the entire system, as well as the need of spare parts. To put PLCs back as central processing units of industrial automation technology and to reduce the need for specific drivers, this work proposes a novel architecture with enhanced improvements based mainly on the concepts of data flow computation and memoization technique to boost PLC performance. Evaluations made on the proposed design demonstrate a reduction in 95% in the proposed architecture’s scan time and show significant performance boost even in small-scale, didactic and straightforward examples. Moreover, the experimental evaluations also have demonstrated the potential for continued performance improvement with the increase of the program size.
P
ED
Index Terms—Programmable logic controllers, special architecture, data flow machines, memoization technique
I. I NTRODUCTION
AC
CE
PT
ROGRAMMABLE logic controllers (PLCs) represented an essential contribution to the digital revolution in the industry, also known as the third industrial revolution. PLCs aided industries in reaching the high levels of productivity and efficiency found nowadays. These improvements occurred mainly due to the replacement of the large and heavy electromechanical relay control panels by digital technology, with significant improvements in maintenance and repeatability [1]. PLCs are systems composed of program logic that use a set of inputs to generate outputs, which in their turn, control a given process. A relevant characteristic of these systems is their scan cycle, which is composed of three operations: reading the inputs, executing the program’s logics and writing the outputs. The elapsed time to run a scan cycle is named scan time and is directly related to the PLC’s performance and response time [2]. A short scan time is crucial in time-sensitive applications like motion control, control of proportional–integral–derivative (PID) processes and real-time systems in which a deterministic response is one of the essential properties for a satisfactory performance [3]. 1 Federal University of Rio Grande do Sul Graduate Programme on Computing 2 Federal University of Rio Grande do Sul Institute of Informatics 3 Federal University of Rio Grande do Sul Institute of Informatics
Figure 1: Ladder diagram - example 1 In a standard PLC architecture, the scan cycle execution is continuously repeated. In the example presented in Figure 1, the scan time is computed as the time to read the inputs, plus the time to execute all the three rungs, plus the time to write the outputs. Since a standard PLC architecture always sequentially runs the scan cycle operations, a more extensive program produces a more extensive scan time and, consequently, a lower performance and response time. To improve the PLC performance based on a scan time reduction, this paper proposes two improvements to the PLC architecture. The first one, named verify to execute improvement (VE), is inspired by data flow machines and digital logic simulation theory, while the second one, titled search to execute improvement (SE), is based on the operation of cache memories and the memoization technique also explored in [4]. The VE improvement determines that a rung is scheduled to execute only if edges (signal value transitions) are detected
ACCEPTED MANUSCRIPT MICPRO, JUNE 2018
2
CR IP T
information to compare to the proposed design at instruction level, which is not appropriate, since the architectures are different both in data path and operation frequency. Due to the impossibility of a fair comparison with any commercial architecture, the comparison of the improvements with the baseline provides a suitable way to evaluate these developed improvements, providing valuable information to use them in any PLC architecture in future works. Differently from [5] and [6], the approach presented here explores even further the possibility of avoiding execution of previously calculated rungs’ execution results. Similarly to [6], several PLC architectures and frameworks were proposed in [7], [8], [9], and [10]. These works describe the limitations and challenges of PLCs’ special architectures and represent a valuable source of reference for the proposal described in this paper. Regarding the work presented in [7], it transforms a single ladder diagram of a specific application into a hardware description language (HDL), which is not comparable to the proposal of this current paper. In [8] and [10], two novel PLCs architectures are designed by means of an FPGA similarly as [6], in which the instruction cycles are compared to available data from commercial PLCs. Again, like [6], [8] and [10] are not comparable to this article’s proposal since its improvements are based on skipping useless executions, and not on improving hardware co-design. Finally, in [9] a design in FPGA was used to enhance the performance of a specific application in replacement of a PLC, highlighting the lack of PLCs’ performance in some applications. However, it is difficult to compare this particular proposal to a generic one presented in this work. The idea of avoiding the execution of rungs is not properly new since, in a similar way, it is present in data flow machines and digital logic simulation theory. In this context, suitable references of those areas are available at [11], [12], [13], and [14]. However, applying this technique for enhancing the performance of PLCs is an entirely new approach with high potential for exploration. The work by Ikbal et al. [15] presents an evaluation method for PLCs. Despite the definition of a methodology to benchmark PLCs, which is a contribution that influenced this current work, their proposal does not consider the usage of third-party benchmarks, like the PLC-Open committee [16]. Benchmarks like the one proposed by PLC-Open TC3 [17] can provide an excellent alternative for evaluation instead of an ad-hoc set of test cases. Unfortunately, the most relevant part of the PLC-Open TC3, the application benchmarks, were not developed and thus they were not available by the time this paper was prepared. However, it is noteworthy the effort in proposing benchmarks for PLCs, and this is an important initiative headed by the PLC-Open committee. At last, Milik [18] presents an approach where a method to convert PLC’s program data into the Verilog hardware description language (HDL) is described, in a way that dedicated hardware generated in HDL may be used to implement the PLC’s logics. This work uses PID controllers as a test PLC application, which is sensitive to a reduced scan time, thus showing the importance of short scan time for specific applications. Nevertheless, frameworks to convert rung structures
ED
II. R ELATED W ORK
M
AN US
at its inputs, similarly to [5] and [6]. The SE improvement reduces the scan time by storing results of previously executed rungs and skipping the execution of those rungs that present the same input values as in a previous scan cycle. The VE and SE improvements were developed in three different execution cores, one for each improvement and a third core combining both VE and SE. Afterwards, the proposed execution cores were compared to a baseline standard execution core in a simulation environment using the scan time execution duration as a parameter. Experimental results have shown up to 95% of scan time reduction even in small-scale, didactic and straightforward examples. Such a considerable speedup achieved on this work may allow PLCs to directly control applications that today require specific external drivers, like, for instance, robotics control and other motion applications. Also, even in a system where cost reduction by removing a particular driver is not possible, a reduction in PLC reaction time can benefit industrial processes by increasing production. Finally, the improvements proposed in this article make the PLC architecture relevant to a more extensive project space area than today, thus making PLCs a key element of automation systems and reducing the need for specific parts’ replacement stock in the industries. The remaining sections of this paper are organized as follows. Section II discusses main related works. A detailed description of the proposed improvements is presented in Section III, while the new architecture is detailed in Section IV. Section V reports experimental results that validate the proposed architecture, and finally the conclusions and future works are presented in Section VI.
AC
CE
PT
Even though programmable logic controllers are still massively used in industries since their introduction in the decade of 1960, there are few proposals focused on improvements of PLC’s performance. However, a careful literature review allows identifying essential contributions related to this area as presented in the following. Suresh et al. [4] use the memoization technique to cache results of transcendental functions like sine, cosine, tangent, etc. The study obtained significant reductions in execution time by benefiting from the static behavior of the transcendental functions. Similarly, the present work stores the results of most executed rungs along with their arguments to reduce PLC’s scan time by not running again rungs for which results are already cached. A proposal to reduce PLC’s scan time by not executing rungs for which the inputs did not change was introduced by Chmiel and Hrynkiewicz [5], which is similar to the VE improvement proposed in this article. In the continuation of [5], Chmiel et al. in [6] reported the development of a dual 1-bit/32-bits PLC architecture in FPGA. The 1-bit processor is used to detect the edges in the rungs, while the 32-bits processor executes the rungs themselves. Regarding the evaluation presented in [5], it uses a simple example in a standard PLC to prove the benefits of the technique that is also explored in this article, while [6] uses incomplete commercial PLCs’
ACCEPTED MANUSCRIPT MICPRO, JUNE 2018
3
III. T HE P ROPOSED I MPROVEMENTS
Input I/0 I/1 I/2 I/3
Impact 1/4 2/4 3/4 4/4
M
ED PT CE AC
Figure 2: Ladder diagram - example 2
To reduce the trigger detection of the first rung execution, the VE improvement utilizes an algorithm focused on the application program, ordering rungs according to two criteria: impact and probability. Impact is the ratio of occurrences of a given input over the total of rungs, while probability is the proportion of each input over the total quantity of input occurrences in all rungs. The algorithm orders inputs according to the decreasing order of the sum of impact and probability for each input. Signal variations are thus evaluated according to this ordering of inputs, and if an input in the ordering has a
Probability 1/10 2/10 3/10 4/10
Total 0.35 0.70 1.05 1.40
Table I: VE improvement verification order criteria calculated for ladder diagram example 2
AN US
The first improvement proposed in this work is called verify to execute (VE), which was based on the paradigms of data flow machines and digital logic simulation theory. The VE improvement consists of checking rungs’ inputs for the occurrence of signal value variations and scheduling the rungs for execution only if there is an ”on” to ”off” or an ”off” to ”on” transition in their inputs. By executing the ladder diagram program presented in Figure 1 with the VE improvement, an edge detected at input I/0 would schedule the first two rungs for execution. Similarly, a variation in input I/1 would program the first and third rungs for execution, while the last two rungs would be scheduled for execution if input I/2 has an edge transition. Differently from the ladder diagram in Figure 1, where the inputs’ verification order does not matter due to the symmetry of the diagram, the ladder diagram of Figure 2 may delay the first rung execution depending on the order of verification of inputs’ edges. Since a border in input I/3 would cause the execution of all four rungs and a variation in input I/0 would schedule only the first rung for execution, it is remarkable to highlight that the verification order is essential to start the execution of rungs as soon as possible.
signal variation, all rungs where it appears are scheduled for execution. The verification order criteria for the diagram presented in Figure 2 can be observed in Table I. Analyzing the data presented in the table, input I/3 has impact 1.0 (4/4), since it appears in all rungs, while input I/0 has impact 0.25 (1/4) since it is present in a single rung. In turn, input I/3 has probability 0.4 (4/10), since it has four occurrences over the total of ten input occurrences in the whole diagram. Finally, input I/0 has probability 0.1 (1/10), since it has only one appearance over the same ten input occurrences. As a consequence, the verification order for the inputs of the ladder diagram in Figure 2, according to this algorithm, is I/3 (sum is 1.4), I/2 (sum is 1.05), I/1 (sum is 0.7), and I/0 (sum is 0.35).
CR IP T
to HDL in FPGA for improvements in PLC performance is an optimal alternative with higher speedup, but in a different study area than this work, since this kind of implementation does not correspond to a general purpose solution like the one presented here. Similarly to [18], other proposals of using HDLs to improve PLCs’ performance can be found in [19], [20], and [21].
Moreover, also in the example presented in Figure 2, once an edge occurs in I/3 in a given scan cycle, all rungs of the ladder diagram will be scheduled for execution following this detection. As a consequence, the verification of edges in the other three inputs is no longer necessary for that scan cycle. Applying this procedure not only shortens the delay in executing the first rung but also decreases the execution time of the edges’ verification procedure. The reduction happens because once all rungs of a particular input are scheduled for execution, the edge verification for that input is not necessary anymore. An advantage of the VE improvement is that it saves execution cycles, especially in monotonous systems in which an input’s edge occurs interleaved with long periods of idle time. By skipping useless executions, the VE improvement reduces the scan time in typical PLC applications that behave monotonously. Additionally, this improvement provides a faster response in time-critical applications, like motion detection control, for instance. As a drawback, the VE improvement may increase the scan time in applications with a high number of edges due to the overhead in verifying several input variations. This drawback is further discussed in Section V, where it is shown that its overall performance mitigates this problem. Furthermore, the execution order of the VE improvement may differ from a standard execution and thus bring unexpected behavior in the case of not well-implemented ladder diagrams where the logic of different rungs overlap each other. At last, the VE improvement requires a particular architecture with dedicated hardware for the implementation of special features. The second improvement proposed in this paper is called search to execute (SE), which is inspired by cache memory behavior and on the memoization technique also exploited by [4]. The SE improvement may complement the execution phase of either the standard or VE improved architecture, by checking if the input values of a given rung execution are equal
ACCEPTED MANUSCRIPT MICPRO, JUNE 2018
4
CR IP T
IV. T HE P ROPOSED A RCHITECTURE To exploit the VE and SE improvements described in Section III, an architecture was developed in the context of this work. Similar to a standard PLC, the proposed design, presented in Figure 3, is composed of a PLC core capable of reading inputs, executing a program and writing outputs.
Figure 3: Proposed architecture
The PLC core of the developed architecture contains a control entity that uses several signals to coordinate a memory unit and an execution core. The memory unit centralizes all data required by the architecture and is subdivided into registers to store input and output states, a PROM for the program’s data and a RAM to store needed data values. Also in the PLC core, along with the memory unit, the execution core is responsible for running the logical/arithmetic operations defined in the program. As can be observed in Figure 3, the architecture’s execution core may be implemented as a standard core, a VE improved core, an SE improved core, or a VE+SE core with both improvements. This characteristic allows four different execution core implementations to be compared in similar conditions since scan cycle phases for reading the inputs and writing the outputs are executed the same way independently of the execution core. The standard execution core detailed in Figure 4 behaves like a traditional PLC execution core. The input values are used to execute all the rungs and to update the outputs according to the results generated by the execution unit.
AC
CE
PT
ED
M
AN US
to a previously stored one and, if so, skipping the re-execution by using the stored output values. The memoization technique exploited by the SE improvement may increase the scan time in applications where the memoization memory is not large enough to store all possible execution instances, like, for example, in a situation with a high number of different edges at the inputs. However, since the impact of the memoization memory size is not the focus of the current work, conclusions based on additional evidence can only be drawn in a future study focusing this aspect. The memoization memory size is, of course, significant, but the experimental evaluations performed for this article consider all probabilities of memoization hit and miss, such that the memory size can be disregarded in the scope of this work. Nevertheless, when memoization memory size is enough, the SE improved solution provides a faster response for rung executions with most common input values cached, even in applications with a high number of input variations. This characteristic mitigates the memoization memory impact on the area and power consumption in an architecture that uses the SE improvement by its superior speedup factor. In the best case, the execution time of a standard architecture executing the ladder diagram presented in Figure 2 will be the sum of the execution times for all the four rungs. For a design with the VE improvement, the best case occurs when there are no edges, where the system only spends the verification time for all the inputs and does not need to execute any rung. The best scenario for the SE improved architecture occurs when the execution contexts of the four rungs are already cached at the memoization memory, where the system spends only the search time for all four rungs stored in the memoization memory. Finally, in an implementation with both improvements VE and SE, the shortest execution time is equal to the design with the VE improvement alone, which consists of the verification time for all the four inputs when there are no input edges occurrences. Also for the ladder diagram presented in Figure 2, the worst case for a standard architecture is the same as the best case: the sum of the execution times for all the four rungs. For a design with the VE improvement, the worst scenario occurs when there are edges at I/2 and I/0, when the system needs to spend the verification time for I/3, I/2, and I/0, plus the execution time for all the four rungs. In an architecture with the SE improvement, the most extended execution time occurs when there is no memoization stored for any of the rungs, and the system needs to spend the search time for all the four rungs plus the execution time also for all the four rungs. At last, in architecture with both improvements, the worst case scenario occurs when there is a variation at inputs I/2 and I/0 with no stored memoization. This situation composes the worst-case execution time, with the verification time for I/3, I/2, and I/0 plus the memoization search time for all the four rungs and the execution time for all the four rungs. At last, since the proposed improvements do not modify the execution unit of the standard architecture, to mitigate the overhead of the worst cases discussed above the edges’ verification and the memoization search operations must occur in parallel by separated dedicated hardware.
Figure 4: Standard execution core In the VE improved execution core, shown in Figure 5, the edge detector entity uses rungs’ definitions and the preprocessed verification order detailed in Section III to check
ACCEPTED MANUSCRIPT MICPRO, JUNE 2018
5
the same components of the VE and SE improved execution cores in a different organization with the same behavior already explained. The VE+SE improved execution core has almost the same data path as the VE one, except by the memoization checker, which is inserted between the rung scheduler and the execution unit. With this modification, the rung scheduler output delivers the rung identification to the memoization checker, instead of delivering the rung address to the execution unit, as in the VE improved execution core.
AN US
CR IP T
variations in the values of the inputs. If the edge detector identifies an edge, the identification of the corresponding input is passed to the rung scheduler. The rung scheduler checks which rungs contain the given input identification and, in turn, notifies the execution unit to start the execution of the rung by passing its address. In this execution core, the rung scheduler acts as a FIFO queue buffer, allowing the edge detector to continue its verification process, even if several rungs were already scheduled for execution or, at the other side, a rung program fragment is already running at the execution unit.
Figure 7: VE+SE improved execution core
Figure 5: VE improved execution core
AC
CE
PT
ED
M
The main component of the SE improved execution core, detailed in Figure 6, is the memoization checker. Before passing a rung to execute in the execution unit, the memoization checker compares the current input values of a rung to the input values indexed at its memoization memory to check the existence of a previously stored execution result for that rung. If the result for that particular rung execution is already stored in the memory, the cached value is used for updating the output values and the execution unit is not notified, so that the rung execution is skipped. Otherwise, the rung is passed to the execution unit that, once it finishes the execution, notifies the memoization checker to cache the results for future use. Similarly to the VE improvement, both the memoization checker and the execution unit are separated components running independently from each other. Therefore, a memoization search does not affect a rung execution and vice-versa.
Figure 6: SE improved execution core Figure 7 presents the execution core with both VE and SE improvements. The VE+SE improved execution core reuses
The execution unit aided by a register bank and a small control block, is responsible for fetching the instructions from the program memory, decoding and executing them according to the decoded value. For this purpose, a specific ISA (instruction set architecture) was developed for the execution unit to meet requirements of the PLC programs (Table II). Instruction And 1 bit Or 1 bit Not 1 bit Load single input Store single output Done
Type R R R M M C
AND-1 OR-1 NOT-1 LI-1 SO-1 DONE
Format RD RS1 RS2 RD RS1 RS2 RD RS RD INPUT ID RS OUTPUT ID
Cycles 3 3 3 5 4 3
Table II: Instructions of the developed ISA As can be observed in Table II, each instruction is characterized by a type, format, and number of execution cycles. Instructions of register type (R) define operations in the register bank, a memory type instruction (M) executes a data transfer from/to the memory unit, while a control type instruction (C) characterizes an instruction that deviates the execution unit from its standard behavior. The instruction format defines how each instruction is organized. It is subdivided into an instruction code (represented by a mnemonic) and zero up to three arguments. Finally, since the execution unit was developed as a multi-cycle architecture, each instruction takes a different number of cycles to execute. The AND-1, OR-1, and NOT-1 register type instructions execute respectively a boolean AND, OR and NOT operation at the register bank. Each of these instructions has a destination register (RD) and one or two source registers (RS). All register type instructions require three cycles to fetch, to decode, and to execute. The LI-1 instruction loads the value of an input of the inputs register located in memory with a given identification (INPUT
ACCEPTED MANUSCRIPT 6
AC
CE
PT
ED
M
AN US
ID) into a destination register (RD). This type of instruction takes five cycles to accomplish the fetch, decode, addressing execution, memory register read and write back phases. The SO-1 instruction stores the value of a source register (RS) to the outputs register located in memory with the given output identification (OUTPUT ID). Since LI-1 and SO-1 instructions are both from memory type, the first three cycles necessary to execute the SO-1 instruction are the same as for the LI1 instruction, whereas in the fourth and last cycle of SO-1 instruction is where the memory inputs register write operation is performed. The only control type instruction defined in the ISA is the DONE instruction. This instruction performs an essential role in the execution unit, by notifying the execution core control block that the execution of a rung or the entire program has finished, depending on the execution core organization. Similarly to the register type instructions, control type instructions need three cycles to fetch, decode, and execute. When a program in developed ISA is shaped to the standard execution core, a DONE instruction must be added as the last instruction. To adapt the program to the other execution cores, a minor modification must be done, which consists of adding a DONE instruction at the end of each rung’s instructions. This adjustment is required since in the proposed improved cores it is not mandatory that all rungs are executed in each scan cycle. In other words, the DONE instruction represents the end of the execution step of the scan cycle at the standard architecture and the end of a single rung execution for others execution cores. Besides adding additional DONE instructions, the proposed VE, SE, and VE+SE execution cores require extra data in the program memory to achieve a performance boost, differently from the program memory structure of the standard execution core, which needs the instructions only. As can be observed in Figure 8, the VE execution core requires additional program memory fields: the quantity of inputs and rungs, which determine the addresses and size of the other fields; the preprocessed inputs’ verification order in which the edge detector unit should check for edges; the identification of the rungs associated to each input, which are used by the rung scheduler to identify which rungs should be executed when an input edge was detected by the edge detector; the initial addresses of the program fragments corresponding to each rung, which are passed from the rung scheduler to the execution unit; and the program instructions themselves in the developed ISA. Not only the execution unit consumes cycles to execute, but the entities that compose the VE improved execution core presented in Figure 5 also demand some cycles to trigger execution. The edge detector takes three cycles to check that there are no edges in all inputs and three additional cycles for each input of the verification list. Cycle 1 is required to load the global or specific input values, cycle 2 is used to determine the edge on the input value, and cycle 3 serves to decide the dispatch of the input identification to the rung scheduler. At its side, the rung scheduler needs four cycles to schedule the rungs for execution. Cycle 1 is used to detect that the FIFO is not empty, cycle 2 to load the number of rungs that are
CR IP T
MICPRO, JUNE 2018
Figure 8: Program memory structure of the VE core
associated to the tested input, cycle 3 to read the rung’s initial address, and cycle 4 to feed the rung’s initial address into the execution core. As detailed in Figure 9, the program memory structure of the SE execution core contains the following fields: the quantity of rungs, which determines the addresses and size of the other fields; the rungs’ inputs masks, to isolate inputs that should be checked by the memoization checker during the verification of each rung; the rungs’ outputs masks, to modify only the outputs related to a specific rung; the initial addresses of the program fragments corresponding to each rung, which are passed from the memoization checker to the execution unit once it detects a memoization miss; and the program instructions themselves in the developed ISA. Similarly to the VE improved execution core entities, the memoization checker of the SE improved execution core presented in Figure 6 also requires some cycles to activate the execution unit. The memoization checker needs three cycles to start a rung execution. Cycle 1 is used to mask the status of the inputs for the memoization check of the current rung, cycle 2 is required to check the memoization memory for a cached value, and cycle 3 is necessary to decide to either dispatch the execution or reuse the memoization cached result. Finally, the VE+SE improved core program memory structure is a merge of the presented VE and SE structures, since this execution core shares the same units of the VE and SE improved execution cores in a different organization. The only difference is that the rung scheduler takes three cycles instead of four to dispatch the rung identification to the memoization checker, since, as may be observed in Figure 7, the rung scheduler does not need to retrieve the initial address as it does in VE execution core.
ACCEPTED MANUSCRIPT MICPRO, JUNE 2018
7
CR IP T
VE+SE execution core is calculated as the arithmetic mean of the simulation results obtained for the forty-five possible combinations considering both input edges and memoization occurrences possibilities. Although this research utilizes the arithmetic mean as a standardization parameter, it is noteworthy that not all input edges and/or memoization occurrences have the same probability to happen. This measurement choice was taken to get the overall performance of the results without introducing a bias that may arise by selecting a best or worst case scenario in the simulation executions. As didactic test cases, along with the didactic example 1 presented in Figure 1, which represents a balanced example with rungs of same sizes, and example 2 presented in Figure 2, which contains unbalanced rung sizes, a third test case, shown in the ladder diagram example 3 at Figure 10, containing rungs with larger sizes, was also used for evaluation purposes. Figure 9: Program memory structure of the SE core
AN US
V. E XPERIMENTAL E VALUATION
AC
CE
PT
ED
M
To evaluate the proposed architectures, a dedicated cycleaccurate simulation (CAS) software was developed. The simulator was designed in such a way that it can execute the standard execution core as well as the other three proposed improved execution cores. Simulation was the chosen method as the validation procedure due to its flexibility in handling the complexity of the improved execution cores with its several parameters: input count; rung count; rung size in cycles; edges and/or memoization occurrence. Another reason for a software simulator adoption is that it can be easily modified to execute batches of experimental runs as well as it can generate data organized according to the needs of the validation process. The CAS was developed in a Java framework due to the portability and flexibility of this language, allowing the CAS core and execution batches to be programmed and executed in a reasonable time. The adoption of a CAS that is independent of the coding language, in which the cycle period is the main characteristic of the simulator, leads to a natural transition to an HDL, initially in an FPGA environment, thus keeping the benefits of the developed improvements. The results from the simulations of the VE, SE, and VE+SE execution cores were compared against each other using the standard execution core as the baseline. The total duration of the execution cycle of a program was calculated as the arithmetic mean of the number of execution cycles obtained for all the possible combinations of input edges and/or rungs’ memoization. For instance, in example 1 presented in Figure 1, there are eight possibilities of edges’ occurrences in its three inputs, so the duration of the execution cycle for the VE execution core is calculated as the arithmetic mean of the eight simulation results obtained for all the eight possible combinations of edges. Similarly, the duration of the SE execution cycle is calculated as the arithmetic mean of the simulation results corresponding to the eight possibilities of hit or miss in the memoization memory for each of the three rungs. Finally, the duration of the execution cycle of the
Figure 10: Ladder diagram - example 3 The results regarding the arithmetic mean of the execution cycles and speedup for the selected didactic test cases are summarized in Tables III and IV. These results show that a substantial speedup is achieved even for example 1, which is the simpler test case. These initial results also seem to show that the higher the rungs’ complexity is, the larger the achieved speedup since the best results are achieved with example 3, which has the largest and more complex rungs. This trend is further evaluated in additional experiments, shown later in this section. Standard VE SE VE+SE
Example 1 69.00 59.25 40.50 47.53
Example 2 114.00 102.06 65.50 74.36
Example 3 109.00 90.75 60.50 64.85
Table III: Mean execution cycles of didactic test cases VE SE VE+SE
Example 1 1.16 1.70 1.45
Example 2 1.12 1.74 1.53
Example 3 1.20 1.80 1.68
Table IV: Mean speedup of didactic test cases Tables V, VI and VII present the results for the speedup
ACCEPTED MANUSCRIPT 8
probability, average positive speedup and average negative speedup, respectively. The speedup probability represents the percentage of all possible executions that have a positive speedup when compared to the standard core execution. The average positive/negative speedup represents the arithmetic mean of the speedups of all executions that result in a positive or negative speedup when compared to the standard core execution. Analyzing the data in these tables, it is noteworthy that, except for VE, all proposed execution cores have a high probability of positive speedup, although, even for VE, that has the highest probability of negative speedup, values of those negative speedups are always low. VE SE VE+SE
Example 1 0.50 0.88 0.91
Example 2 0.38 0.94 0.95
Example 3 0.53 0.88 0.91
Table V: Speedup probability of didactic test cases Example 2 1.60 1.84 1.60
Example 3 1.59 2.05 1.84
Figure 11: Results - program size vs. performance
AN US
VE SE VE+SE
Example 1 1.61 1.92 1.57
Table VI: Average positive speedup of didactic test cases VE SE VE+SE
Example 1 -0.09 -0.04 -0.19
Example 2 -0.05 -0.04 -0.14
Example 3 -0.06 -0.03 -0.13
CR IP T
MICPRO, JUNE 2018
M
Table VII: Average negative speedup of didactic test cases
duction in Figure 12, it is noticeable the exponential speedup provided by the proposed execution cores. In a superior performance in this aspect, the VE and the VE+SE execution cores reach a top of 95% of reduction when compared to the standard execution core. Moreover, this graph also denotes that SE speedup is not so scalable as the didactic test cases had indicated. Nevertheless, the SE execution core reaches a top scan time reduction of around 43%.
AC
CE
PT
ED
From the results from these small didactic examples, it is possible to identify a trend that SE is the best execution core regarding speedup, while VE+SE is the best one for speedup probability. However, it is simplistic to draw firm inferences from such a small design space. To bring more robust conclusions, this study evaluated the proposed execution cores in a simulation execution batch which ranges from 1 up to 8 inputs, 5 up to 20 rungs, and 10 up to 200 cycles to execute each rung. The number of simulated programs was limited to 100,000 programs for each group of those criteria due to the impossibility of simulating, in reasonable computation time, all the 1.532E+74 possibilities from the combinations of numbers of inputs, rungs, rungs’ execution cycles, and edges and/or memoization occurrences. Also, programs with less than 5 rungs were not included in the batch, since they have less than 100,000 possibilities of different programs and so could cause a bias in the results. The results in the graph presented in Figure 11 demonstrate the hypothesis proposed when analyzing the results from the didactic test cases: the higher the complexity of the rungs is, the more significant is the speedup. The graph also shows that the SE execution core has less speedup scalability than VE and VE+SE since it does not benefit from scenarios with no input edges. Also, despite the very close behavior of VE and VE+SE, it is possible to state that the memoization memory boosts performance in VE+SE by observing that it outperforms VE in programs with less than 600 cycles. In complement to Figure 11, by analyzing scan time re-
Figure 12: Results - program size vs. scan time reduction Program size versus performance (positive/negative speedup) graphs shown in Figures 13a, 13d and 13g also provide evidence that the higher the complexity of the rungs is, the better is the achieved positive speedup. Although these graphs reinforce this finding, SE does not show to be the best execution core, differently from the three didactic test cases. SE shows, in fact, a plateau in the speedup around 175%. On the other hand, VE and VE+SE have an exponential speedup growth as a function of program size. At last, similarly to the
ACCEPTED MANUSCRIPT MICPRO, JUNE 2018
9
(d) Program size vs. performance SE
(e) Input count vs. performance SE
(g) Program size vs. performance VE+SE
(h) Input count vs. performance VE+SE
(c) Rung count vs. performance VE
CR IP T
(b) Input count vs. performance VE
(f) Rung count vs. performance SE
AN US
(a) Program size vs. performance VE
(i) Rung count vs. performance VE+SE
M
Figure 13: Batch results graphs
AC
CE
PT
ED
trends previously identified, when there is a negative speedup, its value is low and tend to zero as program size increases in all proposed cores. The graphs in Figures 13b, 13e and 13h present the relationship between input count and performance. The results show that the performance of VE and VE+SE degrades as the input count grows. However, due to the memoization boost provided by the SE part of the VE+SE execution core, it decreases a bit slower than VE. The almost constant speedup of the SE execution core in Figure 13e is justified since this execution core is not affected by edge occurrences, and so, neither by the input count variation. Figures 13c, 13f and 13i show the graphs of the relationship between rung count and performance. The SE speedup value plateau, already observed in Figure 13d, can also be observed in Figure 13f. The zigzag pattern in the VE and VE+SE results is justified due to the heterogeneity of the programs used in the batch simulation. The limitation of 100,000 programs causes a larger step in the program simulation count, since the larger the rung count is, the larger the number of different possible programs. A bigger simulation step makes different program forms be simulated for each rung count, thus causing the zigzag pattern. This conclusion was verified by simulating the same batch with a limitation of 10,000 programs, which resulted in a more intense zigzag with the same shape and tendency lines as observed in the presented graphs. Therefore, it is reasonable to conclude that a larger number of tests will reduce the zigzag pattern following the graphs’ tendency lines.
Subsequently, by analyzing the minimum, average and maximum speedup probability of batch results presented in Table VIII, it is possible to notice that the SE core is the best one in this attribute, followed by VE+SE, which has the best average speedup probability value. VE execution core was considered the worst execution core also by its minimum speedup value of -11% justified by the overhead in verifying several input edges in the worst scenario. Nevertheless, the VE core has an average performance similar to the other execution cores, benefiting from the widespread scenarios where no input variations occur, as can be observed in the results graphs of Figure 13. VE SE VE+SE
Minimum 39.00 93.75 50.10
Average 96.44 98.76 99.22
Maximum 100.00 99.87 100.00
Table VIII: Speedup probability of batch results Finally, since all proposed execution cores have pros and cons, the selection of which improved core to use will be determined by the application requisites letting to the developer of the solution to select the most appropriate one. In any way, proposed enhanced cores can contribute as a step ahead in enhancing PLC performance by reducing the scan time. VI. C ONCLUSION Current commercial CLP solutions do not focus on performance, leading to the use of specific drivers in timecritical applications in which deterministic response time is
ACCEPTED MANUSCRIPT MICPRO, JUNE 2018
10
R EFERENCES
CR IP T
[1] W. Bolton, Programmable logic controllers. Newnes, 2015. [2] J. W. Webb and R. A. Reis, Programmable logic controllers: principles and applications. Prentice Hall PTR, 1998. [3] R. W. Lewis, Programming industrial control systems using IEC 1131-3. Iet, 1998, no. 50. [4] A. Suresh, B. N. Swamy, E. Rohou, and A. Seznec, “Intercepting functions for memoization: a case study using transcendental functions,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 12, no. 2, p. 18, 2015. [5] M. Chmiel and E. Hrynkiewicz, “An idea of event-driven program tasks execution,” IFAC Proceedings Volumes, vol. 42, no. 1, pp. 17–22, 2009. [6] M. Chmiel, W. Kloska, D. Polok, and J. Mocha, “Fpga-based twoprocessor cpu for plc,” in Signals and Electronic Systems (ICSES), 2016 International Conference on. IEEE, 2016, pp. 247–252. [7] D. Du, X. Xu, and K. Yamazaki, “A study on the generation of silicon-based hardware plc by means of the direct conversion of the ladder diagram to circuit design language,” The International Journal of Advanced Manufacturing Technology, vol. 49, no. 5, pp. 615–626, 2010. [8] M. Chmiel, J. Kulisz, R. Czerwinski, A. Krzyzyk, M. Rosol, and P. Smolarek, “An iec 61131-3-based plc implemented by means of an fpga,” Microprocessors and Microsystems, vol. 44, pp. 28–37, 2016. [9] C. F. Silva, C. Quintans, J. M. Lago, and E. Mandado, “An integrated system for logic controller implementation using fpgas,” in IEEE Industrial Electronics, IECON 2006-32nd Annual Conference on. IEEE, 2006, pp. 195–200. [10] S. L. Carrillo, A. Z. Polo, and M. P. Esmeral, “Design and implementation of an embedded microprocessor compatible with il language in accordance to the norm iec 61131-3,” in Reconfigurable Computing and FPGAs, 2005. ReConFig 2005. International Conference on. IEEE, 2005, pp. 6–pp. [11] J. B. Dennis, “Data flow supercomputers,” Computer, no. 11, pp. 48–56, 1980. [12] O. Pell, O. Mencer, K. H. Tsoi, and W. Luk, “Maximum performance computing with dataflow engines,” in High-Performance Computing Using FPGAs. Springer, 2013, pp. 747–774. [13] F. N. Najm, Circuit simulation. John Wiley & Sons, 2010. [14] H. Zhuang, X. Wang, Q. Chen, P. Chen, and C.-K. Cheng, “From circuit theory, simulation to spice diego: A matrix exponential approach for time-domain analysis of large-scale circuits,” IEEE Circuits and Systems Magazine, vol. 16, no. 2, pp. 16–34, 2016. [15] S. Iqbal, S. A. Khan, and Z. A. Khan, “Benchmarking industrial plc & pac: An approach to cost effective industrial automation,” in Open Source Systems and Technologies (ICOSST), 2013 International Conference on. IEEE, 2013, pp. 141–146. [16] E. van der Wal, “Plcopen,” IEEE Industrial Electronics Magazine, vol. 3, no. 4, p. 25, 2009. [17] PLCOpen, “Tc3 - certification - tf benchmarking,” 2006, [Online; accessed 05-November-2017]. [Online]. Available: http://www.plcopen. org/pages/tc3 certification/benchmarking/index.htm [18] A. Milik, “On hardware synthesis and implementation of plc programs in fpgas,” Microprocessors and Microsystems, vol. 44, pp. 2–16, 2016. [19] A. Milik and E. Hrynkiewicz, “On translation of ld, il and sfc given according to iec-61131 for hardware synthesis of reconfigurable logic controller,” IFAC Proceedings Volumes, vol. 47, no. 3, pp. 4477–4483, 2014. [20] S. Subbaraman, M. M. Patil, and P. S. Nilkund, “Novel integrated development environment for implementing plc on fpga by converting ladder diagram to synthesizable vhdl code,” in Control Automation Robotics & Vision (ICARCV), 2010 11th International Conference on. IEEE, 2010, pp. 1791–1795. [21] M. M. Patil, S. Subbaraman, and P. S. Nilkund, “Iec control specification to hdl synthesis: Considerations for implementing plc on fpga and scope for research,” in Control Automation and Systems (ICCAS), 2010 International Conference on. IEEE, 2010, pp. 2170–2174.
AC
CE
PT
ED
M
AN US
needed. As a solution, this paper presented a novel PLC architecture proposal to achieve performance boost by substantially reducing the scan time. Based on data flow machines and digital logic simulation theory paradigms, as well as cache memories and the memoization technique, the developed improved execution cores show in the best case a reduction of 95% in the scan time. Moreover, the evaluation of the presented solution demonstrates that the speedup grows with the increase in the program size. Depth analysis of the proposed cores results makes possible to conclude that each of the developed execution cores has its pros and cons. The VE improved core is the best solution for a monotonous system in which few input edges occur with substantial periods of idle time. Using the VE proposal in this type of systems compensates its low speedup probability and benefits from its moderate use of memory resources. For systems with high incidences of repeatable edges, the SE improved core is the best choice since most common edge occurrences are cached, thus keeping the scan time short and compensating the extra use of program and memoization memory. At last, the VE+SE improved core is the best option in systems that alternate between periods of idle and moments of intense input variations, since it has the best speedup probability in average, which compensates its more significant consumption of resources compared to other execution cores. Efforts to improve PLC performance like the one presented in this article are a relevant scientific and technological contribution, and several future works can be unfolded from this proposal. Priority future work is to modify the execution unit from a multi-cycle implementation to a pipeline or superscalar design option. On the same line, similarly to modern multicore architectures, a way to further boost the obtained enhancements in performance is the use of several execution units in parallel to reduce even more the PLC scan time. Another study direction opened by this research is the extension of the proposed architecture so that it can dynamically switch between the proposed execution cores. This switch can be performed during execution time, depending on the workload, or offline, by the application developer, which can select one of the improvements according to the project or application requirements. Lastly, a more extensive simulation batch along with real hardware environment experiments by porting the CAS core to an HDL in FPGA can contribute to providing more evidence on how the proposed PLC improvements can be useful to maintain PLCs as the central units of industrial automation systems.
ACCEPTED MANUSCRIPT 11
Laurence Crestani Tasca is currently an MSc student in Computer Science at UFRGS, having a bachelor in computer science by UCS (2011), and an Industrial Automation Technician diploma by SENAI (2003). He has over 15 years of experience in hardware and software design of industrial embedded systems, working in several industrial sectors like electronic signs, elevators, weight measurement systems, developing products for Brazilian and multinational companies that today are present in the five continents. His main research interest is the design of architectures for industrial automation systems.
ED
M
AN US
Prof. Dr. Edison Pignaton Freitas has a PhD in Computer Science and Engineering by Halmstad University in Sweden, 2011, MSc in Computer Science by UFRGS, Brazil, in 2007 and Computer Engineering Degree by Military Institute of Engineering, Brazil, 2003. Currently he holds an associate professor position at UFRGS, affiliated to the Graduate Programs in Electrical Engineering and Computer Science, working in several research areas, but mainly in Embedded Systems, Real Time Systems, Industry Automation, Wireless Sensor Networks and Unmanned Systems. He was invited professor at the Shanghai Dianji University in China (2017).
CR IP T
MICPRO, JUNE 2018
AC
CE
PT
Fl´avio R. Wagner retired in 2017 as Full Professor of the Institute of Informatics of the Federal University of Rio Grande do Sul (UFRGS), in Porto Alegre, Brazil, where he served as Dean from 2006 to 2011. He received a PhD degree in Computer Engineering from the University of Kaiserslautern, Germany (1983), and an MSc degree in Computer Science (1977) and a BS degree in Electrical Engineering (1975), both from UFRGS. He was the President of the Brazilian Computer Society (SBC) from 1999 to 2003. His main research interest is the design and architecture of electronic embedded systems.