MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 Microprocessors and Microsystems xxx (2015) xxx–xxx 1
Contents lists available at ScienceDirect
Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro 5 6
4
Hardware software partitioning of control data flow graph on system on programmable chip
7
Mehdi Jemai ⇑, Bouraoui Ouni
8
Laboratory Electronic and Microelectronic, The National Engineering School of Monastir, University of Monastir, Monastir 5000, Tunisia
3
9 1 2 1 0 12 13 14 15 16 17 18 19
a r t i c l e
i n f o
Article history: Available online xxxx Keywords: Hardware–software partitioning SOPC Control data flow graph Co-design
a b s t r a c t A System On Programmable Chip (SOPC) is a circuit that integrates all components of an electronic system into a single chip. It may consist of memories, one or more microprocessors, interface devices, configurable logic blocks and other necessary components to achieve an intended function. In this paper, we propose a new hardware–software partitioning algorithm of control data flow graph for SOPC. The main aim of our algorithm is to find a best compromise between hardware and software implementation of operations in order to satisfy design constraints in terms of latency and hardware resources of the target application. Our algorithm has been evaluated on real hardware device. In fact, experimentations have been done using a real FPGA Virtex-5. Results have shown that our algorithm provides a better performing system with the lowest possible cost compared to existing approaches. Ó 2015 Elsevier B.V. All rights reserved.
21 22 23 24 25 26 27 28 29 30 31 32
33 34
1. Introduction
35
Modern FPGA has become much more sophisticated than before. It can hold central processing unit (CPU) and several RAMs and DSPs at the same time. In fact, new FPGA device (such as Xilinx Virtex-family) includes two subsets of resources: software resources and hardware one. Software resources may be one or more hard processors or DSP (example IBM PowerPC of Xilinx, AVER of Atmel). The hardware resources may be transceivers, analog-blocks, multiply–accumulate modules (MACs), RAMs blocks and Configurable Logic Blocks (CLBs). Therefore, current FPGA which is a programmable System On Chip (SOC); is called System On Programmable Chip (SOPC). Nowadays, several designers prefer to use SOPC to implement their applications. These applications should meet design constraints like performance, flexibility, and time to market. Often, co-design efficiency has been related to hardware–software partitioning task. Hardware–software partitioning is a system-level partitioning problem. It aims to assign operations of the application to the hardware part or to the software part of the SOPC in order to obtain a faster treatment with lowest cost. In this paper, we aim to solve the following problem: Given a control data flow graph and SOPC circuit; find a possible hardware–software partitioning of a graph on the SOPC in order to
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
⇑ Corresponding author. E-mail addresses:
[email protected] (M. Jemai),
[email protected] (B. Ouni).
get a better compromise between hardware resources used to implement the target graph and its whole latency. Our algorithm is based on a function called generating of partition object. At each algorithm iteration, this function returns a sub-graph called partition object of the original graph. Next, we compute the hardware–software latency and the hardware area cost of each generated partition object. This procedure will be repeated until the hardware–software latency of one among the generated partition objects closes to its hardware area cost. In other words, the mentioned function generates many partition objects (e.g.: 1% hardware 99% software, 25% hardware 75% software, etc.). For each partition object, the latency value will be calculated and a curve will be drawn using these different values. Then, the area cost will be computed as well and a second curve will be produced. When these two curves will be superimposed, a unique intersection point will be obtained. That point refers to the best partition object that should be used.
57
2. Related works
74
In the literature, many designers have proposed hardware–software partitioning algorithm for (SOC) System On Chip [1,2]. In previous works, hardware–software partitioning was carried out manually [3]. However, in reality hardware–software partitioning is more complicated and many requirements on: cost, hardware effort, power dissipation and timing performance have to be taken into consideration. So, efforts have been increased in order to automate the hardware–software partitioning task. For that purpose
75
http://dx.doi.org/10.1016/j.micpro.2015.04.006 0141-9331/Ó 2015 Elsevier B.V. All rights reserved.
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
76 77 78 79 80 81 82
MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 2
M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx
Single partition object
Mix partition object
V1 V2 V4
Larger partition object
V3 V5
Fig. 1. The three kind of partitions object.
many optimization methods have been used to come up with new algorithms such as exact algorithms that are based on: integer linear programming [4], dynamic programming [5] and branch-and bound [6]. However, exact algorithms are very slow and can be applied only for small size graphs. Hence, to overcome its drawbacks, researchers have turned to more flexible and efficient heuristic algorithms. Traditional heuristic algorithms are software-oriented and hardware-oriented. The software-oriented approach means that the initial implementation of the whole application is supposed to be a software solution. Next, during the partitioning, the operations of the application are migrated to hardware until constraints are met [7]. Moreover, other hardware–software partitioning simulated annealing [8–11],
Fig. 2. Pseudo-code of generating of partitions object function.
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
83 84 85 86 87 88 89 90 91 92 93 94 95
MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 3
M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx
V1
P1
V2
V1 V2
V3
P2
V4 V5
P3
ST= 3 J= 0
V8
98 99 100 101 102 103 104 105 106 107 108 109 110
V10
V9 P6
V10
V1
V2
ST= 1 J= 1
V2
V8
ST= 1 J= 1
ST= 2 J= 0
V3
V4
ST= 2 J= 1
V4
V7
ST= 2 J= 1
V8
ST= 2 J= 0
P8
P7
V5
ST= 2 J= 1
V6 V7
ST= 2 J= 1
V8
V9
ST= 1 J= 1
V9
V10
ST= 1 J= 0
V10
Fig. 4. The nesting level and the junction of a node
ST= 2 J= 0
ST= 1 J= 0
V3
ST= 3 J= 0
ST= 3 J= 0
combined algorithm starts with a set of initial random possible solutions. At each iteration, a cost of each solution is evaluated. Next, the authors have used crossover and mutation to build a new population from the current one. By the same way the cost will be reevaluated for the new generation. This process will be repeated until the cost of one solution closes to a predefined cost value. In addition, authors in [18] have studied the Tabu Search (TS) and the hybrid algorithm of Genetic Algorithm (GA) in order to solve the dilemma of hardware/software partitioning. Moreover, in [19] authors have proposed a new optimization technique based on scheduling and partitioning methods. That
ST= 1 J= 0
V6
V6
P4
Fig. 5. Generating of single partitions object.
V1
V5
ST= 2 J= 0
V7 P5
tabu-search [12], and greedy algorithms [13,14] have been presented. Besides, some custom heuristics, such as expert system [15] are also appropriate for hardware–software partitioning problem. In [16] , authors have used genetic algorithm to solve hardware–software partitioning problem. They have started with a set of random possible solutions called population. Next, authors select the best solution of the initial population. After that, they have generated a new population from its previous population using techniques inspired by natural evolution, such as crossover and mutation. At each iteration, authors have compared the best solution of current population to the best solution of its previous population and decided whether to stop or to go on. The algorithm returns the best solution of the latest population as the solution of the partitioning problem. In [17], authors have proposed a heuristic based on genetic algorithm, called combined algorithm. In fact,
ST= 3 J= 0
V5
V9
Fig. 3. Graph G.
97
V3 V4
V6 V7
96
ST= 1 J= 0
vi.
ST= 1 J= 1
Fig. 6. Generating of a larger partitions object.
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
111 112 113 114 115 116 117 118 119 120 121
MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 4
M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx
P11
P9 P10
a
b P12
c P14
P13
d
e
f
Fig. 7. Generating of a mix partitions object.
133
approach has been developed in order to satisfy design constraints such as performance, reliability and design cost. Finally, authors in [20] have come up with new heuristic solution based on partitioning algorithms for multi-processor system on chips. The introduced algorithm assigns priorities to tasks according to out-degree and the software execution time. Firstly, the algorithm look for the critical path in the graph, then it assigns the task with the highest benefit-to-area ratio to hardware implementation. During the iteration, the available hardware area and the critical path will be updated. The calculation process functions until the available hardware area is inadequate to implement a software task lying in the critical path.
134
3. Basic definitions
135
3.1. Control data flow graph
136
139
A control data flow graph G (V; E) is a directed acyclic graph that describes the dependencies between the operations of an application. Where V = fv 1 ; v 2 ; . . . ; v n g is the set of nodes and E is the set of edges {eij |1 6 i, j 6 n}. We have three different types of nodes:
140
1- a node that contains straightforward code (no control con-
141
structs): v Bi with v Bi 2 V B , V B # V; 2- a node that contains the beginning of a control construct : v Si
122 123 124 125 126 127 128 129 130 131 132
137 138
142 143 144 145
with v Si 2 V S , V S # V; 3- a node that contains the end of a control construct:
v Ei 2 V E , V E # V.
Table 1 Node parameters. Node
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
LH ðv i Þ LS ðv i Þ A ðv i Þ
4 8 6
3 7 5
3 8 5
4 9 6
4 10 7
2 5 4
3 8 8
2 7 5
3 10 3
4 9 7 146
An edge eij 2 E designates the direction from node v i to node v j such as the node v i is called the predecessor of node v j . As well, the node v j is called the successor of node v i . Furthermore, there are two particular nodes: the start node of the graph called source and the end node of the graph called sink.
147
3.2. Parameters of node
152
Given a node
v
E i
with
vi V
148 149 150 151
153
LH ðv iÞ is the hardware latency of node v i . LS ðv iÞ is the software latency of node v i . Aðv i Þ is the hardware area of v i .
154 155 156 157
3.3. Nesting level ST
158
We define a nesting level ST (v i ) of a node
8 1: > > > < STðpredðv ÞÞ þ 1Þ : i ST ðv i Þ ¼ > v i ÞÞg 1 : maxfSTðpredð > > : STðpredðv i ÞÞ :
if if if
v i as follows:
v ðiÞ is the start node; predðv i Þ 2 V S ; v i R V E ; v i 2 V E;
else:
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
159
160
162
MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 5
M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx
Fig. 8. Pseudo-code of the proposed algorithm.
Table 2 Values of F1 (Pi) and F2 (Pi). Partition object
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
P12
P13
P14
F 1 ðP i Þ F 2 ðP i Þ
72 6
71 5
70 7
73 4
71 5
71 7
60 25
55 30
55 30
50 35
39 43
35 49
34 50
30 56
V1 V2 V3 V4
P11
V5
Fig. 9. Curves of functions F1 (Pi) and F2 (Pi).
167
As can be seen, at the beginning of a procedure/function the ST (v i ) equals one. It increases by one if a new control construct like an if–then–else and a loop construct, is entered. It decreases by one if a control construct exited. Thus, it remains constant in the other nodes.
168
3.4. Partition object Pi
169
A partition object Pi is a sub-graph G (Pi) of an original graph G, such as all nodes of G (Pi) are assigned to the hardware part of the architecture. There are three kinds of partition objects as shows in Fig. 1.
163 164 165 166
170 171 172 173 174 175 176 177 178 179 180 181
(1) Single partition object: contains a single node (simple instruction, like addition and multiplication). (2) Larger partition object: contains whole control constructs, e.g. nodes from loop to end loop or nodes from if to end if or nodes from case to end case or possibly functions/procedures, etc. (3) Mix Partition object: can be two single partition objects or two large partition objects or single and large partition objects.
V6 V7 V8
Hw
V9 V10 Fig. 10. Hardware–software partitioning of nodes. 182
3.5. Partitions object generating
183
The generating of partition objects of graph G is its division to set, POG, of partition objects Pi such as
184 185
186
POG
n X ¼ Pi
ð1Þ
i¼1
where n is the number of partition objects.
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
188 189
MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 6
M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx
Fig. 11. 16 DCT data flow graph.
Fig. 12. Vector products.
Table 3 Characteristics of 16 DCT and 32 DCT graphs. Graph
Operations (nodes)
Multiplication operations (⁄)
Addition operations (+)
16 DCT 32 DCT
224 448
128 256
96 192
Po Fig. 14. Curves of F1 (Pi) and F2 (Pi) for 32 DCT graph.
AHW ðPi Þ ¼ Po Fig. 13. Curves of F1 (Pi) and F2 (Pi) for 16 DCT graph.
190 191 192
3.6. Hardware area cost We define the hardware area cost of partition object Pi as follows:
X
193
Aðv i Þ
ð2Þ 195
v i 2GðPi Þ
where G (Pi) is a sub-graph corresponding to partition objet Pi (see Fig. 2). By the same way the hardware area cost of a graph G (if all nodes of the graph are assigned to the hardware part of the architecture) is defined as follows:
AHW ðGÞ ¼
X
Aðv i Þ
196 197 198 199 200
201
ð3Þ
v i 2G
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
203
MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 7
M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx Table 4 Design results for 16 DCT. N-P R-T H-R S-R Po L-G %(H-R)
24,800 16,515 ms Operations (⁄) = 76 ; operations (+) = 53 Operations (⁄) = 52 ; operations (+) = 43 15,370 2792 ns 37.04%
Table 5 Design results for 32 DCT. N-P R-T H-R S-R Po L-G %(H-R)
Fig. 16. Basic butterfly computation in FFT algorithm.
Table 6 Characteristics of 16 FFT and 64 FFT graphs. 98,752 112 ms Operations (⁄) = 163 ; operations (+) = 93 Operations (⁄) = 93 ; operations (+) = 99 61,160 5556 ns 76.79%
204
3.7. Critical path, CPG, of graph G
205
The critical path, CPG, of a graph G is the longest path from its source node to its sink node.
206 207 208 209
We define the hardware latency of the critical path (if all nodes of the critical path are assigned to the hardware part of the architecture) as follows:
210
LH ðCP G Þ ¼ maxv i 2G ½ðð1 X m ðiÞÞLH ðv i ÞÞ þ
X
bij ÞLH ðv j ÞÞ
213
214
where
Operations (Nodes)
Subtraction operations ()
Addition operations (+)
16 FFT 64 FFT
64 256
32 128
32 128
bij ¼ 1; if v j depends on 0; else
vi
LS ðv i Þ
LS ðCP G Þ ¼
J ðv i Þ ¼
1; if 0;
219 220 221
223
vi
We define the junction of node
218 217
ð5Þ
v i 2CPG
3.8. The junction of node
ð4Þ
216
We define, the software latency of the critical path (if all nodes of the critical path are assigned to the software part of the architecture)X as follows:
ððð1 X m ðjÞÞ
v j 2G 212
Graph
vi
224
v i as follows
225
226
is the begin or the end of a control construct;
else:
Fig. 15. 16-FFT data flow graph.
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
228
MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 8
M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx Table 7 Design results for 16 FFT graph. N-P R-T H-R S-R Po L-G %(H-R)
2112 0.902 ms Operations () = 19 ; operations (+) = 20 Operations () = 13 ; operations (+) = 12 1397 648 ns 1.6%
Table 8 Design results for 64 FFT graph. N-P R-T H-R S-R Po L-G %(H-R)
Po Fig. 17. Curves of F1 (Pi) and F2 (Pi) for 16 FFT graph.
Po Fig. 18. Curves of F1 (Pi) and of F2 (Pi) for 64 FFT graph.
229 230 231 232 233 234
235 237
4. Proposed hardware–software algorithm Our algorithm is based on function called generating of partition object. At each algorithm iteration, this function returns a sub-graph (partition object). Next, at the same iteration, for each generated partition object our algorithm calculates the values and draws the following functions:
F 1 ðPi Þ ¼ ½ððLS ðCP G Þ LS ðCP GðPi Þ ÞÞ þ LH ðCP GðPi Þ Þ
ð6Þ
240
F1 (Pi) is the hardware–software latency of partition object Pi. Where CPG(Pi) is the critical path of the sub-graph G (Pi) corresponding to partition object Pi.
241 243
F 2 ðPi Þ ¼ AHW ðPi Þ
238 239
ð7Þ
246
F2 (Pi) is the hardware area cost of partition object Pi. Our algorithm repeats this procedure until the generation of one partition object such as:
247 249
AHW ðPi Þ ¼ AHW ðGÞ
244 245
250 251 252
ð8Þ
The intersection between the curve of function F1 (Pi) and the curve of function F2 (Pi) gives the best trade-off between the latency and the hardware resources of the target graph.
33,024 25,313 ms Operations () = 75; operations (+) = 79 Operations () = 53; operations (+) = 49 1397 2608 ns 6.42%
4.1. Generating of partition object function
253
At each algorithm iteration, this function returns a sub-graph corresponding to one partition object.
254
4.1.1. Illustrative example We apply the step of our function on the following graph, Fig. 3. Firstly, we compute the nesting level ST (v i ) and the junction J ðv i Þ of each node v i , Fig. 4. If 8v i 2V (ST ðv i Þ P 1 and (J ðv i Þ ¼ 0)), we generate a single partition object. Six single partitions object, from P1 to P6, are shown in Fig. 5. 8v i
256
4.2. Pseudo-code of the proposed algorithm
273
4.2.1. Illustrative example We apply the steps of our algorithm on the graph shown above in Fig. 3. The parameters of nodes are shown in Table 1 (see Fig. 8). At the first iteration the generating of partition objects function returns the partition object P1 (see Section 4.1.1 and Fig. 5). We compute F 1 ðP1 Þ ¼ ½ðLS ðCP G Þ LS ðCP GðP1Þ ÞÞ þ LH ðCP GðP1Þ Þ and F 2 ðP1 Þ ¼ AHW ðP 1 Þ. In this case, F 1 ðP 1 Þ ¼ 72 and F 2 ðP1 Þ ¼ 6. At the second iteration, the generating of partition objects function returns the partition object P2. We compute F 1 ðP 2 Þ ¼ ½ðLS ðCP G Þ LS ðCP GðP2Þ ÞÞ þ LH ðCP GðP2Þ Þ ¼ 71 and F 2 ðP2 Þ ¼ AHW ðP2 Þ ¼ 5. And we follow the same procedure for other partition objects. Table 2 gives the values of F1 (Pi) and F2 (Pi) for all generated partitions object. Next, we draw the curves of F1 (Pi) and F2 (Pi), Fig. 9. Based on Fig. 9: F 1 ðPi Þ \ F 2 ðPi Þ ¼ P11 . Therefore, the partition object P11 (solution ‘‘c’’ in Fig. 7) provides the best trade-off between latency and hardware resources of graph. Hence, nodes V3, V4, V5, V6, V7, V8 should be assigned to the hardware part of the architecture and V1, V2, V9, V10 should be assigned to the software part of the architecture as shown in Fig. 10.
274
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
255
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293
MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx
9
Fig. 19. Blocks of H.264.
Fig. 20. Intra prediction graph.
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 10
M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx
Po Fig. 21. Curves of F1 (Pi) and F2 (Pi) for H.264.
294
5. Experiments
295
To confirm our approach, we have implemented the DCT and FFT graphs on FPGA Xilinx VirtexÒ-5. The Xilinx VirtexÒ5 development kit enables high performance for embedded design in Xilinx FPGAs. In our approach the software resource is the PowerPC and the hardware resources are the configurable logic blocks (CLBs). Hence, to compute the parameters of each node and to access to the PowerPC, we have used Xilinx ISE tool and Xilinx EDK tool. These Xilinx design tools provide resources and timing report that incorporates timing delay and resources, to provide a comprehensive area and timing summary of the design. Our algorithm has been written in JAVA language and executed under Windows-7 on Acer-PC (Intel Core 2 Duo T5500; 1, 66 GHz; 1 GB of RAM). Design results are shown in Tables 4, 5, 7, and 8, the legends of these tables are:
296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318
- N-P: the number of partition objects generated by the generating of partitions object function. - R-T: the execution time of the algorithm on the Acer-PC. - H-R: the operations assigned to the hardware resources (CLBs of VirtexÒ-5). - %(H-R): percentage of the hardware resources. - S-R: the operations assigned to the software resources (powerPC of VirtexÒ-5) - L-G: the latency of the graph. - Po: the number of partition objects where F 1 ðPi Þ \ F 2 ðP i Þ – ;.
319
320
5.1. First example
321
The DCT, Fig. 11, is the most computationally intensive part of the CLD algorithm. The model proposed by [21,22] is based on 16 or 32 vector products. Thus, the entire DCT is a collection of 16 or 32 nodes ‘‘T1’’ and ‘‘T2’’. The structure of ‘‘T1’’ and ‘‘T2’’ is similar to vector product Fig. 12, but with different bit widths. Table 3 gives the characteristics of 16-DCT and 32-DCT graphs. Fig. 13 shows the curves of F1 (Pi) and F2 (Pi) for 16 DCT graph (see Fig. 14).
322 323 324 325 326 327 328
Table 9 Design results for H.264. N-P R-T H-R S-R Po L-G %(H-R)
35 0.24 ms 27 Tasks 1 Task 27 1485 ns 19.97%
Tables 4 and 5 show the design results provided by our algorithm. Firstly, as attractive Result, our algorithm generates 98,752 partition objects and 24,800 partition objects in 112 ms and in 16,515 ms respectively. Furthermore, results show that 129 of 16-DCT operations and 256 of 32DCT operations are assigned to the hardware part of the architecture. Moreover, results show that the latencies of 16-DCT and 32-DCT graphs are 2792 ns and 5556 ns respectively.
329
5.2. Second example
337
The Fast Fourier transform (FFT), Fig. 15, is an efficient algorithm that computes the discrete Fourier transform (DFT) and its inverse. 16-FFT and 64-FFT which are 16 points and 64 points of FFT respectively have important roles in analysis, design, and implementation of discrete-time signal processing algorithms and systems. Fig. 16, shows the basic butterfly computation in FFT algorithm. Table 6 gives the characteristics of 16 FFT and 64 FFT graphs (see Figs. 17 and 18). Tables 7 and 8 show the design results provided by our algorithm. Results show that our algorithm generates 2112 partition objects and 33,024 partition objects in 0.902 ms and in 25,313 ms respectively.
338
5.3. Third example
350
The H.264 AVC is the most recent standard for video coding, it has been developed by the ITU-T Video Coding Experts Group [23], Fig. 19. The H.264 contain an intra-prediction mode with 4 4 block and 16 16 block sizes for luma component and 8 8 block size for chroma component is used in H.264 to improve the rate-distortion performance. However, the computational complexity of H.264 encoder is drastically increased due to the various intra prediction modes. Recently efficient hardware architectures were proposed for the fast execution of H.264/AVC intra prediction mode selection [24,25]. Fig. 20, shows the basic blocks in the intra prediction graph.
351
Table 10 Design results. Algorithm
Target graph
R-T (ms)
%(H-R)
L-G (ns)
H-R(G) (CLB)
Proposed algo Combined algo Tabu algo Simulated annealing algo Genetic algo
16-DCT task graph
16.515 42.573 45.895 1.724 2.265
37 25 24 30 29
2792 3108 3116 3032 2984
2667 1836 1776 2178 2139
Proposed algo Combined algo Tabu algo Simulated annealing algo Genetic algo
32-DCT task graph
112 6.692 133.678 1.196 3.387
76.79 54 53 62 58
5556 6202 6094 6002 6078
5676 3894 3849 4482 4197
Proposed algo Combined algo Tabu algo Simulated annealing algo Genetic algo
16 FFT graph
0.902 1.114 3.167 1.056 1.056
1.6 0.92 1 1.08 1
648 799 783 767 783
117 66 72 78 72
Proposed algo Combined algo Tabu algo Simulated annealing algo Genetic algo
64 FFT graph
25.313 2.271 34.74 1.125 2.153
6.42 4.62 4.91 5.62 4.95
2608 2952 2896 2760 2888
462 333 354 405 357
Proposed algo Combined algo Tabu algo Simulated annealing algo Genetic algo
H.264 graph
0.240 0.909 28.469 2.456 1.641
19.97 18.98 13.76 10.45 18.70
1485 1550 1898 2103 1566
1438 1367 991 753 1347
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
330 331 332 333 334 335 336
339 340 341 342 343 344 345 346 347 348 349
352 353 354 355 356 357 358 359 360 361
MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx Table 11 Design results.
h
Proposed algo
Combined algo
Tabu algo
Simulated annealing algo
Genetic algo
1.62
1.73
1.74
1.66
1.70
Table 12 Design results.
h
Proposed algo
Combined algo
Tabu algo
Simulated annealing algo
Genetic algo
0.27
0.53
0.55
0.45
0.49
11
hardware resources. Hence, a partitioning algorithm is classified to be good one if it decreases both: whole latency of the application and the hardware resource. Therefore, based on above equation a partitioning algorithm is classified to be good alternative if it increases the value of h. Tables 11–15 show the value of h provided by each algorithm, the target application was 16-DCT task graph, 32-DCT task graph, 16 FFT graph, 64 FFT graph and H.264 task graph respectively. Table 16 shows the average value of h provided by each algorithm. Based on the above design results shown in Table 16, we prove that our algorithm is the best one in terms of average value of h. Indeed, our algorithm provides a gain of 10, 36%, 11.92%, 12.44% and 9.84% compared to combined algorithm, Tabu algorithm, simulated annealing algorithm and Genetic algorithm respectively.
380
6. Conclusion
395
In this paper, we have proposed behavioral hardware software partitioning algorithm. Our algorithm assumes the SOPC as target architecture, but its simplicity and efficiency allow it to be used on System On a Chip (SOC). The proposed algorithms can be used, at behavioral level and in the co-design flow to provide better trade-off between latency and hardware resources. The proposed algorithm has been tested and compared to tabu, genetic, simulated annealing and combined algorithms. Results have shown significant gain since it provides the best value of h compared to others approaches.
396
References
406
[1] R.K. Gupta, G. De Micheli, Hardware–software cosynthesis for digital systems, IEEE Des. Test Comp. (September) (1993) 29–41. [2] W. Wolf, A decade of hardware/software codesign, IEEE J. Magaz., Comp. 36 (4) (2003) 38–43. [3] J. Henkel, R. Ernst, A hardware/software partitioning using a dynamically determined granularity, in: ACM Design Automation Conference (DAC 97), June 1997. [4] S. Banerjee, E. Bozorgzadeh, N.D. Dutt, Integrating physical constraints in hw– sw partitioning for architectures with partial dynamic reconfiguration, IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 14 (11) (2006) 1189–1202. [5] J. Wu, T. Srikanthan, Low-complex dynamic programming algorithm for hardware/software partitioning, Inf. Process. Lett. 98 (2) (2006) 41–46. [6] K. Chatha, R. Vemuri, Hardware–software partitioning and pipelined scheduling of transformative applications, IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 10 (3) (2002) 193–208. [7] F. Vahid, D. Gajski, Clustering for improved system level functional partitioning, in: Proceedings of the 8th International Symposium on System Synthesis, ACM, 1995, pp. 28–35. [8] Ji-Yang Qi, Application of improved simulated annealing algorithm in facility layout design, in: Proceedings of the 29th Chinese Control Conference, July 2010. [9] Z. Peng, K. Kuchcinski, An algorithm for partitioning of application specific systems, in: Proceedings of the European Conference on Design Automation (EDAC’93), 1993, pp. 316–321. [10] J. Henkel, R. Ernst, An approach to automated hardware/software partitioning using a flexible granularity that is driven by high-level estimation techniques, IEEE Trans. Very Large Scale Integ. Syst. 9 (2) (2001) 273–290. [11] P. Eles, Z. Peng, K. Kuchcinski, A. Doboli, System level hardware/software partitioning based on simulated annealing and tabu search, Des. Autom. Embed. Syst. 2 (1997) 5–32. [12] T. Wiangtong, P.Y.K. Cheung, W. Luk, Comparing three heuristic search methods for functional partitioning in hardwaresoftware codesign, Des. Autom. Embed. Syst. 6 (4) (2002) 425–449. [13] K.S. Chatha, R. Vemuri, Magellan: multiway hardware–software partitioning and scheduling for latency minimization of hierarchical control-dataflow task graphs, in: Proceedings of the Ninth International Symposium on Hardware/ Software Co-design (CODES ’01), 2001, pp. 42–47. [14] J. Grode, P.V. Knudsen, J. Madsen, Hardware resource allocation for hardware/software partitioning in the LycosSystem, in: Proceedings of the Conference on Design, Automation and Test in Europe (DATE ’98), 1998, pp. 22–27. [15] M. Lopez-Vallejo, J. Lopez, On the hardware–software partitioning problem: system modeling and partitioning techniques, ACM Trans. Des. Autom. Electron. Syst. (TODAES) 8 (3) (2003) 269–297.
407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450
381 382 383 384 385 386 387 388 389 390 391 392 393 394
Table 13 Design results.
h
Proposed algo
Combined algo
Tabu algo
Simulated annealing algo
Genetic algo
10.93
8.93
9.10
9.29
9.10
Table 14 Design results.
h
Proposed algo
Combined algo
Tabu algo
Simulated annealing algo
2.58
2.33
2.36
2.46
Genetic algo 2.37
Table 15 Design results.
h
Proposed algo
Combined algo
Tabu algo
Simulated annealing algo
Genetic algo
3.88
3.76
3.27
3.06
3.73
Table 16 Average value of h. Proposed Combined Tabu Simulated annealing algo Genetic algo algo algo algo Av. h 3.86
3.46
3.40
3.38
3.48
365
Fig. 21 shows the curves of F1 (Pi) and F2 (Pi) for H.264. Table 9 show the design results provided by our algorithm. Results show that our algorithm generates 35 partition objects in 0.24 ms and the latency of H.264 is 1485 ns.
366
5.4. Comparison with other approaches
367
In this part, we have compared our algorithm to existing algorithms such as: tabu [12], simulated annealing [11], genetic [16], and combined algorithm [17]. Design results are shown in Table 10. To evaluate the design results shown in Table 10, we have introduced the following equation
362 363 364
368 369 370 371 372
373 375 376 377 378 379
h¼
A AH LG
ð9Þ
A is the available hardware resource, AH is the hardware resource used by the graph, and L_G is the whole latency of the graph, As generally reckoned, the design is classified to be excellent; if it provides the fastest possible application with the minimum
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
397 398 399 400 401 402 403 404 405
MICPRO 2212
No. of Pages 12, Model 5G
13 May 2015 12 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481
M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx
[16] He Hongjun, Dou Qiang, Weixia Xu, Hardware/software partitioning for heterogeneous multicore SoC using genetic algorithm, IEEE Intell. Syst. Des. Eng. Appl. (ISDEA) (January) (2012) 1267–1270. [17] Yu Jiang, Hehua Zhang, Xun Jiao, Xiaoyu Song, William N.N. Hung, Ming Gu, Jiaguang Sun, Uncertain model and algorithm for hardware/software partitioning, in: IEEE Computer Society Annual Symposium on VLSI, 2012. [18] Guoshuai Li, Jinfu Feng, Cong Wang, Jinghua Wang, Hardware/software partitioning algorithm based on the combination of genetic algorithm and tabu search, Eng. Rev. 34 (2) (2014). [19] Anup Das, SingaporeAkash Kumar, Bharadwaj Veeravalli, Aging-aware hardware–software task partitioning for reliable reconfigurable multiprocessor systems, in: International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 2013. [20] Honglei Han, Wenju Liu, Jigang Wu, Guiyuan Jiang, Efficient algorithm for hardware/software partitioning and scheduling on MPSoC, J. Comp. 8 (January) (2013) 61–68. [21] A. Mtibaa, B. Ouni, M. Abid, An efficient list scheduling algorithm for time placement problem, Comp. Electr. Eng. 33 (4) (July 2007) 285–298. [22] R. Ayadi, B. Ouni, A. Mtibaa, A partitioning methodology that optimizes the communication cost for reconfigurable computing systems, Int. J. Autom. Comput. 9 (3) (2012) 280–287. [23] B. Ouni, R. Ayadi, A. Mtibaa, Temporal partitioning of data flow graph for dynamically reconfigurable architecture, J. Syst. Arch. 57 (2011) 790–798. [24] Li-Wei Kang, Jin-Jang Leou, An error resilient coding scheme for H.264 video transmission based on data embedding+, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, ICASSP’04, vol. 3, 2004, pp. 257–260. [25] Liangbao Jiao, Jing Zhou, Rui Chenm Efficient parallel intra-prediction mode selection scheme for 4 4 blocks in H.264, in: IEEE International Conference on Intelligent Computation Technology and Automation (ICICTA), vol. 2, 2011, pp 527–530.
Mehdi Jemai held a Diploma in Computer Engineering in 2009 from the Higher Institute of Applied Science and Technology of Sousse and received his Master in Microelectronic in 2011 from the Faculty of Science of Monastir. Currently, he prepares, in the Engineering School of Monastir, his thesis whose interest includes methodologies development for reconfigurable architectures.
485 486 487 488 489 490 491 492 493
484 Bouraoui Ouni is currently an Associate Professor at the National Engineering School of Sousse. He has obtained his Ph.D. entitled ‘Synthesis and temporal partitioning for reconfigurable systems’ in 2008 from the Faculty of Sciences at Monastir. He is obtained his university habilitation entitled ‘Optimisation algorithm for reconfigurable architectures’ in 2012. Hence, his researches interest cover: models, methods, tools, and architectures for reconfigurable computing; simulation, debugging, synthesis, verification, and test of reconfigurable systems; field programmable gate arrays and other reconfigurable technologies; algorithms implemented on reconfigurable hardware; hardware/software codesign and cosimulation with reconfigurable hardware; and high performance reconfigurable computing.
482
Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006
496 497 498 499 500 501 502 503 504 505 506 507 508 509 495 510