Hardware software partitioning of control data flow graph on system on programmable chip

Hardware software partitioning of control data flow graph on system on programmable chip

MICPRO 2212 No. of Pages 12, Model 5G 13 May 2015 Microprocessors and Microsystems xxx (2015) xxx–xxx 1 Contents lists available at ScienceDirect ...

4MB Sizes 0 Downloads 34 Views

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 Microprocessors and Microsystems xxx (2015) xxx–xxx 1

Contents lists available at ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro 5 6

4

Hardware software partitioning of control data flow graph on system on programmable chip

7

Mehdi Jemai ⇑, Bouraoui Ouni

8

Laboratory Electronic and Microelectronic, The National Engineering School of Monastir, University of Monastir, Monastir 5000, Tunisia

3

9 1 2 1 0 12 13 14 15 16 17 18 19

a r t i c l e

i n f o

Article history: Available online xxxx Keywords: Hardware–software partitioning SOPC Control data flow graph Co-design

a b s t r a c t A System On Programmable Chip (SOPC) is a circuit that integrates all components of an electronic system into a single chip. It may consist of memories, one or more microprocessors, interface devices, configurable logic blocks and other necessary components to achieve an intended function. In this paper, we propose a new hardware–software partitioning algorithm of control data flow graph for SOPC. The main aim of our algorithm is to find a best compromise between hardware and software implementation of operations in order to satisfy design constraints in terms of latency and hardware resources of the target application. Our algorithm has been evaluated on real hardware device. In fact, experimentations have been done using a real FPGA Virtex-5. Results have shown that our algorithm provides a better performing system with the lowest possible cost compared to existing approaches. Ó 2015 Elsevier B.V. All rights reserved.

21 22 23 24 25 26 27 28 29 30 31 32

33 34

1. Introduction

35

Modern FPGA has become much more sophisticated than before. It can hold central processing unit (CPU) and several RAMs and DSPs at the same time. In fact, new FPGA device (such as Xilinx Virtex-family) includes two subsets of resources: software resources and hardware one. Software resources may be one or more hard processors or DSP (example IBM PowerPC of Xilinx, AVER of Atmel). The hardware resources may be transceivers, analog-blocks, multiply–accumulate modules (MACs), RAMs blocks and Configurable Logic Blocks (CLBs). Therefore, current FPGA which is a programmable System On Chip (SOC); is called System On Programmable Chip (SOPC). Nowadays, several designers prefer to use SOPC to implement their applications. These applications should meet design constraints like performance, flexibility, and time to market. Often, co-design efficiency has been related to hardware–software partitioning task. Hardware–software partitioning is a system-level partitioning problem. It aims to assign operations of the application to the hardware part or to the software part of the SOPC in order to obtain a faster treatment with lowest cost. In this paper, we aim to solve the following problem: Given a control data flow graph and SOPC circuit; find a possible hardware–software partitioning of a graph on the SOPC in order to

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

⇑ Corresponding author. E-mail addresses: [email protected] (M. Jemai), [email protected] (B. Ouni).

get a better compromise between hardware resources used to implement the target graph and its whole latency. Our algorithm is based on a function called generating of partition object. At each algorithm iteration, this function returns a sub-graph called partition object of the original graph. Next, we compute the hardware–software latency and the hardware area cost of each generated partition object. This procedure will be repeated until the hardware–software latency of one among the generated partition objects closes to its hardware area cost. In other words, the mentioned function generates many partition objects (e.g.: 1% hardware 99% software, 25% hardware 75% software, etc.). For each partition object, the latency value will be calculated and a curve will be drawn using these different values. Then, the area cost will be computed as well and a second curve will be produced. When these two curves will be superimposed, a unique intersection point will be obtained. That point refers to the best partition object that should be used.

57

2. Related works

74

In the literature, many designers have proposed hardware–software partitioning algorithm for (SOC) System On Chip [1,2]. In previous works, hardware–software partitioning was carried out manually [3]. However, in reality hardware–software partitioning is more complicated and many requirements on: cost, hardware effort, power dissipation and timing performance have to be taken into consideration. So, efforts have been increased in order to automate the hardware–software partitioning task. For that purpose

75

http://dx.doi.org/10.1016/j.micpro.2015.04.006 0141-9331/Ó 2015 Elsevier B.V. All rights reserved.

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73

76 77 78 79 80 81 82

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 2

M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx

Single partition object

Mix partition object

V1 V2 V4

Larger partition object

V3 V5

Fig. 1. The three kind of partitions object.

many optimization methods have been used to come up with new algorithms such as exact algorithms that are based on: integer linear programming [4], dynamic programming [5] and branch-and bound [6]. However, exact algorithms are very slow and can be applied only for small size graphs. Hence, to overcome its drawbacks, researchers have turned to more flexible and efficient heuristic algorithms. Traditional heuristic algorithms are software-oriented and hardware-oriented. The software-oriented approach means that the initial implementation of the whole application is supposed to be a software solution. Next, during the partitioning, the operations of the application are migrated to hardware until constraints are met [7]. Moreover, other hardware–software partitioning simulated annealing [8–11],

Fig. 2. Pseudo-code of generating of partitions object function.

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

83 84 85 86 87 88 89 90 91 92 93 94 95

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 3

M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx

V1

P1

V2

V1 V2

V3

P2

V4 V5

P3

ST= 3 J= 0

V8

98 99 100 101 102 103 104 105 106 107 108 109 110

V10

V9 P6

V10

V1

V2

ST= 1 J= 1

V2

V8

ST= 1 J= 1

ST= 2 J= 0

V3

V4

ST= 2 J= 1

V4

V7

ST= 2 J= 1

V8

ST= 2 J= 0

P8

P7

V5

ST= 2 J= 1

V6 V7

ST= 2 J= 1

V8

V9

ST= 1 J= 1

V9

V10

ST= 1 J= 0

V10

Fig. 4. The nesting level and the junction of a node

ST= 2 J= 0

ST= 1 J= 0

V3

ST= 3 J= 0

ST= 3 J= 0

combined algorithm starts with a set of initial random possible solutions. At each iteration, a cost of each solution is evaluated. Next, the authors have used crossover and mutation to build a new population from the current one. By the same way the cost will be reevaluated for the new generation. This process will be repeated until the cost of one solution closes to a predefined cost value. In addition, authors in [18] have studied the Tabu Search (TS) and the hybrid algorithm of Genetic Algorithm (GA) in order to solve the dilemma of hardware/software partitioning. Moreover, in [19] authors have proposed a new optimization technique based on scheduling and partitioning methods. That

ST= 1 J= 0

V6

V6

P4

Fig. 5. Generating of single partitions object.

V1

V5

ST= 2 J= 0

V7 P5

tabu-search [12], and greedy algorithms [13,14] have been presented. Besides, some custom heuristics, such as expert system [15] are also appropriate for hardware–software partitioning problem. In [16] , authors have used genetic algorithm to solve hardware–software partitioning problem. They have started with a set of random possible solutions called population. Next, authors select the best solution of the initial population. After that, they have generated a new population from its previous population using techniques inspired by natural evolution, such as crossover and mutation. At each iteration, authors have compared the best solution of current population to the best solution of its previous population and decided whether to stop or to go on. The algorithm returns the best solution of the latest population as the solution of the partitioning problem. In [17], authors have proposed a heuristic based on genetic algorithm, called combined algorithm. In fact,

ST= 3 J= 0

V5

V9

Fig. 3. Graph G.

97

V3 V4

V6 V7

96

ST= 1 J= 0

vi.

ST= 1 J= 1

Fig. 6. Generating of a larger partitions object.

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

111 112 113 114 115 116 117 118 119 120 121

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 4

M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx

P11

P9 P10

a

b P12

c P14

P13

d

e

f

Fig. 7. Generating of a mix partitions object.

133

approach has been developed in order to satisfy design constraints such as performance, reliability and design cost. Finally, authors in [20] have come up with new heuristic solution based on partitioning algorithms for multi-processor system on chips. The introduced algorithm assigns priorities to tasks according to out-degree and the software execution time. Firstly, the algorithm look for the critical path in the graph, then it assigns the task with the highest benefit-to-area ratio to hardware implementation. During the iteration, the available hardware area and the critical path will be updated. The calculation process functions until the available hardware area is inadequate to implement a software task lying in the critical path.

134

3. Basic definitions

135

3.1. Control data flow graph

136

139

A control data flow graph G (V; E) is a directed acyclic graph that describes the dependencies between the operations of an application. Where V = fv 1 ; v 2 ; . . . ; v n g is the set of nodes and E is the set of edges {eij |1 6 i, j 6 n}. We have three different types of nodes:

140

1- a node that contains straightforward code (no control con-

141

structs): v Bi with v Bi 2 V B , V B # V; 2- a node that contains the beginning of a control construct : v Si

122 123 124 125 126 127 128 129 130 131 132

137 138

142 143 144 145

with v Si 2 V S , V S # V; 3- a node that contains the end of a control construct:

v Ei 2 V E , V E # V.

Table 1 Node parameters. Node

V1

V2

V3

V4

V5

V6

V7

V8

V9

V10

LH ðv i Þ LS ðv i Þ A ðv i Þ

4 8 6

3 7 5

3 8 5

4 9 6

4 10 7

2 5 4

3 8 8

2 7 5

3 10 3

4 9 7 146

An edge eij 2 E designates the direction from node v i to node v j such as the node v i is called the predecessor of node v j . As well, the node v j is called the successor of node v i . Furthermore, there are two particular nodes: the start node of the graph called source and the end node of the graph called sink.

147

3.2. Parameters of node

152

Given a node

v

E i

with

vi V

148 149 150 151

153

 LH ðv iÞ is the hardware latency of node v i .  LS ðv iÞ is the software latency of node v i .  Aðv i Þ is the hardware area of v i .

154 155 156 157

3.3. Nesting level ST

158

We define a nesting level ST (v i ) of a node

8 1: > > > < STðpredðv ÞÞ þ 1Þ : i ST ðv i Þ ¼ > v i ÞÞg  1 : maxfSTðpredð > > : STðpredðv i ÞÞ :

if if if

v i as follows:

v ðiÞ is the start node; predðv i Þ 2 V S ; v i R V E ; v i 2 V E;

else:

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

159

160

162

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 5

M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx

Fig. 8. Pseudo-code of the proposed algorithm.

Table 2 Values of F1 (Pi) and F2 (Pi). Partition object

P1

P2

P3

P4

P5

P6

P7

P8

P9

P10

P11

P12

P13

P14

F 1 ðP i Þ F 2 ðP i Þ

72 6

71 5

70 7

73 4

71 5

71 7

60 25

55 30

55 30

50 35

39 43

35 49

34 50

30 56

V1 V2 V3 V4

P11

V5

Fig. 9. Curves of functions F1 (Pi) and F2 (Pi).

167

As can be seen, at the beginning of a procedure/function the ST (v i ) equals one. It increases by one if a new control construct like an if–then–else and a loop construct, is entered. It decreases by one if a control construct exited. Thus, it remains constant in the other nodes.

168

3.4. Partition object Pi

169

A partition object Pi is a sub-graph G (Pi) of an original graph G, such as all nodes of G (Pi) are assigned to the hardware part of the architecture. There are three kinds of partition objects as shows in Fig. 1.

163 164 165 166

170 171 172 173 174 175 176 177 178 179 180 181

(1) Single partition object: contains a single node (simple instruction, like addition and multiplication). (2) Larger partition object: contains whole control constructs, e.g. nodes from loop to end loop or nodes from if to end if or nodes from case to end case or possibly functions/procedures, etc. (3) Mix Partition object: can be two single partition objects or two large partition objects or single and large partition objects.

V6 V7 V8

Hw

V9 V10 Fig. 10. Hardware–software partitioning of nodes. 182

3.5. Partitions object generating

183

The generating of partition objects of graph G is its division to set, POG, of partition objects Pi such as

184 185

186

POG

n X ¼ Pi

ð1Þ

i¼1

where n is the number of partition objects.

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

188 189

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 6

M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx

Fig. 11. 16 DCT data flow graph.

Fig. 12. Vector products.

Table 3 Characteristics of 16 DCT and 32 DCT graphs. Graph

Operations (nodes)

Multiplication operations (⁄)

Addition operations (+)

16 DCT 32 DCT

224 448

128 256

96 192

Po Fig. 14. Curves of F1 (Pi) and F2 (Pi) for 32 DCT graph.

AHW ðPi Þ ¼ Po Fig. 13. Curves of F1 (Pi) and F2 (Pi) for 16 DCT graph.

190 191 192

3.6. Hardware area cost We define the hardware area cost of partition object Pi as follows:

X

193

Aðv i Þ

ð2Þ 195

v i 2GðPi Þ

where G (Pi) is a sub-graph corresponding to partition objet Pi (see Fig. 2). By the same way the hardware area cost of a graph G (if all nodes of the graph are assigned to the hardware part of the architecture) is defined as follows:

AHW ðGÞ ¼

X

Aðv i Þ

196 197 198 199 200

201

ð3Þ

v i 2G

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

203

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 7

M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx Table 4 Design results for 16 DCT. N-P R-T H-R S-R Po L-G %(H-R)

24,800 16,515 ms Operations (⁄) = 76 ; operations (+) = 53 Operations (⁄) = 52 ; operations (+) = 43 15,370 2792 ns 37.04%

Table 5 Design results for 32 DCT. N-P R-T H-R S-R Po L-G %(H-R)

Fig. 16. Basic butterfly computation in FFT algorithm.

Table 6 Characteristics of 16 FFT and 64 FFT graphs. 98,752 112 ms Operations (⁄) = 163 ; operations (+) = 93 Operations (⁄) = 93 ; operations (+) = 99 61,160 5556 ns 76.79%

 204

3.7. Critical path, CPG, of graph G

205

The critical path, CPG, of a graph G is the longest path from its source node to its sink node.

206 207 208 209

 We define the hardware latency of the critical path (if all nodes of the critical path are assigned to the hardware part of the architecture) as follows:

210

LH ðCP G Þ ¼ maxv i 2G ½ðð1  X m ðiÞÞLH ðv i ÞÞ þ

X

 bij ÞLH ðv j ÞÞ

213

214

where

Operations (Nodes)

Subtraction operations ()

Addition operations (+)

16 FFT 64 FFT

64 256

32 128

32 128

bij ¼ 1; if v j depends on 0; else

vi

LS ðv i Þ

LS ðCP G Þ ¼

J ðv i Þ ¼

1; if 0;

219 220 221

223

vi

We define the junction of node



218 217

ð5Þ

v i 2CPG

3.8. The junction of node

ð4Þ

216

 We define, the software latency of the critical path (if all nodes of the critical path are assigned to the software part of the architecture)X as follows:

ððð1  X m ðjÞÞ

v j 2G 212

Graph

vi

224

v i as follows

225

226

is the begin or the end of a control construct;

else:

Fig. 15. 16-FFT data flow graph.

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

228

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 8

M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx Table 7 Design results for 16 FFT graph. N-P R-T H-R S-R Po L-G %(H-R)

2112 0.902 ms Operations () = 19 ; operations (+) = 20 Operations () = 13 ; operations (+) = 12 1397 648 ns 1.6%

Table 8 Design results for 64 FFT graph. N-P R-T H-R S-R Po L-G %(H-R)

Po Fig. 17. Curves of F1 (Pi) and F2 (Pi) for 16 FFT graph.

Po Fig. 18. Curves of F1 (Pi) and of F2 (Pi) for 64 FFT graph.

229 230 231 232 233 234

235 237

4. Proposed hardware–software algorithm Our algorithm is based on function called generating of partition object. At each algorithm iteration, this function returns a sub-graph (partition object). Next, at the same iteration, for each generated partition object our algorithm calculates the values and draws the following functions:

F 1 ðPi Þ ¼ ½ððLS ðCP G Þ  LS ðCP GðPi Þ ÞÞ þ LH ðCP GðPi Þ Þ

ð6Þ

240

F1 (Pi) is the hardware–software latency of partition object Pi. Where CPG(Pi) is the critical path of the sub-graph G (Pi) corresponding to partition object Pi.

241 243

F 2 ðPi Þ ¼ AHW ðPi Þ

238 239

ð7Þ

246

F2 (Pi) is the hardware area cost of partition object Pi. Our algorithm repeats this procedure until the generation of one partition object such as:

247 249

AHW ðPi Þ ¼ AHW ðGÞ

244 245

250 251 252

ð8Þ

The intersection between the curve of function F1 (Pi) and the curve of function F2 (Pi) gives the best trade-off between the latency and the hardware resources of the target graph.

33,024 25,313 ms Operations () = 75; operations (+) = 79 Operations () = 53; operations (+) = 49 1397 2608 ns 6.42%

4.1. Generating of partition object function

253

At each algorithm iteration, this function returns a sub-graph corresponding to one partition object.

254

4.1.1. Illustrative example We apply the step of our function on the following graph, Fig. 3. Firstly, we compute the nesting level ST (v i ) and the junction J ðv i Þ of each node v i , Fig. 4. If 8v i 2V (ST ðv i Þ P 1 and (J ðv i Þ ¼ 0)), we generate a single partition object. Six single partitions object, from P1 to P6, are shown in Fig. 5. 8v i
256

4.2. Pseudo-code of the proposed algorithm

273

4.2.1. Illustrative example We apply the steps of our algorithm on the graph shown above in Fig. 3. The parameters of nodes are shown in Table 1 (see Fig. 8). At the first iteration the generating of partition objects function returns the partition object P1 (see Section 4.1.1 and Fig. 5). We compute F 1 ðP1 Þ ¼ ½ðLS ðCP G Þ  LS ðCP GðP1Þ ÞÞ þ LH ðCP GðP1Þ Þ and F 2 ðP1 Þ ¼ AHW ðP 1 Þ. In this case, F 1 ðP 1 Þ ¼ 72 and F 2 ðP1 Þ ¼ 6. At the second iteration, the generating of partition objects function returns the partition object P2. We compute F 1 ðP 2 Þ ¼ ½ðLS ðCP G Þ  LS ðCP GðP2Þ ÞÞ þ LH ðCP GðP2Þ Þ ¼ 71 and F 2 ðP2 Þ ¼ AHW ðP2 Þ ¼ 5. And we follow the same procedure for other partition objects. Table 2 gives the values of F1 (Pi) and F2 (Pi) for all generated partitions object. Next, we draw the curves of F1 (Pi) and F2 (Pi), Fig. 9. Based on Fig. 9: F 1 ðPi Þ \ F 2 ðPi Þ ¼ P11 . Therefore, the partition object P11 (solution ‘‘c’’ in Fig. 7) provides the best trade-off between latency and hardware resources of graph. Hence, nodes V3, V4, V5, V6, V7, V8 should be assigned to the hardware part of the architecture and V1, V2, V9, V10 should be assigned to the software part of the architecture as shown in Fig. 10.

274

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

255

257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272

275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx

9

Fig. 19. Blocks of H.264.

Fig. 20. Intra prediction graph.

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 10

M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx

Po Fig. 21. Curves of F1 (Pi) and F2 (Pi) for H.264.

294

5. Experiments

295

To confirm our approach, we have implemented the DCT and FFT graphs on FPGA Xilinx VirtexÒ-5. The Xilinx VirtexÒ5 development kit enables high performance for embedded design in Xilinx FPGAs. In our approach the software resource is the PowerPC and the hardware resources are the configurable logic blocks (CLBs). Hence, to compute the parameters of each node and to access to the PowerPC, we have used Xilinx ISE tool and Xilinx EDK tool. These Xilinx design tools provide resources and timing report that incorporates timing delay and resources, to provide a comprehensive area and timing summary of the design. Our algorithm has been written in JAVA language and executed under Windows-7 on Acer-PC (Intel Core 2 Duo T5500; 1, 66 GHz; 1 GB of RAM). Design results are shown in Tables 4, 5, 7, and 8, the legends of these tables are:

296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318

- N-P: the number of partition objects generated by the generating of partitions object function. - R-T: the execution time of the algorithm on the Acer-PC. - H-R: the operations assigned to the hardware resources (CLBs of VirtexÒ-5). - %(H-R): percentage of the hardware resources. - S-R: the operations assigned to the software resources (powerPC of VirtexÒ-5) - L-G: the latency of the graph. - Po: the number of partition objects where F 1 ðPi Þ \ F 2 ðP i Þ – ;.

319

320

5.1. First example

321

The DCT, Fig. 11, is the most computationally intensive part of the CLD algorithm. The model proposed by [21,22] is based on 16 or 32 vector products. Thus, the entire DCT is a collection of 16 or 32 nodes ‘‘T1’’ and ‘‘T2’’. The structure of ‘‘T1’’ and ‘‘T2’’ is similar to vector product Fig. 12, but with different bit widths. Table 3 gives the characteristics of 16-DCT and 32-DCT graphs. Fig. 13 shows the curves of F1 (Pi) and F2 (Pi) for 16 DCT graph (see Fig. 14).

322 323 324 325 326 327 328

Table 9 Design results for H.264. N-P R-T H-R S-R Po L-G %(H-R)

35 0.24 ms 27 Tasks 1 Task 27 1485 ns 19.97%

Tables 4 and 5 show the design results provided by our algorithm. Firstly, as attractive Result, our algorithm generates 98,752 partition objects and 24,800 partition objects in 112 ms and in 16,515 ms respectively. Furthermore, results show that 129 of 16-DCT operations and 256 of 32DCT operations are assigned to the hardware part of the architecture. Moreover, results show that the latencies of 16-DCT and 32-DCT graphs are 2792 ns and 5556 ns respectively.

329

5.2. Second example

337

The Fast Fourier transform (FFT), Fig. 15, is an efficient algorithm that computes the discrete Fourier transform (DFT) and its inverse. 16-FFT and 64-FFT which are 16 points and 64 points of FFT respectively have important roles in analysis, design, and implementation of discrete-time signal processing algorithms and systems. Fig. 16, shows the basic butterfly computation in FFT algorithm. Table 6 gives the characteristics of 16 FFT and 64 FFT graphs (see Figs. 17 and 18). Tables 7 and 8 show the design results provided by our algorithm. Results show that our algorithm generates 2112 partition objects and 33,024 partition objects in 0.902 ms and in 25,313 ms respectively.

338

5.3. Third example

350

The H.264 AVC is the most recent standard for video coding, it has been developed by the ITU-T Video Coding Experts Group [23], Fig. 19. The H.264 contain an intra-prediction mode with 4  4 block and 16  16 block sizes for luma component and 8  8 block size for chroma component is used in H.264 to improve the rate-distortion performance. However, the computational complexity of H.264 encoder is drastically increased due to the various intra prediction modes. Recently efficient hardware architectures were proposed for the fast execution of H.264/AVC intra prediction mode selection [24,25]. Fig. 20, shows the basic blocks in the intra prediction graph.

351

Table 10 Design results. Algorithm

Target graph

R-T (ms)

%(H-R)

L-G (ns)

H-R(G) (CLB)

Proposed algo Combined algo Tabu algo Simulated annealing algo Genetic algo

16-DCT task graph

16.515 42.573 45.895 1.724 2.265

37 25 24 30 29

2792 3108 3116 3032 2984

2667 1836 1776 2178 2139

Proposed algo Combined algo Tabu algo Simulated annealing algo Genetic algo

32-DCT task graph

112 6.692 133.678 1.196 3.387

76.79 54 53 62 58

5556 6202 6094 6002 6078

5676 3894 3849 4482 4197

Proposed algo Combined algo Tabu algo Simulated annealing algo Genetic algo

16 FFT graph

0.902 1.114 3.167 1.056 1.056

1.6 0.92 1 1.08 1

648 799 783 767 783

117 66 72 78 72

Proposed algo Combined algo Tabu algo Simulated annealing algo Genetic algo

64 FFT graph

25.313 2.271 34.74 1.125 2.153

6.42 4.62 4.91 5.62 4.95

2608 2952 2896 2760 2888

462 333 354 405 357

Proposed algo Combined algo Tabu algo Simulated annealing algo Genetic algo

H.264 graph

0.240 0.909 28.469 2.456 1.641

19.97 18.98 13.76 10.45 18.70

1485 1550 1898 2103 1566

1438 1367 991 753 1347

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

330 331 332 333 334 335 336

339 340 341 342 343 344 345 346 347 348 349

352 353 354 355 356 357 358 359 360 361

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx Table 11 Design results.

h

Proposed algo

Combined algo

Tabu algo

Simulated annealing algo

Genetic algo

1.62

1.73

1.74

1.66

1.70

Table 12 Design results.

h

Proposed algo

Combined algo

Tabu algo

Simulated annealing algo

Genetic algo

0.27

0.53

0.55

0.45

0.49

11

hardware resources. Hence, a partitioning algorithm is classified to be good one if it decreases both: whole latency of the application and the hardware resource. Therefore, based on above equation a partitioning algorithm is classified to be good alternative if it increases the value of h. Tables 11–15 show the value of h provided by each algorithm, the target application was 16-DCT task graph, 32-DCT task graph, 16 FFT graph, 64 FFT graph and H.264 task graph respectively. Table 16 shows the average value of h provided by each algorithm. Based on the above design results shown in Table 16, we prove that our algorithm is the best one in terms of average value of h. Indeed, our algorithm provides a gain of 10, 36%, 11.92%, 12.44% and 9.84% compared to combined algorithm, Tabu algorithm, simulated annealing algorithm and Genetic algorithm respectively.

380

6. Conclusion

395

In this paper, we have proposed behavioral hardware software partitioning algorithm. Our algorithm assumes the SOPC as target architecture, but its simplicity and efficiency allow it to be used on System On a Chip (SOC). The proposed algorithms can be used, at behavioral level and in the co-design flow to provide better trade-off between latency and hardware resources. The proposed algorithm has been tested and compared to tabu, genetic, simulated annealing and combined algorithms. Results have shown significant gain since it provides the best value of h compared to others approaches.

396

References

406

[1] R.K. Gupta, G. De Micheli, Hardware–software cosynthesis for digital systems, IEEE Des. Test Comp. (September) (1993) 29–41. [2] W. Wolf, A decade of hardware/software codesign, IEEE J. Magaz., Comp. 36 (4) (2003) 38–43. [3] J. Henkel, R. Ernst, A hardware/software partitioning using a dynamically determined granularity, in: ACM Design Automation Conference (DAC 97), June 1997. [4] S. Banerjee, E. Bozorgzadeh, N.D. Dutt, Integrating physical constraints in hw– sw partitioning for architectures with partial dynamic reconfiguration, IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 14 (11) (2006) 1189–1202. [5] J. Wu, T. Srikanthan, Low-complex dynamic programming algorithm for hardware/software partitioning, Inf. Process. Lett. 98 (2) (2006) 41–46. [6] K. Chatha, R. Vemuri, Hardware–software partitioning and pipelined scheduling of transformative applications, IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 10 (3) (2002) 193–208. [7] F. Vahid, D. Gajski, Clustering for improved system level functional partitioning, in: Proceedings of the 8th International Symposium on System Synthesis, ACM, 1995, pp. 28–35. [8] Ji-Yang Qi, Application of improved simulated annealing algorithm in facility layout design, in: Proceedings of the 29th Chinese Control Conference, July 2010. [9] Z. Peng, K. Kuchcinski, An algorithm for partitioning of application specific systems, in: Proceedings of the European Conference on Design Automation (EDAC’93), 1993, pp. 316–321. [10] J. Henkel, R. Ernst, An approach to automated hardware/software partitioning using a flexible granularity that is driven by high-level estimation techniques, IEEE Trans. Very Large Scale Integ. Syst. 9 (2) (2001) 273–290. [11] P. Eles, Z. Peng, K. Kuchcinski, A. Doboli, System level hardware/software partitioning based on simulated annealing and tabu search, Des. Autom. Embed. Syst. 2 (1997) 5–32. [12] T. Wiangtong, P.Y.K. Cheung, W. Luk, Comparing three heuristic search methods for functional partitioning in hardwaresoftware codesign, Des. Autom. Embed. Syst. 6 (4) (2002) 425–449. [13] K.S. Chatha, R. Vemuri, Magellan: multiway hardware–software partitioning and scheduling for latency minimization of hierarchical control-dataflow task graphs, in: Proceedings of the Ninth International Symposium on Hardware/ Software Co-design (CODES ’01), 2001, pp. 42–47. [14] J. Grode, P.V. Knudsen, J. Madsen, Hardware resource allocation for hardware/software partitioning in the LycosSystem, in: Proceedings of the Conference on Design, Automation and Test in Europe (DATE ’98), 1998, pp. 22–27. [15] M. Lopez-Vallejo, J. Lopez, On the hardware–software partitioning problem: system modeling and partitioning techniques, ACM Trans. Des. Autom. Electron. Syst. (TODAES) 8 (3) (2003) 269–297.

407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450

381 382 383 384 385 386 387 388 389 390 391 392 393 394

Table 13 Design results.

h

Proposed algo

Combined algo

Tabu algo

Simulated annealing algo

Genetic algo

10.93

8.93

9.10

9.29

9.10

Table 14 Design results.

h

Proposed algo

Combined algo

Tabu algo

Simulated annealing algo

2.58

2.33

2.36

2.46

Genetic algo 2.37

Table 15 Design results.

h

Proposed algo

Combined algo

Tabu algo

Simulated annealing algo

Genetic algo

3.88

3.76

3.27

3.06

3.73

Table 16 Average value of h. Proposed Combined Tabu Simulated annealing algo Genetic algo algo algo algo Av. h 3.86

3.46

3.40

3.38

3.48

365

Fig. 21 shows the curves of F1 (Pi) and F2 (Pi) for H.264. Table 9 show the design results provided by our algorithm. Results show that our algorithm generates 35 partition objects in 0.24 ms and the latency of H.264 is 1485 ns.

366

5.4. Comparison with other approaches

367

In this part, we have compared our algorithm to existing algorithms such as: tabu [12], simulated annealing [11], genetic [16], and combined algorithm [17]. Design results are shown in Table 10. To evaluate the design results shown in Table 10, we have introduced the following equation

362 363 364

368 369 370 371 372

373 375 376 377 378 379



A  AH LG

ð9Þ

A is the available hardware resource, AH is the hardware resource used by the graph, and L_G is the whole latency of the graph, As generally reckoned, the design is classified to be excellent; if it provides the fastest possible application with the minimum

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

397 398 399 400 401 402 403 404 405

MICPRO 2212

No. of Pages 12, Model 5G

13 May 2015 12 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481

M. Jemai, B. Ouni / Microprocessors and Microsystems xxx (2015) xxx–xxx

[16] He Hongjun, Dou Qiang, Weixia Xu, Hardware/software partitioning for heterogeneous multicore SoC using genetic algorithm, IEEE Intell. Syst. Des. Eng. Appl. (ISDEA) (January) (2012) 1267–1270. [17] Yu Jiang, Hehua Zhang, Xun Jiao, Xiaoyu Song, William N.N. Hung, Ming Gu, Jiaguang Sun, Uncertain model and algorithm for hardware/software partitioning, in: IEEE Computer Society Annual Symposium on VLSI, 2012. [18] Guoshuai Li, Jinfu Feng, Cong Wang, Jinghua Wang, Hardware/software partitioning algorithm based on the combination of genetic algorithm and tabu search, Eng. Rev. 34 (2) (2014). [19] Anup Das, SingaporeAkash Kumar, Bharadwaj Veeravalli, Aging-aware hardware–software task partitioning for reliable reconfigurable multiprocessor systems, in: International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 2013. [20] Honglei Han, Wenju Liu, Jigang Wu, Guiyuan Jiang, Efficient algorithm for hardware/software partitioning and scheduling on MPSoC, J. Comp. 8 (January) (2013) 61–68. [21] A. Mtibaa, B. Ouni, M. Abid, An efficient list scheduling algorithm for time placement problem, Comp. Electr. Eng. 33 (4) (July 2007) 285–298. [22] R. Ayadi, B. Ouni, A. Mtibaa, A partitioning methodology that optimizes the communication cost for reconfigurable computing systems, Int. J. Autom. Comput. 9 (3) (2012) 280–287. [23] B. Ouni, R. Ayadi, A. Mtibaa, Temporal partitioning of data flow graph for dynamically reconfigurable architecture, J. Syst. Arch. 57 (2011) 790–798. [24] Li-Wei Kang, Jin-Jang Leou, An error resilient coding scheme for H.264 video transmission based on data embedding+, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, ICASSP’04, vol. 3, 2004, pp. 257–260. [25] Liangbao Jiao, Jing Zhou, Rui Chenm Efficient parallel intra-prediction mode selection scheme for 4  4 blocks in H.264, in: IEEE International Conference on Intelligent Computation Technology and Automation (ICICTA), vol. 2, 2011, pp 527–530.

Mehdi Jemai held a Diploma in Computer Engineering in 2009 from the Higher Institute of Applied Science and Technology of Sousse and received his Master in Microelectronic in 2011 from the Faculty of Science of Monastir. Currently, he prepares, in the Engineering School of Monastir, his thesis whose interest includes methodologies development for reconfigurable architectures.

485 486 487 488 489 490 491 492 493

484 Bouraoui Ouni is currently an Associate Professor at the National Engineering School of Sousse. He has obtained his Ph.D. entitled ‘Synthesis and temporal partitioning for reconfigurable systems’ in 2008 from the Faculty of Sciences at Monastir. He is obtained his university habilitation entitled ‘Optimisation algorithm for reconfigurable architectures’ in 2012. Hence, his researches interest cover: models, methods, tools, and architectures for reconfigurable computing; simulation, debugging, synthesis, verification, and test of reconfigurable systems; field programmable gate arrays and other reconfigurable technologies; algorithms implemented on reconfigurable hardware; hardware/software codesign and cosimulation with reconfigurable hardware; and high performance reconfigurable computing.

482

Please cite this article in press as: M. Jemai, B. Ouni, Hardware software partitioning of control data flow graph on system on programmable chip, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.006

496 497 498 499 500 501 502 503 504 505 506 507 508 509 495 510