software partitioning

software partitioning

Mathematical and Computer Modelling 51 (2010) 974–984 Contents lists available at ScienceDirect Mathematical and Computer Modelling journal homepage...

606KB Sizes 2 Downloads 56 Views

Mathematical and Computer Modelling 51 (2010) 974–984

Contents lists available at ScienceDirect

Mathematical and Computer Modelling journal homepage: www.elsevier.com/locate/mcm

Efficient heuristic algorithms for path-based hardware/software partitioning Wu Jigang a,b,∗ , Thambipillai Srikanthan b , Ting Lei b a

School of Computer Science and Software, Tianjin Polytechnic University, 300160, China

b

School of Computer Engineering, Nanyang Technological University, 639798, Singapore

article

info

Keywords: Algorithm Heuristic Tabu search Dynamic programming Hardware/Software partitioning

abstract Hardware/software (HW/SW) partitioning is one of the crucial steps of co-design systems. It determines which components of the systems are implemented in hardware and which ones are in software. In this paper the computing model is extended to cater for the path-based HW/SW partitioning with the fine granularity in which communication penalties between system components must be considered. On the new computing model an efficient heuristic algorithm is developed, in which both speedup in hardware and communication penalty are taken into account. In addition, an efficient tabu search algorithm is also customized in this paper to refine the approximate solutions produced by the heuristic algorithm. Simulation results show that the heuristic algorithm runs fast and is able to produce high-quality approximate solutions. Moreover, the tabu search algorithm can further refine them to nearly optimal solutions within an acceptable runtime. The difference between the approximate solutions and the optimal ones is bounded by 0.5%, and it hardly increases with the increase of the problem size. © 2009 Elsevier Ltd. All rights reserved.

1. Introduction Hardware/software (HW/SW) co-design has become one of the primary applications of electronic system level tools and methodologies. It provides new opportunities for the development of high speed, low power electronic products such as embedded, communication, multimedia, and intelligent transport systems. Traditional approaches for the design of simple systems were carried out manually. However, as the systems to design have become more and more complex, manual approaches have become infeasible. Thus, it is now imperative to involve design automation at the highest possible level, in order to deal with the high complexity, increased time-to-market pressures and a set of possibly conflicting constraints. In hardware/software (HW/SW) co-design systems, application-specific hardware is usually much faster than software, but it is significantly more expensive. Software on the other hand is cheaper to create and to maintain, but slow. Hence, performance-critical components of the system should be realized in hardware and non-critical components in software. HW/SW partitioning is to decide which components of the system should be implemented in hardware and which ones in software. It has been shown that efficient techniques for HW/SW partitioning can achieve results in performance, power or energy superior to software-only solution. There are many different academic approaches to solve the HW/SW partitioning. The traditional approaches include hardware-oriented and software-oriented. The former starts with a complete hardware solution and iteratively moves parts of the system to the software as long as the performance constraints are fulfilled [1–3], while the latter starts with a software program moving pieces to hardware to improve speed until the performance of the final system meets the given constraint [4–6]. It has been shown that the HW/SW partitioning is NP-hard for most cases. Thus, in algorithmic aspect, simulated annealing algorithms [4,7,8], dynamic programming algorithm [9,10], integer programming approaches [11,12]



Corresponding author. E-mail addresses: [email protected], [email protected] (J. Wu).

0895-7177/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.mcm.2009.08.029

J. Wu et al. / Mathematical and Computer Modelling 51 (2010) 974–984

a

b

Entry

0 B1 sp1 = 10

h1 = 2

sp2 = 7

h2 = 4

e2 = 1 B3 sp3 =7

0 s

h

B1

B1 4

6 s

h

B2

B2 3 s

h

B3

B3

3

B4 sp4 = 9

h4 = 5

s3 =10

2

4

e3 =2

s2 =11

1

2

2 h3 = 3

s1 =12

5

3

e1 = 2 B2

975

6 s

h

B4

B4

0

s4 =14

0 Exit

Fig. 1. Computing model comparison for 4 blocks. (a): Old model indicated by inherent speedups and extra speedups [9,10]. (b): New model indicated by source execution times and all communication times [16].

and genetic algorithms [13,14] are generally utilized to perform the system partitioning and hardware exploration. All these approaches can work perfectly within their own co-design environments, but it is impossible to compare them, because of the large differences in their co-design environments and the lack of benchmarks [15]. Advanced profiling techniques provide us a ‘hot path’, such as the body of a loop, that consists of the executed components with high frequency in a given application. The HW/SW partitioning for the whole application thus can be approximately solved by efficiently partitioning the selected hot path. This paper focuses on the efficient heuristic algorithms for the HW/SW partitioning on the selected hot path. Initially, we extend the computing model such that all possible communications between the neighboring components are taken into account, in order to better reflect real applications. Then we propose a fast heuristic algorithm on the new computing model to generate an approximate solution with good quality. In order to get nearly optimal solutions, we also customize a tabu search algorithm for the HW/SW partitioning to further refine the approximate solutions. Simulation results show that the proposed tabu search algorithm can refine the approximate solutions, so that the solution error to the optimal ones is no more than 0.5% for the cases considered in this paper, even if the partitioning problem is of fine granularity. This paper is organized as follows. In Section 2, we present the computing models and formal description of the HW/SW partitioning problem. In Section 3, we describe the proposed heuristic algorithm followed by the tabu search algorithm. In Section 4, we provide the simulation results to highlight the solution quality of the proposed algorithms, by comparing them with the exact solutions. Finally, we conclude our work in Section 5. 2. Computing models and formulations In [9,10], HW/SW partitioning was described below: the hot path of a given application consists of a sequence of n blocks, denoted as B = {B1 , B2 , . . . , Bn }, that may be moved between hardware and software. Each Bi is followed by Bi+1 for i = 1, 2, . . . , n − 1. Hardware blocks and software blocks cannot execute in parallel. The adjacent hardware blocks are assumed to be able to communicate the read/write variable they have in common directly between them without involving the software side. H denotes the set of blocks assigned to hardware; S denotes the set of blocks assigned to software. The objective is finding a partitioning for B such that B = H ∪ S and H ∩ S = Ø, which yields the best speedup while having a total area penalty no more than the available hardware area. The corresponding computing model utilized in [9,10] is shown in Fig. 1(a), where spi denotes the inherent speedup of moving block Bi to hardware, and ei denotes the extra speedup which is incurred because of blocks being able to communicate directly with each other when they are both placed in hardware. It is noteworthy to point out that, communication time becomes more important especially for the fine granularity HW/SW partitioning. However, the computing model shown in Fig. 1(a) does not completely take the communication time into consideration, e.g., when either of the two neighbors is placed in hardware. In this paper we work on a new model as shown in Fig. 1(b) for the HW/SW partitioning, in which all types of the communication time are taken into account, no matter how the blocks implement. We employ the following notations throughout this paper.

• si denotes the execution time of Bi in software, 1 ≤ i ≤ n.

976

J. Wu et al. / Mathematical and Computer Modelling 51 (2010) 974–984

• hi denotes the execution time of Bi in hardware, 1 ≤ i ≤ n. • ai denotes the area penalty of moving Bi to hardware, 1 ≤ i ≤ n. • ciss (cihh ) denotes the communication time between Bi and Bi+1 if both blocks are assigned to software (hardware). 1 ≤ i < n. • cish (cihs ) denotes the communication time between Bi and Bi+1 if Bi is assigned to software (hardware) and Bi+1 is assigned to hardware (software). 1 ≤ i < n. All source data described above can be generated by employing the estimation tools utilized in [9]. For example, in Fig. 1(b), c1ss = 6, c1hh = 3, c1sh = 5 and c1hs = 4. Without loss of generality, we assume hi < si for 1 ≤ i < n, to guarantee that hardware implementation can accelerate the execution as compared to the software implementation. In comparison, the extra speedup indicated between two neighboring blocks in the old model is taken into account only for the case of both blocks being assigned to hardware. However, extra speedup could also exist between the neighboring blocks when either one of the blocks is assigned to hardware. The new model considers all types of the communications derived from all possible HW/SW assignments of the neighboring blocks, utilizing the source data, rather than the extra speedup as mode of measurement. Thus, the new model better reflects real applications. Let (x1 , x2 , . . . , xn ) be a feasible solution of the partitioning problem, E (x1 , x2 , . . . , xn ) be the corresponding execution time, which includes the inherent communication overhead, where xi ∈ {1, 0}, xi = 1 (xi = 0) indicates that Bi is assigned to hardware (software), 1 ≤ i ≤ n. The execution time can be formalized as E (x1 , x2 , . . . , xn ) =

n n −1 X X (xi hi + (1 − xi )si ) + Ci , i=1

i =1

where Ci indicates the communication time between Bi and Bi+1 , 1 ≤ i ≤ n − 1, calculated by Ci = xi · xi+1 · cihh + xi · (1 − xi+1 ) · cihs + (1 − xi ) · xi+1 · cish + (1 − xi ) · (1 − xi+1 ) · ciss . Given available hardware area A, the partitioning problem discussed in this paper can be modeled as the following nonlinear minimization problem:

P :

 minimize   

E (x1 , x2 , . . . , xn )

  

i =1

subject to

n X

ai xi ≤ A,

xi ∈ {0, 1}, i = 1, 2, . . . , n.

3. Proposed algorithms 3.1. Heuristic algorithm First of all, we review knapsack problem that is closely related to the partitioning problem P . Given a knapsack capacity C and the set of items S = {1, 2, . . . , nP }, where each item has a weight P wi and a profit pi . The target is to find a subset S 0 ⊂ S , that maximizes the total profit i∈S 0 pi under the constraint i∈S 0 wi ≤ C , i.e., all the items fit in a knapsack of carrying capacity C . This problem is called the knapsack problem. The 0–1 knapsack problem is a special case of the general knapsack problem defined above, where each item can either be selected or not selected, but cannot be selected fractionally. Mathematically, it can be described as follows,

KP :

    maximize     subject to     

n X

p i xi

i=1 n

X

w i xi ≤ C ,

i=1

xi ∈ {0, 1}, i = 1, 2, . . . , n

where xi is a binary variable equalling 1 if item i should be included in the knapsack and 0 otherwise. It is this 0/1 property that makes the knapsack problem NP-hard [17,18]. A well-known heuristic strategy [17–19] for filling the knapsack is as follows: Ordering the items first according to their profit-to-weight ratio, p1 p2 pn

w1



w2

≥ ··· ≥

wn

.

Then, in each iteration the item with the largest profit-to-weight ratio is packed into the knapsack if the item fits in the unused capacity of the knapsack, until the knapsack is full or no item fits for the residual capacity of the knapsack. Let P wj = ji=1 wi , j = 1, 2, . . . , n. The break item is defined as the first item that cannot be included in the knapsack. Thus the break item t satisfies

w t −1 ≤ C ≤ w t .

J. Wu et al. / Mathematical and Computer Modelling 51 (2010) 974–984

a

c

b

977

d

HW

SW

HW

SW

HW

SW

HW

SW

h B i-1

B i-1s

B hi-1

B i-1s

h B i-1

B i-1s

h B i-1

B i-1s

c hsi-1

c shi-1

hh c i-1

c hsi-1 ss c i-1

B hi

B si

c hsi

c hsi c ssi

h B i+1

s B i+1

c hsi-1

c

c shi

h B i+1

B si hs i

c

c hhi

c ssi

ss c i-1

B hi c hsi

s B i+1

c shi-1

hh c i-1

sh i

c ssi

h B i+1

s B i+1

c hsi-1 ss c i-1

B hi

B si

c hhi

c shi-1

hh c i-1

ss c i-1

B hi

c shi

c hhi

c shi-1

hh c i-1

c hhi h B i+1

B si c shi c ssi s B i+1

hs hs ss hh hs ss sh sh hh hs sh hh hh Fig. 2. Calculation of δi . (a) δi = ciss−1 + ciss − cish −1 − ci . (b) δi = ci−1 + ci − ci−1 − ci . (c) δi = ci−1 + ci − ci−1 − ci . (d) δi = ci−1 + ci − ci−1 − ci .

It has been observed in practical studies that the heuristic solution is quite similar to the optimal integral solution in the sense that they differ in only a few variables [20]. This simple but very efficient heuristic for the knapsack problem motivates us to develop a corresponding approach for the hardware/software partitioning, based on the following reasons. It is clear that the block Bi in the problem P corresponds to the item i in K P, also ai and A in P correspond to wi and C in K P, respectively. Moreover, P and K P have similar constraints. The only difference between P and K P is that the objective function is linear for K P while it is not for P . Hence, the hardness of the problem P is also NP-hard. It is worth pointing out that, the problem P can be reduced to a knapsack problem when communication time is not taken into account. Pn This is because, when all communication times are set to 0, the objective function E (x1 , x2 , . . . , xn ) can be reduced to i=1 (xi hi + (1 − xi )si ). Noting that minimize

n X (xi hi + (1 − xi )si ) ⇐⇒ maximize

n X (si − hi ) · xi ,

i =1

i =1

the problem P is reduced to the following 0–1 knapsack problem,

    maximize     subject to     

n X (si − hi ) · xi i=1

n X

ai x i ≤ A ,

i=1

xi ∈ {0, 1}, i = 1, 2, . . . , n.

Let pi = si − hi , where pi is called the profit of the block Bi . The communication profit for moving Bi to hardware, denoted as δi , is defined as

δi = comm_sw(Bi ) − comm_hw(Bi ), where, comm_sw(Bi ) (comm_hw(Bi )) indicates the communication time of Bi to its neighbor(s) when Bi is assigned to software (hardware). Fig. 2 illustrates all the cases of its neighbor states and the corresponding calculation manners. For example, Fig. 2(b) shows that, block Bi−1 is already assigned to hardware while Bi is being considered to be moved from ss hh hs software to hardware. In this case, comm_sw(Bi ) = cihs −1 + ci , and comm_hw(Bi ) = ci−1 + ci . Hence, δi is calculated by hs ss hh hs δi = ci−1 + ci − ci−1 − ci . We define the profit-to-area ratio of the block Bi as i a i , for i = 1, 2, . . . , n. Our heuristic strategy for HW/SW i partitioning is based on the profit-to-area ratio of each block. Briefly, p +δ

is calculated for each block, where δi is computed as shown in Fig. 2(a). • Then, the block with the maximum profit-to-area ratio, say Bk , is selected. It will be considered to be assigned to hardware if the available hardware area is enough to implement Bk . • The neighboring block(s) of Bk update(s) the corresponding communication profits, calculated as shown in Fig. 2(b)(c)(d). • The calculations for the selection of the maximum profit-to-area ratio and the update of the communication profits are repeated on the remaining blocks, until no block fits for the residual hardware area.

• First of all, all blocks are assumed in software. The profit-to-area ratio

pi +δi ai

We explain our heuristic strategy with the aid of the example as shown in Fig. 1(b) with 4 blocks. In this example, the total available hardware area is assumed to be 6. Each ai for block Bi is assumed as listed in the first row of the Table 1. Each pi calculated by si − hi is shown in the second row, i = 1, 2, 3, 4.

978

J. Wu et al. / Mathematical and Computer Modelling 51 (2010) 974–984

Table 1 The partitioning process of heuristic algorithm for the instance shown in Fig. 1(b). The total available hardware area is set to 6. Blocks HW areas speedup Iteration 0 Iteration 1 Iteration 2 Iteration 3

B1

B2

B3

B4

ai pi

3 10

1 7

2 7

3 9

pi +δi ai pi +δi ai pi +δi ai pi +δi ai

10+2 3 10+2 3 10+2 3

7+2

7+5 2 7+3 2

9+4 3 9+4 3 9−1 3 9−1 3

√1





(i) Initially, i. e., in iteration 0, each δi is calculated for i = 1, 2, 3, 4, under the assumption that all other blocks are assigned p +δ to software. Also, each profit-to-area ratio i a i is calculated. For example, δ2 is calculated under the assumption of that i B2 is to be moved to hardware while all other blocks are kept in software, as shown in case (a) of Fig. 2. δ2 is calculated p +δ 2 by 6 + 3 − 5 − 2, resulting in 2. Hence, 2 a 2 is of value 7+ in iteration 0. 1 i (ii) Then, the block with the maximum ratio of profit-to-area is selected from the blocks assigned to software and the block will be assigned to hardware in the current iteration if this assignment generates positive speedup in hardware. Assume block Bk is of the maximum profit-to-area ratio. After Bk is selected to move to hardware, the value of the communication profits δk−1 (if k > 1) and δk+1 (if k < n) needs to be updated. It is worth pointing out that, only the neighbors of block Bk require to update their communication profits because moving Bk to hardware only affects its neighbor’s communication profit. In iteration 1 of this example, block B2 is selected to move to hardware, as it has the maximum profit-to-area ratio of 9. Then, δ1 and δ3 are updated to 2 and 3 (underlined in the row iteration 1 of the table), respectively. However, δ4 does not require re-calculation, as it can be directly transferred from the last iteration. (iii) Similarly, the block with the maximum profit-to-area ratio is selected and the communication profits of its neighbors are updated. In this example, block B3 is selected to move to hardware in iteration 2. (iv) The algorithm terminates when the remaining hardware area is not enough for any of the remaining blocks. In this example, the remaining area becomes 0 after block B1 is moved to hardware. The whole process terminates in iteration 3, with the result of the heuristic solution (1, 1, 1, 0). Formally, we outline the heuristic algorithm below. Input: Source data for the blocks B1 , B2 , . . . , Bn : ai — hardware area penalty of block Bi , 1 ≤ i ≤ n; si — execution time of block Bi in software, 1 ≤ i ≤ n; hi — execution time of block Bi in hardware, 1 ≤ i ≤ n; ciss , cihh , cish , cihs – communication time between blocks Bi and Bi+1 , 1 ≤ i < n; Output: The heuristic solution (x1 , x2 , . . . , xn ). AlgorithmHEA /* A heuristic algorithm for HW/SW partitioning, for the given n blocks and the available hardware area A. */

begin p +δ 1 for i := 1 to n do σi := i a i ; /* calculate the profit-to-area ratios for each block */ i 2 H := {}; S := {B1 , B2 , . . . , Bn }; /* H (S ) indicates the block set assigned to hardware (software)*/ k := 1; residual_area := A; (x1 , x2 , . . . , xn ) := (0, 0, . . . , 0); 3 repeat 3.1 Br := the block with maxBi ∈S {σi }; /* select the block with the maximum profit-to-area ratio in the set S */

3.2 if (ar ≤ residual_area) and (σr > 0) then /* block Br begin xr := 1; /* Assign block Br to hardware */ H := H ∪ {Br }, /* update H */ Update σr −1 (if r > 1) and σr +1 (if r < n); residual_area := residual_area − ar ; end; 3.3 k := k + 1; 3.4 S := S − {Br }; /* update S */ until (residual_area ≤ 0) or (k > n); 4 Output (x1 , x2 , . . . , xn ); end.

fits in the residual area */

Note that step 1 runs in O(n) and step 3 runs in O(k · log n) if the σi s are managed using data structure ‘Heap’ [21], where k is the number of the iterations of the repeat-loop, i.e., the number of the selected blocks that are considered to be assigned to hardware. Thus, we conclude that the time complexity of the algorithm HEA is bounded by O(n + k · log n).

J. Wu et al. / Mathematical and Computer Modelling 51 (2010) 974–984

Randomly choose a bit in part 1 and another bit in part 2

A feasible solution 1

1

1

1

1

0

0

1

: A neighbor solution

1

1

0

0

0

0

0

0

0

Part 1

1

0

0

1

0

1

0

0

0

0

Flip both bits

Break item

1

0 Part 2

Part 1

1

979

1

0

Part 2

Fig. 3. Generating a neighbor of a feasible solution.

3.2. Refining heuristic solution via Tabu Search Tabu Search (TS) is one of the traditional heuristic-based algorithms to search for the global optimal solution for NP-hard problems [22,23]. It is an iterative search method designed to cross boundaries of feasibility and systematically impose and release constraints to allow the exploration of forbidden regions [24]. Therefore, it is possible to find a global optimum by using tabu search, although the global optimum is not guaranteed by this method. In this subsection, we customize a TS algorithm, denoted as TSA, to refine the heuristic solution generated by HEA. Generally, TS consists of five primary parameters: local search procedure, neighborhood structure, tabu conditions, aspiration conditions and stopping rule. In the searching process, TS keeps a list, which is called tabu list, of the search moves during each iteration, to restrict the local search procedure in reusing those moves. A recency-based memory stores the recent search areas in order to avoid cycling search of the local zone. The tabu status of a search move can be released at a certain time according to the recency-based memory size. A frequency-based memory is used to store frequency of searching in each area. We also applied it to diversify the searching. Also, an aspiration criteria is utilized so that, if a tabu move generates a better solution than all the feasible solutions obtained so far, its tabu status is neglected. Meanwhile, the corresponding tabu area becomes eligible for the searching again. The stopping rules of the tabu search may be a fixed number of iterations, a fixed number of CPU time, or a fixed number of consecutive iterations without an improvement in the best objective function value, etc. In this paper, TSA starts with the heuristic solution generated by HEA. Noting that the heuristic solution is of the format of ‘11111101000 · · ·’ (similar to the solution format of 0–1 knapsack problem), which is divided by the break item into two parts as shown in Fig. 3. The part 1 has continuous ‘1’s and the part 2 has scattered ‘1’s between ‘0’s afterwards. In our algorithm TSA, we generate a neighbor of a given feasible solution by randomly changing (flipping) one bit in the part 1 and one bit in the part 2. If the generated neighbor fails to fulfill the area constraint, it is neglected and we try again for the fixed number of times. If all attempts fail, only the flipping in the part 1 is applied for the feasible solution. Assume xneib = (x1 , x2 , . . . , xn ), which is a neighbor of a local solution xlocal . We define a function dcost as: dcost (xneib , xlocal ) = E (xneib ) − E (xlocal ). Thus, the smaller the value of dcost, the better quality the neighbor xneib is. To better understand how TSA works on recency-based memory and frequency-based memory, intuitively, we employ an example as shown in Fig. 4 to illustrate the first 4 iterations. To be simple and clear, the neighborhood size is set to 2. The tabu tenure (length of the tabu list) is set to 2, i.e., a pair of flipped bits only can be reconsidered to flip again after 2 iterations. The solution consists of 10 bits. Its break item is located at position 6 in this example. At iteration 1, the first neighbor is generated by flipping bit 1 and bit 7, and the second neighbor is flipping bit 2 and bit 9. Then we calculate their dcosts (assuming they are −1 and 2 respectively in this example). After that we choose the neighbor 1 as xlocal because it has smaller dcost value. At the same time, the frequency-based memory is updated, i.e., the values at position 1 and 7 in the table increase 1. Also, a pair of the flipped bits (1,7) is recorded in recency-based memory and it is put into the tabu list (initially the tabu list is empty) in FIFO manner. At iteration 2, we generate two neighbors with the pairs of the flipping bits (1, 6) and (3, /) respectively, where (3, /) indicates the neighbor 2 only flipped one bit. Here, the dcost values of the two neighbors are assumed to be 4 and 2, respectively. When all non-tabu neighbors have dcost larger than 0, an award rule is applied to encourage searching a new zone. In this example, we pay award ‘−Q ’ to the neighbors as the bits 6 and 3 have never been flipped so far, where Q is a given positive value. The neighbor 2 is chosen as xlocal after utilizing the award. The corresponding memories are updated. At iteration 3, the two neighbors are shown as (1, 6) and (1, 7). Noting that (1, 7) is already in the tabu list. Therefore, we do not choose (1, 7) as xlocal even if its dcost is smaller than that of the neighbor (1, 6). xlocal is set to the neighbor (1, 6).

980

J. Wu et al. / Mathematical and Computer Modelling 51 (2010) 974–984

For the same reason, the neighbor (3, /) is not selected as xlocal in iteration 4. It is noteworthy that the neighbor (1, 7) is selected again in this iteration, since (1, 7) has been released from the tabu list at the time according to the tabu list size. As mentioned before, if a tabu-active feasible solution is better than the best-so-far solution, an aspiration is applied, that is, the tabu status of the solution is neglected. In the whole searching process, a neighbor xneib may enter the tabu list many times. Assume that the latest entrance for xneib is in the iteration iter_late(xneib ) and the current searching is at the iteration iter_curr. The tabu degree of xneib , denoted as Tdegree(xneib ), is defined as Tdegree(xneib ) = iter_late(xneib ) + tabu_tenure − iter_curr . Tabu degree is updated for each neighbor in each iteration. A non-negative tabu degree implies that the neighbor is tabuactive, while a negative one implies that the neighbor is not tabu-active. The outline of the algorithm TSA is as follows. Input: xheur – The heuristic solution generated by the algorithm HEA; Output: xbest_so_far – the best solution found by Tabu Search; Algorithm TSA /* Tabu Search Algorithm for the problem P . M indicates the fixed number of iterations. q indicates the neighborhood size. */

begin 1 xlocal := xheur , and xbest_so_far := xheur ; 2 for i := 1 to M do begin 2.1 Generate q neighbors of xlocal ; 2.2 Update the degrees and dcosts of the q neighbors; 2.3 if all q neighbors are tabu-active then xlocal := the neighbor with the minimal tabu degree else xlocal := the neighbor with the minimal dcost; 2.4 if E (xlocal ) < E (xbest_so_far ) then xbest_so_far := xlocal ; 2.5 Update frequency-based memory; 2.6 Update recency-based memory; /*put an item into tabu list*/ end; end. It is worth pointing out that, the solution refined by TSA is definitely better than the initial solution provided by HEA. This is because, the best-so-far solution is updated once TSA finds a better local solution according to the sentence 2.4. 4. Simulation results 4.1. The optimal solution In order to evaluate the performance of the algorithms HEA and TSA, we introduce a dynamic programming approach, denoted as DPA in this paper, to calculate the optimal solution of the problem P . The main idea is as follows. Assuming that the optimal HW/SW partitioning for B1 , B2 , . . . , Bk−1 has been computed where the utilized hardware area is less than a, we now consider the method to partition the blocks B1 , B2 , . . . , Bk within the available area a. This is achieved by first arriving at all partitioning possibilities based on representing the current block Bk in software or in hardware. The optimal partitioning results in the best possible execution time. If Bk is implemented in software, the optimal partitioning for B1 , B2 , . . . , Bk for the hardware area a is identical to the optimal partitioning for B1 , B2 , . . . , Bk−1 for hardware area a. If Bk is moved to hardware, the optimal partitioning for B1 , B2 , . . . , Bk can be found by examining partitioning for the blocks B1 , B2 , . . . , Bk−1 for area a − ak . Let E_op(k, a) indicate the optimal execution time achievable by moving some or all the blocks from B1 , B2 , . . . , Bk to hardware of size a. Let E_sw(k, a) indicate the execution time achievable by keeping Bk in software and moving some or all the blocks B1 , B2 , . . . , Bk−1 to hardware of size a. Further, let E_hw(k, a) indicate the execution time achievable by moving Bk to hardware and then moving some or all blocks from B1 , B2 , . . . , Bk−1 to area a − ak . The recurrent idea as described above can be formulated as follows. E_sw(1, a) = s ; 1   E_hw(k, 0) = +∞ for k = 1, 2, . . . , n;      +∞ for a < a1 ;   E_hw(1, a) =   h1 otherwise;       E_sw(k − 1, a) + ckss−1 + sk ,  E_sw(k, a) = min ; hs (DPA)  E_hw(k − 1, a) + ck−1 + sk    +∞ for a < ak ;     E_sw(k − 1, a − ak ) + cksh−1 + hk , E_h w( k , a ) =   ;  min  E_hw(k − 1, a − ak ) + ckhh−1 + hk     E_op(k, a) = min{E_sw(k, a), E_hw(k, a)}; k = 2, 3, . . . , n.

J. Wu et al. / Mathematical and Computer Modelling 51 (2010) 974–984

n = 10 break point = 6 neighbors = 2 tabu tenure = 2

Flipped bits

981

Frequency based memory

Tabu list

Neighbor 1

1

Iteration 1

in

(FIFO)

Neighbor 2

7

2

9

1

1

2

empty

7

1

2

3 4 5

6

7

8

9 10

1

0

0

0

0

1

0

0

1

2

3 4 5

6

7

8

9 10

1

0

1

0

0

1

0

0

0

1

2

3 4 5

6

7

8

9

10

2

0

1

1

0

0

0

1

2

3 4 5

6

7

8

9 10

3

0

1

1

2

0

0

0

0

dcost = 2 out

dcost = -1

in

Xlocal := Neighbor 1

1

Iteration 2

6

3

dcost = 2 - Q

1

3

/

2

1

7

0

out

dcost = 4 - Q

/

in

Xlocal := Neighbor 2

Iteration 3

1

6

dcost = 4

1

7

dcost = 1

1

1

6

2

3

/

in

3

/

dcost = -2

Xlocal := Neighbor 2

1

1

7

dcost = 3

Tabu-active

1

1

7

2

1

6

0

0

0

out

Iteration 4

0

out

Xlocal := Neighbor 1

0

Fig. 4. Example for tabu search.

Unlike heuristic algorithms, DPA not only depends on the number of the blocks but also on the plot granularity of the given hardware area A. In fact, given n blocks and the list of trial areas hA1 , A2 , . . . , Am i, its complexity is bounded by O(n · m) both in computation and in space [16]. It is O(n · A) for plot granularity of 1. Therefore, DPA is very limited in dealing with the problem. This is mainly due to the computational and memory requirements in practice. However, DPA can produce the optimal solution according to the dynamic programming principle of Optimality [21]. We utilize it to evaluate the proposed HEA and TSA. 4.2. Performance comparisons We simulate the algorithms HEA, TSA and DPA in C on a workstation—Intel Pentium-4, 3.5GHz CPU, 3.5GB of memory, running Windows operation system. In order to verify their computational ability, without loss of generality, random instances but generated in the similar manner to that employed in [25] are utilized in our simulations. This is because the main operations in each algorithm are numerical calculations, such as addition and comparison, in memory arrays. Following the simulation approach in [25], we generate the parameters below for each block Bi , i = 1, 2, . . . , n.

• • • •

ai is randomly generated in [1, 40], and A is set to α · i=1 ai for 0 < α < 1. si is randomly generated in [1, 100]. hi is set to λi · si , where λi is randomly generated in (0, 1). ciss , cish , cihs and cihh are randomly generated in [1, c ], where c is called communication cost basis, 10 < c ≤ 100. Smaller c values reflect the coarse granularity (computation-intensive) cases, while larger c values reflect the fine granularity (communication-intensive) cases, as described in [25].

Pn

The quality of an approximate solution x is measured by the ratio of solution error to the optimal solution, denoted as

ε(x). Let x∗ be the optimal solution produced by DPA, E (x∗ ) be the solution value of x∗ . Formally, E (x) − E (x∗ ) ε(x) = × 100%. E (x∗ ) In TSA, the neighborhood size is set to n/2, tabu tenure is set to 2, i.e., a pair of flipped bits stays in the tabu list for 2

iterations. In our experiments, we implement TSA with two different initial solutions, one is a random feasible solution and

982

J. Wu et al. / Mathematical and Computer Modelling 51 (2010) 974–984

difference (%) with the exact solution

2.5

2

1.5

1

0.5

0

0

200

400

600 800 1000 number of iterations

1200

1400

Fig. 5. Solution error ε(x) between DPA and TSA(heur) with different tabu iterations, averaged over 20 random instances, n = 1000, A = 50% · and c = 50.

Pn

i=1

ai ,

4

x 10 3.6 3.4 3.2

DPA HEA TSA(rand) TSA(heur)

solution value

3 2.8 2.6 2.4 2.2 2 1.8

10

20

30

40 50 60 70 communication cost basis

80

90

100

Fig. 6. Solution comparison on different communication costs, averaged over 20 random instances, n = 500, A = 30% ·

Pn

i=1

ai .

the other is the heuristic solution produced by HEA. The TSA with a random initial solution is denoted as TSA(rand), and the TSA with heuristic initial solution is denoted as TSA(heur) in this paper. Fig. 5 shows the solution error (difference) between DPA and TSA(heur) with the different iterations. Initially, the error between the heuristic solution and the optimal one is about 2.4%. It is significantly reduced to about 0.6% in the first 200 iterations. This is because, tabu search has more moves to improve the relatively inefficient current solution. The heuristic solution is further refined by the sequential iterations. However, after 500 iterations, the current solution is very close to the exact one. This implies that there are limited moves that can further improve the current solution in the tabu search. As a result, refining the current solution for TSA(heur) gradually becomes difficult although the difference with the exact solution still keeps decreasing. Therefore, the fixed number of the iterations (i.e., M in TSA) is set to 1000 in this paper. Fig. 6 shows the solution values of the proposed algorithms. HEA is able to produce nearly optimal solutions for the computation-intensive case (c < 20). This is because, HEA is developed from a heuristic strategy for solving 0–1 knapsack p +δ p problem. The block with the highest i a i ratio is first considered for hardware implementation. The ratio is close to ai when i i δi (corresponding to the communication cost basis) is small. Therefore, the solution of the HW/SW partitioning problem considered in this paper is very close to that of the classic 0–1 knapsack problem which has been effectively solved by the p heuristic algorithm choosing items based on the ai ratio. i

J. Wu et al. / Mathematical and Computer Modelling 51 (2010) 974–984

983

Table 2 The solution quality ε (%) and the algorithm runtime T (ms), averaged over 20 random instances with c = 50. n

A = 20% ·

P

DPA

100 150 200 250 300 500 600 700 800 900 1000 2000 3000 4000 5000

A = 50% ·

ai HEA

TSA

P

A = 80% ·

ai

DPA

HEA

TSA

P

ai

DPA

HEA

TSA

E (x ∗ )

T

ε

T

ε

T

E (x∗ )

T

ε

T

ε

T

E (x ∗ )

T

ε

T

ε

T

275 405 588 693 855 1379 1718 1976 2258 2614 2796 5720 8515 11173 13873

1.0 2.0 3.9 5.5 7.8 20.6 27.7 38.0 48.5 59.6 73.1 295.1 617.4 1139.0 1768.1

1.2 1.0 0.9 0.7 0.8 0.7 0.7 0.6 0.7 0.7 0.7 0.7 0.6 0.6 0.6

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 1.0 3.5 7.8 13.7 21.4

0.2 0.1 0.2 0.2 0.2 0.2 0.2 0.3 0.2 0.3 0.3 0.3 0.3 0.3 0.3

12.1 19.4 27.1 35.5 44.5 87.3 104.3 135.2 154.3 180.2 208.7 644.0 1153.5 1975.2 2971.3

247 370 476 575 728 1171 1405 1690 1860 2151 2299 4786 7127 – –

2.0 4.3 7.7 11.9 16.8 44.5 64.8 86.0 115.6 142.2 178.1 724.8 1579.0 – –

1.7 2.5 2.3 2.3 2.2 2.4 2.2 2.3 2.4 2.3 2.4 2.3 2.4 – –

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 1.0 1.1 1.1 6.3 13.8 25.2 39.8

0.1 0.2 0.3 0.2 0.3 0.3 0.3 0.3 0.3 0.4 0.3 0.4 0.5 – –

11.9 19.2 27.1 35.4 43.8 81.7 104.1 128.3 159.2 185.7 219.5 680.6 1163.5 2060.4 3025.6

227 343 429 596 666 1067 1339 1572 1776 1986 2188 – – – –

3.0 7.2 11.7 18.1 25.7 69.5 100.5 136.5 175.6 222.1 291.2 – – – –

2.9 3.5 3.1 3.4 3.2 3.1 3.2 3.1 3.3 3.2 3.2 – – – –

0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.1 2.0 10.2 19.3 34.5 51.3

0.4 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 – – – –

9.2 17.1 21.8 28.4 36.0 67.7 86.7 107.4 128.3 153.2 197.2 557.0 1033.5 2031.6 2672.3

Table 3 Comparisons of DPA, HEA and TSA. Algorithm

Time complexity

Space complexity

Solution accuracy

Property

DPA HEA TSA

O(n · A) O(n + k · log n) O(M · q)

O(n · A) O(n) O(n · q)

ε = 0.0% ε ≤ 3.5% ε ≤ 0.5%

Exact Approximate Approximate

On the other hand, however, with the increase of communication cost basis, δi is not negligible to evaluate the value p +δ of i a i . For this case, the solution of the HW/SW partitioning problem gradually deviates from the solution of the 0–1 i knapsack problem. Hence, the gap between the heuristic solution and the optimal solution gradually becomes clear, but the heuristic solution of HEA is still better than the solution of TSA(rand), as shown in Fig. 6. It is worth pointing out that, the heuristic solution can be refined by TSA(heur) to a nearly optimal one, both for the computation-intensive case and for the communication-intensive case. In addition, TSA(heur) keeps better than TSA(rand) because tabu search is a meta-heuristic with memory, an efficient initial solution is crucial to obtain near-optimal solution. Table 2 shows the quality of the approximate solutions and the runtime (T ) of the proposed algorithms. In Table 2, the optimal solution value E (x∗ ) is rounded off to the nearest whole number and the other attributes are rounded off to 1 decimal place. TSA indicates TSA(heur). As shown in Table 2, the quality of the approximate solutions is not significantly impacted P by the problem size (the number of the blocks) for both algorithms HEA and TSA. For example, for the case of A = 50% · ai , the ε values of HEA are 2.2% and 2.4% for n = 300 and n = 3000, respectively. The ratio fluctuates around 2.3% for the problems with different sizes. Moreover, it can be further reduced by TSA to no more than 0.5%. Similar results appear on other cases. In contrast to the problem size, the available hardware area A impacts the quality of the approximate solutions. The solution error for HEA increases P with increasing available hardware area. For the casePof n = 1000, for example, the ε value is 0.7% for A = 20% · ai , but it increases to 3.2% when A increases to 80% · ai . This is because, more blocks may be moved to hardware when the available hardware area increases and HEA requires more iterations to construct an approximate solution, resulting in more cumulations in solution error. It is worth Ppointing out that, TSA can successfully refine the approximate solution and reduce the ε values to 0.2% for A = 80% · ai . Hence, the solution quality can be ensured by the proposed heuristic algorithms, especially by TSA. On the other hand, HEA runs clearly faster than DPA and P TSA. For the case of n = 2000, for example, HEA produces an approximate solution within about 6.3 ms for A = 50% · ai , while DPA and TSA require about 724 ms and 680 ms respectively. With the increase of the number of blocks or the available hardware area, DPA cannot work due to the computer memory limit, shown as ‘-’ in the table. But HEA and TSA can continue. It is reasonable to deduce that both algorithms still produce the approximate solutions with good quality. Table 3 lists the performance of the algorithms DPA, HEA and TSA. n indicates the number of blocks. A indicates hardware area. k indicates the number of the selected blocks that are considered to be assigned to hardware. M and q indicate the total number of iterations and the neighborhood size in tabu search, respectively. In conclusion, HEA is a fast algorithm to approximately solve the problem P , while TSA can refine the approximate solution to a nearly optimal one within acceptable runtime.

984

J. Wu et al. / Mathematical and Computer Modelling 51 (2010) 974–984

5. Conclusions We have proposed a heuristic algorithm for path-based HW/SW partitioning on an extended computing model in which communication penalties between neighboring components are considered. The proposed heuristic algorithm is able to produce nearly optimal solutions for the case of coarse granularity, it also can generate high-quality approximate solutions for the fine granularity. On the other hand, we have proposed an efficient algorithm based on tabu search. The tabu search algorithm is able to refine the heuristic solutions to the nearly optimal ones in acceptable runtime, both for the coarse granularity and for the fine granularity. Acknowledgements We would like to thank the anonymous reviewers for their valuable suggestions. Part of this work was presented in 8th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2009). References [1] R.K. Gupta, C. Coelho, G. De Micheli, Synthesis and simulation of digital systems containing interacting hardware and software components, in: Proc. the 29th ACM/IEEE Design Automation Conference, Los Alamitos, CA, USA, June 1992, pp. 225–230. [2] R. Gupta, G.D. Micheli, Hardware–software cosynthesis for digital systems, IEEE Design and Test of Computers 10 (3) (1993) 29–41. [3] R. Niemann, P. Marwedel, Hardware/software partitioning using integer programming, in: Proc. the IEEE/ACM European Design Automation Conference (EDAC), Paris, France, March 1996, pp. 473–479. [4] R. Ernst, J. Henkel, T. Benner, Hardware–software co-synthesis for micro-controllers, IEEE Design and Test of Computer 10 (4) (1993) 64–75. [5] F. Vahid, D.D. Gajski, J. Gong, A binary-constraint search algorithm for minimizing hardware during hardware/software partitioning, in: Proc. IEEE/ACM European Design Automation Conference (EDAC), Paris, France, February 1994, pp. 214–219. [6] F. Vahid, D.D. Gajski, Clustering for improved system-level functional partitioning, in: Proc. the 8th International Symposium on System Synthesis, Cannes, France, September 1995, pp. 28–33. [7] Z. Peng, K. Kuchcinski, An algorithm for partitioning of application specific system, in: Proc. of IEEE/ACM European Design Automation conference(EDAC), Paris, February 1993, pp. 316–321. [8] J. Henkel, R. Ernst, An approach to automated hardware/software partitioning using a flexible granularity that is driven by high-level estimation techniques, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 9 (2) (2001) 273–289. [9] J. Madsen, J. Grode, P.V. Knudsen, M.E. Petersen, A. Haxthausen, LYCOS: The Lyngby co-synthesis system, Design Automation for Embedded Systems 2 (1997) 195–235. [10] Jigang Wu, T. Srikanthan, Low-complex dynamic programming algorithm for hardware/software partitioning, Information Processing Letters 98 (2006) 41–46. [11] R. Niemann, P. Marwedel, An algorithm for hardware/software partitioning using mixed integer linear programming, in: Partitioning Methods for Embedded Systems, Design Automation for Embedded Systems 2 (2) (1997) 165–193. Special Issue. [12] M. Weinhardt, Integer programming for partitioning in software oriented codesign, Lecture Notes of Computer Science 975 (1995) 227–234. [13] G. Quan, X. Hu, G.W. Greenwood, Preference-driven hierarchical hardware/software partitioning, in: Proc. IEEE International Conference on Computer Design, Austin, TX, USA October 1999, pp.652–657. [14] V. Srinivasan, S. Radhakrishnan, R. Vemuri, Hardware software partitioning with integrated hardware design space exploration, in: Proc. of DATE’98, Paris, France, February 1998, pp. 28–35. [15] S.A. Edwards, L. Lavagno, E.A. Lee, et al., Design of embedded systems: Formal models validation, and synthesis, Proceedings of the IEEE 85 (3) (1997) 366–390. [16] Jigang Wu, T. Srikanthan, G. Zou, New model and algorithm for hardware/software partitioning, Journal of Computer Science & Technology 23 (4) (2008) 644–651. [17] S. Martello, P. Toth, Knapsack Problems: Algorithms and Computer Implementations, John Wiley & Sons, 1990. [18] D. Pisinger, Algorithms for knapsack problems, Ph.D. Thesis, University of Copenhagen, 1995. [19] E. Balas, E. Zemel, An algorithm for large zero-one Knapsack problems, Operations Research 28 (1980) 1130–1154. [20] R. Beier, B. Vocking, Probabilistic Analysis of Knapsack Core Algorithms, in: Proc. of the 15-th annual ACM-SIAM symposium on discrete algorithms, Louisiana, 2004, pp. 468–477. [21] A.V. Aho, J.E. Hopcroft, J.D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, Mass, 1974. [22] F. Glover, Future paths for integer programming and links to artificial intelligence, Computers and Operations Research 13 (1986) 533–549. [23] F. Glover, M. Laguna, Tabu Search, Kluwer Academic Publishers, 1997. [24] V.J. Rayward-Smith, I.H. Osman, C.R. Reeves, G.D. Smith, Modern Heuristic Search Methods, John Wiley and Sons, 1996. [25] P. Arato, Z.A. Mann, A. Orban, Algorithmic aspects of hardware/software partitioning, ACM Transancations on Design Automation of Electronic Systems 10 (1) (2005) 136–156.