:
½u
vk
ði0 ;j0 Þ
v a ½g k0 0 b ði ;j Þ v v ½u k a ½g k b
ðr;hÞHv ½Ir k
ðr;hÞ
; ðr; hÞHvk ½Ir
ðr;hÞ
0
ð8Þ
otherwise
5
It is difficult to design the task placement algorithm for multiFPGA systems. It is not only having to solve the problem of fragmentation of FPGA resources [10], but also has two challenges: First, the number of loading ports on one FPGA unit is limited. Thus, tasks have to be sequentially loaded to the FPGA and then be reconfigured in the FPGA unit [22]. In contrast, in a system with multiple FPGAs, reconfigurations of different tasks can take place at the same time on different FPGAs. In this way, the global makespan can potentially be shortened. As illustrated in Fig. 2, there are three independent tasks T1, T2 and T3 each demanding 3, 2, and 4 logic blocks, respectively. Two FPGA units F1, F2 have nine resource blocks. Ga and Gb denote the global makespans for the two scenarios of placement. Because of simultaneous reconfigurations in multi-FPGA systems, it can be seen that Gb < Ga while assigning tasks to different FPGA units. Second, for the multi-FPGA system, it is important to balance the completion times over different FPGA units. As shown in Fig. 2a, three tasks are allocated to F1. Although the deadline constraint can be met, the slack is very short, suggesting that there is little space for the system to reduce the system speed for energy efficiency. In contrast, in Fig. 2b one task is assigned to F1, and the rest is assigned to F2. Clearly, the slack becomes much larger. This allows the system to tune down the system speed and thus improve energy efficiency. We next explain the basic procedure of Algorithm 2 (i.e., MFIT). The algorithm tries to place new coming tasks to the bottom of the FPGA units. In such a way, the completion time of each FPGA unit can be calculated with Cal(). Then, the minimum completion time is reserved. If several minimum completion times are same, the variances are calculated by the completion time for each FPGA units. And the lowest value of variance is adopted for task placement. If the variances are same, FPGA units with the least number of allocated tasks are selected for task placement. Next, all FPGAs’ available logic blocks are updated. When all tasks are allocated, the global makespan is obtained for the ants to make a decision.
(i, j) ? (i0 , j0 ) represents a node (i0 , j0 ) that can be visited by xk from vk (i, j). Pði;jÞ#ði denotes the transitional probability for ant xk from 0 0 ;j Þ node (i, j) to (i0 , j0 ), if and only if, ðr; hÞ 2 Hxk ½Ir, (r, h) belongs to a set of nodes that ant xk can visit from node (i, j). Otherwise, the v probability is zero. uðik0 ;j0 Þ denotes ant xk pheromone level from node x (i, j) to (i0 , j0 ); gðik0 ;j0 Þ stands for a heuristic function that reciprocal of global makespan for ant xk from node (i, j) to (i0 , j0 ). We develop MFIT, which is to be explained shortly, to perform this function; a and b determine the relative importance between the pheromone and heuristics, respectively. Once all the ants have completed their tours, the global updating rule is used to increment pheromone, which defined as:
uðn; mÞ
v
k ð1 qÞuðn; mÞ þ Q =1ðsi;sjÞ#ðei;ejÞ
ð9Þ
k 1xðsi;sjÞ!ðei;ejÞ is the best solution that an ant, xk, starts from node (si, sj) and ends at (ei, ej) in the current round. Each u(n, m), "n 2 [1, N],
m 2 [1, M], usually evaporates with the global pheromone evaporation rate, unless the corresponding node is visited by an ant. With this strategy, the best solution can be found. q is the evaporation rate, where a higher q can accelerate the convergence speed of AEE. Q is a given coefficient parameter. 4.3. Heuristic function g: MFIT Central to ACO is the heuristic function g. We design the algorithm called MFIT for computing the heuristic function, by which an ant can make a proper transition decision. Essentially, the heuristic function is a task placement algorithm on multi-FPGA systems.
Fig. 2. Illustration of tasks placement on multi-FPGA system.
Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001
6
C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx
Algorithm 1 input: N ;, M: set of tasks, set of FPGA units; T: set of tasks; D: deadline constraint; S: set of N S possible speeds for each FPGA units; Q: set of A implementation types; B: set of L contiguous logic blocks; output: PM j¼1 Ej : system energy consumption schedule maxðSÞ; j ¼ 1; . . . ; M. 1. Initialize: S j 2. Generate a set of N ants, Generate sHL to reserve Ti with type Qa on Fj starting at b‘, ð8a 2 ½1; A; 8‘ 2 ½1; LÞ. 4. For Ir: = 1 to CN max 5. Ants are positioned onto N nodes (i, j), ð8i 2 ½1; N ; 8j 2 ½1; MÞ: 6. For k :¼ 1 to 7. Initialize U :¼ ;, and Hxk ½Ir 8. //Calculation that have been placed task by xk. 9. While Hxk ½Ir–; do 10. If Ti satisfies i 2 Hxk ½Ir 11. and the probabilistic state transition rule. 12. U ¼U [i Hxk ½Ir fig 13. Hxk ½Ir 14. sHL½Ir½k½i ½T i ; Qa ; F j ; b‘ 15. Adding eligible nodes to Hxk ½Ir 16. with in-degree = 0. 17. End If 18. End While 19. End For ðN Þ 20. //Until all ants have built their complete solution 21. Apply the global updating rule with the best solution. 22. End For (CN max ) 23. Global makespan is obtained and 24. schedule select the base schedule from sHL. 25. // FPGA units speed is decreased, unless violate the D 26. For j :¼ 1 to M 27. For k :¼ N S to 1 28. Skj select the next smaller level k in Fj; 29. cj compute the new makespan; 30. If cj > D 31. Break; 32. End If 33. End For 34. End For 35. Output: P 36. M j¼1 Ej 37. schedule: ½T i ; Qa ; F j ; b‘
Algorithm 2 input: N ; M: set of tasks, set of FPGA units; Q: set of A implementation types; B: set of L contiguous logic blocks on each FPGA units; output: Global makespan G 1. Initialize: i :¼ 1 2. While ði! ¼ N Þ Do
3. 4.
For j :¼ 1 to M //To calculate completion time on, cj, when Ti with type Qa is allocated to Fj. cj CalfALLOC½ðT i ; Qa Þ; jg; ð8a 2 ½1; AÞ 5. 6. End For 7. //To reserve the minimum completion time cj. 8. If several minimum completion time from results of minfc1 ; . . . ; cM g are same, 9. Reserve Ti on Fj with the least number of tasks. 10. Else 11. Reserve Ti on Fj with the minimum cj. 12. End if 13. Update ½F 1 ðBÞ; . . . ; F M ðBÞ. 14. i++ 15. End while 16. Output: G maxfc1 ; . . . ; cM g
4.4. Enhanced algorithm From the above mentioned, AEE can be well performed for the independent tasks. But, it may not be suitable for the tasks with precedence and interdependencies. There is a delay for dependent tasks completion. For instance, if two tasks Ta and Tb are allocated to different FPGA units, and Tb is the successor of Ta. The successful completion of Tb depends on the data generated from Ta. Although we concern the link bandwidth is enough, data from Ta undergoes the limited loading ports that delay the Tb. But this delay does not exist when Ta and Tb are in the same FPGA unit, the data generated is saved in the FPGA unit that does not need to experience the reconfiguration phase. So, with these assumptions, in this subsection, we propose an enhanced AEE (eAEE) by taking into the account of tasks precedence and interdependencies. The general idea of eAEE is as follows (see Algorithm 3): Algorithm 3 input: XðT ; EÞ : T is a set of N tasks, E represents set of the tasks interdependencies; D: deadline constraint; M: set of FPGA units; S: set of N S possible speeds for each FPGA units; Algorithm 1; Algorithm 4; output: Schedule 1. Initialization: i :¼ 1, tSchedule £ 2. Level[i] Reserve tasks with no predecessors to the level one; 3. For tasks in Level[i] 4. Check tasks interdependencies E; 5. Suc[i] Find successors of Level[i]; 6. Pred[i] Find the predecessors of Suc[i]; 7. If Pred[i] 2 Level in the previous levels 8. i++; 9. Level[i] Reserve successor tasks; 10. End If 11. End For 12. Lv Compute the total number of levels; 13. For l :¼ 1 to Lv 14. tSchedule[l] Record tasks (Level[l], Pr ed[l]) schedule using Algorithms 1 and 4 combine with tSchedule;
Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001
7
C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx
cj Compute the completion time on Fj with S k and 20. tSchedule; 21. End For 22. G maxfc1 ; . . . ; cM g; 23. If G > D 24. Break; 25. End If 26. End For 27. Schedule Reserve the schedule with the minimum
0.8
FB AEE DP SA
0.6 0.4 0.2
100
150
0.4 0.3 0.2 0.1 100
120
140
160
(a)
(d) FB AEE DP SA
0.8 0.6 0.4
80
100
120
0.6 0.4 0.2 0 100
150
200
(b)
(e)
0.6 0.4 0.2
100
150
200
200
FB AEE DP SA
Deadline
FB AEE DP SA
180
0.8
Deadline
0.8
0 50
FB AEE DP SA
0.5
Deadline
1
0.2 60
0.6
Deadline
Energy Consumption Ratio
0 50
Energy Consumption Ratio
Energy Consumption Ratio
Energy Consumption Ratio
Energy Consumption Ratio
energy consumption with S k and tSchedule 28. Output: 29. Schedule
Firstly, the tasks is separated by precedence level. Initially, tasks with no predecessor are reserved to level one. And Level(i) that is exploited to record the set of tasks at a certain level i. Suc(i) is used to keep the successor of tasks in level i, Pred(i) is to record the predecessor tasks of Suc(i). If all predecessor tasks Pred(i) are in the Level task set, then tasks in Suc(i) can be allocated to Level(i + 1). By repeatedly doing this, tasks in each level and the total number of levels (Lv) can be obtained; Secondly, with the determined set of tasks in each level, we employ the AEE to obtain the shortest overall makespan by turning the FPGA units at maximum speed. However, due to the tasks’ interdependencies, the heuristic function MFIT has to be modified to satisfy the interdependencies (see Algorithm 4). When a task is placed onto the FPGA unit, the delay mentioned above between FPGA units is crucial because of interdependencies. On the basis of the modified MFIT, the global makespan can be obtained. In the meanwhile, a base schedule can be attained;
Energy Consumption Ratio
15. End For 16. // FPGA units speed is decreased, unless violating D 17. For k : N S to 1 18. Sk Select the next smaller speed level k; 19. For j :¼ 1 to M
250
1 FB AEE DP SA
0.8 0.6 0.4 0.2 0 100
150
200
Deadline
Deadline
(c)
(f)
250
300
Fig. 3. Energy consumption comparisons of different algorithms: (a) task number 12, L ¼ 100; (b) task number 12, L ¼ 200; (c) task number 15, L ¼ 100; (d) task number 15, L ¼ 200; (e) task number 18, L ¼ 100; and (f) task number 18, L ¼ 200.
Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001
8
C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx
Finally, the speed on each FPGA unit can be readjusted to fulfill with deadline constraint.
For j :¼ 1 to M cj Calculate completion time on Fj that allocate Ti from Level(l), 5. with Ti ’s Pred, Qa , tSchedule, and E interdependencies; 6. End If 7. End For 8. //To reserve the minimum completion time cj 9. If several minimum completion time from results of minfc1 ; . . . ; cM g are same, 10. Reserve Ti on Fj with the least number of tasks 11. Else 12. Reserve Ti on Fj with the minimum cj 13. End if 14. Update ½F 1 ðBÞ; . . . ; F M ðBÞ. 15. i++ 16. End while 17. Output: G maxfc1 ; . . . ; cM g
4.5. Algorithm analysis For Algorithm 2, there are two loops. The time complexity of each loop is OðN Þ and OðMÞ, respectively. So the total time complexity of this algorithm is OðNMÞ. In Algorithm 1, for an ant, the time complexity is bounded by 2 OðMN Þ. In addition, there are N ants in each round, so the time 3 complexity of a round is OðMN Þ. Therefore, for the maximum 3 round CNmax, the time complexity of AEE is OðMN CNmax Þ. In Algorithm 3, for each level the time complexity is bounded by N 3 CNmax Þ. So, the total number of levels in XðT ; EÞ is Lv, the OðM c N 3 CNmax Þ. time complexity is OðLv M c Also, we analyze the complexity of Algorithm 4 that with input number of tasks c N . The time complexity is OðM c N Þ. 5. Performance evaluation In this section we first show the simulation setup and performance metrics, then discuss performance results of the proposed algorithm in comparison with other alternative algorithms. 5.1. Simulation setup and metrics
Global Makespan
DP AEE FB SA
60
12
15
18
Number of Task
(a) 120
Global Makespan
3. 4.
80
40
100
DP AEE FB SA
80
60
12
15
18
Number of Task
(b) Fig. 4. Comparison of global makespan with various task number and L: (a) L ¼ 100 and (b) L ¼ 200.
Energy Consumption Ratio
FPGA units; Pred(l): a set of task predecessors that corresponding to tasks in Level(l); E: set of tasks interdependencies; tSchedule: a schedule records a set of placed tasks. output: Global makespan G 1. Initialize: i :¼ 1 2. While ði! ¼ c N Þ Do
1 eAEE iSA
0.8 0.6 0.4 0.2 0 1500
2000
2500
3000
3500
4000
Deadline
(a) Energy Consumption Ratio
Algorithm 4 input: Q: set of A implementation types; B: set of L contiguous logic blocks on each FPGA units; M: set of FPGA units; ^ tasks at level l that going to allocate to the Level(l): a set of N
100
1 eAEE iSA
0.8 0.6 0.4 0.2 0 1500
2000
2500
3000
3500
4000
Deadline We adopt a comparative study, comparing our algorithm with other alternative algorithms that will be introduced in the following subsection. The performance metric is the overall energy consumption ratio, we define as follows:
(b) Fig. 5. Comparison of energy consumption of eAEE and iSA with two different set of FPGA units: (a) 4 FPGA units and (b) 6 FPGA units.
Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001
9
C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx
Pa ta þ Pa ta þ
PM
0 j¼1 Ej
PM
j¼1 Ej
ð10Þ
PM
where j¼1 Ej is the total energy consumption where each FPGA unit P 0 is at maximum speed level, and M j¼1 Ej is the total energy consumption after adjustment the level of speed of FPGA units by adopting energy-saving strategies. Comprehensive simulations have been conducted to evaluate the effectiveness of our algorithm. We simulate a multi-FPGA reconfigurable system consisting of four FPGA units, as is typical for TM-4 and BEE2. Each FPGA unit has the same number of logic blocks. Two configurations are considered, L ¼ 100; 200, respectively. The number of adjustable speed levels is smaller than 10, i.e., N S 6 10. We vary the number of tasks in the simulations. For task configuration, we take parameters from the traces as used in [4,22,26]. Each task reconfiguration and execution cycle counts are evenly randomly distributed between [1, 8] and [10, 15] L ¼ 100 ([10, 15] and [13, 18], L ¼ 200), respectively. The number of logic blocks required by a task follows a uniform random distribution, between [40, 80] L ¼ 100, ([80, 115] L ¼ 200). We use a DAG generator (TGFF) [16] that take the shape of DAG from the Strassen’s matrix multiplication or the Fast Fourier Transformation (FFT) [18], and assign each task with a random number of contiguous logic blocks [70, 120], reconfiguration and execution cycle ([1, 19] and [20, 80]), and task interdependencies delay between FPGA units [1,5]. All these follow a uniform random distribution. The experiments are run on a desktop computer, with Intel(R) Xeon(R) dual CPUs @ 2.13 GHz, and 8 GB of RAM. The algorithms are programmed with C# language. The results presented in the following are averaged over 10 independent runs. 5.2. Compared algorithms We compare our algorithm with other three algorithms. 5.2.1. Dynamical programming based algorithm (DP) We develop an algorithm for deriving the optimal schedule and the minimum energy consumption based on dynamic programming. The main step of the DP based algorithm is as follows: Firstly, FPGA units are tuned at a maximum speed that fulfill with the deadline constraint; Secondly, the DP strategy is adopted. Initially, MES is broken into N subproblems (number of tasks), each subproblem indicates the number of assigned tasks. Then, the recursive formula is defined as:
H½i; U ¼ minfMFIT½hði; UÞ; jg
ð11Þ
where i denotes a starting task for the task set T; U is a set of tasks that are assigned, and i \ U ¼ ;, each step U is updated U U [ j; j is a task yet to assign; hði; UÞ is a set of tasks that starts from Ti and follows with assigned tasks U; MFIT is used to calculate the global makespan that begin with Ti, follow with the task set U and end with Tj; H½i; U is to document the a schedule with the minimum global makepsan that begins at Ti, follow with task set U and ends at Tj. By iteratively doing (11), to the last subproblem, we eventually obtain the minimum global makespan that gain a base schedule. Thirdly, the level of speed is readjusted to minimize the overall energy consumption without violating deadline constraint. Although it is able to find the optimal result, the problem scale is small and its complexity is extremely high, with a complexity of 3 OðMN 2N þ1 Þ. We use the DP algorithm as a baseline algorithm.
5.2.2. Fully-balanced algorithm (FB) It is evident that balance allocation the tasks to each FPGA unit can substantially reduce the overall energy consumption, so the fully-balanced algorithm (FB) is designed. Firstly, speed on each FPGA unit is set at the maximum level, and the number of tasks to each FPGA unit is allocated with the awareness of balance. Secondly, the tasks are randomly selected and mapped to FPGA units by MFIT. Then, the global makespan is obtained. Meanwhile, a base schedule is generated. Finally, the speed is readjusted to reduce the energy consumption without violating deadline. The complexity of 2 the algorithm that is bounded by OðN =M CRÞ, where CR is the maximum of execution loop.
5.2.3. Simulated annealing algorithm (SA) SA is a generic heuristic algorithm for solving the global optimization problem. We compare SA with AEE. The main SA procedure is as follows: Firstly, speed on each FPGA unit is adjusted at maximum speed, and SA considers decreasing set of temperature and randomly generates a solution. Secondly, for each temperature, SA iterates these steps: (1) The update function can be used by randomly interchanging a pair of tasks on different FPGA units, thus a new solution can be achieved. (2) Accept update solution that has been improved, or Accept some update solutions that have not been improved (with an acceptable probability under the current temperature). (3) Typically, these steps are repeated until the system arrives at a good solution. Thirdly, the temperature is decreased and repeated to the above steps. Fourthly, when the temperature drops to zero, the solution can be obtained by SA. Finally, the speed on each FPGA unit is readjusted to meet the deadline constraint. Here, we suppose that the set of P temperature and the number of iterations for each temperature is I, so the complexity of SA is OðI PÞ. We also develop an improved SA (iSA) for the tasks with precedence and interdependencies. It is same to eAEE in the first step, we obtain the total number of levels and the set of tasks in each level. Then, with the modified MFIT (Algorithm 4), the same strategy in SA is exploited by iSA to find the overall makespan from the first level to the last level. In the meanwhile, FPGA units are tuned at maximum speed, and a base schedule is generated. Eventually, the speed on each FPGA unit is tuned to minimize the energy consumption without missing deadline constraint. The time complexity of iSA is bounded by OðLv I PÞ.
2000
Global Makespan
d¼
eAEE iSA
1800 1600 1400 1200 1000 4
6
Number of FPGA units Fig. 6. Comparison of global makespan with 100 tasks and L ¼ 200.
Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001
C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx Table 2 Impact of evaporation rate (task number is 50 and L ¼ 200).
Algorithm
Task number
L
Complexity Time
Memory
AEE
12 15 18
200
25.77 s 98.65 s 8.13 min
26.14 MB 27.44 MB 28.90 MB
DP
12 15 18
200
7.94 min 2.62 h >12 h
66.78 MB 733.65 MB 5.63 GB
100 100
200 200
12.81 min 8.31 min
26.02 MB 25.39 MB
eAEE iSA
5.3. Comparative results We first compare the three algorithms in terms of energy consumption ratio. In Fig. 3, the performance of three algorithms in terms of energy consumption ratio is shown. In this set of simulations, there are in total 12, 15, 18 tasks. We can see that AEE consumes slightly more energy that DP does. More specifically, AEE consumes at most 10.65% more than DP for L ¼ 100 and 200, respectively. However, the proposed algorithm uses much less energy than the FB algorithm. FB performs the worst, because FB only concentrates on load balance of tasks. SA is also worse than AEE, because the paradigm of pheromone update rule (9) can lead AEE to obtain the best solution in a global scheme. Due to lack of such a scheme, SA cannot perform better than AEE. Since the global makespan has a great impact on the overall energy consumption, we further investigate the global makespan of three algorithms in Fig. 4. Three task sets with different number of tasks are used. AEE produces a global makespan close to that achieved by the DP algorithm. In Fig. 4b, when there are 12 tasks, AEE almost performs similar to DP. From Figs. 3 and 4, we find that these performance results demonstrate that our proposed algorithm performs well in reducing energy consumption and meeting the deadline constraint. The above mentioned algorithm is designed for the independent task set. We also show the performance of eAEE for a set of task with precedent and interdependencies. The total number of task is 100. Fig. 5 illustrates that the eAEE outperforms iSA in energy saving, and it consumes 58.17% less energy than that of iSA with set of four FPGA units. In Fig. 6, global makepsan obtained by eAEE is also shorter than that of iSA with two different set of FPGA units, respectively. 5.4. Complexity comparison We compare AEE and DP in terms of time and memory complexity. In Table 1, the time and the memory complexities of the two algorithms are shown. We can find that when L ¼ 200 and the number of tasks equals to 18, the DP algorithm consumes 5.63 GB memory and takes more than 12 h before it derives the optimal solution. In comparisons, our algorithm uses much smaller memory, only 28.9 MB memory and takes much shorter time, only 8.13 min. Also, we illustrate the eAEE and iSA in terms of both time and memory complexity. Even though iSA costs less time than that of eAEE, the result gained by iSA is much worse than that of eAEE. It is noted that eAEE costs less than time AEE because of the parameter q. For eAEE, q sets at a relatively higher value that can obtain better solution within a shorter time. In AEE with set of 50 tasks, the evaporation rate q is an important algorithm parameter. We investigate the impact of q on the algorithm performance. Table 2 shows that CNmax (the number of iterations of the main loop in AEE, i.e., ended criterion) is no more than 70, where the evaporation rate is in the range of [0.1, 0.9]. The minimum global makespan can also be obtained, while q is 0.2. It is
Evaporation rate
CNmax
Global makepsan
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
65 30 19 13 9 7 5 4 3
490 488 489 496 502 502 492 500 492
evident that q is related to the convergence speed of AEE. Also, it is noted that q with a relatively higher value can gain the minimum global makespan, hence greatly reduce the algorithm time complexity.
5.5. Impact of implementation types We next investigate the impact of implementation types of tasks on energy consumption. We also consider two task scheduling schemes. One straightforward scheme (STP) is to combine tasks of the same implementation type together and then allocate the combined task as one virtual task. The second scheme (ATP) does not combine tasks of the same type and instead let AEE perform the allocation and scheduling. We use two sets of tasks sharing the same total number of tasks. But, each set has a different number of implementation types. Tasks reconfiguration and execution cycle are randomly distributed between [1, 20] and [25, 45]. We test two task sets of reconfigurable systems with number of FPGA units 4, 6 and 8 respectively. Two deadlines are given for two task sets. As shown in Fig. 7, this clearly shows that the proposed algorithm can effectively mini-
Energy Consumption Ratio
Table 1 Time and memory complexity comparison.
1
ATP STP
0.8 0.6 0.4 0.2 0
4
6
8
Number of FPGA Units
(a) Energy Consumption Ratio
10
1
ATP STP
0.8 0.6 0.4 0.2 0
4
6
8
Number of FPGA Units
(b) Fig. 7. Impact of implementation types (total task number is 50 and L ¼ 100): (a) Number of implementation types 10 and (b) Number of implementation types 15.
Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001
C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx
mize reconfiguration overhead. In addition, STP is not better than ATP. This shows that AEE can adaptively schedule various tasks. 6. Conclusion Multi-FPGA reconfigurable systems have increasingly been adopted for high performance computing. This paper considers the energy efficient scheduling problem with hard deadline constraint. Multi-FPGA reconfigurable systems differ dramatically from traditional computing systems with multiple general processors. A task must be mapped to a set of contiguous logic blocks that must be configured for the execution of this task. Reconfiguration overhead has to be considered. In addition, a FPGA unit has a limited number of loading ports, and this poses a restriction on the number of tasks that can be simultaneously reconfigured. We have presented the energy-efficient scheduling algorithm called AEE. Based on ACO, this algorithm has the advantage of low complexity and can achieve high energy efficiency. Also, an enhanced algorithm eAEE is presented with the consideration of tasks with precedence and task interdependencies. Both have been proven through comprehensive trace-driven simulations. Acknowledgements This research is supported by NSFC (No. 61170238, 60903190, 61027009, 60933011, 61202375, 61170237), Shanghai Pu Jiang Talents Program (10PJ1405800), Shanghai Chen Guang Program (10CG11), MIIT of China (2009ZX03006-001-01), Doctoral Fund of Ministry of Education of China (20100073120021), National 863 Program (2009AA012201 and 2011AA010500), HP IRP (CW267311), SJTU SMC Project (201120), STCSM (08dz1501600, 12ZR1414900), Singapore NRF (CREATE E2S2), and Program for Changjiang Scholars and Innovative Research Team in Universities of China (IRT1158, PCSIRT).
11
[16] J. Heiner, B. Sellers, M. Wirthlin, J. Kalb, FPGA partial reconfiguration via configuration scrubbing, in: Proc. International Conference on Field Programmable Logic and Applications (FPL), 2009, pp. 99–104. [17] T.-Y. Huang, Y.-C. Tsai, E.T.-H. Chu, A near-optimal solution for the heterogeneous multi-processor single-level voltage setup problem, in: Proc. IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2007. [18] S. Hunold, J. Lepping, Evolutionary scheduling of parallel tasks graphs onto homogeneous clusters, in: Proc. IEEE International Conference on Cluster Computing (Cluster), 2011. [19] W.Y. Lee, Energy-efficient scheduling of periodic real-time tasks on lightly loaded multicore processors, IEEE Transactions on Parallel and Distributed Systems (TPDS) 23 (3) (2012) 530–537. [20] C.-F. Li, P.-H. Yuh, C.-L. Yang, Y.-W. Chang, Post-placement leakage optimization for partially dynamically reconfigurable FPGAs, in: Proc. ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), 2007. [21] F. Li, Y. Lin, L. He, J. Cong, Low-power FPGA using pre-defined dual-Vdd/dualVt fabrics, in: Proc. ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGAs), 2004, pp. 42–50. [22] Y. Lu, T. Marconi, K. Bertels, G. Gaydadjiev, Online task scheduling for the FPGA-based partially reconfigurable systems, in: Proc. International Workshop on Reconfigurable Computing: Architectures, Tools and Applications, 2009. [23] N.-C. Perng, J.-J. Chen, C.-Y. Yang, T.-W. Kuo, Energy-efficient scheduling on multi-context FPGA’s, in: Proc. IEEE International Symposium on Circuits and Systems (ISCAS), 2006. [24] F. Redaelli, M.D. Santambrogio, D. Sciuto, S.O. Memik, Reconfiguration aware task scheduling for multi-FPGA systems, in: Proc. Reconfigurable Computing Workshop on HiPEAC, 2010. [25] E.S.H. Hou, N. Ansari, H. Ren, A genetic algorithm for multiprocessor scheduling, IEEE Transactions on Parallel and Distributed Systems (TPDS) 5 (2) (1994) 113–120. [26] H. Walder, C. Steiger, M. Platzner, Fast online task placement on FPGAs: free space partitioning and 2D-hashing, in: Proc. IEEE Reconfigurable Architecture Worshop (RAW) on IPDPS, 2003. [27] Y. Zhang, X.S. Hu, D.Z. Chen, Task scheduling and voltage selection for energy minimization, in: Proc. ACM Design Automation Conference, 2002.
Chao Jing: is a PhD candidate in the Department of Computer Science and Engineering at the Shanghai Jiao Tong University. His research interests include High Performance Computing (HPC), resource management on reconfigurable systems.
References [1] ‘‘40-nm FPGA Power Management and Advantages – Altera’’, Technical Report, 2008. [2] HPCA (High Performance Computing Alliance).
Yanmin Zhu: obtained his PhD in computer science from Hong Kong University of Science and Technology in 2007, and BEng. in computer science from Xi’an Jiao Tong University in 2002. He is a Associate Professor in the Department of Computer Science and Engineering, Shanghai Jiao Tong University, China. His research interests include ad-hoc sensor networks, mobile computing, grid computing and resource management in distributed systems.
Minglu Li: received his PhD in Computer Software from Shanghai Jiao Tong University in 1996. He is a full Professor at the Department of Computer Science and Engineering of Shanghai Jiao Tong University. Now, he is the vice dean of the School of Electronic Information and Electrical Engineering. Currently, his research interests include Grid Computing, Services Computing and Cloud Computing. He has published over 100 papers in important academic journals and international conferences.
Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001