Energy-efficient scheduling on multi-FPGA reconfigurable systems

Energy-efficient scheduling on multi-FPGA reconfigurable systems

Microprocessors and Microsystems xxx (2013) xxx–xxx Contents lists available at SciVerse ScienceDirect Microprocessors and Microsystems journal home...

1MB Sizes 0 Downloads 85 Views

Microprocessors and Microsystems xxx (2013) xxx–xxx

Contents lists available at SciVerse ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

Energy-efficient scheduling on multi-FPGA reconfigurable systems Chao Jing a,⇑, Yanmin Zhu a,b, Minglu Li a,b a b

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University, Shanghai, China

a r t i c l e

i n f o

Article history: Available online xxxx Keywords: Reconfigurable systems Multi-FPGA systems Energy optimization Ant colony optimization Scheduling

a b s t r a c t With the growing demand in high performance computing, reconfigurable computing systems built with Field Programmable Gate Array (FPGA) have become increasingly popular for its reconfigurability and adaptability to applications. Although such systems promise high processing performance, their energy efficiency has become a critical issue. This paper studies the crucial problem of energy-efficient scheduling for reconfigurable systems with multiple FPGAs. Several factors make the energy efficient scheduling particularly challenging, including spatial allocation constraint, reconfiguration overhead, limited reconfiguration ports, and deadline satisfaction. These unique characteristics make energy efficient scheduling in multi-FPGA reconfigurable systems particularly challenging and none of existing solutions can be directly applied. This paper takes on this challenge and proposes an energy-efficient scheduling algorithm called AEE based on ant colony optimization for multi-FPGA reconfigurable systems. A task placement scheme is devised which serves as the heuristic function that derives the minimum global makespan, which is important to the ant colony algorithm based proposed in the paper. The scheme takes into account reconfiguration overhead and places tasks for reducing the overall overhead. Then, based on AEE, an enhanced algorithm (eAEE) is devised to deal with the tasks with precedence and interdependencies. To evaluate the effectiveness of the two proposed algorithms, comprehensive trace-driven simulations have been conducted and compared with other state-of-art algorithms. Experimental results demonstrate that AEE can successfully complete tasks without violating deadline constraints and the energy dissipation is largely reduced, no more than 10.65% higher than the optimum when the problem scale is relatively small. Also, eAEE consumes energy 58.17% less than an improved simulated annealing algorithm (iSA) with a large problem scale.  2013 Elsevier B.V. All rights reserved.

1. Introduction Recent years have witnessed the rapid development of Field Programmable Gate Array (FPGA) technology, which has been widely used in various application domains. Compared with Application Specific Integrated Circuit (ASIC), FPGA outperforms in terms of performance, cost, flexibility and reconfigurability. Today, some off-the-shelf commercial FPGA systems support partial runtime reconfiguration (PRTR). This allows part of logic blocks to be reconfigured on the runtime, without affecting the rest of logic blocks. FPGAs have been used to construct high performance systems that are dynamically reconfigurable. Compared with systems built with multiple general processors, such systems are advantageous. The processing speed is higher since a task is processed with a hardware implementation. Example of such systems include ⇑ Corresponding author. Address: Department of Computer Science and Engineering, Shanghai Jiao Tong University, 3-126 Room, 3 SEIEE Buildings, 800 Dongchuan Rd., China, 200240. Tel.: +86 21 34205421; fax: +86 21 34205120 0. E-mail addresses: [email protected] (C. Jing), [email protected] (Y. Zhu), [email protected] (M. Li).

BEE2 [6] and TM-4 [11]. This paper considers multi-FPGA reconfigurable systems (as shown in Fig. 1), which have been adopted in a number of application domains [2,6,11]. Although multi-FPGA systems are capable of providing high performance computing, their energy dissipation is a serious issue. It is highly important to achieve energy efficiency. Firstly, less heat would be generated and the effort for releasing heat could be greatly saved. Secondly, the less energy dissipates in the system, the smaller the energy bill should be paid. Lastly, less heat results in a higher reliability of the whole system [1]. The increasing temperature leaves FPGA fabrics more vulnerable. Fortunately, many contemporary reconfigurable platforms incorporate tuning mechanisms for energy efficient operations. For instance, Xilinx Vertiex-5 is designed based on CMOS circuitry [3] that allows adjustable power consumption rates. It has become evident, however, that there is an intrinsic tradeoff between processing performance and energy consumption for multi-FPGA systems. To achieve lower power consumption rate, the system can decrease the operating frequency, which leads to longer processing time. In practice, it is highly desirable to

0141-9331/$ - see front matter  2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.micpro.2013.05.001

Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001

2

C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx

Fig. 1. Illustration of a multi-FPGA reconfigurable system.

complete tasks within given deadlines while minimizing the overall power consumption of the whole system. This paper studies the crucial problem of energy-efficient scheduling on multi-FPGA systems without missing deadline constraint. To achieve energy efficiency, several great challenges have to be addressed. First, a task may use a number of logic blocks and it is required that these blocks be located together. We define this as integral allocation of contiguous logic blocks on a FPGA to a computing task. Second, the scheduling system needs to take into account reconfiguration overhead. It takes a non-eligible amount of time to reconfigure the logic blocks in order to execute a new type of task. Moreover, a FPGA has a limited number of loading ports. This suggests that a reconfigurable system faces the constraint that at one time a limited number of tasks can accepted simultaneously for which the logic blocks are being reconfigured for the new type of tasks. This kind of constraint can be considered simultaneous reconfiguration constraint. It should be noted that in a multi-FPGA system, multiple independent tasks are likely to be reconfigured simultaneously. Finally, the scheduling algorithm needs to minimize the overall energy consumption while meeting the deadline constraints of application tasks. There have been a few studies on energy conversation in reconfigurable systems. Many of them are dedicated to designing hardware in the circuit level [7,21] and the system level [20,23]. In [21], the authors develop a dual VDD/Vt voltage on an LUT-Based FPGA to tune the overall energy consumption. DVS (Dynamic Voltage Scaling) [7] is applied to adjust the voltage level in FPGAs embedded with a logic delay measurement circuit. The later methods apply scheduling techniques for reduction of energy consumption in dynamically reconfigurable systems. In [20], a post-placement leakage aware scheduler is invoked to reduce waste of leakage power, which can be measured from the delay between task reconfiguration and execution. One optimal algorithm is proposed for the case in which a task partition over contexts is given, and approximation algorithms are developed for the case in which no task partition is given [23]. In [4,22], the methods either neglect the reconfiguration overhead or focus on single-FPGA systems. When there are multiple FPGA units, we should carefully allocate tasks to the FPGA units, maximizing simultaneous reconfiguration and minimizing reconfiguration overhead. Most existing work on multi-FPGA systems [2,11,24] has largely concentrated on computing performance only. In summary, few existing methods can be applied to energy-efficient scheduling on multi-FPGA systems.

Energy-efficient scheduling on multi-processor or multi-core systems [5,14,17,19,25,27] has been extensively studied. But, energy efficient scheduling on multi-FPGA systems is different and more difficult. First, in a reconfigurable system, allocating tasks to FPGAs must conform to the spatial allocation constraint. A subset of contiguous logic blocks has to be allocated to one task. There is no such constraint for systems of general purpose processors. Second, a task to be executed on a reconfigurable system must first be mapped to a given set of logic blocks before it is run. Finally, scheduling on dynamically reconfigurable systems should explicitly take reconfiguration overhead into account [13]. Thus, existing scheduling algorithms for multi-processor systems can hardly be directly applied in multi-FPGA systems. In this paper, we show that the optimal scheduling problem for energy minimization with deadline constraints is NP Complete (independent tasks). We propose an energy efficient scheduling algorithm called AEE based on Ant Colony Optimization (ACO). Also, we propose an enhanced AEE algorithm (eAEE) to deal with the tasks with precedence and interdependencies. These algorithms can achieve high energy efficiency and in the meanwhile meet the deadline requirement of computing tasks. The distinctive features of the algorithms are low complexity. Comprehensive trace-driven simulation experiments have been conducted to evaluate the algorithms and results show that the algorithms can process all tasks without violating deadline constraint and can considerably reduce energy consumption in the meantime. The main technical contributions of the paper are listed as follows:  We investigate the unique characteristics of multi-FPGA reconfigurable systems, and then accordingly formulate the energyefficient scheduling problem as an optimization problem with deadline constraints.  We analyze the complexity of the optimal scheduling problem and rigidly prove that it is a NP-Complete problem.  We propose AEE and eAEE, two energy efficient scheduling algorithms based on Ant-Colony Optimization, are for tasks independent and dependent, respectively. In spite of low complexity, these algorithms dramatically reduce the overall energy consumption while meeting the deadline constraint.  Comprehensive trace-driven simulations have been conducted and results show that the AEE algorithm dissipates power no more than 10.65% higher than the optimal solution, and eAEE consumes 58.17% less energy than that of iSA. The remainder of paper is organized as follows. We introduce related work in Section 2. In Section 3, we present the system model and formally define the scheduling problem. In Section 4, we give the details of the proposed two scheduling algorithms. The performance evaluation methodology and evaluation results are presented in Section 5. We conclude the paper in Section 6.

2. Related work There is much related work for studying performance versus power tradeoff on dynamically reconfigurable systems. Previous work can be divided into two classes. One class makes change to architecture or circuitry, and the other class does not. We first look at the first class. Li et al. [21] develop a novel architecture at the circuit level. Through programmability of SRAM on a LUT-based FPGA, they develop a pre-defined dual VDD/Vt voltages on the circuit level. A power-aware algorithm is then proposed for reducing power consumption. Chow et al. [7] present a method for commercial FPGA supporting dynamic voltage scaling (DVS) by which power consumption can be tuned. To reduce logic

Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001

C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx

delay, they also implement a measurement circuit for determining the speed. We then discuss the second class, which typically apply scheduling techniques to approach energy efficiency for dynamically reconfigurable systems. Recent study has shown that scheduling can substantially further reduce system-level power consumption. In [20], a post-placement algorithm is proposed with the objective of eliminating leakage waste. The minimumlength schedule is first derived and then a leakage-aware scheduler is invoked to decrease leakage waste with the schedule length remaining fixed. However, the gain in energy saving relies on the fact that there is data dependency among the tasks. For tasks with little data dependency, the gain in energy saving would become marginal. In [23], optimal algorithms are developed for the case when a task partition over contexts is given, and approximation algorithms are developed for the case when no task partition is given. However, these algorithms disregard reconfiguration overheads. Some studies have explored scheduling on single-FPGA systems and consider reconfiguration overhead introduced by the reconfiguration ports constraint [4,22]. In [4], a simple system model is adopted, in which the system consists a number of reconfigurable components and each task can be allocated to such a component. In addition, it is assumed that reconfiguration overhead is identical for all tasks. In [22], the authors develop an online scheduling algorithm for single-FPGA systems. Both the two algorithms do not consider the energy efficiency issue. Furthermore, the algorithms aim at single-FPGA systems. Some other studies explore multi-FPGA systems, but also focus on high computing performance without considering the power consumption issue [2,11,24]. In [24], the application need is given and an algorithm is then designed to minimize the use of FPGA units. FHPCA (FPGA High Performance Computing Alliance) [2] has designed a supercomputer with FPGA technology, aiming at application domains such as oil and gas prospecting, medical imagery and financial services. BEE2, a computing system built with multiple FPGA units [6], can change its clock rate and thus consume different power consumption. Considerable work has been studied for energy-efficient scheduling on multi-core or multi-processor systems [5,14,17,19,25,27]. In [19], based on two energy-saving techniques, the author proposes a scheduling scheme that provides a near minimum energy feasible schedule on lightly loaded multi-core processors. In [25], genetic algorithm (GA) is adopted to solve the multi-processor scheduling. But, it is explicit that a higher computational cost generated by GA. Because passing ‘‘good’’ genes to the next generation, GA experiences the crossover operation. The legality of the new genes is related to the selection of the crossover operation. Due to the existence of illegal genes, there is a requirement to modify illegal genes, so that a higher computation cost is required. By doing this, the illegal genes can be modified. However, the ‘‘good’’ gene cannot be passed down. Besides, scheduling on multi-FPGA systems has substantial differences. Before a task can be executed, it must be first mapped to the FPGA logic blocks. Task allocation must conform to the spatial constraint. Scheduling on reconfigurable systems must minimize reconfiguration overhead. Before a task can be executed, it must be first mapped to a contiguous set of logic blocks. In summary, significant attention has been drawn for FPGA based reconfigurable systems, but little work has done for energy-efficient scheduling on multi-FPGA systems. Although there is a plenty of scheduling algorithms for computing systems with general-purpose processors, these existing algorithms are not applicable to multi-FPGA systems because of the unique associated characteristics.

3

3. System model and problem description In this section, the system model is described and the problem is formally defined. 3.1. System model The computing system under consideration has a set of M homogeneous FPGAs that are partially reconfigurable on runtime, denoted by F = {F1, . . . , FM}. The basic architecture of multi-FPGA systems is illustrated in Fig. 1. The FPGA units are interconnected via high speed links and we assume that the link bandwidth is high enough such that we ignore link communication delay. The central control unit implements the system management logic, e.g., task scheduling. Each FPGA unit has a set of reconfigurable logic blocks organized in one dimension (a similar model has been adopted in [8]). Although we assume the layout of logic blocks is onedimensional, the algorithm developed in this paper can be extended to a two-dimensional setting by minor extension. There are L logic blocks in each FPGA unit, and the set of logic blocks j j on Fj is denoted by Bj ¼ fb1 ; . . . ; bL g. At any given time, a subset of contiguous logic blocks can be allocated to a task and reconfigured for execution of this task. This is practical, for example, some latest commercial FPGAs can support one dimensional reconfigurability [15]. The number of loading ports in a FPGA unit is usually fixed and limited. This suggests that the number of tasks that can simultaneously be loaded for configuration is also limited. Limited number of loading ports are in charge of accessing memory system that yields a delay taken into account to the reconfiguration phase. This poses an important constraint on task scheduling. An application can be represented by a DAG (Directed Acyclic Graph) XðT ; EÞ, where T ¼ fT i ji ¼ 1; . . . ; Ng is set of tasks, and Eði; jÞ ¼ feði;jÞ jði; jÞ 2 f1; . . . ; Ng  f1; . . . ; Ngg is set of edges representing tasks interdependencies. Each task is non-preemptive. The processing workload of each task Ti is defined by its cycle counts Wi and the number of cycle counts is denoted by Ci. With FPGA, a task undergoes two phases: reconfiguration and execution. The cycle count of a task Ti can be divided into reconfiguration and execution cycle counts denoted by W i ¼ ½W ri ; W ei ; 8i 2 ½1; N, respectively. Suppose there are a fixed number of implementation types for the tasks denoted by Q ¼ fQ1 ; . . . ; QA g, where 1 6 A 6 N. Each task Ti is associated with an implementation type in Qa . The tasks of the same implementation type can share the same implementation logic (e.g., Verilog code). The tasks may differ in different input data and thus produce different outputs. It should be noted when a task executes on the set of logic blocks that has already be reconfigured for the same type of task, there is no reconfiguration overhead [13]. The tasks with same implementation type share the requirement on logic blocks, reconfiguration cycle counts, so these can be categorized together before or during the tasks placement. For example, suppose that two tasks Tq and Tp are of the same implementation type. As long as task Tp follows Tq and is executed on the same set of logic blocks, the reconfiguration phase of Tp is safely omitted. 3.2. Energy model FPGA is implemented with CMOS technology. For a CMOS circuit, the dominant power consumption rate is defined as:

PðSÞ ¼ Kef  V 2dd  S

ð1Þ

Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001

4

C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx

where Kef ; V dd ; S and P(S) denote the effective load capacitance, the supply voltage, the speed and the power consumption under the certain speed, respectively [14]. A certain circuit delay tD is associated with a FPGA unit and it is related to Vdd:

1=t D ¼ kh  ðV dd  V T Þt =V dd

ð2Þ

where parameters VT, kh and m stand for the threshold voltage, the proportionality constant and the velocity saturation index, respectively. According to existing study (i.e., [14,17]), m and VT are usually set to 2 and 0, respectively. According to [23], tD is equals to 1/S, where S is the system under certain speed:

S ¼ kh  ðV dd  V T Þt =V dd

ð3Þ

Combing (1) and (3), we can derive the power consumption rate as follows:

PðSÞ ¼ e  S3

ð4Þ

2 Kef =kh .

where e is Thus, we can find that speed S is the denominating factor of power consumption. Every logic block in a FPGA unit shares the same speed, which is adjustable. Speed level on each FPGA unit can be independently tuned. We assume that there are NS discrete levels for the speed in each FPGA unit denoted by S ¼ fS1 ; . . . ; SNS g. Also, since the set of task is given, we suppose that even though in the worst case, all FPGA units are tuned at the maximum level of speed, but such maximum power can be accommodated. Next, we derive the energy consumption of a FPGA unit when a set of tasks being run on this FPGA unit. Suppose that a number of nj tasks run on Fj at speed of Skj , then the energy consumption on Fj can be computed:

Ej ¼ ej  ðSkj Þ3 

nj nj X X C i =ðSkj Þ ¼ ej  ðSkj Þ2  Ci i¼1

ð5Þ

i¼1

Note that the completion time on Fj by nj tasks are denoted by

3.4. Complexity analysis We analyze the complexity of the minimum-energy scheduling problem with deadline constraint described previously. The following theorem shows this complexity. (The MES problem is simplified that assumed tasks are independent.) Theorem 1. The MES problem is a NP-Complete problem. Proof. We prove it with two phases. The first phase shows that the MES problem is in NP, and the second phase show that a NPC problem is reducible to the MES problem. Firstly, we have to prove that the MES belongs to NP. Hereby, given an energy bound, then guess a possible solution Eg, we can check in a polynomial time that a given energy bound is greater than or equal to Eg. Secondly, we provide a polynomial time reduction from a known NP-Complete problem, task scheduling on homogeneous multi-processor (TSM) [12], to MES. Consider an instance of TSM, that is a set of independent and non-preemptive tasks N0 , and M0 set of homogeneous processors, each one has Np speed levels which could be independently adjusted. Execution time Tx(i, j, k) denotes task i on processor j at speed level k, where i 2 {1, . . . , N0 }; j 2 {1, . . . , M0 }; k 2 {1, . . . , Np}. The problem is to find an overall energy consumption less than a given bound x (x P 0), and formulated as:

N½N0 ; M0 ; N p ; Txði;j;kÞ  6 x

ð7Þ

Ignoring the adjustment of speed levels, we can construct an instance of MES:

 Each FPGA unit only has one available logic block (i.e., L ¼ 1).  Every task requires one logic block and requires various running time in MES.

cj.

It is clear that TSM can be reduced to MES in a polynomial time. This concludes the proof. h

3.3. Problem description

After showing that the MES problem is NPC, we know that it is computationally infeasible to derive the optimal scheduling when the number of tasks and the number of FPGA units are large.

This paper considers offline scheduling of application. At the beginning, all tasks are released. There is a given deadline D, by which all the tasks must be completed. Then the energy efficient scheduling problem under consideration in this paper is defined as follows. Definition 1 (Minimum-energy scheduling with deadline constraint, MES). Given an application denoted by DAG XðT ; EÞ, the set M of FPGA units, the deadline requirement D, the minimum energy scheduling problem is to find the optimal scheduling that maps a task to a set of logic blocks in a certain FPGA unit and to the specific time instance when the task can be loaded for execution, under the rigid deadline constraint, with the objective to minimize the overall energy consumption, as follows:

objectiv e : minfPa  t a þ

N X M X ðej  C i  ðSkj Þ2 Þ  xði; j; kÞg i¼1 j¼1

s:t:; maxfc1 ; . . . ; cM g 6 D

ð6Þ

k 2 ½1; . . . ; NS  where ta denotes the total delay that for task interdependencies between different FPGA units. Pa is the power consumption due to the delay. We define x(i, j, k) as, if x(i, j, k) = 1, task Ti is assigned to Fj at speed level k, otherwise x(i, j, k) = 0.

4. Algorithm design In this section we first present the overview of the algorithm design for solving the independent tasks and delve into the design details. Then, we propose an enhanced algorithm for tasks with interdependencies and detail the eAEE algorithm. 4.1. Overview The previous analysis has revealed that the MES problem is NP complete. Given this fact, we turn to seeking an efficient scheduling algorithm that can generate a schedule meeting the deadline constraint while reducing the energy consumption of the whole reconfigurable system. Essentially, the algorithm should make the best use of all resources of the available FPGA units, overcome the simultaneous reconfiguration constraint and reduce the reconfiguration overhead. We propose an efficient scheduling algorithm called AEE, which is based on Ant Colony Optimization (ACO) [9]. The central idea of AEE can be explained by two steps. In the first step, the algorithm finds the shortest overall makespan by tuning the FPGA units to use the largest speed. In the meanwhile, a base schedule is generated.

Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001

C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx

In the second step, the algorithm searches for the lowest system frequency for energy efficiency. Extending the base schedule, it conservatively tunes the frequency and makes sure that the makespan does not exceed the deadline constraint. It should be noted that the mapping of the tasks to FPGA logic blocks is not changed and the relative timing of all tasks remains fixed. 4.2. Design details We first explain the fundamentals of the ACO algorithm. According to ACO, for searching for a feasible solution, each ant creates a feasible solution by repeatedly applying a probabilistic state transition rule to choose a node to next visit. With combination of pheromones and heuristic function, ants can select the next node and finally complete their tour. After that, the global updated rule is used to update pheromones on the edge of their tours in this iteration. When the termination condition holds, the optimal solution is then obtained. We next show the detail design of the AEE algorithm. A set of ants X ¼ fx1 ; . . . ; xN g is initially deployed. Hxk ½Ir is a set of all eligible nodes that xk can visit with in-degree = 0 from certain node. A node is considered as a task placed onto one of FPGA units, e.g., node (i, j) denotes that Ti is placed onto Fj. In degree = 0 denotes the set of nodes has not yet visited by ant xk. In AEE, with the satisfaction of probabilistic state transition rule, an ant can make a decision to the next node, as well as placing task onto that FPGA unit. After that, U and Hxk ½Iter is updated, where U is the set of tasks already placed. The algorithm runs until reaching the ended criterion CNmax. Here, the probabilistic state transition rule is defined as:

vk

Pði;jÞ#ði0 ;j0 Þ ¼

8 >

:

½u

vk

ði0 ;j0 Þ

v a ½g k0 0 b ði ;j Þ v v ½u k a ½g k b

ðr;hÞHv ½Ir k

ðr;hÞ

; ðr; hÞHvk ½Ir

ðr;hÞ

0

ð8Þ

otherwise

5

It is difficult to design the task placement algorithm for multiFPGA systems. It is not only having to solve the problem of fragmentation of FPGA resources [10], but also has two challenges: First, the number of loading ports on one FPGA unit is limited. Thus, tasks have to be sequentially loaded to the FPGA and then be reconfigured in the FPGA unit [22]. In contrast, in a system with multiple FPGAs, reconfigurations of different tasks can take place at the same time on different FPGAs. In this way, the global makespan can potentially be shortened. As illustrated in Fig. 2, there are three independent tasks T1, T2 and T3 each demanding 3, 2, and 4 logic blocks, respectively. Two FPGA units F1, F2 have nine resource blocks. Ga and Gb denote the global makespans for the two scenarios of placement. Because of simultaneous reconfigurations in multi-FPGA systems, it can be seen that Gb < Ga while assigning tasks to different FPGA units. Second, for the multi-FPGA system, it is important to balance the completion times over different FPGA units. As shown in Fig. 2a, three tasks are allocated to F1. Although the deadline constraint can be met, the slack is very short, suggesting that there is little space for the system to reduce the system speed for energy efficiency. In contrast, in Fig. 2b one task is assigned to F1, and the rest is assigned to F2. Clearly, the slack becomes much larger. This allows the system to tune down the system speed and thus improve energy efficiency. We next explain the basic procedure of Algorithm 2 (i.e., MFIT). The algorithm tries to place new coming tasks to the bottom of the FPGA units. In such a way, the completion time of each FPGA unit can be calculated with Cal(). Then, the minimum completion time is reserved. If several minimum completion times are same, the variances are calculated by the completion time for each FPGA units. And the lowest value of variance is adopted for task placement. If the variances are same, FPGA units with the least number of allocated tasks are selected for task placement. Next, all FPGAs’ available logic blocks are updated. When all tasks are allocated, the global makespan is obtained for the ants to make a decision.

(i, j) ? (i0 , j0 ) represents a node (i0 , j0 ) that can be visited by xk from vk (i, j). Pði;jÞ#ði denotes the transitional probability for ant xk from 0 0 ;j Þ node (i, j) to (i0 , j0 ), if and only if, ðr; hÞ 2 Hxk ½Ir, (r, h) belongs to a set of nodes that ant xk can visit from node (i, j). Otherwise, the v probability is zero. uðik0 ;j0 Þ denotes ant xk pheromone level from node x (i, j) to (i0 , j0 ); gðik0 ;j0 Þ stands for a heuristic function that reciprocal of global makespan for ant xk from node (i, j) to (i0 , j0 ). We develop MFIT, which is to be explained shortly, to perform this function; a and b determine the relative importance between the pheromone and heuristics, respectively. Once all the ants have completed their tours, the global updating rule is used to increment pheromone, which defined as:

uðn; mÞ

v

k ð1  qÞuðn; mÞ þ Q =1ðsi;sjÞ#ðei;ejÞ

ð9Þ

k 1xðsi;sjÞ!ðei;ejÞ is the best solution that an ant, xk, starts from node (si, sj) and ends at (ei, ej) in the current round. Each u(n, m), "n 2 [1, N],

m 2 [1, M], usually evaporates with the global pheromone evaporation rate, unless the corresponding node is visited by an ant. With this strategy, the best solution can be found. q is the evaporation rate, where a higher q can accelerate the convergence speed of AEE. Q is a given coefficient parameter. 4.3. Heuristic function g: MFIT Central to ACO is the heuristic function g. We design the algorithm called MFIT for computing the heuristic function, by which an ant can make a proper transition decision. Essentially, the heuristic function is a task placement algorithm on multi-FPGA systems.

Fig. 2. Illustration of tasks placement on multi-FPGA system.

Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001

6

C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx

Algorithm 1 input: N ;, M: set of tasks, set of FPGA units; T: set of tasks; D: deadline constraint; S: set of N S possible speeds for each FPGA units; Q: set of A implementation types; B: set of L contiguous logic blocks; output: PM j¼1 Ej : system energy consumption schedule maxðSÞ; j ¼ 1; . . . ; M. 1. Initialize: S j 2. Generate a set of N ants, Generate sHL to reserve Ti with type Qa on Fj starting at b‘, ð8a 2 ½1; A; 8‘ 2 ½1; LÞ. 4. For Ir: = 1 to CN max 5. Ants are positioned onto N nodes (i, j), ð8i 2 ½1; N ; 8j 2 ½1; MÞ: 6. For k :¼ 1 to 7. Initialize U :¼ ;, and Hxk ½Ir 8. //Calculation that have been placed task by xk. 9. While Hxk ½Ir–; do 10. If Ti satisfies i 2 Hxk ½Ir 11. and the probabilistic state transition rule. 12. U ¼U [i Hxk ½Ir  fig 13. Hxk ½Ir 14. sHL½Ir½k½i ½T i ; Qa ; F j ; b‘  15. Adding eligible nodes to Hxk ½Ir 16. with in-degree = 0. 17. End If 18. End While 19. End For ðN Þ 20. //Until all ants have built their complete solution 21. Apply the global updating rule with the best solution. 22. End For (CN max ) 23. Global makespan is obtained and 24. schedule select the base schedule from sHL. 25. // FPGA units speed is decreased, unless violate the D 26. For j :¼ 1 to M 27. For k :¼ N S to 1 28. Skj select the next smaller level k in Fj; 29. cj compute the new makespan; 30. If cj > D 31. Break; 32. End If 33. End For 34. End For 35. Output: P 36. M j¼1 Ej 37. schedule: ½T i ; Qa ; F j ; b‘ 

Algorithm 2 input: N ; M: set of tasks, set of FPGA units; Q: set of A implementation types; B: set of L contiguous logic blocks on each FPGA units; output: Global makespan G 1. Initialize: i :¼ 1 2. While ði! ¼ N Þ Do

3. 4.

For j :¼ 1 to M //To calculate completion time on, cj, when Ti with type Qa is allocated to Fj. cj CalfALLOC½ðT i ; Qa Þ; jg; ð8a 2 ½1; AÞ 5. 6. End For 7. //To reserve the minimum completion time cj. 8. If several minimum completion time from results of minfc1 ; . . . ; cM g are same, 9. Reserve Ti on Fj with the least number of tasks. 10. Else 11. Reserve Ti on Fj with the minimum cj. 12. End if 13. Update ½F 1 ðBÞ; . . . ; F M ðBÞ. 14. i++ 15. End while 16. Output: G maxfc1 ; . . . ; cM g

4.4. Enhanced algorithm From the above mentioned, AEE can be well performed for the independent tasks. But, it may not be suitable for the tasks with precedence and interdependencies. There is a delay for dependent tasks completion. For instance, if two tasks Ta and Tb are allocated to different FPGA units, and Tb is the successor of Ta. The successful completion of Tb depends on the data generated from Ta. Although we concern the link bandwidth is enough, data from Ta undergoes the limited loading ports that delay the Tb. But this delay does not exist when Ta and Tb are in the same FPGA unit, the data generated is saved in the FPGA unit that does not need to experience the reconfiguration phase. So, with these assumptions, in this subsection, we propose an enhanced AEE (eAEE) by taking into the account of tasks precedence and interdependencies. The general idea of eAEE is as follows (see Algorithm 3): Algorithm 3 input: XðT ; EÞ : T is a set of N tasks, E represents set of the tasks interdependencies; D: deadline constraint; M: set of FPGA units; S: set of N S possible speeds for each FPGA units; Algorithm 1; Algorithm 4; output: Schedule 1. Initialization: i :¼ 1, tSchedule £ 2. Level[i] Reserve tasks with no predecessors to the level one; 3. For tasks in Level[i] 4. Check tasks interdependencies E; 5. Suc[i] Find successors of Level[i]; 6. Pred[i] Find the predecessors of Suc[i]; 7. If Pred[i] 2 Level in the previous levels 8. i++; 9. Level[i] Reserve successor tasks; 10. End If 11. End For 12. Lv Compute the total number of levels; 13. For l :¼ 1 to Lv 14. tSchedule[l] Record tasks (Level[l], Pr ed[l]) schedule using Algorithms 1 and 4 combine with tSchedule;

Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001

7

C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx

cj Compute the completion time on Fj with S k and 20. tSchedule; 21. End For 22. G maxfc1 ; . . . ; cM g; 23. If G > D 24. Break; 25. End If 26. End For 27. Schedule Reserve the schedule with the minimum

0.8

FB AEE DP SA

0.6 0.4 0.2

100

150

0.4 0.3 0.2 0.1 100

120

140

160

(a)

(d) FB AEE DP SA

0.8 0.6 0.4

80

100

120

0.6 0.4 0.2 0 100

150

200

(b)

(e)

0.6 0.4 0.2

100

150

200

200

FB AEE DP SA

Deadline

FB AEE DP SA

180

0.8

Deadline

0.8

0 50

FB AEE DP SA

0.5

Deadline

1

0.2 60

0.6

Deadline

Energy Consumption Ratio

0 50

Energy Consumption Ratio

Energy Consumption Ratio

Energy Consumption Ratio

Energy Consumption Ratio

energy consumption with S k and tSchedule 28. Output: 29. Schedule

Firstly, the tasks is separated by precedence level. Initially, tasks with no predecessor are reserved to level one. And Level(i) that is exploited to record the set of tasks at a certain level i. Suc(i) is used to keep the successor of tasks in level i, Pred(i) is to record the predecessor tasks of Suc(i). If all predecessor tasks Pred(i) are in the Level task set, then tasks in Suc(i) can be allocated to Level(i + 1). By repeatedly doing this, tasks in each level and the total number of levels (Lv) can be obtained; Secondly, with the determined set of tasks in each level, we employ the AEE to obtain the shortest overall makespan by turning the FPGA units at maximum speed. However, due to the tasks’ interdependencies, the heuristic function MFIT has to be modified to satisfy the interdependencies (see Algorithm 4). When a task is placed onto the FPGA unit, the delay mentioned above between FPGA units is crucial because of interdependencies. On the basis of the modified MFIT, the global makespan can be obtained. In the meanwhile, a base schedule can be attained;

Energy Consumption Ratio

15. End For 16. // FPGA units speed is decreased, unless violating D 17. For k : N S to 1 18. Sk Select the next smaller speed level k; 19. For j :¼ 1 to M

250

1 FB AEE DP SA

0.8 0.6 0.4 0.2 0 100

150

200

Deadline

Deadline

(c)

(f)

250

300

Fig. 3. Energy consumption comparisons of different algorithms: (a) task number 12, L ¼ 100; (b) task number 12, L ¼ 200; (c) task number 15, L ¼ 100; (d) task number 15, L ¼ 200; (e) task number 18, L ¼ 100; and (f) task number 18, L ¼ 200.

Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001

8

C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx

Finally, the speed on each FPGA unit can be readjusted to fulfill with deadline constraint.

For j :¼ 1 to M cj Calculate completion time on Fj that allocate Ti from Level(l), 5. with Ti ’s Pred, Qa , tSchedule, and E interdependencies; 6. End If 7. End For 8. //To reserve the minimum completion time cj 9. If several minimum completion time from results of minfc1 ; . . . ; cM g are same, 10. Reserve Ti on Fj with the least number of tasks 11. Else 12. Reserve Ti on Fj with the minimum cj 13. End if 14. Update ½F 1 ðBÞ; . . . ; F M ðBÞ. 15. i++ 16. End while 17. Output: G maxfc1 ; . . . ; cM g

4.5. Algorithm analysis For Algorithm 2, there are two loops. The time complexity of each loop is OðN Þ and OðMÞ, respectively. So the total time complexity of this algorithm is OðNMÞ. In Algorithm 1, for an ant, the time complexity is bounded by 2 OðMN Þ. In addition, there are N ants in each round, so the time 3 complexity of a round is OðMN Þ. Therefore, for the maximum 3 round CNmax, the time complexity of AEE is OðMN  CNmax Þ. In Algorithm 3, for each level the time complexity is bounded by N 3  CNmax Þ. So, the total number of levels in XðT ; EÞ is Lv, the OðM c N 3  CNmax Þ. time complexity is OðLv  M c Also, we analyze the complexity of Algorithm 4 that with input number of tasks c N . The time complexity is OðM c N Þ. 5. Performance evaluation In this section we first show the simulation setup and performance metrics, then discuss performance results of the proposed algorithm in comparison with other alternative algorithms. 5.1. Simulation setup and metrics

Global Makespan

DP AEE FB SA

60

12

15

18

Number of Task

(a) 120

Global Makespan

3. 4.

80

40

100

DP AEE FB SA

80

60

12

15

18

Number of Task

(b) Fig. 4. Comparison of global makespan with various task number and L: (a) L ¼ 100 and (b) L ¼ 200.

Energy Consumption Ratio

FPGA units; Pred(l): a set of task predecessors that corresponding to tasks in Level(l); E: set of tasks interdependencies; tSchedule: a schedule records a set of placed tasks. output: Global makespan G 1. Initialize: i :¼ 1 2. While ði! ¼ c N Þ Do

1 eAEE iSA

0.8 0.6 0.4 0.2 0 1500

2000

2500

3000

3500

4000

Deadline

(a) Energy Consumption Ratio

Algorithm 4 input: Q: set of A implementation types; B: set of L contiguous logic blocks on each FPGA units; M: set of FPGA units; ^ tasks at level l that going to allocate to the Level(l): a set of N

100

1 eAEE iSA

0.8 0.6 0.4 0.2 0 1500

2000

2500

3000

3500

4000

Deadline We adopt a comparative study, comparing our algorithm with other alternative algorithms that will be introduced in the following subsection. The performance metric is the overall energy consumption ratio, we define as follows:

(b) Fig. 5. Comparison of energy consumption of eAEE and iSA with two different set of FPGA units: (a) 4 FPGA units and (b) 6 FPGA units.

Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001

9

C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx

Pa  ta þ Pa  ta þ

PM

0 j¼1 Ej

PM

j¼1 Ej

ð10Þ

PM

where j¼1 Ej is the total energy consumption where each FPGA unit P 0 is at maximum speed level, and M j¼1 Ej is the total energy consumption after adjustment the level of speed of FPGA units by adopting energy-saving strategies. Comprehensive simulations have been conducted to evaluate the effectiveness of our algorithm. We simulate a multi-FPGA reconfigurable system consisting of four FPGA units, as is typical for TM-4 and BEE2. Each FPGA unit has the same number of logic blocks. Two configurations are considered, L ¼ 100; 200, respectively. The number of adjustable speed levels is smaller than 10, i.e., N S 6 10. We vary the number of tasks in the simulations. For task configuration, we take parameters from the traces as used in [4,22,26]. Each task reconfiguration and execution cycle counts are evenly randomly distributed between [1, 8] and [10, 15] L ¼ 100 ([10, 15] and [13, 18], L ¼ 200), respectively. The number of logic blocks required by a task follows a uniform random distribution, between [40, 80] L ¼ 100, ([80, 115] L ¼ 200). We use a DAG generator (TGFF) [16] that take the shape of DAG from the Strassen’s matrix multiplication or the Fast Fourier Transformation (FFT) [18], and assign each task with a random number of contiguous logic blocks [70, 120], reconfiguration and execution cycle ([1, 19] and [20, 80]), and task interdependencies delay between FPGA units [1,5]. All these follow a uniform random distribution. The experiments are run on a desktop computer, with Intel(R) Xeon(R) dual CPUs @ 2.13 GHz, and 8 GB of RAM. The algorithms are programmed with C# language. The results presented in the following are averaged over 10 independent runs. 5.2. Compared algorithms We compare our algorithm with other three algorithms. 5.2.1. Dynamical programming based algorithm (DP) We develop an algorithm for deriving the optimal schedule and the minimum energy consumption based on dynamic programming. The main step of the DP based algorithm is as follows: Firstly, FPGA units are tuned at a maximum speed that fulfill with the deadline constraint; Secondly, the DP strategy is adopted. Initially, MES is broken into N subproblems (number of tasks), each subproblem indicates the number of assigned tasks. Then, the recursive formula is defined as:

H½i; U ¼ minfMFIT½hði; UÞ; jg

ð11Þ

where i denotes a starting task for the task set T; U is a set of tasks that are assigned, and i \ U ¼ ;, each step U is updated U U [ j; j is a task yet to assign; hði; UÞ is a set of tasks that starts from Ti and follows with assigned tasks U; MFIT is used to calculate the global makespan that begin with Ti, follow with the task set U and end with Tj; H½i; U is to document the a schedule with the minimum global makepsan that begins at Ti, follow with task set U and ends at Tj. By iteratively doing (11), to the last subproblem, we eventually obtain the minimum global makespan that gain a base schedule. Thirdly, the level of speed is readjusted to minimize the overall energy consumption without violating deadline constraint. Although it is able to find the optimal result, the problem scale is small and its complexity is extremely high, with a complexity of 3 OðMN  2N þ1 Þ. We use the DP algorithm as a baseline algorithm.

5.2.2. Fully-balanced algorithm (FB) It is evident that balance allocation the tasks to each FPGA unit can substantially reduce the overall energy consumption, so the fully-balanced algorithm (FB) is designed. Firstly, speed on each FPGA unit is set at the maximum level, and the number of tasks to each FPGA unit is allocated with the awareness of balance. Secondly, the tasks are randomly selected and mapped to FPGA units by MFIT. Then, the global makespan is obtained. Meanwhile, a base schedule is generated. Finally, the speed is readjusted to reduce the energy consumption without violating deadline. The complexity of 2 the algorithm that is bounded by OðN =M  CRÞ, where CR is the maximum of execution loop.

5.2.3. Simulated annealing algorithm (SA) SA is a generic heuristic algorithm for solving the global optimization problem. We compare SA with AEE. The main SA procedure is as follows: Firstly, speed on each FPGA unit is adjusted at maximum speed, and SA considers decreasing set of temperature and randomly generates a solution. Secondly, for each temperature, SA iterates these steps: (1) The update function can be used by randomly interchanging a pair of tasks on different FPGA units, thus a new solution can be achieved. (2) Accept update solution that has been improved, or Accept some update solutions that have not been improved (with an acceptable probability under the current temperature). (3) Typically, these steps are repeated until the system arrives at a good solution. Thirdly, the temperature is decreased and repeated to the above steps. Fourthly, when the temperature drops to zero, the solution can be obtained by SA. Finally, the speed on each FPGA unit is readjusted to meet the deadline constraint. Here, we suppose that the set of P temperature and the number of iterations for each temperature is I, so the complexity of SA is OðI  PÞ. We also develop an improved SA (iSA) for the tasks with precedence and interdependencies. It is same to eAEE in the first step, we obtain the total number of levels and the set of tasks in each level. Then, with the modified MFIT (Algorithm 4), the same strategy in SA is exploited by iSA to find the overall makespan from the first level to the last level. In the meanwhile, FPGA units are tuned at maximum speed, and a base schedule is generated. Eventually, the speed on each FPGA unit is tuned to minimize the energy consumption without missing deadline constraint. The time complexity of iSA is bounded by OðLv  I  PÞ.

2000

Global Makespan



eAEE iSA

1800 1600 1400 1200 1000 4

6

Number of FPGA units Fig. 6. Comparison of global makespan with 100 tasks and L ¼ 200.

Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001

C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx Table 2 Impact of evaporation rate (task number is 50 and L ¼ 200).

Algorithm

Task number

L

Complexity Time

Memory

AEE

12 15 18

200

25.77 s 98.65 s 8.13 min

26.14 MB 27.44 MB 28.90 MB

DP

12 15 18

200

7.94 min 2.62 h >12 h

66.78 MB 733.65 MB 5.63 GB

100 100

200 200

12.81 min 8.31 min

26.02 MB 25.39 MB

eAEE iSA

5.3. Comparative results We first compare the three algorithms in terms of energy consumption ratio. In Fig. 3, the performance of three algorithms in terms of energy consumption ratio is shown. In this set of simulations, there are in total 12, 15, 18 tasks. We can see that AEE consumes slightly more energy that DP does. More specifically, AEE consumes at most 10.65% more than DP for L ¼ 100 and 200, respectively. However, the proposed algorithm uses much less energy than the FB algorithm. FB performs the worst, because FB only concentrates on load balance of tasks. SA is also worse than AEE, because the paradigm of pheromone update rule (9) can lead AEE to obtain the best solution in a global scheme. Due to lack of such a scheme, SA cannot perform better than AEE. Since the global makespan has a great impact on the overall energy consumption, we further investigate the global makespan of three algorithms in Fig. 4. Three task sets with different number of tasks are used. AEE produces a global makespan close to that achieved by the DP algorithm. In Fig. 4b, when there are 12 tasks, AEE almost performs similar to DP. From Figs. 3 and 4, we find that these performance results demonstrate that our proposed algorithm performs well in reducing energy consumption and meeting the deadline constraint. The above mentioned algorithm is designed for the independent task set. We also show the performance of eAEE for a set of task with precedent and interdependencies. The total number of task is 100. Fig. 5 illustrates that the eAEE outperforms iSA in energy saving, and it consumes 58.17% less energy than that of iSA with set of four FPGA units. In Fig. 6, global makepsan obtained by eAEE is also shorter than that of iSA with two different set of FPGA units, respectively. 5.4. Complexity comparison We compare AEE and DP in terms of time and memory complexity. In Table 1, the time and the memory complexities of the two algorithms are shown. We can find that when L ¼ 200 and the number of tasks equals to 18, the DP algorithm consumes 5.63 GB memory and takes more than 12 h before it derives the optimal solution. In comparisons, our algorithm uses much smaller memory, only 28.9 MB memory and takes much shorter time, only 8.13 min. Also, we illustrate the eAEE and iSA in terms of both time and memory complexity. Even though iSA costs less time than that of eAEE, the result gained by iSA is much worse than that of eAEE. It is noted that eAEE costs less than time AEE because of the parameter q. For eAEE, q sets at a relatively higher value that can obtain better solution within a shorter time. In AEE with set of 50 tasks, the evaporation rate q is an important algorithm parameter. We investigate the impact of q on the algorithm performance. Table 2 shows that CNmax (the number of iterations of the main loop in AEE, i.e., ended criterion) is no more than 70, where the evaporation rate is in the range of [0.1, 0.9]. The minimum global makespan can also be obtained, while q is 0.2. It is

Evaporation rate

CNmax

Global makepsan

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

65 30 19 13 9 7 5 4 3

490 488 489 496 502 502 492 500 492

evident that q is related to the convergence speed of AEE. Also, it is noted that q with a relatively higher value can gain the minimum global makespan, hence greatly reduce the algorithm time complexity.

5.5. Impact of implementation types We next investigate the impact of implementation types of tasks on energy consumption. We also consider two task scheduling schemes. One straightforward scheme (STP) is to combine tasks of the same implementation type together and then allocate the combined task as one virtual task. The second scheme (ATP) does not combine tasks of the same type and instead let AEE perform the allocation and scheduling. We use two sets of tasks sharing the same total number of tasks. But, each set has a different number of implementation types. Tasks reconfiguration and execution cycle are randomly distributed between [1, 20] and [25, 45]. We test two task sets of reconfigurable systems with number of FPGA units 4, 6 and 8 respectively. Two deadlines are given for two task sets. As shown in Fig. 7, this clearly shows that the proposed algorithm can effectively mini-

Energy Consumption Ratio

Table 1 Time and memory complexity comparison.

1

ATP STP

0.8 0.6 0.4 0.2 0

4

6

8

Number of FPGA Units

(a) Energy Consumption Ratio

10

1

ATP STP

0.8 0.6 0.4 0.2 0

4

6

8

Number of FPGA Units

(b) Fig. 7. Impact of implementation types (total task number is 50 and L ¼ 100): (a) Number of implementation types 10 and (b) Number of implementation types 15.

Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001

C. Jing et al. / Microprocessors and Microsystems xxx (2013) xxx–xxx

mize reconfiguration overhead. In addition, STP is not better than ATP. This shows that AEE can adaptively schedule various tasks. 6. Conclusion Multi-FPGA reconfigurable systems have increasingly been adopted for high performance computing. This paper considers the energy efficient scheduling problem with hard deadline constraint. Multi-FPGA reconfigurable systems differ dramatically from traditional computing systems with multiple general processors. A task must be mapped to a set of contiguous logic blocks that must be configured for the execution of this task. Reconfiguration overhead has to be considered. In addition, a FPGA unit has a limited number of loading ports, and this poses a restriction on the number of tasks that can be simultaneously reconfigured. We have presented the energy-efficient scheduling algorithm called AEE. Based on ACO, this algorithm has the advantage of low complexity and can achieve high energy efficiency. Also, an enhanced algorithm eAEE is presented with the consideration of tasks with precedence and task interdependencies. Both have been proven through comprehensive trace-driven simulations. Acknowledgements This research is supported by NSFC (No. 61170238, 60903190, 61027009, 60933011, 61202375, 61170237), Shanghai Pu Jiang Talents Program (10PJ1405800), Shanghai Chen Guang Program (10CG11), MIIT of China (2009ZX03006-001-01), Doctoral Fund of Ministry of Education of China (20100073120021), National 863 Program (2009AA012201 and 2011AA010500), HP IRP (CW267311), SJTU SMC Project (201120), STCSM (08dz1501600, 12ZR1414900), Singapore NRF (CREATE E2S2), and Program for Changjiang Scholars and Innovative Research Team in Universities of China (IRT1158, PCSIRT).

11

[16] J. Heiner, B. Sellers, M. Wirthlin, J. Kalb, FPGA partial reconfiguration via configuration scrubbing, in: Proc. International Conference on Field Programmable Logic and Applications (FPL), 2009, pp. 99–104. [17] T.-Y. Huang, Y.-C. Tsai, E.T.-H. Chu, A near-optimal solution for the heterogeneous multi-processor single-level voltage setup problem, in: Proc. IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2007. [18] S. Hunold, J. Lepping, Evolutionary scheduling of parallel tasks graphs onto homogeneous clusters, in: Proc. IEEE International Conference on Cluster Computing (Cluster), 2011. [19] W.Y. Lee, Energy-efficient scheduling of periodic real-time tasks on lightly loaded multicore processors, IEEE Transactions on Parallel and Distributed Systems (TPDS) 23 (3) (2012) 530–537. [20] C.-F. Li, P.-H. Yuh, C.-L. Yang, Y.-W. Chang, Post-placement leakage optimization for partially dynamically reconfigurable FPGAs, in: Proc. ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), 2007. [21] F. Li, Y. Lin, L. He, J. Cong, Low-power FPGA using pre-defined dual-Vdd/dualVt fabrics, in: Proc. ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGAs), 2004, pp. 42–50. [22] Y. Lu, T. Marconi, K. Bertels, G. Gaydadjiev, Online task scheduling for the FPGA-based partially reconfigurable systems, in: Proc. International Workshop on Reconfigurable Computing: Architectures, Tools and Applications, 2009. [23] N.-C. Perng, J.-J. Chen, C.-Y. Yang, T.-W. Kuo, Energy-efficient scheduling on multi-context FPGA’s, in: Proc. IEEE International Symposium on Circuits and Systems (ISCAS), 2006. [24] F. Redaelli, M.D. Santambrogio, D. Sciuto, S.O. Memik, Reconfiguration aware task scheduling for multi-FPGA systems, in: Proc. Reconfigurable Computing Workshop on HiPEAC, 2010. [25] E.S.H. Hou, N. Ansari, H. Ren, A genetic algorithm for multiprocessor scheduling, IEEE Transactions on Parallel and Distributed Systems (TPDS) 5 (2) (1994) 113–120. [26] H. Walder, C. Steiger, M. Platzner, Fast online task placement on FPGAs: free space partitioning and 2D-hashing, in: Proc. IEEE Reconfigurable Architecture Worshop (RAW) on IPDPS, 2003. [27] Y. Zhang, X.S. Hu, D.Z. Chen, Task scheduling and voltage selection for energy minimization, in: Proc. ACM Design Automation Conference, 2002.

Chao Jing: is a PhD candidate in the Department of Computer Science and Engineering at the Shanghai Jiao Tong University. His research interests include High Performance Computing (HPC), resource management on reconfigurable systems.

References [1] ‘‘40-nm FPGA Power Management and Advantages – Altera’’, Technical Report, 2008. [2] HPCA (High Performance Computing Alliance). . [3] ‘‘Virtex-5 Family Overview’’, Technical Report, Xilinx, 2009. [4] J. Angermeier, J. Teich, Heuristics for scheduling reconfigurable devices with consideration of reconfiguration overheads, in: Proc. IEEE Reconfigurable Architecture Worshop (RAW) on IPDPS, 2008. [5] H. Aydin, Q. Yang, Energy-aware partitioning for multiprocessor real-time systems, in: Proc. IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2003. [6] C. Chang, J. Wawrzynek, R.W. Brodersen, BEE2: a high-end reconfigurable computing system, IEEE Design & Test of Computers 22 (2) (2005) 114–125. [7] C. Chow, L. Tsui, P.H.W. Leong, W. Luk, S.J.E. Wilton, Dynamic voltage scaling for commercial FPGAs, in: Proc. IEEE International Conference on FieldProgrammable Technology (FPT), 2005, pp. 173–180. [8] K. Compton, S. Hauck, Reconfigurable computing: a survey of systems and software, ACM Computing Survey 34 (2) (2002) 171–210. [9] M. Dorigo, Optimization, Learning and Natural Algorithms, Ph.D. Dessertation Politecnico di Milano, Italy, 1992. [10] A.A. ElFarag, H.M. El-Boghdadi, S.I. Shaheen, Miss ratio improvement for realtime applications using fragmentation-aware placement, in: Proc. IEEE Reconfigurable Architecture Worshop (RAW) on IPDPS, 2007. [11] J. Fender, J. Rose, D. Galloway, The Transmogrifier-4: An FPGA-Based Hardware Development System with Multi-Gigabyte Memory Capacity and High Host and Memory Bandwidth, Technical Report, The Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto Toronto, 2006. [12] M.R. Garey, D.S Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W.H. Freeman, 1979. [13] S. Ghiasi, A. Nahapetian, M. Sarrafzadeh, An optimal algorithm for minimizing run-time reconfiguration delay, ACM Transactions on Embedded Computing Systems (TECS) 3 (2) (2004) 237–256. [14] L.K. Goh, B. Veeravalli, S. Viswanathan, Design of fast and efficient energyaware gradient-based scheduling algorithms for heterogeneous embedded multiprocessor systems, IEEE Transactions on Parallel and Distributed Systems (TPDS) 20 (1) (2009). [15] Z. Gu, M. Yuan, X. He, Optimal static task scheduling on reconfigurable hardware devices using model-checking, in: Proc. IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), 2007.

Yanmin Zhu: obtained his PhD in computer science from Hong Kong University of Science and Technology in 2007, and BEng. in computer science from Xi’an Jiao Tong University in 2002. He is a Associate Professor in the Department of Computer Science and Engineering, Shanghai Jiao Tong University, China. His research interests include ad-hoc sensor networks, mobile computing, grid computing and resource management in distributed systems.

Minglu Li: received his PhD in Computer Software from Shanghai Jiao Tong University in 1996. He is a full Professor at the Department of Computer Science and Engineering of Shanghai Jiao Tong University. Now, he is the vice dean of the School of Electronic Information and Electrical Engineering. Currently, his research interests include Grid Computing, Services Computing and Cloud Computing. He has published over 100 papers in important academic journals and international conferences.

Please cite this article in press as: C. Jing et al., Energy-efficient scheduling on multi-FPGA reconfigurable systems, Microprocess. Microsyst. (2013), http:// dx.doi.org/10.1016/j.micpro.2013.05.001