HEARS: A heterogeneous energy-aware real-time scheduler

HEARS: A heterogeneous energy-aware real-time scheduler

HEARS: A Heterogeneous Energy-Aware Real-time Scheduler Journal Pre-proof HEARS: A Heterogeneous Energy-Aware Real-time Scheduler Sanjay Moulik, Ris...

1MB Sizes 1 Downloads 29 Views

HEARS: A Heterogeneous Energy-Aware Real-time Scheduler

Journal Pre-proof

HEARS: A Heterogeneous Energy-Aware Real-time Scheduler Sanjay Moulik, Rishabh Chaudhary, Zinea Das PII: DOI: Reference:

S0141-9331(19)30201-7 https://doi.org/10.1016/j.micpro.2019.102939 MICPRO 102939

To appear in:

Microprocessors and Microsystems

Received date: Revised date: Accepted date:

3 April 2019 31 October 2019 8 November 2019

Please cite this article as: Sanjay Moulik, Rishabh Chaudhary, Zinea Das, HEARS: A Heterogeneous Energy-Aware Real-time Scheduler, Microprocessors and Microsystems (2019), doi: https://doi.org/10.1016/j.micpro.2019.102939

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

HEARS: A Heterogeneous Energy-Aware Real-time Scheduler Sanjay Moulik1,∗, Rishabh Chaudhary1 , Zinea Das1

Abstract Developing energy-efficient schedulers for real-time heterogeneous platforms executing periodic tasks is an onerous as well as a computationally challenging problem. As a result, today we are confronting a scarcity of real-time energy-aware scheduling techniques which are applicable to heterogeneous platforms. Hence, this research proposes a heuristic strategy called, HEARS, for DVFS enabled energy-aware scheduling of a set of periodic tasks executing on a heterogeneous multicore system having an arbitrary number of core types. The presented scheme first applies deadline-partitioning to acquire a set of distinct time-intervals called frames. At any frame boundary, the following two-phase hierarchical operation is applied to obtain schedule for the next frame: First, it computes execution requirements for each task on every processing core of the platform. For any task, its execution requirement on a core within a frame depends upon the following criteria: i. Length of the ensuing frame, ii. Total execution requirements of the instance of the task on different cores (as a single task may have different execution requirement on different cores) and, iii. Deadline of the instance of the task. Next, it simultaneously allocates each task on one or more cores and selects operating frequencies for the concerned cores such that the total execution demand of all the allocated tasks are satisfied as well as there is minimum change in energy consumption for the system. Experimental results show that our proposed strategy is not only able to achieve appreciable energy savings with respect to state-of-the-art MaxMin (2% to 37% on average) but also enables significant improvement in resource utilization (as high as 57%). Keywords: HEARS, Energy-Aware, Heterogeneous Platforms, Multicore, Real-Time and Scheduler 1. Introduction Real-time Systems - Basics: A real-time system is characterized by it’s dual notion of correctness: logical and temporal [1]. Hence, the correctness of a real-time system is not only judged by the logical result but also the time at which they are achieved, failure to which may have a catastrophic effect. Safety-critical applications such as reactors in nuclear plants, anti-lock braking systems in vehicles, pacemakers in health-care, fly-by-wire in aircrafts etc. are examples of realtime systems. Many real-time systems support applications having infinite sequence of recurrent

tasks [2]. A recurrent task typically represents few lines of code whose execution is triggered by external events that may happen in their operating environment and must complete within a stipulated time bound called deadline. Each execution of the task is referred to as a task instance or job. A recurrent task is said to be periodic if any two consecutive instances of the same task are always separated by a fixed inter-arrival time. Periodic tasks may be executed either preemptively or non-preemptively. In preemptive execution, a job which is currently executing on a processor may be interrupted at any time and resumed later. No such interruption is allowed during non-preemptive execution.

∗ Corresponding

author Email address: [email protected] (Sanjay Moulik ) 1 Authors are with Department of Computer Science and Engineering, Indian Institute of Information Technology Guwahati, Guwahati - 781015, Assam, India. Preprint submitted to Microprocessors and Microsystems

Heterogeneous Platforms: Over the years, we have witnessed a significant shift in the nature of processing platforms in real-time embedded November 13, 2019

systems. For example, a modern System-on-Chip platform contains multicore processors with specialized digital signal processing cores, graphics processing cores, customizable FPGAs, ASIPs, ASICs etc. In this work, we have considered heterogeneous (or unrelated) platforms [3, 4] which are based on same ISA architecture but with computing elements of different specifications. On such a platform, the same piece of code may need different amounts of execution time on different processing cores. Given a set of real-time applications modeled as periodic tasks and a heterogeneous multicore processing platform, successfully satisfying all timing related specifications is ultimately a scheduling problem.

problems. Optimal online homogeneous multicore scheduling approaches such as Pfair [7], ERFair [8], DPFair [9] etc. may be used to schedule the tasks associated with each processing core group. Compared to a non-migrative scheme, this approach is marked by higher achievable resource utilizations and lower partitioning related overheads. Fully-migrative approaches allow tasks to transit between all processing cores in the platform, and hence offers better schedulability than its two previous counterparts. In [10], the authors have proposed an algorithm to optimally schedule independent preemptive aperiodic tasks on systems having m heterogeneous cores. However, the algorithm requires O(m2 ) preemptions/migrations in the worst case. According to [4], this strategy can be further extended to optimally schedule periodic tasks, although this results in further increase in the number of preemptions and inter-core task migrations. Recently, Chwa et al. [11] extended the DPfair scheduling strategy and proposed Hetero-Fair, a global optimal algorithm for the scheduling of periodic tasks on heterogeneous multicores with two types of processing cores (for example, ARM’s big.LITTLE [4]; referred to as two-type platform).

Scheduling on Multiprocessor Heterogeneous Platforms: Traditionally, schedulers for real-time tasks on multicores/processors can be broadly classified into the following three categories: non-migrative, intra-migrative and fullymigrative. In non-migrative scheduling, a task is allotted on a specific core and is never allowed to migrate to any other core. Such an approach has the benefit of converting the multicore scheduling problem to a set of unicore ones. Hence, simple unicore scheduling algorithms such as Earliest Deadline First (EDF), Rate Monotonic (RM) etc. may be used. An additional advantage of this approach is that migration related overheads such as loss in cache affinity and inter-core taskstate transfer, are completely avoided due to the strict partitioning of tasks onto distinct individual cores. However, strict partitioning has its own associated drawbacks. The achievable resource utilization for non-migrative approaches may be very low and sensitive to the actual partition derived for a given task-set instance especially for heterogeneous platforms where tasks are non-uniformly affined to different cores with respect to their execution efficiencies. Thus, an efficient partitioning strategy may deliver significantly better resource utilizations compared to simple bin packing schemes [5, 6] like first fit, next fit, worst fit etc. However, optimal partitioning is a well known NP-hard problem and hence, online partitioning strategies typically have to resort to heuristic sub-optimal methods. Intra-migrative scheduling relaxes strict partitioning by allowing tasks to migrate among similar cores only. Once tasks are allocated to processing core groups, the scheduling problem boils down to a set of homogeneous multicore scheduling

Energy-Aware Scheduling: Nowadays, many of the embedded systems like PDAs, laptops, mobiles etc. depend upon battery as their principal source of energy. Hence, these devices are not only judged by their real-time functional performance but also by their efficiencies in terms of energy management. A lot of research has been conducted towards energy management of these devices at various levels of abstraction starting from firmware and hardware to architectural, system and even application levels. Typically, two energy management schemes are employed at the operating system level: i) Dynamic Power Management (DPM) [12], where particular parts of a system are turned off strategically when the processors are in idle state, and ii) Dynamic Voltage and Frequency Scaling (DVFS) [12], which reduces power dissipation by exploiting the relation between the power consumption and supply voltage/execution frequency while satisfying both resource and timing constraints for a set of real-time tasks. Our Work: In a system with heterogeneous (unrelated) cores, weight of the same task may vary on different cores, even when the cores run at the 2

same frequency. As the cores are unrelated, the same tasks will have different shares corresponding to each core within a frame. It becomes a challenge to schedule migrating tasks without having any overlapped execution across multiple cores within a frame. Hence, it is difficult to provide an optimal scheduler for heterogeneous platforms having an arbitrary number of core types and it is still an open research problem. Although, some of the works in literature attempt to address the problem of energy-aware scheduling for heterogeneous platforms, they do not allow migration of tasks within the system often leading to significantly lower resource utilizations. However, efficient resource utilization is a critical design criteria in many embedded systems as it helps in minimizing the number of required resources, thereby reducing design cost. Thus, with efficient energy management and resource usage as primary objectives, we propose, HEARS, a DVFS enabled energy-aware scheduling strategy which offers high resource utilizations in heterogeneous multicore platforms. The salient features of the proposed strategy can be summarized as follows:

ratios of task sets on a given heterogeneous platform but also reduce dynamic energy consumption in the system. Organization: In the next section, we present some of the related work done in this field and then we explain the specifications used in our work. Next, we present our proposed algorithm in Section 4. We also present the Time Complexity Analysis of the proposed algorithm at the end of Section 4. Next, in the Section 5, we first present an illustrative example followed by the experimental set up. Then experimental results have been discussed at end of Section 5. Finally, in Section 6, we conclude our work. 2. Related Works The problem of scheduling in a multiprocessor platform has been studied for many years. PFair algorithms [7, 8] form the class of the first known optimal multiprocessor scheduling strategies for homogeneous platforms. With the advent of heterogeneous multi-core platforms such as Nvidia Tegra[16], ARM’s big.LITTLE [4] etc., there is a need for embedded systems design strategies to adapt to these newer platforms. Raravi et al. [17] proposed a task-to-core allocation algorithm for a heterogeneous platform consisting of two different types of processing cores (such as ARM’s big.LITTLE). After partitioning and assigning tasks onto the two homogeneous subclusters of the heterogeneous processing platform, they have applied variations of optimal PFair algorithms such as ERfair [8], DPFair [9] etc., on each homogeneous sub-cluster. Recently, Baruah et al. [18] presented an Integer Linear Program (ILP) based optimal solution to the partitioning problem for heterogeneous platforms. However, solving ILPs for systems with a large number of tasks and processing cores may incur prohibitively expensive computational overheads in many real-time systems. Authors in [19] proposed a DP-Fair [9] based heuristic approach for the scheduling of realtime tasks executing on heterogeneous platforms. It may be noted that the works in [17, 18, 19] discussed above, mainly focus on satisfying timing related constraints associated with real-time systems and they do not consider energy minimization as part of their scheduling objective. A lot of research has been done for energy-aware scheduling on multicore real-time systems. Authors

• While performing task-to-core allocation, HEARS not only considers execution and energy demands of the tasks under consideration but also considers the current operating frequencies of the cores. It sequentially allocates each task on one or more processing cores such that there is minimum change in energy consumption for the system. • Using an efficient task-to-core allocation strategy, HEARS is able to achieve high resource utilization and thus deliver significant performance improvement over state-of-the-art MaxMin [13] approach, while incurring only a bounded number of inter-core task migrations. • By employing energy saving heuristic, HEARS is able to deliver appreciable energy savings compared to the state-of-the-art MaxMin[13]. • Extensive simulation based experiments have been carried out using real-world applications taken from Parsec [14] and M¨ alardalen [15] benchmarks. To verify the efficacy of our algorithm, we carried out experiments based on synthetic task sets as well. Experimental results show that our proposed strategy is not only able to significantly improve acceptance 3

in [20] propose a DVFS based scheduling strategy called Deterministic Stretch to Fit (DSF) using the principle of inter-task slack sharing ability. The authors have presented two mechanisms to reduce energy consumption in the system. In the first mechanism, they present an online algorithm to reclaim energy by adapting to the variations in actual workload of target application tasks. In the second mechanism, they have extended the algorithm with an adaptive and speculative speed adjustment strategy, which anticipates early completion of future task instances based on the information of their average workload. By utilizing both DVFS and DPM techniques, authors in [21] proposed a Mixed Integer Linear Programming (MILP) formulation for energy optimization in real-time multicore systems. Ejiali et al. proposed a standby-sparing based energy-aware fault-tolerant scheme for the scheduling of aperiodic tasks on dual-cores [22]. In order to reduce energy consumption, they employed DVFS for the main unit and DPM for the spare. Later, this work has been extended by Haque [23] et al. for periodic real-time applications. Recently, Guo et al. generalised the work presented in [23] for real-time systems deployed on platforms consisting of more than two processing cores [24]. However, all these works [20, 21, 22, 23, 24] have been targeted towards multicore platforms consisting of homogeneous cores. Over the years, homogeneous platforms have slowly given way to heterogeneous multicore platforms in order to satisfy strict performance demands, often along with additional constraints on weight, size, power etc. But there are only few works which have entertained the problem of energy-aware scheduling for heterogeneous platforms. An energy-aware partitioning and scheduling algorithm for standby-sparing systems has been presented in [25]. It is essentially a primary-backup approach for dual-core platform where the primaries and back-ups of tasks are always mapped to distinct cores. Given two cores V1 and V2 , the primaries of all tasks which require shared resources are mapped to one of the cores (say, V1 ), while the primaries of the remaining tasks which do not require shared resources are mapped to the other core (say, V2 ). Another scheduling technique for standby-sparing systems has been proposed in [26]. They presented a DVFS based energy-aware strategy to schedule fixed-priority real-time tasks. Although these strategies have attempted to address the problem of energy-aware scheduling on hetero-

geneous platforms, they restricted the number of processing core types to only two. In recent years, some research works have focused on generic heterogeneous platforms, having an arbitrary number of processor types. Authors in [27] proposed two heuristics based on the Earliest Deadline First (EDF) algorithm to minimize energy consumption in heterogeneous multicore platforms. Another energy-aware partitioning strategy called the Least Loss Energy Density (LLED) has been proposed in [28]. Although this strategy is able to use DVFS and DPM mechanisms to reduce energy consumption in the system, mapping of tasks onto processors is computationally expensive and this restricts its practical applicability. The problem of scheduling independent tasks with stochastic execution times and energy consumption on a heterogeneous platform under deadline and energy budget constraints, has been studied in [29]. A two phased energy-aware scheduler for heterogeneous systems has been proposed in [13]. In the first phase, the authors presented a DVFS based task-tocore allocation strategy called MaxMin and in the next phase, they further refined the allocation to achieve sleep states and better energy savings. Although these works attempt to address the problem of energy-aware scheduling for heterogeneous platforms, they do not allow migration of tasks within the system often leading to significantly lower resource utilizations with respect to the fully migrative approach. However, efficient resource utilization is a critical design criteria in many embedded systems as it helps in minimizing the number of required resources, thereby reducing design cost. Thus, with efficient energy management and resource usage as primary objectives, we propose, HEARS, a DVFS enabled energy-aware scheduling strategy which offers high resource utilizations in heterogeneous multicore platforms. 3. Specifications 3.1. System Model The system under consideration consists of a set of n periodic tasks T = {T1 , T2 , ..., Tn } to be scheduled on a set of m heterogeneous multicores V = {V1 , V2 , ..., Vm } which can operate on a discrete normalized set of frequencies F = {f1 , f2 , ..., fmax }, such that, fmax represents the normalized frequency of 1 and all other frequencies lie between f1 ) and 1. The frequency set (Table 1) available ( fmax 4

task receives its appropriate execution share and operating frequencies of the cores are scaled appropriately.

in Nexus 4 with quad-core Snapdragon S4 Pro processor, has been used to carry out the experiments. Each instance of a periodic task Ti has an execution requirement of ei,j (when executing at fmax ), e period pi , and utilization ui,j = ( pi,ji ) on core Vj . We assume task deadlines to be implicit, that is, same as its period pi .

Algorithm 1: HEARS Input: T , V , F Output: A set of schedule tables (∀Vj ∈ V in [0, H]) 1 Compute hyperperiod H = lcm(p1 , p2 , . . . , pn ) 2 Using deadline-partitioning, compute frame Gk 3 Let G be the set of frames G = {G1 , G2 , . . . } and f r[m × 1] be frequency level matrix for cores 4 for each frame Gk ∈ G do 5 Let AM[ n × m] and SM[ n × m] be the allocation matrix and schedule matrix at Gk 6 Initialize AM and SM to φ and f r to 1 7 EA-ALLOCATE (T , V , F , Gk , AM , f r) 8 EA-SCHEDULE (T , V , AM , SM )

3.2. Power Model In this work, we have adopted the analytical core energy model presented in [30]. The dynamic power consumption P in a system with DVFS capability is directly proportional to the operating frequency f and the square of the supply voltage ν (i.e. P ∝ f ν 2 ). The supply voltage is again linearly proportional to the operating frequency [31]. Hence, the expression for power consumption may be represented as: Pf = c × f 3 , where, f represents the operating frequency of the core, Pf represents the power consumption in the system with operating frequency f and c denotes the constant of proportionality.

4.1. HEARS

Table 1: Available Frequency Frequency Frequency Frequency Frequency (in MHz) (normalized) (in MHz) (normalized) 384 0.25 1026 0.67 486 0.32 1134 0.75 605 0.40 1242 0.82 702 0.46 1350 0.89 810 0.53 1458 0.96 918 0.60 1512 1.0

By following the above steps, we present our proposed scheduling strategy, HEARS (Algorithm 1), for the energy-aware scheduling of real-time tasks on heterogeneous multicores. It takes a task set T , a heterogeneous platform V , and a frequency set F as inputs. It starts by computing a set of frames G within hyper-period H, according to the deadline partitioning scheme. Next, for each frame Gk ∈ G, HEARS computes the allocation matrix AM . For this purpose, it internally uses EA-ALLOCATE (Algorithm 2) and EA-SCHEDULE (Algorithm 4).

3.3. Problem Description Given a task set T with n implicit-deadline persistent periodic tasks and a core set V having m heterogeneous cores (with DVFS support), find a feasible schedule which minimizes energy consumption while guaranteeing that all tasks will always meet their timing (execution and deadline) and resource constraints.

4.2. EA-ALLOCATE The pseudo-code of EA-ALLOCATE has been shown in Algorithm 2. It attempts to allocate each task Ti ∈ T onto the heterogeneous platform V such that its execution requirement within the given frame Gk is satisfied and operating frequencies of the cores are scaled appropriately. For this purpose, it first computes the share matrix sh[n×m] for the task set (Line 3). Next, it tries to find an unconsidered task Ti whose share value is minimum in the share matrix (at some core) (Line 5). Then, it attempts to find a core Vj , where Ti can be fully allocated in the frame at current frequency level f r[j] of Vj (Line 6). Since, there are discrete frequency levels in the system, there may be a scenario when

4. Proposed Scheduling Scheme The proposed scheduling scheme HEARS is a two-level hierarchical resource allocation strategy. At the first level, it applies deadline partitioning (similar to DPFair [9]) to compute a set of frames, where a single frame is the interval between two consecutive deadlines corresponding to the set of ready tasks. Next, within each frame, HEARS schedules tasks onto available cores, such that each 5

Algorithm 2: EA-ALLOCATE Input: T , V , F , Gk , AM , f r Output: AM and f r 1 Let sh[ n × m] be share matrix at Gk 2 Initialize f r[j] = fmin for all Vj 3 Compute Share Matrix sh for T in Gk 4 for i = 1 to n do 5 Find unconsidered Ti with minimum share from sh 6 Find core Vj where it can be allocated at f r[j] level 7 if Vj is found then 8 Update AM [i][j] = sh[i][j] 9 10

11 12

13 14 15

16 17

Algorithm 3: ALLOCATE-MIGRATE Input: L1 , V , AM , f r, F , Gk Output: AM and f r 1 Let sumi denotes summation of frequency scaled shares allocated for Ti in the system 2 while L1 is not empty do 3 Extract Ti from L1 and set sumi = 0 4 Let L2 be a non-decreasing order sorted list based on Ti ’s shares on cores 5 Add hTi , j, sh[i][j]i to L2 , for j = 1 to m 6 Let usi be the normalized unallocated share of Ti 7 while L2 is not empty do 8 Extract-out the first entry hi, j, sh[i][j]i from L2 9 Compute unused capacity of Vj at fmax : ucj 10 if ucj 6= 0 then 11 Compute unallocated share of τi on Vj : usi 12 if usi > ucj then 13 Set f r[j] to highest level i.e. fmax 14 Set AM [i][j] = ucj and sumi + = (ucj /f r[j]) 15 Update the normalized unallocated share of Ti : usi = (usi − uci )/ui,j

else Find f r[j] level for each core Vj at which Ti can be allocated with shares: sh[i][j] if Ti cannot be scheduled at any core then Add Ti at end of the migrating task list L1 else Choose a Vj where it causes minimum energy change Raise f r[j] to appropriate level and set AM [i][j] = sh[i][j] ALLOCATE-MIGRATE (L1 , V , AM , f r, F , Gk ) return AM and f r

16 17

there are empty slots for a core at it’s operating frequency level. Hence, we may be able to allocate Ti using those idle slots. If such a core is found then EA-ALLOCATE simply updates Allocation Matrix AM with the share value (Line 8). If such a core is not found then it computes the frequency level for each core, at which Ti ’s corresponding shares can be allocated (Line 10). Then, it selects the core and the frequency which leads to minimum change in energy consumption for the system (Line 14 and 15) or else it adds Ti to a list of migrating tasks L1 (Line 12). Such a mechanism helps the algorithm in allocation of a task to the core which not only leads to minimum change in frequency levels for the core but also leads to minimum increase in energy consumption of the system. This strategy carries on till all the tasks have been considered. Next, it calls algorithm ALLOCATE-MIGRATE(3) to allocate the migrating tasks from the list L1 .

18 19 20

21

else if sumi + ucj − dusi e ≤ |Gk | then Set f r[j] to maximum level which avoids sumi + (ucj − dusi e) > |Gk | Set AM [i][j] = (ucj − dusi e) else Declare infeasible T

return AM and f r

4.3. ALLOCATE-MIGRATE The pseudo-code of ALLOCATE-MIGRATE has been shown in Algorithm 3. It attempts to schedule tasks which require more than one core for their execution. First, it extracts a task Ti from the list L1 and creates a non-decreasing order sorted list L2 (based on shares) to keep track of the normalized unallocated shares of Ti (Lines 2 to 4), where each node (hTi , j, sh[i][j]i) in L2 stores Ti ’s shares sh[i][j] at core Vj . Then, it extracts the first element (say, 6

hi, j, sh[i][j]i) from L2 and computes the unused capacity ucj of Vj (Line 8) at frequency fmax . If ucj is non-zero, then ALLOCATE-MIGRATE computes the unallocated shares of Ti with respect to Vj at fmax , i.e., usi (Line 10). While utilizing the unused capacity of Vj , there are two possibilities (Lines 11 to 17):

set of guidelines (HEARS guidelines): i. Task are scheduled on free slots starting from left to right in a frame, ii. Always schedule task with highest number of migrations first, iii. A migrating task is scheduled in a stair-case fashion to avoid overlapping and, iv. Schedule of a non-migrating task is broken into multiple chunks when it cannot be scheduled continuously on it’s allocated core. These guidelines helps in generation of a feasible schedule for a task sets on heterogeneous platforms.

• usi > ucj : It signifies that unallocated shares of Ti is greater than the unused capacity of Vj . Hence, Ti is partially scheduled on Vj , the normalized unallocated shares of Ti is updated and frequency level for Vj is set at maximum.

Migration Overheads: In our approach, when a task cannot be fully allocated to any single core, it is carefully allocated and scheduled across multiple cores in a time-slice such that it’s execution at a core does not overlap with any other core in the system. From the partitioning and scheduling mechanism discussed above, we may observe that a particular core in the system may hold a set of fixed tasks along with either: i. no other migrating tasks, ii. multiple terminating migrating tasks, iii. one non-terminating migrating task, and iv. one non-terminating migrating task and multiple terminating migrating tasks. Cores having fixed tasks and terminating migrating tasks does not cause any migration in a time-slice. Whenever a non-terminating migrating task is scheduled at a particular core, that core will not have any remaining capacity to allot any other task. Hence, number of migrations at a core with a non-terminating migrating task is restricted to only 1. There can be at most m − 1 cores with non-terminating migrating tasks in any time-slice. Therefore, number of migrations in a time-slice using our strategy is bounded by m − 1.

• usi ≤ ucj : This signifies that unused capacity of core Vj is sufficient enough to meet the unallocated demand of Ti . Hence, the task Ti is allocated on Vj and frequency level of Vj is set appropriately. It may be noted that scaling of frequency might lead to parallel execution of a task at multiple cores. Hence, the algorithm verifies that summation of scaled shares of Ti does not exceed 1 across the cores. Since, Ti ’s allocation is completed, usi is reset to 0. In a scenario when a task cannot be allocated across multiple cores (Line 18), HEARS declares that the scheduling of Ti on V is infeasible. Algorithm 4: EA-SCHEDULE Input: T , V , AM , SM Output: Schedule Matrix SM 1 Prepare a non-decreasing order sorted list of tasks M GR based on number of migrations required for them 2 for i = 1 to n do 3 Extract Ti from front of M GR 4 for each Vj (j = 1 to m) and AM [i][j] 6= 0 do 5 Schedule Ti ’s allocated share on Vj according to HEARS guidelines 6

Dynamic Task Arrivals/Departures: A periodic task can arrive or depart at any time in a dynamic system. When a new task Ti arrives into the system during execution of a hyperperiod H (of length |H|), having execution requirement eij (on a reference core Vj ) and period pi , our algorithm first checks whether there is sufficient time to finish the task if we start its execution after the hyperperiod. The execution of Ti can only be delayed till start of next hyperperiod if pi − (eij + |H| − t) ≥ 0. In this scenario, Ti is added to the ready queue after the end of current hyperperiod. Otherwise, the system suspends it’s execution from the current time instant, recomputes new frames by including consideration of Ti , and then prepares a new schedule for the tasks. When a task departs from the system,

return SM ;

4.4. EA-SCHEDULE (Algorithm 4) After successful task-to-core allocation in a frame, HEARS invokes EA-SCHEDULE (Algorithm 4). The algorithm EA-SCHEDULE schedules each task Ti on it’s allocated cores such that Ti is not executed simultaneously on more than one core. In order to do so, it uses following 7

it’s entry is deleted from the list of ready tasks and it is not considered for scheduling from next frame.

EA-ALLOCATE (Algorithm 2) can be computed as Θ(m) + Θ(mn) + Θ(n × (mn + m)) + Θ(mnlog m) = Θ(mn(n + log m)).

4.5. Time Complexity Analysis

• EA-SCHEDULE tries to find feasible schedule for each task (maximum n tasks may be there) in a frame. To achieve it, a sorted list of tasks based on their number of migrations can be prepared in Θ(nlogn) time [32]. Next, it selects the task with highest number if migrations, which can in achieved in Θ(1) time and schedules the task across m cores in the worst case. Hence, time complexity of EASCHEDULE (Algorithm 4) can be computed as Θ(nlog n) + Θ(nm) = Θ(n(m + log n)).

As stated earlier, the algorithm HEARS is composed of four functions namely HEARS (Algorithm 1), EA-ALLOCATE (Algorithm 2), ALLOCATE-MIGRATE (Algorithm 3) and EASCHEDULE (Algorithm 4). These four functions work in unison to prepare schedule for the task set T on heterogeneous platform V . The time complexity analysis for each algorithm is as follows: • ALLOCATE-MIGRATE tries to allocate all tasks which have to be allocated across more than one core. It iterates over the task list L1 and extracts a task from it at each iteration. It can be achieved in Θ(1) time. Then it constructs a non-decreasing order sorted list L2 for each extracted task based on their share values. L2 takes Θ(mlog m) time to be constructed [32]. Then for each entry hTi , j, sh[i][j]i in L2 , share value of Ti is appropriately scaled and added to Vj . Since, we know the allocated shares in Vj , computation of operating frequency required to allocate Ti takes Θ(1) time. Hence, the overall time complexity of ALLOCATE-MIGRATE (Algorithm 3) can be computed as Θ(n × (mlog m + m)) = Θ(nmlog m).

• At start, HEARS finds set of frames in a hyperperiod using deadline-partitioning, whose time complexity is Θ(nlog n) [19]. Next, it calls EA-ALLOCATE and EA-SCHEDULE for each frame. Let L be the total number of frames in the time interval [0, H]. Hence, the overall time complexity of HEARS (Algorithm 1) for a hyperperiod can be determined as L × Θ(mn(n + log m)) + Θ(n(m + log n)) = Θ(Lmn(n + log m + log n/m)). 5. Experimental Set Up and Results We have implemented the HEARS algorithm and compared the same against MaxMin-M, a variation of the MaxMin[13] algorithm. MaxMin is a DVFS based energy-aware task partitioning scheme for periodic tasks executing on a heterogeneous multicore platform. Before presenting the detailed experimental set up and results, we now provide an overview of MaxMin.

• Initially, EA-ALLOCATE sets frequency level for each core to fmin , which takes Θ(m) time. Next, share matrix sh[n×m] can be determined in Θ(n × m) time [19]. The algorithm tries to allocate each task onto available cores. In order to do so, it finds an unconsidered task with minimum share value from sh. This operation has a time complexity of Θ(mn) [19]. Next, searching of the core where it can be allocated at the core’s current frequency level and updating AM takes Θ(m) time. If such a core is not found, determining appropriate frequency level for each core where the task can be allocated and then searching for the core where it will lead to minimum change in energy consumption can be achieved with time complexity Θ(m). If none of the cores can fully accommodate the task, the algorithm calls ALLOCATE-MIGRATE, which has a time complexity of Θ(nmlog m), as determined earlier. Hence, time complexity of

Overview of MaxMin [13]: MaxMin is a heuristic energy-aware task allocation strategy targeted towards heterogeneous processing platforms. It uses a parameter called Energy Density EDij , which is defined as the dynamic energy consumption rate of the ith task at the highest operating frequency fmax , of processing core j. The algorithm MaxMin works in three phases. In the first phase, it finds the Maximum Energy Density EDimax (=maxm j=1 {EDij }) and the Minimum Energy Density EDimin (=minm j=1 {EDij }) for each task Ti , over all cores and calculates the difference EDidif f (= EDimax − EDimin ). Each task Ti is then inserted into a list LED , which is 8

sorted in non-increasing order of EDidif f . In the second phase, each task in LED (starting with the first) is allocated to its most preferred core such that it’s entire execution demand can be satisfied on that core. In the last phase, MaxMin finds a suitable operating frequency for each core based on workloads assigned in the previous phase. Modified MaxMin (MaxMin-M) Algorithm: The basic MaxMin algorithm does not allow a single task to be allocated to multiple cores, where each core satisfies a distinct fraction of its total execution demand. The literature on bin packing strategies [5, 6] show that such a mechanism which does not allow inter-core task migrations may lead to very poor resource utilizations. Hence, we have used a modified version of MaxMin called MaxMin-M, which embeds the strategy over a deadline partitioning framework. The MaxMin-M algorithm divides time into frames demarcated by the arrivals/departures of all tasks in the system (also known as Deadline Partitioning (DP)). At the beginning of each frame, MaxMin-M determines the proportional execution shares for each task within the frame. Inside each frame, the basic MaxMin algorithm is applied for energy-aware task allocation on heterogeneous cores. The system re-synchronizes globally at the end of every frame. Such a strategy, enables MaxMin-M to allow migration at the end of every frame boundary and hence deliver significantly better resource utilizations compared to basic MaxMin. We explain the working principle of MaxMin-M and compare it against our proposed algorithm using an illustrate example, which demonstrates how HEARS uses it’s finely tuned task-to-core allocation strategy to provide better energy savings compared to the MaxMin-M algorithm, in next section.

(a) Final energy-aware schedule using HEARS

(b) Final energy-aware schedule using MaxMin-M Figure 1: Schedule for Example

frequency levels of all cores at minimum i.e. frequency for all cores is set at 0.25 (Refer Table 1). The algorithm proceeds by computing the share matrix sh[7×4] as shown in Table 2b. T7 is the first task to be considered because it has the lowest share value of 1 at V3 . Hence, it is allocated to V3 but we don’t need to change the frequency for V3 because the current operating frequency of V3 is sufficient to execute share of V3 . Energy consumption (E) for the system remains at same level of 0.06 (= 4 × 0.253 ). Then T2 is considered because it has the lowest share value of 4 at V1 among remaining unconsidered tasks. Hence, T2 is allocated to V1 and V1 is set at frequency 0.4, which is the nearest available frequency (from Table 1) sufficient to complete execution of shares of T2 . Energy consumption (E) for the system becomes 0.109 (=0.43 + 3 × 0.253 ). Next, T3 is considered with share value of 4 on V4 . The frequency of V4 is set to 0.4 and E becomes 0.158 (2 × (0.43 + 0.253 )). Then, T1 is allocated to V2 with frequency 0.53, T4 to V4 with frequency set at fmax i.e. 1 and T5 to V1 with frequency fmax . Next, we find that T6 cannot be allocated fully at any single core. Hence, it is added to the migrating task list L1 . Since, no task is remaining for consideration, so ALLOCATE-MIGRATE is called with L1 . ALLOCATE-MIGRATE extracts T6 from L1

5.1. An Illustrative Example Let us consider a system consisting of a set of seven real-time periodic tasks, T = {T1 , T2 , ..., T7 }, to be scheduled on a heterogeneous multicore platform consisting of four cores, V = {V1 , . . . , V4 }. The Utilization Matrix U[7×4] is given in Table 2a. The period (as well as deadline) of each task is: p1 = 10, p2 = 20, p3 = 10, p4 = 20, p5 = 40, p6 = 40, p7 = 40. According to HEARS (Algorithm 1), we first use deadline partitioning to compute the current frame G1 = [0, 10). Next, HEARS calls EA-ALLOCATE to allocate the task set onto the cores. EA-ALLOCATE starts by setting the 9

U[7×4] T1 T2 T3 T4 T5 T6 T7

V1 0.7 0.4 1.0 0.9 0.6 1.0 1.0

V2 0.5 0.5 0.8 1.3 0.6 0.6 0.6

V3 0.9 0.8 0.7 1.0 1.2 0.7 0.1

V4 1.2 0.8 0.4 0.6 1.4 1.7 1.7

sh[7×4] T1 T2 T3 T4 T5 T6 T7

(a) Utilization Matrix

shi,1 7 4 10 9 6 10 10

shi,2 5 5 8 13 6 6 6

shi,3 9 8 7 10 12 7 1

shi,4 12 8 4 6 14 17 17

(b) Share Matrix

Table 2: Example (Part I)

and creates a sorted list L2 for T6 . Next, it starts iterating on all nodes of L2 . It starts by extracting h6, 2, 6i from L2 . But, it will find that remaining capacity uc2 for V2 at fmax is only 5. Hence, it allocates T6 on V2 for 5 time-slots and set operating frequency of V2 at fmax . The remaining share of T6 becomes 1 (=6-5). Then, the algorithm extracts next node h6, 3, 7i from L2 . The normalized unallocated share us6 of T6 is computed as: d1 × 76 e = 2. It finds that the current frequency of V3 is not sufficient to execute T7 and remaining share of T6 . Hence, T6 is allocated to V3 with raised frequency of 0.4 for 2 time-slots. The allocation matrix AM for G1 has been shown in Table 3a. Next, HEARS calls EA-SCHEDULE to schedule the tasks on the cores with respect to AM . The final schedule is shown in Figure 1a. Therefore, the overall percentage of fractional power saved in the system is: P = 4−(1.03 +1.03 +0.43 +1.03 ) = 23.4%. 4

5.2. Experimental Set Up All our simulations have been run for a total execution time of 100000 time slots on heterogeneous systems (Table 4). For each set of input parameters, we ran the simulation on 50 different test cases. A parameter called utilization Factor (UF) is used in order to obtain a measure of resource utilization corresponding to a given task Pn m i=i avgj=1 (ui,j ) , where set. m Pn UFm is defined as avg (u ) is the summation of the average i,j j=1 i=i utilization of the tasks over the m cores. For creating task sets with a specific U F , the randomly generated utilization values have been scaled appropriately. Results are generated for various distinct utilization factors. The following three metrics have been used to compare the performance of our proposed algorithm against MaxMin-M. The first metric, ARat, defines acceptance ratio as the number of task sets successfully scheduled by the system against the total number of task sets submitted to it. The second metric is NPow, which represents the normalized power consumption in the system and the last metric is CSC, which represents the average context switch cost per core per time slot (in µs). Two sets of experiments have been carried out to validate and compare our proposed algorithm with MaxMin-M. In the first set, we have used benchmark programs to test the algorithms in real-life situations, whereas in the second set, extensive simulation based experiments have been carried out using synthetic task sets to validate the efficacy of the algorithms over varied scenarios that may be encountered.

Now, let us try to solve the same example using MaxMin-M. Normalized energy consumption requirements of the tasks at different cores along with the values of EDdif f have been shown in Table 3b. Hence, the sorted sequence of tasks (LED ) is obtained as: T7 → T6 → T5 → T4 → T1 → T3 → T2 . Using this sequence, final energy aware schedule is shown in Figure 1b. Using such an allocation, the operating frequencies are found to be 1.0 (=4+6/10), 0.6 (=6/10), 1.0 (=10/10) and 1.0 (=6+4/10) for the cores V1 , V2 , V3 and V4 , respectively. Therefore, the overall percentage of fractional power saved in the system is: P = 4−(1.03 +0.63 +1.03 +1.03 ) = 19.6%. Hence, we can 4 observe that using a more fine-tuned energy-aware heuristic strategy, HEARS is able to achieve better energy savings than MaxMin-M.

Framework for Benchmarks: We selected a set of 14 tasks from Parsec [14] and M¨ alardalen [15] benchmarks to evaluate the performance of our algorithm on system having 4 cores. A de10

AM[7×4] T1 T2 T3 T4 T5 T6 T7

V1 0 4 0 0 6 0 0

V2 5 0 0 0 0 5 0

V3 0 0 0 0 0 2 1

V4 0 0 4 6 0 0 0

E[7×4] T1 T2 T3 T4 T5 T6 T7

(a) Allocation Matrix

V1 0.343 0.064 1.000 0.729 0.216 1.000 1.000

V2 0.125 0.125 0.512 2.197 0.216 0.216 0.216

V3 0.729 0.512 0.343 1.000 1.728 0.343 0.1000

EDidif f 1.603 0.448 0.936 1.981 2.528 4.697 4.813

V4 1.728 0.512 0.064 0.216 2.744 4.913 4.913

(b) Normalized energy consumption Table 3: Example (Part II)

tailed discussion on the procedure for measuring execution requirements of each program is presented in [33] and the results have been listed in Table 5. These values have been obtained with the help of Gem5 [34] for an Intel Xeon processor, 65 nm CMOS technology, operating at 1.5 GHz. For creating task sets with a specific U F , periods of these tasks have been generated appropriately.

for different values of utilization factors (U F ). Detailed analysis of our experimental results is presented below.

Table 5: Execution Requirements of programs for Parsec[14] and M¨ alardalen benchmarks[15] Application

Table 4: Micro-architectural parameters used for simulations

FSB speeds Min. feature size Instruction set Microarchitecture L2 cache

body can fluid freq duff lms qurt

667 MT/s to 1066 MT/s 65 nm x86 NetBurst 4 MB

Execution Time (in ms) 3120 12300 960 1440 73 146 130

Application stream swap x264 bsort edn ndes select

Execution Time (in ms) 11820 34500 60 9 62 8340 135

Experiment 1- Effect on Acceptance Ratio (ARat): In this experiment, we have varied UF between 0.5 and 1.0. We may observe from Figure 2a that at lower UF upto about 0.6, both MaxMin-M and HEARS exhibits similar performance efficiencies. However, HEARS progressively outperforms MaxMin-M as UF increases beyond 0.6. This phenomenon may be attributed to the non-migrative execution policy of MaxMin-M within frames. At lower UF values (below 0.6), the probability that MaxMin-M can successfully allocate entire execution shares of every task onto a single core, is very high. However, at higher UF values, the possibility of allocating every task onto cores without causing any migration becomes lower. And hence, MaxMin-M shows poorer performance compared to HEARS. In particular, ARat reduces from 100% to 21% and 100% to 41% for MaxMin-M and HEARS, respectively.

Framework for Synthetic Tasks: The performance of HEARS has been evaluated and compared through a set of carefully chosen simulation based experiments on system having 8 cores. Each task set considered is composed of randomly generated hypothetical periodic tasks, whose execution requirements are generated from a Normal Distribution having standard deviation σe = 10 and mean µe = 50. The size of the task sets have been varied from n = 10 to n = 50. The task sets are having a high Utilization Factor (UF) of 0.9. Entries in the utilization matrix (U[n×m] ) are also obtained from a Normal Distribution with standard deviation of σu = 0.2, mean µu = 0.4 and they have scaled appropriately.

5.3. Experimental Results 5.3.1. Benchmark Program Results To compare and analyze our results with that of MaxMin-M, we have measured ARat and NPow

Experiment 2- Effect on Normalized Power Consumption (NPow): As the value of ARat 11

5.18% and 2.38% for U F values 0.5, 0.6, 0.7, 0.8 and 0.9, respectively. 5.3.2. Synthetic Task Set Results We have measured and compared the ARat and CSC values of HEARS and MaxMin-M by varying the number of tasks (n) between 10 and 50. The number of cores (m) and Utilization Factor (U F ) for these experiments were set at 8 and 0.9, respectively. Detailed analysis of our experimental results is discussed below. (a) Effect on Acceptance Ratio (ARat)

Experiment 3- Effect on Acceptance Ratio (ARat): We may observe from Figure 3a that there is a significant variation in the ARat values of HEARS and MaxMin-M with variation in number of tasks (n). Also, we can observe that increase in the number of tasks results in increased ARat values for both the algorithms. As utilization factor and number of processing cores remain constant, an increase in the number of tasks leads to reduced individual task weights, which leads to less number of migrating tasks in the system. Also, there is a higher probability for tasks requiring parallel execution at more than one core when number of tasks are fewer for HEARS. Hence, both algorithms are able to perform better with increase in tasks. In particular, ARat values were found to be 36%, 56%, 86%, 92% and 100% for HEARS and 25%, 32%, 37%, 46% and 57% for MaxMin-M.

(b) Effect on Normalized Power Consumption (NPow ) Figure 2: Result Comparison for Benchmark programs with varying U F (m = 4)

for MaxMin-M is significantly lower than that of HEARS, only those task sets which have been successfully completed by both the algorithms have been considered in this experiment. We may observe from Figure 2b, that NPow is directly proportional to the UF of the system. This is because residual capacity in the system decreases with an increase in UF of the task set, thereby reducing the scope for lowering core operating frequencies. This leads to higher power consumption with an increase in UF. As discussed earlier, MaxMin-M sequentially allocates tasks to their most favoured cores in the order of their EDdif f values. On the other hand, HEARS tries to allocate tasks to cores where the change in energy consumption of the system is minimum. This allows HEARS to perform significantly better than MaxMin-M in most scenarios where the system has some spare capacity, so that most tasks get assigned to cores where their energy efficiencies are comparatively high. In particular, improvements in energy savings for HEARS over MaxMinM may be observed to be 37.08%, 18.15%, 7.51%,

Experiment 4- Cost of Context Switch (CSC)): In our experiments, we have assumed the delay corresponding to a single context switch to be 5.24 µs, which is the actual average context switch overhead on a 24-core Intel Xeon L7455 system under typical workloads[35]. We computed the total number of context switches in the system over the entire simulation duration. Then, we obtained the total delay due to context switches by multiplying the delay caused by a single context switch (5.24µs [35]) with total number of context switches. Next, we calculated the average context switch overhead (in µs) per core per time slot for both HEARS and MaxMin-M algorithms. As observed from Figure 3b, MaxMin-M incurs slightly fewer context switches compared to HEARS at higher values of n. This is because MaxMin-M uses a non-migrative scheduling strategy within timeslices, whereas HEARS is a fully-migrative strategy. Higher ARat values as achieved by HEARS is essentially effected by appropriately switching task 12

(a) Effect on Acceptance Ratio (ARat)

(b) Effect on Cost of Context Switch (CSC )

Figure 3: Result Comparison for Synthetic tasks with varying number of tasks (m = 8 and UF = 0.9)

executions on cores (using preemptions and migrations) so that tasks can be scheduled in an efficient manner in the system. To include the cost of context switches into the overall power consumption, the operating frequencies of the individual cores were adjusted accordingly. For example from Figure 3b, we observe that for 50 tasks (which is very high), HEARS suffers a context switch overhead of ∼ 2.543µs while Max-Min-M suffers a context switch overhead of only ∼ 2.034µs. So, HEALERS incurs an extra overhead of 0.509µs per core per time slot with respect to MaxMin-M. Given a time slot size of 1ms, HEARS will therefore be able to complete as much work in 1965ms as MaxMinM will complete in (1ms/0.509µs =) 1964ms. To overcome this additional context switching overhead, HEARS must execute at 1965/1964 = times the operating frequency of MaxMin-M. As all our experiments have been conducted assuming only 12 available discrete frequency levels (refer Table 1), this additional overhead in terms of calculated frequency very rarely translates into an increase in the discrete frequency level of a core. Hence, the cost of context switching did not have any significant effect on power consumption in our experiments.

able to significantly improve acceptance ratios for task sets and energy savings of the platform, compared to the state-of-the-art MaxMin [13]. References [1] G. Buttazzo, Hard real-time computing systems: predictable scheduling algorithms and applications, Springer, 2011. [2] S. K. Baruah, A general model for recurring real-time tasks, Proceedings 19th IEEE Real-Time Systems Symposium (Cat. No.98CB36279) (1998) 114–122. [3] R. I. Davis, A. Burns, A survey of hard real-time scheduling for multiprocessor systems, ACM computing surveys (CSUR) 43 (4) (2011) 35. [4] S. Baruah, M. Bertogna, G. Buttazzo, Multiprocessor scheduling for real-time systems. [5] S. Baruah, J. Carpenter, Multiprocessor fixed-priority scheduling with restricted interprocessor migrations, in: 15th Euromicro Conference on Real-Time Systems, 2003. Proceedings., 2003, pp. 195–202. doi:10.1109/ EMRTS.2003.1212744. [6] Bin packing and machine scheduling, American Mathematical Society. [7] S. K. Baruah, N. K. Cohen, C. G. Plaxton, D. A. Varvel, Proportionate progress: A notion of fairness in resource allocation, Algorithmica 15 (6) (1996) 600–625. [8] J. H. Anderson, A. Srinivasan, Early-release fair scheduling, in: 12th Euromicro Conference on RealTime Systems, IEEE, pp. 35–43, 2000. [9] S. Funk, et al., Dp-fair: a unifying theory for optimal hard real-time multiprocessor scheduling, Real-Time Systems 47 (5) (2011) 389. [10] E. L. Lawler, J. Labetoulle, On preemptive scheduling of unrelated parallel processors by linear programming, Journal of the ACM (JACM) 25 (4) (1978) 612–619. [11] H. S. Chwa, J. Seo, J. Lee, I. Shin, Optimal real-time scheduling on two-type heterogeneous multicore platforms, in: Real-Time Systems Symposium, pp. 119–129, 2015. doi:10.1109/RTSS.2015.19.

6. Conclusion In this paper, we have proposed a low-overhead heuristic strategy called, HEARS, for the energyaware scheduling of a set of periodic tasks on a heterogeneous multicore platform. Experimental studies show that our proposed scheduling scheme is 13

[12] N. K. Jha, Low power system scheduling and synthesis, in: Proceedings of the 2001 IEEE/ACM International Conference on Computer-aided Design, ICCAD ’01, IEEE Press, Piscataway, NJ, USA, 2001, pp. 259– 263. [13] M. A. Awan, P. M. Yomsi, G. Nelissen, S. M. Petters, Energy-aware task mapping onto heterogeneous platforms using dvfs and sleep states, Real-Time Systems 52 (4) (2016) 450–485. [14] P. University, Princeton application repository for shared-memory computers (PARSEC), url: http://parsec.cs.princeton.edu. [15] J. Gustafsson, A. Betts, A. Ermedahl, B. Lisper, The M¨ alardalen WCET benchmarks – past, present and future, OCG, Brussels, Belgium, 2010, pp. 137–147. [16] Nvidia, Tegra processors, https://www.nvidia.com/object/tegra.html. [17] G. Raravi, et al., Task assignment algorithms for twotype heterogeneous multiprocessors, Real-Time Systems 50 (1) (2014) 87–141. [18] S. K. Baruah, et al., ILP-based approaches to partitioning recurrent workloads upon heterogeneous multiprocessors, in: 28th Euromicro Conference on Real-Time Systems (ECRTS), IEEE, pp. 215–225, 2016. [19] S. Moulik, R. Devaraj, A. Sarkar, Hetero-sched: A lowoverhead heterogeneous multi-core scheduler for realtime periodic tasks, in: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2018, pp. 659–666. doi:10.1109/HPCC/SmartCity/DSS.2018. 00117. [20] M. Bhatti, C. Belleudy, M. Auguin, An inter-task real time DVFS scheme for multiprocessor embedded systems, in: Design and Architectures for Signal and Image Processing (DASIP), 2010 Conference on, 2010, pp. 136–143. doi:10.1109/DASIP.2010.5706257. [21] G. Chen, K. Huang, A. Knoll, Abstract: Energy optimization for real-time multiprocessor system-on-chip with optimal dvfs and dpm combination, in: The 11th IEEE Symposium on Embedded Systems for Real-time Multimedia, 2013, pp. 40–40. doi:10.1109/ESTIMedia. 2013.6704501. [22] A. Ejlali, B. M. Al-Hashimi, P. Eles, A standbysparing technique with low energy-overhead for faulttolerant hard real-time systems, in: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis, ACM, 2009, pp. 193–202. [23] M. A. Haque, H. Aydin, D. Zhu, Energy-aware standbysparing technique for periodic real-time applications, in: Computer Design (ICCD), 2011 IEEE 29th International Conference on, IEEE, 2011, pp. 190–197. [24] Y. Guo, D. Zhu, H. Aydin, Generalized standby-sparing techniques for energy-efficient fault tolerance in multiprocessor real-time systems, in: Embedded and RealTime Computing Systems and Applications (RTCSA), 2013 IEEE 19th International Conference on, IEEE, 2013, pp. 62–71. [25] Y. wen Zhang, Energy-aware mixed partitioning scheduling in standby-sparing systems, Computer Standards and Interfaces 61 (2019) 129 – 136. doi:https: //doi.org/10.1016/j.csi.2018.06.004. [26] V. Moghaddas, M. Fazeli, A. Patooghy, Reliability-

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

14

oriented scheduling for static-priority real-time tasks in standby-sparing systems, Microprocessors and Microsystems 45 (2016) 208 – 215. doi:https://doi.org/ 10.1016/j.micpro.2016.05.005. S. Tosun, Energy- and reliability-aware task scheduling onto heterogeneous mpsoc architectures, The Journal of Supercomputing 62 (1) (2012) 265–289. M. A. Awan, S. M. Petters, Energy-aware partitioning of tasks onto a heterogeneous multi-core platform, in: 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS), 2013, pp. 205– 214. K. Li, X. Tang, Q. Yin, Energy aware scheduling algorithm for task execution cycles with normal distribution on heterogeneous computing systems, in: 2012 41st International Conference on Parallel Processing, 2012, pp. 40–47. S. Moulik, A. Sarkar, H. K. Kapoor, Energy aware frame based fair scheduling, Sustainable Computing: Informatics and Systems 18 (2018) 66 – 77. doi:https: //doi.org/10.1016/j.suscom.2018.03.003. W. Bao, C. Hong, S. Chunduri, S. Krishnamoorthy, L.N. Pouchet, F. Rastello, P. Sadayappan, Static and dynamic frequency scaling on multicore cpus, ACM Trans. Archit. Code Optim. 13 (4) (2016) 51:1–51:26. doi:10.1145/3011017. URL http://doi.acm.org/10.1145/3011017 T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to Algorithms, Third Edition, 3rd Edition, The MIT Press, 2009. S. Bygde, A. Ermedahl, B. Lisper, An efficient algorithm for parametric wcet calculation, in: 2009 15th IEEE International Conference on Embedded and RealTime Computing Systems and Applications, 2009, pp. 13–21. doi:10.1109/RTCSA.2009.9. N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al., The gem5 simulator, ACM SIGARCH Computer Architecture News 39 (2) (2011) 1–7. A. Bastoni, B. B. Brandenburg, J. H. Anderson, Cacherelated preemption and migration delays: Empirical approximation and impact on schedulability, in: Operating Systems Platforms for Embedded Real-Time Applications (OSPERT). 6th International Workshop on, 2010.

15

Conflict Of Interest We don’t have conflict of interest from any reviewer.

16