Building real-time parallel task systems on multi-cores: A hierarchical scheduling approach

Building real-time parallel task systems on multi-cores: A hierarchical scheduling approach

Accepted Manuscript Building Real-Time Parallel Task Systems on Multi-cores: A Hierarchical Scheduling Approach Tao Yang, Qingxu Deng, Lei Sun PII: D...

7MB Sizes 2 Downloads 16 Views

Accepted Manuscript

Building Real-Time Parallel Task Systems on Multi-cores: A Hierarchical Scheduling Approach Tao Yang, Qingxu Deng, Lei Sun PII: DOI: Reference:

S1383-7621(18)30336-9 https://doi.org/10.1016/j.sysarc.2018.10.006 SYSARC 1536

To appear in:

Journal of Systems Architecture

Received date: Revised date: Accepted date:

25 August 2018 15 October 2018 22 October 2018

Please cite this article as: Tao Yang, Qingxu Deng, Lei Sun, Building Real-Time Parallel Task Systems on Multi-cores: A Hierarchical Scheduling Approach, Journal of Systems Architecture (2018), doi: https://doi.org/10.1016/j.sysarc.2018.10.006

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Building Real-Time Parallel Task Systems on Multi-cores: A Hierarchical Scheduling Approach Tao Yang, Qingxu Deng∗, Lei Sun Institute of Cyber Physical Systems Engineering Northeastern University,Shenyang, China

CR IP T

Abstract

Parallel real-time scheduling on multi-core platforms is widely studied and focused in academic and industrial fields. DAG (directed acyclic graph) is a widely-used parallel model. But existing scheduling algorithms for DAG suffer schedulability analysis of significant inaccuracy and resource waste. In this paper, we propose a hierarchical scheduling framework for DAG tasks with constrained deadlines that it provides a much more efficient resource sharing solution without distinguishing the types of tasks. We have evaluated and implemented our proposed hierarchical scheduling and the results show that our proposed method has promising performance.

AN US

Keywords: Hierarchical, Parallel, DAG, Scheduling, Real-time 1. Introduction

AC

CE

PT

ED

M

Multi-cores are more and more widely used in real-time systems, to meet their rapidly increasing requirements in performance and energy efficiency. The parallelism of software should be explored to fully utilize the computation capacity of multi-cores. DAG (directed acyclic graph) is a widely-used model to describe parallel workload, where each vertex represents a piece of sequential workload and each edge represents the precedence constraint between its source and destination vertices. Many researchers have established their own parallel task models based on DAG. [1] proposed the multi-DAG model and [2] proposed the conditional parallel task model. Our task model only focus on the most basic DAG model. Existing scheduling algorithms for DAG real-time tasks can be categorized into two classes: federated scheduling [3, 4] and global scheduling [5, 6, 7]. Under global scheduling, multiple DAG tasks share all the processors, and thus the overall resource utilization of the system is potentially high. However, the schedulability analysis for global scheduling suffers significant inaccuracy due to the complex interference among tasks. In federated scheduling, each task is assigned several dedicated processors on which it executes elusively without being interfered by other tasks. The analysis of federated scheduling is easier, but it suffers resource waste due to fragmentation (in the worst case half of the total resource are doomed to be wasted). In this paper, we propose a new method to scheduling realtime DAG tasks, to leverage the strength of both global scheduling and federated scheduling. Our new method adopts a hierar∗ Corresponding

author Email addresses: [email protected] (Tao Yang), [email protected] (Qingxu Deng), [email protected] (Lei Sun) Preprint submitted to Journal of Systems Architecture

chical scheduling approach, which divides the overall scheduling problem into two parts: 1. scheduling DAG tasks on virtual processors 2. scheduling virtual processors on physical processors

More specifically, each DAG tasks is assigned with several dedicated virtual processors (and at runtime executes on them exclusively). With proper characterization of the resource provided by virtual processors, we can analyze each DAG task independently as in federated scheduling. On the other hand, virtual processors are scheduled on the physical processors at runtime which effectively enables the processor sharing among different DAG tasks. Therefore, our hierarchical scheduling approach inherits the strengths of both federated scheduling and global scheduling, and thus achieves better schedulability. We implement our proposed hierarchical scheduling on a realistic platform. The hardware is Intel(R) Core(TM) i7-7700 CPU with 3.6 GHz. There are a total of 8 processors and each one has 256-KB L1 caches and 1024-KB L2 caches and 8192kB L3 caches. The kernel with our scheduler is 4.9.30-x86 64GNU/Linux which is based on OpenMP 3.0 [8] and LitmusRT [9]. We evaluate its runtime performance. Although the introduction of hierarchical scheduling will will reduce performance, it can be tolerated. We conduct comprehensive experiments to evaluate the schedulability of our hierarchical scheduling against the state-of-the-art. The results show that our proposed method has more better performance than other state-ofthe-art analysis technicals for scheduling DAG tasks with constrained deadlines which are G-MEL and SF-XU. The remaining of the paper is organized as follows. In Section 2 we first introduce the task model. In Section 3 we give an overview of the hierarchical scheduling approach and introOctober 22, 2018

ACCEPTED MANUSCRIPT

duce some related concepts. Then we present details about how to schedule DAG tasks on virtual processors in Section 4, and then introduce the scheduling of computing resource on physical processors in Section 5. In Section 6, we introduce an implementation of our proposed method on a realistic platform, as well as the measurement of its runtime overhead. In Section 7 we present evaluations of schedulability of our proposed hierarchical scheduling approach and comparison with the state-ofthe art. We present related work in Section 8 and conclude the paper in Section 9.

Figure 1: An example of a DAG task.

CR IP T

2. Task Model

schedule tasks. More specifically, we schedule each task on its dedicated virtual processors (VPs) and schedule all VPs together on physical processors. This approach involves the offline design part and the on-line scheduling part:

We consider the scheduling of a task set τ consisting of n tasks {τ1, τ2, ..., τn } executed on a multi-core platform consisting of M cores. Each task τk is represented by a DAG consistp ing of p vertices {vk1, vk2, ..., vk }, and several edges connecting j them. Each vertex vk has a WCET (worst-case execution time) j ck . Edges represent dependencies among vertices. A directed p p edge from vertex vku to vk means that vk can only be executed u u after vk is finished. In this case, vk is called a predecessor of p p vk , and vk is called a successor of vku . We say a vertex is eligible at a time instant if it is not finished and all its predecessors have been finished. Each DAG task is released recurrently with a period Tk , and a relative deadline Dk . The total execution time Ck , utilization uk and density δ k of a DAG task τk are defined as X Ck Ck j , δk = . Ck = ck , uk = T D k k j

AN US

• Off-line design: The off-line design phase decides how many VPs are assigned to each DAG task and how much resource provided by each of these VPs. • On-line scheduling: At runtime, the on-line scheduling consists of two components (as shown in Figure 2): – A dispatcher that dispatches the workload of each DAG task to its dedicated VPs.

M

– A scheduler that schedules the VPs on all the physical processors.

v k ∈τ k

ED

We use L k to denote the sum of cku of each vertex vku on the longest chain (also called the critical path) of task τk , i.e., the execution time of the task τk to be exclusively executed on infinite number of cores, which can be computed in linear time with respect to the size of the DAG [10, 11]. A task τk releases an infinite number of jobs, and the time interval between the release times of any two successive jobs is no less than its period Tk . All jobs released by the a task have the same DAG structure. Let Jk, a denote the a-th job instance of task τk , and r k, a , d k,a be the absolute release time and absolute deadline of job Jk, a respectively. All the vertices of Jk, a are required to be executed after the release time r k,a and must be completed before the deadline d k, a . The difference between its absolute release time and absolute deadline of a job equals to its relative deadline. In this paper, we consider tasks with constrained deadlines, i.e., Dk ≤ Tk . We call the time interval [r k, a, d k,a ] the scheduling window of Jk, a . An example of a DAG task τk is shown in Figure 1 with Dk = 20, Tk = 21 and WCET of each vertex marked in the figure. The total execution time of the task is Ck = 30, its 3 utilization is uk = 10 7 and its density is δ k = 2 .

PT

Figure 2: Illustration of on-line scheduling.

CE

In the following, we introduce the characterization of service provided by VPs.

AC

3.1. Virtual Processors The m dedicated VPs of task τk ∈ τ compose a virtual platform Π = {P1, P2, ..., Pm }. The virtual platform Π starts to provide service along with the release of each job generated by τk and stops to provide service when a job is finished. Without loss of generality, we assume the system starts at time 0. Then for any time instant t ∈ R+ , the computing resource provided by each VP can be characterized as the resource function:

3. Overview of the Approach In this work, instead of scheduling tasks on physical processors directly, we adopt a hierarchical scheduling approach to 2

  1 Pi is available at time t for τk f i (t) =  . (1)  0 otherwise  Intuitively, the resource function represents whether the resource of a processor can be used by a DAG task τk at each time instant t. Let a job Jk, a of τk is finished at t a , we have f i (t) = 0 for ∀t, t a ≤ t ≤ r k,a+1 .

ACCEPTED MANUSCRIPT

Moreover, from (1) the cumulative capacity provided by a VP for a task in a time interval I = [t 1, t 2 ] is:  t2 Fi (I) = f i (t)dt. t1

Definition 2. The total amount of capacity provided by all m processors in interval I = [t 1, t 2 ] is defined as:  t2 P (I) = p(t)dt (4)

By summing up the resource function of all m virtual processors we can obtain the parallelism of the virtual platform Π, i.e., the number of available resources provided by the virtual platform at each time instant t:

From Definition 1, the cumulative capacity provided by the virtual platform in time intervals under each parallelism can be calculated by S(I, δ) × δ, then from (3) and (4), we have

f i (t).

P (I) =

(2)

m X δ=0

S(I, δ) × δ

CR IP T

p(t) =

m X

t1

i=1

For example, the total amount of capacity corresponding to the virtual platform in Figure 3(b) is: P (I) =



20

p(t)dt =

0

3 X δ=0

S(I, δ) × δ = 54

In the following, we explain how to guarantee the schedulability when scheduling task set through VPs.

AN US

(a)

3.2. Schedulability

Let task set τ is scheduled on m processors. Since our approach divides the overall scheduling problem into two parts: scheduling workload on VPs and scheduling VPs on physical processors, it is sufficient for task set τ to be schedulable if the following two conditions are satisfied:

Figure 3: An example of (total) resource function.

M

(b)

Figure 3(a) shows the resource functions of 3 virtual processors, denoted as f 1 (t), f 2 (t), f 3 (t) respectively. The cumulative capacity of f 1 (t) during time interval I = [0, 20] is calculated  20 by F1 (I) = 0 f 1 (t)dt = 17. The parallelism of the virtual platform consisting of these 3 virtual processors is shown in Figure 3(b), where the value of Y-axis at each time instant is the parallelism. In the following we define the parallelism supply function of the platform during a time interval I = [t 1, t 2 ] as:

ED

• Condition 2: All virtual platforms are successfully scheduled on M physical processors.

PT

Condition 1 indicates that the dispatcher needs to provide guarantees to make sure there is no deadline miss when each DAG task is scheduled on its dedicated virtual platform. Condition 2 indicates that the scheduler must divides the computing resource of each physical processor properly, to make sure that the specific computing resource required by each virtual platform, i.e., the number of virtual processors and their resource functions, are satisfied. In the rest of the paper, we focus on how to make the above two conditions to be satisfied.

AC

CE

Definition 1. The parallelism supply function S(I) = {S(I, 0), S(I, 1), ..., S(I, p)} represents the cumulative length of time for each parallelism δ ∈ [0, m] in a time interval I = [t 1, t 2 ], which can be calculated by:  t2 s(δ, t)dt (3) S(I, δ) = t1

4. Schedule Tasks on Virtual Processors

where s(δ, t) is defined as:  1 s(δ, t) =  0 

In this section, we focus on the scheduling of tasks on virtual platforms, i.e., the problem of satisfying Condition 1. In the following, we only consider the scheduling of one task. Note that each virtual platform can only be accessed by its corresponding DAG task. Thus there is no interference from other tasks when a DAG task is dispatched on its virtual platform, and the scheduling of each DAG is independent from others. Our target is to construct the virtual platform Π (i.e., the number of virtual processors and the resource function of each

p(t) = δ p(t)! = δ

Consider the time interval I = [0, 20] of the total resource function in Figure 3(b). The parallelism supply function is S(I) = {S(I, 0), S(I,1), S(I, 2), S(I, 3)}. S(I, 2) is calcu20 lated with S(I, 2) = 0 s(2, t)dt = 4. Similarly, we have S(I, 0) = 0, S(I, 1) = 1, and S(I, 3) = 15.

• Condition 1: Each task τk ∈ τ can guarantee its deadline when it is executed on its dedicated virtual platform Π.

3

ACCEPTED MANUSCRIPT

and P2 respectively. At time 1, after vk1 is finished, both vk2 , vk3 and vk4 are ready. Suppose vk2 and vk3 are chosen to be scheduled on P1 and P3 respectively. At time 7, vk5 is finished and both P2 and P3 are idle, let the eligible vertex vk4 is scheduled on P2 . At time 12, P2 is not available, then vk4 migrates from P2 to P3 which is available at time 12. At time 13, vk4 is finished and vk7 is eligible which is scheduled on P1 .

of them) where τk ’s deadline can be guaranteed when scheduled by dispatcher. In the following, we first introduce the scheduling behavior of the dispatcher during the runtime. Then we introduce the construction of virtual platform for τk where its deadline can be guaranteed.

CR IP T

4.1. Dispatcher The dispatcher works under any work-conserving algorithm. Since the original concept for work-conserving is for unit-speed processors, we first extend this notion to virtual processors. Definition 3 (Work-conserving). For each time instant t, a work-conserving scheduler never allows an available virtual processor (i.e.,∀Pi ∈ Π, f i (t) = 1) to be idle as long as there is still work ready to be executed.

Figure 4: Illustration of work-conserving algorithm.

Intuitively, at each time instant, the dispatcher simply chooses eligible vertices of each task τk and puts them on its current available virtual processors to be executed. Moreover, we distinguish the time intervals when a job of task τk is dispatched on Π as:

AN US

Recall that our target is constructing the virtual platform Πk where τk ’s deadline can be guaranteed when scheduled by dispatcher. It is unrealistic to construct such a virtual platform without any guideline. In the following, we first investigate the schedulability when a DAG task is scheduled on a given virtual platform.

Definition 4 (Busy/non-busy interval). We say a time interval [t 1, t 2 ] is a busy interval, if for any time instant t ∈ [t 1, t 2 ], all available virtual processors (i.e., ∀Pi ∈ Π, f i (t) = 1) are busy, otherwise we say it is a non-busy interval.

M

We define the key path when a task is scheduled by dispatcher:

ED

Definition 5 (Key path). The key path λ = {vk1, vk2, ..., vku } of a job Jk, a generated by τk recursively when Jk,a is scheduled on a virtual platform by dispatcher: • vku is the latest finished vertex among all vertices in the DAG job.

4.2. Schedulability of a DAG on Virtual Platform Since the virtual platform Π starts and ends to provide service along with the release and completion of each job of τk , we first investigate the schedulability of a single job on a given virtual platform in its scheduling window. Let job Jk, a of task τk be released at time r k,a and scheduled on a given virtual platform Π = {P1, P2, ..., Pm } in its scheduling window Ik,a = [r k, a, r k,a + Dk ]. In the following, we focus on whether Jk, a is schedulable on Π, i.e., Jk,a is guaranteed to be finished before its deadline r k, a + Dk . Theorem 1. Assume that job Jk,a is released at time r k, a and scheduled on a virtual platform Π = {P1, P2, ..., Pk }. Then it must be finished before r k, a + Dk if the following condition satisfies:

PT

• vkz is the latest finished vertex among all predecessors of vkz+1 for ∀z : 1 ≤ z ≤ u − 1.

CE

A vertex is called key vertex if it is on key path, else it is called non-key vertex. Lemma 1. The time interval between the finishing time of vkz and the starting time of vkz+1 in key path λ is busy interval for ∀z : 1 ≤ z ≤ u − 1.

Lk +

u X

AC

p=0

S(Ik, a, δ)+

Ck − L k −

Pu

δ=0 S(Ik, a , δ) × δ ≤ Dk (5) u+1

where Ik, a is the scheduling window of Jk, a and u is the value satisfying:

Proof. Assume that the finishing time of vkz is t f , the starting time of vkz+1 is t s , and there exist some time unit when some available processors are idle during the interval (t f , t s ). As vkz is the latest finished vertex among all predecessors of vkz+1 , vkz+1 must be ready to be executed after the very time instant t f . Then by Definition 3, vkz+1 must be executed before t s at the idle time unit, reaching a contradiction with the assumption that vz+1 starts at time t s . 

u X

p=0

S(Ik, a, δ) × δ ≤ Ck − L k ≤

u+1 X

p=0

S(Ik, a, δ) × δ

(6)

Proof. We prove the theorem by contradiction. Assume that Inequations (5) and (6) hold but Jk, a misses its deadline at r k, a + Dk . We first explore the features of resource functions of Πk satisfying Inequations (5) and (6). Inequation (6) indicates the following truth with regard to resource provide by Π:

For example, Fig.4 shows the execution when task τk in Fig.1 is scheduled by dispatcher on the virtual platform in Fig.3. At time 0, vk1 and vk5 are ready to be executed and scheduled on P1 4

ACCEPTED MANUSCRIPT

(i) The cumulative capacity provided in all time intervals with parallelism no greater than u is no more than Ck − L k .

To clarify the proof of the theorem, consider a job Jk,a of the task τk in Fig 1 is released at time 0 and scheduled on the virtual platform Π shown in Fig 2. As shown in Figure 5, we first present the parallelism provided by Πk in time interval of [r k, a, r k, a + Dk ] = [0, 20] in descending order. Since Ck − L k = 18, then LT is obtained by summing up 1 time unit of parallelism 1, 4 time units with parallelism 2, and 3 time units of parallelism 3 (1 × 1 + 4 × 2 + 3 × 3 = 18), and the maximum cumulative length to provide total capacity of 18 equals 8 (1+4+3=8). Then according to Theorem 1, Jk,a of the task τk in Fig 1 is schedulable on Πk in Fig 5.

Figure 5: The illustration of L T .

4.3. Construction of Virtual Platform

AN US

I j ∈Ψ

CR IP T

(ii) The cumulative capacity provided in all time intervals with parallelism no greater than u + 1 is no less than Ck − L k , and the amount of workload Ck − L k − Pu p=0 O(Ik, a , δ) × δ is no more than the cumulative capacity provided in all time intervals with parallelism of u + 1. P Pu C −L − u S(Ik, a ,δ)×δ , then the Let LT = δ=0 S(Ik,a, δ) + k k δ=0 u+1 value LT represents the maximum cumulative length of intervals with parallelism no larger than k + 1 which can provide total capacity of Ck − L k . In other words, the platform provides at least Ck − L k total capacity during any set of time intervals with cumulative length of LT between [r k, a, r k, a + Dk ]. Let Ψ denote any set of disjoint time intervals where ∀I j ∈ Ψ, I j ∈ P [r k, a, r k, a + Dk ] and I j ∈Ψ |I j |= LT . Then we have: X P (I j ) ≥ Ck − L k (7)

In the following, we introduce how to construct a virtual platform in the scheduling window of a DAG job where its deadline is guaranteed. Theorem 1 give us a sufficient condition that a virtual platform need to satisfy for a DAG job to be schedulable. However, there are still various design schemes for such virtual platforms conforming to Theorem 1. In this subsection, we look inside about how to design an efficient virtual platform.

M

Now consider vkz is a vertex of Jk, a executed at time r k, a + Dk . We then construct the key path of Jk, a that ends with vkz recurrently, denoted as π = {vk1, vk2, ..., vkz }. The length of key path π falls in [r k, a, r k, a + Dk ] is denoted as L π . We use Ψπ to denote the set of time intervals where the key vertices are executed and Ψπ to denote the set of all other time intervals during [r k, a, r k, a + Dk ]. We use W (Ψ) to denote the workload completed in a set of time intervals Ψ. Then we have the workload finished in [r k, a, r k, a + Dk ] is W (Ψπ ∪ Ψπ ):

ED

W (Ψπ ∪ Ψπ ) = W (Ψπ ) + W (Ψπ )

(a)

≥ L π + W (Ψπ )

PT

Since L k is the length of the longest chain of Jk,a , we have L k − L π ≥ 0. Let Ψπ 1 and Ψπ 2 denote two sets of time intervals with cumulative length of Dk − L k and L k − L π respectively, T where Ψπ 1 Ψπ 2 = ∅ and Ψπ 1 ∪ Ψπ 2 = Ψπ . Then we have:

CE

(b)

W (Ψπ + Ψπ ) ≥ L π + W (Ψπ 1 ∪ Ψπ 2 ) = L π + W (Ψπ 1 ) + W (Ψπ 2 )

Figure 6: Alternative virtual processors satisfying Lemma 2.

From Lemma 1, we have all virtual processors are busy during all time intervals in Ψπ , and from (5) we have LT ≤ Dk −L k , then from (7) we have:

AC

To be more clear, we first conclude the features of virtual platform that satisfy the conditions in Theorem 1 as follows: Lemma 2. Assume Tu and Tn are two non-overlapping sets of time intervals in [r k, a, r k,a + Dk ]. Based on Theorem 1, job Jk,a released at time instant r k, a is finished before r k,a + Dk on a virtual platform Π if the virtual platform satisfies:

W (Ψπ 1 ) ≥ Ck − L k

Thus we have: W (Ψπ + Ψπ ) ≥ L π + Ck − L k + L k − L π = Ck

(1) |Tu | = L k , and |Tu | + |Tn | = Dk ,

which means the workload of Jk, a that has been finished in [r k, a, r k, a + Dk ] is no less than Ck , contradicting with the assumption that Jk,a does not complete at r k,a + Dk . So if conditions (5) and (6) are satisfied, task Jk,a is guaranteed to be finished before its deadline, and the theorem is proved. 

(2) ∀t u ∈ Tu , t n ∈ Tn, p(t u ) ≥ p(t n ), (3) The cumulative capacity provided in Tn is no less than Ck − Lk . 5

ACCEPTED MANUSCRIPT

Lemma 3. To satisfy Lemma 2, it is necessary for a virtual platform satisfying:

To design such a virtual platform, we have many choices. For example, Figure 6 shows two possible virtual platforms satisfying the conditions in Lemma 2, consisting of four and three virtual processors respectively. The shadow areas with lowest parallelism of both them have cumulative lengths of Dk − L k = 8 and provide no less than Ck − L k = 18 capacity, i.e., 20 in Figure 6.(a) and 18 in Figure 6.(b) respectively. Intuitively, we want the total capacity of the virtual platform provides in interval of [r k, a, r k, a +Dk ] to be as small as possible, to make full use of computing resources. In the following we present a design scheme of a virtual platform and prove that its total amount of computing capacity is minimum.

P (Ik, a ) ≥ d

where Ik,a = [r k, a, r k, a + Dk ]. Proof. We prove the Lemma by contradiction. Assume P (Ik,a ) < d

CR IP T

and

Ck − L k e × L k + Ck − L k Dk − L k = [r k, a, r k,a + Dk ].

, the cumulative resource provided in the remaining time interi −L i vals, denoted as Tu , is less than d C D i −L i e ∗ L i . On the other hand, since the number of available virtual processors is an integer, the minimum number of available processors required to provide resource with amount Ci − L i during i −L i time intervals of length Di − L i is d C D i −L i e. Based on condition (2) in Lemma 2, the parallelism in Tu is no less than that in Tn . i −L i So the resource provided in Tu is at least d C D i −L i e ∗ L i , which contradicts the assumption above that the cumulative resource i −L i provided in Tu is less than d C D i −L i e ∗ L i . Then the Lemma is proved. 

(8)

M

Proof. We first prove that if the conditions are satisfied, the sum of lengths of time intervals whose parallelism equals m is no less than L k . We prove it by contradiction.Assume the total lengths of time intervals with parallelism equal to m is L h with L h < L k . Then the maximum resource provided in Ik,a is Pma x = m ∗ L h + (Dk − L h ) ∗ (m − 1)

Ci − L i e ∗ L i + Ci − L i Di − L i

AN US

where Ik,a

P (Ik,a ) < d

Ck − L k e Dk − L k

P (Ik,a ) = d

Ci − L i e ∗ L i + Ci − L i Di − L i

and the conditions in Lemma 2 hold. Then there must exist a set of time intervals Tn during which the cumulative resource provided equals Ci − L i . Since

Theorem 2. Assume a DAG job Jk, a is released at r k,a , then it must be finished before r k,a + Dk when dispatched on Π if the following conditions are satisfied: m=d

Ck − L k e × L k + Ck − L k Dk − L k

ED

The equation holds since the parallelism must be an integer. Then it can be shown that Pma x < P (Ik, a ) as follows. P (Ik,a ) − Pma x

PT

= m × L k + Ck − L k − m × L h − (m − 1) × (D − L h ) = m × (L k − L h ) + Ck − L k − (m − 1) × (Dk − L h )

= m × (Dk − L h + L k − Dk ) + Ck − L k − (m − 1) × (Dk − L h )

CE

= m × (L k − Dk ) + Ck − L k + (Dk − L h )

(a)

= Ck − L k − m × (Dk − L k ) + (Dk − L h ) >0

AC

= Ck − L k − (m − 1) × (Dk − L k ) + (Dk − L h ) − (Dk − L k ) Then we know when L h < L k , the cumualative resource prok −L k vided in Ik, a can not reach d C D k −L k e∗L k +Ck −L k , contradicting the assumption. So there must be a set of time intervals whose lengths add up to L k and the parallelism of each time instant in k −L k the set equals d C D k −L k e. We denote the set of time intervals as Tk , and the remaining time intervals in Ik, a as Tn . From equation 8, the cumulative resource provided in Tn is Ck − L k . Then Lemma 2 is satisfied, and the theorem is proved. 

(b)

The following Lemma gives us the total amount of capacity provided in [r k, a, r k, a + Dk ] by a virtual platform satisfying Theorem 2 is minimum for all virtual platforms conforming to Lemma 2.

Figure 7: Alternative virtual processors satisfying Theorem 2

Theorem 2 only makes constraints to the total amount of capacity provided by all m virtual processor in interval 6

ACCEPTED MANUSCRIPT

[r k, a, r k, a + Dk ]. Thus it is still free to design the resource function of each virtual processor, as long as condition (8) is satisfied. Two possible virtual platforms satisfying Theorem 2 are shown in Figure 7. In Figure 7.(a), F1 (Ik,a ) = F2 (Ik, a ) = 20 and F3 (Ik, a ) = 14. In Figure 7.(b), F1 (Ik,a ) = F2 (Ik, a ) = F3 (Ik, a ) = 18. Both of them can provide total capacity of 18 in intervals with lowest parallelism and cumulative length of 8 (the shadow areas). k −L k In general, if d C D k −L k e is an integer, we have a fixed platform where Fi (Ik,a ) = Dk , 1 ≤ i ≤ m. Otherwise, the total amount of capacity satisfying condition (8) can be arbitrary distributed on m virtual processors in interval of Ik, a . It is hard to say which distribution is better since the total capacity provided by them are already minimum. In this paper, we present a simple solution by uniform distribution:

physical processors and (ii) whether these resulting virtual platforms can be successfully scheduled on physical processors. 5.1. Schedule Virtual Platforms

CR IP T

Theorem 3. Assume a DAG job Jk, a is released at r k,a , then it must be finished before r k, a + Dk when it is scheduled on Π if the following conditions are satisfied: Ck − L k m

AN US

F1 (Ik, a ) = ... = Fm (Ik,a ) = L k +

Note that our construction of each virtual platform Π only requires the total amount of capacity provided by each virtual prok −L k cessor within a time interval. Π is constituted by m = d C D k −L k e virtual processors and each of them provides total amount of −L k capacity during the scheduling window of each job L k + C km of τk . Then each job of τk can be finished before its deadline when it is scheduled on Π. That is, the specific resource function of each virtual processor, i.e., the shape of service curve, does not affect the schedulability when τk is dispatched on Π. In other words, the only thing we need to consider is to make −L k sure Fi (Ik, a ) = L k + C km for each released job Jk,a . So each virtual processor Pi ∈ Π can be modeled as a sporadic sequential task with parameters of (Ck,i , Dk , Tk ). The −L k is equal to worst-case execution time Ck, i = L k + C km the amount of capacity it needs to provide. Dk and Tk are the same as the relative deadline and period of τk , respectively. For example, the virtual processors in Figure 7.(b) can be model as three sporadic sequential tasks with parameters of (Ck, i , Dk , Tk ) = (18, 20, 21). Then the process of providing computing resource to each specific virtual processor is equivalent to the execution of its corresponding sporadic sequential task, and scheduling the resource provided by each virtual processor upon physical processors is equivalent to scheduling a set of sporadic sequential tasks on multiprocessors. When a virtual processor is scheduled on a physical processor by the scheduler at a time instant, it means that the current computing resource of a physical processor is available for this particular virtual processor , thus can be accessed by its corresponding DAG task. There are various multiprocessor scheduling algorithms for sequential sporadic tasks can be used to schedule these virtual processors, such as global EDF and partitioned EDF. In this work, we choose to use partitioned EDF, and in particular with the First-Fit packing strategy [12], to schedule them.

k −L k where m = d C D k −L k e and Ik,a = [r k, a , r k, a + Dk ].

AC

CE

PT

ED

M

Figure 7.(b) shows the uniform distribution of Πk where each virtual processor provides total capacity of 18 in time interval [0, 20], and the possible resource functions of these 3 virtual processors composing the virtual platform are shown on top of Figure 7.(b). k It can be observed from Theorem 3, a job Jk,a with C D k ≤ 1, i.e., light task under federated scheduling scheduling algorithm, is dispatched on a single virtual processor where F1 (Ik, a ) = Ck . That is, we do not have to distinguish DAG tasks into heavy or light as in federated scheduling under hierarchical scheduling. In the above we only consider the construction of virtual platform in the scheduling window of a single DAG job. Since each DAG task τk generates jobs recurrently with respect to a its period Tk , the corresponding virtual platform of τk also need to provide computing resource consistent with the release of each job of τk . That is, the virtual platform Πk corresponding of τk needs to satisfy Theorem 3 for all jobs ∀Jk, a, a ≥ 1. Note that we choose this design scheme since our guideline is to minimize the total amount of capacity. One can also design his own virtual platform in accordance with different strategies, under particular perspective. Even a platform providing much more capacity in interval of [r k, a, r k,a + Di ] than the minimum total amount of capacity in Lemma 3 is also acceptable in terms of schedulability, for flexibility or other reasons, as long as Theorem 1 is followed.

5.2. Schedulability under Partitioned EDF In partitioned EDF scheduling of multiprocessors, each task is allocated to be executed on a specific processor and tasks on one processor are scheduled with EDF scheduling strategy as for uniprocessor. Given a set of tasks τ1, τ2, ..., τn to be executed on a platform composed of m unit-speed processors π1, π2, ..., πm , a partitioning algorithm based on First Fit was proposed in [12]. Tasks are first indexed with their relative deadlines, and tasks with smaller relative deadlines have smaller indexes. The partition starts from the tasks with smaller indexes and a task τi is assigned to execute on the first processor π j which satisfies the following two conditions:

5. Schedule Virtual Processors on Physical Processors In the following, we focus on the problem of scheduling all virtual processors together on physical processors, i.e., satisfying Condition 2. Similar with the scheduling of tasks on virtual platforms, this problem is composed by two issues: (i) how to provide the specific service required by each virtual platform on 7

ACCEPTED MANUSCRIPT

1−

X

τ l ∈Tπ j

X

τ l ∈Tπ j

Real Time Applications

DBF ∗ (τl , Di ) ≥ Ci

(9)

process

where ui = Ci /Ti , Tπ j is the set of tasks already allocated to processor π j before τi , and  0 DBF ∗ (i, t) =  Ci + (t − Di ) × ui 

thread

if t < Di otherwise

FeatherTrace: trace overhead API

Liblitmus: litmus API

OpenMP: parallel API

(10)

ul ≥ u i

Virtual Domain Virtual Prpcessors Dispatcher

CR IP T

Di −

Scheduler

Similar, each virtual processor Pi with (Ck, i , Dk , Tk ) is first assigned to its dedicated physical processor. Then all virtual processors assigned to a same physical processor are scheduled by EDF by the scheduler. More specifically, when we use partitioned EDF to schedule virtual platforms, at design time, the virtual platform of each DAG is first constructed, and all resulting virtual processors are partitioned to the physical processors according to (9) and (10) in the non-decreasing order of their deadlines. If all virtual processors are successfully allocated to their dedicated physical processors, the system is reported to be schedulable otherwise un-schedulabe. During the run time, the dispatcher schedules the workload of each DAG task on its dedicated virtual platform and the scheduler schedules each virtual processor on its dedicated physical processor under EDF. Discuss(arbitrary deadlines). The DAG tasks τ is an arbitrary deadline set of tasks if the relative deadline Di of each task τi could be less than, equal to, or greater than its period Ti . So arbitrary deadline task mode is more general than constrained deadlines task mode. We use request-bound function(RBF) to analysis the schedulability test of arbitrary deadline tasks mode [12] which provided the following inequalities: X Di − RBF ∗ (τl , Di ) ≥ Ci (11)

Real-time Scheduling Plugins

PSN-EDF

GSN-EDF

PVP

Trace Event Mechanism

...

AN US

Multi-cores Platform

τ l ∈Tπ j

τ l ∈Tπ j

ul ≥ u i

In this section, we implement our method on a real-time system we have developed based on OpenMP[8] and LitmusRT[9]. In the following, we introduce the details about the realization of two scheduling components, as well as the runtime performance. To better understand the architecture, we mainly introduce two core parts of the architecture. The first one is how to allocate processor resources to VP. The second one is how openmpthreads use VP resources. Since openmp-threads and systemthreads have a one-to-one relationship, we use openmp-thread instead of system-thread. 6.2. Scheduler In this section, we describe how the VP share the hardware resources. The framework is developed based on the reservation framework in LitmusRT[9]. By using the interfaces of VP framework, we can create a number of threads and VPs in the host system. Firstly, according to the configuration of DAG task, we can calculate the number m of virtual processors which are needed. The listing 2 is sample code that creates m number of VP for the DAG task.

(12)

CE

1−

X

PT

ED

M

Figure 8: The architecture overview of our platform

AC

where RBF ∗ (τi , t) = Ci + ui ∗ t. Same to constrained deadline, we also use partitioned EDF to schedule the virtual processors. The inequality 11 and inequality 12 ensures that all the virtual processors are successfully Scheduled.

init_config_vp (& litmus_thread_config ); create_vp ( vp_type ,m ,& litmus_thread_co nf ig );

6. Virtual Processors in LitmusRT

Listing 1: Create VP resource

6.1. Overview In this section we describe the architecture of the parallel shared resource platform in detail. Each task we generated is a directed acyclic graph(DAG) and these tasks do not use processor resources directly. Instead, processor resources are used in a shared way through middleware. We define the middleware as a virtual processor(VP). So the below Fig.8 shows a complete architecture of our platform which is hierarchical.

Each of VPs has its own struct vp res which is unique in the system. There are three scheduling-related parameters in the data structure of vp res namely priority, next replenishment and cur budget. priority is equal to the deadline of the DAG-task and next replenishment is equal to the period of the DAG-task. The cur budget is a variable and equal to the execution time of any vertice of the DAGtask. When a VP is created, it is already assigned to a fixed 8

ACCEPTED MANUSCRIPT

core. On each core, all the VP are managed in the queue of active reservations based on priority order. The VP with earlier deadline has higher priority. Base on the classic EDF scheduling algorithm, we put the VP which has highest priority to the core. Since we maintain an already prioritized priority queue, we simply need to take a VP to schedule from the head of the queue. The local-level scheduling is as follows Algorithm 1:

VP_DOMAIN list vp_domain_id vp_res* vp_client* lock ...

VP_RES

VP_RES

VP_RES

... vp_id vp_domain* vp_res_list

... vp_id vp_domain* vp_res_list

... vp_id vp_domain* vp_res_list

VP_CLIENT

VP_CLIENT

vp_res* dispatch_t vp_client_list

vp_res* dispatch_t vp_client_list

Figure 9: The architecture of the VP framework

Algorithm 1 local-level scheduling 6.4. Dispatcher

Input: prev VP Output: next VP 1: state = local cpu state(); 2: raw spin lock(&state->lock); 3: if state->scheduled != NULL then 4: move prev vp res to back of the ready queue; 5: end if 6: figure out how many budgets are consumed and update time; 7: find out the highest priority of vp res; 8: raw spin unlock(&state->lock); 9: return the next vp res;

AN US

CR IP T

According the parameter of num wthreads, openmp can generate a team of threads. We set the same deadline for theses threads which belong to the same dag-task. The number of threads can be the maximum out-degree of the DAG task, but it is best not to exceed the actual number of processors. The initial openmp thread can be treated as a normal linux thread and we must set the real-time parameters for these threads to be scheduled by the real-time scheduler in litmus-rt. Since openmp thread is not visible to us, we cannot access to them directly. In order to overcome this problem we use the below listing 1 which is static 1 policy [13]. # pragma omp parallel for schedule ( static ,1) for ( int i =0; i < num_wthreads ; i ++) { init_thread ( cp [ i ]); }

M

Only when a VP wants to join into the queue active reservations, we need to adjust the order of the elements in the queue. There are two situations that trigger an adjustment: - Generate a new VP in the system

ED

- Trigger scheduling

PT

On each core, always the highest priority VP in the queue is scheduled. And when a VP is occuping a core resources, the DAG-task threads that belongs to the same VP domain as will also get computing resources. 6.3. VP DOAMIN: Connection for vp res and vp client

CE

In last section, we describe in detail how to schedule VPs to use core resources according to the basic EDF priority policy. Once the highest priority VP gets core resources, the dispatcher will select which thread to be running. Different DAG-tasks have their own threads and VPs in the system and we create three main core data structures to manage them. We create a vp domain for each DAG-task and the variable vp domain id is used to identify uniqueness in the system. There are two important pointer variables in vp domain. One is vp res* which is use to point to the queue of VP. This queue is different from the priority queue which is described in last section and it is only used to organize VPs for the same DAG-task. We use structure vp res to represent VP which is virtual processor resource. Each vp res has a variable vp domain* for quickly finding its VP domain. The other pointer is vp client* and it points to queue of vp client. Through vp res, we can find the data structure struct that is real running code task struct. The architecture of VPs-based domain is showed in Fig.9.

Listing 2: Thread initialization

Due to the description of openmp4.5[8] about static, each iteration is equipped with only one openmp thread. So every thread can be set to a litmus-rt thread for executing realtime task. There are two core functions in init tread(). One is set rt task param() that is responsible for transfering the proper parameters to the host kernel. The other one is task mode() which sets the mode of the threads. The openmp threads access to these VPs by vp id and vp domain id. The vp id is used to specify the only VP in host system and the VPs of one DAG task have the same domain id vp domain id. So we can achieve a many-to-many relationship between openmp threads and VPs resource.

AC

Algorithm 2 List-scheduling

Input: next VP Output: next thread 1: find the thread ready queue vp client* according to the VP; 2: if the queue vp client* != NULL then 3: get the head entry from the queue vp client*; 4: find the next task address base on the vp client; 5: end if 6: return the next thread; The vp domain also maintaining another struct queue of vp client. Each of vp client contains a pointer vp client* to what is currently in use VP. The Algorithm 2 9

100

80

X

80

60

percent of samples

• When a openmp-thread is suspended, there must be no free VP resource

100

X

determines which pointer of vp client can point to the VP. Openmp-threads using VPs need to comply with the following constraints:

percent of samples

ACCEPTED MANUSCRIPT

60

40

0

• No preemption Each of vp client corresponds with one of openmp thread in the same vp domain through the struct task client. Finally, we return the openmp thread to the host scheduler and the openmp task can be truly running.

G-EDF P-EDF PVP

20 0

2000

4000 6000 process cycles

8000

(a) scheduling overhead

10000

40 20

G-EDF P-EDF PVP

0 1200 1700 2200 2700 3200 3700 4200 process cycles

(b) context-switch overhead

7. Evaluation

6.5. Runtime Performance In order to verify the feasibility of the proposed approach in a real-time OS, we developed virtual Processors framework in LitmusRT[9]. We focused on that how much additional kernel overhead is introduced compared to a conventional scheduler.

In this section, we evaluate the performance of our proposed method under synthetic workload. In particularly, we compare our hierarchical scheduling method denoted by H-YANG with other state-of-the-art analysis technicals for scheduling DAG tasks with constrained deadlines: (i) the schedulability test based on response time analysis for global EDF scheduling in [14], denoted by GMEL. G-MEL was developed for a more general DAG model with conditional branching, but can be directly applied to the DAG model of this paper, which is a special case of [14], and (ii) the schedulability test based on the processor allocation strategy in [4] based on denoted by SF-XU. Since semi-federated has shown its superiority for scheduling DAG tasks with implicit-deadline (either theoretic or empirical evaluations), other algorithms under this model are not included in our comparison. (details can be found in [3] [4] [14] [15][7][16][17]).

AN US

Platform. We ran experiments on Intel(R) Core(TM) i7-7700 CPU with 3.6 GHz. There are a total of 8 processors and each one has 256-KB L1 caches and 1024-KB L2 caches and 8192-kB L3 caches. The kernel version is 4.9.30-x86 64GNU/Linux.

CR IP T

Figure 10: Critical overheads under three schedulers(GEDF,PEDF, PVP)

ED

M

Benchmarks. We randomly generate the JSON configuration file for the DAG task. Each JSON file contains the basic information about a DAG diagram. Based on liblitmus[9] and OpenMP[8], we developed a parallel overhead that can run on LitmusRT. We ran each DAG-task for 10 seconds under three schedulers that are partitioned virtual processor(PVP), and partitioned EDF(PEDF), and Global EDF(GEDF). We included the two conventional schedulers to provide a baseline. While running each DAG workload, we collected all overhead of each sample with the tool of Feather-Trace framework[9].

7.1. Synthetic Workload The task sets are generated using the Erd¨os-R´enyi method G(ni , p) [18]:

PT

• vertices and edges: For each task, the number of vertices is randomly chosen in the range [50, 250]. The worst-case execution time of each vertex is randomly picked in the range [50, 100]. For each possible edge we generate a random value in the range [0, 1] and add the edge to the graph only if the generated value is less than a predefined threshold p. In general, tasks are more sequential (i.e., the critical path of the DAG is longer) with larger p. The same as in [5], an minimum number of additional edges are added to make a task graph weakly connected.

AC

CE

Critical overheads. There are two critical kernel overheads for evaluating a scheduler in a real OS. They are shown in Figs 10(a-b). Figs 10(a) describes scheduling overhead which is the cost of determining which process(or thread) will be running. The Scheduling overhead contains core logic of scheduler and synchronization overhead. We have observed 99th percentile scheduling overhead is only 2775 cycles under P-EDF and 4650 cycles under PVP and 10775 cycles under GSN-EDF. Although we put a hierarchical scheduling framework into LinuxRT that will cause a drop in performance, the performance degradation has little impact on the system. The context-switch overhead is shown in Figs 10(b). It is the cost of actually switching two processes which contains switching address spaces and prefetching the process control block and so on. The observed 99th percentile scheduling overhead is 4075 cycles under PVP and 3775 cycles under PSN-EDF and 5250 cycles under GSN-EDF. The performanceour about context-switch of our scheduler is almost similar to PSN-EDF. So our method is feasible in a real real-time systems.

• deadlines and periods: The periods are generated as an integer of powers of two. We find the smallest value a such that L i ≤ a, and randomly set Ti to be one of 2a, 2a + 1, or 2a + 2. The ratio L i /Ti of the task is in the range [1, 1/2], (1/2,1/4], or (1/4, 1/8], when its period Ti is 2a, 2a + 1, or 2a + 2, respectively. The relative deadline is uniformly selected from the range [L i , Ti ]. To generate the task set, we first generate heavy tasks until the total utilization exceeds U − 1 where U is the target total utilization, and then generate light tasks. For each parameter configuration, we generate 1000 task sets. 10

ACCEPTED MANUSCRIPT

G-MEL SF-XU H-YANG

Acceptance Ratio

0.8 0.6 0.4 0.2

1.0 0.8

Acceptance Ratio

1.0

G-MEL SF-XU H-YANG

0.6 0.4 0.2

0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 Normalized utilization

(a) normalized utilization

2

3

Ti/Li

4

5

6

(b) tensity

CR IP T

Figure 11: Comparison of acceptance ratio in different dimensions under synthetic workload.

scheduling. For mixed-criticality DAGs with implicit deadlines, Li et.al [33] proved that for high utilization tasks, the mixed criticality federated scheduling has a capacity augmenta√ √ tion bound of 2+2 2 and 5+2 5 for dual- and multi-criticality systems, respectively. Moreover, they also derived a capacity 11m for dual-criticality systems with augmentation bound of 3m−3 both high- and low utilization tasks. The decomposition-based algorithm under DAG model was first studied in [5], and a capacity augmentation bound of 4 was proved under GEDF for implicit deadline tasks. This method is then refined in [6] where a capacity augmentation bound between [2, 4] was proved, depending on the DAG structure characteristics. A schedulability test in [34] was provided to achieve a lower capacity bound than 4 in most cases, while in other cases above 4. In [17], a capacity √ bound of 3+2 5 was proved for some special task sets. For constrained and arbitrary deadline DAG tasks, Baruah et.al [35] proved a bound of 2 under GEDF for a single recurrent DAG task with an arbitrary deadline. Baruah[36] also proved a speed-up factor of 3 − 1/m for constrained deadline DAG tasks and proved a speed-up factor of 4 − 2/m for arbitrary deadline DAG tasks With respect to any optimal federated scheduling algorithm. Chen [37] showed that any federated scheduling algorithm has no constant speed up factors with respect to any optimal scheduling algorithm for DAG tasks with constrained deadline. Li et.al [7] and Bonifaci et.al [15] proved a bound of 2 − 1/m under GEDF, and Bonifaci et.al [15] proved a bound of 3 − 1/m under deadline monotonic scheduling. All the above bounds are with respect to an optimal scheduling algorithm. In [38] a schedulability test for arbitrary deadline DAG was derived based on response-time analysis. Beyond DAG task models, researchers have studied conditional parallel real-time tasks where both fork-join and branching semantics exist in the same graph [14, 16, 2].

ED

M

AN US

Figure 11.(a) compares the acceptance ratio of task sets with different normalized utilization, where p is 0.1 and L i /Ti is randomly generated in the range of [1/8, 1]. The normalized utilization indicates the x axis and we compare the acceptance ratio in terms of normalized utilization under different number of processors. It can be observe that both SF-XU and H-YANG outperform G-MEL. This is because of the analysis technicals of SF-XU and H-YANG are both under the framework of federated scheduling where the interference among tasks are eliminated, while G-MEL is based on the response time analysis which is much more pessimistic due to the inter-task interferences. Moreover, H-YANG performs better than SF-XU since it provides a much more efficiency resource sharing solution. Figure 11.(b) compares the acceptance ratio of task sets with different tensity. Figure 11.(b) follows the same setting as Figure 11.(a), but task periods are generated with different ratios between Ti /L i (corresponding to the x-axis). The tensity becomes smaller as the value in x-axis increases. The normalized utilization of each task set is randomly chosen from [0.1, 1]. It is interesting to find that the gap between these three tests become larger as the tensity gets greater. This is because the resource waste problem of SF-XU becomes significant as the ratio between the period and the length of critical path gets larger, which can be eased by H-YANG in some degree.

PT

In this paper, we propose a hierarchical scheduling framework for DAG tasks with constrained deadlines. Under hierarchical scheduling, the scheduling of computing resource and the scheduling of the workload are independent with each other, providing a much more efficient resource sharing solution without distinguishing the types of tasks. We implement our proposed hierarchical scheduling on a realistic platform whose runtime overhead is shown to be acceptable. We also construct comprehensive experiments to evaluate the schedulability of our hierarchical scheduling and the results show that our proposed method has promising performance. There are many directions of future work for this paper. First we would like to consider the scheduling of multiple DAG tasks on given virtual platform under different interfaces. The scheduling of DAG tasks model with resource sharing is also an important direction in our consideration. Second we would like to consider energy-efficient[39] scheduling algorithms for DAG.

CE

8. Related Work

9. Conclusion

AC

Related WorkReal-time task scheduling on multi-core processors has been an active research topic in recent years, including both partitioned scheduling[19, 20] and global scheduling [21, 22, 23]. Most such prior work assumes that each task runs sequentially on each core, and the parallelism of multi-core processors can only be exploited by multiple tasks. To better utilize multi-core parallelism, the synchronous task model [24, 25, 26, 27, 28, 29, 30, 31, 32] assumes that each task may consist of multiple segments, and each segment may contain multiple parallel threads that are synchronized between segments. It can be viewed as a constrained special case of the DAG task model. For implicit deadline DAG tasks, [3] proved a capacity aug√ mentation bound 3+2 5 for GEDF and 3.732 for G-RM. [3] also provides a capacity augmentation bound of 2 under federated 11

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

[23] [1] J. C. Fonseca, V. N´elis, G. Raravi, , L. M. Pinho, A multi-dag model for real-time parallel applications with conditional exe-cution, in Proc. 30th Annu. ACM Symp. Appl. Comput. (2015) 1925–1932. [24] [2] S. Baruah, The federated scheduling of systems of conditional sporadic dag tasks, in: Proceedings of the 12th International Conference on Embedded Software, IEEE Press, 2015, pp. 1–10. [3] J. Li, J. J. Chen, K. Agrawal, C. Lu, C. Gill, A. Saifullah, Analysis of [25] federated and global scheduling for parallel real-time tasks, in: Real-Time Systems (ECRTS), 2014 26th Euromicro Conference on, IEEE, 2014, pp. 85–96. [26] [4] J. Xu, G. Nan, L. Xiang, Y. Wang, Semi-federated scheduling of parallel real-time tasks on multiprocessors, in: Real-Time Systems Symposium (RTSS), 2017 IEEE 38st, IEEE, 2017, pp. 80–91. [27] [5] A. Saifullah, D. Ferry, J. Li, K. Agrawal, C. Lu, C. D. Gill, Parallel realtime scheduling of dags, Parallel and Distributed Systems, IEEE Transactions on 25 (12) (2014) 3242–3252. [28] [6] X. Jiang, X. Long, N. Guan, H. Wan, On the decomposition-based global edf scheduling of parallel real-time tasks, in: Real-Time Systems Symposium (RTSS), 2016 IEEE, IEEE, 2016, pp. 237–246. [7] J. Li, K. Agrawal, C. Lu, C. Gill, Outstanding paper award: Analysis of [29] global edf for parallel tasks, in: Real-Time Systems (ECRTS), 2013 25th Euromicro Conference on, IEEE, 2013, pp. 3–13. [8] Openmp application programming interface v4.5. URL http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf [30] [9] Litmusrt v2017.1. URL http://www.litmus-rt.org/ [10] P. Voudouris, P. Stenstr¨om, R. Pathan, Timing-anomaly free dynamic scheduling of task-based parallel applications, in: Real-Time and Em[31] bedded Technology and Applications Symposium (RTAS), 2017 IEEE, IEEE, 2017, pp. 365–376. [11] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to [32] algorithms second edition (2001). [12] N. Fisher, S. Baruah, T. P. Baker, The partitioned scheduling of sporadic tasks according to static-priorities, in: Real-Time Systems, 2006. 18th Euromicro Conference on, IEEE, 2006, pp. 10–pp. [33] [13] J. Li, Z. Luo, D. Ferry, K. Agrawal, C. Gill, C. Lu, Global edf scheduling for parallel real-time tasks, in: Real-Time Systems(2015), Springer US, 2015, pp. 395–439. [34] [14] A. Melani, M. Bertogna, V. Bonifaci, A. Marchetti-Spaccamela, G. C. Buttazzo, Response-time analysis of conditional dag tasks in multiprocessor systems, in: Real-Time Systems (ECRTS), 2015 27th Euromicro Conference on, IEEE, 2015, pp. 211–221. [35] [15] V. Bonifaci, A. Marchetti-Spaccamela, S. Stiller, A. Wiese, Feasibility analysis in the sporadic dag task model, in: Real-Time Systems (ECRTS), 2013 25th Euromicro Conference on, IEEE, 2013, pp. 225–233. [16] S. Baruah, Improved multiprocessor global schedulability analysis of spo[36] radic dag task systems, in: Real-Time Systems (ECRTS), 2014 26th Euromicro Conference on, IEEE, 2014, pp. 97–105. [17] M. Qamhieh, L. George, S. Midonnet, A stretching algorithm for par[37] allel real-time dag tasks on multiprocessor systems, in: Proceedings of the 22nd International Conference on Real-Time Networks and Systems, ACM, 2014, p. 13. [38] [18] D. Cordeiro, G. Mouni´e, S. Perarnau, D. Trystram, J.-M. Vincent, F. Wagner, Random graph generation for scheduling simulations, in: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, ICST (Institute for Computer Sciences, Social-Informatics [39] and Telecommunications Engineering), 2010, p. 60. [19] C. Wang, C. Dong, H. Zeng, Z. Gu, Minimizing stack memory for hard real-time applications on multicore platforms with partitioned fixedpriority or edf scheduling, ACM Trans. Design Autom. Electr. Syst. 21 (3) (2016) 46:1–46:25. [20] Z. Al-bayati, Q. Zhao, A. Youssef, H. Zeng, Z. Gu, Enhanced partitioned scheduling of mixed-criticality systems on multicore platforms, in: Asia and South Pacific Design Automation Conference(ASP-DAC), 2015 ACM 20st, IEEE, 2015, pp. 630–635. [21] C. Wang, Z. Gu, H. Zeng, Global fixed priority scheduling with preemption threshold: Schedulability analysis and stack size minimization, IEEE Trans. Parallel Distrib. Syst. 27 (11) (2016) 3242–3255. [22] N. Guan, W. Yi, Q. Deng, Z. Gu, G. Yu, Schedulability analysis for non-

preemptive fixed-priority multiprocessor scheduling, Journal of Systems Architecture-Embedded Systems Design 57 (5) (2011) 536–546. N. Guan, W. Yi, Z. Gu, Q. Deng, G. Yu, New schedulability test conditions for non-preemptive scheduling on multiprocessor platforms, in: Real-Time Systems Symposium (RTSS), IEEE, 2008, pp. 137–146. G. Liu, Y. Lu, S. Wang, Z. Gu, Partitioned multiprocessor scheduling of mixed-criticality parallel jobs, in: Embedded and Real-Time Computing Systems and Applications (RTCSA), 2014 IEEE 20rd International Conference on, IEEE, 2014, pp. 1–10. S. Kato, Y. Ishikawa, Gang edf scheduling of parallel task systems, in: Real-Time Systems Symposium, 2009, RTSS 2009. 30th IEEE, IEEE, 2009, pp. 459–468. K. Lakshmanan, S. Kato, R. Rajkumar, Scheduling parallel real-time tasks on multi-core processors, in: Real-Time Systems Symposium (RTSS), 2010 IEEE 31st, IEEE, 2010, pp. 259–268. A. Saifullah, J. Li, K. Agrawal, C. Lu, C. Gill, Multi-core real-time scheduling for generalized parallel task models, Real-Time Systems 49 (4) (2013) 404–435. J. Kim, H. Kim, K. Lakshmanan, R. R. Rajkumar, Parallel scheduling for cyber-physical systems: Analysis and case study on a self-driving car, in: Proceedings of the ACM/IEEE 4th International Conference on CyberPhysical Systems, ACM, 2013, pp. 31–40. G. Nelissen, V. Berten, J. Goossens, D. Milojevic, Techniques optimizing the number of processors to schedule multi-threaded tasks, in: Real-Time Systems (ECRTS), 2012 24th Euromicro Conference on, IEEE, 2012, pp. 321–330. C. Maia, M. Bertogna, L. Nogueira, L. M. Pinho, Response-time analysis of synchronous parallel tasks in multiprocessor systems, in: Proceedings of the 22Nd International Conference on Real-Time Networks and Systems, ACM, 2014, p. 3. B. Andersson, D. de Niz, Analyzing global-edf for multiprocessor scheduling of parallel tasks, in: Principles of Distributed Systems, Springer, 2012, pp. 16–30. P. Axer, S. Quinton, M. Neukirchner, R. Ernst, B. Dobel, H. Hartig, Response-time analysis of parallel fork-join workloads with real-time constraints, in: Real-Time Systems (ECRTS), 2013 25th Euromicro Conference on, IEEE, 2013, pp. 215–224. J. Li, D. Ferry, S. Ahuja, K. Agrawal, C. Gill, C. Lu, Mixed-criticality federated scheduling for parallel real-time tasks, Real-Time Systems 53 (5) (2017) 760–811. M. Qamhieh, F. Fauberteau, L. George, S. Midonnet, Global edf scheduling of directed acyclic graphs on multiprocessor systems, in: Proceedings of the 21st International conference on Real-Time Networks and Systems, ACM, 2013, pp. 287–296. S. Baruah, V. Bonifaci, A. Marchetti-Spaccamela, L. Stougie, A. Wiese, A generalized parallel task model for recurrent real-time processes, in: Real-Time Systems Symposium (RTSS), 2012 IEEE 33rd, IEEE, 2012, pp. 63–72. S. Baruah, Federated scheduling of sporadic dag task systems, in: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, IEEE, 2015, pp. 179–186. J.-J. Chen, Federated scheduling admits no constant speedup factors for constrained-deadline dag task systems, Real-Time Systems 52 (6) (2016) 833–838. A. Parri, A. Biondi, M. Marinoni, Response time analysis for g-edf and gdm scheduling of sporadic dag-tasks with arbitrary deadline, in: Proceedings of the 23rd International Conference on Real Time and Networks Systems, ACM, 2015, pp. 205–214. D. Liu, J. Spasic, P. Wang, T. Stefanov, Energy-efficient scheduling of real-time tasks on heterogeneous multicores using task splitting, in: Embedded and Real-Time Computing Systems and Applications (RTCSA), 2016 IEEE 22rd International Conference on, IEEE, 2016, pp. 17–19.

CR IP T

References

12

ACCEPTED MANUSCRIPT

Name: Tao Yang Tao Yang is a PhD candidate in Northeastern University, Shenyang, China. He is good at system implementation. His current research interests include multiprocessor real-time scheduling and operating

CR IP T

system architecture.

Name: Qingxu Deng

AN US

Qingxu Deng received the Ph.D. degree from Northeastern University, Shenyang, China, in 1997. He is currently a Full Professor with the

School of Computer Science and Engineering, Northeastern University.

M

His current research interests include multiprocessor real-time scheduling

PT

Name: Lei Sun

ED

and formal methods in real-time system analysis.

Lei Sun received the MSc degree in computer science from the

CE

University of Edinburgh, UK. He is currently pursuing the Ph.D. Degree

AC

at the School of Computer Science and Engineering, Northeastern University, China. His research interests include smart energy systems, cyber-physical systems, and real-time systems.

ACCEPTED MANUSCRIPT

CR IP T

Name: Tao Yang

PT

ED

M

AN US

Name: Qingxu Deng

AC

CE

Name: Lei Sun