Real-time scheduling of parallel tasks with tight deadlines

Real-time scheduling of parallel tasks with tight deadlines

Real-time Scheduling of Parallel Tasks with Tight Deadlines Journal Pre-proof Real-time Scheduling of Parallel Tasks with Tight Deadlines Xu Jiang, ...

3MB Sizes 2 Downloads 53 Views

Real-time Scheduling of Parallel Tasks with Tight Deadlines

Journal Pre-proof

Real-time Scheduling of Parallel Tasks with Tight Deadlines Xu Jiang, Nan Guan, Xiang Long, Yue Tang, Qingqiang He PII: DOI: Reference:

S1383-7621(20)30036-9 https://doi.org/10.1016/j.sysarc.2020.101742 SYSARC 101742

To appear in:

Journal of Systems Architecture

Received date: Revised date: Accepted date:

22 August 2019 19 January 2020 4 February 2020

Please cite this article as: Xu Jiang, Nan Guan, Xiang Long, Yue Tang, Qingqiang He, Real-time Scheduling of Parallel Tasks with Tight Deadlines, Journal of Systems Architecture (2020), doi: https://doi.org/10.1016/j.sysarc.2020.101742

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Real-time Scheduling of Parallel Tasks with Tight Deadlines Xu Jiang1,2 , Nan Guan2∗, Xiang Long3 , Yue Tang2 , Qingqiang He2 1 University 2

of Electronic Science and Technology of China, China The Hong Kong Polytechnic University, China 3 Beihang University, China

Abstract Real-time systems are shifting from single-core to multi-core processors, on which software must be parallelized to fully utilize their computation power. Recently, different types of scheduling algorithms have been proposed for parallel real-time tasks modeled as directed acyclic graphs (DAG), among which federated scheduling shows its superiority in real-time performance. However, the performance of federated scheduling seriously degrades for tasks with tight relative deadlines (the gap between the relative deadline and the longest path length is small). In this paper, we propose new methods based on federated scheduling to solve this problem by exploring the intra-task structure information. By our new methods, each heavy task is transformed into a set of independent sporadic sub-tasks with the guidance of its intra-task structure information, such that the number of processors required is reduced. We conduct experiments to evaluate our proposed approach against the state-of-the-art methods of different types of scheduling algorithms. Experimental results show that our approach consistently outperforms all of the compared methods under different parameter settings, especially for task sets consisting of tasks with tight deadlines. Keywords: parallel, real-time, multiprocessor, scheduling, DAG 1. Introduction Real-time systems are becoming more and more computationally demanding to meet the ever increasing requirements for the functionality and quality of service [1, 2, 3]. Multi-core processors are essential to fulfill this demand. However, using multi-core processors is not a free lunch. Software must be properly parallelized to fully exploit the computation capacity of multi-core processors. A parallel real-time task is usually modeled as a Directed Acyclic Graph (DAG), where edges describe the interdependency constraints among the workload represented by vertices. Recently, several approaches have been proposed for scheduling DAG tasks [4, 5, 6, 7, 8] (see Section 8 for a review). Among these approaches, federated scheduling [6] is a promising approach with good real-time performance. In federated scheduling, DAG tasks are classified into two subsets: heavy tasks and light tasks. Each heavy task exclusively executes on a subset of dedicated processors, while the light tasks are treated in the same way as traditional sequential real-time tasks and share the remaining processors. Federated scheduling not only successfully schedules a large portion of DAG task systems that are not schedulable by other approaches, but also provides the best quantitative worst-case performance guarantee [9]. Due to ∗ Corresponding

author Email addresses: [email protected] (Xu Jiang1,2 ), [email protected] (Nan Guan2 ), [email protected] (Xiang Long3 ), [email protected] (Yue Tang2 ), [email protected] (Qingqiang He2 ) Preprint submitted to Journal of Systems Architecture

its good real-time performance, it has drawn much attention in recent research on real-time scheduling of DAG parallel tasks [7] [8] [9]. However, the number of processors required by each heavy task to be schedulable under federated scheduling is calculated ignoring of the intra-structural information, which may result in a huge utilization loss while applied to task sets containing tasks with tight deadlines, i.e., the length of longest path of a DAG task is close to its relative deadline (details will be discussed in Section 3). In this work, we propose a new approach to solve this problem by adapting the decomposition-based method to federated scheduling, where each heavy DAG task is decomposed into a set of independent sporadic sequential tasks, inserting artificial release times and deadlines for workload released by individual vertices in the task. The task decomposition is performed according to the intra-task structure information of each heavy task, by which we can potentially smoothen a task’s workload and reduce the number of processors required, and thus improve the schedulability of the whole task set. We conduct experiments with both realistic programs and randomly generated task sets to evaluate the real-time performance of our approach, and compare it with the state-of-the-art of different types of scheduling algorithms. The results show that our approach consistently outperforms existing approaches under different parameter settings. Especially, our approach can schedule a large portion of task sets consisting of tasks with tight deadlines which are barely schedulable under other methods. The rest of the paper is organized as follows. In Section 2, February 11, 2020

we introduce the task model. In Section 3, we briefly review the federated scheduling approach and point out its pessimism when scheduling DAG tasks with tight deadlines. In Section 4 we introduce the overview of our approach. Section 5 introduces our decomposition method for each heavy task, after which we show the scheduling of the whole task set in Section 6. We evaluate our approach in Section 7. The related researches of the state-of-art are introduced in Section 8 and we conclude in Section 9.

are scheduled on the remaining processors as if they are sequential real-time tasks by traditional real-time scheduling algorithms such as global EDF and partitioned EDF. The number of processors mi required by a heavy task to be schedulable is calculated by: Ci − Li e mi = d Di − Li The calculation of the number of processors required by a heavy task with tight deadline to be schedulable is quite pessimistic due to the ignorance of the intra-structural information within it. An example of a DAG task with tight deadline is shown in Fig.2.(a) where (Ci , Di , Li ) = (12, 9, 8) (the worstcase execution time of each vertex is marked inside the circle). Clearly, the task in Fig.2.(a) can meet its deadline when it is scheduled on two processors, as shown in Fig.2.(b). However, under federated scheduling, the number of processors required i by this task to be schedulable is d CDii−L −Li e = 4.

2. Task Model We consider the scheduling problem of a task set τ consisting of n tasks {τ1 , τ2 , ..., τn } on m identical processors. Each task is represented by a DAG. A DAG task τi consists of several vertices and edges connecting them. A vertex vi, j ∈ τi has a Worst-Case Execution Time (WCET) ci, j . The total execution P time of all vertices of task τi is denoted by Ci = vi, j ∈τi ci, j . Edges represent interdependencies among vertices. A directed edge from vertex vi, j to vi,k indicates that vi,k can only be executed after vi, j is finished. In this case, vi, j is a predecessor of vi,k , while vi,k is a successor of vi, j . Li denotes the sum of ci, j of each vertex vi, j on the longest path of task τi , i.e., the execution time of task τi to be exclusively executed on infinite number of processors. A DAG task is released periodically with a period T i and must be finished before its relative deadline Di . We assume all tasks to have constrained deadlines, i.e., Di ≤ T i . The density is defined by δi = Ci /Di . We say a DAG task is heavy if δi > 1, otherwise light. The laxity of task τi is Υi = Di − Li and the elasticity is defined by Γi = Li /Di . We say a DAG task has tight deadline if Γi → 1.

(a) A DAG task τi with tight deadline.

(b) The scheduling sequence of task τi on 2 processors. Figure 2: Illustration of a DAG task with tight deadline.

4. Overview of Our Approach In this paper we apply the decomposition-based method on federated scheduling which can fully explore the intra-task structure information to help with the processor allocation of each heavy task. By our approach, each heavy task τi is first decomposed into several independent sequential sub-tasks (this framework was first presented in [4] [10] and then refined by [11]), then these sub-tasks are scheduled on their dedicated processors. When a heavy task is transformed into a set of independent sporadic sequential sub-tasks, each sub-task is assigned an artificial release time and deadline (relative to the release time of the entire task). The dependencies among different vertices in the task are automatically guaranteed as long as each individual sequential sub-task respects its own release time and deadline constraints. A proper decomposition can balance the workload of a DAG task and thus potentially reduce the number of processors required by the resulting sequential sub-task set to be schedulable. The same as federated scheduling, all light tasks are scheduled on the remaining processors as if they are sequential tasks by traditional real-time scheduling algorithms, e.g., partitioned EDF.

Figure 1: A DAG task example.

Fig.1 shows a DAG task example with 10 vertices. The density of this task is Ci /Di = 48/24 = 2, and the laxity is Di − Li = 2. 3. Federated Scheduling In federated scheduling [6], DAG tasks are classified into heavy tasks (with densities greater than 1) and light tasks (with densities no greater than 1). Each heavy task exclusively executes on a subset of dedicated processors, while the light tasks 2

The key step of our approach is the decomposition of each heavy task. In the following, we briefly introduce the overall target of the task decomposition. Let {τi,1 , τi,2 , ..., τi, j } denote the resulting sequential sporadic task set of a heavy task τi . We use di, j to denote the relative deadline of a sub-task τi, j , and the c density of this sequential sub-task is δi, j = di,i, jj . Each sub-task is active when time reaches its release time. The instantaneous total density `i (t) of τi is the sum of the densities of all the active sub-tasks belonging to τi at time t. The number of processors required by the resulting sub-task set to be schedulable is decided by the capacity requirement (will be explained later in Section 6), which is calculated by:

Figure 3: Timing diagram of the task example in Figure 1.

is computed by ~i = max{`i (t)} t>0

(1)

ri, j =

Then the overall target of task decomposition is to make the capacity requirement of the resulting sub-task set to be as small as possible, thus increasing the chance for the whole task set to be schedulable on a given number of processors.

max

{ci,k + ri,k }

vi,k ∈PRE(vi, j )

where PRE(vi, j ) is the set of predecessors of vi, j . The latest finish time fi, j of vi, j is equal to Li if it has no successors, otherwise is computed by fi, j =

5. Decomposition of Heavy Tasks

min

{ri,k }

vi,k ∈SUC(vi, j )

where SUC(vi, j ) is the set of successors of vi, j . Fig.3 shows the timing diagram of the task τi in Fig.1, where the earliest ready times and latest finish times are marked as upper and lower arrows, respectively. Following this timing diagram, all the dependency constraints of a task are automatically preserved if each vertex executes in the time window starting from its earliest ready time and ending before its latest finish time.

In this section, we introduce the decomposition for each heavy task. This method consists of three steps: (i) timing diagram, (ii)segmentation, and (iii) laxity distribution. A heavy task is first represented by a timing diagram according to its inter-dependencies. Then, in the segmentation, the time window between the release time and relative deadline of a DAG task τi is divided into several segments according to the timing diagram, and the total amount of workload of task τi is assigned into these segments (a vertex may be split into several sequential sub-tasks and each sub-task is assigned to a dedicated segment, as will be introduced later). In the laxity distribution, the laxity Di − Li of this task is distributed into each segment, and thus given to the sub-tasks in each segment accordingly. The overall target of these three steps is to minimize the capacity requirement of the resulting sequential sub-task set. In Section 5.1 we first introduce the construction of timing diagram where the inter-dependencies in a DAG task are guaranteed. Then in Section 5.2, we introduce the basic concepts of segmentation. In Section 5.3, we first introduce how to distribute the total laxity to obtain a minimum capacity requirement under a given segmentation, after which it will become clear how the segmentation strategy affects the capacity requirement of the resulting sequential sub-task set. Then we present our segmentation strategy in Section 5.4.

5.2. Segmentation With the timing diagram obtained in last subsection, we divide a task τi into several segments, with earliest ready time and latest finish time of each vertex. In general, the time window [ri, j , fi, j ] of vi, j may cross several segments. In this case, a vertex can be assigned to one or more of these segments (a vertex may be split into several parts and each part is assigned to one of these segments). Later in Section 5.4 we will introduce how to assign the vertex to these segments, and show how the assignments will affect the capacity requirement of the resulting sub-task set. For this moment, we just assume that each vertex vi, j has been assigned to one or more of these segments covered by its lifetime window [ri, j , fi, j ]. Assume a task is divided into X segments, denoted by s1i , · · · , siX . We use dix to denote the length of segment six , and eix to denote the total amount of workload of every vertex (or a part of a vertex) assigned to segment six , then we know

5.1. Timing Diagram To divide a task into segments, we need to construct the timing diagram of task τi . The timing diagram τ∞ i defines the earliest ready time and latest finish time of each vertex vi, j in τi , denoted by ri, j and fi, j respectively, assuming that τi executes exclusively on sufficiently many processors and all the workload of τi must be finished within Li . Assume that τi releases an instance at time 0. Then for each vertex vi, j , ri, j equals 0 if it has no predecessors, otherwise ri, j

Li =

X X

dix , Ci =

x=1

X X

eix

(2)

x=1

Definition 1. The density of a segment six is defined as: hix = 3

eix dix

(3)

Each segment consists of several threads, and each thread corresponds to a part of a vertex over the segment (or a vertex that completely falls in this segment). Assume that each thread must be executed in the segment where it lies, then the resulting sporadic task set is transformed into several threads belonging to different segments with their own release times and relative deadlines.

Figure 5: Laxity distribution of the segmentation in Figure 4.

them as {s1i , s2i , ..., siX }, where hix ≥ hix+1 holds for ∀x, 1 ≤ x ≤ X − 1 (this order of segments will be used in the rest of paper). Intuitively, we can construct an ideal algorithm to distribute the laxity where maxXx=1 {hix } is minimized, by the following iterative steps: Figure 4: A segmentation example of the task in Figure 1.

• Step 1: Assign the remaining laxity (Di − Li at the beginning) to the segment with largest density gradually, i.e., s1i , until either no laxity is left and the algorithm stops or h1i = h2i and go to Step 2.

Fig.4 shows an example of segmentation for task τi in Fig.1. The timing diagram is divided into 5 segments, where vi,2 covers segments s2i , s3i , s4i , and vi,6 covers s2i , s3i . In this example, the workload of vi,2 is split into three parts with execution times of 5, 1 and 2, and assigned to s2i s3i and s4i respectively. vi,6 is split into two parts with execution time of 4 and 2, and assigned to s2i and s3i . Recall that here we only use this example to illustrate the related concepts, and the concrete segmentation strategy will be presented later in Section 5.4.

• Step 2: Let k denote the maximum index among all segments where h1i = h2i = ... = hki holds. Assign the remaining laxity to all segments s1i , s2i , ..., ski simultaneously in a manner of keeping h1i = h2i = ... = hki , until either no laxity is left and the algorithm stops or h1i = h2i = ... = hki = hk+1 i and go to Step 2. The pseudo-code of the above algorithm is shown in Algorithm 1 where ∆ is an arbitrarily small positive number. Fig.5 shows the laxity distribution by Algorithm 1 according to the segmentation shown in Fig.4.

5.3. Laxity Distribution After segmentation, we need to carefully assign the total laxity Di − Li to each segment to make the capacity requirement of this heavy task as small as possible. In other words, we need to distribute the total laxity Di − Li of the task to each individual segment in a balanced manner to avoid a workload burst caused by any segment.

ALGORITHM 1: The synthetic algorithm of laxity distribution. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

Lemma 1. The capacity requirement of the resulting sub-task set of τi is ~i = maxXx=1 {hix }. Proof. Since different segments can not be executed at the same time, the instantaneous total density of the resulting sporadic sequential task set is upper bounded by the maximum density of a segment, i.e., maxt>0 {`i (t)} = maxXx=1 {hix }. Then according to Equation (1), the lemma is proved. To minimize the capacity requirement of τi , it is sufficient to minimize the maximum density among all segments. In other words, we will stretch the segments such that the total length of all segments changes from Li to Di to minimize maxXx=1 {hix }. In the following we first introduce an optimal algorithm of laxity distribution in the sense of minimizing maxXx=1 {hix } under a given segmentation. Recall that in this sub-section, we suppose that the segmentation has already finished, i.e., all workload has been assigned to their dedicated segments. Without loss of generality, after the segmentation we present all segments in descending order of their densities and denote

j ← 1 h ← e1i /di1 repeat while h > eij+1 /dij+1 do h←h−∆ for each segment six , x ∈ [1, · · · , j] do ∆ix ← eix /h − dix , dix ← dix + ∆ix laxity ← laxity − ∆ix end for if laxity = 0 then break; end if end while j← j+1 end repeat 0

Let di x denote the length of each segment after laxity distribution by Algorithm 1. Let k denote the largest number of index 0 j among all segments where ∀sij , di j > dij . Then we can classify all segments into two types: • Segments without laxity assigned: each segment six where x > k. For these segments, no laxity is assigned to them 4

and their densities remain the same after laxity distribution.

From Definition 2, it can be found that the value of Cout (h) is monotonically decreasing with h. Furthermore, we have the following properties.

• Segments with laxity assigned: each segment six where x ≤ k. For these segments, they are assigned some laxity such that their densities become equal after the laxity distribution. Let ~i denote their equal density, we have h1i = h2i = ... = hki = ~i and ~i > hix for ∀x, x > k. The laxity Lij assigned to segment ski can be calculated by: Lij = eij /~i − dij , 1 ≤ j ≤ k

Lemma 2. Let ∆ denote an arbitrary positive number, we have: X Cout (h) − ∆ dix ≤ Cout (h + ∆) six ∈si (h)

Proof. From Definition 2, we have si (h + ∆) ⊆ si (h), then we have: X Cout (h) − ∆ dix

(4)

Clearly, we have the maximum density among all segments maxXx=1 {hix } is ~i after laxity distribution by Algorithm 1.

Theorem 1. Algorithm 1 is optimal in the sense of minimizing maxXx=1 {hix }.

=

six ∈si (h)−si (h+∆)

Proof. We prove the theorem by contradiction. Suppose that after the laxity distribution in Algorithm 1, we have h1i = h2i = ... = hki = ~i and other segments satisfy hix = eix /dix ≤ ~i where k < x ≤ X. Then we have Pk j j=1 ei (5) ~i = Pk j j=1 di + (Di − Li )

=

x=k+1



X X

k X j=1 0

di x <

x=k+1

0

di j < Li − X X

k X

(eix − (h + ∆)dix )

six ∈si (h+∆)

(eix − (h + ∆)dix ) + Cout (h + ∆)

Lemma 3. The resulting ~i by Algorithm 1 is the only value that satisfies Cout (~i ) = ~i × (Di − Li )

0

0

X

P Since six ∈si (h)−si (h+∆) (eix − (h + ∆)dix ) ≤ 0, then we have proved the lemma.

Let ~i denote the maximum density among all segments and di x denote the length of segment six after laxity distribution by 0 any other method. If ~i < ~i , it must be in the case that for each segment six where 1 ≤ x ≤ k we have: Pk j eix j=1 ei < Pk 0 j di x j=1 di + (Di − Li ) 0 P P Then we have kj=1 di j > kj=1 dij + (Di − Li ), thus di x = Di −

X

(eix − (h + ∆)dix ) +

six ∈si (h)−si (h+∆)

0

X X

six ∈si (h)

X

Proof. Suppose that after the laxity distribution by Algorithm 1, we have h1i = h2i = ... = hki = ~i thus eix /dix ≥ ~i , 1 ≤ x ≤ k and

hix = eix /dix ≤ ~i , k < x ≤ X

From (5), we have

dij

j=1

dix

~i = Pk

x x=1 ei

x=1

x=k+1

Thus

Reaching a contradiction with that no laxity is assigned to segments with index greater than k according to Algorithm 1.

k X x=1

Since ∆ is an arbitrary small positive number, it is unrealistic to realize the ideal laxity distribution behavior in Algorithm 1. However, the only information we need is the resulting value of ~i by Algorithm 1, with which we can assign the laxity to each segment according to (4). Thus in the following, we focus on how to obtain the value of ~i . We first introduce some useful notions and related properties.

eix dix

eix

Pk

dix + (Di − Li )

(eix − dix × ~i ) = ~i × (Di − Li )

Since − dix × ~i = 0 holds for each segment satisfying = ~i , from Definition 2: Cout (~i ) =

k X x=1

(eix − dix × ~i )

and we have Cout (~i ) = ~i × (Di − Li ). Then, from the definition of Cout (h), we have the value of Cout (h) is monotonically decreasing with h. Since the value of h × (Di − Li ) is monotonically increasing with h, we have the lemma proved.

Definition 2. Let h denote a threshold and si (h) denote the set ex of segments where ∀six ∈ si (h), dix > h. The overflow workload i Cout (h) over threshold h is defined by X Cout (h) = (eix − dix × h) six ∈si (h)

5

According to Lemma 3, to calculate ~i , we need to find a threshold h where the overflow work Cout (h) exactly fulfills the gap of h(Di − Li ).

Lemma 4. ~i ≥

Ci Di

satisfies for the resulting ~i of Algorithm 1.

ALGORITHM 2: The computation of the minimum capacity requirement. 1: h ← Ci /Di 2: repeat 3: Cout (h) ← 0, dih ← 0 4: for each segment six , x ∈ [1, · · · , X] do 5: if hix > h then 6: Cout (h) = Cout (h) + eix − dix × h 7: dih = dih + dix 8: end if 9: end for 10: if Cout (h) = h × (Di − Li ) then 11: return h 12: else

Proof. We prove this lemma by contradiction. Suppose that ~i < CDii , after the laxity distribution by Algorithm 1. Then we 0 have eix /di x < Ci /Di for each segment. Thus X X

eix <

x=1

PX

0

X Ci X 0 x d Di x=1 i

Since x=1 di x = Di , we have tradiction.

PX

x x=1 ei

< Ci , reaching a con-

Now based on Lemma 3 and Lemma 4, we are ready to construct a systematic algorithm with polynomial complexity to find ~i . The details of our algorithm are described as follows: • Step 1: Set threshold h =

13:

h←h+

Ci Di .

14: end if 15: end repeat

• Step 2: Test if Cout (h) = h × (Di − Li ) holds. If not, the algorithm goes to Step 3. Otherwise, we have ~i = h, and the algorithm returns ~i .

5.4. An Optimal Segmentation According to Lemma 3, to minimize ~i we need to make Cout (h) as small as possible. Based on this, we construct a systematic algorithm to assign the workload of vertices into each segment. We use an ordered list Cr to store the total amount of work that has not been assigned to any segment, and elements in Cr represent the remaining work of vertices denoted by ωi, j which are in the increasing order of their latest finish time. We use Cix to store the work of vertices that has already been assigned to segment six (we also use Cr and Cix to denote the total amount of work in Cr and Cix respectively). At the beginning, we first assign the work of each vertex that only covers a single segment to its dedicated segment, i.e., moving these vertices from Cr to Cix . After this, a threshold h is initialized to be Ci /Di according to Lemma 4. We call a segment six light if it satisfies Cix /dix < h. Now we begin to assign the remaining work in Cr into each segment by the following steps:

• Step 3: Compute the minimum incremental quantity ∆ of h by ∆=

Cout (h) − h × (Di − Li ) P (Di − Li ) + six ∈si (h) dix

Cout (h) − h × (Di − Li ) (Di − Li ) + dih

(6)

and update h by h + ∆. Then go to Step 2.

The pseudo-code of the above algorithm is shown in Algorithm 2. Briefly, the algorithm starts from testing the smallest possible value of h for ~i . If this value of h doesn’t turn out to be the resulting value of ~i by Algorithm 1 according to Lemma 3, it then tries a larger possible value of h until no larger h could be updated. In the following, we show that we can always find the resulting value of ~i by Algorithm 1 according to the Algorithm 2 (i.e., the incremental ∆ will not cause the algorithm to miss testing the value of ~i ). It is sufficient to show that we have ~i ≥ h + ∆ after each iteration of Step 3. Let h0 denote the threshold of the current iteration by Algorithm 1 where ~i ≥ h0 (it is safe since we have ~i ≥ Ci /Di at the beginning of iteration), then from (6), we have: X (∆ + h0 )(Di − Li ) = Cout (h0 ) − ∆ dix

• Step 1: Assign the remaining work in Cr to light segments as much as poosible. In this step, we do assignment to light segments one by one, in their time order. For each light segment six , we assign the work of vertices that cover six , according to their order in Cr . Note that the work of any vertex that can be assigned to a segment six is no more than its length, i.e., dix . Let $i,x j denote the amount of work of vi, j that has already been assigned to six . The remaining work of a vertex ωi, j can be entirely assigned to a light segment six if six still remains light after added the remaining work of ωi, j and $i,x j + ωi, j ≤ dix . In this case, we move ωi, j from Cr to Cix , and go into the next iteration, to assign the next vertex in Cr that covers six .

six ∈si (h)

and then from Lemma 5.3, we have: (∆ + h0 )(Di − Li ) ≤ Cout (h0 + ∆) From the Lemma 3, we know ~i ≥ h0 + ∆. Once the minimum capacity requirement ~i is computed by 0 Algorithm 2, the length of each segment di x after laxity distribution can be computed according to ~i . That is, for each segment with density no greater than ~i , no laxity is assigned to it. For those segments with densities strictly greater than ~i , the laxity assigned to each segment six among them is calculated according to equation (4).

Otherwise, we need to consider two cases:

6

– The remaining work ωi, j > dix − $i,x j and six still remains light after added at most dix − $i,x j work from ωi, j : in this case, we assign dix − $i, jx work to six .

0

– Otherwise: we move ωi j work from ωi, j to six such 0 that after adding ωi j to Cix , Cix /dix exactly reaches the threshold h, and in this case, the assignment of Cr to the current segment is over.

cover this segment, 2/5 execution time of vi,2 is assigned to s3i and $3i,2 = 2/5 since $3i,6 = di3 = 2, and then we have ωi,2 = 6 − 2/5 = 28/5. Now the assignment of current step 16 is over. At this time, as Cout (h)−h(Di −Li ) = 32 5 −2× 5 = 0, 16 thus no more values of h can be updated and ~i = 5 .

• Step 2: Compute a new threshold. Let si (h) denote the set of all segments with densities no smaller than h, i.e., each segment six ∈ si (h) has Cix /dix ≥ h. Then we calculate Cout (h). In general Cout (h) consists of two parts. One part involves segments whose densities are greater than h (including each vertex that only covers a single segment) P which can be calculated by six ∈si (h) (Cix −h×dix ). The other part is the remaining unassigned work in Cr . Cout (h) is the sum of the workload of these two parts. Then we compute the minimum incremental quantity ∆ of h by equation (6). Note that, s(h) is different with that in equation (6) where s(h) is defined as the set where each segment six ∈ s(h) has its density strictly greater than h, i.e., Cix /dix > h (this is because those segments with Cix /dix = h are possible to be assigned some workload and turn into heavy while those segments with Cix /dix < h can not be assigned any workload). If ∆ = 0, then we have ~i = h and go to Step 3. Otherwise, h is updated by h + ∆ and go to Step 1.

• After last step, ωi,6 = 54 and ωi,2 = 28 5 , then we can arbitrarily assign them to any segment covered by them. In this example we assign all of them to s2i . At last $2i,6 = 4 and $2i,2 = 28/5. ALGORITHM 3: The segmentation algorithm. 1: for each element ωi, j in Cr do 2: if vi, j only covers a single segment six then 3: Cix ← Cix ∪ {ωi, j }, Cr ← Cr \ {ωi, j } 4: end if 5: end for 6: h = Ci /Di 7: repeat 8: for each light segment six , x ∈ [1, · · · , X] do 9: for each element ωi, j in Cr that covers six , in their order in Cr do C x +ω 10: if i d x i, j < h && $i,x j + ωi, j ≤ dix then i 11: Cix ← Cix ∪ {ωi, j }, Cr ← Cr \ {ωi, j }; 12: else x x x C +d −$ 13: if i di x i, j < h && ωi, j > dix − $i,x j then i 14: Cix ← Cix ∪ {dix − $i,x j }, ωi, j ← ωi, j + $i,x j − dix 15: end if 16: else 17: $i,x j ← $i,x j + h × dix − Cix , ωi, j ← ωi, j − (h × dix − Cix ) 18: break; 19: end if 20: end for 21: end for 22: calculate ∆ according to Equation (6) 23: if ∆ = 0 then 24: return ~ = h; 25: else 26: h=h+∆ 27: end if 28: end repeat 29: if Cr is not empty then 30: Arbitrarily assign the remaining work in Cr to their related segments; 31: end if

• Step 3: Assign the remaining work in Cr , if any, to the heavy segments arbitrarily. After Step 2, the remaining work in Cr can only be assigned to heavy segments. Since ~i only depends on the total amount work Cout (h), we can assign a remaining element in Cr to its covered segments arbitrarily. The only constraint is that the work of a vertex assigned to a segment six is at most dix . The pseudo-code of our segmentation algorithm is shown in Algorithm 3. We use our running example in Fig.1 and Fig.3 to illustrate Algorithm 3. At the beginning, vi,1 is assigned to s1i , vi,3 , vi,4 and vi,5 to s2i , vi,7 and vi,8 to s3i , vi,9 to s4i and vi,10 to s5i . h is calculated by Ci /Di = 2. Then we visit each segment in time order. • For s1i , s2i and s3i nothing can be assigned. For s4i , only one vertex vi,2 covers s4i , thus 2 of c2i is assigned to s4i so that Ci4 /di4 = h = 2. After that, $2i,2 = 2 and ωi,2 = 8 − 2 = 6. For s5i , nothing can be assigned. At this time, the assignment of current step is over. Now Cout (h) = 16 and h is updated by h + (Cout (h) − h(Di − Li ))/((Di − Li ) + P x six ∈si (h) di ) = 2 + (16 − 2 × 2)/(2 + 10) = 3. •

The final result of the segmentation is shown in Figure 6.

s1i

For , s4i and s5i , nothing can be assigned. There are two vertices vi,2 and vi,6 covering s2i . vi,6 is before vi,2 in Cr since fi,6 < fi,2 , thus 2 of c6i is assigned to s2i and Ci2 /di2 = h = 3. At this time $2i,6 = 2 and ωi,6 = 6 − 2 = 4. For s3i , 2 of c6i is assigned to s2i such that $3i,6 = 2 and ωi,6 = 4 − 2 = 2. Now h is updated by 3 + (8 − 6)/(8 + 2) = 16 5 .

Lemma 5. The segmentation algorithm in Algorithm 3 is optimal in the sense of resulting in the minimum Cout (h) for a given h. The proof of Lemma 5 is similar with the proof of Theorem 4 in [11], thus omitted here. There are two keys for Algorithm 3 to be optimal. First, in each iteration, we always try to assign vertices to segments to make their densities just reach the threshold h. Therefore, the assigned workload will not be counted into Cout (h). Second, in Step 2, the workload of

• The same as in Step 2, 6/5 execution time of vi,6 is assigned to s2i such that Ci2 /di2 = h = 3, and then $2i,6 = 2 + 6/5 = 16/5 and ωi,6 = 4/5. For s3i , although both vi,2 and vi,6 7

Figure 7: Scheduling of the workload in each segment. Figure 6: Segmentation of the task example in Figure 4.

we schedule the workload belonging to different vertices in the same segment in a head to tail manner. That is, we iteratively arrange the workload of each vertex on a processor from the head of this segment. If the workload of the last vertex can not be fully arranged on this processor, its workload is split into two parts, one part is arranged on this processor at the end of this segment, while the other part is arranged on another processor at the beginning of this segment.

different vertices is assigned to segments in an EDF-like manner, which ensures the optimality of allocating their workload to make as many segments meet the threshold as possible. Theorem 2. The segmentation algorithm in Algorithm 3 is optimal in the sense of resulting in the minimal capacity requirement of ~i . 0

Proof. The proof is by contradiction. Let ~i denote the maxi0 0 0 mum density among all segments where ~i < ~i and Cout (~i ) denote the overflow load by another segmentation. From Lemma 3, we have Cout (~i ) = ~i × (Di − Li ) and

0

0

Theorem 3. A heavy task τi is schedulable on m processors by AL-DECOM and TH scheduling algorithm if m ≥ d~i e. Proof. From Algorithm 2, the density of each segment after our decomposition method is no greater than the capacity require0 0 0 ment, i.e., eix /di x ≤ ~i . Then we have m × di x ≥ ~i × di x ≥ eix . Since the work of a vertex assigned to six is never greater than dix (from Algorithm 3), by TH scheduling algorithm shown in Fig.7, the workload in each segment can be arranged on m processors, thus task τi can guarantee its deadline on these processors.

0

Cout (~i ) = ~i × (Di − Li ) 0

0

0

0

As ~i < ~i , then we have Cout (~i ) ≥ Cout (~i ), thus 0

0

0

~i × (Di − Li ) ≥ Cout (~i )

From Lemma 5, Cout (~i ) ≥ Cout (~i ), thus

Then according to Theorem 3, we assign d~i e processors to each heavy task τi . As we introduced in Section 4, all light tasks are scheduled together on the remaining processors by any sequential scheduling algorithm such as partitioned EDF. If the total number of processors is enough for both heavy task and light task to be schedulable, then we can conclude that the DAG task set is schedulable. We call the overall approach (consisting of AL-DECOM, TH scheduling algorithm and the scheduling of light tasks) the decomposition-based federated scheduling.

0

~i × (Di − Li ) ≥ ~i × (Di − Li ) 0

reaching a contradiction.

⇒ ~i ≥ ~i

Putting all pieces together, now we have the final version of the decomposition algorithm (denoted by AL-DECOM) for each heavy task τi : (i) represent τi as timing diagram, (ii) do segmentation according to Algorithm 3 and obtain the minimum capacity requirement ~i of τi , (iii) distribute the total laxity to each segment according to equation (4).

Corollary 1. The schedulability of decomposition-based federated scheduling strictly dominates federated scheduling in [6]. Proof. According to [6], for each task τi , the number of dediCout (~i ) i cated processors is calculated by m ≥ d CDii−L −Li e. Since ~i = Di −Li and Cout (~i ) ≤ Ci − Li (from Definition 2), then from Theorem 3 we know the total number of processors required by all heavy tasks of decomposition-based federated scheduling is no more than that under federated scheduling. Since under decomposition-based federated scheduling light tasks are scheduled the same as federated scheduling, we have the total number of processors required for a task set to be schedulable under decomposition-based federated scheduling is no more than federated scheduling. In other words, if a task set is schedulable on a given number of processors by federated

6. Processor Allocation In the following, we introduce how to assign processors to each heavy task τi according to ~i . Before that, we first present how we schedule the workload belonging to different vertices in their dedicated segments on a given number of processors. The only constraint is that the workload of the same vertex can not execute in parallel. Based on this, we adopt a simple algorithm called TH (Tail to Head) Scheduling algorithm to schedule the workload in each segment. We use our running example to demonstrate the scheduling algorithm. As shown in Fig.7, 8

scheduling algorithm, it is also schedulable by decompositionbased federated scheduling.

increases, the gap between our method and federated scheduling becomes smaller, and they merge again when T i /Li continues to increase. This is because when T i /Li is too large, almost all tasks are light, with which there is no difference between federated scheduling and our method. For the global scheduling tests G-LI, the schedulability tests directly use the value of Li /T i , thus exhibiting sharp increase at certain Li /T i values.

7. Evaluations In this section, we evaluate our approach under both synthetic workload and realistic OpenMP programs on a on a hardware platform with Intel i7-7820HQ [email protected], cache size of 8MB and total memory of 4GB. In particular, we compare the schedulability of our method with five representative algorithms in all three paradigms: (1) the decomposition-based EDF scheduling and analysis techniques in [11], denoted by D-XU, (2) the schedulability test based on capacity augmentation bounds for global EDF scheduling in [12], denoted by G-LI and the schedulability test based on response time analysis for global EDF scheduling in [13], denoted by G-MEL, (3) federated scheduling: (i) the schedulability test based on the processor allocation strategy in [6], denoted by F-LI, (ii) the schedulability test based on the processor allocation strategy in [9], denoted by SF-XU. Other methods not included in our comparison are either theoretically dominated or significantly outperformed (with empirical evaluations) by one of the above methods.

7.2. Realistic Benchmarks We also collect 18 OpenMP programs based on C language from several benchmarks (See Table. 1) and transform them to DAGs by a tool, called ompTG [15]. Columns 3-5 show the feature of each application, where LOC stands for line of codes.

Table 1: Summary of OpenMP programs in ompTGB. Benchmark Application LOC Tasks alignment.for 637 191 alignment.single 637 191 fft 4447 81 fib 118 177 bots-1.1.2 [16] sort 237 66 sparselu.gcc.for 218 136 sparselu.gcc.single 213 137 strassen 735 58 dense_algebra 1479 85 dash_1.0 [17] finite_state_machine 641 33 nobody_methods 5478 151 openmpbench_C_v31 [18] taskbench 167 109 MatrixMultiplication 111 65 OpenMPMicro [19] Square 246 151 pt_to_pt_pingpong 309 204 openmpmpi [20] pt_to_pt_overlap 313 204 botsalgn 852 191 spec2012 [21] botsspar 915 136

7.1. Synthetic Workload The task sets are generated using the Erdös-Rényi method G(ni , p) [14]. For each task, the number of vertices is randomly chosen in the range [50, 250]. The worst-case execution time of each vertex is randomly picked in the range [50, 100]. We pick periods that are powers of two. We find the smallest value a such that Li ≤ 2a , and randomly set T i to be one of 2a , 2a+1 , or 2a+2 . The ratio Li /T i of the task is in the range [1, 1/2], (1/2,1/4], or (1/4, 1/8], when its period T i is 2a , 2a+1 , or 2a+2 , respectively. For each possible edge we generate a random value in the range [0, 1] and add the edge to the graph only if the generated value is less than a predefined threshold p where p = 0.1. The same as in [4], we also add a minimum number of additional edges to make a task graph weakly connected. To generate the task set, we first generate heavy tasks until the total utilization exceeds U − 1 where U is the target total utilization, and then generate light tasks. For each parameter configuration, we generate 1000 task sets. In Figure 8, task sets are generated according to different normalized utilization. Task periods are randomly generated in range [1/8, 1]. As shown in the results, the maximum normalized utilization of a task set that can be scheduled by our method is around 0.8 while all other methods are below 0.7. It can also be observed that the schedulability of all methods decreases as the number of processors increases. Figure 9 follows the same setting as Figure 8, but task periods are generated with different ratios of T i /Li (corresponding to the x-axis). The normalized utilization of each task set is randomly chosen from [0.1, 1]. When T i /Li is very small, the tasks are difficult to schedule. It is clear that when T i /Li is around 1, rare task sets can be scheduled by all other methods while a large portion of task sets can be scheduled by our methods. As T i /Li

Vertices 400 400 227 353 130 280 301 122 176 64 320 216 128 300 408 408 400 290

Edges 399 399 304 528 193 279 435 177 175 63 469 311 127 299 604 604 399 424

In Figure 10, for each task set, we randomly choose n tasks from the benchmark where n is randomly chosen from {2, 5, 8}. T i is generated by using the same methods with Figure 8 such that Li /T i is in the range of [1/8, 1]. The number of procesP sors is calculated by m = d CTii /Ue where U is the normalized utilization, indicating the x axis in Figure 10. For each configuration, we generate 1000 task sets. The results show the same trend as in Figure 8 while all methods perform even better than under synthetic workload. This is because the workload of the real parallel tasks is relatively uniform and rare tasks have extreme structure in comparison of randomly generated task sets. In Figure 11, for each task set, we randomly choose n tasks from the benchmark where n is randomly chosen from {2, 5, 8}. The periods of tasks are generated in the same way as in Figure P 9. The number of processors is calculated by m = d CTii /Ue where U is the normalized utilization which is randomly chosen from [0.1, 1]. In comparison with synthetic task sets, the gap between our method and federated scheduling is much bigger. This indicates that our method can fully take the advantage of the intra-task structure information of these realistic programs and make them easier to schedule. However, due to the ignorance of such information, the schedulability of federated scheduling is quite pessimistic even if the resource waste 9

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

D -O U R D -X U G -M E L G -L I F -L I S F -X U

A c c e p ta n c e R a tio %

0 .8

1 .0

0 .8

0 .6

0 .6

0 .4

0 .4

0 .2

0 .2

0 .0

0 .0 0 .1

0 .2

0 .3

0 .4 0 .5 0 .6 N o r m a liz e d U tiliz a tio n

0 .7

0 .8

1 .0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

D -O U R D -X U G -M E L G -L I F -L I S F -X U

0 .8

1 .0

0 .8

0 .6

0 .6

0 .4

0 .4

0 .2

0 .2

0 .0

0 .0

0 .9

0 .1

0 .2

(a) m = 8

0 .3

0 .4 0 .5 0 .6 N o r m a liz e d U tiliz a tio n

0 .7

0 .8

1 .0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

D -O U R D -X U G -M E L G -L I F -L I S F -X U

0 .8

A c c e p ta n c e R a tio %

0 .1

A c c e p ta n c e R a tio %

1 .0

0 .6

1 .0

0 .8

0 .6

0 .4

0 .4

0 .2

0 .2

0 .0

0 .0 0 .1

0 .9

0 .2

(b) m = 16

0 .3

0 .4 0 .5 0 .6 N o r m a liz e d U tiliz a tio n

0 .7

0 .8

0 .9

(c) m = 32

Figure 8: Comparison of acceptance ratio with different normalized utilization (m denotes the number of processors). 3

4

5

6

A c c e p ta n c e R a tio %

0 .8

0 .6

1 .0

0 .8

0 .8

0 .6

D -O U R D -X U G -M E L G -L I F -L I S F -X U

0 .4

1 .0

0 .4

0 .2

0 .2

0 .0

0 .0 1

2

3

T i/ L

4

5

6

1

2

3

5

6

0 .6

0 .4

1 .0

1 .0

0 .8

0 .8

0 .6

D -O U R D -X U G -M E L G -L I F -L I S F -X U

0 .4

0 .2

0 .2

0 .0

0 .0 1

2

3

i

(a) m = 8

4

T i/ L

4

5

6

1

2

3

4

5

6

1 .0

0 .8

A c c e p ta n c e R a tio %

2

A c c e p ta n c e R a tio %

1

1 .0

0 .6

0 .6 D -O U R D -X U G -M E L G -L I F -L I S F -X U

0 .4

0 .2

0 .4

0 .2

0 .0

0 .0 1

i

(b) m = 16

2

3

T i/ L

4

5

6

i

(c) m = 32

Figure 9: Comparison of acceptance ratio with different T i /Li (m denotes the number of processors).

winner, i.e., they have their own strengths and weaknesses. All tasks may share processors together under global scheduling algorithms (neither with decomposition or not), but interference among different tasks are quite hard to analyzed. Moreover, one bad task, e.g., a task with tight deadline, may cause the whole task set to be unschedulable. Under federated scheduling, there is no interference among different tasks. However, the stateof-the-art federated scheduling approach ignores the intra-task information of parallel task, and is hard to schedule tasks with tight deadlines. Besides DAG task models, researchers have studied conditional parallel real-time tasks where both fork-join and branching semantics exist in the same graph [13, 34, 8].

problem has been eased by semi-federated scheduling to some extent. 8. Related Work Early work on real-time scheduling of parallel tasks assumes constraints to task structures [22, 23, 24, 25]. For example, a Gang EDF scheduling algorithm was proposed in [24] for moldable parallel tasks where the task could only be executed on a fixed number of processors. The synchronous task model was studied in [25, 10, 26, 27, 28, 29, 30]. Recently, the scheduling problem of multiple DAGs has been studied with decomposition-based algorithms. In [4], a capacity augmentation bound of 4 was proved under GEDF. A schedulability test in [5] was provided to achieve a capacity bound lower than 4 in most cases,√while in other cases above 4. In [31], a capacity bound of 3+2 5 was proved for some special task sets. For global scheduling without decomposition, a resource augmentation bound of 2 was proved in [32] for a single arbitrary-deadline DAG task. In [33],[12], a resource augmentation bound of 2-1/m and a capacity augmentation bound of 42/m were proved. A pseudo-polynomial time sufficient schedulability test was presented in [33], which later was generalized and dominated by [34]. [12] proved the capacity augmentation √ bound 3+2 5 for GEDF (which is proved to be tight) and 3.732 for GRM. For federated scheduling, [12] provided a capacity augmentation bound of 2 for implicit-deadline tasks which is also proved to be tight. In [35] a schedulability test for arbitrary deadline DAG was derived based on response-time analysis. Among three main paradigms of scheduling algorithms for parallel real-time tasks, no one of them can claim to be a clear

9. CONCLUSIONS In this paper, we study the scheduling of parallel real-time tasks modeled by DAG. We propose a new approach which significantly improves the schedulability of federated scheduling. The minimum capacity requirement of each task is calculated individually according to our decomposition method with guidance of the intra-task structure information, and then each task executes exclusively on its dedicated processors without interference from other tasks. Experiments with both realistic programs and randomly generated task sets show that our approach consistently outperforms the state-of-the-art including both global scheduling and federated scheduling, not only for task sets consisting of tasks with tight deadlines but also for task sets under different parameter settings. Note that the two steps (laxity distribution and segmentation) in decomposition method for determining the minimum capacity requirement of each heavy task are proved to be optimal, 10

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

D -O U R D -X U G -M E L G -L I F -L I S F -X U

A c c e p ta n c e R a tio %

0 .8

1 .0

0 .8

0 .6

0 .6

0 .4

0 .4

0 .2

0 .2

0 .0

0 .0 0 .1

0 .2

0 .3

0 .4 0 .5 0 .6 N o r m a liz e d U tiliz a tio n

0 .7

0 .8

1 .0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

D -O U R D -X U G -M E L G -L I F -L I S F -X U

0 .8

1 .0

0 .8

0 .6

0 .6

0 .4

0 .4

0 .2

0 .2

0 .0

0 .0

0 .9

0 .1

0 .2

(a) n = 2

0 .3

0 .4 0 .5 0 .6 N o r m a liz e d U tiliz a tio n

0 .7

0 .8

1 .0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

0 .8

0 .9

D -O U R D -X U G -M E L G -L I F -L I S F -X U

0 .8

A c c e p ta n c e R a tio %

0 .1

A c c e p ta n c e R a tio %

1 .0

1 .0

0 .8

0 .6

0 .6

0 .4

0 .4

0 .2

0 .2

0 .0

0 .0 0 .1

0 .9

0 .2

0 .3

(b) n = 5

0 .4 0 .5 0 .6 N o r m a liz e d U tiliz a tio n

0 .7

0 .8

0 .9

(c) n = 8

Figure 10: Comparison of acceptance ratio with different normalized utilization. 3

4

5

6

A c c e p ta n c e R a tio %

0 .8

0 .6

1 .0

0 .8

0 .8

0 .6

D -O U R D -X U G -M E L G -L I F -L I S F -X U

0 .4

1 .0

0 .4

0 .2

0 .2

0 .0

0 .0 1

2

3

T i/ L

4

5

6

1

2

3

5

6

0 .6

1 .0

1 .0

0 .8

0 .8

0 .6

0 .4

0 .4 D -O U R D -X U G -M E L G -L I F -L I S F -X U

0 .2

0 .2

0 .0 1

0 .0 2

3

i

(a) n = 2

4

T i/ L

4

5

6

1

2

3

4

5

6

1 .0

0 .8

A c c e p ta n c e R a tio %

2

A c c e p ta n c e R a tio %

1

1 .0

0 .6

0 .6

0 .4

0 .4 D -O U R D -X U G -M E L G -L I F -L I S F -X U

0 .2

0 .0 1

2

i

(b) n = 5

3

T i/ L

0 .2

0 .0 4

5

6

i

(c) n = 8

Figure 11: Comparison of acceptance ratio with different T i /Li .

while the timing diagram is not. However, if further improvement of schedulability needs to be made, some heuristic strategies should be investigated under this method. This will be the direction of our future work.

[11] [12]

References

[13]

[1] Q. Zhao, Z. Gu, H. Zeng, N. Zheng, Schedulability analysis and stack size minimization with preemption thresholds and mixed-criticality scheduling, Journal of Systems Architecture 83 (2018) 57–74. [2] J. Zhou, J. Yan, K. Cao, Y. Tan, T. Wei, M. Chen, G. Zhang, X. Chen, S. Hu, Thermal-aware correlated two-level scheduling of real-time tasks with reduced processor energy on heterogeneous mpsocs, Journal of Systems Architecture 82 (2018) 1–11. [3] J. Zhou, T. Wang, P. Cong, P. Lu, T. Wei, M. Chen, Cost and makespanaware workflow scheduling in hybrid clouds, Journal of Systems Architecture 100 (2019) 101631. [4] A. Saifullah, D. Ferry, J. Li, K. Agrawal, C. Lu, C. D. Gill, Parallel realtime scheduling of dags, Parallel and Distributed Systems, IEEE Transactions on 25 (12) (2014) 3242–3252. [5] M. Qamhieh, F. Fauberteau, L. George, S. Midonnet, Global edf scheduling of directed acyclic graphs on multiprocessor systems, in: Proceedings of the 21st International conference on Real-Time Networks and Systems, ACM, 2013, pp. 287–296. [6] J. Li, J. J. Chen, K. Agrawal, C. Lu, C. Gill, A. Saifullah, Analysis of federated and global scheduling for parallel real-time tasks, in: Real-Time Systems (ECRTS), 2014 26th Euromicro Conference on, IEEE, 2014, pp. 85–96. [7] S. Baruah, Federated scheduling of sporadic dag task systems, in: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, IEEE, 2015, pp. 179–186. [8] S. Baruah, The federated scheduling of systems of conditional sporadic dag tasks, in: Proceedings of the 12th International Conference on Embedded Software, IEEE Press, 2015, pp. 1–10. [9] X. Jiang, N. Guan, X. Long, W. Yi, Semi-federated scheduling of parallel real-time tasks on multiprocessors, in: 2017 IEEE Real-Time Systems Symposium (RTSS), 2017. [10] A. Saifullah, J. Li, K. Agrawal, C. Lu, C. Gill, Multi-core real-time

[14]

[15] [16]

[17]

[18] [19] [20] [21]

11

scheduling for generalized parallel task models, Real-Time Systems 49 (4) (2013) 404–435. X. Jiang, X. Long, N. Guan, H. Wan, On the decomposition-based global edf scheduling of parallel real-time tasks, in: Real-Time Systems Symposium (RTSS), 2016 IEEE, IEEE, 2016, pp. 237–246. J. Li, K. Agrawal, C. Lu, C. Gill, Outstanding paper award: Analysis of global edf for parallel tasks, in: Real-Time Systems (ECRTS), 2013 25th Euromicro Conference on, IEEE, 2013, pp. 3–13. A. Melani, M. Bertogna, V. Bonifaci, A. Marchetti-Spaccamela, G. C. Buttazzo, Response-time analysis of conditional dag tasks in multiprocessor systems, in: Real-Time Systems (ECRTS), 2015 27th Euromicro Conference on, IEEE, 2015, pp. 211–221. D. Cordeiro, G. Mounié, S. Perarnau, D. Trystram, J.-M. Vincent, F. Wagner, Random graph generation for scheduling simulations, in: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2010, p. 60. Y. Wang, N. Guan, J. Sun, M. Lv, Q. He, T. He, W. Yi, Benchmarking openmp programs for real-time scheduling, in: RTCSA, IEEE, 2017, pp. 1–10. A. Duran, X. Teruel, R. Ferrer, X. Martorell, E. Ayguade, Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp, in: Parallel Processing, 2009. ICPP’09. International Conference on, IEEE, 2009, pp. 124–131. V. Gajinov, S. Stipi´c, I. Eri´c, O. S. Unsal, E. Ayguadé, A. Cristal, Dash: A benchmark suite for hybrid dataflow and shared memory programming models: with comparative evaluation of three hybrid dataflow models, in: Proceedings of the 11th ACM conference on computing frontiers, ACM, 2014, p. 4. J. M. Bull, F. Reid, N. McDonnell, A microbenchmark suite for openmp tasks, in: International Workshop on OpenMP, Springer, 2012, pp. 271– 274. V. V. Dimakopoulos, P. E. Hadjidoukas, G. C. Philos, A microbenchmark study of openmp overheads under nested parallelism, in: International Workshop on OpenMP, Springer, 2008, pp. 1–12. J. M. Bull, J. P. Enright, N. Ameer, A microbenchmark suite for mixedmode openmp/mpi., in: IWOMP, Springer, 2009, pp. 118–131. M. S. Müller, J. Baron, W. C. Brantley, H. Feng, D. Hackenberg, R. Henschel, G. Jost, D. Molka, C. Parrott, J. Robichaux, et al., Spec omp2012—an application benchmark suite for parallel systems using

[22] [23] [24] [25] [26]

[27]

[28]

[29] [30]

[31]

[32]

[33] [34] [35]

openmp, in: International Workshop on OpenMP, Springer, 2012, pp. 223–236. G. Manimaran, C. S. R. Murthy, K. Ramamritham, A new approach for scheduling of parallelizable tasks in real-time multiprocessor systems, Real-Time Systems 15 (1) (1998) 39–60. W. Y. Lee, L. Heejo, Optimal scheduling for real-time parallel tasks, IEICE transactions on information and systems 89 (6) (2006) 1962–1966. S. Kato, Y. Ishikawa, Gang edf scheduling of parallel task systems, in: Real-Time Systems Symposium, 2009, RTSS 2009. 30th IEEE, IEEE, 2009, pp. 459–468. K. Lakshmanan, S. Kato, R. Rajkumar, Scheduling parallel real-time tasks on multi-core processors, in: Real-Time Systems Symposium (RTSS), 2010 IEEE 31st, IEEE, 2010, pp. 259–268. J. Kim, H. Kim, K. Lakshmanan, R. R. Rajkumar, Parallel scheduling for cyber-physical systems: Analysis and case study on a self-driving car, in: Proceedings of the ACM/IEEE 4th International Conference on CyberPhysical Systems, ACM, 2013, pp. 31–40. G. Nelissen, V. Berten, J. Goossens, D. Milojevic, Techniques optimizing the number of processors to schedule multi-threaded tasks, in: Real-Time Systems (ECRTS), 2012 24th Euromicro Conference on, IEEE, 2012, pp. 321–330. C. Maia, M. Bertogna, L. Nogueira, L. M. Pinho, Response-time analysis of synchronous parallel tasks in multiprocessor systems, in: Proceedings of the 22Nd International Conference on Real-Time Networks and Systems, ACM, 2014, p. 3. B. Andersson, D. de Niz, Analyzing global-edf for multiprocessor scheduling of parallel tasks, in: Principles of Distributed Systems, Springer, 2012, pp. 16–30. P. Axer, S. Quinton, M. Neukirchner, R. Ernst, B. Dobel, H. Hartig, Response-time analysis of parallel fork-join workloads with real-time constraints, in: Real-Time Systems (ECRTS), 2013 25th Euromicro Conference on, IEEE, 2013, pp. 215–224. M. Qamhieh, L. George, S. Midonnet, A stretching algorithm for parallel real-time dag tasks on multiprocessor systems, in: Proceedings of the 22nd International Conference on Real-Time Networks and Systems, ACM, 2014, p. 13. S. Baruah, V. Bonifaci, A. Marchetti-Spaccamela, L. Stougie, A. Wiese, A generalized parallel task model for recurrent real-time processes, in: Real-Time Systems Symposium (RTSS), 2012 IEEE 33rd, IEEE, 2012, pp. 63–72. V. Bonifaci, A. Marchetti-Spaccamela, S. Stiller, A. Wiese, Feasibility analysis in the sporadic dag task model, in: Real-Time Systems (ECRTS), 2013 25th Euromicro Conference on, IEEE, 2013, pp. 225–233. S. Baruah, Improved multiprocessor global schedulability analysis of sporadic dag task systems, in: Real-Time Systems (ECRTS), 2014 26th Euromicro Conference on, IEEE, 2014, pp. 97–105. A. Parri, A. Biondi, M. Marinoni, Response time analysis for g-edf and gdm scheduling of sporadic dag-tasks with arbitrary deadline, in: Proceedings of the 23rd International Conference on Real Time and Networks Systems, ACM, 2015, pp. 205–214.

12

Xu Jiang is currently working at the School of Computer Science and Eng ineering, University of Electronic Science and Technology of China. Dr Jiang received his BS degree in computer science from Northwestern Polytechnical University, China in 2009, received the MS degree in computer architecture from Graduate School of the Second Research Institute of China Aerospace Science and Industry Corporation, China in 2012, and PhD from Beihang University, China in 2018. His research interests include real time systems, parallel and distributed systems and embedded systems. Nan Guan is currently an assistant professor at the Department of Computing, The Hong Kong Polytechnic University. Dr Guan received his BE and MS from Northeastern University, China in 2003 and 2006 respectively, and a PhD from Uppsala University, S weden in 2013. Before joining PolyU in 2015, he worked as a faculty member in Northeastern University, China. His research interests include real time embedded systems and cyber physical systems. He received the EDAA Outstanding Dissertation Award in 2014, the Best Paper Award of IEEE Real time Systems Symposium (RTSS) in 2009, the Best Paper Award of Conference on Design Automation and Test in Europe (DATE) in 2013. Xiang Long received his BS degree in Mathematics from Peking University, China in 1985, received the MS and PhD degrees in Computer Science from Beihang University, China in 1988 and 1994. He has been a professor at Beihang University since 1999. His research interests include parallel and distributed systems, computer architecture, real time systems, embedded systems and multi –/many core oriented operating systems. Yue Tang received the B.S. and M.S. degrees in computer science from Northeastern University, China in 2013 and 2015. She is currently a PhD student in the computing department of Hong Kong polytechnic University. Her current research interests i nclude model and analysis of real time systems. Qingqiang He received the BS degree in com puter science and technology from Northeastern University, China, in 2014, and the MS degree in computer software and theory from Northeastern U niversity, China, in 2017. Now he is working toward the PhD degree at Hong Kong Polytech nic University. His research interests include embedded real time system, real time scheduling theory, and distributed ledger. 1179

13

Conflict of Interest Weichen Liu, School of Computer Science and Engineering, Nanyang Technological University Singapore, [email protected]

Wang Yi, Dept of Information Technology, Uppsala University, [email protected]