Joint task assignment and cache partitioning with cache locking for WCET minimization on MPSoC

Joint task assignment and cache partitioning with cache locking for WCET minimization on MPSoC

J. Parallel Distrib. Comput. 71 (2011) 1473–1483 Contents lists available at SciVerse ScienceDirect J. Parallel Distrib. Comput. journal homepage: w...

674KB Sizes 0 Downloads 40 Views

J. Parallel Distrib. Comput. 71 (2011) 1473–1483

Contents lists available at SciVerse ScienceDirect

J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc

Joint task assignment and cache partitioning with cache locking for WCET minimization on MPSoC✩ Tiantian Liu ∗ , Yingchao Zhao, Minming Li, Chun Jason Xue Department of Computer Science, City University of Hong Kong, Hong Kong, China

article

info

Article history: Received 11 September 2010 Received in revised form 12 April 2011 Accepted 11 May 2011 Available online 19 May 2011 Keywords: MPSoC Worst-case analysis Task assignment Cache partitioning

abstract Cache locking technique is often utilized to guarantee a tighter prediction of Worst-Case Execution Time (WCET) which is one of the most important performance metrics for embedded systems. However, in Multi-Processor Systems-on-Chip (MPSoC) systems with multi-tasks, Level 2 (L2) cache is often shared among different tasks and cores, which leads to extended unpredictability of cache. Task assignment has inherent relevancy for cache behavior, while cache behavior also affects the efficiency of task assignment. Task assignment and cache behavior have dramatic influences on the overall WCET of MPSoC. This paper proposes joint task assignment and cache partitioning techniques to minimize the overall WCET for MPSoC systems. Cache locking is applied to each task to guarantee a precise WCET. We prove that the joint problem is NP-hard and propose several efficient algorithms. Experimental results show that the proposed algorithms can consistently reduce the overall WCET compared to previous techniques. © 2011 Elsevier Inc. All rights reserved.

1. Introduction Caches are known for their effectiveness in bridging the gap between processor and memory speed, but notorious for their unpredictability. Modern processors, such as ARM 9 [2] and MIPS32 series [23], often provide cache locking capability, which can manage cache in a controllable fashion. Cache locking is to select and lock some content of a specific program or data into the cache and subsequently prevent this content from being replaced at runtime so that cache is used in a more efficient and predictable way. Much research has been done in relation to cache locking in single-core systems, but little in multi-core systems. With the advancement of semiconductors, Multi-Processor Systemson-Chip (MPSoC) architecture becomes increasingly common in embedded systems, for example, ARM Cortex-A9 [2], MIPS32 74K [23] and IBM Xenon [35]. MPSoC architecture is often utilized to support multi-task applications. Different tasks are assigned and scheduled on different cores based on different predefined metrics. Level 1 (L1) cache is usually a private cache used by the tasks on the same core, while Level 2 (L2) cache is shared by multiple tasks executing on different cores simultaneously. The throughput, consistency, predictability and fairness of the shared cache have a dramatic impact on an MPSoC system’s performance. In this paper,

✩ A preliminary version of this article appeared in Proceeding of ICPP2010, pp. 573–582. ∗ Corresponding author. E-mail addresses: [email protected] (T. Liu), [email protected] (Y. Zhao), [email protected] (M. Li), [email protected] (C.J. Xue).

0743-7315/$ – see front matter © 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2011.05.006

task assignment with cache partitioning and locking is performed to minimize the overall Worst-Case Execution Time (WCET) for MPSoC architectures. WCET is one of the most important performance metrics for embedded systems. Schedulability analysis and resource deployment of systems rely on the assumption that systems’ WCETs are known. However, caches introduce dynamic behavior which is hard to predict, thus lead to an unpredictable WCET. With embedded MPSoC systems becoming a trend, overall WCET estimation and reduction are of great importance and challenge. The overall WCET means the worst-case finishing time of a set of tasks running on multiple cores. Cache locking techniques can guarantee a tighter prediction of WCET for single-core systems [25,21] by selecting the most significant code or data to preload into cache and locking them within a period. However, this technique cannot be directly applied to MPSoC systems. The presence of multiple cores and a shared cache offers the potential of multiple cores enjoying fast accesses to shared code/data. It also requires more complex resource management. To effectively apply cache locking, tasks should be assigned to cores using cache-aware assignment methods. The shared L2 cache should be partitioned among cores so that cache locking can be applied to each core with the most effective amount of cache. Research efforts have been invested in task assignment and cache partitioning on MPSoC architecture independently. Such decoupled techniques may not lead to a global optimization. The appropriate partition of the shared cache depends on the tasks assigned to the cores. Task assignment should also take into consideration the execution time variation of a task based on the cache size allocated. Therefore, task assignment and cache partitioning are deeply related and they

1474

T. Liu et al. / J. Parallel Distrib. Comput. 71 (2011) 1473–1483

should be jointly considered to optimize the overall performance. By applying cache locking and partitioning together, intra-task and inter-task cache conflict misses and capacity misses can be eliminated and cold misses can be predicted efficiently. By jointly considering task assignment and cache partitioning, we make task assignment decisions based on more precise estimation of cache usage, reduce contention for the L2 cache and further improve the systems’ performance. We have discussed part of the work in this paper in [22]. In [22], we analyze the relationship between task assignment and cache partitioning, and propose algorithms to perform them jointly to minimize the overall WCET. Cache locking is utilized for each task to guarantee a predictable WCET which is formulated as a function of cache size available to be locked. This information is then used as the guide during the joint task assignment and cache partitioning process. In this paper, we propose two more algorithms for task assignment and cache partitioning, which have better time complexities than the algorithms proposed in [22]. We compare the newly proposed algorithms with previous work in the experimental section. We also discuss several critical issues about how the locking technique can be applied to practical systems, for example how cache conflicts are solved. We additionally evaluate the execution times of the proposed algorithms to show that they are efficient to be applied in practical embedded systems. The contributions of this paper are as follows: 1. Consider task assignment, L2 cache partitioning and cache locking jointly to minimize the overall WCET for MPSoC embedded systems. 2. Prove that the joint problem of task assignment and L2 cache partitioning with cache locking is an NP-hard problem. Propose some efficient algorithms for solving the problem. 3. Propose two optimal algorithms to solve the L2 cache partitioning problem under a fixed task assignment. 4. Infer from experimental results on how to determine certain architecture parameters under different settings. The remainder of this paper is organized as follows. Section 2 introduces related work. Section 3 presents the MPSoC architecture, as well as the characteristics of cache and tasks. In Section 4, we analyze the joint problem of task assignment and L2 cache partitioning, present a motivating example, and prove that the joint problem is NP-hard. Heuristic algorithms are proposed in Section 5. Section 6 shows the experimental results. Finally, this paper is concluded in Section 7. 2. Related work Joint task assignment and Cache partitioning. Li et al. [19] describe a task assignment and cache partitioning algorithm for heterogeneous MPSoC. They first allocate tasks to cores to balance the workload, and then use software-based cache partitioning to guarantee cache hits for some tasks. Their target is not to reduce WCET and the partitioning algorithm is not optimal. [31,26] focus on the integrated problem of task scheduling and scratchpad memory (SPM) partitioning, which is similar to our problem. Suhendra et al. [31] propose a technique based on integer linear programming (ILP) for the problem. They assume that each core has a private SPM and that a core can access another core’s private SPM albeit with an increased latency. In [26], Salamy and Ramanujam integrate task assignment and SPM partitioning to reduce the execution time of embedded applications. They first separate these two steps and finally adjust the solution. They only adjust the cache partition but do not adjust task assignments. Task assignment. We only discuss the task assignments considering cache behavior. Fedorova et al. [10] propose an assignment algorithm that improves the management of L2 cache.

It co-schedules threads that collectively achieve a low L2 miss rate and gives higher priority to threads that do not require much L2 cache. In [9], they present a new cache-fair scheduling algorithm that addresses non-uniform cache allocation and reduces co-runner-dependent performance variability. Calandrino and Anderson [4] aim to improve the shared cache performance by co-scheduling tasks that reference a common memory region and avoiding co-scheduling tasks that will thrash the cache. Cache partitioning. In [27], a dynamic programming algorithm is proposed to obtain the optimal memory/cache partition for utilization minimization on single-core architecture. For cache partitioning on MPSoC, works have geared toward different targets. Suh et al. [29] aim to reduce the average cache miss rate of the concurrently executing process. Kim et al. [18] focus on the importance of fair multi-core caching. Iyer [17] cares about the QoS of cache and presents a cache management framework to control the shared L2 cache. Chang and Sohi in [6] present Cooperative Cache Partitioning (CCP) to resolve cache contention with multiple time-sharing partitions which can exploit the benefits of LRUbased latency optimizations. Cache locking. Puaut et al. propose greedy algorithms for instruction cache (I-cache) locking, while Campoy et al. use genetic algorithms [5]. Falk et al. [8] take the changing of the worse-case execution path into consideration during each step of optimization. Liu et al. [21] propose optimal algorithms for the static I-cache locking problem to minimize WCET for singletask single-core systems. They also study the locking problems for multi-task single-core systems in [20]. Vera et al. conduct work on data cache (D-cache) locking in [33,34]. Suhendra and Mitra [30] compare different combinations of cache partitioning and locking techniques. Task assignment and cache partitioning on MPSoC system are often carried out independently [10,9,4,27,29,18,17,6]. Suhendra et al. [31] consider them together, but the method is applied to an architecture that one core can access another core’s SPM which is not suitable for cache architecture. The work of [19,26] can be further improved when cache locking technique is available and utilized. Suhendra and Mitra [30] consider cache partitioning and locking together, but the goal is different from this paper. In this paper, we first apply cache locking technique to each task to obtain its WCET as a function of locked cache size. This function is then used as the input for the joint task assignment and cache partitioning problem. We propose novel algorithms for task assignment and task partitioning to minimize the overall WCET. The work proposed in this paper can also be applied to MPSoC architecture with SPM. 3. Multi-task MPSoC architecture with shared cache This section introduces the notations and assumptions used in this paper. The characteristics of the MPSoC architecture as well as the cache and tasks are presented. 3.1. MPSoC architecture with shared cache The MPSoC architecture with 2-level cache explored in this paper is shown in Fig. 1. The system has a core set C = {c1 , c2 , . . . , cn }. Each core ci has its own private L1 cache (I- cache and D-cache) of size S1i . All the cores share an L2 cache of size S2. This architecture is common in several commercial CPUs and it offers good scalability as well as low access latency. Cache partitioning aims to allocate cache resources to different cores or different tasks exclusively according to specific criteria. In this way, cache pollution can be avoided and the timing effect of each task or core can be estimated separately. However, cache

T. Liu et al. / J. Parallel Distrib. Comput. 71 (2011) 1473–1483 Table 1 Notations used in this paper.

Task 1, Task 2, ...... , Task m

S1 1

Core 1

Core 2

Core n

L1 Instruction Cache

L1 Instruction Cache

L1 Instruction Cache

L1 Data Cache

S1 2

S2

......

L1 Data Cache

S1 n

1475

L1 Data Cache

L2 Cache

Main Memory Fig. 1. MPSoC architecture with private L1 cache and shared L2 cache.

partitioning reduces the effective capacity used by different cores or tasks. Cache locking aims to select and lock some content of a program or data in cache. It brings significant benefits to performance, especially on application-specific embedded systems. The application-specific characteristic enables us to analyze the application’s properties and make informed decisions and optimizations before execution. Two schemes of cache locking are possible: static and dynamic. In the static locking scheme, cache content is locked from the beginning to the end. In the dynamic locking scheme, locked cache content can be changed at specific reload points. The selection of content to be locked has a dramatic influence on the system performance. By using cache locking techniques, cache conflict misses and capacity misses are eliminated while the number of cache cold misses of the unlocked content can be predicted. By finding limited content to be locked in the cache, previous researches show that cache miss rate may be increased [32], but the WCET of programs can be predicted and reduced by up to 76% [8,21]. In this work, we use the locking technique proposed in [21,34] for I-cache and D-cache respectively to select the locking content to obtain a minimized WCET for each single task. In [21], the static I-cache locking problem is modeled as an ILP problem. Optimal algorithms are proposed for subsets of the general problem with special properties. In [34], Vera et al. first find some merging regions [11] to lock, and then put the variables with more memory references in these regions to D-cache. Some characteristics are assumed in this paper for cache, as well as the partitioning and locking techniques: 1. Cache can be either direct-mapped or set-associative. 2. Cache partitioning is performed based on cores so that there is no cache interference between cores. The granularity of partitioning is one cache line. 3. Cache locking is applied to each task using the static cache locking technique in [21,34] for I-cache and D-cache, respectively. The granularity of locking is one cache line. 4. A general function is used to represent the WCET using cache locking throughout this paper according to the results of [21, 34], without differentiating locking of I-cache or of D-cache. 5. The possibility of cache conflict within the locking selection is solved by previously proposed compiling techniques, such as procedure placement [14], padding [16] and code positioning [36], as discussed in Section 5.5. 3.2. Task characteristics There are m tasks T = {t1 , t2 , . . . , tm } in the MPSoC system, as shown in Fig. 1. Each task tj is associated with functions fij (x, y)

Notation

Description

n ci S1i S2 S2i m tj WCETij fij (x, y) WCETci

The number of cores. A core. Size of ci ’s L1 cache. Total size of L2 cache. Size of ci ’s partitioned L2 cache. The number of tasks. A task. WCET of task tj when executed on core ci without cache. WCET of task tj when executed on core ci using L1 cache of size x and L2∑ cache of size y. WCET of core ci , is equal to tj |tj is assigned to ci fij (S1i , S2i ).

Total_WCET

Overall WCET of the system, is equal to maxci ∈C WCETci .

which represents the WCET of task tj when it executes on core ci with a locked L1 cache of size x and a locked L2 cache of size y. The cache locking algorithms in [21,34] are used to select the locking content to minimize the WCET for each single task. Since cache locking manages cache in a predictable way, and cache interference and conflicts can be avoided by cache partitioning and compiling techniques, each fij (x, y) offers a precise WCET estimation for task tj under one cache configuration. This function is then used as the input for the joint task assignment and cache partitioning problem. It can be seen from the results of [21,34] that by using cache locking techniques, the more cache a task has been assigned, the less execution time it uses. Therefore, we assume that each function fij (x, y) is non-negative and non-increasing on both x and y when cache locking [21,34] is applied. Task assignment is to assign each task a certain core. Once a task is assigned to a core, it will execute exclusively on that core. For simplicity, tasks in this paper are not associated with realtime properties. We assume the schedulability between tasks is guaranteed and the arrival times are the same for all tasks. Thus, any uniprocessor scheduling algorithm can be applied on each core after tasks are assigned. The set of tasks assigned to a core will execute sequentially on that core. Tasks will use the private L1 cache of that core as well as the shared L2 cache. 4. Problem analysis The joint task assignment and L2 cache partitioning problem on an MPSoC to minimize the overall WCET can be defined as follows. Input: Given m tasks T = {t1 , t2 , . . . , tm }, an MPSoC architecture with n cores C = {c1 , c2 , . . . , cn } and an L2 cache of size S2. In a realistic situation, we have m ≥ n. The WCET of task tj when executed on core ci without L1 or L2 cache is WCETij . Each core ci has a private L1 cache of size S1i and shares the L2 cache. Each task tj has i non-negative and non-increasing functions fij (x, y) which represents the WCET when task tj executes on core ci with L1 cache of size x and L2 cache of size y. Output: To design an assignment for the m tasks and a partition for the L2 cache so that the Total_WCET of the MPSoC system is minimized. The tasks assigned to a core will execute sequentially according to some single-core scheduling methods. Assuming that the L2 cache partitioned to core ∑ ci is S2i , then the WCET of core ci can be calculated as WCETci = tj |tj is assigned to ci fij (S1i , S2i ). The overall WCET of the whole system is the latest finishing time of tasks on different cores, which is also called makespan or Total_WCET . We have Total_WCET = maxci ∈C WCETci . The notations used in this paper is summarized in Table 1.

1476

T. Liu et al. / J. Parallel Distrib. Comput. 71 (2011) 1473–1483

Table 2 Attributes of the example task set. Task

WCETij

fij (x, y)

t1 t2 t3 t4 t5

24 12 22 25 18

(1−x/20)(1−y/45)·WCET01 (1−x/30)(1−y/60)·WCET02 (1−x/35)(1−y/45)·WCET03 (1−x/25)(1−y/34)·WCET04 (1−x/24)(1−y/32)·WCET05

4.1. NP-hardness Lemma 1. The joint problem of task assignment and L2 cache partitioning for overall WCET minimization is NP-hard. Proof. When the total size of L2 cache S2 = 0, the task assignment and cache partitioning problem is exactly the Scheduling on Unrelated Parallel Machines problem [15], which is proved to be NP-complete. Since this special case is NP-complete, the general problem of task assignment and L2 cache partition is also NPhard.  Overview of the proposed strategies. Since the problem is NPhard, a heuristic method is applied in the main strategy. We assign tasks and partition cache based on the cache locking properties. First, tasks are assigned to balance the workload among cores, then the L2 cache is partitioned to further reduce the Total_WCET . When a solution is obtained, we can still carry out some adjustments until an approximate-optimal solution is achieved. 4.2. Motivating example Consider a task set T = {t1 , t2 , t3 , t4 , t5 } whose attributes are listed in Table 2. The MPSoC architecture in this example is a dualcore architecture with a core set of C = {c1 , c2 }. Each core has an L1 cache of size 10. The total L2 cache of size 10 is shared by cores. For simplicity, a homogeneous architecture is used in this example, thus the WCETs of an individual task are the same when it is executed on different cores. In other words, WCETij and fij (x, y) are not related to cores (the value of i), thus in Table 2, the initial digit ‘‘0’’ of the subscripts ‘‘01’’, ‘‘02’’, . . . , ‘‘05’’ of WCET is a fictitious core number. In this motivating example, we use linear functions to represent each fij (x, y), since in our previous experiments of cache locking techniques [21,34] in aiT [1], we obtained similar linear functions for some benchmarks. In practical systems, some fij (x, y) could be nonlinear. The approaches under comparison here are: (a) EQUAL: decoupled task assignment and cache partitioning. First equally partition L2 cache among cores. Then use the task assignment algorithm proposed in Section 5.1 to assign tasks. (b) INTEG: the state-of-the-art algorithm for integrated task assignment and SPM partitioning in [26], where SPM has very similar characteristics as a locked cache. (c) TACP: the proposed algorithm for joint task assignment and L2 cache partitioning in Section 5. (d) MTACP: the proposed modified algorithm for joint task assignment and L2 cache partitioning in Section 5.4. The solutions for this example using the above approaches are shown in Fig. 2(a–d). We also give the optimal solution obtained by a brute force method, as shown in Fig. 2(e). In each figure, the Y -axis represents different cores, and the X -axis represents time. For example, in Fig. 1(a), tasks t3 , t5 , and t2 are assigned to c1 , while tasks t4 and t1 are assigned to c2 . WCETc1 = 30.16 and WCETc2 = 23.45, so Total_WCET = 30.16. The cache partitioning solution is given in the right of each figure, where the upper number represents the size partitioned to c1 and the lower for c2 .

Technique (a) equally partitions L2 cache to each core. Then it uses the task assignment algorithm (Algorithm 3) to assign tasks where the currently longest task is assigned to the core with currently shortest WCETci . Therefore, t3 is assigned to c1 first, t4 to c2 second, and then t1 to c2 , and so on. Technique (b) also first equally partitions L2 cache. Then it will assign the shortest task to a core to minimize the current Total_WCET . After each assignment, it will do a cache size adjustment according to the ‘‘elasticity’’ of a task which reflects the sensitivity of a task to cache size. In this example, the shortest task t2 is first assigned to c1 , and two more cache lines are added to c1 . Then t5 is assigned to c2 , t1 –c1 , t4 –c2 , and t3 –c1 , with cache size adjustment following each step of assignment. After the last adjustment, all the L2 cache is partitioned to c1 to minimize the Total_WCET . The proposed TACP first assumes tasks are not using L2 cache, and assigns the current longest task to the core with current shortest WCETci . It has the same task assignment as Technique (a). Then L2 cache is given to different cores step by step to minimize the Total_WCET , resulting in c1 having 9 units and c2 having 1 unit. No more adjustments occur for this solution. The proposed MTACP first carries out Technique (a). Then L2 cache is given to different cores step by step to minimize the Total_WCET , and adjustments of task assignment take place lastly. As can be seen from Fig. 2, TACP and MTACP achieve better overall WCET solutions than EQUAL and INTEG. The INTEG algorithm in [26] uses similar steps as the proposed strategies, but does not utilize the cache locking techniques. After the assignment and partitioning, it also adjusts the solution, but only adjusts the cache partition. MTACP also obtains the optimal solution as shown in Fig. 2(e) for this example. INTEG, TACP and MTACP achieve 5.5%, 10.7% and 12% Total_WCET reductions, respectively, compared to EQUAL for this example. 5. Algorithms for task assignment and cache partitioning The main algorithm proposed in this paper consists of three steps. The first step is to assign the m tasks to n cores. The second step is to partition L2 cache to cores such that the Total_WCET is minimized for the assignment of the first step. The last step is to adjust the task assignment generated by the first step and cache partition generated by the second step to further reduce the Total_WCET . The main algorithm TACP (Algorithm for Joint Task Assignment and L2 Cache Partitioning) is shown in Algorithm 1. Before the three main steps, we need to calculate the function fij (x, y) first (Step 0, line 1–2). For m tasks and n cores, cache locking is supposed to be carried on every cache size of L1 cache of each core and every cache size of L2 cache. The complexity of static locking algorithm in [21] is O(|V |2 + |V |S ), where |V | is the number of basic blocks in a task∑ and S is the cache size. Thus the worst-case complexity is O(m( 1≤i≤n S1i )S2(|V |2 + |V |S )). For each core, we only concern one L1 cache size (S1i ) since the whole L1 cache is always utilized, thus the complexity can be reduced to O(nmS2(|V |2 + |V |S )). In the experiments, all these fij (x, y) results are obtained by our previous experiments on aiT. An fij (x, y) table is constructed and looked up when needed, which introduces tiny overhead to the whole process. The algorithms for Step 1 (line 3–4), Step 2 (line 5–6) and Step 3 (line 7–8) will be discussed in Sections 5.1–5.3, respectively. 5.1. Task assignment based on L1 cache The algorithms in this section gives an initial task assignment for the successive steps in Algorithm 1. We assume that each task on core ci only uses ci ’s L1 cache of size S1i . Therefore, the WCET of task tj if assigned to core ci using locked L1 cache is fij (S1i , 0)

T. Liu et al. / J. Parallel Distrib. Comput. 71 (2011) 1473–1483

(a) EQUAL.

(b) INTEG [26].

(d) MTACP.

1477

(c) TACP.

(e) OPT.

Fig. 2. Motivating example.

Algorithm 1 TACP: Algorithm for Joint Task Assignment and L2 Cache Partitioning. Require: A set of m tasks T = {t1 , . . . , tm }; A set of n cores C = {c1 , . . . , cn }, each of which has an L1 cache of size S1i ; A shared L2 cache of size S2. Ensure: Assign tasks to cores and partition L2 cache to cores to minimize the Total_WCET . 1: /* Step 0. Calculate fij (x, y) */ 2: For each task tj on each core ci , calculate function fij (x, y) by using the cache locking approach in [21, 34]; 3: /* Step 1. Assign the tasks */ 4: Assume that each task only uses L1 cache, and give an initial task assignment using Algorithm 2 or Algorithm 3;// In Section 5.1 5: /* Step 2. Partition the cache */ 6: Obtain an optimal L2 cache partition for the initial task assignment using Algorithm 5;// In Section 5.2 7: /* Step 3. Adjustment of assignment and partition */ 8: Re-assign tasks and re-partition cache to further reduce the Total_WCET using Algorithm 6;// In Section 5.3 9: Return the task assignment and L2 cache partition. which is a∑ fixed value. The WCET of core ci can be calculated as WCETci = tj |tj is assigned to ci fij (S1i , 0). The cores in an MPSoC can be homogeneous or heterogeneous. The decision problem of determining whether a given set of tasks can be assigned to homogeneous cores with Total_WCET smaller than a given bound is called Minimum Makespan Problem which is proved to be NP-Complete in [12]. When the cores are heterogeneous, this generalized version of minimum makespan problem is also NP-hard [15]. We first focus on heterogeneous cores. Because cores have different processing abilities and L1 cache sizes, the WCETs of one task on different cores are different. A WCET-Matrix W = (wij ) ∈ Zn+×m is calculated, where wij = fij (S1i , 0) represents the WCET of task tj when executed on core ci using ci ’s L1 cache. Based on this matrix, Davis and Jaffe in [7] present a√polynomial algorithm that delivers a solution within a factor of n of the optimum. They use an ‘‘efficiency’’ attribute to measure how good the decision is when assigning task tj to core ci , and use a greedy method to obtain an approximate solution. We use this approximation algorithm TAHeC (Algorithm for Task Assignment on Heterogeneous Cores) for our assignment, shown in Algorithm 2. As shown in Algorithm 2, the efficiency for task tj on core ci is calculated as efij = b(j)/wij , where b(j) is the shortest WCET of

Algorithm 2 TAHeC: Algorithm for Task Assignment on Heterogeneous Cores. Require: A set of m tasks T = {t1 , . . . , tm }, each of which has fij (x, y); A set of n cores C = {c1 , . . . , cn }, each of which has an L1 cache of size S1i . Ensure: Assign tasks to cores to minimize the Total_WCET . 1: /* Step 1. Obtain Efficiency */ 2: Calculate WCET-Matrix W , where wij = fij (S1i , 0); 3: for each task tj do 4: b(j) = min1≤i≤n wij ; 5: efij = b(j)/wij ; 6: end for 7: /* Step 2. Initialization */ 8: for each core ci do 9: Create a list of tasks sorted in non-increasing order of efij ; 10: Mark ci as ‘‘active’’ and each task as ‘‘unassigned’’; 11: end for 12: /* Step 3. Assignment */ 13: while not all tasks are assigned do 14: Find the core with the shortest WCET among active cores: ci ; 15: Find the next unassigned task on ci ’s list of tasks: tj ; √ 16: if task tj does not exist | efij < 1/ n then 17: Mark ci as ‘‘inactive’’; 18: else 19: Assign task tj to core ci ; 20: Mark task tj as being ‘‘assigned’’; 21: Update the WCET for core ci ; 22: end if 23: end while 24: Return the task assignment.

task tj only using L1 cache on all different cores (lines 3–6). For each core, we sort the WCETs of tasks in a non-increasing order (lines 8–11). There are m iterations to perform task assignment (lines 13–23). In each iteration, we first try to choose the core with the shortest WCET (line 14). Then we determine whether or not to schedule the next task (line 15) on it based on the task’s ‘‘efficiency’’ efij (lines 16–22). If assigned, the WCET of that core is updated (lines 18–21). If quick-sort and ‘‘Min_Heap’’ are used, the complexity of Algorithm 2 is O(nm + nm log m + n + m log n) = O(nm log m). When the cores are homogeneous, we have another algorithm TAHoC (Algorithm for Task Assignment on Homogeneous Cores) which has a better complexity. We adopt the algorithm proposed

1478

in [3] which has an approximation ratio of is shown in Algorithm 3.

T. Liu et al. / J. Parallel Distrib. Comput. 71 (2011) 1473–1483 4 3



1 . 3n

The algorithm

cache (line 6). If ‘‘Max_Heap’’ is used, the complexity of Algorithm 4 can achieve O(n + S2 log n).

Algorithm 3 TAHoC: Algorithm for Task Assignment on Homogeneous Cores. Require: A set of m tasks T = {t1 , . . . , tm }, each of which has fij (x, y); A set of n cores C = {c1 , . . . , cn }, each of which has an L1 cache of size S1. Ensure: Assign tasks to cores to minimize the Total_WCET . 1: /* Step 1. Initialization */ 2: Compute the WCET of task tj using L1 cache of size S1: fij (S1, 0); 3: Sort tasks in a list L by fij (S1, 0) non-increasingly; 4: /* Step 2. Assignment */ 5: while L ̸= ∅ do 6: Assign the first task in L to the core with the shortest WCET (ties are arbitrarily broken); 7: Remove this task from L; 8: Update the WCET for the corresponding core; 9: end while 10: Return the task assignment.

Lemma 2. Algorithm 4 can output an optimal cache partition for a fixed task assignment.

As shown in Algorithm 3, we first need to compute the WCET of each task using only L1 cache (line 2). Since this is a homogeneous architecture, each task only has one WCET on different cores. We sort the WCETs of tasks in a non-increasing order (line 3). Then, there are m iterations to perform task assignment (lines 5-9). In each iteration, we choose the core with the shortest WCET and assign the longest task to it (line 6). If quick-sort is used to sort tasks and ‘‘Min_Heap’’ is used to maintain the cores’ WCET values, the complexity of Algorithm 3 is O(m log m + n + m log n) = O(m log m). 5.2. L2 Cache partitioning In this section, we introduce two optimal algorithms for L2 cache partitioning when tasks are already assigned to cores. Suppose task tj is assigned to core ci , then the WCET of tj is fij (S1i , S2i ) if the size of L2 cache partitioned to ci is S2i . The WCET ∑ of core ci is calculated as WCETci = tj |tj is assigned to ci fij (S1i , S2i ). The more cache a core is assigned, the less time the set of tasks on this core uses. Since we want to minimize the Total_WCET , we should give more cache to the core with the longest WCET. We first give a greedy algorithm GCP (Greedy Algorithm for L2 Cache Partitioning on Cores with a Fixed Task Assignment), as shown in Algorithm 4. It assigns L2 cache units one by one to the core with the longest WCET under the current cache partition. Algorithm 4 GCP: Greedy Algorithm for L2 Cache Partitioning on Cores with a Fixed Task Assignment. Require: A set of n cores C = {c1 , . . . , cn }, each of which has a set of tasks assigned to it; A shared L2 cache of size S2. Ensure: Partition L2 cache to cores to minimize the Total_WCET . 1: /* x is the size of unfixed L2 cache in each iteration */ 2: Let x = S2; 3: while x > 0 do 4: Add one unit of L2 cache to the core with the longest WCET ; 5: Update the WCET of that core with the new partition; 6: x = x − 1; 7: end while 8: Return the L2 cache partition. There are S2 iterations in Algorithm 4 (lines 3–7). In each iteration, we need to find the core with the longest WCET (line 4), update the WCET of that core (line 5) and update the size of unfixed

Proof. In order to decrease the Total_WCET , we have to increase the cache size of the core with the longest WCET. Hence, each iteration in Algorithm 4 is necessary and the assigned cache should be fixed in the corresponding cores.  Usually, S2 is a large number and n ≪ S2. Hence the complexity of Algorithm 4 is O(S2 log n). Here we give another algorithm ACP (Algorithm for L2 Cache Partitioning on Cores with a Fixed Task Assignment), as shown in Algorithm 5 of which the running time is better. Algorithm 5 ACP: Algorithm for L2 Cache Partitioning on Cores with a Fixed Task Assignment. Require: A set of n cores C = {c1 , . . . , cn }, each of which has a set of tasks assigned to it; A shared L2 cache of size S2. Ensure: Partition L2 cache to cores to minimize the Total_WCET . 1: /* x is the size of unfixed L2 cache in each iteration */ 2: Let x = S2; 3: /* When x ≥ n */ 4: while x ≥ n do 5: Temporarily add ⌊ nx ⌋ L2 cache to the first n − 1 cores; 6: Temporarily add x − (n − 1)⌊ nx ⌋ L2 cache to the last core; 7: Compute the WCET of each core with the current cache partition; 8: Assume that core ci is the core with the longest WCET , and it is newly assigned S2′i cache in this iteration; 9: Fix S2′i L2 cache on core ci ; 10: Release the temporarily added L2 cache on the other cores; 11: x = x − S2′i 12: end while 13: /* When x < n */ 14: while x > 0 do 15: Use Algorithm 4; 16: end while 17: Return the L2 cache partition. In Algorithm 5, there are at most log

n n−1

S2 n

+ 1 iterations when

x ≥ n (lines 4–12). In each iteration, we first temporarily add the remaining size evenly to each core (lines 5–6) and update the WCET of each core (line 7). Then we find the core with the longest WCET (line 8) and confirm the update of it (line 9). The other cores are not really updated. If ‘‘Max_Heap’’ is used, the complexity of these steps is O(n + (log n S2 + 1) · (log n + n)) = O(n + n n−1

n log

S2 n−1 n n

log S2 n

) = O(n log

n

n−1

) = O(n

log S2 n 1 n−1

) = O(n2 log S2 ). When n

x < n, the complexity is O(n + n log n) according to Algorithm 4. So the complexity of Algorithm 5 is O(n2 log S2 + n + n log n) = n O(n2 log

S2 n

) in total, which is better than Algorithm 4.

Lemma 3. Algorithm 5 can output an optimal cache partition for a fix task assignment. Proof. In each iteration, we consider the core with the longest WCET after the temporary cache partition. Let ci be such a core. Suppose that the size of L2 cache on ci is S2′i and the current WCET of ci is WCETci . Then we claim that in the optimal cache partition, the size of L2 cache on core ci is no less than S2′i . If there is less than S2′i cache on core ci , the real WCET of core ci is surely more than the current WCETci , which means the Total_WCET of the optimal cache

T. Liu et al. / J. Parallel Distrib. Comput. 71 (2011) 1473–1483

partition is larger than WCETci . This contradicts the assumption that this cache partition is optimal. Therefore, the L2 cache on the core with the longest WCET is needed, hence each iteration in Algorithm 5 is necessary, which proves Lemma 3.  5.3. Re-assign and re-partition Step 1 cannot guarantee an optimal solution, while Step 2 can. For a joint problem, even if each step is locally optimal, the solution may not be globally optimal. To improve the solution, we use an adjustment process as shown in Algorithm 6.

Algorithm 7 MTACP: A Modified Algorithm for Joint Task Assignment and Cache Partitioning. Require: A set of m tasks T = {t1 , . . . , tm }; A set of n cores C = {c1 , . . . , cn }, each of which has an L1 cache of size S1i ; A shared L2 cache of size S2. Ensure: Assign these tasks to cores and partition L2 cache to cores to minimize the Total_WCET . 1: /* Step 0. Calculate fij (x, y) */ 2: For each task tj on each core ci , calculate function fij (x, y) by using the cache locking approach in [21, 34]; 3: /* Step 1. Equally partition the L2 cache */ S2 4: Assign ⌊ ⌋ L2 cache to the first n − 1 cores; n

Algorithm 6 ATACP: Adjustment Algorithm for Task Assignment and L2 Cache Partitioning.

5:

Require: A task assignment; An L2 cache partition; An iteration threshold α . Ensure: Adjust the task assignment and L2 cache partition to further reduce the Total_WCET . 1: count = 0; 2: while count < α do 3: Let cx and cy be the core with the longest WCET and the shortest WCET, respectively; 4: Let tj be the task with the shortest WCET on core cx ; 5: Try to move task tj to core cy (not really move); 6: Obtain a new L2 cache partition according to the current assignment using Algorithm 5; 7: if this solution decreases the Total_WCET then 8: Update the Total_WCET ; 9: Save the task assignment and L2 cache partition of this solution. 10: else 11: Algorithm end; 12: end if 13: count + +; 14: end while 15: Return the task assignment and L2 cache partition saved.

7:

In Algorithm 6, we use an adjustment threshold α to finish the algorithm in reasonable time (line 2). As found in the experiments, 50 is enough for the threshold α . In each iteration, we find the core with the longest WCET and the core with the shortest WCET (line 3), and find the task with the shortest WCET on the core with the longest WCET (line 4). Then we try to move this task to the core with shortest WCET (line 5) and carry out cache partitioning once based on this newly obtained task assignment (line 6). If this movement leads to a reduced Total_WCET , we confirm this movement (lines 7–9). The complexity of Algorithm 6 is O(α(log n + m + n2 log S2 )) = O(m + n2 log S2 ). n n 5.4. A modified algorithm The first step in Algorithm 1 may not guarantee a good initial assignment because it does not consider the WCET variation impacted by L2 cache. Therefore, we add one more step to Algorithm 1. We first initially partition the L2 cache equally to cores. Then we assign tasks based on the WCET of tasks using both L1 cache and L2 cache, partition the cache again and finally adjust the solution. The modified algorithm is named Algorithm MTACP (A Modified Algorithm for Joint Task Assignment and Cache Partitioning), as shown in Algorithm 7. In the task assignment step (Step 2, lines 6–7), Algorithm 7 uses fij (S1i , S2i ) to calculate the WCET of task tj on core ci with L1 cache and equally partitioned L2 cache (Step 1, lines 3–5). According to the complexity analysis of algorithms in Sections 5.1–5.3, and Step 0, the complexities of Algorithms 1 and 7 are O(nmS2(|V |2 +

1479

6:

8: 9: 10: 11: 12:

Assign S2 − (n − 1)⌊ S2 ⌋ L2 cache to the last core; n /* Step 2. Assign the tasks */ Assume that each task uses L1 and L2 cache, and give an initial task assignment using Algorithm 2 or Algorithm 3; /* Step 3. Partition the cache */ Obtain an optimal L2 cache partition for the initial task assignment using Algorithm 5; /* Step 4. Adjustment of assignment and partition */ Re-assign tasks and re-partition cache to further reduce the Total_WCET using Algorithm 6; Return the task assignment and L2 cache partition.

|V |S )) + nm log m + n2 log S2 + m + n2 log S2 = O(nmS2(|V |2 + n n S2 2 |V |S )) + nm log m + n log n . 5.5. Cache conflicts Even though we use cache locking and maintain the size limitation every time, we may still encounter the cache conflict problem where selected content are mapped to the same cache line. We need to avoid these intra-task cache conflicts so that cache locking can be applied in practice. Inter-core cache conflicts are eliminated by cache partitioning between cores, and inter-task cache conflicts do not exist since tasks are supposed to execute on cores sequentially without interruption between each other. Some research [16,13,14,36] has been conducted to solve the cache conflict problem, but they are geared toward a different goal which is to minimize the conflict rate for a program by carefully selecting functions into cache or repositioning parts of the program. Code placement [14,36] and padding technique [16] can be modified to be applied to the problem in this work. The code placement technique has two phases: cache placement and memory placement. Cache placement phase assigns cache addresses to minimize cache conflicts. It decides a placement for each selected block in cache. The memory placement phase assigns memory address under the cache placement constraints. In this paper, we conduct these two phases as follows. First, we carry out cache placement phase. A simple method is used to place the selected content in cache: consecutively following their memory index orders without overlaps or padding. Second, we carry out memory placement phase. Because the placement units are basic blocks or basic data structure in this work, they should be placed following their index orders sequentially in memory. So according to the determined cache placement, some padding is inserted before the selected content in the memory so that the selected blocks or data can be mapped to certain cache lines. In this paper, we just insert as much padding as needed to adjust the mapping address each time, which may not guarantee a memory placement with the smallest padding for a certain cache selection. The cache-memory placement problem is another difficult optimization problem which needs to consider both cache placement and memory placement together and will be studied in the future.

1480

T. Liu et al. / J. Parallel Distrib. Comput. 71 (2011) 1473–1483

Table 3 Benchmarks used in previous works.

Table 4 Results of proposed algorithms and related work. Algorithms: Total_WCET (10−6 s)

Name

Source

Size (B)

Name

Source

Size (B)

Cores

Tasks

Fibonac FFT Square BubSort FIR1

SNU berkeley SNU MRTC IJAE

3 499 6 244 3 567 2 779 11 965

FIR2 MatriMulti BinaSear LCDOutput SearMatrix

MRTC MRTC SNU MRTC MRTC

10 563 3 037 4 248 1 678 936

EQUAL

INTEG [26]

TACP

MTACP

2 2 2 2

5 10 20 50

29.51 140.40 236.42 602.98

26.88 139.52 234.48 590.19

26.21 138.83 234.48 576.02

25.79 136.02 227.50 576.02

4 4 4 4

5 10 20 50

19.66 90.26 130.10 322.36

15.48 77.73 126.11 314.09

15.48 75.57 122.37 300.09

13.11 72.79 120.02 300.09

8 8 8

10 20 50

63.77 89.15 180.18

58.66 67.38 162.70

53.45 65.05 162.30

49.09 63.23 160.61

16 16

20 50

65.60 110.84

59.86 83.55

57.62 83.29

55.77 81.66

11.25%

15.11%

6. Experimental results This section presents the experimental results to illustrate the effectiveness of the proposed algorithms. We use homogeneous cores and direct-mapped cache in our experiments. The granularity of cache partitioning and locking is one cache line. The shared L2 cache is set to be several different values using different configurations as will be mentioned later. We simulate the cache control mechanism to support cache locking and implement the locking algorithms in [21,34]. The proposed task assignment and cache partitioning algorithms and previous related techniques are implemented as a post-compiler optimizer. A total of 50 tasks are used in our experiments. Ten of the tasks are benchmarks used in previous WCET studies by various research groups (SNU [28], MRTC [24]) as shown in Table 3. The function fij (x, y) for these benchmarks is obtained from aiT [1] equipped with cache locking ability [21,34]. The other 40 tasks are synthetic benchmarks with preset profile parameters. For generated tasks, we use a function x fij (x, y) = (1 − S1 ·size )(1 − S2y·g ), where size(tj ) is the size (t )·k i

j

(Byte) of task tj and k, g are parameters used to measure how L1 and L2 cache affect the WCET, respectively. This function is also constructed based on the experimental curves by applying techniques in [21,34] in aiT. Experiments are conducted on different sets of tasks under different conditions. We fix the L2 cache size to be 10% of the total size of each task set. We do not use a unified size of L2 cache because for different task sets, the total tasks’ size changes dramatically. For example, the task set of 50 tasks in our experiments has a size of about 5.8 M, while the task set of 5 tasks only has a size of about 500 K. By using 10% of the total size, we can simulate an L2 cache size at about 64 K–512 K which is a typical L2 cache size for embedded chips. For example, ARM1136 has an L2 cache of 128 K, and XScale3 has an L2 cache of 512 K. Some chips may be armed with a bigger L2 cache, therefore we also try a 30% ratio and find the results are similar in both two settings. A bigger cache size may cost longer optimization time, as shown in Section 6.4. The approaches under comparison are: 1. EQUAL: Decoupled task assignment and equally cache partitioning among cores. 2. INTEG: The heuristic algorithm in [26]. This algorithm first partitions L2 cache equally to different cores. Then it carries out task assignment following an increasing order of tasks’ execution time. Based on the ‘‘elasticity’’ of a task which reflects how the task is sensitive to cache size, it adjusts cache partition after each assignment. 3. TACP: The proposed algorithm in Section 5. 4. MTACP: The modified algorithm in Section 5.4. 6.1. Under different MPSoC architectures First, task sets are simulated to test the proposed algorithms and related works under different MPSoC architectures. We use core sets of 2, 4, 8, 16 cores and task sets of 5, 10, 20, 50 tasks for each algorithm. In Table 4, columns ‘‘Cores’’ and ‘‘Tasks’’ give the number of cores and tasks, respectively. Columns ‘‘EQUAL’’

AvgR



9.91%

‘‘INTEG’’, ‘‘TACP’’ and ‘‘MTACP’’ give the Total_WCET (10−6 s) achieved when using algorithm EQUAL, INTEG, TACP and MTACP, respectively. Row ‘‘AvgR’’ gives the average reduction rate of Total_WCET using INTEG, TACP and MTACP compared to EQUAL. The Total_WCET contains the overhead caused by cache locking. As can be seen from Table 4, the proposed TACP and MTACP achieve significant WCET reduction for all MPSoC architectures, compared to EQUAL. INTEG also obtains better solution than EQUAL, and sometimes can obtain the same solution as TACP, for example, for 5 tasks on 4 cores. But it cannot compete with the proposed techniques because it only adjusts the cache partition but does not adjust the task assignment. MTACP always achieves the best among these four algorithms. On average, INTEG, TACP and MTACP achieve about 9.91%, 11.25% and 15.11% Total_WCET reductions compared to algorithm EQUAL. In this section, the table aims to show the Total_WCET obtained under different architectures. More detailed comparisons between algorithms are shown in the following sections. 6.2. Under a fixed number of cores Fig. 3(a–d) shows the results when fixing the core number at 2, 4, 8, 16, respectively. In each figure, the Y -axis represents the normalized Total_WCET , and the X -axis represents different task sets expressed by the number of tasks. The basis of comparison is the corresponding result of EQUAL as shown in Table 4. For the dual-core architecture, we also give the normalized optimal results using a brute force method. This method takes 14 h to solve the problem with 50 tasks on 2 cores. It is impractical to simulate the brute force method for more than 2 cores. As can be seen from Fig. 3, TACP performs better than or equal to INTEG, while MTACP performs better than or equal to TACP for all task sets under different MPSoC architectures. MTACP can achieve the optimal solution sometimes on the dual-core architecture. 6.3. Under a fixed number of tasks We fix the number of tasks to see how the number of cores affects the overall WCET. Fig. 4(a) uses the task set of 20 tasks and the number of cores varying among {2, 4, 8, 16}, while Fig. 4(b) uses the task set of 50 tasks and the number of cores varying among {2, 4, 8, 16, 32}. The Y -axis represents the normalized Total_WCET , and the X -axis represents different numbers of cores. The basis of comparison is the EQUAL algorithm. The circle points give the normalized optimal results under dual-core architecture. We can see that for a task set of 20 tasks, it is not necessarily true that the larger the core number is, the better the WCET

T. Liu et al. / J. Parallel Distrib. Comput. 71 (2011) 1473–1483

(a) 2 cores.

(b) 4 cores.

(c) 8 cores.

(d) 16 cores.

1481

Fig. 3. Comparison between algorithms under a fixed number of cores.

(a) 20 tasks.

(b) 50 tasks.

Fig. 4. Comparison between algorithms under a fixed number of tasks.

reduction ratio can be achieved. When the core number is 8, each algorithm achieves the best WCET reduction ratio. This is because with a large core number, the number of tasks on each core is reduced which may lead the Total_WCET to decrease, but the L2 cache partitioned on each core is also limited which may lead the Total_WCET to increase. And vice versa for a small core number. The same phenomenon appears for a task set of 50 tasks. The best core number for this task set is 16 among {2, 4, 8, 16, 32}. There is a best core number for an individual task set. This conclusion gives us suggestions about how to determine how many cores are necessary to achieve a specific WCET bound for an applicationspecific embedded system. 6.4. Algorithms’ overhead evaluation We evaluate the execution time of the presented algorithms in this section. Based on the time complexities of the algorithms, the value of L2 Cache size, the number of tasks and cores all have impacts on the execution time. Therefore, we test our algorithms under different combinations of these configuration parameters and show the execution times for the two proposed algorithms in

Table 5. For L2 cache size, we choose 10% and 30% of the size of each task set. For the task sets with 20 and 50 tasks, we use both the architectures with 8 and with 16 cores. From the complexity analysis, we conclude that the overhead of Step 0 (fij calculation) may be dominated, thus we need to count in this overhead. However, for all benchmarks, the cache locking results are already done by previous work, and can be used directly for combining different task sets. In the experiments, they are read from a cache locking result table, thus causing very little overhead. Table 5 shows both the execution times when counting in the execution times of calculating fij and when counting in the looking up time from an fij table in Column ‘‘F’’ and Column ‘‘NoF’’, respectively. The time unit is second. We can see from Table 5 that for all configurations, all the proposed algorithms can finish within eight seconds if looking up from an fij table. Step 0 takes longer time as shown in Column ‘‘F’’. This complies with the complexity analysis. The cache size dominates the impact on the execution time when excluding the time for Step 0. When the cache size is tripled, the execution time increases dramatically. On average, TACP and MTACP need 3.9 s and 4.5 s, respectively, when excluding the time for Step 0.

1482

T. Liu et al. / J. Parallel Distrib. Comput. 71 (2011) 1473–1483

Table 5 Overhead evaluation (seconds) under different configurations. Cache size

Alg

Tasks

20

Cores Get f

50

8

16

8

16

F

NoF

F

NoF

F

NoF

F

NoF

10% 10%

TACP MTACP

10.7 11.0

1.51 1.78

21.1 21.6

2.18 2.45

28.3 28.9

2.21 2.69

32.2 32.7

2.79 3.02

30% 30%

TACP MTACP

39.2 40.1

4.12 4.90

50.5 50.9

5.87 6.31

61.2 61.9

6.23 6.89

69.5 70.4

7.16 7.92

This overhead is not a big issue, since firstly a typical L2 cache is not very big in real embedded chips. Furthermore, because we target embedded systems with specified applications, we need only to make these offline decisions once. Therefore, the proposed algorithms are efficient to be applied. 7. Conclusion This paper focuses on the joint problem of task assignment and L2 cache partitioning for MPSoC embedded systems. The goal is to minimize the overall WCET for the system. Cache locking is utilized to guarantee a predictable minimized WCET for each task. We prove that the joint problem is an NP-hard problem and propose efficient algorithms to perform task assignment and L2 cache partitioning jointly. We also propose two optimal algorithms for L2 cache partitioning with a fixed task assignment. Experimental results show that the proposed algorithms can reduce the overall WCET efficiently. Suggestions on how to determine specific architecture parameters under different situations are also inferred from the experimental results. In the future, we will study the subtask assignment and cache partitioning problem for single-task MPSoC systems, where the sub-tasks of a task have dependences with each other. Acknowledgment This work is supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region [Project No. CityU 123609] and a grant from City University of Hong Kong [Project No. 7008042]. References [1] AbsInt, 2010. http://www.absint.com/ait/. [2] ARM, 2010. http://www.arm.com/. [3] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-Spaccamela, M. Protasi, Complexity and Approximation: Combinatorial Optimization Problems and their Approximability, Springer, New York, NY, 1999. [4] J.M. Calandrino, J.H. Anderson, Cache-aware real-time scheduling on multicore platforms: Heuristics and a case study, in: Proceedings of the 20th Euromicro Conference on Real-Time Systems, ECRTS08, pp. 299–308. [5] A. Campoy, A.P. Ivars, F. Rodriguez, J.V. Busquets-Mataix, Static use of locking caches vs dynamic use of locking caches for real-time systems, in: Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering, pp. 1283–1286. [6] J. Chang, G.S. Sohi, Cooperative cache partitioning for chip multiprocessors, in: Proceedings of the 21th ACM International Conference on Supercomputing, pp. 242–252. [7] E. Davis, J. Jaffe, Algorithms for scheduling tasks on unrelated processors, Journal of the Association for Computing Machinery 28 (1981) 721–736. [8] H. Falk, S. Plazar, H. Theiling, Compile-time decided instruction cache locking using worst-case execution paths, in: Proceedings of the International Conference on Hardware-Software Codesign and System Synthesis (CODE+ISSS07), pp. 143–148. [9] A. Fedorova, M. Seltzer, M.D. Smith, Cache-fair thread scheduling for multicore processors, Technical Report TR-17-06, Division of Engineering and Applied Sciences, Harvard University, 2006. [10] A. Fedorova, M. Seltzer, M.D. Smith, C. Small, CASC: A Cache-Aware Scheduling Algorithm For Multithreaded Chip Multiprocessors, Technical Report TR2005-0142, Sun Labs Technical Report, 2005. http://research.sun.com/scalable/pubs/CASC.pdf.

[11] C. Ferdinand, R. Wilhelm, Efficient and precise cache behavior prediction for real-time systems, Real-Time Systems 17 (1999) 131–181. [12] A. Garey, D. Johnson, Computers and Intractability: A Guide to the Theory of NP-completeness, Freeman, New York, NY, 1979. [13] A. Gordon-Ross, F. Vahid, N. Dutt, A first look at the interplay of code reordering and configurable caches, in: Proceedings of the 15th ACM Great Lakes symposium on VLSI, pp. 416–421. [14] C. Guillon, F. Rastello, T. Bidault, F. Bouchez, Procedure placement using temporal-ordering information: dealing with code size expansion, Journal of Embedded Computing 1 (2004) 437–459. [15] O.H. Ibarra, C.E. Kim, Heuristic algorithms for scheduling independent tasks on nonidentical processors, Journal of the ACM 24 (1977) 280–289. [16] K. Ishizaka, M. Obata, H. Kasahara, Cache optimization for coarse grain task parallel processing using inter-array padding, Languages and Compilers for Parallel Computing 2958 (2004) 64–76. [17] R. Iyer, Cqos: a framework for enabling qos in shared caches of cmp platforms, in: Proceedings of the 18th ACM International Conference on Supercomputing, pp. 257–266. [18] S. Kim, D. Chandra, Y. Solihin, Fair cache sharing and partitioning in a chip multiprocessor architecture, in: Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques, pp. 111–122. [19] Y. Li, W. Wolf, A task-level hierarchical memory model for system synthesis of multiprocessors, in: Proceedings of the 34th Design Automation Conference, DAC97, pp. 153–156. [20] T. Liu, M. Li, C.J. Xue, Instruction cache locking for real-time embedded systems with multi- tasks, in: Proceedings of the 15th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA09, pp. 494–499. [21] T. Liu, M. Li, C.J. Xue, Minimizing wcet for real-time embedded systems via static instruction cache locking, in: Proceedings of the 15th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS09, pp. 35–44. [22] T. Liu, Y. Zhao, M. Li, C.J. Xue, Task assignment with cache partitioning and locking for wcet minimization on mpsoc, in: Proceedings of the 39th International Conference on Parallel Processing, ICPP 2010. [23] MIPS, 2010. http://www.mips.com/. [24] MRTC, 2010. http://www.mrtc.mdh.se/projects/wcet/home.html. [25] I. Puaut, D. Decotigny, Low-complexity algorithms for static cache locking in multitasking hard real-time systems, in: Proceedings of the 23rd IEEE RealTime Systems Symposium, RTSS02, pp. 114–123. [26] H. Salamy, J. Ramanujam, A framework for task scheduling and memory partitioning for multi-processor system-on-chip, in: Proceedings of the 4th International Conference on High Performance and Embedded Architecture and Compilers, HiPEAC09, pp. 263–277. [27] J.E. Sasinowski, J.K. Strosnider, A dynamic programming algorithm for cache/memory partitioning for real-time systems, Institute of Electrical and Electronics Engineers. Transactions on Computers 42 (1993) 997–1001. [28] SNU, Seoul national university, 2010. http://www.useoul.edu/. [29] G.E. Suh, L. Rudolph, S. Devadas, Dynamic partitioning of shared cache memory, Journal of Supercomputing 28 (2004) 7–26. [30] V. Suhendra, T. Mitra, Exploring locking & partitioning for predictable shared caches on multi-cores, in: Proceedings of the 45th Design Automation Conference, DAC08, pp. 300–304. [31] V. Suhendra, C. Raghavan, T. Mitra, Integrated scratchpad memory optimization and task scheduling for mpsoc architectures, in: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, CASES06, pp. 401–410. [32] X. Vera, B. Lisper, J. Xue, Data cache locking for higher program predictability, in: Proceedings of the 2003 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS03, pp. 272–282. [33] X. Vera, B. Lisper, J. Xue, Data caches in multitasking hard real-time systems, in: Proceedings of the 24th IEEE Real-Time Systems Symposium, RTSS03, pp. 154–165. [34] X. Vera, B. Lisper, J. Xue, Data cache locking for tight timing calculations, ACM Transactions on Embedded Computing Systems 7 (2007) Article 4. [35] Xenon, 2010. http://domino.research.ibm.com/comm/research_projects.nsf/ pages/multicore.Xbox360.html. [36] W. Zhao, D. Whalley, C. Healy, F. Mueller, Improving wcet by applying a wc code-positioning optimization, ACM Transactions on Architecture and Code Optimization 2 (2005) 335–365.

T. Liu et al. / J. Parallel Distrib. Comput. 71 (2011) 1473–1483

1483

Tiantian Liu is currently a Ph.D. student in the Department of Computer Science, City University of Hong Kong. She received her M.E. and B.E. degrees from the Department of Computer Science and Technology at Shandong University in China in 2008 and 2005 respectively. Her research interests include embedded and real-time systems, parallel systems, and memory optimizations.

Minming Li is currently an assistant professor in the Department of Computer Science, City University of Hong Kong. He received his Ph.D. and B.E. degree from the Department of Computer Science and Technology at Tsinghua University in 2006 and 2002 respectively. His research interests include algorithm design and analysis in wireless networks and energy efficient scheduling, combinatorial optimization and computational economics.

Yingchao Zhao is currently working in HKUST as a postdoc fellow in the Department of Industrial Engineering and Logistics Management. She received her Ph.D. and B.E. degree from the Department of Computer Science and Technology at Tsinghua University in 2009 and 2004 respectively. Her research interests include algorithmic game theory, algorithm design, computational complexity analysis and scheduling.

Chun Jason Xue is currently an assistant professor in the Department of Computer Science, City University of Hong Kong. He received his Ph.D. and M.E. degree from the Department of Computer Science at University of Texas at Dallas in 2007 and 2003 respectively. He received his B.E. degree in the Department of Computer Science and Engineering, University of Texas at Arlington in 1997. His research interests include optimization for parallel embedded systems, optimization for DSPs with VLIW or multi-core architecture, hardware/software co-design.