Non-preemptive and SRP-based fully-preemptive scheduling of real-time Software Transactional Memory

Non-preemptive and SRP-based fully-preemptive scheduling of real-time Software Transactional Memory

Journal of Systems Architecture 61 (2015) 553–566 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.el...

2MB Sizes 1 Downloads 51 Views

Journal of Systems Architecture 61 (2015) 553–566

Contents lists available at ScienceDirect

Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc

Non-preemptive and SRP-based fully-preemptive scheduling of real-time Software Transactional Memory António Barros ⇑, Luís Miguel Pinho, Patrick Meumeu Yomsi CISTER Research Centre, ISEP/IPP, Rua Dr. António Bernardino de Almeida 431, 4200-072 Porto, Portugal

a r t i c l e

i n f o

Article history: Received 1 October 2014 Received in revised form 3 May 2015 Accepted 1 July 2015 Available online 26 July 2015 Keywords: Real-time systems Synchronization mechanisms Software Transactional Memory Non-preemptive scheduling Stack Resource Protocol Cache non-coherency Multi-core platforms Contention management

a b s t r a c t Recent embedded processor architectures containing multiple heterogeneous cores and non-coherent caches renewed attention to the use of Software Transactional Memory (STM) as a building block for developing parallel applications. STM promises to ease concurrent and parallel software development, but relies on the possibility of abort conflicting transactions to maintain data consistency, which in turns affects the execution time of tasks carrying transactions. Because of this fact the timing behaviour of the task set may not be predictable, thus it is crucial to limit the execution time overheads resulting from aborts. In this paper we formalise a FIFO-based algorithm to order the sequence of commits of concurrent transactions. Then, we propose and evaluate two non-preemptive and one SRP-based fully-preemptive scheduling strategies, in order to avoid transaction starvation. Ó 2015 Elsevier B.V. All rights reserved.

1. Introduction The current trend of manufacturing chips including multiple processor cores to increase the processing power provided the ability to execute concurrent software in parallel. This tendency for even larger number of processor cores will further impact the way systems are developed. Some recently proposed architectures for embedded systems, like the Kalray’s MPPA [1] (up to 1024 cores; current version is 256 cores), the Tilera tile CPU [2] (shipping versions feature 64 cores) allow to (i) concentrate multiple applications into the same processor, maximising the hardware utilisation, reducing cost, size, weight, and power requirements, and (ii) improve application performance by exploiting parallelism at the application level. Nevertheless, such an architecture raises several problems, due to core interconnection and memory hierarchy: (1) cache coherency is being challenged [3] although some solutions scale up to dozens of cores [4], and some chips still provide (software-based) solutions; (2) buses do not scale and the paradigm is shifting to networks-on-chip (NoC). Finally, (3) platforms can be homogeneous, with either symmetric multiprocessing or asymmetric multiprocessing, or heterogeneous, with different core types. All

⇑ Corresponding author. E-mail addresses: [email protected] (A. Barros), [email protected] (L.M. Pinho), [email protected] (P. Meumeu Yomsi). http://dx.doi.org/10.1016/j.sysarc.2015.07.008 1383-7621/Ó 2015 Elsevier B.V. All rights reserved.

put together, all these issues influence substantially in the way applications share data. Lock-based synchronisation solutions are popularly used to avoid race conditions when data are shared between tasks via parallel threads. We assume that each task may generate a potentially infinite number of instances, also referred to as jobs. However, in multiprocessor systems, coarse-grained locks serialise non-conflicting operations that could progress in parallel, degrading the system throughput, while fine-grained locks increase the complexity of system development, degrading the system composability. Alternatively, non-blocking approaches present strong conceptual advantages [5] and have been shown to perform better than lock-based ones in several cases [6]. This research: The Software Transactional Memory (STM) [7], is a concept in which a critical section – also referred as to the transaction – executes in isolation, without blocking, regardless of other simultaneous transactions. Here, an optimistic concurrency control mechanism is responsible for serialising concurrent transactions, maintaining the consistency of shared data objects. Conflicts are solved by applying a contention policy that selects the transaction that will commit, while the contender will most likely abort and repeat. Hence, solutions must be devised that reduce contention. The time overhead resulting from aborts affects the worst-case execution time (WCET) of a task that executes a transaction. Therefore, the timing behaviour of a task can only be predictable if the transaction overhead is bounded, allowing to determine

554

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

the WCET and the utilisation of the task. Additionally, minimising the number of aborts reduces wasted execution time. In this paper, we formalise a FIFO-based contention management algorithm and two non-preemptive scheduling strategies that provide predictability and prevent transaction starvation. We also present a new fully-preemptive scheduling strategy, based on the Stack Resource Protocol (SRP) [8], to tackle the issues identified in the non-preemptive approaches. We evaluate the behaviour of these three strategies by simulation, analysing the introduced overhead and consequent impact in schedulability, and comparing their performances with pure partitioned EDF (P-EDF) and a lock-based synchronisation mechanism intended for multi-core architectures. Paper structure: The paper is structured as follows. Section 2 presents the background and relevant published work when using STM in embedded real-time systems based on parallel architectures. Section 3 sets the system model and the assumptions made in this work. Section 4 presents our main contributions, it formalises a decentralised algorithm referred to as FIFO-CRT to manage conflicts between concurrent transactions which is more effective if transactions are not preempted, as we show in Section 5. Section 6 identifies the issues of the two non-preemptive scheduling approaches and defines an SRP-based fully-preemptive alternative, referred to as SRPTM, that addresses those issues. The results from simulations that compare the performance of FIFO-CRT, under the two proposed scheduling strategies and SRPTM against pure partitioned EDF (P-EDF) and a lock-based synchronisation mechanism designed for multi-core systems are presented in Section 7. This paper terminates with the conclusions and perspectives in Section 8. 2. Background and Related Work Transactional memory framework, in which the programmer must indicate which piece of code forms the transaction, promises to ease concurrent programming and rely on a mechanism that maintains the consistency of shared data objects, located at the transactional memory. Multiple transactions may always execute in parallel in an idealised world. However, when conflicts among concurrent objects occur (either a read-write or a write-write conflict), then a contention management policy is applied in order to guarantee the serialisation of the concurrent transactions. Usually, this allows one transaction to complete while the remaining contenders are aborted and forced to repeat. This approach has proven to scale well with multiprocessors [9] as it delivers higher throughput than coarse-grained locks. Also, it does not increase the design complexity as fine-grained locks do [10]. STM achieves better performances when contention is low, causing low transaction abort ratio. Thus, STM behaves very well in systems exhibiting the following characteristics [11]: (1) a predominance of read-only transactions, (2) short-running transactions and (3) a low ratio of context switches during the execution of a transaction. Some transactions may present characteristics (e.g. long-running, low priority) that can potentially expose them to starvation. In parallel systems literature, the main concern about STM is on system throughput, and the contention management policy often has the role to prevent livelock (a pair of transactions indefinitely aborting each other) and starvation (one transaction being constantly aborted by the contenders), so that each transaction will eventually commit and the system will progress. In real-time systems, the guarantee that a transaction will eventually conclude is not sufficient to assure the timing requirements that are critical to such systems: how long a transaction takes to commit must be known, otherwise the worst case execution time of the task cannot be calculated. The schedulability of the task set depends on the WCET of each task. As transactions are part of task executions, this WCET can only be calculated if the maximum

time used to commit the included transaction is known. As such, STM is relevant in real-time systems as long as the employed contention management policy provides guarantees on the maximum number of retries of each transaction. Although the concept of STM is not new and numerous works have been published, only a few works dealt with it in the context of real-time systems. Manson et al. [12] proposed a data access mechanism for uniprocessor platforms – the Preemptible Atomic Regions – together with an analysis to bound the response time of jobs. In this approach, the atomicity of a transaction is guaranteed by assuring that no concurrent job will execute between the beginning and the end of the transaction. If a job is preempted by a higher-priority job while performing a transaction, this transaction is immediately aborted and its effects undone. This policy matches the Abort-and-Restart model [13]. As no concurrent transactions are allowed in the system, it is impractical in multiprocessor systems, unfortunately. Anderson et al. [14] established scheduling conditions for lock-free transactions under Earliest Deadline First (EDF) [15] and Deadline Monotonic (DM) [16], but exclusively for uniprocessor systems. In the same direction, Anderson et al. [17] provided a different approach to support transactions in multiprocessor systems: a wait-free mechanism relies on a helping scheme. This mechanism provides an upper bound on the transaction execution time. In this approach, an arriving transaction must help pending transactions1 before being able to proceed even if no conflicts occur, so the upper bound is likely to increase with the number of processors in the system, unfortunately. Fahmy et al. [18] described an algorithm to calculate an upper-bound on the worst-case response time2 of tasks on a multiprocessor system using STM when tasks are scheduled by following the Pfair algorithm. The authors assume that each task can have multiple atomic regions, and that concurrent transactions can interfere with each other. Conflicts are detected and solved during the commit phase. This analysis applies only to small atomic regions, assuming that any transaction will execute in, at most, two quanta, unfortunately. Sarni et al. [19] proposed a real-time scheduling of concurrent transactions for soft real-time systems. The authors characterised transactions with scheduling parameters, namely a relative deadline which expresses the maximum amount of time desirable for the transaction to commit once it has started. When an instance of a transaction starts, it is assigned an absolute deadline determined by the sum of its release time and relative deadline. Conflicting transactions are serialised based exclusively on their absolute deadlines. However, this may have a negative effect on transactions with further deadlines, unfortunately. The authors adapted a practical STM to run on a real-time kernel, and modified the contention manager to apply their proposed policy. The experimental results showed that using scheduling data can improve the number of transactions that meet their deadlines. Barros and Pinho [20] defended a FIFO-based approach to serialise concurrent transactions as a means to predict the time required to commit, but only provide a sketch of the decision algorithm. Furthermore, this paper does not take into account the considerable effect of preempting transaction on the predictability of the time required to commit. Cotard [21] developed a wait-free STM for hard real-time embedded systems on multi-core platforms, in which transactions help their contenders to commit. This work assumes that 1 Helping means that the job executes the transactional code of another task’s job that is progressing in the system, but has not committed yet. 2 The worst-case response time is the longest time that can elapse since the job is released until it finishes its execution.

555

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

transactions are homogeneous, i.e. the data set of a transaction is formed exclusively by the read set (read-set transaction) or by the write set (write-set transaction). Consequently, a task can have multiple transactions of different nature: e.g., a read-set transaction to acquire data and a write-set transaction to later output the results of a computation. The system model does not consider a specific scheduling algorithm, but requires transactions to be executed non-preemptively, until they commit. In addition, write-set transactions that have intersecting write sets must be allocated to the same core; since transactions cannot be preempted, then it is impossible to have a write-write conflict between transactions. This devised policy provides two results: (i) a write-set transaction will never abort because there are no write-write conflicts, and (ii) a read-set transaction will abort, at most, once because concurrent updates help read-set transactions that have previously aborted beforehand. Heterogeneous transactions that combine read and write operations in one single atomic section are not considered in this work, unfortunately. This may be required in more complex applications. El-Shambakey and Ravindran [22] analysed the response time of two contention managers that take the scheduling priorities of jobs as the decision criteria: the EDF contention manager (ECM) for tasks sets scheduled by following a global earliest deadline first (G-EDF) and the RMA contention manager (RCM) for tasks sets scheduled by following a global rate monotonic assignment (G-RMA). When a conflict is detected, the ECM selects the transaction with the earlier deadline whereas the RCM selects the transaction with the higher priority to commit. The system model is limited as it assumes that each transaction accesses one single STM object, but each task can have multiple transactions, which cannot be nested. The restriction on the number of accessed objects per task simplifies the response time analysis and allows the authors to establish a direct comparison with the lock-free retry loop mechanisms [23] and determine in which conditions STM provides lower retry costs than the lock-free approach. Unfortunately, this same restriction severely reduces the more general notion of atomic section, in which multiple operations on multiple data must be viewed has one atomic operation. In the same vein, the same authors [24] propose the length-based contention manager (LCM) to be used together with G-EDF or G-RMA. In LCM, a transaction which detects a conflict, voluntarily aborts under two circumstances: (i) it has a lower priority than the concurrent transaction, or (ii) the concurrent transaction has executed for a pre-defined amount of its total execution. This work assumes the same system model as in [22], namely one object accessed per transaction and multiple non-nested transactions per task. Consequently, it presents the same limitations. In addition, upon the detection of a transaction conflict, the contention manager must keep track of the execution time of all jobs executing in all cores in order to decide which transaction will commit. This may lead to a non-negligible execution time overheads, unfortunately. The same authors [25] propose the Priority contention manager with Negative values and First access (PNF) to circumvent the aforementioned issues. The system model allows transactions to access multiple objects. PNF distributes transactions in progress into two groups: the m-set (the maximum size of the group is m, the number of cores) in which transactions execute non-preemptively and will commit immediately, and the n-set (the maximum size of the group is n, the number of tasks) in which transactions execute with the lowest priority defined by the scheduler (1) and will abort. This approach relies heavily on a centralised contention manager and therefore, it may not scale with an increase of the number of cores, unfortunately. In order to circumvent this last limitation, the same authors [26] provide a more dynamic and decentralised approach, in which the contention manager does not require to know beforehand, the

data set of each transaction. The contention manager is called First Bounded, Last Timestamp (FBLT). In this approach, each transaction is assigned a maximum number of aborts, and conflicts are solved by following the same principle as in LCM. When a transaction reaches its maximum number of aborts, FBLT registers the timestamp and rises the priority of the job to a non-preemptable level. Unlike PNF, conflicts between non-preemptable transactions are possible. FBLT selects the transaction with the earlier registered timestamp to commit. The main limitation of this approach is that the maximum number of aborts per transaction is assumed to be given. This information may not be accessible for complex real-world applications. The approaches presented by El-Shambakey and Ravindran use global-based schedulers.3 Although these schedulers provide us with interesting features such as better load-balancing and higher flexibility solutions in terms of schedulability, they present serious overhead (number of migrations and preemptions) which are not desirable in the context of hard real-time systems, where applications have stringent timing requirements. Furthermore, Bastoni et al. [27] proved that on a platform with twenty-four cores, global EDF is a non-viable choice, while clusters of size six practically approximated global approaches. For these reasons, this work adopts fully partitioned-based schedulers.4 These schedulers take advantage of the entire body of knowledge, intuitions and techniques that have been developed for single core platforms over the past forty years. Furthermore, they do not suffer from potential scheduling anomalies as this could be the case for global schedulers. These aforementioned works provide some interesting perspectives on how to deal with STM in the context of real-time systems. However, it is clear that there are many pending issues, and further research is necessary to take advantage of future parallel architectures. This paper contributes to paving the way towards new scheduling approaches that make the contention manager predictable and fair, by using on-line information. The purpose is to reduce the overall number of transaction retries, improving their responsiveness and reducing wasted processor usage, while ensuring that deadlines are met. 3. System model Task specification. We assume that the workload is carried by a def

set of n periodic tasks s ¼ fs1 ; . . . ; sn g. Each task si releases a potentially infinite number of jobs and is characterised by a worst-case execution time C i , a relative deadline Di , and a period T i . These parameters are given with the following interpretation. The jth job of task si executing on processor pk , referred to as ski;j , is characterised

by

its

release

time

r i;j

such

that

def

r i;jþ1 ¼ ri;j þ T i ; def

8i 2 f1; . . . ; ng; 8j P 1; and an absolute deadline di;j ¼ r i;j þ Di . We assume that the system is synchronous, i.e., all tasks in s release their first jobs at the same time instant t ¼ 0. Formally, we assume that ri;1 ¼ 0; 8i. For such a system, interval ½0; PÞ, where P is the hyper-period (P ¼ lcmðT 1 ; . . . ; T n Þ – the least common multiple of the periods of all tasks), is a feasibility interval [28]. For the sake of clarity, we assume that each task si performs at most one transaction, denoted as xi . Nevertheless, the results obtained throughout this paper are extensible to tasks which executes multiple non-nested transactions with minor efforts. The transaction of si is characterised by: 3 In global scheduling, all tasks are stored in a single priority-ordered queue. The global scheduler selects for execution the highest priority tasks from this queue. 4 In fully partitioned schedulers, tasks are first assigned statically to processors. Once the task assignment is done, each processor uses independently its local scheduler.

556

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

 C xi : the maximum time required to execute the sequential code of xi once, without any external interference from other tasks of the system itself, and try to commit.  The data set (DSi ): the collection of shared objects that are accessed by xi . This data set can be partitionned in two subsets RSi and WSi where: . The read set (RSi ) is the subset of objects that are accessed by xi solely for reading. . The write set (WSi ) is the subset of objects that are modified by xi during its execution. The size of the data set, read set and write set are denoted as j DSi j; j RSi j and j WSi j, respectively. Platform and Scheduler specifications. We assume that all the def

jobs are executed on a multi-core platform p ¼ fp1 ; . . . ; pm g composed of m homogeneous cores, i.e., all cores have the same computing capabilities and are interchangeable. The task set s is scheduled by following a partitioned EDF (P-EDF) scheduler, i.e., each task is statically assigned to a specific core at design time and each core schedules its subset of tasks at run time by following the classical EDF scheduler: at each time instant the job with the earliest absolute deadline is selected for execution and ties are broken in an arbitrary manner. Note: Task migrations from one core to another at runtime is not allowed. STM specification. We assume that a collection of p STM  def  objects O ¼ o1 ; . . . ; op are located at the globally shared memory and are accessible to all tasks carrying a transaction, independently from the core on which they are executing. Multiple simultaneous transactions are supported and for each object there is a chronologically ordered list that records all transactions currently accessing the object. The number of transactions that have a specific object, say oj , in their data sets is denoted has j oj j. Contention occurs when two or more transactions, executing in parallel, have intersecting data sets. Hence, we can map contentions by using a graph G in which vertices represent transactions and edges represent transactions with intersecting data sets.

Definition 4 (Independent transactions). Two transactions xi and xj are said to be independent when they belong to different contention groups. Definition 5 (Transaction overhead of a job). The transaction overhead of job si;j , denoted as W i;j , is defined as the time wasted in executing aborted commit attempts of xi . Formally, W i;j is given by: def

W i;j ¼ Ai;j  C xi

ð1Þ

In Eq. (5), Ai;j represents the number of failed attempts of xi before it commits. Definition 6 (Execution time of jobs of a task executing a transaction). Given task si executing a transaction xi , the execution time of the jth job, denoted as C i;j , is defined as the sum of four terms: (1) the time (C prexi ) required to execute the code of si before xi starts, (2) the time (C posxi ) required to execute the code of si after xi has committed, (3) the execution time (C xi ) of a successful transaction of xi , and (4) the transaction overhead (W i;j ) of that job. Formally, C i;j is given by: def

C i;j ¼ C prexi þ C xi þ C posxi þ W i;j

ð2Þ

Definition 7 (Task utilisation). The utilisation of task si , denoted as U i , is defined as the execution ratio of the jobs of si within one hyper-period. Formally: P=T

Ui ¼

1 Xi C i;j  P=T i j¼1 T i

ð3Þ

Definition 8 (System utilisation). The utilisation of system s, denoted as U s , is defined as the sum of the utilisations of all tasks in s. Formally:

US ¼

n X Ui

ð4Þ

i¼1

Definition 1 (Contention group). Given a contention graph G, a contention group, denoted as Xk (with k P 0), is defined as a set of connected vertices of G in which any two transactions are connected by a path. Fig. 1 illustrates a very simple example of contention groups in which a set of five transactions fx1 ; x2 ; x3 ; x4 ; x5 g are sharing a set of three STM objects fo1 ; o2 ; o3 g. In this example, the data sets of the transactions form two distinct contention groups: X1 ¼ fx1 ; x5 g and X2 ¼ fx2 ; x3 ; x4 g. Each instance of a transaction has a life cycle that follows the states represented in Fig. 2, in which the transaction is in progress. When a transaction arrives, it executes the transaction code and commits if no conflicts are detected, otherwise it may be aborted, retrying to commit immediately. Note: a transaction may be aborted multiple times until it successfully commits. Definition 2 (Direct contender of a transaction). The direct contender of a transaction xi is defined as a transaction xj that shares at least one STM object with xi . Formally, that is DSi \ DSj – £. Definition 3 (Indirect contender of a transaction). The indirect contender of a transaction xi is defined as a transaction xj that does not share any STM object with xi , but belongs at the same contention group as xi

4. Contention management policy: FIFO-CRT An STM contention management policy oriented for real-time systems must meet the following three requirements. . Predictability. When a transaction arrives, it must be guaranteed that its execution does not exceed a predefined time limits to commit. This requirement imposes an upper bound on the number of aborts that the transaction under analysis can undergo. . Starvation avoidance. The ability to commit must be distributed in fair manner among the set of contending transactions. As such, no task will have an excessive abort overhead. . Decentralised contention management. The algorithm that implements the contention management policy should be preferably decentralised and executed by each transaction at the moment it tries to commit, and not on a dedicated processor. This way, the overall system overhead will be drastically limited. Barros and Pinho [20] promoted the idea that these requirements are covered by a policy which sequences the contending transactions according to their chronological order of arrival, i.e. the moment at which each transaction executes its first attempt.

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

557

The algorithm that we formalise in this work considers static parameters only. Arrival times of transactions and core’s IDs are few examples of metrics that we use so that it is simple to reach our goal of a decentralised consensus as required above. The time until commit of the target transaction should depend solely on the set of running/ready transactions at the moment this transaction starts. Furthermore, this time must be independent from the future arrivals of other transactions.

Fig. 1. Transaction dependencies by object concurrency.

Fig. 2. State diagram of a transaction.

Algorithm 1 details the operations executed by every transaction when trying to commit. If the transaction was not previously into the zombie state, then it would take the ownership of the objects in its data set. A transaction will have to wait that an object is released before it can take ownership of this object. For every object in its write subset, the transaction checks if it is the oldest active transaction accessing it. This condition ensures that the transaction wins against all contenders on that object as the followed a FIFO-based approach. If the transaction fails on one object, then it immediately releases all its currently owned objects, aborts and repeats. Otherwise, if all conflicts are won, the transaction immediately releases the objects in the read subset, commits updates, and mark the contenders in the write subset as zombies, before releasing the objects. When checking for conflicts, the target transaction just considers the subset of contending transactions that are active at that moment and the respective job is running (i.e., the job is executing in some core). A transaction in the zombie state can be overtaken by newer transactions without affecting its number of aborts.

This last statement finds its justification in the fact that the transaction is going to inevitably abort in the end. Thus, this algorithm optimises the chances of success by ignoring older transactions that are already doomed to fail. Non-running (i.e., preempted) jobs are ignored as a measure to avoid deadlock caused by two concurrent transactions in the same core. Assuming the case in which a job is preempted by a higher priority job during the execution of a transaction, and the higher priority job starts a conflicting transaction: the new transaction would endlessly abort-and-retry, waiting for the older transaction to commit. This would never happen as the preempted job would never get the chance to execute again. Unlike locking solutions, the shared objects are owned just during the commit process, and not during the whole transaction section, which improves the parallelism. Note that the ownership process is controlled by the STM and should be transparent to the programmer. This feature improves the system composability. However, if we allow preemptions during the execution of transactions, then this behaviour can be seriously unpredictable, as we demonstrate in Section 5. 5. Non-preemptive scheduling of transactions Our algorithm permits that a transaction overtakes a preempted transaction, avoiding deadlock between conflicting transactions executing in the same core. Furthermore, this allows sound transactions running in other cores to commit, without having to wait (successively aborting-and-retrying) for an arbitrary amount of time for the preempted transaction to resume the execution and eventually commit. However, this reduces the probability of committing transactions that are prone to be preempted (e.g. long transaction, or low-priority job). In the same vein, a frequently preempted

558

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

transaction may fail to commit for contending transactions that are more recent but are allowed to commit while the transaction is preempted, inverting artificially the intended behaviour of the system. Cancelling temporarily preemptions during the execution of a transaction solves the problems of long transaction starvation and unpredictability stated above. If a transaction is guaranteed that it will not be preempted, then the success of transaction will depend solely on the contention management policy. This solution can be compared with priority boosting [29]: raising the priority of a job to the highest level during a critical section effectively cancels preemptions. The Flexible Multiprocessor Locking Protocol (FMLP) [30] follows a similar approach as the one presented in this paper, executing critical sections non-preemptively, by their order of arrival. Unfortunately, FMLP can serialise critical sections with non-intersecting datasets but accessing objects in the same group; STM allows such transactions to proceed and commit in parallel. In this paper, we consider two non-preemptive approaches to schedule transactions:  Non-preemptible until commit (NPUC) in which the job is assured to be scheduled from the moment the transaction starts until it successfully commits.  Non-preemptible during attempt (NPDA) in which the job is non-preemptible during the transaction, but has preemption points between attempts. Any higher-priority job can be scheduled at any of these points. 5.1. Non-preemptible until commit: NPUC Under NPUC, each transaction will take-over the core in which it is executing, and will fail until all active direct contenders (transactions whose data accesses will conflict with the accesses of the transaction in consideration) that arrived earlier have committed and finished. In Fig. 3, although job s2 has an earlier deadline, it is only able to preempt s3 after the transaction successfully commits. NPUC is totally predictable, as the time required for the transaction to successfully commit depends solely on the transactions that are already executing when the transaction arrives. Since direct contenders (transactions that have, at least, one conflict with the write set) will also wait for their own earlier direct contenders to finish, contention is propagated in chain. So, in the worst case, a transaction will have to wait for ðm  1Þ transactions to complete, assuming that every other core is already executing one transaction. The predictability given by NPUC comes with a cost: higher priority tasks will have their responsiveness reduced due to lower-priority blocking. 5.2. Non-preemptible during attempt: NPDA Under NPDA, preemptions are limited during a transaction to preemption points inserted between consecutive attempts, when the transaction is aborted and prepares to retry. This policy assures that the success to commit of each attempt depends only on the running transactions, and reduces lower-priority blocking as compared with NPUC, improving responsiveness of higher-priority tasks. Fig. 4 represents the same situation as Fig. 3, but scheduled under NPDA. In this case, job s2 is able to preempt s3 earlier, immediately after the transaction aborts. Since jobs can be preempted between transaction attempts, one core can hold more than one transaction in progress at any given time. This means that the number of earlier conflicting transactions for a given transaction is not limited to the number of remaining cores ðm  1Þ, as in NPUC. In the depicted example, job s1 in core p1 aborts its transaction twice because of the

Fig. 3. Transactions scheduled under NPUC.

transactions carried by jobs s2 (first abort) and s3 (second abort) that execute both in core p2 So, although NPDA improves responsiveness of higher priority jobs, it is less predictable than NPUC. 6. SRP-based scheduling of transactions: SRPTM Although the non-preemptive approaches described in Section 5 enforce the intended behaviour of the contention management policy, both approaches have their pros and cons.  Non-preemptive Until Commit (NPUC) provides a fully predictable serialisation of transactions, derived from the fact that there can only be one transaction in progress per core. However, it has a negative impact on the responsiveness of urgent tasks.  Non-preemptive During Attempt (NPDA) improves the responsiveness of the more urgent tasks. However, it is a non trivial exercise to determine a tight upper-bound on the transaction overheads. This is because the system allows multiple simultaneous transactions in progress in the same core. The approach described in this section tackles the drawbacks of NPUC and NPDA as (1) it provides a predicable serialisation of transactions and (2) it maintains the responsiveness of jobs with shorter absolute deadlines. To this end goal, at most one transaction per core is allowed. In addition, preemptions during the execution of any transaction are not disabled. Before going into the technical details behind our proposed approach, let us summarise the main idea. This research is based on the Stack Resource Protocol (SRP) [8]. Each task is assigned a preemption level by decreasing order of relative deadline, so that ki < kj if and only if Di > Dj . Preemption levels are assigned globally, meaning that tasks are ordered by relative deadline independently of the core in which they execute. Additionally, each transaction is also assigned a preemption level that can be higher than the preemption level of the task that carries the transaction. The transaction preemption level expresses the highest preemption level of a task that can be affected by the outcome of the particular transaction. For example, supposing two transactions that are direct contenders (see Definition 2), one executed in a job with a short relative deadline and the other executed in a job with longer relative deadline. The first transaction may have to wait for the second transaction to commit, and so the preemption level of the second transaction should reflect the preemption level of the first job. Similar to SRP, a job can be preempted by a released job (with earlier absolute deadline) while executing a transaction only if

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

559

Fig. 4. Transactions scheduled under NPDA.

Fig. 5. Contention cascades.

the later job has higher preemption level than the running transaction. This way, the scheduler ensures that preempting a transaction will not negatively delay a potential job with a shorter relative deadline executing in other core. Besides the interference of direct contenders, transactions can be delayed by parallel transactions with non-intersecting datasets, when serialised in FIFO order, as seen in Fig. 5. This example illustrates a transaction of job s3 waiting for a transaction of job s2 to commit, which in turn is waiting for transaction of job s1 to commit. Transactions of jobs s1 and s3 are indirect contenders (see Definition 3) in this sequence, because although their data sets do not intersect, x3 is delayed by x1 . Therefore, the preemption level set to a particular transaction must reflect the highest premption level of all tasks that execute transactions that could have to wait on its success. These transactions are either direct or indirect contenders, that is all transactions in the same contention group.

si has the shortest relative deadline in the sequence: the other jobs

6.1. Assigning preemption levels to transactions Main idea. We initially assign sequential preemption levels to tasks by decreasing order of their relative deadlines so that each task si has a task preemption level, denoted as ki . Before going into further details, let us introduce a couple of key definitions. Definition 9 (Ceiling of a STM object). Each transactional object ok is assigned a ceiling – ceilðok Þ – that is the maximum preemption level of all tasks that have a transaction that accesses (either reading or writing) that object. The object ceiling provides an indication of how short is the relative deadline of a job si potentially waiting for a transaction xi to commit, dependent on object ok . Formally, this ceiling is defined as def

ceilðok Þ ¼ max fki j xi uses ok g

ð5Þ

Definition 10 (Ceiling of a contention group). Each contention group Xk is assigned a ceiling that is the maximum preemption level of all tasks that belong to contention group Xk (see Definition 1). The ceiling of a contention group provides an indication of how short is the relative deadline of a job si potentially waiting for a transaction xi to commit, dependent on any other transactions (being either direct or indirect contenders) in the same contention group Xk as xi . Formally, this ceiling is defined as def

ceilðXk Þ ¼ max fki j xi 2 Xk g

ð6Þ

As already mentioned, as transactions are serialised by order of arrival, then a transaction xi may have to wait for a sequence of direct and indirect contenders to commit, before xi eventually commits. It is critical if xi is at the end of the sequence, and job

in the sequence may have greater relative deadlines and can introduce unbearable delays if preempted. Thus, each transaction xi must be aware of the task in the same contention group with shortest relative deadline that could possibly wait for xi to finish. If this information is available, then the scheduler in the core in which xi is executing knows the possible implications on the schedulability of jobs in other cores, when deciding to preempt si . Definition 11 (Preemption level of a transaction). The preemption level of transaction xi is given by the ceiling of the contention group in which xi belongs, and formally defined as:

kxi ¼ ceilðXk Þ;

xi 2 Xk

ð7Þ

For practical implementation purposes, we defined that the transaction preemption level of a task that does not have a transaction is set to zero. Example. Table 1 illustrates the results of assigning preemption levels to tasks and transactions as described, for the simple task set composed of six tasks represented in Fig. 1. For each task we present the relative deadline (see Table 1: column 2) and the corresponding preemption level (see Table 1: column 3). The ceiling of each transactional object oi ði ¼ 1; 2; 3Þ (bottom line of Table 1) is set to the maximum preemption level of all tasks that access it. The preemption level of each transaction (see Table 1: column 7) is the highest preemption level found in the group to which the transaction belongs. In this case, transaction x2 accesses no more than object o1 that has ceiling 3, but has preemption level of 4 because x2 has possibly dependent transactions that access o2 that has ceiling 4. Thus, the preemption level of task s4 is propagated to transaction x2 , even though x2 and x4 have non-intersecting data sets. Observe that task s6 does not execute a transaction and so has its transaction preemption level set to zero. 6.2. Scheduling policy Our scheduling policy is based on P-EDF and we only introduce additional rules to the general policy when a job is executing a transaction. Rule 1 (Core ceiling). We associate to each core pk a non-negative core ceiling Kk that reflects the urgency of completing the transaction in progress, by means of the preemption level of the transaction. When there is no transaction in progress, the core ceiling is zero. Therefore, any released job skj;b is able to preempt an

560

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

Table 1 Example of calculation of preemption levels. Tasks

Di

ki

s1 s2 s3 s4 s5 s6

50 100 80 70 120 60

6 2 3 4 1 5

TM objects o1

already running job

o2

o3 

 

  

3

ceilðoi Þ

kxi

4

6 4 4 4 6 0

6

ski;a as long as it has an earlier absolute

deadline (dj;b < di;a ) and the running job is not executing a transaction (Kk ¼ 0). Rule 2 (Core ceiling reset). When a job ski;a starts executing a transaction, it sets the core ceiling to its transaction preemption level (Kk ki ). Conversely, when a transaction successfully finishes, the job resets the core ceiling level to zero (Kk 0), returning the scheduler to conventional P-EDF operation. So, if job ski;a is executing a transaction, the scheduler must also ensure that the preemption level of task skj;b is higher than the current ceiling on that core (Kk ). This rule holds back skj;b from delaying the transaction in progress, and from propagating this delay to parallel dependent transactions, some of which could be part of more urgent jobs than skj;b . Rule 3. [Unique transaction in progress per core] We define that at any instant for each core, only one transaction is allowed to be in progress. A released job of skj;b containing a transaction xj will not start to execute until the transaction in progress in the core has not commited, even if skj;b has earlier absolute deadline than the running job and higher preemption level than the core ceiling, at the moment of release r j;b . The impossibility of having multiple transactions progressing in the same core provides the immediate result of avoiding deadlocks due to circular dependency chains. Thus, the condition of FIFO-CRT in which a transaction in a preempted job can be killed by a more recent contender (see Algorithm 1: line 7) is no longer necessary, and is dropped. Consequently, a transaction in SRPTM is protected from any newer transaction, even if the job is preempted. Rule 3 improves the predictability of the progress of transactions (each transaction will have to wait, at most, for ðm  1Þ simultaneous transactions in progress before it can commit). Additionally, it also reduces preemption overhead because, like in SRP, skj;b is blocked once at most, at release time. Rule 4. [Priority inversion prevention] When a released job

skj;b is

prevented from running because of a transaction xi in progress in the same core pk , then the core ceiling is risen to the preemption level of sj and the scheduler ensures that the status of job ski;a is set to running in place of

skj;b .

In the case covered by Rule 4, job

skj;b becomes blocked but

ensures that the job executing the transaction in progress has the means to be scheduled to commit xi as soon as possible. Thus, rising the core ceiling eliminates the possibility of priority inversion, as it avoids ski;a from being preempted by jobs with lower preemption levels and no transactions before finishing xi .

Algorithm 2 details the decision tree that the scheduler follows when a job of sj is released and a job of si is currently running in the core. For the sake of simplicity, we redirect the interested reader to the technical report [31] where the analytical results of the scheduling schemes are presented. 7. Simulation results and discussion This section reports on the simulation results obtained for each scheduling approach: NPUC, NPDA and SRP-TM. These results are benchmarked with P-EDF and FMLP. Fig. 6 illustrates the flowchart of the simulation process. Specifically, a profile with the target system characteristics allows us to generate multiple task sets by using the task set generator. For each task set, the simulator executes all the scheduling approaches together with FIFO-CRT, except for FMLP which is executed together with P-EDF. The executions produce the result files which will be compared among each others. The source code with these materials are available at https://zenodo.org/record/17302. 7.1. Simulation setup Platform We developed a simulation environment to test the proposed contention management algorithm by following different scheduling policies – pure P-EDF, P-EDF with NPUC, P-EDF with NPDA and P-EDF with SRPTM – in systems containing from 2 to 64 cores. Additionally, we implemented the FMLP rules over P-EDF, in order to compare our approaches with a state-of-the-art lock-based synchronisation mechanism. We chose FMLP because the similarities with NPUC in the way atomic sections are scheduled (non-preemptively) and serialised (order of arrival). However, we are aware of the differences in the application scenarios between FMLP and STM. FMLP performs better in systems in which nested critical sections (i.e. atomic sections accessing multiple shared objects) are rare. The performance of FMLP is degraded when the number of critical sections and resources aggregated in contention groups increases, because it decreases the granularity of locks; however heavy contention also has a negative impact on STM. FMLP suspend jobs when they wait on a long resource (a resource that is accessed by a critical section that takes long

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

execution time to perform), and busy waits on short resources. In our simulation we implemented the busy wait only, as transactions are expected to be short, but this option has a negative impact in FMLP when a task has a long atomic section; again, long atomic sections also have a negative impact on STM. So, we justify the inclusion of FMLP to compare how the various synchronisation and scheduling strategies deal with more generic data sharing scenarios. Generation of task sets. For our simulations, we defined 12 task set profiles that share some common characteristics. The task sets are synchronous (all tasks release their first job at time t ¼ 0) and for the sake of simplicity and clarity in the discussion, we assume implicit deadlines for all tasks (i.e. Di ¼ T i for task si ). We generated task sets such that each of them would ideally use 75% of the system capacity (a system utilisation of U S ¼ 0:75  m), neglecting the execution time overhead related with atomic sections (aborts and retries with STM, and busy waiting with lock-based mechanisms). The individual utilisation of a task (neglecting possible atomic section overheads) is randomly selected from the interval ð0:0; 0:3, with uniform distribution. In order to keep the hyper period reasonably small, the period of each task is determined as the product of three values, T i ¼ ai  bi  ci , in which ai is randomly chosen from f2; 4; 8; 16g; bi from f3; 6; 9; 12g and ci from f5; 10; 15g, as recommended by Nellis et al. [32]. The WCET of def

the task is directly given by C i ¼ maxðU i  T i ; 1Þ; if the task has an atomic section, then the minimum WCET is 2 as one time unit is required to flag the beginning of the atomic section and another time unit to flag the end. Tasks were allocated to cores according to worst-fit descending algorithm, by decreasing order of utilisation ratio. We randomly select approximately 75% of the tasks to have one atomic section, and approximately half of the atomic sections modify some element in its data sets. The size of the data set of an atomic section ranges from 1 to 3, with discrete uniform distribution. For each profile, we calculated the number of transactional objects/resources required so that the expected value of tasks accessing each object oi is E½j o j ¼ 3. Considering that the average number of tasks accessing an object is given by def

j o j¼

Pn

i¼1

j DSi j ; p

ð8Þ

We recall that p represents the number of objects. It can be calculated by the expected sum of the sizes of data sets divided by the expected number of accesses per object:



U s  Pðsi has xi Þ E½j DS j  E½U i  E½j o j

ð9Þ

We observed that purely random generated transaction data sets ended up creating one single global contention group (or as a single global group lock, as known in FMLP). To counter this effect, we artificially created contention groups organising objects into g def

subsets: ðg  1Þ subsets of size j Xi j ¼ bp=g c each, and the last group j Xlast j¼ p  bp=g c  ðg  1Þ. Each transaction joins randomly a contention group, and its data set is confined to the objects of this contention group. Thus, a transaction is prevented from competing with transactions of other groups. We did not balanced the allocation of transactions to groups, so that contention would not be artificially softened as it may not occur in real applications. Consequently, some groups may have more tasks than others. We defined two groups of profiles in order to observe the behaviour of the synchronisation mechanisms and scheduling strategies for varying size of systems, and for varying granularity of contention.

561

 In the first group of profiles we vary the number of cores in the range of m 2 f2; 4; 8; 16; 32; 64g. The number of shared objects varies proportionally with the number of cores, in the range of p 2 f5; 10; 20; 40; 80; 160g so the expected number of tasks accessing each object remains constant, and equal to 3. Additionally, we vary the number of contention groups proportionally with the number of cores, in the range of g 2 f1; 2; 4; 8; 16; 32g, so that contention groups maintain approximately the same size (number of objects) and expected number of tasks. This group of profiles allows to observe the effects of the size of the system (number of cores and tasks) on the scheduling of tasks, for equivalent levels of contention.  A second group of profiles in which the number of cores and number of shared objects were kept constant: m ¼ 64 and p ¼ 160 respectively. However, the number of contention groups varies in the range of g 2 f1; 2; 4; 8; 16; 32; 64g. This group of profiles allows to observe the effects of granularity of contention, for equivalent size of system (cores, tasks and shared objects). For each profile we generated 50 randomly synchronous task sets. Each task set was tested for a time equivalent to its hyper period, which is sufficient to verify the feasibility of the schedule (no deadline is missed and will never be missed). Every job released was executed until it finished, even if a deadline was missed. Deadlines were solely detected at the end of the execution of a job, so deadline miss propagation is allowed under heavy utilisation due to large atomic section overheads. Under heavy utilisation, a backlog can pass to the following hyper period, so the number of unreleased or unfinished jobs before the end of the hyper period were accounted as deadline misses. The simulation environment was implemented in ISO C++ programming language and compiled with Clang version 3.5 for the target system Mac OS X 10.9.5. The simulations were carried in a iMac with a 3.06 GHz Intel Core 2 Duo processor. 7.2. Results of simulations In this section, we report on the simulation results. In each simulation, we recorded for each task the number of jobs with deadlines missed (including jobs that were expected to finish inside the hyper period). We also recorded the time overhead of atomic sections (due to aborts or busy waits) performed inside completed jobs, and the total system capacity used by the task set during the simulation time. Experiment 1 (Varying system size). The goal of this experiment is to compare how the different scheduling approaches (P-EDF, NPUC, NPDA, SRPTM and FMLP) would be able to feasibly schedule the task sets in the group 1. Fig. 7 presents the feasibility rates for group 1 and shows that P-EDF has higher success rates than the other scheduling strategies. Inspecting individual task sets, we observed that many tasks missed deadlines because their periods were smaller or slightly greater than the execution time of transactions of tasks executing in the same core. Non-preemptive approaches were not able to schedule jobs of these tasks when a transaction was executing and when the deadline laid inside the transaction execution time, in which cases the result was inexorably a deadline miss. The fully preemptive behaviour of P-EDF allows to schedule these short period tasks independently of transactions in progress, and SRPTM can also schedule part of these short period tasks (those that do not have transactions) while a transaction is in progress, hence the superior results relative to non-preemptive approaches.

562

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

Fig. 6. Simulation flowchart.

Fig. 7. Feasibility rates: varying number of cores.

Fig. 8. Feasibility rates: varying number of cores with short atomic sections.

As the system size increases, so does the probability of occurring longer sequences of serialisation of transactions, thus P-EDF feasibility rate also drops to zero when transactions may have to wait longer than the available time to commit. We analysed the behaviour of the same task sets but with the duration of atomic sections reduced. In this case, we took each atomic section with an original execution time longer than 20 time units and generating a random value with uniform discrete distribution, decided with 95% chance to reduce the execution time to 20 time units. We chose the value of 20 time units because the smallest period in a task set is 30 time units (2  3  5 ¼ 30), and the maximum execution time for such a task is 9 time units, so there is a slack of 21 time units to wait for a concurrent job to execute one transaction attempt. As expected, the feasibility rates increased, as seen in Fig. 8. We can observe that approaches that do not allow preemption points until the atomic section finishes (NPUC and FMLP) are harder to schedule the task set. We can also observe that SRPTM and NPDA have similar feasibility rates and in smaller systems behaved better than P-EDF. With small transaction, short period tasks have more opportunities

to have their jobs scheduled in NPDA, hence the results getting closer to SRPTM. We analysed the overall number of deadline misses, in order to observe how much they failed to scheduled the task sets. Fig. 9 represents the total number of deadline misses (including not released or completed jobs) for each group with long atomic sections (Fig. 9)) and with atomic sections with limited execution time (Fig. 9(b)). We see that SRPTM tends to miss less deadlines, even if it does not present the highest feasibility rate. The maximum number of aborts that a transaction suffers poses the highest stress on the effort to meet the deadlines, both for the job that is executing the atomic section, and for jobs that are waiting for the atomic section to complete. These can be ready jobs in the same core or jobs with contending transactions on other cores. Fig. 10 reveals the total number of deadline misses for each task set profile. We can see that NPUC and NPDA present on average, lower maximum numbers of aborts. Under NPUC, transactions are not exposed to delays caused by preempted contenders in other cores, so it tends to present the lower values. However, the delays caused by preempted contenders have a negative impact on SRPTM,

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

563

Fig. 9. Total deadlines missed (50 simulations).

Fig. 10. Maximum number of aborts per job (average).

Fig. 11. Time overhead per atomic section (average).

increasing the maximum number of aborts experienced by the waiting transactions. In P-EDF, it is the possibility of aborting an older transaction that is preempted (in order to avoid deadlocks) that affects this value. Although the maximum number of aborts experienced per job conditions the schedulability of the system, we also analysed the

overall average performance of atomic sections. We measured the execution time overhead due to aborts on STM, and busy waiting on FMLP, for jobs that were effectively released and completed execution. Fig. 11 illustrates the extra time required to execute, on average, one job that has an atomic section. In Fig. 11(a) are the results for long atomic sections and in Fig. 11(b) for limited length

564

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

Non-preemptive approaches perform better results for smaller systems, since they tend to force transactions to commit as soon as possible. However, as systems grow, the possibility of longer sequences of transactions to serialise also rises. This is reflected in the increased number of aborts before commit. FMLP presents higher overheads because it does not profit from intra-group contention granularity, as transactions do (non-intersecting data sets inside the same contention group).

Fig. 12. Feasibility rates: varying number of cores with short atomic sections.

atomic sections. We see that SRPTM presents, on average, smaller overheads than the other approaches. This is explained by the possibility of a transaction to be preempted and yield the processor to other tasks while it is waiting for its contenders on other cores to commit, which reduces the number of aborts and improves the amount of work performed at the core. Unlike P-EDF, the transaction is not killed by newer contenders when the job is preempted. The effect of this protection of the last transaction attempt in SRPTM is visible in the gap that separates it from P-EDF in this chart.

Experiment 2 (Varying granularity of contention). In the second batch of simulations, we tested P-EDF, NPUC, NPDA, SRPTM and FMLP in systems of similar size (similar number of cores, tasks, atomic sections and shared objects), but varying the granularity of contention. For 64 cores and long transactions, we could not feasibly schedule any task set for any number of groups. For exactly the same task sets but with transactions with limited execution times, we could schedule some of the task sets under P-EDF, NPDA and SRPTM, as seen in Fig. 12. Again, we see that fully preemptive P-EDF has the highest ability to schedule task sets. Now, let us turn our attention again to the number of missed deadlines and unreleased of unfinished jobs in order to observe how much the different scheduling strategies failed to schedule the task sets. Fig. 13(a) presents the sum of missed deadlines of all 50 simulations for each profile for long transactions and Fig. 13(b) for the same task sets with limited execution time transactions. In both cases, SRPTM presents always the lowest values for

Fig. 13. Total deadlines missed (50 simulations).

Fig. 14. Maximum number of aborts per job (average).

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

565

Fig. 15. Time overhead per atomic section (average).

all profiles. We can also observe that STM based mechanisms present a slight increase of missed deadlines as the number of contention groups becomes smaller. The averages of the maximum number of aborts per job are represented in Fig. 14, where we can observe that the non-preemptive approaches suffer less aborts than P-EDF and SRPTM. The difference is more pronounced when transactions are long in Fig. 14, and reduced when transactions are short in Fig. 14(b). Although the non-preemptive approaches tend to perform better in worst cases, SRPTM tend to present lower overheads in the long run, as seen in Fig. 15. In Fig. 15(a) we observe that the behaviour of FMLP is heavily penalised when the granularity of contention is coarse, and becomes drastically distant from the behaviour of STM-based approaches. In Fig. 15(b) we present the same chart, excluding FMLP. In this figure we observe that NPUC and NPDA require more execution time to execute their jobs with transactions. 8. Conclusions and further work In this paper we propose a decentralised contention management algorithm for Software Transactional Memory (STM), for multi-core real-time systems, in which conflicting transactions are serialised by their chronological order of arrival. This algorithm is fair and avoids starvation across transactions. However, preempting a transaction reduces the probability of successfully commit, and so we propose two approaches to limit preemptions: non-preemptive until commit (NPUC) and non-preemptive during attempt (NPDA). NPUC is more predictable, while NPDA improves responsiveness of more urgent tasks. Both non-preemptive approaches have their issues related with negative impact on the responsiveness of urgent tasks (NPUC), or with ensuring a computationally feasible response time analysis (NPDA). We provide an alternative scheduling strategy, referred as SRPTM, that is based on SRP that does not require preemptions to be disabled, and that addresses the issues of both non-preemptive approaches. Simulation results show that non-preemptive approaches effectively reduce the maximum number of aborts, but SRPTM achieves higher feasibility rates, and less transaction overhead. However, judicious processor allocation is required for tasks that have small laxity to accommodate transaction retries. Acknowledgement This work was partially supported by National Funds through FCT (Portuguese Foundation for Science and Technology) and by

ERDF (European Regional Development Fund) through COMPETE (Operational Programme ‘Thematic Factors of Competitiveness’), within projects FCOMP-01–0124-FEDER-015006 (VIPCORE) and FCOMP-01–0124-FEDER-037281 (CISTER); by FCT and EU ARTEMIS JU, within project ARTEMIS/0003/2012, JU grant nr. 333053 (CONCERTO).

References [1] Kalray, MPPA 256 – Many-core processors, 2012. [2] Tilera, Tilera TILEPro64 processor, product brief, 2012. [3] B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. Adve, V. Adve, N. Carter, C.-T. Chou, Denovo: Rethinking the memory hierarchy for disciplined parallelism, in: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2011, pp. 155–166. [4] M.M.K. Martin, M.D. Hill, D.J. Sorin, Why on-chip cache coherence is here to stay, Commun. ACM 55 (7) (2012) 78–89. [5] P. Tsigas, Y. Zhang, Non-blocking data sharing in multiprocessor real-time systems, in: Proceedings of the 6th IEEE International Conference on Real-Time Computing Systems and Applications (RTCSA), 1999, pp. 247–254. [6] B.B. Brandenburg, J.M. Calandrino, A. Block, H. Leontyev, J.H. Anderson, RealTime Synchronization on Multiprocessors: To Block or Not to Block, to Suspend or Spin?, in: Proceedings of the 14th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), 2008, pp. 342–353. [7] N. Shavit, D. Touitou, Software transactional memory, in: Proceedings of the 14th annual ACM symposium on Principles of distributed computing (PODC), 1995, pp. 204–213. [8] T.P. Baker, Stack-based scheduling of realtime processes, Real-Time Syst. 3 (1) (1991) 67–99. [9] A. Dragojevic´, P. Felber, V. Gramoli, R. Guerraoui, Why STM can be more than a research toy, Commun. ACM 54 (4) (2011) 70–77. [10] C.J. Rossbach, O.S. Hofmann, E. Witchel, Is transactional programming actually easier?, in: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP), 2010, pp. 47–56. [11] W. Maldonado, P. Marlier, P. Felber, A. Suissa, D. Hendler, A. Fedorova, J.L. Lawall, G. Muller, Scheduling support for transactional memory contention management, in: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP), 2010, pp. 79–90. [12] J. Manson, J. Baker, A. Cunei, S. Jagannathan, M. Prochazka, B. Xin, J. Vitek, Preemptible Atomic Regions for Real-Time Java, in: Proceedings of the 26th IEEE International Real-Time Systems Symposium (RTSS), USA, 2005, pp. 62–71. [13] J. Ras, A.M.K. Cheng, Response time analysis for the Abort-and-Restart event handlers of the Priority-Based Functional Reactive Programming (P-FRP) paradigm, in: Proceedings of the 15th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), 2009, pp. 305–314. [14] J.H. Anderson, S. Ramamurthy, M. Moir, K. Jeffay, Lock-free transactions for real-time systems, in: Real-Time Database Systems: Issues and Applications, Kluwer Academic Publishers, USA, 1997, pp. 215–234. [15] C.L. Liu, J.W. Layland, Scheduling algorithms for multiprogramming in a hardreal-time environment, J. ACM 20 (1) (1973) 46–61. [16] J.Y.-T. Leung, J. Whitehead, On the complexity of fixed-priority scheduling of periodic, real-time tasks, Perform. Eval. 2 (4) (1982) 237–250. [17] J.H. Anderson, R. Jain, S. Ramamurthy, Implementing hard real-time transactions on multiprocessors, Real-Time Database and Information Systems: Research Advances, Kluwer Academic Publishers, USA, 1997 (pp. 247–260).

566

A. Barros et al. / Journal of Systems Architecture 61 (2015) 553–566

[18] S.F. Fahmy, B. Ravindran, E.D. Jensen, On bounding response times under software transactional memory in distributed multiprocessor real-time systems, in: Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), 2009, pp. 688–693. [19] T. Sarni, A. Queudet, P. Valduriez, Real-time support for software transactional memory, in: Proceedings of the 15th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), 2009, pp. 477–485. [20] A. Barros, L. M. Pinho, Software transactional memory as a building block for parallel embedded real-time systems, in: Proceedings of the 37th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA), Finland, 2011. [21] S. Cotard, Contribuition á la robustesse des systémes temps réel embarqués multicoeur automobile (Ph.D. thesis), Université de Nantes, 2013. [22] M. El-Shambakey, B. Ravindran, STM concurrency control for multicore embedded real-time software: time bounds and tradeoffs, in: Proceedings of the 27th Annual ACM Symposium on Applied Computing (SAC), ACM, Italy, 2012, pp. 1602–1609. [23] U. C. Devi, H. Leontyev, J. H. Anderson, Efficient Synchronization under Global EDF Scheduling on Multiprocessors, in: Proceedings of the 18th Euromicro Conference on Real-Time Systems (ECRTS), Germany, 2006, pp. 75–84. [24] M. El-Shambakey, B. Ravindran, STM concurrency control for embedded realtime software with tighter time bounds, in: Proceedings of the 49th Annual Design Automation Conference (DAC), ACM, USA, 2012, pp. 437–446. [25] M. El-Shambakey, B. Ravindran, On real-time STM concurrency control for embedded software with improved schedulability, in: Proceedings of the 18th Asia and South Pacific Design Automation Conference (ASP-DAC), Japan, 2013, pp. 47–52. [26] M. El-Shambakey, B. Ravindran, FBLT: a real-time contention manager with improved schedulability, in: Proceedings of the Conference on Design, Automation and Test in Europe (DATE), EDA Consortium, France, 2013, pp. 1325–1330. [27] A. Bastoni, B. B. Brandenburg, J.H. Anderson, An empirical comparison of global, partitioned, and clustered multiprocessor EDF schedulers, in: Proceedings of the 31st IEEE Real-Time Systems Symposium (RTSS), USA, 2010, pp. 14–24. [28] L. Cucu-Grosjean, J. Goossens, Exact schedulability tests for real-time scheduling of periodic tasks on unrelated multiprocessor platforms, J. Syst. Archit. 57 (5) (2011) 561–569. [29] R. Rajkumar, Real-time synchronization protocols for shared memory multiprocessors, in: Proceedings of the 10th International Conference on Distributed Computing Systems, 1990, pp. 116–123. [30] A. Block, H. Leontyev, B. B. Brandenburg, J. H. Anderson, A flexible real-time locking protocol for multiprocessors, in: Proceedings of the 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), South Korea, 2007, pp. 47–56. [31] A. Barros, P. M. Yomsi, L. M. Pinho, Response time analysis for hard real-time tasks with STM transactions on multi-core platforms, Tech. rep., CISTER – Research Centre in Real-time and Embedded Computing Systems, Portugal, 2015. [32] V. Nelis, P.M. Yomsi, J. Goossens, Feasibility intervals for homogeneous multicores, asynchronous periodic tasks, and FJP schedulers, in: Proceedings of the 21st International Conference on Real-Time Networks and Systems (RTNS), ACM, France, 2013, pp. 277–286.

António Manuel de Sousa Barros has a 5 year degree (1997) and a MSc (2009) in Electrical and Computer Engineering from University of Porto. Since 2001 he is a teacher assistant at the Department of Computer Engineering – School of Engineering of the Polytechnic Institute of Porto. Since 2005, he is member of the CISTER research centre and a Ph.D. candidate since 2009. Before joining CISTER, António worked as a freelancer in the fields of electronics and computer programming from 1997 to 1998. From 1998 to 2001, he was a researcher at the Biomedical Engineering Institute (University of Porto). His main research interests include real-time communication, real-time scheduling and reliable software.

Luis Miguel Pinho has a MSc (1997) and a Ph.D. (2001) in Electrical and Computer Engineering at the University of Porto. He is Coordinator Professor at the School of Engineering of the Polytechnic Institute of Porto, and Vice-Director and Research Associate at the CISTER (Research Centre in Real-Time and Embedded Computing Systems) research unit, where he currently leads activities in, among others, real-time programming models, scheduling of real-time parallel tasks and real-time middleware. He has participated in more than 15 R&D projects, being Coordinator of the FP7 R&D European Project P-SOCRATES and of several national projects, and CISTER coordinator and work package leader in several European and national projects. He has published more than 100 papers in international venues and journals in the area. He was Senior Researcher of the ArtistDesign Network of Excellence (NoE) and is a member of the HiPEAC NoE. He was Keynote Speaker at RTCSA 2010, Conference Chair and Program Co-Chair of Ada-Europe 2006, Program Co-Chair of Ada-Europe 2012 and General Co-Chair of ARCS 2015. He is Editor-in-Chief of the Ada User Journal.

Patrick Meumeu Yomsi is an expert in the field of real-time scheduling theory and applications. His research blends software engineering principles with a detailed domain knowledge to innovate new models and accompanying analyses which provide developers with early evaluations of safety–critical systems. Patrick received his Ph.D. in 2009 from the Paris-Sud University, Orsay in France. Prior to joining the CISTER Research Center as a Research Associate in 2012 in Porto, Portugal, he was successively a member of the AOSTE research unit at the French National Institute in Computer Science and Control (INRIA) in Paris Rocquencourt, France, the PARTS Research Unit at the University of Brussels (ULB) in Brussels, Belgium, and finally a member of the TRIO Research Unit at INRIA in Nancy, France. His research interests include real-time scheduling theory, real-time communication and real-time operating systems. His work was awarded at several occasions: ‘‘Best Paper Awards’’ (2009 and 2010) and ‘‘Best Presentation Award’’ (2013).