Adaptive thread mapping strategies for transactional memory applications

Adaptive thread mapping strategies for transactional memory applications

Accepted Manuscript Adaptive thread mapping strategies for transactional memory applications M´arcio Castro, Lu´ıs Fabr´ıcio W. G´oes, Jean-Franc¸ois ...

2MB Sizes 0 Downloads 62 Views

Accepted Manuscript Adaptive thread mapping strategies for transactional memory applications M´arcio Castro, Lu´ıs Fabr´ıcio W. G´oes, Jean-Franc¸ois M´ehaut PII: DOI: Reference:

S0743-7315(14)00102-6 http://dx.doi.org/10.1016/j.jpdc.2014.05.008 YJPDC 3306

To appear in:

J. Parallel Distrib. Comput.

Received date: 15 June 2013 Revised date: 25 April 2014 Accepted date: 28 May 2014 Please cite this article as: M. Castro, L.F.W. G´oes, J.-F. M´ehaut, Adaptive thread mapping strategies for transactional memory applications, J. Parallel Distrib. Comput. (2014), http://dx.doi.org/10.1016/j.jpdc.2014.05.008 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Highlights (for review)

Highlights * We propose adaptive thread mapping strategies based on single metrics. * We propose a new strategy based on association rule learning. * We implement all the proposed adaptive strategies in a TM system. * We achieved performance improvements of up 64.4% on a set of synthetic applications. * We achieved performance improvements of up to 16.5% on the STAMP benchmark suite.

Click here to view linked References

Adaptive Thread Mapping Strategies for Transactional Memory Applications M´arcio Castroa,∗, Lu´ıs Fabr´ıcio W. G´oesb , Jean-Franc¸ois M´ehautc a Federal

University of Rio Grande do Sul Institute of Informatics Av. Bento Gon¸calves, 9500 - Campus do Vale - 91501-970 - Porto Alegre - Brazil b University PUC-Minas Computer Science Department Avenida Dom Jos´e Gaspar, 500 - 30535-610 - Belo Horizonte - Brazil c University of Grenoble CEA-DRT - LIG Laboratory ZIRST 51 Avenue Jean Kuntzmann - 38330 - Montbonnot - France

Abstract Transactional Memory (TM) is a programmer friendly alternative to traditional lock-based concurrency. Although it intends to simplify concurrent programming, the performance of the applications still relies on how frequent they synchronize and the way they access shared data. These aspects must be taken into consideration if one intends to exploit the full potential of modern multicore platforms. Since these platforms feature complex memory hierarchies composed of different levels of cache, applications may suffer from memory latencies and bandwidth problems if threads are not properly placed on cores. An interesting approach to efficiently exploit the memory hierarchy is called thread mapping. However, a single fixed thread mapping cannot deliver the best performance when dealing with a large range of transactional workloads, TM systems and platforms. In this article, we propose and implement in a TM system a set of adaptive thread mapping strategies for TM applications to tackle this problem. They range from simple strategies that do not require any prior knowledge to strategies based on Machine Learning techniques. Taking the Linux default strategy as baseline, we achieved performance improvements of up to 64.4% on a set of synthetic applications and an overall performance improvement of up to 16.5% on the standard STAMP benchmark suite. Keywords: Transactional memory, thread mapping, adaptivity, multicore.

1. Introduction There was a 30-year period in which the advances in semiconductor technology and computer architectures improved the performance of a single processor at a high annual rate of 40% to 50% [1]. However, issues such as dissipating heat from increasingly densely packed transistors began ∗ Corresponding

author. Email addresses: [email protected] (M´arcio Castro), [email protected] (Lu´ıs Fabr´ıcio W. G´oes), [email protected] (Jean-Franc¸ois M´ehaut) Preprint submitted to Journal of Parallel and Distributed Computing

April 25, 2014

to limit the rate at which processor frequencies could be increased. This was one of the main reasons why most of semiconductor manufacturers are now investing in multicore processors [2]. Consequently, applications must now evolve to efficiently exploit the potential of multicore platforms. Sequential applications thus need to be split into pieces (e.g., tasks) that can be executed in parallel by threads, each one running on a processor/core [3]. The side effect is that the application data, which were accessed by a single thread on a sequential application, is now shared among several concurrent threads. Thus, it is necessary to use synchronization mechanisms to coordinate concurrent accesses to these shared data. Traditional synchronization mechanisms such as locks, mutexes and semaphores have been proven to be more error-prone [4], largely due to well-known problems such as deadlocks and livelocks [5], and are difficult to manage in large scale systems. Due to those issues, researchers have been looking for alternative mechanisms. One of such mechanisms that has been subject of intense research in the last years is Transactional Memory (TM) [6, 7]. The TM programming model allows programmers to write parallel portions of the code as transactions, which are guaranteed to execute atomically and in isolation regardless of eventual data races. At runtime, transactions are executed speculatively and the TM runtime system continuously keeps track of concurrent accesses and detects conflicts. Conflicts are then solved by re-executing conflicting transactions. This model removes from the programmer the burden of correct synchronization of threads and provides a straightforward way of extracting parallelism from applications. Although the TM programming model simplifies concurrent programming, the performance of TM applications on multicores still relies on how frequent they synchronize, the amount of contention (conflicts between transactions) and the way transactions access shared data on memory (memory access pattern) [8]. In order to alleviate the cost of accessing the main memory, multicore processors usually feature complex memory hierarchies composed of different levels of cache (private and shared). As a drawback, this can potentially increase memory access latency and degrade bandwidth usage if threads are not properly placed on cores. An appealing approach to efficiently exploit the memory hierarchy and alleviate these drawbacks is called thread mapping [9], which places threads on specific cores according to a predetermined strategy. However, the efficiency obtained from a thread mapping strategy relies upon matching the behavior of the application with the underlying system and platform characteristics. This issue becomes much more complex in TM due to two reasons: (i) the TM model uses speculation, hence TM applications present an irregular behavior (data dependencies between threads are only known at runtime); and (ii) each TM system implements its own mechanisms to detect and solve conflicts and thus the same TM application can behave differently when the underlying TM system is changed [10, 11]. Due to the aforementioned issues, a single fixed thread mapping (which does not adapt itself to the current workload) cannot deliver the best performance in all cases. For instance, in some workloads it would be better to place threads on cores as close as possible in the cache hierarchy to increase cache sharing; while for others it would be better to distribute threads among different processors to reduce memory contention. Because of that, adaptivity becomes a key feature to increase performance for a wide range of different workload characteristics and platforms. Adaptivity has been studied in different contexts as a means of: performing dynamic load balancing on MPI [12]; generating and selecting a specific multithreaded version for a given loop at runtime on OpenMP [13]; and automatically selecting a TM algorithm adapted to the workload [14]. As opposed to those previously cited adaptive approaches, in this article we exploit adaptivity in the context of thread mapping. Since we are particularly interested in TM, we propose different adaptive thread mapping strategies that consider information from the TM application, 2

TM system and multicore platform. These strategies can be split into two categories: (i) strategies that do not require any prior knowledge; and (ii) strategies that require prior knowledge based on Machine Learning (ML) techniques. Castro et al. [10, 15] proposed and evaluated a strategy that used ML to perform static and dynamic thread mapping on TM applications, showing promising results. However, the proposed approach was evaluated against simple non-adaptive thread mapping strategies that do not consider any information from the TM application, TM system and platform. In this article, we perform a more thorough evaluation of these previous works, comparing them to other new adaptive approaches. Overall, the contributions of this article are: • We propose two adaptive thread mapping strategies that do not require any prior knowledge from TM applications; • We propose a new strategy based on association rule learning; • We extend the work presented in [15] by comparing its performance results to those obtained with those new adaptive strategies; • We implement all the proposed adaptive strategies in a TM system, so TM applications can benefit from adaptive thread mapping without any source code modification. The rest of this paper is organized as follows. Section 2 presents the background and motivation for this research. Section 3 describes the proposed adaptive thread mapping strategies. Section 4 discusses the implementation details on a state-of-the-art TM system. Section 5 outlines our experimental methodology while Section 6 presents results. Finally, Section 7 discusses related work and Section 8 concludes the paper and points out future work. 2. Background and Motivation We first present the basic concepts of Transactional Memory in Section 2.1. Then, we discuss thread mapping and motivate this research in Section 2.2. 2.1. Transactional Memory (TM) Transactional Memory is an alternative synchronization solution to classic mechanisms such as locks and mutexes [5, 6]. It makes it easier to write parallel programs by providing the programmer with a higher-level abstraction for synchronization, while leaving the implementation of the mechanism that provides this abstraction to the underlying system. Moreover, it provides an efficient model for extracting parallelism from applications [1]. Transactions are portions of code that are executed atomically and in isolation. Concurrent transactions commit successfully if their accesses to shared data do not conflict with each other. When one or more concurrent transactions conflict, only one transaction will commit whereas the others will abort and none of their actions will become visible to other threads [1]. Conflicts can be detected during the execution of transactions when the TM system uses an eager conflict detection policy whereas they are detected at commit-time when the system uses a lazy conflict detection policy. However, some TM systems also allow lazy transactions to detect conflicts before committing: this may be the case in which one conflicting transaction commits while the 3

other is still running. In this case, the TM system may abort the running transaction due to the conflict with the committing transaction. The TM system is in charge of re-executing aborted transactions. The choice among the conflicting transactions is done according to the conflict resolution policies implemented in the runtime system. Two common alternatives are to squash the transaction that discovers the conflict immediately (suicide strategy) or to wait for a time interval before restarting the conflicting transaction (backoff strategy). Transactional Memory can be software-only (STM), hardware-only (HTM) or hybrid (HyTM) [6]. In this article we focus on STM, since it offers flexibility in implementing different mechanisms to detect and resolve conflicts and it does not require any specific hardware. STM allows us to carry out experiments on actual multicore platforms without relying on simulations. 2.2. Thread Mapping Thread mapping has been extensively used as a technique to efficiently exploit the memory hierarchy of modern multicore platforms. It allows multithreaded applications to minimize memory latency and/or reduce memory contention by improving data locality. Figure 1 shows three static thread mappings (also known as thread pinning strategies) that have different goals. For the sake of simplicity, we consider in Figure 1 two threads running on a platform composed of two quad-core processors. Scatter distributes threads across different processors avoiding cache sharing between cores in order to reduce memory contention (Figure 1a). Compact places threads on sibling cores. This reduces data access latency, since concurrent threads can share all levels of shared caches (Figure 1b). Finally, round-robin is an intermediate solution in which threads share higher levels of cache but not the lower ones (Figure 1c). The Linux operating system implements its own scheduling strategy. In contrast to those thread pinning strategies, it implements a dynamic priority-based strategy that allows threads to migrate to idle cores to balance the run queues. (a) scatter

(b) compact

L3

L3

L2 C0

L2 C1

C2

L3

L2 C3

C4

Processor 0

L2 C5

C6

C7

Processor 1

L2 C1

C1

C2

L2 C3

C4

L2 C5

C6

C7

Processor 1

Legend L3

L2 C0

C0

L2

Processor 0

(c) round-robin L3

L3

L2

C2

Processor 0

L2 C3

C4

L2 C5

C6

C7

Cx

Thread pinned on core Cx

Cx

Idle core

Processor 1

Figure 1: Three thread pinning strategies.

Unfortunately, finding a thread mapping that best fits the current workload relies upon matching the behavior of the application with system and platform characteristics. In the context of TM applications, the choice of a suitable thread mapping for a TM application is even more challenging [10]. This stems from the fact that the same TM application may behave differently depending on the STM system configuration. Configurations such as conflict detection and resolution policies may modify the behavior of the application, affecting the right choice of the thread mapping to be applied [15]. 4

To support this fact, let us consider a TM application composed of three distinct phases, each one having different degrees of contention and transaction lengths based on the characterization of real-world TM applications presented in [16]. We executed this application with 8 threads on a multicore platform which features different levels of shared caches (this platform is further described in Section 5.1). In addition to that, we used two combinations of conflict detection/resolution policies (eager-suicide and lazy-backoff ). Figure 2 compares the execution times obtained with each one of the thread pinning strategies previously described, the Linux default strategy and an adaptive thread mapping approach, which chooses the best thread mapping for each phase.

Execution time (s)

26 Compact Round-Robin Scatter Linux Adaptive

24 22 20 18 16 14 eager suicide

lazy backoff

Figure 2: Performance of thread pinning against adaptive thread mapping on a synthetic benchmark.

As it can be observed in Figure 2, the adaptive thread mapping approach outperforms all individual thread pinning strategies. This occurs because a thread pinning strategy may perform well for a specific workload (phase) but may result in poor performance for others. Figure 2 also emphasizes that different conflict detection/resolution policies affect the thread mapping strategies. In particular, compact presented the best performance among all thread pinning strategies for the lazy-backoff STM configuration whereas round-robin was the best for the eager-suicide one. Finally, Linux presented similar performance to that obtained with thread pinning strategies as showed in [11]. In the next section, we propose four different strategies to perform adaptive thread mapping on TM applications. These strategies require different amounts of knowledge from the TM application and platform to decide which is the best thread mapping to be applied. 3. Adaptive Thread Mapping for Transactional Memory Applications In this article we exploit adaptivity in the context of thread mapping applied to TM applications. To increase performance on a wide range of scenarios, adaptive strategies must fulfill two requirements. They must work across different TM applications and STM system configurations, selecting an appropriate thread mapping for each case. In addition to that, the decision of the thread mapping to be applied cannot be static, since the workload behavior may change throughout the execution. In the next sections, we present the global view of our approach and we discuss different adaptive thread mapping strategies that aim at satisfying these two requirements. 5

3.1. Overview Our adaptive thread mapping strategies require different amounts of information from the workload to decide the thread mapping to be applied. We split the adaptive thread mapping strategies in two categories: (i) single metric-based strategies that do not require any prior knowledge from applications (Section 3.2); and (ii) Machine Learning-based strategies that require prior knowledge from applications (Section 3.3). Even though each adaptive thread mapping strategy considers a different number of metrics to make decisions, they follow the same steps presented in Figure 3. Since the workload characteristics of TM applications may change at runtime, applications must be profiled throughout the execution. This is done through a cyclic process composed of three steps: profiling, decision and deployment. Transactional Memory Application r

p

Profiling

... Decision

Adaptive Thread Mapping

Deployment

commited transactions

Figure 3: Adaptive thread mapping for TM applications.

The application starts running with a default thread mapping and it is profiled during p committed transactions at runtime. We use the number of transactions instead of a fixed time to guarantee that our measurements occur during the execution of transactions. This profiling step allows us to gather the information needed by the adaptive thread mapping strategies. Next, in the decision step, the profiled information is used to decide which one of the three static thread mappings discussed in Section 2.2 (scatter, compact or round-robin) should be applied. This decision is done by the adaptive thread mapping strategy. Then, the selected thread mapping is applied (deployment step) and remains unchanged during the next r committed transactions. Since these three steps are repeated throughout the execution, the thread mapping can be changed at runtime to match the current workload behavior. 3.2. Single Metric-based Strategies We first propose two adaptive thread mapping strategies that do not require any prior knowledge from applications. All decisions taken by these strategies are based on a single metric collected at runtime. 3.2.1. Conflicting Dataset Our first adaptive thread mapping strategy (named Conflict) is based on the intuition that the amount of conflicts could be used to decide the thread mapping to be applied. For instance, consider a TM application in which transactions constantly access the same portion of shared data at the same time. In this high conflicting scenario, placing threads on sibling cores might reduce the latency because the data will probably be available into the cache shared by them. Now, consider an opposite case where a TM application is composed of several transactions 6

that usually access a large portion of disjoint data. In this low conflicting scenario, distributing threads across different processors (thus avoiding cache sharing) might reduce the contention on the same cache, making more cache available for each thread. We use the abort ratio to estimate the amount of conflicts of concurrent transactions at runtime. The abort ratio is the fraction of the number of aborts to the number of transactions issued (aborted + committed). Applications whose transactions tend to concurrently access the same data will present a high abort ratio, since transactions will be constantly in conflict. Analogously, applications whose transactions usually access disjoint data will present a low abort ratio, since transactions rarely conflict. Our Conflict adaptive strategy follows these intuitions, selecting the thread mapping based on the abort ratio as shown in Equation 1.    scatter if A p ≤ α     (1) Conflict(A p ) =  round-robin if α < A p ≤ β     compact if A p > β We compute the abort ratio (A p ) after p committed transactions during the profiling step, as shown in Section 3. Then, the abort ratio is used to decide which thread mapping should be applied. The variables α and β are constants to define different amounts of contention (low, moderate or high). This strategy selects scatter when the abort ratio is low and compact when the abort ratio is high. In case of moderate abort ratio, an intermediate thread mapping is selected (i.e., round-robin). It is important to mention that Conflict may lead to wrong decisions in some specific scenarios. On the one hand, metadata serialization may result in low abort ratios and applications may suffer from metadata ping-ponging among caches. In these cases, compact would be beneficial, achieving better performance than scatter. On the other hand, Conflict may recommend compact when threads touch lots of private data but frequently conflict on a single shared data or data structure. In this case, the application will present a high abort ratio but compact will likely decrease the locality of private acesses.

3.2.2. Test-and-Map In our second adaptive thread mapping strategy (named Test), each one of the three static thread mappings (compact, round-robin and scatter) is applied during a period of p/3 committed transactions and the execution time obtained from each thread mapping is measured. Then, the thread mapping that achieved the lowest execution time (Equation 2) is selected and applied. The Test strategy is described in Equation 3. Min = min(C p , R p , S p )    scatter if S p = Min     Test(C p , R p , S p ) =  round-robin if R p = Min     compact if C p = Min

(2)

(3)

The basic idea behind this strategy is the following: measuring the time spent with each one of the three thread mappings during a fixed number of committed transactions may give us a good estimation about their performances. Based on that, the strategy selects the thread mapping 7

that gives the best performance, i.e., the lowest execution time. This strategy is interesting since it measures the actual performance of each individual thread mapping at runtime. The main drawback of this strategy comes from the overhead of thread migrations, since we need to test all thread mappings during the profiling step. The cost associated to thread migrations depends on a wide variety of factors such as the memory access pattern, the cache hierarchy and whether a thread moves between cores of the same chip or cores on different chips. Because of that, this cost is very difficult to obtain precisely and may considerably vary from one application to another. According to our experiments, the time taken to migrate a single thread can be as low as few microseconds (2.6µs on average) and may go up to tens of microseconds (11.3µs on average) on the Intel-based platform used in our experiments (this multicore platform is further described in Section 5.1). We try to reduce the number of thread migrations in Test when possible. For instance, we guarantee that the thread mapping currently in use before the profiling step will be the first to be profiled. Thus, we do not change the thread mapping more than 3 times between each profiling step. 3.3. Machine Learning-based Strategies The adaptive thread mapping strategies discussed until now were based on single metrics to decide the thread mapping to be applied. In this section, we propose other adaptive thread mapping strategies that consider several metrics at the same time. These strategies are based on Machine Learning techniques to construct a predictor capable of taking good decisions at runtime. New instance Input instances

Training phase Learning Process

Application Profiling

[app, STM config, threads, mapping]

Predictor

Thread mapping

Figure 4: Generic framework to construct thread mapping predictors based on Machine Learning algorithms.

To do so, we use the framework proposed in [10] as the basis for testing the performance of different learning algorithms. This framework is depicted in Figure 4. Initially, a set of TM applications with distinct transactional characteristics is selected for the training phase. These applications are then executed with all possible combinations of different TM system configurations, thread counts and thread mappings. In this article, we consider four possible TM system configurations: two conflict detection mechanisms (eager and lazy) and two conflict resolution policies (suicide and backoff) discussed in Section 2.1. We also consider the three thread mappings discussed in Section 2.2. During the training phase, each combination of application, TM configuration, thread count and thread mapping (also known as input instance) is profiled during the execution. Since the performance of a TM application is not only governed by its characteristics, we must also gather information from the TM system and platform. After carrying out several experiments, we selected a set of features that have an important impact on the performance of TM applications based on empirical evidence: 8

• Transactional time ratio: fraction of the time spent inside transactions to the total execution time; • Abort ratio: fraction of the number of aborts to the number of transactions issued; • TM configuration: conflict detection (eager or lazy) and conflict resolution (suicide or backoff); • Last-level cache (LLC) miss ratio: fraction of the number of LLC misses to the number of LLC accesses. TM features such as conflict detection and resolution mechanisms are inherently qualitative. On the other hand, application features such as abort ratio and cache miss ratio are quantitative properties but they may be grouped into categories. This is necessary because the ML algorithms used in this article work with qualitative properties. Thus, we converted each ratio value within a range [x; y] into one of the three following qualitative properties: low, medium or high. These three qualitative properties were chosen based on the characterization of STAMP applications in [16]. Since TinySTM was not included in [16], we profiled all STAMP applications while running them with different TinySTM configurations (conflict detection/resolution policies) and different thread counts to collect three metrics: transactional time ratio, abort ratio and cache miss ratio. To convert quantitative numeric values into qualitative ones, we applied a classical discretization technique over a statistical distribution of values. Considering the set of applications used in the paper, we observed that the results for each metric followed an uniform distribution U(0, 1). Thus, we applied an uniform linear discretization, breaking the interval [0; 1] into three uniform subintervals: low (from 0 to 0.33), medium (from 0.34 to 0.66) and high (from 0.67 to 1). This discretization allows us to convert each continuous ratio value within a range [0; 1] into one of these three qualitative properties. In order to reduce ambiguity among input data samples, it would be possible to increase the number of quantitative values. Although we could have employed a more fine-grained linear discretization, we concluded that three quantitative values were enough to avoid biased input and consequently under-generalization of the resulting prediction model for the tested applications. The collected data is then used to feed a learning algorithm, which outputs a predictor. Then, the predictor can be used to infer thread mappings for new unobserved instances. In the next sections, we describe each one of the ML algorithms used in this study. 3.3.1. Decision Tree Learning Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item’s target value. One of the most known decision tree learning algorithms is the Iterative Dichotomiser 3 (ID3) [17], which outputs a decision tree as a predictor based on categorical discrete input instances. The ID3 algorithm is presented in Algorithm 1. Given a set S of input instances and a set of F features, the ID3 learning process starts choosing the “best” feature to be in the root node. By best, it means the feature that splits S in more homogenous subsets, that is, subsets where the majority of input instances contains the same target variable value (in our context, the target variable is the thread mapping). This recursive process is repeated for each successor node until all leaf nodes contain only input instances with the same target variable value. Homogeneity is measured by Entropy defined in Equation 4 where pi is the probability that a given instance is in a subset with the same target variable value. Gain can be derived from 9

input : Input instances S , Features F output: Decision tree T begin Fmax ← max(Gain(S , Fi )); Split S into subsets S j by each value V j ∈ Fmax ; Create a subtree U where the root node is Fmax ; foreach V j ∈ Fmax do Add an edge labeled V j to U; Connect a successor node N j containing S j ; Add to T the subtree U; if Entropy(N j ) = 0 then Set label of N j to target variable value; Remove S j instances from S ; else ID3(S j , F) end end Algorithm 1: ID3(S , F) Entropy by computing the subtraction between the entropy of S and the sum of the entropies of its subsets S i as shown in Equation 5. The feature with highest gain is chosen to be the root node. X Entropy(S ) = pi log2 (pi ) (4) Gain(S , F) = Entropy(S ) −

X |S i | i∈F

|S |

× Entropy(S i )

(5)

The use of this approach to perform thread mapping on TM applications was first proposed in [10] to perform static thread mapping and extended in [15] to perform dynamic thread mapping. However, the performance obtained was only compared with static thread mappings, which do not consider any information from the application, TM system or platform. In this article, we compare the dynamic approach proposed in [15] to other adaptive thread mapping strategies. 3.3.2. Association Rule Learning Association rule learning is a very common method used to discover relations between variables in large databases. Apriori is a well-known association rule machine-learning technique used for frequent itemset mining [18]. Algorithm 2 presents the Apriori algorithm. Given a list of itemsets, it identifies association rules between those items based on their frequency. These rules reveal subsets of items that frequently occur together in the same itemsets. The algorithm is driven by the following rule: all non-empty subsets of a frequent itemset must be also frequent. This rule allows the algorithm to eliminate all itemsets that are not composed of frequent item subsets, reducing significantly the search space. For each association rule S i → S j , where S i and S i are subsets of items of a frequent itemset, the Apriori algorithm calculates its confidence (Conf), presented in Equation 6. High confidence levels mean that most of the time an itemset S i is present in a frequent itemset, the itemset S j is also there. 10

input : Input instances I, Minimum Support M output: Set of rules R begin S ← itemsets of size K = 1 items from I; while S is not empty do Count occurrences of each itemset S i ∈ S ; Remove itemsets S i with Support(S i ) < M; forall permutations of K itemsets S i , S j ∈ S do Insert rule S i → S j in R with Conf(S i , S j ); K = K + 1; end Insert permutations of size K + 1 items with the remaining itemsets S i ∈ S ; end end Algorithm 2: Apriori(I, M)

Conf(S i , S j ) =

support(S i ∧ S j ) support(S i )

(6)

In our work, this technique can be used to identify which feature values are more frequently associated with a certain thread mapping. After the Apriori algorithm finds the association rules, we select only the ones which have a thread mapping in the itemset S j with a high confidence level. 4. Implementation Details We implemented our adaptive thread mapping strategies within an STM system. Thus, TM applications can profit from adaptive thread mapping without any kind of source code change. We first introduce the STM system used in our experiments (Section 4.1). Then, we discuss the implementation details of the adaptive thread mapping module (Section 4.2). 4.1. TinySTM TinySTM is a well-known STM system that uses a global version clock approach to control the conflicts between transactions and a shared array of locks to manage concurrent accesses to memory locations [19]. It implements different conflict detection designs including eager (i.e., encounter-time locking design) and lazy (commit-time locking design). In addition to that, it can be configured to use different conflict resolution policies, such as suicide and backoff. We chose TinySTM among other STM systems because it is lightweight, efficient and its implementation has a modular structure that can be easily extended with new features [11, 19]. Figure 5 depicts the global view of TinySTM. Basically, it is composed of an STM core, in which most of the STM code is implemented, and some additional modules. These modules implement basic features such as the dynamic memory management (mod mem) and transaction statistics (mod stats). We added a new module called mod atm that extends TinySTM to perform adaptive thread mapping transparently. We used the most recent version of TinySTM (1.0.4) to implement our module as well as to carry out all experiments. 11

mod_mem Dynamic Memory Management

TinySTM STM

mod_stats

Modules

Statistics of Transactions

.. .

mod_atm Adaptive Thread Mapping

Topology Analyzer Transaction Profiler Adaptive Strategy

Figure 5: Overview of TinySTM.

4.2. Adaptive Thread Mapping in TinySTM Our adaptive thread mapping module (mod atm) combines the following main components: topology analyzer, transaction profiler and adaptive strategy. The topology analyzer component gathers useful information from the hierarchical topology of the underlying multicore platform, such as sockets, shared caches and cores. In order to avoid the operating system and architectural idiosyncrasies, we use the Hardware Locality (hwloc) library [20]. In addition to the portable abstraction of the hierarchical topology, this library provides functions to bind threads to cores. For instance, the function hwloc_set_thread_cpubind() is used to bind a thread on cpus given in physical bitmap set. We use this feature to implement and apply the static thread mappings described in Section 2.2. The transaction profiler component performs the application runtime profiling to gather information from hardware counters and transactional basic statistics. The collected data is used as an input to the adaptive thread mapping strategies. TinySTM allows to register applicationspecific callbacks that are triggered each time particular events occur. We use this feature to register callbacks that are triggered when transactions start, commit and abort. Moreover, we use a function interposition technique to replace calls to the POSIX Threads library (Pthreads) with calls to our own wrappers. This is useful for knowing when threads are created and destroyed during the execution. To compute the abort ratio, we maintain two variables to count the number of commits and aborts. The transactional time ratio is an approximation obtained by measuring the time spent inside transactions (between start and commit/abort operations) and outside transactions. The LLC miss ratio is obtained from the Performance Application Programming Interface (PAPI) [21]. This interface allows to access two hardware counters to compute the LLC miss ratio. For example, if the platform has L3 caches, we use the following counters: PAPI L3 DCM (L3 data cache misses) and PAPI L3 DCA (L3 data cache accesses). When the platform does not have L3 caches, we use the corresponding counters to access L2 caches. Finally, the adaptive strategy component implements all the adaptive thread mapping strategies discussed in Section 3, i.e., Conflict, Test, ID3 and Apriori. During the decision step, the adaptive thread mapping strategy uses the information obtained from the transaction profiler to decide which thread mapping should be applied. In order to reduce the intrusiveness (i.e., to not change the behavior of the application) and to avoid extra synchronization mechanisms to guarantee reliable measures among concurrent threads, we assume that the workload of TM applications is uniformly distributed among the threads. This means that profiling a single thread during a certain period can give us a good 12

approximation of what would be captured if all threads were profiled during the same period. This assumption simplifies the profiling process, since we can use only one thread for profiling the TM application at a time. However, this can lead to suboptimal choices of thread mappings if the workload is non-uniform. 5. Experimental Setup We first detail the multicore platform used in our experiments (Section 5.1). Then, in Section 5.2, we discuss the applications that were used to evaluate the performance of the adaptive thread mapping strategies. Finally, we discuss the predictors obtained from the ML-based strategies (Section 5.3). 5.1. Multicore Platform Our experiments were conducted on a multicore platform based on four six-core Intel Xeon X7460 (Table 1). Each group of two cores shares a L2 cache (3MB) and each group of six cores shares a L3 cache (16MB).

Software

Hardware

Table 1: Summary of the hardware/software characteristics.

Processor Number of cores Number of sockets Clock (GHz) L2 capacity (MB) L3 capacity (MB) DRAM capacity (GB) Linux kernel GCC TinySTM STAMP benchmarks EigenBench

Intel Xeon X7460 24 4 2.66 3 16 64 3.2.0-4 4.7.2 1.0.3 0.9.10 0.8.0

All experiments were carried out with exclusive access to this platform and each experiment was executed up to 30 times to guarantee a confidence level of 95%. We used TinySTM for all experiments and we configured it to use different conflict detection and resolution policies. 5.2. Transactional Memory Applications We used two sets of applications in our experiments. The first set is based on applications composed of synthetic workloads created with EigenBench [22]. EigenBench is a lightweight microbenchmark that allows users to perform a thorough exploitation of the orthogonal space of TM applications characteristics. These synthetic workloads allows us to test the adaptive thread mapping strategies on a large yet controlled design space. EigenBench can be easily configured to mimic a wide variety of workloads. Since varying all possible orthogonal TM characteristics involves a high-dimensional search space, we decided to vary three orthogonal characteristics that govern the behavior of TM applications: 13

• Transaction length: number of shared accesses per transaction (short/long); • Contention: probability of conflict (low-conflicting/contentious); • Density: fraction of the time spent inside transactions to the total execution time (sparse/dense); Since we assume two possible discrete values for each one, we can create a total of 23 distinct workloads (named W1 , W2 , . . . , W8 ) by combining these values. Then, we used these workloads to create TM applications composed of different phases (workloads). The second set is based on realistic applications. These applications come from the Stanford Transactional Applications for Multi-Processing (STAMP) [16], which includes 8 applications of a variety of algorithms and different application domains. These applications are composed of a wide range of transactional behaviors such as the size of transactions, amount of contention, sizes of read and write sets, coarse-grain and fine-grain transactions. In our experiments, we used the largest inputs recommended in [16] for non-simulated runs for all applications. In kmeans and vacation, we selected the low contention configuration to cover a wider range of transactional characteristics. 5.3. Machine Learning-based Predictors The ML-based adaptive strategies must be fed with input instances for practical usage. This is an important step since the predictor will learn from data collected from the input instances. If the input instances do not represent good samples, the predictor may output many incorrect predictions. Because of that, we decided to use STAMP applications as input instances for the ML-based strategies. We believe that these realistic applications can be useful to construct a predictor capable of capturing the behavior of real-world TM applications. The overall time of the training phase is directly related to the time needed to execute all the input instances. In our case it took about 4 hours to run all possible combinations of applications, conflict detection/resolution policies, thread counts and thread mappings on our multicore platform. However, since this training phase is only performed once, it is a one-off cost. We used all applications from STAMP to train the ML-based strategies while evaluating them on EigenBench synthetic workloads. However, when we evaluated the ML-based strategies on a specific STAMP application, we excluded it from the training phase. For instance, to evaluate a ML-based strategy on intruder, we used the information from all STAMP applications other than intruder to feed the learning process. This leave-one-out cross-validation guarantees that our ML-based strategies are evaluated on unobserved instances. Figure 6 shows the decision tree generated by ID3 when all STAMP applications are used as input instances. We can derive some overall conclusions from this decision tree. First, it finds a possible correlation between the abort ratio and the LLC miss ratio, since when an application presents a high abort ratio, the LLC miss ratio is taken into account to decide the thread mapping to be applied. Secondly, when the abort ratio is low, the predictor tends to select a strategy that places threads far from each other (e.g., scatter). A low abort ratio means that transactions rarely access the same shared data at the same time. Thus, the contention generated by several threads accessing the same cache can be alleviated by applying such strategies. Finally, when the conflict detection used is backoff, the decision tree tends to avoid scatter. Backoff forces aborted transactions to wait some time before being re-executed. This reduces the amount of contention on the cache, making it possible to place threads on sibling cores to amortize the access latency. 14

compact

me Tx Abort Ratio

hig Tx Time Ratio

diu

m

low

LLC Miss Ratio

h

TM Conflict Resolution

low

me diu

m compact

suicide bac

scatter

scatter

low

high

medium / high

ide

scatter

suic

TM Conflict Resolution

bac

LLC Miss Ratio hig

koff round-robin

scatter

Tx Abort Ratio

m/h

igh

compact

low

round-robin

linux

low

h

medium

mediu

lazy

TM Conflict Detection

eager

of

back

TM Conflict Resolution

suicide

koff

compact

(a) SMP-24

round-robin

(b) SMP-16

Figure 6: Figure Decision generated by generated ID3 whenby using STAMP applications 4. tree Decision trees the all ID3 learning algorithm as on input both instances. platforms. Internal nodes represent input features (blue rect represent the thread mapping strategy to be applied (gray rounded rectangles).

Figure 7 shows the set of rules generated by Apriori when considering all STAMP applithe conflict resolution mechanisms notread 16 (Figure 4b). It is important to mention that cations as input instances. Thedetection attributesand “time”, “abort” and “llcm”does can be as transactional theLLC performance by these applications. ranking algorithm time ratio, abort change ratio and miss ratio,perceived respectively. Values on the right side of the rules in the pre-processing step, This of can be explained byofthe factoccur that applications suchsame as itemsets. mapping can be dropped from the d represent the number times the subsets items together in the Allstrategies rules Ssca2todo100%. not spend much time within transactions. even when they are actually the best one for a giv have confidence equal This occurs when the thread mapping strateg In contrast, other applications such as Bayes, Genome present significantly more performance gains and Labyrinth proved to be very sensitive 1. time=high, abort=low → scatter (8)to both thread strategies. That was the case of the Linux str mapping strategies and STM system parameters. On these 2. abort=low, llcm=med → scatter (8) decision tree for SMP-24. applications, Compact usually delivered less performance 3. time=high, abort=low, llcm=med → scatter (8) gains than other thread mapping strategies. The influence of the target platform is confi 4. time=med → compact (4) This5.occurs for two reasons. From one perspective, Bayes decision trees, which are very different from llcm=low → scatter (4) and Labyrinth present high abort ratios.(4) This means that On the SMP-24, Compact and Scatter are th 6. time=high, llcm=low → scatter transactions usually abort=high abort several→times before important thread mapping strategies whereas R 7. time=med, compact (4) committing. Each time a transaction is aborted, all the (4) accesses to the and Compact are the most important ones on t 8. time=med, llcm=high → compact shared9.data inside thellcm=high transaction→are re-executed. We can also observe that the decision tree of th abort=high, compact (4) This fact increases the probability of having many transactions being much simpler than the SMP-24. This can be expl 10. abort=high, llcm=low → scatter (4) executed at the same time, increasing the contention on fact that the SMP-24 has a more complex cach memories. Since when Compact forces threads to share Figure 7:cache Rules generated by Apriori using all STAMP applications as input instances. Even though there are differences between cache, such contention is even higher. On the other hand, trees, we can derive some overall conclusions. F Genome is characterized having a high transaction time between The set of rules generated by Aprioribyalso reveal a possible correlation the abort a possible correlation between the abort ratio a ratio. This means application executes ratio and thread mappings. Again,that the this predictor usually selects transactions scatter when the abort ratio miss ratio. Onisboth trees, when an application theabort time.ratio Sinceistransactions frequently shared low and compactmost whenofthe high. An exception is access found on the last rule, high suggesting abort ratio, the LLC miss ratio is taken i data,when theythe will dispute to decide the thread mapping policy to be appl that scatter is better LLC miss the ratioaccess is low.to shared caches. Thus, memory contention be also increased. the abort ratio is low, the predictor tend Despite the previously mentionedwill similarities, the results obtained from thewhen two ML-based that places threads far from each other ( algorithms differ in two aspects. Decision trees cover all possible combinations ofstrategy features while Decision Tree association rules B. mayGenerating not. This the means that we can always use decision trees to and makeRound-Robin decisions strategies). A low abort ratio transactions As observed in the previous section, the target regardless of the profiled information. Differently, association rulesplatform may not have rules forrarely all access the same shared data time. Thus, the can influence the thread possible cases. Because of that,the we impact select aof default thread mapping mapping strategy (i.e., compact, in our case) contention generated by sev accessing same cache can be alleviated by ap perceived by an application/STM configuration. Because when the profiled information does not match any of the rules. Another important aspectthe constrategies. of this, we trained our ML-based approach to predict a cerns TM configurations (conflict detection and conflict resolution): these configurations were Finally, when the conflict detection used is mapping strategy considering eachbyplatform taken into accountsuitable on ID3thread whereas they were completed excluded Apriori. Thus, we expect decision tree tends to decide for the Compact separately. The profiled data was strategies. obtained from runs with different behaviors when applying these ML-based ping strategy. Backoff forces transactions to wa 8 threads for all possible feature combinations to feed the before re-executing due to aborts. This reduces ML ID3 algorithm. By using 8 threads, the thread mapping of contention on the cache, making it possib strategies can explore all levels of cache memory and the threads on sibling cores to amortize the access resulted decision trees are still valid for less threads. As 15when the number of threads aforementioned in Section V-A, An exception is the Linux default strategy gets close to the maximum number of available cores, the only in one specific case. It tends to distribute thread mapping strategy does not impact the performance. threads among cores in order to balance th Figure 4 shows two decision trees generated by the ID3 Thus, the resulting distribution of threads is v learning algorithm on the SMP-24 (Figure 4a) and SMPunpredictable. These characteristics thus benefit

6. Performance Evaluation In this section, we evaluate the performance obtained with each one of the adaptive thread mapping strategies over a large set of transactional applications. We first evaluate them on synthetic TM applications, in which the transactional workload considerably changes during the execution (Section 6.1). This allows us to observe if our strategies correctly adapt the thread mapping at runtime when the workload changes. Then, we evaluate them on all applications available in STAMP (Section 6.2) to assess whether our strategies can also achieve good results over a wide range of realistic transactional workloads. Finally, we present a peek into the adaptive strategy in action (Section 6.3). Speedups are calculated based on the execution time of the sequential (hence transactionless) version of each application. We used the original version of TinySTM for all runs with the Linux default strategy, thus avoiding any overhead added by our adaptive thread mapping module (mod atm). We set the Conflict strategy parameters to α = 0.33 and β = 0.66 based on the same characterization of STAMP applications described in Section 3.3. Choosing automatically the best p and r parameters for an arbitrary application is challenging because we do not assume any prior knowledge about the number of transactions executed by the application. However, we assumed the following invariants when choosing them for a set of applications considered in the paper: (i) r  p and (ii) p is sufficiently large to obtain statistically relevant measurements, i.e., to avoid statistical noises. The first invariant intends to reduce the overhead of profiling, decision and deployment steps. The second invariant aims at reducing statistical noises in the profiling step. After performing several experiments, we fixed the parameters p and r of all adaptive thread mapping strategies based on empirical evidences. We used p = 1, 000 and r = 20, 000 for our experiments on all applications other than bayes and labyrinth. This means that 1,000 committed transactions are profiled during profiling steps and thread mappings remain unchanged during the next 20,000 committed transactions. For bayes and labyrinth, we fixed p = 50 and r = 500 because they execute few transactions (up to 3,000 transactions). 6.1. EigenBench Our first set of experiments explores the effectiveness of our adaptive strategies on a set of synthetic TM applications. We derived these applications from the 8 workloads discussed in Section 5.2. We fixed the number of phases to 3, thus each application will be composed of three distinct workloads. All possible applications composed of three distinct workloads is determined by the number of k-combinations from a given set of n elements, i.e., Ckn = C38 , which results in 56 applications (named A1 , A2 , . . . , A56 ). Each workload executes 1,000,000 transactions, so applications composed of 3 phases execute 3,000,000 transactions. The set of applications is represented as follows: A1 = {W1 , W2 , W3 }, A2 = {W1 , W2 , W4 }, . . . , A56 = {W5 , W6 , W7 }. Applications were executed with each one of the adaptive thread mapping strategies (i.e., Conflict, Test, ID3 and Apriori) and with the Linux default strategy. All applications were executed with 8 threads and TinySTM was configured with lazy conflict detection and backoff conflict resolution. Figure 8 presents the percentage improvement obtained with each adaptive thread mapping strategy over the Linux default scheduling strategy. We can draw some important conclusions from these results. Firstly, the thread mapping had an important impact on the performance of these applications. We observed performance improvements of up to 64.4%. Secondly, our 16

80

Legend

Conflict

Test

Apriori

ID3

60

Improvement over Linux (%)

40 20 0

A01

A02

A03

A04

A05

A06

A07

A08

A09

A10

A11

A12

A13

A29

A30

A31

A32

A33

A34

A35

A36

A37

A38

A39

A40

A41

A14

A15

A16

A17

A18

A19

A20

A21

A22

A23

A24

A25

A26

A27

A28

A42

A43

A44

A45

A46

A47

A48

A49

A50

A51

A52

A53

A54

A55

A56

80 60 40 20 0

Applications

Figure 8: Percentage improvement over Linux with synthetic applications composed of three phases.

adaptive thread mapping strategies improved the performance of the applications in all cases. This confirms that our strategies can adapt the thread mapping to match the characteristics of transactional workloads. Finally, we observed that the ML-based strategies usually achieved better performance gains when compared to the single metric-based ones. The reason for that is two-fold: (i) ML-based strategies use more than one metric to make decisions; and (ii) predictors carry some knowledge obtained through the learning process on data collected from realistic TM applications. Figure 9 shows the distribution of performance improvements of each adaptive thread mapping strategy over Linux. Each graph represents a density histogram, meaning the percentage of data occurrences (y-axis) over discrete intervals (x-axis). Taking ID3 as an example, we can conclude that in most cases (approximately 40%) it achieved performance gains between 50% and 60% over the Linux default strategy. Test

Density (%)

Conflict 50

50

40

40

30

30

20

20

10

10 0

0 0

10

20

30

40

50

60

0

70

10

20

Apriori

30

40

50

60

70

50

60

70

ID3

50

50

40

40

30

30

20

20

10

10

0

0 0

10

20

30

40

50

60

70

0

10

20

30

40

Performance improvement over Linux (%)

Figure 9: Density histogram of performance improvements over Linux.

Overall, Conflict did not achieve any performance improvement greater than 50% with an average performance improvement of 32.3%. Although these are clearly significant performance improvements, this indicates that the choice of the thread mapping cannot rely only on the abort ratio if we intend to achieve the best performance possible. In many cases, Test achieved per17

formance gains close to those obtained with the ML-based strategies. However, the overhead of testing different thread mappings throughout the execution of the applications was very high in some cases, decreasing the average performance of this adaptive strategy to approximately 39%. This overhead was mainly observed on applications presenting low execution times (few seconds), such as A27 , A31 and A47 (Figure 8). The ML-based strategies presented the best results, which confirms that more than one metric should be taken into account to achieve better performance. ID3 achieved performance improvements of up to 62% and 45.1% on average whereas Apriori achieved performance improvements of up to 64.4% and 46.6% on average. Although the density histograms of both strategies follow approximately the same distribution, we observe more cases where Apriori achieved performance improvements superior to 60%. The results obtained with the adaptive thread mapping strategies on this set of synthetic applications are promising. This evaluation was useful in order to observe how much we can improve the performance of TM applications with adaptive thread mapping on applications composed of very distinct transactional workloads. We now proceed with the evaluation of these strategies on realistic applications from STAMP. 6.2. STAMP In our second set of experiments we aim at analyzing the performance improvements of adaptive thread mapping strategies on realistic TM applications such as those available in STAMP [16]. We considered the following applications and configurations: • STAMP applications: bayes, genome, intruder, kmeans, labyrinth, ssca2, vacation and yada; • TinySTM policies: conflict detection (eager and lazy) and conflict resolution (suicide and backoff); • Thread counts: 2, 4, 8 and 16. We then executed each one of the 128 possible combinations (8 applications × 4 TM configurations × 4 thread counts) using our four adaptive strategies as well as with the Linux default scheduling strategy. We first discuss four interesting case studies in Section 6.2.1 and then we show the overall results of the adaptive thread mapping strategies in Section 6.2.2. 6.2.1. Case Studies Figure 10 presents the speedups obtained from four interesting cases (labyrinth, kmeans, intruder and genome) when varying the TM configurations and thread counts. Labyrinth. This application is a variant of Lee’s routing algorithm implemented with transactions [16]. The main data structure employed in this application is a three-dimensional uniform grid representing a maze. Each thread grabs start and end points that it must connect by a path of adjacent maze grid points. Manipulations on the maze grid are performed inside transactions, thus conflicts occur when two or more threads pick paths that overlap. This application executes long transactions and presents a low abort ratio with low thread counts and a moderate to high abort ratio with high thread counts. We observed different results while varying the conflict resolution policies (Figure 10). With suicide, most of the performance improvements were obtained with low thread counts. In these cases, Conflict, ID3 and Apriori chose scatter: since this application has a very low abort ratio with low thread counts, it is beneficial to spread threads across different processors to reduce memory contention. On the other hand, with 8 and 16 18

labyrinth−eager−suicide

labyrinth−eager−backoff

9 ● ●

7

● ●

5 4 3



6

● ● ●

4



3

● ●

2

1 2

4

8

16

2

8





● ●

1.5





Speedup

2.0









1.5

2

4

8

● ●







2.5



● ●

0.5

1.0

0.0

0.5 4

8

16

● ●

2

4

8

4

8

16

● ● ●

6

2

● ●

2.0 ●

1.0

0.0

0.5 2

4

8



16

2

8

16

● ●



5

8

● ●

3

4 2

● ● ●

4



6

● ●

● ● ●

6

10

● ● ●

2

● ●

1 8

4



genome−lazy−backoff



4 3

4

● ● ●

● ●

16



● ●

2



● ●

1.5





5

2

● ● ●



● ●

12

● ●

6

● ●

16

2.5 ● ●

genome−lazy−suicide



4

8

intruder−lazy−backoff

0.5

16

● ●

4

7

12

8

2

3.0

genome−eager−backoff

● ●

● ● ●

0.5

7

10



0.0

1.5 1.0





1.0



genome−eager−suicide

● ●

2.0

● ●

● ● ● ●



● ●

16

● ●

2.5

1.5

2.0 ●

8

intruder−lazy−suicide 3.0

1.5

● ●

● ●

● ●

2

2.5

1.5

2

● ● ●

1.5

3.0

2.0

4

3.0

2.0

16



2.0

2

kmeans−lazy−backoff ●

intruder−eager−backoff

● ●

16

0.0

16

3.0

8

2.5

0.5

intruder−eager−suicide 2.5

4

1.0

● ● ●

0.0 8



● ● ● ●

1 2



0.0

1.0



● ●

0.5

4

2

3.0

2.5

1.0

● ● ●

0.5

2

3

● ●

kmeans−lazy−suicide

3.0 ● ●

2.0

4

● ● ●

kmeans−eager−backoff

3.0

1.0

16



5



0 4

● ●

6

● ● ●

1 2

8 7

6

3

● ● ● ●

kmeans−eager−suicide 2.5

7

4

● ● ●

1

0

9 ● ● ●

8

5

5



labyrinth−lazy−backoff

9

8 7

6

2

labyrinth−lazy−suicide

9

8

● ●

1 2

4

8

16

2

4

8

16

2

4

8

16

Number of threads

Legend



Linux



Conflict

Test



Apriori

ID3

Figure 10: Performance of adaptive thread mapping strategies on genome, labyrinth, kmeans and intruder.

threads the abort ratio was moderate to high, increasing considerably the contention on shared caches. This problem was alleviated with backoff, since aborted transactions wait for a time interval before restarting. Compact is beneficial when the contention on the shared caches is not high, since threads can reuse cached data. With high thread counts, Apriori always chose compact, achieving the best results. Conflict applied round-robin in most cases and compact only when the abort ratio was high. ID3 did not applied compact in some cases with high thread counts, thus reducing its gains. Test achieved the worst results due to wrong estimations of the performance of individual thread mappings in the profiling phases. In labyrinth, we had to profile very few transactions, since it globally executes a limited number of transactions. Kmeans. This application implements a clustering algorithm that tries to group similar elements into K clusters [16]. It iterates over a set of elements and calculates the distance between these elements and their centroids. Since threads only occasionally update the same centroid concurrently, this algorithm is well suited for TM. It has short transactions and presents a low to moderate abort ratio with few threads and high abort ratio with 16 threads. Our adaptive thread mapping strategies usually improved the performance of kmeans (Figure 10). However, we observed that Conflict was the least performant. In order to investigate that, we analyzed the profiled information obtained during the execution of kmeans with different thread mappings. 19

We then noticed that the LLC miss ratio was considerably impacted by the thread mappings: the LLC miss ratio was very high with scatter whereas its was low to moderate with compact (thus improving the overall performance of kmeans). Since Conflict does not take the LLC miss ratio into account, it only applied compact when the abort ratio was high. On the contrary, Apriori caught such phenomenon and applied compact in most cases, achieving the best performance. Intruder. This application emulates the Design 5 of the Signature-based network intrusion detection system, which scans network packets in order to detect a known set of intrusion signatures [16]. It is composed by three phases: capture, reassembly and detection. Both capture and reassembly phases are enclosed by transactions. Different shared data structures are used depending on the phase: a FIFO queue is used in the capture phase whereas a dictionary implemented by a self-balancing tree is used in the reassembly phase. This application presents a variable abort ratio during the execution. At the beginning, scatter achieves better performance gains because the abort ratio is considerably low (transactions tend to access disjoint data). However, threads operate on much less nodes inside the self-balancing tree as the execution progresses, increasing considerably the abort ratio. Thus, the probability of having transactions accessing the same node is increased (temporal locality). In this case, compact achieves better performance. Since the abort ratio is moderate most of the time, Conflict usually applied round-robin whereas ID3 usually applied compact (because the transactional time ratio in intruder is high). Apriori achieved slightly better results than ID3, applying scatter most of the time and compact when the abort ratio becomes high. Genome. This application takes a large number of DNA segments as its input parameter and tries to match them to reconstruct the original source genome [16]. Additions to the structure containing unique segments and all accesses to the global pool of unmatched segments are enclosed by transactions to allow concurrent accesses. Although most of its execution is spent inside transactions, this application presents a very low abort ratio. This means that transactions frequently access disjoint data, thus avoiding cache sharing will be beneficial. Thus, the key here is to avoid compact to reduce the contention on the same cache and to make more cache available for each thread. This is exactly what our adaptive thread mapping strategies did. However, we observed that Test incurred in some performance losses (Figure 10). Since genome has a very short execution time (about a second), the overhead of thread migrations in Test may surpass the gains obtained from thread mapping. We confirmed this assumption after performing experiments with a larger input data set. In these cases, Test always presented performance improvements comparable to the other adaptive thread mapping strategies. 6.2.2. Overall Results Figure 11 presents the average performance improvements of each one of the adaptive strategies per application when considering all TM configurations and thread counts normalized to Linux (averages were calculated using arithmetic means). Overall, our adaptive thread mapping strategies improved the performance of STAMP applications. The only application in which adaptive thread mappings did not achieve any performance improvement was vacation. Vacation emulates an on-line travel reservation system, in which clients execute operations (transactions) to a database server. The reason behind the similar performance of all adaptive thread mapping strategies in comparison to the Linux default scheduling strategy comes from the way transactions access the database. Requests coming from clients are generated randomly in a distributed fashion, so threads end up executing transactions that uniformly access the database rows in a balanced way. 20

1.25

Overall performance normalized to Linux

1.20 1.15

Legend Conflict

1.10

Test

1.05

Apriori ID3

1.00 0.95 0.90 bayes

genome

intruder

kmeans

labyrinth

ssca2

vacation

yada

Figure 11: Overall performance of adaptive thread mapping strategies normalized to Linux.

Figure 12 shows the distribution of performance gains and losses of each adaptive strategy when considering all TM configurations and thread counts. Positive intervals on the x-axis mean performance gains whereas negative intervals mean performance losses. Concerning the single metric-based strategies, Conflict presented better results than Test on average. However, we observed again that the abort ratio, taken as a single metric, is not capable of achieving the best performances. In approximately 30% of cases, Conflict resulted in up to 10% of performance losses. Test

Density (%)

Conflict 50

50

40

40

30

30

20

20

10

10 0

0 −60

−40

−20

0

20

40

−60

60

−40

−20

Apriori

0

20

40

60

20

40

60

ID3

50

50

40

40

30

30

20

20

10

10

0

0 −60

−40

−20

0

20

40

60

−60

−40

−20

0

Performance normalized to Linux (%)

Figure 12: Density histogram of performance gains and losses normalized to Linux.

The ML-based adaptive strategies achieved better performance gains than single metricbased ones while limiting the performance losses. Overall, Apriori achieved performance gains of up to 51.7% (8.7% on average) whereas ID3 achieved performance gains of up to 43% (6.9% on average) when compared to Linux. Apriori performed better than ID3 since it incurred in very few performance losses (most of them limited to 10%) while doubling the number of performance improvements in the range from 10% to 20%.

21

6.3. Dynamic Thread Mapping in Action In order to observe how our adaptive thread mapping strategies react when they encounter several different phases, we created a single application composed of all the 8 distinct workloads discussed in Section 5.2. We then executed this application with one of our ML-based strategies (ID3) while tracing the information obtained by the transaction profiler at the end of each profiling period. Figure 13 shows the variance of the profiled metrics during the execution with 4 threads. Vertical bars represent the intervals in which each thread mapping was applied. compact

scatter

compact

scatter

r-r compact scatter

r-r

compact

6x105

7x105

scatter

100 90 80

Ratio (%) Ratio (%)

70 60 50 40 30 20 10 0 0

1x105

2x105

3x105 4x105 5x105180 Number of transactions Number of commited transactions 160 140

8x105

Abort Ratio Tx Time Ratio LLC Miss Ratio

120 Ratio (%)

Figure 13: Profiled metrics during the execution of an application with 8 phases. Vertical bars represent the thread 100 mapping applied by the ID3 during the execution. 80

60 and profiles some transactions. At the beginning, ID3 applies compact as its default strategy After the first profiling period, the predictor decided to apply40compact so the thread mapping 20 remained the same. This behavior was repeated until the application reached a different phase near 1 × 105 committed transactions. At this point, the predictor switched to scatter. Overall, 0 5 5 5 5 5 0 1x10 2x10 4x10 5x10 the predictor detected more than 8 phases due to the variance of some profiled metrics but3x10 it still Number of transactions detected correctly the 8 main phase changes, reacting by applying a suitable thread mapping for each phase. We can also observe that the variance of the profiled metrics confirms the fact that the 8 workloads have distinct characteristics.

7. Related Work We discuss in this section the most relevant related works concerning the main topics explored in this article. Section 7.1 presents works that make use of thread or process mapping mechanisms whereas Section 7.2 discusses works that use Machine Learning (ML) techniques to improve the performance of parallel applications. 7.1. Thread and Process Mapping Although the focus may be different depending on the programming model, thread or process mapping approaches are based on heuristics to map threads or processes on cores. In this section, 22

6x10

5

7x10

we summarize some works that rely on thread mapping approaches to improve the performance of parallel applications. H. Chen et al. [23] proposed a profile-guided parallel process placement to minimize the cost of point-to-point communications in MPI applications. Firstly, the communication profile of an MPI application and the network topology of the cluster platform are collected. Then, such information is used as the input of a graph mapping algorithm, which maps the communication graph of parallel applications to the system topology graph. J. Zhang et al. [24] extended the approach proposed in [23] to consider MPI collective communications. Basically, they proposed to transform collective communications into a series of point-to-point communications. Then, they could use existing approaches to find optimized mapping schemes which are optimized for both point-to-point and collective communications. The authors evaluated the performance of this approach on a set of benchmarks which include MPI collective communications. Their works differ from ours in some key aspects. They use traces to compute the communication graph of the application after a previous execution whereas we use profiling at runtime. Another difference concerns the scheme adopted to map processes/threads to cores. They rely on graph partitioning heuristics whereas we use Machine Learning. S. Hong et al. [9] proposed a dynamic thread mapping strategy for data parallel applications implemented with OpenMP. The proposed approach has three steps alternating phases, which are periodically applied inside parallel loops. First, threads are profiled during some iterations to compute their loads. Then, threads are mapped in such a way that the load is distributed among the processors. Finally, the new mapping remains unchanged during a fixed number of iterations. The proposed strategy was evaluated using a simulator and the average improvement obtained was about 13%. Our work is different from this prior study because they use mostly the processor load (number of cycles) to take decisions whereas our ML-based strategies consider different metrics from the application, runtime system and platform at the same time. M. Diener et al. [25] examined data sharing patterns between threads in different OpenMP workloads and used those patterns to map processes. These algorithms relied on memory traces extracted from benchmarks using a simulator. The authors achieved moderate improvements in the common cases and considerable improvements in some cases, reducing execution time by up to 45% in comparison to the Linux. Similarly, E. Cruz et al. [26] used memory traces to perform thread mapping on OpenMP applications. They proposed and evaluated a technique of process mapping that binds threads of a given application on cores and allocates their data on close memory banks. This method used different metrics and an heuristic to obtain the mapping. Their results showed performance gains of up to 75% compared to the Linux scheduler and memory allocator. In contrast to these works, we targeted TM applications, which tend to present a more dynamic behavior than those suited to OpenMP. Moreover, we did not rely on simulations to gather information about the applications. Instead, we use hardware counters and software libraries to gather information and to take decisions at runtime. 7.2. Performance Improvement of Parallel Applications with Machine Learning Machine Learning has been extensively used as a predictive mechanism to solve a wide range of problems. In this section, we briefly describe some works that rely on ML techniques to improve the performance of parallel applications. D. Grewe et al. [27] proposed a ML-based compiler model that accurately predicts the best partitioning of data-parallel OpenCL tasks. Static analysis is used to extract code features from OpenCL programs. These features are used to feed a ML algorithm which is responsible for predicting the best task partitioning. The proposed model achieved a speedup of 1.57 over a 23

state-of-the-art dynamic runtime approach. In contrast to our work, they focused on task partitioning of OpenCL applications on CPU-GPU systems rather than thread mapping of transacional applications over homogenous architectures. G. Tournavitis et al. [28] proposed a two-staged parallelization approach combining profilingdriven parallelism detection and ML-based mapping to generate OpenMP annotated parallel programs. This method involves collecting static and dynamic features from the sequential version of the program to identify portions of code that can be parallelized. Then, a previously trained ML-based predictor is applied to each parallel loop candidate to select an OpenMP scheduling policy (i.e., cyclic, dynamic, guided or static). Unlike our approach, G. Tournavitis et al. use a different ML algorithm called Support Vector Machines (SVM) to decide whether or not to parallelize a loop candidate and which scheduling strategy should be applied. In addition to that, user intervention may be needed to approve/disapprove decisions taken by the predictor when correctness cannot be proven by static analysis. A similar work was done by Z. Wang et al. [29], focusing on determining the best number of threads for a parallel program and how the parallelism should be scheduled. The proposed approach is also based on SVM to infer the performance of particular mappings, selecting the best one. Machine Learning has also been recently studied in the context of transactional memory. Q. Wang et al. [14] introduced predictive mechanisms based on ML that can select an STM algorithm adapted to the TM application workload at runtime. Two adaptive policies were proposed, one based on Case-Based Reasoning and other based on Neural Networks. These policies were evaluated on all applications from the STAMP benchmark suite, achieving performance improvements very close to an oracle that always chooses the best algorithm. D. Rughetti et al. [30] proposed a ML-based solution to dynamically select the optimal concurrency level (number of concurrent threads) of TM applications. The approach was based on Neural Networks and works as follows. First, the Neural Network is trained using a data set obtained by profiling the application. Then, at runtime, a statistical characterization of the application workload is periodically used as input to the previously trained Neural Network. These predictions, are finally exploited by a control algorithm, which regulates the number of threads to maximize the application throughput. In contrast to these works, we are interested on the impact of thread mapping on the performance of TM applications. Although our ML-based strategies and these previous works share the idea of using ML techniques to improve the performance of TM applications, they differ in several aspects. First, they use ML for different goals, so our approach is complementary to those works. Second, they differ in terms of ML algorithms used for prediction. Finally, the collected information during the learning phase and runtime profiling also differs: we do not need to collect information in single-threaded mode. This would add a considerable overhead in dynamic applications, since it would be necessary to profile several transactions in single-threaded mode to capture the workload behavior. 8. Conclusion and Perspectives The performance of TM applications can be improved at different levels, ranging from the TM application to the underlying platform levels. Diverging from several previous works that focused on improving the performance on a single level, we showed that the performance of TM applications can be improved if we match their characteristics (along with the characteristics of the TM system) to the underlying multicore platform. More precisely, we were interested in gathering useful information from applications to better exploit the memory hierarchy of modern multicores. 24

We used thread mapping to do so, which aims at mapping threads to cores in order to improve the use of resources such cache memories. However, STM systems make this task even more difficult due to the runtime system: TM applications can behave differently depending on the STM system configuration. The result is that it is not trivial to predict a suitable thread mapping strategy for a specific TM application, STM system configuration and platform. We proposed different adaptive strategies that aim at inferring suitable thread mappings for TM applications. All strategies were implemented in an STM system, making it transparent to the TM applications. Our results showed that adaptive thread mapping is important and can considerably improve the performance of TM applications. Overall, ML-based strategies presented achieved better performance gains than single metric-based ones. We achieved performance improvements over the Linux default scheduling strategy of up to 64.4% on a set of synthetic applications and 16.5% on the standard STAMP benchmark suite. This research can be extended in some directions. Recent architectures and compilers supporting TM will broaden its audience and will also significantly contribute to the appearance of new real-world TM applications. In addition to that, it is expected that those new architectures will apply the Non-Uniform Memory Access (NUMA) design. Thus, a medium- to long-term research possibility will be to apply our adaptive strategies on those new architectures and applications. However, the way data is allocated/distributed among the memory banks on NUMA platforms also influences the overall performance of applications. Because of that, it would be necessary to consider not only thread mapping but also memory allocation policies to make better use of the NUMA memory banks. References [1] J. Larus, C. Kozyrakis, Transactional Memory: Is TM the Answer for Improving Parallel Programming?, Communications of ACM 51 (2008) 80–88. [2] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, K. Yelick, A View of the Parallel Computing Landscape, Communications of the ACM 52 (2009) 56–67. [3] M. Herlihy, N. Shavit, The Art of Multiprocessor Programming, Morgan Kaufmann, Burlington, USA, 2008. [4] Anthony Discolo, T. Harris, S. Marlow, S. P. Jones, S. Singh, Lock Free Data Structures Using STM in Haskell, Functional and Logic Programming 3945 (2006) 65—-80. [5] K.-C. Tai, Definitions and Detection of Deadlock, Livelock, and Starvation in Concurrent Programs, in: International Conference on Parallel Processing (ICPP), volume 2, IEEE Computer Society, Raleigh, USA, 1994, pp. 69–72. [6] T. Harris, J. Larus, R. Rajwar, Transactional Memory: Synthesis Lectures on Computer Architecture, volume 5, Morgan & Claypool Publishers, Madison, USA, 2nd edition, 2010. [7] M. Couceiro, P. Romano, Where Does Transactional Memory Research Stand and What Challenges Lie Ahead?, ACM SIGOPS Operating Systems Review 46 (2012) 87–92. [8] M. Castro, K. Georgiev, V. Marangonzova-Martin, J.-F. M´ehaut, L. G. Fernandes, M. Santana, Analysis and Tracing of Applications Based on Software Transactional Memory on Multicore Architectures, in: Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP), IEEE Computer Society, Aya Napa, Cyprus, 2011, pp. 199–206. ¨ [9] S. Hong, S. H. K. Narayanan, M. T. Kandemir, O. Ozturk, Process Variation Aware Thread Mapping for Chip Multiprocessors, in: Design, Automation & Test in Europe (DATE), IEEE Computer Society, Nice, France, 2009, pp. 821–826. [10] M. Castro, L. F. W. G´oes, C. P. Ribeiro, M. Cole, M. Cintra, J.-F. M´ehaut, A Machine Learning-Based Approach for Thread Mapping on Transactional Memory Applications, in: High Performance Computing Conference (HiPC), IEEE Computer Society, Bangalore, India, 2011, pp. 1–10. [11] M. Castro, Improving the Performance of Transactional Memory Applications on Multicores: A Machine Learningbased Approach, Ph.D. thesis, Universit´e de Grenoble, Grenoble, 2012.

25

[12] C. Huang, O. Lawlor, L. V. Kal´e, Adaptive MPI, in: International Workshop on Languages and Compilers for Parallel Computing (LCPC), volume 2958 of Lecture Notes in Computer Science (LNCS), Springer, Texas, USA, 2003, pp. 306–322. [13] X. Chen, S. Long, Adaptive Multi-versioning for OpenMP Parallelization via Machine Learning, in: International Conference on Parallel and Distributed Systems (ICPADS), IEEE Computer Society, Shenzhen, China, 2009, pp. 907–912. [14] Q. Wang, S. Kulkarni, J. Cavazos, M. Spear, Towards Applying Machine Learning to Adaptive Transactional Memory, in: ACM SIGPLAN Workshop on Transactional Computing (TRANSACT), ACM, San Jose, USA, 2011. [15] M. Castro, L. F. W. G´oes, L. G. Fernandes, J.-F. M´ehaut, Dynamic Thread Mapping Based on Machine Learning for Transactional Memory Applications, in: International European Conference on Parallel and Distributed Computing (Euro-Par), volume 7484 of Lecture Notes in Computer Science (LNCS), Springer-Verlag, Rhodes Island, Greece, 2012, pp. 465–476. [16] C. C. Minh, J. Chung, C. Kozyrakis, K. Olukotun, STAMP: Stanford Transactional Applications for MultiProcessing, in: IEEE International Symposium on Workload Characterization (IISWC), IEEE Computer Society, Seattle, USA, 2008, pp. 35–46. [17] J. R. Quinlan, Induction of Decision Trees, Machine Learning 1 (1986) 81–106. [18] R. Agrawal, R. Srikant, Fast Algorithms for Mining Association Rules in Large Databases, in: International Conference on Very Large Data Bases (VLDB), Morgan Kaufmann Publishers Inc., Santiago de Chile, Chile, 1994, pp. 487–499. [19] P. Felber, C. Fetzer, T. Riegel, Dynamic Performance Tuning of Word-Based Software Transactional Memory, in: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), ACM, Salt Lake City, USA, 2008, pp. 237–246. [20] F. Broquedis, J. Clet-Ortega, S. Moreaud, B. Goglin, G. Mercier, S. Thibault, hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications, in: Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP), IEEE Computer Society, Pisa, Italy, 2010, pp. 180–186. [21] D. Terpstra, H. Jagode, H. You, J. Dongarra, Collecting Performance Data with PAPI-C, in: Tools for High Performance Computing, Springer-Verlag, 2010, pp. 157–173. [22] S. Hong, T. Oguntebi, J. Casper, N. Bronson, C. Kozyrakis, K. Olukotun, EigenBench: A Simple Exploration Tool for Orthogonal TM Characteristics, in: IEEE International Symposium on Workload Characterization (IISWC), IEEE Computer Society, Atlanta, USA, 2010, pp. 1–11. [23] H. Chen, W. Chen, J. Huang, B. Robert, H. Kuhn, MPIPP: An Automatic Profile-Guided Parallel Process Placement Toolset for SMP Clusters and Multiclusters, in: International Conference on Supercomputing (ICS), ACM, Cairns, Australia, 2006, pp. 353–360. [24] J. Zhang, J. Zhai, W. Chen, W. Zheng, Process Mapping for MPI Collective Communications, in: International European Conference on Parallel and Distributed Computing (Euro-Par), volume 5704 of Lecture Notes in Computer Science (LNCS), Springer-Verlag, Ischia, Italy, 2009, pp. 81–92. [25] M. Diener, F. L. Madruga, E. R. Rodrigues, M. A. Z. Alves, J. Schneider, P. O. A. Navaux, H.-U. Heiss, Evaluating Thread Placement Based on Memory Access Patterns for Multi-core Processors, in: IEEE International Conference on High Performance Computing and Communications (HPCC), IEEE Computer Society, Melbourne, Australia, 2010, pp. 491–496. [26] E. H. M. da Cruz, M. A. Z. Alves, A. Carissimi, P. O. A. Navaux, C. P. Ribeiro, J.-F. M´ehaut, Using Memory Access Traces to Map Threads and Data on Hierarchical Multi-core Platforms, in: IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), IEEE Computer Society, 2011, pp. 551–558. [27] D. Grewe, M. F. P. O’Boyle, A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL, in: International Conference on Compiler Construction (CC), Springer-Verlag, Saarbr¨ucken, Germany, 2011, pp. 286–305. [28] G. Tournavitis, Z. Wang, B. Franke, M. F. P. O’Boyle, Towards a Holistic Approach to Auto-Parallelization: Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping, ACM SIGPLAN Notices 44 (2009) 177–187. [29] Z. Wang, M. F. P. O’Boyle, Mapping Parallelism to Multi-cores: A Machine Learning Based Approach, ACM SIGPLAN Notices 44 (2009) 75–84. [30] D. Rughetti, P. D. Sanzo, B. Ciciani, F. Quaglia, Machine Learning-Based Self-Adjusting Concurrency in Software Transactional Memory Systems, in: IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), IEEE Computer Society, Washington, USA, 2012, pp. 278–285.

26

*Author Biography & Photograph

onal worklist apo simplifying the erformance optiion programmer. matically selects the best perforhowed that our similar or better

extended to acnce, the pipeline g a sequence of mechanism can zations and STM ecome available, utotuning mechwe intend to intempiler and extend

[21] J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. P. Shen, “Speculative precomputation: Longrange prefetching of delinquent loads,” in ISCA, 2001, pp. 14–25. [22] C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun, “STAMP: Stanford transactional applications for multi-processing,” Márcio Castro is currently ina postdoctoral researcher at the IISWC, 2008, pp. 35–46. Federal University of and RioK. OlukoGrande do Sul (UFRGS), Brazil. He [23] W. Baek, C. C. Minh, M. Trautmann, C. Kozyrakis, tun, “The OpenTM transactional application programming inter-the University of Grenoble in 2012, received his Ph.D. degree from face,” in PACT, 2007, pp. 376–387. [24] V. J. Marathe, W.his N. S.M.Sc. III, and and M. L. B.Sc. Scott, “Adaptive software degrees in Computer Science from PUCRS in transactional memory,” in DISC, 2005, pp. 354–368. 2009 and 2006, respectively. [25] R. Guerraoui, M. Herlihy, and B. Pochon, “Toward a theory of In 2006, he earned an honor transactional contention managers,” in PODC,cum 2005, pp. 258–264.from PUCRS and the Best Student distinction (summa laude) [26] K. Nikas, N. Anastopoulos, G. Goumas, and N. Koziris, “EmAward from Computer Society (SBC). His research ploying transactional memory and the helperBrazilian threads to speedup dijkstra’s algorithm,” in ICPP, 2009, pp. 388–395. interests include transactional memory, thread and memory affinity, [27] L. Kale and S. Krishnan, “Charm++: A portable concurrent object oriented system based on c++,” in OOSPLA, 1993, pp. 91–108. multicore and manycore programming and parallel scientific [28] H. Gonz´alez-V´elez and M. Leyton, “A survey of algorithmic applications. More information about his current research activities skeleton frameworks: High-level structured parallel programming enablers,” Softw. Pract. Exper., vol. 40, pp. 1135–1160, 2010. can be found at http://www.marciocastro.com [29] M. Kulkarni, M. Burtscher, R. Inkulu, K. Pingali, and C. Cascaval, “How much parallelism is there in irregular applications?” in PPoPP, 2009, pp. 1–12. ´

Lu´ıs Fabr´ıcio W. Goes is currently an associate Luís Fabrício W. Góes hisisPh.D. currently an associate professor of professor at PUC Minas. He received degree from thescience University ofat Edinburgh inMinas 2012, in Brazil. He received his Ph.D. computer PUC his M.Sc. degree in Electrical Engineering from Degree from the University of Edinburgh in 2012, his PUC Minasin in Informatics 2004 and his B.Sc. in Computer Science from PUC Minas in 2002. His research M.Sc. degree in programming Electricalpatterns, Engineering from PUC Minas in 2004 interests include parallel software and parallel job and histransactional B.Sc. inmemory Computer Science from PUC Minas in 2002. He is scheduling. the leader of the Parallel Programming Team (PART) at PUC   Minas. He is also a reviewer of the Journal of Parallel and Nikolas Ioannou is currently a Research Assistant at the University of Edinburgh. He reDistributed Computing His research interests include ceived his Diploma in Electrical and(JPDC). Computer Engineeringprogramming from the National Technical Univer-software transactional memory and parallel patterns, sity of Athens in 2008 and his Ph.D. from the parallel jobEdinburgh scheduling. University of in 2012. His research

ew cores to many: A R, Intel, 2006. omputer architecture ges, 2005. EE Computer, vol. 39,

Gebis, P. Husbands, Shalf, S. W. Williams, mputing landscape.” 009. Management of Parallel

ng with deterministic

interests include parallel computer architectures, many-core systems, thread-level speculation, and power managemen

mplified data process37–150. Morgan & Claypool

tomatic performance Par, 2011, pp. 3–14. performance tuning ory,” in PPoPP, 2008,

action scheduling for 008, pp. 169–178. Patterns for Parallel

 

tfitting C++ for Multi-

ogramming without ! o. 2, 2007. anarayanan, K. Bala, uires abstractions,” in

and M. L. Scott, “A agement in software 40, 2009. , C. C. Kirkham, and actional memory perdering,” in HiPEAC,

X. Sui, M. A. Hassaan, li, “Structure-driven programs,” in PPoPP,

ai, “Design and imhelper threading on 9, 2005. eiserson, K. Randall, ded runtime system,” g, vol. 37, no. 1, pp.

14

 

Jean-François Méhaut is a Professor of Computer Science at the Polychronis Xekalakis is currently a senior Université Joseph Fourier (UJF) research scientist at Intel-Labs Barcelona. He since 2003. He currently holds a received his Ph.D. degree from the Univerresearch position at CEA, on secondment from UJF. His current sity of Edinburgh in 2009 and his Dipl. Eng. research includes embedded as well as all aspects of high from the University of Patras in 2005. Hissystems research interests include co-designed virtual maperformance computing including runtime systems, multithreading chines, speculative multithreading, and architectural techniques for low power. Contact him at and memory management in NUMA multiprocessors, multi-core [email protected] and hybrid programming. He has spent one year at the Argonne National of Department of Energy (DOE, Illinois, USA). Murray ColeLaboratory is currently an associate professor of Computer Science at the University of EdinHe isHeinvolved inofthe Mont-Blanc burgh. is a member the Institute for Com- FP7 European project to explore puting approaches Systems Architecture, the School new for within exascale computing based on low power and of Informatics. His research interests include embedded processors. parallel programming models, emphasising approaches which exploit skeletons to package and optimize well known patterns of computation and interaction as parallel programming abstractions.

Marcelo Cintra received the BS and MS degrees from the University of Sao Paulo in 1992 and 1996, respectively, and the PhD degree from the University of Illinois at Urbana-Champaign in 2001. After completing the PhD degree, he joined the faculty of the University of Edinburgh, where he is currently an associate professor. His research interests include parallel architectures, optimizing compilers, and parallel programming. He has published extensively in these areas. He is a Senior member of the ACM, the IEEE, and the IEEE Computer Society. More information about his current research activities can be found at http://www.homepages.inf.ed.ac.uk/mc.