J. Parallel Distrib. Comput. (
)
–
Contents lists available at ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
Exploring performance and energy tradeoffs for irregular applications: A case study on the Tilera many-core architecture Ajay Panyala ∗ , Daniel Chavarría-Miranda, Joseph B. Manzano, Antonino Tumeo, Mahantesh Halappanavar Advanced Computing, Mathematics and Data Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd., Richland, WA 99352, USA
highlights • • • •
Optimizing irregular applications for modern many-core architectures is challenging. Study energy and performance efficiency on Tilera using two irregular applications. Using auto-tuning, we explore the optimization design space along three dimensions. Show whole-node energy savings and performance improvements of up to 49.6% and 60%.
article
info
Article history: Received 20 June 2015 Received in revised form 19 May 2016 Accepted 16 June 2016 Available online xxxx Keywords: Auto tuning Irregular applications Community detection Sparse conjugate gradient Energy optimization Many-core processors Data layouts
abstract High performance, parallel applications with irregular data accesses are becoming a critical workload class for modern systems. In particular, the execution of such workloads on emerging many-core systems is expected to be a significant component of applications in data mining, machine learning, scientific computing and graph analytics. However, power and energy constraints limit the capabilities of individual cores, memory hierarchy and on-chip interconnect of such systems, thus leading to architectural and software trade-offs that must be understood in the context of the intended application’s behavior. Irregular applications are notoriously hard to optimize given their data-dependent access patterns, lack of structured locality and complex data structures and code patterns. We have ported two irregular applications, graph community detection using the Louvain method (Grappolo) and high-performance conjugate gradient (HPCCG), to the Tilera many-core system and have conducted a detailed study of platform-independent and platform-specific optimizations that improve their performance as well as reduce their overall energy consumption. To conduct this study, we employ an auto-tuning based approach that explores the optimization design space along three dimensions— memory layout schemes, GCC compiler flag choices and OpenMP loop scheduling options. We leverage MIT’s OpenTuner auto-tuning framework to explore and recommend energy optimal choices for different combinations of parameters. We then conduct an in-depth architectural characterization to understand the memory behavior of the selected workloads. Finally, we perform a correlation study to demonstrate the interplay between the hardware behavior and application characteristics. Using auto-tuning, we demonstrate whole-node energy savings and performance improvements of up to 49.6% and 60% relative to a baseline instantiation, and up to 31% and 45.4% relative to manually optimized variants. © 2016 Elsevier Inc. All rights reserved.
1. Introduction
∗
Corresponding author. E-mail addresses:
[email protected] (A. Panyala),
[email protected] (D. Chavarría-Miranda),
[email protected] (J.B. Manzano),
[email protected] (A. Tumeo),
[email protected] (M. Halappanavar). http://dx.doi.org/10.1016/j.jpdc.2016.06.006 0743-7315/© 2016 Elsevier Inc. All rights reserved.
Future computing systems will be significantly power constrained. Processors’ power densities have reached new highs and cannot grow anymore without creating concerns for heat dissipation. Because of the cost of energy, systems will need be power capped [5]. Many-core architectures, integrating a large number of lowpower cores, appear to be a promising solution to increase
2
A. Panyala et al. / J. Parallel Distrib. Comput. (
)
–
Fig. 1. An overview of the auto-tuning process explored in this paper to optimize for energy and performance.
performance (by increasing thread-level parallelism) while remaining within the power constraints. Optimizing applications simultaneously for energy and performance on these architectures is a complex problem. A programmer needs to extract sufficient parallelism, finely orchestrate communication and data transfers, maximize locality, and explore a variety of other parameters to obtain the highest performance. Parallel applications with irregular data accesses are becoming major workloads on high performance computing systems. These irregular applications pertain to areas such as data mining, machine learning, scientific computing, and graph analytics. Irregular applications are notoriously hard to optimize, given their data-dependent access patterns, lack of structured locality and complex data structures and code patterns [11,27]. Performanceand energy-efficient implementation of these kernels on modern, energy efficient, multicore and many-core platforms is therefore an important and challenging problem. Power and energy constraints limit the capabilities of individual cores, memory hierarchy, and on-chip interconnect of manycore designs, thus leading to architectural and software tradeoffs that must be understood in the context of the intended application’s behavior. Although increasing performance usually reduces overall energy dissipation, the dynamic behavior of a program may dramatically change the (instantaneous) power consumption, leading to non-linear, or not easily understandable, relations. In this paper, we have conducted a detailed study on how to optimize two representative irregular applications to achieve high performance and energy efficient execution on the Tilera many-core platform. The two applications are: Graph clustering or community detection using a parallel Louvain method (Grappolo), and a sparse conjugate gradient (HPCCG) based solver. The Tilera many-core platforms have been designed for low-power computing through a combination of conservative, in-order processing core designs and flexible networks-on-chip interconnections. In particular, we target the TILE-Gx 36 processor, which integrates 36 64-bit simple cores interconnected through six different Network-on-Chip (NoCs) fabrics. The 36 cores can access two distinct memory controllers, and exploit a configurable second level cache (dubbed Distributed Dynamic Cache), which can act as a private coherent level 2 and/or a shared level 3 cache. Data accesses and data structure layout have a significant impact on the performance, and therefore the energy dissipation, of the processor because of the communication through NoCs and the distributed cache. Irregular applications are a prime candidate for this study because of their complex memory patterns. We demonstrate the performance/energy trade-offs for irregular applications and their impact on the memory hierarchy on the Tilera many-core platform.
Our recent work [8] details how we have ported OpenMP multithreaded versions of the community detection application and the sparse conjugate gradient kernel to the TILE-Gx36 platform and have performed a set of ‘‘best practices’’ manual optimizations. These include platform- and application-specific data structure optimizations, as well as the careful selection of the scheduling policies for OpenMP parallel regions to reduce load imbalance, induced by irregularity, across threads. In addition to these manual optimizations, we explored the design space along three dimensions as illustrated in Fig. 1: 32 memory layout schemes (four per data structure), about 300 compiler flag choices, and approximately 200 OpenMP loop scheduling options. It demonstrated the rationale for the various optimizations and, given that exploration of the design space is manually infeasible (32 × 300 × 200), we extended MIT’s OpenTuner auto-tuning framework [2] to explore and recommend energy optimal choices for different combinations of parameters. Fig. 1 presents an overview of our auto-tuning approach. Given a distinct configuration selected by the OpenTuner search space driver, our transformations rewrite the source code of the application to conform to the requested configuration and compile the code with the selected GCC optimization flags. The application binary is then executed on the actual Tilera system. The energy and time are reported back to the OpenTuner framework as feedback for the design space exploration process. We have configured the OpenTuner framework to minimize energy consumption and maximize performance simultaneously as its intended result. We adapted a Tilera-provided external power sensor utility (tile-btk) with a resolution of ≈250 ms to drive our manual and automatic energy optimization. Building on [8], we have expanded the study with a larger collection of input datasets (22 datasets in total) for community detection. The expanded datasets allow us to characterize the behavior of a large number of graph types under our autotuning framework. Using selected results (based on performance and/or the input graph’s characteristics) from the datasets, we conducted an in-depth architectural characterization. We analyzed performance counters to understand memory behavior of the selected workloads. Finally, we performed a correlation study to gain insight on the interplay between the hardware behavior (represented by the performance counters), the dataset’s characteristics (number of vertices, number of edges, degree per vertex, etc.), performance and energy. Our results demonstrate whole-node energy savings and performance improvements of up to 49.6% and 60% relative to a baseline instantiation, and up to 31% and 45.4% relative to manually optimized variants. We analyze the various energy/performance trade-offs, correlating them with the application behavior on the processor. In this work, we make the following contributions:
A. Panyala et al. / J. Parallel Distrib. Comput. (
(i) An expanded energy-centric optimization study of two irregular applications on the Tilera many-core low-power platform; (ii) Automatic exploration of several energy-oriented optimization parameters relevant to irregular applications; (iii) Quantification of the impact of non-determinism on the overall trends in performance and energy consumption for irregular workloads, using hardware performance counters, time, and energy as metrics; (iv) Analysis of the interplay between the underlying architecture’s characteristics and the properties of the input sets. Building upon our experience, we expand the lessons learned in previous studies with a more in-depth look at the application behavior and how different workload’s characteristics affect the performance and energy optimization process. We believe that the lessons learned from our study will impact how irregular applications can be optimized for performance and energy on Tilera and similar platforms. The paper is organized as follows: Section 2 provides details on the TILE-Gx 36 processors and the two applications; Section 3 discusses the optimizations and the design space exploration; Section 4 presents the experimental results; Section 5 discusses related work, and, finally, Section 6 concludes the paper.
)
–
3
2.1.1. Power & energy monitoring on the Tilera platform Tilera provides a comprehensive set of tools for software development on their many-core platforms. The software runs on an x86 host and is configured for cross-compilation to the Tilera VLIW instruction format. The tile-monitor tool and the tile-btk interface are two key components of this environment. The tile-monitor tool enables uploading and downloading code or data from the x86 host to the nodes, as well as enabling the execution of applications on the Tilera platforms. The tile-btk (or Board Test Kit) interface provides a number of monitoring services on the x86 host, intended to analyze the dynamic behavior of a Tilera system. One of the key services provided by tile-btk is power monitoring across five sensors: the 12 V power supply to the Tilera node’s motherboard, the line supplying power to the many-core socket, as well as the memory interface, the network ports, and the other I/O ports, including PCI Express. The average sampling time that we have observed using this interface in our system is ≈250 ms. Given this sampling granularity, we focus our efforts on characterizing overall application energy dissipation for the whole platform, rather than focusing on finer-grained effects at the socket or memory interface levels. We also cannot reliably measure (or correlate power measurements to) application details (subroutines, individual loops or code fragments) that do not execute for appreciable time (≫250 ms).
2. Tilera platform and application workload This section provides background information on our target platform, the Tilera TILE-Gx36 processor, and the two reference irregular applications. 2.1. Tilera many-core architecture The Tilera TILE-Gx36 is a many-core processor composed of 36 processing cores (called ‘‘tiles’’ in Tilera’s terminology) interconnected through multiple Network-on-Chip (NoCs) fabrics in a two-dimensional mesh topology. Each tile includes a processor engine, a cache engine and a switch engine. The processor engine is a 64 bit 3-way VLIW unit running at 1.2 GHz. The cache engine includes a private 32 kB, 2-way set associative L1 data cache, a private 32 kB, direct-mapped instruction cache and a 256 kB, 8-way set associative L2 cache. A cache line has a granularity of 64 bytes across all the cache levels. The switch engine connects the tiles to six different NoCs, including: two networks for memory traffic, a network for coherency, a static message passing network, a user-programmable dynamic message passing network, and a dedicated I/O network. The TILE-Gx36 processor integrates two independent DDR3 memory controllers. Memory pages can be assigned to specific memory controllers or stripped across them. Our system has 32 GB of memory. The L2 caches act as a Dynamic Distributed Cache (DDC). When accessing data, if a tile misses in its own L2, it can access the L2 of another tile before going to the memory, thus reducing the latency (comparable to accessing a L3 cache) if there is a hit. In this work, we exploit the possibility to finely control caching policies. In a Tilera system, each memory address has a home tile. The home tile tracks sharing and coherence information for that address. Other tiles that write to that address must send their coherence updates over the NoC to the home tile in order for their updates to be realized. A programmer can specify different homing policies for memory allocation. There are two principal homing strategies, that work at the granularity of an entire memory page:
• Local homing: A particular tile is the home tile for the whole page.
• Hash-for-home: Individual cache lines in a page (64 bytes) are distributed in a round-robin manner to the L2 caches of a set of tiles (by default, all the tiles in the processor). The default memory allocation policy hashes all pages, except the pages that correspond to the program’s stack (allbutstack).
2.2. Parallel community detection (Grappolo) Consider a weighted graph G = (V , E ) with vertex set V , edge set E, and a weight function w : E → R that associates realvalued non-negative weights to edges. The process of grouping vertices into disjoint sets such that vertices in the same set are highly correlated between themselves, but sparsely correlated with vertices in other sets, is called graph clustering or community detection. Although community detection is an NP-hard problem, several fast and efficient heuristics exist. One such heuristic is the Louvain method [6], which is based on optimizing a metric called modularity [21]. The Louvain algorithm progresses in iterations and phases. Within a phase, the algorithm starts by assigning each vertex to a community of its own. The algorithm then iterates until the gain in modularity is below a user-specified threshold. In each iteration, the algorithm processes each vertex and makes a greedy decision, based on gain in modularity, whether the vertex should remain in its current community, or migrate to the community of one of its neighbors. At the end of a given phase, a new graph is constructed by collapsing each community from the last iteration of the previous phase into a meta-vertex and adding edges to reflect the strengths of intra-community (self-edges) and intercommunity (edges between two meta-vertices) connections. This new graph becomes input to the next phase and the process is repeated until a user-specified threshold gain in modularity is obtained. The Louvain method is highly irregular, with data dependent accesses in the traversal and neighbor-based exploration of communities in the graph. The construction of the next-level graph (collapsing) is also irregular since it follows the topology of the original graph and the naturally induced communities found in it. We designed and implemented a version of the Louvain method for Tilera that we call Grappolo [7]. We take this version as a basis and further optimize it to consider energy. Fig. 2(a) illustrates the key access patterns to the graph and community data structures that induce irregularity in this application. The letters represent the node identities of the graph, while the numbers represent the community identities which that node belongs to. The lines represent edges that connect the node A to its neighbors. The nodes E and S belong to the same community (2), and all other nodes are in different communities. This pattern generates irregular accesses when the algorithm explores nodes by communities and not by neighborhoods.
4
A. Panyala et al. / J. Parallel Distrib. Comput. (
(a) Community detection.
)
–
(b) HPCCG. Fig. 2. Sources of application irregularity.
2.3. High performance conjugate gradient The High Performance Conjugate Gradient (HPCCG) compact application is part of Sandia National Laboratories’ Mantevo benchmark suite [12,4]. This is a set of benchmarks intended to guide co-design processes for upcoming exascale applications and systems, as part of the US Dept. of Energy’s (DOE) HPC research activities. HPCCG mimics the finite element generation, assembly and solution for an unstructured grid physical problem. The domain is a three-dimensional physical box. A sparse matrix representation is used for this problem with 27 non-zeros per row corresponding to the interaction of the finite elements in 3D space. The computational focus of HPCCG is on the solution of a linear system through a sparse conjugate gradient solver. Most of the execution time is spent on three key kernels (in order of increasing importance): dense vector dot product, element-wise scaled dense vector sum and sparse matrix dense vector product. In contrast to community detection, HPCCG is ‘‘less irregular’’, in the sense that many data access operations in its code are actually dense, contiguous, predictable accesses to onedimensional vectors. However, the sparse matrix product induces irregularity into the accesses of the dense vector used in the ⃗ HPCCG stores its sparse matrix using the standard product: ⃗ x = Ab. Compressed Sparse Row (CSR) format, in which all non-zeros are stored contiguously in a single array, together with an array of the column indices for each non-zero, and a row-indexed array that indicates the beginning and end of the non-zeros for each row. Given the 3D physical nature of the problem being represented by the HPCCG code, the irregularity present in the data accesses is predictable and repeatable across rows of the sparse matrix. Fig. 2(b) illustrates the irregular access pattern induced on the dense right hand side (RHS) vector, due to the indirection in the column index array, to match each non-zero to the corresponding element in the array. We exploit these characteristics in our port of the HPCCG application on the Tilera platform. 2.4. Non-determinism in parallel irregular applications Parallel applications are difficult to develop, debug and optimize due to non-determinism introduced by concurrent execution. This can be reflected in terms of correctness problems such as data races during debugging, or it can be reflected in terms of performance variability during execution. Regular application workloads, such as dense linear algebra and signal processing algorithms, offer the desirable property that the expected outputs can be reproduced deterministically across multiple runs with the
same input. Runs using different numbers of threads lead to the same deterministic result. Parallel irregular applications have less deterministic behavior by nature: which thread gets which section of the workload and in which relative order it gets processed can affect the result (e.g., the order in which locks/atomic operations get resolved). This variability has non-trivial performance and energy dissipation implications: it is difficult to pinpoint exactly what the observed performance (or energy) of a particular input and system configuration should be. Furthermore, these values can vary between runs with exactly the same configuration and input. The community detection application is an example of a non-deterministic parallel application: the resulting communities and, therefore, the observed performance and energy dissipated can vary between runs with the same number of threads and inputs. The entire spectrum of results is equally valid (given the minimization of modularity), but the differences are non-trivial. On the other hand, the HPCCG application exhibits differences in its numerical results due to the interaction of concurrency and the properties of IEEE floating-point numbers: summations of numbers of widely different ranges is less accurate than numbers in tight intervals. However, the structure of the sparse matrix and dense vectors does not change over time, in contrast to community detection, thus making HPCCG sit somewhere in between fully regular and highly irregular codes. For these reasons, we have designed a detailed experimental methodology that requires running each application a large number of times to identify upper and lower bounds for performance and energy, and to enable a correct analysis of the results. 3. Energy optimizations for irregular applications This section focuses on the different optimizations used on our test applications. These optimizations are broadly classified into architecture-specific data layouts, OpenMP parallel loop scheduling, and GCC compiler options. 3.1. Tilera-specific optimizations To optimize the selected irregular applications on Tilera, we used OpenMP and platform-specific extensions for memory layout control and private heap management, thread synchronization and atomic operations, as well as pinning logical OpenMP software threads to specific hardware cores on the chip. We have also designed flexible data layouts that take advantage of the platform’s capabilities for locality control.
A. Panyala et al. / J. Parallel Distrib. Comput. (
)
–
5
Table 1 Memory layout schemes used in this work. Scheme
Description
Comments
Hashed Local Segmented
Hash-for-home: data distributed in round-robin blocks of 64 bytes across tiles Local-homing for private per-thread data, hashed for everything else Principal arrays partitioned into np chunks, private data uses Local-homing
Partitioned
Sparse matrix rows are partitioned uniformly on the number of cores
Limited data locality Exploit locality where it is clearly available Chunks are rounded up to 64 kB page size, memory accesses are NOT aligned to match page boundaries Local-homing is used for each partition (HPCCG only)
(a) Segmented.
(b) Partitioned. Fig. 3. Data layouts for 4 tiles.
3.1.1. Memory layout schemes on Tilera The primary data structures used in community detection and HPCCG are arrays and hash tables. Graph data in community detection is represented using a Compressed Sparse Row (CSR) format with two arrays: a vertex pointer array and an edge array. The vertex pointer array indicates the beginning position of the list of edges for each vertex in the edge array. Each element in the edge array stores the destination vertex and weight for the edge (the source vertex is implicit). Other auxiliary arrays are used to store the community membership and the degree of each vertex. Hash tables are used to build transformed graphs required for each phase. The sparse matrix in HPCCG is also stored in the CSR format using a vector of non-zeros, a vector of column indices for each non-zero and a vector indicating the starting position of the nonzeros and column indices of each row in the matrix. Other arrays store the dense vectors corresponding to the CG method. The local, segmented, hashed and partitioned layouts are used for the CSR and auxiliary arrays, as well as the hash tables. Table 1 summarizes the different data layouts explored in our design. We implemented the Tilera versions of the applications in C++, taking advantage of the flexibility of the data structures in C++’s Standard Template Library (STL) to use custom memory allocators [14]. We designed two Tilera-specific memory allocators that are plug-and-play compatible with the standard C++ allocator to support local and segmented allocation. Fig. 3(a) illustrates a segmented data layout, while Fig. 3(b) illustrates the use of the local allocator to create a partitioned layout across 4 cores/threads. 3.2. Auto-tuning framework and parameters Mapping irregular applications to systems with a large number of tunable parameters, such as Tilera, involves a number of software implementation decisions that are not easy to determine while the code is developed. One of these parameters is the data layout to use for key data structures in the application. The irregular access patterns and relatively large number of data structures make this selection even more difficult. Another key parameter is the scheduling policy for OpenMP parallel loops
that are operating on unbalanced iterations, which are typical in irregular data processing (i.e., work depends on the degree of the graph node or the number of non-zeros in the row). The software developer will make the best decision possible, given previous experience with similar codes. However, the interaction of multiple data structures with possibly different layouts, as well as multiple parallel loops with different workloads, make the selection much more difficult. To further complicate these decisions, if we intend to optimize consumed energy rather than execution time, the developer is not likely to make an optimal choice of parameters. For this reason, we have developed an automated procedure to explore the best combination of parameters to select data structure layouts, OpenMP scheduling policy and blocking factors for parallel loops, as well as GCC compiler settings [15]. We have developed our procedures on top of MIT’s OpenTuner autotuning framework [2], which uses state-of-the-art design space exploration techniques to solve multi-objective optimization problems. The key design space exploration strategy used by OpenTuner is the multi-armed bandit with sliding window, area under the curve credit assignment, which is deemed to be highly effective for complex, multi-objective optimization. As shown in Fig. 1, we have extended the OpenTuner framework with automated code-rewriting transformations that can realize specific combinations of data layouts and OpenMP scheduling parameter settings. Given a distinct configuration selected by the OpenTuner search space driver, our transformations rewrite the source code of the application to conform to the requested configuration and compile the code with the selected GCC compiler flags. The application binary is then executed on the actual Tilera system. We use the external power monitoring system described in Section 2.1.1 to obtain the energy consumed by the specific configuration. The energy and time are reported back to the OpenTuner framework as feedback for the design space exploration process. We have configured the OpenTuner framework to minimize energy consumption and maximize performance simultaneously as its intended result. Each application has OpenMP parallel regions and data structures that can be targeted by our optimization process. For
6
A. Panyala et al. / J. Parallel Distrib. Comput. (
)
–
Table 2 Input characteristics for the community detection datasets used in our experimental study. The first 14 datasets (up to hugetrace) performed best with solutions obtained by using uk-2002 as the training set and the remaining 8 datasets (from cnr onwards) performed best with solutions obtained by using MG2 as the training set. Input graph
Num. vertices
Num. edges
Degree stats
#Iter/#Phases
Max.
Avg.
Num. clusters
gsm coPapersDBLP af_shell10 StocF Hook_1498 CubeCoupDT6 netherlands NLR hollywood09 channel delaunay uk-2002 italy hugetrace
589,446 540,486 1,508,065 1,465,137 1,498,023 2,164,760 2,216,688 4,163,763 1,139,905 4,802,000 16,777,216 18,520,486 6,686,493 16,002,413
10,584,739 15,245,729 25,582,130 9,770,126 29,709,711 62,520,692 2,441,238 12,487,976 56,375,711 42,681,372 50,331,601 261,787,258 7,013,978 23,998,813
105 3 299 34 188 92 67 7 20 11 467 18 26 194 955 9 3
35.92 56.42 33.93 13.34 39.67 57.77 2.21 6.00 98.92 17.78 6.00 28.28 2.10 3.00
52/6 31/7 131/11 86/7 76/6 78/6 60/10 63/9 27/5 98/7 69/10 32/7 320/16 82/11
36 118 79 44 42 38 628 166 12 913 55 286 4 162 1 815 357
cnr Geo_1438 rggn224 soc-LiveJournal1 roadUSA hugeBubbles20 MG2 europe_osm
325,557 1,437,960 16,777,216 4,847,571 23,947,347 21,198,119 11,005,829 50,912,018
2,738,970 30,859,365 132,557,200 68,475,391 28,854,312 31,790,179 674,142,381 54,054,660
18 236 56 40 22 887 9 3 5 466 13
16.83 42.93 15.81 28.26 2.41 3.00 122.51 2.13
32/6 72/6 58/9 45/6 64/11 87/11 20/5 331/15
195 31 417 2 967 1 425 391 1 251 997 6 646
community detection and HPCCG, we targeted their OpenMP parallel loops: six in each application, and their key data structures: eight for community detection and ten for HPCCG. For the OpenMP scheduling policies and iteration blocking factors, we let the auto-tuner select from three policies (static, dynamic and guided) and allow variation in the blocking factor from 1 to 2048 (in powers-of-two increments), as well as the policy-default block size. In addition to these sets of optimization elements, the GCC compiler for Tilera provides a total of 335 configurable optimization flags available from the command line (as of version 4.8.2): 154 of which require additional parameters, for example a loop unrolling factor. We underline how the total number of combinations of variants for loops and data structures is nontrivial for a developer to manually optimize, not including the overwhelming number of compiler options. Through auto-tuning, we explore 15 distinct optimization orderings which include all possible combinations of the three types of optimizations (layout, GCC, OpenMP). The auto-tuned versions use the best automatically derived configuration for the set of optimizations selected. For example, the ordering OMP − GCC − Layout optimizes for all three, in that particular order. In this case, the auto-tuner first searches for the best OpenMP optimization configuration that minimizes the overall energy used by the application and simultaneously improves performance. Next, the best OpenMP configuration found is fixed, and the auto-tuner proceeds to find the best combination of the 335 configurable GCC compiler flags that further minimizes the energy and improves performance. Finally, the best OpenMP and GCC configurations found so far are fixed, and the auto-tuner now searches for the best data structure layout that minimizes energy and maximizes performance. Different orderings result in different optimization configurations. These 15 auto-tuned optimization orderings also include the individual optimizations by themselves (for example GCC − Only). Section 4 demonstrates the impact of each of these orderings and selects the ordering that has the most impact on minimizing energy and maximizing performance as the best case solution. 4. Experimental results To quantify the impact of the optimizations that our autotuning processes found, we have conducted an extensive experimental evaluation for the two selected applications using
twenty-two graph datasets for community detection and three problem sizes for HPCCG. Our hardware testbed consists of a TILEGx36 node used for all of our experiments, with an external power monitoring driver running on the x86 host. We have used the power sensor corresponding to the 12 V supply to the Tilera node’s motherboard, as we intended to capture the effect on whole node energy consumption (not just the cores). The TILE-Gx36 node runs a custom version of Linux, adapted for the Tilera’s hardware. The compiler and runtime environment are adapted from GCC 4.8.2 and retargeted for the TILE-Gx’s 64-bit VLIW cores. Table 2 describes the graph inputs we have used for the community detection application. With the exception of MG2, all other inputs were downloaded from the DIMACS10 challenge website [3], and the University of Florida sparse matrix collection [10]. MG2 is constructed from ocean metagenomics data, using the construction procedure described in [29]. Our inputs are selected with the intent to encompass a broad range of graph characteristics, in terms of number of vertices, number of edges and degree distribution. We have used uk-2002 and MG2 inputs as the training sets for auto-tuning. We selected these two as training sets because of their relatively long runtimes (representative of real world problems) and their representativity of a large class of graphs. For the HPCCG benchmark, we used three problem sizes: 1803 , 2903 and 3603 , aimed at occupying increasing fractions of the Tilera platform’s memory (the size corresponds to the dimensional extent of the physical domain). We used the 1803 problem size as the training set for auto-tuning. For each training set (uk-2002, MG2 and 1803 ), we explore 15 optimization orderings through auto-tuning which include all possible combinations of the three types of optimizations (layout, GCC, OpenMP) as described in Section 3.2. Note that for community detection, we do not auto-tune for layout choices since the optimal data structure layout is already known from our own previous work on optimizing the community detection code for performance on Tilera [7]. The auto-tuning effort presented in the context of this paper has also confirmed that the layout configuration reported in [7] is indeed always the best choice. To understand the behavioral trends of these non-deterministic, irregular applications, we ran each experiment consisting of 17
A. Panyala et al. / J. Parallel Distrib. Comput. (
)
–
(a) HugeBubbles performance.
(b) HugeBubbles energy.
(c) MG2 performance.
(d) MG2 energy.
(e) Europe performance.
(f) Europe energy.
7
Fig. 4. Community detection: Performance and energy results for inputs run with solutions obtained by using MG2 as the training set (y-axis corresponds to time in seconds or energy in Joules, respectively).
code variants for each input that include 15 auto-tuned optimization orderings plus the baseline and manually optimized implementations, a total of 50 times (per variant) for the training datasets uk-2002 and 1803 , 10 times for all other datasets including MG2. We also collect the hardware performance counter information (using the Linux tool Perf ), energy and time metrics for each run. We have discarded from our reported results the time and energy spent in reading input files and pre-processing. The reported results correspond to the execution times and energy spent on the actual application. In the case of community detection, we present performance and energy results for all 22 inputs. However, we down-select to six representative inputs to conduct a deeper analysis. Three (delaunay, uk-2002, hugetrace) out of these six inputs yielded the highest improvements in performance and energy when run with the solutions (i.e. optimization configurations) obtained by using uk-2002 as the training set. These three inputs are
selected from the first 14 inputs shown in Table 2 by considering inputs with more than 15 million vertices and a runtime exceeding 40 s. The remaining three (out of the six representative) inputs (hugeBubbles20, MG2, europe_osm) yielded the highest improvements in performance and energy when run with the solutions obtained by using MG2 as the training set. These three inputs are selected from the last eight inputs shown in Table 2 by considering inputs with more than 10 million vertices and a runtime exceeding 100 s. We present these results in Figs. 4 and 5 for community detection and in Fig. 6 for HPCCG. The x-axis in the figures represents the different optimization configurations, while the y-axis indicates the execution time in seconds or the energy dissipated in Joules. The error bars represent the range of observed execution times and dissipated energy over those experiments. For each dataset, we present 15 optimization configurations that include the auto-tuned configurations, plus the baseline implementation (Baseline) and a version manually
8
A. Panyala et al. / J. Parallel Distrib. Comput. (
)
(a) uk-2002 performance.
(b) uk-2002 energy.
(c) delaunay performance.
(d) delaunay energy.
(e) hugetrace performance.
(f) hugetrace energy.
–
Fig. 5. Community detection: Performance and energy results for inputs with solutions obtained by using uk-2002 as the training set (y-axis corresponds to time in seconds or energy in Joules, respectively).
tuned by an expert developer (Manual). The baseline and manual implementations have been compiled with the ‘‘-Ofast’’ optimization flag. The manual version includes developer-assigned layouts for the key data structures. Moreover, the OpenMP scheduling policies and block sizes have been carefully chosen for this version by an expert developer to maximize performance and minimize energy dissipation for each application. The corresponding raw numbers for time (in seconds) and energy (in Joules) for the baseline, manual and best auto-tuned implementations are presented in Table 3 for community detection and in Table 4 for HPCCG. For community detection, the best case results (time and energy) presented in the first 14 rows of Table 3 (up to hugetrace) were achieved with the solutions obtained by using uk-2002 as the training set. The best case results presented in the last 8 rows of this table (from the input cnr onwards) were achieved with the solutions obtained by using MG2 as training set. The distinction
between these two categories of inputs is marked by the horizontal line in the tables. From this point onwards, we refer to the best auto-tuned code variant, which is usually one out of the 15 autotuned optimization configurations, that results in the least energy dissipation as the ‘‘best solution’’ or ‘‘best case’’. All runs were executed using the full 36 cores on the Tilera processor. We also experimented with varying the number of cores active in the application for different OpenMP parallel regions, given the fact that not all of them scale equally as more cores are added. However, these results performed worse and consumed more energy than runs using the full 36 cores of the chip. All parallel regions in both codes continue to scale, albeit not in an equal manner, as more cores are added to the execution. From Figs. 4–6, we observe that the auto-tuned solutions with the highest performance exhibit the lowest energy dissipation. This may appear as an obvious observation, because in general reducing the execution time should also reduce the energy used
A. Panyala et al. / J. Parallel Distrib. Comput. (
(a) 1803 performance.
(b) 1803 energy.
(c) 2903 performance.
(d) 2903 energy.
(e) 3603 performance.
(f) 3603 energy.
)
–
9
Fig. 6. HPCCG kernel performance and energy results for the three problem sizes using 1803 as the training set (y-axis corresponds to time in seconds or energy in Joules, respectively).
for the computation. However, this may not be true if there are cases where certain optimizations or the use of certain architectural features generate unexpected peaks in instantaneous power consumption, thus increasing energy dissipation. For the 6 representative inputs used in community detection and the 3 problem sizes used as inputs in HPCCG, our results appear to confirm that this is not the case for the Tilera processor, demonstrating its effectiveness in low-power operation. For the remaining 16 graph inputs (community detection) omitted from the discussion, the best solutions may differ between performance and energy for an input. This behavior is acceptable since the actual comparable gain observed between the best solution for performance and the best solution for energy is very small (i.e. within the error margin). Hence, for such an input, choosing any of these two best solutions as the overall solution for optimizing both performance and energy is acceptable. For MG2 (community detection) and the three problem sizes used as inputs in HPCCG,
we observe similar behavior. For these cases, we choose the best solution for energy as the overall solution for optimizing for both performance and energy. The percentage improvements (for both performance and energy) of the best solutions over the baseline implementation are presented in Table 6. Table 7 presents the percentage improvements of the best solutions over the manual implementation for community detection. Finally, Table 8 presents the percentage improvements of the best solutions over the manual and baseline implementation for HPCCG. Tables 6 and 7 also show the difference in percentage improvements (in performance and energy) between the best solution obtained using uk-2002 as the training set and the best solution obtained using MG2 as the training set. As seen in the Difference columns of both tables, for the first 3 graph inputs, the solutions obtained by using uk-2002 as the training set provide up to 5% improvement in performance and up to 5.7% improvement in energy dissipation compared to the solutions obtained by using
10
A. Panyala et al. / J. Parallel Distrib. Comput. (
)
–
Table 3 Community detection: Time (secs) and Energy (joules) raw numbers for Baseline, Manual and the Best cases. The first 14 datasets (up to hugetrace) performed best with solutions obtained by using uk-2002 as the training set and the remaining 8 datasets (from cnr onwards) performed best with solutions obtained by using MG2 as the training set. Dataset
Baseline
Manual
Best case
Time (s)
Energy (J)
Time (s)
Energy (J)
Time (s)
Energy (J)
gsm coPapersDBLP af_shell10 StocF Hook_1498 CubeCoupDT6 netherlands NLR hollywood09 channel delaunay uk-2002 italy hugetrace
3.33 4.11 4.64 5.97 6.32 11.52 11.78 17.27 16.56 25.54 57.56 64.25 74.98 97.30
170.64 201.38 237.17 306.79 345.52 603.12 534.39 854.14 800.03 1346.51 2735.76 3436.28 3574.48 4625.07
2.70 3.27 3.54 4.45 4.82 8.70 8.83 11.94 13.74 18.92 45.86 50.74 52.83 83.05
142.57 162.44 188.76 236.94 262.51 477.37 400.12 605.71 673.27 1037.18 2183.73 2842.24 2586.59 3934.04
2.63 3.19 3.41 4.25 4.71 8.50 8.50 11.59 13.64 18.14 44.27 49.59 50.50 74.29
133.16 158.83 164.12 226.77 248.66 448.24 385.17 579.19 649.23 979.27 2091.39 2657.18 2468.87 3546.09
cnr Geo_1438 rggn224 soc-LiveJournal1 roadUSA hugeBubbles20 MG2 europe_osm
1.14 6.51 55.57 68.29 126.40 155.58 183.13 546.34
46.14 347.39 2941.19 2998.50 5788.49 7545.09 9573.25 26 772.34
0.89 4.89 45.74 55.35 101.06 111.96 156.35 377.52
33.68 271.58 2436.59 2492.44 4603.24 5537.59 8208.76 19 426.78
0.87 4.88 43.14 50.52 96.95 105.94 150.60 349.95
27.40 263.41 2236.93 2230.33 4402.87 5201.84 7637.88 17 892.48
Table 4 HPCCG: Time (secs) and energy (joules) raw numbers for baseline, manual and the best cases. Dataset
3
180 2903 3603
Baseline
Manual Energy (J)
Time (s)
Energy (J)
Time (s)
Energy (J)
42.66 186.14 344.44
2336.09 10 018.81 18 786.40
40.24 168.82 323.85
2202.03 9281.10 17 865.24
28.15 116.30 222.76
1715.75 7098.38 13 648.30
Table 5 Overall improvements in performance and energy. % Improvement
Manual Baseline
Best case
Time (s)
Comm. det.
HPCCG
Time
Energy
Time
Energy
11.8% 56.1%
11.8% 49.6%
45.4% 60.0%
31.0% 41.2%
MG2. For the remaining 3 inputs, we observe that the solutions obtained as a result of using MG2 as the training set provide up to 5% improvement in performance and up to 3.8% in energy dissipation compared to when uk-2002 is used as the training set. Overall, the best solutions from the training sets for all inputs achieve up to 56% improvement in performance and up to 49.6% in energy dissipation over the baseline version. We achieve up to 11.8% improvement in performance and in energy dissipation over the manually optimized version. For HPCCG, we achieve up to 60% improvement in performance and up to 41% in energy dissipation over the baseline version. Finally, we achieve up to 45.4% improvement in performance and up to 31% improvement in energy dissipation over the manually optimized version. These results are summarized in Table 5. 4.1. Impact of the optimizations Our previous work on optimizing the community detection code for performance on Tilera provided valuable guidance for the manual optimizations to apply in terms of optimal data structure layout [7]. We suggested the use of hashed memory for global graph data structures and local memory for private, perthread data. This combination results in very good performance and low energy usage for community detection. We confirmed
this hypothesis by using our auto-tuning process for data layout selection on community detection, which settled on the previously mentioned data layout combination as requiring lower energy. Intuitively, this combination makes sense, since the global graph structures have the most irregular access patterns. In fact, this enables each core to own multiple cache lines (64 bytes) from the beginning through the end of the full data, rather than having a concentrated contiguous block of data, which may or may not have a large number of accesses. The auto-tuning process found combinations of OpenMP scheduling policies and block sizes, together with GCC compiler options that further enhance performance and reduce energy across all datasets. However, the major improvements come from the data layout selection that the Manual version already captures. We see that the HPCCG application exhibits a lower level of non-determinism than community detection, which can directly be attributed to its modest level of irregularity. In general, for both applications, the fully optimized versions exhibit a lower level of non-determinism (tighter ranges). We attribute this to more precise data structure layouts that are better aligned with the application’s access pattern, as well as to lower levels of runtime jitter in the OpenMP scheduling layers, due to the grouping of parallel iterations into larger chunks (rather than dynamically scheduling more frequently). The Manual versions are closer, in terms of lower energy consumption, to the best auto-tuned versions for the community detection application. On the other hand, for HPCCG, the manually optimized versions consume significantly more energy than the fully optimized versions. For HPCCG, the Manual version does not capture a significant fraction of the performance enhancement and energy savings achieved by the fully auto-tuned version. The main reason is
A. Panyala et al. / J. Parallel Distrib. Comput. (
)
–
11
Table 6 Community detection: Best case percentage improvement over the Baseline version for the 6 representative graph inputs. The first 3 datasets (up to hugetrace) performed best with solutions obtained by using uk-2002 as the training set and the remaining 3 datasets performed best with solutions obtained by using MG2 as the training set. Dataset
uk-2002
MG2
Difference (uk-MG2)
Time
Energy
Time
Energy
Time
Energy
delaunay uk-2002 hugetrace
30.02% 29.56% 30.98%
30.82% 29.33% 30.43%
27.65% 26.93% 26.04%
28.44% 23.65% 26.31%
2.37% 2.63% 4.94%
2.38% 5.68% 4.12%
hugeBubbles20 MG2 europe_osm
43.78% 20.90% 51.16%
41.25% 24.79% 46.00%
46.87% 21.61% 56.13%
45.05% 25.34% 49.63%
−3.09% −0.71% −4.97%
−3.80% −0.55% −3.63%
Table 7 Community detection: Best case percentage improvement over Manual version for the 6 representative graph inputs. The first 3 datasets (up to hugetrace) performed best with solutions obtained by using uk-2002 as the training set and the remaining 3 datasets performed best with solutions obtained by using MG2 as the training set. Dataset
uk-2002
MG2
Difference (uk-MG2)
Time
Energy
Time
Energy
Time
Energy
delaunay uk-2002 hugetrace
3.59% 2.33% 11.79%
4.42% 6.97% 10.95%
1.70% 0.26% 7.57%
2.53% 2.28% 7.44%
1.89% 2.07% 4.22%
1.89% 4.69% 3.51%
hugeBubbles20 MG2 europe_osm
3.47% 3.22% 4.45%
3.67% 7.00% 5.94%
5.69% 3.83% 7.88%
6.46% 7.48% 8.58%
−2.22% −0.61% −3.43%
−2.79% −0.48% −2.64%
Table 8 HPCCG: Best case performance improvements over the Baseline and Manual versions. Dataset
1803 2903 3603
% Imp. over baseline
% Imp. over manual
Time
Energy
Time
Energy
51.56% 60.06% 54.63%
36.16% 41.15% 37.65%
42.97% 45.17% 45.39%
28.35% 30.75% 30.90%
that the optimized selection of data layouts for the ten key data structures in HPCCG is not intuitive. The partitioned layout selection for the sparse matrix is the only layout that can directly be derived from an understanding of the access patterns in the code. The other 9 data structures (dense one-dimensional vectors) are used in several different manners across its three key subroutines (dense vector dot product, scaled dense vector sum and sparse matrix dense vector multiply). For three (vector of non-zero elements, the vector X and a temporary vector) out of the nine remaining data structures, the auto-tuner found that the segmented layout was the best choice because these structures could be accessed by all 36 cores simultaneously (each core accesses its own private chunk) without interfering with each other. For the remaining 6 data structures (input vectors, solution vectors and the auxiliary data structures required for maintaining the CSR format), the auto-tuner found that the hashed layout was the best choice. These data structures are involved in other cores’ computations either as input or output and are hard to partition and hence have to be globally accessible. The auto-tuner obtained significantly improved layouts for the three vectors in comparison to the manually selected ones (where all the nine vectors use the hashed layout). In contrast to the community detection code, the manually selected OpenMP scheduling parameters for HPCCG were already optimal according to our auto-tuning search. Further improvement was achieved by the optimization of GCC command-line options. We observe that the selected training sets for auto-tuning (uk2002, MG2 and 1803 ) and the performance of the optimized configurations found from them do produce repeatable and very similar results for the other datasets, with more variation for community detection.
4.2. Analysis of the non-deterministic behavior on the Tilera architecture As one of the major contributors of power, the memory hierarchy is the target of most optimizations. This is due to the behavior of the caches and the cost to access off-chip memory. Caches may have a higher cost in instantaneous power, but they can potentially improve the performance. This may result in a net reduction in energy, since power is additive and cannot be hidden in the same way as latencies can. With this in mind, we conducted an in-depth study of the memory hierarchy usage while running the best solution found by our auto-tuner. The hardware counter information for the best solution normalized with respect to the baseline is shown in Fig. 7 for community detection and in Fig. 9 for HPCCG. The raw counter values obtained are averaged over 5 runs and represent the entire run of the applications (not only its computational kernel). These tables show Instruction Stalls (IS), Load Stalls (LS), L1 read (L1RM) and write misses (L1WM), L2 read (L2RM) and write misses (L2WM), and remote (distributed) L2 read (RL2RM) and write misses (RL2WM). The results presented in these figures hint at the irregular access that these solutions exhibit. The great reduction in memory transactions over all graphs with respect to the baseline implementations (represented by the remote cache misses) shows that the best solution (thanks to its layout selection) is more effective in bringing the data into the on-chip caches, as demonstrated by the Distributed Dynamic Cache behavior in Tables 9 and 11. However, the disproportionate high number of local L2 misses shown in Figs. 7 and 9 reveals that the access pattern is still irregular for both applications (both HPCCG and community detection). The improvement of L1 accesses for community detection seems moderate in the graphs, but their impact is significant when looking at the difference in raw performance counters with respect to the baseline in Table 9. However, for HPCCG, the deterioration in L1 accesses (see Table 11) does not have a significant impact on the performance and energy because it is amortized by the lower latencies when accessing the data in the distributed dynamic cache. The hardware counters normalized with respect to the manual implementation are shown in Fig. 8 for community detection and Fig. 10 for HPCCG. For community detection, the best
A. Panyala et al. / J. Parallel Distrib. Comput. (
RL2WM
L2WM
L2RM
532%
RL2WM
445%
RL2RM
L2WM
L1WM
158%
L2WM
371%
(e) delaunay.
L1WM
L1RM
0%
RL2WM
20%
0%
RL2RM
40%
20% L2WM
60%
40%
L2RM
80%
60%
L1WM
100%
80%
L1RM
120%
100%
LS
120%
LS
140%
IS
482%
140%
RL2WM
(d) uk-2002.
RL2RM
(c) Europe.
L1RM
0%
LS
20%
0%
IS
40%
20% RL2WM
60%
40%
RL2RM
80%
60%
L2WM
100%
80%
L2RM
120%
100%
L1WM
120%
L1RM
138%
140%
L2RM
161%
140%
LS
L1WM
(b) MG2.
L2RM
(a) HugeBubbles.
L1RM
RL2WM
RL2RM
0%
L2WM
20%
0%
L2RM
40%
20% L1WM
60%
40%
L1RM
80%
60%
LS
100%
80%
IS
120%
100%
IS
160%
140%
120%
IS
–
RL2RM
515%
)
LS
354%
140%
IS
12
(f) hugetrace.
Fig. 7. Community detection: Performance counters normalized to the Baseline version for the 6 representative graph inputs.
Table 9 Community detection: Difference in raw performance counters (in millions, billions and trillions) between the best case and baseline versions for the 6 representative graph inputs. A positive value means that the best case has that many more stalls/misses. The first 3 datasets (up to hugetrace) performed best with solutions obtained by using uk-2002 as the training set and the remaining 3 datasets performed best with solutions obtained by using MG2 as the training set. Dataset
IS
LS
L1RM
L1WM
L2RM
L2WM
RL2RM
RL2WM
delaunay uk-2002 hugetrace
−424.5 b −446.2 b −658.5 b
−348.5 b −346.3 b −550.9 b
−3.8 b −10.7 b −8.9 b
−287.5 m −450.3 m −821.0 m
515.3 m 87.2 m 1.2 b
24.8 m 8.2 m 63.0 m
−4.4 b −4.0 b −6.1 b
−1.1 b −1.2 b −1.5 b
hugeBubbles20 MG2 europe_osm
−1.1 t −906.4 b −4.6 t
−923.0 b −686.5 b −3.6 t
−21.3 b −23.8 b −31.0 b
−771.0 m −2.6 b −597.1 m
1.6 b 98.3 m 301.8 m
89.7 m 49.8 m 29.6 m
−9.9 b −9.3 b −45.3 b
−1.9 b −3.5 b −2.6 b
Table 10 Community detection: Difference in raw performance counters (in millions, billions and trillions) between the best case and Manual versions for the 6 representative graph inputs. A positive value means that the best case has that many more stalls/misses. The first 3 datasets (up to hugetrace) performed best with solutions obtained by using uk-2002 as the training set and the remaining 3 datasets performed best with solutions obtained by using MG2 as the training set. Note that for delaunay, the Manual implementation is the best case. Dataset
IS
LS
L1RM
L1WM
L2RM
L2WM
RL2RM
RL2WM
delaunay uk-2002 hugetrace
0
0
0
0
0
0
−24.8 b −120.3 b
−22.3 b −43.7 b
−5.4 b 2.0 b
0 83.2 m −139.9 m
−3.7 m −89.2 m
−10 635 −1.8 m
−13.2 m −388.5 m
0 44.3 m −15.6 m
hugeBubbles20 MG2 europe_osm
−6.7 b −228.0 b −153.4 b
−10.3 b −221.3 b −12.4 b
653.4 m 3.9 b −25.0 b
599.9 m 648.0 m −2.3 b
−21.6 m −260.2 m −39.5 m
−1.8 m −11.0 m −338 667
−365.7 m −3.8 b −101.8 m
84.8 m 56.0 m 53.1 m
RL2WM
RL2RM
L2WM
L2RM
L1WM
L1RM
13
L2WM
RL2RM
RL2WM
L2WM
RL2RM
RL2WM
L1WM L1WM
L2RM
L1RM L1RM
L2RM
LS
(d) uk-2002.
20%
0%
0%
RL2WM
40%
20% RL2RM
60%
40%
L2WM
80%
60%
L2RM
100%
80%
L1WM
120%
100%
L1RM
140%
120%
LS
140%
(e) delaunay.
IS
(c) Europe.
LS
0%
RL2WM
20%
0%
RL2RM
40%
20% L2WM
60%
40%
L2RM
80%
60%
L1WM
100%
80%
L1RM
120%
100%
LS
140%
120%
IS
140%
IS
–
(b) MG2.
IS
(a) HugeBubbles.
)
LS
0%
RL2WM
20%
0%
RL2RM
40%
20% L2WM
60%
40%
L2RM
80%
60%
L1WM
100%
80%
L1RM
120%
100%
LS
140%
120%
IS
140%
IS
A. Panyala et al. / J. Parallel Distrib. Comput. (
(f) hugetrace.
Fig. 8. Community detection: Performance counters normalized with respect to the Manual implementation for the 6 representative graph inputs. Table 11 HPCCG: Difference in raw performance counters (in millions, billions and trillions) between the best case and baseline versions. A positive value means that the best case has that many more stalls/misses. Dataset
IS
L1RM
L1WM
L2RM
L2WM
LS
RL2RM
RL2WM
1803 2903 3603
−501.7 b −2.2 t −4.2 t
12.9 b 53.3 b 101.9 b
−282.0 m −1.8 b −3.3 b
5.2 b 22.4 b 42.8 b
81.0 m 369.1 m 684.1 m
−542.2 b −2.3 t −4.6 t
−5.5 b −24.5 b −47.2 b
−894.2 m −4.2 b −7.3 b
Table 12 HPCCG: Difference in raw performance counters (in millions, billions and trillions) between the best case and Manual versions. A positive value means that the best case has that many more stalls/misses. Dataset 3
180 2903 3603
IS
L1RM
L1WM
L2RM
L2WM
LS
RL2RM
RL2WM
−365.1 b −1.5 t −3.1 t
13.5 b 56.8 b 109.6 b
16.2 m 15.4 m 11.6 m
915.6 m 4.6 b 8.8 b
85.6 m 421.2 m 777.8 m
−407.7 b −1.7 t −3.4 t
−896.0 m −5.2 b −10.3 b
−664.0 m −3.3 b −6.1 b
solution has an improvement of up to 11.8% in performance and energy compared to the Manual implementation. We observe a decrease of up to 20% in the stalls/misses. However, on average the improvement across all datasets for the hardware counters is only around 6%. The raw counter numbers supporting this claims can be found in Table 10. This is due to the fact that the layout used in the Manual implementation is already optimal, as discussed in Section 4.1. Hence, the only improvements obtained with respect to the Manual are those from GCC options and OpenMP loop schedul-
ing policies. Contrary to community detection, we observe a significant improvement (up to 45% in performance and 31% in energy) over the Manual implementation for HPCCG. This is reflected in the hardware counters (in Fig. 10, and Table 12) where we observe a similar trend as in the comparison with respect to the Baseline. This is because the data layouts found by the auto-tuner (that give the most improvement) are not intuitive even for an expert developer to figure out manually as explained in Section 4.1.
A. Panyala et al. / J. Parallel Distrib. Comput. (
RL2RM
RL2WM
RL2RM
RL2WM
L1WM
RL2RM
(a) 1803 .
L1RM
0%
L2WM
2550% 544%
LS
20%
0%
RL2WM
40%
20% L2WM
60%
40%
L2RM
80%
60%
L1WM
100%
80%
L1RM
120%
100%
LS
–
140%
120%
IS
)
L2RM
2646% 769%
140%
IS
14
(b) 2903 .
2472% 490%
140% 120% 100% 80% 60% 40%
RL2WM
RL2RM
L2WM
L2RM
L1WM
L1RM
LS
0%
IS
20%
(c) 3603 . Fig. 9. HPCCG: Performance counters normalized to the Baseline version.
L2RM
L1WM
L1RM
RL2RM
(a) 1803 .
LS
0%
IS
20%
0%
RL2WM
40%
20% L2WM
60%
40%
L2RM
80%
60%
L1WM
100%
80%
L1RM
120%
100%
LS
120%
IS
1459%
140%
L2WM
1253%
140%
(b) 2903 .
1050%
140% 120% 100% 80% 60% 40%
RL2WM
RL2RM
L2WM
L2RM
L1WM
L1RM
LS
0%
IS
20%
(c) 3603 . Fig. 10. HPCCG: Performance counters normalized with respect to the Manual implementation.
4.3. Correlations between hardware counters, graph characteristics and energy For community detection, the correlation between time, energy and the various performance counters is presented as a heat map in Fig. 11(a). The correlation between time, energy and the various graph characteristics is presented as a heat map in Fig. 11(b). In these figures, the color scales on the right show a set of colors ranging from light to dark representing a range of correlation coefficients. A correlation coefficient of 1 indicates a strong positive
correlation, −1 indicates a strong negative correlation and 0 indicates an absence of correlation between the variables. We use the Pearson correlation coefficient to measure the relationship between two variables. The correlations are calculated using all 22 graph inputs. As shown in the figures, performance and energy are highly correlated to each other. When comparing the graph characteristics against time and energy (Fig. 11(b)), only the number of vertices has a high correlation with time and energy. This is in line with how the algorithm calculates and distributes the work based on the
A. Panyala et al. / J. Parallel Distrib. Comput. (
)
–
15
(a) Performance counters.
(b) Graph properties. Fig. 11. Community detection: Correlations between performance counters and graph properties with time and energy represented using heat maps. The color scales on the right show a set of colors ranging from light to dark representing a range of correlation coefficients. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
16
A. Panyala et al. / J. Parallel Distrib. Comput. (
number of vertices (i.e. all vertices are visited in every iteration). Another interesting result is the high correlation between the number of clusters (i.e. communities) formed and the number of edges. This correlation shows insight on how the algorithm forms communities based on the connectivity of their edges. When analyzing the correlation between hardware counters and time and energy (shown in Fig. 11(a)), one of the insights we found is that the instruction and load stalls are correlated with time and energy. Moreover, instruction stalls are highly correlated with load stalls, meaning that they increase or decrease in the same rate on all the tested datasets. This insight helps us understand that stalls coming from the memory subsystem affect performance and energy greatly in our datasets. This also gives credence to the energy gains that we obtain from the layout optimizations. On this line of analysis, we find that the Level 1 misses (both for read and writes) are highly correlated with performance/energy. These results confirm the data provided by the previous Section 4.2, in which reduction of level 1 cache misses results in an increase in performance. Finally, the other parts (i.e. L2RM, L2WM, RL2RM and RL2WM) of the memory subsystems seem to be less correlated, but still having an important correlation (between 0.6 and 0.87) with the current datasets. The highest correlated counter from the remaining corresponds to the Remote Level 2 writes. This behavior is connected with the fact that remote level 2 write misses involve at least two operations: one that loads the old line from main memory and one that writes the new value into the cache. Hence, we see more gains in performance/energy from the reduction in remote write misses, even though the percentage improvement of remote read misses (compared to Baseline) is much higher (as seen in Section 4.2) than those of the remote write misses. For HPCCG, all the hardware counters are highly correlated to the energy due to the application and input data characteristics. The structure of the sparse matrix and dense vectors does not change over time, making HPCCG ‘‘less’’ irregular compared to community detection. 5. Related work Optimizing applications for performance and power requires exploring a large design space. Many works have looked at introducing (semi) automated solutions to minimize the programmer effort in such exploration. This section discusses the current stateof-the-art related to exploration of compiler options and to more general exploration of hardware and software parameters for lowpower systems. 5.1. Compiler options and heuristics exploration There exists a significant amount of work on optimization selection for applications. These techniques concentrate on an optimization feature set (i.e., compilation phases) and use machine learning algorithms to create predictive models based on representative training sets. In [15], the authors use machine learning techniques to select ‘‘good’’ optimizations ordering per method within a dynamic compiler (Jikes RVM). Under this model, an artificial neural network is constructed and trained so as to be able to predict beneficial optimization ordering for a given piece of code. Although it is implemented in a Java virtual machine, the technique used for compiler selection mirrors our auto tuner exploration of compiler options. However the authors do not consider energy or irregular applications in their analysis. The analysis presented in [26] shows a study in auto tuning with respect to performance, energy, energy × delay and energy ×delay2 products using software level tunable parameters, such as cache tiling factors, and loop unrolling. The authors use the Active
)
–
Harmony [25] auto-tuner to obtain the best parameter sets for the given application. The work analyzes several software aspects in terms of energy and performance, but the authors consider neither irregular applications, nor a large space of compiler techniques. The study shown in [22] presents a framework to select the best combination of polyhedral techniques and optimization techniques based on six machine learning techniques. This process is done in two-steps. In the first step, a down-select is performed using static cost models on specific high level optimizations. During the second step, the performance of the optimization sets are predicted by training the models with selected inputs. Improvements are, on average, between 3.2× and 8.7× over nontrained sets. However, by focusing on the polyhedral space, this framework is limited to regular, affine type problems and does not consider energy. The research presented in [28] showcases a study on the correlation between energy usage and execution time of an application. The authors use the Polyhedral Compiler Collection (PoCC) to create a large number of test cases and explore which ones provide the best performance. They use RENCI’s RCRtool [1] to collect energy and performance information. The setup in the study mirrors our setup (we use OpenTuner instead of PoCC and tile-btk instead of RCRtool), but it focuses on regular applications (which allows for affine transformations). Several researchers have explored the effect of different compiler parameters for High Level Synthesis (HLS) [13]. The register-transfer level designs produced by HLS tools are heavily dependent on the quality of the intermediate representation of the original high-level source code. In turn, the quality of the intermediate representation changes depending on the compiler optimizations applied and their ordering. Thus, compiler optimizations can significantly influence hardware metrics such as area, maximum frequency and execution cycles of the resulting circuits, and exploring these parameters is critical to obtain the best results. Our work does not address HLS, and looks in particular at power/performance trade-offs. However, it is similar in spirit, recognizing the potential impact of compiler optimizations on these metrics. SPIRAL [23] is another popular autotuning framework, which generates platform-tuned implementations of signal processing transforms. It defines a complete algebra to express the transforms and a set of tunable parameters for each potential target platforms. The framework can accept power, energy, and/or performance as objective function. The use of the algebra limits SPIRAL only to digital signal processing algorithms because it does not allow to express data-intensive kernels such as those in irregular applications. The Optimized Sparse Kernel Interface (OSKI) [24] is an auto tuning framework for sparse matrices and sparse operations. It searches for a data structure that minimizes the size of the sparse matrices and for an efficient kernel implementation with that specific data layout on the target machine. The libraries can also perform continuous profiling and periodic tuning. OSKI can be guided through user-provided hints, or can monitor selected kernels to understand the characteristics of the workload. OSKI is currently being ported to several parallel and accelerator architectures. In contrast to OSKI, our work targets two representative irregular compact applications (graphs and sparse linear algebra), and target a specific many-core platform. 5.2. Design space exploration Many research groups have looked at Design Space Exploration (DSE) approaches where power is (together with performance) one of the main objectives. The proposed approaches cover a wide spectrum of solutions, including static techniques that
A. Panyala et al. / J. Parallel Distrib. Comput. (
explore the optimal number of components (e.g., number of cores) and architectural parameters (e.g., size of the caches) through simulation and power models given a set of applications [16], and extend the exploration to thermal constraints [18] and floorplans [20]. Some approaches, such as [17], also propose heuristics for dynamic adaptation of the number of active cores, voltage and frequencies depending on the parallelism of the applications and the power constraints. Our work is orthogonal to these. We start from a given homogeneous many-core architecture with physical instrumentation and perform softwarelevel power/performance design space exploration. However, we did explore the degree of parallelism exploited at runtime by the application, looking for interesting Pareto points in the power/performance space. There are many frameworks that explore promising design points in multidimensional spaces, including performance and power metrics, for integrated hardware–software co-design (e.g., [30,19]). These utilize a variety of exploration methods, such as regression models or heuristic search algorithms. Lately, the research has focused on DSE of power/performance trade-offs in custom accelerators [9]. However, these tools require fast power models to evaluate many design points without synthesis or simulation. Although our work performs design space exploration for several parameters, it only modifies the software. Furthermore, we do not estimate power and energy, but directly measure it on the target platform. 6. Conclusions We presented an empirical study on energy and performance optimizations for irregular applications on the Tilera many-core platform. Our approach focused on using a customized auto-tuning framework to automatically explore the impact of different data layouts, OpenMP loop scheduling policies and blocking factors, and GCC command-line compiler options. The automatically discovered parameter settings improve performance and reduce energy consumption for our two target applications, community detection and HPCCG, for a range of representative input sets. We have observed whole-node energy savings and performance improvements of up to 49.6% and 60% relative to a baseline instantiation, and up to 31% and 45.4% relative to manually optimized variants. Future work includes plans to perform a more detailed study of the impact of individual compiler optimizations on energy reduction, as well as adapting existing and designing new static compiler analyses and transformations that can target energy consumption. The auto-tuning process for GCC compiler options reports a significant number (combination) of flags as part of the best configuration. But only few of these flags might significantly contribute towards minimizing energy and improving performance. For example, for all optimization orderings involving auto-tuning for GCC, the best configuration found by the autotuner is a combination of all 335 compiler flags out of which 154 flags requiring additional parameters differ between the orderings. For specific datasets, we know that only few of these 335 flags have 99% impact on energy/performance and the rest are really not needed. In such cases, including the remaining flags might cause increased energy consumption or reduction in performance for other (untrained) datasets. Therefore, we believe that identifying the impact of individual compiler options will help in refining the overall optimization process further. For layout optimizations, although Tilera has programmatic control of the cache, this does not preclude the layout optimizations explored in this paper from benefitting cache architectures in general where the memory subsystem control is more implicit. We also plan to apply and extend the insights gained from this case study on Tilera to new many-core systems such as Intel’s Knights Landing, which exhibit plethora of optimization opportunities in a complex space with heterogeneous memories.
)
–
17
Role of the funding source The research for this paper was funded by DoD under the ‘‘Integrated Compiler and Runtime Autotuning Infrastructure for Power, Energy, and Resilience (ATPER)’’ project 63810 at Pacific Northwest National Laboratory (PNNL). The work was conducted solely by PNNL personnel. PNNL is operated by Battelle Memorial Institute under Contract DE-AC06-76RL01830. References [1] M.Y.L. Allan Porterfield, Rob Fowler, Rcrtool: Design document; version 0.1. Technical Report, RENCI, North Carolina, USA, 2010. [2] J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O’Reilly, S. Amarasinghe, Opentuner: An extensible framework for program autotuning, in: PACT’14: 23rd International Conference on Parallel Architectures and Compilation, 2014, pp. 303–316. [3] D. Bader, H. Meyerhenke, P. Sanders, D. Wagner, Graph partitioning and graph clustering: 10th DIMACS implementation challenge workshop, Contemp. Math. 588 (2012). [4] R.F. Barrett, M.A. Heroux, P.T. Lin, C.T. Vaughan, A.B. Williams, Poster: Miniapplications: Vehicles for co-design, in: SC’11 Companion, 2011, pp. 1–2. [5] P. Beckman, et al., Exascale operating systems and runtime software Report, 2012. [6] V. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks, J. Stat. Mech. Theory Exp. (2008) P10008. [7] D. Chavarría-Miranda, M. Halappanavar, A. Kalyanaraman, Scaling graph community detection on the tilera many-core architecture, in: HiPC’14, 2014. [8] D. Chavarría-Miranda, A. Panyala, M. Halappanavar, J.B. Manzano, A. Tumeo, Optimizing irregular applications for energy and performance on the tilera many-core architecture, in: Proceedings of the ACM International Conference on Computing Frontiers, CF, Ischia, Italy, May 2015. [9] D. Chen, et al. High-level power estimation and low-power design space exploration for FPGAs, in: ASP-DAC’07, 2007, pp. 529–534. [10] T.A. Davis, Y. Hu, The university of florida sparse matrix collection, ACM Trans. Math. Softw. 38 (1) (2011) 1–25. [11] J. Feo, O. Villa, A. Tumeo, S. Secchi, Irregular applications: Architectures & algorithms, in: Proceedings of the First Workshop on Irregular Applications: Architectures and Algorithm, IAAA’11, ACM, New York, NY, USA, 2011, pp. 1–2. [12] M.A. Heroux, D.W. Doerfler, P.S. Crozier, J.W. Willenbring, H.C. Edwards, A. Williams, M. Rajan, E.R. Keiter, H.K. Thornquist, R.W. Numrich, Improving performance via mini-applications. Technical Report SAND2009-5574, Sandia National Laboratories, 2009. [13] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, N. Calagar, S. Brown, J. Anderson, The effect of compiler optimizations on high-level synthesis for FPGAs, in: 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM, April 2013, pp. 89–96. [14] N.M. Josuttis, The C++ Standard Library: A Tutorial and Reference, AddisonWesley Professional, 2012. [15] S. Kulkarni, J. Cavazos, Mitigating the compiler optimization phase-ordering problem using machine learning, SIGPLAN Not. 47 (10) (2012) 147–162. [16] J. Li, J.F. Martinez, Power-performance implications of thread-level parallelism on chip multiprocessors, in: ISPASS’05, 2005, pp. 124–134. [17] J. Li, J. Martinez, Dynamic power-performance adaptation of parallel computation on chip multiprocessors, in: HPCA-12, 2006, pp. 77–87. [18] Y. Li, K. Skadron, D. Brooks, Z. Hu, Performance, energy, and thermal considerations for smt and cmp architectures, in: HPCA-11, 2005, pp. 71–82. [19] S. Mohanty, V.K. Prasanna, S. Neema, J. Davis, Rapid design space exploration of heterogeneous embedded systems using symbolic search and multi-granular simulation, in: LCTES/SCOPES’02, 2002, pp. 18–27. [20] M. Monchiero, R. Canal, A. González, Design space exploration for multicore architectures: A power/performance/thermal view, in: ICS’06, 2006, pp. 177–186. [21] M.E.J. Newman, M. Girvan, Finding and evaluating community structure in networks, Phys. Rev. E 69 (2) (2004) 026113. [22] E. Park, L.-N. Pouche, J. Cavazos, A. Cohen, P. Sadayappan, Predictive modeling in a polyhedral optimization space, in: Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO’11, IEEE Computer Society, Washington, DC, USA, 2011, pp. 119–129. [23] M. Püschel, F. Franchetti, Y. Voronenko, Spiral, in: Encyclopedia of Parallel Computing, Springer, 2011, (Chapter). [24] K.Y. Richard Vuduc, James Demmel, The optimized sparse kernel interface (OSKI) Library. Technical Report, Berkeley Benchmarking and Optimization Group, University of California, Berkeley, USA, 2007. [25] C. Ţăpuş, I.-H. Chung, J.K. Hollingsworth, Active harmony: Towards automated performance tuning, in: SC’02, 2002, pp. 1–11. [26] A. Tiwari, M.A. Laurenzano, L. Carrington, A. Snavely, Auto-tuning for energy usage in scientific applications, in: Euro-Par’11, 2012, pp. 178–187. [27] A. Tumeo, J. Feo, O. Villa, S. Secchi, T.G. Mattson, Special issue on architectures and algorithms for irregular applications (AAIA) - guest editors’ introduction, J. Parallel Distrib. Comput. 76 (2015) 1–2.
18
A. Panyala et al. / J. Parallel Distrib. Comput. (
)
–
[28] W. Wang, J. Cavazos, A. Porterfield, Energy auto-tuning using the polyhedral approach, in: 4th International Workshop on Polyhedral Compilation Techniques, 2014. [29] C. Wu, A. Kalyanaraman, W.R. Cannon, pgraph: Efficient parallel construction of large-scale protein sequence homology graphs, IEEE Trans. Parallel Distrib. Syst. 23 (10) (2012) 1923–1933. [30] V. Zaccaria, G. Palermo, F. Castro, C. Silvano, G. Mariani, Multicube explorer: An open source framework for design space exploration of chip multi-processors, in: ARCS’10, 2010, pp. 1–7.
Joseph B. Manzano received the B.S. degree in Computer Science from William Paterson University of New Jersey in 2003 and the Ph.D. degree in Electrical and Computer Engineering from University of Delaware in 2011. He is currently a Research Scientist at Pacific Northwest National Laboratory. His research interests include exascale runtime systems, system software for many and multicore machines, autotuning for energy, and emerging benchmarks for new generation of embedded hardware.
Ajay Panyala received the B.Tech. degree from Jawaharlal Nehru Technological University, Hyderabad, India in 2007 and the Ph.D. degree from Louisiana State University in 2014, both in Computer Science. He is currently a post doctoral research associate at Pacific Northwest National Laboratory. His research interests are in compiler optimizations for high performance computing.
Antonino Tumeo received the M.S. degree in Informatic Engineering, in 2005, and the Ph.D.degree in Computer Engineering, in 2009, both from Politecnico di Milano, Italy. He is currently a Senior Research Scientist at Pacific Northwest National Laboratory. His research interests are in irregular applications, modeling and simulation of high performance architectures, hardware–software codesign, FPGA prototyping and GPGPU computing.
Daniel Chavarría-Miranda received the B.S. degree from Universidad de Costa Rica in 1994, the M.S. degree from the Monterrey Institute of Technology and Higher Education, Monterrey, Mexico in 1998 and a combined M.S./Ph.D. degree from Rice University in 2004, all in Computer Science. He is currently a Senior Research Scientist at Pacific Northwest National Laboratory. His research interests are in parallel and distributed systems, programming models for high-performance and parallel computing, programming languages, interactions of architectural features with software systems.
Mahantesh Halappanavar received the M.S. and Ph.D. degrees from Old Dominion University in 2003 and 2009, all in Computer Science. He is currently a Senior Research Scientist at Pacific Northwest National Laboratory. His research interests include the interplay of algorithm design, architectural features, and input characteristics targeting massively multithreaded architectures such as the Cray XMT and emerging multicore and many-core platforms.