Tuning linear algebra for energy efficiency on multicore machines by adapting the ATLAS library

Tuning linear algebra for energy efficiency on multicore machines by adapting the ATLAS library

Future Generation Computer Systems ( ) – Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevi...

712KB Sizes 0 Downloads 41 Views

Future Generation Computer Systems (

)



Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Tuning linear algebra for energy efficiency on multicore machines by adapting the ATLAS library Thomas Jakobs, Jens Lang ∗ , Gudula Rünger, Paul Stöcker Department of Computer Science, Technische Universität Chemnitz, 09107 Chemnitz, Germany

highlights • • • •

We present a version of ATLAS that minimises energy instead of time. When ATLAS is tuned for minimal energy consumption, 10% of energy can be saved. Some tuning parameters show differences when optimised for time and for energy. Being compute-bound, DGEMM profits most from the tuning for energy.

article

info

Article history: Received 29 February 2016 Received in revised form 21 January 2017 Accepted 7 March 2017 Available online xxxx Keywords: Energy autotuning Energy efficiency Power Basic linear algebra subroutines (BLAS) ATLAS

abstract While automated tuning is an established method for minimising the execution time of scientific applications, it has rarely been used for an automated minimisation of the energy consumption. This article presents a study on how to adapt the auto-tuned linear algebra library ATLAS to consider the energy consumption of the execution in its tuning decision. For different tuning parameters of ATLAS, it investigates which differences occur in the tuning results when ATLAS is tuned for a minimal execution time or for a minimal energy consumption. The tuning parameters include the matrix size for the lowlevel matrix multiplication, loop unrolling factors, crossover points for different matrix-multiplication implementations, the minimum size for matrices to be transposed, or blocking sizes for the last-level cache. Also, parameters for multithreaded execution, such as the number of threads and thread affinity are investigated. The emphasis of this article is on a method proposed with which it is possible to replace a tuning process for execution time by a tuning for energy consumption, especially in the parallel case. ATLAS serves as a prominent example for a tuned library. Furthermore, the article draws conclusions on how to design an energy-optimising autotuning package and how to choose tuning parameters. The article also discusses why the matrix-matrix multiplication has a potential for increasing the energy efficiency while the time efficiency remains constant, whereas other routines have shown to improve their energy efficiency by reducing the execution time. © 2017 Elsevier B.V. All rights reserved.

1. Introduction Most simulations from the field of scientific computing can be decomposed into subroutines from a set of commonly used numerical and computational methods [1]. One of the most important classes of such subroutines is dense linear algebra. As a standard interface for dense linear algebra libraries, BLAS (Basic Linear Algebra Subroutines) has established over the last decades [2]. Since keeping pace with the constantly improving hardware is important,



Corresponding author. E-mail addresses: [email protected] (T. Jakobs), [email protected] (J. Lang), [email protected] (G. Rünger), [email protected] (P. Stöcker). http://dx.doi.org/10.1016/j.future.2017.03.009 0167-739X/© 2017 Elsevier B.V. All rights reserved.

automated tuning or autotuning methods are widely used and have also been used for packages implementing the BLAS interface [3]. Until today, autotuning has mainly focused on minimising the execution time of an application [4]. However, energy efficiency of scientific codes will become an important design goal in the near future [5]. Currently, there is no autotuning package for linear algebra available that considers energy as an optimisation goal. This article proposes a method for creating a BLAS library which is automatically tuned for a minimum energy consumption. One of the most popular autotuned BLAS packages is ATLAS (Automatically Tuned Linear Algebra Software) [3]. ATLAS investigates a large number of code variants for different BLAS operations and selects those showing the best performance, partly even considering data characteristics. It finds the optimal values for several tuning parameters, such as block sizes, pre-fetch distances, or

2

T. Jakobs et al. / Future Generation Computer Systems (

performance crossover points of different implementations. The ATLAS library, however, currently only optimises for execution time and does not consider the energy consumption. Although it has often been found that optimising the execution time also reduces the energy consumption by nearly the same percentage [6–8], this is not always true; in [7,9,10], it is shown that a deviation of up to 7% is possible. This article presents some approaches to introduce energy as an optimisation goal into the ATLAS autotuner. The aim is to create an autotuned BLAS library which uses as little energy as possible, for example by favouring low-energy implementation variants over fast variants in cases in which a conflict between the two goals of optimising execution time or energy occurs. Thus, our approach is a proposal to tackle the problem of a multi-objective optimisation by enhancing the autotuning approach in ATLAS. An important requirement for our approach is that running the adapted ATLAS routines shall not have any side effects that may influence other running processes or non-ATLAS threads. This prevents the use of widely known and effective techniques, such as frequency manipulation or clock gating [5]. The parameters that are evaluated during the autotuning process of ATLAS for execution time are also evaluated by using energy measurements. This enables us to point out specific differences in the effect of e.g. matrix block sizes on timing and energy behaviour. Some parts of this investigation have been presented earlier in [11], which focused on parameters for singlethreaded execution of the blas routines. In this article, we extend this work by including parameters that are mainly relevant for a multithreaded execution and on a multi-objective optimisation. The additional contributions of this article are given in the following. It investigates whether using simultaneous multithreading and thread affinity yield a higher performance in terms of time or energy for multithreaded execution. Additionally, the influence of different execution variants on cache usage boundaries is discussed. Thus, the article presents an ATLAS variant for multithreaded execution, which has been completely tuned for energy consumption. Hence, this ATLAS variant should exhibit a minimum energy consumption of all variants producible by the autotuner. In summary, the contributions of this article are:

• the modification of the autotuning process of ATLAS with the goal to exploit different objectives, i.e. execution time and energy efficiency; • a concise and thorough empirical study of the differences in execution time and energy behaviour of the ATLAS library with respect to several parameters, such as matrix block size or cache boundaries; • a first step towards a multi-objective autotuning process concentrating on the two optimisation goals time and energy by picking the implementation variant with the best result for both criteria; • the evaluation of the autotuning process for multithreaded execution. The rest of this article is structured as follows: Section 2 gives a summary on the autotuning mechanism of ATLAS and proposes methods for introducing energy awareness into the autotuner. Section 3 investigates several tuning parameters and their behaviour when tuned for execution time and for energy consumption. Section 4 investigates the effect of tuning the complete package for energy consumption. Section 5 presents related work, and Section 6 concludes the article. 2. Autotuning in ATLAS Autotuning is an ‘‘automated process, guided by experiments, of selecting one from among a set of candidate program implementations to achieve some performance goal’’ [12]. Autotuning usually consists of three steps [12]:

)



1. Creation of a number of candidate implementations for the operation to be tuned; 2. Experimental evaluation of the performance of the variants; 3. Selection of the best variant for the practical execution of the code. As the performance of a code normally strongly depends on the hardware on which it is executed, the autotuning procedure is, in most cases, performed during the installation of the software on a new hardware. 2.1. The ATLAS autotuning process A good review of the autotuning process of ATLAS is given in [3, 13]. Fig. 1 visualises the steps of the ATLAS autotuning process, the data transferred between the steps, and the parameters which can be extracted from each step (arrows pointing down). The first step of ATLAS is the detection of certain hardware parameters, including the size of the level-1 cache, the latency of a multiply operation, the number of floating-point registers, and the presence of the fused multiply–add (FMA) hardware instruction. In order to save tuning time, ATLAS provides a library of hardware parameters for the most common architectures. If the library values are used, the hardware detection step is omitted. Hardware detection can, however, be enforced (by giving the parameter -Si archdef 0 to the configure script). ATLAS uses two different methods for creating candidate implementations. The first method is the source code adaptation (second box in Fig. 1), which is used when different source codes are needed for implementing the candidates. When using source code adaptation, different algorithms for one operation may be evaluated or, if the code is generated automatically, different levels of loop unrolling or different loop nestings can be tested. The incorporated search engine determines which code variants and which parameter settings exhibit the best performance. The parameters to be tuned include the block size NB for the blocked matrix multiplication, the loop unrolling factors nu , mu and ku , and a parameter indicating whether fused multiply–add is used. The search engine hands over the parameter set to be investigated to the code generator, which generates the source code considering these parameters and compiles it. During the execution, the performance of the generated routine is measured and returned to the search engine which then, based on the result, refines the parameter set. ATLAS offers various timers for the evaluation of the performance, including wall clock timers and cycle-accurate CPU timers. The second method for creating candidate implementations is the parametrised adaptation (third box in Fig. 1), which means that a piece of code contains a parameter which controls its execution. These parameters are tuned after the source code generation. The parameters include the copy/no-copy threshold, which controls whether the matrices are copied and rearranged in memory before the operation, and the CacheEdge, which optimises the cache usage of higher-level caches. This article will mainly (but not solely) focus on operations of BLAS level 3, i.e. matrix-matrix operations. For level-3 kernels, ATLAS uses a two-stage optimisation [3], which works as follows: The high-level stage splits up the operation into smaller, blockwise matrix-matrix multiply operations. These matrix-matrix multiplications are processed in the low-level stage. For the lowlevel stage, there exists a highly optimised kernel which performs a matrix multiplication of the form C ← A⊤ B +β C . All matrices are square matrices with their dimensions (N, M and K ) being NB . The value for NB is chosen in a way such that the operation becomes cache-contained, i.e. the data needed fits into the level-1 cache. Furthermore, the article will only deal with tuning parameters that are already present in ATLAS. New parameters, such as the operating frequency of the CPU, are not introduced. Works investigating dynamic voltage and frequency scaling for linearalgebra routines are, e.g., [14,7].

T. Jakobs et al. / Future Generation Computer Systems (

)



3

Fig. 1. Autotuning process in the ATLAS library based on [13] showing the parameters and steps used and obtained during the autotuning process.

2.2. Energy awareness for the ATLAS autotuning In order to create an energy-optimised ATLAS library, we use the energy consumption as performance measure for the candidate implementations. The existing performance measurement, which measures execution time, can be replaced or complemented by the measurement of the energy consumption. Both strategies have been followed. The energy consumption is measured using the Running Average Power Limit (RAPL) [15] software interface provided by recent Intel CPUs. The RAPL interface has already been used for studies on the energy consumption of linear algebra routines, e.g. [16–18]. In various works, it has been shown that the values returned by RAPL correspond sufficiently well to values obtained by hardware energy meters, e.g. [19,15,20]. For this work, the PAPI [21] package is used for accessing the RAPL interface. The PAPI event PACKAGE_ENERGY:PACKAGE0, which maps to the RAPL register MSR_PKG_ENERGY_STATUS, is used. This register measures the energy consumption of the complete CPU package, including core, cache and uncore energy consumption. The measurement value is updated every 1 ms and the register holding it overflows after roughly 60 s if the machine is heavily loaded [22, vol. 3B, ch. 34.7.2]. The performance of codes considering the execution time is called the time efficiency in this article. It is measured in flop/s, i.e. the number of floating point instructions which are executed in one second. The corresponding performance measure for energy consumption is the energy efficiency, measured in flop/J. The flop/J value measures how many floating point instructions can be executed using the energy of one joule. 2.2.1. Measuring energy and execution time For some parameters, the ATLAS autotuner obtains a table of different parameter settings and the corresponding performance values. From the table, the autotuner selects the parameter setting with the highest performance, i.e. the highest flop/s value. In these cases, we perform the energy measurement in addition to the time measurement, thus, supporting multi-objective . optimisation. Both values, flop/s and flop/J, are then included in the table. In the source code, the energy measurement is started before the time measurement and stopped after it. The ATLAS version 3.10.2 has been used for these experiments. 2.2.2. Measuring energy instead of execution time In some cases, e.g. for the evaluation of the generated matrix multiplication routines, the result of the performance evaluation is the input for the next iteration of the optimisation in ATLAS. In this case, measuring time and energy at once is not useful as ATLAS only optimises for one criterion. Thus, we replace the time measurement by an energy measurement. This replacement effects the whole ATLAS package so that ATLAS is only optimised for energy, not for execution time. In detail, we modify the routine ATL_cputime in the file tune/sysinfo/ATL_cputime.c such that it returns the energy consumption. For the measurement, the PAPI library is used as described above. It has to be ensured that

no overflow occurs in the measurement register. However, we have found that all measurements need less than one minute to complete so that the overflow will not occur. The ATLAS version 3.10.1 has been used for these experiments. In this modified version of ATLAS, the hardware parameters number of registers, latency and size of level-1 cache have been set fix to the values obtained by the original ATLAS version. This ensures that all optimisations start with the same conditions and effects arising from different initial conditions are avoided. As the energy measurement shows a greater variance than the time measurement, the number of the measurements per kernel has been increased from 3 to 5. Furthermore, the number of repetitions per measurement, which is between 40 and 18,500, has been increased by a factor of 100. 3. Evaluation of single parameters A selection of the large number of parameters tuned during the autotuning process of ATLAS is investigated in this section. During the tuning of each parameter, the performance of ATLAS with the different parameter settings is evaluated. This section presents measurements and results for both kinds of tuning, for time and for energy, and attempts to show their relations for each parameter. While this section evaluates the time efficiency versus the energy efficiency of single parameters, such as the block size and loop unrolling factors, Section 4 investigates the differences that arise when the complete package is tuned for energy consumption, i.e. without concentrating on single parameters. For the experiments, a machine with a quad-core Intel Core i7-4770K processor has been used. The processor has a clock frequency of 3.5 GHz and cache sizes of 32 KiB level-1 data cache, 32 KiB level-1 instruction cache, 256 KiB level-2 cache (all per core), and 8 MiB level-3 cache (shared). The architecture offers 256-bit fused multiply–add units [23]. As the detection of the hardware parameters shows differences between optimisations for performance and energy consumption the following hardware parameters have been set to the values obtained by the optimisation run of the time-optimised version: the cache size is 128 KiB, the multiply latency is 5, fused multiply–add is available. 3.1. Block size for low-level matrix multiplication For the low-level matrix multiplication, from which the highlevel operations are composed, the optimal size NB of the block matrices is determined. The low-level matrix multiplication performs operations of the form C ← A⊤ B + β C with all dimensions N, M and K being NB . The operation is cache-contained, which means that NB is chosen such that the entire matrix A and 2 columns of B fit into the level-1 cache. Additionally, one cache line remains for storing the parts of C currently needed [3]. The differences for NB resulting from an energy optimisation and a time optimisation of ATLAS are evaluated using the original ATLAS library and the ATLAS version replacing the time measurement by energy measurement according to Section 2.2.2.

4

T. Jakobs et al. / Future Generation Computer Systems (

Fig. 2. Efficiency for different values of the low-level matrix multiplication block size NB of the energy-optimised variant (horizontal axis) and the time-optimised variant (vertical axis) the outlier NB = 16 is omitted (data type: float).

In both versions, the autotuning process performs two consecutive searches for the best NB : In the first search, a preliminary value is determined which is used during the optimisation of the values of the loop unrolling factors mu and nu . In the second search, which this section deals with, the final value is determined. The second search uses the loop unrolling factors found in the first step. The search itself is performed iteratively, i.e. the next candidate value of NB is determined based on the performance of the current value. Therefore, the searches for the time-optimal and for the energyoptimal values for NB may diverge. The time and energy efficiency for the NB candidate values (as created by the search algorithm) of the single-precision matrix multiplication are shown in Fig. 2. The horizontal axis shows the performance of the energy-optimised variant for the given block size NB , the vertical axis shows its performance with the timeoptimised variant. So, the rightmost value, i.e. NB = 76, exhibits the best energy efficiency, the uppermost value, i.e. NB = 80, exhibits the best time efficiency. It may be questioned, however, whether the differences in the cluster {60, 64, 68, 76, 80} are really significant, as they only differ in a range of roughly 2%. The value NB = 72 deserves special attention: Its time efficiency is less than 1% worse than the best of NB = 80; but its energy efficiency is roughly 9% worse than for the energy-optimal value NB = 76. So, under slightly different conditions, e.g. by using a processor with a slightly higher or lower frequency, the search for the timeoptimal NB could yield an NB = 72 as the optimum. That would substantially degrade the energy efficiency. These large differences in the energy efficiency for parameter settings exhibiting a similar time efficiency show that a dedicated optimisation for energy efficiency is useful and interesting. For obtaining values comparable between the time-optimising and the energy-optimising variant, besides the hardware parameters, the following parameters have been set fix in these experiments: The loop unrolling factors are mu = 4, nu = 1, ku = 1, and fused multiply–add is used. These values have been obtained by the time-optimising variant. 3.2. Loop unrolling factors for low-level matrix multiplication The cache-contained matrix multiplication uses a triple-nested loop running over the dimensions N, M and K of the matrices. The K loop is the outermost loop, the M and N loops are the inner loops. The inner loops are unrolled by the factors mu and nu , such that all local variables can be stored in the registers available. Different settings for (mu , nu ) are probed, and ku is chosen in a subsequent step. As with NB , the search for optimal unrolling factors is performed twice. This section deals with the values obtained in the second search, i.e. after the final value for NB has been obtained.

)



Fig. 3. Efficiency for the (mu , nu ) loop unrolling factors for the energy-optimised (horizontal axis) and the time-optimised (vertical axis) variants (data type: double, NB = 56). Table 1 CacheEdge for the time-optimised and the energy-optimised version of ATLAS with singlethreaded and multithreaded execution.

Singlethreaded Multithreaded

Energy-optimised (KiB)

Time-optimised (KiB)

4096 512

3072 2048

Fig. 3 shows the time efficiency and the energy efficiency for different settings of (mu , nu ) for the double-precision matrix multiplication. The axes indicate the energy efficiency and the time efficiency for the given values using the ATLAS variants optimised for the respective criterion. The figure shows that there is again a cluster of parameter settings which are efficient in both, time and energy. This cluster comprises the values {(4, 1), (10, 1), (8, 1), (12, 1), (6, 1)}, i.e. all values with nu = 1. There is an outlier from this cluster, (mu , nu ) = (4, 3), which has a slightly worse time efficiency (1%), but a clearly worse energy efficiency (12%). So if, due to slightly changed conditions, the N loop is unrolled by a factor of 3 instead of by a factor of 1 when optimising for time, the energy efficiency degrades by 12%. This again shows the need for an additional energy-aware tuning process. This finding is supported by [9], where it was also found that loop unrolling factors have to be set differently for time and energy optimisation. 3.3. CacheEdge Besides a cache-contained low-level matrix multiplication, which optimises for the level-1 cache, will be discussed in the next section, ATLAS offers an optimisation for higher levels of cache, i.e. level-2 or level-3 cache. For this optimisation, which amongst others influences the GEMM-based operations, the parameter CacheEdge is tuned. This parameter is determined empirically and it represents the amount of the cache that is useable by ATLAS for its particular kind of blocking [3]. The values obtained by the ATLAS versions replacing time by energy measurement according to Section 2.2.2 are shown in Table 1. The matrix sizes for the measurements are set fix to M × N × K = 112 × 1960 × 9362. The CacheEdge values are taken from the file include/atlas_cacheedge.h. Table 1 shows lower CacheEdge values for multithreaded execution, independent from the optimisation goal. This effect results from the fact that the level-3 cache is shared between the cores. Thus, a smaller portion of the cache remains for each thread. In Fig. 4, the time efficiency and energy efficiency of the four ATLAS versions are shown. The biggest difference is between the multithreaded and the singlethreaded version of the time-optimised ATLAS: using more processors hardly adds a penalty on the execution time as the maximum execution time of all processors is

T. Jakobs et al. / Future Generation Computer Systems (

)



5

Table 2 Choice of copy/no-copy parameter for gemm depending on the matrix shapes defined by N, M and K and their size defined by M · N · K . Copy matrices if M · N · K . . .

Matrix shapes K K K K K K

Fig. 4. Time efficiency and energy efficiency for the CacheEdge measurements of Table 1 for the time-optimised and the energy-optimised version of ATLAS with singlethreaded and multithreaded execution.

M

< 3NB ≥ 3NB ≥ 3NB ≥ 3NB ≥ 3NB

M M M M

N

< 3NB < 3NB ≥ 3NB ≥ 3NB

N N N N

< 3NB ≥ 3NB < 3NB ≥ 3NB

≤ MNK _K ≤MNK _MN ≤MNK _N ≤MNK _M ≤MNK _GE

in roughly 120 tests. If (and only if) the ratio considering time is greater than one and the ratio considering energy is smaller than one, or vice versa, the crossover points differ for time and for energy optimisation. This is true in only a few cases, e.g. no. 9, 68 and 96. Even then, the differences are very small. Thus, for optimising energy efficiency, the small/large crossover points from the timeoptimised variant should be a good approximation. 3.5. Copy/no-copy crossover

Fig. 5. Execution-time ratios and energy ratios between implementations for large and small data sizes of different BLAS routines. The objective of the experiment is to find the crossover data-size, i.e. the size when the ratio flips from below 1.0 above 1.0.

the decisive factor. In contrast, the energy consumption of all running processors has to be summed up for achieving the total energy consumption. Furthermore, also the communication between the processors needs energy. This might be the reason why the singlethreaded version of the energy-optimised ATLAS is more energy efficient than its multithreaded counterpart. 3.4. Small/large crossover For an efficient handling of small and large matrices, it might be worthwhile to use different, specialised implementations which consider their specific characteristics. It is important to find out the performance crossover point between the implementations. This crossover point is the matrix size for which both implementations yield the same performance. Smaller matrices should be processed using the implementation for small matrices, and larger matrices should be processed using the implementation for large matrices. ATLAS provides such implementation variants for small and large matrices and determines the small/large crossover points. For finding out whether there is a difference for time-optimised and energy-optimised small/large crossover points, the file tune/blas/gemm/tfc.c has been adapted. The time efficiency measurement has been complemented by an energy efficiency measurement according to Section 2.2.1. Fig. 5 shows the ratio of the efficiency of the large to the efficiency of the small implementation variant. Both optimisation goals, energy efficiency and time efficiency, have been measured

Before the execution of a GEMM-based operation, ATLAS copies and rearranges large matrices in memory for improving the efficiency. Alongside the copying, matrices are transposed (if necessary), the scalar factor α is applied, the data is broken up into contiguous blocks of size NB × NB , and reordered into a block-major format. After the copy operation, the matrices can be multiplied using the low-level (cache-contained) kernel. The copy operation has costs of O(n2 ), compared to the costs of O(n3 ) for the multiplication operation for large matrices. Thus, for large matrices, the copy operation does not have a substantial impact on the execution time, whereas for small matrices, the copy operation has a relevant impact. There is a specific matrix size, called the crossover point, which reflects the boundary between the copy and the no-copy strategy: If the matrices are larger than the crossover point, they are copied. If the matrices are smaller, they are not copied. According to [3], this crossover point strongly depends on the hardware architecture, thus making a tuning of this parameter necessary. The crossover point not only depends on the hardware architecture, but also on the shape of the matrices being processed. Table 2 indicates under which conditions the matrices are copied (as coded in src/blas/gemm/ATL_gemmXX.c). For example, if K is smaller than 3 · NB , then the matrices are copied if and only if the product M · N · K is not greater than the crossover constant MNK _K . If all dimensions are greater than or equal to 3 · NB , then the matrices are copied if and only if their product is not greater than the crossover constant MNK _GE. These crossover constants are determined empirically. Each of them receives a prefix indicating whether the matrices involved are transposed (T) or not transposed (N). The code for the determination of the copy/no-copy crossover points can be found in tune/blas/gemm/mmcuncpsearch.c. When using the energy consumption as optimisation goal, the time measurement has been replaced by energy measurement in the file bin/gemmtst.c. No precautions are taken that the other tuning parameters are identical in the time-optimising and the energyoptimising variants as the copy/no-copy search is hard to isolate. The results of the experiment are shown in Fig. 6. The bars indicate the mean values of the crossover points for the different matrix shapes as obtained in 5 runs. Additionally, the 90% confidence intervals are given. For each matrix shape, there is one crossover point for the time-optimised ATLAS and one crossover point for the energy-optimised ATLAS. There is a tendency towards smaller crossover points for the energy-optimised variant. Due to the large error intervals, this tendency is, however, not significant in most

6

T. Jakobs et al. / Future Generation Computer Systems (

)



available CPU cores a thread may be executed. By setting the thread affinity, one can prevent a thread from migrating between cores, which always causes a penalty due to the need to transfer register data and to rebuild the caches. On the other hand, migrating a thread to another core allows the exploitation of computing resources which otherwise might be idle. Whether preventing thread migration does or does not yield a performance benefit is investigated in this section. Simultaneous multithreading has been turned off during the experiments. The results of the experiment are shown in Fig. 8. The chart indicates that there is no relevant difference between optimising for energy or for execution time with respect to the thread affinity. For the dgemm routine, the execution using thread migration exhibits roughly 12% decreased values for both, execution time and energy. The performance of the dgemv routine is not influenced by allowing or preventing thread migration. Fig. 6. Copy/no-copy crossover points for double-precision operations with different matrix shapes (90% confidence intervals given).

cases. The reason for large error intervals is the wide variation in the results for each matrix shape. Only for the MNK _MN matrix shapes, the crossover points for the energy-optimised variant are significantly smaller than for the time-optimised variant. This means that, for an energy-optimised operation, sometimes a relatively small matrix is copied which would not have been copied for a time-optimised operation. The results indicate that a regular execution scheme, which is ensured by the matrix copying, is more beneficial for energy efficiency than for time efficiency. As a consequence, more matrices are copied if ATLAS is optimised for energy consumption. This partly contradicts findings that intra-node data movement has a large impact on the energy consumption [17,24]. This effect, however, might not be visible here as the energy consumed by the RAM is not considered by the RAPL register used. 3.6. Simultaneous multithreading Simultaneous multithreading, or hyperthreading, is a technique for improving the efficiency of CPUs by providing a number of logical cores for each physical core. The threads may use different components of the core concurrently, e.g. one thread uses the integer and logical unit, another thread uses the floating-point unit. Thus, using simultaneous multithreading can decrease the execution time. Whether it also decreases the energy consumption, is investigated in this section. ATLAS offers the possibility to define a maximum number of threads to be used for multithreaded execution, which is considered during the tuning. For the experiments, this number has once been set to the number of threads reported by the CPU and once to the half of that number, i.e. once with and once without simultaneous multithreading. The results of an experiment investigating the execution time and the energy consumption for dgemm multiplication of N × N matrices with and without simultaneous multithreading, are presented in Fig. 7(a). It can be seen that the energy consumption decreases by 10% when simultaneous multithreading is not used, whereas the execution time remains nearly constant. The results for the dgemv routine multiplying a N × N matrix with a vector of length N are shown in Fig. 7(b). These results, in contrast to the results of the dgemm routine, do not show any substantial differences between simultaneous multithreading switched on and off. 3.7. Thread affinity The Linux operating system allows a thread affinity to be assigned to each thread. The thread affinity specifies on which of the

3.8. Block size for the TRSM routine Besides the NB for the cache-contained matrix multiplication, there exist similar quantities for other routines, such as TRSM. TRSM implements a triangular solver and solves systems of the form Ax = b, where A is a square triangular matrix [25]. TRSM is, for example, needed for Cholesky factorisation [26]. Fig. 9 shows the energy efficiency and the time efficiency for different values for the TRSM NB . This figure is analogous to Fig. 2 for the matrix multiplication in Fig. 9. In contrast to the matrix multiplication, the values are much closer to a function approximated by εt = 42 J/s · εE , where εt is the time efficiency and εE is the energy efficiency. This indicates that the reason for the differences in the energy consumption can be mainly caused by the differences in execution time. A value of P = 42 W as found by the regression is a credible value for the power of the CPU package. 4. Complete-package energy autotuning This section presents the results of the complete adaptation of ATLAS for energy optimisation. The routine ATL_cputime measuring the time is replaced by a routine measuring the energy as discussed in Section 2.2.2. So the optimisations minimise the energy consumption instead of the execution time. A selection of four routines representing the BLAS levels 1, 2, and 3 has been investigated with the complete-package energy autotuning. Fig. 10 shows the results of the measurements. The routines investigated are the gemm general matrix-matrix multiplication, the gemv general matrix–vector multiplication, the spmv symmetric packed matrix–vector multiplication, and the axpy vector dot product. The charts show the execution time and the energy consumption for one call of the respective routine with different input data sizes, as well as the resulting electrical power of the CPU. The measurements have been taken for both, operations with single precision and operations with double precision floating point numbers. Surprisingly, the results show that the tuning for energy, compared to tuning for time, often improves both, the time efficiency as well as the energy efficiency. For instance, for the gemm routine with double precision, the energy consumption decreases by roughly 6% while the execution time decreases by roughly 4%. This results in a decreased power (i.e. the ratio between energy and time) of roughly 6%. With single precision, time and energy consumption decrease by roughly 4%. The biggest reduction in the energy consumption is observed for the gemv routine, which has a decrease of roughly 10% to 15% for both data types. A relatively low decrease of roughly 2% is observed for the spmv routine, which might be caused by irregularities in the memory access due to the packed storage format of the matrix.

T. Jakobs et al. / Future Generation Computer Systems (

(a) Routine dgemm.

)



7

(b) Routine dgemv.

Fig. 7. Execution with and without using simultaneous multithreading (hyperthreading) showing the execution time (left axis) respectively the consumed energy (right axis) for the DGEMM routine (a) and the DGEMV routine (b) depending on the matrix and vector sizes.

(a) Routine dgemm.

(b) Routine dgemv.

Fig. 8. Execution with different thread affinity masks allowing or preventing thread migration between CPU cores showing the execution time (left axis) respectively the consumed energy (right axis) for the DGEMM routine (a) and the DGEMV routine (b) depending on the matrix and vector sizes.

Fig. 9. Time efficiency vs. energy efficiency for different values for the TRSM NB block size parameter.

In contrast, the gemm and gemv routines are routines with very regular memory access patterns, which can be optimised more easily. A fluctuating behaviour between an increase and a decrease in energy consumption of up to 9% is observed for the axpy routine. The axpy routine is memory-bound which might explain the behaviour differing from the other routines. In some cases, e.g. the single-precision gemm, the energyoptimised variant needs a shorter execution time than the timeoptimised variant. The reason for that might be caused by the more precise measurement due to the increased number of repetitions employed for the energy optimisation. However, also in such cases, the energy consumption decreases by a higher percentage than the execution time. This is shown by the fact that a lower power can be observed.

The observed power values, see Fig. 10 bottom, are spread rather than approximating a linear behaviour. On the one hand, this can in some cases be deduced from small steps or jumps in the energy and time measurement values. On the other hand, this shows that time and energy do not correlate completely but have small deviations for most of the measured values. In other words, the non-linearity of the power values indicates that energy consumption and execution time of these routines do not fully, though yet very strongly, correlate. The goal of the investigation does not lie in the optimisation of ATLAS routines for power usage. Rather the focus is on the generation of energy saving routines in general. The values obtained for power consumption, thus, do not reflect the results and goals of our work but are given as additional information added for completeness. 5. Related work Energy efficiency for the execution of linear algebra routines has been investigated in several works: [14] measures the core, un-core and memory energy consumption for the BLAS-3 routine dsyr2k, for the BLAS-2 routine dsymv, and for the LAPACK routine dsytrd using the RAPL interface. The work considers different numbers of threads and dynamic voltage and frequency scaling. The main finding is that for achieving the most energy-efficient solution, a number of parameters such as problem size and arithmetic intensity have to be considered. [17] measures the execution time and the energy consumption of the dgemm BLAS routine considering dynamic voltage and frequency scaling. [18] presents power profiles for Cholesky and QR factorisations on distributed-memory

8

T. Jakobs et al. / Future Generation Computer Systems (

)



Fig. 10. Execution time, energy consumption and power for the BLAS routines gemm, gemv, spmv and axpy for the time-optimised and the energy-optimised variants of ATLAS. The power values are additionally given to provide a detailed picture of the measurement results and were not used for the autotuning process.

machines with the libraries DPLASMA and ScaLAPACK. The main finding is that the tile-based DPLASMA is up to two thirds more energy efficient than the block-based ScaLAPACK. [27] investigates the energy efficiency of the implementations of Cholesky, QR and LU factorisation in the LAPACK and PLASMA libraries. It investigates differences arising from the implementation as block algorithms in LAPACK and as tile algorithms in PLASMA. These results also prove that tile algorithms, which are designed for a higher degree of parallelism, have a much lower energy consumption than block-based algorithms. For improving the energy efficiency on hardware level, [28] presents a CPU/FGPA combination for energy-efficient linear

algebra. The FPGA implements the vector–vector dot product and a matrix–vector product with various data types, which enables to execute a matrix-matrix multiplication. Offloading the computation to the FPGA increases the energy efficiency and causes no substantial difference in the execution time. [29] discusses the trade-off between execution time and energy consumptions for matrix inversion on three different machines: a conventional general-purpose multicore processor, a low-power CPU and a low-power CPU complemented by a GPU. Autotuning methods for minimising the energy consumption of an application have, for example, been investigated in [24,4,6,9]: [24] minimises the energy consumption of an LU factorisation by

T. Jakobs et al. / Future Generation Computer Systems (

an automated minimisation of the memory traffic based on a model for the number of memory accesses. [4] minimises the energy consumption of the Fast Multipole Method, a particle simulation, by selecting the optimal depth of the particle tree dividing the interactions into near-field and far-field interactions. [6] presents an autotuning approach using polyhedral optimisation. Three scientific kernels from the optimiser’s test suite are tuned for minimal energy consumption on the Xeon Phi. [9] autotunes the unrolling factors for the two loops of a stencil operation within a Poisson’s equation solver. The objective function is the energydelay product. None of the above works deals with tuning BLAS operations for energy consumption as the present work does. Multi-objective optimisation, viz optimisation for energy consumption and a further criterion, is dealt with in [7,30,31]. In [7], such a multi-objective autotuner is presented. The article investigates the behaviour of three different hardware architectures, a Xeon Phi, a multi-core Xeon, and a Blue Gene, when executing a number of scientific kernels with dynamic voltage and frequency scaling. Voltage and frequency scaling has not been investigated in the present work as we were mainly interested in the tuning parameters which are already available in ATLAS and in a concise study of the differences in behaviour for time and energy. [30] presents an autotuner with multi-objective optimisation for time, energy and resource usage. A number of scientific kernels, including gemm and syr2k are investigated. A detailed analysis of the parameters is only given for gemm. The article mainly focuses on the number of CPU cores and the frequency as parameters, which is a complementary investigation goal than we have. In [31], the gemm, gemv and ger BLAS routines are investigated. Different orders of exploring the search space, i.e. the order of applying optimisations, such as blocking or parallelisation, are investigated. For our research, we keep the order of optimisations unchanged and focus mainly on the influence of optimisation parameters. A comprehensive survey on the recent developments in the field of energy-aware and energy-efficient linear algebra methods is given in [32]. So far, there has not been a thorough investigation providing detailed data of the optimisation for time or for energy as given here. 6. Conclusion and future work This article has evaluated the possibilities to autotune the ATLAS linear algebra package for a minimum energy consumption. The influence of the tuning parameters chosen on the energy consumption has been investigated. Some tuning parameters show a strong influence, some only show a small influence on the overall energy consumption. The study proves that tuning for energy consumption is necessary for obtaining an energy-optimised ATLAS library because time and energy optimisation does not entirely correlate, as it has been shown. For several parameters, including the block size NB for the matrix multiplication and the unrolling factors, there are parameter settings which yield similar execution times, but cause large differences (up to 10%) in the energy consumption. For these parameters, an explicit analysis of the energy consumption is needed in order to avoid that the time-optimising variant selects a nonoptimal parameter setting regarding the energy consumption. A remarkable outcome is that especially parameters affecting the matrix-matrix multiplication need different settings when optimised for time and for energy consumption. This includes the block size and unrolling factors for the low-level matrix multiplication as well as the usage of simultaneous multithreading for dgemm. The reason for that could be that the matrix-matrix multiplication is optimised so well that it is compute-bound. This means, the execution time mainly depends on the time that the CPUs need to perform the necessary calculations. For a given

)



9

data size, the number of floating-point operation is constant, and consequently also the execution time, no matter which further optimisations are applied. The energy efficiency, however, can still be increased by reducing the number of memory accesses, as the roofline model for energy shows [10]. As several parameter settings yield a compute-bound matrix multiplication for ATLAS, one can select from those settings which have the lowest energy consumption thus providing a multi-objective optimisation. Other parameters, such as the trsm NB or the thread affinity for dgemv do not show substantial differences between time efficiency and energy efficiency for the different parameter settings. One can assume that here the execution is memory-bound so that the largest potential in saving energy comes from saving execution time. Then, a time-efficient parameter setting implicates an energy-efficient parameter setting. In addition to evaluating the influence of single parameters, a version of ATLAS has been created which replaces the time measurement by an energy measurement. This variant of the library is then completely tuned for a minimal energy consumption. This energy-tuned ATLAS library consists of routines which need less energy, from a few percent up to 10%, than the routines of an ATLAS library tuned for time. Also here, the gemm stands out as its energy-optimised variant compared to the time-optimised variant has substantially lower power values for all matrix sizes and for single precision as well as for double precision. No changes were made to the algorithms and transformations already present in ATLAS for performance tuning. Only the parameters were set to values for which the test results showed a lower energy consumption. Hence, the enrichment of ATLAS by codes and transformations taking into account energy beneficial variants might improve the obtained results even further. In this study, the energy consumption was the optimisation criterion for the adapted ATLAS. The results can be used as a base for applying compound scalar metrics such as energydelay products. If ATLAS is to be adapted to perform fully automated multi-objective optimisation, e.g. for energy and execution time, some fundamental changes will have to be made to the architecture of the optimiser. The results of this study may also serve as a base for a version of ATLAS considering a more comprehensive set of parameters that have a greater impact on the energy consumption, such as frequency scaling or the amount of memory traffic. Including these parameters into the autotuning decision might result in an even higher saving in energy consumption. Future work includes an investigation of other autotuned linear-algebra packages, such as MAGMA [33]. Research is needed for assessing whether the same energy tuning methods as here can be used for MAGMA. Especially, the heterogeneity of machines, which is considered by MAGMA, will need special attention. From comparing different autotuned linear-algebra packages, it might be possible to set up general, evidence-based rules for energy optimisation if parameters such as unrolling factors or cache usage show a similar behaviour in these packages. Acknowledgements This work has partially been supported by Deutsche Forschungsgemeinschaft, grant number RU 591/10-2, and the German Federal Ministry of Education and Research (BMBF), project number 01IH16012B. References [1] K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, K.A. Yelick, The landscape of parallel computing research: A view from Berkeley, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006, URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006183.pdf.

10

T. Jakobs et al. / Future Generation Computer Systems (

[2] R. van de Geijn, K. Goto, BLAS (Basic linear algebra subprograms), in: D. Padua (Ed.), Encyclopedia of Parallel Computing, Springer, 2011, pp. 157–164. [3] R.C. Whaley, A. Petitet, J.J. Dongarra, Automated empirical optimizations of software and the ATLAS project, Parallel Comput. 27 (1–2) (2001) 3–335. [4] J. Lang, Energie- und Ausführungszeitmodelle zur effizienten Ausführung wissenschaftlicher Simulationen (Ph.D. thesis), Technische Universität Chemnitz, Department of Computer Science, Chemnitz, Germany, 2014. [5] J. Carretero, S. Distefano, D. Petcu, D. Pop, T. Rauber, G. Rünger, D. Singh, Energy-efficient algorithms for ultrascale systems, Supercomput. Front. Innov. 2 (2) (2015) http://dx.doi.org/10.14529/jsfi150205. [6] W. Wang, J. Cavazos, A. Porterfield, Energy auto-tuning using the polyhedral approach, in: Proc. International Workshop on Polyhedral Compilation Techniques at HiPEAC 2014 conference, 2014. [7] P. Balaprakash, A. Tiwari, S. Wild, Multi objective optimization of hpc Kernels for performance, power, and energy, in: S.A. Jarvis, S.A. Wright, S.D. Hammond (Eds.), High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, in: Lecture Notes in Computer Science, Springer, 2014, pp. 239–260. [8] T. Katagiri, C. Luo, R. Suda, S. Hirasawa, S. Ohshima, Energy optimization for scientific programs using auto-tuning language ppOpen-AT, in: Embedded Multicore Socs, MCSoC, 2013 IEEE 7th International Symposium on, 2013, pp. 123–128. [9] A. Tiwari, M.A. Laurenzano, L. Carrington, A. Snavely, Auto-tuning for energy usage in scientific applications, in: Proceedings of the 2011 International Conference on Parallel Processing - Volume 2, Euro-Par’11, Springer, 2012, pp. 178–187. [10] J.W. Choi, D. Bedard, R. Fowler, R. Vuduc, A roofline model of energy, in: 2013 IEEE 27th International Symposium on Parallel Distributed Processing, (IPDPS), 2013, pp. 661–672. http://dx.doi.org/10.1109/IPDPS.2013.77. [11] J. Lang, G. Rünger, P. Stöcker, Towards energy-efficient linear algebra with an atlas library tuned for energy consumption, in: The 2015 International Conference on High Performance Computing & Simulation, HPCS 2015, 2015. [12] R.W. Vuduc, Autotuning, in: D. Padua (Ed.), Encyclopedia of Parallel Computing, Springer, 2011, pp. 102–105. [13] K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, P. Stodghill, Is search really necessary to generate high-performance BLAS? Proc. IEEE 93 (2) (2005) 358–386. [14] J.I. Aliaga, M. Barreda, M.F. Dolz, E.S. Quintana-Ortí, Are our dense linear algebra libraries energy-friendly? Comput. Sci.-Res. Dev. (2014) 1–10. [15] E. Rotem, A. Naveh, A. Ananthakrishnan, D. Rajwan, E. Weissmann, Powermanagement architecture of the intel microarchitecture code-named Sandy Bridge, IEEE Micro 32 (2) (2012) 20–27. [16] J. Lang, G. Rünger, An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration, J. Parallel Distrib. Comput. 74 (9) (2014) 2884–2897. http://dx.doi.org/10.1016/ j.jpdc.2014.06.001. [17] J. Demmel, A. Gearhart, Instrumenting linear algebra energy consumption via on-chip energy counters, Tech. Rep. UCB/EECS-2012-168, EECS Dept., University of California, Berkeley, 2012. [18] G. Bosilca, H. Ltaief, J. Dongarra, Power profiling of Cholesky and QR factorizations on distributed memory systems, Comput. Sci.- Res. Dev. 29 (2) (2014) 139–147. [19] T. Rauber, G. Rünger, M. Schwind, H. Xu, S. Melzner, Energy measurement, modeling, and prediction for processors with frequency scaling, J. Supercomput. 70 (3) (2014) 1451–1476. http://dx.doi.org/10.1007/s11227-014-1236-4. [20] D. Hackenberg, T. Ilsche, R. Schöne, D. Molka, M. Schmidt, W.E. Nagel, Power measurement techniques on standard compute nodes: A quantitative comparison, in: 2013 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS, 0, 2013, pp. 194–204. http://dx.doi.org/10. 1109/ISPASS.2013.6557170. [21] S. Browne, J. Dongarra, G.H. Nathan Garner, P. Mucci, A portable programming interface for performance evaluation on modern processors, Int. J. High Perform. Comput. Appl. 14 (3) (2000) 189–204. [22] Intel, Intel 64 and IA-32 Architectures Software Developers Manual (May 2012).

)



[23] C. Angelini, The Core i7-4770K review: Haswell is faster; desktop enthusiasts yawn, Jun. 2013. URL http://www.tomshardware.com/reviews/core-i74770k-haswell-review,3521.html [accessed 18.12.14]. [24] E. Garcia, J. Arteaga, R. Pavel, G.R. Gao, Optimizing the LU factorization for energy efficiency on a many-core architecture, in: C. Caşcaval, P. Montesinos (Eds.), Languages and Compilers for Parallel Computing, in: Lecture Notes in Computer Science, Springer, 2014, pp. 237–251. [25] J.J. Dongarra, J. Du Croz, S. Hammarling, I.S. Duff, A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Software 16 (1) (1990) 1–17. [26] E. Jeannot, Symbolic mapping and allocation for the cholesky factorization on numa machines: Results and optimizations, Int. J. High Perform. Comput. Appl. 27 (3) (2013) 283–290. [27] H. Ltaief, P. Luszczek, J. Dongarra, Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency, Comput. Sci.- Res. Dev. 27 (4) (2012) 277–287. [28] H. Giefers, R. Polig, C. Hagleitner, Analyzing the energy-efficiency of dense linear algebra kernels by power-profiling a hybrid CPU/FPGA system, in: 2014 IEEE 25th International Conference on Application-specific Systems, Architectures and Processors, ASAP, 2014, pp. 92–99. [29] P. Benner, P. Ezzatti, E.S. Quintana-Ortí, A. Remón, Trading off performance for power-energy in dense linear algebra operations, in: Proceedings of VI Latin American Symposium on High Performance Computing HPCLatAm 2013, 2013. [30] P. Gschwandtner, J.J. Durillo, T. Fahringer, Multi-objective auto-tuning with Insieme: Optimization and trade-off analysis for time, energy and resource usage, in: F. Dutra, V. Santos Costa (Eds.), Euro-Par 2014 Parallel Processing, in: Lecture Notes in Computer Science, vol. 8632, Springer, 2014, pp. 87–98. [31] S.F. Rahman, J. Guo, Q. Yi, Automated empirical tuning of scientific codes for performance and power consumption, in: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers, HiPEAC’11, ACM, 2011, pp. 107–116. [32] L. Tan, S. Kothapalli, L. Chen, O. Hussaini, R. Bissiri, Z. Chen, A survey of power and energy efficient techniques for high performance numerical linear algebra operations, Parallel Comput. 40 (10) (2014) 559–573. [33] Y. Li, J. Dongarra, S. Tomov, A note on auto-tuning GEMM for GPUs, in: Proceedings of the 9th International Conference on Computational Science: Part I, ICCS’09, Springer-Verlag, Berlin, Heidelberg, 2009, pp. 884–892. http://dx.doi.org/10.1007/978-3-642-01970-8_89.

Thomas Jakobs is a research assistant at the chair of Practical Computer Science at the Faculty of Computer Science at Technische Universität Chemnitz. His current research covers energy and power aware vectorisation and the use of hardware transactional memory in scientific and parallel computing environments. Jens Lang is a senior researcher at the chair of Practical Computer Science at the Faculty of Computer Science at Technische Universität Chemnitz. His research interests include scientific simulations with a focus on autotuning for an efficient and energy-aware execution on parallel computers as well as on graphics processing units. Gudula Rünger received a doctoral degree in Mathematics from the University of Cologne in 1989 and a habilitation in Computer Science from the Saarland University, Saarbrücken, in 1996. From 1997 to 2000 she has been a Professor of Parallel Computing and Complex Systems at the Leipzig University. Since 2000, she is a full Professor of Computer Science at Technische Universität Chemnitz. Her research interests include parallel and distributed programming, parallel algorithms for scientific application as well as software development in science and engineering. Paul Stöcker is a master’s student at Technische Universität Chemnitz.