Comprehensive cache performance tuning with a toolset

Comprehensive cache performance tuning with a toolset

Future Generation Computer Systems 26 (2010) 167–174 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: ...

3MB Sizes 2 Downloads 82 Views

Future Generation Computer Systems 26 (2010) 167–174

Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Comprehensive cache performance tuning with a toolset Jie Tao ∗ Steinbuch Centre for Computing, Karlsruhe Institute of Technology, Germany

article

info

Article history: Received 30 September 2008 Received in revised form 26 February 2009 Accepted 22 July 2009 Available online 8 August 2009

abstract Processor speed is increasing exponentially, while the increase in memory speed is relatively slow. This results in the fact that the overall performance of a computing system is increasingly contained by the memory performance. This paper describes an approach for improving the cache hit ratio and thereby the efficiency of the memory system. The approach is based on a set of performance tools which are capable of presenting the cache problems, the reason for them, and the solution for tackling them. © 2009 Elsevier B.V. All rights reserved.

Keywords: Memory performance Code optimization Performance visualization Cache simulation Prefetching

1. Introduction Cache is introduced into a computer system as a solution for bridging the speed disparity between the processor and the memory. Since the appearance of the first on-chip cache in 1985, the cache size has grown to gigabytes. However, the size of applications is also increasing at the same time, by an even larger factor than the cache. This indicates that it is not possible to hold the working set of a typical application in the cache memory. As a consequence, many applications still suffer from a large number of cache misses. This causes significant performance degradation because an access to the main memory takes hundreds of cycles while an access in the cache needs only several. According to [1], the SPEC2000 benchmarks running on a modern, high-performance microprocessor spend over half of the time stalling for loads that miss in the last level cache. Hence, researchers have been exploring cache optimization techniques. For years, compiler developers have used various optimization schemes [2–5] to restructure the code and data for a better runtime cache performance. Nevertheless, a more straightforward way is to apply these optimization techniques directly on the application source code. Unlike the compiler approach of trying different solutions without knowing the problem, explicit optimization exactly targets the problems and therefore can introduce better performance gain. Efficient source code optimization relies on the knowledge about the access pattern of a program. However, this knowledge



Tel.: +49 7247 828616; fax: +49 7247 824972. E-mail address: [email protected].

0167-739X/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2009.07.010

usually cannot be easily achieved by analysing the program code. In this case, we provide a set of performance tools to help the user to understand the data access feature and the runtime cache access behaviour of the applications. In comparison with existing tools which are mostly limited to being only capable of presenting the problems, this toolset additionally shows the reason for the problems and the solution for tackling them. The toolset contains several components, including a data profiler, a cache simulator, and a pattern analyser for data acquisition, a visualization tool for data presentation, a prefetching tool, and a graphical user interface for integration. The profiling tool is developed to benefit from the hardware support of modern processors in system monitoring. It collects the cache misses captured by the performance counters and computes the access hotspots. The cache simulator is developed to deliver information about the runtime cache operations. It also analyses the feature of cache misses. The pattern analyser provides simple, regular access patterns. The gathered performance data are presented to the user by the visualization tool with high-level graphical views. These views are specifically designed to guide the programmers step-by-step to find the problem and the solution. They also bring the user to the concrete optimization targets where the code structure has to be modified. To further simplify the user’s work, we provide a prefetching tool which automatically inserts the prefetching instructions in the source code. The functionality of the toolset has been verified with several benchmarks and realistic applications. In this paper we choose an image compression code as an example to demonstrate the process of applying the toolset to optimize a program. The remainder of the paper is organized as follows. Section 2 gives a brief overview of the tool infrastructure. Sections 3 and 4

168

J. Tao / Future Generation Computer Systems 26 (2010) 167–174

Data Acquisition

Code Instrumentor

Tools for Optimization Pattern discovery

Data affinity

Pre-compiler Code with prefetching instructions

Visualization

Data Profiler

Access pattern Bottleneck Miss reason

Cache Simulator

Optimization strategies

Performance Data Cache operation profile

PDK

Miss categories Statistics on global events

Fig. 1. The proposed tool architecture.

describe how the tools both automatically improve the source code and help the users achieve this goal. Section 5 shows the experimental results. In Section 6 related work is discussed. The paper concludes in Section 7 with a short summary and some future directions. 2. The tool infrastructure The goal of any performance toolkit is to help users understand the runtime behaviour of the application. The design of the graphical interface is one factor that influences how much knowledge users can acquire and how easily this knowledge is achieved. However, the quality of collected performance data is a more important factor because runtime data are the base for any graphical presentations. Therefore, data acquisition and visualization are the focus of this work. For the former, comprehensive information is acquired with hardware counters, simulation, and automatic analysis, while for the latter a visualization tool is specially designed to present the runtime operations of both the cache and the application in a user-understandable way. Fig. 1 shows the developed tools and their interactions. The left side of the figure depicts the components for gathering the performance data. This includes a code instrumentor, a data profiler, a cache simulator, and a pattern analyser. The code instrumentor does not serve performance data directly. Rather it delivers memory reference traces which are the input of both the pattern analyser and the cache simulator. The instrumentor is a modified version of an existing one called Doctor. Doctor was originally developed as a part of Augmint [6], a multiprocessor simulation toolkit for Intel x86 architectures. We modified Doctor in order to achieve an individual instrumentor independent of the simulation environment. The memory reference trace is used by the pattern analyser to discover regular access patterns. Currently, this component delivers two kinds of affinity information: repeated access sequence and access stride. The former shows which memory addresses are frequently successively accessed, while the latter depicts which element in a data array is the next access target. Obviously, these access patterns can be used to guide the user to preload required data. However, it is tedious work for the user to insert prefetching instructions in the source code. Therefore, we have developed a precompiler to give the user additional help: it automatically extends the source program with prefetching. The precompiler currently targets C/C++ codes. The memory reference trace is also used by the simulator to model the functionality of the cache memory and thereby to collect detailed information about cache events. This information

is an extension of the statistical data acquired using the hardware counters. The gathered performance data are then delivered to the visualization tool, where the low-level information is presented with understandable graphical views. These presentations allow the user to easily understand the access pattern, to detect the mapping conflict or capacity problems, and further to find optimization schemes. The last component in the tool infrastructure is a program development kit. It aims at providing an interface for users to operate the tools and to develop their applications. Programmers can use this interface to start the tools, to understand the runtime access behaviour, to modify, build and run the program, and finally to evaluate the optimization impact. Actually, we build two paths to deal with all kinds of cache misses. Traditionally, cache misses are divided into three categories: compulsory, capacity, and conflict. Compulsory misses occur when the data are first accessed. Conflict misses occur when a data block has to be removed from the cache due to mapping overlaps. Capacity misses occur when the cache size is smaller than the working set size. Our fist optimization path involves the pattern analyser and the precompiler, and aims at reducing compulsory misses. The following section will describe the functionality of this path and its associated tools in detail. The other path aims at helping the programmers to tackle conflict and capacity misses. It uses the data profiler and cache simulator to acquire performance data and the visualization tool to present the access pattern and miss feature. Section 4 demonstrates the procedure of using this path to optimize an image compression code. 3. Guided prefetching based on access patterns The most efficient way to reduce compulsory miss is prefetching [7]. This technique attempts to load data into the cache before they are requested. A key issue with this technique is to avoid prefetching data that are not required in a short time. Such inefficient prefetching may cause more cache misses because the prefetched data can evict frequently reused data out of the cache. We propose a kind of guided prefetching which uses the access pattern of an application to prohibit inefficiency. 3.1. Pattern acquisition For acquiring access patterns we have developed an analysis tool which currently detects both the access chain and the access stride. An access chain is a group of accesses that repeatedly occur

J. Tao / Future Generation Computer Systems 26 (2010) 167–174

together but target different memory locations. An example is the successive access to the same component of different instances of a struct. It is clear that this information directly shows the next requested data, which is actually the prefetching target. For detecting address chains the analysis tool deploys Teiresias [8], an algorithm often used for pattern recognition in Bioinformatics. For pattern discovery, the algorithm first performs a scan/extension phase for generating small patterns of predefined length. It then combines those small patterns, with the feature that the prefix of one pattern is the suffix of the other, into larger ones. For example, from pattern DFCAPT and APTSE pattern DFCAPTSE is generated. For detecting access strides, the analyser applies an algorithm similar to that described in [9]. This algorithm uses a search window to record the difference between each two references. Accesses with the same difference are combined to form an access stride in the form of hstart_address, stride, repeating_timesi. For example, a stride h200, 100, 50i indicates that starting with the address 200, memory locations with an address difference of 100, e.g. 300, 400, 500, etc., are also requested, and this pattern repeats 50 times. The detected access patterns are stored in an XML file. Together with each access chain or stride, the position in the source code is also recorded. This allows us to track the patterns in the program source and thereby to find the location for inserting prefetching codes. 3.2. Automatic prefetching With the access pattern in hand, we now need a precompiler to generate an optimized source program with prefetching inserted. This is actually the task of a parser. We could directly deploy an existing parser of any compiler and build the prefetching on top of it; however, these parsers are usually of a complexity that is not necessary for this work. In this case, we developed a simple parser specifically for the purpose of prefetching. The key technique with the parser is to determine what and where to prefetch. For the former the access chains and strides clearly give the answer. However, the access patterns are provided in virtual addresses, while the parser works with variables. Therefore, an additional component was implemented for the goal of transforming the addresses in the pattern to the corresponding data structures in the program. Previously, we relied on specific macros to register variables and generate a mapping table at runtime, but currently we obtain the mapping table automatically by extracting the static data structure from the debugging information and instrumenting malloc calls for dynamic ones. Based on the variable names, the parser can build prefetching instructions like prefetch(p), where p is the variable to prefetch. Nevertheless, for arrays and linked references such as the struct data structure in C/C++ programs, it must additionally examine the corresponding code line to acquire other information, for example, about the dimensions of an array. The other important issue with prefetching is to decide the location where the prefetching instruction is inserted. This location directly determines the prefetching efficiency. For example, if the data are loaded into the cache too early, they can be evicted from the cache before being used. On the other hand, if the prefetching is not issued early enough, the data are potentially not in the cache when the processor requests them. However, it is quite tedious work to accurately compute the prefetching location. For simplicity, in this initial work we put the prefetching instruction several steps in front of the access. Users are asked to specify this prefetch distance.

169

Knowing where to prefetch, the parser now has to identify the individual code line in the program. For this, the source code is scanned and key words that represent the start or end of a code line, like if, for, and ;, are searched. Comments and directives for parallelization are removed before the scanning process for efficiency, but inserted into the resulted program again after this process. Overall, based on the parser and other components, an optimized version of an application is generated with each prefetching instruction corresponding to a detected access pattern. This new version can be traditionally compiled and executed on the target architecture supporting software prefetching. 4. Detecting optimization means with visualization Actually, the primary goal of the developed toolset is to support user-level code optimization. For this, we not only show users which data structure or code segment causes the most cache misses but also tell them how the cache miss is generated. This knowledge helps users find an appropriate optimization technique and its associated parameters. The visualization tool is responsible for the data presentation. The visualization component [10] provides a set of graphical views to present the bottlenecks, miss reason, and the access pattern. We use a Discrete Wavelet Transform (DWT) to depict how this tool guides the user step-by-step to detect the problem and the solution. The DWT is an algorithm usually used for image and video compression. It performs wavelet transformation by sampling the wavelet discretely. The wavelet representation of a discrete signal X consisting of N samples can be computed by convolving X with low-pass and high-pass filters and down-sampling the output signal by 2. This process decomposes the original image into two sub-bands: a lower band and a higher band. A twodimensional (2D) DWT is conducted by first performing an 1D DWT on each row (horizontal filtering) of the input image and storing the intermediate coefficients in the output image, followed by an 1D DWT on each column (vertical filtering) of the output image and storing the results in the input image. It was found that the vertical filtering was much slower than the horizontal filtering [11,12], but the exact reason is not clear. Hence, we analyse this code in terms of cache behaviour. 4.1. Problem analysis The first step is to observe the cache performance of both vertical and horizontal filtering. We use a 3D phase view to perform this task. Fig. 2 shows an example. The 3D presentation demonstrates the cache performance of a data structure in program phases, where a program phase can be a function, a subroutine, or a code region. Within the diagram, a data structure, ou_image in this concrete view, is divided into blocks of user-specified size and distributed across the x-axis. The y-axis shows the cache miss occurred when accessing each data block of the data structure. Program phases are distributed across the z-axis. This example contains two program phases, one for the horizontal filtering (closer to the x-axis) and the other for the vertical filtering. Observing the diagram, it can be clearly seen that a significant number of misses exist with the vertical filtering, while the horizontal filtering shows only a small number of cache miss. We have also observed the input image, but did not find many cache misses with this data structure. Hence, we conclude that the access to the output image in the vertical filtering has cache problems.

170

J. Tao / Future Generation Computer Systems 26 (2010) 167–174

Fig. 2. 3D phase information.

Fig. 4. Variable Trace: the first 15 accesses to the output image. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 3. Variable Miss Overview. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

The next step is to know the reason for the cache misses. Are they caused by the first reference, mapping conflict, or insufficient cache size? To get the answer, we use Variable Miss Overview. Fig. 3 shows a sample view with the DWT. Variable Miss Overview uses coloured bars to present the miss statistics of variables in the program. Each bar, from left to right, represents the total miss, the compulsory miss, the conflict miss, and the capacity miss. Here, total miss is an accumulation of other misses. It can be seen that both images introduce cache misses; however, more than 80% of the miss is caused by accessing ou_image and the primary reason for the miss is mapping conflict. This leads us to further use Variable Trace to observe how the cache miss is formed. Variable Trace aims at exhibiting the access pattern of an array by showing the references to all of its elements in the order as they are accessed. This allows the programmer to find the reused data and the reuse feature. As depicted in Fig. 4, 20 fields are applied to depict the access trace. A display field gives the information about the access type (indicated by the colour), name of the array, number of the accessed block (left number), and the concrete array element (right number). The access type can be a load operation, a replacement, or a cache hit. The accesses are filled from left to right: the left access is performed after the right one. Here, a data block is in the size of a cache line and contains several elements depending on the variable type. First, a reuse behaviour can be seen with the DWT: the same data block is repeatedly accessed. For example, the first access,

Fig. 5. The Cache Set view of set 1 in L1. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

presented in field 15, is performed on element 0 in block 0; this block is accessed again (the eighth shown in field 8) and this time by reading element 1 which is also in block 0. However, the colour shows that the second reference is a cache miss, indicating that block 0 has been evicted from the cache before element 1 is accessed. The same behaviour can be seen with all other data blocks. As the accesses to the same data block are close we can conclude that these misses are caused by mapping conflict. Examining the rest accesses, we found another reuse feature: some data blocks are used again but much later. Similarly, this data reuse does not introduce a cache hit. Such a miss must be a capacity miss. Our next task is to find how the conflict is formed and which data block has evicted the reused one. For this we apply Cache Set. As shown in Fig. 5, this view depicts the cache operations and the content update performed in a single cache set, Set 1 in this example. As we simulated a 4-way cache, each cache set contains four lines. Again, the operations are combined with the colour of the field. The content update is presented in the chronological order of the fields with the left one replacing the right one. It can be seen that

J. Tao / Future Generation Computer Systems 26 (2010) 167–174

Set 1 is first filled with block 0, block 4, block 8, and block 12. According to the Variable Trace view in Fig. 4, these blocks are reused. However, other requested and reused blocks, i.e. blocks 16, 20, and 24, are also mapped in the same set and the access to them removes the previously used ones. As a consequence, the second access to the same data block is a miss and this access replaces another reused block. Therefore, all accesses to the same data block are cache misses. Using Cache Set, we also found the reason for the capacity misses. A capacity miss usually occurs with hierarchical loops, where the outer loop uses the data of the inner loop. If the inner loop processes a larger amount of data than the cache can hold, the reused data will be removed from the cache before the outer loop starts. This miss, however, can be avoided by reducing the length of the inner loop so that the outer loop starts as the reused data are still in the cache. Hence, the key issue with a capacity miss is to find the optimal length of the inner loop. Using Variable Trace we detect the first data block reused in the outer loop and with Cache Set we know which matrix element has replaced it. For holding the reused data, this element shall not be accessed in the inner loop. Hence, the column number (in case of looping on the column) or row number (in case of row loop) is the appropriate length of the inner loop. Overall, the tools have helped us to find the concrete cache problems and the reason for it. For each application the analysis procedure can be different, but generally, programmers can rely on the following steps to detect the optimization targets. A program code usually contains individual segments, such as functions and loops. Hence, the first step is to find the code regions with cache bottlenecks. The 3D view can be applied in this step. In the next step, the hot code regions are concentrated. As cache misses are caused by accessing data, the goal of this step is to detect the data structures which show the most cache problems. The Variable Miss Overview can be applied for this task. Finally, it is necessary to know how the cache miss is formed because this knowledge directly decides which optimization technique is efficient. The Variable Trace and Cache Set views are appropriate for obtaining this knowledge.

171

Fig. 6. The original implementation of a code segment in the vertical filtering.

Fig. 7. Optimization with loop fission.

Fig. 8. Optimization with loop tiling.

approach works better than loop fission because it has no runtime overhead. As mentioned, to remove capacity misses the inner loop must be shortened. For maintaining the functionality of the original code, an additional loop has to be inserted. This technique is called loop tiling [13]. Fig. 8 shows the optimized code segment with this scheme. In contrast to the original implementation, a loop with index k is added to the code. This loop divides the computation in the j loop into smaller blocks (noting the change of loop condition with j). In this case, loop j processes only blocksize elements rather than the whole row.

4.2. Optimization 5. Experimental results As shown above, a conflict miss occurs when several reused data blocks replace each other. A natural solution is to partition the calculation into two steps, with one step processing the first four blocks and the other step handling the rest of the data. This technique is called loop fission [5]. Fig. 7 shows the optimized code segment, while the original implementation is depicted in Fig. 6. Actually, the DWT code contains two parts. The first part performs a horizontal filtering in which the input image is processed and the output image is produced. The second part of the code takes the output of the first part as input and performs a vertical filtering on it. The resulting image is then stored in the output image. Because horizontal filtering does not introduce many cache misses, our optimization focuses on the second part of the DWT code. As a result, Fig. 6 gives only the code segment for the vertical filtering. In contrast to the original implementation, the optimized code uses two loops to perform the same work. In this case, no mapping conflict occurs with any loop and the data are held in the cache. However, the additional loop introduces extra instructions. Another common approach to tackle conflict misses is array padding [4]. This scheme removes the mapping conflict by enlarging the matrix dimension, thereby changing the mapping behaviour. Hence, for optimization, the matrix ou_image is defined as ou_image[N ][M + LineSize] rather than ou_image[N ][M ]. In this case, the reused data blocks do not map in the same cache set. This

The performance analysis described above enable us to achieve optimized versions of the DWT code. Actually, the optimization is based on a single run with a certain data size and cache configuration. For capacity misses, the optimal length of the inner loop can be calculated for different cache sizes, and the blocksize in the optimized code is adapted to the cache configuration. For conflict misses, however, the optimized code is probably not optimal for other working sets and cache organizations because the mapping behaviour changes. Nevertheless, in order to examine the applicability of this approach, we run the same optimized version with different data sizes. The programs are run on four architectures: a Pentium 3 with an L1 cache of 16 KB, 2-way, and 3 2B of cache line, a Pentium 4 with an L1 of 8 KB, 4-way, and 64 B line size, an AMD Opteron with an L1 of 64 KB, 2-way, and 64 B of cache line, and an Itanium II with an L1 of 16 KB, 4-way, and 64 B line size. Figs. 9–12 depict the experimental results. First, we look at the prefetching result. As DWT does not show many compulsory misses, we only achieved a slight performance improvement on some machines due to the prefetching overhead. Here, we choose the two best cases for demonstration. Fig. 9 depicts the improvement in execution time of the prefetching version against the original implementation. In the

172

J. Tao / Future Generation Computer Systems 26 (2010) 167–174 Pentium 3

Pentium 4

AMD

Itanium

Speedup

1.5 1.0 0.5 0.0 256 x 256

400 x 400

512 x 512

800 x 800

1024 x 1280 x 2048 x 3000 x 4096 x 4096 1024 1280 2048 3000 Image size

Fig. 12. Speedup with loop tiling.

Fig. 9. Speedup with prefetching. Pentium 3

Pentium 4

AMD

Itanium

3.5

Speedup

3.0 2.5 2.0 1.5 1.0 0.5 0.0 256 x 256

400 x 400

512 x 512

800 x 800

1024 x 1280 x 2048 x 3000 x 4096 x 4096 1024 1280 2048 3000 Image size

Fig. 10. Speedup achieved by the padding technique. Pentium 3

Pentium 4

AMD

Itanium

2.5 Speedup

2.0 1.5 1.0 0.5 0.0 256 x 256

400 x 400

512 x 512

800 x 800

1024 x 1280 x 2048 x 3000 x 4096 x 1024 1280 2048 3000 4096 Image size

Fig. 11. Speedup by using loop fission.

figure, the y-axis depicts the speedup of the prefetching version against the original one, while the x-axis gives the image size. A value of 0.18, for example, indicates that the optimized version runs 18% faster. The two curves in the figure demonstrate how the optimization works on different target architectures. First, several peak points can be seen with both machines. They are actually the results with image sizes of power of two. This can be explained by the fact that these cases suffer significantly from conflict misses, and as a result more misses can be eliminated with optimization (DWT does not show many conflict misses with data sizes not a power of two). Overall, we achieved the best improvement of 18% with an image size of 4096. Fig. 10 shows the speedup of running the padding version. Here, the speedup is calculated by the execution time of the original implementation divided by the execution time of the optimized version. It can be seen that for image size of a power of two, a clear speedup has been acquired. Overall, this result indicates that DWT has a regular access pattern, and for such applications the optimization based on a single run also works for other configurations. As presented in Fig. 11, a similar result has been obtained with the fission technique, in which images with a size of a power of two show better speedup. In contrast to the padding scheme, however, slowdown can be seen with this technique. The slowdown comes from the additional instructions introduced by the fission strategy. It is clear that running these instructions causes a performance loss. If the gain by reduced memory

access penalty cannot compensate for this loss, the optimization does not bring any speedup. This is also the reason that the padding scheme works better: a speedup of as high as 3.19 has been acquired. The best speedup with fission is 2.4. Fig. 12 depicts the result with the tiling technique. It is not surprising that the speedup is low because DWT does not show many capacity misses. Similar to loop fission, tiling also introduces additional instructions. For the same reason, slowdown has been observed. However, for cases when the cache problem is critical such as with image sizes of a power of two and cases with a large image size, the gain in memory access time still can cover the performance loss, and as a result the optimization shows its advantage. 6. Related work In recent years, an increasing number of tools have been developed for performance analysis. Most of them target parallel applications, especially those parallelized with MPI or OpenMP, but tools specifically designed for cache optimization also exist. These available performance analysers vary mainly in which information they collect and how the bottlenecks are presented. Smaller performance analysis tools simply show the result in a text form without graphical interfaces. Examples are gprof [14], mpiP [15], and ompP [16]. The first one was developed to measure the CPU time and number of calls to each procedure in a program. mpiP is a profiling library for MPI applications. It delivers detailed information about MPI function calls. ompP is similar to mpiP. It profiles the OpenMP constructs, and measures their overhead and the time for parallel and sequential code segments. Large performance analysers, on the other hand, make the runtime information more understandable with graphical presentations. Well-known products include Vampir, KOJAK, TAU, Paaver, and VTune. Vampir [17] was originally developed to show the communication issues of MPI applications, but was extended to depict the cache performance. For MPI programs it uses timelines to demonstrate the inter-process traffic along the whole execution cycle, allowing the user thereby to find the communication bottlenecks. For the cache it shows the number of cache misses in the source code, enabling the detection of cache critical code regions. Paaver [18] is a performance visualization and analysis tool that can be used to understand the runtime execution behaviour of parallel applications, the operating system activities, resource utility, cache performance, and so on. It uses a simple trace format that can be easily followed by various trace generators, such as instrumentor, simulator, and counter profiler. For visualization it provides graphical views for showing the behaviour with time, textual views to give the details, and analysis views for quantitative data. An example is the miss ratio timeline, where the miss rates of the specified cache in different time slots are depicted. KOJAK [19] is an automatic performance evaluation system for parallel applications. It deploys both an instrumentation approach

J. Tao / Future Generation Computer Systems 26 (2010) 167–174

and a counter interface to gather information about communication transactions of message passing programs, synchronization and parallelism of shared memory codes, and system events recorded by hardware counters. The generated trace is off-line analysed for detecting inefficient behaviour that is classified into a call-path profile. The call-path is viewed using the CUBE performance browser. The focus of this research work is data collection and analysis, especially on large parallel systems. The Tuning and Analysis Utilities (TAU) [20] toolkit is also a performance analysis toolkit for parallel applications. It collects performance information with instrumentation. The instrumentation can be inserted in the source code using automatic tools or manually using the instrumentation API. Its visualization component provides graphical displays of the performance data. The visualization focuses on functions and events. For example, a bar view shows the time spent by each function for a specific MPI event, allowing the user to simply find the hot MPI transactions and their location. In the commercial area, Intel is the dominating tool developer. The Thread Profiler [21] is a recent product developed to help the user understand what threads are doing and how they interact at runtime. VTune [22] is an earlier product implemented for performance analysis. It provides several views to help programmers identify bottlenecks. For cache metrics it shows the absolute number of cache misses with respect to the code region, allowing the user to detect functions and even code lines that introduce the most cache misses. Overall, cache performance has been addressed in the development of general performance analysis systems. However, a few visualization tools are specifically designed for exhibiting the cache access behaviour. CVT [23] can be regarded as the result of initiative investigations. It is a visualization program aiming at presenting the cache activities. It uses a main window showing the active content of the whole cache memory and a second window presenting statistics on cache misses in terms of arrays. It also provides information about the operations in the cache during a program run and supplies different statistics. CACHEVIZ [24] is another visualization tool intending to show the cache access behaviour of a complete program in a single picture. It provides three main views (a density view, a reuse distance view, and a histogram view) to present individual references and their features, allowing detection of regular structures such as spatial locality. KCachegrind [25] is a visualization tool for presenting the data gathered by a profiling tool based on the Valgrind [26] framework for simulation. The focus of this tool is on individual functions and it shows various information about them, including the call graph, the time needed for executing them, and the cache miss related to them. In summary, existing performance analysis tools, either for general purposes or cache specific ones, aim at helping the user to understand the runtime behaviour of the complete program. As a consequence, the graphical views, such as timelines and call graphs, often demonstrate the entire information in great detail. This kind of visualization is useful for debugging but does not highlight the bottlenecks for optimization. Our visualizer is specifically designed for optimization, and therefore concentrates on depicting the problem and the reason. This information is more useful than the number of cache misses. 7. Conclusions This paper introduces an approach of optimizing the cache performance with support of a toolset. Unlike conventional approaches which try to solve the cache problems without knowing

173

the reason, we start from the problem, then analyse the cache miss reason, and finally select appropriate optimization schemes for different cache misses. The tool set supports such efficient cache optimization with its comprehensive data acquisition components and the visualization tool. While the former delivers detailed information about the runtime data and cache operations, the visualization tool presents the data in a way that the access pattern and miss reason can be easily detected. The optimization on a sample code, Discrete Wavelet Transform, shows the efficiency of this approach. We are currently porting the tools to many-core architectures. Cache optimization on such machines is more challenging due to a potential longer penalty of accessing the shared main memory. Acknowledgements This work was done when the author was a staff member of the Institut für Technische Informatik, Universität Karlsruhe. The author thanks Professor Wolfgang Karl for enabling this research work, and also thanks Asadollah Shahbahrami at the Computer Engineering Laboratory of Delft University of Technology for providing the DWT code. References [1] W. Lin, S. Reinhardt, D. Burger, Reducing DRAM latencies with an integrated memory hierarchy design, in: Proceedings of the 7th International symposium on High-Performance Computer Architecture, January 2001, pp. 301–312. [2] C. Dubach, et al. Fast compiler optimisation evaluation using code-feature based performance prediction, in: Proceedings of the International Conference on Computing Frontiers, May 2007. [3] Z. Pan, R. Eigenmann, Fast and effective orchestration of compiler optimizations for automatic performance tuning, in: Code Generation and Optimization, 2006. International Symposium on, March 2006, pp. 319–332. [4] G. Rivera, C.W. Tseng, Data transformations for eliminating conflict misses, in: Proc. ACM Int. Conf. on Programming Language Design and Implementation, Montreal, Canada, 1998, pp. 38–49. [5] Z. Wang, E. Sha, X. Hu, Combined partitioning and data padding for scheduling multiple loop nests, in: Proc. Int. Conf. on Compilers, Architecture, and Synthesis for Embedded Systems, 2001, pp. 67–75. [6] A.-T. Nguyen, et al. The Augmint multiprocessor simulation toolkit for Intel x86 architectures, in: Proceedings of 1996 International Conference on Computer Design, October 1996. [7] S.G. Berg, Cache prefetching, Technical Report UW-CSE 02-02-04, Department of Computer Science & Engineering, University of Washington, February 2004. [8] I. Rigoutsos, A. Floratos, Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm, Bioinformatics 14 (1) (1998) 55–67. [9] T. Mohan, et al. Identifying and exploiting spatial regularity in data memory references, in: Supercomputing 2003, November 2003. [10] B. Quaing, J. Tao, W. Karl, YACO: A user conducted visualization tool for supporting cache optimization, in: High Performance Computing and Communications: First International Conference, HPCC 2005. Proceedings, in: Lecture Notes in Computer Science, vol. 3726, Sorrento, Italy, September 2005, pp. 694–703. [11] D. Chaver, C. Tenllado, L. Pinuel, M. Prieto, F. Tirado, 2-D wavelet transform enhancement on general-purpose microprocessors: Memory hierarchy and SIMD parallelism exploitation, in: Proc. Int. Conf. on the High Performance Computing, December 2002. [12] Y.D. Lee, B.D. Choi, J.K. Cho, S.J. Ko, Cache management for wavelet lifting in JPEG 2000 running on DSP, Electronics Letters 40 (6) (2004). [13] S. Ghosh, M. Martonosi, S. Malik, Automated cache optimizations using CME driven diagnosis, in: Proc. Int. Conf. on Supercomputing, May 2000, pp. 316–326. [14] J. Fenlason, R. Stallman, GNU gprof: The GNU profiler. Available at: http:// www.gnu.org/software/binutils/manual/gprof-2.9.1/html_mono/gprof.html. [15] J. Vetter, C. Chambreau, mpiP: Lightweight, scalable MPI profiling, April 2007. Available at: http://mpip.sourceforge.net/. [16] K. Fürlinger, M. Gerndt, ompP: A profiling tool for OpenMP, in: Proceedings of the First International Workshop on OpenMP, IWOMP 2005, Eugene, USA, May 2005. [17] H. Brunst, H.-Ch. Hoppe, W.E. Nagel, M. Winkler, Performance optimization for large scale computing: The scalable VAMPIR approach, in: Proceedings of the ICCS 2001, 2001, pp. 751–760. [18] European Center for Parallelism of Barcelona (CEPBA). Paaver – parallel program visualization and analysis tool – reference manual. Available at: http: //www.cepba.upc.es/paraver. [19] B. Mohr, F. Wolf, KOJAK — A tool set for automatic performance analysis of parallel applications, in: Proceedings of the European Conference on Parallel Computing, EuroPar, in: LNCS, vol. 2790, 2003, pp. 1301–1304.

174

J. Tao / Future Generation Computer Systems 26 (2010) 167–174

[20] S. Shende, A.D. Malony, The TAU parallel performance system, International Journal of High Performance Computing Applications 20 (2) (2006) 287–331. [21] Intel Corporation. Intel thread profiler. Available at: http://www.intel.com/cd/ software/products/asmo-na/eng/threading/286749.htm. [22] Intel Corporation. Intel VTune performance analyzer. Available at: http:// www.intel.com/cd/software/products/asmo-na/eng/vtune/vlin/239145.htm. [23] E. van der Deijl, G. Kanbier, O. Temam, E.D. Granston, A cache visualization tool, IEEE Computer 30 (7) (1997) 71–78. [24] Y. Yu, K. Beyls, E.H. D’Hollander, Visualizing the impact of the cache on program execution, in: Proceedings of the 5th International Conference on Information Visualization, IV’01, July 2001, pp. 336–341. [25] J. Weidendorfer, M. Kowarschik, C. Trinitis, A tool suite for simulation based analysis of memory access behavior, in: Proceedings of the ICCS 2004, 2004, pp. 440–447.

[26] N. Nethercote, et al. Valgrind. Available at: http://www.valgrind.org/.

Jie Tao received her masters degree in computer science from Jilin University, China, in 1989. Then she worked at the same university first as an assistant and afterwards as an associate professor. In 2002 she got her Ph.D. at the chair for Computer Technology and Computer Organization, department of Computer Science at Munich University of Technology. Her research interests include shared memory programming, architecture simulation, performance tools, monitoring, cluster & Grid computing, autonomous computing, and data locality optimization.