Analyzing data locality in GPU kernels using memory footprint analysis

Analyzing data locality in GPU kernels using memory footprint analysis

Accepted Manuscript Analyzing Data Locality in GPU Kernels Using Memory Footprint Analysis Mohsen Kiani, Amir Rajabzadeh PII: DOI: Reference: S1569-...

940KB Sizes 1 Downloads 95 Views

Accepted Manuscript

Analyzing Data Locality in GPU Kernels Using Memory Footprint Analysis Mohsen Kiani, Amir Rajabzadeh PII: DOI: Reference:

S1569-190X(18)30184-9 https://doi.org/10.1016/j.simpat.2018.12.003 SIMPAT 1889

To appear in:

Simulation Modelling Practice and Theory

Received date: Revised date: Accepted date:

15 April 2018 29 November 2018 3 December 2018

Please cite this article as: Mohsen Kiani, Amir Rajabzadeh, Analyzing Data Locality in GPU Kernels Using Memory Footprint Analysis, Simulation Modelling Practice and Theory (2018), doi: https://doi.org/10.1016/j.simpat.2018.12.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Analyzing Data Locality in GPU Kernels Using Memory Footprint Analysis

CR IP T

Mohsen Kiani, Amir Rajabzadeh

School of Computer Engineering and Information Technology, Razi University, Kermanshah, IRAN

Abstract

AC

CE

PT

ED

M

AN US

Memory footprint is a metric for quantifying data reuse in memory trace. It can also be used to approximate cache performance, especially in shared cache systems. Memory footprint is acquired through memory footprint analysis (FPA). However, its main limitation is that, for a memory trace of n accesses, the all-window FPA algorithm requires O(n3 ) time. Therefore, in this paper, we propose an analytical algorithm for FPA, whereby the average footprints are calculated in O(n2 ). The proposed algorithm can also be employed for window distribution analysis. Moreover, we propose a framework to enable the application of FPA to GPU kernels and model the performance of L1 cache memories. The results of experimental evaluations indicate that our proposed framework functions 1.55X slower than the Xiang’s formula, as a fast average FPA method, while it can also be utilized for window distribution analysis. In the context of FPA-based cache performance estimation, the experimental results indicate a fair correlation between the estimated L1 miss rates and those of the native GPU executions. On average, the proposed framework has 23.8% error in the estimation of L1 cache miss rates. Further, our algorithm runs 125X slower than the reuse distance analysis (RDA) when analyzing a single kernel. However, the proposed method outperforms RDA in modeling shared caches and multiple kernel executions in GPUs.

Preprint submitted to Simulation Modeling Practice and Theory

December 4, 2018

ACCEPTED MANUSCRIPT

Analyzing Data Locality in GPU Kernels Using Memory Footprint Analysis

CR IP T

Mohsen Kiani, Amir Rajabzadeh

School of Computer Engineering and Information Technology, Razi University, Kermanshah, IRAN

AN US

Keywords: Performance modeling, GPU, Data locality, Footprint analysis, Cache memory 1. Introduction

AC

CE

PT

ED

M

GPUs have become one of the most popular computing resources in modern computing systems as they deliver high performance and have become easier to program. In GPUs, the thread-switching technique is employed to tolerate the latency incurred by the data transfers through the memory hierarchy. However, long memory latency, caused by global load and store instructions, can still impose memory stall cycles, thereby limiting the overall performance. Consequently, the memory system has appeared as a performance bottleneck in many applications [1, 2, 3]. Employing cache memories is regarded as a major solution to resolve the memory latency problem [4]. GPU manufacturers have implemented multiple levels of hardware-managed private and shared cache memories. Further, architecting and managing cache memories in GPUs have been actively under study during the past years [5, 6, 7, 8, 3]. Many researchers have tackled the challenge of memory performance bottleneck and have proposed different techniques and algorithms to boost the memory performance [9]. Despite the employment of cache memories, memory performance is still considered as a major problem because the cache share per physical core is very limited in GPUs [10]. Furthermore, the number of threads a typical GPU runs concurrently can be significantly higher than the available physical cores and thus the per-thread cache share is even smaller than that of the physical core. Therefore, the massive thread-level parallelism in GPUs can

Preprint submitted to Simulation Modeling Practice and Theory

December 4, 2018

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

impose severe cache contentions [7]. As a result, the optimization of GPU applications in terms of memory performance is crucial. Cache memories exploit the data locality exposed by an application. Therefore, data locality is a major parameter in performance optimization. The first step towards optimizing the data locality of an application is the measurement of data locality. In this regard, proper design and analysis tools are required for locality measurement. The measured data locality can then be used as a basis to approximate cache performance. Therefore, such analysis tools can help with comprehending the memory behavior, designing more efficient cache memories, and developing more efficient applications. Different locality metrics have been developed, among which the reuse distance (or stack distance) and memory footprint are more popular [11]. The conversion between the locality metrics and cache miss rate curve were well-established in previous studies [12]. There are many algorithms and methods for measuring the data locality of an application [11]. In this context, trace-driven simulation is a popular method. For example, the reuse distance [13] can be calculated by analyzing memory traces, where the traces are collected from the execution of a given application on a specific processor. Reuse distance is calculated based on the number of unique data accessed between two accesses to the same datum. Measuring the reuse distance of all accesses in a trace is called reuse distance analysis (RDA), which generates a reuse histogram that demonstrates the frequency of different reuse distance values as a measure of locality. The reuse histograms can be used to calculate the miss rate of fully associative caches of all size. Memory footprint is another popular metric for measuring the data locality [11]. In a memory trace consisting of n references, a window of l consecutive references over the trace contains a given number of unique data, which is known as the footprint or the working set size of the window. Based on the footprint analysis (FPA), calculating the footprints of the desired window length, l, includes measuring the footprints of all n − l + 1 overlapping windows of length l. Let wf pi (l) denote the measured footprint of the i-th window of length l (i ranges 1 to n − l + 1). The average footprint of all windows of length l, denoted by f p(l), is obtained through summing up the footprints of all windows of length l, i.e. wf pi (l)s, and then dividing the result by the total number of l-length windows. Indeed, there are some analytical methods that allow for the calculation of average footprints without separately measuring the footprints in each window [14]. Such methods are 3

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

distinguished by the term average footprint analysis from the former method which is termed all-window footprint analysis. Although average footprint analysis can largely accelerate the footprint calculations, it might not be applicable to investigate the footprints of individual windows of the same size, which allows for deeper insights into the locality of the application under study. The miss rate in a cache of a given capacity can be calculated based on the average footprints [12]. In comparison with the reuse distance, one of the most interesting properties of the FPA is that the footprint is composable so the joint footprint of a set of programs running on a shared cache environment can be estimated from the individual (or solo) footprints of the programs [11]. The footprint’s composability has led to its growing popularity in locality measurement and cache performance analysis in shared cache systems [15, 16]. The main limitation of the FPA is the complexity of the all-window calculation. In a trace of length n, there are O(n2 ) windows that contain O(n) accesses. As a result, O(n3 ) calculations are required to calculate the footprints of all windows. On the other hand, the average footprints can be calculated in O(n2 ) time. Xiang’s formula [14] is an example of the average footprint calculation. However, because only the average footprints are calculated, it cannot be utilized for window distribution analysis, e.g., calculating the minimum and maximum footprint values for a given window length [11]. Window distribution analysis requires all-window footprints. In the context of GPUs, although the reuse distance analysis has been adapted for GPUs [17, 18, 19], to the best of our knowledge, no prior study has been done for calculating the memory footprints of GPU kernels. The motivation of the present study is calculating the footprints in GPU kernels and modeling the GPUs cache performance based on the footprint theory. First, we propose a fast FPA algorithm that is comparable to Xiang’s formula in terms of execution efficiency. While Xiang’s formula cannot be employed for window distribution analysis, our proposed algorithm can be employed for this purpose. Further, to enable the application of FPA to GPU kernels, we develop a framework to generate the L1 cache memory traces in GPUs (e.g., the reference sequences to the L1 cache memories). Then, the FPA is applied to the traces to calculate the memory footprints and finally, the measured footprints are used to estimate the cache performance. We employ the analytical model developed in [12] for converting the footprints to cache miss rates and also use the model proposed in [15] to approximate the joint footprints of co-run kernels. 4

ACCEPTED MANUSCRIPT

CR IP T

In addition to cache performance estimation, the footprint can be used for kernel characterization. Specifically, the patterns in which a kernel accesses the cache memory can be studied through the use of the window distribution analysis. The main contributions of this article are as follows:

AN US

1. An analytical model is proposed for footprint analysis to calculate memory footprints in affordable times. The proposed model can be utilized for the average footprint analysis or window distribution analysis. 2. A framework is developed to enable the application of FPA to GPU kernels. 3. Experimental evaluations are performed on more than 20 GPU kernels to acquire their memory footprints and approximate cache miss rates. The results acquired by the framework are compared with those of the real GPU executions to validate the framework. Further, the cache performance of 55 two-kernel execution combinations is evaluated.

PT

2. Background

ED

M

In the following sections, the baseline GPU architecture is first introduced, and the footprint theory is briefly presented (see Section 2). Then, the proposed footprint algorithm is described (see Section 3). Thereafter, in Section 4, we explain our proposed framework, called FGCA, aiming at GPU cache performance analysis based on the footprint theory. Later, in Section 5, we present our experimental results and discuss our findings. Next, in section 6, we review the related work and compare the similar work with ours. Finally, we conclude our study in Section 7.

CE

In this section, we introduce the baseline GPU architecture and the footprint theory.

AC

2.1. Baseline GPU Architecture This paper considers NVIDIA terminology [20]. Fig. 1 shows an overview of the baseline GPU architecture considered in the present work. A GPU has a large number, typically hundreds of cores organized as several streaming multiprocessors (SMs). Each SM has several schedulers to schedule the threads on the processing cores. Some of the processing cores are special function units that are used to perform the complex computing operations. Moreover, load and store instructions are performed by the load/store units. 5

ACCEPTED MANUSCRIPT

Scheduler Scheduler

SM1

GPU SM2

SMP

CR IP T

Register file

Interconnect L1 cache Shared memory

L2

L2

MC

MC

MC

Global memory

AN US

Processing cores Special function units Load/store cores MC: Memory controller

L2

Figure 1: Overview of the baseline GPU architecture

AC

CE

PT

ED

M

In addition to the processing cores, the hardware-managed cache memories (L1) and programmer-managed scratchpads (called shared memory) are implemented on each SM to accelerate the memory transfers. Generally, the SMs’ L1 cache is shared among all the running threads in the SM. The number of co-running threads (also called the in-flight threads) is a machinedependent parameter. Based on the CUDA programming model, developed by NVIDIA for GPU parallel programming [21], a CUDA kernel is a function performed by GPUs. Each kernel contains a large number of threads organized as several thread blocks. The number of thread blocks in the kernel is called the grid dimension and the number of threads in each thread block is called the block dimension. Once a kernel is launched, the GPU divides the thread blocks into several 32-thread groups, called warps. Each GPU has limited capacity in the number of warps and the number of thread blocks it can run concurrently [20]. If the number of thread blocks mapped to an SM exceeds the machine-specific limits, the execution of the thread blocks is performed in several steps where a subset of the thread blocks is performed in each step. 2.2. Footprint Theory The footprint theory concerns measuring the locality in a memory reference sequence called trace. The memory references in a trace are collected based on the order of their execution by the processor (here, at the cache 6

ACCEPTED MANUSCRIPT

AN US

CR IP T

block granularity). A consecutive interval in a trace containing a specific number of references is called a window. The footprint measures the number of unique data accessed within a window of given length (also called the working set size of the window). In the rest of the paper, l represents the window length, i.e. the number of references in a window of interest. In a trace of n accesses, there are n − l + 1 windows of length l. These n − l + 1 windows are overlapping, i.e., one access in a trace may be enclosed in several windows. The footprint of the i-th window of length l is denoted by wf pi (l). The average footprint of all l-length windows is denoted by f p(l), which is calculated by averaging the footprints of all l-length windows, i.e. wf pi (l) for all i ∈ {1, .., n − l + 1}. Based on the footprint theory, the average footprints of all window lengths (e.g., f p(l) for all l ∈ {1, ..., n}) is calculated to analyze the locality in a given trace. Note that the total number of windows of all lengths is of an order of O(n2 ) [12].

M

2.2.1. Footprint-based Cache Performance Analysis The relationship between the footprint and other locality metrics has been studied in [12]. For a given application, the miss rate of a cache with c blocks (fully-associative LRU cache), denoted by mr(c), can be calculated based on the average footprints of the application, i.e. f p(l), l ∈ {1, ..., n}:

ED

mr(c) = f p(l0 + 1) − f p(l0 ),

(1)

AC

CE

PT

where c = f p(l0 ). The expression c = f p(l0 ) represents the window length whose average footprint equals c. Equation (1) shows the experimental usefulness of the theory. In the present work, Equation (1) is applied for the estimation of miss rates. Fig 2 shows a sample trace (to cache blocks) and stack distance of each reference along with the average footprints of the window lengths ranging from one to four. The individual footprints of the windows of length four are also illustrated in the figure. In addition, the miss rate of a fully-associative LRU cache with one block is estimated based on the reuse stack distances and Equation (1). 2.2.2. Comparison of FPA with RDA Reuse distance (RD), is an extensively popular locality metric that measures the working set size of a window starting and ending at references to 7

ACCEPTED MANUSCRIPT

2 3 4 5 6 7 b a c a c c 3

Stack distace(sd)

3

2

∞ ∞ 1 ∞ 1 1 0

2

f p(1) = 1 f p(2) = 1.83 f p(3) = 2.2 f p(4) = 2.5

CR IP T

1 i a T race

Stack distance : mr(c) = 1 − P (sd < c) → mr(1) = 1 − 71 = 0.86 F ootprint : mr(c) = f p(c + 1) − f p(c) → mr(1) = 1.83 − 1 = 0.83

Figure 2: Footprint values for four windows of length four in a sample trace

PT

ED

M

AN US

the same datum. In RD analysis, working set sizes of all such windows are calculated, thus it is a subset of the footprint analysis. Basically, the RD is a metric for measuring the locality of single program execution. To measure the RD in a shared cache multi-threaded environment, in which the cache is shared among several programs or threads, the threads’ data sharing and interleaving should be defined to calculate RDs of the memory accesses in each thread. However, footprint analysis is more efficient in shared cache environments because the footprint has a principal specification called composability. Composability states that the miss rate of a cache shared among p programs can be acquired from the individual footprints of the programs [11]. Consequently, any execution combination can be analyzed based on the individual footprints. On the contrary, in RD analysis, the reuse histograms should be calculated for every execution combination, which can result in significant analysis times especially for large values of p. Below, RDA and FPA are concisely compared:

AC

CE

- Calculation complexity: A naive FPA method requires O(n3 ) time to calculate the all-window footprints in a trace of n accesses; however, faster methods have been proposed that are capable of calculating the average footprints in O(n2 ) [14]. Further, in many cases, a subset of window lengths, or even a single window length is adequate. Thus, the complexity can further be alleviated. On the other hand, the basic RDA method requires O(n2 ) time. In addition, some faster RDA methods have also been proposed to gain reuse profiles in O(n log m) or O(n log log n) time [22], where m is the data size in the trace. - Accuracy: In RDA, the reuse distance is calculated per access and then the miss rate of a fully-associative LRU cache of any size can be 8

ACCEPTED MANUSCRIPT

-

CR IP T

CE

PT

-

ED

M

-

AN US

-

calculated accurately. FPA, on the other hand, averages the working set sizes of all windows with the same length and thus the locality phase changes are filtered. As a result, the miss rates calculated based on the average footprints are potentially less accurate. It should be noted that window distribution analysis can be used to identify locality phase changes at the cost of time. Composability: The footprint is composable [11] whereas the reuse distance is not. This is the main benefit of using the footprint instead of the reuse distance. Composability allows for the estimation of the cache miss rates of co-run programs from their solo-run footprints. This is important especially since modern CPUs are multi-core processors with layers of shared caches, and also GPUs can run several kernels at the same time. Thus, the footprint can be used for modeling cache sharing in CPUs and GPUs. Comprehensiveness: Both RDA and FPA can be performed once and the results can be used to approximate the miss rates of all cache sizes. However, because FPA is composable, it is more comprehensive. Accelerating calculations by means of parallel executions: Several studies have been dedicated to accelerating the RDA and FPA calculations through exploiting parallel execution. However, parallel execution of the FPA calculations requires less effort than that of the RDA (see Section 3) and parallel execution of RDA calculations can also impose errors in the yielded results [23]. Sampling: Various studies have investigated the sampling-based analysis in RDA [22, 24] and FPA [25, 12] to alleviate the calculation complexity. The results have revealed that sampling could be performed in both RDA and FPA and considerable speedups were achieved while the incurred errors were marginal. However, some researchers have found that footprint sampling is simpler than reuse distance sampling [11].

AC

2.2.3. Footprint Analysis in Shared Caches Let P be a set of programs that can be run on a shared cache system. Further, let p be the cardinality of P (p = |P |). Additionally, let Q ⊆ P denote a set of programs chosen to concurrently run on a shared cache and assume q = |Q|. Programs contained in Q are called co-run programs and Q itself a co-run. Running co-run programs on a shared cache alters the footprints of the programs. Let f p(pi , l) be the average footprint of program pi , i ∈ {1, ..., p}. Function f p(pi , l) is calculated when pi is run individually 9

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

thus it is not a function of any co-run. Further, assume that program pi be the j-th member of Q. Moreover, let cf pj (pi , Q, l) (j ∈ {1, .., q}) be the co-run footprint of pi in Q. cf pj (pi , Q, l) represents the footprint of program pi (which is the j-th co-run program in Q) when it is performed simultaneously with its co-run peers on a shared cache. Because all q corun programs contained in Q share the same cache, a single footprint can also be calculated for the trace that is generated by all the co-run programs, collaboratively (the trace seen by the shared cache). This trace contains all accesses of the co-run programs to the shared cache. The footprint of a co-run, Q, is called stretched footprint and denoted by sf p(Q, l). sf p(Q, l) can be used to calculate the miss rate of the cache shared by the co-runs programs. Since FP is composable, sf p(Q, l) can be calculated from the individual footprints of the co-run programs, i.e. f p(pi , l)s [15]. In other words, there is no need to synthesize a collaborative trace containing accesses of all corun programs and then analyze it by FPA. Instead, the same footprints calculated for programs when they ran individually are used to analyze any co-run combination. Hu et al. [15] showed that sf p(Q, l) can be calculated through applying simple mathematical formulas. First, co-run footprints of the programs contained in Q, i.e. cf pj (pi , Q, l)s, are calculated through applying a conversion to the solo footprints, f p(pi , l)s. Then, cf pj (pi , Q, l)s are converted to the stretched footprint, i.e. sf p(Q, l). This conversion is   p valuable because for p programs and q co-run programs, there are q co-run combinations, and instead of simulating all of the possible co-runs, they can be analyzed based on p footprints [11, 12]. In the present work, we only consider co-runs of two programs (i.e., q=2). The basis of calculating the stretched footprint is the rate at which each co-run program generates its memory references. Let ari be the memory access rate generated by program pi when it is performed individually. The memory access rate of a program can be obtained through profiling (e.g. the average number of memory accesses generated within a given time unit). In the model proposed by Hu et al. [15], Equation (2) is applied to convert individual to co-run footprints: cf pj (pi , Q, l) = f p(pi , li ),

(2)

where li is an integer value defined based on the memory access rates of pi P and its co-run programs. For each program, pi , li equals f (l × ari / i∈Q ari ) 10

ACCEPTED MANUSCRIPT

j∈Q

CR IP T

where f () is a function that assigns an integer value to li of each co-run P program so that l = i∈Q li . In other words, f () determines the share, i.e. the number of accesses, of each co-run program in a window of l accesses in the co-run’s trace. For two programs, f () can be a rounding function. Then, P a special case is for ar1 = ar2 that the fraction (l × ari / i∈Q ari ) generates two numbers as x.50 and y.50. In this case, x.50 is rounded to x+1 and y.50 to P y hence satisfying the condition l = i∈Q li . After calculating cf pj (pi , Q, l)s, Equation (3) is applied to them to obtain the stretched footprint of co-run, Q: X sf p(Q, l) = cf pj (pi , Q, l). (3)

M

AN US

By substituting sf p(Q, l) in Equation (1), the miss rate of a shared cache can be estimated. Thus, the main challenge in the footprint theory is calculating the individual footprints. The above model functions as a logical trace interleaving as if the co-runs are running concurrently and their collaborative trace is recorded and then its footprint is calculated. Thus, this model substitutes the co-run simulation with a simple mathematical model thereby avoiding the need for simulating every possible co-run execution. 3. The Proposed Footprint Calculation Model

ED

In this section, we develop an analytical model to calculate the footprints of the desired window lengths.

AC

CE

PT

3.1. Motivation Based on a naive all-window footprint calculation algorithm, all O(n2 ) windows in a trace of n accesses are enumerated and their footprints are calculated one by one. Each window contains up to n accesses. This naive algorithm, called the counting-based algorithm, requires O(n3 ) time [26]. The average footprint is calculated for each window length, l, through averaging the footprints of all l-length windows. Because n is typically very large, this cubic time complexity is not affordable in practical usage. Different methods have been proposed to alleviate the complexity of footprint calculations, including trace reduction and statistical sampling [26, 11]. Another appealing method is utilizing analytical models, which are developed based on mathematical formulas whereby the calculations can be significantly accelerated. Xiang’s formula [14] is an example that highly accelerates the footprint calculations. However, Xiang’s formula calculates only the average footprints 11

ACCEPTED MANUSCRIPT

CR IP T

and cannot provide window distribution analysis, by which more insight can be obtained into the program’s behavior in terms of data locality. For instance, the minimum and maximum working set sizes among all windows of the same length, or any desired statistical parameter can be calculated by means of all-window footprint analysis. Below, we first explain our proposed analytical model, which can be used for calculating the average footprints of a given window length. Thereafter, we extend the model for window distribution analysis.

AN US

3.2. Analytic Model for Footprint Calculation First, we introduce the parameters used in the proposed model. Then, several definitions are introduced and the model is derived. The main parameters used in the model are as follows:

ED

M

- n: denotes the number of accesses in the trace. - i: represents the access index in the trace, which is in the range between one to n. We use terms access and index, interchangeably. - ai : denotes the i-th access in the trace. - l: represents the window length of interest and ranges from one to n. - f p(l): denotes the average footprint calculated for all the l-length windows. f p(l) takes real values between one and l. - wf pj (l): denotes the footprint calculated for the j-th window among all the windows of length l where j ranges from one to n−l +1. wf pj (l) takes integer values between one and l.

AC

CE

PT

Consider access i, denoted by ai , accessing a datum, d. When ai is the first to access d in a window, it increases the footprint of that window. In this article, such an access is termed contributing access (see Definition 3.2). The main idea based on which our model is built is counting the number of windows wherein ai is contributing. We employ two auxiliary parameters to define f p(l) as a function of contributing accesses. The parameters are introduced here and explained in more detail in the following paragraphs. - f ci (l): represents the number of windows of length l within which access ai is a contributing access, and ai is not the first access in any of the windows (see Section 3.2.1). - f c(l): the total number of contributing accesses in all windows of length l. 12

ACCEPTED MANUSCRIPT

The proposed model aims to calculate both f p(l) (i.e., the average footprints) and wf pj (l)s for the desired l (i.e., the all-window footprints). In what follows, we derive a model to meet these goals. First, we introduce several definitions and then we explain our proposed model.

CR IP T

Definition 3.1. Previous clean interval: For access ai accessing a datum d, the previous clean interval is defined as the largest continuous sequence of accesses in the trace that (1) terminates at ai , and (2) does not contain any other access to d.

AN US

Let cli denote the length of the clean interval of ai . Parameter cli counts the length of the access sequence ending at i and either beginning after the previous access to d, or stretching back to the beginning of the trace (when no access to d has taken place up to index i). Let ˆi be the index of the previous access to d where ˆi < i, and ˆi = 0 when ai is the first access to d. We have cli = i − ˆi − 1. When ai is the first access to d in the trace, we get cli = i − 1. As an example, in the trace shown in Fig. 3a, the length of the clean intervals of the fifth (i=5) and the seventh (i=7) accesses are equal to two and six, respectively.

M

Property 3.1. Based on the definition of previous clean interval, we have cli < i for all i ∈ {1, ..., n}.

ED

Definition 3.2. Contributing access: In a window of l accesses enclosing {j, ..., j + l − 1}, access ai , i ∈ {j, ..., j + l − 1}, is a contributing access (CA), when it is the first access in the window that refers to d.

AC

CE

PT

When cli ≥ i − j, then ai is a CA because no previous access is located within the window of interest. In a given window, an access that is not a CA is called neutral access. The number of CAs in a window represents the footprint of the window. Further, the average footprint of all l-length windows can be expressed as a function of CAs. First, CAs in all windows of length l, denoted by f c(l), is calculated and then it is divided by the total number of windows of length l: f c(l) , (4) n−l+1 where the denominator represents the total number of windows of length l in a trace of n accesses. The next step towards calculating the average footprints is deriving an expression for f c(l) as a function of CAs. f p(l) =

13

ACCEPTED MANUSCRIPT

Property 3.2. The first access in a window of any length is always a CA.



X

i∈{1,...,n}



f ci (l) + n − l + 1,

(5)

AN US

f c(l) =

CR IP T

Property 3.2 suggests that the minimum number of CAs in all windows of length l equals n−l +1, that is the number of windows of length l. Hence, we have f c(l) ≥ n − l + 1 and f p(l) ≥ 1. In addition, accesses located within the windows at locations other than the first index might also be contributing. Instead of counting CAs in each window, it is more efficient to count the number of windows within which each access is a CA. Parameter f c(l) that defines the total number of CAs in windows of length l can be calculated using Equation (5):

CE

PT

ED

M

where function f ci (l) represents the number of windows of size l that contain ai as a non-starting CA (NSCA). An access in a window is an NSCA when it is a CA not located at the beginning of the window. The second term, n − l + 1, counts all CAs located at the beginning of all windows of size l (see Property 3.2). Note that since there are l overlapping windows of length l at the most at each index, each access in a trace appears in up to l windows among the total of n − l + 1 windows (assuming n − l + 1 ≥ l). With that in mind, we derive an expression for f ci (l) to efficiently calculate NSCAs. We show that f ci (l) can be calculated based on several parameters as a function of l and i. We divide a trace into three consecutive intervals to define the determinative parameters for f ci (l) at each index and for each window length. After deriving an expression for f ci (l), we present an algorithm to traverse a trace and calculate f ci (l) to define f c(l)s and f p(l)s of the desired window lengths. 3.2.1. Defining f ci (l) Let ai be an access not at the beginning of a window, w.

AC

Property 3.3. Access ai is an NSCA in w if cli ≥ i − j where j is the index of the lower bound of w and j < i. Instead of defining CAs window by window, f ci (l) defines the number of windows which contain ai as NSCA whereby enhancing the calculation efficiency. For the sake of simplicity, let’s assume that there are l overlapping 14

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

windows enclosing ai . The first window contains ai as its last access while the l-th window contains ai as its first access. Note that f ci (l) only counts NSCAs. Among the l − 1 overlapping windows that contain ai as a nonstarting access, the number of windows that enclose ai as an NSCA is a function of cli . Property 3.3 suggests that when cli = r, then ai is NSCA in r windows among the l − 1 windows (given that r ≤ l − 1). In other words, f ci (l) is limited to cli and the number of overlapping windows that contain ai as a non-starting access. As explained later, Property 3.3 is not always applicable to define f ci (l) because the number of overlapping windows is not always l. Fig. 3 shows the relation between cli and f ci (l) for three window lengths 2, 3, and 4. As shown in Fig. 3a, a5 is located within four windows including w2 to w5 (other windows are shown as dashed arrows). In w5 , shown as a black arrow, a5 is the starting CA. Such cases are counted by the constant term in Equation (5). Further, a5 is not a CA in w2 because a3 , as a previous access, is located in this window. Such windows are illustrated as gray arrows. Further, a5 is NSCA in w3 and w4 , illustrated by bold gray arrows. Fig. 3b and Fig. 3c depict the same for windows of length three and two. In Fig. 3c, f ci (l) is limited by l − 1 rather than by cli . When we assume that the number of overlapping windows is l, among which ai is a non-starting access within l − 1 windows, then, f ci (l) can be defined as min(l − 1, cli ). However, these two parameters are not enough to accurately define f ci (l) throughout a trace. The reason is that the number of overlapping windows is not always l. Instead, the number of overlapping windows is variable throughout the trace. Below, we develop a general expression for f ci (l) that is applicable to all accesses in a trace to define NSCAs. In general, f ci (l) is a function of two main parameters: (1) the number of windows containing ai as a non-starting access, denoted by nwi (l), and (2) a parameter that models the effects of the distance between this and the previous reference to the same datum, denoted by pdi : f ci (l) = min(nwi (l), pdi ).

(6)

The parameter pdi is used to define among nwi (l) windows those within which ai is the first access to its datum. Since the number of overlapping windows at each access is l at the most where one of them contains the access at its first index, in the basic form, nwi (l) equals l − 1 and pdi equals cli . 15

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

However, as discussed below, in some cases, l − 1 and cli are not adequate for the definition of f ci (l). The parameters nwi (l) and pdi depend on the pattern of the overlapping windows. This pattern is not the same at all accesses along the trace. We divide a trace into several intervals to appropriately define nwi (l) and pdi within each interval. In the first interval, a new window starts at each access while no window ends thus the number of overlapping windows increments when moving forward through the interval. In the second interval, the number of overlapping windows remains the same throughout the interval. Lastly, in the third interval, a window ends at each access and no new window begins and consequently, the number of overlapping windows is diminishing when moving forward within the interval.

16

ACCEPTED MANUSCRIPT

1 2 3 4 5 6 7 8 9 a a b c a d e f c w1

w2

w3

w4

w5

n = 9, l = 4 cl5 = 2 f c5 (4) = 2

w6

(a) Window size=4

w1

w2

w3

w4

AN US

1 2 3 4 5 6 7 8 9 a a b c a d e f c

i T race

CR IP T

i T race

w5

w6

n = 9, l = 3 cl5 = 2 f c5 (3) = 2

w7

(b) Window size=3

w2

w3

PT

ED

w1

M

1 2 3 4 5 6 7 8 9 a a b c a d e f c

i T race

w4

w5

w6

w7

w8

n = 9, l = 2 cl5 = 2 f c5 (2) = 1

(c) Window size=2

CE

Figure 3: An example to show the dependence of f ci (l) to cli and window length. The bold gray arrows show the windows in which a5 is an NSCA

AC

Consider a trace of n accesses and windows of length l. The first window starts at the first access and a new window starts at each subsequent access until access n − l + 1 where the last window starts. As a result, throughout a given interval in the trace, the number of overlapping windows is growing until it reaches a peak. Further, in some interval in the trace, the number of overlapping windows is decreasing because after a given index no new window is initiated while the previous windows start ending. Since the model 17

ACCEPTED MANUSCRIPT

CR IP T

requires the number of overlapping windows at each index, this changing pattern should be considered to accurately define nwi (l) at each index. On this ground, we divide a trace into three intervals as follows. Later, we define the bounds of each interval as a function of l. Let s and e be the lower and upper bounds of an interval and p be an integer value, and let owi (l) denote the number of l-length overlapping windows at the i-th access in an interval. Definition 3.3. Incremental interval: An incremental interval (IncInt) is a sequence of accesses in a trace in which owi+1 (l) = owi (l) + 1 for all i ∈ {s, ..., e}.

AN US

Definition 3.4. Steady interval: A steady interval (StdInt) is a sequence of accesses in a trace in which owi (l) = p for all i ∈ {s, ..., e} where p is a constant integer value.

M

Definition 3.5. Diminishing interval: A diminishing interval (DimInt) is a sequence of accesses in a trace within which the number of overlapping windows decreases at the rate of one window per access so that owi (l) = p − j where i ∈ {s, ..., e} and p is a constant value that represents the number of overlapping windows at the beginning of the interval and j is the intra-interval index (j ∈ {0, ..., e − s − 1}). Further, owi+1 (l) = owi (l) − 1.

AC

CE

PT

ED

In an incremental interval, a new window is initiated at each access but no window is ended in the interval. Note that when ow1 (l) = 1, we obtain owi = i. Thus, the number of overlapping windows grows at the rate of one. In a steady interval, the number of overlapping windows is constant since, at each access in the interval, no new window is either initiated or ended, or if a new window is initiated, another window is ended. Hence, the number of overlapping windows remains constant. When l << n, this interval is significantly larger than other intervals. In a diminishing interval, no new window is initiated but one window is ended at each access. Hence, the number of overlapping windows decreases at the constant rate of one. 3.2.2. Defining Interval Bounds and nwi (l) As mentioned before, a trace is treated as three non-overlapping consecutive intervals, which includes an incremental, a steady, and a diminishing interval. The purpose of partitioning the trace is to derive expressions for nwi (l) and pdi applicable to all accesses located within the same interval. 18

ACCEPTED MANUSCRIPT

AN US

CR IP T

However, nwi (l) and pdi may have different expressions within different intervals. Further, for some window lengths, one or two intervals can be empty. In the following, we define the bounds of the intervals as functions of l and n. Changing from one interval to another is due to a change in the pattern of the overlapping windows, which itself is dictated by the values of l and n. For window length l, a window is initiated at every access up to index n−l+1. Hence, first, the number of the overlapping windows is increasing up to a maximum value. Since each window encloses l accesses, the maximum is limited to l. Afterward, the number of the overlapping windows either starts diminishing, or it is more likely to remain at the maximum for a while and then starts diminishing until it eventually decreases to one at the end of the trace. Below, two definitions are provided to be used as a basis to define the bounds of the intervals. Definition 3.6. First-peak index: The first-peak index, denoted by pi, is the smallest index at which the number of overlapping windows, owpi (l), reaches the maximum number of overlapping windows of length l.

M

Definition 3.7. Last-peak index: The last-peak index, denoted by pi0 , is the biggest index at which the number of overlapping windows, owpi0 (l), has the maximum value of overlapping windows of length l.

AC

CE

PT

ED

By definition, we have pi ≤ pi0 and owpi = owpi0 . Further, for all i < pi and i > pi0 , we have owi (l) < owpi (l). Furthermore, for all pi ≤ i ≤ pi0 we have owi (l) = owpi (l). Peak indexes define bounds of the intervals. For example, an incremental interval ends at the first-peak index, where a steady interval (if any) begins. Moreover, all indexes between the first- and the last-peak indexes constitute a steady interval. Similarly, a diminishing interval begins at the last-peak index. The first- and last-peak indexes can be defined based on n and l. Here, we show that when l < n − l + 1, pi and pi0 are equal to l and n − l + 1, respectively. Index l is the last index enclosed by the first window and index n − l + 1 is the starting index in the last window. We further show that in cases in which l ≥ n − l + 1, we have pi = n − l + 1 and pi0 = l. Let wj be the j-th window of length l where j ∈ {1, ..., n−l+1}. The first window, w1 , encloses {1, ..., l} and the last window contains {n − l + 1, ..., n}. Further, wj begins at index j and ends at index j + l − 1. 19

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

Let’s assume that l = n/2. At index n − l + 1, (i.e., at index l + 1 because l = n/2), the last window begins while w1 was ended at the previous index. Thus, the first and the last windows do not overlap. In this case, at index l, the number of overlapping windows equals l since one window is started at each index while no window is ended up to this point. Similarly, for all l < n/2, since l < n − l + 1, the overlapping windows first reaches l at the l-th index. Thus, pi = l and owpi = l. Further, the number of overlapping windows remains equal to l up to index n − l + 1 where the last window is initiated. After index n − l + 1, no new window is initiated while one window in ended at each index hence the number of overlapping windows starts diminishing after index n − l + 1. Consequently, pi0 = n − l + 1. In summary, when l ≤ n/2 (i.e., when l < n − l + 1), then pi = l and pi0 = n − l + 1. In addition, we have owpi = owpi0 = l. On the other hand, for all l > n/2 (i.e., when l ≥ n − l + 1), the maximum number of overlapping windows is less than l since the last window is initiated at index n − l + 1. Consequently, when l > n/2, we have pi = n − l + 1. The last-peak index, in this case, is l, where the windows eventually start ending one after another. Hence, we have pi0 = l. Thus, when l > n/2, we have pi = n − l + 1 and pi0 = l. Further, we have owpi = owpi0 = n − l + 1. To put it another way, pi = min(l, n − l + 1), pi0 = max(l, n − l + 1), and owpi = owpi0 = pi. Further, according to Definition 3.3 to 3.5, the intervals can be determined based on the peak indexes as follows.

PT

- An incremental interval, denoted by IncInt, includes {1, ..., pi}. - A steady interval, denoted by StdInt, includes {pi + 1, ..., pi0 }. - A diminishing interval, denoted by DimInt, includes {pi0 + 1, ..., n}.

CE

Below we define the bounds of the intervals for two cases of l ≤ n/2 and l > n/2, and determine nwi (l) within each interval.

AC

Definition 3.8. Inverse index: The inverse index in a trace of n accesses, denoted by i0 , is defined as n − i + 1.

The parameter i0 equals n at the beginning of the trace and equals one at the end of the trace.

1. l ≤ n/2: In this case, we have pi = l and pi0 = n − l + 1. (a) IncInt: An incremental interval includes {1, ..., l}. One window starts at each access in this interval whereas no window ends. 20

ACCEPTED MANUSCRIPT

CE

PT

ED

M

AN US

CR IP T

Further, for all i ∈ {1, ..., l} we have owi (l) = i, and because the starting CAs are not included in f ci (l) (see Equation 5), nwi (l) = i − 1. (b) StdInt: A steady interval contains {l + 1, ..., n − l + 1}. In this interval, one window is initiated and another is ended at each access. Thus, the number of overlapping windows remains equal to l, i.e. owpi (l). One window among the l overlapping windows at each index contains the access as its first access. Thus, by definition, in this interval, we have nwi (l) = l − 1. (c) DimInt: A diminishing interval includes {n−l +2, ..., n}. Further, nwi (l) = i0 (see Definition 3.5 and Definition 3.8). This interval contains the last l − 1 accesses in the trace and no new window is initiated within this interval. Further, at the first access, an−l+2 , the number of overlapping windows is l − 1 and it diminishes to one at the last access. Hence, the number of overlapping windows can be expressed as i0 . This is the same expression as p − i in Definition 3.5, where p equals l − 1, and trace index is converted to the intra-interval index by applying i − (n − l + 2). Thus nwi (l) equals l − 1 − i + n − l + 2 or n − i + 1, which is defined as i0 . 2. l > n/2: In this case, we have pi = n − l + 1 and pi0 = l. (a) IncInt: An incremental interval contains {1, ..., n−l +1}. Further, we have nwi (l) = i − 1. (b) StdInt: A steady interval includes {n−l+2, ..., l} and nwi (l) = n− l + 1. Note that since no window in initiated within this interval, none of the accesses within this interval appear as a starting access. (c) DimInt: A diminishing interval contains {l + 1, ..., n}. In this interval, we have nwi (l) = i0 (see Definition 3.5 and Definition 3.8). In fact, p in this interval equals n − l, and i is converted to an intra-interval index as i − (l + 1), yielding n − l − i + l + 1 and hence nwi (l) = i0 .

AC

As an example, Fig. 4 illustrates the limits of the intervals for two window lengths including l=3 (l ≤ n/2 case) and l=8 (l > n/2 case).

3.2.3. Defining pdi in the Intervals Given the number of windows containing ai as a non-starting access (i.e., nwi (l)), the parameter pdi is used to specify the number of windows containing ai as an NSCA by applying Equation 6. This parameter has the same 21

ACCEPTED MANUSCRIPT

AN US

CR IP T

expression at all accesses within the same interval, but it can be expressed differently from one interval to another. Below, pdi is determined in each interval. Assume that one window is initiated from the start of a trace up to index i. In this case, based on Definition 3.1, cli = r implies that the datum accessed by ai was last accessed at index i − r − 1. Consequently, it indicates that up to r windows, among windows of the same length, contain ai as an NSCA which include windows initiated from index i − r to i − 1 (the number of the windows is also limited by nwi (l), see Equation 6). In this case, we have pdi = cli . However, the assumption of initiating one window at each new access is true only for the first n − l + 1 accesses. After this index, no new window is initiated and thus cli cannot be applied to determine f ci (l). Instead, we define a new parameter as follows. Definition 3.9. Attuned clean interval length: The attuned clean interval length, denoted by cli0 , is defined as max(cli + (n − l + 2) − i, 0).

M

The max() operator in Definition 3.9 is to ensure non-negative values. The parameter cli0 is used for all indexes i > n − l + 1 as a substitute for cli to compensates the loss of initiating new windows. The expressions of pdi in different intervals are defined as follows.

AC

CE

PT

ED

1. l ≤ n/2 (a) pdi in an IncInt: The parameter pdi in an incremental interval equals cli since cli = r means that among the windows containing ai as a non-starting access, ai is contributing within at most r windows. (b) pdi in an StdInt: Similar to the previous case, pdi is equal to cli in this case. (c) pdi in a DimInt: The parameter pdi is cli0 in this case. The reason for using cli0 is that in a decreasing interval, no new window is initiated while one window is ended at each access. Index n − l + 1, i.e. the last member of the StdInt, is where the last window is initiated. When cln−l+2 = r, an−l+2 still appears in up to r windows as an NSCA. However, at access n − l + 3, cli = r means that ai appears as an NSCA in at most r − 1 windows and this value continues to diminish in the subsequent accesses. Thus cli is substituted with cli0 . 2. l > n/2 22

ACCEPTED MANUSCRIPT

CR IP T

(a) pdi in an IncInt: The parameter pdi in this interval is the same as in an IncInt when l ≤ n/2. Thus, pdi = cli . (b) pdi in an StdInt: In this interval, no new window in initiated. The lower bound of this interval is the same as in the DimInt for l ≤ n/2. Thus, in this case pdi = cli0 . (c) pdi in a DimInt: Since no new window is initiated from the beginning of the StdInt, cli0 is also applied to this interval as pdi .

AN US

3.2.4. Putting it All Together Given nwi (l) and pdi as the limiting parameters (see Section 3.2.2 and 3.2.3), the final expression of f ci (l), as a function of i, l, and n, can be obtained. Parameters nwi (l) and pdi defined in each interval are placed into Equation (6) to obtain f ci (l) in the respective interval. 1. l ≤ n/2 (a) f ci (l) in an IncInt: In this interval, nwi (l) equals i − 1 and the maximum value it can take equals l −1. Further, cli is used as pdi . Consequently, as suggested by Property 3.1, given that cli ≤ i − 1, cli is the only limiting factor in this case:

M

∀l ≤ n/2, i ∈ {1, ..., l} :

f ci (l) = cli .

(7)

PT

ED

Example. As depicted by Fig. 4a, the IncInt for l = 3 includes {1, 2, 3} and f c1 (3) and f c2 (3) are equal to zero and one, both limited by cli . (b) f ci (l) in an StdInt: The parameter nwi (l) equals l − 1 and pdi equals cli : ∀l ≤ n/2, i ∈ {l + 1, ..., n − l + 1} :

f ci (l) = min(l − 1, cli ). (8)

AC

CE

Example. In Fig. 4a, the StdInt for l = 3 includes {4, ..., 9} and f c4 (3) to f c9 (3) are equal to the minimum value between cli and l − 1. (c) f ci (l) in a DimInt: The parameter nwi (l) equals i0 and pdi equals cli0 : ∀l ≤ n/2, i ∈ {n − l + 2, ..., n} :

f ci (l) = min(i0 , cli0 ).

(9)

Example. In Fig. 4a, the DimInt for l = 3 includes {10, 11} and f c10 (3) and f c11 (3) are equal to zero and one, respectively. 23

ACCEPTED MANUSCRIPT

CR IP T

2. l > n/2 (a) f ci (l) in an IncInt: In this interval, nwi (l) = i − 1 and pdi = cli . Thus, similar to the IncInt in l ≤ n/2, the limiting parameter is cli : ∀l > n/2, i ∈ {1, ..., n − l + 1} : f ci (l) = cli . (10) Example. As illustrated in Fig. 4b, the IncInt for l = 8 includes {1, ..., 4} and f c1 (8) to f c4 (8) are equal to cli . (b) f ci (l) in an StdInt: In this interval, nwi (l) = n−l+1 and pdi = cli0 . The parameter cli0 is the limiting factor because in this interval, cli0 ≤ nwi (l): f ci (l) = cli0 .

AN US

∀l > n/2, i ∈ {n − l + 2, ..., l} :

(11)

Example. In Fig. 4b, the StdInt for l = 8 includes {5, ..., 8} and f c5 (8) = 4, which is limited by cli0 . (c) f ci (l) in a DimInt: The parameter nwi (l) equals i0 and pdi equals cli0 :

M

∀l > n/2, i ∈ {l + 1, ..., n} :

f ci (l) = min(i0 , cli0 ).

(12)

ED

Example. In Fig. 4b, the DimInt for l = 8 includes {9, 10, 11} and nw9 (8) is equal to i0 = 3 and f c9 (8) = 0, limited by cli0 .

PT

The limiting parameters in each interval yield the minimum value among all the parameters in the model. Thus, Equations (7) to (12) can be reduced to Equation (13). ∀i, l ∈ {1, ..., n} :

f ci (l) = min(l − 1, i0 , cli , cli0 ).

(13)

AC

CE

Fig. 4, shows an example for calculating the average footprints for window lengths of 3 and 4 in a sample trace. In this figure, the values of f ci (3) and f ci (8) are calculated using Equation (13). Then, f c(3) and f c(8) are calculated by using Equation (5). Note that the bold values in each column represent the minimum values. As illustrated in the figure, the proposed model accurately calculates the footprint values. The results of the model are compared against those of the direct counting-based method. Since parameter cli is independent of window length, once calculated, it can be applied to all window sizes. Consequently, the proposed model can compute f p(l)s of the desired window lengths through a single trace traversal. 24

ACCEPTED MANUSCRIPT

f c(l) =

n−l+1 X j+l−1 X  j=1



CR IP T

3.3. Comparison with Direct Counting Let ˆi be the position of the last access to the same datum accessed by ai . Further, let ˆi = 0 when ai is the first access so far. The direct counting method counts the data size of all l-length windows as follows: B(ˆi < j) ,

i=j

where B(x) is a Boolean operator and it equals one when condition x it true and zero otherwise. Applying Property 3.2, we obtain the following equation.

j=1



B(ˆi < j) + n − l + 1.

AN US

f c(l) =

n−l+1 X j+l−1 X  i=j+1

The above equation counts the data sizes of one window in O(n2 ). On the other hand, in the proposed model, f c(l) has the following expression. f c(l) =

X i∈I



min(l − 1, i0 , cli , cli0 ) + n − l + 1,

ED

M

where I represents the trace. Comparing the two above models, when the number of overlapping windows is large, which is the case when the cache performance modeling is of concern, the proposed model can significantly accelerate the calculation.

AC

CE

PT

3.4. Algorithm for Footprint Calculation In this section, an algorithm is proposed for the average footprint calculation based on the presented model in the previous section. Algorithm 1 shows the details of the footprint calculation in a serial trace denoted by I. The trace I contains n accesses. The maximum window length can be limited to lmax accesses to alleviate the time complexity. The accesses in the trace are analyzed consecutively and f ci (l)s are calculated for each desired window length. For access ai , the index of the previous access to the same datum, i.e. ˆi, is first acquired (line 9) and cli is then calculated (line 10). A hash table can be used to find ˆi in O(1) time and O(m) space (m is the number of unique data in the trace). Note that for the first access to a datum, we have ˆi = 0. After calculating cli , for each desired window, first, f ci (l) is calculated based on Equation 13 and then, f c(l) is updated based on Equation 11 (line 11-15). The state of the hash table is updated after each access (line 16). 25

1 2 3 4 5 6 7 8 9 10 11 a b a c d e a e d d b 2

0 9 11 2 0

cli cli0 i0 l−1 f ci (3)

1 9 10 2 1

1 8 9 2 1

3

3 9 8 2 2

4 9 7 2 2

3

5 9 6 2 2

3

3

3 6 5 2 2

2

1 5 4 2 1

3 4 3 2 2

f p(3) = 3

2

0 0 2 2 0

2

8 7 1 2 1

23 9

AN US

i T race

CR IP T

ACCEPTED MANUSCRIPT

IncInt(l = 3) StdInt(l = 3) DimInt(l = 3) 23 9

f c(3) = 14 + 9 ⇒ f p(3) =

(a) Window Size of Three (l = 3)

M

1 2 3 4 5 6 7 8 9 10 11 a b a c d e a e d d b

0 4 11 7 0

CE

PT

cli cli0 i0 l−1 f ci (8)

ED

i T race

AC

f c(8) =

1 4 10 7 1

1 3 9 7 1

3 4 8 7 3

4 4 7 7 4

5 4 6 7 4

3 1 5 7 1

1 0 4 7 0

5

3 0 3 7 0

5

0 0 2 7 0

4

f p(8) =

19 4

5

8 2 1 7 1

IncInt(l = 8) StdInt(l = 8) DimInt(l = 8) 15 + 4 ⇒ f p(8) = 19 4

(b) Window Size of Eight (l = 8)

Figure 4: An example showing the calculation of f c(3) and f c(8) using the proposed model and its comparison with direct counting

26

ACCEPTED MANUSCRIPT

Finally, the footprint of each window length is calculated based on Equation (4) (line 18-20).

AN US

CR IP T

3.4.1. Calculation Complexity of the Algorithm The proposed algorithm requires a single trace traversal. For each access, cli should be calculated, which requires O(1) time using a hash table with one entry per datum. It results in O(n) time and O(m) space to calculate cli s throughout a trace. Additionally, our proposed model should be applied to calculate f c(l)s. The calculations require up to n iterations. Accordingly, the total complexity is O(n2 ) for all window lengths (instead of O(n3 ) in the case of the direct counting algorithm). However, it should be noted that the operation performed for calculating f ci (l) is a rather simple minimum value calculation, and also the footprint calculations of different window lengths can be performed in parallel. Therefore, in a practical sense, the proposed algorithm results in affordable execution times and the calculations can further be accelerated through applying parallel execution.

AC

CE

PT

ED

M

3.5. Window Distribution Analysis Window distribution analysis aims at obtaining insight into the locality of an application through inspecting the footprints of all windows of the same length. For example, statistical parameters, such as standard deviation (STD), can be calculated among the footprints of the windows of length l to study the locality phase changes exhibited by the application. Hence, this analysis can provide useful insight into the application’s data access pattern [11]. However, to enable window distribution analysis requires all-window footprints. Comparing with Xiang’s formula, the main benefit of our proposed algorithm is that it can be employed for all-window footprint analysis. Other algorithms, e.g., the algorithm presented in [26], have significant time complexities or their time complexity is alleviated at the cost of accuracy. However, our method enables the analysis in affordable times and it generates the footprints without any loss of accuracy. To apply the window distribution analysis to windows of length l, the footprint values of all n − l + 1 windows should be known. Let wf pj (l) be the footprint of the j-th window of length l, that is the number of CAs in the window. In our presented algorithm, Algorithm 1, for each access, f ci (l) shows the number of windows in which ai is an NSCA. Thus, the total of f ci (l) windows exist whose footprints should be incremented by one. Among 27

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

Algorithm 1 Footprint calculation algorithm Input: I, n, lmax Output: f p(l) ∀l ∈ {1, ..., lmax }, m 1: Procedure initialize() 2: f c(l) = (n − l + 1), ∀l ∈ {1, ..., lmax }; 3: Hash(i) = 0, ∀i ∈ {1, ..., m}; 4: end Procedure 5: Begin 6: initialize(); 7: for all i in 1 to n do 8: ai = I(i); { .get a new access} ˆ 9: i = Hash(ai ); {.get the index of the previous access} 10: cli = i − ˆi − 1; 11: for all l in 1 to lmax do 12: cli0 = max(cli + (n − l + 2) − i, 0); i0 = n − i + 1; 13: f ci (l) = min(l − 1, i0 , cli , cli0 ); {.for window length of l, calculate the total number of windows in which ai is an NSCA} 14: f c(l) = f c(l) + f ci (l); {.add the result to the previous value} 15: end for 16: Hash(ai ) = i; {.update the hash table} 17: end for 18: for all l in 1 to lmax do 19: f p(l) = f c(l)/(n − l + 1); {.calculate the average footprints} 20: end for 21: end

28

ACCEPTED MANUSCRIPT

CE

PT

ED

M

AN US

CR IP T

all windows of length l, ai increments the footprint values of those windows that satisfy the following conditions: (1) the window contains ai , and (2) ai is a CA in the window. These two conditions can be checked based on l, i and f ci (l). Fig. 5 illustrates an example showing the relation between f ci (l) and perwindow footprints. In this figure, all windows of length four are numbered as w1 to w5 . At index i = 2, we have f c2 (4) = 1 and the footprint of the first window, w1 , should be incremented by one. In this case, a2 is the first access in w2 , which is treated separately through adding the working set size of every window by one to account for the starting CAs. As another example, at a6 , f c6 (4) equals three thus the footprints of w3 to w5 should be incremented by one. In general, for all i ≤ n − l + 1 (where the last window is initiated), when f ci (l) = p, p > 0, the footprint values of wi−p to wi−1 should be incremented by one. On the contrary, for i > n − l + 1, when we have f ci (l) = p, the footprint values of wn−l+2−p to wn−l+1 , i.e. the last p windows, should be incremented by one. Algorithm 2 presents the procedure for calculating per-window footprints, for windows of length l, in which the footprints of the windows are updated based on the calculated f ci (l). The presented procedure is invoked after the calculation of f ci (l) (in Algorithm 1). The min() operator generates the indexes of the windows whose footprints should be incremented. When i ≤ n − l + 1, i − k is the minimum value returned by the min() operator and for the indexes where i > n − l + 1, the min() operator returns n − l + 2 − k. After traversing the trace, the footprint value of each window should be incremented by one (not shown in the algorithm) to include the effects of the first accesses in each window. The calculation complexity of the procedure is O(l). Note that the procedure is only applied to calculate per-window footprints of the desired window lengths. 4. Framework for Memory Footprint Analysis in GPU Kernels

AC

In this section, we propose a framework to approximate the performance of cache memories in GPUs. In this work, we consider L1 caches located within SMs. The framework is named FGCA, short for footprint-based GPU cache performance analysis. In FGCA, the footprint analysis is utilized as a basis to provide miss rates in the cache memories. Fig. 6 illustrates the working flow of FGCA. In order to apply the footprint analysis to a kernel running on a GPU, first, the access traces of the cache memories should 29

ACCEPTED MANUSCRIPT

AN US

i 1 2 3 4 5 6 7 8 T race a b c d a e d b w1 w2 w3 w4 f ci (4)

CR IP T

Algorithm 2 Per-window footprint calculation procedure for l-length windows Input: l, i, f ci Output: updated wf p[j] ∀j ∈ {1, ..., n − l + 1} 1: for all k in f ci down to 1 do 2: j = min(i − k, n − l + 2 − k); { .generate index of the windows that contain ai as a CA} 3: wf p[j] + +; { .update the footprints of the windows} 4: end for

w5

0 1 2 3 3 3 1 1

M

Figure 5: An example showing the relation between f ci (l) and per-window footprints

AC

CE

PT

ED

be provided. A cache trace contains serialized and ordered accesses to the cache. As a result, the thread-level parallelism in GPUs should be converted to serialized traces. The GPU’s thread-level parallelism should be considered when generating the cache traces of a kernel running on the GPU. Generally, in each SM of a GPU, an L1 cache is shared among the running threads. Thus, to analyze the cache performance of a GPU that runs a given kernel, first, the cache trace files should be generated so that they simulate the access sequences to the L1 cache memories in the GPU. In FGCA, the L1 cache traces are generated from the raw information extracted form the kernel. Then, the footprint analysis is applied to the traces. Once the footprints of the generated traces are calculated, they can be used to estimate the miss rates of all cache sizes using Equation (1). 4.1. Generating L1 Cache Memory Traces FGCA generates cache memory traces of a kernel in a way that the order of accesses stored in the traces reflects the execution of the kernel on a specific GPU. It employs the same method proposed in [27].GPUs execute threads 30

ACCEPTED MANUSCRIPT

AN US

CR IP T

in a SIMT (single instruction multiple threads) fashion. In the terminology introduced by NVIDIA, a group of co-running threads is called a warp. In current GPUs, a warp consists of 32 threads. After launching a CUDA kernel, CUDA thread blocks are distributed among the SMs. Then, depending on the number of threads per thread block, each thread block is divided into one or more warps. Only a limited number of thread blocks and warps can be executed in parallel on each SM since all the co-running threads share the resources available on the SM. At least one L1 cache memory is implemented on each SM that is used to cache the read accesses executed by the warps. In some GPUs, e.g., the Maxwell generation GPUs [28], more than one L1 partition is implemented on each SM; however, L1 partitions are independent and do not share any data. L1 caches only cache read accesses (they work under a write-invalidation policy) [20], thus no cache coherency protocol is required at the L1 level. Below, we explain the functions performed in each step of FGCA illustrated in Fig. 6.

AC

CE

PT

ED

M

a) First, per-thread raw memory access information is extracted through manually instrumenting and then executing the kernel. The extracted information is raw and GPU-independent. Only a counter for each memory access along with indexes of the active threads are recorded, but not the order they are scheduled on the cores. The final order of execution is assigned in the next step. Hence, the raw information is extracted once and can be applied for simulating any desired GPU machine. b) In the next step, having the GPU specification (SM count, maximum thread block and warp per SM) and cache block size, the raw information is used to generate the cache memory traces. Memory accesses are processed based on the number of inflight blocks and the counter values assigned to the memory accesses. Further, each warp memory access is coalesced [20] to generate the cache block-size accesses, called transactions. All the generated transactions of the same memory warp access are ordered based on their thread indexes. Once processing all the memory accesses of the inflight thread blocks is finished, processing the accesses of another group of inflight blocks is started (if any) until all the memory accesses are converted to cache transactions. It should be noted that since warps share data, their footprints are not composable. Thus, it is not possible to model the warp-level cache sharing using the footprints of individual warps. The 31

ACCEPTED MANUSCRIPT

CR IP T

output of this step is the ordered cache traces in each SM. c) Finally, the cache traces are analyzed using the introduced FPA. Note that the cache access traces are specified for the given GPU processor. For a GPU with different characteristics, the framework should re-generate the ordered access traces (by performing the second step in the framework). The outputs of FGCA include footprints, parameters achieved through applying window distribution analysis, and miss rates.

CE

PT

ED

M

AN US

There are many cache-related organizational parameters that can be included in the analysis. However, being a hardware-independent algorithm, footprint analysis does not take the effects of hardware-related parameters into account. Hardware independence enables the application of similar simulation results to many processor configurations. However, it comes at the cost of accuracy. It should be noted that based on the analysis method introduced here, the traces are GPU-dependent to model the GPU thread-level parallelism whereas the results of one simulation are applicable to analyze all cache sizes. Nevertheless, the footprint analysis itself ignores many cache-related organizational parameters, including cache associativity, replacement policy (non-LRU), cache indexing function, and miss status holding registers. This can lead to serious deviations in the estimated miss rates, especially for kernels with conflicting accesses. However, the results acquired by footprint analysis are representative enough to measure and understand the data locality of GPU kernels and can be utilized as a basis for estimating cache performance. Further, the proposed method can be used for analyzing multiple simultaneous kernel executions, which is an important field of study in modern GPUs. 5. Evaluation Methodology

AC

In this section, we explain our evaluation methodology to investigate the proposed algorithm, and then we report and discuss our findings. We validate FGCA in terms of miss rate calculation against real hardware executions. We use NVIDIA NVPROF to collect the miss rate values recorded on the GPU’s performance counters. In this step, to compare FPA with a similar tracedriven simulation, we also include the detailed reuse distance analysis (RDA) proposed in [17]. Further we compared FGCA with the Xiang’s formula [14] 32

ACCEPTED MANUSCRIPT

(b)

Extracting raw memory access information

raw info.

Instrumented Kernel GPU

(c)

cache Per-cache Footprint traces memory trace analysis generation (FPA)

GPU specification (#SM, max warps,...)

Figure 6: Work-flow of FGCA

Average footprints Per-window footprints Cache miss rates Distribution analysis results

CR IP T

(a)

AN US

and the model proposed in [26] (we call it CKlogM algorithm) in terms of execution time. The footprint values generated by FGCA and the Xiang’s formula are significantly close, as both models calculate the precise footprint values. However, because the CKlogM algorithm imposes some error, we also compare CKlogM and FGCA in terms of accuracy.

PT

ED

M

5.1. Evaluation Setup We employed an NVIDIA GTX 970 GPU to experiment the accuracy of FGCA. The GPU has 13 SMs each having two 24 KB L1 cache memory slices. Further, we performed the experiments on a system with a core i5 processor, 8 GB of RAM and Ubuntu 14.04. To evaluate our algorithm, we utilized benchmarks from the Polybench/GPU [29] and Rodinia [30] benchmark suites, as shown in Table 1. We compiled the applications using CUDA 7.0 and GCC 4.8.4. In performing the simulations, we only applied the FPA and the RDA to analyze one of the L1 cache memories.

AC

CE

5.2. Validation and Evaluation of FGCA Fig. 7 illustrates the miss rate comparison estimated by FGCA with performance counters recoded on the GTX 970 GPU, and the simulation results obtained by the RDA method proposed in [17]. The miss rates in FGCA are calculated using Equation (1). The average difference in the miss rates generated by the RDA and FGCA with respect to NVPROF are 17.54% and 23.78%, respectively. As it can be seen, the results of FGCA are correlated with those of the other methods. However, the miss rate error is significant in some kernels. We have further studied those cases by means of window distribution analysis and the results are presented and discussed in section 5.3. 33

ACCEPTED MANUSCRIPT

Table 1: Used benchmarks in the evaluations Application(Kernel’s input data size)

2DC MM 3DC ATX1,ATX2 BI1, BI2 COR FD1, FD2, FD3 GES MVT1,MVT2 SYR BP1, BP2 CFD HS NW1, NW2 SRD1, SRD2

AN US

CR IP T

2Dconvolution† (4096) Matrix Mult.† (384) 3DConvolution† (256) ATAX† (4096) BICG† (4096) Correlation† (256) FDTD† (2048) GESUMMV† (4096) MVT† (4096) SYR2K† (256) Backprop.‡ (262144) CFD‡ (0.2M) Hotspot‡ (1024) NW‡ (4096) SRAD_V2‡ (2048)

Kernels names

†From Polybench/GPU suite [29] ‡From Rodinia V3.1 suite [30]

AC

CE

PT

ED

M

In the next step, we compare FGCA with RDA and the Xiang’s formula in terms of execution time. Fig. 8a and Fig. 8b depict the slowdowns of FGCA to the RDA method and the Xiang’s model, respectively. On average, FGCA is 124.6X and 1.55X slower than the RDA method and the Xiang’s formula, respectively. We examined FGCA’s performance in calculating the window lengths of {1, ..., l0 }, where f p(l0 ) = c, i.e., the window length whose average footprint equals the cache size, c. In this case, FGCA functions 47.5X slower than the RDA method. It should be noted that the applied RDA method requires O(nlog(n)) time for calculating reuse profiles and miss rates [17], which explains why it is an order of magnitude faster than FGCA. Additionally, FGCA was applied for two window lengths including l0 and l0 + 1, and it functioned 32.7X faster than the RDA method. Next, FGCA is compared with the CKlogM algorithm [26]. The CKlogM algorithm is a fast all-window footprint calculation method that employs analytical methods and uses scaled trees [22] to accelerate footprint calculations. As illustrated in Fig. 8c, FGCA has an average speedup of 26.36X over the CKlogM algorithm. In this experiment, two parameters of the CKlogM algorithm, i.e. C and the accuracy, were set to 64 and 99%, respectively. Further, the errors are calculated for one out of one hundred window lengths up to window length of 100000. The average error in the calculated foot34

ACCEPTED MANUSCRIPT

miss rate (%) miss rate (%)

NVPROF

100

100

RDA

FGCA

GES

SRD2

FD3

SRD1

FD2

NW2

FD1

NW1

HS

CFD

AN US BP2

BP1

SYR

MVT2

0

MVT1

50

CR IP T

COR

BI2

BI1

ATX2

ATX1

3DC

MM

0

2DC

50

Figure 7: L1 cache miss rates: GTX 970 GPU (NVPROF), RDA [17], and FGCA

PT

ED

M

prints by the CKlogM algorithm is 0.95% for 99% accuracy. Performing the same experiment, but with 95% accuracy, the speedup of 25.1X and the average error of 3.36% is resulted. Fig. 9 illustrates the footprints of several sample kernels calculated by FGCA and the CKlogM algorithm. Note that FGCA calculates footprints precisely without any loss of accuracy. On the other hand, the CKlogM algorithm has imposed some errors due to the use of scaled trees. However, the evaluation results indicate that the imposed errors are marginal.

AC

CE

5.3. Window Distribution Analysis As stated in Section 3.5, the main benefit of the proposed model in this paper is that it can be employed for window distribution analysis. In comparison with average footprint analysis, window distribution analysis can provide the user with more detailed information on the kernel’s locality. For example, in some kernels, the memory access pattern is not uniform and in some execution phases, the kernel represents abrupt changes in the exposed locality. Such behaviors cannot be detected using the average footprint analysis since it filters such changes when averaging the window footprints. On the contrary, window distribution analysis can be employed to reveal such behaviors. 35

AC 2DC

100 80 60 40 20 0

36 GES

AVG

SRD2

SRD1

NW2

NW1

HS

CFD

AVG

SRD2

SRD1

NW2

NW1

HS

CFD

AVG

SRD2

SRD1

NW2

NW1

HS

CFD

200

BP2

BP1

SYR

MVT2

MVT1

GES

FD3

FD2

124.6

CR IP T

(a)

BP2

BP1

SYR

MVT2

FGCA to Xiang’s formula

AN US

FD1

COR

BI2

BI1

ATX2

ATX1

FGCA to RDA

BP2

BP1

SYR

MVT2

GES MVT1

1

FD3

2

FD2

(b)

MVT1

M FD1

COR

4

FD3

BI2

BI1

MM 3DC

0

FD2

FD1

ED ATX2

ATX1

0

COR

BI2

BI1

ATX2

(c)

ATX1

MM 3DC

3

MM

2DC

Slowdown 400

3DC

2DC

Slowdown 600

PT

CE

Speedup

ACCEPTED MANUSCRIPT

1.55

FGCA to CKlogM

26.36

Figure 8: Performance of FGCA with respect to (a) RDA, (b) the Xiang’s formula, and (c) CKlogM algorithm (with C=64 and 99% accuracy)

ACCEPTED MANUSCRIPT

2DC

MM

0.5

·104

avg. error: 3.4 %

1

0.5

BP2

avg. error: 4.98 %

AN US

0.5

Window size

·105

1

0.5

Window size

1

0

0 0

Window size

GES

·105

1

·105

1

0.5

·104

0

0

0

·105

2,000

0.5

1

4,000

1

Window size Average footprint

avg. error: 3.67 %

CR IP T

avg. error: 3.91 %

0.5

·104

CKlogM

0

1.5

0

Average footprint

fgca

Figure 9: The average footprint plots for 95% accuracy (the blue line) and the accurate footprints (the red line)

ED

M

Fig. 10 presents the standard deviation (STD) among the per-window footprints for windows of length l0 (where f p(l0 ) = c). The figure also illustrates the correlation between the calculated standard deviations and the miss rate errors. Note that the errors are calculated with respect to the results of the real GPU executions.

AC

CE

PT

5.4. Multiple Kernel Execution In this section, we present the results of modeling cache performance in multiple kernel executions. The aim is to approximate cache conflicts when the cache is shared among the co-run kernels (see Section 2.2.3). In the evaluations presented here, two simultaneous kernels are considered. Eleven kernels were selected and all of the two-kernel combinations were examined. Further, the co-run footprints and miss rates are estimated based on Equation (3) and Equation (1), respectively. Moreover, the memory access rates of the kernels required for estimating the joint footprints (see Equation (2)) 1 are acquired from the GPU’s performance counters. Figure 11 shows the resultant cache performance for two-kernel combinations. For each combination, the weighted miss rates are also given. The weighted miss rates are calculated by averaging the individual miss rates based on the cache trans37

STD

Miss rate error 100

400

SRD2

NW2

SRD1

HS

NW1

BP1

BP2

SYR

MVT2

MVT1

FD3

GES

FD2

FD1

BI2

COR

BI1

ATX2

ATX1

MM

3DC

0

2DC

0

CR IP T

50 200

CFD

STD

600

Miss rate error(%)

ACCEPTED MANUSCRIPT

Figure 10: Standard deviation (STD) of the average footprints of the l0 -length windows where f p(l0 ) = c and its correlation to the miss rate errors

PT

ED

M

AN US

action counts of each kernel. Note that the weighted miss rates represent the conflict-free cache performance. Thus, the difference between the weighted and the co-run miss rates is indicative of the cache conflict imposed by the simultaneous execution of the kernels. We observe that, in some execution combinations, the simultaneous kernel execution has led to notable cache conflicts. Combinations 2DC-GES, MMHS, BI1-GES, GES-HS, and GES-NW1 are examples that result in large cache conflicts. In some cases, however, the cache conflicts are negligible and the co-run miss rates are close to those of the individual executions. Examples are MM-ATX1, BI1-BP1, and BP1-NW1. Overall, compared with the individual executions, the increase in the miss rates in 16 out of 55 cases are within 5%, whereas in 38 cases the miss rate variations are higher than 5%. On average, with respect to the weighted miss rates, the co-run miss rates are increased by 26.86%.

AC

CE

5.5. Discussion The results generated by FGCA indicate that in addition to being an informative locality metric, the footprint can be used as a basis for the estimation of cache performance. However, since the effects of many hardwarerelated parameters have not been modeled in FPA and also because the average footprints are applied for miss rate calculations, significant skews can be encountered in the estimated performance of some benchmarks. Examples are BI2, COR, GES, and MVT1. In some of these benchmarks, even the detailed RDA method fails to accurately predict the miss rates. In the applied RDA method, many performance-influencing parameters such as cache 38

ACCEPTED MANUSCRIPT

Miss rate

Co-run

60 40

3DC-SRD1

ATX1-BI1

HS-SRD1

NW1-SRD1

3DC-HS

3DC-NW1

HS-NW1

3DC-BP2

3DC-GES

3DC-BI1

BP2-NW1

BP2-SRD1

BP2-HS

BP2-GES

BP1-NW1

BP1-SRD1

BP1-HS

BP1-GES

BP1-BP2

BI1-SRD1

BI1-NW1

BI1-HS

BI1-GES

BI1-BP2

BI1-BP1

ATX1-SRD1

ATX1-NW1

0

AN US

20

GES-SRD1

40

GES-HS

60

GES-NW1

80

3DC-BP1

MM-SRD1

100

3DC-ATX1

MM-HS

MM-NW1

MM-BP2

MM-GES

MM-BI1

MM-BP1

MM-3DC

MM-ATX1

2DC-SRD1

2DC-HS

2DC-NW1

2DC-BP2

2DC-BI1 ATX1-HS

2DC-GES

2DC-ATX1 ATX1-GES

2DC-BP1

2DC-MM

2DC-3DC

ATX1-BP1

0

CR IP T

20

ATX1-BP2

Miss rate

Weighted solo-run

80

M

Figure 11: Simulating cache conflict in simultaneous kernel executions sharing the same cache memory

AC

CE

PT

ED

associativity, cache indexing, memory latency, and miss status holding registers have been modeled [17]. BI2, GES, and MVT1 are examples in which both FPA and RDA fail to accurately predict the miss rates. This indicates that these benchmarks are hard to predict even when the detailed hardware parameters are modeled. As illustrated in Fig. 10, performing a window distribution analysis reveals that in the majority of the kernels, the correlation between the predicted miss rate and the standard deviation is strong. COR kernel is an exception for which the calculated standard deviation is zero but its estimated error is large. In this case, the standard deviation calculated using the window distribution analysis do not provide any information regarding the reason why the miss rate error is significantly high. To investigate the reason for this counter-intuitive behavior, we plotted the average footprints of COR for a subset of window lengths in Fig. 12. As it can be observed, the average footprint in the benchmark first increases rapidly until it levels out at some point. In the largest window length, the average footprint is still less than the cache size. As a result, the calculated miss rate equals zero. We performed this benchmark on a GPU with 512 KB of L2 cache and L1 disabled. In this case, the resultant hit rate was 99.68%. 39

ACCEPTED MANUSCRIPT

AN US

6,000 4,000

M

8,000

2,000

·105

6

4

2

0

0

PT

ED

Average footprint

CR IP T

This shows that even though the miss rate of COR on a 24 KB cache is close to a hundred percent, possibly due to its thrashing behavior, its data size is rather small and the miss rate in a larger cache is close to zero. This indicates that ignoring the detailed hardware parameters can cause significant errors in some kernels. Note that in this case, the detailed RDA has an accurate estimation. In Section 5.4 we present the evaluation results of multiple kernel executions. The results indicate that the most salient benefit of the FPA stems from the composability property of the footprint. The footprints of co-run kernels can be estimated without any additional footprint analysis. In the case of the RDA method, all co-run combinations should be simulated to achieve the joint cache performance; however, it is impractical to acquire the reuse profiles in all co-run combinations when the kernel set is large. Consequently, using FPA for the analysis of shared cache systems is considerably more efficient than using RDA.

CE

Window Figure 12: Average footprint plotsize of Correlation kernel

6. Related Work

AC

6.1. Footprint-based Cache Performance Modeling Xiang et al. [12], studied the mathematical conversion between locality metrics, including footprint and reuse distance, and their conversion to miss rate curves. The authors proved that the footprint is concave and thus it can be mathematically converted to other metrics. Acquiring the footprints in a trace of n accesses is time-consuming. Further, an algorithm was proposed 40

ACCEPTED MANUSCRIPT

AN US

CR IP T

to alleviate the calculation of all-window footprints [26]. Later, Xiang et al. [14] developed a fast analytical model to calculate the average footprints, instead of the all-window footprints. Since the footprint metric is composable, its employment in modeling cache sharing can significantly decrease the simulation time because many execution combinations of co-run tasks can be modeled from their solo-run footprints. Xiang et al. [31] used the footprint and reuse distance to build a model for the estimation of cache performance of co-run tasks on multicore CPUs and re-group tasks to optimize performance. The authors used a sampling technique to measure the locality metrics. Later, Hu et al. [15] developed a model that only used footprints to approximate the shared cache performance of a group of co-run tasks. To efficiently measure the footprints, the authors introduced adaptive bursty sampling (ABF) method.

AC

CE

PT

ED

M

6.2. Reuse Distance-based Cache Performance Modeling Reuse distance (stack distance) was pioneered by Mattson et al. [32] as a metric to model storage configurations. Their proposed algorithm required O(nm) time. Beyls and D’Hollander utilized the reuse distance as a locality metric to model cache performance [13]. Several authors proposed more efficient algorithms for measuring reuse distance [33, 34]. Further, sampling methods [35], approximate calculation [36], and parallel execution [23] techniques have been utilized to further alleviate its time complexity. RDA has also been adapted for multicore CPUs in several studies [37, 24, 38]. The reuse distances acquired in a private cache of a multicore CPU are called PRDs (private reuse distances) and the reuse distances of a shared cache are called CRDs (concurrent reuse distances). In acquiring PRDs and CRDs, the thread-level interleaving should be modeled. Thus, the CRDs for interacting threads based on given thread interactions are executiondependent. Wu & Yeung [39] assumed that the CRD profiles obtained at loop-level parallelism in threads exposing similar memory behaviors are virtually independent profiles. The authors utilized CRD profiles to analyze cache performance when scaling core count in large-scaled chip multi-processors (LCMPs). They also utilized PRD profiles to estimate the average memory access time and investigated different cache configurations in multicore CPUs [40, 41]. Recently, Badamo et al. [42] used RDA to approximate performance and power consumption in LCMPs.

41

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

6.3. Cache Performance Modeling in GPUs Hong & Kim developed an analytical model for predicting GPU performance [43]; however, the effects of cache memories were ignored in their model. Later, Sim & Kim [44] extended the model proposed in [43] and included the effects of cache memories. Their model required cache miss rates to predict the GPU performance. Baghsorkhi et al. [45] proposed a hierarchical memory model based on the statistical sampling and trace file analysis to estimate the performance of a given GPU memory hierarchy. Further, Huang et al. introduced GPUMech [46], which employs the interval analysis to estimate the GPU performance. Dao et al. [47] showed that considering cache memory effects was necessary to accurately predict the GPU performance. In [48], a model has been proposed to approximate the locality in CUDA kernels with regular access patterns. Most of the GPU performance models require cache miss rates as an input to approximate the total GPU performance. Further, the mentioned cache models predict the cache performance for a single cache configuration, in terms of miss rate. RDA and FPA, on the other hand, are locality metrics convertible to other metrics based on a well-defined theory [12], which can be used for predicting arbitrary cache sizes.

AC

CE

PT

ED

6.3.1. RDA for GPU Cache Performance Modeling In several recent studies, the RDA has been adapted for GPU cache performance modeling. Tang et al. in [49] used the RDA algorithm for estimating cache performance in GPU kernels. However, they did not model the GPU execution behavior, and simple thread and block scheduling schemes were assumed in their work. Later, Nugteren et al. [17] developed a detailed RDA model for the GPUs L1 cache memories in which different hardware parameters were modeled. Wang and Xiao [18] applied RDA to identify memory access patterns in GPU kernels. Their method relied on cycle-accurate simulations to acquire GPU memory traces. Recently, RDA has been utilized to model simultaneous kernel executions in GPUs [19]. The authors experimented coarse-grained SM allocation schemes, based on which GPU SMs were partitioned among the co-run kernels. Recently, Kiani and Rajabzadeh [27, 50] developed RDA-based models for GPUs in which both L1 and L2 cache memories were modeled. To the best of the authors’ knowledge, no prior study has addressed the application of FPA to GPU cache performance modeling. 42

ACCEPTED MANUSCRIPT

7. Conclusion and Future Work

References

ED

M

AN US

CR IP T

In this paper, we proposed a framework, called ’FGCA,’ to enable the application of FPA to GPU kernels. FGCA generates the access traces of a kernel running on a GPU and applies an FPA algorithm to calculate the kernel’s footprints. Then, it uses the calculated footprints to estimate the miss rates in L1 cache memories of the GPU. Our work is the first to apply the footprint analysis to GPU kernels. Moreover, we developed an analytical model for footprint calculation in O(n2 ) time. The performance of our model is comparable to that of the fastest average footprint analysis model, whereas, in addition to the average footprint calculation, it can be used to perform window distribution analysis. The experimental results reveal that our proposed framework generates footprints in affordable times. Further, the approximated cache miss rates are in correlation to hardware executions. Since the footprint is composable, any co-run kernel combination can be analyzed based on the solo-run footprints. In this case, when the number of kernels in a co-run is large, FPA is significantly more efficient than RDA. We employed FPA for analyzing multiple GPU kernel executions whereby the performance in shared cache memories is estimated. Since analytical models can be used to convert solo-run to co-run footprints, the footprint outperforms the reuse distance in modeling the performance of shared cache systems in terms of efficiency. In the future, we planned to integrate the sampling methods into our framework to further accelerate the footprint calculations.

CE

PT

[1] X. Mei, X. Chu, Dissecting gpu memory hierarchy through microbenchmarking, IEEE Transactions on Parallel and Distributed Systems 28 (1) (2017) 72–86.

AC

[2] B. Jang, D. Schaa, P. Mistry, D. Kaeli, Exploiting memory access patterns to improve memory performance in data-parallel architectures, IEEE Transactions on Parallel and Distributed Systems 22 (1) (2011) 105–118. [3] Benchmarking the gpu memory at the warp level, Parallel Computing 71 (2018) 23 – 41. doi:https://doi.org/10.1016/j.parco.2017.11. 003. 43

ACCEPTED MANUSCRIPT

[4] D. Patterson, The top 10 innovations in the new nvidia fermi architecture, and the top 3 next challenges, NVIDIA Whitepaper 47.

CR IP T

[5] M. Khairy, M. Zahran, A. G. Wassal, Efficient utilization of gpgpu cache hierarchy, in: Proceedings of the 8th Workshop on General Purpose Processing Using GPUs, GPGPU-8, 2015, pp. 36–47.

[6] C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, H. Zhou, Localitydriven dynamic gpu cache bypassing, in: Proceedings of the 29th ACM on International Conference on Supercomputing, ACM, 2015, pp. 67–77.

AN US

[7] Y. Liang, X. Xie, Y. Wang, G. Sun, T. Wang, Optimizing cache bypassing and warp scheduling for gpus, IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems PP (99) (2017) 1–1. doi:10.1109/TCAD.2017.2764886.

M

[8] G. Koo, Y. Oh, W. W. Ro, M. Annavaram, Access pattern-aware cache management for improving data utilization in gpu, in: Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, 2017, pp. 307–319.

ED

[9] S. Mittal, A survey of techniques for managing and leveraging caches in gpus, Journal of Circuits, Systems, and Computers 23 (08) (2014) 1430002.

PT

[10] W.-m. Hwu, What is ahead for parallel computing, Journal of Parallel and Distributed Computing 74 (7) (2014) 2574–2581.

CE

[11] C. Ding, X. Xiang, B. Bao, H. Luo, Y.-W. Luo, X.-L. Wang, Performance metrics and models for shared cache, Journal of Computer Science and Technology 29 (4) (2014) 692–712. doi:10.1007/ s11390-014-1460-7.

AC

[12] X. Xiang, C. Ding, H. Luo, B. Bao, Hotl: A higher order theory of locality, in: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’13, 2013, pp. 343–356. [13] K. Beyls, E. D’Hollander, Reuse distance as a metric for cache behavior, in: Proceedings of the IASTED Conference on Parallel and Distributed Computing and systems, Vol. 14, 2001, pp. 350–360. 44

ACCEPTED MANUSCRIPT

[14] X. Xiang, B. Bao, C. Ding, Y. Gao, Linear-time modeling of program working set in shared cache, in: 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011, pp. 350– 360. doi:10.1109/PACT.2011.66.

CR IP T

[15] X. Hu, X. Wang, Y. Li, Y. Luo, C. Ding, Z. Wang, Optimal symbiosis and fair scheduling in shared cache, IEEE Transactions on Parallel and Distributed Systems 28 (4) (2017) 1134–1148. doi:10.1109/TPDS. 2016.2611572.

AN US

[16] H. Luo, P. Li, C. Ding, Thread data sharing in cache: Theory and measurement, in: Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’17, 2017, pp. 103–115. [17] C. Nugteren, G.-J. van den Braak, H. Corporaal, H. Bal, A detailed gpu cache model based on reuse distance theory, in: High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, IEEE, 2014, pp. 37–48.

ED

M

[18] D. Wang, W. Xiao, A reuse distance based performance analysis on gpu l1 data cache, in: Performance Computing and Communications Conference (IPCCC), 2016 IEEE 35th International, IEEE, 2016, pp. 1–8.

PT

[19] M. Kiani, A. Rajabzadeh, Skerd: Reuse distance analysis for simultaneous multiple gpu kernel executions, in: 2017 19th International Symposium on Computer Architecture and Digital Systems (CADS), 2017, pp. 1–6. doi:10.1109/CADS.2017.8310677.

CE

[20] C. Nvidia, C programming guide version 4.0, Nvidia Corporation.

AC

[21] NVIDIA, Cuda best practice, [Online, accessed 12-January-2017] (2017). URL http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ #axzz4ZymoX69i/ [22] Y. Zhong, X. Shen, C. Ding, Program locality analysis using reuse distance, ACM Trans. Program. Lang. Syst. 31 (6) (2009) 20:1–20:39. [23] H. Cui, Q. Yi, J. Xue, L. Wang, Y. Yang, X. Feng, A highly parallel reuse distance analysis algorithm on gpus, in: Parallel & Distributed 45

ACCEPTED MANUSCRIPT

Processing Symposium (IPDPS), 2012 IEEE 26th International, IEEE, 2012, pp. 1080–1092.

CR IP T

[24] D. L. Schuff, M. Kulkarni, V. S. Pai, Accelerating multicore reuse distance analysis with sampling and parallelization, in: Parallel Architectures and Compilation Techniques (PACT), 2010 19th International Conference on, IEEE, 2010, pp. 53–63. [25] C. Ding, T. Chilimbi, All-window profiling of concurrent executions, in: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’08, 2008, pp. 265–266.

AN US

[26] X. Xiang, B. Bao, T. Bai, C. Ding, T. Chilimbi, All-window profiling and composable models of cache sharing, in: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP ’11, 2011, pp. 91–102.

M

[27] M. Kiani, A. Rajabzadeh, Efficient cache performance modeling in gpus using reuse distance analysis, ACM Transaction on Architecture and Code Optimization, in press.

ED

[28] NVIDIA, Maxwell tuning guide, [Online, accessed 18-February-2017] (2017). URL http://docs.nvidia.com/cuda/maxwell-tuning-guide/ #axzz4ZymoX69i/

PT

[29] L.-N. Pouchet, Polybench: The polyhedral benchmark suite.

CE

[30] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, K. Skadron, Rodinia: A benchmark suite for heterogeneous computing, in: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, Ieee, 2009, pp. 44–54.

AC

[31] X. Xiang, B. Bao, C. Ding, K. Shen, Cache conscious task regrouping on multicore processors, in: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012), CCGRID ’12, 2012, pp. 603–611. [32] R. L. Mattson, J. Gecsei, D. R. Slutz, I. L. Traiger, Evaluation techniques for storage hierarchies, IBM Systems Journal 9 (2) (1970) 78–117. doi:10.1147/sj.92.0078. 46

ACCEPTED MANUSCRIPT

[33] B. T. Bennett, V. J. Kruskal, Lru stack processing, IBM Journal of Research and Development 19 (4) (1975) 353–357. doi:10.1147/rd. 194.0353.

CR IP T

[34] G. Almási, C. Caşcaval, D. A. Padua, Calculating stack distances efficiently, in: Proceedings of the 2002 Workshop on Memory System Performance, MSP ’02, ACM, New York, NY, USA, 2002, pp. 37–43. [35] Y. Zhong, W. Chang, Sampling-based program locality approximation, in: Proceedings of the 7th international symposium on Memory management, ACM, 2008, pp. 91–100.

AN US

[36] C. Ding, Y. Zhong, Predicting whole-program locality through reuse distance analysis, in: Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, PLDI ’03, ACM, New York, NY, USA, 2003, pp. 245–257.

M

[37] C. Ding, T. Chilimbi, A composable model for analyzing locality of multi-threaded programs, Tech. rep., Technical Report MSR-TR-2009107, Microsoft Research (2009).

ED

[38] Y. Jiang, E. Zhang, K. Tian, X. Shen, Is reuse distance applicable to data locality analysis on chip multiprocessors?, in: Compiler Construction, Springer, 2010, pp. 264–282.

PT

[39] M.-J. Wu, D. Yeung, Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs, in: Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, IEEE, 2011, pp. 264–275.

AC

CE

[40] M.-J. Wu, D. Yeung, Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis, in: Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, ACM, 2012, pp. 2–11. [41] M.-J. Wu, M. Zhao, D. Yeung, Studying multicore processor scaling via reuse distance analysis, in: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, ACM, New York, NY, USA, 2013, pp. 499–510.

47

ACCEPTED MANUSCRIPT

[42] M. Badamo, J. Casarona, M. Zhao, D. Yeung, Identifying power-efficient multicore cache hierarchies via reuse distance analysis, ACM Transactions on Computer Systems (TOCS) 34 (1) (2016) 3.

CR IP T

[43] S. Hong, H. Kim, An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness, in: Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, ACM, New York, NY, USA, 2009, pp. 152–163.

AN US

[44] J. Sim, A. Dasgupta, H. Kim, R. Vuduc, A performance analysis framework for identifying potential benefits in gpgpu applications, in: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, ACM, New York, NY, USA, 2012, pp. 11–22.

M

[45] S. S. Baghsorkhi, I. Gelado, M. Delahaye, W.-m. W. Hwu, Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors, in: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’12, ACM, New York, NY, USA, 2012, pp. 23–34.

ED

[46] J.-C. Huang, J. H. Lee, H. Kim, H.-H. S. Lee, Gpumech: Gpu performance modeling technique based on interval analysis, in: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, 2014, pp. 268–279.

PT

[47] T. T. Dao, J. Kim, S. Seo, B. Egger, J. Lee, A performance model for gpus with caches, IEEE Transactions on Parallel and Distributed Systems 26 (7) (2015) 1800–1813.

AC

CE

[48] M. Kiani, A. Rajabzadeh, Vlag: A very fast locality approximation model for gpu kernels with regular access patterns, in: 2017 7th International Conference on Computer and Knowledge Engineering (ICCKE), 2017, pp. 260–265. doi:10.1109/ICCKE.2017.8167887. [49] T. Tang, X. Yang, Y. Lin, Cache miss analysis for gpu programs based on stack distance profile, in: Distributed Computing Systems (ICDCS), 2011 31st International Conference on, IEEE, 2011, pp. 623–634. [50] M. Kiani, A. Rajabzadeh, Rdgs: A reuse distance-based approach to gpu cache performance analysis, Computing and Informatics, in press. 48