Parallel Computing 27 (2001) 1173±1195
www.elsevier.com/locate/parco
Adaptive software prefetching in scalable multiprocessors using cache information Daeyeon Park a,*, Byeong Hag Seong a, Rafael H. Saavedra b a
Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, 373-1 Kusong-dong Yusong-gu, Taejon, 305-701, South Korea b Computer Science Department, University of Southern California, Los Angeles, California, 90089-0781, USA
Received 7 September 1998; received in revised form 24 February 2000; accepted 12 January 2001
Abstract Scalable multiprocessors present special challenges to static software prefetching because on these systems the memory access latency is not completely determined at compile time. Furthermore, dynamic software prefetching cannot do much better because individual nodes on large-scale multiprocessors would tend to experience dierent remote memory delays over time. A ®xed prefetch distance, even when computed at run-time, cannot perform well for the whole duration of a software pipeline. Here we present an adaptive scheme for software prefetching that makes it possible for nodes to dynamically change, not only the amount of prefetching, but the prefetch distance as well. We show how simple performance data collected by hardware monitors can allow programs to observe, evaluate and change their prefetching policies. Our results show that adaptive prefetching (APF) was capable of improving performance over static and dynamic prefetching by 10%±60%. Ó 2001 Elsevier Science B.V. All rights reserved. Keywords: Multiprocessor; Distributed shared memory; Software prefetching; Adaptive prefetching; Adaptive execution
1. Introduction There have been numerous researches to bridge the ever-widening gap between processor and memory speeds. One of the most promising approaches is prefetching.
*
Corresponding author. Fax: +82-42-869-80-69. E-mail address:
[email protected] (D. Park).
0167-8191/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 0 1 ) 0 0 0 8 5 - 0
1174
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
Prefetching hides the memory latency by fetching data to the higher level of memory hierarchy before it is actually used. There have been prefetching researches in two directions: hardware-based approaches [2,4] and software-based approaches [8,9]. Hardware-based approaches use extra hardware mechanisms to dynamically predict future data usages. Software-based prefetchings insert explicit prefetch instructions based on the program analysis during compilation phase. Although software prefetchings suer from run-time overhead due to the inserted prefetching codes, they are preferred to hardware approaches because they require only a small amount of hardware support and can cover wide class of reference patterns [9]. The eectiveness of software-based prefetching techniques is often limited by the ability of the compiler to predict the optimal value for the relevant performance parameter aecting the runtime behavior of a program, such as prefetch distance and prefetch degree. The problem of predicting a good value at compile time is inherently dicult due to several reasons. First, some critical information may not be available at compile time. Moreover this information usually depends on the input data. Second, the dynamic behavior of the program makes compile-time prediction dicult. Third, some aspects of parallel machines such as network contentions, synchronization, cache coherency and communication delays make it almost impossible for the compiler to predict these eects at compile time. Finally, the optimal value of a parameter tends to vary as the program executes, and this variance can be quite large on some parallel machines. The optimal value of the prefetch distance varies depending on dynamic behavior such as actual remote memory latency and the amount of prefetch cancellation. Because the actual latency experienced by prefetches is not constant, but depends on the run-time behavior of prefetching itself and other machine components, like caches, network interconnect, and memory modules, it is not always possible to minimize CPU stall time by using a ®xed compile-time prefetch distance. The other important parameter aecting the optimal prefetch distance is prefetch cancellation. Prefetch cancellation has a signi®cant eect on performance because the processor stalls for the whole duration of a miss when a prefetch is cancelled. Traditional static prefetching (where the prefetch distance is ®xed) suers from two problems. First, even there was some amount of prefetch cancellation, the same distance is used during the whole execution. Second, the ®xed prefetch distance may not be optimal because it is dicult to predict at compile time the actual value of the memory latency and the amount of cache interference caused by prefetching. We describes an adaptive algorithm for software prefetching that uses simple performance data collected by a hardware monitor to dynamically adjust the prefetch distance. We present simulation results which show that our adaptive algorithm is capable of hiding signi®cantly more latency than other prefetching algorithms where the prefetch distance is constant. In some cases the reduction of stall time can be as large as 60%. Furthermore, the only hardware supports required by our adaptive scheme are three counters: one measuring the number of prefetch requests, another counting the number of late prefetches (the ones arriving after the processor has requested the data) and the other measuring the number of prefetches killed as a result of cache con¯icts. The simplicity of this scheme underlines the
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
1175
potential performance bene®ts that can be obtained by adaptively exploiting runtime program and machine performance information. Although here we only explicitly compare adaptive prefetching (APF) against static prefetching (SPF), our results also apply to dynamic prefetching. The reason is that dynamic prefetching, where the prefetch distance is set at run-time but kept constant throughout the execution of the software pipeline, cannot perform better than an optimal static scheme, which is outperformed by adaptive prefetching. Our results show that adaptive prefetching consistently outperforms static prefetching over all prefetch distances. In other words, the main advantage of adaptive prefetching comes from its ability to periodically change the value of the prefetch distance and this way minimize cache pollution and maximize the covering of latency, something that neither static or dynamic prefetching can do. The organization of the rest of paper is as follows. In Section 2, we discuss several software prefetching approaches and their shortcomings. Section 3 deals with the adaptive prefetching algorithm and the required hardware. The simulation framework used to study the performance of adaptive prefetching is presented in Section 4, while Section 5 presents experimental results. Section 6 describes related work. The conclusions are given in Section 7. 2. Software prefetching In this section, we discuss how software prefetching can exhibit adaptive behavior. We start by presenting a very simple dynamic scheme in which one of two versions, each prefetching a dierent amount of data, is executed depending on the value of a run-time argument. We then introduce an adaptive scheme to adjust the value of the prefetch distance based on information collected using a hardware monitor. 2.1. Static and dynamic prefetching Consider the code excerpt given in Fig. 1(a). This is a simple kernel in which the sum of a sliding window of elements of array b is accumulated on array a. In order to compute the correct prefetch predicates, i.e., the subset of iterations that require inserting explicit data prefetches in the code, the compiler needs to know the value of parameter n. It is clear that if n is small enough so that the data accessed inside the innermost loop by both references is guaranteed to ®t in the cache without crossinterference, then all the prefetches for a and almost all for b have to be issued during the ®rst iteration of the outermost loop as it is shown in Fig. 1(b). As the outermost loop advances, a new element of b is read which was not referenced in previous iterations. Hence, an explicit prefetch has to be issued to cover its latency. If this is not the case, the compiler has to assume that the data loaded into the cache at the beginning of the innermost loop will be displaced by the data loaded at the end and vice versa. Therefore, data prefetches have to be inserted on all iterations of the outermost loop, as Fig. 1(c) shows.
1176
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
Fig. 1. Software pipelining schemes based on the runtime value of argument n. Prefetch( ) function prefetches one cache line, and sub®gure (b) and (c) assume a 32-byte cache line.
In a traditional compiler, where all decisions are based on static analysis, the compiler could apply inter-procedural data ¯ow analysis in an attempt to determine the value of argument n. Several situations can make the compiler fail: (1) the function is called from an external module which is compiled separately; (2) the value of n is determined at run-time (either by reading it or by explicitly computing it); or (3) the function is called at dierent places, each time using dierent arguments. If any of these conditions is true, the compiler cannot decide which version is the best to execute on all possible situations.
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
1177
These kinds of problems can be solved in dynamic prefetchings. In dynamic approaches, the compiler can generate both versions plus extra code representing the software agent whose responsibility is to test n at the beginning of the function and then select the version which better matches the actual argument. The important observation here is that much less analysis is required to ®nd out the run-time conditions needed to determine the best version to execute than what inter-procedural data ¯ow analysis requires. Moreover, an agent can always make sure that the most ecient version is executed independently of how many times the function is called and the actual values of the arguments. However, dynamic approaches cannot do better than static approaches in solving such problems as presented in the next subsection. 2.2. The eects of prefetch cancellation Another important problem for the compiler is to decide which value to use for the prefetch distance (variable PF_DIST in Fig. 1(b±c)). Finding this ``optimal value'' requires predicting the run-time magnitude of the memory latency which is dicult in today's scalable multiprocessor architecture. To illustrate the inherent diculties associated in predicting the best value for the prefetch distance and to motivate the need for an adaptive scheme, let us focus on how prefetch cancellation aects the determination of the optimal prefetch distance. Prefetch cancellation occurs when a memory operation collide in the cache with an in-¯ight prefetch or with data loaded by previous prefetches. The amount of prefetch cancellations are aected by the number of instructions executed in between the prefetch and its associated ®rst use, and the inherent mapping con¯icts of data references. For the speci®c code excerpt given in Fig. 1, it is the particular cache positions of references a[0] and b[i] that determines whether or not the prefetch streams of array a and b will cancel each other. If the distance between these two references, measured in cache lines, is larger than the prefetch distance, then no prefetch cancellation will occur. It is clear that as the outermost loop advances the cache mapping of b[i] moves relative to a[0], so unless m is very small, the prefetch streams will eventually have to interference with each other. Figs. 2(a-b) illustrates these two scenarios. The assumptions made in the ®gure are: (1) the cache is direct-mapped and of an unspeci®ed size; (2) cache lines can hold four consecutive elements; and (3) the base memory latency (no network, bus, or directory contention) requires sixteen iterations of prefetch distance (four cache lines per reference). In Fig. 2(a) we ®rst assume that the cache mapping distance between a[0] and b[i] is much larger than the prefetch distance. Hence, none of the prefetches issued inside the innermost loop are cancelled, consequently most of the latency can be hidden from the processor. In contrast, Fig. 2(b) shows what happens when the mapping distance is less than the prefetch distance. On a direct-mapped cache it is not possible to satisfy two prefetches mapping to the same line. Therefore, one of the two prefetches is always cancelled which in turn will cause the processor to suer stall time when it tries to read the missing data. It is clear that better performance can be obtained by reducing the prefetch distance even when doing this makes it impossible to hide all the latency. As Fig. 2(c)
1178
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
Fig. 2. The eect of cache mapping con¯icts on the eectiveness of prefetching.
shows, if instead the prefetch distance is reduced to only eight iterations (prefetching two cache lines ahead of time), then all the prefetch cancellation is eliminated at the cost of incurring a small amount of stall time. Two observations can be made here. First, only the ®rst prefetch in each group of four (two for each array) suers from uncovered latency. This is because the other prefetches in the group ``take
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
1179
advantage''of the stall time incurred by the leading prefetch (the ®rst one in each group) to complete. Here, for the sake of argument, we assume that prefetches arrive in the same order that they were issued. Non-leading references can also induce a small amount of stall time when for some reason their relative distance from the leader increases. Second, it should be clear that any adaptive scheme that changes the value of the prefetch distance at run-time so that no amount of prefetch cancellation occurs while attempting at the same time to maximize the amount of memory latency covered should outperform any other scheme that uses a ®xed prefetch distance. 2.3. Adapting the prefetch distance In the previous section, we showed that when the cache positions of elements a[0] and b[i] are close enough to each other to fall within their respective windows of interference, then the innermost loop suers a signi®cant amount of prefetch cancellation. From this we can conclude that a simple adaptive scheme in which the prefetch distance is computed by the software agent at the beginning of the loop by using the mappings a[0] and b[i] can do better than any static scheme. This approach, however, works well only on caches that are accessed using virtual addresses or when the compiler (through the agent) can obtain the actual physical mappings. It is also dicult to apply this approach to the loop nests which contain several references that are traversed using dierent patterns. It is very dicult, if not impossible, to characterize all the possible patterns that can result in prefetch cancellation when computing the optimal prefetch distance. This means that a general and eective adaptive scheme for the prefetch distance should not rely on always knowing the exact cache mappings of the data. We now present a general adaptive scheme for software pipelining that minimizes the amount of stall time by dynamically changing the value of the prefetch distance. The main idea here is to change the value of the prefetch distance when performance information indicates that the local ``optimal'' prefetch distance (as observed from each thread) changes. This can be accomplished by checking the relevant event counters at a certain frequency and using them to quantify how much and in which direction the prefetch distance should be changed. It is clear that any adaptive scheme has to react quickly when prefetch cancellation occurs or when the magnitude of the memory latency suddenly changes. This implies that the agent may need to modify the prefetch distance several times during the software pipeline. To accomplish this we split the loop containing the software pipeline body (the prologue and epilogue remain the same) into a two-loop nest in the similar way as it is done in other loop optimizations like tiling (blocking). Fig. 3(c) shows how this could be done on the software pipeline given in Fig. 1(c). In order to highlight only the important aspects of the algorithm, Fig. 3 assumes a cache line size of one element. The number of iterations in the controlling loop (the one for index j1) determines how frequently the prefetch distance is adjusted. This controlling loop evaluates the eectiveness of the current prefetch distance, computes the new distance for the next sub-pipeline interval, and increases or decreases the number of outstanding prefetches in order to match the new distance.
1180
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
Fig. 3. Static and adaptive software pipelining.
3. Adaptive prefetching 3.1. Cache information For adaptive prefetch distance calculation, a software agent monitors cache information related with a prefetching. To get cache information the cache is modi®ed
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
1181
in a small way to maintain information about in-¯ight prefetches. This is done by allocating a cache frame to a prefetch at issue time and setting the a status bit in the frame itself to indicate that the prefetch is in-¯ight (i.e., a transit state). If a new prefetch with a dierent addresses map to the same cache frame, the second prefetch is cancelled, and if a load maps to the same frame, the original prefetch is cancelled. The results of a prefetch command can be classi®ed as follows: · Useful prefetch: the result of the prefetch command arrives in time and used. · Late prefetch: the result of the prefetch command arrives late, and the load or store command has to block for the data. · Cancelled prefetch: the prefetch command is cancelled due to the con¯icting load/ store or prefetch. · Unused prefetch: the result of the prefetch command arrives, but the data is not used because the cache line is replaced or invalidated before use. The objective of the adaptive prefetching algorithm is to reduced the un-useful prefetches (that is late prefetches, cancelled prefetches, and unused prefetches). To reduce cancelled and unused prefetches, the prefetch distance should be reduced. For late prefetches it has to be increased. However, our simple hardware makes it dif®cult to count the number of useful prefetch and unused prefetch. Because our hardware maintains information about in-¯ight prefetches, the events which occur after the arrival of prefetched data can not be monitored. We can monitor only late and cancelled prefetches. We used following three counters to monitor hardware states. · Cprefetch : counts the number of prefetch command issued. · Ccancel : counts the number of cancelled prefetches. · Clate : counts the number of late prefetches. In the next section, we show how the fraction of delayed prefetches (frac_late Clate =Cprefetch ) and the fraction of cancelled prefetches (frac_cancel Ccancel =Cprefetch ) are used to adjust the prefetch distance. 3.2. A simple adaptive algorithm for the prefetch distance In Section 2.2, we illustrated that decreasing the prefetch distance to avoid prefetch cancellation is more eective than increasing it to cover more memory latency. Therefore, any eective adaptive algorithm for the prefetch distance should give priority to the former over the latter as shown in the following function. int Delta_Pref_Distance (frac_cancel, frac_late, PF_DIST) int PF_DIST; double frac_cancel, frac_late; { if (frac_cancel > acancel ) return ( bcancel PF_DIST); if (frac_late > alate ) return (blate ); return (0); }
1182
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
This function computes new prefetch distance using current cache information. The particular values for acancel , alate , bcancel , and blate are statically determined during compile time. The exponential decrease policy for the prefetch distance highlights the importance of eliminating prefetch cancellation even if the amount of covered latency is reduced.
4. Simulation methodology This section presents the simulation framework used to study the performance of APF. In Section 4.1, we present the architectural assumptions and simulation environment. In Section 4.2, we describe the benchmark programs. 4.1. Architectural assumptions We simulate 16- and 64-node CC-NUMA multiprocessors using an executiondriven simulator called Trojan [10,11]. Trojan, which is an extended version of MIT Proteus [3], allows the detailed modeling of the various hardware components. All components, except the network, are explicitly simulated; the amount of network contention suered by messages is computed analytically using the model proposed by Agarwal [1]. Each node contains 4 Mbytes of main memory and a 4 Kbytes of cache. The cache is direct-mapped and write-back with a line size of 16 bytes. Release memory consistency model is enforced by hardware. Cache coherence for shared memory is maintained using a distributed directory- and invalidation-based protocol. Nonmemory machine instructions are assumed to take one cycle. Prefetch instructions, which were inserted manually, take 2 cycles to execute. More detailed architectural parameters are shown in Table 1. With these parameters, the average read latencies Table 1 Baseline architectural parameters Main memory Cache
Cache consistency Network
Parameter
Value
Size Latency Size Latency Line size Associativity Model Directory Structure Routing Links Switch delay Wire delay
4 MB per node 35 cycles 4 KB per node 1 cycle 16 bytes Direct mapped Release consistency Full-map directory Mesh Wormhole routing Bi-directional, 16 bit 2 cycles 2 cycles
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
1183
Table 2 Application characteristics Program
Description
Input data set
Jacobi LU MMul Ocean
Successive over-relaxation LU decomposition Matrix multiplication Ocean simulation with Multi-grid Solver
522 1000 260 260 260 260 258 258
without contention are: 1 cycle for a cache hit, 120 cycles for a 2-hop global access, and 140 cycles for a 3-hop global memory access. 4.2. Applications In order to understand the relative performance of adaptive prefetching, we use two dierent types of applications. In the ®rst type, we use a synthetic benchmark to investigate in detail the eectiveness and limitations of APF. In these experiments, in particular, we are interested in obtaining good approximations for parameters bcancel , blate , acancel and alate . In the other experiments we use real applications, some of which are taken from SPLASH2 [12], to quantify the amount of improvement over static approach that our algorithm oers. Table 2 presents the programs and their input data sets. Each program exhibits dierent class of computations in a sense that each program suers from dierent types of cache interferences. Jacobi program suers from self-interferences. MMul and LU programs occasionally suer from cross-interferences in pairwise arrays, while Ocean suers from cross-interferences on several arrays. To avoid pathological cache con¯icts, we manually changed the alignment of some matrices.
5. Experimental results and discussion In this section, we present experimental results that were obtained using two dierent types of experiments as discussed in Section 4. Section 5.1 presents our experimental results for a synthetic benchmark. In Section 5.2, we give experimental results for real applications with default machine parameters. In Section 5.3, we discuss experimental results with architectural variations. 5.1. Controlled experiments The synthetic benchmark partitions the machine nodes in two disjoint set: the control group and interference group. Nodes in the control group (control nodes) execute a loop nest that generates remote requests at a constant rate and use software pipelining, either with a static or adaptive prefetch distance, to hide as much latency as possible from the processors. In contrast, nodes in the interference group (interference nodes) work together to generate a certain amount of bisection network
1184
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
trac by controlling the rate at which they generate remote requests. The trac is maintained constant for some interval of time and then is changed to a new value, which is unknown to the control nodes, according to a prede®ned script running on the interference nodes. The key observation here is that by changing the amount of trac in the interconnect the interference nodes modify the amount of latency of the remote requests issued by the control nodes. The results shown in this subsection are for a single control node that is under the control of various static and adaptive policies. Here the number of interference nodes is 63. The static policies (denoted in the ®gures as SP-i, where i is the prefetch distance) use a constant value ranging from 60 to 1000 cycles for the prefetch distance. Our adaptive policy (APF) starts with prefetch distance 200, and adjust the prefetch distance using the algorithm given in the last section. In particular APF reduces the prefetch distance to one half of its current value (bcancel 0:50) when a signi®cant number of prefetches are cancelled. At the same time if prefetches arrive late and there is few prefetch cancellation, then the prefetch distance is increases by one iteration of the innermost loop (blate 1). The values for acancel and alate determines how sensitive is the adaptive scheme. Through the experiment, 0.1 is used for both acancel and alate . These values showed reasonable performance, and small deviation from these values has little eect on the overall performance. Fig. 4 shows the remote latency pattern generated by the interference nodes as a function of the channel utilization. As explained above, the interference nodes cooperate between themselves to maintain a constant channel utilization for some amount time by controlling their remote request rate. As the ®gure shows varying the channel utilization within the range 0.18±0.90 causes the remote latency to vary from 100 to 425 cycles. Fig. 5 shows the average stall time that static and adaptive prefetching schemes experience as a function of the interference pattern. The results clearly show that no static prefetch distance can perform well over all possible latency values. The adaptive scheme, however, is capable of adjusting to the changes to keep the stall
Fig. 4. Remote latency pattern induced by interference nodes.
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
1185
Fig. 5. Average stall time per prefetch as a function of prefetching policy.
time close to its minimum. Scheme APF is capable of detecting when attempts to increase the prefetch distance results in higher levels of prefetch cancellation. The correct action here is to reduce the distance until no signi®cant amount of cancellation occurs. The important conclusion to draw from Fig. 5 is not that adaptive scheme is better than static ones just because the former is able to adjust to large variations of the latency. In reality the latency suered by a program can be either small or large, but in most cases it tends to stay within a narrow gap. The real problem is that compilers cannot predict suciently well the actual runtime value of the latency. Furthermore, using prefetching only makes things worse. To illustrate this last point, consider using prefetching on a program that spends 80% of its time waiting for remote requests to complete. Assume that under these conditions the average channel utilization and remote latency are 0.30 and 110 cycles, respectively. Now, assuming everything else being equal, if prefetching is 50% eective in reducing stall time, then the channel utilization and remote latency will increase to 0.50 and 135 cycles. But if prefetching is 80% eective, the channel utilization and remote latency will jump to 0.85 and 300 cycles. It is this uncertainty on the eectiveness of prefetching what makes it impossible for the compiler to select the correct prefetch distance in every case. Adaptive prefetching, on the other hand, is guaranteed to produce the best performance because it adjust itself to the particular runtime conditions. Fig. 6 gives results on the normalized total execution time. Here, the stall time has been broken into two components: delay caused by the residual latency (pf-lissue: stall time induced by late prefetches) and delay caused by cache misses resulting from cancelled prefetches (pf-cancel). In addition, the adaptive schemes include the extra overhead involved in adjusting the prefetch distance, which consists of two parts: reading the counters in the performance monitor, and increasing/decreasing the prefetch distance. Here we assume a tightly coupled hardware monitor whose counters can be read directly from a user thread without the intervention of the operating system. The results show that by using a adaptive policy we can minimize
1186
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
Fig. 6. Eect of static and adaptive policies in the execution time.
both components of the stall time. Furthermore, the overhead incurred in running the adaptive algorithm does not overly degrade performance. The static schemes, on the other hand, can either eliminate the residual delay or the amount of cancelled prefetches, but not both. The eective of a full adaptive algorithm is strongly dependent on parameters bcancel and blate . Intuitively reducing the occurrence of cancelled prefetches is more important than trying to cover all the latency. This was illustrated in Section 2.2, where we showed that when prefetches suer from residual latency, the overlapping of prefetches amortizes the corresponding stall time of the leading references amongst all outstanding prefetches. This is why the algorithm we presented in Section 3.2 increases the prefetch distance conservatively by a small constant, but decreases it rapidly when a signi®cant amount prefetch cancellation occurs. We ran experiments varying both bcancel and blate and the results are shown in Fig. 7. As expected, the best results were obtained when the prefetch distance is increased as little as possible by only one iteration in the former and when it is decreased to 50% in the latter. 5.2. Eectiveness of adaptive prefetching in complete applications In this section, we evaluate the eectiveness of the APF for real benchmarks by comparing their performances with those of SPF with various prefetch distances. All the results were collected simulating a 16-node CC-NUMA multiprocessor. We start by presenting in Table 3 some relevant cache and network statistics on the behavior of prefetching. The table shows miss rates, prefetch cancellation rates, average miss penalties, and speedups. The latter are reported relative to policy SP-200. The miss rates account for all load references independently of whether or not a prefetch was issued in an attempt to eliminate a miss. Prefetch cancellation (pf-cancel) corresponds to the fraction of all prefetches issued that are ineective in covering latency
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
1187
Fig. 7. Eect of bcancel and blate .
Table 3 Performance comparison between static and adaptive prefetching Benchmark
prefetch distance
Miss rate (%)
Pf-cancel rate (%)
Average miss penalty (cycles)
Speedup
Jacobi
SP-100 SP-200 SP-300 APF
8.3 11.9 11.9 6.2
17.3 66.0 66.0 5.8
219 179 165 192
1.14 1.00 1.08 1.63
LU
SP-100 SP-200 SP-300 APF
10.8 6.7 5.1 3.6
2.1 3.5 5.1 1.3
67 96 131 131
0.95 1.00 0.99 1.16
MMul
SP-100 SP-200 SP-300 APF
6.3 3.6 3.0 2.8
2.8 4.5 5.6 2.6
132 211 256 229
0.95 1.00 1.00 1.09
Ocean
SP-100 SP-200 SP-300 APF
13.6 12.3 12.7 11.3
15.5 16.7 19.2 7.0
154 169 172 146
1.00 1.00 0.97 1.18
due to cache con¯icts. The table clearly shows that adaptive prefetching minimizes not only cache miss rates, but also the prefetch cancellation rates. Consequently, adaptive prefetching manages to improve performance relative to the static policies from 10% (MMul) to more than 60% (Jacobi). Measuring the eectiveness of APF only by looking at overall speedups is somehow misleading because the maximum amount of improvement that prefetch-
1188
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
ing can produce depends on many factors not aected by whether prefetching is static or adaptive. For example, if the original program does not suer from high miss ratios or if not enough prefetches are inserted to cover misses, adaptive prefetching cannot overcome this de®ciency. Fig. 8 tries to factor out this by focusing on the fraction of stall time not covered by SPF that can be successfully eliminated by using an adaptive scheme instead. The stall time induced by prefetching has two sources: (1) cache misses occurring as a result of canceled prefetches (pf-cancel) and (2) prefetches that complete but arrive late (pf-lissue). In Fig. 8 we see that, relative to SP-200, APF can eliminate between 20% and 45% of the stall time suered by static prefetching. Moreover, adaptive prefetching reduces not only the stall time coming from canceled prefetches, but also that induced by late prefetches as well. Figs. 9 show the relative execution times for all the prefetching schemes normalized with respect to no prefetching (no-pf). The normalized execution time is broken down as follows: (1) busy time spent executing instructions (busy), (2) stall time, and (3) overheads of prefetching and adaptive prefetching. The stall time is further decomposed into the following components: (2.1) stall time caused by not issuing prefetch (no-pf), (2.2) time stalled due to the prefetch cancellations (pf-cancel), (2.3) stall time because of late prefetch issue (pf-lissue), (2.4) stall time for synchronization operations such as barriers and locks (sync), and (2.5) stall time due to the output buer full (obuer-full). As shown in the ®gures, compared with the SP-200, the speedup of adaptive prefetching ranges from 9% (MMul) to 63% (Jacobi). Table 3 indicates that this is accomplished by reducing both the cache miss-rate and prefetch cancel rate. Moreover, although, in most cases, static prefetching can hide the long memory latency, it induces signi®cant stall time resulting from prefetch cancellation. Furthermore, as the ®gure shows, the eectiveness of prefetching is dependent on the prefetch distance and this changes from program to program. For example, the best
Fig. 8. Overall performance of adaptive prefetching.
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
1189
Fig. 9. The eectiveness of adaptive prefetching: (a) Jacobi, (b) LU, (c) MMul, (d) Ocean.
static prefetch distance for Jacobi is 100 cycles, whereas for LU and MMul it is 200 cycles. Having analyzed the bene®ts of adaptive prefetching, we now focus on the costs. The overhead of adaptive prefetching can be decomposed into the following: (1) the overhead of reading the performance monitor registers; (2) the overhead of adjusting the prefetch distance; and (3) the overhead of executing the software agent. The actual cost of adaptive prefetching depends primarily on how the software agent and the performance monitor interact, i.e., polling, interrupt, or active message, and the frequency of interaction. There is a tradeo between measuring period and overhead: a longer (alternatively shorter) period means less (more) adaptation with less (more) overhead. In our experiments, we used a static polling scheme for checking the performance monitor with an overhead of 30 cycles. Our simulation results show that this scheme incurs an overhead of no more than 5% of the total execution time. This is quite low considering that adaptivity improves performance over the static schemes by a much larger amount (10%±40%).
1190
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
The good news for the overhead is that it can be completely or partially overlapped by the stall time components. If the processor is blocked as a result of a cache miss or a synchronization operation, it is just stalled because we use static polling. Even so, the stall time due to the late prefetch issue can help in reducing the overhead to some extent. For example, if the processor has to stall 20 cycles because of late prefetch issue, the overhead is overlapped by the stall time and only 10 cycles is added to the adaptive prefetching overhead since after 30 cycles the processor does not stall any more. The stall time caused by a prefetch cancellation does no help in reducing the overhead because after ®nishing the extra work, the processor can issue the cancelled memory access. In our experiments, around 30% of the overhead is overlapped by pf-lissue. In summary, our results show: (1) if the prefetch distance is too large (alternatively short), the number of cancelled prefetches increases (decreases), while the number of late prefetches decreases (increases); (2) the amount of a prefetch cancellation has a signi®cant eect on performance; (3) the best static prefetch distance is application dependent and it is aected by several machine parameters; (4) the overhead of adaptive prefetching is small; and (5) adaptive prefetching works well when it is dicult to predict the best static prefetch distance. 5.3. Eects of architectural variations In this subsection, we report on the impact of several architectural variations on the performance of adaptive prefetching. We focus on the eects of changing the cache size, machine size, network latency, and memory consistency model. Here we compare the performance of adaptive prefetching relative to that of static prefetching directly (i.e., the execution time of adaptive prefetching is normalized to that of static prefetching) to see more clearly how much performance improvement is achieved by adaptive prefetching. For SPF, because the prefetch distance 200 is best in most cases as shown in the previous subsection, this distance is used except for the variations of network latency and machine size where 300 is used to cover the increased latency. 5.3.1. Cache size In our baseline machine the cache size is set to 4 Kbytes. In order to test if adaptive prefetching continues to work well on larger caches, we run additional experiments where the cache size is increased to 16 Kbytes. For larger cache size, we use the same size of data sets because we want to see the performance changes with larger caches. The results of our experiments are presented in Fig. 10. The results show that for Jacobi, APF reduces the execution time by 39% on a 4 Kbyte cache and that this number drops to 26% on a 16-Kbyte cache. This reduction is the result of a corresponding reduction on the amount of self-interferences in the program, which in turn reduces the number of canceled prefetches for both static and adaptive prefetching. Because static prefetching suers more from self-interference, it bene®ts more from a larger cache. On Ocean, the relative performance does not change signi®cantly when the cache size increases. The reason is that pf-cancel component
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
1191
Fig. 10. Cache size variation.
caused by cross-interferences is not sensitive to the size of cache because the relative positions of the array in memory continues to be the same. 5.3.2. Machine size Increasing the machine size from 16 to 64 nodes has a direct impact on latency by both increasing the average distance between nodes and the bisection bandwidth. We use larger data sets on these experiments to make sure that every processor is assigned a similar amount of work. For LU and Ocean, 460 460 and 514 514 matrices are used instead of 260 260 and 258 258 for 16 processors, respectively. As the number of processors increases, the variance of network latency is becoming large since average hop count goes up. Thus, it is interesting to see how the eectiveness of APF is aected by the variation in the number of processors. Fig. 11 shows the results. We see that pf-cancel component is increased on static prefetching (from 31% to 44% for LU and from 47% to 49% for Ocean) with 64 processors, which means that prefetch cancellation is more critical for larger number of processors. In contrast, pf-lissue component increases for APF (from 21% to 27% for LU and from 13% to 15% for Ocean) with 64 processors. This results from the fact that APF decreases the prefetch distance so that prefetch cancellation does not occur, which can result in increasing pf-lissue. Due to the longer network latency, this strategy aects pf-lissue more with 64 processors than with 16 processors. Overall, even if pf-lissue increases on a larger machine, APF is more eective as the number of processors increases. 5.3.3. Network latency Another machine variation that also aects latency is changing the characteristics of the interconnect. Here, instead of assuming a switch and wire delay of 2 cycles each, we double the delay to 4 cycles. Doing this results in increases of a
1192
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
Fig. 11. Number of processors variation.
Fig. 12. Network latency variation (N1 default, N2 doubled switch and wire delay).
no-contention 3-hop remote memory access from 140 to 230 cycles. Fig. 12 shows the eect of this change on the execution time for Ocean and LU. We see that APF works better under longer network latency as was the case with 64 processors. As the network latency increases as a result of larger hop count or increased wire (or switch) delays, the component of pf-cancel becomes larger and thus there is more room for improvement. 5.3.4. Memory consistency Finally, we study the execution time eects under sequential consistency. The main dierence between sequential and release consistency is that, under release
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
1193
Fig. 13. Memory consistency variation.
consistency, write stall time can be overlapped with computations by using a write buer. Because the processor is stalled on a write miss under sequential consistency, stall time components (pf-lissue and no-pf) should increase. Fig. 13 presents the resulting execution times for Jacobi and LU. These show that, for Jacobi, there is little dierence between release and sequential consistency. On this program prefetches are issued for all memory references and most of the stall time is caused by prefetch cancellation, which is not aected by the memory consistency model. For LU, however, the performance improvement of APF under sequential consistency is not as good as it is under release consistency. The reason for this is that because pf-cancel decreases but no-pf and pf-lissue increase, there is not much room left for APF to improve performance. If we increase prefetch distance to cover pf-lissue, the performance is getting worse because of the increased pf-cancel. In summary, the eectiveness of APF is aected by the architectural variations. Increasing network latency, either by increasing the machine or slowing the network, improves the eectiveness of adaptivity. 6. Related work While there have been many prefetching schemes, only a few of them can be considered as adaptive schemes, and most of those adaptive schemes are hardware based. During our research, we cannot ®nd any works on adaptive software prefetchings. In this section, we summarize previous works on adaptive hardware schemes and hardware/software integrated schemes. Dahlgren et al. [5] proposed an adaptive hardware prefetching method based on sequential prefetching. Sequential prefetching prefetches consecutive blocks following the block that misses in the cache. Dahlgren et al's scheme adapts the number of prefetched blocks (prefetch degree) according to the eectiveness of previous
1194
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
prefetchings. Associating two bits per cache line and three counters per cache, their scheme measures the eectiveness of prefetching. At regular interval, it calculates the fraction of useful prefetches. If it is higher than a prede®ned threshold, the prefetch degree is incremented and it is lower than the threshold, the prefetch degree is decremented. Ki and Knowles's scheme [7] is also an adaptive hardware prefetching which is quite similar to Dahlgren et al's scheme. Their method uses a tagged prefetching scheme in which the setting of a tag bit which is associated with each cache line, initiates prefetching for next cache lines. A tag bit is cleared when its cache line is brought into cache memory, and set when the cache line is referenced. Their scheme adapts the prefetch distance and prefetch degree using the cache usage information which is similar to that of our method: useful prefetch, late prefetch, replaced prefetch, and invalidated prefetch. Their adaptation function, however, does not consider prefetch cancellation eects and adapts the prefetch distance and degree linearly. Gornish and Veidenbaum [6] proposed an integrated hardware/software prefetching method. By making hardware to handle only simple small constant stride access patterns, they reduced the complexity of a hardware prefetching mechanism. They also reduced run time overhead of the software prefetching by restricting the software to handle only large constant stride and non-constant stride access streams. Complex access patterns are handled by software, but most of the prefetchings are handled by simple hardware. However, their scheme cannot dynamically adapt to runtime parameter variations. They can adapt the prefetching degree only on a program by program and stream by stream basis. 7. Conclusion The two main ideas behind adaptive prefetching which set it apart from static and dynamic schemes are: (1) that the determination at run-time of parameters having an eect in performance should be made based not only on program behavior, but on machine performance information as well; and (2) that in order to maintain a high level of optimality, some of these parameters have to be re-evaluated periodically based on measurements taken by a performance monitor. We have argued that the eciency of certain optimizations, specially those dealing with memory latency, network contention, and remote communication can bene®t from adaptive schemes. In this paper, we have studied how adaptive schemes can be applied to software prefetching algorithms to make it more eective in covering latency and reducing cache pollution. We have presented a practical adaptive prefetching algorithm capable of changing the prefetch distance of individual prefetch instructions relative to their corresponding loads by combining information collected about the number of late prefetches and the number of cancelled prefetches. We have shown that in order to minimize stall time it is imperative to give priority to the latter over the former. Finally, our simulation results showed that adaptive prefetching can improve the performance of static schemes by a signi®cant amount ranging from 10% to 60%.
D. Park et al. / Parallel Computing 27 (2001) 1173±1195
1195
Even more important is the 20% to 45% reduction in stall time that adaptive prefetching provides over static and dynamic schemes.
References [1] A. Agarwal, Limits on interconnection network performance, IEEE Trans. Parallel and Distributed Systems 2 (1991) 398±412. [2] J. Baer, T. Chen, An eective on-chip preloading scheme to reduce data access penalty, in: Proceedings of the Supercomputing'91, November 1991, pp. 176±186. [3] E. Brewer, C. Dellarocas, A. Colbrook, W. Weihl, PROTEUS: A high performance parallelarchitecture simulator, Technical Report MIT/LCS/TR-516, MIT, 1991. [4] T. Chen, J. Baer, A performance study of software and hardware data prefetching schemes, in Proceedings of the 21st Annual International Symposium on Computer Architecture, April 1994, pp. 223±232. [5] F. Dahlgren, M. Dubois, P. Stenstrom, Fixed and adaptive sequential prefetching in shared-memory multiprocessors, in: Proceedings of the 1993 International Conference on Parallel Processing, Vol. I, August 1993, pp. 56±63. [6] E. Gornish, A. Veidenbaum, An integrated hardware/software data prefetching scheme for sharedmemory multiprocessors, in: Proceedings of the 1994 International Conference on Parallel Processing, Vol. II, August 1994, pp. 281±284. [7] A. Ki, A. Knowles, Adaptive data prefetching using cache information, in: Proceedings of the 11th International Conference on Supercomputing 1997. pp. 204±212. [8] T. Mowry, Tolerating latency through software-controlled prefetching in shared-memory multiprocessor, Journal of Parallel and Distributed Computing 12 (1991) 87±106. [9] T. Mowry, Tolerating latency in multiprocessors through compiler-inserted prefetching, ACM Trans. Computer Systems 64 (1998) 55±92. [10] D. Park, Adaptive execution: Improving performance through the runtime adaptation of performance parameters, Ph.D. thesis, University of Southern California, 1996. [11] D. Park, R. Saavedra, Trojan, A high-performance simulator for shared-memory architectures, in: Proceedings of the 29th Annual Simulation Symposium, April 1996. pp. 44±53. [12] S. Woo, J. Singh, The SPLASH-2 programs: Characterization and methodological considerations, in: Proceedings of the 22nd Annual International Symposium on Computer Architecture, June 1995, pp. 43±63.