Improvement of energy-efficiency in off-chip caches by selective prefetching

Improvement of energy-efficiency in off-chip caches by selective prefetching

Microprocessors and Microsystems 26 (2002) 107±121 www.elsevier.com/locate/micpro Improvement of energy-ef®ciency in off-chip caches by selective pr...

964KB Sizes 0 Downloads 1 Views

Microprocessors and Microsystems 26 (2002) 107±121

www.elsevier.com/locate/micpro

Improvement of energy-ef®ciency in off-chip caches by selective prefetching Jonas Jalminger*, Per StenstroÈm Department of Computer Engineering, Chalmers University of Technology, SE-412 96 Goteborg, Sweden Received 11 September 2001; revised 26 November 2001; accepted 14 December 2001

Abstract The line size/performance trade-offs in off-chip second-level caches in light of energy-ef®ciency are revisited. Based on a mix of applications representing server and mobile computer system usage, we show that while the large line sizes (128 bytes) typically used maximize performance, they result in a high power dissipation owing to the limited exploitation of spatial locality. In contrast, small blocks (32 bytes) are found to cut the energy-delay by more than a factor of 2 with only a moderate performance loss of less than 25%. As a remedy, prefetching, if applied selectively, is shown to avoid the performance losses of small blocks, yet keeping power consumption low. q 2002 Elsevier Science B.V. All rights reserved. Keywords: Caches; Energy-ef®ciency; Energy-delay; Performance evaluation; Prefetching

1. Introduction In the quest towards higher performance levels, architectural design trade-offs have largely ignored to factor in the energy used to execute the program. While reducing the power dissipation in high-end microprocessors is clearly important, it is even more important for the mobile computers such as laptops and personal digital assistants (PDAs) [3,12,14,16]. As the use of mobile computers is becoming widespread, the computational demands are increasing; typically mobile computers must support a wide range of performance demanding applications such as video/audio decoding, computer graphics, and handwriting recognition. In making architectural trade-offs for high-end computers, a key goal is to reduce the energy used to execute a program without any loss in performance. While this is certainly also important for mobile computers, another useful optimization criterion is minimization of the energy-delay, i.e. the product of the execution time and the energy used [7]. Previous studies have indicated that between 25 and 50% of the total chip power of a microprocessor is dissipated in the caches [14]. Therefore, understanding cache architectural design trade-offs that minimize the energy-delay, or, energy without any performance losses, is of paramount importance. * Corresponding author. E-mail addresses: [email protected] (J. Jalminger), [email protected] (P. StenstroÈm).

In this paper, we revisit the cache line size trade-offs with a focus on off-chip caches to understand whether commonly used line sizes, which are typically in the range 64±256 bytes, are good trade-offs in light of energy dissipation. With a set of seven applications that are relevant for two classes of systemsÐmobile computers and high-end serversÐwe have performed architectural simulation in conjunction with a previously published model of energy dissipation. Our ®rst contribution is the observation that commonly used line sizes can result in a devastating energy dissipation; in fact, the energy dissipated for line sizes in the range 128± 256 bytes can be twice as much as for a line size of 32 bytes; especially if the cache space is not suf®cient to accommodate all working sets. The intuitive explanation is that while the hit ratio typically improves as the line size is increased, the amount of spatial locality is not big enough which leads to severe energy losses to bring useless data into the cache with only marginal improvements in performance. Our data suggest that a smaller line size than what is commonly used is desirable in order to minimize the energy-delay. Previous work that has studied the line size trade-off [1,7,12,14,16,17,28,33] did not notice this effect. Our observation prompted us to consider techniques that aim at more selectively exploiting the spatial locality to compensate for the performance losses of small cache blocks without affecting the energy used. Softwarecontrolled data prefetching [22] can improve the hit ratio by bringing data into the cache prior to its use. Since

0141-9331/02/$ - see front matter q 2002 Elsevier Science B.V. All rights reserved. PII: S 0141-933 1(01)00150-8

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

108

2.1. Architecture framework

Fig. 1. Architecture model. The energy model is derived for the dashed box to the right.

prefetching also can bring useless data into the cache, it has to be applied with care. We used the performancedebugging facility in SimICS [21] to identify the load instructions in the applications that contribute to most of the cache misses. Remarkably, typically less than 10 load instructions caused half of the total number of misses. Experiments based on selectively prefetching only blocks touched by instructions that with a high likelihood result in misses, yielded interesting results. Assuming a line size of 32 bytes, we show that it is possible to get higher performance using selective prefetching than using a larger line size but without prefetching. Moreover, this performance gain is possible with only a slight increase in energy dissipation. The rest of the paper is organized as follows. In Section 2, we ®rst establish the architectural framework and the evaluation methodology used to study performance/energy trade-offs. In Section 3, we then present the results concerning performance/energy trade-offs for line sizes in secondlevel (L2) off-chip caches. Motivated from the observation that small line sizes minimize energy dissipation, we consider selective prefetching as a means to improve performance without increasing the energy dissipation in Section 4 and present our results for selective prefetching. This is followed by a discussion of related work in Section 5 before we conclude in Section 6.

2. Architecture framework and performance/energy models This section establishes the architecture framework in Section 2.1 followed by performance and energy models in Section 2.2. We then present the overall experimental methodology and the applications we have used to study line size trade-offs for L2 off-chip caches in light of performance and energy used in Section 2.3. Table 1 Default architectural parameters in the experiments Cache parameter

Value

L1 L1 L1 L2 L2

16KB 2-way 32 bytes 64KB 2-way

cache size association line size cache size association

We consider a processor/memory architecture according to Fig. 1 in which the processor and the ®rst-level (L1) cache is on-chip and the L2 cache is off-chip. In this architecture, the bandwidth into and out of the chip is limited which is why the line size of the L1 cache is typically as small as 32 bytes. The L2 cache interfaces to memory. The line size of the L2 cache is not constrained by the limited chip bandwidth. In contrast, with a fully interleaved memory the available memory bandwidth makes larger line sizes feasible. In addition, off-chip caches can afford to be huge which adds to the bene®t of large line sizes. Therefore, many contemporary designs use line sizes of 128 bytes or larger. The parameter space for the cache hierarchy is obviously huge. While we have studied the impact of typical parameters such as the cache size and the associativity, we will focus on the following results for a set of default parameters which are listed in Table 1. Next we justify our choice of these parameters. First, while today's processors have two-level on-chip caches, we model a single-level on-chip cache. However, the miss rate for the on-chip cache will not be lower if we would model a two-level on-chip cache given a ®xed transistor budget. Therefore, the energy used in the off-chip cache as predicted by our model is lower than in a contemporary system. Secondly, the default sizes of the L1 and L2 caches are quite small compared to contemporary systems. However, at this operation point, the primary working sets typically ®t into both caches. However, we will also study the impact of larger L2 caches on the result. Thirdly, given the limited bandwidth across the chip boundary, we assume a ®xed line size of 32 bytes for the L1 cache, whereas we vary the line size of the L2 cache from 32 to 256 bytes. Finally, as for other memory system assumptions, we consider the instruction cache to be ideal and therefore the processor does not stall on instruction cache misses. In addition, the caches are assumed to be write-back, write allocate. Also, when fetching a memory block from the main memory, the data is placed in both levels to maintain inclusion. The next two simplifying assumptions tend to exaggerate the performance difference between small and large blocks in favor of large blocks. We will later show that this makes our experimental ®ndings stronger. The ®rst simplifying assumption is that the L1 and L2 miss penalties are constant and unaffected by the line sizes. In a real system, the miss penalty typically increases with the line size either because the bus is not wide enough, unless a critical-word-®rst approach is adopted, or because of contention effects in the memory system. The miss penalties we assumed are 10 and 100 cycles for L1 and L2 misses, respectively. The second simplifying assumption is that we assume an in-order single-issue processor. A multiple-issue processor

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121 Table 2 Technology parameters used in the L2 cache energy model

Feature size External power supply Internal power supply Bank width Bank height Bit line swing (read) Bit line swing (write) Bit line capacitance per cell Word line capacitance per cell Sense ampli®er current Column control transistor gate Sense ampli®er active time

Unit

Value

Notation

mm V V bit bit V V F F A F S

0.35 3.3 1.5 128 512 0.5 1.5 2.5 £ 10 215 5.2 £ 10 215 1.5 £ 10 24 4.0 £ 10 214 2.0 £ 10 29

± Vext Vint nbl nwl Vread Vwrite Cbl;cell Cwl;cell Isense Cgate;cell tactive

has the ability to cut some of the miss penalty by overlapping miss latency with other misses or processing which again means that the results we derive will tend to make the performance difference between small and large blocks bigger than in a real system. 2.2. Energy model The energy model is adopted from Fromm et al. [8]. This model accounts both for the energy used in the memory arrays (and associated tag array) in the L2 cache as well as the energy uses in transferring cache blocks between the L2 cache and main memory. It does also account for main memory energy usage. What it does not account for is the energy dissipated by the processor and the L1 cache which will be justi®ed later. The L2 cache consists of several sub-arrays (or banks) in order to keep the latency and energy usage down. We assume that within the 10 cycle access time it is possible to decode the address to only enable the selected sub-arrays, where each sub-array consists of 512 £ 128 bits. We do not model the impact of bit ¯ips on energy usage in detail; instead we assume that half of the bits ¯ip on each transfer on the bus. When accessing the L2 cache, all memory subarrays are precharged but only the selected sub-arrays are discharged to read or write an L1 cache line. The energy use accounts for precharge, discharging of sub-arrays and Table 3 Bus and main memory accesses energies Bus loads

Unit

Data line capacitance Data pad capacitance Data line driver capacitance Address line capacitance Address pad capacitance Address line driver capacitance

F F F F F F

Main memory energies Energy for a row access Energy for a column access

J J

Value

Notation 212

4 £ 10 7 £ 10 212 5 £ 10 212 4 £ 10 212 5 £ 10 212 5 £ 10 212 3.2 £ 10 28 6 £ 10 29

Cdl Cdp Cdldriver Cal Cap Caldriver Erow Ecol

109

amplifying the sensed data. The same approach is applied to the tags. A summary of the technology parameters we use is found in Table 2. Let us now present the formulas to derive the energies. The capacitance to drive when accessing one row in a sub-array is Cwl;tot ˆ nbl Cwl;cell

…1†

and thus, the energy needed to cycle an entire word line is Ewl ˆ

1 2

2 Cwl;tot Vint

…2†

Analogously, to calculate the energy used to activate a bit line, the following equations are used: Cbl;tot ˆ nwl Cbl;cell Ebl;precharge ˆ

1 2

…3†

Cbl;tot nbl nsub-arrays Vread Vint

…4†

where nsub-arrays denotes the number of sub-arrays needed for a certain cache size. Ebl;discharge ˆ

1 2

Cbl;tot nbl nsub;enable Vread Vint

…5†

where nsub;enable is the number of sub-arrays needed for a given L1 cache line size and Vread denotes the bit line swing for a read operation. For a write operation, simply exchange with Vwrite. Thus, to get a complete read or write cycle, ®rst precharge the sub-arrays, denoted by Ebl;precharge ; and then read or write the enabled array/arrays, denoted by Ebl;discharge ; and this can be summarized with the following equation: Ebl ˆ Ebl;precharge 1 Ebl;discharge

…6†

The contribution of the sense ampli®ers on the energy is modeled as follows: Esense ˆ Isense Vint nbl tactive

…7†

where tactive is the time the sense ampli®ers are active and is the same as that of processor cycle. This active time decreases with increased operation frequency and therefore this contribution decreases with faster processors. The last contribution stems from the gate transistors that connect the bit lines to the sense ampli®ers 2 Egate ˆ nbl Cgate;cell Vint

…8†

Let us now turn to the energy use for the bus and main memory accesses. Table 3 lists the different capacitances and access energies for various parts on the bus and in main memory. To get the bus energy the following equation is used:  Ebus ˆ ndbits …Cdl 1 2Cdp 1 Cdldriver † 12 1 nabits …Cal 1 2Cap 1

Caldriver † 12



…9† 1 2

2 Vext

where ndbits and nabits are the number of data and address bits, respectively. The right most constant in the above equation

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

110

Table 4 Average energies (in Joule) associated with memory accesses as a function of L2 line sizes (using the default L2 size in Table 1) Types of access

32 byte 1.8 £ 10

L1 to L2 read/write (EnergyL2,average 1 Ebus) L2 to main memory read/write (Energymm 1 Ebus)

28

0.98 £ 10 27

64 byte

128 byte

256 byte

±

±

±

1.7 £ 10 27

3.1 £ 10 27

6.0 £ 10 27

comes from the assumption that half of the bits ¯ip on average. The DRAM access energy is divided into a row address Erow and column address Ecol part whose values can be found in Table 3. To access main memory, we need to do one row access and several column accesses to get the number of bytes we are interested in. Each access delivers 32 bits (4 bytes). Now we are ready to express the contributions from different part of the memory hierarchy in the following way: EnergyL2 ˆ Ewl 1 Ebl 1 Esense 1 Egate  Energymm ˆ Erow 1

…10†

 nblock size 2 1 Ecol 4

…11†

Using the above equations, we calculated the energies shown in Table 4 where the numbers are the average of the read and write energy. The numbers also include the bus energy. As mentioned earlier we are only interested in ®nding energies in the data cache, mainly because instruction caches exhibit high hit ratios compared to data caches. 2.3. Experimental methodology and applications The methodology we use to derive execution times and the energies used is based on the SimICS tool [21]. SimICS is a highly ef®cient instruction-set simulator capable of executing application binaries on top of an operating system, in essence Linux 2.0.35. We have interfaced the two-level memory hierarchy according to Fig. 1 to SimICS. On every memory access, SimICS calls the memory system simulator where the statistics needed are collected. To model energy usage, the memory events of interest, i.e. number of L2 cache hits and misses, are captured and

drive the energy model which looks as follows: E ˆ L2hits …EnergyL2 1 Ebus †

…12†

1L2miss …EnergyL2 1 Energymm 1 Ebus †

Seven applications that span the spectrum from typical applications for mobile computers and high-end servers drive the performance and energy models discussed to evaluate the energy-delay trade-offs: applications representing mobile computers are Quake which is a popular game [13]; Handwriting recognition, a program that translates the declaration of independence from handwriting to ASCII text [25]; MPEG-2, an MPEG-decoder [24]; Raytracing, a program that raytraces a car [5]. Applications representing high-end servers include the TPC-D queries [31] DSS-Q3 and DSS-Q6 run on the MySQL [29] decision support system (DSS) and OLTP, the on-line transaction processing benchmark from TPC-B [31] run on MySQL [29]. The number of simulated cycles and number of reads and writes are summarized in Table 5. Clearly, the traces are suf®ciently long to warm-up even L2 caches are of several megabytes. As for the metrics used to quantify energy-ef®ciency, we ®rst consider the energy used in the L2 cache and memory during the execution of a program. We will also consider the energy-delay, i.e. the product of the energy used and the execution time as another important metric. To help comparing different energy-delay values, keep in mind that a decrease in energy-delay is an increase in energyef®ciency and vice versa. 3. Impact of line size on energy-ef®ciency In this section, we consider how the line size in the L2 off-chip cache affects the energy-ef®ciency measured as total energy used to execute a program or the energydelay. As a base for the investigation, we start in Section

Table 5 Characteristics of the simulated benchmarks Quake No. of simulated instructions No. of references No. of reads No. of writes

Handwriting recognition 8

30.53 £ 10 10.28 £ 10 8 7.38 £ 10 8 2.90 £ 10 8

8

7.40 £ 10 2.07 £ 10 8 1.55 £ 10 8 0.52 £ 10 8

DSS-Q3

DSS-Q6 8

3.09 £ 10 0.73 £ 10 8 0.53 £ 10 8 0.20 £ 10 8

OLTP 8

16.34 £ 10 3.60 £ 10 8 2.38 £ 10 8 1.22 £ 10 8

Raytrace 8

1.39 £ 10 0.30 £ 10 8 0.23 £ 10 8 0.08 £ 10 8

MPEG-2 8

7.66 £ 10 2.55 £ 10 8 2.43 £ 10 8 0.12 £ 10 8

1.30 £ 10 8 0.44 £ 10 8 0.28 £ 10 8 0.17 £ 10 8

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

111

Fig. 2. Global miss rate for different L2 cache line sizes. The y-axis shows the miss ratio in per cent of the total number of accesses.

3.1 by revisiting how the line size affects the performance of our benchmarks. We then study the impact of the line size on the energy dissipated during the execution in Section 3.2. The key observation made is that the optimal line size is considerably smaller than what is commonly used for offchip caches, especially if the cache space is not enough. Then in Section 3.3 our ®ndings are generalized.

3.1. Impact of L2 line size on performance In Fig. 2, we show the miss rates of the two-level cache hierarchy as a function of the L2 line size for all the seven applications. First, for DSS-Q3, OLTP, and Handwriting recognition, the miss rate ®rst declines as expected and then increases as a result of a limited cache space to

Fig. 3. Normalized execution time for different L2 line sizes. All numbers are normalized to the case with 256 byte blocks.

112

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

Fig. 4. Relative energies for different L2 line sizes.

accommodate all working sets. For the rest of the applications, however, the miss rates monotonically decrease and reach their minimum for 256 byte blocks. In Fig. 3, the impact of the L2 line size on the execution times of the benchmarks is shown. These curves follow, as expected, the same trends as the miss rates. We note that the performance difference between L2 caches with 32 and 256 byte blocks is at most 25%. We next consider how

big the difference is in energy usage between 32 and 256 byte blocks. 3.2. Impact of L2 line size on energy used In Fig. 4, we show the energy used during the execution of the benchmarks as a function of the L2 line size relative to the energy used for 256 byte lines. For all the

Fig. 5. Normalized energy-delay for different L2 line sizes.

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

113

Fig. 6. Cache block utilization for different L2 cache line sizes. The utilization is measured in per cent.

applications, we note that the energy monotonically increases as we go to larger line sizes. Indeed, the energy used for 256 byte blocks is 2±5 times higher than for 32 byte blocks. In Fig. 5, we factor in the execution time and consider the energy-delay (the product of the energy used and the execution time). In general, the energy-ef®ciency deteriorates with the line size. This is because the improvement in

execution time is considerably smaller than the increase in energy. For ®ve out of the seven benchmarks the energydelay steadily increases with the line size. Indeed, the energy-delay is at least twice as high for 256 byte blocks compared to 32 byte blocks for six out of the seven applications, whereas the difference is much smaller for DSS-Q6. Large blocks are only bene®cial if the spatial locality is big enough that a majority of the words in the block are

Fig. 7. Energy-delay ratio between 256 and 32 byte blocks.

114

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

Fig. 8. Energy-delay ratio between 256 and 32 byte blocks with increasing latency tolerance.

really accessed. A useful metric for spatial locality is therefore block utilization, means the fraction of the words in a block that is accessed. Using this metric we can now explain the increase in energy used for large blocks which is shown in Fig. 6. Across all the benchmarks, the block utilization monotonically decreases as we move to larger blocks. For DSS-Q6, which does sequential scan, a large block size still exhibits a fairly high block utilization. The low utilization at large blocks results in only modest improvements in the miss rate at the expense of the energy being used to bring useless data into the cache. 3.3. Variation analysis We have shown that large blocks typically exploit most of the spatial locality in applications but result in poor block utilization which makes the energy to increase drastically. While the observations were made under a ®xed set of architectural assumptions, we will justify in this section why we believe the qualitative observations are valid in a broader scope of technology assumptions. First, we only considered the energy used in the L2/ memory part of the memory hierarchy. However, we expect the energy used in the L1 cache to have a marginal impact on the result. As the energy used for a main memory access (10 27 J) is typically three orders of magnitude higher than the energy used for an L1 hit (10 210 J), the results would only be slightly different. A simple back-on-the-envelope calculation shows that the L1 cache energy is only about 5% of the energy used in the rest of the memory hierarchy. Secondly, and presumably more importantly, the default cache sizes are small compared to what one would ®nd in high-end servers especially. Therefore, it becomes impor-

tant to study how our observations change if we consider larger L2 caches. In Fig. 7, the bars for each application show the ratio of the energy-delay for 256 and 32 byte blocks for L2 cache sizes ranging from 64KB to 2MB. While this ratio is more than 2 for all of the applications for the default L2 cache size, it decreases monotonically as we move to larger L2 caches for all applications except for Quake. The reason for this is quite obvious; if we can afford to make the cache big enough to accommodate all working sets, the energy will be dictated by the cold miss rate. The sudden increase in energy-delay ratio for the Quake benchmark is attributed to the fact that more working sets ®t in the cache for a 32 byte block, whereas for a 256 byte block there are still some con¯ict misses present when increasing the cache size. The DSS-Q6 benchmark exhibits so much spatial locality that for caches greater than 64KB, the energy-ef®ciency is higher for 256 byte blocks. We note that for big L2 caches (say 2MB), 256 byte blocks yield about the same energy-ef®ciency as 32 byte blocks for all the applications except for Quake, DSS-Q3, and Raytrace where 32 byte blocks still yield between 20 and 35% better energy-ef®ciency. On the other hand, for moderate L2 cache sizes (say 512KB) as one typically ®nd in mobile computers, the energy-ef®ciency difference is quite signi®cant and ranges from 8 to 78% and averages 35%. Thus, a small block size is a good trade-off if energy-delay is to be minimized for moderate L2 cache sizes. Finally, we have used a simpli®ed processor model that issues a single instruction per cycle and that blocks on cache misses. Both of these assumptions do not hold for processors being designed today. However, for a processor that

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

115

Table 6 Percentage of all cache misses generated by a number of different memory instructions Benchmark

1 Instruction (%)

3 Instructions (%)

5 Instructions (%)

10 Instructions (%)

Quake Handwriting recognition DSS-Q3 DSS-Q6 OLTP Raytrace MPEG-2

10 42 2 6 2 8 30

18 72 7 19 4 22 63

33 78 12 32 6 33 74

46 . 80 19 59 10 49 80

supports multiple outstanding memory operations we would expect a smaller improvement in execution time, compared to a single-issue processor, as we move to larger blocks because the miss latency could then be partly hidden. However, the energy used would not be affected. This would result in a bigger difference between the energydelay for small and large blocks. Therefore, the data derived for the simpli®ed processor model are expected to be conservative and we would expect a bigger difference in a multiple-issue processor. To test these hypotheses, we now assume a processor that can hide the miss latency by other misses or by computation assuming the default cache parameters. In Fig. 8, we consider three cases for each application: (1) no (or 0%) latency tolerance; (2) partial (50%) latency tolerance; (3) full (100%) latency tolerance. As in the previous experiment, we consider the ratio of energy-delay between 256 and 32 byte blocks. For all benchmarks, except DSS-Q3, the ratio is constant or increase as we tolerate more latency. The reason that DSS-Q3 does not show this behavior is that the miss rate for 256 byte blocks is greater than that for 32 byte blocks and thus there is more latency to hide. In summary, we have seen that a low energy-delay is favored by a small line size, typically 32 bytes. This should be contrasted to the commonly used line sizes for off-chip L2 caches of 64±256 bytes. While the effect is small for huge L2 caches typically found in high-end servers, it is signi®cant for moderately sized L2 caches that one may ®nd in mobile computers. The downside is that 32 byte blocks lead to lower performance owing to the fact that we do not fully exploit the inherent spatial locality. This implies that if we can exploit the spatial locality more selectively it would be possible to improve both performance and decrease the energy needed. This opportunity is studied next in the context of the applications that represent mobile computer usage. 4. Selective software prefetching The approach we have considered to boost performance of caches with small blocks without affecting the low energy dissipation is prefetching. With prefetching, a block that is expected to be accessed in the future is requested before it is

actually needed. Ideally, it is fetched into cache enough cycles in advance to completely hide the memory latency. In order to not affect the energy dissipation, it is important to only prefetch data into cache that is actually used and avoid displacing actively used data from the cache. In addition, a key challenge for prefetching is to minimize useless memory traf®c. Previously proposed prefetch schemes are either hardware-based [6] or software-based [22]. Since hardwarebased prefetching typically wastes more memory traf®c than software-based, we have pursued a software-based approach in this study. Also, most if not all high-end processors support prefetch instructions. In addition, if a methodology is used to insert them carefully, it is possible to improve performance with a minor increase in energy dissipation. Another contribution reported in this section concerns the performance impact of selective prefetching as a methodology to ful®ll this goal. In Section 4.1, we describe the prefetch annotation methodology. In Section 4.2, we apply the methodology to the benchmarks representing mobile computer usage. Finally, in Sections 4.3 and 4.4, we report on the performance and energy results using the methodology. 4.1. Methodology for prefetch annotations Ideally, insertion of prefetch instructions is supported by a compiler. Despite the fact that there exists good static analysis frameworks for selectively inserting prefetches into the code [22], they do not apply to applications where information critical to the scheduling of prefetch instructions is only known at run-time. For such applications, we advocate a methodology in which prefetch instructions are inserted selectively by identifying the memory instructions in the code that contribute to most of the cache misses. We used the SimICS tool to count the number of misses caused by individual load and store instructions. By having access to the addresses of the memory instructions that cause a majority of the cache misses, it is possible to track the corresponding constructs at the source-code level that cause them using a source-code debugger such as GNU's gdb. This methodology thus supports insertion of prefetch annotations at the source-code level.

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

116

In Table 6, we show the statistics concerning the percentage of cache misses caused by 1, 3, 5 and 10 different memory instructions that cause most of the cache misses. The interesting result is that for ®ve benchmarks, between 46 and 80% of the misses are generated by at most 10 different instructions. In light of this observation, it seems feasible to manually insert prefetches selectively to decrease the miss rate. Insertion of prefetches selectively and scheduling them properly are the main issues to address regardless whether prefetching is done using a performance-debugging tool such as SimICS or using a compiler. A potential problem with any pro®ling methodology is that it may not be valid for another set of input data. We carefully inspected the load instructions that cause a majority of the misses and noted that their effective addresses are not affected by input data. On the other hand, another set of input data may result in another execution path that results in another set of load instructions that cause a majority of the misses. While this may happen, it will nevertheless not cause any useless prefetches to be issued and thus will not affect the energy negatively. 4.2. Case study In the case study we restricted ourselves to the four applications that re¯ect usage for mobile computers: Quake; Handwriting recognition; Raytrace; MPEG-2. For the server-class applications, one can usually afford to make the L2 cache big enough and since the block size is not critical for the energy-delay for DSS-Q3, DSS-Q6 and OLTP assuming big enough L2 caches; this is another reason not to include these benchmarks. The architecture and energy models are the same. One simplifying assumption that potentially could skew our results is that we do not model memory system contention. However, as we will see, this would only marginally affect the results as the prefetch ef®ciency we have observed is close to 100%. Our methodology to insert prefetches uses the following equation to calculate how many iterations in advance we need to prefetch data. nprefetch ˆ

l s

…13†

where nprefetch denotes the number of iterations to prefetch ahead, l the expected memory latency and s the shortest path through each iteration. We use the default memory latency l ˆ 100 cycles.As for the Quake benchmark, we inserted three prefetch instructions to cover misses in two loops. To cover the misses in the ®rst loop, two prefetches were inserted. The reason why it would not be possible to automatically insert prefetches in the ®rst loop is that the number of iterations depends on input data. Therefore, apart from inserting prefetch instructions, code must be inserted that guarantees that they are executed only if there are more iterations to execute. Actually, previous work by Santhanam

et al. [27] uses either compiler heuristics or pro®le information from previous runs and feeds this information back to help the prefetch engine to handle run-time control paths and iteration spaces, just as we do when pro®ling applications. As the number of iterations varied, two run-time checks are needed: one to prefetch every fourth iteration and one that stops prefetching if not enough iterations are present. As the shortest path through the loop was ®ve instructions (Eq. (13)), yielded that a prefetch instruction should be inserted 20 iterations ahead of the point where the data was used. The other prefetch instruction was inserted to cover the misses generated for few loop iterations. For the last loop, only one prefetch instruction was inserted to cover misses as the loop iterated a maximum of eight times. The instruction overhead caused by prefetchrelated code reached 2.2%. Handwriting recognition was found to be a much easier application to insert prefetches. The two loops that accounted for a majority of misses only required a single prefetch each. Just as in Quake, the iteration bounds were not statically known; therefore we could not use loop unrolling or software pipelining as a means to cut overhead. Instead, the same control construct used in Quake was needed to issue a prefetch every fourth iteration. This construct was used in both the loops. Fortunately, these loops had iteration counts between 128 and 192 and thus the overhead in prefetching data in iterations outside the loop's iteration space did result in small overheads. In fact, it actually covered misses since the loop was repeated. The instruction overhead for these loops reached 8.7%. The misses for the Raytrace application was not generated in any loops. Instead, a function that checks whether a triangle and a ray intersect was identi®ed to be the key source of misses. Two prefetches were inserted to cover misses. As the function is called with input data, the need to selectively consider what to prefetch is evident. It turned out that fetching only two control pointers was enough to enhance performance. In this case, no loop unrolling or control statements were needed. The instruction overhead incurred was 0.5%. Finally, we considered MPEG-2 and found that we needed to insert three prefetch instructions in two different nested loops. The two nested loops were very similar and therefore we will only describe one of them. The outer loop is run between four and eight times and the inner loop between 8 and 16 times. As the inner loop had such little iteration, there was no point in using software pipelining. Instead, before executing the inner loop, the data needed for the next set of iterations were fetched. Furthermore, because of the relatively few iterations in the outer loop, we removed the last iteration from the loop; instead of doing n iterations, only n 2 1 iterations were kept in the loop and the last iteration was executed separately. The instruction overhead was 0.8%. To conclude, for all benchmarks it was fairly straightforward to ®nd where to insert prefetches. For the benchmarks

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

117

Fig. 9. Global miss rate with prefetching. For every benchmark, the ®rst four bars are without prefetching and the right bar is with prefetching.

we have looked into, not more than 1 day has been put into each of them and this is mostly because of the fact that we were not familiar to the code. 4.3. Impact on execution time In terms of keeping the energy dissipation at the same

level as without prefetch, the number of useless prefetches should be kept at a minimum. However, the instruction overhead will increase energy dissipation in terms of processing and more pressure on the instruction cache. In our processor/cache model of Fig. 1, we assume that the issued prefetches place the data into both the L2 and L1 caches. The execution time includes the extra overhead of

Fig. 10. Normalized execution time after insertion of prefetch instructions.

118

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

Fig. 11. Execution time split into memory stall time, instruction overhead and busy time with prefetching. Busy time means that the processor has something to do, instruction overhead is the extra instructions executed due to prefetching, and memory stall is the time spent by the processor waiting for data.

the prefetches but the overhead in terms of energy is not included. We conjecture that this effect is small, however, as the hit rate in instruction caches usually is high. In addition, the instruction overhead, as we have seen, is quite small. In Fig. 9, we show the miss rate of the two-level cache hierarchy across all the block sizes and with the system using a 32 byte L2 line size and prefetch. A ®rst observation is that the system with prefetch manages to achieve a lower miss rate than the system with the same line size with no prefetch. A second observation is that in two out of the four benchmarks, the miss rate for the system with prefetch is lower for the system using 64 byte lines in the L2 cache. An important remark is that even though the MPEG-2 benchmark does not seem to bene®t that much from prefetching, all prefetches that do not eliminate a miss will be counted as a miss although part of the miss latency would be hidden. This means that even though the prefetches do not notably enhance performance, the improvement might still be better than the miss rate numbers indicate. Finally, for one of the benchmarks, Handwriting recognition, the miss rate is actually better for the system with prefetch than for the system with 256 byte blocks (which is without prefetching). Let us now study how the improved miss ratio affects the execution time. In Fig. 10, we show the relative execution times for all L2 cache systems from 32 to 256 byte blocks and the system with 32 byte blocks and prefetching. We see that in three of the benchmarks we are able to increase the performance to

that of larger block sizes. In the Handwriting recognition benchmark we have increased the performance even more. Unfortunately, in the Quake benchmark it was found to be hard to insert the prefetches far enough in advance to hide the latency and thus the performance improvement is much smaller. An interesting question is how the instruction overhead limits further performance improvements. In Fig. 11, we show the relative execution time of the systems with 32 byte L2 line sizes with and without prefetching. The execution-time bars are decomposed into busy time, instruction overhead for prefetching, and memory stall time. The most important observation is that the instruction overhead is typically small for all the four benchmarks. While it is negligible for all the applications except for Handwriting recognition, the instruction overhead for that application is small compared to the reduction of the execution time which is 23%. Thus, despite the complex annotations needed, the instruction overhead is not an important issue. In summary, on average we were able to enhance performance 9% compared to 32 byte blocks with no prefetching. Keeping in mind that we have covered misses only between two and four memory instructions in these benchmarks. Consequently, we would expect even better performance if we had covered more instructions that generate cache misses. Even though it seems to be possible to increase performance, it is important to see whether we were able to keep the prefetch ef®ciency high, i.e. not wasting any

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

119

Fig. 12. Normalized energy-delay with prefetching.

energy in bringing useless data into the cache. This issue is studied in Section 4.4. 4.4. Impact on energy-delay The energy dissipated after insertion of the prefetch instructions is almost identical to the energy dissipated for 32 byte blocks in Fig. 3 and therefore no ®gure is included.

The most important observation is that the system with prefetch annotation manages to maintain as low energy dissipation as in the system with no prefetch with the same line size. For Raytrace and MPEG-2, we were able to keep the energy dissipation constant, and for Quake and Handwriting recognition, the increase was only 2.6 and 3.6%, respectively. We have also measured the prefetch ef®ciency, i.e. the fraction of all prefetches issued that

Fig. 13. Cache word usage after insertion of prefetch instructions.

120

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

bring blocks into cache that are later accessed. Both partially and fully covered misses are included. For Handwriting recognition, Raytrace and MPEG-2, the prefetch ef®ciency could be kept between 98 and 99.4%, and again for the Quake benchmark, we were not selective enough and reached an ef®ciency of 71%. Finally, let us consider the energy-delay and how it is affected by the system with prefetch. Fig. 12 shows the energy-delay for all the ®ve systems. The most important observation is that energy-delay of the system with prefetching is superior to all the other systems. As we have shown earlier it also has the lowest energy dissipation and about the same performance as the system with the largest line size. The main reason for this can be seen in Fig. 13, where we show the cache block utilization. The system with prefetching has nearly as high cache block utilization as the system with small 32 byte blocks in the L2 cache. Interestingly, Handwriting recognition exhibits higher block utilization with prefetching. We tracked this to the fact that the compiler makes other decisions at compile time with the prefetches inserted. The difference is however typically lower than 0.6%. In summary, we have shown that large blocks can be devastating to the energy dissipation owing to the low cache block utilization that brings a lot of useless data into cache that cause energy dissipation peaks. By using small blocks, the energy-delay is minimized. Unfortunately, small blocks result in lower performance. We have shown that selective prefetching can be used to offset the performance discrepancy between small and large blocks without affecting the energy dissipation. We have also shown how emerging performance-debugging tools such as SimICS can be used to aid such a methodology. 5. Related work Several other studies have looked at energy-ef®ciency issues in memory hierarchies. Our study differs from those in that we focus on an off-chip cache between the processor and main memory and vary the line size in the off-chip cache. Alberta and Bahar [1] looked at the design trade-offs using different techniques as adding a victim cache and different buffers but did not look at trade-offs with varying the L2 line size. Hicks et al. [12] looked at the tradeoffs using different cache and line sizes. They found that increasing the line size in the L1 cache tends to improve energy-ef®ciency although the overall energy increased but the hit rate increase outweighed this penalty. One of the things they failed to notice was the detrimental effect of low cache block usage for increased block sizes. This effect has recently been observed by Kumar and Wilkerson [19]. However, they did not study what implications it has on energy-ef®ciency. Kin et al. [16] looked at a structure called ®lter cache. It essentially acts as a

small buffer between the L1 cache and processor core to ®lter L1 cache accesses. By doing this they were able to reduce the energy dissipation with 58% while reducing performance by 21%. This study only considered on-chip accesses and did not pay any attention to the rest of the memory hierarchy. Kamble and Ghose [14,15], Ko et al. [17], and Su and Despain [28] have investigated similar approaches in the sense that they ®ltered references with a smaller cache in front of the L1 cache. Prefetching has been studied a lot in the literature in the context of enhancing the performance. However, this study is, to the best of our knowledge, the ®rst one to consider prefetching as a technique to improve energy-ef®ciency. Mowry and Gupta [23] also studied the gains of inserting prefetch instructions into the source code by hand. While some performance could be gained by just a few instructions, in accordance with what we have found, a considerable effort was needed to remove a majority of the misses. Baer and Chen [2] and Dahlgren et al. [6] considered hardware-based approaches to prefetching. While these approaches can be quite effective, they in general result in up to 30% more memory traf®c. This would be completely inadequate for improving energy-ef®ciency. This motivated us to consider highly selective prefetching. 6. Conclusion The trend towards larger block sizes in the L2 cache of modern computer systems continues as these caches get larger and there is more spatial locality to exploit. Indeed, it is not uncommon to ®nd systems with 128 byte blocks [8,15]. However, we have found that an increased line size decreases the block utilization that we showed could be devastating for the energy used. On average, we have found in this study that the energy used when moving from 32 to 128 byte blocks increased 93% with a peak of 165%. At the same time, the increase in performance was only on average 9% and never exceeded 18%. While the energy increase becomes smaller in the huge caches found in high-end servers, it is an issue for the systems that cannot afford huge caches owing to energy, cost, or area constraints. These observations motivated us to look into the possibility of inserting prefetch instructions into the code to enhance the performance of caches with small blocks without, or negligibly, increase energy dissipation. We inserted prefetches in the benchmarks and on average we were able to notice an increase in performance by 9%, thereby achieving the low energy dissipation of small blocks and the high performance of large blocks. We have also demonstrated how performance-debugging tools can support a methodology for insertion of prefetch annotations manually in the case it is not possible to use a fully automated approach, i.e. a compiler to insert prefetch instructions selectively.

J. Jalminger, P. StenstroÈm / Microprocessors and Microsystems 26 (2002) 107±121

Acknowledgements We would like to extend our gratitude to few peoples. First, we are grateful to Stylianos Perissakis at UC Berkeley who provided us with the memory energy dissipation model. Secondly, thanks to Prof. Bertil Svensson who has encouraged this research. This research is sponsored by the Swedish Foundation for Strategic Research (SSF) under the program Smart Sensors and by equipment grants from the Swedish Planning and Coordination of Research (FRN) and by Sun Microsystems Inc. References

[19]

[21]

[22] [23]

[24] [25]

[1] G. Alberta, R.I. Bahar, Power and performance tradeoffs using various cache con®gurations, in: Proceedings of the Power-Driven Microarchitecture Workshop in Conjunction with ISCA'98, 28 June 1998. [2] J.L. Baer, T.F. Chen, An effective on-chip preloading chip scheme to reduce data access penalty, Proc. Supercomput. (1991) 176±186. [3] R.S. Bajwa, N. Schumann, H. Kojima, Power analysis of 32-bit RISC microcontroller integrated with a 16-bit DSP, Proc. ISLPED'97 (1997) 137±142. [5] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, A. Gupta, The SPLASH-2 programs: characterization and methodological considerations, in: Proceedings of the 22nd International Symposium on Computer Architecture, June 1995, pp. 24±36. [6] F. Dahlgren, M. Dubois, P. StenstroÈm, Sequential hardware prefetching in shared-memory multiprocessors, IEEE Trans. Parall. Distribut. Syst. 6 (7) (1995) 733±746. [7] R.J. Evans, P.D. Franzon, Energy consumption modeling and optimization for SRAMs, IEEE J. Solid State Circuits 30 (5) (1995) 571± 579. [8] R. Fromm, S. Perissakis, N. Cardwell, B. McGaughy, C. Kozyrakis, D. Patterson, T. Anderson, K. Yelick, The energy ef®ciency of IRAM architectures, Proc. ISCA'97 (1997) 327±337. [12] P. Hicks, M. Walnock, R.M. Owens, Analysis of power consumption in memory hierarchies, Proc. ISLPED (1997) 239±242. [13] ID Software, Quake, October 1997. http://www.idsoftware.com. [14] M.B. Kamble, K. Ghose, Analytical energy dissipation models for low power caches, Proc. ISLPED (1997) 143±148. [15] M.B. Kamble, K. Ghose, Energy-ef®ciency of VLSI caches: a comparative study, in: Proceedings of the 10th International Conference on VLSI Design, pp. 261±267. [16] J. Kin, M. Gupta, W.H. Mangione-Smith, The ®lter cache: an energy ef®cient memory structure, in: Proceedings of the 13th International Symposium on Microarchitecture, 1997, pp. 184±193. [17] U. Ko, P.T. Balsara, A.K. Nanda, Energy optimization of multilevel

[27] [28] [29] [31] [33]

121

cache architectures for RISC and CISC processors, IEEE Trans. Very Large Scale Integration Syst. 6 (2) (1998) 299±308. S. Kumar, C. Wilkerson, Exploiting spatial locality in data caches using spatial footprints, Ann. Int. Symp. Comput. Architect. (1998) 357±368. P. Magnusson, F. Larsson, A. Moestedt, B. Werner, F. Dahlgren, M. Karlsson, F. Lundholm, J. Nilsson, P. StenstroÈm, H. Grahn. SimICS/ sun4m: a virtual workstation, in: Proceedings of the USENIX 1998 Annual Technical Conference, 1998, pp. 119±130. T.C. Mowry, Tolerating latency through software-controlled data prefetching, PhD Dissertation, March 1994. T.C. Mowry, A. Gupta, Tolerating latency through softwarecontrolled prefetching in scalable shared-memory multiprocessors, J. Parall. Distribut. Comput. 2 (4) (1991) 87±106. MPEG Organization, MPEG2 decoder, January 1996. http:// www.mpeg.org. National Institute of Standards and Technology, Public domain OCR: NIST form-based handprint recognition system, February 1997. http://www.nist.gov/itl/div394/894.03/databases/defs/nist.ocr.html. V. Santhanam, E.H. Gornish, W.-C. Hsu, Data Prefetching on the HP PA-8000, Annu. Int. Symp. Comput. Architect. (1997) 264±273. C.-L. Su, A.M. Despain, Cache designs for energy ef®ciency, Proc. Annu. Hawaii Int. Conf. Syst. Sci. (1995) 306±315. TcX AB, Detron HB and Monty Pyton Program KB, MySQL v3.22 Reference Manual, September 1998. Transaction Processing Performance Council, TPC benchmark D (decision support), Standard Speci®cation Rev. 1.1, December 1995. J.T. Zawodny, E.W. Johnson, J.B. Brockman, P.M. Kogge, Cache-inmemory: a lower power alternative? in: Proceedings of the PowerDriven Microarchitecture Workshop in Conjunction with ISCA'98, 28 June 1998.

Jonas Jalminger received his MS in computer science and engineering in 1998 from Chalmers University of Technology, Sweden where he is currently pursuing his PhD in computer engineering. His research interests include cache memory systems and techniques to enhance energy ef®ciency in such. He is a student member of the IEEE and ACM.

Per StenstroÈm is a Professor of Computer Engineering and vice-dean at the School of Computer Science and Enginnering at Chalmers University of Technology, Sweden. His research interests are particularly devoted to design principles for high-performance and poweraware memory systems. He serves on a regular basis on program committees of major conferences in the area and is editor for IEEE Trans. on Computers and the Journal of Parallel and Distributed Computing. He was the General Chair of the ACM/IEEE 28th Int. Symposium on Computer Architecture. Dr. StenstroÈm is a Senior member of the IEEE and a member of the ACM.