Microprocessors and Microsystems 30 (2006) 268–279 www.elsevier.com/locate/micpro
PP-cache: A partitioned power-aware instruction cache architecture Cheol Hong Kim a,*, Sung Woo Chung b, Chu Shik Jhon a a
Mobile System Laboratory, Department of Electrical Engineering and Computer Science, Seoul National University, Shilim-dong, Kwanak-gu, Seoul 151-742, South Korea b Department of Computer Science and Engineering, Korea University, Anamdong, Sungbuk-ku, Seoul 136-701, Korea Available online 6 January 2006
Abstract Microarchitects should consider energy consumption, together with performance, when designing instruction cache architecture, especially in embedded processors. This paper proposes a new instruction cache architecture, named Partitioned Power-aware instruction cache (PP-cache), for reducing dynamic energy consumption in the instruction cache by partitioning it to small sub-caches. When a request comes into the PP-cache, only one sub-cache is accessed by utilizing the locality of applications. In the meantime, the other sub-caches are not activated. The PP-cache reduces dynamic energy consumption by reducing the activated cache size and eliminating the energy consumed in tag matching. Simulation results show that the PP-cache reduces dynamic energy consumption by 34–56%. This paper also proposes the technique to reduce leakage energy consumption in the PP-cache, which turns off the lines that do not have valid data dynamically. Simulation results show that the proposed technique reduces leakage energy consumption in the PP-cache by 74–85%. q 2006 Elsevier B.V. All rights reserved. Keywords: Low-power design; Instruction cache; Partitioned cache; Dynamic energy; Leakage energy
1. Introduction As applications continue to require more computational power, energy consumption in a processor dramatically increases. Unfortunately, high-energy consumption in a processor requires high cooling and packaging costs. Moreover, it reduces the battery life of embedded systems. In general, caches consume a significant fraction of total processor energy [1]. For this reason, many researches have focused on the energy efficiency of caches to reduce energy consumption in a processor. In caches, energy is consumed whenever caches are accessed (Dynamic and Leakage energy), and even when caches are idle (Leakage energy). Many studies have been proposed to reduce dynamic energy consumption in the cache. Filter cache trades performance for energy consumption by filtering power-costly regular cache accesses through an extremely small cache [2]. Bellas et al. proposed a technique that uses an additional mini cache located between the L1 instruction cache and the CPU core in order to reduce signal switching activity and dissipated energy with the help of compiler [3]. Selective-way cache provides the ability to * Corresponding author. Tel.: C82 2 880 1829; fax: C82 2 888 1048. E-mail address:
[email protected] (C.H. Kim).
0141-9331/$ - see front matter q 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2005.12.004
disable a set of the ways in a set-associative cache during periods of modest cache activity to reduce energy consumption, while the full cache may remain operational for more cache-intensive periods [4]. Way predicting set-associative caches access one tag and data array initially based on their prediction mechanism, and access the other arrays only when the initial array did not result in a match, which leads to less energy consumption at the expense of longer access time [5,6]. As the number of transistors employed in a processor increases, leakage energy consumption becomes comparable to dynamic energy consumption. Reducing leakage energy in caches is especially important, because caches comprise much of a chip’s transistor counts. Various techniques have been suggested to reduce leakage energy consumption in the cache. Powell et al. proposed the gated-Vdd, a circuit-level technique to gate the supply voltage and reduce leakage energy in unused memory cells [7]. They also proposed Dynamically Resizable Instruction (DRI) cache [7] that reduces leakage energy dissipation by resizing the cache size dynamically, based on the gated-Vdd technique. Cache decay reduces leakage energy by invalidating and turning off cache lines when they hold data that are not likely to be reused, based on the gated-Vdd technique [8]. Drowsy cache scheme, composed of normal mode and drowsy mode for each cache line, reduces leakage energy consumption with multi-level supply voltages [9]. In this paper, we focus on the methods to reduce dynamic/leakage energy consumption in the L1 instruction
C.H. Kim et al. / Microprocessors and Microsystems 30 (2006) 268–279
cache (iL1). The iL1 consumes a significant portion of total processor energy [10]. We propose a hardware technique to reduce dynamic energy consumption in the iL1 by partitioning it to several sub-caches. When a request comes into the proposed cache, only one predicted sub-cache is accessed. Instead, the other sub-caches are not activated, leading to dynamic energy reduction. To reduce leakage energy consumption in the proposed cache, we propose the technique to turn off the lines that do not have valid data dynamically. We also propose a prefetching technique to reduce the miss rates of the proposed cache. The rest of this paper is organized as follows. Section 2 describes the motivation of this paper and related research. Sections 3 and 4 present the traditional cache architecture and the proposed cache architecture, respectively. Section 5 discusses our evaluation methodology and shows detailed evaluation results. Finally, Section 6 concludes this paper. 2. Background 2.1. Motivation The most important program property that we usually exploit is locality. Temporal locality means that recently accessed blocks are likely to be accessed in the near future. Spatial locality denotes that blocks whose addresses are near one another tend to be referenced together in time. In this paper, we try to exploit benefits from temporal/spatial locality of applications. Fig. 1 shows the probability that a cache request accesses the latest accessed page, obtained from our simulations using SimpleScalar [11]. From these observations, we realized that almost all cache requests coming into the iL1 are within the latest accessed page. This feature is more clearly shown in floating point applications (Fig. 1(b)). To reduce per-access dynamic energy consumption of the iL1 by utilizing this high temporal/spatial locality of applications, we propose a new instruction cache architecture called Partitioned Power-aware instruction cache (PP-cache). The PP-cache is composed of several sub-caches and each subcache is dedicated to one page. When a request from the processor comes into the PP-cache, only one sub-cache, which
269
was accessed just before, is accessed. The predicted sub-cache is expected to contain the requested instruction with very high probability, because the predicted sub-cache contains the page in which the previous instruction is included. The proposed PPcache is targeting at embedded processors where tasks are not frequently switched. The PP-cache is not applicable to data caches, since temporal/spatial locality in data caches is inferior to that in instruction caches, as shown in Fig. 1. 2.2. Related research One study closely related to ours is that on the partitioned instruction cache architecture (we call this scheme as PI-cache scheme through this section), proposed by Kim et al. [12]. In the PI-cache scheme, the iL1 is split into several sub-caches. All cache lines from a particular page are mapped to a specific sub-cache indicated by the TLB. When the iL1 is accessed, only one selected sub-cache by MRU mechanism is accessed to reduce per-access energy cost of the iL1. Each sub-cache in the PI-cache scheme contains multiple pages. Contrary to the PI-cache scheme, each sub-cache in our scheme is dedicated to only one page. Therefore, each sub-cache in the PI-cache scheme has to contain tag array to identify the cache line, whereas each sub-cache in our scheme does not contain tag array, leading to more reduction of per-access energy compared to the PI-cache scheme by eliminating tag lookup and comparison within the iL1. A cache design to reduce tag area cost by partitioning the cache has been proposed by Chang et al. [13]. They divide the cache into a set of partitions, and each partition is dedicated to a small number of pages in the TLB to reduce tag area cost in the iL1. However, they did not consider energy consumption. When the cache is accessed, all partitions are concurrently accessed in their work. As described in Section 2.1, almost all requests coming into the iL1 fall into the region of the page which was accessed just before. Therefore, our scheme is expected to reduce more energy consumption in the iL1 compared to the PI-cache scheme with little performance degradation. The PI-cache scheme can slightly reduce cache miss rates in some applications, whereas very complex dynamic remapping is
Fig. 1. Probability that a cache request accesses the latest accessed page.
270
C.H. Kim et al. / Microprocessors and Microsystems 30 (2006) 268–279
required, which may incur negative slack in the critical path. Each sub-cache in the PI-cache scheme has its own counter to count the number of cache misses occurred in the sub-cache. Then, if the number of misses occurred in the sub-cache exceeds the specified threshold value, the PI-cache scheme remaps the page that caused the last miss in that sub-cache into the sub-cache with the least number of cache misses to reduce conflict misses in the iL1. Contrary to the PI-cache scheme, a conflict miss does not occur in the sub-cache of our scheme, since only one page is mapped to one sub-cache whose size is equal to page size. Therefore, our scheme is implemented with less hardware overhead compared to the PI-cache scheme, which makes the PP-cache scheme more suitable than the PIcache scheme for embedded processors. In the PI-cache scheme, they did not consider leakage energy consumption. In this paper, we propose the technique to reduce leakage energy consumption in the partitioned cache. Moreover, we propose the prefetching technique to reduce the miss rates of the partitioned cache.
3. Traditional instruction cache architecture Fig. 2 depicts the traditional instruction cache architecture including instruction TLB. Instruction cache structure focused in this paper is Virtually-Indexed, Physically-Tagged (VI–PT) cache. VI–PT cache is used in many current processors to remove the TLB access from critical path. In the VI–PT cache, virtual address is used to index the iL1 and the TLB is concurrently looked up to obtain physical address. After that, the tag from the physical address is compared with the corresponding tag bits from each block to find the block actually requested. The two-level TLB, a very common technique in low power processors (e.g. ARM11 [14]), consists of micro TLB and main TLB. The micro TLB is placed over the main TLB for filtering accesses to the main TLB for low power consumption. When a miss occurs in the micro TLB, additional cycle is required to access the main TLB. The purpose for the main TLB is to maintain high hit rates.
Fig. 2. Traditional instruction cache architecture (shaded part of storage array consumes dynamic energy in cache access).
C.H. Kim et al. / Microprocessors and Microsystems 30 (2006) 268–279
271
Fig. 3. Cache access flow in the traditional cache.
Cache access flow in the traditional cache is summarized in Fig. 3. When an instruction fetch request from the processor comes into the iL1, virtual address is used to determine the set. If the selected blocks are not valid, cache access is a miss. If there is a valid block, the tag of the block is compared with the address obtained from the TLB to see whether it was really requested. If they match, cache access is a hit.
4. Partitioned power-aware instruction cache architecture 4.1. Reducing dynamic energy consumption Energy consumption in the cache is mainly dependent on the cache configuration such as cache size and associativity. In general, a small cache consumes less energy than a large cache. However, a small cache increases cache miss rates, which results in performance degradation. Thus, a large cache is inevitable for performance. The PP-cache is composed of small sub-caches in order to make use of both advantages from small cache and large cache. When a request comes into the PPcache, only one sub-cache, that is predicted to have the requested data, is accessed, and the other sub-caches are not activated. Each sub-cache is dedicated to only one page allocated in the micro TLB. The number of sub-caches in the PP-cache is equal to the number of entries in the micro TLB.
Therefore, there is one–one correspondence between subcaches and micro TLB entries. Increasing the associativity of the cache to improve the hit rates has negative effects on the energy efficiency, since setassociative caches consume more energy than direct-mapped caches by reading data from all the lines that have same index. For example, four-way set-associative caches precharge and read four ways but select only one of them on a cache hit, resulting in wasted dynamic energy dissipation for three ways. In the PP-cache, each sub-cache is configured as direct-mapped cache to improve the energy efficiency. Each sub-cache size is equal to page size. Therefore, we can eliminate tag array in each sub-cache because all the blocks within one page are mapped to only one sub-cache. Fig. 4 depicts the proposed PP-cache architecture. There are three major changes compared with the traditional cache architecture. (1) Id field is added to each micro TLB entry to denote the sub-cache which corresponds to each micro TLB entry. The id field in each micro TLB entry indicates the subcache which all cache blocks within the page are mapped to. (2) There is a register called PSC (predicted sub-cache id) that stores the id of the latest accessed sub-cache. The access to the PP-cache is performed based on the information in the PSC register. (3) Tag arrays are eliminated in the PP-cache. Instruction fetch in the PP-cache is performed as shown in Fig. 5. When an instruction fetch request from the processor
272
C.H. Kim et al. / Microprocessors and Microsystems 30 (2006) 268–279
Fig. 4. Partitioned Power-aware instruction Cache architecture (shaded part of storage array consumes dynamic energy in cache access).
Fig. 5. Cache access flow in the PP-cache.
C.H. Kim et al. / Microprocessors and Microsystems 30 (2006) 268–279
comes into the PP-cache, only one sub-cache, which was accessed just before, is accessed based on the information stored in the PSC register. At the same time, the access to the instruction TLB is performed. If the access to the micro TLB is a hit, it means that the requested data belongs to the pages mapped to the PP-cache. In case of a hit in the micro TLB, the id of the matched micro TLB entry is compared with the value in the PSC register to verify the prediction. When the prediction is correct (the sub-cache id corresponding to the matched micro TLB entry is same to that stored in the PSC register), a normal cache hit occurs if data was found in the predicted sub-cache and a normal cache miss occurs if data was not found. A normal cache hit and a normal cache miss mean a cache hit and a cache miss without penalty (another sub-cache access delay), respectively. If the prediction is not correct, it means that the requested data belongs to the other pages in the PP-cache. In this case, the correct sub-cache is also accessed. This incurs additional cache access penalty. If data is found in the correct sub-cache, a cache hit with penalty occurs. If a cache miss occurs even in the correct sub-cache, a cache miss with penalty occurs. If a miss occurs in the micro TLB, it implies that the requested data does not belong to the pages mapped to the PPcache, consequently a cache miss occurs. In this case, the subcache that corresponds to the replaced page from the micro TLB is flushed in whole (replacement algorithm for the micro TLB is LRU). Each sub-cache can be easily flushed by resetting valid bits of all cache blocks because the instruction cache only allows read operation (no write-back is required). Then, incoming cache blocks which correspond to the newly allocated entry in the micro TLB are placed into the flushed sub-cache. The PP-cache incurs little hardware overhead. Traditional micro TLB must be extended to incorporate the id for each entry for this scheme. However, the number of bits for the id field is typically small: 2, 3, or 4 bits. One register is required for the PSC register and one comparator is required to check sub-cache prediction. This overhead is negligible. The PP-cache is expected to reduce dynamic energy consumption by reducing the accessed cache size and eliminating the energy consumed in tag lookup and comparison. If the miss rates of the PP-cache do not increase so much compared to those of the traditional cache, the PP-cache can be an energy-efficient alternative as an iL1. We expect that the miss rates of the PP-cache do not increase so much because we observed that the access to the iL1 comes into the latest accessed page with very high probability, as shown in Section 2. Moreover, the PP-cache consumes less chip area compared to the traditional instruction cache by eliminating tag arrays. 4.2. Prefetching for the PP-cache In the PP-cache, there is no conflict miss, since one page is mapped to one sub-cache whose size is equal to page size. However, there are more compulsory misses than the traditional cache, because the sub-cache in the PP-cache is
273
Fig. 6. Examples of allocation of a missed block.
flushed whenever the corresponding entry in the micro TLB is replaced. Compulsory misses may be eliminated if the PPcache transfers all cache blocks in the page simultaneously when the page is allocated in the micro TLB. However, the PPcache does not transfer whole page at the same time, because it may incur serious bus contention problem and increase energy consumption in the lower level memory (L2 cache, main memory) significantly. For this reason, the PP-cache transfers the cache block from lower level memory only when it is requested. To reduce compulsory misses in the PP-cache, we propose a simple hardware-based prefetching technique, which is similar to the prefetch-on-miss [15] technique, for the PP-cache. Prefetching from the L2 cache is one of the techniques to reduce the miss rates of the first level cache. The prefetch-onmiss scheme, proposed by Smith, simply prefetches the next block when one block is required due to a cache miss. For example, if the A00 block is required due to a cache miss in the iL1, the A00 is transferred to the iL1 and the A01, the next block of the missed block, is prefetched to the iL1 (Fig. 6(a)). However, this scheme may increase the number of access to the L2 cache, resulting in more energy consumption in the L2 cache. In the prefetch-on-miss scheme, if the A03 (last subblock of the L2 cache block) is required due to a cache miss, the next block (A04) is prefetched to the iL1 (Fig. 6(b)). In this case, the L2 cache should be accessed again to find the block containing the A04, because the A03 and the A04 are not located in the same block of the L2 cache. If the A04 is not referenced by processor, prefeching the A04 results in dissipating unnecessary energy. To prevent this case, the proposed prefetching technique prefetches the next block only when the requested block and the next block are located in the same L2 cache block. If the requested block is the last subblock of the L2 cache block, our scheme does not prefetch the next block. We propose this simple prefetching technique, which consumes little area overhead, for the PP-cache, since the PP-cache is targeting at embedded processors. It is easily noticed whether the missed block is the last subblock of the L2 cache block. The block address is divided into a tag, an index, and a block offset (Fig. 7(a)). The block offset of the L2 cache is divided into two fields: the lower field is a part overlapped by the block offset of the iL1, the upper field is the other part that indicates the sub-block in the L2 cache. Fig. 7 shows an example. Fig. 7(b) and (c) depicts the block offset masks of iL1 and L2 cache, respectively. In this example, we assume that the line size of the iL1 is 32 bytes and that of the L2
274
C.H. Kim et al. / Microprocessors and Microsystems 30 (2006) 268–279
Fig. 7. Method to enable prefetch signal.
cache is 128 bytes. Therefore, the bit width of iL1 block offset is 5 and that of L2 cache block offset is 7, respectively. The lower field (lower 5 bits) of L2 cache block offset is not used in the L2 cache. The L2 cache is searched by the upper field (shaded bits in Fig. 7(c)) to find the sub-block which is requested by the iL1. If the upper field is filled with ‘1’, it means that the requested block is the last sub-block of the L2 cache block. As shown in Fig. 7(d), if we apply bit-wise NAND to the upper field of the block offset for the prefetch enable signal, the prefetch enable signal is disabled when the upper field of the block offset is filled with ‘1’. Consequently, the prefetch enable signal is enabled only when the missed block is not the last sub-block of the L2 cache block. If the prefetch enable signal is enabled, the next block of the missed block is prefeched to the PP-cache. By using this prefetching technique, the miss rates of the PP-cache is expected to decrease by reducing compulsory misses in the PP-cache, unless a taken branch is encountered. 4.3. Reducing leakage energy consumption Up to this point, we have focused on dynamic energy consumption, traditionally the major component of energy in current technology. However, leakage energy is expected to be more significant as threshold voltages decrease in conjunction with smaller supply voltages in upcoming chip generations [16]. There may be a number of lines that do not have valid data in the PP-cache, since sub-caches in the PP-cache are frequently flushed. Based on this assumption, we propose the technique to reduce leakage energy consumption in the PPcache by turning off the unused memory cells in the PP-cache dynamically. To turn off the lines in the PP-cache, we use the circuit level technique called gated-Vdd proposed by Powell et al. [7]. Key idea in this technique is to insert a ‘sleep’ transistor between the ground and the SRAM cells of the cache line. When the line is turned off by using this technique, leakage energy dissipation of the cache line can be considered negligible with a little area overhead, which is reported to be about 5% [7]. In the PP-cache, if a miss occurs in the micro TLB, the subcache that corresponds to the replaced page from the micro TLB is flushed. At this time, all cache lines in the flushed sub-
cache will be turned off by using gated-Vdd technique. At the time of initialization, all cache lines in the PP-cache are turned off to save leakage energy consumption. Then, each cache line will be turned on when a first access to the line is requested. In other words, each cache line is turned on during reading data from lower level memory after a compulsory miss. After that, each cache line will not be turned off until the sub-cache is flushed. At the time of first access to each cache line, a compulsory miss occurs inevitably. Therefore, leakage energy consumption in the PP-cache is expected to be reduced with no performance degradation by using this technique. Moreover, no more hardware overhead except the gated-Vdd is required for this technique.
5. Experiments In order to determine the characteristics of the proposed scheme with respect to the traditional cache schemes, we simulated various benchmarks using SimpleScalar simulator [11]. CACTI cache energy model was used to collect the power parameters, where we assumed 0.18 mm technology [17]. Simulated applications are selected from SPEC CPU2000 suite [18]. The main processor and memory hierarchy parameters used in this simulation are shown in Table 1. The simulated processor is a two-way superscalar processor with an L2 cache, Table 1 System parameters Parameter
Value
Functional units
2 INT ALUs, 2 FP ALUs, 1 INT multiplier/ divider, 1 FP multiplier/divider Two instructions/cycle Bimodal Fully associative, one cycle latency 32 entries, fully associative, one cycle latency, 30 cycle miss penalty 16 and 32 KB, one-way–eight-way, 32 byte lines, one cycle latency 4 KB (page size), one-way, 32 byte lines, one cycle latency 32 KB, four-way, 32 byte lines, one cycle latency, write-back 256 KB unified, four-way, 128 byte lines, eight cycle latency, write-back 64 cycle latency
Fetch, decode, issue, commit Branch predictor Micro TLB Main TLB L1 I-cache Sub-cache in the PP-cache L1 D-cache L2 cache Memory
C.H. Kim et al. / Microprocessors and Microsystems 30 (2006) 268–279
which is expected to be similar to the next generation embedded processor by ARM [19]. 5.1. Instruction fetch delay and IPC The normalized delay to fetch instructions is given in Fig. 8. We assume that ppc in the graphs denotes the proposed PPcache. ppcp denotes the PP-cache with prefetching. L1 Lookup portion in the bar represents the cycles required for iL1 accesses. L1 Miss portion denotes the cycles incurred by iL1 misses. Overhead in ppc portion denotes the delayed cycles incurred by sub-cache misprediction in the PP-cache scheme, namely the penalty to access another sub-cache after sub-cache misprediction. Note that traditional cache schemes do not have overhead in ppc portion. As shown in these graphs, set-associative caches show less delay than direct-mapped caches by reducing the delay due to cache misses. This comes from the fact that set-associative caches improve the hit rates compared to direct-mapped caches by reducing conflict misses in the cache. Therefore, the instruction fetch delay is reduced with more degree of associativity. As shown in Fig. 8(a) and (b), the instruction fetch delay of 16 KB PP-cache is increased by 18% (CINT) and 12% (CFP) on the average compared to that of the traditional direct-mapped cache. The instruction fetch delay of 16 KB PPcache with prefetching is increased by 12% (CINT) and 9%
275
(CFP) on the average. As shown in the graphs, the delay due to cache misses is reduced by using the prefetching technique proposed. The reduction of instruction fetch delay by prefetching is well shown in applications where the branch instruction ratio (total number of branch instructions/total number of instructions) is small, because the proposed technique prefetches the next block of the missed block based on the assumption that instructions are accessed sequentially. The instruction fetch delay of 32 KB PP-cache is increased by 10% (CINT) and 5% (CFP) on the average. The instruction fetch delay of 32 KB PP-cache with prefetching is increased by 8% (CINT) and 3% (CFP). This increase of instruction fetch delay is caused by two reasons: one is the degradation of the hit rates by restricting the blocks to be allocated in the PP-cache to the blocks within the pages mapped to the micro TLB entries. The other is the sub-cache misprediction which incurs additional sub-cache access delay, indicated by the overhead in ppc portion. The normalized Instructions Per Cycle (IPC) according to each cache model is shown in Fig. 9. As shown in these graphs, the performance of the processor with 16 KB PP-cache is degraded by 9% (CINT) and 5.9% (CFP) on the average compared to that with 16 KB traditional direct-mapped cache. The performance of the processor with 16 KB PP-cache and prefetching is degraded by 6.1% (CINT) and 4.2% (CFP) on the average. The performance of the processor with 32 KB
Fig. 8. Normalized instruction fetch delay.
276
C.H. Kim et al. / Microprocessors and Microsystems 30 (2006) 268–279
Fig. 9. Normalized IPC.
PP-cache is degraded by 4.9% (CINT) and 2.8% (CFP) on the average. The performance of the processor with 32 KB PPcache and prefetching is degraded by 3.6% (CINT) and 2.0% (CFP) on the average. As shown in Figs. 8 and 9, the performance gap between the traditional cache and the PP-cache decreases as the cache size increases. This is because the hit rates of the PP-cache improve by increasing the number of the pages mapped to the PP-cache: 32 KB PP-cache with eight sub-caches compared with 16 KB PP-cache with four sub-caches. As shown in Fig. 8, L1 Miss portion in the bars significantly decreases in 32 KB PP-cache compared to 16 KB PP-cache. From these results, the PP-cache is expected to be more efficient as the cache size becomes larger. The PP-cache shows better performance in CFP applications compared to CINT applications (Figs. 8 and 9). The reason is that floating point applications have better locality compared to integer applications as seen in Section 2. Particularly, the PPcache shows better performance than the traditional directmapped cache in a gzip application. This performance gain is obtained from improving the hit rates by reducing conflict misses. The results shown in Figs. 8 and 9 are obtained from the configurations in Table 1. We simulated all cache configurations with same cache access latency (one cycle). In fact,
physical access time varies by the cache configurations. Table 2 gives the physical access time according to each cache models obtained from CACTI model. For the traditional cache models, physical cache access time generally increases as the degree of the associativity increases. Way-prediction designs have been proposed for fast L1 cache access in setassociative caches [5,6]. Way-prediction enables the adoption of set-associative caches in upcoming processor designs, despite iL1 access time is timing-critical. However, we do not consider the way-prediction for the traditional setassociative caches in calculating physical cache access time, since we do not consider negative effects on the hit rates caused by the way-prediction that may lead to performance degradation. Physical access time of the PP-cache is “1 AND gate delay (it is required to enable the sub-cache indicated by the PSC register, 0.114 ns, obtained from ASIC STD130 DATABOOK by Samsung Electronics [20])Csub-cache access latency Table 2 Physical cache access time Access time (ns)
One-way
Two-way
Four-way
Eight-way
PP-cache
16 KB 32 KB
1.129 1.312
1.312 1.455
1.316 1.430
1.383 1.457
1.042 1.049
C.H. Kim et al. / Microprocessors and Microsystems 30 (2006) 268–279 Table 3 Per-access dynamic energy consumption Energy (nJ)
One-way
Two-way
Four-way
Eight-way
PP-cache
16 KB 32 KB
0.473 0.621
0.634 0.759
0.935 1.059
1.516 1.666
0.247 0.250
(0.873 ns, obtained from CACTI)Coutput selection latency (obtained from CACTI)”. As shown in Table 2, physical access time of the PP-cache is less than that of the traditional directmapped cache, because the accessed cache size in the PP-cache is smaller that that in the traditional cache. This feature is well shown in 32 KB iL1 than 16 KB iL1. The physical access time of the traditional cache increases as the cache size increases. However, the physical access time of the PP-cache is mainly dependent on the sub-cache size. Consequently, the access time of the PP-cache increases a little as the cache size increases (increased access time comes from increased output selection latency). Therefore, if the processor clock speeds up or the size of cache increases in the future, the proposed PP-cache is expected to be more favorable. The performance of the PP-cache is likely to be degraded significantly compared to the traditional cache if two functions in different pages call each other repeatedly. In this case, the accuracy of sub-cache prediction in the PP-cache will decrease
277
significantly. To prevent this case, the technique such as scatter loading, which can place the instructions within the functions calling each other repeatedly into same page. On the other hand, the architecture-level techniques to improve the hit rates of the PP-cache in this case could be a promising direction for future work. 5.2. Dynamic energy consumption Table 3 shows per-access dynamic energy consumption according to each cache models obtained from CACTI. In the traditional cache, the energy consumed by the cache increases as the degree of associativity increases. The increase in associativity implies the increase of output drivers, comparators, and sense amplifiers, consequently the increase of total energy. The PP-cache consumes less per-access dynamic energy compared to traditional caches. There are two reasons for better energy efficiency: one is that the accessed cache size in the PP-cache is smaller than that in the traditional cache by partitioning it to several sub-caches. The other is the elimination of access to tag arrays. As shown in Table 3, per-access dynamic energy in the traditional caches increases as the cache size increases. By contrast, per-access dynamic energy in the PP-cache grows little as the cache size increases, since per-access dynamic
Fig. 10. Normalized dynamic energy consumption.
278
C.H. Kim et al. / Microprocessors and Microsystems 30 (2006) 268–279
energy in the PP-cache is mainly dependant on the sub-cache size. This property is especially favorable in a large cache. Detailed dynamic energy consumption obtained from SimpleScalar and CACTI together is shown in Fig. 10. Total dynamic energy consumption presented in these graphs is the sum of the dynamic energy consumed during instruction fetch. In these graphs, L1 Access portion denotes the dynamic energy consumed during iL1 accesses, and L1 Miss portion represents the dynamic energy consumed during accessing lower level memory incurred by misses in the iL1. As shown in the graphs, the PP-cache shows less dynamic energy consumption than the traditional caches. Sixteen kilobytes PP-cache reduces dynamic energy consumption by 34% (CINT) and 38% (CFP) on the average compared to the traditional direct-mapped cache. Thirty two kilobytes PP-cache reduces dynamic energy consumption by 54% (CINT) and 56% (CFP), on the average as shown in Fig. 10. As expected, the PP-cache is more energy efficient with a large cache. Moreover, the PP-cache is expected to be more energy efficient in the near future, since the physical access time of the PP-cache is faster than the traditional cache. In order to determine the characteristics of the PP-cache with respect to the traditional caches when the size of cache is small or the size of application is small, we also simulated various small benchmarks (SPEC test input data) with small iL1 (4, 8 KB) using SimpleScalar. According to the
simulations, small PP-cache (4, 8 KB) shows worse dynamic energy-delay product (Zaverage per-access cache dynamic energy consumption!average cache access delay) than small traditional caches (4, 8 KB), because the number of pages in the PP-cache is too small. However, 16 KB PP-cache shows better energy-delay product than smaller traditional caches (4, 8 KB). It also shows better energy-delay product than samesize (16 KB) traditional cache (Figs. 8 and 10). Thirty-two kilobytes PP-cache shows best energy-delay product. Therefore, the proposed PP-cache is expected to be energy-efficient in various environments. 5.3. Leakage energy consumption Fig. 11 shows the normalized leakage energy consumption in the PP-cache. In these graphs, ppc denotes the original PPcache scheme and ppcl denotes the PP-cache scheme adopting the technique to reduce leakage energy consumption. Each bar in the graphs is obtained by assuming that leakage energy is proportional to ((average number of lines turned on!average turned on time)/(total number of lines in the iL1!execution time)), based on the results from simulations with SimpleScalar. According to these graphs, the proposed technique reduces leakage energy consumption in the 16 KB PP-cache by 85%
Fig. 11. Normalized leakage energy consumption.
C.H. Kim et al. / Microprocessors and Microsystems 30 (2006) 268–279
(CINT) and 77% (CFP) on the average compared to the original PP-cache. It reduces leakage energy consumption in the 32 KB PP-cache by 74% (CINT) and 75% (CFP) on the average. Leakage energy reduction is well shown in 16 KB PPcache. This is because each sub-cache in 16 KB PP-cache is flushed more frequently compared to that in 32 KB PP-cache. From these observations, we realized that much portion of the PP-cache is turned off during execution. The sub-cache size in the PP-cache can be adjusted to be smaller than the page size to reduce these unused memory cells. However, if we decrease the sub-cache size, there must be tag comparison in the PPcache and the control logics have to be inserted. Moreover, performance degradation will be incurred due to conflict misses in the sub-cache. The methods to decrease the subcache size with little effects on the performance could be also the promising direction for future work. 6. Conclusions We have introduced a new instruction cache design called partitioned power-aware instruction cache (PP-cache) to reduce energy consumption in a processor. The PP-cache is composed of several sub-caches and each sub-cache is dedicated to only one page. When a request from the processor comes into the PP-cache, only one predicted sub-cache is accessed for energy efficiency. The proposed PP-cache reduces dynamic energy consumption significantly compared to traditional caches. The energy efficiency of the PP-cache has benefits from two reasons: one is reducing the accessed cache size and the other is eliminating tag lookup and comparison. Moreover, the performance loss is little, considering the reduced physical cache access time. We also proposed a simple prefetching technique to decrease the miss rates of the PPcache by reducing the number of compulsory misses. To reduce leakage energy consumption in the PP-cache, we proposed the technique to turn off the lines that do not have valid data dynamically. The proposed method reduces leakage energy consumption in the PP-cache significantly without any performance loss. Therefore, the PP-cache is expected to be a scalable solution for a large instruction cache. References [1] S. Segars, Low power design techniques for microprocessors, in: Proceedings of the International Solid-State Circuits Conference, 2001.
279
[2] J. Kin, M. Gupta, W. Mangione-Smith, The filter cache: an energy efficient memory sturcture, in: Proceedings of the International Symposium on Microarchitecture, 1997, pp. 184–193. [3] N. Bellas, I. Hajj, C. Polychronopoulos, Using dynamic cache management techniques to reduce energy in a high-performance processor, in: Proceedings of the International Symposium on Low Power Electronics and Design, 2000, pp. 90–95. [4] D.H. Albonesi, Selective cache ways: on-demand cache resource allocation, in: Proceedings of the International Symposium on Microarchitecture, 1999, pp. 70–75. [5] K. Inoue, T. Ishihara, K. Murakami, Way-predicting set-associative cache for high performance and low energy consumption, in: Proceedings of the International Symposium on Low Power Electronics and Design, 1999, pp. 273–275. [6] M. Powell, A. Agarwal, T.N. Vijaykumar, B. Falsafi, K. Roy, Reducing set-associative cache energy via way-prediction and selective directmapping, in: Proceedings of the International Symposium on Microarchitecture, 2001, pp. 54–65. [7] M. Powell, S.H. Yang, B. Falsafi, K. Roy, T.N. Vijaykumar, Gated-vdd: a circuit technique to reduce leakage in deep-submicron cache memories, in: Proceedings of the International Symposium on Low Power Electronics and Design, 1999, pp. 64–69. [8] S. Kaxiras, Z. Hu, M. Martonosi, Cache decay: exploiting generational behavior to reduce leakage power, in: Proceedings of the International Symposium on Computer Architecture, 2001, pp. 148–157. [9] K. Flautner, N.S. Kim, S. Martin, D. Blaauw, T. Mudge, Drowsy caches: simple techniques for reducing leakage power, in: Proceedings of the International Symposium on Computer Architecture, 2002, pp. 148–157. [10] J. Montanaro, et al., A 160 Mhz, 32b, 0.5w cmos risc microprocessor, in: Proceedings of the International Solid-State Circuits Conference, 1996, pp. 214–229. [11] D. Burger, T.M. Austin, S. Bennett, Evaluating future microprocessors: the simplescalar tool set, Tech. Rep. TR-1308, University of WisconsinMadison Computer Sciences Department, 1997. [12] S.T. Kim, N. Vijaykrishnan, M. Kandemir, A. Sivasubramaniam, M.J. Irwin, Partitioned instruction cache architecture for energy efficiency, ACM Transactions on Embedded Computing Systems 2 (2) (2003) 163–185. [13] Y.J. Chang, F. Lai, S.J. Ruan, Cache design for eliminating the address translation bottleneck and reducing the tag area cost, in: Proceedings of the International Conference on Computer Design, 2002, pp. 334–339. [14] ARM, ARM1136J(F)-S, http://www.arm.com/products/CPUs/ ARM1136JF-S.html [15] A.J. Smith, Cache memories, Computing Surveys 14 (3) (1982) 473–530. [16] S. Borkar, Design challenges of technology scaling, IEEE Micro (1999) 23–29. [17] P. Shivakumar, N.P. Jouppi, Cacti 3.0: an integrated cache timing, power, and area model, Tech. Rep. TR-WRL-2001-2, University of WisconsinMadison Computer Sciences Department, 2001. [18] SPEC, SPEC CPU2000 Benchmarks, http://www.specbench.org [19] M. Muller, At the heart of innovation, http://www.arm.com/miscPDFs/ 6871.pdf [20] SamsungElectronics, Asic std130 databook, http://www.samsung.com