Reducing cache and TLB power by exploiting memory region and privilege level semantics

Reducing cache and TLB power by exploiting memory region and privilege level semantics

Journal of Systems Architecture 59 (2013) 279–295 Contents lists available at SciVerse ScienceDirect Journal of Systems Architecture journal homepag...

3MB Sizes 0 Downloads 20 Views

Journal of Systems Architecture 59 (2013) 279–295

Contents lists available at SciVerse ScienceDirect

Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc

Reducing cache and TLB power by exploiting memory region and privilege level semantics q Zhen Fang a,⇑, Li Zhao b, Xiaowei Jiang b, Shih-lien Lu b, Ravi Iyer b, Tong Li b, Seung Eun Lee c a

AMD Corp., 7171 Southwest Parkway, Austin, TX 78735, USA Intel Corp., 2111 NE 25th Ave, Hillsboro, OR 97124, USA c Dept. of Electronic and Information Engineering, Seoul National University of Science and Technology, South Korea b

a r t i c l e

i n f o

Article history: Received 7 December 2012 Received in revised form 19 February 2013 Accepted 8 April 2013 Available online 23 April 2013 Keywords: First-level cache Translation lookaside buffer Memory regions Ring level Simulation

a b s t r a c t The L1 cache in today’s high-performance processors accesses all ways of a selected set in parallel. This constitutes a major source of energy inefficiency: at most one of the N fetched blocks can be useful in an N-way set-associative cache. The other N-1 cachelines will all be tag mismatches and subsequently discarded. We propose to eliminate unnecessary associative fetches by exploiting certain software semantics in cache design, thus reducing dynamic power consumption. Specifically, we use memory region information to eliminate unnecessary fetches in the data cache, and ring level information to optimize fetches in the instruction cache. We present a design that is performance-neutral, transparent to applications, and incurs a space overhead of mere 0.41% of the L1 cache. We show significantly reduced cache lookups with benchmarks including SPEC CPU, SPECjbb, SPECjAppServer, PARSEC, and Apache. For example, for SPEC CPU 2006, the proposed mechanism helps to reduce cache block fetches from the data and instruction caches by an average of 29% and 53% respectively, resulting in power savings of 17% and 35% in the caches, compared to the aggressively clock-gated baselines. Ó 2013 Elsevier B.V. All rights reserved.

1. Introduction In the past decade, the general-purpose microprocessor industry has gone through a shift from performance-first to energy-efficient computing. Computer architects invented and investigated methods to optimize power consumption of literally every component of the processor. In contrast with most other units of the chip, the first-level cache in commercial general-purpose processors has seen rather limited application of micro-architectural power efficiency optimizations. The primary reason for this lack of change is the L1 cache’s critical role in system performance. In fact, most of the architecture-level power optimizations to caches are applied to lower levels (i.e., L2 and L3) where decent power savings at the cost of slightly prolonged hit latencies and marginally decreased hit rates are considered acceptable. These compromises in perforq This is an extension of paper ‘‘Reducing L1 Caches Power By Exploiting Software Semantics’’, published in International Symposium on Low Power Electronics and Design (ISLPED), July 30 – August 1, 2012. Compared with the ISLPED paper, new material in this sumission mainly includes:  optimizations to the instruction TLB;  detailed analysis on the counter-intuitive phenomenon of high I-cache occupancy by kernel code;  detailed analysis on cache occupancy and reactivity of memory accesses with different semantics;  evalua tion with a much richer set of benchmarks;  sensitivity studies including SMT, OoO, and different cache setups. This work was done when all authors were employed by Intel. ⇑ Corresponding author. Tel.: +1 5037506614. E-mail address: [email protected] (Z. Fang).

1383-7621/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.sysarc.2013.04.002

mance are usually not acceptable for L1 caches because of the risk of impacting overall system performance. After all, every code fetch and every data memory reference rely on the L1 cache’s timely response to keep the computing pipeline busy. In this study, we attempt to improve the power efficiency of the L1 cache and Translation Lookaside Buffer (TLB) without lengthening its access cycle or impacting the hit rate. The key idea is to eliminate unnecessary lookups in the cache by exploiting software semantics. In our proposal, a cached data block is fetched only if its memory region is the same as the incoming request, and a cached instruction block is fetched only if its ring level is the same as the incoming request. Compared with existing power reduction solutions for high-performance L1 cache, a distinctive feature of our design is that it is performance-neutral; the optimizations will be achieved without reducing effective set-associativity or serializing cache access phases. In the next few sections, we give an overview of the L1 caches in today’s high-performance microprocessors, and explain why the prevailing design is inefficient. We also introduce the software semantics that will be exploited.

1.1. Background In this section, we give an overview of the L1 caches in today’s high-performance microprocessors, and explain why the

280

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295

prevailing design is inefficient. We also introduce the software semantics that will be exploited.

1.1.1. High-performance L1 caches L1 caches in high-performance processors have been aggressively optimized to achieve low latency for hits. Fig. 1 shows representative timing diagrams of a read hit in physically-tagged set-associative L1 caches. In Fig. 1, the index bits come from page offset, so it is both virtually- and physically-indexed. Because the cache tags use physical address bits, tag comparison cannot start until the TLB has completed virtual-to-physical address translation. To meet the tight timing target, in an N-way set-associative L1C, all N data blocks of the selected set are accessed in parallel, irrespective of the tag lookup result. At least N-1 blocks will be discarded later, resulting in low power efficiency. This is different than L2/L3 or low-power caches where tag check and data readout are serialized. L1 data and instruction caches of literally all modern general-purpose microprocessors are implemented in this manner. Virtual caching (VC) can help to remove the TLB access from L1 cache hits’ critical path by using virtual address bits to tag the cache blocks [1,2]. However, VC is rarely found in real products due to a number of issues including synonyms (multiple virtual addresses in a same process map to a same physical page), homonyms (a same virtual address exists with multiple processes each mapped to a different physical page), physical address-based cache coherency, and page mapping/permission changes. A physically-tagged cache is not subject to these challenges. But in order to meet the performance target for read hits, it has to speculatively fetch all blocks in a set before the tag comparison result is available. Parallel reads from the data RAM result in low power efficiency. In later sections of the paper, we will try to eliminate some of these speculative cacheline fetches by utilizing two software semantics, overviewed next. To understand power consumption and low-level implementations of a cache, we refer readers to a good tutorial from Western Research Lab [3]. Two contributors dominate dynamic cache power consumption: bitline discharge and sense amplifier. Both are proportional to the number of fetches from the cache arrays. So the dynamic power dissipation of a cache is largely proportional to the number of fetches from the cache arrays.

1.1.2. Bifurcation of the address space We will leverage the virtual memory regions partitioning assumed by software to optimize the data cache. Fig. 2(a) shows a representative memory region definition on 32-bit machines. 64-bit address spaces are similar. In most applications, dynamic usage of virtual memory is predominantly to the user stack and heap regions. The stack starts from a high address and grows downward, while the heap starts from a low address and grows upward. The bifurcation design of this virtual memory system implies that in the virtual address space, one bit can differentiate stack and heap regions. In most platforms, this bit is the most significant bit of the address, like in Fig. 2(a).

The virtual memory division semantics are completely lost in the physical memory space, due to the way that the operating system manages page frames. As a result, physically-tagged caches are not cognizant of the semantics of the stored data. In reality, except for pathological cases, stack and heap do not share physical pages. Fig. 2(b) shows that on average, about 40% of all data memory references in SPEC CPU2000 are accesses to the stack (experiment methodology explained in Section 4.1). The rest of the 60% consists of accesses to the heap, BSS and data segments, with heap being the dominant component. In the rest of the paper, we use the term ‘heap’ to refer to all three non-stack segments – heap, BSS and data. Also shown in Fig. 2(b) is the occupancy of each application’s stack memory in a 32 KB, 8-way set-associative cache. Although the stack accounts for 40% of all memory instructions, on average it only occupies 20% of the cache capacity, as a result of the stack’s relatively small memory footprint[4]. For each of the reads from stack memory, in an N-way set-associative L1 cache we have to check all the N ways in the selected set. Statistically, for each stack access, majority of the N lookups are unnecessary (based on non-stack’s 80% occupancy), since none of the ways with heap data will generate a match. Similar scenario occur for heap memory references. This observation motivates our optimization to the data cache. 1.1.3. Ring levels and instruction cache sharing Ring level is the software semantic that we will use to optimize the instruction cache. It is a mechanism by which the operating system (OS) and the processor hardware cooperate to restrict what user-mode programs can do. All mainstream processors including x86, Power, SPARC, and Itanium use a similar mechanism. For brevity, we base our discussion on the x86 architecture. When a user program executes, the processor is set to ring 3, the least privileged level. When the kernel executes, the processor is set to ring 0, the most privileged level. Rings 1 and 2 have been largely unused except by virtualization layers. The processor’s current privilege level is stored in the 2-bit CPL field of the Code Segment register. Segment protection hardware uses the CPL value to determine whether or not the current process has enough privilege to execute a machine instruction or access a particular memory address. User applications and the OS do not share code, but they share the instruction cache. There have been a number of papers studying interaction of OS and microarchitecture [5–9]. Most of them focus on the total execution time or overall cache miss rates of kernel instructions. One question related to instruction cache sharing between user and kernel modes remains unanswered – at any given time, what percentage of the I-cache capacity is used by kernel code? Computation-intensive applications spend the vast majority of their execution time in user mode. One’s intuition would suggest that the instruction cache be constantly filled with user application code, interfered only occasionally by kernel instructions brought in by events like system calls. We found that this intuition is not correct. Fig. 3 shows histograms of dynamic instruction counts and instruction cache occupancies of kernel-mode code in steady stages of four SPEC2000 benchmarks. The cache is 32 KB, 8-way set-associative. Data collection methodology will be explained

Fig. 1. Read pipeline in a high-performance L1 cache.

281

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295

Fig. 2. Virtual memory: software’s view and SPEC CPU2000 profiling results.

later in Section 4.1. Each data point in the figures represents the average value in an interval of 25 microseconds. In each figure, we also give the algebraic mean values of both bands of data averaged over the 3400 intervals. As can be seen from the figures, although user code dominates instruction count, kernel instructions occupy a significant portion of the cache, clearly contradicting the expectation that the instruction cache is filled with application code. In gzip and equake, most of the instruction cachelines actually contain kernel code. The phenomenon can be explained by the different characteristics of kernel and user programs. User code usually spends the majority of the time in nested loops, while the OS is more likely to traverse non-iterative decision trees. Detailed analysis on the instruction memory footprint and loop behavior of the OS can be found in [10,11]. For a comparable number of dynamic instructions, the kernel usually has much larger instruction memory footprint and less locality, occupying more I-cache entries. The tight loops in user code such as SPEC CPU benchmarks, on the other hand, have rather small static instruction memory sizes; user application execution tends to repeatedly fetch instruction from some hot ways of hot sets, which allows the (rarely used) kernel code to stay for a long time before being evicted. To see how increased pressure for cache space may affect kernel code occupancy, we run a multiprogrammed mix of SPEC work-

100%

loads on an simultaneous multithreading (SMT) core. The data are shown in Fig. 4 in which 1T, 2T and 4T stand for 1-thread, 2-thread and 4-thread respectively. Below each bar the application mix is shown. For example, applu + vpr is a 2T test while applu + vpr + apsi + twolf is a 4T test. Most experiments use our default instruction cache size of 32 KB, except the ‘4T, 16 KB I-cache’ bars. With two user-dominated threads sharing the instruction cache, the kernel occupancy does not show a clear increase or decrease compared with when each thread had full ownership of the cache. In the aggressive 4T/core tests, kernel code occupancies are only a few percent lower than 2T/core. Only when we shrink the cache size by half do we see a noticeable drop in kernel code occupancies. Compared with the SPEC applications, Apache and SPECjbb are different. Most of instructions executed in Apache are in kernel mode, while SPECjbb spends vast majority of its time in user-level libraries. When Apache + SPECjbb run on two hardware threads in a core, a balanced dynamic division of the shared I-cache is created. I-caches do not take advantage of ring level information; an incoming instruction fetch searches the I-cache purely on a (physical) memory address basis. Though a user-mode instruction fetch will never hit a cacheline that contains kernel-mode code, all ways in the selected cache set are still searched in parallel. For example, based on the profiling data of equake shown in Fig. 3(c), statistically

100%

Cachelines w/ Kernel Code / All Cachelines 80%

80%

Kernel Instr / Total Instr

60%

60%

K-instr occup = 16.0% K-instr count = 2.5%

40%

40%

20%

20%

K-instr occup = 62.5% K-instr count = 1.9%

0%

0% 1

401

801

(a) gcc

1201

1601

2001

2401

2801

1

3201

401

801

1201

(b) gzip

Time (x25 u-seconds)

100%

100%

80%

80%

1601

2001

2401

2801

3201

2401

2801

3201

Time (x25 u-seconds)

K-instr occup = 37.1% K-instr count = 1.9%

60%

60% K-instr occup = 75.6% K-instr count = 2.4%

40%

40% 20%

20%

0%

0% 1

(c) equake

401

801

1201

1601

2001

Time (x25 u-seconds)

2401

2801

3201

1

(d) mesa

401

801

1201

1601

2001

Time (x25 u-seconds)

Fig. 3. Kernel code’s high occupancies in L1 I-cache, in contrast with its low dynamic instruction counts.

282

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295 100%

Kernel instr occupancy Kernel instr dynamic count

80% 60% 40% 20%

ap ac he ap a c sj he bb +s jb b

gc c ga gz p+ m gcc ip c + 4T f+ g , 1 gcc zip 6 K +g B z I -c ip ac he

ga p

ga mc p+ f m cf

eq ua m ke g r eq id ua me + e ke s qu +m a 4 T a ke e s , 1 +m a 6K e B sa I -c ac he eo n+

eo n e o mg n+ rid m gr id

ar t+ s

ap pl

w

bz ip b p 2 im zip ars +b 2 + e r p z 4T ip2 ars , 1 +p er 6K ar B ser I -c ac he

ar sw t ar im t+ sw im

ap si ap tw si olf +t + w 4T a , 1 psi olf 6K +t w B o I -c lf ac he u+ v

pr

ap pl

u ap v pl pr u+ vp r

0%

Fig. 4. Kernel code’s occupancies in the instruction cache (1T, 2T and 4T).

about (1–2.4%)  75.6% + 2.4%  (1–75.6%) = 74.4% of all accesses to I-cache RAMs will result in tag mismatches and fetched instruction cachelines being discarded. This observation motivates our optimization to the instruction cache. 1.2. Contributions of the paper In this paper, we make four contributions.  We present findings on how stack/heap data share the L1 data cache (DL1) capacity, and how user/kernel code share the L1 instruction cache (IL1) capacity. Characteristics of the user/kernel IL1 sharing are counter-intuitive.  We propose to introduce virtual memory and ring level semantics into the first level caches in a high-performance microprocessor in order to improve energy efficiency by eliminating unnecessary lookups.  We present detailed designs of the software semantics-aware cache. One optimizes data array accesses in DL1. The other optimizes both tag checks and data fetches in IL1.  We evaluate the efficacy of the two semantic tag designs using standard benchmarks running on architecture-level performance and power simulators. The rest of the paper is organized as follows. Related work is surveyed in Section 2. We present detailed designs for L1 caches in Section 3, and evaluation results in Section 4. Section 5 summarizes the paper. 2. Related work Caches in general-purpose processors are typically designed oblivious of software semantics, with a few exceptions, e.g., building separate caches for user data versus OS data [12], and stack versus non-stack [13,4]. Nellans et al. [12] suggest dedicated data caches for user applications and the OS. Based on some unique characteristics of the stack memory, researchers [14,15] designed specialized stack caches to exploit these software semantics for performance or power. Observing that stack data are actually not shared with other threads, Ballapuram et al. [16] propose to not snoop for stack accesses in cache coherency protocols. Semantic-Aware Partitioned (SAP) cache has been proposed by Lee [4] and Cho [13]. In SAP, a dedicated data cache is used for each virtual memory region. For example, in [4], one quarter of the total L1 data cache capacity is used exclusively by the high-memory region, which actually includes user stack and kernel data. We will refer to it as the stack partition. Hard partitioning the L1 cache has two drawbacks. First, the rigid partitioning inevitably leads to low utilization of the cache capacity. Second, a stack var-

iable and a heap variable can be backed by a same physical page. This is the synonym problem which is a well-known challenge with virtual caching [1]. To ensure program correctness, cache coherency needs to be maintained between the partitions, with cost in both hardware complexity and performance. Opportunistic Virtual Caching (OVC) [17] saves dynamic power also by partitioning the L1 cache. In OVC the OS and oftentimes the application programmer determine which memory regions are not subject to read-write synonyms, and are thus safe to be stored and accessed by virtual tags in the cache. For these pages, OVC partitions the cache based on certain virtual address bits, which effectively turns the cache into multiple smaller ones. As a superset of a true virtual caching system, OVC adds 10% of space overhead to the baseline physical cache in order to handle the attendant complexities including physical-to-virtual reverse translation, read/write and privilege checks and address space identification. The TLB may be consulted either before a request accesses the L1 or after it misses the L1. This adds an extra source of requests to the TLB and pose a challenge to timing of the tight TLB and L1 cache pipelines [18]. When synonyms are created to a page frame, the kernel needs to communicate it to hardware so that physical caching will be used for this page from now on. OVC uses the highest-order virtual address bit to carry this information. Since an application could request synonyms at any time during execution, the kernel would need to be able to modify the application code after program starts so that all affected virtual addresses are changed. These addresses include dynamically generated application addresses and memory pointers. The dual virtual-physical tag cache [19] has a similiar hardware organization as OVC. To avoid the prohibitively high complexity of dynamically modifying application code by the kernel, Zhou et al. rely on explicit directives inserted by application programmers. Another related body of work is virtual address bits based way prediction. Sun UltraSPARC-III [20] employs a technique called microtag to improve read hit latency of L1 caches. In this design, some low order bits of the cache tag (called ‘loTag’) are first checked and the hit way data is speculatively fetched and forwarded to the processor. The remaining bits in the tag (‘hiTag’) are also checked to determine if the access was a true hit or should the word be discarded. Note that cache power is not reduced; all ways of data are still fetched from storage [21]. What is accelerated by a correct speculation is the N-to-1 selection of the fetched blocks and data forwarding stages. Since the microtag scheme predicts at most one block to be hit, it requires that loTags of all the ways in each set be distinct. To enforce uniqueness, if an incoming block has the same loTag bits as an existing block in the set, cache replacement logic has to evict the latter before installing the incoming line. Eviction of useful data increases conflict misses. The solution to premature evictions is to use more bits for loTag: 8 bits are used in UltraSPARC-III [20]. Although microtag is used

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295

as a performance optimization in UltraSPARC-III, we could use it for power reduction purposes. We will discuss more about this hypothetical application of the microtag technique in the experiment section of the paper. Mostly for embedded applications and microprocessors, researchers investigated software-assisted selective lookup in set-associative caches. Direct Addressed Cache [22] uses a special register in each load/store instruction to memoize the way number to be used by subsequent load/store instructions. Compiler-controlled way-placement [23,24] saves cache energy by avoiding all-way lookups. Researchers also use a small filter cache [25] ahead of L1 to reduce accesses to the latter. A phased-access cache serializes the tag check and data fetch by doing them in a two-step sequence, improving power efficiency at the cost of longer cache hit latency. Kessler et al. [26] compare a subset of the stored tag bits against the incoming request, and examine the remaining bits only if the partial tag comparison is a hit. In [27], all the tags in a cache set are examined in parallel in their full width, and a data block is accessed only if it is the hit way. Apparently, a phased cache is more suitable for lower-level caches than for the L1 cache of a modern high-performance processor. Way-halting cache[28] is a special form of the phasedaccess cache that can avoid performance impact caused by twostep tag checking. In this design, a few low-order virtual address bits are first checked to detect cache misses early, before the rest of the tag bits are checked. This assumes that the cache is virtually tagged. If the cache is physically tagged (which most general-purpose microprocessors use), the way-halting design would require a special page allocation scheme in the OS. The paper did not discuss ramifications on software stack portability and performance of large applications. Another form of phase-access cache is the MRU (Most Recently Used) algorithm that is used in a number of way predictors, including the L2 cache in MIPS R10000 [29] and the L1 instruction cache of Alpha 21264 [30]. In each set the MRU block is speculatively read from the data array. If the tag comparison proves that the speculation is incorrect, remaining blocks of the set are fetched in subsequent clock cycles. In practice, simply predicting the MRU way for cache hit only works well for 2-way set associative caches. Researchers invented more complex prediction mechanism for caches with higher associativity. Usually a prediction table is used to log history information so that a speculative prediction can be made about which way is more likely to match the incoming request. Prediction must be based on information that is available before the virtual address is generated. Various prediction sources have been tried including the program counter (Reactive Associative Cache, RSA [31,32]) and register numbers (Predictive Sequential Associative Cache, PSA [33]). The fundamental difference between way prediction and our work is that way prediction is speculative in nature. When prediction accuracy is low, the mispredicted accesses suffer from longer hit latencies, and also consume more of the valuable L1 cache read bandwidth due to additional lookups. In order to get satisfactory prediction accuracy, way prediction methods introduce fairly high hardware complexity. For a 32 KB, 64B/block cache, RSA needs about 8–16 K bytes of storage for prediction [31], while the PSA uses a 2048-entry table [33]. Compared with the large number of published studies on the data cache, the instruction cache in general-purpose CPUs is a relatively less explored area. On user/kernel instruction mix side, as we have discussed in Section 1.1.3, researchers presented detailed information that helps on design issues such as I-cache size and associativity [5,6,34,7–9,11]. Redstone et al. [9] found that less than 5% of processor cycles were spent in executing kernel instructions for SPECInt. Our experience with SPEC CPU on a modern machine (Intel Sandy Bridge processor + Linux 2.6) confirmed that the basic picture presented in [9] has not changed. However, these

283

studies provide little insight on how user code and kernel code divide the I-cache capacity. To the best of our knowledge, this is the first published study that provides quantitative characterization on instruction cache occupancies of user-mode and kernel-mode codes, and also the first work that exploits user/kernel modes information to optimize the instruction cache. To cope with the high heat density in TLB in multi-issue processors, researchers proposed power optimized TLBs. Fan [35] reduces TLB RAM accesses by exploiting applications’ data access locality. Ekman [36] adds a Page Sharing Table to avoid unnecessary TLB lookups when servicing snoops. Delaluz [37] allows the DTLB to be dynamically resized in response to varying application characteristics. 3. Design of software semantics-aware caches 3.1. Overview We propose to exploit software semantics in the cache design to avoid unnecessary associative searches, thus reducing dynamic power consumption. Specifically, we present two software semantics-aware cache designs, for the data cache and instruction cache, respectively. In both designs, further search operations in the cache set are continued only if the semantic tag comparison gives a match.  Vtag (1 bit per cacheline): Tag each L1 data cacheline with virtual memory region information. Using the most significant bit (MSb) of the virtual address can help to eliminate all the crosschecks between stack and heap data.  Rtag (2 bits per cacheline): Tag each L1 instruction cacheline with ring level information, represented by the processor’s ring level status bits when the instruction is fetched and cached. E.g., for a user-level instruction fetch, this obviates the need to search the ways that contain kernel-level code. Compared with schemes that rely on cache partitioning (SAP [4]) and way prediction [38] to save power, the proposed mechanism has a key advantage in performance. The proposed design achieves comparable power savings as existing solutions (OVC [17], RSA [31], and PSA [33]), with hardware overhead that is an order of magnitude lower. It is fully compatible with existing application binaries, and does not require the OS kernel to manipulate their code or addresses. 3.2. Augmenting the data cache with Vtags Fig. 5 shows an implementation of the Vtag-augmented L1 data cache after the virtual address is generated and processed by the Address Generation Unit (AGU) and Memory Ordering Buffer (MOB). Bold lines denote new logic that we propose to add. For brevity, we omitted some logic like cacheline status check. This virtually-indexed physically-tagged cache used as an example is 32 KB in capacity and 8-way set-associative. The virtual page number (VPN, virtual address bits [MSb:12]) is translated by the TLB to a physical page number (PPN), which will serve as the physical tag. VA bits [11:6] select one of the 64 sets. From each of the N (N = 8 in our example) ways, it initiates reading of one line out of the tag store, and one line out of the data store. The N tags are all compared with the incoming physical tag, and at most one comparison can be a hit. Data store accesses are performed in parallel with the N tag store accesses. In a traditional cache, all N data subbanks are accessed and the data blocks are staged in the sense amplifiers. In a Vtag-optimized cache, data readout from the SRAM arrays is selective. To achieve this, we augment each cacheline with the most significant bit of its virtual address. When a data read arrives, Vtag check is performed in parallel with TLB access, in addition to the regular physical tag

284

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295 From AGU/ MOB

VPN[MSB:12]

index[11:6]

offset[5:0]

set TLB

V-tag

Tag

Way 1 Data



Tag

Way N Data

PPN 512b

512b

=

= V-tag hit data WL/BL enable

word

hit?

Fig. 5. Eliminating unnecessary L1D’s data array accesses using Vtags. Selective data readout from the SRAM arrays is realized by qualifying WL/BL operations with Vtag check results, eliminating unnecessary bitline discharges.

check. Data subbank i (1 6 i 6 N) is accessed only if Vtag check gives a ‘match’ for way i. If Vtag check gives a ‘mismatch’ for way i, access to data subbank i is avoided. The largest components of cache power dissipation, bitline discharge and sense amps operations, are eliminated for the Vtag mismatching ways. Without the need for the TLB stage, the Vtag check can finish early enough (explained next) so that we can skip accesses to the ways of data whose Vtags do not match the query Vtag. Cache hit assertion logic is not changed. Vtag only optimizes reads from the data array, and not writes into the data array, since tag check and data array write are serialized in the baseline cache. Snoop port accesses could be optimized in a same way as the read port, depending on how the cache coherence state machine is implemented. The cache coherency protocol and coherence-related performance are not affected by Vtag. Success of Vtag relies critically on one question: can Vtag logic finish early enough to turn data store accesses on/off without impacting the hit latency? In the example in Fig. 5, the Vtag store is 64 bits for each of the N subbanks. Vtag logic is merely a 64-bit vector access followed by 1-bit XOR. We implemented Vtag store access and comparison in 32 nm CMOS technology. It can be completed in well under 0.06 ns. This gives enough time to propagate the way selection signals to the data store since it is in parallel with TLB look up. This extra delay will fit into the first phase of the clock even with frequency up to 4 GHz. Though theoretically possible, optimizing the tag array accesses in a similiar manner is challenging because of the tight timing constraints. Therefore, regular physical tag check is performed regardless of whether Vtag is a hit or a miss. This is reflected in Fig. 5 where traditional tag comparison determines the hit/miss signal, while the Vtag result is used to control the wordlines and bitlines of the data subarray. There is no conceptual obstacle that prevents the same design shown in Fig. 5 from being applied to data TLB. A challenge is that timing for the TLB is even tighter than the cache tag. Therefore directly applying our design to DTLB would be impractical. Fig. 6 compares the read hit pipeline before and after applying the semantic tag technique. The solid vertical bar in Fig. 6(b) represents the added Vtag filtering logic. 3.3. Augmenting the instruction cache and instruction TLB with Rtags Conceptually, instruction cache’s Rtag optimization is rather similar to data cache’s Vtag optimization, except that Rtag uses the ring level bits of the processor while Vtag uses the highest order virtual address bit of the incoming request. One could derive an implementation of Rtag-augmented instruction cache from Vtag

in the data cache. If such a design was used, I-cache data array accesses could be reduced. Fig. 7(a) shows how such a design can reduce accesses to the data store in a 4-way set-associative instruction cache. In this example, we assume only two privilege levels, and the processor is executing a user-level program. Of the four instruction blocks cached in the selected set, two are user code and two are kernel code. Only the blocks whose Rtag’s match the incoming Rtag are read out. In such a design, the Rtag comparison is done after cache set indexing, and has to be performed on a per-access basis. Similar to the Vtag case, we would only be able to optimize accesses to the data array but not the tag array. If we can manage to remove ring-level check from the critical path of I-cache tag array accesses, we will be able to reduce tag array accesses as well. Actually, this is what our optimized design will achieve, presented next. We make an observation that dynamically generating ring level match signals for every instruction fetch is not necessary, because these signals for a cache set do not change unless a cacheline replacement occurs in the set, or if the ring level of the current thread changes. Based on this observation, we introduce a bitmask for each set, in addition to the ring level Rtag vector. The bitmask directly serves as way selection to enable/disable wordlines and bitlines, shown in Fig. 7(b). The bitmask generation and usage are decoupled. The left half of Fig. 7(b) illustrates how a bitmask is generated, while the right half shows how it is used. Compared with Fig. 7(a), the key difference is that Rtag comparison is performed when a cacheline is brought in, not when it is looked up. When a cache set is selected by the indexing logic, its bitmask is ready to use, without the need for Rtag comparison. In other words, the bitmask for each cache set is generated before any read has been made to the set. Compared with the basic implementation, in the decoupled mechanism, generation of the bitmask is no longer on the critical path of tag array reads; it is done in the cache miss handling process. When the processor’s ring level changes, every bit in every bitmask is flipped. In an SMT processor with M hardware threads, we need M bitmasks for each set. Selecting the correct bitmask is done in the thread select cycle, usually half clock period ahead of I-cache access. On an M-context SMT core supporting R ring levels, the total cache size storage size for ring levels and bitmasks is M  logðRÞ  cacheline size bits and M  log(R)  Number_of_entries bits for the I-cache and I-TLB, respectively. Using Intel Sandy Bridge processor as an example: with M = 2, R = 4, 32 KB I-cache, 64B/line, 128-entry ITLB, total storage overhead is 2048 bits + 512 bits = 320 bytes. Fig. 8(b) shows how the traditional instruction fetch pipeline in Fig. 8(a) is changed by adding Rtag filtering logic. Because of the

285

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295 Clock phy tag

AddrGen

DTLB

VA

all ways of tags Index

hit/miss Tag cmp

Tag array

way_sel

all ways of data Way mux

Data array

a) Baseline pipeline

Clock phy tag AddrGen

DTLB

VA

all ways of tags

hit/miss Tag cmp

Tag array

Index

Reduced ways of data accesses

way_sel

Way mux

Data array

b) Vtag-optimized pipeline Fig. 6. Data cache read pipeline before and after the Vtag optimization.

Ring-level vector of set i k

u

u

k

current mode: u WS

WS

WS

Way selects (WS) generation is on critical path of tag access.

WS

Data W0

W1

W3

W2

(a) Basic Implementation Generation of the bitmask

Usage of the bitmask

Ring-level vector of set i current processor mode: u

k

u

u

Bitmask of set i 0

k

WS

W0 0

1

1

1

WS

W1

0

1

WS

W2

0

WS

W3

Tags and Data

Bitmask of set i WS generated by the LRU logic in the last miss to the set .

No Rtag comparison after set selection

(b) Optimized Implementation Fig. 7. Using Rtags to optimize ICache/ITLB accesses.

bitmask design, we are able to place the Rtag check logic before the physical tag and ITLB VPN processing, thus optimizing the power consumption of not only the cache data arrays, but also the cache tag arrays and I-TLB. It is legitimate for the OS to execute user code in kernel mode. This is harmless except for self-modifying code. We can simply disable the proposed function in such a rare usage case. 3.4. Synonyms The Vtag mechanism does not change how the baseline system architecture deals with synonyms except when a high virtual page

and a low virtual page are backed by a same physical page. This scenario could happen, for example, when the OS maps a user heap data structure into the kernel address space. False match by Vtags will not occur since physical tag comparison is still performed. False miss, however, could occur and cause program execution errors if one of the synonymed pages is written to. For instance, if the L1 cacheline is in M (modified) state, a false miss will cause obsolete data from low-level memory to be used. Since intentional virtual address aliasing is rare in practice (unintentional aliasing would indicate a bug in the OS), we only need to ensure program correctness when a false miss occurs; performance of false miss correction is a non-issue.

286

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295 Clock phy tag

Next PC

ITLB

PC

all ways of tags Index

hit/miss Tag cmp

Tag array

way_sel

all ways of data Way mux

Data array

a) Baseline pipeline

Clock phy tag

Reduced ways of I-TLB and I-tag accesses

ITLB

Next PC PC

hit/miss Index

Tag cmp

Tag array

Reduced ways of data accesses

way_sel

Way mux

Data array

b) Rtag-optimized pipeline Fig. 8. Instruction cache read pipeline before and after the Rtag optimization.

Table 1 Summary of cost and benefits. Extra bits

Optimizing

Tag array?

Data array?

Vtag: MSb of virtual address

D-cache

Not optimized

Optimized 64 bytes

Rtag: ring level and Bitmask

I-cache and ITLB

Optimized

Optimized 320 bytes

There are two solutions, software and hardware. In the software solution, since synonyms creation has to go through the OS memory manager, the OS can simply disable Vtag usage when an alias mapping is created. A hardware solution is to cancel the consequent L1 miss request (e.g., in the last clock cycle in Fig. 5) and re-execute data array access, if physical tag comparison does not agree with the Vtag result. The performance penalty for re-execution only affects an aliased read, and does not affect hit or miss latencies for non-aliasing cases.

3.5. Hardware overhead Table 1 summarizes the optimizations that we have discussed. Storage overhead examples are based on the organization of Intel Sandy Bridge processor. The numbers in the two rightmost columns are for the entire L1D/L1I caches. In the data cache, Vtag uses a 1-bit XOR with the potential to eliminate a full cacheline data fetch. In the instruction cache, for the vast majority of instruction fetches, Rtag uses a 1-bit way-selection mask with the potential to eliminate both the tag comparison and the data array access. Vtag and Rtag add a space overhead of 0.41% to the L1 caches. Our design requires no changes to cacheline replacement logic, does not change cache hit/miss rates, and has no impact on cache hit latency. Being only a small extension to most L1 cache implementations in general-purpose CPUs, we believe that the total cost of adoption will be low.

Overhead (B)

Overhead (%) 0.41% of L1 caches and ITLB

3.6. Further understanding on power saving opportunities In this section, we present insight regarding the following question – what type of applications have more opportunities to save power on Vtag- and Rtag-augmented caches? The efficacy of the proposed optimizations primarily depends on the overall reactivity of the semantic tags throughout the application execution. For a cache access, reactivity is defined as the number of semantic tag matches divided by the cache associativity. For example, if the Vtag of an incoming data read matches the stored Vtags of 2 of the 8 ways in the set, the reactivity is 25% for this access. A lower reactivity value is desirable since it helps to eliminate more parallel lookups. While occupancy reflects the degree of dominance in the cache by a certain type of memory, reactivity captures the relationship between incoming requests and stored cachelines. The example in Fig. 9 contains a 2-set, 8-way cache in which 6 of its 16 cache6 lines contain stack data. Thus Occupancystack ¼ 16 and Occupancyheap ¼ 10 . Obviously Occupancy + Occupancy stack heapis al16 ways 100%. Suppose the incoming stack read lands at set 1 and the heap read lands at set 2. 5 of the 8 stored cachelines in set 1 are stack data, so reactiv itystack ¼ 58. Similarly, reactiv ityheap ¼ 78. reactivitystack + reactivityheap has a range of [0, 200%]. In Section 1.1.3, we showed the occupancy profiles of four SPEC2000 benchmarks (Fig. 3). Fig. 10 visualizes the percentages of time each cacheline holds user and kernel contents, offering a different angle of view than in Fig. 3 on occupancies in L1I for gcc and equake. The 512 cachelines (64 sets  8 ways) are first

287

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295

stack request

cache set 1

heap request

cache set 2

Legend: stack heap

Fig. 9. Cache occupancy versus semantic tag reactivity: an example.

sorted by way number, and then by set number. Thus a sequence of contiguous instruction memory blocks map to adjacent positions along the X-axis in Fig. 10. For gcc, the majority of cachelines contain user code most of the time while it is the opposite for equake, though in both applications there are sporadic cachelines that tend to hold kernel (user) instructions. The equake figure is representative of many SPEC CPU programs in which kernel instructions occupy most of the L1I capacity. The high occupancy and long residency of kernel instructions is a result of the small memory footprint of loop-dominated user programs. While occupancy and reactivity are correlated, different applications can have quite different relationships between the two values. Fig. 11 compares the occupancy and reactivity for each application. Only selected applications are shown in the figure due to page width limit, although the AVG bars represent the average occupancy and reactivity of all SPEC 2000 benchmarks. In Fig. 11(a), Vtag gives a 58% average reactivity for incoming stack reads while stack data occupancy average is only 20%. Vtag reactivity for heap reads is 82%, close to heap occupancy, 80%. Comparing Fig. 11 (a) and (b), we can see that values of the occupancy and reactivity are closer to each other in the user/kernel code cases. Another observation from Fig. 11 is that reactivitystack + reactivityheap is around 140% for most of the applications, and reactivitykernel + reactivityuser is around 105%. These invariants will be used in the analytical model, presented next.

TLB. These are all achieved through clock-gating. In an N-way setassociative array, accessing M of the N ways consumes approximately M/N of the baseline dynamic power by gating the clocks to the bitlines and segmented wordlines of the rest N  M subarrays. Equations (a) and (b) in Fig. 12 give us the power consumption of each individual cache access. When there is no cache access in a clock cycle, the cache structure is clock gated. This is true for both the baseline and our optimized cache. Thus the cache power consists of leakage and some residual dynamic power, given in Equation (c). Aggregating Equations (a) and (c), or (b) and (c) in the whole application simulation gives us the overall average power of the data cache or instructure cache structure. 4. Evaluation In this section, we describe our experiment methodology and present the evaluation results on Vtag and Rtag. 4.1. Simulation methodology The experiment consists two separate steps. First, we collect workloads’ instruction traces. Second, the traces are fed to a performance and power simulator to obtain experiment results. 4.1.1. Workloads and traces Our traces have been collected using SoftSDV [39]. SoftSDV is a full-system emulator and simulator that has been primarily used for porting and/or development of operating systems, device drivers and key applications on x86 and IA-64. We configured SoftSDV to emulate an x86 platform with 2 GB of main memory. Fedora 10 was booted on top of SoftSDV and ran the benchmarks. After skipping the warm-up phases, SoftSDV captured all fetched instructions and dumped them into traces. These traces faithfully contain the dynamic instruction sequence of all user and kernel instructions. In most experiments, we run each trace for 1 billion instructions in a steady stage of the applications. We use SPEC CPU2000, SPEC CPU 2006, PARSEC, SPECjAppServer2004, SPECjbb2005 and the Apache web server to evaluate the proposed cache optimizations. PARSEC from Princeton University primarily includes RMS (recognition, mining, synthesis) and other emerging workloads. SPECjAppsServer2004 is a J2EE 1.3 application server. It is a multi-tier e-commerce benchmark that emulates an automobile manufacturing company and its dealers. SPECjbb2005 is a Java-based server benchmark that represents a warehouse serving a number of districts. Apache is one of the most

3.7. Power savings Benefits of Vtag and Rtag come from reduction in power consumption in different structures of the cache, as is shown in Table 1. In this section we qualitatively discuss where specifically the power savings will come from, using our implementation of the software semantics-aware cache. The equations will also be used in our quantitative evaluation in the next section. On a high level, total cache power consists of five components: tag array leakage power, tag array dynamic power, data array leakage power, data array dynamic power, and address decoder power. That is,

P total ¼ P tag lkg þ P tag dyn þ P data lkg þ P data dyn þ P dec: Because the data store is an order of magnitude larger than the tag store, P_data_dyn is much larger than P_tag_dyn. For down to 32 nm CMOS process, dynamic power dominates in a typical cache, though leakage power is larger for smaller feature sizes. Vtag optimizes P_data_dyn of the data cache. Rtag optimizes P_tag_dyn and P_data_dyn of the instruction cache and instruction

100%

(a) gcc

80%

(b)equake

60% 40%

Kernel

20%

User 0% 1

65

W0

129

W1

193

257



321

385

449

W7

1

65

W0

129

W1

193

257

321

385

449



Fig. 10. Propensity of each cacheline to hold user/kernel instructions on a 32 KB, 8-way I-cache.

W7

288

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295

AVG, occup

AVG, sel AVG, sel

gcc, sel gcc, sel

AVG, occup

gcc, occup gcc, occup

gap, sel

equake, sel equake, sel

gap, sel

equake, occup equake, occup

gap, occup

eon, sel eon, sel

gap, occup

eon, occup

bzip2, sel bzip2, sel

eon, occup

bzip2, occup

art, sel

(a)

bzip2, occup

kernel

heap

art, occup

apsi, sel

applu, sel

applu, occup

apsi, occup

stack

160% 140% 120% 100% 80% 60% 40% 20% 0%

user

art, sel

art, occup

apsi, sel

apsi, occup

applu, sel

applu, occup

120% 100% 80% 60% 40% 20% 0%

(b)

Fig. 11. Overall occupancy and reactivity: (a) stack vs. heap data; (b) kernel vs user code.

T/S

gap gcc

T = S × Pstack × Selectivitystack +

apsi

S × Pheap × Selectivityheap Pstack + Pheap = 1.0 Selectivitystack + Selectivityheap = C, in which

Pstack

C = 1.40 for Vtag, and 1.05 for Rtag. (a) An analytical model for remaining lookups

tivi lec e S

c ty sta

k

(b) Graphical presentation (Lower T is better)

Fig. 12. Minimizing the number of cache lookups.

widely deployed web servers. SPECjbb and Apache are run together in a 2-way SMT core to mimic a real-life workload consisting of a data engine and a web server.

4.1.2. Simulator and setup We use a trace-driven platform simulator, ManySim [40] to model the performance aspect of the architecture. ManySim has an abstract core module to improve simulation speed, but accurately simulates the memory subsystem, including cache and TLB microarchitectures. We set up and tune the core module in ManySim to model two microprocessors, one aggressive, big core and one in-order, small core. Table 2 shows key simulation parameters that we use. In particular, the L1 cache parameters are representative of the latest high-performance processors like Intel Sandy Bridge and IBM Power6.1 Unless otherwise specified, default values in Table 2 are used in our experiments throughout the paper. 1 In a virtually indexed, physically tagged cache, the index bits need to come from the page offset. This dictates that Associativ ity P CacheSize PageSize . For a 32 KB cache size and 4 KB page size, for instance, the minimum set-associativity is 8. In our simulation, the 4-way simulation setup is used just as a sensitivity exploration.

To estimate power dissipation, we integrated Wattch [41] into ManySim. Wattch is an architecture level power estimation tool that provides dynamic power estimates of microarchitecture structures. The version of Wattch that we downloaded only had technology parameters of up to 100 nm. We plugged newer CMOS technology parameters for 45 nm (obtained through MOSIS) into Wattch. Leakage is assumed to be 20% of the overall baseline power for L1 cache, and is not reduced by our design. We calculate energy consumption of the caches using the model in Section 3.7 on a cycle-by-cycle basis and integrate the numbers to derive the average power. 4.2. Experiment results: overall results Table 3 contains key experiment results of all the benchmarks that we tested. For each benchmark, we show how many cacheline fetches are eliminated by the optimizations, and the consequent power reduction. Overall, semantic tags filter out a higher percentage of readouts in the instruction cache than in the data cache. Between different categories of applications, SPEC CPU and PARSEC tend to benefit more from the optimizations than server

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295 Table 2 Simulation setup. Bold fonts denote default values. Parameter

Value

Processor core L1 I-cache L1 D-cache TLB

3.2 GHz, out-of-order; 1.6 Ghz, in-order 32 KB, 64B lines, 8-way and 4-way, LRU 32 KB, 64B lines, 8-way and 4-way, LRU Data: 128 entries, 8 W; Instr: 96 entries, 8 W and fully-associative 64B lines, 8-way, 4 MB for big core and 512 KB for small core L1 = 2 cycles; L2 = 12 cycles LRU for all caches and TLBs 110 ns latency

L2 cache Cache latency Replacement DRAM

workloads. For instance, for SPEC 2000, SPEC 2006 and PARSEC, eliminated data cache accesses average around 28%, and instruction cache accesses are reduced by over 35%. In the following sections, we present detailed characterization of semantic tag-augmented caches, including sensitivity to setassociativity, simultaneous multithreading and out-of-order-ness. Due to space limit, we only present data from a subset of the benchmarks in Table 3. These are the benchmarks that some of the closely related work used. The insight we gain from them actually applies well to other applications in Table 3.

4.3. Experiment results: Vtag on the data cache Fig. 13(a) and (b) show the distribution of Vtag matching ways using SPEC CPU2000 and Apache + SPECjbb on two setups: 1-thread and 2-thread. As we have discussed in Section 3.7, a lower Vtag match percentage (i.e., smaller reactivity) in the figures implies higher power saving opportunities. Because Vtag does not optimize write instructions, all write instructions count as 8-way matches in the profiling data. It is worth noting that by using the MSb of the virtual address as the Vtag, we effectively classify all memory references into two categories. Category one (Vtag = 0) includes user heap, user BSS and user data. Category two (Vtag = 1) includes user stack and all the kernel data. Which of the two categories dynamic libraries belong to is trace-dependent. Fig. 13(a) shows that on average, about 19% of memory references have no more than 2 ways of Vtag matches in their respective cache sets, denoted by the bottom segment of the AVERAGE bar. This means that for 19% of all accesses, only up to two data subbanks need to be activated in our 8-way cache. Only a little over half of the accesses need to fetch 7 or 8 data blocks. High match percentages are usually a result of unbalanced numbers of stack/heap accesses. In contrast to the counterintuitive relationship between kernel/user code frequency and occupancy in the instruction cache, in the data cache the heap/ stack data frequency and occupany correlerate quite positively, as we have seen in Fig. 2(b). For example, in mcf, stack reads only account for 4% of all memory references. Thereby, most lines in the L1 d-cache contain heap data. Incoming heap reads will see their Vtag match most of the Vtags in the selected sets, while incoming stack reads will mostly have low reactivity. Since reads are now predominantly heap accesses, the result is dominance of the ‘7 or 8 matches’ segment for mcf. By contrast, 56% of apsi’s data accesses are user stack reads. User stack writes, user non-stack accesses and kernel data structures account for the rest, 44%. As a result, we observe a rather good distribution for apsi in Fig. 13(a). Overall, Vtag is able to eliminate 27% of all fetches from the data array for SPEC 2000. When two threads share a cache (Fig. 13(b)), the overall distribution is similar to 1T cases, although the server workload, Apache + SPECjbb shows higher Vtag match percentages. Data reads in Apache are mostly in the kernel region, while SPECjbb data loads

289

are mostly in user stack and shared libraries which are loaded to high memory locations in our trace. These data addresses all have their MSb’s equal to 1, so we get high Vtag match numbers for Apache + SPECjbb. Using equations discussed in Section 3.7, we estimate total data cache power consumption for the benchmarks. As is shown in Table 3, the 27% average reduction in cache lookups leads to a cache power saving of 16% for SPEC 2000 with our default setup (8-way, 3.2 Ghz, out-of-order). In Fig. 14 we present estimated power reduction percentages for two set-associativities (4-way and 8-way), and two processor models (out-of-order, 3.2 Ghz and in-order, 1.6 Ghz). Vtag is more effective on an 8-way cache than 4-way because of finer control granularity in a finer-grained cache partition. The power savings correlate with the Vtag reactivities in Fig. 13(a). For example, a large ‘7 or 8 matches’ bar implies small power savings (e.g., gap and mcf) and large ‘0 to 2 matches’ bar implies bigger power savings (e.g., apsi and mesa). Dynamic power consumption is a function of switching activities of a circuit. If we compare the out-of-order numbers with the in-order ones in Fig. 14, we can see how the activity factor affects the efficacy of Vtag. It is clear that power reduction of Vtag is more noticeable in a server-class processor than in a netbookclass processor, largely due to their different dynamic power usage levels as a result of different pressure at the cache. When two applications run on a 2-thread SMT core, the competition between the applications can lead to two diverging directions with respect to Vtag’s benefits. On one hand, increased activities can boost dynamic power usage, just as out-of-order execution can increase pressure on the caches over in-order execution. A high dynamic power in the baseline usually implies bigger relative benefits from using Vtags. On the other hand, an application that has nice stack/heap partition between cache ways (i.e., good Vtag reactivity) can be perturbed by the other application on the same core that has a large, unbalanced working set (i.e., bad Vtag reactivity). Comparing Figs. 14 and 15, we see a slight drop in power reduction percentage, which is a net effect of the two conflicting factors. 4.4. Experiment results: Rtag on the instruction cache and I-TLB Distribution of Rtag reactivities on SPEC CPU2000 and SPECjAppServer is presented in Fig. 16(a). Due to the user applications’ small instruction memory footprint, the very low-reactivity segment (0 to 2 matches) is the most common case for a number of benchmarks including art, equake, gzip, mcf, mgrid, swim and vpr. On average, an instruction fetch in SPEC00 has a 38% probability to have a same ring level as up to 2 of the 8 ways, and another 30% probability to match 3 or 4 ways. Averaged out over the SPEC00 benchmarks, less than 10% of all code fetches actually have a 7 or 8-way match. In an SMT core running a multiprogrammed mix of SPEC workloads, one may wonder whether we will see substantially smaller occupancies by kernel code because the aggregated user code memory footprint could cover a much bigger share of the I-cache. We found that is generally not true. With two user-dominated threads sharing the I-cache, in most cases the kernel occupancy is simply just about the mean value of the two threads when each had full ownership of the cache. Apache + SPECjbb is a unique case. Most of instructions executed in Apache are in kernel mode, while SPECjbb spends vast majority of its time in user-level libraries. When such a workload runs on two hardware threads, it offers a balanced partition of the shared I-cache. Detailed reactivity distribution data for two-thread workloads are presented in Fig. 16(b). With Rtag, the relative power efficiency in the I-cache is even better than Vtag’s effects on D-cache, shown in Fig. 17. In the 1-thread experiment, average power reduction for SPEC2000 is

290

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295

Table 3 Power efficiency improvement with Vtag- and Rtag-augmented Caches. Benchmark

Data cache Readouts saved by Vtag (%)

Instruction cache Readouts saved by Rtag (%)

Power reduction (%)

SPEC 2000

164.gzip 171.swim 172.mgrid 173.applu 175.vpr 176.gcc 177.mesa 179.art 181.mcf 183.equake 197.parser 252.eon 254.gap 256.bzip2 300.twolf 301.apsi AVERAGE

25 41 33 22 37 21 41 29 5 25 32 28 6 25 22 41 27

15 18 20 10 23 13 26 15 2 16 20 18 3 16 13 27 16

63 74 69 56 71 29 40 83 80 74 34 22 61 50 47 45 56

41 48 45 36 46 19 26 54 52 48 22 14 39 32 31 29 36

SPEC 2006

400.perlbench 401.bzip 403.gcc 429.mcf 433.milc 444.namd 445.gobmk 450.soplex 453.povray 456.hmmer 458.sjeng 462.libquantum 464.h264ref 470.lbm 471.omnetpp 473.astar 482.sphinx3 483.xalancbmk AVERAGE

29 26 22 7 37 28 42 21 34 29 18 23 45 39 19 33 13 41 29

16 15 13 2 24 18 25 11 21 14 9 14 27 21 13 21 11 26 17

44 51 26 77 67 72 67 27 45 62 53 76 75 27 19 42 65 51 53

29 41 18 52 44 48 44 18 28 42 35 48 47 17 13 27 43 31 35

PARSEC

Bodytrack Canneal Dedup Ferret Fluidanimate Streamcluster AVERAGE

27 29 39 40 12 21 28

18 18% 21 25 7 13 17

43 66 52 59 79 85 64

28 43 34 38 51 55 42

Server

Apache SpecJbb SPECjAppsServer AVERAGE

11 13 22 15

7 8 14 10

16 30 29 25

11 19 19 16

36% and 31% respectively for 8-way and 4-way I-cache. Comparing Fig. 17 against Fig. 14, we can see that Rtag is more effective on Icache than Vtag is on D-cache. There are two reasons for the difference. The first is that Vtag only reduces data array accesses but Rtag reduces both data array accesses and tag comparisons, as we have discussed in Section 3.3. A more important reason is that Rtag has much lower reactivities than Vtag (compare Fig. 16(a) with Fig. 13(a)). A direct comparison of eliminated cache structure accesses is shown in Table 3 where Rtag eliminates 56% of all fetches from the I-cache’s tag and data arrays for SPEC00. Rtag is only able to save under 20% of the power for SPECjAppServer because this application has much larger user-level instruction memory footprint than SPEC2000 programs. The user instructions are able to wipe most kernel code out of the cache. Similar to what we have shown with the data cache, Rtag achieves a higher percentage of power savings on an aggressive out-of-order processor than an in-order processor. Fig. 18 compares the default, out-of-order numbers with those from an in-order processor model.

Power reduction (%)

Finally, we present power savings of the instruction TLB in Fig. 19. Using the default simulation setup (8-way set-associative), we observe an average of 27% power reduction for both 1-thread and 2-thread experiments. The power saving increases to 33% when we use a fully-associative ITLB. 4.5. Comparison with semantic-aware cache partitioning and way prediction In this subsection, we quantitatively compare Vtag with two closely related bodies of work, namely the semantic-aware partitioned cache (SAP) [13,4], and virtual address bits based way prediction (WP) [20]. We introduced both approaches in Section 2. In Sun UltraSPARC-III, WP is used purely to reduce cache hit latency [20,21]. We could use the microtag based technique in a slightly different manner, for power reduction purposes. To achieve this, we would need to use the microtag comparison results to gate fetches to the data array. But since loTag check is on the critical path of a cache hit, too many bits for microtag will

291

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295

Vtag match distribution

Fig. 13. Power consumption components. Reactivity is the key parameter to be obtained through simulation.

0~2

3~4

5~6

7~8

100% 80% 60% 40% 20%

vp ER r AG E AV

tw ol f

r sw im

rs e

pa

a

gr id m

cf

es

m

c

ip

m

gz

p

gc

e

ga

n

ua k

eq

ip 2

eo

t ar

bz

si ap

ap

pl u

0%

0~2

3~4

5~6

7~8

100% 80% 60% 40% 20% 0%

ap pl u+ vp r ap si+ tw ol f ar t+ sw bz im ip 2+ pa rs er eo n+ m eq gr ua id ke +m es a ga p+ m cf gc c+ gz ap ip ac he +s jb b

Vtag match distribution

(a) 1-thread

(b) 2-thread SMT Fig. 14. Distribution of Vtag reactivity in L1D. A lower average is better.

4-way, out-of-order 8-way, out-of-order 4-way, in-order 8-way, in-order

Power Reduction

30% 25% 20% 15% 10% 5%

vp r Av er ag e

ol f tw

sw im

m gr id pa rs er

m es a

m cf

gz ip

gc c

ga p

eo n eq ua ke

bz ip 2

ar t

ap si

ap pl u

0%

Fig. 15. Power reduction of the data cache using Vtag (1-thread). The baseline is a conventional cache with aggressive clock gating.

292

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295

Power reduction

25%

4-way 8-way

20% 15% 10% 5%

+s jb

b

p ac he

ap

gc c+ gz i

cf p+ m ga

gr id ua ke +m es a eq

eo

n+ m

se r 2+ pa r

bz

ip

f

t+ sw im ar

si +t w ol ap

ap

pl u

+v pr

0%

Fig. 16. Power reduction of the data cache using Vtag (2-thread SMT).

3~4

5~6

7~8

80% 60% 40% 20% 0% ap pl u ap si ar t bz ip 2 eo eq n ua ke ga p gc c gz ip m c m f es a m gr pa id rs er sw im tw ol f SP v p EC r AV sj G ap ps

Rtag match distribution

0~2 100%

(a) 1-thread 3~4

5~6

7~8

80% 60% 40% 20%

gc c+ gz ip ap ac he +s jb b

ga p+ m

cf

0% ap pl u+ vp r ap si+ tw ol f ar t+ sw im bz ip 2+ pa rs er eo n+ m gr eq id ua ke +m es a

Rtag match distribution

0~2 100%

(b) 2-thread SMT Fig. 17. Distribution of Rtag reactivity in the instruction cache. A lower average is better.

add extra delay to L1 accesses. In fact, using the cache organization in Sections 1.1.1 and 3.2, our gate-level simulation in a 32 nm, 3.0 Ghz process revealed that a microtag of 3 bits or more affects timing of the cache hit in a significant way. An extra clock cycle needs to be added to the read hit case. We present two sets of data in Figs. 20 and 21, one on application performance and the other on cache power.2 The experiment

2 We do not use the metrics Energy  Time or Energy  Time2 to compare the efficiency of these schemes in a high-performance system because Energy would need to include a very large number of system components which all dissipate additional power as Time is lengthened.

setup is based on our default setup in Table 2. We added hardware cache coherency between cache partitions in SAP in order to handle the synonym problem (discussed in Section 2). Fig. 20 presents the performance aspect of SAP and microtagbased WP when compared to the baseline (Vtag and conventional cache). WP represent our hypothetical application of the original WP for power saving purposes. For SAP, we tried different partitions and reconfirmed Lee’s 8 KB/24 KB partition between stack and heap regions as the optimal split [4]. Using this division of the data cache, with hardware cache coherency maintained between them, we get an average performance loss of over 2%. Severe performance degradation has been observed with certain

293

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295 60%

4-way

50%

8-way

1T

8-way, 2T 40%

40%

20%

30% apache+sjbb

gcc+gzip

gap+mcf

equake+mesa

eon+mgrid

bzip2+parser

e eq on ua ke ga p gc c gz ip m c m f es a m gr id pa rs er sw im tw ol f SP v EC pr AV sj G ap ps

ar t ip 2 bz

ap pl

u ap si

0%

art+swim

10%

apsi+twolf

0%

20%

applu+vpr

Power reduction

60%

Fig. 18. Power reduction of the instruction cache using Rtag. Left:1T. Right: 2T.

Power reduction

60%

Out-of-order

50%

In-order

40% 30% 20% 10%

sj ap ps

vp r

tw ol f

sw im

pa rs er

m gr id

m es a

m cf

c

gz ip

gc

ga p

eo n eq ua ke

ar t bz ip 2

i ap s

ap pl u

0%

Fig. 19. Power Reduction in IL1 on Big Core vs. Small Core.

Fig. 20. Instruction TLB power savings using Rtag.

benchmarks, especially Apache and SPECjbb. Most of the data reads in Apache are to kernel memory, which belongs to the upper half of the virtual memory space and is thus serviced by the small stack cache, causing excessive number of cache misses. SPECjbb has a significant portion of memory reads to shared libraries which are loaded to high memory locations in our traces, in addition to the stack accesses. Therefore this application also suffers from high stack cache miss rates. WP-3b and WP-8b use the lowest 3 and 8 bits of the virtual page number as the way prediction source. Both WP-3b and WP-8b delay read hit by one clock cycle. Fig. 20 shows that microtag-based way predictions introduce perceivable performance loss. The performance loss with WP is primarily due to the extra clock cycle of read latency. A secondary reason is early eviction of useful cachelines to maintain prediction uniqueness. Fig. 22 compares the cache power savings from the several solutions. SAP is not as effective as Vtag because of the overhead for maintaining coherency between the stack cache and heap cache. Besides the power consumption of the cache coherency circuit, our bigger concern for SAP is the design and validation cost of cache coherency at the L1 level, which is hard to quantify here. WP achieves slightly higher (about 0.2%) power savings than Vtag

because using more bits to filter data array fetches reduces reactivity. This comes at a cost of performance. 4.6. Threat of validity In this section we would like to discuss weakness in this study that we are aware of. One issue is accuracy of the power numbers. This is one of the fundamental challenges faced by the entire computer architecture community. We use high-level microarchitecture tools instead of more accurate, circuit-level simulation to project for power consumption. Gate-level simulation data would be more convincing but that approach has severe limitations in simulation speed. This study requires SPEC-like benchmarks and simulations of instruction on the order of billion instructions. We would not be able to finish that amount of simulation in realistic time. Like most researchers in this field do, we use cycle-precise simulators for lack of a better methodology. These simulators have been validated, published and used by many researchers before. However, due to concern on accuracy of the power estimates, instead of presenting absolute numbers (in joules for power, for example), we use relative percentages when discussing power consumption. We hope

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295

IPC Degradation (baseline, Vtag = 0%)

294 14% 12% 10% 8% 6% 4% 2% 0%

SAP

WP-3b

WP-8b

SAP

WP-3b

WP-8b

Fig. 21. Performance impact of SAP and WP (lower is better).

Power Reduction (baseline = 0%)

30%

Vtag

25% 20% 15% 10% 5% 0%

Fig. 22. Cache power savings of Vtag, SAP and WP (higher is better).

that the general conclusions that we make from these numbers will still hold if we had lower-level simulation results. Another potential issue is that there may be occasional instance of operating systems that does not have a bifurcated address space definition as we described in Section 1.1.2. Although we have surveyed most major platforms, we cannot claim that an operating system with a conflicting implementation does not exist. 5. Summary and future work In this paper we present detailed design and evaluation of Vtag and Rtag to optimize L1 data cache and instruction cache. We show that augmenting cache tags with virtual memory region information and ring level information can filter out a significant percentage of cache lookups, saving power without performance impact. For next steps, we plan to extend the basic idea to other computation units and the communication substrate as part of our future work. Another effort we would like to pay is a more thorough survey of the address space definition in modern operating systems. For example, if address space layout randomization [42] is used, Vtag needs to be disabled while the Rtag feature will remain valid. References [1] M. Cekleov, M. Dubois, Virtual-address caches. Part 1: Problems and solutions in uniprocessors, IEEE Comput. 17 (5) (1997) 64–71. [2] D. Wood, et al., An in-cache address translation mechanism, in: ISCA-13, 1986. [3] G. Reinman, N.P. Jouppi, Cacti 2.0: an integrated cache timing and power model, Tech. Rep. WRL-2000-7, COMPAQ (Feb.), 2000. [4] H.-H.S. Lee, C.S. Ballapuram, Energy efficient D-TLB and data cache using semantic-aware multilateral partitioning, in: ISLPED, 2003, pp. 306–311. [5] A. Alameldeen, C. Mauer, M. Xu, P. Harper, M. Martin, D. Sorin, M. Hill, D. Wood, Evaluating non-deterministic commercial workloads, in: Fifth Computer Architecture Evaluation using Commercial Workloads, 2002. [6] T.E. Anderson, H.M. Levy, B.N. Bershad, E.D. Lazowska, The interaction of architecture and operating system design, in: ASPLOS, 1991. [7] T. Li, L.K. John, A. Sivasubramaniam, N. Vijaykrishnan, J. Rubio, Understanding and improving operating system effects in control flow prediction, in: ASPLOS, 2002, pp. 68–80. [8] A.M.G. Maynard, C.M. Donnelly, B.R. Olszewski, Contrasting characteristics and cache performance of technical and multi-user commercial workloads, in: ASPLOS, 1994, pp. 145–156. [9] J.A. Redstone, S.J. Eggers, H.M. Levy, An analysis of operating system behavior on a simultaneous multithreaded architecture, in: ASPLOS, 2000, pp. 245–256. [10] D. Chanet, B.D. Sutter, B.D. Bus, L.V. Put, K.D. Bosschere, Automated reduction of the memory footprint of the Linux kernel, ACM Trans. Embedded Comput. Syst. 6 (4) (2007) 23. [11] J. Torrellas, C. Xia, R.L. Daigle, Optimizing the instruction cache performance of the operating system, IEEE Trans. Comput. 47 (12) (1998) 1363–1381.

[12] D. Nellans, R. Balasubramonian, E. Brunvand, OS execution on multi-cores: is out-sourcing worthwhile?, Oper Syst. Rev. 43 (2) (2009) 104–105. [13] S. Cho, P.-C. Yew, G. Lee, Decoupling local variable accesses in a wide-issue superscalar processor, in: ISCA-26, 1999, pp. 100–110. [14] M. Huang, J. Renau, S.-M. Yoo, J. Torrellas, L1 data cache decomposition for energy efficiency, in: ISLPED, 2001, pp. 10–15. [15] H.-H.S. Lee, M. Smelyanskiy, C.J. Newburn, G.S. Tyson, Stack value file: Custom microarchitecture for the stack, in: HPCA-7, 2001, pp. 5–14. [16] C.S. Ballapuram, A. Sharif, H.-H.S. Lee, Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors, in: ASPLOS, 2008, pp. 60–69. [17] A. Basu, M.D. Hill, M.M. Swift, Reducing memory reference energy with opportunistic virtual caching, in: ISCA-39, 2012. [18] J.R. Haigh, M.W. Wilkerson, J.B. Miller, T.S. Beatty, S.J. Strazdus, L.T. Clark, A low-power 2.5-Ghz 90-nm level 1 cache and memory management unit, IEEE J. Solid State Circuits 40 (5) (2005) 1190–1199. [19] X. Zhou, P. Petrov, Low-power cache organization through selective tag translation for embedded processors with virtual memory support, in: GLSVLSI-16, 2006. [20] T. Horel, G. Lauterbach, UltraSPARC-III: designing third-generation 64-bit performance, IEEE Micro 19 (3) (1999) 73–85. [21] R. Heald, K. Shin, V. Reddy, I.-F. Kao, M. Khan, W.L. Lynch, G. Lauterbach, J. Petolino, 64-KByte sum-addressed-memory cache with 1.6-ns cycle and 2.6-ns latency, IEEE J. Solid State Circuits 33 (11) (1998) 1682–1689. [22] E. Witchel, S. Larsen, C.S. Ananian, K. Asanovic, Direct addressed caches for reduced power consumption, in: MICRO, 2001, pp. 124–133. [23] T.M. Jones, S. Bartolini, B. De Bus, J. Cavazos, M.F.P. O’Boyle, Instruction cache energy saving through compiler way-placement, in: Proc. Conference on Design, Automation and Test in Europe, 2008, pp. 1196–1201. [24] R. Ravindran, M. Chu, S. Mahlke, Compiler-managed partitioned data caches for low power, in: International Conference on Languages, Compilers and Tools for Embedded Systems, 2007, pp. 237–247. [25] Y. Etsion, D.G. Feitelson, L1 cache filtering through random selection of memory references, in: PACT-16, 2007, pp. 235–244. [26] R.E. Kessler, R. Jooss, A. Lebeck, M.D. Hill, Inexpensive implementations of setassociativity, in: ISCA-16, 1989, pp. 131–139. [27] A. Hasegawa, I. Kawasaki, K. Yamada, S. Yoshioka, S. Kawasaki, P. Biswas, SH3: high code density, low power, IEEE Micro 15 (6) (1995) 11–19. [28] C. Zhang, F. Vahid, J. Yang, W. Najjar, A way-halting cache for low-energy high-performance systems, ACM Trans. Archit. Code Optim. 2 (1) (2005) 34–54. [29] K.C. Yeager, The MIPS R10000 superscalar microprocessor, IEEE Micro 16 (2) (1996) 28–41. [30] R.E. Kessler, The Alpha 21264 microprocessor, IEEE Micro 19 (2) (1999) 24–36. [31] B. Batson, T.N. Vijaykumar, Reactive-associative caches, in: PACT, 2001, pp. 49–60. [32] M.D. Powell, S.-H. Yang, B. Falsafi, K. Roy, T.N. Vijaykumar, Gated-Vdd: A circuit technique to reduce leakage in deep-submicron cache memories, in: ISLPED, 2000, pp. 90–95. [33] B. Calder, D. Grunwald, J. Emer, Predictive sequential associative cache, in: HPCA-2, 1996, pp. 244–253. [34] L.A. Barroso, K. Gharachorloo, E. Bugnion, Memory system characterization of commercial workloads, in: ISCA-25, 1998, pp. 3–14. [35] D. Fan, Z. Tang, H. Huang, G.R. Gao, An energy efficient TLB design methodology, in: ISLPED, 2005, pp. 351–356.

Z. Fang et al. / Journal of Systems Architecture 59 (2013) 279–295 [36] M. Ekman, F. Dahlgren, P. Stenstrom, Tlb and snoop energy-reduction using virtual caches in low-power chip-multiprocessors, in: ISLPED, 2002, pp. 243– 246. [37] V. Delaluz, et al., Reducing dTLB energy through dynamic resizing, in: ICCD, 2003. [38] G.W. Shen, C. Nelson, MicroTag for reducing power in a processor, Oct., 2006. [39] R. Uhlig, R. Fishtein, O. Gershon, I. Hirsh, H. Wang, SoftSDV: A presilicon software development environment for the IA-64 architecture, Intel Technol. J. 3 (4) (1999). [40] L. Zhao, R. Iyer, J. Moses, R. Illikkal, S. Makineni, D. Newell, Exploring largescale CMP architecture using ManySim, IEEE Micro 27 (4) (2007) 21–33. [41] D. Brooks, V. Tiwari, M. Martonosi, Wattch: A framework for architecturallevel power analysis and optimizations, in: ISCA-27, 2000, pp. 83–94. [42] Symantec Corp., An Analysis of Address Space Layout Randomization on Windows Vista. URL: .

Zhen Fang is a Senior Member of Technical Staff at AMD. His research interests include processor architecture, memory architecture, performance modeling and workload analysis. Fang has a PhD in computer science from the University of Utah.

Li Zhao is a Staff Research Scientist at Intel. Her research interests include processor architecture, memory architecture, and performance evaluation. Zhao received her Ph.D. degree in computer science from the University of California, Riverside.

Xiaowei Jiang is a Staff Research Scientist at Intel. His research interests include microarchitecture, architectural support for enhancing operating system performance and security, performance modeling, and cache memory systems. Jiang received his PhD degree in computer engineering from North Carolina State University.

295

Shih-lien Lu is a Principal Researcher at Intel. His research interest includes computer architecture and microarchitecture, power efficient computing, and VLSI design. Lu leads the Oregon Architecture Lab. Lu received his Ph.D from University of California, Los Angeles.

Ravi Iyer is a Senior Principal Engineer and Director of the SoC Platform Architecture Research Group at Intel. His research focuses on future system-on-chip (SoC) and chip multiprocessor (CMP) architectures, especially small cores, accelerators, cache and memory hierarchies, fabrics, quality of service, emerging applications, and performance evaluation. Iyer has a PhD from Texas A&M University.

Tong Li is a CPU Architect at Intel. His research interests include processor microarchitecture and operating systems. Li has a PhD in computer science from Duke University.

Seung Eun Lee is an assistant professor in the Department of Electronic and Information Engineering at Seoul National University of Science and Technology. His research interests include computer architecture, multiprocessor SoCs, networks on chips (NoCs), and VLSI design. Lee has a PhD from the University of California, Irvine.