Linked instruction caches for enhancing power efficiency of embedded systems

Linked instruction caches for enhancing power efficiency of embedded systems

Microprocessors and Microsystems 38 (2014) 197–207 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www...

3MB Sizes 0 Downloads 26 Views

Microprocessors and Microsystems 38 (2014) 197–207

Contents lists available at ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

Linked instruction caches for enhancing power efficiency of embedded systems Chang-Jung Ku, Ching-Wen Chen ⇑, An Hsia, Chun-Lin Chen Department of Information Engineering and Computer Science, Feng Chia University, Taichung City 40724, Taiwan

a r t i c l e

i n f o

Article history: Available online 15 February 2014 Keywords: Instruction cache Low power Branch target buffer Embedded systems

a b s t r a c t The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cacheaccess power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time. Ó 2014 Elsevier B.V. All rights reserved.

1. Introduction Embedded systems are essentially application-specific microcomputers. Embedded system products are used widely nowadays, such as mobile phones, GPS receivers, and MP3 players. Most embedded systems are designed as mobile devices or are embedded in other devices where the power supply is mainly through batteries. Therefore, in addition to performance and cost considerations, power consumption is a significant factor. Thus, it is important to design a power-saving embedded system for prolonging the battery life of such products. According to previous studies, memory system accesses account for about 45% [6,11,21–23] of the total power consumed by embedded systems. Thus, if the power consumed by the memory system were decreased, the total power consumed by a system would decrease significantly. According to [8], the power consumed for each instance of memory access is 8–10 times that consumed for each in⇑ Corresponding author. Tel.: +886 424517250x3728. E-mail addresses: [email protected] (C.-J. Ku), [email protected] (C.-W. Chen), [email protected] (A. Hsia), [email protected] (C.-L. Chen). http://dx.doi.org/10.1016/j.micpro.2014.01.006 0141-9331/Ó 2014 Elsevier B.V. All rights reserved.

stance of cache access. Therefore, if the number of instances of memory access were to decrease, the power consumption of an embedded system would improve greatly. In previous studies, many researchers have aimed to reduce the power consumption of memory systems to deliver power savings [1–5,7–8,10,15–18,21,24–29]. We categorize these works into three types: (1) adding a small cache between the CPU and the L1 cache to reduce the access power [2,15–16,24–25]. Previous authors [15] have stated that although the additional small cache can reduce memory system power consumption by 40–50%, it has a higher miss rate because it is too small to capture the locality of the executed programs. (2) Increasing the degree of cache associativity [10,21] to reduce the number of conflict misses also increases the number of tag comparisons, which increases the power consumption. Consequently, the power consumed for each instance of cache access increases as the degree of cache associativity increases. (3) Using a compression cache to reduce the number of memory accesses. [1,3–5,7–8,17–18,26–29]. However, this method incurs extra power and performance penalties when decompressing the instructions fetched from the compression cache. In addition, the addresses of the compressed instructions

198

C.-J. Ku et al. / Microprocessors and Microsystems 38 (2014) 197–207

differ from the original instructions so it needs to translate an address before accessing the compression cache, which leads to extra power consumption. According to the above analysis, the following problems should be considered when designing a power-aware cache-memory hierarchy: the cache hit rate, power consumption during one cache access, and avoidance of address translation. In this paper, we propose a new cache architecture called single linked cache (SLC), which addresses these problems to deliver power savings. First, a reference table is used to store the frequently executed instructions and increase the hit rate. We also provide a mapping relationship between the addresses and the frequently executed instructions in the reference table so a processor can fetch the requested instructions from the table without address translation. Second, we reduce the power consumption when accessing the reference table by using a low associative strategy in the reference table, while a mechanism of entry linkage is used to solve conflict misses due to low associativity. Finally, although frequently executed instructions comprise 10–20% of the total instructions in a program, only part of the frequently executed instructions may be used during a short period of time, i.e., not all frequently executed instructions are used during a short period. Thus, to further reduce the power consumption during a single access to the frequently executed instruction reference table, we designed a multiple linked caches (MLC) to reduce the access power. In addition, to avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are also recorded in the branch target buffer (BTB). By consulting BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache, which avoids unnecessary cache access. This paper is organized as follows. Section 2 reviews previous related works. Section 3 explains how the proposed SLC and MLC work. Section 4 presents the experimental results. In Section 5, we conclude this paper.

mapping relationship between the address of a compressed instruction in the memory and the address of an original instruction. Thus, there are overheads when accessing the LAT before fetching an instruction from the memory. Other researchers [2,15–16,24–25] have suggested the design where a large cache is replaced by several small caches to save power when accessing the cache because the power consumption when accessing a large cache is considerably larger than that when accessing a small one. In these approaches, researchers added a small cache between the processor and the cache, which is known as a filter cache [15]. When a processor sends an address request, the filter cache is accessed first with lower power requirements to fetch the requested instruction if the access is a hit in the filter cache. If a miss occurs in the filter cache, the cache is accessed. Utilizing a filter cache can reduce the power consumption of accesses but it does not improve the access time because there is not much difference between the access time with a large cache and a small cache. The limited size of the filter cache also results in significant numbers of conflict misses, which degrade the system performance. Besides, Kim [14] added a small cache in the memory system called LPT-cache for capturing the most cache accesses to reduce power consumption by utilizing LPT-cache. They store the basic blocks of instructions into LPT-cache instead of cache lines in the traditional mapping way. In addition, the related information of the stored basic blocks is recorded in the branch target buffer, which is offered to the processor to fetch needed instructions from LPT-cache. Zhang [30] pointed out that the utilization of the sets in a cache is non-uniform to result in some sets are seldom used to decrease the cache hit rates. Thus, Zhang proposed a new cache architecture called Efficient Cache to modify decoders in caches to balance the utilization of each set to improve cache hit rates and reduce the power consumption of memory system. 3. Proposed method

2. Related works Previous studies of power-saving embedded systems have focused on reducing the power consumption of the processors and the memory system interface, mainly by reducing the memory access times and the power consumption of the memory bus. In addition to saving power, reducing the number of memory accesses can also reduce the execution time. Therefore, Benini and Yoshida proposed selective compression to reduce the number of memory accesses [4,29]. Benini [29] proposed a method that compresses frequently executed instructions and packages multiple compressed instructions into a virtual instruction. When a processor fetches a virtual instruction from the cache, the processor can decompress it to obtain multiple instructions and reduce the number of memory accesses. In [4], Yoshida proposed a method for storing frequently executed instructions in the memory without packaging. When a compressed instruction is fetched, the bus switches from its original 32-bit transmission to 8-bit transmission to reduce the power consumption. These studies mainly reduced the number of memory accesses by compressing frequently executed instructions. However, to decompress the compressed instructions, the system must maintain a frequently executed instruction reference table that stores the original frequently executed instructions. When a virtual instruction or a compressed instruction is fetched, it can be decompressed to obtain multiple instructions or a single uncompressed instruction by looking up the reference table. Due to instruction compression, the address of an instruction in the memory is different from the instruction address issued by the processor. Therefore, the system needs a look-ahead table (LAT) to record the

In this chapter, we introduce our novel system known as a linked cache that reduces the power consumption of the memory system. The proposed linked cache can increase the cache hit rate and avoid address translation, while also reducing the power consumption. In addition, a direct mapping strategy is used in the proposed linked cache to reduce the access power while a link field is added to a cache block to reduce conflict misses. To further reduce the power consumption, we designed multiple linked caches where the linked cache is divided into several parts to reduce the access power. To avoid unnecessary cache accesses when the requested instructions are not in the linked caches, the addresses of the frequently executed instruction blocks are recorded in the branch target buffer (BTB). By consulting BTB, a processor can access the memory to obtain the requested instruction directly if the requested instruction is not in the cache, which avoids unnecessary cache access. Section 3.1 explains the design of the single linked cache (SLC) and Section 3.2 describes the design of multiple linked caches (MLC). 3.1. Design of the single linked cache (SLC) In this section, we explain the design of our proposed SLC. To increase the cache hit rates and avoid address translation before fetching an instruction, we analyzed trace files to store the instructions of the most executed addresses in the linked cache using direct mapping and linking. Unlike previous studies [3] that selected the most executed instructions as the frequently executed instructions and that used cost address translation before accessing these frequently executed instructions, we collected the corresponding

199

C.-J. Ku et al. / Microprocessors and Microsystems 38 (2014) 197–207

instructions of the most executed addresses from the trace files of applications and stored them in the SLC. Therefore, a processor can fetch the frequently executed instructions from the SLC by using the requested addresses directly without address translation. The direct mapping strategy is used for power saving in SLC. A link field is also added to a block in the linked cache to reduce conflict misses. Therefore, if there are n blocks in SLC, the most n frequently executed instructions of an application are selected to be placed in SLC. The link field stores the index of a conflict instruction to record the block where the conflict block is stored. Thus, each block in SLC has four fields, i.e., the valid bit, tag, instruction, and link. Next, we introduce how the frequently executed addresses and instructions are stored in the SLC. Before storing the frequently executed instructions in the SLC, the frequently executed addresses and their corresponding instructions are sorted according to their execution frequencies, which are obtained from the trace files. Thus, we introduce the two steps required for storing the instructions in the SLC. During the first stage, the frequently executed instructions are placed in the SLC according to the sorted results if conflicts do not occur. If a conflict occurs, however, the instruction with the lower executed frequency is reserved and to be processed during the second stage. During the first stage, the storage of the frequently executed instructions continues until all of the frequently executed instructions are stored in the SLC or reserved for the second stage. During the second stage, the conflict instructions reserved in the first stage are placed in the blank blocks of the SLC according to the order of their execution frequency. When a reserved instruction is placed in the SLC, the valid field of this block is set to N. The link field of the original mapped block where this instruction should be placed is also checked. If the link field of the original mapped block is NULL, the block encountered a conflict during the first stage so the index of the allocated block is set to this field. However, if the link field of the original mapped block

is not NULL, we can follow the links through the link fields from the original mapped block until a NULL value is found and then set the index of the allocated block to the field of the found block. We provide an example of our proposed two-stage placement of instructions, as follows. Assume that there are eight blocks in SLC and all valid fields are set to N. The most frequently executed eight addresses and their corresponding instructions are shown in Figs. 1a and 1b. In the first stage, three addresses and their corresponding instructions are placed in blocks, i.e., 100, 101, and 011, without any conflict using the direct mapping strategy where each valid field of these three blocks is set to T (true). When the instruction from address 4 is going to be placed in block 100, a conflict occurs because the valid field of block 100 has already been set to T. Therefore, the instruction of this address is reserved to be handled during the second stage. Addresses 5, 6, and 7 are all mapped to block 000, so only address 5 can be placed in block 000 and addresses 6 and 7 are reserved. Finally, because the valid field of block 001 is N and address 8 is mapped to block 001, we can place the address and its corresponding instruction in block 001. Next, the partial instructions of the frequently executed addresses are placed in SLC during the first stage, as shown in Fig. 1(a). In the second stage, addresses 4, 6, and 7 are going to be placed in the blank blocks of the SLC. A blank block with the index 010 is used to hold instruction 4 and the valid field of this block is set to N. In addition, the link field of the original mapped block of instruction 4, i.e., the link field of block 100, is set to 010, as shown in Fig. 1(b). Likewise, instruction 6 is placed in block 110 and the link field of block 000, where instruction 6 was originally mapped, is set to 110. Finally, instruction 7 is placed in block 111 and the link field of block 000, which was the original mapped block of instruction 7, should be set to 111. However, the link field of block 000 is not NULL so we follow the link from block 000 to check the link field of block 110. Because the link field of block 110 is NULL, the

Frequently Executed Addresses and Instructions address 1 address 2 address 3 address 4 address 5 address 6 address 7 address 8

0000200D 00002374 0000200B 0000200C 00002378 00002508 00002870 000025A1

3afffffc e0d1c0f2 e1530001 34832004 bafffffa e2400b01 b2600000 e1a04001

Single Linked Cache (SLC) Valid Tag Instruction Index 000 Index 001 Index 010 Index 011 Index 100 Index 101 Index 110 Index 111

T T N T T T N N

Link Null Null Null Null Null Null Null Null

000011B bafffffa 000012D e1a04001 0000100 e1530001 000011B e0d1c0f2 0000100 3afffffc

Fig. 1a. Placement of frequently executed instructions in SLC during the first stage.

Frequently Executed Addresses and Instructions address 1 address 2 address 3 address 4 address 5 address 6 address 7 address 8

0000200D 00002374 0000200B 0000200C 00002378 00002508 00002870 000025A1

3afffffc e0d1c0f2 e1530001 34832004 bafffffa e2400b01 b2600000 e1a04001

Single Linked Cache (SLC) Valid Tag Instruction Index 000 Index 001 Index 010 Index 011 Index 100 Index 101 Index 110 Index 111

T T N T T T N N

000011B 000012D 0000100 0000100 000011B 0000100 0000128 0000143

bafffffa e1a04001 34832004 e1530001 e0d1c0f2 3afffffc e2400b01 b2600000

Link 110 Null Null Null 010 Null 111 Null

Filled Fig. 1b. Placement of frequently executed instructions in SLC during the second stage.

200

C.-J. Ku et al. / Microprocessors and Microsystems 38 (2014) 197–207

link field of block 110 is set to the index of block 111, as shown in Fig. 1(b). After the addition of the link fields, the problem of the conflict misses caused by the low associative mapping mechanism can be resolved efficiently. In the following section, we explain how to fetch instructions correctly using the link fields. The action of fetching an instruction from SLC occurs without address transformation, after the frequently executed addresses and their corresponding instructions are placed in SLC, as follows. An address can be divided into tag, index, and block-offset fields. Therefore, when a CPU sends an address request, the mapped block in SLC is checked according to the index field of the issued address. If the valid field of the mapped block is N, this means that the requested instruction is not in SLC. Therefore, it can access the memory for the requested instruction. If the valid field is T, this means that the requested instruction for this address may be in SLC. Thus, the two tags of the issued address and the mapped block are compared to check whether they are the same. If the two tags are the same, a cache hit occurs and the instruction is fetched and sent to the processor. If the two tags are not the same, the link field of the mapped block is checked. If the link field is NULL, this means that the requested instruction is not in SLC, i.e., a cache miss occurs. Therefore, the processor fetches the instruction from the memory. By contrast, if the link field is not NULL, we can follow the links to compare the tag of the block pointed by the link field with the tag of the address, until two tags are the same or the link field is NULL. This design may cause multiple SLC accesses when a processor wants to fetch a conflict instruction, but this situation will not increase the power consumption or affect performance of the system badly. This is because the instructions accessed by the linked fields collide with instructions that are executed more times than these conflict instructions. If the linked cache mechanism is not used, a miss occurs when accessing the conflict instruction. A process has to access the memory to fetch the desired instruction. With SLC, the link fields are used to reduce the miss rate. Therefore, the proposed method can fetch the frequently executed instructions from SLC efficiently and further reduce the number of memory accesses significantly. The improvement to cache hit rates by utilizing the link field will be shown in Section 4. From the results we can learn that utilizing SLC can reduce cache conflict misses and improve about 10% cache hit rates on average. 3.2. Design of multiple linked caches (MLC) In this section, we explain the design of MLC, which further reduce the power consumption when accessing the linked cache. In Section 3.1, the most frequently executed instructions were

(a)

selected by analyzing the trace files and storing them in SLC to reduce the number of memory accesses and reduce the power consumption. However, each of the most frequently executed instructions is not always executed frequently during the overall running time of a program. Thus, only part of the frequently executed instructions are executed frequently for any period during the running time of a program so some blocks are executed frequently in a specific interval whereas other blocks remain idle. The power consumption when accessing a memory unit grows as the size of the memory unit increases so we need to store the frequently executed instructions from different intervals in different small caches to further reduce the power consumption when accessing SLC. If the requested instruction is not in the linked cache, the processor still queries the linked cache before it accesses memory. However, this increases the power consumption and degrades the performance. Next, we introduce a design that avoids querying the linked cache, which saves power when the desired instruction is not in the linked cache. To store the frequently executed instructions from different execution intervals in different small linked caches, we divide the frequently executed instructions based on the unit of a basic block, as shown in Fig. 2 where a basic block [12] indicates a straight-line code sequence which there are no branches jumping into, except to the entry (target address), and which there are no branches jumping out, except at the exit. We refer to the basic block divided into the frequently executed instructions as a frequently executed basic block (FEBB). The start address of a FEBB indicates the target address of a branch instruction, which shows the basic block that may be executed from a branch instruction, as shown in Fig. 2. By allocating the FEBBs to different small linked caches, the power consumed when accessing a small linked cache can be reduced effectively. To avoid querying the linked caches and to save power when a desired instruction is not in the linked caches, we use a branch target buffer (BTB) to record the linked caches containing the frequently executed basic block to which the branch instruction will jump. This allows us to determine whether the basic block that needs be executed is in the linked caches. If it is, the processor can fetch the desired instruction from the linked cache indicated by the BTB. Otherwise, the processor can fetch the desired instruction from the memory directly. Fig. 3 shows the MLC design and Fig. 4 shows the two added fields in BTB. In Fig. 3, the MLC are marked as linked cache #1 to linked cache #N. A MUX is added to the architecture, which allows the processor to fetch a desired instruction from MLC, as shown in Fig. 3. If the processor sends an address request and the requested instruction

(b) Fig. 2. Dividing frequently executed instructions into basic blocks.

(c)

201

C.-J. Ku et al. / Microprocessors and Microsystems 38 (2014) 197–207 Table 1 The Selected Benchmarks from Media Bench II and MiBench. Media Bench II

h.263 h.264 jpeg jpe2000 mepg2 mepg4

MiBench

Fig. 3. Multiple linked caches (MLC) architecture.

is in the linked caches, MUX2 can select the appropriate linked cache from these N linked caches and MUX1 can select the data from MUX2. However, if the requested instruction is not in the linked cache, MUX1 can select the data from the main memory. For ensuring that MUX1 and MUX2 select the correct data from the linked caches or the main memory, two fields, i.e., the taken linked cache number (TLCN) and the not-taken linked cache number (NTLCN), are added to the branch target buffer (BTB), which record the number of the linked caches where the basic blocks of the taken and not-taken target addresses are stored, respectively. However, if the target basic block is not in the linked caches, the corresponding field of BTB records a null value. Therefore, MUX2 can fetch a desired instruction from the correct linked cache using the data stored in BTB. If a null value is fetched from BTB, this indicates that the basic block that needs to be executed is not in the linked caches. Thus, a processor can obtain the desired instruction from the memory via MUX1 without querying the linked caches. This design effectively prevents the processor from accessing the linked caches if the desired instruction is absent from the linked caches. Fig. 4 shows an example where BTB records a null value or the number of a linked cache where a frequently executed basic block is stored. Using the data recorded in BTB, we can determine which linked cache has the basic block that needs to be executed. There are two types of branch instructions, i.e., direct branches and indirect branches. For a direct branch, e.g., if-else or for, we can determine

basicmath FFT FFT_Inv qsort cjpeg susan

A video coder and decoder based on H.263 standard A video coder and decoder based on H.264 standard A video coder and decoder based on JPEG standard A video coder and decoder based on JPEG-2000 standard A video coder and decoder based on MPEG-2 A video coder and decoder based on MPEG-4 The basic math test performs simple mathematical calculations A Fast Fourier Transform A Fast Fourier inverse transform Qsort test sorts a large array of strings into ascending order Lossy compression image format An image recognition package

the possible target address of the direct branch instruction by trace profiling. For an indirect branch, however, we cannot identify the jumping target address for the indirect branch instruction in the same way because the target address of an indirect branch is not calculated until the program execution. When executing an indirect branch instruction, the fetched values of TLCN and NTLCN are null, and the processor fetches the requested instruction from the memory. In addition, MLC needs to utilize BTB to record the locations in which the frequently executed basic blocks are stored. According to [13], the number of instructions in a basic block is about 4 to 16 instructions. Besides, in embedded processors nowadays, the number of entries in BTB is usually 2 K and the number of frequently executed instructions is about 256 in an application [29]. Therefore, the needed number of entries in BTB for MLC would be smaller than the size of entries of BTB. In our method, MLC is constituted of multiple linked caches to save power consumption by accessing small linked caches. According to CACTI [19], the power consumed by accessing a smaller linked cache is less than a bigger one. However, if the size of a linked cache is too small to store a whole basic block, the power consumption increases. According to [13], for most of basic blocks, the number of instructions in a basic block is between 4 and 16.

Fig. 4. Values of TLCN and NTLCN in BTB after frequently executed basic blocks are stored in linked caches.

202

C.-J. Ku et al. / Microprocessors and Microsystems 38 (2014) 197–207 Traditional Cache Selective Compression MLC

Filter Cache SLC

Hit ratio (%)

100%

75%

50%

25%

0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

susan

(a) With 512 B cache Traditional Cache Selective Compression MLC

Filter Cache SLC

Hit ratio (%)

100%

75%

50%

25%

0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

susan

(b) With 1 KB cache Traditional Cache Selective Compression MLC

Filter Cache SLC

Hit ratio (%)

100%

75%

50%

25%

0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

susan

(c) With 2 KB cache Traditional Cache Selective Compression MLC

Filter Cache SLC

Hit ratio (%)

100%

75%

50%

25%

0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

(d) With 4 KB cache Fig. 5. The hit rates with traditional cache, filter cache, selective compression, SLC, and MLC using different cache sizes.

susan

203

Normalized power consumption (%)

C.-J. Ku et al. / Microprocessors and Microsystems 38 (2014) 197–207 Traditional Cache Selective Compression MLC

Filter Cache SLC

100%

75%

50%

25%

0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

susan

Normalized power consumption (%)

(a) With 512 B cache Traditional Cache Selective Compression MLC

Filter Cache SLC

125% 100% 75% 50% 25% 0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

susan

Normalized power consumption (%)

(b) With 1 KB cache Traditional Cache Selective Compression MLC

200%

Filter Cache SLC

150%

100%

50%

0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

susan

Normalized power consumption (%)

(c) With 2 KB cache Traditional Cache Selective Compression MLC

300%

Filter Cache SLC

250% 200% 150% 100% 50% 0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

susan

(d) With 4 KB cache Fig. 6. Comparison of the normalized power consumption with traditional cache, filter cache, selective compression, SLC and MLC using different cache sizes.

204

C.-J. Ku et al. / Microprocessors and Microsystems 38 (2014) 197–207

Table 2 System configuration parameters used in the simulations. Parameter

Value

CPU clock speed I-cache I-cache Latency D-cache D-cache Memory Latency BTB Size of a linked cache in MLC

1 GHz 512 B, 1 KB, 2 KB, 4 KB, Direct Mapping 1 cycle 512 B, 1 KB, 2 KB, 4 KB, Direct Mapping 1 cycle 100 cycles 2 K-entry, 2-way 64 bytes

Table 3 Hit rates improvements of SLC to traditional cache, filter cache, selective compression. Comparison

Cache size 512 bytes

1 K bytes

2 K bytes

4 K bytes

Improving range Traditional cache Filter cache Selective compression

5–36% 6–20% 1 to 0%

0–34% 3–14% 1 to 0%

0–30% 2–15% 1 to 0%

0–25% 1–8% 1 to 0%

Improving average Traditional cache Filter cache Selective compression

25% 11% 0.3%

17% 10% 0.3%

13% 9% 0.3%

9% 4% 0.2%

Therefore, we set the size of each linked cache as 64 bytes which can store at most 16 instructions in a linked cache our MLC design. 4. Simulation results In this chapter, we compare the proposed designs of SLC and MLC with the traditional cache, selective compression [3], and filter cache [24] in terms of the cache hit rates, power consumption, and execution time with different cache sizes. We utilized ARM SDT 2.5 to simulate the environment of the ARM system. Six benchmarks from Media Bench II and six from MiBench were selected to run the entire benchmarks in our simulation, as shown in Table 1. Other system configurations are shown as Table 2. In addition, the power simulation tool CACTI 5.3 [19] was used to determine the dynamic and static power consumption. The power consumption evaluation was based on published methods [9,20], which included the dynamic and static power consumption, as shown in Eq. (1). The dynamic power consumption included the power consumption when accessing the cache and memory once, as shown in Eq. (2). Static power consumption occurred when the processor did not execute any computations, as shown in Eq. (3).

Total Energy ¼ Dynamic Energy þ Static Energy

ð1Þ

Dynamic Energy ¼ number of cache accesses  cache access energy þ number of memory accesses  memory access energy

ð2Þ

Static Energy ¼ execution time  leakage power of cache þ execution time  leakage power of memory

ð3Þ

Fig. 5 shows the instruction cache hit rate results with the traditional cache architecture, filter cache, selective compression, SLC and MLC using different cache sizes. Fig. 5 shows that the average hit rates with SLC were 55%, 71%, 84%, and 92% when using the 512 B, 1 KB, 2 KB, and 4 KB caches, respectively, which were 9–25% better on average than the traditional cache and 4–11% better on

average than the filter cache, As shown in Table 3. Because the proposed MLC design uses BTB to look up whether the requested instruction is in MLC and avoid unnecessary accesses, the hit rate of MLC was 100%. In the designs of traditional caches, filter caches, SLC, and MLC, when a processor issues an address of an instruction, they are accessed to know whether the desired instruction is in the design or not. Therefore, these designs can be evaluated by hit rates. However, in the design of selective compression, the author stored frequently executed instructions into the frequently executed instruction table through profiling, encoded frequently executed instructions in memory into indexes, and compressed multiple indices into a compressed instruction. When a processor issues an address of an instruction, the address is translated to access memory. If a compressed instruction is fetched from memory, the frequently executed instruction table is accessed to obtain several instructions by the indices extracted from the fetched compressed instruction. If a normal instruction is fetched from memory, the fetched instruction is sent to the processor for execution and the frequently executed table is not accessed. Therefore, the term of hit rates in the selective compression method was evaluated by the rates of the amount of accessing the frequently executed instruction table to the amount of the requested instructions issued by the processor. Fig. 6 shows the power consumption comparison for the traditional cache, filter cache, selective compression, SLC, and MLC using direct mapping with different cache sizes. With regard to the power consumption of SLC, MLC, traditional cache, filter cache, and selective compression, Table 4 shows the improvement of SLC over traditional cache, filter cache, and selective compression while Table 5 show that of MLC over these related works. However, Fig. 6 also shows that SLC consumed slightly more power than the traditional cache for the h.263 benchmark when the cache size is 4 KB. This is because the hit rates of traditional cache and SLC is almost the same when the cache size is 4 KB. Therefore, the improvements in the cache hit rates with SLC when using linked caches for h.263 was limited so the power consumption of SLC was increased due to one or more accesses per hit. However, MLC also uses linked caches to improve the hit rates but the design of MLC effectively reduces the number of MLC accesses by using BTB to reduce the number of unnecessary accesses, which also reduced the access power because a small linked cache was accessed. Thus, the total power consumption with MLC was lower than the other methods. The filter cache could not improve the number of conflict misses, which increased the amount of memory accesses and the power consumption, whereas SLC and MLC utilized the linked caches to reduce the number of conflict misses and save power. Selective compression can decrease the power consumption by reducing the number of memory accesses. If a processor sends a memory request, however, it must perform address translation by looking up LAT, which consumes more power than SLC, and MLC. Fig. 7 compares the execution times with the traditional cache, filter cache, selective compression, SLC, MLC. Table 6 shows the improvement of execution time for SLC over traditional cache, filter cache, selective compression while Table 7 shows that for MLC over these related works. Fig. 7 shows that storing the frequently executed instructions in SLC and MLC to increase the hit rates decreased the number of memory accesses required, thereby reducing the execution time. The filter cache increased the number of memory accesses required, which affected the execution time because the filter cache was unable to improve the number of conflict misses. The selective compression method compressed the instruction codes to reduce the number of memory accesses and reduce the power consumption. However, it needed to translate a requested address before accessing the memory, which incurred a long access delay in the memory that increased the execution time. Fig. 7 shows that the average execution time of MLC was 1% faster than that of SLC. This is because MLC can avoid unnecessary

205

C.-J. Ku et al. / Microprocessors and Microsystems 38 (2014) 197–207

Normalized execution times (%)

Traditional Cache Selective Compression MLC

Filter Cache SLC

125% 100% 75% 50% 25% 0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

susan

(a) With 512 B cache

Normalized execution times (%)

Traditional Cache Selective Compression MLC

Filter Cache SLC

150% 125% 100% 75% 50% 25% 0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

susan

(b) With 1 KB cache Traditional Cache Selective Compression MLC

Normalized execution times (%)

250%

Filter Cache SLC

200%

150%

100%

50%

0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

susan

(c) With 2 KB cache

Normalized execution times (%)

Traditional Cache Selective Compression MLC

Filter Cache SLC

500% 400% 300% 200% 100% 0% h.263

h.264

jpeg

jpe2000

mpeg2

mepg4

basicmath

FFT

FFT_Inv

qsort

cjpeg

susan

(d) With 4 KB cache Fig. 7. Comparison of the normalized execution times with traditional cache, filter cache, selective compression, SLC and MLC using different cache sizes.

206

C.-J. Ku et al. / Microprocessors and Microsystems 38 (2014) 197–207

Table 4 Power consumption improvement of SLC to traditional cache, filter cache, selective compression. Comparison

Cache size 512 bytes

1 K bytes

2 K bytes

4 K bytes

Improving range Traditional cache Filter cache Selective compression

8–38% 3–10% 7–26%

5–30% 4–18% 6–32%

3–36% 2–25% 23–56%

8 to 35% 1–38% 32–64%

Improving average Traditional cache Filter cache Selective compression

23% 6% 13%

18% 9% 22%

18% 14% 40%

13% 14% 49%

Table 5 Power consumption improvements of MLC to traditional cache, filter cache, selective compression. Comparison

Cache size 512 bytes

1 K bytes

2 K bytes

4 K bytes

Improving range Traditional cache Filter cache Selective compression

9–38% 5–14% 7–28%

7–33% 7–20% 8–33%

5–39% 5–28% 25–58%

4–40% 7–43% 35–68%

Improving average Traditional cache Filter cache Selective compression

25% 8% 14%

20% 11% 24%

22% 18% 43%

20% 21% 53%

Table 6 SLC compare to traditional cache, filter cache, selective compression for execution times. Comparison

Cache size 512 bytes

1 K bytes

2 K bytes

4 K bytes

Improving range Traditional cache Filter cache Selective compression

1–38% 2–14% 7–36%

1–32% 3–28% 16–44%

1–42% 3–32% 31–73%

1–49% 1–20% 44–84%

Improving average Traditional cache Filter cache Selective compression

21% 8% 19%

17% 9% 32%

22% 16% 54%

23% 10% 67%

Table 7 MLC compare to traditional cache, filter cache, selective compression for execution times. Comparison

and MLC using the simulation tool CACTI 5.3 [19] showed that the growth of dynamic power was greater than that of the access time as the cache capacity increased. Therefore, splitting the linked cache into several small linked caches and avoiding unnecessary accesses to the linked cache allowed MLC to reduce the power consumption by 8% compared with SLC, as shown in Fig. 6. 5. Conclusion In this paper, we proposed two power-saving cache designs, i.e., SLC and MLC, which increased the cache hit rates, improved the system performance, and reduced the system power consumption. To reduce the power consumption and access delays, similar to direct cache mapping, and hit rate reduction as compared to a directed cache is similar to what would be expected with a two-way associative cache. SLC stores the frequently executed instructions while direct mapping extended with links solves the instruction conflict problem, which is caused by low associativity with linkages. These ensure that SLC has a high cache hit rate, which saves power. MLC was proposed to further reduce the power consumption during each cache access and avoid the processor accessing the cache when the desired instruction is not in the cache. In MLC, the frequently executed instruction blocks are stored in different linked caches, according to their locality, and the linked cache addresses of the frequently executed instruction blocks are recorded in BTB. Thus, the dynamic power consumption when accessing one linked cache is reduced when accessing a smaller linked cache. By consulting BTB, the processor can determine whether the requested instruction is in MLC. If the requested instruction is not in MLC, the processor can access the memory directly to avoid unnecessary cache queries, thereby saving power and improving the performance. In the simulations, we compared SLC and MLC with selective compression, traditional cache, and filter cache in terms of the hit rate, power consumption and execution time. The simulation result showed that SLC improved the hit rate by 9–25% and by 4–11% compared with the traditional cache and filter cache, respectively. MLC improves the hit rate by 16–70% and 11–55% compared with the traditional cache and filter cache, respectively. In terms of power consumption, SLC consumed 69%, 82%, and 89% of the power consumed by selective compression, traditional cache, and filter cache, respectively. MLC consumed 66%, 78%, and 85% of the power consumed by selective compression, the traditional cache, and filter cache, respectively. In terms of the execution time, SLC and MLC were almost the same and they required 57%, 79%, and 89% of the execution time with selective compression, the traditional cache, and filter cache, respectively. Acknowledgement

Cache size 512 bytes

1 K bytes

2 K bytes

4 K bytes

Improving range Traditional cache Filter cache Selective compression

2–38% 3–14% 7–36%

1–32% 3–29% 16–45%

1–42% 3–33% 31–73%

1–49% 1–20% 44–84%

Improving average Traditional cache Filter cache Selective compression

22% 8% 20%

17% 10% 32%

22% 16% 54%

23% 10% 67%

linked cache accesses by consulting BTB and to a small linked cache had a lower access time than a large one. However, the miss penalty of a cache miss is much greater than accessing a linked cache, so the execution time improvement of MLC compared with SLC is not obvious. The comparisons of the power consumption of SLC

This research was supported by the National Science Council (NSC101-2221-E-035-067) and (NSC102-2221-E-035-030-MY2). References [1] Alaa R. Alameldeen, David A. Wood, Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches, Department of Computer Sciences Technical, Report CS-TR-2004-1500, 2004. [2] Kashif Ali, Mokhtar Aboelaze, Suprakash Datta, Modified Hotspot Cache Architecture: A Low Energy Fast Cache for Embedded Processors, International Conference on Embedded Computer Systems, 2006, pp. 35–42. [3] Luca Benini, Alberto Maciia, Alberto Nannarelli, Code Compression Architecture for Cache Energy Minimization in Embedded Systems, International Symposium on Low Power Electronics and Design, 2001, pp. 322–327. [4] Luca Benini, Alberto Macii, Enrico Macii, Massimo Poncino, Selective Instruction Compression for Memory Energy Reduction in Embedded Systems, in: IEEE/ACM Proceedings of International Symposium on Low Power Electronics and Design, 1999, pp. 206–211.

C.-J. Ku et al. / Microprocessors and Microsystems 38 (2014) 197–207 [5] Luca Benini, Francesco Menichelli, Mauro Olivieri, A class of code compression schemes for reducing power consumption in embedded microprocessor systems, IEEE Trans. Comput. 53 (4) (2004) 467–482. [6] Ozgur Celebican, Tajana Simunic Rosing, Vincent J. Mooney, Energy Estimation of Peripheral Devices in Embedded Systems, in: Proceedings of the 14th ACM Great Lakes symposium on VLSI, 2004, pp. 430–435. [7] Po-Yueh Chen, Wu Chao-Chin, Ying-Jie Jiang, Bitmask-based code compression methods for balancing power consumption and code size for hard real-time embedded systems, Microprocess. Microsyst. 36 (3) (2012) 267–279. [8] Ching-Wen Chen, Chih-Hung Chang, Chang-Jung Ku, A low power-consuming embedded system design by reducing memory access frequencies with multiple reference tables and encoding the most executed instructions, IEICE Trans Inform Syst E88-D (12) (2005) 2748–2756. [9] Ching-Wen Chen, Ku Chang-Jung, A tagless cache design for power saving in embedded systems, J. Supercomput. 62 (1) (2012) 174–198. [10] Eui-Young Chung, Cheol Hong Kim, Sung Woo Chung, An accurate and energyefficient way determination technique for instruction caches by using early tag matching, in: 4th IEEE International Symposium on Electronic Design, Test & Applications, 2008, pp. 190–195. [11] Vincent W. Freeh, Feng Pan, Nandini Kappiah, David K. Lowenthal, Rob Springer, Exploring the Energy-Time Tradeoff in MPI Programs on a PowerScalable Cluster, in: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 2005, pp. 4–13. [12] John L. Hennessy, David A. Patterson, Computer Architecture, fifth ed., A Quantitative Approach, 2011. [13] Jian Huang, David Lilja, Exploiting Basic Block Value Locality with Block Reuse’’, in: Proceedings of the 5th International Symposium on High Performance Computer, Architecture, 1999, pp. 106–114. [14] Jong-Myon Kim, Sung Woo Chung, Cheol Hong Kim, Energy-aware instruction cache design using small trace cache, IET Comput. Dig. Tech. 4 (4) (2010) 293– 305. [15] Johnson Kin, Munish Gupta, William H. Mangione-Smith, The Filter Cache: An Energy Efficient Memory Structure, in: Proceeding of 30th International Symposium on Microarchitecture, 1997, pp. 184–193. [16] Lea Hwang Lee, William Moyer, John Arends, Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops, in: Proceedings of the International Symposium on Low Power Electronics and Design, 1999, pp. 267–269. [17] Jang-Soo Lee, Won-Kee Hong, Shin-Dug Kim, An on-chip cache compression technique to reduce decompression overhead and design complexity, J. Syst. Architect.: EUROMICRO J. 46 (15) (2000) 1365–1382. [18] Chang Hong Lin, Yuan Xie, Wayne Wolf, Code compression for VLIW embedded systems using a self-generating table, IEEE Trans. Very Large Scale Integr. Syst. 15 (10) (2007). [19] Naveen Muralimanohar, Rajeev Balasubramonian, Norman P. Jouppi, CACTI 5.3, HP Laboratories, Palo Alto, 2007. [20] Marisha Rawlins, Ann Gordon-Ross, Lightweight runtime control flow analysis for adaptive loop caching, in: Proceedings of the 20th Symposium on Great Lakes Symposium on VLSI, 2010, pp. 239–244. [21] Biju K. Raveendran, T.S.B. Sudarshan, Avinash Patil, Komal B. Randive, S. Gurunarayanan, Predictive placement scheme in set-associative cache for energy efficient embedded systems, International Conference on Signal Processing, Communications and Networking, 2008, pp. 152–157. [22] Dinesh C. Suresh, Walid A. Najjar, Jun Yang, Power efficient instruction caches for embedded systems, SAMOS 2005 LNCS, vol. 3553, 2005, pp. 182–191. [23] Kugan Vivekanandarajah, Thambipillai Srikanthan, Saurav Bhattacharyya, Dynamic filter cache for low power instruction memory hierarchy, Proceedings of the Digital System Design, EUROMICRO Systems, 2004. pp. 607–610. [24] Kugan Vivekanandarajah, Thambipillai Srikanthan, Custom Instruction filter Cache Synthesis for Low-Power Embedded Systems, in: Proceedings of the 16th International Workshop on Rapid System Prototyping, 2005, pp. 151– 157. [25] Kugan Vivekanandarajah, Thambipillai Srikanthan, Saurav Bhattacharyya, Dynamic Filter Cache for Low Power Instruction Memory Hierarchy, in: Proceedings of the EUROMICRO Systems on Digital System Design, 2004, pp. 607–610. [26] Jun. Yang, Rajiv. Gupta, Frequent value locality and its applications, ACM Trans. Embedded Comput. Syst. 1 (1) (2002) 79–105. [27] Jun Yang, Rajiv Gupta, Energy Efficient Frequent Value Data Cache Design, ACM/IEEE 35th International Symposium on Micro, Architecture, 2002. [28] Jun Yang, Youtao Zhang, Rajiv Gupta, Frequent Value Compression in Data Caches, in: Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 2000, pp. 258–265.

207

[29] Yukihiro Yoshida, Bao-Yu Song, Hiroyuki Okuhata, Takao Onoye, Isao Shirakawa, An Object Code Compression Approach to Embedded Processors, in: ACM/IEEE International Symposium Low Power Electronics and Design, 1997, pp. 265–268. [30] Chuanjun Zhang, An Efficient Direct Mapped Instruction Cache for ApplicationSpecific Embedded Systems, in: Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System, Synthesis, 2005, pp. 45–50.

Chang-Jung Ku received a B.S. and M.S. degree from the Chaoyang University of Technology, Taiwan, in 2004 and 2006, respectively. Currently, he is a candidate for a Ph. D. in the Department of Information Engineering and Computer Science at the Feng Chia University, Taiwan. His research interests include computer architecture, parallel processing, and embedded systems.

Ching-Wen Chen received a M.S. degree in the Department of Computer Science from the National Tsing-Hua University, Taiwan, 1995. He obtained his Ph.D. in Computer Science and Information Engineering from the National Chiao-Tung University, Taiwan, 2002. He was an Assistant Professor at the Chaoyang University of Technology (2005-2007), Taiwan. Currently, he is an Associate Professor in the Department of Information Engineering and Computer Science at the Feng Chia University, Taiwan. His research interests include computer architecture, parallel processing, embedded systems, mobile computing, and wireless sensor network.

An Hsia received a B.S. and M.S. degree from the Feng Chia University, Taiwan, in 2008 and 2010, respectively. Currently, he is a candidate for a Ph. D. in the Department of Information Engineering and Computer Science at the Feng Chia University, Taiwan. His research interests include computer architecture, parallel processing, and embedded systems.

Chun-Lin Chen received a B.S. degree in the Department of Computer Science and Information Engineering from the Dayeh University, Taiwan, in 2009. His research interests include computer architecture and embedded systems.