Improving cache locking performance of modern embedded systems via the addition of a miss table at the L2 cache level

Improving cache locking performance of modern embedded systems via the addition of a miss table at the L2 cache level

Journal of Systems Architecture 56 (2010) 151–162 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.el...

983KB Sizes 5 Downloads 69 Views

Journal of Systems Architecture 56 (2010) 151–162

Contents lists available at ScienceDirect

Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc

Improving cache locking performance of modern embedded systems via the addition of a miss table at the L2 cache level Abu Asaduzzaman a, Fadi N. Sibai b,*, Manira Rani a a b

Dept. of Computer Science and Engineering, Florida Atlantic University, Boca Raton, Florida, USA UAE University, Computer Systems Design, P.O. Box 17551, CIT, Al Ain, United Arab Emirates

a r t i c l e

i n f o

Article history: Received 18 May 2009 Received in revised form 30 January 2010 Accepted 13 February 2010 Available online 18 February 2010 Keywords: Cache locking Miss table Multi-core architecture Performance/power ratio Timing predictability

a b s t r a c t To confer the robustness and high quality of service, modern computing architectures running real-time applications should provide high system performance and high timing predictability. Cache memory is used to improve performance by bridging the speed gap between the main memory and CPU. However, the cache introduces timing unpredictability creating serious challenges for real-time applications. Herein, we introduce a miss table (MT) based cache locking scheme at level-2 (L2) cache to further improve the timing predictability and system performance/power ratio. The MT holds information of block addresses related to the application being processed which cause most cache misses if not locked. Information in MT is used for efficient selection of the blocks to be locked and victim blocks to be replaced. This MT based approach improves timing predictability by locking important blocks with the highest number of misses inside the cache for the entire execution time. In addition, this technique decreases the average delay per task and total power consumption by reducing cache misses and avoiding unnecessary data transfers. This MT based solution is effective for both uniprocessors and multicores. We evaluate the proposed MT-based cache locking scheme by simulating an 8-core processor with 2 levels of caches using MPEG4 decoding, H.264/AVC decoding, FFT, and MI workloads. Experimental results show that in addition to improving the predictability, a reduction of 21% in mean delay per task and a reduction of 18% in total power consumption are achieved for MPEG4 (and H.264/AVC) by using MT and locking 25% of the L2. The MT results in about 5% delay and power reductions on these video applications, possibly more on applications with worse cache behavior. For the FFT and MI (and other) applications whose code fits inside the level-1 instruction (I1) cache, the mean delay per task increases only by 3% and total power consumption increases by 2% due to the addition of the MT. Ó 2010 Elsevier B.V. All rights reserved.

1. Introduction More processing speed and high predictability are needed for processing real-time applications on embedded systems. As an example, real-time video conferencing uses a visual paradigm rather than a conventional text paradigm and deals with large image files including video sequences. The algorithms for future multimedia systems are more complicated than those currently being used and require huge amounts of processing speed is real time. According to the newly emerged multi-core design paradigm, two cores running at one half of the frequency can approach the performance of a single core running at full frequency, while the dual core consumes significantly less amount of power. Because of their high performance/power ratio, multi-core processors offer new possibilities for system designers in implementing highly

* Corresponding author. Tel.: +971 37135589. E-mail address: [email protected] (F.N. Sibai). 1383-7621/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.sysarc.2010.02.002

complex multimedia computing algorithms as described by Suhendra and Mitra [18]. In a multi-core architecture, two or more independent cores are combined into a die. In most cases, each processor has its own private level-1 cache memory (L1). Normally, the L1 is split into instruction (I1) and data (D1) caches. Also, multi-core processors may have one shared level-2 cache (L2) or multiple distributed and dedicated L2s. Asaduzzaman and Mahgoub [3] showed that cache parameters (such as cache size, line size, and associativity levels) significantly influence multimedia system performance. Multi-core architectures are more suitable for real-time multimedia applications because concurrent execution of tasks on a single processor is inadequate for achieving the required level of performance and reliability. The integration of billions of transistors in a single chip is now possible. As a result, the multi-core design trend is expected to grow for the next decade. However, the cache memory introduces timing unpredictability while real-time applications demand timing predictability and cannot afford to miss deadlines. Therefore, it becomes a great challenge to support real-time applications on a system in presence of cache. A better

152

A. Asaduzzaman et al. / Journal of Systems Architecture 56 (2010) 151–162

use of cache is needed in order to improve the predictability and performance/power ratio. Cache locking shows promises in improving predictability as shown by Asaduzzaman et al. [2,4], Puaut [11], Arnaud and Puaut [1], Puaut and Pais [12], and Tamura et al. [20]. Cache locking is defined as the ability to put off some or all of the data or instruction cache from being overwritten. Cache entries can be locked either for individual ways within the cache or for the entire cache. In entire cache locking, cache hits are treated in the same manner as hits to an unlocked cache. Cache misses are treated as a cache-inhibited access. Invalid cache entries at the time of the locking will remain invalid and inaccessible until the cache is unlocked. Entire cache locking is inefficient when the size of the data or the number of the instructions to be locked is small compared to the cache size. In way locking (a.k.a., set locking and partial locking), only a portion of the cache is locked by locking ways within the cache. Unlocked ways of the cache behave normally. Harrison [8] and Stokes [16], Stokes [17] show that way locking may improve the predictability and performance. Using way locking, the Xbox 360’s Xenon processor achieves the performance of using local storage by the Synergistic Processing Elements (SPEs) in the IBM Cell processor. Way locking at the L1 cache is not permitted on some processors (like PowerPC 750GX), but way locking at the L2 cache is possible. In existing single-core way cache locking techniques, the blocks to be locked are selected by off-line analysis of the applications. The information collected by off-line analysis, can be post-processed and re-used in single-core and/or multi-core systems to facilitate the efficient selection of blocks to be locked and/or victim blocks to be replaced. In this work, we introduce a miss table (MT) at L2 cache level to hold information about block addresses which cause most or all cache misses when cache locking is not used. This technique is effective for both uniprocessors and multi-core systems. This paper reveals that the approach of using a miss table together with cache locking improves the predictability and decreases delay and total power consumption by reducing cache misses as the right cache blocks are locked and/or replaced. This paper is organized as follows. In Section 2, some related literature is discussed. The proposed miss table (in a multi-core architecture) is introduced in Section 3. In Section 4, the cache locking scheme with miss table is evaluated by presenting and analyzing some important simulation results. Finally, this work is concluded in Section 5.

2. Literature survey Improving the predictability and performance/power ratio by cache locking has become an important research topic in the past years. A lot of work has been done to improve predictability in single-core systems by cache locking. In this section, we first briefly discuss some cache memory hierarchies, followed by various existing single-core cache locking techniques. Then, we present a number of cache memory hierarchies used in contemporary and popular multi-core processors before we introduce our proposed idea of adding a miss table at L2 cache level to improve the predictability and performance/power ratio by improving cache locking performance. Cache memory has a very rich history in the evolution of modern computing as summarized in Jacob et al. [9], Smart Cache [24]. Cache memory is first seen in the IBM System/360 Model 85 in 1960s. In 1980s, the Intel 468DX microprocessor introduced an on-chip 8 KB L1 cache for the first time. In the early 1990s, an off-chip L2 cache appeared with 4the 86DX4 and Pentium microprocessor chips. Today’s microprocessors usually have 128 KB or more of L1, and 512 KB or more of L2, and optional 2 MB or more

L3 (3rd level cache). The L1 cache can be split into I1 and D1 in order to improve performance as shown by Romachenko [13], Romachenko [14]. Torres [21] discusses the Intel Pentium 4 processor, one of the most popular single-core processors that use the inclusive cache architecture. A typical inclusive cache architecture is shown in Fig. 1. The L2 cache contains each and every blocks that the L1 (i.e., I1 and D1) cache may contain. In case of a L1 miss followed by a L2 miss, the block is first brought into L2 from main memory, then into the L1 from the L2. The Intel Pentium 4 Willamette is a single-core processor that has on-die 256 KB inclusive level-2 cache; with 8 KB level-1 trace/instruction cache (I1) and 8 KB level-1 data cache (D1). Locking entries in the cache is a common and effective technique in reducing cache misses and therefore in helping to reduce execution time as well as execution time fluctuations (over back to back runs) which leads to better predictability and lower power consumption. The idea is that by locking important blocks in the cache, future accesses to these blocks in the cache are possible; thereby better predictability should be achieved with reduced memory access time and reduced total power consumption. Vera and Lister [23] proposed a memory hierarchy with a dynamic locking cache to provide high performance combined with high predictability for complex systems. They conclude in an intuitive way that faster execution times are possible. However, it is very difficult to provide the necessary locking information into the program code since both the hardware and the software may be involved. The methodology introduced by Arnaud and Puaut [1] may improve predictability by selecting the target set of instructions efficiently. However, the algorithm used in this scheme can be also misleading, because it uses neither the information about the task structure nor the problem parameters. Tamura et al. [19] studied the impact of three different fitness functions. These fitness functions are used in a genetic algorithm that selects the contents of a static locking cache memory in a real-time system. Results indicate that none of the fitness functions perform well in the two basic metrics used. Tamura et al. [20] combined static cache analysis with data cache locking to estimate the worst-case memory performance in a safe, tight and fast way. Experimental results show that this scheme is more predictable. However, a better analysis that classifies the cache accesses as misses or hits and locks fewer regions could be beneficial. Campoy et al. [6] proposed a dynamic cache locking algorithm which partitions the task into a set of regions. Each region owns statically a locked cache content determined offline. A sharp improvement is observed, as compared to a system without caches. However, this technique is not capable of power estimation – a crucial design factor for embedded systems. Furthermore, the above mentioned techniques were developed to evaluate predictability in a singlecore system and they are not adequate for analyzing performance, power consumption, and predictability of multi-core real-time systems. Most manufacturers are adopting multi-core processors to acquire required high processing power for the future multimedia computing systems. Popular multi-core processors from Intel, AMD, and IBM have multilevel caches as discussed in Romachenko [13], Romachenko [14], Multicore (2008), and Every [7]. For exam-

Split L1 CPU

I1 D1

Unified L2

Shared Bus Fig. 1. Inclusive cache architecture.

Main Memory

153

A. Asaduzzaman et al. / Journal of Systems Architecture 56 (2010) 151–162

Core

Core I1

D1

I1

D1

I1

3. Proposed miss table

Core

Core I1

D1

D1

In the previous section, we discussed the cache memory hierarchy in some popular single-core and multi-core processors and some existing cache locking techniques. In this section, we introduce the proposed miss table (MT). Even though the MT is suitable for both single-core and multi-core systems, we discuss the schematic diagram of MT in a multi-core system (see Fig. 3). The miss table is populated with information of blocks which cause cache misses (without cache locking) by post-processing the results of worst case execution time (WCET) analysis of the applications. In the MT, block addresses are sorted in descending order of the number of cache misses. The addition of the MT opens an opportunity for the cache locking scheme to lock blocks from the top of the MT (those with most misses) which might create more misses (if not locked) – and thus lower timing predictability and increase power consumption. Similarly, the pre-loading mechanism can load those blocks which are identified to create more misses (those in the top entries of the MT). Also, the cache replacement strategy can exclude blocks with maximum number of misses. Thus the contents of the MT can guide the pre-loading and replacement strategies as needed. Locking guarantees a cache hit of the locked block while not replacing a block, also extending the block’s hit rate. As a block hits in the cache, predictability improves as, consistently, no additional steps are required to locate and fetch the addressed item from higher and slower levels of the memory hierarchy. As a result, the system architecture with the MT makes better usage of the caches using the MT and should significantly improve the predictability and the performance. In the following subsections, we discuss the architecture, miss table, cache replacement policy, and multi-core cache locking algorithm.

L2

Shared Bus

Main Memory

Fig. 2. Intel quad-core Xeon DP architecture.

ple, as shown in Fig. 2, the Intel quad-core Xeon DP has 128 KB I1, 128 KB D1, and 8 MB shared L2. The AMD quad-core Opteron has 256 KB I1, 256 KB D1, 2 MB distributed and dedicated L2 and 2 MB (Santa Rosa) or 4 MB (Deerhound) shared L3. IBM’s Cell multi-core processor has a Primary Processing Entity (PPE) consisting of an IBM dual-threaded PowerPC and 8 Cells. Stokes [16], Stokes [17], Blanchford [5], and Vance [22] summarize Cell-like multi-core architecture. Each Cell is also called a Synergistic Processing Elements (SPEs). The PPE contains a 32 KB I1 and a 32 KB D1 caches. A 512 KB L2 is shared by the PPE and SPEs. Primarily the PowerPC PPE keeps the processor compatible with lots of applications. Each SPE has 256 KB SRAM and a 4  128 bit ALU (Arithmetic Logical Unit which does the math in a processor), and 128 of 128-bit registers. The Element Interconnect Bus (EIB) is the communication bus internal to the Cell processor which connects the various onchip system elements: PPE processor, memory controller (MIC), eight SPE coprocessors, and two off-chip I/O interfaces. Sibai [15] studies and compares various shared versus private cache memory organizations. The impact of sharing and privatizing caches on the performance of homogeneous multi-core architectures depends on cache and workload parameters (especially, on cache sizes, miss rates, and interconnect switch congestion and latency).

Core I1

D1

3.1. Architecture We consider a multi-core architecture with 8 cores where the L2 is shared by all cores (like the Intel Xeon processor). In this architecture, each core has its own private L1 which is split into

Core I1

Core

D1

I1

D1

Core I1

D1

I1

D1

MT

Bus1 Bus1

I1

D1

Core

I1

L2

D1

I1

Core

D1

Core

Core

System Bus Bus2 Main Memory Fig. 3. Simulated 8-core architecture with MT.

154

A. Asaduzzaman et al. / Journal of Systems Architecture 56 (2010) 151–162

I1 and D1 for improved performance. The shared L2 may be partitioned into parts to reduce bus contention so that only a few cores can access each part. The miss table (MT) is considered – and resides – at cache level-2 (but not inside the L2) so that it can be accessed from the L1 (to guide the victim block selection) and from the L2 (to guide the cache locking in the L2). The schematic diagram of an 8-core architecture with MT is shown in Fig. 3. 3.2. Miss table A miss table (MT) is a table that contains the addresses of all or top-most blocks that cause cache misses – block addresses are sorted in descending order of the miss numbers. For each application (or code segment), after post-processing the tree-graph generated by Heptane simulation tool, data (block address, number of misses, etc.) is prepared for the MT. The MT can be implemented at any cache level (L1 or L2). In this work, the MT is a separate fast SRAM memory that resides at the L2 cache level, and is shared by all the 8 cores’ I1 caches and the L2 cache. In order to allow the MT operation to scale with number of cores, the scalable MT organization of Fig. 4 is proposed. In this organization, the MT has 3 ports. One port is shared by the 4 cores (I1 caches) on the left, one of the other 2 ports is shared by the 4 cores (I1 caches) on the right, and the third port is exclusively used by the L2. The 3 MT ports allow for 3 simultaneous lookups of the MT, thereby significantly reducing the MT’s delay overhead. Further MT lookup latency improvements, but with additional hardware cost, can be achieved by dedicating an unshared link between each core and its MT way. 3.2.1. Miss table workflow The MT is partitioned into 2 sets of MT tables or ways, where each set has 4 MT ways, one way per core. The MT ways are loaded at the start of the task run with cache block addresses in descending number of misses for the pertaining task. If a task is preempted, the associated MT way is loaded by the MT information of the new task allocated to that core. When a core is idle, its associated MT way is turned off to save power, and turned on again when the core becomes active. Port 3 enables the L2 to access all of the MT ways. As each MT way contains the block addresses sorted in descending order of misses per task, and as the L2 requires information on which of its entries to lock, a simple and fast MT implementation is for the L2 to lock the block addresses stored in the top entries in all the MT ways. This solution does not require sorting the block addresses in the MT ways all as one bundle (nor statically sorting the blocks with most misses across many applications running simultaneously, as the combination of the applications running

To/from a set of 4 cores on the left

simultaneously may only be known before run time) but considers the top blocks from each MT way at run time. When some MT ways are turned off due to inactive cores, more top entries from the active MT ways can be selected by the L2 for L2 locking. A lookup request coming from the left (or right) side will result in the MT way in the left (or right) set being looked up, precisely the MT way corresponding to the requesting core number. Each MT way is dual ported to allow serving one core, belonging to a set of 4 cores, and the L2 simultaneously. As the MT is structured in smaller ways rather than one single big table, the MT lookup time is reduced as each way has 1/8 the number of entries than one single big MT table (for an 8-core system). In addition to using the MT for block selection for level-2 cache locking, the MT is also used for level-1 cache victim block selection. 3.2.2. Miss table overhead The impact of the MT overhead delay and power on the mean delay per task and power consumption, respectively, needs to be well justified. For a system with N cores, the MT overhead delay due to 25% L2 cache locking can be determined using the following assumptions or estimates:  Number of entries per MT way = 25% of number of entries in L2/ N, as the MT is composed of N ways.  Overhead delay of MT lookup by L2 = L1 lookup delay  number of entries in MT way/number of entries in L1  MT overhead delay due to L1 = 2  MT overhead delay due to L2, owing to the proximity of the MT to the L2. For a system with N cores and 25% L2 cache locking, the additional power needed to operate the MT can be determined using the following assumptions (upper bound limit on the MT overhead power):  MT overhead power = (25% L2 size/N)  (MT way size/L2 size) The performance and power consumption overheads of the proposed MT based solution are not significant as the MT is much smaller than the cache memories and the MT lookup is done simultaneously, and thus overlaps in time, with the I1 lookup. The proposed MT organization provides a scalable solution as more MT ports can be added when the number of cores is increased. 3.3. Cache replacement policy using miss table LRU (least recently used) is an efficient cache replacement strategy. However, cache locking may bring performance improvements when LRU cannot manage a few blocks efficiently. Guided

Port 1

1

2

3

To/from a set of 4 cores on the right

Port 2

4

1

2

3

4

4 MT ways

4 MT ways Port 3

To/from L2 cache Fig. 4. Scalable MT organization.

A. Asaduzzaman et al. / Journal of Systems Architecture 56 (2010) 151–162

by the MT, an improved LRU cache replacement policy is adopted for the L2 as cache locking is implemented in L2. The goal of this cache replacement policy is to select a cache block with minimum number of misses to be evicted from the cache. This cache replacement policy always selects unlocked blocks with the minimum number of misses for eviction and replacement. In case of a tie in the number of misses, a block belonging to the group of blocks tied for least number of misses is selected using the LRU scheme for eviction and replacement. One benefit of the improved LRU replacement policy is that it does not affect the locked cache blocks which are needed to help increase predictability and avoids replacing blocks with a high number of misses as much as possible thereby contributing to the reduction of the miss rate of the blocks identified to have a high number of misses. The (regular) LRU cache replacement policy is used in the L1.

sor system, N = 1. The state of a busy core is expressed as Core.K, where 1 6 K 6 N. Each busy core processes its task using its private L1 (i.e., I1 and D1) and shared L2. Right after finishing the task, the state of core Core.K becomes Core.K0 . The simulation is completed when all tasks are processed. For each group of N tasks, a task that takes the maximum amount of time to be finished is considered to obtain the mean delay per task. In our simulations, the time required to access the MT from the L1 is assumed to be double the time needed to access the MT from the L2 owing to the MT’s proximity to the L2. The power consumed by the system to complete all tasks is considered to obtain the total power consumption. As we are only simulating the execution of a single-threaded application on one core, only one of the MT ways is active while all other 7 (see Fig. 4) are off, the active MT way consumes a very small amount of power. Since the totality of the 8 MT ways should indicate which 25% entries of the L2 cache to lock, each MT way size is at most equal to the size of (25% of L2 cache size)/8, i.e., 0.03125  L2 cache size. Hence, for this study, we stay on the conservative side by assuming that the power consumed by the MT is equal to 0.03125  the power consumed by L2 cache. We adopt a high upper bound on the MT power overhead to prove that the MT should not impact the total system power. Thus, the delay and power overheads due to the addition of MT are incorporated in our simulated and are accounted for in the simulation results. In case of cache locking without a miss table, memory blocks are selected randomly for locking. For an I1 cache miss, the victim block is selected with the minimum number of misses (assuming the I1 is full) using the MT.

3.4. Cache locking using the miss table In this subsection, we present and discuss the proposed miss table-based cache locking at the L2 cache level scheme for a multicore architecture as cache locking in multi-core systems is more complicated than, and therefore supersedes, cache locking in uniprocessor systems. Fig. 5 illustrates the workflow of this scheme using the MT. In this scheme, the L2 and L1 caches are pre-loaded with the selected blocks using the MT. In case of cache locking, cache blocks are selected for locking using MT. For a system with N cores, the simulator generates N tasks (or less if no more tasks to execute) at a time and assign one task to a core. For a uniproces-

Start Pre-load L2 and L1 with blocks selected using MT N

Cache locking? Y

Prepare to lock up to 25% of L2 using MT

Generate N tasks (for a system with N cores) Task.1 Core.1

Task.3

Task.2 Core.2

Core.3

Task.N …

Core.N



Core.N’

Each core processes its task … Core.1’

Core.2’

155

Core.3’

Delay = SUM (MAX (delay for each N cores)) Power = SUM (power for all N cores) All done?

N

Mean Delay per Task = Total delay ÷ Number of tasks Total Power = Total power for all tasks End Fig. 5. Workflow diagram of MT-based cache locking scheme in an N-core system.

156

A. Asaduzzaman et al. / Journal of Systems Architecture 56 (2010) 151–162

It is noted that cache locking has been used in commercial processors and does not cause tasks to deadlock as in the classic problem of racing to mutex-controlled memory locations. Our locking algorithm, using the MT at L2 cache level and victimizing the blocks with least number of misses, does not deadlock as the entire cache is not locked and as the replacement algorithm can always picks an unlocked cache with the least number of misses for eviction. Once the unlocked block with the least number of misses is chosen for replacement, another block will be victimized on the next cache miss, unless the former block is accessed before the latter block, a case which occurs with low probability given that the former block was not chosen to be locked. Entire cache locking is not suitable when the number of instructions or the size of data to be locked is small compared to the cache size. Also, way locking (a.k.a., set locking and partial locking) at L1 cache is not permitted on some processors (like PowerPC 750GX), but way locking at level-2 cache is possible on those processors. The cache locking algorithm used by Arnaud and Puaut [1] to improve predictability can be misleading, because it uses neither the information about the task structure nor the problem parameters. Most existing cache locking techniques (like techniques developed by Tamura et al. [20], Campoy et al. [6], and Asaduzzaman et al. [2]) are not capable of power estimation. Furthermore, most existing schemes were developed to evaluate predictability in a single-core system and they are not adequate for analyzing performance, power consumption, and predictability at the same time. Our proposed MT-based cache locking scheme is effective in analyzing the performance, power consumption, and predictability for both uniprocessor and multi-core systems. Also, our proposed MT-based cache locking scheme can be flexibly implemented at either level-1 or level-2 cache. 4. Evaluation We develop a simulation platform to evaluate the impact of the proposed miss table-based cache locking on the performance, power consumption, and predictability. We introduce the MT mainly to improve cache locking performance by wisely selecting the blocks to be locked. Even though the MT is effective for, and applicable to, both uniprocessor and multi-core systems, we model and simulate an 8-core architecture with 2-level caches. In this work, we consider the miss table at L2 cache level and we implement cache locking of the L2. In the following subsections, we briefly discuss the simulation details and present some important simulation results to evaluate proposed miss table-based cache locking scheme. 4.1. Simulation details We simulate a system where multiple (single- or multithreaded) applications can run on multiple cores simultaneously. The cache locking with miss table is evaluated using Moving Picture Experts Group’s MPEG4 (part-2) decoding, Advanced Video Coding – widely known as H.264/AVC decoding, Fast Fourier Transform (FFT), and Matrix Inversion (MI) workloads. The choice of these workloads is based on their popularity in embedded systems and due to their diversity. While the workloads consist each of a popular application/algorithm, one clear distinction is that the varying code size of the applications which clearly has performance and predictability implications on the cache size and cache locking. Other differences include the number of memory operations and number of cache misses as shown in Table 1. In this simulation, we select the application/algorithm code segments such a way that some of them (namely, FFT and MI) entirely fit in the initial size of I1 (4 KB) but others (namely, MPEG4 and H.264/AVC) do

Table 1 Important characteristics of MPEG4, H.264/AVC, FFT, and MI. Characteristics

MPEG4 decoder

H.264/ AVC decoder

FFT

MI

Code size (KB) Number of instructions (K) Number of I1 cache misses (K) for I1 = D1 = 16 KB, L2 = 1 MB, line = 128B, 8-way Fits in I1 (4–64 KB)?

93.04 16.207 1.160

78.85 11.374 783

2.34 365 17

1.47 227 12

No

No

Yes

Yes

not. This selection provides the opportunity to investigate the impact of the miss table on cache locking based on different application code sizes. We generate a tree-graph for each application that shows the blocks that cause misses by single-core WCET analysis. By postprocessing the tree-graph, we create one data file for each application. We use VisualSim to develop the simulation platform, to run the simulation program, and to collect the results. In this subsection, we briefly discuss some assumptions, simulation tools, and important input and output parameters. 4.1.1. Assumptions We make the following important assumptions in this work,  Cache locking at L2 cache with and without miss table is implemented in this work. When MT is not used, cache blocks are selected randomly for locking.  The total MT size is expected not to exceed 25% of L2 cache size (for L2 cache locking). The MT is pre-loaded only once at the start.  The MT is equally shared by all the cores as depicted in Fig. 4.  A modified LRU cache replacement strategy is used for L2 to select the victim blocks using MT information. The regular LRU cache replacement strategy is used for L1.  Write-back memory update policy is used.  The delay introduced by the bus that connects L2 and the main memory is 15 times longer than the delay due to the bus that connects L1 and L2.  The cache hit ratio is adjusted/increased when the cache size, line size, or associativity is changed.  The changes in cache hit/miss times due to the increase in cache size, line size, and associativity are assumed to be negligible compared to the obtained reduction in mean delay per task.  We simulate a multi-core system with 8 cores – each core processes its respective task using its private L1 (I1 and D1) and shared L2. The workload we use to run the simulation program is single-threaded and the simulator simulates one application running on one core at a time. The other cores are idle at that time. As shown by Mezetti et al. [10], most cache effects disappear when the threads have very little in common and the code and data are properly placed in the cache memories to avoid conflict misses. With private L1 for each core and large L2 to accommodate all 8 cores and with proper allocation of code to the I1 and L2 caches, we assume that multiple cache effects are, can be made, negligible and should not impact our simulation results. Moreover, our scalable MT organization includes a separate MT way for each core needed to equally support each core’s victim block selection and cache locking strategy.

4.1.2. Simulation tools VisualSim and Heptane simulation tools are used in this work. VisualSim (a.k.a., VisualSim Architect) from Mirabilis Design is a graphical system-level simulation tool VisualSim [26]. We installed

A. Asaduzzaman et al. / Journal of Systems Architecture 56 (2010) 151–162

VisualSim on a Dell PowerEdge 1600SC computer under Windows XP. Using VisualSim, we model and run the simulation program and obtain simulation results. The results (from VisualSim) are stored as text and/or graph files. Heptane (Hades Embedded Processor Timing ANalyzEr) is a WCET analysis tool for embedded systems (see Heptane [25]). Prior to running simulations on VisualSim, we configure Heptane in Linux Fedora 10 in the same Dell PowerEdge 1600SC PC (where we have VisualSim in Windows XP). The results from Heptane containing instruction cache miss information of the considered workloads are placed in the directory as specified in the configuration file and are viewed through a Web browser by opening the file ‘‘HTML/index.html”. A limitation of Heptane is that it only gives instruction miss information but no data miss information. 4.1.3. Input and output parameters Important input parameters are shown in Table 2. We simulate an 8-core system with I1, D1, and L2 caches as previously described. We keep the L2 size fixed at 1 MB and vary the I1/D1 size, L1/L2 line size, associativity level, and locked L2 portion size (percentage of total L2). Output parameters include the average delay per task and total power consumption. Delay is defined as the time between the start of execution of a task and its end. In this simulation, for each group of 8 tasks, the maximum amount of time to finish a task is considered to obtain the mean delay per task. In this work, we perform an activity based power analysis using VisualSim simulation tool. VisualSim has a Power Manager Module (PMM) that calculates total power consumption. PMM properties can be changed to add/remove components (core, cache, main memory, etc.) and to reset the amount of power consumed by each component. Three power states are considered – active (full on), standby (on), or sleep (off). In active state, a component receives full (adequate) power from the system to deliver full functionality to the user. For a task i, a component j consumes Pij (active) amount of power to be fully functional. In standby state, a component is partially powered with automatic wakeup on request. For the same task i, the same component j consumes Pij (standby) amount of power while remaining in standby state. In sleep state, a component is turned off and should not consume any significant energy. The total power consumption can be expressed by Eq. (1) below.

PðtotalÞ ¼

XX;YX ðPij ðactiveÞ þ Pij ðstandbyÞÞ

ð1Þ

i¼1;j¼1

In this equation, X is the total number of tasks and Y is the total number of components (like processing cores, cache, bus, and main memory). 4.2. Results In this work, we propose a miss table at L2 cache level to enhance the timing predictability and performance/power ratio by improving cache locking performance for embedded systems. We model a system with 8 processing cores and run the simulation program under MPEG4 decoding, H.264/AVC decoding, FFT, and MI workloads. L1, which is split into I1 and D1, is private to each

Table 2 Important input parameters. Parameter

Value

I1 (/D1) cache size (KB) L1/L2 line size (byte) L1/L2 associativity level L2 cache size (MB) Number of cores

4, 8, 16, 32, or 64 16, 32, 64, 128, or 256 1-, 2-, 4-, 8-, or 16-way 1 (fixed) 8 (fixed)

157

core and L2 is shared by all cores. We obtain results (average delay per task and total power consumption) by varying the I1 cache size, I1 line size, and I1 associativity level for various L2 locked cache size with and without miss table. We notice that the results due to MPEG4 and H.264/AVC follow the same pattern as shown in Figs. 6 and 7; even though H.264/AVC shows little improvement in mean delay per task and total power consumption as it has less stressful workload. Similarly, the results due to FFT and MI follow exactly the same pattern (see Figs. 6 and 7). For that reason, we discuss the impact of the miss table on cache locking by presenting the results for only MPEG4 and FFT in Figs. 8–11. In the following subsections, we present the results illustrating the impact of cache locking (with and without miss table) and cache parameters (L1 cache size, line size, and associativity level) on mean delay per task and total power consumption. The power consumption correlates with the mean delay but we chose to include power results for accuracy and completeness. 4.2.1. Impact of miss table First, we present the impact of the miss table with cache locking on the mean delay per task and total power consumption. As we know that with the increase in number of locked blocks – on the one hand, predictability is increased as the cache blocks with most misses are locked; but on the other hand, cache misses may increase as the effective cache size decreases. The addition of miss table helps improve the hit ratio by selecting blocks with maximum misses for cache locking and by selecting a block with minimum misses for cache replacement. Experimental results show that using the miss table with cache locking decreases the mean delay per task for MPEG4 and H.264/ AVC decoding algorithms. As illustrated in Fig. 6, it is observed that the mean delay per task starts decreasing with the increase in the number of locked blocks for MPEG4 and H.264/AVC applications. Beyond 25% cache locking, and for the MPEG4 and H.264/AVC applications, the mean delay per task increases with the increase in the number of locked blocks as effective cache size and cache hits decrease. It is also observed that for smaller applications like FFT (or MI) which entirely fit in I1 cache, the mean delay per task slightly increases (about 3%) due to the overhead of adding the miss table. The FFT/(MI) No-Lock No-MT curve is not visible in this figure as it overlaps with the FFT/(MI) locking without MT curve. Note that the FFT/(MI) locking with MT curve is right on top of the FFT/(MI) locking without MT curve showing a minor increase in mean delay per task. Using a miss table with cache locking has a positive impact on the total power consumption for some multimedia applications. From experimental results we observe that the total power consumption starts decreasing with the increase in the number of the locked blocks for MPEG4 and H.264/AVC applications (see Fig. 7). We also observe that the total power consumption decreases more when the miss table is used with cache locking (up to 25% for MPEG4 and H.264/AVC). However, for FFT (and MI), the total power consumption slightly increases (about 2%) due to the addition of the miss table. Based on Figs. 6 and 7, we report results with 25% cache locking in the next subsections, as 25% cache locking produces the best performance/power ratio. Note that the L1 and L2 cache parameters used in this experiment (Figs. 6 and 7) were set to the best values (which minimize delay and power) as will be presented in the next subsections. 4.2.2. Impact of I1 cache size We obtain the average delay per task and total power consumption for various I1 cache sizes for no locking and 25% L2 locking using MPEG4 decoding, H.264/AVC decoding, FFT, and MI work-

158

A. Asaduzzaman et al. / Journal of Systems Architecture 56 (2010) 151–162

Fig. 6. Mean delay per task versus L2 locked cache size (with and without miss table).

Fig. 7. Total power consumption versus L2 locked cache size (with and without miss table).

Fig. 8. Mean delay per task versus I1 cache size.

Fig. 9. Total power consumption versus I1 cache size.

A. Asaduzzaman et al. / Journal of Systems Architecture 56 (2010) 151–162

159

Fig. 10. Mean delay per task versus I1 line size.

Fig. 11. Total power consumption versus I1 line size.

loads. Locking 25% of L2 is chosen as 25% cache locking is proven to be the best performing one as discussed earlier. We do not show the results of H.264/AVC as they follow the same pattern of MPEG4; similarly, we do not show the results of MI as they follow the same pattern of FFT. As shown in Fig. 8, the mean delay per task for MPEG4 decoding decreases when cache locking and miss table are applied for all cache sizes. Only the I1 cache size is considered as the Heptane package can simulate only the instruction cache and not the data cache. Results also show that for MPEG4 decoding, the average delay per task decreases with the increase in I1 cache size. As the FFT code fits inside the I1, cache locking does not have any positive impact on the mean delay per task; but the miss table slightly increases the mean delay (about 3%) per task for FFT. Similarly, experimental results show that regardless of the I1 cache size, the total power consumption due to MPEG4 decoding decreases when we compare no locking with cache locking (see Fig. 9). The total power consumption for MPEG4 decoding decreases even more when miss table is used along with cache locking. Like the mean delay per task, the total power consumption is also not influenced by cache locking and I1 size for FFT (and MI); but total power consumption slightly increases (about 2%) due to the addition of miss table.

4.2.3. Impact of I1 line size Larger line size helps improve performance by reducing compulsory cache misses (a compulsory miss occurs on the first access to a block) in embedded systems. However, too large a line size may increase capacity cache misses (a capacity miss occurs because blocks are discarded from the cache as the cache cannot contain all blocks needed for program execution) and that may reduce

performance by increasing mean delay per task and increase power requirement. In this experiment, we use the same line size for I1, D1, and L2. Fig. 10 shows the average delay per task versus I1 line size for no locking and 25% L2 cache locking with and without MT. We observe that the average delay per task goes down as cache locking and miss table are used and I1 line size is increased from 16 bytes (up to 128 bytes) for MPEG4 (and H.264/AVC). It is noted that the average delay per task is slightly increased (about 3%) due to the addition of miss table for FFT (and MI). Simulation results show that total power consumption goes down for MPEG4, regardless of the line size when miss table is used with cache locking, as depicted by Fig. 11. Simulation results also show that for MPEG4 decoder total power consumption decreases with increasing line size leveling off at 128B. For line size greater than 128B, the total power consumption increases for MPEG4 due to cache pollution. It is also noted that, like mean delay per task, total power consumption is slightly increased (about 2%) due to the addition of miss table for FFT (and MI).

4.2.4. Impact of I1 associativity level In the case of direct mapped or set associative block placement strategies, conflict cache misses occur when several blocks are mapped to the same set or block frame. In this subsection, we discuss the impact of I1 associativity level on average delay per task and total power consumption for no locking and 25% L2 locking. The impact of no locking and 25% L2 cache locking on the mean delay per task by varying I1 associativity level is shown in Fig. 12. Experimental results show that for any I1 associativity level, the mean delay per task for MPEG4 decreases when we move from no locking to 25% L2 cache locking (with miss table). The decrease

160

A. Asaduzzaman et al. / Journal of Systems Architecture 56 (2010) 151–162

Fig. 12. Mean delay per task versus I1 associativity level.

Fig. 13. Total power consumption versus I1 associativity level.

Table 3 Percent changes of delay and power for MPEG4 and FFT applications.

25% L2 cache locking 25% L2 cache locking and miss table

Applications

Changes (%) of delay and power Delay

Power

MPEG4 FFT MPEG4 FFT

() 16 0 () 21 (+) 3

() 14 0 () 18 (+) 2

in the mean delay per task is significant for smaller levels of associativity (1-way direct mapping to 2-way). After 4-way, the mean delay per task remains almost the same. It is noted that the addition of the miss table helps decrease the mean delay per task for MPEG4. However, for various I1 associativity level, the average delay per task increases little bit (about 3%) when miss table is used with cache locking for FFT. Simulation results show that regardless of the I1 associativity level, total power consumption due to MPEG4 decoder decreases when we compare no locking with 25% L2 cache locking with miss table (see Fig. 13). It is noted that the decrease in total power consumption due to increasing associativity level is significant between 1-way (direct mapping) and 4-way. It is also noted that for different I1 associativity level the total power consumption increases little bit (about 2%) when miss table is used with cache locking for FFT due to the overhead from the miss table. Finally, we summarize the impact of cache locking and miss table on mean delay per task and total power consumption as shown in Table 3. The maximum decrement () in mean delay

per task and total power consumption from their initial values with No-Lock No-MT (the values of the respective parameters are mentioned in Figs. 6 and 7) are considered. As shown in Table 3, for MPEG4, a 21% reduction in mean delay per task and an 18% reduction in total power consumption is achieved using the miss table-based cache locking technique. It is noted that using the miss table with cache locking has no positive impact on performance and power for FFT (and MI) in this experiment, except for the mean delay per task slightly increasing by 3%, and the total power consumption slightly increasing by 2% due to the miss table overhead.

5. Conclusion A cache memory improves performance by bridging the speed gap between the main memory and the processing core(s). However, the presence of cache memory (in both single-core and multi-core architecture) makes the execution time more unpredictable. Studies show that cache locking may improve timing predictability by sacrificing performance. In this work, we propose a miss table-based cache locking technique for embedded systems to increase timing predictability and performance/power ratio by improving cache locking performance. This technique is suitable for both single-core and multi-core architectures. MTbased locking reduces cache misses by locking/replacing the right cache blocks inside the cache thereby boosting up the predictability and reducing the average delay per task and total power consumption. We evaluate the proposed miss table at cache level-2 by simulating an 8-core processor that has 2 levels of caches. We obtain results by using MPEG4 decoding, H2.64/

A. Asaduzzaman et al. / Journal of Systems Architecture 56 (2010) 151–162

AVC decoding, FFT, and MI workloads. Experimental results show that the use of the miss table improves cache locking as well as overall performance for MPEG4 and H.264/AVC decoders for all I1 cache size, line size, associativity level, and L2 locked cache size. For MPEG4 (and H.264/AVC) decoding, a reduction of up to 21% in mean delay per task and a reduction of up to 18% in total power consumption are achieved by using the miss table with 25% L2 cache locking. Experimental results also show that for FFT (and MI), cache locking has no positive impact on the mean delay per task and total power consumption as their code fits completely inside the I1 cache. Due to the miss table overhead, the mean delay per task slightly increases by 3% and the total power consumption increases 2% for FFT. Even though the performance gain and power saving due to the addition of MT are smaller than those offered by cache locking in this experiment, we believe that MT should lead to higher performance gain and lower power consumption for applications with worse code cache behavior than MPEG4 and H.264/AVC. Also, the additional cost and power consumption due to the addition of MT are affordable. Modern embedded systems are expected to support all kind of applications including those with worst cache behavior. Therefore, we consider the addition of MT as a viable solution to further improve cache locking performance. This MT-based locking technique is applicable to both uniprocessors and multi-cores, but in this paper, we simulate a multi-core architecture. The simulation of a multi-core architecture, although with a single-threaded application running on a single core, allows for the simulation of the right sizing of the caches and real-like bus delays as in a real multi-core system. Moreover, in a multi-core architecture it is possible to run several single-threaded applications at the same time on multiple cores. When the threads are independent and have little in common and the code and data are properly placed in the cache memories to avoid conflicts as described by Mezetti et al. [10], most cache effects disappear. Given that our simulated multi-core architecture has a large L2 to accommodate for at least 8 cores, and given that each core has a private L1 cache, and with proper allocation of code to the I1 and L2 caches, we believe that multiple cache effects would be minimum and would not impact the results. The total MT size is much smaller than the cache structures and the MT is looked up at the same time as the L1 is looked up; so the performance and power consumption overheads of the proposed MT based solution are minor, as confirmed by our results. The MT organization can support multiple applications running on multiple cores by providing one MT way per core. Furthermore, the proposed MT organization is scalable as more ports can be added to it to serve a larger number of cores. We plan to investigate the impact of using a private miss table for each core on the timing predictability and performance/power ratio of a multi-core system with a shared L2 and with private L2s as an extension of this work in the future. References [1] A. Arnaud, I. Puaut, Dynamic Instruction Cache Locking in Hard Real-Time Systems, in: Proc. of the 14th International Conference on Real-Time and Network Systems (RNTS), IEEE, Poitiers, France, May 2006. [2] A. Asaduzzaman, N. Limbachiya, I. Mahgoub, F.N. Sibai, Evaluation of I-Cache Locking Technique for Real-Time Embedded Systems, in: Proc. of IEEE Int. Conference on Innovations in IT (IIT’07), IEEE, UAE, 2007, pp. 342–346. [3] A. Asaduzzaman, I. Mahgoub, Cache modeling and optimization for portable devices running MPEG-4 video decoder, Int. J. Multimedia Tools Appl. 28 (1) (2006) 239–256. [4] A. Asaduzzaman, I. Mahgoub, F.N. Sibai, Impact of L1 Entire Locking and L2 Way Locking on Performance, Power Consumption, and Predictability of Multicore Real-time Systems, in: Proc. of IEEE Int. Conference on Computer Systems and Applications (AICCSA’09), IEEE, Rabat, Morocco, 2009, pp. 705– 711.

161

[5] N. Blanchford, Cell Architecture Explained Version 2. . [6] A. Campoy, E. Tamura, S. Saez, F. Rodriguez, J. Busquets-Mataix, On Using Locking Caches in Embedded Real-time Systems, in: Proc. of ICESS-05, LNCS 3820, Springer, Berlin, 2005, pp. 150–159. [7] D.K. Every, IBM’s Cell Processor: The Next Generation of Computing? Shareware Press, 2005. . [8] C. Harrison, Programming the cache on the PowerPC 750GX/FX – Use Cache Management Instructions to Improve Performance. In IBM Microcontroller Applications Group, 2005. . [9] B. Jacob, S. Ng, D. Wang, Memory Systems: Cache, DRAM, Disk, Morgan Kaufmann, 2007. [10] E. Mezetti, H. Holsti, A. Colin, G. Bernat, T. Vardanega, Attacking the Sources of Unpredictability in the Instruction Cache Behavior, in: Proc. 16th Int. Conference on Real-Time and Network Systems, RTNS, 2008, pp. 151– 160. [11] I. Puaut, Cache Analysis vs Static Cache Locking for Schedulability Analysis in Multitasking Real-time Systems, in: Proc. of the 2nd International Workshop on Worst-case Execution Time Analysis, in Conjunction with the 14th Euromicro Conference on Real-Time Systems, Vienna, Austria, June 2002. [12] I. Puaut, C. Pais, Scratchpad Memories vs Locked Caches in Hard Real-time Systems: A Quantitative Comparison, in. Proc. Design, Automation and Test in Europe Conference and Exhibition, 2007, pp. 1–6. [13] V. Romachenko, Quad-Core Opteron: Architecture and Roadmaps, , 2006. [14] V. Romachenko, Evaluation of the Multi-core Processor Architecture Intel Core: Conroe, Kentsfield. In , 2006. [15] F.N. Sibai, On the Performance Benefits of Sharing and Privatizing Second and Third Level Cache Memories in Homogeneous Multi-Core Architectures, Microprocessors and Microsystems, Embedded Hardware Design, vol. 32, Elsevier, 2008, pp. 405–412. [16] J. Stokes, Xenon’s L2 vs. Cell’s Local Storage, and Some Notes on IBM/ Nintendo’s Gekko, Arstechnica, 2005. . [17] J. Stokes, Introducing the IBM/Sony/Toshiba Cell Processor – Part II: The Cell Architecture, Arstechnica, 2005. . [18] V. Suhendra, T. Mitra, Exploring Locking and Partitioning for Predictable Shared Caches on Multi-cores, in: Proc. of Design Automation Conference (DAC’08), ACM, Anaheim, USA, 2008, pp. 300–303. [19] E. Tamura, J. Busquets-Mataix, J. Martin, A. Campoy, A Comparison of Three Genetic Algorithms for Locking-cache Contents Selection in Real-time Systems, in: Proc. of the International Conference on Adaptive and Natural Computing Algorithms, Coimbra, Portugal, 2005, pp. 462–465. [20] E. Tamura, F. Rodriguez, F. Busquets-Mataix, A. Campoy, High Performance Memory Architectures with Dynamic Locking Cache for Real-time Systems, in: Proc. of the 16th Euromicro Conference on Real-Time Systems, Euromicro, Italy, 2004, pp. 1–4. [21] G. Torres, Inside Pentium 4 Architecture. Hardware Secrets, LLC, 2005. . [22] A. Vance, Cell Processor Goes Commando. The Register, 2006. . [23] X. Vera, B. Lister, Data Cache Locking for Higher Program Predictability, in: Proc. of SIGMETRICS Proc of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, ACM, 2003, pp. 272–282. [24] Smart Cache, Cache – Smart Computing Encyclopedia, 2008. . [25] Heptane (Hades Embedded Processor Timing ANalyzEr), A Tree-based WCET Analysis Tool, 2009. . [26] VisualSim (Short for VisualSim Architecture), A System-Level Simulator from Mirabilis Design, Inc., 2009. .

Abu Asaduzzaman received the PhD and MS degrees, both in Computer Engineering, from Florida Atlantic University (FAU) in 2009 and 1997, respectively and the BS degree in Electrical Engineering from Bangladesh University of Engineering and Technology (BUET) in 1993. Currently, he is working at FAU as Specialist Computer Applications. Previously (2003–2006), he taught various hardware/software classes for FAU Computer and Electrical Engineering and Computer Science (CEECS) Department. His research interests include modeling and simulation, real-time embedded system, multi-core architecture, and cache optimization. He has published several journal and conference papers in these areas. He is a member of the IEEE, the honor society of Phi Kappa Phi, Tau Beta Pi, Upsilon Phi Epsilon, and Golden Key. His biography is published in the 1997 edition of Who’s Who Among Students in American Universities and Colleges and in the 2010 edition of Who’s Who in America 2010.

162

A. Asaduzzaman et al. / Journal of Systems Architecture 56 (2010) 151–162

Fadi N. Sibai received the PhD and MS degrees from Texas A&M University and the BS degree from the University of Texas at Austin, all in Electrical Engineering. He has been directing the Computer Systems Design program (since 2006) and the IBM Cell Center of Competency (since 2008) at the UAE University, Al Ain, United Arab Emirates. During 1996–2006, he worked for Intel Corporation, Santa Clara, California, U.S.A, and during 1990–1996 he was an Assistant Professor of Electrical Engineering at the University of Akron, Ohio, USA. His research interests include computer architecture, parallel and distributed computing, multi-core embedded systems, and performance evaluation. He has published over 100 papers and reports and organized or served on the program committees of 18 international conferences. He is a member of Eta Kappa Nu. Dr. Sibai’s biography is published in the 2010 edition of Who’s Who in The World.

Manira Rani is currently a Graduate student in the department of Computer and Electrical Engineering and Computer Science at Florida Atlantic University. Ms. Rani received the MS degree in Fisheries from University of Tromso, Norway in 2005 and the BS degree in Zoology from Dhaka University, Bangladesh in 2003. Her current research interests include performance evaluation, architecture exploration, and multicore embedded systems. She has published several research papers in these areas. She is a member of the honor society of Tau Beta Pi, Upsilon Phi Epsilon, and Golden Key.