Computing Systems in Engineering, Vol. 6, Nos 4,'5. pp. 459 464. 1995
Pergamon
0956-0561(95)00048-8
Copyright {, 1995 Elsevier Science Lid Printed in Great Britain. All rights reserved 0956-0521/95 $9.50 + 0.0(I
A S M A R T CACHE FOR I M P R O V E D VECTOR PERFORMANCE MICHAEL K. GSCHWIND~" a n d THOMAS J. PIETSCH+ flnstitut ffir Technische Informatik, Technische Universita_t Wien, TreitlstraBe 3-182-2, A-1040 Wien, Austria J;UNISYS ()sterreich GmbH, Information Services, Seidengasse 33-35, A-1071 Wien, Austria (Received 9 December 1993; accepted in recised.lbrm 30 June 1995)
Abstract---As the speed of microprocessors increases at a breath-taking rate, the gap between processor and memory system performance is getting worse. To alleviate this problem, all modern processors contain caches, but even using caches, processors cannot achieve their peak performance. We propose a mechanism, smart caching, which extends the power of conventional memory subsystems by including a prefelch unit. This prefetch unit is responsible for efficiently using the available memory bandwidth by fetching memory data before they are actually needed. Prefetching allows high-level application knowledge to increase memory performance, which is currently constraining the performance of most systems. While prefetching does not reduce the latency of memory accesses, it hides this latency by overlapping memory access and instruction execution.
1. INTRODUCTION In recent years, the m e m o r y system has been starting to constrain overall system performance, as new architecture styles were introduced. At the same time. D R A M has only benefited from technological improvements. A m d a h l ' s Law shows that further i m p r o v e m e n t s in system p e r f o r m a n c e can only be m a d e if this disparity is addressed. ~ Today, m e m o r y read latency is one of the most significant bottlenecks in c o m p u t e r system design. As C P U p e r f o r m a n c e has increased drastically over the past few years, 23 the i m p r o v e m e n t in D R A M p e r f o r m a n c e has been only gradual. A n average a n n u a l increase in C P U p e r f o r m a n c e by 2 5 % to 100% was thus paralleled by an only 10% increase in D R A M performance. While t h r o u g h p u t can be addressed by simply widening data p a t h s and by using several m e m o r y banks, no such simple solution strategy is awtilable for reducing latency. Cache memories allow for the reduction of m e m o r y access latency, but are only effective if values stored in the cache are accessed several times to amortize the cache loading overhead. The first reference to a value will always result in a cache miss, unless the value resides in a cache line already resident in the cache. A cache miss can result in a delay equivalent to the execution of 25 or more instructions. With 2 0 % of instructions being m e m o r y access instructions, a n d a p r e s u m e d hit rate of 97%, this translates to a slowdown from the potential peak p e r f o r m a n c e of as much as 15%! To prefetch the a p p r o p r i a t e values, high level semantic knowledge is necessary to guide the
prefetching mechanism. We claim that this can best be achieved by adding instructions to express this high level knowledge in a form which can be processed by the C P U . This is by far preferable to inferring stride data at run time, as it is static i n f o r m a t i o n which can be deduced once at compile time. 2. REDUCING MEMORY LATENCY THROUGH PREFETCH1NG To measure the benefit which can be derived by using intelligent prefetching, we decided to add a prefetching m e c h a n i s m to a p o p u l a r RISC processor (the M I P S R3000). 4 Originally, these extensions were targeted towards i m p r o v i n g the execution of functional and logic p r o g r a m m i n g languages such as Prolog, 5 but it soon occurred to us that the results o b t a i n e d therewith could equally well be applied to speed up list and vector processing. ~ A large portion of m a n y p r o g r a m s is m a d e up of list a n d / o r vector processing. These operations cannot be adequately sped up by caches, as they exhibit p o o r locality and a large working set. However. access to this type of data is extremely predictable. After accessing an element, the probability is high that either the next, or previous, element will be accessed. Thus we decided to implement instructions which, when one array/list element is accessed, the next one is prefetched automatically. This prefetching occurs concurrently with the execution of the actual p r o g r a m . Thus, the penalty incurred by accessing uncached data is effectively " h i d d e n " under normal p r o g r a m execution time. 45!)
460
M . K . Gschwind and T. J. Pietsch
Due to spatial locality, caches can exhibit prefetching behavior, if several elements accessed reside in the same cache line. In lists and vectors, the next element does not necessarily have to be in a physically adjacent memory location. For this reason, vector programs exhibit poor spatial locality, limiting the gains which can be derived from the use of a cache. Think of two elements B ( i , j ) and B(i + 1 , j ) in a row-major matrix. While they may be logically adjacent, in physical memory they exhibit a stride of element.size • row.size. Vector data are therefore not in the class of data accesses which can derive benefits from the usage of a cache memory. Especially when vectors are larger than the cache, vector data which have been loaded into the cache will not be accessed while they remain in the cache. They may, however, push out other, useful data, a process which is called cache pollution. Since vector accesses are highly predictable, it is simple to always predict the location of the next access correctly. This translates into a high effectiveness of prefetch strategies.
These prefetched values are stored in the shadow cache. Each prefetched value is stored in the cache cell corresponding to the general purpose register holding its memory address. To initiate the next prefetch, the memory address stored in the general purpose register file is incrementedt by the stride value (stored in the corresponding stride register). Since address computation is integrated into the instruction execution, this feature has the added benefit of greatly reducing pathlength. A restricted version of this pathlength reduction feature is already present in the HP Precision Architecture. Load instructions using a base register and an offset can write the result of the address calculation back into the base register. 7 A detailed instruction set description can be found in Ref. (6). A simple example will serve to illustrate the usage of the extended instruction set: for (j = 0;j < 1 0 0 ; j + + ) sum + = X[j];
3. THE EXTENDED INSTRUCTIONSET We extended the MIPS R3000 instruction set to include instructions to • transmit static information about the stride behavior of a series of accesses, • supply the address of the first access, • specify the data type of the prefetched value, • start the prefetching mechanism, and, finally, • use a prefetched value, and issue the next prefetch c o m m a n d to the prefetch unit. Issuing a new prefetch command consists of several steps: First, the new memory address is computed by adding the stride factor to the address of the last access. Then, this address is stored back to the general register file for later usage. Finally, a prefetch request is sent off to the prefetch unit. To allow prefetching of elements with non-unit stride, we introduce a stride register file. This register file supplies the offset to be added upon execution of an instruction supporting the prefetch mode. This register file is loaded by the atitiis rd, rt, immediate instruction. This instruction adds the immediate constant to the contents of the general purpose register rt and stores the result in stride register rd. Prefetching is initiated by the fetch.tit rti, offset(base) instruction. The address operand specifies the address of the first memory access. This address is stored in the general purpose register rd. The dt tag specifies the data type of the values being prefetched. The extended instruction set also contains a full complement of all arithmetic and logic instructions which are used in conjunction with prefetched values.
Using the extended instruction set, this code fragment will be translated into the following machine code sequence: addis la fetch.w
$3, $0, 4 $5, X $3,0(5;5)
add addu sit bne
$16, (5;3) + , $16 $4, $4, 1 5;2, 5;4, 100 $2, 5;0, L8
LS:
The first instruction in this sequence initializes the stride register to the stride between two consecutively accessed elements. In our particular case, this is the size of a machine word (4 bytes). The next instruction then loads the address of the first element which will be accessed into register $5. This register is then used to set up register $3 as base register for prefetching Table 1. Summary of the instruction set extensions Instruction
Description
addis rd, rt, offset fetch.dt rd, offset
Set up stride register Start prefetching dt can be any of: w (word), h (halfword unsigned), hs (halfword signed) b (byte unsigned), bs (byte signed) Arithmetic and logic instructions using prefetched values (operand (rt) +). Multiply and divide instructions using prefetched values.
alop rd, (rt)+, rs md (rt)+, rs
t We also support negative stride values.
461
Smart cache for improved vector performance
general purpose registers (32x32)
+
from memory
i+++l I
smart cache (31 entries)
+
(31 entries)
1
~ i ~ Added Parts ADA : address adder ALU
Fig. 1. Block diagram of extended MIPS R3000.
data with the type .w, i.e. 32 bit integers. The add instruction after label L8 adds the prefetched operand (referenced as ($3)+) to the sum which is kept in register $16. The add instruction also initiates the prefetch for the next value. The remaining instructions control the number of iterations, and are executed in parallel with the prefetch operation.
Once the address for a data prefetch has been computed, a prefetch request is issued to the prefetch unit, which operates independently from the main processor. While instruction execution continues in the main processor, the prefetch unit tries to carry out the prefetch requests. The main processor only stalls, if an instruction accesses the smart cache and finds that a value is not available. The main C P U will then stall until the value in question has been loaded into the smart cache, whereupon it will resume normal execution. The prefetch unit, upon receiving a request, queues the request for future processing. Prefetch requests are queued until the memory bank holding the required data is available. To supply the required bandwidth for prefetching to be effective, we
4. ARCHITECTURE The internal architecture of the MIPS processor remains largely unchanged. To support prefetching, we introduced two extra register files, the slrideregisterfile and the smart cache. To facilitate address computation for the prefetch address arithmetic, an address adder is introduced (see Fig. 1).
CPU Prefetch Queue
Data Cache
Read Write Buffer
System Bus
Memory Module
•
•
•
Memory Module
Fig. 2. Prefetching architecture.
462
M.K. Gschwind and T. J. Pietsch
Table 2. Summary of execution times (in CPU cycles) Optimum Baseline Prefetch 1 Prefetch 4 Prefetch 8 Prefetch16
Redraw Mull-All MM100 LK1 159760 16016 22100617 1079993 192072 17798 32624270 1181319 143413 15053 27798422 1409916 i 143413 15056 19922902 990053 143413 15060 18856029 990053 143413 15068 18542056 990053
LK9 14600051 17325296 15200078 12600157 12600158 12600159
introduced memory banking into the architecture (see Fig. 2). Several prefetch requests can be outstanding at the same time, one for each memory bank. For simulation purposes, we tested banking factors from 1 to 16. Once a memory bank becomes idle, the prefetch unit schedules the next prefetch access to memory. During the processing of a prefetch request, any number of exceptional conditions can arise, such as a TLB miss or a bus error. There are three possible strategies for dealing with exceptional conditions: • Generate a CPU exception immediately. • Generate an exception when the prefetched value is accessed. • Force the read to be executed when the prefetched value is being accessed. For performance reasons, and to avoid the appearance of spurious exceptions, we chose to store exceptional conditions and only raise an exception when the faulted memory address is being accessed. The smart cache associates a tag with each prefetch request. The tag indicates the current status of the prefetch request. A prefetch request can be in one of the three states: finished. Data is available for instructions requiring this value. pending. A prefetch request has been issued to the prefetch queue, but the value has not yet arrived in the smart cache. faulted. A memory access has been attempted, but caused an exceptional condition (such as TLB miss, bus error, etc.). When the processor tries to access a value tagged as faulted, it will raise an exception in order for the operating system kernel to rectify the situation. 5. SIMULATION RESULTS
To evaluate the performance benefit derived from smart caching, we simulated several programs with different memory subsystems. The performance derived from these models is then compared to two models using the conventional MIPS R3000 instruction set, the baseline and optimal models. The baseline model has the performance characteristics of a conventional MIPS R3000 workstation, with a suitable memory subsystem consisting of t Even though the results presented here were measured using integer operations, the concept can be generalized to floating point operations.
page-mode DRAMs and a data cache. The optimal model executes the same MIPS program, but with an idealized memory subsystem. This subsystem is assumed to immediately supply every memory value requested and represents the optimum achievable performance by the MIPS R3000 processor. The simulators for the different processor models are based on the publicly available MIPS simulator SPIM by James Larus, 8 as extended by Rogers and Rosenberg. 9 We compiled the benchmarks with the freely redistributable G N U project C compiler, gcc (version 2.3.3, with the options -O -fstrengthreduce). 1° Note that cycle counts, pathlength and related measures are, as might be expected, sensitive to the compiler used. We must caution readers that your actual mileage may vary. To quantify the performance benefit derived from smart caching, we executed five programs on the unmodified architecture and the prefetching model, using 1 to 16 memory banks: Redraw. This fragment is taken from the screen redraw function of a popular CAD application. Mult-AII. This function computes the product of 1000 vector elements. MMI00. Multiplies two 100 x 100 matrices. LK1. An integer version of Livermore kernel 1, from the Livermore benchmark suite. H LK9. An integer version of Livermore kernel 9.# We report the simulation results as overall execution time in machine cycles (see Table 2) and as memory latency per reference, as proposed by Klaiber and LevyJ 2 In the Redraw benchmark (see Fig. 3), enough computation time is available so as to completely hide all memory accesses. Due to the pathlength reduction feature of our extended architecture, average cost per memory access, compared to the original, unmodified MIPS R3000 version is below zero. The pathlength reduction feature can reduce the cost of each memory access by up to two instructions-one for the address computation and one for issuing the load instruction. The results for the Mult-All benchmark also show that all memory accesses are now executed in parallel with the computation. The pathlength reduction is only 1 instruction, as the original M|PS version performed the address computation for the next iteration concurrently with the integer multiply operation. For the Redraw and Mult-All benchmarks, increasing the number of memory banks does not reduce the average cost per memory access, as no memory access parallelism is available. MMI00 accesses several data and words per iteration. Thus, sufficient data parallelism is available for the increase of the number of memory banks to have an effect. The prefetch architecture executing on a single memory bank, while improving upon the original version, does not show any pathlength reduction benefit. This performance benefit is lost by
Smart cache for improved vector performance
463
3 I I Ba $ e l i ~
• Prefetch w (~h I bank [ ] Pref etch w t~h 4 banks
2
Q P'efetcl~ w ~h 8 banks • Prefetch w ~h 16 banks
2
Fig. 3. Average cost per memory access in cycles. The "'baseline" configuration represents a standard M1PS R3000 system, the prefetching data are for I 16 memory banks respectively. A perfect memory system has an average cost of 0, i.e. it delivers one word per machine cycle. Negative values are the result of pathlength reduction--prefetching requires less instructions to compute array addresses and to issue memory requests.
the C P U executing stall cycles until the memory data become available. The Livermore kernel 1 actually loses performance on the prefetching architecture, when using a single memory bank. Here, the data cache prefetches a significant amount of data from the memory, as multiple data items are fetched when the data cache loads a line. This line is transferred from memory in a single burst, which is more efficient than single requests, as issued by the prefetching mechanism. As up to three data items can be prefetched at the same time, the availability of multiple memory banks improves performance significantly. Livermore kernel 9 executes 10 memory references per iteration and performs a significant amount of arithmetic on them. Thus, performance is bounded by either the available memory bandwidth (in the case of a single memory bank) or by the time required to perform the operations on the data set (in the case of multiple memory banks). The presented results show that, as might be expected, the efficiency of multiple memory banks depends on the availability of data parallelism to keep these memory banks busy. Additional parallelism can be obtained by unrolling the computation loops available in all of these codes, and executing operations on several original loop iterations concurrently.
6. R E L A T E D W O R K
The CDC-6600 had separate address and data register sets. M e m o r y values were loaded into the data registers by storing their address in the respective address registers. We do not know how well compilers exploited this capability to prefetch values.
Jouppi ~-~ proposes stream buffers to prefetch consecutive cache lines. This mechanism cannot detect stride behavior, and is therefore only useful for data which are accessed consecutively, or for small problem sets, where data is stored in adjacent cache lines. Fu et al. ~ introduce a hardware prefetching scheme, where the prefetching unit determines stride values by comparing (instruction, memoo,address ) tuples to deduce stride behavior. This approach brings significant performance gains in some cases, but defers program analysis from compile to run time. Thus, program tranformations to improve prefetching effectiveness cannot be applied. Rogers and Li ~5 use an explicit prefetch instruction with minimal hardware support. By prefetching into the regular register file, register pressure can become a significant problem. 7. C O N C L U S I O N
We have shown that by the addition of a small prefetch buffer, performance can be enhanced dramatically. This mechanism effectively hides the latency of R A M by overlapping memory access and instruction execution. With our model, the time for executing programs is bounded by the slower of either available memory bandwidth or by the time required to execute the code. To alleviate possible memory bandwidth bottlenecks, our prefetch mechanism efficiently supports the utilization of multiple memory banks. The prefetch unit is controlled by several new instructions, which allow to set the address of the first reference and the stride value. While our work uses a specific architecture, the MIPS R3000, to prove the concept, this mechanism can easily be added to any other existing architecture.
464
M . K . Gschwind and T. J. Pietsch REFERENCES
1. G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. Proceedings AF1PS 1967 Spring Joint Computer ConJerence, pp. 483 485, Atlantic City, NJ, April 1967. 2. J. L. Hennessy and N. P. Jouppi. "Computer technology and architecture: An evolving interaction." 1EEE Computer, 24(9): 18--29 (1991). 3. J. L. Hennessy and D. A. Patterson. Computer Architecture--A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA, 1990. 4. G. Kane. M1PS RISC Architecture. Prentice Hall, 1989. 5. A. Krall and T. Pietsch. R3000 extensions for the support of logic and functional programming languages. Technical report, Abteilung ffir Prgrammiersprachen, Technische Universitfit Wien, 1992. 6. M. K. Gschwind and T. J. Pietsch. Smart cache prefetching for improved performance. Proceedings Austro-Chip-93, Bad Waltersdorf, Osterreich, June 1993. 7. R. Lee, M. Mahon and D. Morris. Pathlength reduction features in the PA-RISC architecture. Compcon Proceedings. IEEE, 1992. 8. J. R. Larus. SPIM $20: A MIPS R2000 simulator. Technical Report 966, University of WisconsinMadison, September 1990.
9. A. Rogers and S. Rosenberg. Cycle level SPIM. Technical report, Department of Computer Science, Princeton University, 1993. I0. R. Stallman. Using and Porting GNU CC. Free Software Foundation, Cambridge, MA, 1993. 11. F. H. McMahon. Lawrence Livermore National Laboratory FORTRAN Kernels Test: MFLOPS. FORTRAN source code, September 1991. 12. A. C. Klaiber and H. M. Levy. An architecture for software-controlled data prefetching. Proceedings of the 18th Annual International Symposium on Computer Architecture, pp. 43 53, May 1991. 13. N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 364 373, August 1990. 14. J. W. C. Fu, J. H. Patel and B. L. Janssens. Stride directed prefetching in scalar processors. Proceedings of the 25th Annual International Symposium on Microarchiteeture, pp. 102 110, 1992. 15. A. Rogers and K. Li. Software support for speculative loads. The 5th International Confi, rence on Architectural Support .for Programming Languages and Operating Systems, t992.