Journal of Systems Architecture 45 (1998) 305±322
Performance of a context cache for a multithreaded pipeline Amos R. Omondi a
a,*
, Michael Horne
b
Department of Computer Science, Flinders University, GPO Box 2100, Adelaide, SA 5001, Australia b EDS 68 Jervois Quay, Wellington, New Zealand Received 20 August 1996; received in revised form 21 March 1997; accepted 7 October 1997
Abstract Hardware-multithreading is a technique in the design of high performance machines that is currently the subject of much active research. Such machines are characterised by the exploitation of instruction-level concurrency, extracted from simultaneously active multiple threads of control. The main advantage of this technique is in the elimination of performance-debilitating processor latencies, and the technique is therefore useful in the design of both single-processor systems and parallel-processor systems. In this paper we discuss the design and simulated performance of a novel operand-buering system for a high-performance multithreaded pipelined uniprocessor. Ó 1998 Elsevier Science B.V. All rights reserved. Keywords: Instruction pipeline; Multithreading; Microthreading; Context cache; Renaming
1. Introduction Multithreading is a technique in the design of high performance machines that is currently the subject of much active research; in addition at least one commercial company is currently marketing multithreaded machines [1,2,16]. Such machines are characterised by the exploitation of low-level
*
Corresponding author. E-mail: amos@ist.¯inders.edu.au
concurrency, extracted from simultaneously active multiple threads of control. The main advantage of this technique is in the elimination of performancedebilitating processor latencies at both low levels (instruction-level) and high levels (process-level). Consequently, the technique is useful in the design of both single-processor systems and large parallelprocessor systems. In this paper we shall discuss the design and simulated performance of a new type of operand-buering system for a high-performance pipelined single processor with multithreading.
1383-7621/98/$ ± see front matter Ó 1998 Elsevier Science B.V. All rights reserved. PII: S 1 3 8 3 - 7 6 2 1 ( 9 7 ) 0 0 0 8 4 - 2
306
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
Ideally, the performance (as measured by the speed-up or throughput) of a pipelined processor is proportional to the number of pipeline stages. In practice, however, this is usually dicult to achieve due to dependencies (in the production and consumption of data and in control) between dierent instructions in the pipeline. A number of high-performance machines use a variety of sophisticated techniques to deal with these diculties, but such techniques tend to be complex and costly, as they must be implemented in hardware. A simple solution is possible with multithreading: by having at least as many concurrent threads of control as there are pipeline stages and, by interleaving into the pipeline instructions from the dierent threads, it can be ensured that at any given moment all the instructions in the pipeline are independent [2]. In what follows we shall show how this ``naive'' approach is modi®ed to yield a more ecient design. The issues to be dealt with in the design of a multithreaded processor include the following [2]: · Processor design to support interleaved threads. · Processor design for fast context-switching. · Synchronization and scheduling of threads. The ®rst of these is concerned with the general design of the processor and includes such aspects as the partitioning of processor functions among the pipeline stages, the role of registers and caches, the identi®cation of threads, and so forth. The second is concerned with the design of architecture and storage name-space implementation in order to support the fast context-switching that is implied by the low-level interleaving of instructions from several threads. And the third is concerned with inter-thread communication and the selection and implementation of appropriate policies for injecting instructions into the processor pipeline. In what follows we shall describe, with reference to these issues, a new design for a multithreaded pipeline processor and report on the results of a preliminary evaluation of one important aspect ±
the operand-buering system ± of this processor design. The rest of the paper is as follows. Section 2 is a brief review of the problem of hazards in pipelined machines; the limitations of the standard ways of dealing with hazards and the role of operand-renaming are reviewed in order to place multithreading in context. Section 3 consists of a description of a new multithreaded-pipeline design and discusses the architectural and implementation issues that are fundamental in this design. Section 4 describes the operand-buering system (a context cache) that has been devised for this pipeline, and Section 5 is a discussion of the performance of this operand buer. Section 6 discusses related work, and Section 7 is a summary. 2. The problem of pipeline hazards Various hazards (dependencies) that can arise between instruction in a typical pipeline may prevent it from completing one instruction in each cycle, as it ideally should, and lead to a throughput that is far from ideal. The three types of hazards are: read-after-write, in which an instruction's progress depends on its reading a storage element (cache line, memory word, data register, program counter, condition-code register, etc.) that is yet to be updated by an earlier instruction; write-afterread, in which an instruction's progress depends on its being able to write to a location that is yet to be read from by an earlier instruction; and writeafter-write, in which an instruction's progress depends on its being able to write to a storage location that is also due to be written into by an instruction that appears earlier. Within these three categories, we may further distinguish between computational hazards, which involve only computational units, store hazards, which involve memory, and control hazards, which are those that arise from branch instructions.
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
The write-after-read and write-after-write hazards are arti®cial, in the sense that they can be eliminated by renaming destination operands. The essential idea in renaming is that by changing each destination name and each subsequent reference to the same operand as a source, up to (but excluding) the next reference of the same name as a destination, all the arti®cial hazards are completely eliminated. As an example, the application of renaming the third and fourth references to R0 in the code fragment of Table 1(a), in which all three types of hazard occur, yields the code fragment in Table 1(b), in which only read-after-write hazards remain. Evidently, renaming can be done in hardware or software; however, renaming in hardware oers more ¯exibility, as it is eective on computed addresses as well, and, moreover, the names used need not be limited to those available to the programmer or compiler (i.e. to the architectural names) but can be of any number that yields the best performance in the hardware. Read-after-write hazards ± and these include all control hazards ± are, however, more dicult to deal with. The simplest solution is to stall the pipeline whenever a hazard is detected, but this is at variance with the general performance goals of pipelining. Excluding other partial solutions, the only way to deal with them is to allow instructions to be processed in an order that is dierent from that in which they enter the pipeline and data to be processed in an order other than that in which they are requested; this, then, essentially extracts spatial parallelism from what was a linear instruction Table 1 Renaming to eliminate hazards (a) Original code
(b) Renamed code
R0 R3 R0 R2
R0 R3 R7 R8
( ( ( (
R1/R2 R0 R4 R2 + R5 R0 ) R3
( ( ( (
R1/R2 R0 R4 R2 + R5 R7 ) R3
307
stream. Nevertheless, the usefulness of such outof-order processing is limited by the fact that the amount of spatial parallelism in a single instruction stream is usually quite small and is of decreasing value as pipeline-length grows. Moreover, the technique is not particularly useful with control hazards, unless additional measures (such as costly hardware branch-direction prediction) are employed. To make progress beyond this point, more than one instruction stream is required, and this is where multithreading comes in. 3. A multithreaded pipeline A straightforward use of multithreading to deal with the dependencies discussed above is to have at least as many instruction streams as there are pipeline stages. Then, by ensuring that in each cycle the instruction entering the pipeline is from a dierent stream, the pipeline can be kept full and with independent instructions. There are, however, three main problems with this approach: ®rst, it requires that there always be a high degree of thread-level concurrency, and this is, in general, not possible at all times, especially with long-latency pipelines; second, it requires a complete replication of a number of resources ± a costly business; third, as we shall argue below, it is not a very ecient of resources such as computational registers. We have therefore developed a pipeline that extracts intra-thread (i.e. inter-instruction) concurrency in the ®rst instance and resorts to inter-thread concurrency only when absolutely necessary; this is in sharp contrast with almost all other implementations of multithreading (in which the emphasis is on inter-thread concurrency). Basically, in this new design thread-switch takes place only when a latency-inducing hazard is detected, rather than on every cycle. This, of course, again requires that appropriate mechanisms exist to detect and resolve inter-instruction hazards, and we
308
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
have developed new techniques to deal with this. (The general approach has been termed microthreading [3].) The new pipeline design has been developed according to the following general strategy for dealing with potential pipeline stalls. Each stage of the pipeline has a special buer, a stall buer, attached to it. Whenever an instruction arrives at a stage and is unable to make progress, it is dropped into the stall buer at that stage and an instruction capable of making progress ± status signals from other parts of the pipeline eventually change the status of instructions in a buer, from ``unable-toprogress'' to ``able-to-progress'' ± is picked from the buer and forwarded to the next stage of the pipeline. (This gives the same eect as having each stage always process, and forward to the succeeding stage, each instruction that it receives ± the ideal situation.) The instruction that is picked from the buer to replace the one that is unable to proceed may be from a dierent stream from the instruction that it replaces, and this is how interthread concurrency is exploited. The complete pipeline is described in detail in [4], as are various issues in the architecture-implementation relationship; we shall here give only a brief overview. The instruction-preprocessing part of the pipeline we have been studying is shown in Fig. 1. The main units are as follows. The Instruction Fetch Unit fetches instructions from memory and also carries out low-level scheduling of threads. It does these by making use of four tables that hold for each thread a pointer to a code-space, a pointer to a work-space, the address of the next instruction block to be fetched, and a buer to hold the current instruction block. Each line in each table has an associated status bit that indicates whether the corresponding thread is active or not. The stall buer for the stage is just the collection instruction-prefetch buer, and ``unable-to-progress'' means that a control-transfer instruction has been decoded but the target instructions are not yet
available. The Decoding Unit carries out the initial decoding for all instructions. The unit has two tables that hold stack-pointer and programcounter values for each thread; the stall buer here is another table with a one-instruction slot for each thread, since at any given moment a thread may have just one instruction undergoing address calculation. The third stage is the Control Point and Addressing Unit, at which the program counter and stack pointer are updated and full addresses used in operand-fetching are calculated. The Destination Reservation Unit reserves lines in the context cache for destination operands. The stall buer in this case is just a simple queue of instruction waiting for a destination line to be allocated. The Operand-Fetch Units fetch operands from the context cache, which is also pipelined, and store back results. They are the most complex units, as this is where most of the intra-thread and inter-thread spatial concurrency are extracted. The stall buer here holds all instructions that are involved in read-after-write hazards or are awaiting data from memory. The Control-Transfer-Test Unit carries out all the testing that is required for conditional branching. It is located here for two main reasons: the ®rst is that condition-testing generally does not require a full arithmetic unit and can be carried out faster in a specialised unit; the second is that by having such tests carried out as early as possible, branching latency is minimized, and, in consequence, so are the number of waiting instructions and the number of threads required to achieve ideal performance. The main subject of this paper is the design and performance of the context cache. 4. A multipurpose context cache The role of the context cache is best appreciated by ®rst considering the relevant architectural and implementation issues that led to its design. The
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
Fig. 1. Instruction pipeline.
309
310
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
starting point here is the observation that multithreading requires a fast and ecient method for low-level context switching. Now, if we compare a machine that has general-purpose registers and replicates these once for each active thread with a machine that has no registers but employs a cache, then we can observe that for the multiple-registersets: · Context-switching is fast because it is mostly a matter of changing a pointer from one register set to another; but resources are not used eciently, since there will be times when a thread is idle but still has a register set assigned to it. · A thread may be using little of its register set while another thread may need more registers. This will aect performance, in addition to being another inecient use of resources. · Depending on the thread granularity, the overheads of saving and restoring registers can be incurred at levels as low as procedure boundaries and below that. In any case, irrespective of the thread granularity, re-assigning register sets are periodically necessary in any system with high levels of concurrency. · In general, it is not possible to be selective about the loading and unloading of registers: all of a register set must be loaded or unloaded in one go, even though a process may need only a small portion of its register set in order to make progress. (It is shown that in some cases it is possible for the compiler to determine exactly what registers need to be saved and restored, but their methods are never entirely successful and in some cases fail up to 80% of the time [5].) This aects performance by introducing avoidable latencies. On the other hand, for the pure cache system: · Resources are always used eciently, since the contents of cache lines are automatically ejected to make way for more (immediately useful) data; moreover, cache lines are allocated only in response to some demand.
· The assignment and re-assignment of cache lines can be done on a very selective basis. Also, any context-switching overhead can be completely masked by doing it piecemeal during idle cycles of the cache-memory interface. · The (cache part) of the dynamic context of a thread is built and taken apart in a piecemeal fashion, according to demand and use. A thread may therefore start to make progress as soon as only a small ± i.e. the most recently demanded ± part of its dynamic context has been loaded. · A thread can be descheduled simply by changing its status; it is not necessary to wait for its entire dynamic context (in cache) to be unloaded. Based on the above observations, we rejected a register-architecture in favour of a memory-architecture. Nevertheless most machines are based on register-architectures for at least three good reasons: short addresses (and, therefore, compact code), speed in the implementation, and the performance advantages of explicit (i.e. programcontrolled) placement and displacement of data in high-speed storage. In order to retain these advantages, we decided that the top eight locations of each process's workspace be addressable using only displacements from the top of the workspace and that the context cache be implemented as a small fully associative one. (This takes care of the ®rst two advantages of registers; the third is dealt with in a manner that is described in Section 4.2.) The decision on addressing is crucial and has very far-reaching signi®cance: eectively it embeds a register name-space into a memory name-space ± a little re¯ection will show that most of the problems with registers arise from a separation of the two name-spaces ± and allows the same hardware structures to be used for renaming and the detection and resolution of hazards, for both computational and store hazards; these (and the concept, discussed below, of ``cache locking'') are among the most interesting aspects of the context cache. Moreover, all this is in addition to a cache's
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
conventional role as a high-speed buer between memory and processor. Further discussion on the architectural issues will be found in [4]. The context cache is, in essence, a small fully associative copy-back cache that has been modi®ed to carry out the multiple functions above (Fig. 2). Each line of the cache consists of some status bits, an address tag, and a data slot that holds just one memory word; the replacement algorithm used is the standard Not-Recently-Used one, but any other algorithm will do. The special features of the cache come from the extra multiple status bits, the cache management algorithms, and the attached stall buer. We shall describe the cache and its use in terms of these, excluding the V (Validity), U (Use), and A (Altered) status bits of the cache; the latter are just the usual ones required to indicate validity, to implement a Not-
311
Recently-Used replacement algorithm, and to indicate (for the copy-back) whether or not the contents of a line have been changed. 4.1. Renaming The M bit is the Mapping bit and is used during renaming. To see how renaming can be carried out within a cache, consider the standard implementation of register-renaming. In this, a distinction is made between the names of the registers and the registers themselves, and a table is then used to map the names to the registers, i.e. to indicate what register is associated with what name at any given instant. (Note that the number of physical registers need not be the same as the number of register names.) Thus, for example, the renaming of Table 1 would be accomplished by
Fig. 2. Context cache.
312
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
the sequence shown in Fig. 3. If we apply this scheme directly with a cache, then we end up with Fig. 4, in which there is an obvious duplication of information, as the address that labels a mapping-table entry is the same address that appears in the tag ®eld of the corresponding cache line and the address that labels the cache line is the same one that appears in the mapping-table entry for the line. We can therefore eliminate the mapping table by having a bit (the M bit) to indicate that ``the address contained within this line is currently mapped onto this line''. The cache then processes each destination name as follows. First, association is carried out on the incoming address and the cache tag ®eld and the M ®eld. If non-equivalence occurs, then a new line is acquired for the destination and the M bit on that line is set. Otherwise, the same procedure is carried out but the M bit on the line giving equivalence is reset. The address in the instruction is then replaced with the address of the newly acquired cache line. Evidently, some means of maintaining ordering is needed to ensure consistency when two or more entries exist for the same address; we use an ordering matrix for this [4]. 4.2. Locking and forwarding One of the performance advantages of registers over a cache comes from the fact that once data is placed in a register, it stays there until it is explicitly removed (i.e. overwritten), whereas in a cache data can be purged (according to the replacement algorithm used) even when it is still needed. In order to retain this advantage of registers while still retaining a cache, we have the concept of cache locking: when one of the top eight memory locations of a workspace is mapped into the cache ± recall that these are addressed by short names and are intended to imitate registers ± a Locking bit (the L status bit) is set in the corre-
sponding line. Subsequently, the replacement algorithm does not reject any line whose L bit is set, thus ensuring that data stays in the cache as long as it is needed. Unlocking is carried out when a thread terminates or when a line's contents are no longer required; the latter is determined in a manner described below. It is also possible to do forwarding within the cache: Suppose Lr is a ``register location'' in some workspace and L is an ordinary memory location. If during the processing of two instructions such as Lr ( L and Lr ( Lr 1 a cache miss occurs on L, then ordinarily three cache lines would be required: two for the Lr destinations and one for L. By recognizing, during the processing of the second instruction, that the ®rst is a pure ``register load'' operation, the renaming can be avoided and the number of cache lines used reduced to two; eectively, this replaces the two instructions by the single instruction Lr ( L 1. To detect such a case a Forwarding (F) status bit is set when processing the ®rst instruction; subsequently, when processing the second instruction, instead of renaming, the F bit is reset and the ®rst instruction is deleted from the cache stall buer. Similar actions could be taken for ``store'' followed by ``load'' instructions, but we have not investigated this, as such sequences are unlikely to be frequent. 4.3. Detecting and resolving data hazards In general, instructions, even within a single thread, leave the cache in an order that is dierent from their initiation sequence, and, therefore, some means is required to detect any intra-thread hazards ± the only ones that remain after renaming are the read-after-write ones ± and ensure proper sequencing. This is accomplished by the use of two ``reservation'' status bits: the Read (R) bit on a line is set whenever there is an instruction waiting (in the stall buer) to read that line, and the Write (W) bit is set whenever there is a pending ``write'' op-
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
Fig. 3. Register renaming.
313
314
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
Fig. 4. Cache renaming.
eration. The R bit is actually required only to ensure that a line is not used while it is waiting for data; and a read-after-write hazard is detected if a ``read'' operation is attempted on a line that has its W bit set. We have accordingly modi®ed the ordinary Not-Recently-Used replacement algorithm so that (in addition to lines that are locked) any line that is reserved for reading or writing is not subject to rejection. When a line is eventually read from or written to, the corresponding reservation bit is re-set. At the same time, both the other reservation bit and the mapping bit are examined, and if they are not set, then the line is unlocked. 4.4. Stall buer The stall buer consists of three main units: the Update-and-Dispatch Unit, the Instruction Store, and an Ordering Matrix. The Update-and-Dispatch unit inserts into the Instruction Store any instructions that are unable to make progress, monitors and updates their status (according to changes in the cache), and dispatches for execution those instructions that no longer need wait. Each line of the Instruction Store consists of a number of status bits, the addresses of the cache line as-
sociated with the instruction, and a data slot. The Ordering Matrix is a square matrix of binary values that is used to indicate hazards between instructions in the stall buer. A typical case where the matrix is useful is when there are a number of instructions waiting to read from one line; by being able to detect this situation, the status of all such instructions can be changed with one operation when the awaited data becomes available. Ordering matrices were ®rst suggested by Tjaden and Flynn [6], but these were not of a practical nature; we have formulated a new type of Ordering Matrix that has none of the problems associated with earlier suggestions. We shall not discuss here the details of the stall buer and its management. Our main concern in what follows is the eect of the size of the buer ± speci®cally, the number of lines in the Instruction Store ± on performance. 5. Performance of the context cache To evaluate the performance and usefulness of the context cache, we carried out simulations to measure the eects of multithreading, context
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
cache size, and cache stall-buer size. The simulations were carried out by building a recon®gurable pipeline simulator and then running through it traces obtained from the Lawrence Livermore Loop kernels; the results are averaged over the 13 kernels. The performance studies covered all aspects of the pipeline, but we shall here report on only two measurements: the buer utilization at the context cache and the pipeline speed-up (ideally, seven) over an equivalent non-pipelined machine, for both single-threaded and multithreaded performance. Comparisons are made between the context cache and conventional operand buers. Because a fully associative cache has to be quite small in order to have a performance comparable with that of registers, sizes used in the studies are accordingly small. (Note, however, with a suitable architecture, a small fully associative cache can give very good performance, in terms of hit ratio, if all operands are named.) Nevertheless, since direct-mapped caches are simpler than set-associative caches, which in turn are simpler than fully associative caches, and, for a ®xed cost, simpler caches can be larger than more complex ones, slightly larger sizes were also simulated for the direct-mapped and set-associative caches than for the more complex caches. In what follows, rather than tabulate large sets of values, we have chosen to present just enough to show the trends that were revealed; more detailed performance ®gures will be found in [7]. A common feature of all these results is that the speed-ups are, in general, relatively low. This is due to a combination of the architecture (as opposed to the implementation) and the relatively naive compilation of code; we intend to change both in future studies, but neither affects the comparative value of the performance ®gures. Also, as will be seen below, with the appropriate choice of parameters the performance of the context cache is very close to the ideal.
315
5.1. Single-thread performance Table 2 summarises some of the results from simulating a unithreaded pipeline with various types of operand buer of dierent sizes; the ®gure for the set-associative caches are averages over set sizes ranging from two to sixteen. The performance ®gures for the register ®le showed very little change with size (mainly, as a consequence of the type of code generated). Two trends that are to be expected are also apparent in Table 2: the ®rst is that buer utilization decreases with buer size; the second is that speed-up increases with buer size. (The result of ordering the conventional caches by performance is also as we should expect.) The aforementioned trends hold for the context cache too, but it will be observed that in this case the drops in utilization are smaller than for the other buers and the speed-ups are uniTable 2 Overall single-thread pipeline performance Operand-buer
Buer size (lines)
Buer utilization
Pipeline speed-up
Direct-mapped cache
8 16 24 32 40
89.4 68.6 52.8 35.9 35.9
1.13 1.63 1.68 1.75 1.76
Set-associative cache
8 16 24 32
96.6 72.2 54.9 43.7
1.41 1.67 1.71 1.72
Fully associative cache
8 16 24
99.7 72.6 54.8
1.53 1.72 1.76
Register ®le Context cache
8 8 16 24
99.9 85.1 80.5 80.5
1.18 1.94 2.92 3.49
316
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
formly higher. There are two factors that could explain this: the renaming and the cache locking. However, a detailed examination of functioning of the cache showed that cache unlocking did not occur as frequently as we had expected ± usually the mapping bit was still set when both reservation bits were reset ± thus producing greater latencies than had been anticipated. This is partly a consequence of naive compilation: for example, although it happened rarely, it was sometimes the case that a cache line was not renamed and, therefore, never unlocked until the corresponding thread terminated. We therefore attribute most of the higher performance to just the renaming and intend to change the locking/unlocking protocol in future designs. A complete and thorough comparison of the operand buers should take into account the relative complexity and cost of the buers (i.e. be based on cost : performance ratios), but this is not possible without detailed implementation studies. We can, however, point to a detailed VLSI-implementation study of a similar structure that has been carried out by Nuth and Dally [8,9]. This study shows that a structure such as the context cache will cost little more (relative to total chip area) than a register ®le of similar size but will have much better performance. Given its design, the cost for the stall buer be will approximately equal to that of the cache, and therefore the total cost of the system proposed will be about two to three times that of a register ®le or conventional cache of similar size. This is quite reasonable, given that for the best case performances, the context cache is at least twice as good as the other operand buers. Moreover, when we look at the multithreading case, a de®nitely positive conclusion, in favour of the context cache, is reached. Having con®rmed the superiority of the context cache, we next studied various aspects of its performance. The most important of these were the variations in pipeline speed-up and stall-buer
utilization as the size of the stall buer was increased. Some of these results are summarised in Table 3. Again the results are what we expected, in so far as utilization drops with increasing buer size, but performance increases. The increase in performance, is, however, not linear: at about the point at which the stall buer has the same number of lines as the cache, the rate of increase drops to almost nil and does not change thereafter. This shows that the size of the stall buer is essentially something that has to be traded o against cost and performance: according to the ®gures, the best trade-o occurs when the stall buer is approximately half the size of the cache. The performance ®gures for the stall-buer performance also give some idea of the amount of concurrency that is being extracted from a single thread: analysis of the ®gures shows that the degree is, on average, slightly above two. Table 3 Stall-buer performance in single-thread pipeline Context-cache size (lines)
Stall-buer size (lines)
Stall-buer utilization
Pipeline speed-up
8
2 4 6
90.0 86.7 80.3
1.35 2.08 2.52
16
2 4 6 8 10 12
88.4 86.0 85.5 80.9 76.3 67.0
1.59 2.53 3.19 3.48 3.71 3.74
24
2 4 6 8 10 12 14 16
88.3 86.6 84.2 84.7 80.8 75.2 73.3 67.0
1.63 2.62 3.45 3.87 4.33 4.50 4.72 4.82
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
5.2. Multithreading performance A summary of some of the multithreading results, comparing the dierent types of buer, is given in Table 4. For the ®gures shown, the simulated context cache and fully associative cache were 64 lines each, and the direct-mapped cache was twice as large. For the set-associative cache, the total cache size was 96 lines, and the results are averaged over a number of set sizes (ranging from 2 to 48 lines). In the case of the register ®le, the size was proportional to the number of threads, but the registers were not segmented on a per-thread basis.
Table 4 Overall multithreaded-pipeline performance Operand-buer
No. of threads
Buer utilization
Pipeline speed-up
Direct-mapped cache
2 4 6 8
33.8 50.4 59.0 75.8
1.40 1.04 1.00 0.88
Fully associative cache
2 4 6 8
77.1 99.7 99.6 99.5
1.81 1.65 1.17 0.97
Set-associative cache
2 4 6 8
47.5 82.0 93.5 98.9
1.91 2.04 1.67 1.34
Register ®le
2 4 6 8
67.2 84.6 82.7 82.9
1.15 1.14 0.89 0.84
Context cache
2 4 6 8
96.9 99.5 99.9 99.9
2.88 3.84 3.06 2.55
317
As we would expect, for a given number of threads, the buer utilization increases with the degree of cache associativity; when the number of threads is increased, buer utilization also increases; and the utilization of the context cache is uniformly higher than for the other buers. The speed-ups, on the other hand, show a dierent trend: excluding the context cache and the set-associative cache, the speed-up decreases as the number of threads is increased. This may be explained by the fact that even though the buer utilization goes up, increasing the number of threads increases the number of inter-thread collisions in the cache, so that individual threads experience greater latencies ± a con®rmation of analytical and experimental studies by others [10,11,15]. In the case of the set-associative cache, the initial increase in speed-up is due to the fact that with only two threads, changing the set sizes did not have much eect on the speed-up, and this lowers the averaged value. For the context cache, there is initially a big increase in speed-up, before the inevitable decrease, which proves that the context cache is better suited to the exploitation of multiple threads than the other buers are. The ®gures for the register ®le are given with a caveat: perhaps a fairer comparison might have been with a ®le segmented according to thread. The ®gures in Table 4 also show that for a single processor and a given type of buer, a few (2±4) threads are sucient to give the best pipeline performance. This result too is in line with analytical and experimental studies that have been carried out by others [10,11]. It is worth noting that this agreement, as well as that indicated above, with the results of other researchers holds, even though we have simulated very small caches whereas the other studies use much larger caches. Other related performance studies will be found in [12]. As with the single-thread case, once the superiority of the context cache over other buers had
318
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
been established, we concentrated on more experiments with just the context cache. The results of some of these experiments are shown in Tables 5± 7; and Fig. 5 gives a graphical representation of the general trends. (The context-cache utilization was almost perfect in each case.) Table 5 shows how performance changes with stall-buer size, up Table 5 Performance of context cache relative to stall-buer size No. of threads
Buer lines
Pipeline speed-up
2
4 8 12 16
2.53 3.63 3.90 4.00
3
4 8 12 16
2.85 4.51 5.13 5.24
4
4 8 12 16
3.88 5.44 5.82 6.01
5
4 8 12 16
2.91 4.53 5.10 5.17
6
4 8 12 16
2.78 4.06 4.26 4.30
7
4 8 12 16
2.73 4.06 4.28 4.29
8
4 8 12 16
2.26 3.00 3.00 3.01
Table 7 Performance of stall buer with optimal number of threads Buer size (lines)
Buer utilization
Pipeline speed-up
2 3 4 5 6 7 8 9 10 12 14 16
86.4 86.0 84.4 84.1 82.6 81.6 80.3 78.7 77.1 74.5 69.0 62.7
1.87 2.52 3.20 3.88 4.39 4.87 5.20 5.45 5.65 5.82 5.96 6.01
Table 6 Performance of context cache with optimal stall-buer size No. of threads
Buer utilization
Pipeline speed-up
2 3 4 5 6 7 8
77.0 64.2 62.7 59.7 46.9 44.4 39.6
4.00 5.24 6.01 5.17 4.30 4.27 3.01
to the best speed-up in each case: there is an increase in speed-up as the buer-size increases, but the rate of increase quickly ¯attens out as the number of threads increases; this is to be expected from the greater cache interference between threads. Table 6 shows the performance obtained with a 64 line context cache and the best stallbuer size. Again we see that as the number of threads increases, the buer utilization drops, and the speed-up initially increases and then drops. Notice, however, that the best speed-up (which occurs with four threads) is very close to the ideal.
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
319
Fig. 5. Summary of context cache performance.
The stall-buer size used to obtain the results in Table 6 was 16, i.e. one-fourth the size of the context cache. This is less than that for the corresponding single-thread case (for which the best size was about one-half that of the context cache) and is due to two factors: The ®rst is that even though the average number (64/4 16) of contextcache lines per thread is smaller than the largest size tested for the single-thread case, the context cache is used very eciently in the multithreading case, according to the needs of each thread. (This is in line with one of the main claims made in Section 4.) The second is that whereas multithreading means that there is enough spatial parallelism to make good use of the cache, in the single-thread case, the attempt ± which is inherent in continuing to process instructions in one stream until an independent one is found ± to extract spatial parallelism that may not be there simply leads to having a large number of stalled instructions, rather than contributing to any appreciable
increase in performance. We may therefore conclude that the size of the stall buer can be tradedo against the size of the context cache and the number of threads, and, more importantly, its cost can be substantially less than that of the cache. This is con®rmed by the additional results in Table 6. 6. Related work Of work by other researchers, the closest relationship to our work (i.e. on context caches) will be found in [13,8,9]. The main dierences between the context caches discussed there and ours are in the following features that do not appear in the other work: the embedding of a register namespace within a memory name-space, hazard detection and resolution in the cache, forwarding in the cache, locking of data items, and the associated use of the stall buers. Because of these features,
320
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
we expect that our context cache will give better performance than the other two. Soundararajan and Agarwal [14] have also investigated dribbling registers, which are similar to our context cache in that a context switch involved the piece-meal loading or unloading, the ``dribbling'', of the context registers. It would be useful to have a thorough comparison between the dribbling register and our context cache, but even in the absence of that, some dierences are apparent: One dierence between the dribbling registers and the context cache is that the former are true registers and is used in association with a cache, whereas the context cache, as discussed above, combines both the essence of registers and the essence of a cache; the dribbling in the former case requires the use of an extra port to the cache, and the designers estimate that this reduces performance by a factor or 3±5%. Another dierence between the two are those listed above comparing our context caches with other context caches, and, for similar reasons as above, we expect our context cache to give better performance. Some indication of this is in the reported ®gure of 70% processor utilization for the dribbling registers, compared with 86% in our case (speed-up of 6.01, with a maximum of 7) when the cache parameters are chosen appropriately, although much cannot be made of this given the dierent simulation environments. [14] includes a comparison of dribbling registers and context cache, but this is only for the context caches of the designs described in [13,8,9] and is not applicable in the present case: for example, it is shown in [14] that with the other types of context cache the processor utilization does not change with the number of contexts, whereas we have seen above that it does with our design. A third dierence between the dribbling registers and any context cache is that the eectiveness of the former depends on the number of load/store operations in each process, which is not the case with the latter.
7. Summary and conclusions We have described a new type of cache for multithreaded pipelines. This cache is novel in the following respects: it makes good use of an architecture in which a register name-space is embedded in a memory name-space; it has the ®rst implementation of renaming and forwarding with a cache; it is the ®rst use of a cache to uniformly detect and resolve data dependencies; and it introduces the concept of automatically locking and unlocking data into and out of a cache. We have carried out a number of performance studies of this context cache, and (in addition to con®rming analysis and experiments that have been carried out by others, in the use of caches with multithreading) what we have learned includes the following: · A multithreaded pipeline is very sensitive to the type of operand buer used and to the management (e.g. sharing) of that buer. · The context cache does indeed give better performance than other types of operand buer. · We gave a lot of emphasis to implementation and not enough to architecture and compilation. This was a mistake, as has been shown by others: for example, that multithreading works best with code that is optimized to exploit locality [11]. · The locking system of the cache, although a good idea in principle, did not perform as well as we expected (especially with more than a few threads) and needs to be changed. Some of the problems associated with it can be lessened by good compilation, but the best solution seems to be to make it less conservative ± e.g. to remove the requirement that the mapping bit be reset before unlocking. · Implementing the simulator showed that the secondary units associated with the context cache are probably too complex in structure
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322
and management and need to be simpli®ed, especially if a realization is to be carried out. Our current and planned work has three main components: The ®rst is to start by developing a better base-architecture than the one we started with; the second is to simplify the whole context cache structure, while retaining the good ideas and correcting the problematic ones, and then carry out new simulations; and the third is to ®nally carry out a VLSI realization, in order to more accurately asses cost and performance.
References [1] R. Alverson, The TERA computer system, in: Proc. International Symposium on Supercomputing, ACM, New York, 1990, pp. 1±6. [2] J.B. Dennis, G.R. Gao, Multithreaded architectures: Principles, projects, and issues, Multithreaded Computer Architecture: A Summary of the State of the Art, Kluwer Academic Publishers, Boston, 1994, pp. 1±72. [3] A. Bolychevsky, C.R. Jesshope, V.B. Muchnik, Dynamic scheduling in a RISC architecture, IEE Proceedings, Computers and Digital Techniques 193 (1996) 309±317. [4] A.R. Omondi, Ideas for the design of multithreaded pipelines, Multithreaded Computer Architecture: A Summary of the State of the Art, Kluwer Academic Publishers, Boston, 1994, pp. 213±251. [5] D. Grunwald, R. Neves, Whole-program optimization for time and space ecient threads, in: Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, New York, 1996, pp. 50±59. [6] G.S. Tjaden, M.J. Flynn, Representation and detection of concurrency using ordering matrices, IEEE Transactions on Computers 22 (1973) 752±761.
321
[7] M. Horne, Operand-buering in pipelined machines, M.Sc. Thesis, Department of Computer Science, Victoria University of Wellington, New Zealand, 1995. [8] P.R. Nuth, W.J. Dally, Named state and ecient context switching, Multithreaded Computer Architecture: A Summary of the State of the Art, Kluwer Academic Publishers, Boston, 1994, pp. 201±212. [9] P.R. Nuth, W.J. Dally, The named-state register ®le: Implementation and performance, in: Proc. 1st IEEE Symposium on High Performance Computer Architecture, IEEE Press, New York, 1995, pp. 4±13. [10] A. Agarwal, Performance tradeos in multithreaded processors, IEEE Transactions on Parallel and Distributed Systems 3 (1992) 525±539. [11] R. Thekkath, S.J. Eggers, The eectiveness of multiple hardware contexts, in: Proc. 6th International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, New York, 1994, pp. 328± 337. [12] D.E. Culler, Multithreading: fundamental limits, potential gains, and alternatives, Multithreaded Computer Architecture: A Summary of the State of the Art, Kluwer Academic Publishers, Boston, 1994, pp. 97±137. [13] H.H.J. Hum, G.R. Gao, Concurrent execution of heterogeneous threads in the super-actor machine, Multithreaded Computer Architecture: A Summary of the State of the Art, Kluwer Academic Publishers, Boston, 1994, pp. 317± 347. [14] V. Soundararajan, A. Agarwal, Dribble registers: A mechanism for reducing context switch latency in largescale multiprocessors, Laboratory for Computer Science, Massachusetts Institute of Technology, 1992. [15] D.E. Culler, M. Gunter, J.C. Lee, Analysis of multithreaded microprocessors under multiprogramming, Multithreaded Computer Architecture: A Summary of the State of the Art, Kluwer Academic Publishers, Boston, 1994, pp. 351±372. [16] R.A. Iannucci, G.R. Gao, R.H. Halstead, B. Smith, Eds., Multithreaded Computer Architecture: A Summary of the State of the Art, Kluwer Academic Publishers, Boston, 1994.
322
A.R. Omondi, M. Horne / Journal of Systems Architecture 45 (1998) 305±322 Amos Omondi graduated as a B.Sc. (Hons), computer science, from the University of Manchester (UK) and a Ph.D., computer science, from the University of North Carolina ± Chapel Hill (USA). He currently teaches at the Flinders University of South Australia. His research interests are in computer architecture, computer arithmetic, and parallel processing.
Michael Horne received a Masters of Computer Science in 1994 from Victoria University, Wellington, New Zealand. He has been working for EDS New Zealand for three years as part of the UNIX Support Group. As part of the UNIX Support Group, he provides second level support for UNIX problems and a technical consultancy role for internal and external clients. He maintains an ongoing interest in advances in commercially available hardware and operating systems.