Early design stage exploration of fixed-length block structured architectures

Early design stage exploration of fixed-length block structured architectures

Journal of Systems Architecture 46 (2000) 1469±1486 www.elsevier.com/locate/sysarc Early design stage exploration of ®xed-length block structured ar...

658KB Sizes 0 Downloads 60 Views

Journal of Systems Architecture 46 (2000) 1469±1486

www.elsevier.com/locate/sysarc

Early design stage exploration of ®xed-length block structured architectures Lieven Eeckhout *, Henk Neefs, Koen De Bosschere Department of Electronics and Information Systems (ELIS), Ghent University, Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium Received 1 March 2000; accepted 1 September 2000

Abstract An important challenge concerning the design of future microprocessors is that current design methodologies are becoming impractical due to long simulation runs and due to the fact that chip layout considerations are not incorporated in early design stages. In this paper, we show that statistical modeling can be used to speed up the architectural simulations and is thus viable for early design stage explorations of new microarchitectures. In addition, we argue that processor layouts should be considered in early design stages in order to tackle the growing importance of interconnects in future technologies. In order to show the applicability of our methodology which combines statistical modeling and processor layout considerations in an early design stage, we have applied our method on a novel architectural paradigm, namely a ®xed-length block structured architecture. A ®xed-length block structured architecture is an answer to the scalability problem of current architectures. Two important factors prevent contemporary out-of-order architectures from being scalable to higher levels of parallelism in future deep-submicron technologies: the increased complexity and the growing domination of interconnect delays. In this paper, we show by using statistical modeling and processor layout considerations, that a ®xed-length block structured architecture is a viable architectural paradigm for future microprocessors in future technologies thanks to the introduction of decentralization and a reduced register ®le pressure. Ó 2000 Elsevier Science B.V. All rights reserved. Keywords: Computer architecture; Block structured architecture; Early design stage modeling; Statistical simulation

1. Introduction Current design methodologies for the evaluation of microprocessors are becoming impractical due to the time-consuming simulation runs. Most contemporary design methods are based on highlevel architectural simulations which deliver IPC

*

Corresponding author. E-mail addresses: [email protected] (L. Eeckhout), [email protected] (H. Neefs), [email protected] (K. De Bosschere).

(number of instructions executed per clock cycle) results and which are cycle-accurate. A cycleaccurate high-level simulator typically runs at a simulation speed of 50K instructions per second [2] which means that simulating one second of a target processor with an IPC of 2.5 running at 1 GHz, will take approximately 14 hours. If we take into account that several processor con®gurations need to be simulated for several workloads in order to obtain the optimal processor, we can conclude that this methodology will be infeasible in the near future. In this paper, we show that statistical modeling [7] can be used to perform early design

1383-7621/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved. PII: S 1 3 8 3 - 7 6 2 1 ( 0 0 ) 0 0 0 3 6 - 9

1470

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

stage explorations of new microarchitectures. In statistical modeling, distributions are extracted from a benchmark trace and are used to generate a synthetic trace which yields quite accurate performance estimations. Thanks to the statistical nature of this technique, performance characteristics quickly converge to a steady-state solution when simulating this synthetic trace, which makes this technique suitable for fast design space explorations. Once an interesting area of processor con®gurations is determined through statistical modeling in an early design stage, more accurate simulation techniques can be used to perform detailed simulations. Another important consideration is that the impact of chip layout and interconnects on performance becomes signi®cant. Next generation chip technologies will o€er us several hundreds of millions or even billions of transistors in the near future running at gigahertz frequencies. The most important technical challenge for these deep-submicron technologies will be to deal with the increasing impact of interconnects on performance since wire delays do not scale with feature size [13]. So, designers will have to design microprocessors which lack long wires in critical paths. Due to the growing importance of interconnects in future technologies, computer engineers should incorporate these design issues in early design stages to avoid surprises in later phases of the design process [3]. In this paper, we use viable processor layouts to estimate how clock cycle period scales with architectural parameters. In addition, the in¯uence of technology scaling on overall performance is studied. In order to show the applicability of our methodology which combines statistical modeling and processor layout considerations in an early design stage, we have applied our method to a new architectural paradigm, namely a ®xed-length block structured architecture. The concept of this architectural paradigm originated from the consideration that scaling contemporary (out-of-order) architectures to higher levels of parallelism tends to be infeasible due to the increasing complexity and the ever growing impact of interconnects on performance. Enlarging the number of instructions being processed in parallel requires

bigger instruction windows, more functional units and larger register ®les. Palacharla et al. [35] have shown that window logic and bypass logic tend to be a possible bottleneck in future out-of-order architectures. Besides window logic and bypassing, the register ®le is another critical structure. Increasing the issue width increases the number of register ®le ports required. Due to the impact of the number of ports on the register ®le access time [11], DEC was forced to partition the register ®le in the Alpha 21264 [19]. So, computer architects will need to answer the following question: how can these billions of transistors be applied in future technologies to attain maximum performance while limiting the complexity (and its impact on clock period) and avoiding long interconnects in critical paths? It is clear that this will require new architectural paradigms. The architectural paradigm which is evaluated in this paper heavily relies on decentralization or partitioning [37]. The basic concept of a decentralized microarchitecture is quite simple: instead of implementing one big and thus slow processor core, decentralized microarchitectures are made up of several small-scale and thus very fast processor engines communicating through relatively slow interconnects. The idea is to keep as much inter-instruction communication as possible local in one processor engine so that the slow communication paths between the engines do not harm performance too much. As a result, a net overall performance gain will be attained due to the higher clock frequencies. There are two main contributions in this paper. First, we show that the design process can be sped up by applying statistical modeling in an early design stage. We also argue that processor layout should be considered in early stages of the design process due to the increasing importance of interconnects on performance. Second, we show that a ®xed-length block structured architecture is an example of a viable architectural paradigm for future deep-submicron technologies thanks to the reduced complexity, the avoidance of long interconnects and the reduced register ®le pressure. This paper is organized as follows. Section 2 presents the block structured architecture. In Section 3, the methodology is elaborated: statistical

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

1471

simulation is detailed and various processor layouts are investigated. In Section 4, a broad design space of block structured architectures is investigated. Section 5 extensively details on related architectures. Finally, this paper is concluded in Section 6.

2. Block structured architecture A block structured architecture (BSA) is an architectural paradigm which was ®rst presented by Melvin and Patt in [30]. A particular form of BSAs, namely ®xed-length BSAs, 1 has been further re®ned by Neefs [31]. In a BSA, instructions are statically grouped into ®xed-length blocks. The number of instructions included in a block is bounded by a maximum, e.g. 16 instructions, and the instructions can be taken from various basic blocks. Since no control ¯ow is allowed within a block, predication [28,36] is needed to transform intra-block control ¯ow into data ¯ow. Once the instructions to be included in a block are determined, register renaming is performed by the compiler to obtain a static single assignment form, which maximizes the attainable parallelism within a block. Besides an instructions section, a BSA block also contains an in section, an out section and a branches section. The in section itemizes the inter-block registers which are used by the instructions in the block. The out section speci®es the inter-block registers which are de®ned and live on exit; i.e., the out section maps intra-block registers to inter-block registers. The in and out sections take care of the inter-block data communication, while the branches section speci®es the inter-block control ¯ow. To illustrate the principles of a block, an example is shown in Fig. 1. The registers denoted as r and i are inter-block and intra-block registers, respectively. Inter-block registers are architecturally visible, whereas intra-block registers are only visible within a block. In the instructions section, when i1 equals zero, predicate register p1 is set; 1 For the remainder of this paper, we only consider ®xedlength BSA, which will be denoted as BSA.

Fig. 1. Example: source code (on the left) and BSA code (on the right).

otherwise it is cleared. If p1 (also called the guard) is true, one is added to r1 and r2, otherwise 2 is added to r1. In the out section, when p1 is true, i3 is mapped to r1, and i4 to r2; otherwise, i5 is mapped to r1. In all cases, i1 is mapped to r3. 2.1. Microarchitecture When executing only one block at a time, performance will be low. Therefore, we have chosen for a microarchitecture that allows for parallel execution of multiple blocks, leading us to a particular implementation of the control-dependence based decentralized paradigm [37]. This means that several units of work of a sequential program are (speculatively) executed in parallel on separate processing elements. The atomic unit of work in a BSA is called a block, and a processing element is called a block engine, see Fig. 2. In a BSA, a head and a tail pointer indicate the block engine that executes the earliest and the latest assigned blocks, respectively. At any time, there is only one block engine executing a block non-speculatively, which is indicated by the head pointer. The branch predictor predicts the follower block of the one currently being executed in the block engine indicated by the tail pointer. The predicted block is then fetched and assigned by a sequencer to the follower block engine indicated by the tail pointer. This head-and-tail mechanism is also used in multiscalar architectures [42] and trace processors [40]. As suggested by the use of intra-block and inter-block registers in a BSA, a distinction is made between intra-block and inter-block communication. Inter-block communication is concerned with

1472

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

Fig. 2. Microarchitecture of a BSA; the darkest rectangle represents an instruction selection window; the ¯oorplan of a block engine used in Section 3.2, is given on the right.

the propagation of data values between di€erent block engines. Between adjacent block engines, data values are propagated through associative logic; between non-adjacent block engines, values are propagated through the register ®le, see Fig. 2. Intra-block communication, on the other hand, is related to the communication ¯ow within a single block engine, and follows the data ¯ow execution policy, which is made possible thanks to the static single assignment form. An instruction that resides in the instruction window, is selected to be executed on a functional unit when all its operands are available (data-¯ow). The scope of selection in the instruction window is restricted to the instruction selection window. Bypassing is implemented to guarantee back-to-back execution of data-dependent instructions. After the execution, the computed register values are sent to all the instructions in the instruction window; an instruction that has that register as a source operand, then needs to store that value locally. Indeed, within a block engine no register ®le is provided. Another important feature concerning the intra-block communication is the speculative execution of predicated instructions. This means that predicated instructions can be executed before the value of their guard is known, 2 eliminating the

2 That is, instructions are only selected for execution when the guard is true or is still unknown.

true data dependency introduced by predication [28,36]. Correct program semantics are guaranteed by mapping the correct data values in the out section, see Fig. 1. 2.2. Advantages A block structured architecture has several advantages over traditional superscalar processors: Decentralization. First of all, a BSA is an answer to the scalability problem of traditional out-of-order architectures. In future chip technologies wiring delay will become a major obstacle in boosting clock frequencies in traditional superscalar architectures due to large window sizes and wide issue widths [34]. The central idea is to have small (and thus very fast) block engines interconnected by relatively long (and thus slow) interconnections. Notice that a second level of decentralization is provided in the instruction window topology: an instruction window consists of multiple selection windows, see Fig. 2, in order to reduce the complexity of the selection logic [27,34]. Easier fetching. Since the length of a BSA block is ®xed, fetching instructions will be easier. We do not need to predict multiple branches in a single clock cycle. Moreover, we do not need to fetch from di€erent parts in the instruction cache within a single cycle. Easier fetching was the major motivation for Melvin and Patt [30] to come up with a

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

block structured ISA, see also [21]. A dynamic scheme to overcome the fetching complexity problem is the trace cache [39]. Predication. Predication was proven to be an interesting technique to eliminate unbiased branches and to expose multiple execution paths to the processor [28,36]. Indeed, predicated instructions from multiple paths are speculatively executed regardless of the guard's value; the correct register values are then committed if the corresponding guards are true in the out section. Fewer register ®le ports. Franklin and Sohi [14] showed that the lifetime of register instances is restricted; i.e., the temporal distance measured in the number of instructions between the use of a register instance and its creation is restricted. Thus, grouping nearby instructions into blocks, will keep most inter-operation communication within block boundaries. Since in a BSA only inter-block communication passes to the register ®le, the register ®le will need fewer access ports, resulting in a smaller register ®le access time [11]. Similar ideas were proposed in [43,47]. 2.3. Disadvantages A block structured architecture also has some disadvantages: Lower IPC for a given virtual window size. Since instructions are committed per block ± instead of individually as is the case in an out-of-order architecture ± the virtual window will be less eciently utilized, resulting in lower IPC for a given virtual window size. The smaller the maximum number of instructions in a block, the smaller the IPC degradation will be. Slower inter-block communication. The interblock communication in a multi-block BSA will be slower ± measured in the number of cycles ± than the intra-window communication in traditional architectures. This will decrease IPC, especially for small blocks due to a larger amount of inter-block communication. Higher memory bandwidths and larger instruction cache required. Predication, register renaming by the compiler, and the inclusion of various sections in a block require more encoding bits per instruction. As a consequence, instruction caches

1473

of larger sizes are required as well as higher memory bandwidths. And as is shown in [26], greater caches lead to higher access times. One possible solution to overcome this disadvantage is into organizing the instruction cache in several cache banks. Each cache bank contains a part of a BSA block. Fetching a block then involves accessing the various cache banks (with the same address) to fetch the various parts of the block from the appropriate cache banks and coalescing them to form the BSA block wanted. Compilation is hard. To obtain good performance results, blocks should be ®lled with useful instructions as much as possible. This is certainly a challenge for the compiler. A preliminary study of the compiler requirements is described in [32]. 3. Methodology Since both IPC (number of instructions executed per clock cycle) and clock cycle period determine performance, both are used in our methodology to validate the block structured paradigm on its feasibility. IPC is estimated through statistical modeling; the in¯uence of microarchitectural parameters on clock cycle is estimated by examining viable processor layouts. 3.1. Statistical modeling To determine IPC in a reliable way, detailed simulations are required on a cycle-accurate functional simulator executing optimal BSA code. There are two problems with this technique: ®rst, a highly optimizing compiler as well as a detailed simulator need to be developed which are timeconsuming tasks. Second, once we have set up this experimental environment, a huge amount of simulations need to be done which are timeconsuming as well, since several hundreds of millions or even billions of instructions need to be simulated for various processor con®gurations for various workloads. Therefore, we decided to perform an early design stage evaluation using a novel technique, namely statistical simulation [7], which allows fast simulation (a steady-state solution is reached after simulating only a few million instructions) and quite accurate IPC estimates.

1474

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

Basically, statistical simulation [8] consists of three phases, see Fig. 3. First, programs are analyzed to extract a statistical pro®le. In the second phase, a synthetic instruction trace is produced  a la Monte Carlo using that statistical pro®le. Subsequently, this instruction trace is fed into a tracedriven simulator modeling the architecture, which yields performance characteristics, such as IPC (number of useful instructions executed per clock cycle). During statistical pro®ling (phase 1), several program execution statistics are extracted from a benchmark trace: the instruction mix, the distribution of the dependency distance between instructions (i.e., the distance counted in the number of instructions in the trace between the producer and the consumer of a register instance), the probability that a memory operation is dependent

Fig. 3. Statistical modeling and simulation. The various ISAspeci®c and microarchitectural parameters will be varied in the experiments of Section 4.

on the xth memory operation before it in the trace through a memory data value, the distribution of the basic block size, the probability that a branch instruction is biased to the same branch outcome. The traces used to collect statistical pro®les, see Table 1, were generated from the SPECint95 benchmarks on a DEC 500au station with an Alpha 21164 processor. The Alpha architecture is a load/store architecture with 32 integer and 32 ¯oating-point registers, each of which is 64 bits wide. The SPECint95 benchmarks have been compiled with the DEC cc compiler version 5.6 with the optimization ¯ag set to -O4. The traces were carefully selected not to include initialization code. A synthetic BSA-trace is generated in phase 2. This is done in two steps: synthetic BSA blocks are built up ®rst; then, a BSA-trace is created by pointing the actually executed control ¯ow path in each (synthetic) BSA-block. The latter step will indicate the useful instructions included in the BSA-block. In the ®rst step, the synthetic BSAblock construction, basic blocks are added to the BSA-block until the maximum BSA-block size is reached; basic blocks are added to the most likely control ¯ow path. The synthetic BSA-block formation is illustrated in Fig. 4 through an example. First basic block a (path probability p ˆ 1:0 is included in the BSA-block, then b (p ˆ 0:65), then e (p ˆ 0:40), then c (p ˆ 0:35), and ®nally d …p ˆ 0:25†. Notice that an actual compiler would add predicates to the instructions of the basic blocks to guarantee correct program semantics, i.e, p1 to b, p1 to c, p1&p2 to d, and p1&  p2 to e. The path probabilities used by the

Table 1 The SPECint95 benchmarks used, the input ®les, the number of initial instructions skipped, the number of instructions incorporated in the trace and whether or not the program was completed after Nskipped ‡ Ninstrs instructions Benchmark

Input

Nskipped

Ninstrs

Completed

li go compress gcc m88ksim ijpeg perl vortex

train.lsp 50 9 2stone9.in
0M 250M 0M 0M 350M 100M 600M 800M

226M 200M 217M 182M 200M 170M 200M 200M

yes no yes yes no yes no no

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

1475

IPC. The IPC results reported in Section 4 are geometric average IPC results over the various benchmarks. The following instruction latencies were assumed: integer one cycle, load three cycles (this includes address calculation and data cache access), multiply eight cycles, FP operation four cycles, single and double precision FP divide 18 and 31 cycles, respectively. All operations are fully pipelined, except for the divide. In addition, we assumed that clearing a block from a block engine and starting the execution of a new block takes three clock cycles: (i) dispatching the new block to the block engine, (ii) reading the register values from the register ®le and (iii) selecting instructions to be executed. 3.2. Estimating clock cycle Fig. 4. Building up a synthetic BSA-block as a collection of basic blocks. The probabilities determine the probability that the arrow is part of the actually executed path. The numbers in the circles show the order in which the corresponding basic blocks are included. The basic blocks shown in dark gray are marked to be part of the correct control ¯ow path.

compiler to form BSA-blocks can be derived from pro®le information [4,12] or from static program analysis [20]. Note that the synthetic trace generation algorithm presented here is only applicable for architectures for which the compiler uses ®xed-length structures to schedule instructions, as is the case for a ®xed-length block structured architecture. Compilers and schedulers for VLIW architectures on the other hand, typically use variable-length structures, such as superblocks [23] or hyperblocks [29], in order to enlarge the scope for scheduling operations into (®xed-length) instruction words. This does not a€ect the generality of the synthetic trace generation technique presented here since the algorithm can be modi®ed to generate traces using variable-length scheduling structures as well. However, this is out of the scope of this paper. The last phase in the statistical modeling methodology is trace-driven simulation: synthetic BSA-traces are simulated on a trace-driven simulator which yields performance results, namely

To estimate clock cycle period, we have assumed that the critical pipeline stage is the execution stage. This assumption is acceptable if the instruction window size and issue width are kept relatively small [34], e.g., for processors with windows containing less than 32 instructions and issue widths smaller than 4. We came to the same conclusion after automatic synthesis in [9]. The execution stage, that is shown in Fig. 5, contains the actual instruction execution as well as the bypassing. So, the processor's cycle time Tcycle can be calculated using the following formula: Tcycle ˆ Texecute ‡ Tbypass :

…1†

Bypassing is responsible for making results produced by functional units available to datadependent instructions in the next few clock cycles.

Fig. 5. The execution stage: bypassing and instruction execution.

1476

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

The number of result buses needed is determined by the number of pipeline stages containing not yet stored register values; in our case, two result buses are required [9]. The delay due to bypassing is determined by the multiplexor's width and the wiring length. The latter is strongly dependent on the chosen layout. So, the delay due to bypassing Tbypass can be computed as follows: Tbypass ˆ Twiring ‡ Tmux :

…2†

Since in future technologies wiring delays will be far more important than gate delays [13], we approximate Tbypass by just taking into account the wiring delay Tbypass ' Twiring :

…3†

the feature size of the IC technology being used [34]. Wiring delay. Palacharla et al. [34] assumed a kwadratic relationship between wiring length and wiring delay 1 Twiring ˆ Rm Cm L2wiring ; 2

…6†

with Rm and Cm the resistance and the capacitance per unit of length, respectively. This assumption is valid for unbu€ered wires. When a bu€er is inserted every nth fraction of the wire, the wiring delay will be Twiring

 2 ! 1 Lwiring ˆ n s ‡ Rm C m ; 2 n

…7†

Processor layout. A viable layout should be chosen to be able to estimate Tbypass . Two possible layouts are shown in Fig. 6 : a 1-D (on the left) and a 2-D layout (on the right). The functional units are placed in a one-dimensional array in the 1-D layout; in the 2-D layout, the functional units are placed in a two-dimensional array. Consequently, wiring length Lbypass for bypassing data values equals

with s the delay of a bu€er, i.e., an invertor. This wiring delay can now be minimized by choosing an appropriate number of bu€ers n to be inserted. In other words, the optimal delay can be obtained by minimizing Eq. (7) as a function of n. This is easily done and the optimal delay equals

Lbypass ˆ iHfu

which expresses a linear relationship between wiring length and wiring delay. It is interesting to note that this optimal wiring delay is also obtained by applying the following rule: insert a bu€er if the wiring delay of a line segment (from the last bu€er being inserted) equals the delay of a bu€er. By comparing Eqs. (6) and (8) it can be veri®ed that inserting bu€ers is only advisable if the following inequality is ful®lled s 8s : …9† Lwiring > Rm Cm

…4†

for the 1-D layout with Hfu the height of a functional unit and i, the issue width or the number of functional units. For the 2-D layout, Lbypass equals p …5† Lbypass ˆ 2 iHfu : Thus the 2-D layout p is better in terms of minimal bypass delay if 2 i < i or if i > 4. The width of a functional unit was chosen Hfu ˆ 3200k with k half

Twiring ˆ

p 2sRm Cm Lwiring ;

…8†

1-D layout. For the 1-D layout, Tbypass thus equals 1 Tbypass ˆ Rm Cm i2 Hfu2 2

Fig. 6. Possible layouts: a 1-D (on the left) and a 2-D (on the right) layout.

…10†

for unbu€ered wires; for bu€ered wires, Tbypass equals p Tbypass ˆ 2sRm Cm iHfu : …11†

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

1477

2-D layout. The bypass delay for the 2-D layout with unbu€ered wires is Tbypass ˆ 2Rm Cm iHfu2 ;

…12†

the bypass delay for the 2-D layout with bu€ered wires is p …13† Tbypass ˆ 8sRm Cm iHfu : Another way of reducing the complexity of the bypassing logic is by using incomplete bypass schemes [1] which only implement the frequently used bypass paths and uses interlocks in the remaining situations. Technology scaling. To analyze the e€ect of technology scaling on the importance of wiring delays in future technologies, we base our analysis on assumptions made in [34]. Consider for example two technologies whose feature sizes scale with a scaling factor S, then gate delays will scale down with a factor S. Delays of unbu€ered wires on the other hand, do not scale because the resistance and the capacitance increase with a factor S whereas the wiring length decreases with a factor S, see Eq. (6);pthe  delay of bu€ered wires will scale with a factor S , see Eq. (8). Although bu€ering long wires makes wires more scalable to future technologies to some extent, the in¯uence of wiring delays will be more and more signi®cant in future technologies. Bypass delay. Now we will quantify how bypass delay is a€ected by issue width and technology scaling. We have considered two chip technologies, a 0:25 lm- and a 0:10 lm-technology. Each time, the best implementation (bu€ered vs. unbu€ered wires and 1-D vs. 2-D layout) in terms of minimal bypass delay is used for every processor con®guration. The technology-dependent parameters are Table 2 Technology-dependent parameters Rm …X=lm† Cm …fF =lm† k …lm† s (ns) Texecute (ns)

0:25 lm

0:10 lm

0.064 0.88 0.125 0.04 1.0

0.16 2.20 0.05 0.016 0.5

Fig. 7. The increase in cycle time due to bypassing as a function of issue width and chip technology.

given in Table 2: Rm and Cm are based on numbers reported in [34] and s is based on the delay of an invertor from a highly advanced 0:25 lm standard cell CMOS technology; we chose the delay to execute an instruction Texecute ˆ 1:0 ns and Texecute ˆ 0:5 ns for the 0:25 lm- and 0:10 lmtechnology, respectively, which seems reasonable. The results are shown in Fig. 7. From this graph, it is clear that for future technologies wiring delay will have a more signi®cant impact on performance; e.g., for an issue width i ˆ 4, bypassing increases clock cycle time with 7.2% and 13.6% for the 0:25 lm- and the 0:10 lm-technology, respectively. As discussed before, a 2-D layout is preferable to a 1-D layout if more than four functional units are considered. By using Eqs. (4), (5) and (9) it can be veri®ed that bu€ered wires are only useful from i > 8:88 and i > 3:77 for the 0:25 lm- and the 0:10 lm-technology, respectively.

4. Evaluation In this section, the block structured architectural paradigm is evaluated using the methodology detailed in Section 3. Several important aspects are addressed: the fraction of useful instructions included in a block, performance as function of technology scaling, the impact of branch prediction accuracy, the in¯uence of inter-block communication latency, the reduced register ®le pressure and the memory operation issue strategy.

1478

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

4.1. Block formation A ®rst interesting aspect is to examine how many useful instructions will be included in a BSA-block, i.e., what fraction of instructions in a BSA-block belong to the actually executed control ¯ow path. In others words, we want to check whether the compiler is capable of ®lling BSAblocks with useful instructions. It is clear that the fraction of useful instructions w.r.t. the total number of instructions included in a block will decrease as block size increases. This is veri®ed in Fig. 8: for block sizes of 8, 16 and 32 instructions, the average fraction of useful instructions is 99%, 96% and 91%, respectively. These high percentages should not be surprising since the average basic block size is 5±6 instructions and the average static branch prediction accuracy is approximately 85% [22]. 4.2. Performance In Figs. 9 and 10, performance is evaluated for various BSA processor con®gurations for the two chip technologies, a 0:25 lm- and a 0:10 lmtechnology. Performance P (instructions executed per unit of time) is measured as follows: Pˆ

IPC Tcycle

…14†

with IPC measured through statistical simulation, see Section 3.1, and Tcycle the clock cycle time measured using the formulas of Section 3.2. Each cluster of results in these graphs represents a set of processor con®gurations with equal hardware resources; i.e., the virtual window size and the total issue width are kept constant. Within a cluster, we varied the number of block engines, the BSAblock size and the issue width per block engine. The processor con®gurations in Fig. 10 have double the total issue width as in Fig. 9. Several optimal con®gurations can be pointed out, namely 16 or 32 block engines executing blocks of 16 instructions with 1 or 2 functional units per block engine. There are several forces operating which result in processor con®gurations with optimal performance given a certain hardware budget: · IPC will increase for smaller block sizes due to the better utilization of the virtual window size; a block engine is only available for computation if an entire block is executed and this favors processor con®gurations with smaller block sizes. In other words, BSAs with smaller blocks do have a ®ner granularity. · Clock frequency also increases when distributing the available functional units to several block engines since this reduces the hardware complexity. · On the other hand, performance will degrade with smaller blocks because smaller blocks

Fig. 8. The fraction of instructions included which belong to the actually executed control ¯ow path.

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

1479

Fig. 9. Performance, measured in billions of instructions per second, as a function of the number of block engines, BSA-block size and issue width for two chip technologies. The BSA processor con®gurations are denoted as follows: BSA (e:b:i) with e the number of block engines, b the block size and i the issue width per block engine. Perfect branch prediction was assumed at the inter-block level.

imply more inter-block communication, which is relatively slow compared to intra-block communication. · In addition, the fact that the execution time of small blocks might be smaller than the time required to distribute blocks over the block engines; e.g., in a con®guration with 64 block engines, it takes 64 cycles to distribute 64 blocks to 64 block engines; since the execution of small blocks will only take a few cycles, many block engines will be idle during long periods of time. This explains why performance drops for con®gurations with a block size of eight instructions. Another important note is that performance increases much faster for smaller blocks (and thus more block engines) in the 0:10 lm-technology

than for smaller blocks in the 0:25 lm-technology. This is due to the fact that the importance of bypass delay is higher for the 0:10 lm-technology, see Section 3.2. Notice also that the performance gain obtained by doubling the total issue width, compare Fig. 9 with Fig. 10, is marginal ± 17% and 7% for a virtual window of 256 and 512 instructions, respectively ± compared to the increased hardware resources needed. This suggests that doubling the total issue width once again will be useless due to the marginal IPC increase and the cycle time decrease. As a result, the optimal con®gurations only require 1 or 2 functional units per block engine which means, see Section 3.2, that a 1-D layout with unbu€ered wires will suce to implement

Fig. 10. Idem Fig. 9 when doubling the total issue width.

1480

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

bypassing in an optimal way: minimum bypass delay with the simplest hardware design! 4.3. Branch prediction accuracy In Fig. 11, it is quanti®ed how performance is a€ected by the branch prediction accuracy. This is done for three BSA con®gurations each having a block size of 16 instructions and one functional unit per block engine. When a (multi-way) branch executing on block engine x is mispredicted, all block engines have to be cleared in the ring from block engine x until the one indicated by the tail pointer and all these block engines will have to get new blocks assigned. In our simulations we have assumed a ®ve-cycle branch misprediction penalty: (i) calculating the new block address, (ii) fetching the correct block from the I-cache, (iii) conducting the fetched block to the appropriate block engine, (iv) reading data values from the register ®le and (v) selecting the instructions to be executed. It is clear from the graph in Fig. 11 that the impact of branch misprediction accuracy is signi®cant, especially for architectures with high degrees of machine parallelism. So, from this graph we can conclude that highly accurate branch predictors for multi-way branches [24,25] should be a design issue for future microarchitectures. Similar results were obtained in [33] for traditional superscalar architectures.

Fig. 11. Performance as a function of the branch prediction accuracy and the number of block engines for BSA con®gurations with a block size of 16 instructions and one functional unit per block engine.

4.4. Inter-block communication As said in Section 2, in a BSA a distinction is made between intra- and inter-block communications. To quantify the impact of this design option on performance, we have done some experiments by varying the latency of the inter-block communication. We have compared a situation where the inter-block communication between engines takes one or two clock cycles, to the ideal case where inter-block communication is as fast as intra-block communication. In these experiments, performance was degraded with 2.8% and 5.9%, for a BSA con®guration with 16 block engines executing blocks containing 16 instructions and having one functional unit per block engine; the branch misprediction rate was set to 97%. These results show that the in¯uence of the slower inter-block communication on overall performance is small. Through all the other experiments we assumed the following inter-block communication latencies: one clock cycle between adjacent block engines and two cycles between non-adjacent block engines. This corresponds to the organization of Fig. 2 where communication between adjacent block engines is conducted through associative logic and communication between non-adjacent block engines is done via the register ®le. 4.5. Register ®le pressure As stated in the introduction, the register ®le is likely to become a bottleneck in future superscalar architectures. In a BSA, the register ®le pressure is reduced due to the fact that much communication is kept locally in a block engine. This is re¯ected in Fig. 12: the reduction of the numbers of register ®le transfers is shown as a function of the number of useful instructions in a BSA-block. Given the fact that a block of 16 instructions contains 15 useful instructions on average, see Fig. 8, this means that the number of register ®le reads and writes are reduced with 46% and 33%, respectively. The results of Fig. 12 are calculated using the following formula: Pnÿ1 Pk kˆ0 iˆ0 p…i† …15† n

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

1481

Table 3 Relative performance w.r.t. out-of-order execution of memory operations (without re-execution) BSA (16:16:1) BSA (32:16:1)

Fig. 12. The reduction in the number of register ®le transfers (reads and writes) as a function of the number of useful instructions in a block.

with n the number of instructions considered and p…n† the probability density function of the age-of-register-operands distribution (the distribution of the number of instructions between the production and the consumption of a register instance) and the register-lifetime distribution (the distribution of the number of instructions between two writes to the same architectural register) [14], respectively. 4.6. Memory operations Another important design issue is the memory operation issue strategy. Here, we distinguished three design strategies: (a) Loads and stores are issued in-order. A store can only be issued when all previous loads and stores have been issued; a load can be issued when all previous stores have been issued. (b) Stores are issued out-of-order; loads inorder. This scheme requires a bu€er that holds speculative store values; these speculative values are then committed in program order. (c) Loads and stores are issued out-of-order, but when a memory dependency violation is detected, the violating instruction needs to be reexecuted as well as all its dependent instructions (dynamic memory disambiguation with reexecution). From Table 3 it is clear that the memory operation issue strategy is an important design issue and that

(a)

(b)

(c)

42% 34%

45% 37%

97% 97%

a dynamic memory disambiguation scheme is required to obtain nearly optimal performance results. Similar results were obtained in [33] for traditional superscalar processors. Possible implementations which support dynamic memory disambiguation in case of a control-dependence based decentralized architecture, are the Address Resolution Bu€er (ARB) [15] and the Speculative Versioning Cache (SVC) [17].

5. Related architectures Most contemporary superscalar architectures are out-of-order architectures which in essence means that the hardware dynamically schedules instructions. Although these architectures are capable of executing instructions out-of-order, inorder retirement guarantees correct program semantics and precise interrupts. According to Palacharla et al. [35], there are two types of outof-order organizations, one where the register ®le only contains non-speculative register values, which we will call the non-speculative register ®le model, and one where the register ®le contains speculative as well as non-speculative register values, which will be denoted as the speculative register ®le model. To clearly understand the design trade-o€s between both implementations, we will brie¯y discuss both organizations. In the non-speculative register ®le model ± implemented in the HP PA-8000 [41], Intel's Pentium Pro [18], the HAL SPARC V [6] and AMD's Athlon [5] ± register data values are read from the register ®le before dispatching an instruction in the instruction window. Instructions that are residing in the instruction window, wait for their operands to become available. When all its operands are available, the instruction is selected to be executed in the next cycle. The data values on which

1482

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

the instruction operates are either read from the bypass paths or the instruction window. Note that since the register ®le does not contain speculative register values, the speculative data values need to be stored in the instruction window. In the speculative register ®le model ± implemented in the DEC Alpha 21264 [19] and the MIPS R10000 [48] ± the data values are either read from the bypass paths or the register ®le. In this implementation, accessing the register ®le requires an extra pipeline stage between selecting and executing the instruction. Both organizations do have advantages and disadvantages in terms of cycle time. 3 The main advantage of the nonspeculative register ®le model is that the register ®le can be kept small, i.e., the number of physical registers is limited to the number of architectural registers. Farkas et al. [11] have shown that the access time to the register ®le is negatively a€ected by the number of registers and the number of ports. An important disadvantage of the speculative register ®le model is that bypass delay tend to be higher since the bypass wires will need to pass over the register ®le as was pointed by Palacharla et al. [35] and implemented in the DEC Alpha 21264 [16] (per cluster). The underlying reason for this is that in order to keep cycle time short ± and thus also the register ®le read stage ± the register ®le is placed in the middle of the functional units to minimize the wires which distribute the data values from the register ®le to the functional units. The main disadvantage of the nonspeculative register ®le model is that more area is required to implement the instruction window because (speculative) data values have to be stored in the instruction window. The organization chosen in the BSA (per block engine) is the non-speculative register ®le model [9]. Palacharla et al. [35] have quanti®ed how cycle time will be a€ected by scaling out-of-order architectures to more parallelism in future technologies. They identi®ed the instruction window

3 Both organizations will also in¯uence IPC. However, this is out of the scope of this paper and will therefore not be discussed here.

logic and the bypass logic as the most critical structures due to the increased complexity and wiring delays. The instruction window logic consists of two steps, namely wake-up and instruction selection. The wake-up logic broadcasts to all the instructions in the window which registers will be available in the next clock cycle; the selection logic selects an instruction to be executed in the next clock cycle. Note that these two steps have to be done in one clock cycle in order to be able to execute data-dependent instructions in consecutive clock cycles [21]. Another structure which is or may become a possible bottleneck is the register ®le since DEC had to duplicate its register ®le to meet the design goals in the Alpha 21264 [19]. Only recently, researchers have shown interest in decentralizing or partitioning superscalar architectures in order to tackle the complexity problem. A good overview is given by Ranganathan and Franklin [37], where the various propositions are categorized into three classes: execution-dependence based, datadependence based and control-dependence based decentralized architectures. Now we will discuss each of these classes in terms of its hardware complexity and we will locate the BSA in this classi®cation. In an execution-dependence based decentralized architecture, instructions are assigned to reservation stations based on the functional unit on which the instruction will be executed. This partitioning technique was ®rst implemented in the IBM 360/91 by Tomasulo [44]. Other commercial implementations of this technique are the HP PA-8000 [41], the MIPS R10000 [48] and the HAL SPARC V [6]. Jourdan et al. [27] examined the in¯uence of several instruction window topologies on IPC. When considering hardware complexity we can expect that including reservation stations will reduce the complexity of the instruction selection logic since the scope for selecting a ®rable instruction is restricted. However, we believe that this will not be enough for future designs because the wake-up logic still needs to broadcast register tags to a major part of the instructions in all the reservation stations; and this might be a problem for architectures with huge instruction windows since long wires

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

will be involved. In addition, the register ®le bottleneck is not addressed in this paradigm. In a data-dependence based decentralized architecture, instructions are assigned near where their data dependencies will be resolved. Examples are the Alpha 21264 [19], clustered dependencybased microarchitectures as described in [35], the multicluster architecture [10], the PEWs microarchitecture [38] and the MISC (Multiple Instruction Stream Computer) [46]. The Alpha 21264 [19] and the clustered dependency-based microarchitecture [35] are organized into two clusters each containing a copy of the register ®le. Consistency of both register ®les is guaranteed by broadcasting data values over inter-cluster bypasses which is hardly scalable to more than two clusters. In addition, duplicating the register ®le only divides the number of register ®le read ports; the number of write ports is una€ected due to the inter-cluster broadcast. In a control-dependence based decentralized paradigm, instructions are assigned near where their control dependencies will be resolved, which seems to be the most scalable paradigm if the engines are organized in a unidirectional ring [37]. Examples are the multiscalar architecture [42], trace processors [40] and the superthreaded architecture [45]. The architectures which most closely lean against the BSA, are the multiscalar architecture and the trace processor. The compiler of a multiscalar architecture partitions a program into tasks which comprises multiple basic blocks. The main di€erence between a BSA block and a task however, is that a BSAblock has a ®xed length ± the number of instructions in a task is unbounded ± and that no control ¯ow is allowed within a BSA-block ± control ¯ow is converted by the compiler into data ¯ow through predication. Another important di€erence is that predicated execution is supported within a BSA-block, mitigating the consequences of mispredicting branches. Note that BSAs also have some similarities with trace processors [40,47]. In both architectures, the unit of work ± a block vs. a trace ± has a ®xed length and can contain several basic blocks. But the main di€erence is that a trace is constructed at runtime and contains only one ¯ow of control; BSA-

1483

blocks, on the other hand, are constructed statically and can contain multiple ¯ows of control.

6. Conclusions An important challenge in the design of future microprocessors is that current methodologies are becoming impractical due to the reduced timeto-market. First, architectural simulations using contemporary methodologies are too timeconsuming in an early design stage. Second, processor layout considerations will need to be incorporated in the early design stages due to the ever growing importance of interconnects on performance in future deep-submicron technologies. In this paper, we have shown that statistical modeling can be used to estimate IPC in an early design stage due to its fast simulation property. In addition, we have used viable processor layouts in order to quantify the impact of microarchitectural parameters as well as technology scaling on clock cycle period in an early design stage. We have applied our methodology on a novel architectural paradigm, namely a ®xed-length block structured architecture. This architecture originated from the idea that scaling contemporary superscalar architectures to higher levels of performance and to future deep-submicron technologies, requires new microarchitectural paradigms to overcome the increasing complexity and the increasing importance of interconnects. In this paper, we have shown that ®xed-length block structured architectures are capable of reducing the processor core's complexity by taking appropriate microarchitectural and implementational design decisions, namely by introducing decentralization and by reducing the register ®le pressure. As a result, this paper shows that a ®xedlength block structured architectures is a viable architectural paradigm for future microprocessors. Acknowledgements Support was given by the FWO-project G.0036.99 on block structured architectures for multimedia signal processing.

1484

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

References [1] P.S. Ahuja, D.W. Clark, A. Rogers, The performance impact of incomplete bypassing in processor pipelines, in: Proceedings of the 28th Annual International Symposium on Microarchitecture (MICRO-28), November 1995, pp. 36±45. [2] P. Bose, Performance evaluation and validation of microprocessors, in: Proceedings of the International Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS'99), May 1999, pp. 226±227. [3] P. Bose, T.M. Conte, T.M. Austin, Challenges in processor modeling and validation, IEEE Micro 19 (3) (1999) 9±14. [4] P.P. Chang, S.A. Mahlke, W.W. Hwu, Using pro®le information to assist classic code optimizations, Software Practice and Experience 21(12), 1991. [5] K. Diefendor€, K7 Challenges Intel, Microprocessor Report 12(14), 1998. [6] K. Diefendor€, Hal makes sparcs ¯y, Microprocessor Report, 13(15), 1999. [7] L. Eeckhout, K. De Bosschere, H. Neefs, Performance analysis through synthetic trace generation, in: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2000), April 2000, pp. 1±6. [8] L. Eeckhout, H. Neefs, K. De Bosschere, Estimating IPC of a block structured instruction set architecture in an early design stage, in: Parallel Computing: Fundamentals and Applications; Proceedings of the International Conference ParCo99, January 2000, pp. 468±475. [9] L. Eeckhout, H. Neefs, K. De Bosschere, J. Van Campenhout, Investigating the implementation of a block structured processor architecture in an early design stage, in: Proceedings of the 25th Euromicro Conference, vol. 1, September 1999, pp. 186±193. [10] K.I. Farkas, P. Chow, N.P. Jouppi, Z. Vranesic, The multicluster architecture: reducing cycle time through partitioning, in: Proceedings of the 30th Annual International Symposium on Microarchitecture (MICRO-30), December 1997, pp. 149±159. [11] K.I. Farkas, N.P. Jouppi, P. Chow, Register ®le design considerations in dynamically scheduled processors, Technical Report WRL 95/10, Digital Western Research Laboratory, November 1995. [12] J.A. Fisher, S.M. Freudenberger, Predicting conditional branch directions from previous runs of a program, in: Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V), October 1992, pp. 85±95. [13] M.J. Flynn, P. Hung, K.W. Rudd, Deep-submicron microprocessor design issues, IEEE Micro 19 (4) (1999) 11±22. [14] M. Franklin, G.S. Sohi, Register trac analysis for streamlining inter-operation communication in ®ne-grain parallel processors, in: Proceedings of the 22nd Annual International Symposium on Microarchitecture (MICRO22), December 1992.

[15] M. Franklin, G.S. Sohi, ARB: a hardware mechanism for dynamic reordering of memory references, IEEE Transactions on Computers 45 (5) (1996) 552±571. [16] B.A. Gieseke et al., A 600-MHz superscalar RISC microprocessor with out-of-order execution, in: Proceedings of the 1997 IEEE International Solid-State Circuits Conference, February 1997, pp. 176±177. [17] S. Gopal, T.N. Vijaykumar, J.E. Smith, G.S. Sohi, Speculative versioning cache, in: Proceedings of the Fourth International Symposium on High-Performance Computer Architecture (HPCA-4), 1998. [18] L. Gwennap, Intel's P6 uses decoupled superscalar design, Microprocessor Report 9(2), 1995. [19] L. Gwennap, Digital 21264 sets new standard, Microprocessor Report 10(14) (1996) 1±6. [20] R.E. Hank, S.A. Mahlke, R.A. Bringmann, J.C. Gyllenhaal, W.W. Hwu, Superblock formation using static program analysis, in: Proceedings of the 16th Annual International Symposium on Microarchitecture (MICRO26), December 1993, pp. 247±255. [21] E. Hao, P.-Y. Chang, M. Evers, Y.N. Patt, Increasing the instruction fetch rate via block-structured instruction set architectures, in: Proceedings of the 29th Annual International Symposium on Microarchitecture (MICRO-29), December 1996, pp. 191±200. [22] J.L. Hennessy, D.A. Patterson, Computer Architecture: A Quantitative Approach, second ed., Morgan Kaufmann Publishers, Los Altos, 1996. [23] W.W. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. Warter, R.A. Bringmann, R.G. Ouelette, R.E. Hank, T. Kiyohara, G.E. Haab, J.G. Holm, D.M. Lavery, The superblock: an e€ective technique for VLIW and superscalar compilation, Journal of Supercomputing 7 (1993) 9±50. [24] Q. Jacobson, S. Bennett, N. Sharma, J.E. Smith, Control ¯ow speculation in multiscalar processors, in: Proceedings of the Third International Symposium on High-Performance Computer Architecture (HPCA-3), February 1997, pp. 218±229. [25] Q. Jacobson, E. Rotenberg, J.E. Smith, Path-based next trace prediction, in: Proceedings of the 30th Annual International Symposium on Microarchitecture (MICRO30), December 1997, pp. 14±23. [26] N.P. Jouppi, S.J.E. Wilton, An enhanced access and cycle time model for on-chip caches, Technical Report WRL 93.5, Digital Western Research Laboratory, July 1994. [27] S. Jourdan, P. Sainrat, D. Litaize, An investigation of the performance of various instruction-issue bu€er topologies, in: Proceedings of the 28th Annual International Symposium on Microarchitecture (MICRO-28), November 1995, pp. 279±284. [28] S.A. Mahlke, R.E. Hank, J.E. McCormick, D.I. August, W.W. Hwu, A comparison of full and partial predicated execution support for ILP processors, in: Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA-22), June 1995, pp. 138±149.

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486 [29] S.A. Mahlke, D.C. Lin, W.Y. Chen, R.E. Hank, R.A. Bringmann, E€ective compiler support for predicated execution using the hyperblock, in: Proceedings of the 25th International Symposium on Microarchitecture (MICRO-25), December 1992, pp. 45±54. [30] S. Melvin, Y. Patt, Enhancing instruction scheduling with a block-structured ISA, International Journal of Parallel Programming 23 (3) (1995) 221±243. [31] H. Neefs, A preliminary study of a ®xed length Block Structured instruction set Architecture, Technical Report 96-07, Department of Electronics and Information Systems (ELIS), Ghent University, November 1996, Available through http://www.elis.rug.ac.be/neefs. [32] H. Neefs, K. De Bosschere, J. Van Campenhout, Issues in compilation for ®xed-length block structured instruction set architectures, in: Proceedings of the Workshop on Interaction between Compilers and Computer Architectures, In Conjunction with the Third International Symposium on High-Performance Computer Architecture (HPCA-3), February 1997. [33] H. Neefs, K. De Bosschere, J. Van Campenhout, Exploitable levels of ILP in future processors, Journal of Systems Architecture 45 (9) (1999) 687±708. [34] S. Palacharla, N.P. Jouppi, J.E. Smith, Quantifying the complexity of superscalar processors, Technical Report CS-TR-96-1328, University of Wisconsin-Madison, November 1996. [35] S. Palacharla, N.P. Jouppi, J.E. Smith, Complexity-e€ective superscalar processors, in: Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA-24), June 1997, pp. 206±218. [36] D.N. Pnevmatikatos, G.S. Sohi, Guarded execution and dynamic branch prediction in dynamic ILP processors, in: Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA-21), April 1994, pp. 120±129. [37] N. Ranganathan, M. Franklin, An empirical study of decentralized ILP execution models, in: Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), October 1998, pp. 272±281. [38] N. Ranganathan, M. Franklin, The PEWs microarchitecture: reducing complexity through data-dependence based decentralization, Microprocessors and Microsystems 22 (6) (1998) 333±343. [39] E. Rotenberg, S. Bennett, J.E. Smith, Trace cache: A low latency approach to high bandwidth instruction fetching, in: Proceedings of the 29th Annual International Symposium on Microarchitecture (MICRO-29), December 1996, pp. 24±35. [40] E. Rotenberg, Q. Jacobson, Y. Sazeides, J. Smith, Trace processors, Proceedings of the 30th Annual International Symposium on Microarchitecture (MICRO-30), December 1997, pp. 138±148.

1485

[41] A.P. Scott, K.P. Burkhart, A. Kumar, R.M. Blumberg, G.L. Ranson, Four-way superscalar PA-RISC processors, Hewlett-Packard Journal 48(4), 1997. [42] G.S. Sohi, S.E. Breach, T.N. Vijaykumar, Multiscalar processors, in: Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA-22), June 1995, pp. 414±425. [43] E. Sprangle and Y. Patt, Facilitating superscalar processing via a combined static/dynamic register renaming scheme, in: Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO-27), December 1994, pp. 143±147. [44] R.M. Tomasulo, An ecient algorithm for exploiting multiple arithmetic units, IBM Journal 11 (1967) 25±33. [45] J.-Y. Tsai, J. Huang, C. Amlo, D.J. Lilja, P.-C. Yew, The superthreaded processor architecture, IEEE Transactions on Computers 48(9), 1999. [46] G. Tyson, M. Farrens, A.R. Pleszkun, MISC: A multiple instruction stream computer, in: Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO-25), December 1992, pp. 193±196. [47] S. Vajapeyam, T. Mitra, Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences, in: Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA-24), May 1997, pp. 1±12. [48] K.C. Yeager, MIPS R10000 Superscalar microprocessor, IEEE Micro 16 (2), 1996. Lieven Eeckhout was born in Kortrijk, Belgium in 1975. He received the Engineering degree in Computer Science from Ghent University, Belgium, in 1998. Since then, he is working towards a Ph.D. at the same university. He is supported by a grant from the Flemish Institute for the Promotion of Scienti®c-Technological Research in the Industry (IWT). His research interests include computer architecture, performance analysis and workload characterization. Henk Neefs was born in Assenede, Belgium in 1970. He received the Engineering Degree in Physics from Ghent University, Belgium, in 1993. He obtained a Ph.D. in Computer Science from the same university in 2000. He also took courses in molecular biology at the Free University of Brussels, Belgium. Currently, he is a hardware engineer at COMPAQ's Palo Alto Design Center. His current research interests include architectural techniques to increase the performance of computers, optical interconnects and the use of simulation in molecular biology. Henk Neefs is a member of IEEE and OSA.

1486

L. Eeckhout et al. / Journal of Systems Architecture 46 (2000) 1469±1486

Koen De Bosschere was born in Oudenaarde, Belgium in 1963. He received the degrees of Electrotechnical Engineering and Computer Science from Ghent University, Belgium, in 1986 and 1987, respectively. He obtained his Ph.D. from the same university in 1992. Since 1993, he has been teaching at the Faculty of Applied Sciences where he currently teaches courses in computer architecture, operating systems and declarative programming languages. His research interests include logic programming, system programming, parallelism and debugging.