performance tools be? A case study with statistical simulation

performance tools be? A case study with statistical simulation

The Journal of Systems and Software 73 (2004) 45–62 www.elsevier.com/locate/jss How accurate should early design stage power/performance tools be? A ...

NAN Sizes 0 Downloads 48 Views

The Journal of Systems and Software 73 (2004) 45–62 www.elsevier.com/locate/jss

How accurate should early design stage power/performance tools be? A case study with statistical simulation Lieven Eeckhout *, Koen De Bosschere Department of Electronics and Information Systems, Ghent University, Sint-Pietersnieuwstraat 41, B9000 Gent, Belgium Received 10 July 2003; received in revised form 1 August 2003; accepted 6 August 2003 Available online 25 December 2003

Abstract To cope with the widening design gap, the ever increasing impact of technology, reflected in increased interconnect delay and power consumption, and the time-consuming simulations needed to define the architecture of a microprocessor, computer engineers need techniques to explore the design space efficiently in an early design stage. These techniques should be able to identify a region of interest with desirable characteristics in terms of performance, power consumption and cycle time. In addition, they should be fast since the design space is huge and the design time is limited. In this paper, we study how accurate early design stage techniques should be to make correct design decisions. In this analysis we focus on relative accuracy which is more important than absolute accuracy at the earliest stages of the design flow. As a case study we demonstrate that statistical simulation is capable of making viable microprocessor design decisions efficiently in early stages of a microprocessor design while considering performance, power consumption and cycle time.  2003 Elsevier Inc. All rights reserved. Keywords: Computer architecture; Early design stage methods; Statistical simulation; Architectural power modeling

1. Introduction Moore’s law states that the number of transistors available on a chip doubles every 18 months. This is due to the ever increasing capabilities of the chip-processing technology. The innovation in CAD tools on the other hand, does not evolve that fast, resulting in the well known design gap. In case of high-performance microprocessor designs, where most, if not all, time critical structures are designed full custom, ever growing design teams are needed to make use of the huge amounts of transistors becoming available. These large design teams might become hard to manage in the near future. Another important issue that needs to be considered nowadays when designing high-performance microprocessors, is that interconnect delay is of major concern as feature sizes continue to shrink and clock frequencies

*

Corresponding author. Tel.: +3292643405; fax: +3292643594. E-mail addresses: [email protected] (L. Eeckhout), kdb@ elis.ugent.be (K. De Bosschere). 0164-1212/$ - see front matter  2003 Elsevier Inc. All rights reserved. doi:10.1016/S0164-1212(03)00247-4

continue to increase. And this certainly has its implications on the microarchitecture of a microprocessor. For example, in the Pentium 4 (Hinton et al., 2001) and the Alpha 21264 (Kessler et al., 1998) additional pipeline stages are inserted in the microarchitecture to transport data from one place to another on the chip, resulting in a smaller instruction throughput but a higher clock frequency. Furthermore, power consumption is becoming a major issue for embedded as well as for highperformance microprocessor designs. Power consumption should be kept to a minimum to keep the cooling and packaging cost reasonable and to guarantee a long operating time of battery-powered devices. These two issues, the ever increasing design gap and the impact of technology on microprocessor design, require computer architects and computer engineers to make viable design decisions early in the design cycle. In other words, a priori design techniques are needed that provide reasonable estimates of performance in relation to technological aspects. Also, estimates should be obtained efficiently since the design space is huge and the design time is limited.

46

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

Until now, the microarchitecture of a microprocessor is defined through extensive architectural simulations which are extremely time-consuming––a simulation time of several months or even years is not exceptional. Indeed, microarchitectures continue to become more and more complex which increases the simulation time per instruction significantly. In addition, the applications that need to be simulated are becoming increasingly complex, requiring billions of instructions to be simulated to evaluate a single processor configuration for a single benchmark. As a result, evaluating a huge design space for a set of benchmarks (that should be representative for the anticipated microprocessor workload) through architectural simulations is infeasible. Therefore, a recent interest exists to use early design stage techniques that provide computer engineers quick estimates of important execution characteristics, such as performance, power consumption, chip area, cycle time, etc. The main objective of these techniques is to identify a region of interest with desirable characteristics; this region of interest can then be further analyzed through more detailed and thus slower architectural simulations. The final result will be a reduction of the overall design time and design cost. In the recent past, various researchers have been proposing early design stage techniques to estimate a variety of important characteristics, such as performance in terms of IPC or number of instructions executed per cycle (Noonburg and Shen, 1997; Carl and Smith, 1998; Hsieh and Pedram, 1998; Brooks et al., 2000b; Ofelt and Hennessy, 2000; Oskin et al., 2000; Loh, 2001; Nussbaum and Smith, 2001; Eeckhout, 2002), power consumption (Eeckhout and De Bosschere, 2001; Srinivasan et al., 2002; Eeckhout, 2002), cycle time (Hrishikesh et al., 2002), chip area (Steinhaus et al., 2001); etc. All researchers working on early design stage tools try to make their method as accurate as possible, however, they have no clue on how accurate their technique should be. This possibly has two implications on the tools they are developing: (i) their technique might not be accurate enough to be useful, or (ii) their technique might not need this high level of accuracy to be useful. Obviously, this potentially has severe implications for the computer engineers using these tools: (i) inaccurate estimates might be obtained using these tools leading to wrong design decisions, or (ii) obtaining estimates using Ôover-accurate’ tools might be relatively slow due to the complexity of these estimation tools. We conclude that having a clear view on the accuracy of these early design stage techniques is paramount for using them with confidence. However, to the best of our knowledge very little work has been done on this issue. In this paper, we address the question of how accurate these early design stage techniques should be. First of all, we argue that relative accuracy is more important than absolute accuracy in case of an early design stage

technique. In other words, predicting a performance trend between various design points is more important than predicting the performance of one single design point. For example, when a computer engineer wants to determine a region of energy-efficient architectures––i.e., maximizing performance with reasonable power consumption––the optimum or a region of near-optimal designs should be identified in the design space rather than determining the absolute energy-efficiency factor. Second, we study the impact of the relative accuracy in estimating instructions per cycle (IPC) and energy consumed per cycle (EPC) on the evaluation of energy-efficient microprocessors. Third, as a case study of an early design stage technique we consider statistical simulation (Carl and Smith, 1998; Oskin et al., 2000; Nussbaum and Smith, 2001; Eeckhout, 2002; Eeckhout et al., 2003). Statistical simulation is a recently introduced technique capable of estimating IPC and EPC both accurately and quickly. More in particular, we show that statistical simulation (i) is fast, (ii) attains good absolute accuracy with errors that are no larger than 10% on average, (iii) attains excellent relative accuracy making of it a technique that (iv) is capable of identifying a region of energy-efficient microprocessor designs. Finally, we also present an experiment that shows that statistical simulation is also useful in combination with an early floorplanner to address timing issues in an early design stage. This paper is an extended version of Eeckhout et al. (2003). The main contribution of this extended version is the detailed analysis of the importance of the relative accuracy for a design technique. We show how relative accuracy relates to making correct design decisions early in the design cycle. We address this issue for early design stage power/performance tools in general and for statistical simulation in particular. The paper is organized as follows. First, we discuss how accurate early design stage techniques should be to be useful for identifying energy-efficient microarchitectures. Section 3 details on the statistical simulation framework. The subsequent section discusses how statistical simulation can be used in combination with energy estimation tools and timing estimation tools. The experimental set up used in this paper is discussed in Section 5. Statistical simulation for microprocessor design is evaluated in Section 6. Finally, we conclude in Section 7.

2. How accurate should early design stage tools be? Designing a high performance microprocessor is a time-consuming task (typically 5–7 years) that consists of a number of design steps (Bose and Conte, 1998). In the first step, a workload is composed by selecting a number of benchmarks with corresponding input data

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

sets. It is important that this workload is representative for the target domain of operation of the microprocessor under development. For example, the workload for a general-purpose microprocessor should be broad enough to cover a representative part of program behavior that is typically observed on a general-purpose computer system, such as a web browser, a spreadsheet, a word processing application, etc. In the second step, the huge design space is explored at a very high level by interdisciplinary experts examining potential designs in terms of performance, power consumption, cycle time, chip area, pin count, packaging cost, etc. The techniques that are used in this step are called early design stage exploration tools. In the third step, this bounded design space––note that this design space is still very large in spite of the previous step––is evaluated using detailed architectural simulations. Based on the results of these architectural simulations, computer architects will actually define the microarchitecture. In the fourth step, this microarchitecture is further refined and a microarchitecture model is built at the register-transfer-level (RTL). This level of abstraction guarantees cycle-accurate simulations in contrast to the previous steps (Black and Shen, 1998; Gibson et al., 2000; Desikan et al., 2001). Note that using this cycle-accurate simulator for doing architectural simulations would be infeasible for the following reasons: (i) the development of a cycleaccurate model takes much longer than the development of an architectural simulator, (ii) simulations using cycle-accurate simulators are much slower than using architectural simulators making them infeasible for exploring a large design space in limited time, (iii) this high level of accuracy is probably not needed in previous steps of the design cycle to make correct design decisions (which is the problem this paper deals with). In the subsequent steps of the design flow, the RTL-model is further translated to the logic level, the gate level, the circuit level and the layout level. Finally, verification is done at all levels. An important question that needs to be answered in light of this design flow, is how accurate these design tools should be. Obviously, this question is not only relevant for early design stage exploration techniques but also for architectural simulation, i.e., design steps that do not guarantee cycle-accurate simulation results. In this paper, we assume two limitations: (i) we limit ourselves to early design stage tools, although the approach taken here is also applicable to evaluate the required accuracy of architectural simulations; and (ii) the optimization criterium used in this paper is energy-efficiency, i.e., maximum performance for minimum power consumption. Again, the methodology used in this section is also applicable for other optimization criteria, e.g., maximum performance for minimum chip area. As suggested previously, most tool developers as well as computer engineers using early design stage tools

47

focus on its absolute accuracy, i.e., in one single point. However, computer architects do not need absolute accuracy in all cases. In addition, expecting a high absolute accuracy in early design stages is not realistic given the number of unknowns at that point in the design cycle. Relative accuracy which quantifies predicting a performance trend between various design points, can be sufficient to make correct design decisions, i.e., to determine the optimal configuration for a given optimization criterium. The next section discusses the concepts of absolute and relative accuracy in detail. 2.1. Absolute versus relative accuracy Consider two estimation techniques, A and B, in which A is the reference technique, i.e., the most accurate one. In this paper, A is architectural simulation and B is an early design stage technique, e.g., statistical simulation. The absolute error AE, which is a measure for the absolute prediction accuracy, is defined as the prediction error for a given metric M in one single design point P: MB;P  MA;P AEP ¼ : ð1Þ MA;P The absolute error thus measures the procentual difference over metric M between two techniques. Note that a positive error means overestimation whereas a negative error means underestimation. The metric M can be IPC, EPC, etc. The relative error RE is a measure for the relative prediction accuracy and quantifies the prediction error over two design points P and Q: MB;Q MA;Q REP;Q ¼  : ð2Þ MB;P MA;P For example, if technique B estimates that Q attains a speedup of 1.2 over P and technique A estimates that Q attains a speedup of 1.15 over P, the relative error is 5%. The basic idea behind this definition is that if the estimated performance increase is 20% and the relative error is 5%, then the real performance increase is 15%. An alternative definition for the relative error would have been to consider REP;Q ¼

MB;Q =MB;P  MA;Q =MA;P : MA;Q =MA;P

ð3Þ

However, we believe that the definition given in formula 2 makes more sense for the following reason. Note that the use of the relative accuracy is especially valuable in the knee of a performance curve, e.g., in the region where the increased hardware cost does not justify the marginaly increasing performance. In this region, we are dealing with small changes in performance, with a maximum of a few tens of percentages. As such, the alternative definition in formula 3 is not very well suited for this application. For example, estimating a small

48

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

increase of 1% as 1.5% seems fairly accurate for an early design stage designer. Formula 2 shows a relative error of 0.5%. The alternative definition in formula 3 on the other hand claims a relative error of 50% which obviously makes little sense given the small performance increases. By consequence, we will use the definition given in formula 2 throughout this paper. Obviously, absolute and relative accuracy are related to each other. To clarify this relationship, we can make the following statements. • Perfect absolute accuracy (AE ¼ 0%) implies perfect relative accuracy (RE ¼ 0%). • An absolute prediction accuracy that is constant over the complete design space still implies a perfect relative accuracy. Indeed, a constant absolute accuracy implies that the absolute error in design point P equals the absolute error in Q: AEP ¼ AEQ )

MB;Q MB;P MB;Q MA;Q ¼ ) ¼ MA;Q MA;P MB;P MA;P

) REP;Q ¼ 0:

ð4Þ

This implies that the relative error is zero or perfect relative accuracy is attained. • An absolute prediction accuracy that is not constant over the design space might lead to a relative prediction error. However, if the absolute accuracy does not vary wildly over the design space, we can still expect to achieve reasonable relative accuracy. Unfortunately, the latter case (a non-constant absolute accuracy) seems to be reality since, to the best of our knowledge, no early design stage technique exists that achieves an absolute accuracy that is zero or a non-zero constant over the complete design space. By consequence, a thorough evaluation of the impact of a nonzero relative accuracy needs to be done. As stated previously, we will focus on evaluating the energy-efficiency of a microprocessor. Obviously, to measure the energy-efficiency of a microprocessor we need viable power/performance metrics. These will be discussed in the next section. In the subsequent section, we will discuss how the relative accuracy both in terms of performance and power affects design decisions concerning the energy-efficiency of a given microarchitecture. 2.2. Power/performance metrics Several metrics can be used to evaluate the power/ performance characteristics of a given microprocessor. In this paper, we use two energy-efficiency metrics, namely energy-delay product and energy-delay-square product. These two metrics were found to be reasonable metrics for evaluating the energy efficiency of midrange and high-end microprocessor designs (Brooks et al.,

2000a). The energy-delay product (EDP) is defined as follows:  2 energy cycles 1  ¼ EDP ¼  EPC ð5Þ instruction instruction IPC and the energy-delay-square product (ED2 P) is defined as follows:  2 energy cycles  ED2 P ¼ instruction instruction  3 1 ¼  EPC; ð6Þ IPC with IPC the number of instructions executed per cycle and EPC the amount of energy consumed per cycle. Note that EDP and ED2 P are fused metrics that measure power and performance together. Both metrics are smaller-is-better, i.e., lower energy consumption for more performance. In Brooks et al. (2000a) argue that the EDP is appropriate for high-end systems, for example workstations. The ED2 P is found to be appropriate for the highest performance server-class machines, since the ED2 P gives an even higher weight to performance than EDP does. The inverses of the EDP and the ED2 P are proportional to the so called MIPS2 / Watt and MIPS3 /Watt metrics, respectively. In these metrics, MIPS stands for millions of instructions executed per second and Watt for the energy consumed per second. Note that the selection of appropriate metrics to quantify the power/performance characteristics of a microprocessor, depends on the application domain. As stated above, EDP and ED2 P are targeted at high performance microprocessors. However, other metrics can be used for other target domains: • The total amount of energy consumed during a program execution is another important metric. The energy consumption E can be calculated as follows: E ¼ P  T;

ð7Þ

with T the execution time of the computer program. In other words, the total energy consumption of a program execution is proportional to the average power dissipation P . Note that energy consumption is not only important for battery-powered devices; for higher end systems that are connected to the electricity grid, energy consumption is directly related to the electricity bill, to the total heat production (important for server farms) and the associated cooling cost. • Another metric that might be appropriate for lower end systems is energy per instruction (EPI), calculated as follows: EPI ¼

1  EPC: IPC

ð8Þ

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

This metric can also be estimated given an IPC prediction and an energy per cycle prediction. The inverse of this metric is proportional to the MIPS/Watt metric. • Two metrics that are more related to heating, packaging cost, cooling cost and reliability are maximum power dissipation and power density or power dissipation per unit of area.

2.3. The impact of relative accuracy This section details on the impact of relative accuracy both in terms of IPC and EPC on making correct design decisions concerning the energy-efficiency (quantified as EDP and ED2 P) of a given microarchitecture. In other words, we will quantify how small the relative error in IPC and EPC should be so that the most energy-efficient microarchitecture is still correctly identified, i.e., the microarchitecture with the minimal EDP or the minimal ED2 P. Therefore, we consider a case study in which we vary the instruction window size 1 as well as the processor width 2 of an out-of-order microarchitecture. In Fig. 1, the IPC as well as the EPC is shown as a function of these two microarchitectural parameters. These data were obtained through detailed architectural simulations using the SPECint95 benchmarks on a trace-driven simulator. For a more detailed discussion on the methodology used for obtaining these results, we refer to Section 5 later on in this paper. To have a better understanding of what relative accuracy means and how it affects IPC and EPC predictions, we refer to Fig. 2. In these graphs, the IPC and the EPC is shown as a function of window size for a 6-wide processor. This is done for varying levels of accuracy: (i) zero absolute and zero relative prediction error, i.e., obtained through architectural simulation, see also Fig. 1, (ii) an absolute error of 5% and a zero relative error, (iii) a relative error of 0.2% on top of the data in (ii); the relative error is measured between two processor configurations with a window size of 16  n and 16  ðn þ 1Þ instructions, with 2 6 n 6 9, (iv) a relative error of 0.5%, and (v) a relative error of 1%. The data represented in these graphs under (ii) to (v) are obtained using the following formula:   MA;Q MB;Q ¼ REP;Q þ ð9Þ  MB;P ; MA;P with MA;P and MA;Q the data obtained through detailed architectural simulation. These graphs show that even a small relative error can lead to severe mispredictions due to a accumulator effect, i.e., MB;Q is calculated from 1

This equals the number of in-flight instructions. This equals the fetch width, the decode width, the rename width, the issue width and the reorder width. 2

49

MB;P , MB;R is calculated from MB;Q , etc. This effect can be formalized in the following formula:   MA;R  MB;Q MB;R ¼ REQ;R þ MA;Q     MA;R MA;Q ¼ REQ;R þ  REP;Q þ  MB;P MA;Q MA;P  MA;Q ¼ REQ;R REP;Q þ REQ;R MA;P  MA;R MA;R þ REP;Q þ ð10Þ  MB;P : MA;Q MA;P In case the relative accuracy is constant over the design space, as we assume here, this formula can be simplified to:   MA;Q MA;R MA;R MB;R ¼ RE2 þ RE þ RE þ  MB;P : MA;P MA;Q MA;P ð11Þ Using the data presented in Fig. 1 as a case study implies there are four parameters in these experiments: (i) the relative accuracy on IPC along the window size axis, (ii) the relative accuracy on IPC along the processor width axis, (iii) the relative accuracy on EPC along the window size axis, and (iv) the relative accuracy on EPC along the processor width axis. In the set of experiments we will discuss now, these four relative accuracies will be varied and it will be verified whether the optimal microarchitecture, i.e., with minimal EDP or minimal ED2 P, is identified. In this first set of experiments, we vary the relative prediction error in IPC and EPC along the window size axis for a 6-wide superscalar microarchitecture. The IPC and EPC numbers are calculated using formula 9. For each level of relative accuracy we verify whether the microarchitecture with the minimal EDP and the ED2 P is identified. The results of this experiment are shown in Tables 1 and 2 for EDP and ED2 P, respectively. We observe that in case the relative error in IPC is zero, the EDP-optimal design is identified as long as the relative error in EPC varies between )1% and 4%. For correctly identifying the ED2 P-optimal design, the relative error in EPC should be smaller, i.e., between )0.5% and 0%. Interestingly, note also that in some cases where the relative error in IPC is non-zero, e.g., 2%, the relative error should also be non-zero, at least 3%, to correctly identify the EDP-optimal design. In the second set of experiments we vary the relative error in IPC along the window axis and the processor width axis, see Tables 3 and 4. The relative error in EPC along the window axis and the processor width axis is assumed to be zero. Since in these experiments we vary the relative error along two parameter axis, we have used formula 10 in which REQ;R and REP;Q are the relative accuracy along window size axis and the processor width axis, respectively. These results show that in order to

50

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

80

2.5

70 EPC (nJ/cycle)

1.5 1

60 50 40 30

4

10

8 processor width

window size

6

4

128 144 160

112

96

80

0 64

128

112

96

80

10

48

6

64

8 processor width

32

10

48

0

144 160

20

0.5

32

IPC

2

window size

Fig. 1. IPC (on the left) and EPC (on the right) as a function of instruction window size and processor width obtained through detailed architectural simulations.

real IPC or EPC relative error = 0% relative error = 0.2% relative error = 0.5% relative error = 1%

2.4 2.3

70 65 60

2.2

55 IPC

2.1

50

2

45

1.9

40

1.8

35

1.7

32

48

64

80

EPC (nJ/cycle)

2.5

30

160 128 144 96 112 48

32

window size

64

80

160 128 144 96 112 window size

Fig. 2. IPC (on the left) and EPC (on the right) as a function of instruction window size under various conditions: (i) real numbers as obtained through detailed architectural simulations, (ii) RE ¼ 0% (with AE ¼ 5%), (iii) RE ¼ 0.2%, (iv) RE ¼ 0.5% and (v) RE ¼ 1%.

Table 1 The impact on predicting the optimal EDP-microarchitecture by varying the relative accuracy in IPC and EPC along the window size axis RE in EPC along window size axis 0.7

RE in IPC along window size axis

)5% )4% )3% )2% )1% 0% 1% 2% 3% 4%

)5%

)4%

)3%

)2%

)1%

0%

1%

2%

Y Y Y

Y Y

Y Y Y

Y Y

Y Y Y

Y Y

Y Y Y

Y Y Y

A ÔY’ denotes whether the optimal EDP-configuration is identified.

3%

4%

Y Y Y

Y Y Y

5%

Y Y Y

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

51

Table 2 The impact on predicting the optimal ED2 P-microarchitecture by varying the relative accuracy in IPC and EPC along the window size axis 0.7 RE in EPC along window size axis )0.6% RE in IPC along window size axis

)0.3% )0.2% )0.1% 0.0% 0.1% 0.2% 0.3%

Y Y

)0.5%

)0.4%

)0.3%

Y Y

Y Y

Y Y

)0.2%

)0.1%

0.0%

Y Y

Y Y

Y Y

0.1%

0.2%

0.3%

Y Y

Y Y

Y Y

0.4%

0.5%

0.6%

Y Y

Y Y

Y Y

A ÔY’ denotes whether the optimal ED2 P-configuration is identified.

Table 3 The impact on predicting the optimal EDP-microarchitecture by varying the relative accuracy in IPC along the processor width axis and the window axis, respectively RE in IPC along window size axis )2.0% 0.7

)5% RE in IPC along processor width axis )4% )3% )2% )1% 0% 1% 2%

)1.6%

)1.2%

)0.8%

)0.4%

0.0%

0.4%

0.8%

Y Y Y Y Y

Y Y Y Y Y

Y Y Y Y Y

Y Y Y Y Y Y

Y Y Y Y Y Y

Y Y Y Y Y

Y Y Y Y

1.2%

Perfect EPC relative accuracy is assumed. A ÔY’ denotes whether the optimal EDP-configuration is identified.

Table 4 The impact on predicting the optimal ED2 P-microarchitecture by varying the relative accuracy in IPC along the processor width axis and the window axis, respectively RE in IPC along window size axis )0.4% 0.7

RE in IPC along processor width axis

)1.8% )1.6% )1.4% )1.2% )1.0% )0.8% )0.6% )0.4% )0.2% 0.0% 0.2%

)0.3%

)0.2%

)0.1%

0.0%

0.1%

0.2%

Y Y Y Y Y Y Y

Y Y Y Y Y Y Y Y

Y Y Y Y Y Y Y Y

Y Y Y Y Y Y Y Y Y

Y Y Y Y Y Y Y Y

Y Y Y Y Y Y Y Y

0.3%

Perfect EPC relative accuracy is assumed. A ÔY’ denotes whether the optimal ED2 P-configuration is identified.

correctly identify the EDP-optimal design, see Table 3, the RE in IPC along the window size axis should be larger than )1.6% and smaller than 0.8%; the RE in IPC along the processor width axis should be larger than )4% and smaller than 1%. A comparable result can be obtained from Table 4: )0.3% 6 RE in IPC along the window size axis 6 0.2%; and )1.6% 6 RE in IPC along the processor width axis 6 0%. From these experiments, we can make a number of important conclusions.

• First of all, we have shown it is possible to reason about relative accuracy and its impact on making correct design decisions. This is important for tool developers to get confidence in the tools they are developing. Obviously, this is also important information for the researchers using those tools. • The level of (relative) accuracy that is needed to make correct design decisions depends on the optimization criterium. From our experiments, we observe that finding the ED2 P-optimal microarchitecture is more

52

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

sensitive to relative accuracy than it is the case for EDP, i.e., smaller relative errors are required to make correct design decisions for ED2 P than for EDP. • The results are more sensitive to the relative accuracy in IPC than to the relative accuracy in EPC. This is observed in Tables 1 and 2 where the number of Y’s is smaller in the vertical direction than in the horizontal direction. Note that this sensitivity is even more pronounced for the design of ED2 P-optimal designs than for EDP-optimal designs. Indeed, the ratio of Y’s in the vertical direction to the number of Y’s in the horizontal direction is smaller for the ED2 P-optimal design, see Table 2, than for the EDP-optimal design, see Table 1. This is important information for tool developers showing that more emphasis should be given to accurate performance modeling rather than to accurate power modeling, especially in case of the design of high performance server-class systems. In other words, a less accurate power modeling tool will be as useful as a more accurate one. • The same relative accuracy is not required along all parameter axes. For example, Tables 3 and 4 clearly show that the design decisions are more sensitive to the relative accuracy along the window size axis than along the processor width axis.

3. Statistical simulation As mentioned in the introduction of this paper, we consider statistical simulation as a case study of an early design stage technique. In this section, we discuss the statistical simulation methodology and how it can be used for doing design space explorations while considering performance, power consumption as well as cycle time issues. In the next section, we will evaluate the absolute and more in particular the relative accuracy of the proposed technique. As such, we will verify whether statistical simulation can be used as an efficient method for making correct design decisions early in the design cycle. The statistical simulation methodology (Carl and Smith, 1998; Oskin et al., 2000; Nussbaum and Smith, 2001; Eeckhout, 2002; Eeckhout et al., 2003) consists of three steps, see Fig. 3: statistical profiling, synthetic trace generation and trace-driven simulation. 3.1. Statistical profiling During the statistical profiling step, a real program trace––i.e., a stream of instructions as they are executed instruction per instruction by a (single-issue in-order) microprocessor––is analyzed by a microarchitecture-dependent profiling tool and a microarchitecture-independent profiling tool. The complete set of statistics collected during statistical profiling is called a statistical

real trace microarchitecture-dependent profiling tool

branch statistics

microarchitecture-INdependent profiling tool

cache statistics

program statistics statistical profile

synthetic trace generator

synthetic trace

trace-driven simulator power/performance characteristics Fig. 3. Statistical simulation.

profile. The microarchitecture-independent profiling tool extracts (i) a distribution of the instruction mix (we identify 19 instruction classes according to their semantics and the number of source registers), (ii) the distribution of the age of the input register instances (i.e., the number of dynamic instructions between writing and reading a register instance; measured per instruction class and per source register; 22 distributions in total) and (iii) the age of memory instances (i.e., the number of load instructions between writing and reading the same memory location). The age distribution of register and memory instances only captures read-afterwrite (RAW) dependencies. Write-after-write (WAW) and write-after-read (WAR) dependencies are not considered since we assume perfect (hardware supported) register renaming, i.e., there are enough physical registers to remove all WAW and WAR dependencies dynamically. Note that this is not unrealistic since it is implemented in the Alpha 21264 (Kessler et al., 1998). If measuring performance as a function of the number of physical registers would be needed, the methodology presented in this paper can be easily extended for this purpose by modeling WAR and WAW dependencies as well. The microarchitecture-dependent profiling tool only extracts statistics concerning the branch and cache behavior of the program trace for a specific branch predictor and a specific cache organization. The branch statistics consist of seven probabilities: (i) the conditional branch target prediction accuracy, (ii) the conditional branch (taken/not-taken) prediction accuracy, (iii) the relative branch target prediction accuracy, (iv) the relative call target prediction accuracy, (v) the indirect jump target prediction accuracy, (vi) the indirect call target prediction accuracy and (vii) the return target prediction accuracy. The reason to distinguish between these seven probabilities is that the prediction accuracies

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

3.2. Synthetic trace generation and simulation Once a statistical profile is computed, a synthetic trace is generated by a synthetic trace generator using this statistical profile. This is based on a Monte Carlo method: a random number is generated between 0 and 1, that will determine a program characteristic using a cumulative distribution function, see Fig. 4(a). The generation of a synthetic trace itself works on an instruction-by-instruction basis. Consider the generation of instruction x in the synthetic instruction stream, see Fig. 4(b): (1) Determine the instruction type using the instructionmix distribution; e.g., an add, a store, etc. were generated in Fig. 4(b). (2) For each source operand, determine the instruction that creates this register instance using the age of register instances distribution. Notice that when a

(b) add

synthetic instruction trace

L2 I$ miss

store

branch

mispredicted (a) random number

greatly vary among the various branch classes. In addition, the penalties introduced by these are completely different. A misprediction in cases (i), (iii) and (iv) only introduces a single-cycle bubble in the pipeline. Cases (ii), (v), (vi) and (vii) on the other hand, will cause the entire processor pipeline to be flushed and to be refilled when the mispredicted branch is executed. The cache statistics include two sets of distributions: the data cache and the instruction cache statistics. The data cache statistics contain two probabilities for a load operation, namely (i) the probability that a load needs to access the level-2 (L2) cache––as a result of a level-1 (L1) cache miss––and (ii) the probability that main memory––as a result of a level-2 (L2) cache miss––needs to be accessed to get its data; idem for the instruction cache statistics. A statistical profile can be computed from an actual trace but it is more convenient to compute it on-the-fly from either an instrumented functional simulator, or from an instrumented version of the benchmark program running on a real system which eliminates the need to store huge traces. A second note is that although computing a statistical profile might take a long time, it should be done only once for each benchmark with a given input. And since statistical simulation is a fast analysis technique, computing a statistical profile will be worthwhile. A third important note is that measuring microarchitecture-dependent characteristics such as branch prediction accuracy and cache miss rates, implies that statistical simulation cannot be used to study branch predictors or cache organizations. Other microarchitectural parameters however, can be varied freely. We believe this is not a major limitation since, e.g., cache miss rates for various cache sizes can be computed simultaneously using the cheetah simulator (Sugumar and Abraham, 1993).

53

load 1

L1 D$ miss

add program characteristic

L1 I$ miss

Fig. 4. (a) Determining a program characteristic using random number generation and (b) generating a synthetic trace.

dependency is created in this step, the demand of syntactical correctness does not allow us to assign a destination operand to a store and a conditional branch instruction. 3 For example in Fig. 4(b), the load instruction cannot be made dependent on the preceding branch. However, using the Monte Carlo method we cannot assure that the instruction that is the creator of that register instance, is neither a store nor a conditional branch instruction. This problem is solved as follows: look for another creator instruction until we get one that is not a store nor a conditional branch. If after a certain maximum number of trials still no dependency is found that is not supposedly created by a store or a conditional branch instruction, the dependency is simply removed. (3) If instruction x is a load instruction, use the age of memory instances distribution to determine whether a store instruction w (before instruction x in the trace; i.e., w < x) accesses the same memory address; e.g., a read-after-write dependency is imposed through memory between the load and the store in Fig. 4(b). This will have its repercussions when simulating these instructions. In our simulator we assume speculative out-of-order execution of memory operations. This means that when a load x that 3 Relative jumps, indirect jumps and returns do not have destination operands either. However, we will not mention them for the remainder of this paper although we take this into account.

54

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

accesses the same memory location as a previous store w (w < x), is executed earlier than the store, the load would get the wrong data. To prevent this, a table is kept in the processor to keep track of memory dependencies. When the store w is executed later, it will detect in that table that load x has accessed the same memory location. In that case, the load and all its dependent instructions need to be re-executed. (4) If instruction x is a branch, determine whether the branch and its target will be correctly predicted using the branch statistics. In order to model resource contention due to branch mispredictions, we take the following action while simulating a synthetically generated trace: when a Ômispredicted’labeled branch is inserted in the processor pipeline, instructions are injected in the pipeline (also synthetically generated) to model the fetching from a misspeculated control flow path. These instructions are then marked as coming from a misspeculated path. When the misspeculated branch is executed, the instructions of the misspeculated path are removed, new instructions are fetched (again synthetically generated) and marked as coming from the correct control flow path. (5) If instruction x is a load instruction, determine whether the load will cause a L1 cache hit/miss or L2 cache hit/miss using the data cache statistics. When a ÔL1 or L2 cache miss’-labeled load instruction is executed in the pipeline, the simulator assigns an execution latency according to the type of the cache miss. In case of a L1 cache miss, the L2 cache access time will be assigned; in case of a L2 cache miss, the memory access time will be assigned. (6) Determine whether or not instruction x will cause an instruction cache hit/miss at the L1 or L2 level. In Fig. 4(b), the first and the last instruction get the label ÔL2 I$ miss’ and ÔL1 I$ miss’, respectively. When a ÔL1 or L2 cache miss’-labeled instruction is inserted into the pipeline, the processor will stop inserting new instructions in the pipeline during a number of cycles. This number of cycles is the L2 cache access or the memory access time in case of L1 cache miss or a L2 cache miss, respectively. The last phase of the statistical simulation method is the trace-driven simulation of the synthetic trace which yields estimates of power and/or performance characteristics. An important performance characteristic is the average number of instructions executed per cycle (IPC) which can be easily calculated by dividing the number of instructions simulated by the number of execution cycles. How power characteristics will be calculated will be discussed in detail in Section 4.1.

4. Application: microprocessor design The most obvious application of statistical simulation is microprocessor performance evaluation. Since this methodology is fast and quite accurate (as will be shown in the evaluation section of this paper), performance estimates can be obtained quickly. A huge design space can thus be explored in limited time to identify the optimal design or in case statistical simulation fails in identifying the optimal design, a region of interest with near-optimal performance. This region of interest can then be further analyzed using more detailed (and thus slower) architectural simulations. As such, the microprocessor architectural design time will decrease significantly leading to a shorter time-to-market. As discussed previously, performance should not be the only concern when designing a microprocessor. Power consumption and the impact of interconnects on cycle time are also of major concern today. And we can only expect this to grow in the near future. How statistical simulation can be used to consider these additional issues in the early stages of the design, is discussed in the next two subsections. 4.1. Power consumption Since power consumption is becoming a major issue in microprocessor design, it should be incorporated in the early design stages so that computer engineers are not confronted with unexpected power consumptions in a late design phase, increasing the packaging and cooling cost considerably. This issue can be handled with statistical simulation by incorporating an architectural power model in the trace-driven simulator. As a result, power characteristics can be estimated early in the design flow (Eeckhout and De Bosschere, 2001). Several architectural-level power estimation models have been proposed in the last few years, e.g., Wattch (Brooks et al., 2000d), SimplePower (Ye et al., 2000), PowerTimer (Brooks et al., 2000c) and TEM2 P2 EST (Dhodapkar et al., 2000). In this study, we used Wattch (Brooks et al., 2000d) as the power estimation model because of its public availability and its flexibility for architectural design space explorations; the power models included in Wattch are fully parameterizable which provides its high flexibility. According to the authors, Wattch also provides good relative accuracy which is required for doing architectural design space explorations. Wattch (Brooks et al., 2000d) calculates the dynamic power consumption P of a processor unit (e.g., functional unit, instruction cache, data cache, register file, clock distribution, etc.) using the following formula: P ¼ CV 2 af , where C, V , a and f are the load capacitance, the supply voltage, the activity factor (0 < a < 1) and the processor frequency, respectively. V and f are

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

55

technology dependent: in this study, we assume V ¼ 2:5 V, f ¼ 600 MHz and a.35 lm technology. We used the technology parameters for which Wattch was validated against reported data (Brooks et al., 2000d). Wattch estimates the capacitance C, which is composed of gate and line capacitances, based on circuit models. The activity factor a measures how often clock ticks lead to switching activity on average. In this study, we assumed a base activity factor of 1/2 modeling random switching activity, which we believe is a reasonable approximation in an early design stage. 4 The activity factor a can be further lowered by clock-gating unneeded units. In this study, we assumed an aggressive conditional clocking scheme in which the power estimate is linearly scaled from the maximum power consumption with port or unit usage; unused units dissipate 10% of their maximum power. Measuring the port or unit usage is done by inserting so called activity counters in the simulator that keep track of the number of accesses per clock cycle to the various units. For this study, we have integrated Wattch in our trace-driven simulator by inserting the Wattch activity counters in our simulator. Further, we also assume 3 extra pipeline stages between the fetch and the issue stage, as is done in Wattch. In addition, the power models in Wattch closely resemble the microarchitecture modeled in our simulator which is important for the validity of the results, according to the analysis done in Ghiasi and Grunwald (2000).

more important than gates in current chip technologies. Technology experts expect this trend will continue in the near future. This clearly has an important impact on the microarchitectural design. By consequence, floorplanning issues should be considered early in the design flow, such as PEPPER (Narayananan et al., 1995) which was succesfully used within IBM for the design of high performance microprocessors. A floorplan determines the length and thus the delay of the wires interconnecting the various high-level structures on the chip. For example, if interconnecting two hardware structures takes too much time to be considered within one clock cycle, an additional clock cycle might be needed which increases the pipeline depth. Increasing the pipeline depth modifies the microarchitecture and thus the number of instructions executed per clock cycle, on its turn affecting the overall performance. To measure this impact on overall performance new architectural simulations need to be run. As such, considering these timing issues early in the design cycle is extremely important in order not to be surprised in terms of overall performance in a late design stage. In addition, making correct design decisions early in the design flow reduces the total simulation time since this eliminates the necessity for running detailed and slow architectural simulations later on. In Section 6.4, we will show that statistical simulation can be used in combination with early floorplanning.

4.2. Timing issues

5. Methodology

Another very important design issue next to power consumption obviously is cycle time. Indeed, the total execution time of a computer program is the product of CPI (reverse IPC or the number of clock cycles per instruction), times the number of instructions executed, times the cycle time (measured in fractions of nanoseconds per cycle). As such, accurate cycle time estimates should be available in the earliest design stages to have a clear view on the overall performance. Whereas in older chip technologies estimating the cycle time of a hardware structure was fairly simple, predicting cycle time for current chip technologies is not that straightforward. Indeed, in older chip technologies the delay of the gates was sufficient to make an accurate prediction of the total cycle time. Nowadays, the impact of the interconnects becomes more important than gate delays. This is due to the non-equal growth in speed for gates and interconnects making interconnects relatively

This section discusses the benchmarks and the microarchitectures we have used in our evaluation.

4 An alternative approach would be to use a distribution of the data values produced in a program execution for generating synthetic data values in the synthetic trace. These synthetically generated data values could then be used to measure the switching activity in various structures.

5.1. Benchmarks In this study we used the SPECint95 benchmark suite 5 and the IBS benchmark suite (Uhlig et al., 1995) to evaluate. The SPECint95 traces were generated on a DEC 500ua station with an Alpha 21164 processor and were compiled with the DEC cc compiler version 5.6 with the optimization flags set to -O4 and linked statically using the -non_shared flag. Each trace contains approximately 200 million dynamic instructions and were carefully selected not to include initialization code. The IBS traces were generated on a MIPS-based DEC 3100 system running the Mach 3.0 operating system. The IBS traces are known to have a larger instruction footprint (due to the inclusion of significant amounts of operating system activity) and to stress the memory subsystem more than SPECint benchmarks do (Uhlig et al., 1995).

5

http://www.spec.org.

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

instr 2 ...

instr 3 ...

...

instr w

...

crossbar

...

register renaming unit

...

fetch unit

instruction-cache

instr 1

mem unit 1 ...

...

mem unit m

data cache

issue bandwidth i

instruction window

instruction selection unit

56

non-mem unit 1 ... non-mem unit n

fetch bandwidth f ... register file

...

reorder buffer

...

reorder bandwidth r

Fig. 5. An out-of-order microarchitecture.

5.2. Out-of-order architecture To validate the model, an out-of-order superscalar architecture was assumed which is an architectural paradigm that is implemented in most contemporary microprocessors, such as the Alpha 21264 (Kessler et al., 1998), Pentium 4 (Hinton et al., 2001), MIPS R10000 (Yeager, 1996), etc. In an out-of-order architecture, see also Fig. 5, instructions are fetched from the I-cache, on which register renaming is performed. Register renaming eliminates WAR and WAW dependencies from the instruction stream; only real RAW data dependencies remain. 6 Once the instructions are transformed into a static single assignment form, they are dispatched to the instruction window, where the instructions wait for their source operands to become available (data-flow execution). Each clock cycle, ready instructions are selected to be executed on a functional unit. The number of instructions that can be selected in one clock cycle, is restricted to the issue bandwidth. Further, perfect bypassing was assumed; in other words, data-dependent instructions can be executed in consecutive cycles. The latencies of the instruction types are given in Table 5. All operations are fully pipelined, except for the divide. Once an instruction is executed, the instruction can be retired when all previous instructions from the sequential instruction stream are retired. The number of instructions that can be retired in one clock cycle, is restricted to the reorder bandwidth. And retiring takes one additional clock cycle. The parameters involved in our trace-driven simulator, are the fetch bandwidth f , the window size w, the issue bandwidth i, the number of memory units m (in our case, m ¼ i=2) and the reorder

6

As stated before, we assume perfect register renaming.

Table 5 Out-of-order architecture 0.7

Instruction latencies

Branch predictor

Caches

Integer (1), load (3), multiply (8) FP (4), divide (18/31) fully pipelined except divide 8-bit gshare (4 KB), 4 KB bi-modal 4 KB meta; 4-way 512-sets BTB; RAS 8 entries 32 KB DM L1 I$, 64 KB 2WSA L1 D$, 256 KB 4WSA L2; 10/80 cycles access time to L2/mem

bandwidth r. In this paper, the instruction window size is varied from 32 to 256 instructions; the issue width, which is the maximum number of instructions that can be selected to be executed per cycle, is varied from 4 to 12. The fetch bandwidth and the reorder bandwidth were chosen to be the same as the issue width, the processor width for short. More details on the architectures simulated can be found in Table 5. The reason why we chose wide-resource machines in our evaluation is that on such microarchitectures performance is more limited by program parallelism than by machine parallelism. This way, the capability of the statistical simulation methodology for modeling program parallelism is stressed appropriately. Consequently, if the technique is accurate for wide-resource machines, we can expect that the technique will also be useful for processors with less machine parallelism. The data presented in the evaluation section of this paper confirm that the prediction errors are smaller for smallresource machines than for wide-resource machines. As such, we can conclude that statistical simulation is useful for embedded system designs (typically small-resource machines due to power considerations) as well as for high-performance designs (typically wide-resource machines for optimal performance).

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

6. Evaluation relative standard deviation

1.4%

This evaluation section is organized as follows. First, we will show that statistical simulation indeed is a fast simulation technique making it feasible in the earliest design stages. Second, we will discuss the absolute accuracy of statistical simulation for estimating performance and power. Third, we will evaluate its ability to explore design spaces reliably by focusing on the relative accuracy. Fourth, it will be shown that statistical simulation is useful when timing issues get involved.

li gcc compress go ijpeg vortex m88ksim perl

1.2% 1.0% 0.8% 0.6% 0.4% 0.2% 0.0% 0

5

5% 4% 3% 2%

30 25

1% 0% -1% -2%

20 15 10

-3% -4%

0

-5% avg

5 groff

energy/cycle

IPC prediction error

avg

groff

nroff

sdet

verilog

mpeg

-4%

energy/cycle error

IPC

avg

real energy/cycle estimated energy/cycle error

-2%

nroff

0

avg

-5% groff

0 nroff

5

2 sdet

groff

mpeg 10

-3% -4%

0%

sdet

4

avg

6

groff

8

15

nroff

10

nroff

35

sdet

5% 4% 3% 2% 1% 0% -1% -2% -3% -4% -5%

real_gcc

0

gs

-4% verilog

0

20

2% real IPC estimated IPC error

0.5

-2%

sdet

12

4% 1

0.2

25

6%

real_gcc

0.4

real_gcc

14

0%

8%

1.5

verilog

0.6

energy/cycle

5% 4% 3% 2% 1% 0% -1% -2%

16

energy/cycle error

18

real_gcc

4.5

10% IPC prediction error

2%

avg

groff

-4% nroff

0 sdet

-2% real_gcc

0.2 gs

0%

verilog

0.4

0.8

verilog

2%

0.6

jpeg

0.8

4%

gs

4%

6%

mpeg

1

1.2 1

jpeg

6%

verilog

4

2

real_gcc

8%

gs

1.6 1.4

jpeg

3.5

12%

jpeg

8%

gs

3

2.5

gs

1.4

mpeg

2.5

w = 128; i = 8

12%

jpeg

10%

IPC

1.8 IPC prediction error

2

10%

mpeg

w = 64; i = 6

12%

jpeg

2

In this section, we concentrate on the absolute power and performance prediction accuracy. Fig. 7 presents Ôreal’ and Ôestimated’ measurements for IPC (top graphs) as well as for EPC (bottom graphs). This was done by simulating the real traces and their corresponding synthetic traces. In addition to raw performance and energy per cycle estimations, these graphs also show the prediction errors (measured on the right Y axes). This was done for three microarchitecture configurations: a 4issue 32-entry window processor (on the left), a 6-issue 64-entry window processor (in the middle) and an 8-issue 128-entry window (on the right). These graphs show that statistical simulation is indeed a quite accurate

energy/cycle error

w = 32; i = 4

mpeg

1.5

6.2. Absolute accuracy

1.6 1.2 IPC

1

Fig. 6. Relative standard deviation as a function of the number of instructions simulated for the SPECint95 benchmarks. Comparable results were obtained for the IBS traces.

To evaluate the speed of statistical simulation, we have set up the following experiment. We have generated 40 different synthetic traces for each original trace using different random seeds. For each of these traces performance was measured as a function of the number of instructions simulated. In Fig. 6, the relative standard deviation is plotted for several benchmarks as a function of the number of instructions simulated so far. The relative standard deviation is computed using the formula si =xi , with si and xi the standard deviation and the average IPC, respectively, after simulating i synthetically generated instructions. This graph shows that the standard deviation is less than 1% after simulating half a million instructions. Architectural simulations using real traces on the other hand require the simulation of billions of instructions which leads us to the conclusion that statistical simulation indeed is a fast simulation technique.

energy/cycle

0.5

number of synthetic instructions simulated (in millions)

6.1. Simulation speed

1.8

57

Fig. 7. Real and estimated IPC (upper graphs) and EPC (lower graphs) measured on the left Y axis for the IBS traces and for three microarchitectures with window size w and issue width i. The prediction error is shown as well on the right Y axis.

58

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

technique. The performance prediction errors are no larger than 12%; the energy prediction errors are no larger than 5%. Note that, as we pointed out in Section 5, the IPC prediction error increases for wider-resource machines (6% for w ¼ 32 and i ¼ 4, 8% for w ¼ 64 and i ¼ 6, and 12% for w ¼ 128 and i ¼ 8); the energy per cycle prediction error increases from 2% to 5%. Note also that the IPC is generally overestimated. This is due to the fact that long critical paths in computer programs are poorly modeled through average dependencies (as imposed by statistical simulation) leading to many short critical paths and thus IPC overestimations. 6.3. Relative accuracy Being an early design stage method, statistical simulation is aimed at exploring huge design spaces in limited time for which relative accuracy (the relationship be-

tween multiple design points) is more important than absolute accuracy (in one single design point). Figs. 8 and 9 show the relative errors in IPC and EPC along the processor width axis and the window size axis, respectively. For example, in Fig. 8 on the left, the relative error in IPC when increasing the processor width from 8 to 10 in a 160-entry window machine is about )0.25%; the relative error in IPC from increasing the processor width from 4 to 6 in 32-entry window machine is about 1.8%. From Figs. 8 and 9, we conclude that (i) the relative error in IPC is no larger than 2% and the relative error in EPC is no larger than 2.5% in absolute terms; and (ii) the relative error along the processor width axis is larger than along the window size axis. Recall that from Section 2, we concluded that the relative accuracy along the window size axis is more critical than the relative accuracy along the processor width axis. As such, this result is beneficial for statistical simulation.

2.0%

1.5% 1.0% 0.5% RE in EPC

0.0%

8 to 10

6 to 8

processor width

4 to 6

-0.5%

96 112 128 144 160

-0.5% -1.0% -1.5%

32 48 64 80

-2.0%

window size

-2.5%

processor width

8 to 10

0.5%

0.0%

6 to 8

1.0%

4 to 6

RE in IPC

1.5%

96 112 128 144 160

80

32 48 64

window size

0.6%

1.0%

0.4%

0.5%

0.2%

0.0%

RE in EPC

RE in IPC

Fig. 8. The relative error of statistical simulation in IPC (on the left) and EPC (on the right) along the processor width axis.

0.0% -0.2% -0.4%

-1.5%

10

80 to 96

8

112 to 128

6

96 to 112

4 processor width

32 to 48 48 to 64 64 to 80

-2.0% 144 to 160

window size

-1.0%

128 to 144

10

112 to 128

8

96 to 112

6

144 to 160

processor width

128 to 144

4

32 to 48 48 to 64 64 to 80 80 to 96

-0.6%

-0.5%

window size

Fig. 9. The relative error of statistical simulation in IPC (on the left) and EPC (on the right) along the window size axis.

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

59

10

13 12.5

9.5

12 9 EDP

EDP

11.5 11 10.5

8

160 144 128 112 96 80 64 48

10 9.5 9 8

6

160 144 128 112 96 80 64 48

7.5 7

window size

32

10

8.5

10

4

processor width

32

8

6

window size

4

processor width

Fig. 10. EDP as a function of window size and processor width for detailed simulation (on the left) and statistical simulation (on the right). The white bar denotes the optimal configuration.

Note also that the relative accuracy is not constant over the complete design space in contrast to the theoretical study given in Section 2. In Fig. 10, the Ôreal’ EDP (obtained through detailed architectural simulations, on the left) and the Ôestimated’ EDP (obtained through statistical simulation, on the right) are shown as a function of processor width and window size. These data are average numbers over the SPECint95 traces. The most energy-efficient architecture is the one with the lowest energy-delay product, thus maximizing performance with a reasonable energy per

cycle. The Ôreal’ EDP numbers identify the 6-wide 48entry window configuration as the most energy-efficient microarchitecture. This configuration is shown by means of a white bar in Fig. 10. The same configuration is identified by statistical simulation, on the right in Fig. 10. Note that in spite of the non-constant relative error in IPC and EPC, a correct design decision is made. In Fig. 11 the same is shown in case the optimization criterium is ED2 P. Detailed architectural simulation identifies the 8-wide 80-entry window machine as the

7

5

6.5 4.5

2

ED P

ED2P

6 5.5

4

5 3.5 4.5 4 4

6

8

processor width

144

10

160

128

112

96

80

64

48

32

window size

3 4

6

8

144

10

128

112

96

80

64

48

32

window size

160

processor width

Fig. 11. ED2 P as a function of window size and processor width for detailed simulation (on the left) and statistical simulation (on the right). The white bar denotes the optimal configuration.

60

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

6.4. Timing issues

most energy-efficient; statistical simulation on the other hand identifies the 10-wide 96-entry window machine. Statistical simulation clearly fails in identifying the optimal design. Note however that the ED2 P-surface around the optimal configuration is flat making it difficult to identify the most energy-efficient design. Statistical simulation nevertheless succeeds in identifying the region of near-optimal designs.

As discussed before, the impact of interconnect delay and thus floorplanning is rapidly increasing as feature sizes get smaller and smaller in current chip technologies. Consider the case where an early floorplanner determines that in order to obtain a certain clock frequency, additional pipeline stages need to be inserted

7% 6% 5%

real trace statistical simulation

4% 3% 2%

avg ibs

groff

nroff

sdet

real_gcc

gs

verilog

jpeg

mpeg

perl

avg spec

m88ksim

vortex

ijpeg

go

compress

gcc

1% 0% li

performance degradation

increasing load latency from 3 to 4 cycles 9% 8%

adding one pipeline stage in front pipeline performance degradation

6% real trace

5%

statistical simulation

4% 3% 2% 1%

sdet

nroff

groff

avg ibs

sdet

nroff

groff

avg ibs

verilog

real_gcc

gs

jpeg

mpeg

avg spec

perl

m88ksim

vortex

ijpeg

go

compress

gcc

li

0%

real trace

verilog

real_gcc

gs

jpeg

mpeg

avg spec

perl

m88ksim

vortex

ijpeg

go

compress

gcc

statistical simulation

li

performance degradation

adding two pipeline stages in front pipeline 10% 9% 8% 7% 6% 5% 4% 3% 2% 1% 0%

Fig. 12. Performance degradation due to the insertion of (i) one pipeline stage between the data-cache and the memory units in the processor core (upper graph); (ii) one pipeline stage in the front-end pipeline (middle graph); and (iii) two pipeline stages in the front-end pipeline (bottom graph). These data were obtained for a 12-issue 256-entry window machine.

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

in the microprocessor pipeline to transport data from one place to another on a chip. For example, adding a pipeline stage in the front-end pipeline of a microprocessor (e.g., for transporting data from the instructioncache to the processor core) does not have the same impact on performance as a pipeline stage in the backend (e.g., for transporting data from the data-cache to the memory units in the processor). Note that these two design options were taken in the Intel Pentium 4 (Hinton et al., 2001) and the Compaq Alpha 21264 (Kessler et al., 1998), respectively. In this section, we demonstrate that statistical simulation is accurate in determining where to insert additional pipeline stages in the microprocessor with the least performance degradation. Fig. 12 shows (i) performance degradation due to inserting a pipeline stage between a memory unit and the data-cache, effectively increasing the latency of load instructions with one cycle (upper graph); (ii) performance degradation due to inserting one pipeline stage in the front-end pipeline (middle graph); and (iii) performance degradation due to inserting two pipeline stages in the front-end pipeline (bottom graph). These data show that statistical simulation is capable of accurately estimating the impact on performance of these microarchitectural modifications. For example, inserting one pipeline stage in the front-end is less performance degrading than in the back-end for the IBS traces. The opposite is true for the SPECint95 benchmarks. In both cases, a correct decision is made by the statistical simulation technique.

7. Conclusion It is increasingly important to expose the impact of technology effects to computer architects while defining the microarchitecture of a new microprocessor since these aspects will have their repercussions at the architectural level. Indeed, power consumption and interconnect delay are becoming major issues in current and future microprocessor designs. In addition, the simulation time, even at the architectural level, increases dramatically as architectures and applications become more and more complex. Early design stage tools are useful to explore these huge design spaces efficiently. In this paper, we have addressed the question how accurate these design tools should be to be useful. In this analysis, we focused on relative accuracy which is more relevant than absolute accuracy at such an early stage of the design. We conclude that it is possible to reason about the (required) accuracy of development tools and that the required accuracy is dependent on the optimization criterium, which is extremely valuable information for tool developers and designers using these tools. As a case study, we have also evaluated the relative accuracy of statistical simulation as an early design stage method. We have verified that viable design decisions can be

61

made in limited time using statistical simulation while considering instruction throughput, cycle time and power consumption.

References Black, B., Shen, J.P., 1998. Calibration of microprocessor performance models. IEEE Computer 31 (5), 59–65. Bose, P., Conte, T.M., 1998. Performance analysis and its impact on design. IEEE Computer 31 (5), 41–49. Brooks, D., Bose, P., Schuster, S.E., Jacobson, H., Kudva, P.N., Buyuktosunoglu, A., Wellman, J.-D., Zyuban, V., Gupta, M., Cook, P.W., 2000a. Power-aware microarchitecture: Design and modeling challenges for next-generation microprocessors. IEEE Micro 20 (6), 26–44. Brooks, D., Martonosi, M., Bose, P., 2000b. Abstraction via separable components: An empirical study of absolute and relative accuracy in processor performance modeling. Tech. Rep. RC 21909, IBM Research Division, T.J. Watson Research Center, December. Brooks, D., Martonosi, M., Wellman, J.-D., Bose, P., 2000c. Powerperformance modeling and tradeoff analysis for a high end microprocessor. In: Proceedings of the Power-Aware Computer Systems (PACS’00) held in conjunction with ASPLOS-IX, November. Brooks, D., Tiwari, V., Martonosi, M., 2000d. Wattch: A framework for architectural-level power analysis and optimizations. In: Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA-27), June, pp. 83–94. Carl, R., Smith, J.E., 1998. Modeling superscalar processors via statistical simulation. In: Workshop on Performance Analysis and its Impact on Design (PAID-98), held in conjunction with the 25th Annual International Symposium on Computer Architecture (ISCA-25), June. Desikan, R., Burger, D., Keckler, S.W., 2001. Measuring experimental error in microprocessor simulation. In: Proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA-28), July, pp. 266–277. Dhodapkar, A., Lim, C.H., Cai, G., Daasch, W.R., 2000. TEM2 P2 EST: A thermal enabled multi-model power/performance estimator. In: Proceedings of the Power-Aware Computer Systems (PACS’00) held in conjunction with ASPLOS-IX, November. Eeckhout, L., 2002. Accurate statistical workload modeling. Ph.D. thesis, Ghent University, Belgium, available at http://www.elis. rug.ac.be/~leeckhou, December. Eeckhout, L., De Bosschere, K., 2001. Early design phase power/ performance modeling through statistical simulation. In: Proceedings of the 2001 International IEEE Symposium on Performance Analysis of Systems and Software (ISPASS-2001), November, pp. 10–17. Eeckhout, L., Stroobandt, D., De Bosschere, K., 2003. Efficient microprocessor design space exploration through statistical simulation. In: Proceedings of the 36th Annual Simulation Symposium, April, pp. 233–240. Ghiasi, S., Grunwald, D., 2000. A comparison of two architectural power models. In: Proceedings of the Power-Aware Computer Systems (PACS’00) held in conjunction with ASPLOS-IX, November. Gibson, J., Kunz, R., Ofelt, D., Horowitz, M., Hennessy, J., Heinrich, M., 2000. FLASH vs. (simulated) FLASH: Closing the simulation loop. In: Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX), November, pp. 49–58. Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., Roussel, P., 2001. The microarchitecture of the Pentium 4 processor. Intel Technology Journal (Q1).

62

L. Eeckhout, K. De Bosschere / The Journal of Systems and Software 73 (2004) 45–62

Hrishikesh, M.S., Burger, D., Jouppi, N.P., Keckler, S.W., Farkas, K.I., Shivakumar, P., 2002. The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays. In: Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA-29), May, pp. 14–24. Hsieh, C., Pedram, M., 1998. Micro-processor power estimation using profile-driven program synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 17 (11), 1080– 1089. Kessler, R.E., McLellan, E.J., Webb, D.A., 1998. The Alpha 21264 microprocessor architecture. In: Proceedings of the 1998 International Conference on Computer Design (ICCD-98), October, pp. 90–95. Loh, G., 2001. A time-stamping algorithm for efficient performance estimation of superscalar processors. In: Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS-2001), May, pp. 72–81. Narayananan, V., LaPotin, D., Gupta, R., Vijayan, G., 1995. PEPPER – a timing driven early floorplanner. In: Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors (ICCD-1995), October, pp. 230– 235. Noonburg, D.B., Shen, J.P., 1997. A framework for statistical modeling of superscalar processor performance. In: Proceedings of the third International Symposium on High-Performance Computer Architecture (HPCA-3), February, pp. 298–309. Nussbaum, S., Smith, J.E., 2001. Modeling superscalar processors via statistical simulation. In: Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT-2001), September, pp. 15–24. Ofelt, D.J., Hennessy, J.L., 2000. Efficient performance prediction for modern microprocessors. In: Proceedings of the International

Conference on Measurement and Modeling of Computer Systems (SIGMETRICS-2000), June, pp. 229–239. Oskin, M., Chong, F.T., Farrens, M., 2000. HLS: Combining statistical and symbolic simulation to guide microprocessor design. In: Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA-27), June, pp. 71–82. Srinivasan, V., Brooks, D., Gschwind, M., Bose, P., Zyuban, V., Strenski, P.N., Emma, P.G., 2002. Optimizing pipelines for power and performance. In: Proceedings of the 35th Annual International Symposium on Microarchitecture (MICRO-35), November, pp. 333–344. Steinhaus, M., Kolla, R., Larriba-Pey, J.L., Ungerer, T., Valero, M., 2001. Transistor count and chip-space estimation of SimpleScalarbased microprocessor models. In: Proceedings of the 2001 Workshop on Complexity-Effective Design held in conjunction with the 28th Annual International Symposium on Computer Architecture, June. Sugumar, R.A., Abraham, S.G., 1993. Efficient simulation of caches under optimal replacement with applications to miss characterization. In: Proceedings of the 1993 ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’93), pp. 24–35. Uhlig, R., Nagle, D., Mudge, T., Sechrest, S., Emer, J., 1995. Instruction fetching: Coping with code bloat. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA-22), June, pp. 345–356. Ye, W., Vijaykrishnan, N., Kandemir, M., Irwin, M.J., 2000. The design and use of SimplePower: A cycle-accurate energy estimation tool. In: Proceedings of the 37th Design Automation Conference, June, pp. 340–345. Yeager, K.C., 1996. MIPS R10000 superscalar microprocessor. IEEE Micro 16 (2).