Path-based next N trace prefetch in trace processors

Path-based next N trace prefetch in trace processors

Microprocessors and Microsystems 29 (2005) 273–288 www.elsevier.com/locate/micpro Path-based next N trace prefetch in trace processors Kai-feng Wang*...

1MB Sizes 0 Downloads 24 Views

Microprocessors and Microsystems 29 (2005) 273–288 www.elsevier.com/locate/micpro

Path-based next N trace prefetch in trace processors Kai-feng Wang*, Zhen-zhou Ji, Ming-zeng Hu School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, People’s Republic of China Received 31 July 2004; revised 27 October 2004; accepted 31 October 2004 Available online 26 November 2004

Abstract The performance of trace processor rests with trace cache efficiency to a great extent. Higher trace cache miss rate will reduce performance significantly because no traces can be dispatched to the back-end PEs when trace cache miss occurs until the completion of missing trace construction. When running large work set application, higher capacity miss rate is inevitable for the relatively small capacity of trace cache. With the ever-increasing conventional application scale, this problem will become more severe. Addressing to the high capacity miss rate, a two-level trace cache is incorporated with conventional one-level trace cache in this paper. We found that augmenting two-level trace cache can only improve performance in a limited way for the long access latency of two-level trace cache. In order to reduce the access latency of two-level trace cache, a path-based next N trace prefetch mechanism is proposed in this paper. Path-based next N trace prefetch mechanism prefetches the next N trace from current running trace with the help of path-based next N trace prediction which is an extension to the path-based next trace predictor. Simulation results show that the path-based next N trace prefetch mechanism with prefetch distance three attains 11.3% performance improvement over the conventional one-level trace cache mechanism for eight SPECint95 benchmarks. q 2004 Elsevier B.V. All rights reserved. Keywords: Trace processor; Performance metric; Prefetch mechanism

1. Introduction In trace processor [11], trace cache [8,13] plays an important role in providing traces to back-end PEs. In the course of application execution, trace cache will be accessed with the predicted next trace_id supplied by the next-trace predictor [9]. If the desired trace exists in trace cache, it will be fetched and dispatched as fast as possible. Otherwise, trace construction process will be issued and trace fetch and dispatch processes must be stalled until the completion of missing trace construction. The construction process will last several to a few hundred cycles related to instruction cache latency and trace selection method. In those cycles, no traces can be dispatched to back-end spare PEs, which will reduce trace processor performance significantly. Lower trace cache miss rate ensures higher performance. But unfortunately, the efficiency of trace cache is affected by application behavior besides trace cache organization, * Corresponding author. Tel.: C86 0451 86416614. E-mail address: [email protected] (K.-f. Wang). 0141-9331/$ - see front matter q 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2004.10.009

storage management and replacement policy. Although many research works have addressed on trace cache efficiency [1–5], but with the human desire of more and more powerful computation ability, the scale of conventional application increases at an amazing speed which means that a great deal of traces will be generated when executing those applications. For the limited trace cache capacity, applications with large trace work set will bring heavy pressure on trace cache and will cause higher miss rate inevitably. For example, Table 1 shows a 1K-Entry (128 KB), two-way one-level trace cache hit rate for eight benchmarks of SPECint95 (Trace selection methods, trace cache structure and benchmark description are in Section 3). We saw that go and gcc are two largest static trace work set applications among eight benchmarks, and as expected, they have the lowest trace cache hit rate. Conventionally, trace cache misses can be classified into two types: compulsory miss and capacity miss. If a given dynamic trace has not been observed before for a new instruction path is followed or a new code section is entered, compulsory miss occurs. On the other hand, during

274

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

Table 1 Trace cache hit rate Benchmark

Hit rate (%)

Static traces

Dynamic traces

go gcc li m88ksim perl vortex ijpeg compress

55.47 75.41 89.36 96.25 87.75 91.91 89.44 87.71

55,593 49,809 1121 1649 3958 8959 5742 698

3,670,327 4,170,184 5,069,981 4,259,171 4,714,830 3,920,621 3,222,682 4,287,312

the execution of application, some old traces will be replaced out with new generated traces for the limited capacity of trace cache, and when those replaced traces are needed again, capacity miss occurs. Compulsory miss can be solved partially with some trace pre-construction [6] techniques. And the capacity miss can be reduced by increasing trace cache capacity and set associate, augmenting victim cache and some other similar techniques. But for the consideration of cycle time, the improvement space of trace cache capacity and set associate is very limited. It is clear that just optimizing one-level trace cache efficiency cannot solve the capacity problem and only give limited benefits. In this paper, a trace cache hierarchy is evaluated by introducing an additional two-level trace cache to address the capacity problem. In order to boost performance, a pathbased next N trace prefetch mechanism is proposed to reduce the long access latency of two-level trace cache further. The prefetch mechanism is very simple and easy to implement. In addition to two-level trace cache and a small full-associate trace prefetch buffer, a little modification on path-based next trace predictor [9] will be required. At last a small trace execution history buffer is used to capture the trace history. This paper is organized as follows. Section 2 introduces some related works. Section 3 describes the simulation environment, benchmark selection and metrics used to evaluate path-based next N trace prefetch mechanism. Section 4 discusses trace cache hierarchy and the principle of path-based next N trace prefetch mechanism. Section 5 shows the experiment results of path-based next N trace prefetch mechanism. Section 6 discusses some path-based next N trace prefetch design space issues. Section 7 summaries and concludes the paper.

2. Related works Trace cache proposed by Rotenberg et al. in [8] can provide multiple basic blocks per cycle to satisfy the increasing instruction demand of high issue-bandwidth superscalar processor. It is a special instruction cache which caches a snapshot of dynamic instruction stream referred as trace. A trace is fully specified by a starting

address and sequence of branch outcomes, which describe the path followed. For the temporal locality and bias branch behavior existing in conventional application, an executed trace may be executed again in the near future. So keeping the executed trace in trace cache and when it is needed again, the whole trace can be fetched from trace cache in one cycle instead of multi-cycle instruction cache access, which improves fetch-bandwidth significantly. For improving the accuracy of next-trace prediction, Jacobson et al. [9] proposed a path-based next trace predictor which treats the traces as basic units and explicitly predicts sequences of traces. The trace sequences history is collected with the history registers and correlated predictor, and make prediction based on these histories. To reduce performance losses due to cold starts and alias problem, a small alternate predictor is used. In order to avoid losing path histories when executing procedure calls and returns, a return history stack is used to keep the path history information. Some traces may have the same starting address but different internal branch outcomes. They are different instruction sequences started at the same point but followed different path. When the desired trace matches one of the trace instances in trace cache with the same starting address but different internal branch outcomes, partial match occurs. In [3,12], two techniques referred as partial match and inactive issue are evaluated. In those papers, traces with the same starting address but different internal branch outcomes can be cached in trace cache simultaneously. When trace cache can only provide a partial matched trace to back-end PEs, all the blocks within the trace are still issued whether or not they match the prediction path. The blocks that do not match the prediction are said to be issued inactively and considered not valid for the subsequent issue cycle. The advantage occurs when the prediction is incorrect, and the instruction on the other way is already in the pipeline instead of refreshing them from instruction cache or trace cache. Another two techniques referred as branch promotion and trace packing are explored in [2]. The former dynamically converts strongly biased branches into branches with static prediction and the latter packs the trace segments with as many instructions that will fit without the block boundary limitation. In [5], Ramirez, Larriba-Pey and Valero proposed a trace filter mechanism, which suggests filter traces that do not contain taken branches since they do little contribution to improve the instruction fetch-bandwidth. In [1], Friendly et al. explore several fill unit optimizations in trace cache processor including register moves, reassociation, scaled add and instruction placement. In order to reduce the heavy redundancy existing in conventional trace cache, a block-based trace cache is proposed by Black et al. in [4]. And further, a completion time Tree-based Multiple Branch Predictor (TMP) [7] is proposed by Rakvic

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

et al. to solve the poor accuracy of conventional multiple branches predictor.

3. Simulation environment 3.1. Trace processor model In this paper, a detailed, fully execution-driven trace processor [11] simulator is used to evaluate the path-based next N trace prefetch mechanism. The simulator is built on top of the Simplescalar toolset [10]. It has been modified to support multiple processing elements, trace cache [8], pathbased next trace predictor [9], data value predictor [15] and ARB [14] structure. PISA Instruction Set is adopted in the simulator, and the simulator can simulate the execution of any instruction on each pipeline stage accurately. It supports simultaneous running multiple processing elements and outof-order executing within the processing element, broadcasting local and global values, speculating on data values and selective reissuing instructions when a mispeculation occurs. The default configuration of trace processor is listed in Table 2. We choose a relatively large capacity of onelevel trace cache which has 1024 entries (128 KB) in anticipation of future technique. A relatively large trace predictor (64K-Entry for both correlated table and alternate table) is used to explore the potential benefits of path-based next N trace prefetch mechanism. The correlated table index is formed from the sequence of past hashed_ids (hashed version of trace_id) with the hashed function referred as ‘DOLC’ in [9]. ‘D’epth is the number of hashed_ids besides the last hashed_id that are used to from the index. ‘O’lder is the number of bits from each hashed_id except the two most recent hashed_ids. ‘L’ast is the number of bits selected from the second most recent hashed_id and ‘C’urrent is

275

the number of bits selected from the most recent hashed_id. The trace prefetch buffer size is fixed to 16-Entry (about 2 KB). All the experiments in Section 5 use this default configuration. In Section 6, when we explore the path-based next N trace prefetch design space issues, different configurations of trace cache, path-based next trace predictor and ‘DOLC’ mechanism will be used and the details of those configurations will be discussed there. Obviously, the trace selection methods will impact the qualities of traces and all trace processor components efficiency. It is no doubt that different results will be generated with different trace selection methods when evaluating path-based next N trace prefetch mechanism. The impact of trace selection methods on prefetch efficiency is deserved to be further exploration. The object of this paper is to introduce the principle of path-based next N trace prefetch mechanism and a study of the relation of trace selection and next N trace prediction is beyond the scope of this paper. So the detail discussion on trace selection method is not included here. In this paper, we chose a simple trace selection method: (1) the max length of any trace is 32 instructions; (2) trace construction process will be ended when meeting indirect branch, trap or syscall instruction. It must be noted that this simple trace selection method is not a special optimum for our path-based next N trace prefetch mechanism study and other trace selection methods are also feasible. 3.2. Benchmarks All SPECint95 benchmarks were selected for in-depth analysis and presentation in our studies because most of applications we use daily are integer programs. Although the evaluation of floating-point benchmark is omitted, but we believe that our path-based next N trace prefetch

Table 2 Default configuration of trace processor Parameter

Value

Trace predictor One-level trace cache Two-level trace cache Trace prefetch buffer Branch predictor L1 ICache L1 DCache L2 Cache Memory Cache latencies Memory latencies Global physical register PE number Function units per PE Decode width per PE Issue width per PE Instruction latencies

64K-Entry correlated table, DOLCZ{7,5,7,11} 64K-entry alternate table, 32-entry RHS 1024-Entry, 32 insts per trace, two-way mapped, LRU 8K-Entry, 32 insts per trace, eight-way mapped, LRU 16-Entry, 32 insts per trace, full associate, LRU 64K-entry bimodal predictor 64 KB, 64 byte blocks, two-way mapped 64 KB, 64 byte blocks, four-way mapped 2 MKB, 128 byte blocks, four-way mapped Ideal IL1 Hit-1 cycle, IL1 miss-12 cycles, DL1 hit-1 cycle, DL1 miss-12 cycles Memory latency: 80 1, TLB latency: 30 Unlimited 4 8 IALU, 8 IMUL, 8 MEMPORT, 8 FALU, 8 FMUL 8 4 Integer multiplication, 6–6; integer division, 35–35; all other integer instructions, 1–1; floating point convert, 2–1; floating point multiplication, 2–1; floating point division. 12–14; floating point sqrt, 18–20; all other floating point instructions, 2–1

276

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

Table 3 Benchmarks Benchmark

Spec dir.

Input data set

Simulation insts

go gcc compress m88ksim li perl

– ref – train test train

First First First First First First

vortex ijpeg

train train

99 KO3 genrecog.i 40000 e 2231 Kc!ctl.in test.lsq (queens 7) scrable.pl ! scrable.in vortex.in vigo.ppm

100 million 100 million 100 million 100 million 100 million 100 million

First 100 million First 100 million

accuracy and coverage rate, a good prefetcher should provide timely traces which can be reflected by the third metric–prefetch timeliness rate. This metric indicates the fraction of traces offered by trace prefetcher arrive before they are needed, but not so early that the traces must be discarded before they can be used.

4. Trace cache hierarchy and path-based next N trace prefetch 4.1. Trace cache hierarchy

mechanism can work also on those programs. It is expected that higher path-based next N trace prediction accuracy can be achieved on floating-point programs than on integer programs for their simple control-flow behavior, which will benefit prefetch mechanism by reducing the number of wrong prefetch operations. All the benchmarks used in this paper are compiled with Simplescalar compiler, which is a derivation of gcc-2.7.2. All the benchmarks are simulated for the first 100 million instructions. The detail input data set is listed in Table 3. Since, the train run of benchmark perl was shorter than 100 million instructions—so the train input file scrabble.in was modified and two words abodome and evilds were added and the file now contains {zed, veil, vanity, abodome, evilds}. 3.3. Performance metric In addition to IPC, we use another three important metrics to evaluate the effectiveness of path-based next N trace prefetch mechanism: prefetch accuracy, prefetch coverage rate and prefetch timeliness. Prefetch accuracy indicates the fraction of prefetched traces that were actually used. A good prefetch mechanism should have higher prefetch accuracy otherwise most prefetched traces are not helpful to improve performance. Prefetch accuracy is defined as Accuracy Z

Trace cache hierarchy is a natural choice to solve the one-level trace cache limited capacity problem without increasing machine cycle time significantly. In this paper, we evaluate the trace cache hierarchy consists of one onelevel trace cache and one two-level trace cache. The lower two-level trace cache has large capacity and set associate with long access latency, and the one-level trace cache has relatively small capacity and set associate but shorter access latency. Two-level trace cache keeps traces being replaced from one-level trace cache. The diagram of trace cache hierarchy is shown in Fig. 1. In trace fetch stage, two-level trace cache was accessed simultaneously with one-level trace cache for the purpose of reducing one-level trace cache miss penalty. If one-level trace cache misses but the desired trace exists in two-level trace cache, it will be supplied by two-level trace cache if two-level trace cache can provide the desired trace faster than trace construction unit. For performance consideration, when one-level trace cache misses, trace construction process and two-level trace cache access progress simultaneously. If trace construction process completed earlier than the accomplishment of two-level trace cache access, the trace will be provided by trace construction unit, and the unfinished two-level trace cache access will be cancelled. Otherwise, the trace will be provided by two-level trace cache and rest of the trace construction process will be

pref hit !100 pref total

where prefhit is the number of prefetches issued that result in a trace prefetch buffer hit and preftotal is the total number of prefetches issued. While a highly accurate prefetch mechanism is desirable, it will be little use if it prefetches only a small amount of traces. Prefetch coverage rate quantifies this effect by measuring the number of useful prefetches as a fraction of the total one-level trace cache misses. Prefetch coverage rate is defined as Coverage Z

pref hit !100 Trace_Cache_MissL1

where prefhit is the number of prefetches issued that result in a trace prefetch buffer hit and Trace_Cache_MissL1 is the total misses of one-level trace cache. In addition to highly

Fig. 1. Trace cache hierarchy fetch mechanism.

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

Fig. 2. Performance improvement with trace cache hierarchy.

squashed. Simulation results in Fig. 2 show that augmenting two-level trace cache improve performance for some benchmarks, but the upper bound on performance improvement is limited due to the long access latency of two-level trace cache. In Fig. 2, only three benchmarks (go, gcc and vortex) have significant improvements (5.7% for go, 5.8% for gcc and 2.7% for vortex). The other five benchmarks have no improvement at all. The reason for this phenomenon is that even if the trace exists in two-level trace cache, detecting and fetching it may also consume more than 10 cycles, which is longer than trace construction latency in most cases. This conclusion can be supported by the data shown in Figs. 3 and 4. Fig. 3 shows the distribution of traces according to their construction latency, and Fig. 4 shows the distribution of trace construction cycles according to their construction latency. We saw the facts from Fig. 3 that there are only a few traces that will cost more than 12 cycles to complete the construction process. It means that two-level trace cache has little chance to be utilized because most of traces will be provided by trace construction unit. From Fig. 4, we saw that go, gcc and vortex have a relatively large fraction of construction cycles that come from those trace constructions whose latencies are larger than 12 cycles. It explains why only go, gcc and

Fig. 3. Trace construction distribution.

277

Fig. 4. Trace construction cycle distribution.

vortex have significant IPC improvements, but the other five are not. 4.2. Path-based next N trace prediction The data in Fig. 2 show that trace cache hierarchy can only solve the capacity problem partially. It cannot take much performance improvement for the long access latency of two-level trace cache. To further make two-level trace cache into effect, we proposed a path-based next N trace prefetch mechanism to reduce the access latency of twolevel trace cache by prefetch traces from two-level trace cache to a small trace prefetch buffer which has the same access latency with one-level trace cache. Path-based next N trace prefetch mechanism works base on path-based next N trace prediction and the key idea is to predict the next N trace that will be executed and prefetch it into trace prefetch buffer if it is not in one-level trace cache and trace prefetch buffer but exists in two-level trace cache. For the purpose of predicting the next N trace, we must correlate one trace with its next N trace. Consider a trace commit sequence Tm,TmC1,.,TmCnK1,TmCn as shown in Fig. 5. If the whole sequence from Tm to TmCn can be kept by some hardware structures, and when TmCn is committed, we know that the next N trace of Tm is TmCn. Such hardware is named as trace execution history buffer (TEHB) in this paper and shown in Fig. 6. It has NC1 entry recording traces from Tm to TmCn. With TEHB we can correlate Tm

Fig. 5. Correlation of tracem and tracemCn.

278

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

Fig. 6. Extension to path-based next trace predictor.

(the first entry in TEHB) and its next N trace TmCn (the last entry in TEHB) together. There are three fields in each TEHB entry: † trace_idX: a trace identifier of traceX from others. † which_pred: the predictor table used when the prediction of tracexC1 (the next trace of tracex) is made. If correlated table is used, which_pred will be set to 0; otherwise which_pred will be set to 1. † index: index of the predictor table used when the prediction of tracexC1 is made. In addition to TEHB, a small modification of path-based next trace predictor is also needed. Two fields, pref_trace_id and pref_cnt are added in correlated table and alternate table as shown in Fig. 6. Pref_trace_id is the trace_id of the next N trace from current and pref_cnt is a three-bit saturating counter which reflects the correct probability of next N trace prediction. The bigger the pref_cnt value is, the more the probability of correct prediction of next N trace. 4.2.1. Collection of next N trace prediction information The collection of next N trace prediction information is accomplished at trace commit stage. Whenever a trace is committed, every entry in TEHB steps forward one entry, and the current committed trace is pushed into the last entry of TEHB. In order to fill the which_pred and index field in TEHB, the information of which_pred and index must be kept from the next trace prediction process. With TEHB, we can correlate trace_idm and trace_idmCn for trace_idmCn is the next N trace from trace_idm. For convenience, the first trace in TEHB is referred as Tracefirst, and the current committing trace is referred as Tracecurrent. Each time after pushing a new committed trace into TEHB, the trace_id, which_pred and index information of Tracefirst

together with the trace_id of Tracecurrent are used to update the next trace predictor. If the value of field which_pred is 0, correlated table will be updated. If the value of field which_pred is 1, alternate table will be updated. What need to do when updating prediction table includes: (a) checking whether the pref_trace_id in the entry which index number equals to index field value of Tracefirst matches the trace_id of Tracecurrent or not. If matches and the value of pref_cnt are less than 7, pref_cnt increases by 1. If not matches, pref_cnt decreases by 4 when the value of pref_cnt larger than 4. If the value of pref_cnt is less than 4, pref_cnt wil be set to 0 and the pref_trace_id field in that entry will be replaced by the trace_id of Tracecurrent. 4.2.2. Next N trace prediction Prediction of the next N trace is carried out simultaneously with the prediction of next trace as shown in Fig. 6. When predicting the next trace, pref_trace_id and pref_cnt are also examined. If pref_cnt is not less than 1, the pref_trace_id will be viewed as the next N trace of current trace, and the operation of prefetching the trace with pref_trace_id is issued immediately. If pref_cnt is 0, it means that the pref_trace_id has low probability to be the next N trace from current, so no prefetch operation is issued. For the different stages when updating next N trace prediction information (in trace commit stage) and next trace prediction information (in trace fetch stage), both the correlated table and alternate table need an additional read port and write port. A good prefetch mechanism must issue prefetch operation earlier enough, otherwise the prefetch trace cannot be available when it is first accessed. It will not only cause trace cache miss, but also pollute trace prefetch buffer by replacing out other useful prefetched traces. In path-based next N trace prefetch mechanism, the advance of prefetch operation can be guaranteed as long as N (prefetch distance)

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

279

Fig. 7. Trace processor with path-based next N trace prefetch.

is big enough. For example, in this study, the max length of trace is 32 instructions, and all the PEs issue-bandwidth is 4. We assume the access latency of two-level trace cache as 12 cycles; so one trace containing 28 instructions can be accomplished at least in seven cycles. If we prefetch the next N trace (NR3), we can be sure that the next N trace will return before it is first accessed in most cases. 4.3. Path-based next N trace prefetch Fig. 7 shows the trace processor diagram after augmenting the trace prefetch buffer, the two-level trace cache and the path-based next N trace prefetch mechanism. Trace prefetch buffer is a small, full-associate trace cache, which has the same access latency to conventional one-level trace cache. Every time the trace_id of next trace generated by path-based next trace predictor is sent to the trace prefetch buffer, two-level trace cache and one-level trace cache. If the desired trace does not exist in trace prefetch buffer and one-level trace cache, the trace construction process will be issued immediately. If the desired trace exists in two-level trace cache, it will be fetched from two-level trace cache. At the same time with two-level trace cache accessing, the trace construction process is also running to construct that missing desired trace. The faster of two-level trace cache access and trace construction process will provide the desired trace. If the trace construction latency is larger than two-level trace cache access latency, that trace will be provided by two-level trace cache rather than trace construction unit; otherwise the desired trace will be provided by trace construction unit. Once the desired trace is provided by one of the two processes mentioned above, the other uncompleted process will be cancelled. At the same time with next trace prediction, the value of pref_trace_id and pref_cnt in the same entry are also

checked. If pref_cnt is not less than one, the pref_trace_id will be sent to trace prefetch buffer, two-level trace cache and one-level trace cache simultaneously. If the desire trace exists in two-level trace cache, it will be prefetched from two-level trace cache into trace prefetch buffer. If the trace exists in one-level trace cache or trace prefetch buffer, the prefetch operation will be cancelled. one-level trace cache, two-level trace cache and trace prefetch buffer need additional read port for purpose of trace existence probing.

5. Primary simulation results In this section, we will explore the effectiveness of the path-based next N trace prefetch mechanism. Prediction accuracy, IPC improvement and three prefetch metrics (prefetch accuracy, prefetch coverage rate and prefetch timeliness) are evaluated with different prefetch distance varies from 1 to 12. All the experiments in this section use the default configuration which is listed in Table 2. 5.1. Path-based next N trace prediction accuracy We first investigate the path-based next N trace prediction accuracy with different prediction distance. The path-based next N trace prediction accuracy affects the effectiveness of trace prefetch significantly because lower prediction accuracy will issue useless prefetches, not only causes trace cache miss but also pollutes the trace prefetch buffer. Fig. 8 shows the next N trace prediction accuracies for eight SPECint95 benchmarks with prediction distance vary from 1 to 12. For all benchmarks, the prediction accuracy will fall with the increment of prediction distance. The average prediction accuracies are: 80.13% for next 1

280

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

Fig. 8. Next N trace prediction accuracy.

trace prediction, 76.20% for next 2 trace prediction, 72.98% for next 3 trace prediction, 70.19% for next 4 trace prediction, 68.03% for next 5 trace prediction, 66.13% for next 6 trace prediction, 64.47% for next 7 trace prediction, 63.05% for next 8 trace prediction, 61.80% for next 9 trace prediction, 60.60% for next 10 trace prediction, 59.50% for next 11 trace prediction and 58.48% for next 12 trace prediction. Among the eight benchmarks, the prediction accuracy of compress drops more quickly than others— from 65.96% (prediction distance 1) to 23.45% (prediction distance 12). This result shows that although the amount of static traces of compress is small, but the dynamic traces behavior is very irregular and hard to be captured by our path-based next N trace predictor. On the contrary, the regular dynamic traces behavior of m88ksim makes the prediction accuracy dropping slowly when prediction distance changes from 1 to 12 (94.84–86.22%). 5.2. Path-based next N trace prefetch accuracy Fig. 9 shows prefetch accuracy of path-based next N trace prefetch mechanism when prefetch distance varies from 1 to 12. This metric indicates the fraction of prefetched traces that were actually used. Obviously,

a good prefetch mechanism should have higher prefetch accuracy. The results in Fig. 9 show that the prefetch accuracy is very low when prefetch distance is 1 or 2 for all the benchmarks except compress. Most prefetch operations that prefetch next or next two trace are issued too late and cannot return in time before the desired trace is accessed. Those late return prefetched traces will be replaced out by new prefetched traces sooner for the small capacity of the trace prefetch buffer. While compress has low frequency of prefetch operation, so even if the late prefetched traces missed their first change to be used, they will not be replaced out immediately and can remain in the trace prefetch buffer for a long time. So when the second access to one of those late traces happens, trace prefetch buffer hit will occurs. We see that when prefetch distance grows beyond 3, the prefetch accuracy will increase significantly for all benchmarks except compress. It means that when prefetch distance is beyond 3, most prefetches operation will be accomplished in time before the first access to the prefetched traces. The average prefetch accuracies with different prefetch distance are: 51.1, 63.7, 77.3, 78.4, 79.2, 79.3, 78.9, 79.2, 78.2, 77.9, 78.2 and 77.7% for the prefetch distance configuration from 1 to 12, respectively.

Fig. 9. Trace prefetch accuracy.

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

281

Fig. 10. Trace prefetch coverage rate.

5.3. Path-based next N trace prefetch coverage rate Fig. 10 shows prefetch coverage rate of path-based next N trace prefetch mechanism. Prefetch coverage rate indicates the fraction of traces that were supplied by trace prefetcher rather than demand-fetched. As discussed in Section 3.3, while a highly accurate prefetch mechanism is desirable, it will be little use if it prefetches only a small amount of traces. A good prefetch mechanism should have higher coverage rate. We see that prefetch coverage rates increase with the increment of prefetch distance firstly, and soon they achieve the highest points and then fall with the further increment of prefetch distance. For gcc, perl, vortex, and ijpeg, the prefetch coverage rate is small when the prefetch distance is 1 or 2 because most prefetch operations are late prefetches. With the poor one-level trace cache hit rate, all the prefetch coverage rates for go are less than 22%. With the high one-level trace cache hit rate and small amount of prefetch operations, the behavior of compress prefetch coverage rate is different from others. The coverage rate is the highest for compress when prefetch distance is 1. The average of prefetch coverage rates are: 59.6, 65.6, 70.5, 71.7, 70.9, 70.9, 70.2, 69.7, 68.7, 67.4, 66.9 and 66.2% for the prefetch distance configuration from 1 to 12, respectively. 5.4. Path-based next N trace prefetch timeliness Any prefetch can be classified into one of the three types: late prefetch, timeliness prefetch and useless prefetch. Late prefetches are those prefetches that issued too late and they will not return in time before they are needed. And timeliness prefetches are those prefetches that they will be returned before their first access and they will be accessed at least one time before they are replaced out. All the other prefetches can be viewed as useless prefetches because they were replaced out by some new prefetches when they were needed. A good prefetch mechanism should have few late prefetches and useless prefetches. Table 4 shows the distribution of late prefetches, timeliness prefetches

and useless prefetches with the prefetch distance varies from 1 to 12. From the table, we know that when the prefetch distance grows beyond 3, the late prefetch rate will be lower than 2.0% for all the benchmarks. The useless prefetch rate drops slowly for all the benchmarks with the increment of prefetch distance except ijpeg. When the prefetch distance grows from 1 to 3, the useless prefetch rate of benchmark ijpeg drops significantly from 36.02 to 3.99%. It means that a large fraction of prefetched traces will be returned in time if the issue of prefetch operations is enough early. When the prefetch distance grows beyond 3, the change of prefetch timeliness rate is small. The average prefetch timeliness rates are: 70.6, 74.7, 80.8, 81.3, 82.1, 82.8, 82.3, 82.8, 81.7, 82.2, 81.5 and 82.4% for the prefetch distance configuration from 1 to 12, respectively. 5.5. IPC improvement Fig. 11 shows the performances of four models with different hardware configurations: † 1TC model: a trace processor with only one-level trace cache (1K-Entriy, two-way). Both tables in path-based next trace predictor have 64K entries. † 2TC: a trace processor with one-level trace cache (1KEntry, two-way) and two-level trace cache (8K-Entry, eight-way). Both tables in path-based next trace predictor have 64K entries. † PREFx: a trace processor with one-level trace cache (1KEntry, two-way) and two-level trace cache (8K-Entry, eight-way) and path-based next N trace prefetch mechanism. Both tables in path-based next trace predictor have 64K entries. The trace prefetch buffer has 16 entries. † BASE: BASE model is a trace processor with one-level trace cache (1K-Entry, two-way) and two-level trace cache (8K-Entry, eight-way) and no path-based next N trace prefetch mechanism. Because the path-based next N trace predictor used in PREFx model are modified to contain the prefetch information, it is fair to compare

282

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

Table 4 Distribution of prefetches Benchmark

Prefetch_distance:1

Prefetch_distance: 2

Prefetch_distance: 3

Prefetch_distance: 4

Late (%)

Timeliness (%)

Useless (%)

Late (%)

Timeliness (%)

Useless (%)

Late (%)

Timeliness (%)

Useless (%)

Late (%)

Timeliness (%)

Useless (%)

go gcc li m88ksim perl vortex ijpeg compress

5.99 5.62 5.87 4.69 3.23 10.12 8.68 2.59

45.06 52.30 78.87 84.17 91.53 69.33 55.30 88.47

48.95 42.08 15.26 11.14 5.24 20.55 36.02 8.94

3.23 2.97 2.63 4.88 1.64 4.05 3.24 1.35

49.26 57.76 80.02 81.17 92.60 75.60 73.40 87.83

47.51 39.27 17.35 13.95 5.76 20.35 23.36 10.82

1.49 1.22 0.73 1.94 0.17 1.15 0.10 1.13

52.30 62.71 82.33 85.99 96.12 80.41 95.91 90.66

46.21 36.07 16.94 12.07 3.71 18.44 3.99 8.21

0.53 0.50 0.15 0.07 0.10 0.35 0.03 1.08

54.04 64.28 77.47 90.00 94.93 82.85 96.16 90.58

45.43 35.22 22.38 9.93 4.97 16.8 3.81 8.34

Benchmark

Prefetch_distance:5

Prefetch_distance:6

Prefetch_distance:7

Prefetch_distance:8

Late (%)

Timeliness (%)

Useless (%)

Late (%)

Timeliness (%)

Useless (%)

Late (%)

Timeliness (%)

Useless (%)

Late (%)

Timeliness (%)

Useless (%)

go gcc li m88ksim perl vortex ijpeg compress

0.17 0.20 0.23 0.70 0.10 0.08 0.02 0.72

54.57 65.16 83.92 92.31 94.19 83.87 96.10 86.48

45.26 34.64 15.85 6.99 5.71 16.05 3.88 12.8

0.11 0.17 0.17 0.01 0.01 0.13 0.02 1.12

54.68 65.04 85.63 91.98 96.39 84.04 96.24 88.43

45.21 34.79 14.2 8.01 3.6 15.83 3.74 10.45

0.10 0.15 0.20 0.01 0.10 0.07 0.02 0.56

54.94 64.60 83.43 89.59 95.49 82.23 96.03 92.04

44.96 35.25 16.37 10.4 4.41 17.7 3.95 7.4

0.11 0.13 0.44 0.00 0.00 0.07 0.02 0.06

55.13 64.02 87.68 90.90 94.34 83.06 95.80 91.77

44.76 35.85 11.88 9.1 5.66 16.87 4.18 8.17

Benchmark

Prefetch_distance:9

go gcc li m88ksim perl vortex ijpeg compress

Prefetch_distance:10

Prefetch_distance:11

Prefetch_distance:12

Late (%)

Timeliness (%)

Useless (%)

Late (%)

Timeliness (%)

Useless (%)

Late (%)

Timeliness (%)

Useless (%)

Late (%)

Timeliness (%)

Useless (%)

0.09 0.13 3.37 0.00 0.06 0.21 0.02 0.49

55.23 63.78 81.93 85.69 95.27 82.74 95.39 93.38

44.68 36.09 14.7 14.31 4.67 17.05 4.59 6.13

0.09 0.10 0.31 0.02 0.03 0.08 0.03 0.68

54.83 63.32 81.50 92.53 94.80 83.18 94.73 92.61

45.08 36.27 18.19 7.45 5.17 16.74 5.24 6.71

0.13 0.12 0.51 0.00 0.12 0.07 0.02 0.42

54.58 62.91 80.66 89.40 94.80 83.01 94.50 91.76

45.29 36.97 18.83 10.6 5.08 16.92 5.48 7.82

0.11 0.12 0.32 0.02 0.09 0.08 0.03 1.79

54.46 62.79 88.69 89.71 94.19 83.72 94.12 91.79

45.43 37.09 10.99 10.27 5.72 16.2 5.85 6.42

the performance between trace processors with same hardware resources on next-trace predictor. The total size of path-based next trace predictor used in PREFx is about 2277 KB (correlated table, 65,536!(64C2C10C

64C3)/8 bytes; alternate table, 65,536!(64C4C64C 3)/8 bytes), So in BASE model, we use a relatively large correlated table configuration which has 256K entries and its total size is about 3047 KB (correlated table,

Fig. 11. Performance improvement.

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

256!1024!(64C2C10)/8 bytes; alternate table, 65,536!(64C4)/8 bytes). The results in Fig. 11 show that augmenting pathbased next N trace prefetch mechanism can boost performance significantly. When the prefetch distance is 3, the IPC improvements against 1TC model are: 19.6% for go, 20.2% for gcc, 12.1% for li, 3.7% for m88ksim, 16.7% for perl, 7.6% for vortex, 2.4% for ijpeg, 7.7% for compress and the average improvement is 11.3%. Even if compared to the BASE model: the IPC improvements are still significantly and they are: 12.1% for go, 12.8% for gcc, 11.2% for li, 3.8% for m88ksim, 15.8% for perl, 4.1% for vortex, 1.2% for ijpeg, 7.5% for compress and the average improvement is 8.6%. We also saw from the figure that IPC will drops slowly with the increment of prefetch distance when prefetch distance is larger than 3 for go and gcc-the two largest scale programs among eight benchmarks. There are no obvious changes of IPC on other six benchmarks. This phenomenon can be explained as followed. The performance improvement comes from two sources in our path-based next N trace prefetch mechanism: (1) perfecting trace which is not in one-level trace cache and trace prefetch buffer before it is needed; (2) the small trace full-associate trace prefetch buffer can hold some ‘hot’ traces that cannot cached in one-level trace cache simultaneously for the relatively small set-associate. For small scale programs such as compress, the real bottleneck of performance is not the small capacity of one-level trace cache (in our default configuration, one-level trace cache has 1024 entries, while compress has only 698 static traces) but the small set associate (two-way in default configuration). There are a few ‘hot’ traces that generate heavy conflicts on several trace cache sets. Augmenting a small fullassociate can leverage this problem and those hot traces can be kept in this trace prefetch buffer. Because the access to those ‘hot’ traces dominates the accesses to trace prefetch buffer, the performance of those small programs will not be sensitive to the changes of prefetch distance. We believe that it is very important to focus the behavior of those benchmarks with large trace work set—go and gcc, because in general the other six benchmarks probably have relatively small working set compared to most realistic programs.

283

6. Design space of path-based next N trace predictor In this section, we exploit the design space issues to give more insight into the path-based next N trace prefetch mechanism. Firstly, we evaluate the next N trace prediction accuracy with different predictor table configurations and trace history depth. Secondly, we explore the contributions of each component of path-based next trace predictor to next N trace prediction accuracy. At last, we exploit the effectiveness of our prefetch mechanism on performance in trace processors with different one-level trace cache configurations. Due to space limitation, we will not show all the results with prefetch distance vary from 1 to 12 and the prefetch distance 3 is used in the following experiments. 6.1. Impact of different correlated table size and trace history depth Because the correlated table is the most important component in path-based next trace predictor and the next N trace prediction accuracy is most sensitive to the correlated table size, so we first study the next N trace prediction accuracy with different correlated table configurations and trace history depth. The correlated table is also the component that is most sensitive to aliasing in the predictor, so the prediction accuracy with infinite correlated table configuration is also studied. The infinite configuration assures that each unique sequence of hashed_id maps to its own table entry (no aliasing). For different correlated table size, different DOLC mechanisms will be used and they are listed in Table 5. In this subsection, we study the next N trace prediction accuracy with 4K, 16K, 64K, 256K and infinite entries correlated table while keep the alternate table size fixed to 64K entries and return history stack size fixed to 32 entries. Go, gcc, ijpeg and vortex are the four largest scale benchmarks as described in Section 1. Their prediction accuracies are also most sensitive to correlated table size. With the increment of correlated table capacity from 4K entries to infinite, the prediction accuracy increases significantly. When trace history depth grows from 1, the prediction accuracies of those four benchmark start to increase which means that the prediction based on multiple trace history information outperforms than the prediction based on just the most recent trace information. With the increment of trace history depth further, the prediction

Table 5 Index generation configuration Depth

DOLC for 4K

DOLC for 16K

DOLC for 64K

DOLC for 256K

1 3 5 7 9

1–0–5–7(1p) 3–5–6–8(2p) 5–2–5–11(2p) 7–4–5–7(3p) 9–3–5–7(3p)

1–0–6–8(1p) 3–5–7–11(2p) 5–3–5–11(2p) 7–4–7–11(3p) 9–3–7–11(3p)

1–0–7–9(1p) 3–5–9–13(2p) 5–5–5–7(2p) 7–5–7–11(3p) 9–4–7–9(3p)

1–0–7–11(1p) 3–7–9–13(2p) 5–5–7–9(2p) 7–5–11–13(3p) 9–4–9–13(3p)

284

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

accuracies will fall slowly. The reason is that when using large traces history depth, the aliasing problem will become severe for the limited capacity of the predictor tables. Aliasing puts on more pressure on small predictor than large predictor. The smaller the predictor, earlier the detrimental of aliasing emerges. For go, with 4K-Entry predictor, the prediction accuracy using three most recent trace information is higher than using five most recent trace information. But for 256K-Entry predictor, only after trace history depth grows beyond 5, the prediction accuracy begins to fall. As go, gcc, ijpeg and vortex have large static trace work set, aliasing problem will become prominent with the increment of trace history depth, and finally hampers the prediction accuracy. We saw that for other four benchmarks with small trace work set, the aliasing problem is not very prominent. Their prediction accuracies will increase with the increment of trace history depth. When trace history depth changes from 1 to 3, these four benchmarks prediction accuracies do not change significantly with the increment of correlated table capacity. When trace history depth beyond 5, the smaller predictor will be quickly affected by aliasing problem, so the prediction accuracy begins to change significantly. Among the eight benchmarks, the prediction accuracy of compress is not sensitive to the table size for its small trace work set which has just only 698 traces as mentioned in Section 1. The prediction accuracy with a 16K-Entry correlated table is just 2% higher than the prediction accuracy with a 4K-Entry correlated table when the trace history depth is 7. We found that even with infinite correlated predictor table (no aliasing), the prediction accuracy is not obviously higher than the prediction accuracies when augmenting 64K or 256K-Entry correlated predictor table. The average prediction accuracies are: 67.3, 71.0, 73.1, 73.1 and 74.0% for the predictor table configuration from 4K to infinite with trace history depth 7. The main reason is that for large capacity predictor tables, the impact of aliasing problem is not very significant and the impact on prediction accuracy is relatively small. We found at the same time that for go, gcc, vortex and ijpeg, prediction accuracies will fall even using infinite correlated table configuration with the increment of trace history depth. This result told us that it is not true that the prediction accuracy generated with more trace histories is always higher than the prediction accuracy generated with less trace histories. How to determine the optimum trace history depth dynamic to achieve higher prediction accuracy is a problem which deserves to be further exploited (Fig. 12). 6.2. Impact of alternate predictor and RHS on prediction accuracy All predictor using path-based information must face the cold starts problem. When a program begins to be executed or a new procedure is called, the useful history information has not been collected yet and most predictions will be

wrong. Although the concept of cold starts is clear, but to measure its impact on prediction accuracy directly is very hard. How to determine one path is an untouched code area before and how to determine the finished phase of the cold starts dynamically are all tricky problems. We know that in path-based next trace predictor, alternate table and return history stack are the components used to reduce the cold starts problem [9]. The index of alternate table is formed from the last trace_id which make it has a short study curve and little sensitive to cold starts problem. Return history stack keeps original trace history when enter a procedure and restore the history when return from the procedure. In this section, we will study the contribution of each component on prediction accuracy rather than measure the impact of cold starts on prediction accuracy directly. Three different configurations of path-based next trace predictor are used in the following experiments: † CCACR: predictor contains correlated table (64KEntry), alternate table (64K-Entry) and return history stack (32-Entry). † CCA: predictor contains correlated table (64K-Entry), alternate table (64K-Entry). † C: predictor contains only correlated table (64K-Entry). We found from Fig. 13 that for go, gcc and ijpeg, alternate table successfully reduce the negative impact of cold starts on prediction accuracies. An interesting thing happens to go, gcc and m88ksim that when the trace history depth grows beyond some thresholds that augmenting the return history stack will hamper the prediction accuracy. Those thresholds for three benchmarks are: three for go, seven for gcc and seven for m88ksim. It suggests that for those programs, discarding the trace history when enter a new procedure is more important than keeping them. We saw that return history stack does not benefit prediction accuracy for ijpeg and compress. For those two benchmarks, keeping and restoring trace history do not help to increase prediction accuracy. We also found that for li, m88ksim and perl, augmenting alternate table will give little prediction accuracy improvement. For those programs, the impacts of cold start are very small. Finally, we found that the return history stack plays an important role for vortex when the trace history depth is between 3 and 7. All those observations can be concluded that for these programs with large static trace work set (go, gcc, vortex and ijpeg), augmenting alternate table and return history stack can gain significantly prediction accuracy improvement. While those programs with small static trace work set will not. 6.3. Effectiveness of path-based next N trace prefetch with different trace cache size Table 6 shows the performance of four models as described in Section 5.5 with different one-level trace cache

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

Fig. 12. Prediction accuracy with different configuration of correlated table size and trace history depth.

285

286

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

Fig. 13. Prediction accuracy with different configuration of predictor components.

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

287

Table 6 IPC with different one-level trace cache configuration Benchmark

go gcc li m88ksim perl vortex ijpeg compress

One-level trace cache 256-Entry

512-Entry

1024-Entry

2048-Entry

1TC

2TC

BASE

PREF3

1TC

2TC

BASE

PREF3

1TC

2TC

BASE

PREF3

1TC

2TC

BASE

PREF3

1.51 1.77 2.78 3.29

1.59 1.93 2.78 3.29

1.60 1.94 2.79 3.29

1.84 2.45 3.16 3.56

1.55 2.02 2.81 3.46

1.65 2.17 2.81 3.46

1.66 2.18 2.84 3.46

1.88 2.58 3.20 3.63

1.60 2.24 2.90 3.52

1.70 2.37 2.90 3.52

1.71 2.39 2.93 3.52

1.92 2.70 3.25 3.65

1.69 2.43 2.92 3.58

1.78 2.53 2.91 3.58

1.79 2.54 2.96 3.58

1.98 2.78 3.29 3.66

2.38 2.87 3.88 2.34

2.38 3.04 3.87 2.34

2.38 3.06 3.91 2.34

3.44 3.49 4.19 2.53

2.79 3.12 4.05 2.35

2.80 3.26 4.05 2.35

2.80 3.28 4.09 2.34

3.55 3.55 4.23 2.53

3.07 3.34 4.15 2.35

3.10 3.42 4.16 2.35

3.11 3.45 4.20 2.35

3.60 3.59 4.25 2.53

3.23 3.48 4.19 2.35

3.23 3.54 4.19 2.35

3.24 3.57 4.24 2.36

3.66 3.61 4.27 2.53

configuration. This experiment reveals the effectiveness of our prefetch mechanism on performance with different onelevel trace cache configuration. We saw from the table that our path-based next N trace prefetch mechanism is more effective when the trace processor has a relatively small one-level trace cache. When prefetch distance is 3, the performance of PREF model is 20.5, 15.1, 11.3 and 9.1% higher than 1TC model for the one-level trace cache configuration from 256-Entry to 2048-Entry, respectively. When compared to BASE model, the IPC improvements are 17.1, 12.0, 9.1 and 6.9% for the one-level trace cache configuration from 256-Entry to 2048-Entry, respectively. For trace processor with small one-level trace cache, the higher one-level trace cache miss rate makes the prefetch mechanism work better.

simulation results show that when using a small fullassociate 16-Entry trace prefetch buffer and prefetch distance 3, the performance of path-based next N prefetch mechanism will be 11.3% higher than conventional onelevel trace cache mechanism for eight benchmarks from SPECint95.

Acknowledgements The authors wish to thank the anonymous reviewers for their detailed reviews and many constructive suggestions which have improved the paper significantly.

References 7. Conclusions The relatively small one-level trace cache cannot satisfy the storage requirement of the ever-increasing scale applications, and higher trace cache miss rate will lower the fetch-bandwidth, which will reduce processor performance significantly. To address this problem, we proposed trace cache hierarchy to remedy the capacity problem of one-level trace cache. The trace hierarchy has two-level trace caches. The first level trace cache has small capacity and short access latency and the second trace cache has large capacity but long access latency. The simulation results show that augmenting two-level trace cache cannot bring much performance improvement because of its long access latency. In order to reduce the access latency of two-level trace cache, we proposed a path-based next N trace prefetch mechanism which counts on next N trace prediction technique. The key idea behind it is that when the current trace is executed, the next N trace will be predicted and if it is convinced to be executed in the future, then it will be prefetched from two-level trace cache given that the trace is in it. Our final

[1] D.H. Friendly, S.J. Patel, Y.N. Patt, Putting the fill unit to work: dynamic optimizations for trace cache microprocessors, Proceedings of the 31st International Symposium on Microarchitecture (MICRO31), Dec. 1998. [2] S.J. Patel, M. Evers, Y.N. Patt, Improving trace cache effectiveness with branch promotion and trace packing, Proceedings of the 25th International Symposium on Computer Architecture (ISCA-25), June 1998. [3] D.H. Friendly, S.J. Patel, Y.N. Patt, Alternative fetch and issue policies for the trace cache fetch mechanism, Proceedings of the 30th International Symposium on Microarchitecture (MICRO-30), Dec. 1997. [4] B. Black, B. Rychlik, J.P. Shen, The block-based trace cache, Proceedings of the 26th International Symposium on Computer Architecture (ISCA-26), May 1999. [5] A. Ramirez, J.L. Larriba-Pey, M. Valero, Trace cache redundancy: red and blue traces, Proceedings of the Sixth International Symposium on High Performance Computer Architecture (HPCA6), 2000. [6] Q. Jacobson, J.E. Smith, Trace preconstruction, Proceedings of the 27th International Symposium on Computer Architecture (ISCA-27), 2000. [7] R. Rakvic, B. Black, J.P. Shen, Completion time multiple branch prediction for enhancing trace cache performance, Procedings of the 27th International Symposium on Computer Architecture (ISCA-27), June 2000.

288

K.-f. Wang et al. / Microprocessors and Microsystems 29 (2005) 273–288

[8] E. Rotenberg, S. Bennett, J.E. Smith, Trace cache: a low latency approach to high bandwidth instruction fetching, Proceedings of the 29th International Symposium on Microarchitecture (MICRO-29), Dec. 1996. [9] Q. Jacobson, E. Rotenberg, J.E. Smith, Path-based next trace prediction, Proceedings of the 30th International Symposium on Microarchitecture (MICRO-30), Dec. 1997. [10] D. Burge, T. Austin, The Simplescalar Tool Set, Version 2.0, Technical Report CS-TR-97-1342, Univ. of Wisconsin, Madison, 1997. [11] E. Rotenberg, Q. Jacobson, Y. Sazeides, J.E. Smith, Trace processors, Proceedings of the 30th International Symposium on Microarchitecture (MICRO-30), Dec. 1997. [12] S.J. Patel, D.H. Friendly, Y.N. Patt, Evaluation of design options for the trace cache fetch mechanism, IEEE Trans. Comput. 48 (2) (1999). [13] E. Rotenberg, S. Bennett, J.E. Smith, A trace cache microarchitecture and evaluation, IEEE Trans. Comput. 48 (2) (1999). [14] M. Franklin, G.S. Sohi, ARB: a hardware mechanism for dynamic reordering of memory references, IEEE Trans. Comput. 45 (5) (1996). [15] K. Wang, M. Franklin, Highly accurate data value prediction using hybrid predictor, Proceedings of the 30th International Symposium on Microarchitecture (MICRO-30), Dec. 1997.

Kai-feng Wang was born in 1976. He is currently a PhD candidate in School of Computer Science and Technology, Harbin Institute of Technology. He received the BS degree in 1998, the MS degree in 2001. His research interests include trace cache, trace processor, simultaneous multithreading, value prediction and data value reuse.

Zhen-zhou Ji was born in 1965. He is a professor and doctor supervisor of Department of Computer Science and Engineering, Harbin Institute of Technology. His current research interests include parallel architecture.

Ming-zeng Hu was born in 1935. He is a professor and doctor supervisor of Department of Computer Science and Engineering, Harbin Institute of Technology. His current research interests include computer architecture, parallel computing and network security.