Performance Evaluation 60 (2005) 51–72
On the performance of trace locality of reference A. Mahjur∗ , A.H. Jahangir, A.H. Gholamipour Department of Computer Engineering, Sharif University of Technology, Tehran, Iran Available online 2 December 2004
Abstract In this paper, trace locality of reference (LoR) is identified as a mechanism to predict the behavior of a variety of systems. If two objects were accessed nearby in the past and the first one is accessed again, trace LoR predicts that the second one will be accessed in near future. To capture trace LoR, trace graph is introduced. Although trace LoR can be observed in a variety of systems, but the focus of this paper is to characterize it for data accesses in memory management systems. In this field, it is compared with recency-based prediction (LRU stack) and it is shown that not only the model is much simpler, but also it outperforms recency-based prediction in all cases. The paper examines various parameters affecting trace LoR such as object size, caching effects (address reference stream versus miss address stream), and access type (read, write, or both). It shows that object size does not have meaningful effects on trace LoR; in average the predictability of miss address stream is 30% better than address reference stream; and identifying access type can increase predictability. Finally, two enhancements are introduced to the model: history and multiple LRU prediction. A main contribution of this paper is the introduction of the n-stride prediction1 . For a prediction to be useful, we should have sufficient time to load the object, and n-stride prediction shows that trace LoR can predict an access far ahead from its occurrence. © 2004 Elsevier B.V. All rights reserved. Keywords: Performance evaluation; Memory hierarchy; Locality of reference; Data prefetching
∗
Corresponding author. E-mail addresses:
[email protected] (A. Mahjur),
[email protected] (A.H. Jahangir),
[email protected] (A.H. Gholamipour). 1 It should be noted that the n-stride prediction, introduced in this paper, differs from stride prefetching introduced in other researches. 0166-5316/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.peva.2004.10.018
52
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
1. Introduction Hierarchical systems are used extensively in computer area. Most systems use cache memory to decrease their access time. They include file systems, network servers, web proxies, network clients, and memory management systems. In a hierarchical model, objects are fetched and placed into a layer when a miss occurs or they are prefetched. If there is no room for the new object, a replacement algorithm determines the victim object. Both of the prefetching and replacement algorithms are based on predicting the system behavior. The success of a hierarchical system depends on the correct prediction of system behavior. Although the concepts developed in this paper are general and can be applied to any hierarchical system, this paper focuses on memory management systems and studies their data access patterns. The results presented here can be used for caching (cache miss handling), address translation (TLB management), and virtual memory management (page fault handling). Various methods have been studied so far to model the system behavior and predict objects required in near future. Locality of reference [11] is among these methods. Two LoR types are defined in literature: spatial and temporal. Most replacement algorithms, including LRU, and the effectiveness of hierarchical systems, are based on temporal LoR [9,43]; meanwhile, some prefetching algorithms such as sequential prefetching [42] or stream buffers [31] are based on spatial LoR. Similarly, fetching a cache block or a page frame, instead of just fetching the requested word, which is a type of prefetching, is based on spatial LoR [44]. However, there are a variety of access patterns not covered by traditional LoR types [18,29]. The fact that traditional LoR types are not sufficient to capture all access patterns, has been the motivation of lots of prediction algorithms. Examples are prefetch algorithms for array-based programs [3,5], and algorithms for pointer-intensive programs [4,10,21,22,24–27,36,37]. The main problem of these approaches is that they are tied to special cases. The program segment having the expected behavior should be identified and then this algorithm applied to it. Most of them are offline algorithms that require compiler support [3,4]. A group of prediction methods uses the system trace. The system trace is defined as the sequence of accessed objects. Now, if an object was accessed previously and is accessed again, such methods predict that its nearby objects in the trace will be accessed too. Branch prediction algorithms use the past outcomes of a branch instruction to predict whether it is taken or not taken the next time. Trace cache and trace processors [19,20,32,34,35] extend the idea of branch prediction to predict next basic blocks of codes to be executed. In data access patterns, recency-based prediction2 [38], based on LRU stack, and a number of frequencybased graph algorithms [13,17,22] use the system trace to predict its future behavior. The graph algorithms are Markov predictor [22,30], access graph [13], and probability graph [17]. Markov prediction has been used in the context of cache prefetching [22] and I/O prediction [30]. Access graph has been used in the context of virtual memory management, and probability graph in the context of file systems. This paper formalizes the concept of trace LoR in general and shows how it can be used to predict future data accesses in memory management systems. Section 2 reviews the related work. Section 3 defines trace LoR as a general aspect of most systems. Section 4 introduces trace graph for capturing 2
Throughout the paper, recency-based prediction and LRU stack are used interchangeably.
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
53
trace LoR. Section 5 studies the benchmark results including effects of system configuration. Section 6 introduces some extensions to trace graph to predict the correct behavior when there are more than one trace associated with an object. Section 7 defines n-stride prediction and evaluates its usefulness. Finally, Section 7 concludes the paper.
2. Related work In the context of memory management, various models were suggested to predict a program behavior and prefetch an object. There are three types of prefetching techniques: offline, online, and hybrid methods. Offline techniques rely on compiler to analyze the program and insert prefetching instructions in the code [6,15,46]. Online algorithms detect access patterns at runtime, and hybrid methods [7,8,16,24,40,45,47] use both compiler and runtime behavior to predict the program behavior. The goal of a prediction method is to determine the objects going to be accessed in near future. To achieve this goal, some online algorithms use only the stream of the past object accesses in order to predict future accesses. But, others use some metadata in addition to the stream of past accesses. Most of the time, this metadata is the address of the instruction that accessed the object [12,39]. At the other hand, online methods can be grouped based on the type of the predicted object. That is, whether the predicted object was accessed previously or not. Some of the prediction methods predict an object access only if it was accessed previously. The other group of prediction methods may predict an object access even if it was not accessed previously. These methods are based on detecting regular patterns in accesses [2,7,14,23,41]. Most of such methods are based on spatial LoR. In the simplest case, when an object is accessed, its next objects in the address space are prefetched. However, more sophisticated schemes have been developed to detect more complicated access patterns. As a drawback, these methods fail when a program does not have a regular behavior. For example, in pointer-intensive programs, program structures are spread out memory and accessing them does not follow a regular pattern. The method presented in this paper, uses only the stream of accessed objects to predict the future behavior. Also, it predicts an object only if it was accessed previously. The most pertinent work to our work, is the recency-based prediction that is based on LRU stack. A detailed comparison of our model and recency-based prediction is presented in Section 4. Other works belonging to this category are graph-based models. They are Markov predictor [22,30], access graph [13,33], and probability graph [17]. All of these methods create a weighted directed graph to store the past behavior of a program. In this graph, each node represents an object. They differ in the way that they assign weight to each edge. The simplest case is the Markov predictor. The authors of [22] used it to predict cache misses and the authors of [30] used it to predict I/O accesses. In Markov model, each object has a corresponding node. For example, when for first time a1 and a2 are accessed consequently, an edge is made from a1 to a2 . The weight of the edge is set initially to one and is increased whenever a1 and a2 are accessed in succession again. However, it should be noted that Markov model is not practical [45]. If a system has n objects, the space required to store its information is O(n2 ). Even, the authors of [22], who proposed this usage for Markov model, was aware of this problem too.
54
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
Fig. 1. Markov model sample.
To solve this problem, they suggested to store only four edges for each node. But as the model is frequency-based, it is impossible to limit the outgoing degree of edges. To clarify the problem, consider the Markov model shown in Fig. 1. In the figure, the outgoing degree of each node is limited to 4, and the weights are 26, 32, 78, and 91. Now, suppose that a0 , a5 happens in the trace. Should we add a new edge from a0 to a5 ? If yes, we have to remove an existing edge (in this case a0 → a1 ) and replace it by a0 → a5 . However we have removed a frequently used edge in favor of a less used one and this may disturb our future predictions. Otherwise, if we do not add it and a0 , a5 happens lots of times in the trace, its weight becomes greater than other edges. But, as we did not store it, we cannot detect it. The second graph-based model, called dynamic access graph [13,33], is similar to the Markov model. However, edge weights are decreased periodically. In this manner, the model diminishes the effect of very past history on prediction. As access graph is an extension to Markov model, it has the same problems of Markov model as well. However, it has an extra problem too. In this graph, all weights are decreased periodically. The cost of such operation is very high, and it is not practical to stop the system and update the weights of graph edges. The second graph-based model, called dynamic access graph [13,33], is similar to the Markov model. However, the last work, called probability graph [17] was done in the context of file systems. It considers an object related to another one, if it is accessed after the first one within a lookahead period. An edge is drawn from the first one to the second to show that they are related. The weight of the edge shows how long they have been related. This is another extension to Markov model. To clarify the difference of this method with Markov predictor, consider this sequence a1 , a2 , a3 , a4 . Markov predictor only makes an edge from a1 to a2 , but the probability graph creates these edges: one from a1 to a2 , one from a1 to a3 , and one from a1 to a4 . In addition to the drawbacks of Markov model, this model has another problem. The extension introduced in probability graph is to consider a2 , a3 , a4 as predictions for a1 in the trace sequence a1 , a2 , a3 , a4 . However, our measurements showed that almost in all cases, the hit ratio for these predictions is zero or near the zero . Therefore, this extension only adds some false predictions to the Markov model, so decreases its predictability.
3. Trace locality of reference Consider the algorithm used to find an item in a linked list. This algorithm works as follows. It first examines the first element. If it matches the requested item, the search completes and the element is returned.
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
55
Fig. 2. SimpleScalar main loop.
Otherwise, the second element is examined. This process continues until the list is exhausted or the item is found. This behavior is repeated whenever an item is searched in a linked list. In fact, most linked list processing functions have the same behavior. As another example, consider parts of the SimpleScalar [1] main loop shown in Fig. 2. It does the following actions: • • • • • • • • •
Initializes regs.regs R[MD REG ZERO]. Initializes regs.regs F.d[MD REG ZERO]. If itlb is true calls cache access. If itlb is true calls cache access. Calls MD FETCH INST. Accesses sim num insn. Accesses addr. Accesses is write. Accesses fault.
Clearly, every time this loop is executed, it almost accesses the same sequence of objects. For example, as soon as sim num insn is accessed, it is concluded that addr, is write, and fault are accessed too. Therefore, if one has the trace of the previous iteration of this loop, he/she can predict the trace of the current iteration. The following definition formalizes this property. Trace locality of reference. Suppose two objects were accessed nearby in the past. The system has trace LoR if whenever the first one is accessed, the second one will be accessed in near future. To clarify the concept, consider the sequence of accesses of Fig. 3. Trace LoR predicts that the rest of this trace will be one of the following:
56
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
Fig. 3. Trace LoR.
• a2 , a3 , a4 , • a5 , a6 , a7 .
From the definition of trace LoR, it is concluded that (1) Ordering is preserved. (2) It is not restricted to consecutive objects. (3) When an object is accessed, any occurrence of the object in the trace can be used to predict the rest of the trace. Trace LoR is different from traditional LoR types. Both of the traditional LoR types are static predictors. Whenever an object is accessed, temporal LoR predicts that it will be accessed again in near future, and spatial LoR predicts that its nearby objects will be accessed in near future. It is clear that trace LoR differs from temporal LoR. To clarify the difference of spatial and trace LoR, note that in spatial LoR, prediction is based on the position of objects in the program address space (a static property), but in trace LoR, the prediction is based on the position of objects in the program trace execution, which is a dynamic property. For example, consider the linked list example again. As the linked list elements are spread out in memory, spatial LoR cannot make any prediction when the list is traversed, but after the first traversal, trace LoR predicts future traversals.
4. The basic model: trace graph 4.1. Definition To make prediction using trace LoR, one should store the trace of the system. However, this information should be stored in a way that can be used for prediction. In the simplest case, one wishes to only predict the next object and only uses the last occurrence of the current object. For this purpose, trace graph is introduced. Trace graph. For each object in the object space, one node is created. Each node has at most one outgoing edge. If the trace contains the sequence a1 , a2 , an edge is made from a1 to a2 replacing the previous edge leaving a1 . The graph construction for the trace a1 , a2 , a3 , a4 , a2 , a5 is shown in Fig. 4. As the example shows, the graph is constructed incrementally. Initially the graph is empty. When a new object is accessed, a node representing that object is added to the graph. For two consecutive objects, for example a1 , a2 , the outgoing edge of a1 is set to a2 .
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
57
Fig. 4. Trace graph construction.
The prediction algorithm using trace graph is as follows. When an object is accessed, it is searched in the trace graph. If it is not found, this is the first access to this object and no prediction is made. Otherwise, it was accessed previously, and the destination node of its outgoing edge is its prediction. 4.2. Analysis In this section, trace graph is compared with recency-based prediction (LRU stack) [38]. The LRU stack is formed as follows. Whenever an object is accessed, it is removed from its current location in the stack and is pushed on top of the stack. The recency-based prediction works as follows. Whenever an object is removed from the stack, the object on the top of it is predicted as the next object to be accessed. The construction of the LRU stack for a1 , a2 , a3 , a4 , a2 , a5 is shown in Fig. 5. The first row shows the accessed object. The second row shows the stack contents, and the third row shows the prediction for a1 . The recency-based prediction fails whenever an object is removed from the stack. For example, in the above example, as long as a2 is not accessed for the second time, LRU stack works correctly. The trace has the sequence a1 , a2 and LRU stack predicts a1 → a2 . However, when a2 is accessed for the second time, it is removed from its current location. Now, LRU stack makes the wrong prediction a1 → a3 . Implementing LRU stack should be done using a doubly linked list. Therefore, for each object a next and a previous pointer should be stored. Fig. 6 compares the cost of LRU stack and trace graph. It shows that the cost of the algorithm to update the LRU stack is independent of prediction result, and requires six to eight instructions which depends on whether the accessed item is on the top of the list or in the middle. When the prediction is correct, trace graph requires two instructions, and if it is wrong, it requires three instructions.
58
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
Fig. 5. LRU stack construction and its failure.
Also, it shows that updating trace graph requires one read memory access, but LRU stack requires two accesses. Finally, if the prediction is correct, trace graph does not require any write access. Otherwise, one write access should be done. However, LRU stack requires four write accesses in both cases.
Fig. 6. Comparison of LRU stack and trace graph.
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
59
Table 1 Used benchmarks 1 2 3 4 5 6 7 8 9
ammp art.1 art.2 bzip2 (graphic) bzip2 (program) bzip2 (source) crafty eon (cook) eon (kajiya)
10 11 12 13 14 15 16 17 18
eon (rushmeier) equake facerec fma3d gcc (166) gcc (200) gcc (expr) gcc (integrate) gcc (scilab)
19 20 21 22 23 24 25 26 27
gzip (graphic) gzip (log) gzip (program) gzip (random) gzip (source) mcf mesa parser perlbmk (diffmail)
28 29 30 31 32 33 34
perlbmk (makerand) perlbmk (splitmail) perlbmk (perfect) sixtrack twolf vpr wupwise
5. Experiments To evaluate the effect of prediction using trace LoR on system performance, SimpleScalar [1] was used. The functionality of SimpleScalar was extended to extract the required information. The benchmarks were taken from SPEC CPU 2000 suite and each was run for 20,000,000 addresses. Table 1 shows the list of used benchmarks. The prediction accuracy (the percentage of correct predictions to total number of predictions) were measured for each of the programs and each of its input set. Then, for each program an average was calculated between its different executions (different inputs). Finally, an average prediction accuracy was calculated among all programs. For any benchmark, measurements were done for various configurations. The configuration parameters are: • Caching. Two cases were considered. In the first case, addresses were delivered to trace graph before
submitting them to cache. Therefore, the benchmark results show the predictability of the address reference stream. In the second case, addresses were submitted to cache and miss addresses were submitted to trace graph. In this manner, cache absorbs most of the addresses generated by the program [28], yielding to a different trace. Therefore, benchmark results show the predictability of miss address stream. The cache size was 2K and its block size was 16 bytes. • Access type. Read accesses, write accesses, and mixed (both of them) accesses were considered. • Object size. Three object sizes were considered. In the first case, the address reference stream were considered. In the second case, they were grouped in 16-byte blocks (corresponding to cache lines), and in the last case, they were grouped in 4096-byte blocks (corresponding to page frames). In Section 2, it was discussed that graph-based models are not practical. Therefore, the only feasible model similar to trace graph is recency-based prediction (LRU stack). Fig. 7 shows the benchmark results comparing trace graph and LRU stack. It compares trace graph and LRU stack for different configurations. The configuration parameters are access type, object size, and caching method. For each configuration, the first bar shows prediction accuracy of trace graph, and the second one shows that of recency-based. As the table shows in all configurations trace graph performs better than LRU stack. The best case happens for [address reference stream, read&write, 4096-byte object size]. In this case, 20.3% of LRU stack predictions and 38.8% of trace graph predictions are correct. That is, correct predictions of trace graph is approximately twice LRU stack predictions. [fully associative cache, read access, 16-byte objects]
60
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
Fig. 7. Comparing trace graph and LRU stack.
is the best configuration for both of them. In this case, 66.8% of LRU stack predictions and 70% of trace graph predictions are correct. Fig. 8 shows the benchmark results for trace graph. An interesting result of this study is that miss address stream of read and mixed cases are more predictable than address reference stream. For example, considering [read&write, 16-byte object size], 34.9% of predictions for address reference stream are correct, while 59.2% of predictions for miss address stream of a fully associative cache are correct. As it was mentioned before, the source of trace LoR is the program behavior. However, when the program is executed, only addresses generated to access memory locations are available, which is byte addresses. Of course, they can be grasped at cache block level or memory page frame level. Therefore, there are three different sets of traces: byte (address) trace, cache block trace, and page trace. None of them is the same as object trace. At the first glance, it seems better to use address trace as it has the most extensive information. However, it is not practical. In a real system, it is impossible to store one prediction for each byte. Therefore, we are faced with the prediction granularity problem: what is the best size of predicted items? The pitfall of selecting coarse grain objects is false sharing. That is, two unrelated objects have the same address. Therefore, by just watching addresses, the accessed object cannot be identified. This problem is the same as accessing an object from different places. Therefore, the same solution can be applied here too. The following example clarifies the situation.
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
61
Fig. 8. Trace graph.
Suppose that a1 and a3 are in page p1 , a2 in page p2 , and a4 in page p3 . Consider the following sequence of accesses: a1 , a2 , a3 , a4 . The corresponding page trace is p1 , p2 , p1 , p3 . Considering address trace, after seeing a3 , no prediction can be made. But, at page level, p2 is predicted, which is wrong. On the other hand, using fine grain items cause some valid predictions be lost. Suppose array A is located on page pA , and array B on page pB . Now, assume a code, which adds corresponding entries of two arrays. It produces this trace: a1 , b1 , a2 , b2 , a3 , b3 , . . .. The corresponding page trace is pA , pB , pA , pB , pA , pB , . . .. The address trace fails to show any trace LoR, while page trace detects it at array level. However, as the table shows, no uniform conclusion can be made about object size. In some cases, by increasing object size, predictability is increased, while in other cases, it is decreased. Note that for miss address stream, the corresponding bars of 1-byte object size, and 16-byte object size are the same. It is because that cache block size is 16 bytes. Finally, the table shows the effect of access type. For address reference stream, write accesses have the best predictability and mixed accesses the worst. For example, for 4096-byte object size 53.3% of write accesses, 47.4% of read accesses, and 38.8% of mixed accesses are predictable. However, the situation changes for miss address stream. In this case read accesses have the best predictability and write accesses have the worst.
62
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
Table 2 contains detailed information of benchmark results. Benchmarks are numbered using Table 1. Different executions of the same benchmark (varying input) are separated from each other by dotted line. For each benchmark, the results for following configurations are provided. • Caching. Two cases are shown: address reference stream, and miss address stream of a fully associative
cache. The result for the four-way set associative cache is very similar to fully associative cache and is not shown. • Access type. The results are provided for read, write, and mixed access type. The selected object size is 16 bytes. There are some minor changes if we consider 1-byte or 4096-byte object sizes. Considering address reference stream, the table shows that read accesses of programs have uniform behavior. But, the predictability of write accesses is very dependent to programs. bzip2, gzip and mcf are the worst cases. They are not predictable at all. Excluding these cases, write accesses are much more predictable than read accesses. Mixed accesses have the worst predictability. In other words, differentiating access type increases predictability. The situation changes when miss address stream is considered. In this case, the predictability of read accesses are greatly improved, but predictability of write accesses are decreased in some cases. Again bzip2 and gzip programs does not show any predictability. The table shows that the write predictability is very dependent to the specific benchmark. Even for a specific benchmark, changing the input may affect predictability greatly.
6. Enhancing the model In two ways, there are more than one trace for an object. To illustrate the first, note that a program may not follow its previous behavior exactly. Based on the input parameters, a code fragment accesses different objects and follows different execution paths. In this manner, the same code fragment produces different traces. As an example, consider the linked list search example again. Consider that each item key is an ordered pair (k1 , k2 ). The code to find an item in this list is shown in Fig. 9. For each element in the linked list, it compares the first elements of each key. If they match, the second elements are compared too. If they match, the requested item is found. Suppose the list consists of these elements: {A(3, 4), B(3, 10), C(2, 7), D(7, 5)}. The generated trace for finding (3, 12) and (2, 15) is a set of comparisons, shown by asterisks in Fig. 10. It is clear that the program trace to find the second item has changed from its trace to find the first item.
Fig. 9. Finding an element in a linked list.
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
63
Table 2 Trace graph results for 16-byte objects Bench marks
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Address reference stream
Miss address stream
R
W
RW
R
W
RW
39.1 43.1 43.1 28.6 32.9 27.7 35.9 46.1 35.8 46.7 44.9 54.0 41.3 41.9 34.1 35.9 33.9 33.5 28.6 28.3 32.9 38.0 27.7 38.8 46.5 31.8 34.0 38.4 32.5 40.0 53.8 42.1 47.6 99.3
68.5 78.6 78.6 0 0 0 52.3 52.4 47.0 53.6 67.7 66.7 60.4 56.7 59.5 55.7 58.8 59.3 0 0 0 0 0 0 70.6 37.1 47.2 49.9 47.0 52.5 64.1 68.1 59.1 66.8
33.0 35.0 35.0 24.5 28.S 24.1 33.4 29.5 27.7 33.1 37.2 41.9 31.4 35.8 32.9 32.4 32.6 32.8 24.5 23.7 28.4 34.9 23.9 0 36.0 29.3 26.1 34.6 25.7 33.4 41.7 34.4 42.9 85.3
95.1 66.8 66.8 75.0 77.3 99.8 66.1 73.2 37.7 68.1 48.8 45.3 96.4 69.6 52.3 57.8 54.4 50.4 75.1 83.3 77.3 81.0 99.8 20.6 90.5 78.1 55.8 43.2 52.7 92.9 81.7 66.2 70.2 83.0
50.4 83.4 83.4 0 0 0 44.1 98.1 69.9 97.5 19.8 92.9 44.6 52.6 65.6 57.4 63.2 62.3 0 0 0 0 0 0 75.5 27.7 39.1 27.3 41.2 41.7 96.4 38.0 20.3 0
93.1 64.7 64.7 26.8 28.4 32.4 65.7 67.8 36.8 63.4 14.7 18.7 95.9 69.6 54.8 59.0 56.8 52.8 26.6 26.0 28.5 24.4 32.2 0 56.9 61.0 56.4 42.9 54.0 92.6 77.1 65.9 64.4 87.5
Fig. 10. Traces made for different items.
64
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
Another situation where there is more than one trace for an object is when the object is accessed from different places in the program. A good example is to find an item in a sparse matrix. In a sparse matrix, elements of each row and each column are stored in a separate linked list. In this manner, each matrix entry is located in two different linked lists. Now, if an entry is accessed using its row number, a trace sequence is made which is different from when it is accessed using its column number. Also, this situation happens when a memory location is used to store different objects in different phases of program execution. The stack represents such a case. When a function is called, the memory on the top of the stack is allocated for local variables (and parameters). By returning from it, the memory is freed. Now, if another function is called, the same memory is allocated to its local variables (and parameters). On the other hand, when a function is called again, another frame on the top of the stack is allocated to store its local variables (and parameters). It is fairly possible that the new allocated space be different from the previous one. So, different memory locations are used to store the same variable in different phases of program execution. Therefore, trace LoR cannot detect the previous behavior. Excluding the last case (assigning different locations to the same object), we have more than one trace for an object. To detect the correct trace, one can use some extra information. This can be the history of the trace. In this manner, in order to predict the correct trace, we consider not only the last object seen, but also some of its previous objects. So, predictions resembles: (a1 , a2 ) → a3 . This means that if a1 and a2 are both seen (in succession), a3 will be predicted as the next object. If an object is accessed from different places in the program, the history can be used to differentiate between different traces. For example, for 2-history, by observing the trace a1 , a2 , a3 , the prediction (a1 , a2 ) → a3 is stored. Now, when the sequence a1 , a2 happens, a3 is predicted, but no prediction is made for the sequence a10 , a2 . It is possible that the behavior of a code fragment changes when its input parameters changes. However, most of the time, there is a restricted number of possibilities for this behavior. Therefore, if these possibilities are stored, the current trace will be one of them. An example is looking up an item in a binary search tree. The traversed path is dependent on the value of the searched item and values stored in the tree. But for each entry in the tree, there are just two possible paths. So, storing two predictions suffices. The result of benchmarks for history-based prediction is shown in Fig. 11. The measured history lengths are 1, 2, 4, and 8. For each case, the following parameters are used. • Cache: fully associative, four-way set associative. • Different access types: read, write, mixed (both read and write). • Different object sizes: 16, 4096 bytes.
The main purpose of this set of benchmarks is to evaluate the effect of history-based prediction. The first bar of the table shows the results for 1-history case. This is the same as the trace graph. The second bar shows the benchmarks results for 2-history case and so forth. As the table shows, we have the best results for 4-history case. In the best case, write access type, regardless of other parameters, 4-history performs twice better than 1-history. For example, for [1-history, fully associative cache, 4096-byte objects], only 32.8% of 1-history predictions are correct, but 77.8% of 4-history predictions are correct. It is noticeable that 8-history results are much worse than 4-history results. In fact, they are very close to 1-history results. Again, considering [fully associative cache, write access type, 4096-byte objects], 32.8% of 1-history predictions, and 33.8% of 8-history predictions are correct. Even, in some cases,
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
65
Fig. 11. History-based prediction.
8-history performs worse than 1-history case. This shows that very long histories reduce predictability. In other words, an access does not depend on its very old history. As the table shows, miss address stream resulted from a fully associative cache is slightly more predictable. For example, considering [4-history, read access type, and 16-byte object size], for fully associative cache, 88.6% of predictions are correct, while for four-way set associative, 80.6% of predictions are correct. Also, if we compare different object sizes, we see that except the configuration [2-history, four-way set associative cache, read access], in all other cases 16-byte objects have better predictability than 4096-byte objects. For example, considering [2-history, fully associative cache, write access type] for 16-byte objects, 78.3% of predictions are correct, and for 4096-byte objects 59.1% of predictions are correct. The order of access types based on predictability is read, mixed, and write. That is, if we only focus on read accesses, we have the best predictability. Note that, in some cases, the difference between predictability of read and write accesses is high. The worst case occurs for [1-history, fully associate cache]. In this case for 16-byte objects, 70.0% of read access predictions are correct, but only 49.0% of write access predictions are correct. Also, for 4096-byte objects 60.7% of read access predictions are correct, but only 32.8% of write access predictions are correct. This means that an elegant use of access type can increase predictability. Finally, it should be noted that history-based prediction increases space requirements drastically. For example, considering an object, say a1 , the trace graph stores only one node in the graph. But, for 2-history
66
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
Fig. 12. Using a prediction table.
case, we should have a node for any value of x where the sequence ax , a1 appears in the system trace. Therefore, there should be a tradeoff between space usage and predictability. In the next set of benchmarks (Fig. 12), each object has a prediction table. The table entries are maintained using LRU algorithm. For each configuration, the table has three bars. The first bar is the prediction accuracy when the table size is 1. This is the same as the trace graph. The second bar shows prediction accuracy improvement when table size is increased to 1. That is, it is the difference between prediction accuracy of 2-entry prediction table and 1-entry prediction table. Finally, the last column shows the difference between prediction accuracy of 16-entry prediction table and 2-entry prediction table. As the table shows the first row of the table is an order of magnitude better than other rows. For example, for [fully associative cache, read access, 16-byte objects], the value of the first bar is 70%, the second bar is 7.7%, and the last one is 5.3%. That is, the prediction accuracy of the 16-entry prediction table is 83% (70% + 7.7% + 5.3%). Therefore, if only one prediction is stored for an object (trace graph), 70% of predictions are correct. If two predictions are stored, and both of them are considered, 77.7% of predictions are correct, and finally if 16 predictions are stored and all of them are considered, 83% of predictions are correct. This observation concludes that only the first object is sufficient, and other objects in the prediction table are redundant. Considering the algorithm costs, it shows that LRU prediction table is useless. If two predictions are stored for an object, both of them should be loaded. However, one of them is useless. Again, considering
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
67
the above configuration, 70% of predictions are the first one, and only 7.7% are the second one. In this manner, the algorithm cost is doubled, but its correct predictions is only increased by 7.7%.
7. n-Stride prediction Trace graph is the simplest usage of trace LoR for predicting a system behavior. This section explores a more complicated usage of trace LoR. We call it n-stride prediction. However, it should be noted that it differs from the stride prediction discussed in [14]. n-Stride prediction shows that an access can be predicted far ahead before its happening. Note that for a prediction to be useful, not only the right prediction should be done, but also there must be sufficient time to fetch the predicted object. n-Stride prediction provides the ability to predict an object access far ahead before the real access. To clarify the concept, suppose that the sequence a0 , a1 , a2 , a3 , a4 , . . . is in the system trace. When a0 is accessed, the prediction algorithm should predict xi s in this sequence: a0 , x1 , x2 , x3 , x4 , . . .. In general, x1 possible predictions are a1 , a2 , a3 , and so forth. Also, it is possible to forget x1 , and predict x2 . The possible predictions for x2 are a1 , a2 , a3 , and so forth. The situation is the same for each of the elements x3 , x4 , and so on. Note that each prediction is independent of other predictions. Our measurements show that when i = j, the prediction xi = aj is seldom correct. In fact, except some special cases, in all other cases, the prediction accuracy is zero or near the zero. We call predicting xi = ai (i = 1, 2, 3, . . .) based on observing just a0 , n-stride prediction. To clarify the concept of n-stride, suppose that the sequence a0 , a1 , a2 , a3 , a4 , . . . has happened before, and now a0 is seen (Fig. 13). One-stride means to predict a1 just after a0 . Two-stride means to predict a2 with one distance from a0 regardless of the objects between them, and so on. Fig. 14 shows the benchmarks results of n-stride for i ranging from 1 to 5 (left to right). When i = 1, this is the same as trace graph. As the table shows, when the distance between current object and predicted object (value of i) is increased, the probability of the write prediction decreases. However, even for i = 5, it has reasonable values. For all configurations, the best case happens for [fully associative cache, read access, 16-byte objects]. For example, for i = 5, it shows that 54% of predictions are correct. In this case, the value of one-stride is 70%. Comparing these values, it shows that although the predictability of the system is decreased by increasing i, but most of the time, its predictability is valuable yet. It is interesting that for 4096-byte objects, in most cases two-stride prediction is better than one-stride prediction. Fig. 14 contains the effect of other parameters on n-stride predictability. It shows that miss address reference resulted from a fully associative cache is more predictable than the miss address
Fig. 13. The concept of n-stride.
68
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
Fig. 14. n-Stride prediction.
reference resulted from a four-way associative cache. However, the difference is not very much. For example, for [five-stride prediction, read access, and 16-byte objects], for a fully associative cache 54% of predictions, and for a four-way set associative cache 50.4% of predictions are correct. As Fig. 14 shows, considering read access or write access, 16-byte objects are more predictable than 4096-byte objects. For example, for [five-stirde prediction, fully associative cache, write access], 41% of predictions for 16-byte objects are correct, but 21.8% of predictions for 4096-byte objects are correct. When mixed access is considered, for 2-history and 4-history, 4096-byte objects are more predictable. The other consequence of Fig. 14 is that read accesses are more predictable than write or mixed accesses. For the configuration [five-stride prediction, fully associative cache, 4096-byte objects], 46.9% of read predictions, 23.9% of write predictions, and 32.5% of mixed predictions are correct. Comparing write accesses and mixed accesses, in most cases, mixed access is better than write access, but in some cases, write access is slightly better predicted than read access.
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
69
8. Conclusion In this paper, the concept of trace LoR was developed. If a system behavior has the trace LoR property, the trace of its accesses can be stored to predict its future behavior. A simple model, called trace graph was developed for this purpose. In the next phase, the model was enhanced to improve its effectiveness. In addition, effects of system parameters on the trace graph are measured. If there are more than one occurrence for an object in the trace, it is not clear which occurrence should be used. To overcome this problem, the trace graph was enhanced with history and multiple LRU prediction. Trace LoR cannot make any prediction when an object is not in the program trace. Clearly, there must be other ways to predict such objects too. From the information theory point of view, a program cannot generate random accesses. Therefore, there must be a solution to detect new objects in the trace. An important observation of this paper was n-stride prediction. It shows that access to an object can be detected far ahead before the actual access. In other words, by observing an object in the trace, not only the next object can be predicted, but also the subsequent object accesses can be predicted with good probability. This shows that implementing data trace cache is valuable. However, algorithms for predicting the sequence of accesses should be studied in more detail. To use trace LoR in a real system, the size of trace graph should be limited. Of course, in the case of paging system and TLB prediction, trace graph is satisfactory, but for cache prediction, its size should be limited. In this manner, a replacement algorithm is required. This is another open issue. Finally, when more than one thread is executing in an address space (for example, in multiprocessor systems), it is an issue whether a separate trace graph should be allocated for each thread. If one trace graph is allocated for all threads, it is possible that their different behaviors conflict and reduce predictability. On the other hand, it is possible that they benefit from each one’s past history and make better predictions. This shows that extending our model (or any other model that stores past history of the system) to multithreaded systems require a careful attention.
References [1] T. Austin, E. Larson, D. Ernst, SimpleScalar: an infrastructure for computer system modeling, IEEE Comput. 35 (2) (2002) 59–67. [2] J.L. Baer, T.F. Chen, An effective on-chip preloading scheme to reduce data access penalty, in: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, New Mexico, USA, November 1991, pp. 176–186. [3] B.D. Cahoon, K.S. McKinley, Tolerating latency by prefetching java objects, in: Proceedings of the Workshop on Hardware Support for Objects and Microarchitectures for Java, Texas, USA, October 1999. [4] B.D. Cahoon, K.S. McKinley, Data flow analysis for software prefetching linked data structures in java controller, in: Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, Barcelona, Spain, September 2001, pp. 280–291. [5] B. Cahoon, K.S. McKinley, Simple and effective array prefetching in Java, in: Proceedings of the 2002 Joint ACM–ISCOPE Conference on Java Grande, Washington, USA, November 2002, pp. 86–95. [6] D. Callahan, K. Kennedy, A. Porterfield, Software prefetching, in: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, USA, April 1991, pp. 40–52. [7] T.F. Chen, J.L. Baer, Effective hardware-based data prefetching for high performance processors, IEEE Trans. Comput. 44 (5) (1995) 609–623. [8] T.F. Chen, An effective programmable prefetch engine for high performance processors, in: Proceedings of the 29th International Symposium on Microarchitecture, Ann Arbor, MI, November 1995.
70
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
[9] C. Noga, LRU is better than FIFO, in: Proceedings of the SODA: ACM–SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms), 1998. [10] R. Cooksey, S. Jourdan, D. Grunwald, A stateless, content-directed data prefetching mechanism, in: Proceedings of the 10th Annual International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 2002, pp. 279–290. [11] P.J. Denning, The working set model for program behavior, Commun. ACM 11 (5) (1968) 323–333. [12] K. Farkas, P. Chow, N. Jouppi, Z. Vranesic, Memory-system design considerations for dynamically scheduled processors, in: Proceedings of the 24th Annual International Symposium on Computer Architecture, June 1997. [13] F. Rosen, Experimental studies of access graph-based heuristics: beating the LRU standard, in: Proceedings of the SODA: ACM–SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms), 1997. [14] J.W.C. Fu, J.H. Patel, B.L. Janssens, Stride directed prefetching in scalar processors, in: Proceedings of the 25th International Symposium on Microarchitecture, Portland, OR, December 1992, pp. 102–110. [15] S. Ghosh, M. Martonosi, S. Malik, Precise miss analysis for program transformations with caches of arbitrary associativity, in: Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 1998, pp. 228–239. [16] E.H. Gornish, A.V. Veidenbaum, An integrated hardware/software data prefetching scheme for shared-memory multiprocessors, in: Proceedings of the 1994 International Conference on Parallel Processing, St. Charles, IL, 1994, pp. 281–284. [17] J. Griffioen, R. Appleton, Reducing file system latency using a predictive approach, in: Proceedings of the USENIX Summer, 1994, pp. 197–207. [18] H. Han, C.W. Tseng, A comparison of locality transformations for irregular codes, in: Proceedings of the Fifth Workshop on Languages, Compilers, and Runtime Systems for Scalable Computers, May 2000. [19] Q. Jacobson, E. Rotenberg, J.E. Smith, Path-based next trace prediction, in: Proceedings of the International Symposium on Microarchitecture, 1997, pp. 14–23. [20] Q. Jacobson, J.E. Smith, Trace preconstruction, in: Proceedings of the 27th Annual International Symposium on Computer Architecture, 2000, pp. 37–46. [21] T.L. Johnson, D.A. Connors, M.C. Merten, W.W. Hwu, Run-time cache bypassing, IEEE Trans. Comput. 48 (12) (1999) 1338–1354. [22] D. Joseph, D. Grunwald, Prefetching using Markov predictors, IEEE Trans. Comput. 48 (2) (1999) 121–133. [23] N. Jouppi, Improving direct-mapped cache performance by the addition of a small fully associative cache and prefetch buffers, in: Proceedings of the 17th Annual International Symposium on Computer Architecture, May 1990. [24] M. Karlsson, F. Dahlgren, P. Stenstrom, A prefetching technique for irregular accesses to linked data structures, in: Proceedings of the HPCA, 2000, pp. 206–217. [25] N. Kohout, S. Choi, K. Dongkeun, D. Yeung, Multi-chain prefetching: effective exploitation of inter-chain memory parallelism for pointer-chasing codes, in: Proceedings of the 10th Annual International Conference on Parallel Architectures and Compilation Techniques (PACT’01), September 2001. [26] C. Lattner, V. Adve, Automatic pool allocation for disjoint data structures, in: Proceedings of the ACM SIGPLAN Workshop on Memory System Performance, Berlin, Germany, June 2002. [27] C. Luk, T.C. Mowry, Compiler-based prefetching for recursive data structures, in: Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, October 1996, pp. 222–233. [28] A. Milenkovic, M. Milenkovic, N. Barnes, A performance evaluation of memory hierarchy in embedded systems, in: Proceedings of the IEEE Southeastern Conference on System Theory, Morgantown, WV, USA, March 2003. [29] T.C. Mowry, S. Lam, A. Gupta, Design and evaluation of a compiler algorithm for prefetching, in: Proceedings of Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, MA, September 1992. [30] J. Oly, D.A. Reed, Markov model prediction of I/O request for scientific application, in: Proceedings of the First Conference on File and Storage Technologies, January 2002. [31] S. Palacharla, R.E. Kessler, Evaluating stream buffers as a secondary cache replacement, in: Proceedings of the 21st International Symposium on Computer Architecture, April 1994.
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
71
[32] A. Ramirez, J.L. Larriba-Pey, C. Navarro, J. Torrellas, M. Valero, Software trace cache, in: Proceedings of the International Conference on Supercomputing, 1999, pp. 119–126. [33] Z. Rosen, Access graph-based heuristics for on-line paging algorithms, Master’s Thesis, Science Department, Tel-Aviv University, December 1996. [34] E. Rotenberg, S. Bennett, J.E. Smith, Trace cache: a low latency approach to high bandwidth instruction fetching, in: Proceedings of the International Symposium on Microarchitecture, 1996, pp. 24–35. [35] E. Rotenberg, Q. Jacobson, Y. Sazeides, J. Smith, Trace processors, in: Proceedings of the International Symposium on Microarchitecture, 1997, pp. 138–148. [36] A. Roth, A. Moshovos, G.S. Sohi, Dependence based prefetching for linked data structures, ACM SIGPLAN Notices 33 (11) (1998) 115–126. [37] A. Roth, G. Sohi, Effective jump-pointer prefetching for linked data structures, in: Proceedings of the 26th International Symposium on Computer Architecture, Atlanta, GA, May 1999, pp. 111–121. [38] A. Saulsbury, F. Dahlgren, P. Stenstrom, Recency-based TLB preloading, ACM SIGARCH Comput. Architect. News 28 (2) (May 2000). [39] T. Sherwood, S. Sair, B. Calder, Predictor-directed stream buffers, in: Proceedings of the 33rd International Symposium on Microarchitecture, Monterey, CA, December 2000, pp. 42–53. [40] J. Skeppstedt, M. Dubois, Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps, in: Proceedings of the 1997 International Conference on Parallel Processing, Bloomington, August 1997, pp. 298–307. [41] I. Sklenar, Prefetch unit for vector operations on scalar computers, in: Proceedings of the 19th International Symposium on Computer Architecture, Gold Coast, Qld., Australia, May 1992, pp. 31–37. [42] A.J. Smith, Cache memories, Comput. Surveys 14 (3) (1982). [43] S.P. VanderWiel, D.J. Lilja, A compiler-assisted data prefetch controller, in Proceedings of the International Conference on Computer Design, October 1999. [44] S.P. VanderWiel, D.J. Lilja, Data prefetch mechanisms, ACM Comput. Surveys 32 (2) (2000) 174–199. [45] Z. Wang, D. Burger, et al., Guided region prefetching: a cooperative hardware/software approach, in: Proceedings of the 30th Annual International Symposium on Computer Architecture, San Diego, CA, June 2003. [46] Y. Wu, Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching, in: Proceedings of the SIGPLAN 2002 Conference on Programming Language Design and Implementation, Berlin, Germany, June 2002, pp. 210–221. [47] Z. Zhang, T. Torrellas, Speeding up irregular applications in shared memory multiprocessors: memory binding and group prefetching, in: Proceedings of the 22nd International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June 1995, pp. 1–19.
Ali Mahjur received his B.S. and M.S. degrees in computer engineering from Sharif University of Technology (SUT), Iran, in 1996 and 1998, respectively. He has been a Ph.D. student in computer engineering at SUT since then. His research interests include Computer Architecture, Operating Systems, Memory Management Systems, and Programming Languages.
Amir Hossein Jahangir received his Ph.D. degree in industrial informatics from the Department of Electrical Engineering, Institut National des Sciences Appliquees, Toulouse, France in 1989. Since then, he has been with the Department of Computer Engineering, Sharif University of Technology, Iran, where
72
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72 he has taught several hardware architecture courses and supervised related research projects. From 1990 to 1994 he was the head of the department and has had several other responsibilities thereafter. His research interests include High Performance Computer Architectures, Analysis of Network devices and the design of real-time and Fault-Tolerant systems.
A. Mahjur et al. / Performance Evaluation 60 (2005) 51–72
73
Amir Hossein Gholamipour will receive his B.Sc. from the Department of Computer Engineering, Sharif University of Technology, Iran by June 2005. His research interests include Computer Architecture, RealTime systems, and Embedded Systems.