Adaptive prefetching algorithm in disk controllers

Adaptive prefetching algorithm in disk controllers

Performance Evaluation 65 (2008) 382–395 www.elsevier.com/locate/peva Adaptive prefetching algorithm in disk controllers Qi Zhu a,∗ , Erol Gelenbe b ...

962KB Sizes 0 Downloads 92 Views

Performance Evaluation 65 (2008) 382–395 www.elsevier.com/locate/peva

Adaptive prefetching algorithm in disk controllers Qi Zhu a,∗ , Erol Gelenbe b , Ying Qiao c a University of Houston-Victoria, TX 77901, United States b Imperial College London, SW7 2BT, United Kingdom c Virginia Polytechnic Institute and State University, VA 24060, United States

Received 29 June 2005; received in revised form 28 September 2007; accepted 4 October 2007 Available online 9 October 2007

Abstract A disk caching algorithm is presented that uses an adaptive prefetching scheme to optimize the system performance in disk controllers for traces with different data localities. The algorithm uses on-line measurements of disk transfer times and of interpage fault rates to adjust the level of prefetching dynamically, and its performance is evaluated through trace-driven simulations using real workloads. The results confirm the effectiveness and efficiency of the new adaptive prefetching algorithm. Published by Elsevier B.V. Keywords: Caching; Prefetching; Trace-driven; Disk controllers; Performance analysis

1. Introduction As a result of the ever widening gap between microprocessors and disks, a computer system’s performance is increasingly limited by disk I/O activity [16]. Thus, disk caching, prefetching, and other techniques are used to improve disk I/O performance. 1.1. Background Modern hard disk drives have their own built-in cache to do many tasks such as prefetching, buffering, and caching [7,10]. Today, a 2 MB built-in cache is common in retail low-end IDE/ATA drives, and some SCSI drives are now available with 16 MB. Disks have special electronic processors called controllers to manage the storage and retrieval of data to and from the disk and impact disk caching efficiency and scheduling optimization [17]. Previous researches have found that caching and prefetching can be used in disk controllers to improve I/O performance and shorten the latency for the reference of the I/O subsystems [4,21]. Caching strategies try to keep actively referenced blocks in the cache. Prefetching is an improvement based on caching, and prefetching strategies load missing memory blocks from the slow secondary storage, e.g. a disk, into memory before their actual reference. The performance of prefetching highly depends on the accuracy of the predictions. If the disk cache can accurately ∗ Corresponding address: Department of Computer Science, University of Houston - Victoria, 3007 N Ben Wilson, 77901 Victoria, TX, United States. Tel.: +1 361 570 4312; fax: +1 361 570 4207. E-mail addresses: [email protected] (Q. Zhu), [email protected] (E. Gelenbe), [email protected] (Y. Qiao).

0166-5316/$ - see front matter Published by Elsevier B.V. doi:10.1016/j.peva.2007.10.001

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

383

predict future references and load the appropriate data from a disk in advance, it reduces the miss ratio and I/O speed lag. [25,28] have shown that prefetching can improve the performance of I/O-intensive systems. However, if the accuracy of this predication is poor, the performance can actually decrease due to cache pollution and channel congestion [21]. 1.2. Related work Our paper concentrates on the analysis of prefetching, which can be implemented in an individual disk controller or multidevice disk controllers. Prefetching involves making two fundamental decisions: deciding when to begin a prefetch operation and deciding which page to replace. Most early cache prefetching approaches, such as sequential look-ahead [2,26], prefetch explicit data either through adjacent blocks or through the use of large cache block sizes. However, if spatial locality of disk references is poor, for example, accesses to data are random rather than patterned, such kind of simple prefetching schemes actually decrease performance since they prefetch blocks that are never referenced [27]. More recently, new approaches to caching, paging, and prefetching (see [4,22] for example) based on the framework of competitive analysis: [9] have led to new insights and practical improvements; and [4,5] provided a basic framework based on four guiding principles for developing efficient prefetching algorithms and also proposed the Aggressive and Conservative prefetching algorithms based on these principles. Subsequently, [22] proposed the Reverse Aggressive prefetching algorithm for multiple disks used in parallel. After that, [1] used sophisticated techniques to devise an optimal prefetching schedule for any given sequence in the stall time model for the single disk case. The algorithms proposed in [1,22] are inherently infeasible in practice because of the complex techniques they use to determine when to issue a fetch command. Hence, as practical alternatives to Conservative and Aggressive, researchers have proposed new prefetching algorithms such as forestall (FS) [23], Just-In-Time (JIT), [25], and Bounded Depth (BD), [18], which vary in the way they apply the guiding principles mentioned in [4]. The common feature of these algorithms is that they use rules to guide the reach fetch/eviction decision. Unfortunately, the experiments did not give promising performance in practice. Most papers in disk prefetching have assumed a locality of reference very similar to that of memory reference generated by the CPU. In contrast, [14] proposed an adaptive prefetching approach that does predictive loading based on historical timing information. An adaptive table is applied to control the prefetching, storing information about the order of past disk accesses that is used to accurately predict future access sequences. Their results show that a cache with the adaptive prefetching mechanism can reduce the average time to service a disk request by a factor of up to three, relative to an identical disk cache without prefetching. However, this algorithm is sophisticated since it needs to know the parameters, such as branch factor, look-ahead level, weight ceiling, weighting function, fetch threshold, etc. Moreover, their adaptive table needs too much space for storing parameters-related data. In [13], we proposed a simple adaptive prefetching approach on a single process execution model. The algorithm uses on-line measurements of disk transfer times and of inter-page fault rates to adjust the level of prefetching dynamically. The simulations show that it can provide effective performance improvements for some synthetic workloads. In this paper, we further study this algorithm for the multiprocess execution models and use trace-driven simulations to evaluate the effectiveness of this algorithm on real workloads. 2. Adaptive prefetching algorithm The effectiveness of disk caching schemes has often been measured in terms of hit rate. However, it is a relatively poor metric because hit rate cannot be converted directly into a run time value without knowing a great deal about the structure of the application and simulation model. In general, hit rate may not be well correlated with real speedup, especially in multiprocessors with non-uniform memory architecture (NUMA), where increased hit rate may come at an unacceptably high cost in terms of memory bandwidth. Furthermore, hit rate does not cover all significant aspects of a caching scheme that includes prefetching of data. In our paper the system efficiency η, which represents the process (CPU) time divided by the total real run time, is used as the performance metric. The total run time for a process is the total reference time plus the sum of I/O blocking time. Thus the goal of a prefetching policy is to maximize η, or to minimize the ratio of the total elapsed time to the total computing time.

384

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

We first summarize the adaptive control prefetching algorithm for the single programmed virtual memory model that we proposed in [13]. Subsequently, we extend this adaptive control prefetching algorithm to multiprogramming models using a “parallel processing” approximation. 2.1. Single programmed memory modeling In [3,11], the life-time function e(s) is defined as the mean process time spent by a process within an address space of size s, between two successive page faults of the process outside s. This quantity can be determined experimentally either from a single program or from a benchmark of programs. When a single process is running on a virtual memory machine, an amount of space sm is allocated in main memory and sc is allocated in disk cache to the process. In [13], an equation is given for efficiency η as the ratio of “useful” time spent by the program in the compute phase, to the total time spent including time spent in transferring pages to and from secondary memory. Let η = 1/(1 + H ) where H is a performance metric representing the ratio of the time spent in page transfers to the life-time function spent within main memory and disk cache (Eq. (7) in Appendix). In order to optimize the program’s execution by reducing the time spent by the program in page transfers, we need to maximize η or minimize the quantity H . In that case, we consider the variation of η with the degree of prefetching pages f to get an adaptive design of a prefetching algorithm. Here the number f of pages following the missing page defines the degree of prefetching. Through some derivations and approximations shown in [13] (see Appendix), we have H 0 < 0 if and only if (H 0 is the derivative of H with respect to f ): ∂ TD ∂e(sm + sc ) TD < , ∂f ∂f e(sm + sc )

(1)

where TD is the time taken in page transfers as a result of the page fault from disk to main memory. 2.2. The adaptive prefetching algorithm The purpose of the adaptive prefetching algorithm is to vary f so as to minimize H (maximize η). We use the well-known adaptive control rule: ∂ H , (2) f new = f old − δ ∂ f f = f old where δ is a positive constant. This approach implies that when the difference of disk transfer times is larger than the difference of inter-page fault times, that is, the growth rate of life-time is less than the growth rate of the disk service time, we have H 0 > 0 from Eq. (1); then Eq. (2) recommends that f will be decreased because more prefetched pages will not help in improving the performance; on the other hand, if the difference of disk transfer times is smaller than the difference of inter-page fault times, the growth rate of the life-time is faster than the disk service time, we have H 0 < 0 and f should be increased to improve the performance. In practice, the adaptive algorithm will make decisions concerning the values to adjust f . Let these successive values be denoted by { f 1 , f 2 , . . . , f k , . . .}. Suppose that the measured values of e(sm + sc ) and TD in the intervals between these changes are denoted by {e1 , e2 , . . . , ek , . . .} and {T 1 , T 2 , . . . , T k , . . .}, so that e1 results from the value f 1 , etc. Then the algorithm proceeds as follows to determine f k+1 : k

– If [T k − T k−1 ] < [ek − ek−1 ] Tek − ε, then f k+1 > f k k

– If [T k − T k−1 ] > [ek − ek−1 ] Tek + ε, then f k+1 < f k k

– If |[T k − T k−1 ] − [ek − ek−1 ] Tek | ≤ ε, then f k+1 = f k , where ε is a small positive constant to determine a range of values of the difference between the left- and the righthand sides of the inequality (1) within which it is not worth making any changes in the degree of prefetched pages. The simulations we report in this paper set ε = 0 and use 1 for the change of prefetching pages.

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

385

2.3. Multiprogrammed memory modeling To ensure the efficient use of the resources, Multiprogramming is employed in computer systems to allow a number of users’ simultaneous accesses to the shared facilities [6,8]. Thus, we extend the prefetching algorithm to multiprogrammed memory models. To simplify the problem, a parallel processing approximation [12] is assumed in our analysis. We assume that all processes execute independently at all resources (e.g., CPU, disk), while the controller and the memory are shared. That is, each process can execute at a resource as soon as it is ready, without queuing. This assumption would be valid if we have a machine with several CPUs or one where processes are not CPU-bound, and disk declustering or striping has been used for I/O parallelism on multiple disks, such as RAID (Redundant Array of Independent Disks). 2.3.1. Identical programs Suppose there are K identical programs that execute independently, so the system throughput is: 1 θ ∼ = [1 − (1 − η) K ], c where c is the average value of total process execution time, and it is independent of the degree of prefetching pages f , η is the CPU utilization for a single process. The variation of θ versus the degree of prefetching f then becomes: ∂η ∂θ ∼ ∂{1/c[1 − (1 − η) K ]} K = (1 − η) K −1 . = ∂f ∂f c ∂f Due to 0 ≤ η ≤ 1, the sign of ∂θ/∂ f is the same as the sign of ∂η/∂ f . Since η = 1/(1 + H ), we have, ∂η ∂H −1 · = . ∂f (1 + H )2 ∂ f From the above equation and Eq. (1), we have derived that θ 0 > 0 if and only if H 0 < 0, that is when, ∂ TD TD ∂e(sm + sc ) . < ∂f ∂f e(sm + sc )

(3)

Comparing inequality (3) with inequality (1), we can see that the algorithm obtained with the parallel processing approximation for identical programs is the same as the algorithm for a single programmed model. 2.3.2. Modeling for different program characteristics When all the K programs have different life-time functions such as e1 (sm + sc ), e2 (sm + sc ), . . . , e K (sm + sc ), the approximate analysis we conducted previously has a somewhat different form, the system throughput is: θ ≈

1 K [1 − Πi=1 (1 − ηi )] c

(4)

and,  1 ∂η1 ∂η2 ∂θ ≈ (1 − η2 )(1 − η3 ) · · · (1 − η K ) + (1 − η1 )(1 − η3 ) · · · (1 − η K ) ∂f c ∂f ∂f  ∂η K + · · · + (1 − η1 ) · · · (1 − η K −1 ) . ∂f

(5)

From (5), we could make a conclusion that there is no sufficient and necessary condition for ∂θ/∂ f > 0 if all the different life-time functions have just a scalar variable f . However, if each program i (1 ≤ i ≤ K ) has its own prefetching variable f i and a vector f = ( f 1 , f 2 , . . . , f K ) for all K programs, we have the following theorem. Theorem. The multiprogrammed system throughput θ in Eq. (4) is maximum if we have a vector f = ( f 1 , f 2 , . . . , f K ) for K processes, and each f i (1 ≤ i ≤ K ) could independently make the corresponding single process efficiency ηi to be maximum.

386

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

Proof. From what we have discussed in Section 2.2, when we use the adaptive prefetching algorithm for a single process, each f i could make ηi to be optimized. Therefore, the vector f could guarantee every η1 , η2 , . . . , η K to be K (1 − η ) is maximum independently. Due to 0 ≤ ηi ≤ 1, each (1 − ηi ) is minimal. Thus, the product of all terms Πi=1 i minimal. Through Eq. (4), the system throughput θ is maximum.  3. Simulations by using real workloads We present our simulation results in this section. The simulations were conducted on a Sun SPARC Ultra AXI server with 1 GB RAM running Solaris 8. 3.1. Simulation models For the single programmed memory model, the computer system’s disk is carefully simulated in considering the following variables: – Access time: The access time has two components, seek time and rotational latency. The simulator uses a 3000-rpm disk, which has an average seek time of 20 ms and the average rotational latency of 10 ms. – Transfer time: In our disk model, we assume that the size of a disk sector (or a disk page) is 512 B. And the transfer rate is 50 MB/s, so it takes 10 ms to transfer one sector of data. – Drive I/O queue: In our simulation, the disk drive has a FCFS queue of pending I/O requests associated with it. The depth of this queue is a function of the speed with which requests can be serviced as well as the rate at which new requests are added to the queue. For the multiprogrammed memory model, we use a RAID-5 (Redundant Array of Independent Disks) model that provides storage with a high parallel I/O performance [7]. For every disk in RAID, we still use the same set of disk parameters in a single programmed memory model. Furthermore, two shared memory processors are used to achieve our assumption of the parallel processing approximation. 3.2. Real workloads generation Trace-driven memory simulation uses a program that simulates the behavior of a proposed memory system design, to which we apply a sequence of actual memory references into the simulator to mimic the way that a real processor might exercise the design during program execution [30]. Code annotation at the executable level is probably the most popular form of trace collection because of its low cost, convenience to the end-user, and high speed. Furthermore, recent developments enable it to collect multiprocess and kernel references. By using code annotation at the executable level, it is not necessary to annotate a collection of source and/or object files to produce the final program. Instead, a single command applied to one executable file image generates the desired annotated program. 3.2.1. VMTrace The tool with which we trace each program is the portable tracing VMTrace working on SPARC-v8/Solaris and i386/Linux machines, developed by Kaplan [19,20]. VMTrace emits a complete copy of each page as it is referenced as page-image traces. The page-image traces are really LRU behavior sequences. An LRU behavior sequence contains the paging traffic for an LRU memory of some fixed size. Thus in an M-page memory, for every reference that will miss on that memory, the behavior sequence will contain a hfetch, evicti pair. That is, each record in the behavior sequence contains both the page that was referenced, and therefore must be fetched, and the page that would be evicted from the M-page LRU memory. 3.2.2. Test suite Benchmarks are widely used as a performance evaluation tool because of industrial interests in their usage, the realistic nature of the benchmarks, and acceptable code portability.

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

387

Table 1 Parameters of traces Parameter

Gcc

Eqntott

Tomcatv

L R M C LN

150 242 661 10 30 5

101 230 241 10 30 200

783 384 456 10 30 5

For the evaluation we have conducted, the following program traces from SPEC (System Performance Evaluation Consortium) [29] are gathered and inputted into our simulator. Each of the benchmark programs is linked with VMTrace, given an input, and executed to provide us with the reference traces needed. These traces are – gcc: GNUC C compiler compiling preprocessed source files – eqntott: A program to build a truth table from a boolean expression – tomcatv: A vectorized mesh generation program. Since the purpose of our simulation is to evaluate the proposed adaptive prefetching algorithm, we are only concerned with data reference traces; thus, the instruction traces are deleted from our trace sets. Table 1 lists the significant parameters of the three traces and some parameters we used for the simulation. L is the total number of logical pages; R is the total number of data references; M and C are the pages allocated in memory and in the disk cache; L N is the loop number of the traces we used in order to guarantee that we could obtain enough misses during the simulation. 3.3. Selecting the pages to be prefetched Typically, program references tend to dwell in relatively small areas over fairly long time periods – locality of reference [15]. A good descriptor of locality of reference is working-set that identifies what pages of a program are being referenced during a relatively long time period of program execution. Working-set size indicates how many distinct pages are being referenced during the time period. In order to increase the accuracy of the prefetching, the sequence of fetches performed in the past must be analyzed. Ideally, for each disk block, we need to keep a record of this block, including which were accessed immediately following it in the past. However, if the window size of the past history is too big, the prefetching will needlessly increase the space–time execution cost. Our scheme uses a heuristic mechanism to predict the next most probable disk access using the recent reference history. The approach uses a large table structure with one entry for each of the blocks on disk; the content of each cell is the block number, as described in Fig. 1. This table is directly accessed using the current block number as the index. Each entry has a stack of m most probable successors, and the nearer the field is to the left, the higher its probability to be referenced after the current index is accessed. For a 40 MB program, using a block size of 512 B, using m = 10 and using an integer (2 B) to record each block number, we need (40/0.5) × 10 × 2 = 1600 KB = 1.6 MB of storage space, which is 4% of the program size. Of course, the data in the table are not static and are constantly adapted to the disk accesses that are made. An extension of stack algorithm governs the update of the table. For a sequence of references (. . . , ri−m , ri−m+1 , . . . , ri−1 , ri , . . .) that is the order of physical block requests to the disk, the pseudo-code of the updating algorithm on a two-dimensional array A[ ][ ] is shown in Fig. 2 when the current disk reference is ri . Here is an example for the adaptive table updating. Suppose m = 4 and the sequence of references is (. . . , 135, 52, 4, 121, 79, . . .), the adaptive tables before and after the disk access 79 are shown in Fig. 3. Since 79 is the first successor of block 121, it was moved from the previous 4th probable successor to the 1st probable successor of 121 in the table. And 79 is the second successor of block 4, it was inserted as the 2nd probable successor because it was not originally in the table. As the third successor of block 52, 79 remains unchanged for its position in the table. Finally, 79 is the fourth successor of block 135, it was added into the 4th probable place in the table. Using this algorithm, the adaptive table will always contain entries updating from the most recent reference history to predict the most probable next blocks for each block on disk. When the prefetching algorithm is applied, it will check the adaptive table and prefetch the pages of the first f fields.

388

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

Fig. 1. The adaptive table with m most probable next blocks.

Fig. 2. The algorithm for the updates of the adaptive table.

3.4. Simulations for a single process In this section, experimental results are shown to evaluate the proposed adaptive prefetching algorithm. We first simulate the adaptive prefetching algorithm for a single process. Figs. 4–9 show the resulting life-time of the prefetched pages and the system efficiency (η) versus the number of prefetched pages ( f ) by static and adaptive prefetching algorithms for programs gcc, tomcatv, and eqntott, respectively. Figs. 4, 6 and 8 show the life-time versus f for programs gcc, tomcatv, and eqntott, respectively. At first all the curves rise up, then reach the different maximum points when f = 4, 2, 6 individually, and after maximum points the curves oscillate to a stable value, showing that prefetching more pages would not improve the performance. Figs. 5, 7

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

389

Fig. 3. Example for adaptive table when reference disk block number 79 occurs in a sequence of references of (. . . , 135, 52, 4, 121, 79, . . .).

Fig. 4. Gcc: Life-time versus f .

Fig. 5. Gcc: System efficiency versus f .

390

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

Fig. 6. Tomcatv: Life-time versus f .

Fig. 7. Tomcatv: System efficiency versus f .

and 9 are the curves of system efficiency versus f , respectively. The maximum values are attained when f = 4, 2, 6 individually. Afterwards, the values decrease because the values of life-time become smaller but disk service times increase nearly linearly with f increasing. The values of the adaptive prefetching algorithm (horizontal lines) in six figures are all near-optimal. The results of the average prefetching page numbers using the adaptive prefetching algorithm for programs gcc, eqntott, and tomcatv are 3.9, 5.3, and 1.8, respectively. We selected programs gcc, eqntott, and tomcatv because they represent the different levels of data locality. For eqntott, the data locality is good; for gcc, it is normal; and the worst is tomcatv. Through the simulations, we want to check if our adaptive algorithm could adaptively select the prefetch depth for different data locality programs and whether the performance is optimal. In [24], Lipasti et al. proposed a concept called value locality, which was defined as “the likelihood of a previouslyseen value recurring repeatedly within a storage location.” And in their research, they also used our test benchmark set to explore value locality. Their result showed that, for the value locality, gcc is 43%, eqntott is 70%, and tomcatv is

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

391

Fig. 8. Eqntott: Life-time versus f .

Fig. 9. Eqntott: System efficiency versus f .

10%. We could have a relationship for value locality: eqntott > gcc > tomcatv. This is exactly the same as we show for the average number of prefetching pages, validating our simulation results. 3.5. Simulations for multiple processes Unfortunately, the tool VMtrace is not capable of monitoring multiprocess workloads or the operating system kernel. However, we are just interested in testing the adaptive prefetching algorithm in the presence of data traces. We simply combine several single data traces together as a multiprocess data trace, which is used in our simulation to verify the adaptive prefetching algorithm in a multiprogrammed model. We input the three traces (Gcc, Eqntott, and Tomcatv) into our simulator concurrently, using the same parameters mentioned in Table 1. In Fig. 10, we show the life-time functions versus f for all the three programs. In order to keep the three curves within a figure, we use multipliers 10 for program Gcc and 100 for Tomcatv. Through the curves, we know that the highest points on the curves for Gcc, Eqntott, and Tomcatv are at f = 4, 8, 3, respectively. Fig. 11 displays each

392

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

Fig. 10. System life-time versus f .

Fig. 11. Program efficiency versus f .

program’s efficiency versus the static prefetched pages f . For programs Gcc, Eqntott and Tomcatv, the maximum values of the program’s efficiency are at f = 4, 8, 3, respectively. Those values are nearly the same as the results we got from the single program simulation, and the comparable relationship of these three programs corresponds to the result in [24]. Other important results of using the adaptive prefetching algorithm are shown in Table 2. From all the simulations for the real traces, both the single programmed environment and multiprogrammed environment, we could conclude that the adaptive prefetching algorithm always keeps the system efficiency and total run time optimal. 4. Conclusions We proposed a simple adaptive prefetching approach on single process execution models as well as multiprocess execution models. The adaptive prefetching algorithm uses on-line measurements of disk transfer times and of

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

393

Table 2 Results of multiprocess simulation Parameter

Gcc

Eqntott

Tomcatv

Life-time System efficiency (%) Avg f

0.0286 0.0907 3.95

0.5331 0.8388 7.83

0.0049 0.0233 2.75

inter-page fault times to adjust the level of prefetching dynamically. It can always find the most appropriate degree of prefetched pages for different data localities, both for a good data locality process such as eqntott, and for bad data locality process such as tomcatv. Also, the simulations show that it can provide nearly optimal performance for real workloads, either for single process execution models or multiprocess execution models. In this paper, we only studied the optimal degree of prefetching ( f ) as a static property of a process, which is fixed for the life-time of the program execution. However, the adaptive algorithm also works for processes that have the dynamic locality degrees. The algorithm will adjust the prefetching degree f and migrate from one optimal value to another one. Our future research will address the following issues: – To validate the theoretical developments based on practical system designs and modify them to introduce specific aspects that relate to existing systems. – To implement the adaptive prefetching algorithm on the hardware level instead of the software level. – To apply the new adaptive prefetching algorithm on complete traces, including data traces and instruction traces, user-level processes, and the operating system kernel for multiprocessor workloads. – To apply the new adaptive prefetching algorithm on program traces collected from external hardware probes. Appendix. Variation of η with f for single programmed modeling When we use the life-time function e(s) that we defined in Section 2.1 and consider a single process running on a virtual memory machine, an amount of space sm is allocated in main memory and sc is allocated in the disk controller for the process. The effectiveness of the execution is measured by the ratio of “useful” time spent by the program in the compute phase to the total time spent including time spent in transferring pages to and from secondary memory. e(sm + sc ) e(sm + sc ) + T1 + T2 e(sm + sc ) T1 = TC po [ pd p Fm (1 − p Fc ) + (1 − pd ) p Lm (1 − p Lc )] e(sm ) T2 = TD [ po pd p Fm p Fc + po (1 − pd ) p Lm p Lc + (1 − po )],

η=

(6)

where: – po is the probability that the page whose reference caused the page fault is an “old” page which had previously been loaded to memory; – pd is the probability that the page was “dirty”, i.e. modified, during its last period of residence in main memory; – TC and TD are the time that it takes to transfer pages resulting from a page fault from the disk cache to main memory and from the disk to main memory, respectively; – p Fm and p Fc are the probabilities that a dirty page is flushed from main memory and disk cache to disk, respectively. Notice that “flushing” is a systematic policy decision which is implemented at the level of the operating system; – p Lm and p Lc are the probabilities that a “clean” page is over-written in main memory or disk cache, respectively, by an incoming page. Notice that this will depend on how tight the space is, and how active other processes may be in bringing in new pages.

394

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

1 Eq. (6) can be more conveniently written as η = 1+H , where     TC (1 − p Fc ) TC (1 − p Lc ) (1 − po ) TD p Fc TD p Lc +β + TD + + H =α e(sm ) e(sm + sc ) e(sm ) e(sm + sc ) e(sm + sc ) α = po pd p Fm

(7)

β = po (1 − pd ) p Lm . To optimize the program’s execution by reducing the time spent by the program in page transfers, we need to maximize η, which is equivalent to minimizing the quantity H . The expression for H in Eq. (7) provides insight into an adaptive design of a prefetching algorithm. Considering the variation of H with the degree of prefetching f , po , pd , p Fm , p Lm will be not affected by f unless an explicit choice is made to link the transfer of f pages at a time between main memory and disk cache. Thus we can assume that α and β in (7) do not depend on f . On the other hand, the life-time function e(s) will depend on f , since prefetching will impact the frequency of page faults: a good prefetching strategy should in fact reduce page faults, while a poor choice of the prefetched pages would increase the page fault rate. Tc and TD will be impacted by f since transferring more pages cannot in any manner take less time for each transfer. Also one may expect that the rate at which dirty pages are flushed from the disk cache will increase as f increases, since there will be a need for more free space, resulting in a variation of p Fc with f . Notice that e(sm ) will be of the same order of magnitude as e(sm + sc ). TC will be much smaller than TD , most likely by at least a factor of 10−2 or 10−3 since TC will be at most in the microseconds range, while TD will be in the milliseconds range. Ignoring those elements containing TC , we can write: H =α

(1 − po ) TD p Lc TD p Fc +β + TD . e(sm + sc ) e(sm + sc ) e(sm + sc )

(8)

p Fc will be far less sensitive to f than some of the other quantities in Eq. (8), since the number of pages being prefetched will typically be small as compared to the total number of pages (sm + sc ). Denoting by a “prime” all variables’ derived with respect to f , we have the following approximate expression: H0 = α

TD0 p Lc TD0 p Fc (1 − po ) e0 (sm + sc ) [1 − po + αp Fc + βp Lc ]. +β + TD0 − TD e(sm + sc ) e(sm + sc ) e(sm + sc ) [e(sm + sc )]2

And H 0 < 0 if and only if

∂ TD ∂f

<

(9)

∂e(sm +sc ) TD ∂f e(sm +sc ) .

References [1] S. Albers, N. Garg, S. Leonardi, Minimizing stall time in single and parallel disk systems, Journal of the ACM 47 (2000) 969–986. [2] J.L. Baer, G.R Sager, Dynamic improvement of locality in virtual memory systems, IEEE Transactions on Software Engineering SE-2 (1976) 54–62. [3] L.A. Belady, C.J. Kuehner, Dynamic space sharing in computer systems, Communications of the ACM 12 (5) (1969) 15–21. [4] P. Cao, E.W. Felten, A.R. Karlin, K. Li, A study of integrated prefetching and caching strategies, in: Proc. of ACM SIGMETRICS, 1995, pp. 188–197. [5] P. Cao, E.W. Felten, A.R. Karlin, K. Li, Implementation and performance of integrated application controlled caching, prefetching and disk scheduling, ACM Transaction of Computer Systems (TOCS) 14 (4) (1996) 311–343. [6] A. Chesnais, E. Gelenbe, I. Mitrani, On the modelling of parallel access to shared data, Communications of the ACM 26 (3) (1983) 196–202. [7] P.M. Chen, E.K. Lee, G.A. Gibson, R.H. Katz, D.A. Patterson, RAID: High performance, reliable secondary storage, ACM Computing Surveys 26 (2) (1994) 145–185. [8] E.G. Coffman, H.O. Pollack, E. Gelenbe, R. Wood, Analysis of parallel read sequential write systems, Performance Evaluation 1 (1981) 62–69. [9] A. Fiat, R.M. Karp, M. Luby, L.A. McGeoch, D.D. Sleator, N.E. Young, On competitive algorithms for paging problems, Journal of Algorithms 12 (1991) 685–699. [10] Jun Gao, Zhiming Wu, Zhiping Jiang, A key technology of design in RAID system, Computer Peripherals Review 24 (1) (2000) 5–8. [11] E. Gelenbe, The distribution of a program in primary and fast buffer storage, Communications of the ACM 16 (7) (1973) 431–434. [12] E. Gelenbe, G. Pujolle, Introduction to Networks of Queues, 2nd ed., J. Wiley & Sons, Chichester, New York, 1998. [13] E. Gelenbe, Q. Zhu, Adaptive control of pre-fetching, Performance Evaluation 46 (2–3) (2001) 177–192. [14] K.S. Grimsrud, J.K. Archibald, B.E. Nelson, Multiple prefetch adaptive disk caching, IEEE Transactions on Knowledge and Data Engineering 5 (1) (1993) 88–103.

Q. Zhu et al. / Performance Evaluation 65 (2008) 382–395

395

[15] S.J. Hartley, An analysis of some problems in managing virtual memory systems with fast secondary storage devices, IEEE Transactions on Software Engineering 14 (8) (1988) 1176–1187. [16] J. Hennessy, D. Patterson, Computer Architecture: A Quantitative Approach, 3rd ed., Morgan Kauffman, 2003. [17] J. Jeppesen, W. Allen, S. Anderson, M. Pilsl, Hard disk controller: The disk drive’s brain and body, in: IEEE Proc. of the Intl Conf. on Computer Design: VLSI in Computers & Processors, ICCD, 2001, pp. 262–267. [18] W. Jin, X. Sun, J.S. Chase, Depth prefetching, Research Report Duke-CS-00-07, Duke University, 2000. [19] S.F. Kaplan, Y. Smaragdakis, P.R. Wilson, Trace reduction for virtual memory simulations, in: Proc. of ACM on SIGMETRICS, 1999, pp. 47–58. [20] S.F. Kaplan, Compressed caching and modern virtual memory simulation, Ph.D. Dissertation, 2001. [21] R. Karedla, J.S. Love, B. Wherry, Caching strategies to improve disk system performance, IEEE Computer 27 (3) (1994) 38–46. [22] T. Kimbrel, A.R. Karlin, Near-optimal parallel prefetching and caching, in: Proc. on Symposium on Foundations of Computer Science, 1996, pp. 540–549. [23] T Kimbrel, A. Tomkins, R.H. Patterson, B. Bershad, P. Cao, E. Felten, G. Gibson, A.R. Karlin, K. Li, A trace-driven comparison of algorithms for parallel prefetching and caching, in: Proc. of the Symposium on Operating Systems Design and Implementation, 1996, pp. 19–34. [24] M.H. Lipasti, C.B. Wilkerson, J.P. Shen, Value locality and load value prediction, in: Proc. of ASPLOS, 1996, pp. 138–147. [25] R.H. Patterson, G.A. Gibson, E. Ginting, D Stodolsky, J. Zelenka, Informed prefetching and caching, in: Proc. of the Fifteenth ACM Symposium on Operating Systems Principle, 1995, pp. 79–95. [26] A.J. Smith, Sequential program prefetching in memory hierarchies, IEEE Computer 11 (2) (1978) 7–21. [27] A.J. Smith, Cache memories, Computing Surveys 14 (3) (1982) 473–530. [28] V. Soloviev, Prefetching in segmented disk cache for multi-disk systems, in: Proc. of the 4th Workshop on I/O in Parallel and Distributed Systems, 1996, pp. 69–82. [29] SPEC, Standard Performance Evaluation Corporation, http://www.specbench.org, 2002. [30] R.A. Uhlig, T.N. Mudge, Trace-driven memory simulation: A survey, ACM Computing Surveys 29 (2) (1997) 128–170. Qi Zhu earned his Ph.D. in Computer Science from University of Central Florida in 2002, Currently he is an Assistant Professor at University of Houston - Victoria. His research interests include operating system, queuing system, system modeling and analysis, software engineering, and real-time systems.

Erol Gelenbe A Fellow of ACM, IEEE, and IEE. He is the Dennis Gabor Chair in Electrical and Electronic Engineering at Imperial College London. Prof. Gelenbe is currently researching in self-aware networks and quality of service, future data fusion system, and distributed data and information system.

Ying Qiao is currently an employee for Hurco Inc., she earned her Master degree in Computer Science from Virginia Polytechnic Institute and State in 2006.