Computer Networks 37 (2001) 711±716
www.elsevier.com/locate/comnet
Parameters of cache systems based on a Zipf-like distribution Dmitry G. Dolgikh, Andrei M. Sukhov * Samara State Aerospace University, Moscowskoe sh., 34a, Samara 443086, Russia
Abstract The object of this paper is to explain a system of Internet trac caching. The task is to create an analytical model of a cache system linking its size with other parameters by boundary conditions. A de®nition of a dynamic cache model is introduced. The parameters of a cache system are calculated using the Zipf's ®rst law and Zipf-like distribution. The correspondence between size of a cache system and aggregated bandwidth of external links is derived. Ó 2001 Elsevier Science B.V. All rights reserved. Keywords: Proxy caches; Zipf-like distribution; ICP systems; Size of cache; Analytical cache model
1. Introduction Until recently, proxy caches were an optional service for users who voluntarily con®gured their browsers to redirect requests through a proxy. Today, Internet service providers impose caching transparently at the edges of the network to help meet the exponentially growing demand for Webbased service. In order to optimize the bene®ts of cache systems it is important [1] to develop an analytical model of cache behavior. An accurate model must include expressions for following parameters: cache size, maximum hit rate, mean lifetime of document as well as best cache parameters settings. Such a model was proposed by Breslau et al. [5]. In the present paper this model is extended analytically to de®ne a cache size from boundary conditions, hit rates for single and collective users
as well as the relation between dierent cache parameters. Our model assumes independent references following Zipf-like distribution and no correlation between request frequency and response size or rate of change. The results obtained can be easy generalized on any distribution using the methods proposed here. This work is based on the experience of the Samara Region Network for Science and Education supported by the Samara State Aerospace University (SSAU). It uses materials of the Second Web Cache Managers Workshop that was organized by TERENA in Budapest, the capital of Hungary, on 2±3 March, 2000 on behalf of the DESIRE II project, where one the of co-authors gave a report. 2. Optimal size of a cache system in Internet applications
*
Corresponding author. E-mail addresses:
[email protected] (D.G. Dolgikh),
[email protected] (A.M. Sukhov).
Users of a local network request information from the global network. This information is
1389-1286/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved. PII: S 1 3 8 9 - 1 2 8 6 ( 0 1 ) 0 0 2 4 3 - 2
712
D.G. Dolgikh, A.M. Sukhov / Computer Networks 37 (2001) 711±716
Fig. 1. Scheme of proxy caching. Fig. 2. Illustration of Zipf distribution.
delivered by portion named documents. Some documents are requested repeatedly and therefore they should be held in the cache system. The network society is supposed to operate as a collective user (see Fig. 1), which sends requests to the global network through a cache system. Suppose a user asks for k documents in the time t. It is obvious that a function k
mint ; t Cmint t
1
is a linear one. The result depends on the aggregated bandwidth of the external links mint and on time t. Let p equal the quantity of unique documents in the cache system. The eciency of the system for a collective user gcol can be de®ned as a ratio of non-unique documents k p to their total number k: gcol
k
p k
:
2
It is known that user's requests are spreaded non-uniformly. As noted in Refs. [2,6,8], several researchers have observed that the relative frequency with which Web pages are requested follows Zipf's ®rst law [11]. Zipf's law states that the relative probability of a request for the nth most popular page is proportional to 1=n. #n
A n
3
where A #1 is the probability of the most popular item. Subsequent studies have concluded that this law is applicable for the collective user [3,5,9]. The
distribution of page requests generally follows a Zipf-like distribution where the relative probability of a request for nth popular page is #n
A na
4
with a typically taking on some positive value less than unity. In order to get understandable analytic expressions the Zipf's law is used in this section. The equations will be done for the common law in the fourth section. Fig. 2 illustrates Zipf's distribution. It is easy to ®nd the value A from the calibration condition: a sum of probabilities to request all k documents equals unity. p X A 1: n n1
5
Such an approach was applied for the ®rst time in the work [5]. In order to manipulate by simple analytical expressions, it is possible to change the sum in the Eq. (5) by the integral: Z p A dx 1;
6 1 x A
1 : ln p
7
Let m equal the quantity of documents that should remain in the cache system. These documents will have been requested two or more times.
D.G. Dolgikh, A.M. Sukhov / Computer Networks 37 (2001) 711±716
713
Fig. 4. The scheme of the dynamical model.
Fig. 3. Boundary conditions.
During the ®rst request a document is cached by the system and for the second and following requests it will be transferred to a user from the local cache. To apply the Zipf's law to a point with co-ordinates
2; m in Fig. 3 the value A of citing index of the ®rst documents is k ; ln p
8
p : gcol ln p
9
2m kA m
2
1
3. System's eciency
l M:
P1 1
P1
0
10
Besides, the time t from system's installation must be far more than the cache parameter T. Such a condition converts a static model of the cache
11
where P1 (0) describes the probability that such a document is requested direct from the global network during a session: Al
P1
0 e
The aim of this section is estimation of the maximum value of the system eciency for single user. A session model with two parameters describes its behavior. The ®rst one is session duration Dt, the second one is number of documents l requested during the session. Let the investigating cache system satisfy conditions (8) and (9), then its parameters can be ®xed: t T and m M, where Dt T ;
system into a dynamic one. In the dynamical model (see Fig. 4) we can consider that each user's session is restricted by moment (T Dt; T ), without restriction of the commonality. Probability to request a document twice during a session is proportional 1=k 2 and tends to be zero for those documents that are not in the cache yet. So, the user's session can be considered as a set of independent events [7]. For a dynamic model it is possible to use a Poisson distribution [4] for ®nding eectiveness of the cache system for a single user for a session. It does, because probability A from the equations (7) and (8) to ®nd the most popular item in a cache system is small (A 6 0:1) and the quantity of requested documents l is big enough (l P 100). The probability to discover the ®rst item in a cache system is in accordance with Poisson distribution:
:
12
The probability to ®nd the second most popular document can be calculated in the same way: P2 1
e
A=2l
:
13
We derive the number of successful events summing these values for all M documents from the cache: M X
1
e
A=nl
:
14
n1
Eciency for a single user can be calculated as: ! M 1 X
A=nl
1 e :
15 gs l n1
714
D.G. Dolgikh, A.M. Sukhov / Computer Networks 37 (2001) 711±716
4. Systems that follow Zipf-like distribution This section is devoted to other approaches of modeling cache systems. Firstly this concerns the calibrating condition (5). Attention should be concentrated on the considerable dierence between probabilities de®ned on the interval
m; p. From ®rst Zipf's law it follows that: Z p A dx 1 logp m
16 m x whereas the real discrete distribution gives (see Fig. 3) p m 1 gcol :
17 k The following expression seems to be better: Z
M
A dx gcol H
R: xa
1
18
This condition gives us an alternative, then it was given in Eq. (2), de®nition for H
R gcol that is more acceptable for the dynamic model. In other words this is a sum of the possibilities to request for all documents from the cache or, as it usually calls, hit ratio H
R. A Zipf-like distribution has been applied in Eq. (18). Now the key expression (7), (9), and (15) can be easy modernized: A
1 aH
R ; M1 a 1
M
Ma
1
19
aH
R k: 2
20
For a real cache system a is varying from 0.63 to 0.83 and M P 105 [5]; then M
1
agcol k; 2
21
The relationships between parameters of a real system based on Internet cache protocol (ICP) are also discussed. First of all, the size of a cache system depends on a period of time T (see Fig. 4) when the system is being ®lled. The time T from Eq. (21) corresponds to ``reference-age'' parameters of real ICP systems. This parameter is the mean lifetime T of documents replaced in the cache, e.g., the time from when a document enters the cache until it is replaced, averaged over all replaced documents. The only condition for this parameter is that it has to be far more than the period of a user's session, see Eq. (10). Its value is one month in our network. This value is based on the results shown in Ref. [10]. The authors of that paper assert that frequency of document's changing depends on its popularity and more popular documents are changing more frequently. In Eq. (1) the product of aggregated bandwidth of the external links mint and time of ®lling T is the value of trac passed through a cache system: Seff mint T :
23
A constant C from Eq. (1) is the inverse proportion of the mean size of the documents Beff is in a cache system. C
1 : Beff
24
Experimental data [5] gives 45% for the maximum possible hit rate. The size of such a system is usually starting from 105 documents and 0:63 6 a 6 0:83 as mentioned above. Then the next conclusion can be made M
k : 15
25
5. Recommendations for Internet caching systems
This means that only each 15th document should be stored in the cache among all documents requested by the users from the global network. This main proportion may be related to dierent parameters of ICP systems that describe the rate the documents ¯ow into the cache, because correlation between frequency of requests and document size is very weak and can be ignored [5]:
This section concerns some recommendations for choosing the best cache parameter settings.
Seff 105 s or several days mint
gs
1 l
Z
M 1
1
e
Al=xa
dx:
22
26
D.G. Dolgikh, A.M. Sukhov / Computer Networks 37 (2001) 711±716
which means that a ratio of eective cache system size Seff to aggregated bandwidth of the external links mint can be considered as a constant within wide limits. The next question is to obtain the maximum possible hit rate for cache systems. Expression (22) can be investigated using computer algebra tools, but it describes behavior of a single user during a session. The maximum possible hit rate for the system eciency for a collective user can be found by applying the condition (22) to the whole system. In order to do this all parameters from Eq. (22) must be estimated carefully. Let us start with l. The condition (25) can be changed to l k 15M:
27
But a less strict condition should be used for practical purposes l M:
28
To estimate a value for A let us suppose that gcol H
R 1:0 which means that almost every ®le is in the cache. The values for M and a coincide with experimental data shown above, then the following restriction appears: gcol gs
l M 6 0:47:
29
If smaller values for l are used then the value of gs will dier from those given by the experimental results for the values for the system eciency for a collective user gcol .
6. Discussion An analytical model of a cache system has been created and tested. The model was used to ®nd size, the maximum eciency and frequency of asking for the most popular document of a cache system. Showing practical recommendations can help to tune real systems. One of the main results is the formula that shows a correspondence between size of a cache system and aggregated bandwidth of external links. All equations were done for the ®rst Zipf's law and for Zipf-like distributions. At present time the logs we have are for mostly nation wide top-level caches, like NLARN, JA-
715
NET, DFN, etc. These studies have shown the growth of the hit ratio in a log-like fashion as a function of cache size. For further work it is necessary to acquire new data. First of all, the dynamic dependencies of the Zipf parameter a
M and of the cache hit ratio H
R should be investigated for various communities of clients. We are in need of experimental values of a
M and H
R gcol
M when the cache size growth from 105 to 1010 documents. In order to investigate cache hierarchy an independent cache system of a small department should be chosen. The weekly trac of this department would consist of 106 documents gotten from the global network. In the ®rst stage we could vary the size of the local cache proxy. Later it should be linked step-by-step to the next level of hierarchy (department±university±state±national) to increase the number of cache documents up to 1010 . This expansion would be realized in sibling mode, when documents from the top cache are not held in the local cache. After the results are obtained, the conditions for a, g, M, A can be optimized and more accurate analytical expressions can be de®ned for a collective user. An eectiveness of cooperative caching should be included in deriving of these equations.
Acknowledgements We would like to thank Ian Ravenscroft, an International Support Engineer of Delcam International PLC, for invaluable help with preparing the English version of this paper.
References [1] M. Abrams, C. Standridge, G. Abdulla, S. Williams, E. Fox, Caching proxies: limitations and potentials, Proceedings of the Fourth International WWW Conference, 1995. [2] V. Almeida, A. Bestavrosd, M. Crovella, A. De Oliveira, Characterizing reference locality in the WWW, IEEE International Conference in Parallel and Distributed Information Systems, Miami Beach, FL, USA, December 1996.
716
D.G. Dolgikh, A.M. Sukhov / Computer Networks 37 (2001) 711±716
[3] V. Almeida, M. Cesirio, R. Canado, W. Junior, C. Murta, Analyzing the behavior of a proxy server in the light of regional and cultural issues, 3rd International WWW Caching Workshop, Manchester, England, June 1998. [4] J. Boucher, Voice Teletrac Systems Engineering, Artech House, Norwood, MA, 1988 (Chapter 2). [5] L. Breslau, P. Cao, L. Fan, G. Phillips, S. Shenker, Web caching and Zipf-like distribution: evidence and implications, IEEE Infocom XX (V) (1999) 1±9. [6] S. Classmann, A caching relay for the World Wide Web, First International Conference on the World-Wide Web, CERN, Geneva, Switzerland, May 1999. [7] E.G. Coman, P.J. Denning, Operating Systems Theory, Prentice-Hall, Englewood Clis, NJ, 1973. [8] C. Cunha, A. Bestavros, M. Crovella, Characteristics of WWW client-based traces, Technical report TR-95-010, Computer Science Department, Boston University, Boston, MA 02215, USA, April 1995. [9] N. Nishikawa, T. Hosokawa, Ya. Mori, K. Yoshidab, H. Tsujia, Memory-based architecture for distributed WWW caching proxy, The Seventh WWW Conference, April 1998. [10] A. Wolman, G. Voelker, N. Sharma, N. Cardwell, A. Karlin, H. Levy, On the scale and performance of cooperative Web proxy caching, Operating Syst. Rev. 34 (5) (1999) 16±31. [11] G.K. Zipf, Relativity Frequency as a Determinant of Phonetic Change, Harvard Studies in Classical Philology, vol. XL, 1929 (reprint).
Dmitry Dolgikh is a postgraduate student of Samara State Aerospace University. His area of research is Internet caching systems. He has been a system administrator of the Samara Regional Network for Science and Education since 1996, working on its development and support from project start to its current state.
Anderi Sukhov is an Associate Professor of Samara State Aerospace University, Russia and was awarded a PhD in Physics and Mathematics, in Moscow in 1993. Over the last 10 years he has been involved in acting as an investigator for more than 10 telecommunication projects supported by the Russian government, INTAS, NATO, ESA, US Information Agency, etc. These are the construction of Samara Regional Network for Science and Education±±the ®rst Russian provincial network providing a digital connection to Moscow, the telecommunication support of international space projects which helped rescue the ESA experiments on the FOTON-12 aircraft and other projects. He is founder of the largest regional ISP and an expert in the area of telecommunication investments in Russia.