A case study of caching strategies for a distributed full text retrieval system

A case study of caching strategies for a distributed full text retrieval system

Informorion Processrng & Managem~n/ Printed in Great Bntain. Vol. 26. No. 2. pp. 227-247, 1990 Copyright 0 0306.4573/90 $3.00 + .OO 1990 Pergamon ...

1MB Sizes 13 Downloads 85 Views

Informorion Processrng & Managem~n/ Printed in Great Bntain.

Vol. 26. No. 2. pp. 227-247,

1990 Copyright

0

0306.4573/90 $3.00 + .OO 1990 Pergamon Press plc

A CASE STUDY OF CACHING STRATEGIES FOR A DISTRIBUTED FULL TEXT RETRIEVAL SYSTEM* T. PATRICK MARTIN?, IAN A. MACLEOD, and JUDY I. RUSSELL Department

of Computing & Information Science, Kingston, Ontario K7L 3N6.

Queen’s

University,

and KEN LEESE and BRETT FOSTER Fulcrum Technologies Inc., 560 Rochester Street, Ottawa, Ontario KlS 5K2 (Received 3 February 1989; accepted in final form 23 March 1989)

Abstract-We describe the development of a model for the workload of a full text retrieval system and of a simulation model for a distributed version of an existing system, F&Text. These two models are then used to predict the effectiveness of different caching strategies for the given system and workload, and to evaluate the impact of various parameters on the strategies performance.

1. INTRODUCTION

Full text retrieval systems are a prime candidate for conversion to a distributed environment, especially with their growing importance in applications such as office information systems. A major concern of these conversions must be system performance. We do not want to achieve the advantages of a distributed system at the cost of diminished system performance. The objective of our research has been to develop strategies for building distributed full text retrieval systems and then to predict the effect of these strategies on the system performance. We report here on simulation experiments we conducted to predict the effect different caching strategies would have on the performance of a distributed full text retrieval system. The effects of different strategies for distributing the components of the system are reported elsewhere [ 11. Data caching has become a common technique for reducing the cost of accessing data in a distributed system [2]. By maintaining local caches of recently acquired data, programs can amortize the high cost of querying remote data servers over several references to the same data. Caching is used to great advantage in a number of network based file servers such as Sun’s Network File System [3] and the Sprite Network File System [4]. Data caching is not used extensively in distributed database systems because of the problem of maintaining the consistency of cached data. A cached copy of a data object is not an “official” copy and so not automatically subject to any updates on that object. Special arrangements must be made to receive these updates, or more typically, a cached copy is given a “lifetime” which indicates the length of time the data is expected to remain valid. Distributed full text retrieval systems, on the other hand, are a more promising candidate for caching than distributed database systems because documents are usually never updated and rarely deleted. Simpson and Alonso [.5] have examined the use of data caching in information retrieval systems. The focus of their work is the integration of PC’s and workstations into widely distributed information retrieval systems, while our focus is performance improvement. Caching is proposed as a means to achieve both goals. They also studied caching using a simulation model but their model is a general representation of a distributed infor*This work was supported by the Natural Science and Engineering erative Research and Development grant CRD39069. tcorrespondence should be addressed to T.P. Martin. 227

Research

Council

of Canada

under Coop-

T. PA.I.RICK MARTIN er (11.

228

mation system. Service demands and workload characteristics for their model were arrived at by reading manuals and using sample systems for a short time. The model was intended to predict general trends and effects. Our work, on the other hand, is a detailed performance analysis of one particular system. Actual workloads and system measurements were used to provide the parameters to the simulation model. Our model was built to provide insights into the relationships between workload and system characteristics and the effectiveness of a number of different caching strategies, and to predict the impact these caching strategies would have on the performance of the system.

2. FULlTEST

The system studied is FuVText for Fulcrum Technologies [6]. The structure of the Ful/Text system is shown in Fig. 1. Since we are only concerned with document query and retrieval, we do not discuss any of the other aspects of Ful/Text, such as indexing or col-

lr Applications

Programmer Interface

Search Engine

t

t

CataIogMap Fig. 1. F&Text

system structure.

Reference

Caching

strategies

for a distributed

full text retrieval

system

229

lection management. Information on these other topics can be found in the Ful/Text documentation. A document is any set of textual information. It can be stored entirely within the Ful/Text Catalog file or in a separate operating system file called the Document File. Documents are grouped into collections and each collection has its own set of system files. The FuVText system files consist of the Catalog and Catalog Map files, which contain information about each document in the collection, and perhaps all or part of the contents of the document, the Dictionary and Reference files, which are used by the Search Engine to locate the documents which satisfy the incoming queries, and other auxiliary files such as a stop word file and a thesaurus file. The interface between the FuVText system and application programs or an interactive user front-end is a set of functions called the Programmer Interface (PI). All requests are translated by the PI into a set of function calls to one or more of the underlying system components. A prototype distributed version of FullText was built by the Fulcrum members of the project. This prototype works in a simple client-server fashion. The PI resides on the client and the file processing capabilities reside on the server. The prototype operates in an environment consisting of a DEC server, running a version of UNIX, and a PC AT client, running DOS, connected by DECNET. Calls to the different PI functions are mapped by the client software to requests which are sent to the server for processing. The prototype served two very useful purposes. First, it allowed Fulcrum to study the effects a distributed environment would likely have on the structure of the system and the composition of the system’s Programmer Interface. Second, the prototype provided a verifiable initial configuration for the simulation model.

3. CACHING

STRATEGIES

The two most common text retrieval operations are a collection search and a document browse. Our choices of caching strategies are intended to improve the performance of one or both of these transaction types within the distributed Ful/Text environment. A search transaction accesses the FullText system files to produce a set of pointers to documents that satisfy the search condition. It involves very little communication and a significant amount of disk I/O. Caching the system files at a client does not make sense. The files are very large so downloading a system file would take a significant amount of time and the file would occupy a substantial amount of local storage. The system files are also dynamic. They are updated every time a new document is added to the collection so cached copies of these files would become out-of-date and require refreshing. More importantly, caching at the client site does not deal with reducing the major cost component, namely the disk I/O. The amount of disk I/O at the server for a search transaction should be reduced by a main memory cache at the server node. A browse transaction accesses the document files and returns a formatted page of a document for viewing. It involves both communication and disk I/O. The use of relatively slow communication lines, such as phone lines, means that communication will be the dominant cost. Caching document files at the client sites could reduce the amount of communication involved over a complete session. Caching document files is viable since the contents of a document do not change and we would only have to cache a subset of all the documents, and perhaps only portions of individual documents. The key factor for the success of document caching is that, in a typical session, a user accesses a small percentage of the document data a large percentage of the time. The use of a client cache means that raw data from the document files has to be shipped to the client site where it is then processed to produce the formatted page. Shipping the raw data is necessary because a formatted page can include highlighting of query terms and for the same data to be used in a subsequent query and browse means this highlighting will have to change. We considered four basic caching strategies, a server cache and three variations of a

230

T. PATRICK MARTIN et al.

client cache. These strategies were tried with different hardware configurations, namely different cache sizes, different locations for the cache (main memory, hard disk, floppy disk), and different line speeds. We also experimented with combinations of the server cache and the various client caches. 3.1 Server cache The server site is given a main memory cache to buffer blocks retrieved from the disk. File systems such as Sun’s NFS already use large caches to improve l/O performance. This server cache can be used to model such a file system cache and it can also be used to model an application-controlled cache. The advantage of the full text retrieval system controlling the cache is that more specialized block replacement policies can be used that are tailored to the application. Stonebraker [7] has pointed out the disadvantages to using a general replacement policy for a database management system and some of these arguments apply here. Also, a file system cache is shared by all active applications so system load has a major impact on the effectiveness of caching for any particular application. We experimented with several replacement policies but, given the absence of any updates to the database, we found that they all had similar results. The experiments discussed in this paper use a simple Least Recently Used (LRU) replacement policy, that is, when a new block must be placed in the cache, the current block that has not been referenced for the longest period of time is replaced. Since we are not considering updates, a replaced block does not have to be written back to the disk and can just be discarded. We also considered the use of shared and disjoint caches. In a shared cache, blocks are allocated to clients in a dynamic fashion and the entire cache is searched for every block access. In a disjoint cache, each client is assigned a fixed number of blocks in the cache and only that subset is searched when a block access is required by that client. We found that, because clients often require the same blocks, especially when dealing with system files, and because clients’ requirements vary from transaction to transaction, a shared cache was the best option. The experiments discussed below use a shared cache. 3.2 Client cache The use of a client cache involves shifting the browse processing from the server to the client sites. This distribution of function is hopefully a way to reduce the overall communication costs and the load at the server site. The server’s involvement in the browse function is reduced to supplying document blocks as they are requested by the clients. We assume that client sites will be personal computers or workstations with limited main memory so the client cache must be stored on local disk. This has the advantage that the contents of the cache can persist between client sessions. We considered three implementations of a client cache. The first implementation uses a simple LRU replacement policy. The second implementation uses a prefetching scheme, One Block Lookahead (OBL), in combination with the LRU replacement. OBL tries to anticipate the next block needed by a transaction. When a document block access is made and that block is not in the local cache, the client sends a request to the server for that block and for the next block in the document file. We assume that document files are physically contiguous so the cost of getting the next block is just the transfer time for that block. The third implementation uses another prefetching scheme we call Think Time Prefetch (TTP) in combination with LRU replacement. TTP attempts to anticipate future block requests and carries out the prefetching during the think time between transactions when a client site is idle waiting for the user to formulate the next transaction. After a search transaction, we transfer the beginning blocks from the first document in the search result. After a browse transaction, we transfer blocks following the block used in that browse transaction. The number of blocks that are transferred for each prefetch is a function of the length of the think time, the speed of the communication link and the speed of the client disk subsystem. For example, during a think time of 15 seconds, with a 9600 baud line and a floppy disk, we can transfer just under three 4096 byte blocks.

Caching

strategies

for a distributed 4. SIMULATION

full text retrievaf

system

231

MODEL

The performance analysis was carried out using a simulation model of the prototype distributed Ful/Text. The model was verified against this prototype and then extended to include the different caching strategies. The main tasks we faced were characterizing the workload of a Ful/Text system in a way that could be used in the simulation, building the actual model, and incorporating the different caching strategies into the model. 4.1 Workload c~~rucier~zation Workload characterization is the task of producing a quantitative description of a workload in terms of parameters that affect system behaviour [8]. We chose to define a resource-oriented characterization of the workload. We identified three types of transactions: a collection open which signifies the start of a session, a search, and a browse. Each transaction was then characterized by its demands on the system resources, namely CPUs, disks and networks. The amount of demand on the network was based upon observing the number, and sizes of messages exchanged during a transaction. The amount of demand on the disk was derived from the number, and types, of accesses to the different system files. The files were classified as either sequential or random access in order to assign appropriate overheads. The sample workload was provided by the taxation branch of Revenue Canada, a Ful/Text user. We were given copies of their document and system files and log files for three months’ transactions against their database. The document collections consist of relevant taxation documentation and legislation in both French and English. Subsequent changes to the database, and the confidentiality of some documents unfortunately meant that we were not given the database in a state corresponding to the time of the transactions. Numerous errors and inconsistencies were introduced when the transactions were rerun and these had to be manually corrected or eliminated. The final workload upon which the characterization was based consisted of 579 search transactions, 640 browse transactions and 48 collection open transactions. The transactions were generated by 9 different users and occurred at various times during normal business hours. The characterization represents the sample workload in a form that can be input to our simulation model. The properties collected for each transaction type were: 1, The frequency of occurrence. 2. An average think time. 3. For each PI function called, frequency distributions for the number of accesses to each of the file types. 4. For each PI function called, frequency distributions for the number and sizes of messages exchanged. In a large number of cases, especially with the number and sizes of messages, we found that a constant could be used in place of a distribution, We observed that CPU demand, relative to the demands on the disks and the networks, was not a significant factor and we chose to ignore it. This is a property of our workload which contained mostly simple boolean queries that do not require any significant processing of the results. This assumption may not hold for another type of workload, such as one containing weighted queries that require the results to be sorted before presentation to the user. 4.2 Model description The simulation model was written using a simulation package called NETWORK II.5 191.This package has a number of built-in features to facilitate the representation of communication networks and provides detailed reports on resource usage and other aspects of the simulation run. The structure of the model is shown in Fig. 2. The initial configuration was intended

T. PATRICK MARTIN ef al.

232

Client

Local Server

Disk

Sequential Files

Fig. 2.

Structure of the simulation model.

to reflect a typical architecture for a distributed Ful/Text system. There is a single server machine holding the collection files which is connected to one or more client sites by 9600 baud communication lines. The client machines are personal computers which are capable of running a version of FullText and which have some amount of local disk storage. The server is a medium-sized system, for instance a SUN 3/60, with ample fast disk storage. The disk storage is modelled as two devices to account for the differences in overhead for sequential and random access files. The random access files will, in general, have much larger overheads because of the increased average seek times. The client CPU is approximately ten times as slow as the server CPU. The client sites generate instances of the three transaction types based on the frequencies from the workload characterization. Each transaction type causes some sequence of events to occur in the simulation. The sequence of events for a transaction type represents the resource demands on the system provided by the workload characterization.

Caching

strategies

for a distributed

full text retrieval

system

233

4.3 Cache model The caching strategies were modelled in a separate C program that simulated the actions of the different strategies. The resulting hit ratios, that is the percentage of system accesses that could be serviced from the cache for each strategy, were then incorporated into the simulation model of distributed FullText. The input to the cache simulation program is a reduced logical block reference string (RLBRS) [lo] which was derived from the sample workload. A RLBRS is a sequence of logical disk block references with all immediate references removed. The string of logical block references for a user was produced from the set of seek and read/write operations recorded in the workload for that user. We had to count references to logical file blocks, that is block numbers relative to the start of the file since we had no data on the actual physical blocks accessed. We assumed for simplicity that the file system used only a single block buffer. We used a reduced reference string since immediate rereferences to the same block would be handled by the file system buffer. The effectiveness of a particular caching strategy, and even caching as a whole, is dependent upon two properties of the workload: the degree of locality, and the degree of sequentiality in the block references. Locality is a measure of the degree of concentration within the set of block accesses upon some subset of the blocks. The stronger the locality, the more effective caching is likely to become. Locality of reference has been observed and studied in a number of related areas such as program execution [l 11, database systems [ 121, and other aspects of information retrieval [ 131. We can expect a high degree of locality of block accesses during a typical session. Most of a user’s accesses would be concentrated within a small portion of the database, called a “hot spot.” Users of full text retrieval systems tend to ask a set of queries about a particular topic and many queries are simply refinements of previous ones. Even over long periods of time, for a given application, some small set of blocks will be more frequently accessed than the remainder of the database. Our sample workload was found to have a very high degree of locality. Approximately 1% of the database received most of the block accesses. We felt that this was an abnormal situation caused by either the characteristics of the application, or of the particular workload, and decided that a modified workload would produce a more useful result. We found that the cumulative frequency distribution for the block accesses followed a Bradford-Zipf distribution [13] and so adopted Simpson’s and Alonso’s method for approximating different degrees of locality within a workload [5]. The Bradford-Zipf distribution implies that, given some collection of blocks arranged in decreasing order of productivity (number of references), if we partition the blocks into k groups each containing the same number of productive blocks (i.e. each group is accessed with equal frequency), then the number of blocks in the groups fit the proportion l:n:n*: . . . :&I. The Bradford multiplier (n) identifies the degree of locality. Table 1, taken from [5], shows the changing concentration of accesses for different values of n and k equal to 3. For our experiments we assumed n = 3 and that a typical user session accesses 5% of the document database, that is 203 blocks. Sequentiality of access is the tendency to access runs of consecutive blocks. This is a common characteristic in full text systems and is especially prevalent during the browsing

Table

1. Degrees of Locality

Degree of Locality 1 2 3 5 10 15

Percent of data accessed l/3 of the time 33.3 51 69 81 90 93.4

33.3 29 23 16 9 6.2

33.3 14 8 3 1 0.4

234

T. PATRICKMARTINet nl.

of documents. The Ful/Text system provides a user with the ability to browse particular “pages” of a document, where a page is a formatted screen of text. FuI/Text documents are stored in a compressed format so browsing a page of a document involves sequentially reading that document from the beginning and producing all pages up to and including the desired page. These pages are retained until the next search transaction. A consistently sequential pattern of access will allow us to anticipate which data blocks are likely to be accessed next and to fetch them before they are required.

7

R e ;: 0 ” S e

5.

4.

T i m e ( s e

3_

C 0 ii :

2

1.

+________+___--------

+________--__i____________f_____‘_‘---

I

5 Number of Clients

I 10

(a)

_+_________--+

I

I

15

20

Caching strategies for a distributed full text retrieval system

235

Sequentiality in a workload is represented by runs, that is references to consecutively numbered blocks. A RLBRS, therefore, can be viewed as series of these runs. The length of a run is the number of references in that consecutive sequence. A run of length one is a block reference that is not preceded by a reference to that block’s predecessor nor succeeded by a reference to that block’s successor. The majority of run lengths in the sample workload were uniformly distributed over the range one to nine. The run lengths in our modified workload also follow this distribution.

0.7 )

: : Icm

f

Base Server Cache Client Browse LRU OBL

0.6 S e a r C

h ;f 0.5 r 0

i

P p 0.4 k r a n S

a 0.: C t

i 0

n S

,’

0.:

r S e C 0 ;

0.’

)

I 5

I 15

I

10

Number Of Clients (b) Fig. 3 continued.

I

20

236

T. PATRICKMARTIN 5,

EXPERIMENTS

et al.

AND RESLKTS

Performance can be measured in a number of ways. The most obvious measure for an interactive information retrieval system is response time. Another important performance measure is throughput, that is the number of transactions/second processed by the system. We considered both response time and throughput when evaluating the different caching strategies.

Our first set of experiments measure the effectiveness of the different caching strategies for a client configuration with a local 360 Kbyte floppy disk used to hold the cache.

Base Server Cache

A

;

ClientBrowse LRW ; OBL mm

Ij

8C I_ /’

Y4, \

/’

\

iI3 I3

\

\

r 0

w

s e R

e

6CI_

I:

0 n S

I

e T

i m e

4cI_

s e c 0 : ; 2c

:

@

A

I

I

I

I

5

10

15

20

Numberof Clients

Fig. 4. Browse Avg response times and throughputs page.)

(Floppydisli). (F&tcre continued onfacin~:

Caching strategies for a distributed full text retrieval system

237

The communication lines between clients and server are 9600 baud. We ran the simulation varying the number of active clients from 1 to 24. Hardware and software restrictions precluded any larger simulations. This is not a major disadvantage since the Sun 3/60 our server machine represents actually has a maximum of 15 active user terminals at any one time. The results for this set of experiments are contained in Figures 3 and 4. Figure 3 (a) and (b) show the average response times and throughputs for the search transactions; Figure 4 (a) and (b) show the average response times and throughputs for the browse trans-

Base Server Cache Client Browse LRU 2 OBL rn=P 0

+ 0

B r 0 w s

e

0.6

T h r 0 “g

0.5

h P U t

:

0.4

r a n s a C :

o.3

0

n S

P 4 0.2 S

1

5 Number Of Clients

I

I

I

10

15

20

Fig. 4 continued.

238

T. PATRICK MARTIN et al.

actions. Six cases are shown in each graph- the four caching strategies, labelled “Server Cache,” “LRU,” “OBL” and “TTP,” plus a “Base” case and a “Client Browse” case. The “Base” case is the original simulation of distributed FullText where all transaction processing is performed on the server machine. The “Client Browse” case has the browse processing moved to the client machines but no caching is used. The four graphs must be viewed concurrently in order to interpret the effect caching has on system performance. The change from “Base” Ful/Text to the “Client Browse” case causes a large increase in browse response time because now all data used to process a browse request must be moved to the client site. This extra time spent transmitting data forces a decrease in the throughput for both searches and browses. The smaller number of searches and browses decreases the competition for the disk at the server and results in lower search response times with the larger numbers of clients. The inclusion of caching at the client site offsets, to a degree, the effect of introducing browse processing at the clients. There is a large decrease in browse response times and a slight increase in search response times. The throughputs for both transaction classes also show slight improvement. The client caching strategies achieve better browse response times than the Client Browse case because a fraction of all block requests (54%, 74% and 72070 for LRU, OBL and TTP, respectively) are handled without moving data from the server. The two prefetching strategies are able to satisfy more requests locally because they try to use the sequentiality of the workload as well as the locality. The effectiveness of TTP is limited in this case by the length of the think time. The system is only able to transfer two blocks within the given think time which is less than the average run length so a longer think time would improve the performance of TTP. The Server Cache experiments assume a 500 Kbyte cache in main memory which results in a hit ratio of 0.76. This scheme shows definite performance impro~~ements over the base case. The search response time drops dramatically while the throughputs for both transaction classes improve slightly and the browse response times remain unchanged. The server cache has the largest impact on the search transactions because they are dominated by disk I/O. In fact, for the base case approximately 85% of all disk accesses are attributed to searches. The Server Cache reduces the demand on the server disk, for example with 24 clients the utilization is reduced from 80% to 27%, which means that all disk accesses encounter much less contention for the server disk. Browse response time is unaffected because the main demand from browse transactio~ls is on the comm~lnicatiol~ line.

5 2 Other parameters We next experimented with modifications to several system parameters to see what impact these changes would have on the effectiveness of caching for the given workload. Figure 5 (a) and (b) show the average response times and throughputs for the search transaction class, and Figure 6 (a) and (b) show the average response times and throughputs for the browse transaction class, for two of these modifications. The first change, labelled “TTP-HD” in the graphs, places the client cache on a local hard disk and increases the size of the cache to 1 MB. A cache of this size, using the TTP strategy, obtains a hit ratio of 0.98 for our workload. We can see that this change brings the performance of the split Ful/Text (browse and search processing on separate nodes) in line with the base case where all processing is done on the server. This combination of cache and TTP offsets the large increase in data transfer involved with distributing the browse processing. For TTP-HD versus the base case, browse and search throughputs are slightly less, browse response times are slightly longer, and search response times are virtualIy the same. The second change, labelled “TTP-HD/Server Cache,” combines the TTP-HD version of the client cache with the server cache. This strategy performs much better than the Base case and the TTP-HD strategy but performs slightly worse than just the Server Cache. Search response times are dominated by the presence of the server cache and so they are identical for the Server Cache and TTP-HD/Server Cache cases. Browse response times are dominated by the client cache and so they are the same for the TTP-HD and TTP-

Caching strategies for a distributed fuli text retrieval system

s

e

239

(2..

a r ; R

:5_

e T, 0

.’

n S e

a’

.’

.x

/

/’

6a._

/’ /’

T i m e I

>’

*’

/’ /’ I’

./ J3”

./’ /

3_

,--’

,’

/’

n

e C 0 s ;

I

5 Number of Clients

I

I

10

15

I

2?

(a) Fig. 5. Search Avg response times and throughputs (Hard disk). (Figure confinued on overleaf.)

HD/Server Cache cases. Search and browse throughputs are slightly worse for TTPHD/Server Cache with respect to the Server Cache because of the longer browse times. The third modification is to the speed of the communication lines connecting the clients to the server. The number of clients in the experiments was fixed at I2 and different caching strategies were run with line speeds of 1200, 2400, 4800, 9600 and 19,200 baud. Figures ‘7and 8 show average response times and throughputs for the search and browse transactions, respectively. The figures show the performance for the Base case, TTP-HD and TTP-HD/Server Cache. We do not include the performance measurements for TTP

240

T. PATRICK MARTIN el

al.

0.

,’ ,’

‘IT - HD / Server Cache

/’ <’ <’

0.1 S e a r

I’

d

,’

C h

/

,’ ,

//

,’

,’ ,’

;$ 0.:



/

!’

/’

r 0

i P U t 0.r

i r a

n S

a 0.: C t

i 0

n S

S e C 0 f-j 0.1 )

I 5

Number Of Clients

I

I

10

15

1

w

fbt Fig. 5 continued.

with a floppy disk since they are not comparab1~ to the other cases and we feel that a floppy disk is not a viable storage medium at any of these line speeds. The experiments show that, in general, the performance of the two caching strategies improve as line speeds increase. The higher line speeds also improve the performance of TTP-HD for browse transactions relative to the base case. The browse response times for TTP-HD approach, within a couple of seconds, those of the base case as the line speed increases. The browse throughputs for TTP-HD exceed those of the base case at 19,200

241

Caching strategies for a distributed full text retrieval system A

;

q Y 2

Base Server Cache m -l-l-P-I-ID TTP - HD / Server Cache

45

40

B r o

35

W s e

R

e

3C

S

P 0

n S e

25

T i m e

2c

& e C ;

15

d ;

I

5 Number Of Clients

I

I

10

15

I 20

(a) Fig. 6. Browse Avg response times and throughputs

(Hard

disk). (Figure continued

on overleaf.)

baud. Search response times and throughputs for TTP-HD are slightly poorer than the base case at the higher line speeds because of the better browse performance. 6. CONCLUSIONS

We have described our study of the performance of a prototype distributed full text retrieval system assuming various caching strategies. Our analysis of a particular workload for this system indicated that the major resource demands by search and browse transac-

242

T. 0

0

+

q 2

PATRICK MARTIN

et al.

Base Server Cache m TIP-HD TTP - HLI / Server Cache

0

B r 0 w SO e

T h r 0 x0

h P ” t

( TO r a n S

a C t 0

i 0 n S

S e C 1

0.

I

5 Number Of Clients

I

I

10

15

I 2P

(b) Fig. 6 continued

tions are on the server disk and the communication network, respectively. Different types of caches are required to alleviate these two different demands. We looked at two basic types of cache, namely a server cache and a client cache. A server cache is located on the server site and reduces the demand on the server disk. A client cache is located at a client site and is intended to reduce the traffic on the communication network. The use of a client cache implies that the processing of document files for browse transactions is moved to the client sites which involves a substantial increase in the amount of data moved to the client sites since now all the raw data must be transferred

Caching strategies for a distributed full text retrieval system 0

+ -J

243

Base -I-E-HD TTP - HD I Server Cache

c

S e

a r L : R e s, 0 n s e

:

T i m e

I

1

5c3 Network Line Speed

I

le4

I

1Se4

(a) Fig. 7. Search Avg response times and throughputs (Line speed). (Figure continued OR overleaf.)

to the clients and not just the requested document pages. A client caching strategy must make a significant reduction in block requests to balance this increase in network traffic. A server cache of 500 KBytes provides significant improvements in search response time and throughput of both search and browse transactions. The effectiveness of a client cache, however, is very dependent upon the size of the cache and the speed of the communication network. The large increase in the amount of data transferred to perform browse processing at the client sites meant, for our workload, a 360 KByte cache on a floppy disk is not practical at all and a 1 MByte cache on a hard disk does not improve on the base case unless the communication speed is greater than 9600 baud.

T. PATREK MARTINet al.

244 0.ci_

Base + -rTP-HD r~ TfTp- HD I Server Cache 0

T h r 0 0.4 i P u t $ r 0.3 a n s a C t

i 0 f 0.2 P e r

I

5e3 Network Line Speed

I

Ie4

I

1.5c4

(b) Fig. 7 continued.

We experimented with three block replacement ~gorithms for the client cache-Least Recently Used (LRU), LRU with One Block Lookahead (OBL) and LRU with Think Time Prefetch (TTP). All three algorithms take advantage of the locality in a workload and the two prefetching algorithms (OBL and TTP) also try to take advantage of the sequentiality in a workload. OBL and TTP achieved higher hit ratios than LRU for our workload. In general, an LRU algorithm works well as long as locality is strong, and/or the number of productive blocks is small because the most recently used blocks have a good chance

245

Caching strategies for a distributed full text retrieval system

> Base ‘L l-l-P-HD 3 ‘ITP - HD / Server Cache 3.5

B 3o r 0 W s e R

25

e S

P 0

n s e

2C

T i m e

‘1 I,

;

1:j_

It

k:, \\

e C 0

\‘,

z ;

1:

l(

--------___________ ~___________

--------------_____ -----_________. ti

0 I

I

5e3 Network Line Speed

le4’

I

lSe4

(a) Fig. 8. Browse Avg response times and throughputs (Line speed). (Figure continued on overleaf.)

of being in the favoured subset of the cache. Once these facts change, however, LRU becomes less effective and we need a replacement policy more in line with the BradfordZipf distribution followed by the workload. The two prefetching algorithms, OBL and TTP, have better hit ratios than plain LRU because they also take advantage of the sequentiality present in the workload. TTP has the advantage that it does its prefetching during the user’s think time rather than during the execution of a transaction which minimizes response time as much as possible. The usefulness of TTP is dependent upon the degree of sequentiality in the workload, the speed

T. PATRICK MARTIN et al.

246

Base 7-l-P-HD & 7-l-P- HD I 0

Server Cache

B” r 0 W s e

T h r 0. 0 u i P U

t : 0. r a n s a C t

i ; 0. S

P e T s e c 0. )

I

se3

I le4

I 1Se4

Network Line Sped (b) Fig. 8 continued.

of the communication network and the length of the think time. A hidden cost to the prefetching is the number of unused blocks transferred to the client cache. This cost wiI1 increase as the variance in the run lengths increase. We found that for the given workload a server cache provides definite performance advantages but the value of a client cache, strictly on performance reasons, is questionabIe. Other factors, such as the expected growth in the number of users, the desirability of users’ keeping local copies of data and other applications contributing to the load on the server, would have to be considered before advocating the use of a ciient cache. For

Caching

strategies

for a distributed

full text retrieval

system

247

the given workload, the size of the cache was more important than the replacement algorithm used. We believe that our test workload is not representative of many full text workloads and further experimentation is required to evaluate the effectiveness of caching. The service demands for the search, browse and collection open transaction classes are reasonable but the collection was relatively small and the accesses very concentrated on a small number of the documents. We are developing a more general workload model that will allow us to experiment with such parameters as locality, sequentiality, think time and the amount of demand from other types of workloads. Acknowledgemenfssample workload. the data analysis.

We would like to thank Revenue Canada Taxation for supplying We would also like to thank Dr. J. T. Smith of Queen’s University

the test database and the Statlab for his help with

REFERENCES 1. Macleod, LA.; Martin, T.P. Nordin, B.; Phillips, J.R. Strategies for building distributed information retrieval systems. Information Processing and Management 23(6): 51 l-528; 1987. 2. Terry, D.B. Caching Hints in Distributed Systems. IEEE Transactions on Software Engineering SE-13(l): 48-54; January 1987. 3. Sandberg, R.; Goldberg, D.; Kleinman, S.; Walsh, D.; Lyon, B. Design and implementation of the sun network file system. Proceedings of the USENIX 1985 Summer Conference. 1985: 119-130; Berkeley, CA. 4. Nelson, M.N.; Welch, B.B.; Ousterhout, J.K. Cacking in the sprite network file system. ACM Transactions on Computer Systems 6(l): 134-154; February 1988. 5. Simpson, P.; Alonso, R. Data caching in information retrieval systems. Proceedings of the 1987 ACM Conference on Research and Development in Information Retrieval; 1987: 296-305. 6. Fulcrum Technologies Inc. FullText Programmers Guide Version 4.5, Fulcrum Technologies Inc., Ottawa Ontario; June 1988. 7. Stonebraker, M. Operating system support for database management. Communications of the ACM 24(7): 412-418; July 1981. 8. Ferrari, D.; Serazzi, G.; Zeigner, A. Measurement and tuning of computer systems. London: Prentice-Hall Inc.; 1983. 9. Garrison, W.J. NETWORK 11.5 User’s Manual Version 2, CACI; December 1985. 10. Smith, A.J. Sequentiality and prefetching in database systems. ACM Transactions on Database Systems 3(3): 223-247; Sept. 1978. 1 I. Coffman, E.G.; Denning, P.J. Operating systems theory. London: Prentice-Hall Inc.; 1973. 12. Casas, I.R. PROPHET: A layered analytical model for perforance prediction of database systems; CSRI Tech. Report CSRI-180. Computer Systems Research Institute; Toronto Ontario; 1986. 13. Brookes, B.C. Bradford’s law and the bibliography of science. Nature 224,5223: 953-956; Dec. 1969.