Data caching strategies for distributed full text retrieval systems

Data caching strategies for distributed full text retrieval systems

lnformafion.SysremsVol. 16. No. 1, pp. I-11, 1991 0306-4379/91 $3.00 + 0.00 Press plc Copyright 0 1991Pergamon Printed in Great Britain. All rights...

1MB Sizes 0 Downloads 62 Views

lnformafion.SysremsVol. 16. No. 1, pp. I-11, 1991

0306-4379/91 $3.00 + 0.00 Press plc

Copyright 0 1991Pergamon

Printed in Great Britain. All rights reserved

DATA CACHING STRATEGIES FOR DISTRIBUTED TEXT RETRIEVAL SYSTEMS

FULL

T. PATRICK MARTIN and JUDY I. RUSSELL Department of Computing and Information Science, Queen’s University, Kingston, Ontario K7L 3N6, Canada (Received 6 November 1989; received for publication 26 July 1990)

Abstract-We discuss and evaluate several strategies for caching data in a distributed full text retrieval system. The evaluation of the strategies is carried out using a simulation model of a distributed full text retrieval system. We describe the development of this simulation model which was based on a distributed prototype of an existing full text retrieval system, namely Ful/Text [l]. The simulation model was used to analyze the effect data caching would have on the performance of such a system and to show the relationship between several properties of the workload and the effectiveness of data caching. KeJ’ words: Full text retrieval systems, distributed

systems, simulation, data caching

1. INTRODUCTION

The popularity of text databases, or full text retrieval systems, has grown rapidly in recent years. There are companies which offer access to very large text databases such as newspaper stories or court case records, and many organizations now retain data such as office correspondence, or program source and documentation, in text databases. The use, and size, of text databases can be expected to expand even further with the general acceptance of applications like office information systems and hypertext. Full text retrieval systems, however, are very I/O-intensive and place a very heavy demand on system resources such as secondary storage and communication networks. One possible solution to this overloading is to move a distributed system environment where the load can be divided among multiple machines. There are a number of ways the load could be divided. For example, l l

l

the document collection could be partitioned across multiple server machines; the system workload could be decomposed so that each server handles a different type of request; some of the processing could be moved from the server machine to the various client machines.

We favor the last option because with this strategy the system is able to take advantage of properties of the workload such as the degree of locality, and the amount of sequentiality, in the access to the data. We believe that, regardless of performance advantages, the popularity of distributed systems will force the development of distributed versions of full text retrieval systems as it did with database management systems. The overall objective of our research has been to define strategies for developing distributed full text retrieval systems and to predict the effect of these strategies on system performance. Our approach to accomplishing this objective has been to develop simulation models of a distributed full text retrieval system. In this paper we discuss the results of a study which was carried out to analyze the performance of different data caching strategies for a distributed full text retrieval system. First, we summarize our previous work on distributed full text retrieval systems and survey related work. Second, we describe our simulation model. Third, we describe several data caching strategies. Fourth, using the simulation model, we analyze the performance of a system employing these caching strategies. We also show the relationships between the effectiveness of the caching strategies and parameters of the workload. Finally, we discuss the results of the experiments and draw some conclusions.

2

T. PATRICKMARTIN andJUDY 2. RELATED

1. RUSSELL

WORK

We have studied several techniques for efficiently distributing full text retrieval systems. In one study [2], we examined the effectiveness of distributing and replicating system components across multiple sites. This type of strategy reduces the load on the database server but increases the amount of network traffic. We found neither ~st~buting the system components, nor replication, to be particularly cost-effective. Strategies employing these concepts tended to provide relatively small performance gains while greatly increasing the complexity of the system. We believe that the most likely configuration for a distributed full text retrieval system will be multiple autonomous database servers, each server holding one or more document collections. The system would provide an interface which would allow a single query to access these multiple collections concurrently. In order to support this kind of system configuration, we propose several caching schemes in this paper and analyze their performance. Caching takes advantage of the trend for users to replace terminals with personal computers or workstations by moving some of the work from the server to these client workstations. Caching also takes advantage of a number of properties found in typical full text retrieval workloads. Our view that data caching is a useful technique in a distributed full text retrieval system is supported by a case study we performed which examined the performance of a prototype distributed full text retrieval system and a particular user workload [3]. We extended a simulation model of this prototype with several caching strategies and found that ~~o~an~ was improved for the given workload. Simpson and Alonso 141 have also studied caching as a means of integrating PC’s and workstations into widely distributed information retrieval systems. They studied caching using a simulation model but their model is a general representation of a distributed information system. Service demands and workload characteristics for their model were arrived at by reading manuals and using sample systems for a short time. Our work, on the other hand, is based upon a detailed performance analysis of an existing system. Actual workloads and system measurements were used to provide the parameters to the simulation model. Data caching has become a common technique for reducing the cost of accessing data in a distributed system [5]. By maintaining local caches of recently acquired data, programs can amortize the high cost of querying remote data servers over several references to the same data. Caching is used to great advantage in a number of network based file servers such as Sun’s Network File System [6] and the Sprite Network File System [7]. Data caching is not used extensively in distributed database systems because of the problem of maintaining the consistency of cached data. A cached copy of a data object is not an “official” copy and so not automatically subject to any updates on that object. Special a~angements must be made to receive these updates, or more typically, a cached copy is given a “lifetime” which indicates the length of time the data is expected to remain valid. Distributed full text retrieval systems, on the other hand, are a more promising candidate for caching than distributed database systems because documents are usually never updated and rarely deleted. 3. SIMULATION

MODEL

The simulation model described in this paper is based upon an existing system, Ful/Text from Fulcrum Technologies {I]. The structure of the Ful/Text system is shown in Fig. 1. This structure is typical of information retrieval systems which use inverted files as the indexing mechanism. A document is any set of textual information. It can be stored entirely within the Ful/Text Catalog file or in a separate operating system file called a Document File. Documents are grouped into coZlecf~o~sand each collection has its own set of system files. The FullText system files consist of the CaraZug and Catalog Map files, which contain info~ation about each doc~ent in the collection, and perhaps all or part of the contents of the document, the Dictionary and Reference files, which are used by the Search Engine to locate the documents which satisfy the incoming queries, and other auxiliary files such as a stop word file and a thesaurus file.

Data caching strategies

Document Files

Catalog + Catalog Map

Dictionary + R&~~IlC~

Fig. 1. Ful/Text system structure.

The interface between the Ful/Text system and application programs or an interactive user front-end is a set of functions called the P~ogrumme~ Znterface (PI). All requests are translated by the PI into a set of function calls to one or more of the underlying system components. 3.1. Workload model We define a resource-oriented model of a full text workload which characterizes the workload in terms of its demands on the system resources, namely CPUs, disks and networks. We consider retrieval to consist of three types of transactions: a c~~~ec?~u~open, a search and a browse. A collection open signifies the start of a retrieval session. It involves a small amount of disk I/O and communication. A search transaction accesses the system files to produce a set of pointers to documents that satisfy the search condition. It involves very little communication and a significant amount of disk I/O. A browse transaction accesses the document files and returns a formatted page of a document for viewing. It involves both communication and disk I/O but in a wide area network with relatively slow communication lines the communication costs will dominate. The resource demands associated with each transaction type were derived from a sample workload [3]. The demand on the network was based upon observing the number, and sizes, of messages exchanged during a transaction. The demand on the disk was calculated from the number, and types, of accesses to the different files. The files were classified as either sequential or random access in order to assign appropriate overheads. CPU demand, relative to the demands on the disks and the networks, was not a significant factor in our sample workload which contained mostly simple Boolean queries that do not require any significant processing of the results. An average think time was also associated with each transaction type. The disk block accesses associated with a typical session are represented with a reduced logical block reference string (RLBRS) [8]. An RLBRS is a sequence of logical disk block references, that is references to blocks of a file numbered sequentially from the start of the file, with all immediate references removed. We assume, for simplicity, that the file system uses only a single block buffer. We use a reduced reference string since immediate references to the same block would be handled by the file system buffer.

4

T. PATRICKMAR~N and JUDY 1. RUSELL

The string of block accesses for a session is characterized by two important properties: its degree of locality, and its degree of sequentiality. Locality is a measure of the degree of concentration within the set of block accesses upon some subset of the blocks. The stronger the locality, the more effective caching is likely to become. Locality of reference has been observed and studied in a number of related areas such as program execution [9], database systems [IO], and other aspects of information retrieval [I 11. We can expect a high degree of locality of block accesses during a typical session. Most of a user’s accesses would be concentrated within a small portion of the database, called a “hot spot”. Users of full text retrieval systems tend to ask a set of queries about a particular topic and many queries are simply refinements of previous ones. Even over long periods of time, for a given application, some small set of blocks will be more frequently accessed than the remainder of the database. We model the frequency dist~bution of block accesses in a session with a Bradford-Zipf distribution [ll]. The Bradford-Zipf distribution implies that, given some collection of blocks arranged in decreasing order of productivity (number of references), if we partition the blocks into k groups each containing the same number of productive blocks (i.e. each group is accessed with equal frequency), then the number of blocks in the groups fit the proportion nk- ’ . The Bradford multiplier (n) identifies the degree of locality. Table 1, taken 1:n:n2:...: from Ref. [4] shows the changing concentration of accesses for different values of n when k is equal to 3. Sequentiality of access is the tendency to access runs of consecutive blocks. This is a common characteristic in full text systems and is especially prevalent during the browsing of documents. The FullText system provides a user with the ability to browse particular “pages” of a document, where a page is a formatted screen of text. Ful/Text documents are stored in a compressed format so browsing a page of a document involves sequentially reading that document from the beginning and producing all pages up to, and including, the desired page. These pages are retained until the next search transaction. A consistently sequential pattern of access will allow us to anticipate which data blocks are likely to be accessed next and to fetch them before they are required. Sequentiality in a workload is represented by YMS,that is references to consecutively numbered blocks. An RLBRS, therefore, can be viewed as series of these runs. The length of a run is the number of references in that consecutive sequence. A run of length one is a block reference that is not preceded by a reference to that block’s predecessor nor succeeded by a reference to that block’s successor. We make the following assumptions in constructing the block access strings for our workload model: The database contains around 30,000 documents. The average document length is 10 4 K byte blocks. Over a session which spans on the order of 15 days, a user accesses 1% of the documents, that is 300 documents or 3000 blocks. We chose a large observation interval for a session to make sure that the number of productive blocks was much larger than the typical size of a client cache. A user accesses 30 documents a day which gives an average of 50,000 block accesses in a session. These accesses follow a Bradford-Zipf distribution. Initially, think time is 30 s for both search and browse transactions; run lengths are uniformly distributed over the range l-10, and the locality number for the block access distribution is n = 3.

Table Degree of iocality

1

1. Degrees of locality Percent of data accessed 113 of the time

2 3

33.3 57 49

10 15

90 93.4

5

ai

33.3 29 23 16 9 6.2

33.3 14

a 3 I 0.4

Data caching strategies

5

3.2. Simulation program The simulation program is written in a simulation package called NETWORK II.5 [12]. This package has a number of built-in features to facilitate the representation of communication networks and provides detailed reports on resource usage and other aspects of the simulation run. The system configuration consists of a single server machine holding the collection files which is connected to I2 client sites by 9600 baud communication lines. The client machines are personal computers which are capable of running a version of Ful/Text and which have some amount of local disk storage. The number of client sites is set at 12 because we found, experimentally, that this number introduced a sufficient amount of contention for resources and simulation runs could still be completed in a reasonable amount of time. The server is a medic-sized system with ample fast disk storage. The disk storage is modelled as two devices to account for the differences in overhead for sequential and random access files. The random access files will, in general, have much larger overheads because of the increased average seek times. The client CPU is approx. 10 times slower than the server CPU. The client sites generate instances of the three transaction types based on the frequencies from the workload characterization. Each transaction type causes some sequence of events to occur in the simulation. The sequence of events for a transaction type represents the resource demands on the system provided by the workload characterization.

4.

CACHING

STRATEGIES

The two most common text retrieval operations are a collection search and a document browse. Our choices of caching strategies are intended to improve the performance of one or both of these operations. A search transaction accesses system files to produce a set of pointers to documents that satisfy the search condition. It involves very littie communication and a significant amount of disk I/O. Caching the system files at a client does not make sense. The files are very large so downloading a system file would take a significant amount of time and the file would occupy a substantial amount of local storage. The system files are also dynamic. They are updated every time a new document IS added to the collection so cached copies of these files would become out-of-date and require refreshing. A browse transaction accesses the document files and returns a formatted page of a document for viewing. It involves both communication and disk I/O. The use of relatively slow ~ommunication lines, such as phone lines, means that communication will be the dominant cost. Caching document files at the client sites could reduce the amount of communization involved over a complete session. Caching document files is viable since the contents of a document do not change and we would only have to cache a subset of all the documents, and perhaps only portions of individual documents. The key factor for the success of document caching is that, in a typical session, a user accesses a small percentage of the document data a large percentage of the time. The use of a client cache involves shifting the browse processing from the server to the client sites. This distribution of function is hopefully a way to reduce the overall communication costs and the load at the server site. The server’s involvement in the browse function is reduced to supplying document blocks as they are requested by the clients. We assume that client sites will be personal computers or workstations with a limited main memory so that the client cache is stored on a local hard disk. This has the advantage that the contents of the cache can persist over long sessions and between sessions. We experimented with several replacement policies but, given the absence of any updates to the database, we found that they all had similar results. The experiments discussed in this paper use a simple least recently used (LRU) replacement policy, that is, when a new block must be placed in the cache, the current block that has not been referenced for the longest period of time is replaced. Since we are not considering updates, a replaced block does not have to be written back to the disk and can just be discarded. We analyze three implementations of a client cache. The first impiementation uses a simple LRU replacement policy. The second implementation uses a prefetching scheme, one block ioo~a~ead

6

T.

PATRICKMAR-~

and JUDY I. RUSSELL

(OBL), in combination with the LRU replacement. OBL tries to anticipate the next block needed by a transaction. When a document block access is made and that block is not in the local cache, the client sends a request to the server for that block and for the next block in the document file. We assume that document files are physically contiguous so the cost of getting the next block is just the transfer time for that block. The third implementation uses another prefetching scheme we call t/z&z/ctime ~~e~?c~ (TTP) in combination with LRU replacement. TTP attempts to anticipate future block requests and carries out the prefetching during the think time between transactions when a client site is idle waiting for the user to formulate the next transaction. After a search transaction, we transfer the beginning blocks from the first document in the search result. After a browse transaction, we transfer blocks following the block used in that browse transaction. The number of blocks that are transferred for each prefetch is a function of the length of the think time, the speed of the communi~tion link and the speed of the client disk subsystem. The caching strategies are modelled in a separate C program that simulates the actions of the different strategies. The resulting hit ratios, that is the percentage of system accesses that can be serviced from the cache for each strategy, are then incorporated into the simulation program. The hit ratios are used to reduce the number of requests for document blocks issued by a client to the server. 5. EXPERIMENTS

We present the results of a number of simulation experiments. These experiments demonstrate the effect that different system and workload parameters have on the performance of a distributed full text retrieval system. They also show the effect that the different caching strategies have upon the response time performance of the system as the values of these parameters change. The parameters we consider are the load on the server, the degree of locality in the workload, the degree of sequentiality in the workload, and the average length of a user’s think time between transactions. In all of the experiments we assume that a client site has a 1 MB cache on a hard disk. 5.1. Server load In this set of experiments, we assume that the degree of locality is n = 3, the degree of sequentiality, or average runlength, is 9, and the think time is 30 s. Figures 2 and 3 show the browse and search response times, respectively, for the three client caching strategies as the load on the server machine is increased. The graph Base shows the response time for the basic system where all processing is performed on the server. An increase in the load on the server is modelled in the simulation by a slowdown in the service times of the different resources at the server site. The x-axis in the two figures represents the factor by which total load on the server is increased, and hence the corresponding slowdown in service times of its resources. When the load on the server is light, the response time for a browse is dominated by the demand on the communication network. The base case has the smallest response time because it transfers less data than the caching strategies. The base case transfers formatted document pages while the caching strategies transfer raw document blocks. TTP has a better average response time than LRU or OBL because it transfers the prefetched data during the think times between transactions rather than as part of a transaction. When the load on the server increases the major factor in the response time becomes the server disk and the response times for the Base, LRU and OBL cases increase substantially. The response times for the TTP case, on the other hand, remain relatively constant. TTP has the highest hit ratio of the three caching strategies and requires the smallest number of disk accesses within a transaction of all the cases. It is interesting to note that, at higher server loads, even though OBL has a higher hit ratio than LRU, it experiences longer response times. A block access that cannot be serviced from the cache forces two block accesses in OBL-the required block and its successor. When block access times become large this prefetching negates any advantage in hit ratios OBL may have had over LRU. Search transactions make heavy use of the server disk so increasing the server load causes dramatic increases in the response times. We can also see that the client caching strategies have a negative effect on the search response time. A shorter average browse response time means a

7

Data caching strategies

g..”

(.,’ ..’

_’

100

/

10

/

/

/

20

30

40

Fig. 3. Search response time vs server load.

Fig. 2. Browse response time vs server load.

longer average search response time when the server is heavily loaded. The shorter browse times imply that a client can issue more search transactions which increases the contention for the disk on the server, and hence increases the delay experienced by the transactions.

The effect of the degree of locality of a workload upon the browse response times achieved by the different client caching strategies are shown in Fig. 4. We assume that the server load factor is 40, average runlength is 9, and the average think time is 30 s. Y t

140 ..

&

.D

.‘.., ‘_

‘.,.

EJ..

.-If..

-.,

*, -.

120.-

‘.

[email protected]

_-

-@---____

‘+a

b

100.-

\ \

\ \ ‘n,...,......o -Ii+-

80 ..

--_

80

--e

X: Ru~length Y : Browse Response Time (see)

_. A 60-- x.x.x\;.x, AA

.x,. ‘19 _ _

40 ..

--W_

a

.x

*.

GO ..

40.-

*_ ---a

20 .-

20 -.

,

: ::::::::::

5

10

::::

15

Fig. 4. Browse response time vs locality.

*x

I

10

20

30

Fig. 5. Browse response time vs runlength.

: x

40

8

T. PATRICKMARTIN and JUDY I. RUSSELL

The browse response times for the Base case, which cannot take advantage of any locality, remain constant. The browse response times for all three caching strategies improve as locality increases because their hit ratios increase and the number of block requests to the server decrease. The browse response times for TTP become increasingly better than the base case as the degree of locality grows. The hit ratios associated with both TTP and OBL improve more than those of LRU as locality increases because of the prefetching being performed. As discussed above, however, the high server load offsets some of the gains made by OBL. The locality of accesses to document blocks have no direct effect on the search response times. There is, however, an indirect effect apparent in the simulation runs. As browse transactions become shorter, the number of search transactions issued increases. The increase in search transactions means there is more contention for the heavily loaded resources at the server site which results in added delays and increases in the search response times. 5.3. Degree of seq~e~~iff~ify The effects of the degree of sequentiality of a workload upon the browse response times are shown in Fig. 5. The degree of sequentiality is represented by the average runlength which appears on the x-axis of the figures. The average runlength is directly related to the average length of documents in the collection. The longer the documents, the longer the expected runlengths. The simulation model assumes that every browse transaction accesses four document blocks, so a longer average runlength reflects a larger number of consecutive browses to the same document. We assume that the degree of locality is n = 3, the server load factor is 40, and the average think time is 30 s. The Base system does not take any advantage of sequentiality and browse response times remain constant as the average runlength increases. LRU and OBL initially show some improvement in browse response times as the average runlength is increased but their response times flatten out and, in the case of OBL, even increase slightly. This indicates that LRU and OBL cannot take advantage of the longer runlengths. Their initial decreases are most likely due to changes in the frequency dist~bution of the block accesses brought on by the increased runlengths. TTP achieves significant response time decreases and outperforms the Base case for average runlengths of approx. 15 or greater. TTP’s prefetching takes advantage of the sequentiality in the workload to reduce the number of block requests to the server during a browse transaction, The improvements made by TTP are limited by the think time which determines how many blocks can be prefetched before each transaction. The TTP caching strategy ensures that only the first browse in every run requires a relatively large number of blocks to be transferred, so the longer the runs the less frequently these expensive browse transactions occur. The runlength, like the locality, of the document accesses has no direct effect on search response times. These response times increased in the simulation runs because more search transactions were able to be generated as the response times for the browse transactions decreased. 5.4. Think time Figures 6 and 7 demonstrate the effects of increasing the think time upon browse and search response times, respectively. The think time is the average amount of time a user takes between transactions. We assume that the degree of locality is n = 3, the average runlength is 9 and the server load factor is 40. The only significant effect of increasing the average think time between transactions is a reduction of the load on the server. This results in decreases in the search response times in all cases. The think time does not have a meaningful effect on the browse response times. The decreases in the times for the Base, OBL and LRU cases are due to the decreases in the load on the server. TTP, the one strategy that makes use of the think time, is the least affected by the increases. This indicates that TTP can take advantage of longer think times only with longer average runlengths. 6. CONCLUSIONS We have described some of our work towards developing distributed full text retrieval systems. We outlined a simulation model built to study these systems. Using this model, we saw that

Data caching strategies Y

150 140 130 120 110 too 90 80 70 60 50 40

Base LRU OEL TTP Think Time @ec) Search Response Time fsec)

x: @: 0: A: X: Y:

30 20 10

4

:

:

20

:

:

40

:

:

60

:

:

80

!

:

100

:

a

120

Fig. 6. Browse respanse time vs think time.

X

J

:

:

20

:

:

40

:

:

GO

:

!

80

:

:

100

:

.

120

X

Fig. 7. Search response time vs think time.

achieving good ~rforman~ in d~st~buted full text systems will depend upon how we utiiize the server site and upon the amount data transferred to the chent sites, We proposed severaf data caching strategies to address these problems and describe experiments which studied the effect that server load, workload locality, workload sequentiality and think time had on the performance of these caching strategies. The caching strategies we discussed all employed a local cache at each client site. The use of a client cache implies that the processing of document files for browse transactions is moved to the client sites. This change requires a substantial increase in the amount of data moved to the client sites since now all the raw data must be transferred to the clients and not just the requested document pages. A client caching strategy must make a significant reduction in block requests to balance this increase in network traffic. We experimented with three block replacement algorithms for the client cache--(LRU), LRU with (OBL) and LRU with (TTP). All three algorithms take advantage of the locality in a workload. The two prefetching algorithms, OBL and TTP, have better hit ratios than plain LRU because they also take advantage of the sequentiality present in the workload. TTP has the advantage that it does its prefetching during the user’s think time rather than during the execution of a transaction which minimizes response time as much as possible. The usefulness of TTP is dependent upon the degree of sequentiality in the workload, the speed of the communication network and the fength of the think time. A hidden cost to the prefetching is the number of unused blocks transferred to the client cache. This cost will increase as the variance in the run lengths increases. In general, an LRU algorithm works well as long as locality is strong, and/or the number of productive blocks is small, because the most recently used blocks have a good chance of being in the favored subset of the cache. Once these facts change, however, LRU becomes Iess ef%ctive and we need a replacement policy more in line with the Bradford-Zipf distribution followed by the workload. Server load has a major impact on the performance of a distributed full text retrieval system and steps must be taken to lighten a heavily loaded server. Looking at the performance of browse transactions, we saw that the performance of the TTP client caching strategy improved relative to the Base ease as server load increased. At light server loads, the browse response time is dominated by communization but at heavy loads the disk accesses become the major part of response time.

10

T. PATRICKMARTINand JUDY 1.

RUSSELL

The TTP algorithm tries to execute most of its disk accesses during the think time between transactions where they do not directly affect response time. The degree of locality within a workload is an important property which can be exploited by caching to make significant improvements in response time. We saw that, in a heavily loaded system, a client cache using TTP overcomes the extra block accesses implied by performing browse processing at the client, and gives better average browse response times than the Base case for moderate to strong locality. The locality within the workload from a full text retrieval system user is typically very high so we expect that client caching will prove cost-effective. The degree of sequentiality within a workload is another important property which can be exploited by data caching. TTP performs better in a heavily loaded system than the Base case for runlengths of approx. 15 or greater. Thus, in a system where the average document is relatively large and users tend to browse a large part of a retrieved document, a caching strategy like TTP will help the browse performance. We also saw that the other two strategies, which use little or no prefetcbing, cannot take advantage of the sequentiality. The load on the server site is reduced when the average think time between transactions is increased. TTP is the only strategy that makes use of this idle time and stands to benefit from the increases. However, the usefulness of this time is linked to the average runlength in the block accesses. It is cost-effective to fill up the think time prefetching blocks only if all these blocks are likely to be accessed within the next few browse transactions. Client caching, while improving the average browse response time in a heavily loaded system, worsens the average search response time. Users are able to issue more searches which increases the delay caused by contention at the server. If, in addition to the client cache, a server cache is implemented, then this problem can be overcome. Figures 8 and 9 show the search and browse response times, respectively, achieved by combining a 5 MB server cache and a 5 MB client cache using TTP for varying degrees of locality. We assume a server load factor of 40, a runlength of 9 and a think time of 30 s. Comparing this combination, labelled SCTTP in the figures, with the Base case, we see that it has better search and browse response times for all degrees of locality. The improvements to the search response times range from 35 to 81% and the improvements to the browse response times range from 40 to 80%. Our experiments indicate that, with the intelligent use of data caching, distributed versions of full text retrieval systems which share the work among server and client machines are viable and Y t

_i

143

j%~ x.x.x..

,x..

.x..

.x

I.

~~

120..

t 100. .

60-- x.x.x...x.

,.......

x _____..__

x

40 .” 9 .. 20..

‘$

\ k

4:::::::::::::::

Fig. 8. Search response time for TTP with server cache.

l-a

L

_

-g-_--o 9.

5

10

15

X

Fig.9. Browse response time for TTP with server cache.

11

Data caching strategies

cost-effective. The caching is necessary to offset the increased data transfers implied by moving browse processing to the client machines. Data caching should be employed at both server and client sites. The server cache should be application-controlled and not just the cache supplied by the file system. The client cache should employ a strategy like TTP which exploits the locality and sequentiality of the workload by prefetching document blocks but which does so during the user think time between transactions so that the cost of transferring the extra blocks is not directly reflected in the system response times. Ack~uwfedgeme~t-This

work was supported by the Natural Science and ~n~n~~ng

Research Council of Canada.

REFERENCES III Fulcrum Technologies Inc. FullText Programmers Guide Version 4.5. Fulcrum Technologies Inc., Ottawa, Ontario, June (1988).

[21I. A. Macleod, T. P. Martin, B. Nordin and J. R. Phillips. Strategies for building distributed info~ation

retrieval systems. ~~formotion Process. Mgmt u(6), 51 l-528 (1987). [31 T. P. Martin, I. A. Macleod, J. I. Russell, K. Leese and 3. Foster. A case study of caching strategies for a distributed full text retrieval system. Information Process. h4gmt M(2), 227-247 (1990). f41 P. Simpson and R. Alonso. Data caching in information retrieval systems. Proc. 1987ACM Conf. Research and D~e~op~~t in ~nformution Retrieval, pp. 296-305 (1987). D. B. Terry. Caching hints in distributed systems. ZE.IX Ttans. Sftw. Engng SE13(1), 48-54 (1987). f:] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh and B. Lyon. Design and implementation of the Sun Network File System. Proc. USENIX I985 Summer Conf: pp. 119-130 (1985). [71 M. N. Nelson. B. B. Welch and J. K. Ousterhout. Caching in the Sprite Network File System. ACM Trans. Comput. Systems 6(l), .134-154 (1988). VI A. J. Smith. Sequentiality and prefetching in database systems. ACM Trans. Database Systems 3(3), 223-247 (1978). [91 E. G. Coffman and P. J. Denning. Operating Systems Theory. Prentice-Hall, Englewood Cliffs, N.J. (1973). PO1 I. R. Casas. PROPHET: a layered analytical model for performance prediction of database systems. CSRI Technical Report CSRI-180, Computer Systems Research Institute, Toronto, Ontario (1986). 11l] B. C. Brookes. Bradford’s law and the bibliography of science. Nature 224(5223), 953-956 (1969). [12I W. J. Garrison. NETWORK 11.5 User’s Manual Version 2. CACI, Los Angeles, Cahf. (1985).