Hypermedia and free text retrieval

Hypermedia and free text retrieval

,,,,or,,,u Pnntcd ,,,,,, Pruwwne 11, Great Hntam & \lonoyr/ncw Vnl ZY. No HYPERMEDIA 3. pp 287-298. 1993 ~opynght 0306.4573!Y3 X.6 00 + 00 ~...

1MB Sizes 87 Downloads 149 Views

,,,,or,,,u Pnntcd

,,,,,, Pruwwne 11, Great Hntam

& \lonoyr/ncw

Vnl

ZY. No

HYPERMEDIA

3. pp

287-298.

1993

~opynght

0306.4573!Y3 X.6 00 + 00 ~CI 1993 Pergamon Press L td

AND FREE TEXT RETRIEVAL

M.D. DUNLOP* and C.J. VAN RIJSBERGEN Department of Computing Science, University of Glasgow, Glasgow Cl2 SQQ, U.K. paper discusses aspects of multimedia document bases and how access to documents held on a computer-based system can be achieved; in particular, the current access methods of hypermedia and free text information retrieval are discussed. Browsing-based hypermedia systems provide ease of use for novice users and equal access to any media; however, they typically perform poorly with very large document bases. In contrast, query-based free text retrieval systems are typically designed to work with very large document bases, but have very poor multimedia capabilities. This paper presents a hybrid between these two traditional fields of information retrieval, together with a technique for using contextual information to provide access, through query, to documents that cannot be accessed by content (e.g., images). Two experiments are then presented that were carried out to test this approach. Finally, the paper gives a brief discussion of a prototype implementation, which provides access to mixed media information by query or browsing, and user-interface issues are discussed.

Abstract-This

INTRODUCTION The research reported in this paper was conducted with the main aim of providing a general method for query-based access to non-textual documents. The motivation behind this work comes from two directions. Firstly, the requirement to provide access to non-textual documents held in large computer-based document bases. Secondly, by considering the current methods for accessing such document bases. Computers have traditionally been used for processing numerical and textual information, with the vast majority of computers now used almost exclusively for processing textual documents. There are, however, many fields of work that require access to nontextual information; for example, medics require access to x-rays, architects to building plans, ornithologists to bird calls, and estate agents to property photographs. In such fields the non-textual information is at least as important as the textual information that may accompany it; in other areas the non-textual information is used to highlight details or to give alternative views. With recent advances, in quality and price, of display and storage technology, computers are being used more regularly for the production of images, animation, and music. It is becoming apparent to many computer users that it is possible to create large libraries of documents that contain mixed textual and non-textual information. Most existing non-textual libraries are held on non-computer media (e.g., public libraries often have an extensive music selection held on cassette, vinyl, or compact disc). Within these libraries items are either indexed by artist, title, or by a rough classification (i.e., textual identifiers that are used to describe the non-textual medium). Libraries traditionally accessed textual documents by the same process; for example, novels are typically indexed by author and title. In recent years, however, there has been a significant growth in computer-based library systems that have access to the entire text of the document (or at least a paragraph or two extracted from the document). This not only allows the searcher to partially examine the content of the documents (e.g., academic papers) without going to the shelf, but permits searching of the documents’ content. In a textual environment, reasonably effective general purpose algorithms have been developed that allow a user to input a natural language sentence, which is then matched against all the docuA version of this paper was presented at the RIA09I conference in Barcelona (Dunlop & van Rijsbergen, 1991). This paper completely supersedes the conference paper in the development and testing of the model; however, the conference paper gives more details on the prototype application and wider issues described at the end of tins paper. *May be reached by e-mail to [email protected]. 287

288

M.D. DUNLOP and C.J.

VAN RIJSBERGEN

ments in the document base (van Rijsbergen, 1979; Croft & Harper, 1979). However, no such general purpose algorithms exist, or are likely to exist, for the automatic matching of non-textual documents. The problem of accessing non-textual nodes by query has typically been solved with domain specific solutions or by associating a piece of text with each non-textual document. Domain specific solutions have generally been split into two categories: searching a limited pictorial language (e.g., Constantopoulos et al., 1991, for document retrieval, and Kurlander & Bier, 1988, for searching and replacing within a larger drawing) or in specific application domains (e.g., Andrew et al., 1990, for access to images held within a document base of medical scans, and Hirabayshi et al., 1988, who used impressions such as bright, flamboyant, and formal, to describe images held within a document base of fashion photographs). The use of a textual entry representing a non-textual document (e.g., Al-Hawamdeh et al., 1991), though at first promising, is likely to lead to many problems, since these descriptions must be created by an indexer. This is not only a time consuming task, but is also likely to be unreliable, since human indexers may produce inconsistent and biased descriptions due to their own perspective and level of domain knowledge; for example, an auctioneer would index the Mona Lisa very differently from a fine art student. Partly to provide access to multimedia document bases, there has been a rapid increase in the popularity of browsing-based hypermedia* systems in recent years. These systems, based on early work by Bush (1945) and Nelson (1967), allow the user to browse through the document base using a series of paths (or links) connecting documents together. These paths need not be restricted to accessing textual nodes and often access nodes in many media, providing a very natural environment for the storage and retrieval of non-textual information. Browsing-based retrieval systems are, however, restricted in scale due to the undirected approach users must take. General reviews of work in hypermedia can be found in a special edition of CACM (Smith & Weiss, 1988) and in Conklin (1987). Begoray (1990) and Neilsen (1990) present more recent surveys together with discussion of the design decisions involved in creating a hypermedia system. The contrast between browsing and querying may be expressed in terms of a book library. When using a small library (e.g., a small department library) or when looking within a field with which one is very famiIiar, it is often easier to simply browse through the bookcases looking for books that are useful. The organisation of the library and any labels provided will help locate the required books. Alternatively, when looking for books in an unknown domain in a large library, it is much easier to start with a query, either to the librarian for help or to the catalogue system (whether computerised or not). This distinction has led to the inclusion of query routines in hypermedia systems, so that the user can issue a query to locate the approximate areas to browse. These queries cannot, however, provide direct access to non-textual nodes, leaving users to browse to these from textual nodes. In such systems, non-textual documents must be linked on paths from textual documents; otherwise it would not be possible to access them. These links could then be used to provide access to non-textual nodes directly by query, based on the non-textual document’s context in the document base; for example, if a digitised image were linked with various textual nodes and a reasonable proportion of these textual nodes were relevant to the user’s query, it is likely that the image itself would be relevant. To make use of this form of access to non-textual information, a combined hypermedia and free text retrieval model of information retrieval must be used. Frisse (1988) developed such a hybrid model and used it in the development of a medical handbook that allowed access through query and browsing. The model used by Frisse takes account of the hypermedia structure when performing a query, as opposed to the approach here, which treats the two access methods as almost orthogonal. The model developed here will use hypermedia links to give an approximation to the content of a non-textual node; this approximation, or descriptor, is then treated as the document’s content, with the retrieval engine having no understanding of links. *Throughout this paper the term hypermedk will be used as a general term that encompasses and text-only variants of the access method. The text only variant is also referred to, elsewhere,

mixed mediaas hypertext.

Hypermedia

A MODEL

and free text retrieval

FOR ACCESSING

289

NON-TEXTUAL

NODES

Within a hypermedia system composed of many nodes of different media, it is likely that the structure shown in Fig. 1 would occur naturally. Indeed, if access to the non-textual node is to be permitted, then links must exist between textual and non-textual nodes. In a retrieval system that provides free text querying and hypermedia browsing, it would be possible to issue a textual query that would result in one (or more) of the textual nodes being presented to the user. The user could then use the links to browse from the matched document to the non-textual document. These links can also be used to calculate an artificial descriptor for the image* node, which would permit it to be retrieved directly from a textual query. The textual nodes connected to the image node can be considered as forming a cluster-a group of nodes (or documents) that are very closely related. Cluster techniques to calculate the average meaning, or cluster representative, of the cluster can then be applied to establish the overall content of the documents that compose the cluster. As these nodes are all linked with the image node, the cluster representative can be assigned to the image, giving the image a retrieval content equal to the combined content of the nodes connected to it. This approach provides a method for automatically calculating a descriptor for non-textual nodes; this descriptor can then be used to perform information retrieval operations (e.g., querying and relevance feedback) directly on the non-textual node.

A simple cluster based model of retrieval (Level 1 cut offl The descriptor of a non-textual node can be calculated by considering each document, L, as an N-dimensional vector, where each term occurring in the document base is considered as a dimension, and the value of L, is the weight of term i in document L. The cluster centroid algorithm (Croft, 1978) can then be used to calculate the point in N-dimensional space that is at the centroid, or centre, of the points representing the documents in the given cluster. The algorithm essentially averages the weight of terms in neighbouring documents, and can be expressed as follows:

Vi E

[l..

.N] : Wd,, = ‘Fz,

where

W,,, = cluster based weight of term i of document C, = cluster

of documents

linked

to, and from,

d, and document

d.

*To increase clarity, this discussion considers an image node to the connected with many textual nodes. There is no requirement for the central node to be an image, and it could be composed of any non-textual medium.

Non-Text Node fl Fig. 1. Non-textual

Link

object hnked to textual

objects.

M

290

D. DuNLoP

and C .I v.m KIJSBERW.N

Supporting a wider context (Level 2 cut ojf) The level 1 cut-off, described above, only takes into account the immediate neighbours of a node in the hypermedia network. It may be useful to consider more of the context of the nodes when calculating their cluster-based descriptors. The basic model is extended below to take into account all nodes that can be reached from the node in question by following at most two links, and vice-versa: nodes that can reach the non-textual node within two links (since links need not be symmetrical).

c

ViE

[l...N]:Wl,,,

1tc;, =JC,i+b

L,

c

L,

‘72,

where based weight of term i for document d, of documents linked to, and from, document d, of documents linked to, and from, document d by exactly two links, G = UEC,, C, - C,, - d, and k = a constant, 0 5 k 5 1, defining the relative strength of the more remote neighbours.

W,,, = cluster C, = cluster CA = cluster

The model can be continually extended to include more and more distant nodes; however, the impact of these extra nodes would be small, as they are quite distant from the node of interest. The model can also be extended by taking into account different strength of links between nodes, for example, as a result of the type of link or the media of the destination node. Dunlop (1991) discusses these issues in more detail.

Limitations

of the model

Although the model developed here provides a general purpose method of indexing non-textual nodes for access by textual query, it does have some limitations. These are mainly connected with the quality and quantity of links available in a given document base. The model assumes that the document base contains many nodes that can have descriptors assigned from their content (in a traditional document base this set i\ restricted to textual nodes). This restricts use of the model to document bases having a reasonable ratio of indexable (e.g., textual) to non-indexable nodes. In order for the cluster basedalgorithms to provide some benefit, each non-indexable node should be linked to at least two indexable nodes. If only one link is available, then the model degrades to a poor variant of the traditional method by which non-indexable media are retrieved by indexing a hidden textual description. This restriction does not, however, require the network to be composed of twice the number of indexable nodes than non-indexable nodes. As an absolute minimum requirement for level 1 cut-off, each non-indexable node in the network must be connected directly to at least one indexable node. If the document base uses level 2 cut-off, then the requirement is reduced to state in which every non-indexable node must be connected to an indexable node with no more than one intervening node on the path. The precise ratio of indexable to non-indexable nodes required for reasonable cluster descriptors to be created is not clear, and will vary among document bases. In general, a ratio of 1: 1 could be considered a reasonable minimum. As with many areas of free text information retrieval, there is no solid cut-off point, and the model can be used with lower indexable to non-indexable ratios; but the effectiveness of the retrieval engine will decrease with the percentage of indexable nodes. Likewise, the model benefits from increased ratio of indexable to non-indexable nodes and an increased number of links-so long as the number of links from each node is still a small subset of the document base. As the number of nodes used to calculate a cluster description approaches the total number of nodes in the document base, the effectiveness of the retrieval engine also reduces; as the cluster descriptor becomes a descriptor for the entire document base, its ability to distinguish documents within the base is reduced. As with plain hypermedia, the quality of links is also very important. The links must

Hypermedia and free text retrleval

291

connect the node being indexed to other similar nodes. If the links are to very similar nodes, then the descriptors will more precisely describe the content of the cluster and therefore more accurately describe the non-textual node. This issue is addressed in more detail later.

TESTING THE CLUSTER DESCRIPTOR To test the validity of this approach, two experiments were carried out using a textonly document base. The experiments initially calculated the descriptor of each document by traditional statistical analysis and then a second descriptor for each document based on the cluster technique described above. The test collection was composed of 3204 records derived from the Communications of the ACM (Fox, 1991). Each record was composed of (amongst other fields) a title, an abstract, a list of keywords, and list of references to other records in the collection. Many of the records in the collection did not have any references to, or from, other records, and so could not be used for these comparative experiments; this resulted in only 1751 records being used. Initially each document had a descriptor calculated for it using traditional statistical analysis of the words composing the document. These words were passed through a stop word filter (van Rijsbergen, 1979, pp. 28-29) to remove most words that have no meaning when taken out of context. The remaining words were then conflated (Porter, 1980) so that inflectional suffixes were removed. The words were then assigned a score to state their ability to distinguish relevant from non-relevant documents (Sparck Jones, 1971). A cluster descriptor was then calculated for each document based on the documents that were cited by it and that cited it. This descriptor was calculated by use of the simpler, level 1 cut-off method described above.

Experiment

1 -Comparing

different clusters

When each document had a descriptor calculated for it by traditional and by cluster techniques, these two descriptors were compared using the cosine coefficient (van Rijsbergen, 1979, p. 39). This coefficient considers the two N-dimensional vectors (which start at the origin) representing the descriptors and calculates the cosine of the angle between them. If the value is 0, this corresponds to two perpendicular (0 = 90”) vectors, or two documents with no common terms; a value of 1 (Q = 0’) corresponds to two identical* documents. The cosine coefficient was calculated between each document’s traditional, contentbased descriptor and its cluster descriptor. As a base case, the same process was carried out with cluster descriptors being calculated with random links (as opposed to using citations as links). Figure 2 shows the correspondence between the two sets of coefficient values obtained by comparing each traditional descriptor with the corresponding citation-linked cluster descriptor and a random-linked cluster descriptor. The distribution of links for random cluster assignment was devised so as to approximate the description for the actual citation links, but was not a perfect match-hence the three outlying points (which, in these cases, represent single documents). The document base was heavily biased towards records having very few links associated with them. Figure 2 averages the correlation values for each integral number of links; consequently the mean values need not be the mean of the points displayed. The cluster descriptors based on citations achieved an average 23% similarity with the original textual descriptor, whereas random based cluster descriptor achieved only 4% accuracy. This shows that the citation-based cluster descriptors perform significantly (approximately six times) better than the random case; it also shows that the cluster technique is likely to provide descriptors of suitable quality to be used by a retrieval system.

Experiment 2 -Assessing

the effect on precision and recaN

To gain an impression of how cluster-based retrieval would perform in practice, an experiment was run to compare the precision and recall of a given retrieval engine on the *As far as the retrieval system is concerned

r\I D DLWLOFand C.J. VANRIJSBERGEK

292

,.g 0.6 .5 % 8E

3

OS 0.4

AR

A

Fig. 2. Comparison of random and Imk based cluster descriptors.

CACM collection. The experiment was based on the 64 standard queries (with relevant documents, or answers} that are provided with the CACM collection. Initially, the main text of each query was matched, using the cosine coefficient, against the descriptor for every document in the CACM collection. This produced a sorted list of matched documents. This set of 64 queries was then repeated, but using citation-based clusters to represent the documents. The outcome of this experiment was two sets of query solutions, each solution stating the query number and which documents were considered the best matches (in decreasing order of matching value). From these lists, the top M documents were selected, where M refers to the number of documents given in the CACM sample answers, and compared with the sample answers using the standard measures of precision and recall to produce recall and precision figures for each query. Similar figures were produced for the top 2M and 3M documents. As a base case the experiment was then repeated using clusters based on random links, as opposed to the citations used to calculate the main clusters. Finally, the recall and precision figures were averaged over the number of documents retrieved (same as in sample solutions, twice as many, or thrice as many) and presented in Figs. 3 and 4. From these graphs it is quite clear that citation-based clusters provided almost as good retrieval performance as directly using the records’ content (approximately 70% of the performance). Considering that the approach is intended to retrieve documents that cannot be retrieved directly, this quality of result is very encouraging, and confirms that these descriptors are of suitable quality for use in a retrieval engine. Not surprisingly, clusters based on random links provided a very low retrieval performance (approximately 4% when compared with content-based retrieval). Statistical analysis of the performance figures was carried out using the Wilcoxon signed pairs test (Siegal, 1956). For this analysis, the E-measure of retrieval performance (van Rijsbergen, 1979) was used; this provides single value which describes the performance of the retrieval and can be defined as below. The E-measure produces a number in the range 0 to 1 where 0 represents a perfect retrieval (recall and precision l), and 1 represents a complete failure (recall and precision 0). EC7= 1 -

I

Hypermedia and free text retrieval

_.j:

293

4

+ cr:

-+ *

0.2_i w’

Real Citatiom Random

0.1 0.0

I

__-“.---a

Sigle

-------* 1 Double

I Triple

Fig. 3. Graph of recafl against number of documents retrieved.

where EQ = E-measure for query q, PQ = precision for query q, R, = recall for query q, and

cy = scaling factor such that 0 I QII 1, for these results CY= 0.5. Table 1 shows the mean E-values, for the 64 queries, together with their standard distribution. Significance figures were then calculated based on pairs of E-measures, where one value was extracted from queries based on document content and one from context-based descriptors. The test was carried out for the three 1eveIs of retrieval (single, doubIe, and triple) and for both cluster definitions (citations and random). A nuli hypothesis that the retrieval performance was the same for cluster- and content-based descriptors was adopted. Since it was strongly suspected that the content descriptors would perform better than cluster descriptors, a one-tail test was carried out. As can be seen below, this hypothesis was rejected for a11cases with random clusters (for any level of significance). At the 0.05 significance level, the null hypothesis is also rejected for citation-based clusters. However, at the 0.01 significance level, the null hypothesis cannot be rejected for single and double retrieva1 levels-at this level of significance, it cannot be stated that content-based descrip-

-

-+ * 0.1

0.0

______-

-------

-9

I

1

1

Single

Double

Triple

Fig. 4. Graph of precision against number of retrieved documents.

Real Citations Random

M.D.

29-l Table

1 E-measure

and C.J.

DUNLDP

values-mean

Real data Citation clusters Random clusters

LAN

KIJSHERG~U

values ulth standard

drviatmn~

m bracket\

Single

Double

Trrple

0.753 (0.225) 0.840 (0.166) 0 995 (0.016)

0.778 (0.167) 0.843 (0.143) 0.991 (0.020)

0 804 (0 133) 0 X61 (O.Il6) 0.997 (0.016)

tars are better than citation-cluster-based descriptors (see Table 2 for probabilities). Although not a strong result, when assessing equality of retrieval performance, the experiment does show that context-based cluster retrieval has similar performance to content-based retrieval when the links are meaningful. The differences in performance shown here are also within the range of differences typically found in information retrieval experiments. The results given here for citation-based clustering are based on use of the entire CACM document base, of which 45% of documents cannot be accessed by citation-based indexing, since they have no citations (within the document base) and are not cited. The figures shown here are, therefore, somewhat pessimistic for fully connected document bases; separate experiments need to be carried out on such a document base.

Limitations of these experiments The experiments described here have attempted to show that the cluster-based model of descriptor calculation is worthwhile. To provide a test situation in which the effects could be compared against an automatically derived base case, a text-or&~ document base was used. This raises two issues about the validity of these experiments: will the resuits extend to a non-textual environment and will links in hypermedia document bases have the same properties as the citations used here? The usage of the text document base was designed so that when calculating the cluster-based descriptor of a node, its content was completely ignored and the node was treated as if it were non-textual. When considering the calculations carried out with respect to a of the conparticular node, the only time its contents were used* was in the calculation tent-based descriptors for comparison with the cluster-based descriptor. As a result of this, there is no reason why the results shown here cannot be extended to a multi-media environment in which the content-based descriptor cannot be calculated. A stronger challenge to the validity of this experiment comes when considering the relationship between the citations used here and links in a hypermedia document base. These different forms of connection between nodes have two important properties in common: They are created by human users, and there is no single formal definition of the relationship between the two connected nodes. When writing a scientific paper, authors will cite other work for various reasons; for example, citations might be to similar work, contradictory work, interesting work in another field, source for methods used, or for deeper discussions on topics briefly covered. Although much of the work cited by authors is in the same subject area, this is not always the case, and citations are not always a strong indicator of the subject content of a paper. Likewise in hypermedia networks, authors will include links to many nodes that they think users might find interesting. The motivation for creating, or following, many of the Iinks in a hypermedia environment would appear ‘ettcluding

when it was used to calculate

Table 2. Probabilities

Citation Random

the cluster

that cluster-based

clusters clusters

retneval

descriptor\

for other

node\

has the same performance

Single

Double

0.0287 0.00000

0.0307 0 ~~~)O

ds content-ba\ed frrple O.oOlh (~.(toOOo

Hypermedia

and free text retrieval

295

to be very similar to that for citations-both are simply connections to something the reader may find useful or interesting. It would appear that the relationship between the experimental conditions and those in which the cluster-based algorithm will be used are similar enough (in the areas of importance) that the results shown here will be valid in a hypermedia document base. The relatively low correlations achieved in the first experiment and low matching quality in the second will, in part, be due to the simplistic retrieval engine used in the tests. The retrieval engine was based on simple information retrieval techniques and did not include many features, such as a thesaurus, that would improve the ability of retrieval engine to match documents. The effects of improving the retrieval engine would have no, or very little, effect on the correlations (and retrieval performance) for randomly created clusters, thus increasing the difference between random and citation-based correlations.

WIDER ISSUES So far, this paper has concentrated on the development and testing of a hypothesis that contextual information can be used to retrieve documents that cannot be retrieved by content. This section extends the argument to consider more general issues concerned with the combination of query-based and browsing-based retrieval. These issues are all considered further in Dunlop (1991).

Improvement of textual descriptions So far the discussion has been based on the premise that contextual information would be used to retrieve documents which cannot be retrieved by content (e.g., non-textual nodes). Although the main purpose of the approach presented here is to provide access to non-textual nodes, it may also be used to improve the quality of textual nodes by defining them partly upon their context in the document base (i.e., merging the content-based and context-based descriptors). This is similar to work by Sparck Jones (1971) on query expansion by use of clustering information.

Effect on relevance feedback In query-based systems that provide relevance feedback, the power of feedback is potentially reduced by the fact that users can only pass judgment on documents that matched their most recent query, as these are the only documents presented.* However, when query and browsing access are both provided, users can browse through the document base after issuing a query. This approach treats the query as a method for providing starting points for a browsing-based search in a similar manner to early work by Croft (1978). When browsing, users can view, and hence give relevance feedback on, documents that do not match the current query. The strength of positive feedback on such nonmatched documents could be expected to be higher than for matched documents, as the user is, in some sense, correcting the retrieval decision.

Removal of header nodes Hypermedia systems often require users to select a document from the underlying file system, for example, Guide by OWL (Brown, 1986), or they provide a method for users to browse to the actual documents/nodes they are concerned with, for example, the Home Card of HyperCard (Goodman, 1989). When the document base is large, these manual access techniques become harder to use, and, in the case of specific access nodes, very difficult to create. After the initial document/node is chosen, the users can, however, browse freely using links between documents. The provision of query-based access alleviates the requirement for such manual access methods, since users can provide the starting point(s) for a browse by issuing a query.

*This is technically an interface issue and is not necessarily the case for query-only typical of query-only systems. As a counter-example, Sanderson and van Rijsbergen cess documents from previous queries and to give feedback on these for the current IPH 29:3-E

access; however, it is very (1991) permit users to acquery.

296 Use

M.D.

DUNLOP and C.J

VAN RIJSBERGEN

in single search style systems So far the discussion has assumed that this model of retrieval will be accessed through

both queries and browsing. However, this need not be the case-the hybrid model of document bases can be used “behind the scenes” to provide some query-style features to users who can only browse, and vice versa. When the hybrid model of document bases is presented through a browsing-only interface, nearest neighbour links (van Rijsbergen, 1979) could be provided between nodes. These links would provide a method for browsing between two nodes that, although not directly linked, have very similar content. This reproduces some of the effects of queries, but within an entirely browsing environment. The idea of using information about documents to create links is not new, and was developed by Ivie (1966) for search routines. It would also be possible to use access to nodes as a mild form of relevance feedback and, consequently, build an impression of the users’ interests. This impression could then be used either to provide extra links or highlight links to nodes that match this impression. However, the dynamic creation of links would have to be very rapid to prevent degrading the browsing performance of the system. A free-text retrieval engine could also be used by authors of the hypermedia base to assist in the creation of links by suggesting nodes very similar to the node they are adding. This would reduce the theoretical requirement for authors to scan the entire document base for nodes to be connected with the new node. In a system restricted to providing query-only access (e.g., due to compatibility constraints), links may still exist in the underlying document base. These could be used to provide context-based access to non-textual nodes and to improve the description of textual nodes-indeed, the development of links may be done entirely for these purposes. It is not clear whether a user interface that provided access by a combination of queries and browsing would perform as well as, or better than, a single search style system that had been supplemented as described above. Although a hybrid (query and browsing) search strategy has potentially more power than either browsing or query only searches, it is also a more complex model of retrieval, which may reduce users’ effectiveness. A set of user tests (Dunlop, 1991) showed very small differences in user effectiveness with respect to time and quality of results when using a basic integration of querying and browsing, better results are expected for a tighter integration of the access methods, which was suggested. The tests were carried out by 30 users (mainly undergraduate students with limited ability in using a computer and no knowledge of computing science) over approximately 1.5 hours per user using the prototype system mrnIR (multi-media Information Retriever) described briefly below. A prototype system, mmIR, has been developed, based on the ideas presented in this paper, to provide access to the British Highway Code (Department of Transport, 1987) by browsing, querying, or by a combination of both. The Highway Code is a mixed media document composed of 198 rules, which may reference each other, and 31 images, which are required to fully understand the textual rules. Figure 5 presents a typical screen from mmIR which shows the fifth best matched node for the query “position in road when turning right.” The Navigator window provides the users with most commands for browsing around the document base and/or matched document list, while links to other nodes are presented as a list in the main window (see Fig. 6). Whereas the main window in Fig. 5 shows an image node that matched the query, Fig. 6 shows a non-matching textual node, which was accessed by link traversal and is itself linked to node 55-which may or may not match the current query. mmIR is described more fully in Dunlop and van Rijsbergen (1991) or Dunlop (1991). CONCLUSIONS This paper builds upon a combined model of information retrieval which encompasses the principles and the benefits of both free text retrieval and of hypermedia. The model allows users to have access to large document bases with limited structure, but allows users to browse using whatever structure exists. The paper developed a model for approximating

Hypermedia

and free text retrieval

Fig. 5. A typical

291

screen shot.

the meaning (or content) of documents that cannot be retrieved by content (e.g., images), by use of contextual information extracted from a hypermedia network. Two experiments, in a text-only environment, have shown that the model provides descriptions of documents that are reasonably similar to content-based descriptions and that, when issuing queries, context-based descriptions performed reasonably effectively when compared with contentbased descriptions. Further work needs to be carried out into the use of these methods within a large multimedia document base; however, the testing of such a system would be more difficult since no standard collection (with relevance judgments) is available. It would also be useful to experiment with different document bases that make different use of non-textual media. For example, queries to fine art catalogues would probably be expected to retrieve images directly, whereas following links is perfectly acceptable for a bar-chart in a company report that is merely highlighting the text. Work also needs to be carried out to compare the different levels of cluster-based descriptions presented here. Overall, the experiments and observation of a small prototype system, have shown that use of context information from a hypermedia network to retrieve nontextual nodes by query is effective and of reasonable quality. Acknowledgement-The authors would like to acknowledge the support of the Science and Engineering Research Council, who funded the main author’s Ph.D. work from which this paper is mainly derived. The authors would also like to thank Robert Oddy for his extensive help during the writing of this paper.

Fig. 6. Display

of a non-matching

node (with link).

‘98

hl 11 DIIP.LOP and < I. VAN RIJSRERC,E.N Rtl~EKENCES

Al-Hawamdeh. S., Ang, Y.H , Hut, L . Ooi, B.C.. Prtce, R., & Tng, T.H. (1991). Nearest netghbour searchmg tn a ptcturc archtve system. Proceedrngs of the ACM SICIR Internatronal Conference on Multrmedia Injormarron Systems, National Umveratty of Smgapore. Andrew, h1.L , Bose,D.K., & Couby. S. (1990). Scene analysts via Galots lattices. Proceedmgs of the IM.4 Conference on .4 Maths Re,olurron Due IO Computmg, Oxford University Press. Begoray, J.A. (1990). t\n tntroductton to hypermedia ts\ue\. systems and apphcation areas. InfernatronalJournal cl./‘Afun-Machme Sludws. 3_7, I2 I I31 Brown, P. I. (1986) InteracttLc documentatton. ,Softbvurr: Pruclrce and E.\perrence, 16(3), 91-299. Bush, V. (1945). AI we may thtnh :1//antrc Man/h/y. /76(l), 101-108. Conklin, J. (1987). Hypertext: 4n tntroduction and survey. IEEE Compu/er, 20(9), 17-41. Conatantopoulos. P.. Drakopoulor, I., & Yeorgaroudakis, Y. (1991). Retrtekal of multtmedta documents by pictortal content A prototype \y\tern Proceedings of the ACM SIGIR Internatronal Conference on Multimedra Informatron .Syrtcws, Nattonal Umrersity of Singapore. Croft. W.B. (1978). Organrsrng und wurc,hmg large fikc of document descrrptrony, Ph.D. Thesis, Umversity of Cambrtdge. Croft, VV.B.. 8r Harper, D J. (lY7Y) Uktng probabihstic models of document retrteval without relevance mformatton. Journal qf Documentatron. 35, 285-295. Department of Transport (1987) T/IL’ H/ghwu,’ Code, Her MaJesty’s Stattonery Offtce. London. Dunlop, M.D.. XL van RtJ\bcrgen, C .I (lY91) Hypermedta & free text retrteval Proceedmgs o,f the RIAO 91 Conference on Inrellr,ycnr Te\r and Image Handlrng (pp 337-356) Uni\ersttat Autbnoma de Barcelona, Catalunya, Spaun. Dunlop, M.D (1991)..ilu/~urredru/njor/nafron Rerrrewl, Ph D. The\t\ Computtng Sctence Department, Univer\tty ot Cila\go\\. Report lYYl;‘R11 Fox, E. (1991). CAChl document collectton. Vtrgtnia Polytechmc Institute and State University. Frisse, M.E. (19X8) Searching for tnformstion tn a medical handbook Commun/catfons of the ACM, 31(7). X80-886 Goodman, 1). (1989). 17~wn~,v/~~c~ H>perCbrd hum/hook. NW Yorh: Bantam Computer Boohs. Hirabayaht. I,.. hlatoba, H., Kr Kasahara, I’. (lY88). Informatton retrieval uvng tmpresston of documents as a clue. Procvedrngs of the 1988 .4 CM SIGIR Conjerence. Iwe, E L. ( 1966). Search procedurr~ bused on meusurer of relatedness between documenfs Ph.D. Thesis, M.1 .T., Cambrtdge. MA. Report MAC‘-TR-ZY. Liurlander. D.. and Btrt , E A (lY8X). GraphIcal warch and replace. ,-IC‘hl Trunsactrons on Computer Graph-

10, 22(‘&).

Il3-120

Netlsen. J. (1990) /f_vpcrt~~.U und Hypermedru. San Dtego. CA: Academic Press Nelson, T.H. (1967) Getting tt out of cwr system. In G Schech!cr er a/. (Eda.). Injormutron retrrevul: A crltrcal revrew (pp lZl~Zl()) Washmgton. DC‘: Thompwn Booh\. Porter, M.F. (19X0) .\n algortthm for \uffi\ \trtpptng. Program, /d(3), 130-137 Sander$on, M . R \;m RtJ\hergcn. <‘ J (IYYI ) NRT (New Rctrteval Tool). Ete~Ymnrc~ Puhlrshrng Orrgmalron,

Drcremrnutrwr

L/IX/ Dewy.

Siegal, S. ( 1956). ,Yonparumr/rrc Smith, J B., 6 \‘+‘ei\\, S.I (10x8)

-1(J). 705-217 .~~uII~/K’~:For rhe behar,rora/ wence,. Hypertc\t.

(Introduction

New Yet-h: McGraw-Hill. to the \pectal t\\ue on hypermedia.) Communrcarrons

Uj 1121’I(‘21. 3/(7) Sparch Jones. li ( lY71 ) .-lu/o~t~rr/rc ke~‘~\wxl c~luwjr~~~rwn for rr~,formatron retrieval. London: van Rtj\bergcn. C’.J (197’)). /~r/iv/rr(//wfr retrw~wl (wcond edttton). London. Butterworth\

Butterworths.