Online text retrieval via browsing

Online text retrieval via browsing

Informalion Proceswg & Manogemenr Vol. 24, No. I, PP. 31-37, 1988 Printed in Great Britain. ONLINE TEXT 0 RETRIEVAL 030..4573/88 IE3.W + .OO 1988...

568KB Sizes 0 Downloads 50 Views

Informalion Proceswg & Manogemenr Vol. 24, No. I, PP. 31-37, 1988 Printed in Great Britain.

ONLINE

TEXT

0

RETRIEVAL

030..4573/88 IE3.W + .OO 1988 Pergamon Journals Ltd.

VIA BROWSING

J. F. COVE and B. C. WALSH Department of Computer Science, University of Liverpool, Liverpool L69 3BX, England (Received 16 October 1986; Accepfed 9 April 1987)

Abstract-Browsing refers to information retrieval where the initial search criteria are generally quite vague. The fundamentals of browsing are explored as a basis for the creation of an intelligent computer system to assist with the retrieval of online information. Browsing actions via a computer terminal are examined, together with new methods of accessing text and satisfying user queries. Initial tests with a prototype system illustrated the use of different retrieval strategies when accessing online information of varying structure. The results suggest the construction of a more intelligent processing component to provide expanded capabilities for content extraction and navigation within text documents. Ke_vwords: Browsing,

Intelligent tools, Associated

words, User strategies,

Online retrieval.

INTRODUCTION

The term “browsing” is usually applied to the actions of moving about a library and dipping into books, picking out bits and pieces of information of all kinds. The term is derived from the eating behavior of deer when selecting the fresh young shoots, and thus carries the connotation of selecting worthwhile and useful information. For an online information system, browsing provides a suitable paradigm that may be employed as a useful tool. It is related to searching where the initial search criteria are only partly defined [5]. It is a purposeful activity occasioned by a felt information need or interest. In addition, because recognition is easier than recall (or not knowing), a further way of describing browsing is to say it is the art of not knowing what one wants until one finds it. In conventional information retrieval systems, the user is required to formulate a query on which to search, necessitating her/him to be familiar with special terms and keywords that appear in a text. Browsing should give a priori details about the information under investigation. The ultimate aim of work in this area is to develop an intelligent system that will complement and enhance human browsing but will not replace it. An intelligent system will be able to route to the user some of the difficult problems. This is not an undesirable feature, for a browsing system is a tool, leaving the user to think about and understand the material being processed. Browser systems should not develop to mimic some of a human browser’s functions. Their intelligent function is that of setting out the material to be browsed over in an easily accessible form. This includes preparation of the material and guidance during browsing. Browsing is essentially visual and, therefore, has a strong “direct access” feature. It can be associated with “shapes” and patterns both in terms of pictures and the distribution of text on the page or on the VDU screen. There are a wide variety of browsable items now available on computer systems - for example, books, files of text, indexes, catalogues, data bases, and knowledge bases. Library catalogue browsing has been discussed by Palay and Fox [3]. For online documents, a substantial amount of intelligent processing is required to produce the actions required to assist a human browser. A description of a document’s contents at a high level forms a framework to guide the user to useful areas. More detailed structure relates to paragraphs and sections within the document, and a summary or abstract gives the overall sense of its contents. 31

J.F. COVEand

32 THE

PROTOTYPE

The principal features required for Figure 1 illustrates a decomposition of lower box has an action heading and a tion. This decomposition is useful for important features are not completely

B.C.

WALSH

BROWSER

EYEBROWS

browsing are structure, navigation, and semantics. these features in a top-down design manner. Each body containing more detail or a specific applicadiscussion of browser functionality, but the most independent of each other.

Structure Structure is related to the type of material being examined. It may have a library catalogue (treelike) structure or be organized into chapters and sections. Structure allows a document to be appraised at the top level before a more in-depth analysis of content is carried out.

Navigation Navigation assists in identifying where the user has been and what directions the search may take in the future. It is a combination of physicai position in the document (e.g., in the first page), position in terms of the document’s structure (e.g., at the end of the introduction), and semantic content (e.g., the place the user has reached in the theme of the document).

II /

OvervIew

i- sentences

/

; I

j -paragraph

-feature5

il

lracktng . .._~_____.._

hlstory ~

,

Stemmlng

- word meanings -Fig.

1. Elements

of browsing.

Online

text retrieval

via browsing

33

Semantics Semantics relates to the underlying themes or propositions of the document. An abstract or summary is a concentrated view of these items. The semantics is not necessarily related in any strong way to the structure or navigational aspects of the document, but it is helpful if there is some relationship. A prototype browsing system (EYEBROWS) was designed to act as a test environment to develop ideas and techniques and to highlight problems in the browsing area. Aims considered during the design of the system included its simplicity of use and the capacity to browse the text without prior knowledge of its content. In Figure 1, the unshaded lower boxes labeled “movement,” “items,” “overview,” “ local,” and “stemming” represent the functions provided in EYEBROWS.

Overview One major design goal was the provision of an overview of the text to the user in the form of sentences extracted from paragraphs. The purpose of this was to outline the contents of the text and give an idea of its structure and theme. The facility was provided by using a simple method of selecting “important” words within the text and presenting the user with the sentences in which the words occurred and the first and last sentences of the paragraphs containing these sentences. An important word is a word occurring frequently throughout the text but is not used as a common connective word, (is, as, not, the, etc.). The response to the first call of the overview facility is an overview comprising three paragraph extracts. Later calls to the facility take the form of one paragraph extract, each call receiving different paragraphs extracts and being created using alternative important words. Actual words referenced within the text with which the user could commence her or his browse are provided as part of this facility.

Word association It is hypothesized that for the majority of words the strongest context is given by the phrase in which the word is embedded, with lesser contributions from the sentence and surrounding paragraph. After the user has seen the overview, she or he may wish to select a word on which to browse. On selecting such a word, the user is given a list of words that are associated with the browse word-the list comprising the nearest neighbors (except common connectives) of each appearance of the word. This helps the user to receive the word in its correct context and gives more insight into the contents of the text. Several metrics were considered for word association, for example, alphabetic distances, special syntactic positions to the left or right of the selected word, nearest neighbors when common connective words are excluded, and themes based on semantic propositions. A purely random selection from words used in the text may also be of value to the browser. Using nearest neighbors proved a fast and useful option, so this was adopted for use during the evaluation.

Narrow/widen

search

A further design goal was to allow the user to narrow or widen her or his search at any stage by adding to or deleting from the keyword list or starting a new keyword list.

Backtracking A facility to backtrack was also allowed, enabling the user to return to a stage of the browse reached previously, analogous to leaving a bookmark in a printed text. Unstructured text was considered as the target material for EYEBROWS. It did not contain any chapter and section headings, layout, contents, index, abstract, or summary. An amount of preprocessing of the text is necessary before it can be browsed. Complete sentences are taken as the basic data structures, together with an inverted index of words pointing to the sentences. A stemming algorithm [4] is used so that words with identical stems but different endings can be matched. For example, if the user is browsing on the word “compiling,” he or she will receive feedback concerning any occurrences within the iPH

.24:1-C

34

J.F. COVE and B.C.

WALSH

text of the words: compile, compiled, compiling, and compilation. Similarly, word frequency measures will be based on the occurrence of the word stem, independent of ending. Property lists are set up to store information concerning the positioning and contents of the text’s sentences. EYEBROWS is written in Common LISP and runs on a VAX 1 l/780 at the University of Liverpool. It is a command-driven system, allowing the user to browse a document by selecting from the commands shown in Table 1. On commencing a browsing session, the user is automatically given an overview of the text comprising “segments” of paragraphs, themselves containing “special terms” used within the text and words with which the user may wish to start her or his browsing session. The user can browse on a word or phrase of interest and can add words to and delete words from the current search list by using the four word-oriented commands performing string matching. The current search words (keys) are displayed, but if these are forgotten during individual sentence examination the “keys” command will redisplay them. The user can also request further overviews if he or she requires more details concerning text content. Navigation includes physical positional location, which is established after a sentence has been selected using the “S N” command. Once a position has been established, the simple “>” and “<” commands allow local movements by scrolling forwards or backwards through sentences local to the one selected. A return to the last sentence selected by an 5” command is available using the “back” command. When new words are seached for the positional, information is lost so that a global view of the text is maintained.

EVALUATING

THE SYSTEM

EYEBROWS was not intended to be used as a finished piece of software. It acted as a test environment to develop ideas and techniques. A test was devised to see how this system was acceptable to general users, and if some differences in browsing activity could be detected. Thirteen people from various backgrounds were invited to use EYEBROWS to examine a small range of texts, and conclusions were drawn from their comments together with logs of the browsing sessions. Two different texts were studied, one a relatively unstructured file comprising a short story, the other a slightly structured file containing factual information. The story was a lighthearted account of a mystery at a party and described the adventures of a group of

Table

1. The commands

available

in the EYEBROWS

Command

system

Action

(word)

Finds all occurrences of (word) within currences and the words “associated”

+(wordl)

Displays all sentences (and sentence references) containing ing search words plus the new word (word1 ).

exist-

- (word2)

Displays search

all sentences (and sentence references) words excluding the word (word2).

containing

exist-

Displays words

all sentences of the list.

references)

containing

all

being searched

on.

((word3)

(word4)

(wordN))

(and sentence

keys

Prints

a list of all words

currently

S (N)

Prints

sentence

)

Prints

the next sentence.

(

Prints

previous

back

Prints the original commands.

sentence.

*

Gives an overview

of the “content”

the text and lists these oc with them.

(N).

sentence. Selected

before

a series of ) and (

of the text.

Online text retrieval via browsing

35

people. Many of the users appeared as characters in the story, which did not have a strong plot line but did contain a series of events that focused on two of the characters. Users were not given any specific directives when browsing over the party text. Yet the fact that it contained a mystery and its solution, and that each user had a role in the story (since the names of all users were purposely introduced into the story), gave each user at least a vague goal when moving through the story. In this situation the user had unstatable goals when moving through the text. The other text was extracted from HELP files concerning the VAX VMS commands and it contained a list of major commands in alphabetic order. Each command description was designed to be completely displayed on the VDU screen and contained the command format and a description of its action. A series of seven questions faced the users when browsing over this text. The questions were vague and required the users to move over various sections of the text in order to find the answers. This type of browsing could be compared with the situation where the user has an imprecise query. Statistics on the use of the commands issued by the users were calculated. The browsing sessions on both of the files were split into three parts: start, middle, and end (Fig. 2). On studying these it seemed that for both of the files there was a particular strategy used to commence the browse. The initial command was invariably a word search, no doubt inspired by the compulsory overview given as soon as the session is started. The search on the more structured file then took the form of a combination of “word” and ‘55” commands, compared with the unstructured file that comprised word searches and overview commands. The actions taken during the middle and end parts of the sessions were dissimilar for the two files. It appeared that during the middle and end of the browse on the structured file, the user issued commands that suggested she or he had a particular goal in mind. Yet during this same period, the user browsing the unstructured file seemed to be using a more random activity, perhaps indicating that he or she had only a vague or maybe no goal in mind. During the latter part of the browse, the user searching the unstructured file often

Story

End Section Middle L

Section

80

First

Section

Fig. 2. Frequency of EYEBROWS commands.

J.F. COVE and B.C. WALSH

36

carried out the activity of scrolling back and forth within the text, while the browser of the structured text was more concerned with searching for words and viewing the sentences containing them. The overall percentage of each instruction used was also determined. The most popular command issued on the structured file was the “word” command. This was usually followed by a multiple word (string) search, or by a move to the sentence(s) containing the word. The “>” command was most frequently used when browsing the unstructured file. This was probably because the film had little structure and was itself a story, causing the user to scroll through its contents in order to read them linearly to enable her or him to understand them fully. The overview command was also used much more often with the story file. This too may be due to the differing structure of the files. With the story file, the user may issue an overview command when in difficulty or when needing a completely new direction in which to browse. The user appeared to have more direction when searching the structured file and therefore required little prompting from the system and was less prone to finding her or himself lost. It was found that, in general, a more directed search was carried out on the structured file where goals were more clear than on the unstructured file where a more random activity was used to browse. The users were satisfied with EYEBROWS and supported the importance of the overview, word association, and navigation facilities. The users used EYEBROWS successfully in answering the set questions and in solving the mystery of the story. INTERPRETATION

Following a survey of the literature on browsing, browsing can be divided into three broad categories:

Cove and Walsh

[l] suggested

that

1. Search browsing, which is a closely directed and structured activity where the desired product or goal is known. 2. General purpose browsing, which is an activity that consults specified sources on a regular basis because it is highly probable the sources contain items of interest. 3. Serendipity browsing, which is a purely random, unstructured, and undirected activity. We can see that the above experiment encouraged users to use different patterns of browsing when interacting with the EYEBROWS system. The patterns that were demonstrated by the same users at the different tasks fell into two groups. It appears that the browsing activity carried out on the HELP files was indeed search browsing; the user no doubt had a specific intent or goal but no knowledge of how to achieve it. The browse on the story file represented general purpose browsing-the user was examining the source in an unstructured way in the hope of discovering useful information. The suggested function of browsing systems introduced here seems to be borne out with the use of the EYEBROWS system. It appears worthwhile to pursue and refine the following: l l l l

Structure Navigation Nature of current User strategies

goals

CONCLUSION

The uniqueness of EYEBROWS lies in its emphasis on word association and in its ability to allow the user to browse through an unfamiliar text giving the user an insight into its structure via overviews. It appears that feedback comprising overviews and associated words together with sentences does spark off new trails for the user to follow, providing her or him with fresh words and new directions with which to browse. Consequently, EYE-

Online

BROWS is algorithmic interests lie construction for content

text retrieval

via browsing

37

the subject of further research. The functional features of EYEBROWS are in design; further developments indicate a heuristic approach. Therefore, future in the expansion of the system to include inductive techniques. The aim is the of a more intelligent processing component to provide expanded capabilities extraction and navigation facilities within documents.

REFERENCES 1. Cove, J.F.; Walsh, B.C. A taxonomy of browsing. Working Paper 85/2, Liverpool, England: Computer Science Department, University of Liverpool; April 1985. 2. Hildreth, C. The concept and mechanics of browsing in an online library catalog. Proceedings of the National Online Meeting 1982: 18-196. 3. Palay, A.J.; Fox, M.S. Browsing through databases. In: Oddy, R.N. et al., eds. London: Butterworths; 1981. 4. Porter, M.F. An algorithm for suffix stripping. Program 14(3):130-137; July 1980. 5. Sloman, A. Intelligent browsers. Workshop on tools for intelligent front ends, July 1984.