Storage and retrieval of structured documents

lnJorm~rmn Proressin~ & Manogemen, Vol. 26. No. 2, pp. 197-206, 1990 Printed I” Great Britam. Copyright 0 0306.4573190 $3.00 + .oo 1990 Pergamon Pre...

Download PDF

1MB Sizes 3 Downloads 57 Views

Report

PDF Reader
Full Text

lnJorm~rmn Proressin~ & Manogemen, Vol. 26. No. 2, pp. 197-206, 1990 Printed I” Great Britam.

Copyright 0

0306.4573190 $3.00 + .oo 1990 Pergamon Press plc

STORAGE AND RETRIEVAL OF STRUCTURED DOCUMENTS* IAN A. MACLEOD Department of Computing and information Science, Queen’s University, Kingston, Ontario K7L 3N6, CANADA (Received

29 August

1988; accepted in find form

15 March 1989)

Abstract -There have been a number of important document related activities which suggest the need for a new model for text. IS0 standards for document description have been recently developed. These standards view documents as hierarchical objects and it is likely that languages such as SGML will become widely used in the near future for document markup. As structured documents become available, so there will be a need to evolve tools to take advantage of structural knowledge. The goal of the work described here is to develop such tools. A conceptual model for bibIiographj~ data has been designed. The model is known as Maestro (Management Environment for Structured Text Retrieval and Organization). It supports structured documents and provides a query language to retrieve and link information contained in these structures. In this paper, an overview of Maestro is presented together with an outline of the basic implementation strategy.

1. INTRODUCTION The relational mode1 was, and continues to be, extremely important in the advancement of database understanding. The model is simple, powerful, easy to use and appropriate for a wide range of applications. There is a need for an analogous model for document management. The model of documents supported by text retrieval systems such as STAIRS [l] is a simple one. A document is defined as having content and attributes. The content is represented, at least conceptually, as a list of words each of which can be looked up in an index. As well as the words, the index contains, for each word, a list of the documents containing it and the positions of the word within the document. While this is undoubtedly a simple model, it is certainly not a powerful one for neither retrieval operations nor as a basis for applications other than pure retrieval. The purpose of the work described in this paper is to develop a more flexible and powerful model for document management than currently exists.

1.1 Traditional text retrieval system Systems to assist in the retrieval of documents have existed for some time. Typically, these operate by assigning index terms to the document. This can be done either manually or automatically, using the actual words of the document as indexes. Documents are retrieved on the basis of the correspondence between search terms expressed in a query and index terms of the document. Usually documents are held in collections, and queries are directed towards specific collections. These systems view a document as being a set of words and a collection as an unstructured set of documents. A very restrictive view of text is supported, namely as a string of “words” with the concept of a word being built into the system, except that some systems, STAIRS for example, allow the text to be partitioned into sections known as paragraphs. Paragraphs cannot themselves be partitioned. This facility permits a search to be limited to some part of the document. Typically, such systems also maintain information on word location and word frequency. This allows retrieval based on relative position as well as the construction of weighting algorithms based on *This work has been partially supported by the Natural Sciences and Engineering Research Council of Canada under Grant No. A2398. IPtl

Z6:Z-A

197

198

IAN A.

MA~LEOD

word frequencies. Basically, the only capabilities that these types of systems provide is that of iterative retrieval together with a simple update capability usually limited to document addition and deletion. These systems differentiate between content and attributes. How document information is distributed between these is up to the individual database designer. However, a typical usage would be that the content would be the text of the document and the attributes should hold information relating to the physical instantiation of the document. Typical attributes might be the name of the publication containing the document (a journal name for example), the date of publication and so on. 1.2 Current trends This view of text as being a sequence of words and of collections as being simply a named group of physically similar documents is one that is being superseded by current trends, particularly in international standards. Two sets of standards which are pertinent to text are the ODA and SGML standards. There is a certain amount of overlap in the areas covered by these standards and it is not the purpose of this paper to provide a detailed discussion on their merits. Two of the most important aspects of these standards relate to the organization of document collections and to the structuring of information within documents. These are discussed in the ODA DFR proposals [2] and the SGML standards [3], respectively. These standards have wide ranging implications for document organization strategies and for retrieval. The DFR proposal relates to the storage of document collections. It sees collections as having a hierarchical structure of the type exemplified by the filing cabinet structures found in offices. Instead of being held in a collection of other documents of the same type, a document might be contained in a file, which might be a subfile of some other file and so on. A typical mode of behaviour in this type of environment is where a user might wish to browse through a hierarchy of named file folders. The SGML standards on the other hand deal with document content. They view documents as a hierarchic structure. So instead of a paper being an ordered list of words, it might be considered as having a title; information about authors, such as their names and addresses; a set of sections each of which may in turn be subsectioned; each subsection may be made up of paragraphs and or, figures; paragraphs contain sentences; sentences contain words. It is easy to foresee circumstances where a user might for example wish to retrieve papers containing a certain citation. Another emerging area of interest is hypertext. The sudden explosion of activity in this area might be viewed with a certain amount of skepticism. However the concepts of hypertext are basically quite simple and useful. The most popular view of a hyperdocument is as a collection of linkages between nodes in a hierarchically represented set of documents [4]. An example of a very simple and potentially very useful hypertext application in document retrieval would be one where citations within a document could be used to display the cited documents or where a document could be used to locate all documents cited by it from within the collection. What these trends seem to indicate, is that a more complex view of document content is required together with the ability to link together arbitrary groups of documents in a variety of ways. Existing document retrieval systems do not have the flexibility to handle these extended views of documents and their organization, nor do existing database management systems. I .3 Relared work The limitations of STAIRS-like systems have been apparent for some time [5]. Considerable effort has been expended on developing more powerful tools. This effort has largely been centered on the integration of text handling capabilities into existing database models. Examples of the work here include Macleod’s proposed extensions to SQL [6], the modifications to INGRES for text handling [7], and Lynch and Stonebraker’s proposed modifications to the INGRES system to handle document indexes [8]. While this idea of modifying or extending an existing data model to handle text does address some of the

Storage and retrieval of structured documents

199

problems of STAIRS, most notably its lack of flexibility, it is by no means an ideal solution. Text has special characteristics which do not allow it to be handled easily by the conventional database operators. Most importantly, text has structure and this structure cannot be easily captured by the record based data models in current use. This becomes apparent in trying to map even a simple structure into such a model. Take the structure supported by STAIRS. A STAIRS document can consist of attributes plus text. The text can be considered as a fixed sequence of “paragraphs” where a paragraph is a logical component of the document such as an abstract. If this structure were to be generalized slightly to allow multiple paragraphs of the same type, then the representation of this document in a record based model, such as the relational one, becomes quite complex. As is illustrated in Fig. 1, a document consisting of attributes and three different paragraph types would require at least four separate relations for its representation. There would also need to be additional structures to allow for the fact that retrieval qualified by the paragraph type is optional in STAIRS-like systems. Further, retrieval would require the specification of what tuples (or records) in the various tables have to be brought together to recreate the original document. That is, the basic problem is that the user must be aware of how the document has been decomposed and must also have the ability to reassemble it.

Para 1...

Document

Para 2...

Id

Document Id

Para 3...

/ Atributes

I

Pafa 2’s

I

Fig. 1. A l-Level hierarchic document and its relational representation.

200

1.w A. MKLEOD

The limitations of the record based systems are also apparent for many applications as well as document management [9]. This has led to the development of a new class of database models, generally known as the conceptual models [lo], and also to the evolution of an overlapping set of models known as object oriented models [I 11. (Some authors tend to use the terms “conceptual” and “object oriented” interchangeably.) What all these approaches have in common is the ability to provide a natural representation of the real world objects being modelled. In some ways, the early text retrieval systems can be regarded as primitive conceptual models. They are conceptual in the sense that their basic data type is a document and primitive in that they permit a very restricted representation of document structures and a limited set of operation-, on them. Attempts to extend traditional data models to handle text run counter to the current trends in database research towards providing capabilities to handle a richer set of data objects than is currently used. Efforts have been made to try and develop data models more appropriate for document representation. Both Schek and Pistor 1121 and Macleod [13] have suggested approaches based on hierarchies. However, in both cases the underlying storage model is that of the relational model and these models cannot easily provide the straightforward text retrieval operations of the STAIRS type of model. More interesting is the work of Gonnet and Tompa [14]. This illustrates the need for a structural retrieval capability within the framework of the Oxford English Dictionary Project. Their solution to the lack of suitable tools for this type of retrieval has been the development of a procedural language capable of processing text whose structure is defined by a context free grammar. This language is an extension of the well known Maple language. One result of their work has been to illustrate that even short documents, such as entries in the Oxford English Dictionary, have considerable structure. Also of interest is the MLJLTOS project [15], an application aimed at the development of an environment for multi-media storage and retrieval. Rather than use or extend an existing data model, a new model has been developed which permits the structure and content to be explicitly represented. In effect, a specialized object oriented model has been constructed. It seems apparent that any future models for document management should continue to be based on the conceptual modelling and object oriented philosophy rather than on extensions to conventional database management systems. The approach taken in this work is to work towards a new model based on an extension of the traditional text storage and retrieval model. While the STAIRS framework has its limitations, it seems desirable to investigate the possibility of developing more sophisticated models built on top of these capabilities rather than start completely afresh. Thus progress towards a new model is being made in an evolutionary manner and the initial aims of the project are to develop a prototype storage structure suitable for structured text and a general query language which can take advantage of the structure to perform context based retrieval.

The work described here is viewed as being evolutionary rather than revolutionary and it is not intended to develop the ultimate model for handling textual data (whatever that might be). As was noted earlier, STAIRS, and similar systems, view documents as being two distinct types of components, namely content and attributes. This view is consistent with that of the IS0 DFR proposals and it is not proposed that it should be radically changed. However, the DFR proposals are based on the assumption that links to documents can be constructed. Hypertext applications also require interdocument linkages. As is further discussed below, it seems apparent that such linkages can be accommodated within the base model by storing them as attributes. The STAIRS perception of content is a simplistic one. On the other hand, SGML does provide a very powerful tool for describing document structure and what seems to be required is a technique for representing such structures within a text retrieval system. What is being proposed here is a model which brings together the major concepts inherent in the DFR and SGML proposals without losing sight of the fact that STAIRS-like systems, conceptually simple though they may be, must inevitably influence future designs. Ho\vever,

Storage

and retrieval

of structured

documents

201

it must be recognized that the query languages associated with current text systems are necessarily extremely limited and that any serious new model will need to provide a query language facility that is much more flexible and powerful than any currently available. The system presented in this work is known as Maestro (Management Environment for Structured Text Retrieval and Organization). It has two components, a definitional facility and a high level language for handling queries and updates. An overview of these is given in the following section. An outline of the implementation strategy is then presented and a report on the current status of the project is given.

2. DEFINITION

AND REPRESENTATION

OF STRUCTURED

DOCUMENTS

In the definitional component of Maestro, the objective is to provide a description of a document which can then be used to obtain the structured representation of a document as produced by an SGML compiler. The actual compiler being used is a commercially available one known as XGML [16]. SGML is a markup language. Inserted into the text of a document are markup tags which show the structure of the document. The allowable set of document structures for a particular collection of documents is defined by a document type definition. A document type definition is similar to, though usually somewhat simpler than, a grammar for a formal language. The notation used in SGML, and in Maestro, is basically a modified form of BNF. For example, suppose there exists a set of documents with titles where the document is made up of sections. A section has a heading and may be made up of from either a sequence of paragraphs or a sequence of titled subsections which are themselves composed of paragraphs. Paragraphs consist of a sequence of sentences. This informal description is formalized in Maestro as shown in Fig. 2. This is similar to the notation which would be used in an SGML definition except that the actual SGML description would almost certainly contain significantly more detail. Figure 3 shows an example of a document conforming to this description. The unformatted version of this document might be marked up in SGML (using the OMITTAG feature to reduce markup), as shown in Fig. 4. The primary motivation for the existence of markup languages such as SGML is of course to define a structure for the document which is independent of any particular output device. It is supposed that the canonical representation of the document produced by an SGML parser can be further processed by device dependent drivers to produce a suitable document representation for any particular output device. This in itself would obviously be of considerable benefit in a document retrieval system. The same internal document representation could be used to display output suitable for a simple terminal, a bit mapped display, a lineprinter, a laser printer and so on. More importantly, for some applications the fact that a structured representation of the document is being maintained allows more context sensitive retrieval operations to be defined. In Maestro, document input is performed using the XGML toolkit [17]. This enables the output of the parser to be processed, stripping out unnecessary (for this application) detail, in order to build a structure suitable for retrieval purposes. For example, the structure produced from parsing the example document would be the tree shown in Fig. 5. In addition to content, a document will normally have attributes. The document definition has two components, one listing the document attributes and the other, the struc-

Content is Titlc,*Scction Section is ScctionHcading, (*S&Section 1 ‘Paragraph) SubScction is SubSectionHcading, *Paragraph Paragraph

is *Scntcncc

Fig. 2. Document

definition.

IAN A. MACLEOD

202

Example

1. First

Document

heading

First scntcnce.

Second sentence.

Third sentence. 2. Second Heading 2.1 First SubHeading Fourth sentence. 23 Second S&Heading Fifth sentence. Sixth sentence.

Seventh sentence.

3. Third Heading 3.1 Third S&Heading Eighth Sentence.

Fig. 3. Exampie document.

__Fig, 4. Marked-up document.

ture of the text. For example, suppose the class of documents document belongs also has the attributes “date” and “keywords.” this class of documents would be defined as follows: MyCollection The

CC*,’

to consist

is collection

symbol denotes of text.

Content

“Keywords”

to which the example In the current syntax,

with Date, *Keywords

as being a list of values.

All attributes

are assumed

3. LANGUAGE CAPABiLITIES

What has been described is of limited usefuIness far retrieval {though not display) purposes unIess an appropriate query language exists. While there has been some work in query languages for semantic modeIs which addresses the question of structure [ 131, apart from the Multos pruject [15]* there has been no signifi&ant work in the area of query Ian-

Storage and retrieval of structured documents

203

Title

Example Document First heading First scntencc.

Paragraph ---+ SectionHeading

Second sentence. Third sentence. Second Heading First SubHeading

Sentence SubSectionHeading Paragraph

.-p

Sentence

Second SubHeading

SubSectionHeading Paragraph

___*

Fourth sentence.

Sentence Sentence

Fifth sentence. Sixth sentence.

Sentence

Seventh sentence. Third Heading Third SubHcading

Paragraph---_-j

Sentence

Eighth Sentence.

Fig. 5. Document tree representation.

guages for structured documents. The language described here has evolved from previously published work [19]. It allows the selection of arbitrary forests of subtrees from forests of documents. It also includes high level updating capabilities. This work is intended to validate our expectation that such a language can be efficiently implemented, and that it can be used in a “natural” way to specify a large class of queries more easily than can current standard languages. The syntax of this language is influenced by both Nial, a language for processing heterogeneous arrays [20], and SQL, the most popular of the relational query languages. A long-term objective of this work is to develop a complete language for text processing applications. We would like to achieve this by integrating the language facilities described here with Nial. An outline of some of the language capabilities are described below. A more detailed description is given in [21]. 3.1 Query language overview A fairly complex example of a query taking advantage of the structure of the example document description is the following. Here subsections are being retrieved conditionally upon any paragraph of the subsection containing the word “database” and upon the word “retrieval” appearing in the heading of the section containing the subsection. List gets Subsection (having any Paragraph of Section where “retrieval” in Heading

where “database”

in first Sentence)

This example illustrates some of the more salient features of the language. As in conventional text retrieval systems (and in contrast to relational systems), the result retrieved is not a copy of the information but rather a set of pointers to it. The term reference is used to denote such a retrieved pointer. Information can be retrieved both from collections or from previously retrieved reference lists. For example, the previous result can be used to select only those subsections where the first paragraph satisfies the condition. NewList of List

gets Subsection

(having first Paragraph

where “database”

in first Sentence)

Unlike conventional retrieval systems where only references to entire documents are retrieved, it is possible to reference any element within a document. In the above exam-

IAN

204

A. MACLEOD

The only case where a reference is not ples, the references will be to “SubSections.” retrieved is when the query asks for an attribute. In this case, the actual value of the attribute is retrieved. A document and its content have different references. A document reference allows access to the document attributes whereas a content reference excludes this information. Where an element is a list, the number of elements required to satisfy a condition may be specified. This is called quantification. In the previous query, any is an example of a quantifier. Alternatives are all and notany. Further both any and notany may be qualified with integer values to allow constructs such as “any 2.” It is also possible to refer to specific list elements. In the examptes, “first Sentence” was used to denote a specific sentence. This is an alternative form of the more general “Sentence@(l).” In general, any list element can be qualified in this way with a list of integer values specifying particular elements of the list. Where an element is a sub-element of another element which is a list, then it is possible to specify a condition to sub-elements of certain of the containing elements. In the example, subsections are only examined in the context of particular sections, namely only those having the word “retrieval” in the heading. In the same vein, where an element contains a list of sub-elements, it is possible to retrieve or conditionally examine an element conditional upon some expression on one or more of the subelements. Here the operator having is effectively used to anchor a match at a particular element so that subsequent conditions are applied within the context of that element. 3.2 Building structures Structures are built using references. values of other documents. For example, File is collection with FileName.

Retrieved references may be stored as attribute suppose the following collection was declared:

*FileContents

In this example, the attribute “FileContents” will be used to contain a list of values which are references to other objects in the database. References can be obtained either by a retrieval operation or through the creation of a new document. A new instance of a collection can be built using the operator new. All attributes and content are initially null. The result of the operator is a reference to the instance. Values of an instance can be updated using the gets operator. In the foilowing example, an instance of a “file folder” is created. The folder is named and objects are placed into it. Ref gets new File Once a document example:

has been created,

FileName of Ref gets “Database Now all the documents

containing

List gets MyCollection These values

may be updated

as in the following

File”

“database”

where “database”

can then be stored

FileContents

its attributes

in the title are retrieved.

in Keywords

in the new “File.”

of Ref gets List

Once a reference has been stored in this way, a referenced document can be retrieved in exactly the same way as if it were being retrieved from the original class. For example, the foilowing statement locates all documents in the “Database File” with the word “text” in the title.

Storage and retrieval of structured documents

205

BList gets FileContents (where “text” in Title) of Files where FileName = “Database File”

Alternatively,

if “Ref” is still available:

BList gets FileContents

(where “text” in Title) of Ref

This example is a simple one but illustrates how structures may be built and retrieva1 performed within them. Obviously, for retrieval from potentiaily very complex documents such as hyperdocuments where there will be both cross links as well as hierarchical links among the various components, there will be a need for tools to assist in navigation through the object. Such tools can be built as higher level aids on top of the query language. 4. IMPLEMENTATION

ASPECTS

The system is being buih on top of an existing system known as FuVText [22]. This is a full text retrieval system marketed by Fulcrum Technologies Inc. of Ottawa. It is conceptually similar to IBM’s STAIRS system with a number of important differences. First, its functionality is available through a programmer interface. This means that specialized applications can be built on top of the basic system. Second, it allows on-line updates. Such updates are indexed by a separately maintained index which is merged with the main index at periodic intervals. Third, F&Text is available on a range of equipment and runs under both DOS and UNIX. The text can be subdivided into what are called zones. A zone (the corresponding feature in STAIRS is called a paragraph), will normally correspond to Iogical element of the text such as the abstract. When a document is being entered, the start of a zone is indicated by a control sequence followed by a valid zone number. The contents of a zone need not be contiguous. Like STAIRS, Ful/Text has three main components. The first is the index; the second is the catalogue; and the third is the actual text. The index contains each word in the document cohection and the corresponding catalogue id, the position of each word and the number of the zone containing it. Each document has a single cataloguc entry. This contains attribute data, which need not be fixed length and the location of the actual text. The SGML compiler, XGML [16], is being used as the front end parser for document content. Maestro and XGML are integrated in the sense that both know about document definitions. When a document is being added to the database, it will be parsed by XGML to produce a preordered representation of the parse tree. An input procedure scans this tree, discarding detail not contained in the corresponding Maestro definition. It reconstructs the original document and builds the tabie showing the structure of the tree. The flat document is then added and the abstract syntax tree generated by the parse is stored as a separate “document.” A reference to an element of a document will be the catalogue id of the plain document plus the corresponding node number in the tree. Where a reference is to the entire document, the node number is zero. The internal structure representing the abstract syntax tree for the document will be a table. Figure 6 shows the logical internal representation for the example document. (The text shown in this figure is included for illustrative purposes and is not actuahy present in the actual representation.) Each element of the table wili contain: 1. 2. 3. 4. 5.

an identification of the corresponding document element; an index number for list items starting at origin l-a 0 indicates a non-list item; an offset showing the start of the substructure; the length of the substructure (redundant but convenient); the location of the parent element.

In addition to the information available in the table, the inverted indexes for the plain documents are also available. By combining the information in the index and the table, it is

IAN A. MACLEOD

206

Node 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2s 26 27 28

Element

Index 0 0 1 0 1 1 2 2 1 2 0 1 0 1 1 2 0 1 1 2 1 2 3 0 1 0 1 1

Content Title Section ScctionHcading Paragraph Scntcncc Sentence Paragraph Sentence Section ScctionHead~ng Subsection SubScctionHeading Paragraph Sentence Subsection SubSectionHeading Paragraph Sentence Paragraph Sentence Sentence Section SectionHeading SubSection SubSectio~Heading Paragraph Sentence Fig.

6.

Offset 1 I 3 3 5 5 7 9 9 11 11 13 13 15 1.5 17 17 19 19 21 21 23 25 25 27 27 29 29

lnrernal

Parent

Length 30 2 8 2 4 2 2 2 2 14 2 4 2 2 2 8 2 2 2 4 2 2 G 2 4 2 2 2

0 1 1 3 3 5 5 3 8 1 10 10 12 13 1: 10 16 16 17 1G 20 20 1 23 23 25 25 27

Example Document First Heading First Scntencc Second Sentence Third sentence Second Heading First SubHeading Fourth Sentence Second SubHeading Fifth Sentence Sixth Sentence Seventh Sentence Third Heading Third Sub~~eading Eighth Sentence

tree representation

possible to determine within which elements the word is located. For example, looking up the word “Third” would find for this document that the word is located at positions 9, 25 and 27. For each of these positions, the corresponding path is found by Iooking up the offset in the table and using the parent link to identify the path. For each of these cases, the corresponding path is: Sentence [I] of P~ragraph[2] of Section[lf SectionHeading of Section[3] SubSectionHeading of Subsection[l] of Section[3] Current work includes the investigation of efficient techniques for looking up the offset in the table. Initially, since the table entries are ordered by offset, a binary search is being employed. When a query is being processed, it will first be validated for semantic correctness against the definition. The literal information contained in the query is then used to retrieve potentially relevant documents. The final stage is to select from this set those documents whose structure corresponds to the structure implied by the query. Thus the following major steps are required in query processing: 1. The query is validated. This means that the query must be processed to determine that the document structure implicit in the query is a vahd substructure of the document definition. So for example, the query: a gets Content having Section having Paragraph

where “x” in

Sentence

Storage and retrieval of structured documents

207

implies that “Section” is an element of “Content”; that “Paragraph” is an element contained, possibly indirectly, in “section”; and similarly, “Sentence” is a possibly indirect element of “Paragraph.” It must therefore be determined whether any paths implicit in the query are valid and it is also necessary to identify multiple paths. This query may, for example, be ambiguous because “Paragraph” is an element that occurs more than once in the model for “Content.” 2. The next stage is to identify all those documents containing a valid path. Note again that a query path may correspond to several document paths where an element is a repeated element. 3. ‘The final stage is to extract from the potential list of documents identified in (2), the documents containing the actual values specified in the query. Say we had the following query: a gets Subsection (having Paragraph where “Fifth” in Sentence) where “Second” in SectionHeading

of Section

A simple strategy for processing this query is: 1. For each document containing the term “Second,” get the positions of the term inside the document; 2. For each position, check if any positions correspond to the position of “SectionHeading” as found in the table; 3. Retrieve all documents containing “Fifth”; 4. Intersect this result with the previous; 5. For each remaining document Get the positions of “Fifth” within the document; 6. For each document test if any position corresponds to a “Sentence” and if it does, test to see if this occurs within the context of the section(s) previously located; 7. Finally, for each document locate the appropriate subsections. This strategy is an initial position and will need some refinement. following query: List gets Content

where

For example, take the

“data” in SectionHeading

Using the above strategy, all documents containing the word “data” are retrieved and these are then sequentially examined to find if the word occurs in the appropriate context. It is easy to visualize a database where “data” was an extremely common word so that while very few documents might satisfy the actual query, a very large number might be examined by the search algorithm. 5. STATUS AND SUMMARY

The work described here is currently in progress. At the present time, the system is only partially implemented. Facilities are available through an applications programmer interface and the query language exists only in an embryonic form. The updating features have not been fully implemented. In particular, no attempt has yet been made to ensure referential integrity. Experiments are continuing to reduce the amount of overhead involved in processing queries. As noted above, there is at least one situation where the current implementation is inefficient. That is where a search is made for a word within a short element when the word is commomy used throughout the database. This situation can be handled by the zone construct of Ful/Text. This is a more general construct than the paragraph facility of STAIRS. Any piece of text can be placed in any zone and the contents of zones need not be contiguous within the document. The implementation is currently being modified so that each element type is associated with a single zone. Since zoning information is kept within the index this should greatly increase the efficiency of some of the search operations.

208

IAN A. MACLEOD

The importance of this work is that it enhances the usefulness of SGML as a document processing standard. It provides more powerful and flexible retrieval capabilities than currently exist in unstructured text retrieval. At the same time, because the document content and structure are stored separately and we are basing the implementation upon a conventional text retrieval system, traditional retrieval operations can still be performed on the data. What has so far been demonstrated is that it is feasible to construct a retrieval system that is currently much more flexible and powerful than those which are currently extant. A great deal of work remains to be done both at the interface level and in improving performance.

REFERENCES 1. IBM Corporation. Information Management System/Virtual Storage General Information Manual. IBM Form No. GH20-1260, New York, 1984. 2. ISO. Information Processing Systems-Document Filing and Retrieval (DFR) Part 2: Protocol Specification, ISO/TC 97/SC 18/WG 4, Version 1, Tokyo; 1987. 3. ISO. Information processingtext and office systems-standard generalized markup language (SGML). IS0 8879-1986(E), International Organization for Standardization; 1986. 4. Conklin, J. Hypertext: An introduction and survey. IEEE Computer, 20(9): 17-41; 1987. 5. Macleod, I.A. Three approaches to informational retrieval. Proceedings of the RIAO Conference, Grenoble; 1985. 6. Macleod, 1.A. SEQUEL as a language for document retrieval. Journal of the American Society for Information Science, 30: 243-249; 1979. 7. Stonebraker, M. Document processing in a relational database system. ACM Transactions on Office Information Systems, l(2): 143-158; 1983. 8. Lynch, C.A.; Stonebraker, M. Extending user-defined indexing with application to textual data-bases, Proceedings of the Fourteenth Conference on Very Large Data Bases, Los Angeles, 306-317; 1988. 9. Borgida, A. Features of languages for the development of information systems at the conceptual level, IEEE Software, 2(l): 63-72; 1985. 10. Brodie, M.L., Mylopoulos, J., Schmidt, J.W. On conceptual modelling-perspectives from artificial intelligence, databases and programming languages. New York: Springer-Verlag, 1984. 1 I. Lochovsky, F. Special issue on object-oriented systems, ACM Transactions on Office Information Systems, 5(l); 1987. 12. Schek, H.J.; Pistor, P. Data structures for an integrated data base management and information retrieval system. Proceedings of the Eighth International Conference on Very Large Data Bases, Mexico City, 197207; 1982. 13. Macleod, 1.A. A Model for Integrated Information Systems. Proceedings of the 9th Conference on Very Large Data Bases, Florence, 280-289; 1983. 14. Gonnet, G.H., Tompa, F.W. Mind your grammar: a new approach to modelling text. Proceedings of the Thirteenth Conference on Very Large Data Bases, Brighton, 339-346; 1987. 15. Bertino, E.; Rabitti, F.; Gibbs, S. Query processing in a multi-media environment. ACM Transactions on Office Information Systems, 6(l): I-41; 1988. 16. Exoterica Inc. XGML Application Developer’s Reference Manual, Ottawa, Canada; 1987. 17. Barnard, D.T.; Grosso, P.; Hayter, R.; Ligurs, R.; Macleod, I.A.; McFadden, J.; Wilmot, S. A programmer’s toolkit for accessing parsed documents. Technical Report, Queen’s University, Kingston; 1988. 18. Hull, R.; King, R. Semantic database modeling: survey, applications, and research issues. ACM Computing Surveys, 19(3): 201-260; 1987. 19. Macleod, I.A.; Reuber, A.R. The array model: A conceptual modeling approach to document retrieval. Journal of the American Society for Information Science, 38(3): 162-170; 1987. 20. Jenkins, M.A.; Glasgow, J.I.; McCrosky, C.D. Programming styles in Nial, IEEE Software, 3(l): 46-55; 1986. 21. Macleod, 1.A. A query language for retrieving text from hierarchic structures, Technical report, Queen’s University, Kingston; 1988. 22. Fulcrum Technologies Inc. FULlTEXT Programmer’s Guide Version 4.2. Ottawa, Canada; 1987.

Storage and retrieval of structured documents

Storage and retrieval of structured documents

Recommend Documents