Informarm l’rocewnp & Manugemenr Prmted m Great Br~tam
Vol
29, No
3, PP 387-396,
1993 Copyright
0
03064573/93 $6.00 + .oo 1993 Per~amon Press Ltd
QUERYING A HYPERTEXT INFORMATION RETRIEVAL SYSTEM BY THE USE OF CLASSIFICATION M. ABOUD, C. CHRISMENT, R. RAZOUK, F. SEDES, C. SOULE-DUPUY I.R.I.T./S.l.G., Universite Toulouse III, 118, Route de Narbonne, 31062 Toulouse Cedex, France (Received 30 fuiy 1991; accepted in ~~a~~o~rn 24 April 1992) Abstract-We present in this paper a navigation approach using a combination of functionalities encountered in classification processes, Hypertext Systems and Information Retrieval Systems. Its originality lies in the cooperation of these mechanisms to restrict the consultation universe, to locate faster the searched information, and to tackle the problem of disorientation when consulting the restricted Hypergraph of retrieved information. A first version of the SYRIUS system has been developed integrating both Hypertext and Information Retrieval function~ities that we have called Hypertext Information Retrieval System (H.I.R.S.). This version has been extended using classification mechanisms. The graphic interface of this new system version is presented here. Querying the system is done through common visual representation of the database Hypergraph. The visualization of the Hypergraph can be parameterized focusing on
several levels (classes, links, . . .). INTRODUCTION “Hypertext” designates the organization of an information set permitting multiple search methods. This organization implies a nonclassical interaction by a nonlinear browsing called navigation through a Hypergraph which represents links between chunks of information (Conklin, 1987; Julien, 1988; Nielsen, 1990). This approach presents the disadvantage of enforcing a restricted number of entry points in the Hypergraph. Another disadvantage may be the one of disorientation when the browsing space becomes too large. Indeed, the wish to integrate different information types and to simulate the associative nature of this information makes the information base a real labyrinth in which the user can become lost (Utting & Yankelovich, 1990). Several strategies can be considered to tackle this problem (Tague et al., 1991; Garg, 1988): l l
l l
enriching the Hypergraph with labels, marks, and tags to locate familiar places, integrating more efficient browsers in order to inform the user of his or her position on the global web, implementing other methods of idea association, creating extra links to make as fast a location of the searched information as possible,
So the disorientation problem can be partly solved owing to two concepts: local “webs” (visualization of a sub-Hypergraph through different abstraction levels) and guided browsing (aid for the navigation). The problem of one entry point localization can be handled by using some techniques offered by Information Retrieval Systems (Afrati & Koutras, 1990; Rabitti & Savino, 1990). This imphes a particular organization through which a descriptor is associated with each node. In this way, query formulation in natural-like language will permit one, through a matching process, to retrieve the nodes relevant for the user who is querying. The ordering of such selected nodes according to their degrees of relevance to the query permits the favoring of some entry points. A version of this paper was presented at the RIA0’91 conference, Barcelona, Spain, 2-5, April 1991. 387
id.
388
ABOUD
el al.
Nevertheless, the browsing space may remain quite large. One can therefore consider a restriction of this space by using the properties extracted from node descriptors, which permits one to gather them into classes. Then, during an inquiry stage, the user can specify in the query the interesting classes. Our approach can be summarized by making three inquiry methods cooperate:
* navigation through a graphic representation of the Hypergraph pertext Systems (Nielsen, 1990; Schneiderman, 199I), * natural-Iike language retrieval, as in information RetrievaI McGill, 1983; van Rijsbergen, I979), * multi-criteria retrieval, as in the classification context. We will now present the user’s graphic these three functionalities.
interface
of the SYRIUS
schema, Systems
system,
as in Hy(Salton
&
which integrates
The stored information takes the form of multimedia documents (Razouk, 1990). A first partition of the information base is generated according to the notion of file, which corresponds to a homogeneous set of documents. Documents are ranked and partitioned into subsets called files. A file groups documents according to a given domain or theme, a particular organization etc.; it implies a semantic content for each file which will be chosen by the user who expresses a domain of interest. A document consists of a body, which groups information elements of different types, with which is associated a textual descriptor that describes the document content. The automatic indexing process analyses this descriptor in order to associate a set of indexing terms with it. This set will constitute a model for the future retrieval and automatic document classification. The extracted terms are organized into a thesaurus which groups the set of terms known by the system; one thesaurus is attached to each file in arder to ensure the semantic consistency. A semantic weight is associated with each term in the thesaurus. It measures the term discriminative power, which then heips us to identify a document in the file. To make information sharing easier, we have adopted a structure in which the information granularity is chosen by the user so that a unit stored in a file can be not a document, but a typed chunk of document, which can be shared by several other documents. This leads us to the definition of the concept of concrete node. A concrete node represents one typed elementary information. In this context, a concrete node is a part of text, a diagram, or a picture connected to one or several documents by a structural link (“is_a_par_of’l). Each information chunk of a node can be referenced independently of any other eIement (Chrisment et ai., 1985). This ~~odularity avoids redundancy, increases database consistency, and makes it possible to parameterize information granularity. Structural links permit the construction of objects by aggregation and design of the organization schema. The link “is_a_part_of” constructs a document from its concrete nodes and a file by aggregation on the set of documents. Such objects (documents, are abstract nodes. Different other types of nodes and files, . . .) created by aggregation Iinks are introduced so as to implement semantic links between the different database components. We are now going to present these different types of nodes and links. Then we will introduce these objects’ construction and manipulation.
Information is organized into two kinds of nodes: concrete and abstract nodes. Abstract nodes are generated by aggregation over concrete or abstract nodes. 1.1. X Ctlncrete nudes. In our approach, a concrete node is associated with information of a single type (text, graphic, image, . . .). Specifying concrete nodes is done by interaction between the system and the database administrators by default, that specification is automatically proposed which can be modified by the user.
Querying a hypertext IR system by classification
389
If 1.2 Abstracr nodes. Abstract nodes (document, file, directory, . . .) are constructed by aggregation on the database nodes or by applying database organization mechanisms {thesaurus, class, . . .)_ ~~c~~e~~ m&s: A document should include a single identifier (generated by the system) and a textual descriptor. A document descriptor is automatically built from the textual content of the document (when it is created) to which the administrator can add comments, information about graphic, or image nodes. It is the starting point for document indexing and classification. The following list of actions is associated with this kind of node:
creation of the document with reference to graphic data or images and construction of its descriptor, l automatic indexing, which consists in analysing the descriptor to extract the relevant indexing terms from the document content, l automatic classification, which consists in ranking the document in one of the classes of a classification criterion. * visualization of electronic document (printing a document requires computing its linear structure). l
Tfresaurus nodes: A thesaurus node includes a set of indexingterms extracted from the documents of one file as well as the semantic relations among these terms. In our approach, an indexing term corresponds to a group of keywords, and semantic relations to relations of synonymy, generalization, and specialization between groups of words (Soule-Dupuy, 1990). This fist of indexing terms attached to a document is integrated into the thesaurus of the ~orres~nding file when creating a new document. In the thesaurus, a semantic weight is associated with each indexing term extracted from the document descriptor. This weight is computed from occurrence frequency of the indexing term in a collection of documents. We have integrated three kinds of semantic relations to expand the process when querying the system. The synonymy relation is an equivalence relation that groups the indexing terms of a thesaurus into equivalence classes called synonymy classes. The impact of this relation on retrieval by means of a naturaI-like language formuIation is an expanding process; each term extracted from query is replaced by its s~onymy class and a semantic weight is estimated for this class from its appearance frequency (Aboud, 1990). The semantic weight of a synonymy class is estimated using the formula:
Weight(&)
= (1 - Freq(cl,)Jfi
=
where nk is the number of indexed documents by at Ieast one term of the synonymy class cl, and N is the number of documents in the collection. The coefficient V% is a parameter that adjusts the importance degree of the synonymy class (see Crampes, 1980, for further information). The hierarchical relation enables us to include hierarchy notions such as generalization and specialization between synonymy classes. During retrieval, these notions expand the query with more general or more specific concepts. The association-or proximity- relation is a contextual relation that constructs links between the terms we want to associate. A weight indicates a distance between connected terms and can also be used in a query-expanding process. Class nodes: A class node gathers similar documents according to a criterion. It is characterized by: l
its name, which refers to a concept called reference term-this common concepts to the dass documents,
term represents the
390
M. l l
ABOUD
et d.
the nonformal description, which is used to convey briefly the meaning a list of terms associated with the reference term, which are extracted ment descriptors attached to the class.
of the class, from docu-
Therefore, a class can be viewed as a virtual document reduced to a descriptor without any body. The classification process can be automatic (Celeux, 1989; Prieto-Diaz & Freeman, 1987). 1.2 The links According to the previous architecture, we distinguish four basic types of links, but our system is open and makes it possible to define any other type of link. Structural links: Structural links permit the construction of aggregated objects by connecting the concrete and/or abstract nodes that compose them. For example: l
l
A link “is_a_part._of”, from concrete to abstract nodes, represents a document (abstract node) upon the set of concrete nodes that are its components. A link “belongs_to”, from abstract to abstract nodes, generates a file from the set of documents included in it.
Referring links: Referring
links reference (directly or not) different nodes at several . . . Indexing finks: Indexing links connect thesaurus nodes to document nodes. They are activated when retrieving documents from indexing terms according to the information retrieval principle. Indexing links are created during the document indexing process; terms extracted from the document are added to the thesaurus (if they do not already appear in it) and links between the thesaurus node of the file and the document are created. Classification finks: Classification links aggregate documents according to criteria. They associate a document with a class with respect to the corresponding criterion. Figure 1 illustrates the information organization integrating these types of links. Links levels: “see_also”,
-FILE N
thesaurus
n5 . ..--.-..c -----a --C Fig. 1. The mformation
base organization.
smcturallixlk l-&ring li& indexlink classification link
Querying a hypertext IR system by classification
391
can be created explicitly or implicitly, but they can also be induced. “Explicit” creation occurs when the connection is specified from one node to another by the use of an editor (concept of “button”). Referring links can only be created in this way. “Implicit” creation is the case when the nodes are automatically connected. This method is especially suitable when creating index and classification nodes; thesaurus nodes and classification nodes are implicitly created by indexing and classification processes (Razouk, 1990). The induced links are created, for example, when querying the database to chain the retrieved documents in decreasing order of their similarity with the query. Once the displaying operation is over, these links disappear. The first are static links, which describe the database predefined organization. The latter are dynamic links which make it possible to create links during the database operatingend to introduce dynamicity into the database design.
3. DATABASE QUERYING When performing a search, the user has three tools, which can be used separately or in a cooperative way (Parsaye er al., 1989): l l l
classification to restrict the investigation domain, indexing mechanisms to retrieve documents from natural-like language queries, links to navigate through the retrieved sub-Hypergraph.
The first tool makes it possible to reduce the global Hypergraph to a sub-Hypergraph by choosing one or several classes. By the second one, a natural-like language query locates a relevant entry point (list owner in the reduced sub-graph of retrieved documents). The third one allows the user to navigate from the list owner to the “neighborhood” of retrieved nodes through the reduced graph that an expanding process can enhance. A second strategy can be the following one: l
l
l
the user can choose from a number of classification criteria those which seem to be the most relevant and best fit to the consultation context. The choice of a class in a classification criterion narrows the scope of the inquiry (natural-like language research or navigation) to the set of documents attached to this class, the user can also submit or not a request in natural-like language focusing on a set of documents of the database or on the sample identified by selecting classes of the different indicated criteria, the user can finally view a sub-Hypergraph constructed from the document nodes found by natural-like language retrieval and class selection. This allows for more efficient navigation and prevents the user from straying from the path in the Hypergraph.
We are now going to present the three consultation modes and the part played by each one in the successive restriction of the selected sample of documents. 3.1 Multi-criteria research A classification criterion is viewed as a tree-like structure of classes (left window in Fig. 2). The documents attached to the chosen class (the highlighted one) are displayed in the right window. The user chooses one or several classification criteria that seem the most relevant and well suited to the actual consultation context. The selected criteria are viewed in a tree-like structure in which the user can screen one or several classes. Indicating classes in the criterion hierarchy triggers visualization of general information about the sample (number and descriptor of documents). If the sample is too large, the process of class selection can be repeated on the classes until the sample is of a size to be examined document by document. The user can scan and choose the classes according to the semantics that proceeds from the structure of the selected classification criterion.
M. ABOUD
et ni.
Fig. 2. A multi-criteria
search
After a number of class criterion ~5nsuItations~ the user can dire&y identify the suitable document from the elements of the obtained set, or can ask for a search on other classification criteria in order to restrict this set. The selection process is performed in the same way on the other criteria classes. The result is a set of documents that checks all the requested criteria. 3.2 Qtrerying by natural-like language Such research is based on a matching process between the query and the different document descriptors (Salton & McGiIl, 1983; van Rijsbergen, 1979; Soufe-Dupuy, 1990). The retrieved documents are returned in order of decreasing similarity; a sequential consultation must first return the most relevant documents. So it is possible to create two induced links between the retrieved document nodes: * the link “next” connects a document to its successor according la their similarity degree, * the link “previous” connects a document to its predecessor according to the order of their similarity degree. These two Iinks are saved during consultation and are deleted query occurs or when the consultation is over.
when the entry
of a new
3.3 Consultation by navigation The starting point of the consultation by navigation is the selection of a document from the database (Watters & Shepherd, 1990). This selection can be made in three ways: l
selection of a document among a list of documents of one or several classification criteria,
as a result of the consultation
Querying
a hypertext
IR system by classification
+ t Fig. 3. Principle
. consultation
of a sub-Hypergraph
393
List owner Induced link “next”
building.
of the list of documents proposed as an answer to a natural-like lan-
guage query, . indicating a node in the Hypergraph
of the collection or in a sub-Hypergraph obtained by successive restrictions of the Hypergraph after multi-criteria and/or natural-like language inquiries (see Fig. 3).
Let: E be the sample obtained by a multi-criteria research (if there is no multi-criteria research, E = D, the whole collection of documents), Rep = ((d,, ress,)) be the set of documents retrieved by a natural-like language research on the sample E (if no natural-like language research, Rep = E), SHG = (N, L) be the sub-Hypergraph obtained by the restriction of the Hypergraph of the collection. SHG is built as follows: N = Rep U ( dJ 13 a referring link between d, and d, and d, E Rep), L, the set of links having documents of N as extremities and destinations. 4. THE SYRIUS PROTOTYPE
Such an organization made it possible to implement a Hypertext multi-media information retrieval system. SYRIUS offers the main functionalities of a Hypertext System, includes a free language indexing method, and dynamically classifies documents according to user-defined criteria. These three organization methods of a document collection cooperate to accelerate retrieval and to improve the relevance and friendliness of viewing the collection contents. The prototype has been developed on SUN Workstations, in C language under UNIX, which allowed us to design a dynamic and flexible structure, ensuring modularity and wide portability (Razouk, 1990; Rames, 1991). 4.1 Database querying Querying a database file (see Fig. 4) triggers the display of three windows: (a) view of the global Hypergraph: the vertex represent node identifiers, the arcs represent the links between them; (b) list of available criteria; (c) window intended to input a natural-like language query. 4.2 Refining the querying by restriction Referring to Fig. 5, the following windows are available: (a) list of criteria among which the highlighted one is chosen, (b) tree-like structure of the classes corresponding to
394
M. ABOUD
Fig. 4
Querying
Fig. 5. Refining
et al.
a database
file.
the query by restriction.
Querying
a hypertext
IR system by classification
395
the selected criterion, (c) the query that is applied on the documents belonging to the selected class(es), (d) descriptors of the selected documents, corresponding to the query, (e) corresponding sub-Hypergraph of the previous documents. An experimentation is in progress that is focused on the reuse of document chunks to design new ones (Onuegbe, 1987; ESF Newsletter, 1990; Chrisment et al., 1991). The context of application is electronical document management: “Proposals for Invitation to Tender.” These documents are structured according to the SGML standard (Bryan, 1988; McLeod, 1990); and the marks associated with the document content show the embedded structure: chapters, sections, subsections, paragraphs. The retrieval unit for reusability is the paragraph. A comment that contains the descriptor is associated with each paragraph, section, . . . mark. When creating a new document, the system classifies its paragraphs according to different criteria. Past experience is interesting when building up a proposal to a new Invitation to Tender. Previously drafted Invitations to Tender are retrieved from the future Invitation to Tender topics and the classification criteria. The system localizes the most specific parts of the proposal for Invitation to Tender that are related to the new proposal. Users can recover the text and adapt it according to their specific needs. CONCLUSION
A model for Hypertext Information Retrieval System has been introduced which provides a complete representation of different types of objects and links that can be encountered and, more specifically, various semantic relations. Such an Information Retrieval System consists of a database of multimedia documents links among objects of the database, that is to say, between documents, indexes, thesaurus, classification criteria, etc. These links can be defined apriori (static links) or in a dynamic way. This dynamicity allows us to take into account the database concept evolutions. The originality also lies in the querying process implementation, which integrates three approaches: classical information retrieval by natural-like language queries, multicriteria retrieval through an automatic classification schema, and navigation through a multimedia information Hypergraph. The three approaches’ complementarity makes querying easier and more effective. It provides more powerful and flexible retrieval capabilities than currently exist in classical full-text retrieval systems. As presented, the system brings a further aid to the classical expanding process. Instead of reformulating a query, one can use links corresponding to the restituted documents-for example, extending of the restituted document set to the documents connected by referring links (“see_also”). The three inquiry and browsing modes can be combined according to different strategies, but the user is always guided for querying as for searching. This enables us to solve for the most part the problems of disorientation and mislaying in a web (or labyrinth). However, the modularity of the architecture permits one to use only the interesting structures corresponding to particular functions according to the concerned contexts and applications. Indeed, the structural hierarchy being unique (documents gathered into files), several hierarchies of classification may coexist in order to represent multicriteria classification, used or not according to the search. The main interest of these structures lies in the dynamicity of the links (deduced from semantic properties), which permit the automatic creation of implicit or deduced links, and then the implementation of propagation mechanisms between the nodes of the database. REFERENCES Aboud,
M. (1990). Systemes de recherche d’informations: thesaurus et classification. Doctoral thesis, Paul Sabatier University in Computer Science, Toulouse III, France. Afrati, F., & Koutras, C. (1990, November). A hypertext model supporting query mechanisms. Proceedings of Bryan,
the European Conference on Hypertext ECHTPI. M. (1988). SGML: An author’s guide to the Standard Generalized Markup Language. Workingham: Addison-Wesley.
396
M. ABOUD et al.
Celeux, G. (1989). Clas%fication Automatique des Donnkes. Paris: DUNOD. Chrisment, C., Crampes, B., & Zurfluh, G. (1985). Bases d’mformations gPn&absPes. Parls: DUNOD. Chrisment, C., Comparot, C., Julien, C., & Richard, P. (1991, May). Hypertext-databases: Design, study, development, creation of technical documentation. Proceeding of the conference CIL’9/. Conklin, J. (1987). HYPERTEXT: an mtroduction and survey. IEEE Computer, I&9), 17-41. Crampes, B. (1980). Aide g l’interrogation d’un dictionnaire de donntes. RAIRO Informatique/Computer Science, 14(l), 86-95. ESF NEWSLETTER (1990). Reuse with natural language-Classlflcation and search of reusable elements m ROSE. ESF Newsletter, 3, 5-20. Garg, P.K. (1988). Abstraction mechanisms in hypertext. Communrcatrons ACM, 31(7), 862-870. Julien, C. (1988). Base d’informations gPn&alkkes: Contribution ZII’Ptude des mkamsmes de consultatron d’obyets multimedra. Doctoral dissertation of the Paul Sabatier University (in Computer Science), Toulouse III, France. MacLeod, I.A. (1990). Storage and retrieval of structured documents. Information Processing & Management, 26(2), 197-208. Nielsen, J. (1990). The art of navigation through Hypertext. Communrcarlons of the ACM, 33(3). 2988-321. Onuegbe, E. (1987). Software classification as an aid to reuse: mitral use as part of a rapid prototyping system. Proceedings of the 20th annual Hawari International Conference on System Sciences (pp. 521-529). Parsaye, K., Chignell, M., Khoshafian, S., & Wong, H. (1989). Intelligent databases. New York: Wiley. Prieto-Diaz, R., & Freeman, P. (1987). Classifying software for reusabihty. IEEE Software, March, 6-16. Rabitti, F., & Savino, P. (1990). Retrieval of multimedia documents by lmpreclse query speclficatlon, Proceedings of EDBT’90, Venice, pp. 203-217. Rames, E. (1991) Sur la rPuti/isatlon de composants logfcrels. Doctoral dissertation of the Paul Sabatier IJniversity in Computer Science, Toulouse III, France. Razouk, R. (1990). Bases d’informations gPn&ahsPes: HypermPdra et classlficatron. Doctoral dissertation of the Paul Sabatier University in Computer Science, Toulouse III, France. Salton, G., & McGill, M.J. (1983). Introductron to modern informatron retrieval. New York: McGraw-Hill. Schneiderman, B. (1991). Designing to Facilitate Browsing: a Look Back at the Hyperties Workstation Browser. Hypermedia, 3(2), 101-I 17. Soule-Dupuy, C. (1990). SystPme de Recherche d’lnformatrons. Le systPme tvrdPote.y INFODIAB. MPcanismes d’indexatron et d’interrogration. Doctoral dissertation of the Paul Sabatier University in Computer Science, Toulouse III, France. Tague, J., Salminen, A., & McClellan, C. (1991). Complete formal model for Information retrieval rystems. Proceedrngs of ACM conference SIGIR’91 (pp. 14-20). Utting, K., & Yankelovich, N. (1990). Context and orientation m hypermedia network. ACM TOOIS, 58-84. van Rijsbergen, C.J. (1979). Information retrieval, 2nd edltion. London: Butterworths. Watters, C., & Shepherd, A. (1990). Hypertext access and the New Oxford English Dlctlonary. Hypermedia. 3(l), 59-79.
of