Fuzzy Sets and Systems 38 (1990) 223-240 North-Holland
223
Knowledge engineering for a document retrieval system Ram6n L6pez de M~intaras 1, Ulises Cortrs 1'2, Jaume Manero 2'3, Enric Plaza 1 Grup de Recerca en lntelligdncia Artificial i Lg~gica, Centre d 'Estudis A van~ats de Blanes, Cami de Santa Btirbara, 17300 Blanes, Spain 2 Departament de Llenguatges i Sistemes lnformhtics, Facultat d'lnformgttica, Universitat Polit~cnica de Catalunya, Pau Gargallo 5, 08028 Barcelona, Spain Received June 1988 Revised December 1989
Abstract: A document retrieval system based on fuzzy set theory is described in this paper. Weights, as fuzzy characteristic functions, can be assigned to descriptors in the query expression and to index terms in the document description. Fuzzy set theory allows one to calculate a relevance value for each document from weights assigned to documents and queries. The relevance value for each document is calculated following different models. An elicitation mechanism is used to generate and to enhance the thesaurus structure. The thesaurus guides the retrieval operations.
Keywords: Information retrieval; fuzzy sets theory; knowledge engineering; thesaurus generation; document relevance values; personal construct theory; concept identification.
1. Introduction An information retrieval system (IR system) is a computer system that allows users to receive information from a document collection stored in a database [15]. Conventional information retrieval systems are built for fast and etiicient document database administration. The classical IR system builds an answer set (document list) from a query mgde by the user. Most classical IR systems are built within the framework defined by Boolean set theory. In this framework a book is described by a list of descriptors (index terms) and the query is a list of descriptors with some Boolean relations constructed upon them. The answer set is an unordered list of documents. This output conflicts with the commonsense idea of different document relevance for a user's query (it is usually admitted that some books are 'better' for one's need than others); the binary properties in Boolean sets conflict with reasoning used by people, a document being relevant or not relevant. It is desirable to rank items by an estimation of their relative importance for the user. To provide ranking capabilities in a retrieval information system a variable is defined, called relevance, which characterizes the matching Research supported by CSIC/CAICYT 836 project. 3 Partially supported by Catalonia Government grant CIRIT AR87. 0165-0114/90/$03.50 (~) 1990---Elsevier Science Publishers B.V. (North-Holland)
224
R. Lrpez de Mdntaras et al.
degree for a document with a user's query, and there are different approaches to provide such relevance estimation in a classical Boolean framework. To retrieve means to specify descriptors and to use rules for combining them in a sequence that conveys meaning. The skill with which these descriptors are used determines the effectiveness of the retrieval process. Some approaches are based on some kind of weight use, in the index terms (index weights) or in the query terms (query term weights). Others are based on similarity measures (departing from the Boolean framework). The use of weights in index terms indicate how well the descriptors fit the item content. Weights in the query terms indicate the importance of a concept in a query. Each approach uses an interpretation model to compute the resulting relevance for the resulting set. Traditional Boolean retrieval can be viewed as a special case of weighted retrieval. With zero-one weights (present or absent terms) the model will behave as a classical Boolean retrieval system, retrieving a document set with relevance values equal to one. Fuzzy set theory defines a set as a class with fuzzy boundaries. Such a class may be characterized by associating a grade of membership in the class to every object that might be in the class. This concept is useful to define a new approach for extending to weighted systems. Under this approach, the index term weight is the characteristic function of the fuzzy set of indexing terms describing the item. The important point is that descriptors can fit an information item to varying degrees. Relevance values are calculated in the framework defined by this theory, as is described later in this paper. Some authors have emphasized this approach as in [17, 111. Other work in the area of information retrieval is related to the manipulation of large amounts of data, which is related to efficient access times and simplified file access. Item clustering is an approach used in most libraries (books on the shelves are a manual clustering of items). Clustering is strongly related to classification, and classification methods are used in two different applications. First, to classify the set of index terms (in a thesaurus-like structure). Second to classify the documents into classes (related items in the same class). In our framework (fuzzy set theory knowledge databases), fuzzy classification is essential, as noted in [10, 14]. The Boolean and the fuzzy set approaches have a principal drawback. Both are based on the same access structure (query formulation which generates a retrieved set). The system behavior has no strategy, which is imposed by the user or the library intermediary, The better the strategy, the more useful the retrieved document set. One important improvement will be the integration of retrieval strategies in the system (retrieval rules and metarules). This integration is made in the framework defined by a KBRS (knowledge-based document retrieval information system). 2. The cognitive framework for information retrieval Cognitive sciences are interested in the study of subjects such problem solving, communication, perception or learning. The IR process can be viewed as a
Knowledge engineeringfor document retrieval
225
knowledge communication process, which involves learning and problem solving strategies. In this way IR process can be studied in the framework defined by the cognitive sciences, and this study can help to model an IR system.
2.1. Cognitive sciences and information retrieval The study of the role of knowledge structures in information processing and the meaning of information is the core of the cognitive sciences. There is a trend which considers the information sciences as cognitive sciences, with a connexion with related disciplines as Artificial Intelligence, Psychology, or Linguistics (see Figure 1). The reason for this classification can be found in the information systems use. Information systems are used to solve problems (caused by a need of information). This solution process is concerned with problem solving strategies, learning processes and knowledge structures, so that it can be viewed as one of the cognitive science disciplines. An IR system controls the knowledge flow between documents (conceptual knowledge) and the user (with an information need). This process interacts with different knowledge structures (KS), stored in the document database, or those present in individual users. The same knowledge structures are presented by the system to different individuals, and each one of them makes a different adaptation to his internal knowledge structures. The 'learning' process is made in a different way depending on the user personal KS, 'world knowledge'. An expert user is different from a novice, because an expert has a stronger background and a richer vocabulary. The principal concern of the IR system is the transmission of the internal KS to the user. The IR communication process succeeds when the user understands the information presented by the system. The concepts communicated by the system, if they are not coordinated with the personal user structures, are not understood. If the IR system is conceptualized as a mechanical-computerized system and an intermediary using it, this intermediary can establish a clearer dialogue with the user, but actual computer systems are not able to execute such process. ~ philosophy
rc.~thematics systems theory engineering
[ . . . .
l
q ~ m p u t e r ~cmnce ] ~~ l Cognitive sciences
]Psychology , Linguistics ]Sociology
Information science Documentation IR Librarianship Fig. 1. Cognitive sciences [7].
226
R. L6pez de Mdntaras et al.
2.2. Cognitive structures in IR
The IR system embodies different kinds of knowledge structures, some defined at design time (command language, database structure) and others added on-line to the system (knowledge representations about documents). Users and intermediaries must adapt their internal knowledge structures to those of the system, by learning the system characteristics; in this IR system approach the user depends on the system. The user has these two kinds of knowledge structures (IR knowledge and conceptual knowledge respectively) in different quantity and quality depending on the expertise. In highly specialized users those structures are well developed, but they cannot be found in novice users. 2.3. The intermediary in 117
Usually the interaction with an IR system is not made directly between the user and the system. There is an 'expert' in the retrieval field, who drives the retrieval process: the intermediary. The intermediary is a special and important user in the IR process. The intermediary is usually a librarian, or a person who makes the interaction with the IR system. He has deep understanding about the IR knowledge structures. He interacts between the final user and the computer system, reformulating user petitions in query language format or rebuilding the query expression from the feedback acquired from the retrieved set. The intermediary can be regarded as a part of the IR system. Without his task the IR operation would be impossible. An improvement to the actual IR systems could be the incorporation of the intermediary in the system, as a front-end between the user and the retrieval mechanism. In order to make this incorporation the system must have a conceptual knowledge model, and this modelization is the most important point in the design and construction of Intelligent IR systems.
3. The knowledge-based approach In the previous sections we described briefly a classical information retrieval system (IRS), and the study of the IR process in the framework defined by the cognitive sciences. Now as a challenge we propose, in the first place, to study a device for IRS using Artificial Intelligence tools, based on explicit knowledge about document content, and in the second place, we will expose our own conception of an 'intelligent IRS'. The idea of a system capable to use its knowledge about a domain to solve difficult situations is not new. In the case of document retrieval this knowledge is referred to as a 'good' characterization of the documents contents. The possible characterization of a document consists of a finite set of descriptors; each one of the descriptors attached to a document characterizes the degree to which the document deals with a certain topic. This characterization is a kind of repre-
227
Knowledge engineeringfor document retrieval
sentation of the knowledge contained in the document or the representation of the available knowledge about it. The problem of knowledge representation has been an important research topic for Artificial Intelligence (AI), so it seems natural to face the knowledge-based retrieval problem from the AI standpoint. The use of a knowledge base system allows the system to reproduce the behavior of an expert in a given domain looking for a concrete document (or set of documents) under certain conditions (a query). It is possible to identify a knowledge-based retrieval system (KBRS) through the study of its behavior. This behavior is induced by the knowledge-base which is at the heart of this approach. A knowledge-base for a KBRS is (conceptually) partitioned into three modules: (i) a thesaurus (or static knowledge), (ii) a set of interpretation rules (or dynamic knowledge), (iii) a data base where a document is represented as an identifier plus a set of weighted descriptors (or indexing knowledge; see Figure 2). Other important modules in a prototypical KBRS are the Natural Language based Interface (NLI), and the inference mechanism (IM). The NLI is capable of analyzing various kinds of constrained natural-language queries presented to the retrieval system and translates them into an intermediate form ready to be used by the inference mechanism. A clear separation is maintained between the knowledge base, which contains the domain-specific knowledge, and the IM that is typically in two phases: first, a retrieval module which computes a measure of the relevance between document characterizations and the user's query, and second, a ranking procedure that orders documents by their relevance to the query in terms of the measure obtained in the first step. USER
system controller
7 natural language interface
EXPERT~ [
knowledgebase
retrieval mechanism
ranking mechanism
document database
Fig. 2. Knowledge-assisteddocumentretrieval [3].
228
R. L6pez de Mdntaras et al.
Finally, the system controller, provides the link between the inputs from the NLI module and IM module. Also in some systems this module provides the search strategies that improve the response of the KBRS (see [3]). In a knowledge-based information retrieval system some advantages are apparent. First, it is possible to divide the data base into several smaller ones depending on a meaningful taxonomy. Second, the search process is guided by the rules of the knowledge base or by the system controller module. Third, the knowledge base gives the opportunity to retrieve more accurate sets of documents. The knowledge-based approach is used to incorporate some of the expert's tasks into the framework of the IRS. 3.1. A knowledge-based document retrieval organization
Our work is based on an alternative view of the architecture than that shown in Figure 2. The generic tasks that are represented in our proposal were designed taking in account our past experience in the design and development of knowledge representation schemes [6] and our present work on IRS [13]. This proposal (see Figure 3) is divided in two levels. The first level is an external module of natural language interface NLI that is used to interpret and translate queries into a more tractable expression. This front-end interface allows the users to define their own interpretation of linguistic terms by adjusting default values coded in the system. This module will be explained extensively in Section 5. The second level is properly the KBRS core. It is decomposed intwo levels that User
Natttral language interface
Document relevance evaluator
.~
Inference mechanism
,~ ~ =o = ~
_=
Retrieval mechanism
Statacknowledge I ~ : (thesaurus) I:~ i ii Dynamicknowledge (retrieval rules)
Knowledge based retrieval system Fig. 3. A proposal for a KBRS.
Expert
Knowledge engineeringfor document retrieval
229
interact to resolve the request of the NLI. Inside the KBRS the first level is organized in three modules: (i) the Document Relevance Evaluator (DRE), (ii) the Inference Mechanism (IM), (iii) the Retrieval Mechanism (RM). The knowledge base is in the second level of the KBRS and is partitioned into two levels: (i) the thesaurus (or static knowledge). (ii) the set of rules (or dynamic knowledge). The document database (DB) is also located at this level, which contains extended information about the documents, and the index knowledge. The behavior of the knowledge base will be detailed in Section 4. The IM receives an intermediate expression from the NLI module and, using the information stored in the thesaurus (see Section 4.2), it constructs a new expression and asks the RM to look for a set of identifiers (one identifier for each document retrieved) which holds with the user's requirements. The RM module using the interpretation rules builds up a list of indexesextracted from the index knowledge b a s e - and sends the list to the DRE. This module selects and orders the final set of identifiers to be presented to the user. If the user accepts this set then D R E looks in the data base for more complete information; else the DRE calls the IM with a new set of requirements to modify the previous ones. In the DRE the relevance is evaluated using fuzzy measures in order to calculate the partial membership (relevance) r of a document to a given query q (see [3,131).
4. The knowledge base The knowledge base plays an essential role in the definition and creation of 'intelligent systems'. The election of a given knowledge representation scheme and the definition of the primitives that manipulate all this knowledge has deep impact in the system expressiveness and power.
4.1. Prototype search The existence of a taxonomic structure between objects (documents) in the knowledge base allows the definition of a new object - and abstraction of a given set of objects - called prototype. The prototype definition is given by the following convention: prototype ::= (identifier kernel list_of_elements) kernel::= ((descriptorl.value 0 ... (descriptor..value.)) list_of_elements ::= (identifier1 ... identifier.). identifier ::-- descriptor :: = atom value :: = number
230
R. L6pez de M6ntaras et aL
The kernel of a prototype is defined as the minimum set of descriptors which allows identifying an object as a member of the class of objects that the prototype represents. The list_of_elements is included to optimize the search effort (i.e. if a given query q matches with a certain degree with the prototype Pi then the search task will be started only with the members of the list_of_elements of pi). Thus the prototype allows the inference mechanism to guide the search process to obtain better answers to a given request. Some problems related to this point are those concerning automatic prototyping creation and the maintenance of the set of prototypes. The representation of an object is defined as object:: = ((identifier) ((descriptorl.valuel) (descriptor2.value2) •
•
(descriptor,.valuen))) It is possible to treat the prototypes as objects and then produce a taxonomy of abstractions that allows organizing the database (i.e. a document on frameoriented languages that could be included in the knowledge representation prototype and also in the AI programming-languages prototypes is included in a higher level of abstraction named the AI prototype). 4.2. Knowledge elicitation The thesaurus of the knowledge base could be considered as a concept knowledge base where each descriptor is treated as a concept. In this structure a set of descriptors describes the significant entities of a given domain. This set is T : = {dl . . . . . d,} (i.e. dl ::=LISP, d2 ::= Prolog, etc.). The thesaurus structure allows us to describe binary relations between descriptions as synonymy and
implication. Document retrieval systems base their search and evaluation process on the conceptual knowledge coded in thesauri. The thesauri knowledge is a relational structure on the concepts (descriptors) used to characterize and discriminate among the stored documents. The retrieval systems utilizes this knowledge together with the relationships between descriptors and the documental objects. We will explain presently a methodology for acquiring relational structt~res for specialized domains using knowledge engineering techniques in the framework of fuzzy set theory. The goal of this work is to build specialized domain thesauri in which the descriptors used are those concepts effectively used by the experts in each domain. Specifically, the goal is two-fold: first, to acquire a concept repertoire rich enough to effectively characterize and discriminate the documental objects for fine-grained tasks as those of an expert searching documents in his domain of experience; and second, to acquire the relational structure that forms the thesaurus for this concept repertoire. We view the elicitation process of the concepts the experts effectively use in the framework of knowledge engineering. The acquisition of knowledge for thesaurus construction is the conceptualization
Knowledgeengineeringfor document retrieval
231
stage in knowledge engineering, in which the concepts and their relationships in a domain are elicited (see Section 4.2.1). This task is assisted by a system for knowledge elicitation [1] that interviews a domain expert. The result of this process is the incremental construction of a fuzzy relational network among documental objects and expert-defined concepts as well as a fuzzy relational network (thesaurus) among concept descriptors (with fuzzy relations such as synonymy, superordination, and subordination). Furthermore, the system constructs a document classification assisted by the domain expert and automatically learns the symbolic characterization of document classes (see Section 4.2.2). The classification is useful in itself but it is also useful as a validation tool for knowledge elicitation since it allows the domain expert to verify that class characterizations, based on object characterizations, are also correct. 4.2.1. The concept network elicitation process The first stage in knowledge acquisition consists of the Identification and Conceptualization stages [9]. The identification stage defines the task goals and requirements, usually in a conversation between an expert and a knowledge engineer. In specialized thesaurus construction, the librarian negotiates with the domain expert the exact definition of an appropriate characterization of domain documental objects in order to fulfill the requirements in document search and retrieval applications. The conceptualization stage is supported by the EAR* (Elicit-Analyze-Refine) system [14]. The system interviews an expert, incrementally constructs a concept network, validates the concept relationships, and finally automatically generates a specialized thesaurus. The conceptualization stage is assisted by the interacting program EAR*. The man-machine dialogue consists of several interaction modes. Initially the domain expert only needs to specify a set of document objects in his domain. Nevertheless, if there are some initial concepts for object characterization, the expert may also supply them voluntarily. The set of concepts or attributes of an object constitutes its characterization, and the set of objects having an attribute constitutes its domain. The attribute/object relation is given by a fuzzy proposition of the form "Xis QA", where X is a documental object, A an attribute, and Q a fuzzy linguistic label from a set of legal values called contrastive set (e.g.{very Iratherl medium 1quite [ little }). The representation of the linguistic values are possibility distributions over [0, 1], where 1 represents a concept (e.g. 'Theoretical', 'Oriented to Pattern Recognition Techniques') and 0 represents the user-defined opposite which may be an antonym (e.g. 'Applied') or a negation like 'Not Oriented to Pattern Recognition Techniques'). The possibility distribution defines the attribute/object relation value and is interpreted as the degree of relevancy of the attribute in the characterization of the object (e.g. "BOOK1 is Quite Theoretical" or "BOOK2 is Very Oriented to Pattern Recognition Techniques"). The set of linguistic values is formed by the contrastive set that the expert uses in order to characterize and discriminate among the domain objects. The contrastive set depends on the context, and its semantics is user-defined and represented by a set of possibility distributions.
232
R. L6pez de Mdntaras et aL
Conceptual knowledge elicitation and validation The first interaction mode for concept elicitation is useful in domains where an initial set of concepts for characterization is not known clearly or on which no consensus exists. When this is not the case, it can be skipped and the expert may proceed with the second interaction mode after providing the concept repertoire and the domain documental objects. Elicitation mode: The elicitation mode takes the objects in groups of three and asks the expert to state an attribute common to two of the objects and distinguishing them from the third one. Then it asks the user the opposite trait of the third object and asks also to linguistically evaluate the relevance of the new attribute with respect to the objects (specifically, not just the three evaluated objects but all the objects defined so far; in fact this process can be deferred to the point when the user wants to deal with this matter). An example of dialogue is the following: WHICH ARE THE MOST SIMILAR TEXTS AMONG
** Minsky and Carter WHAT CONCEPT HAVE Minsky ** Research Work
AND
WHAT CONCEPT DISTINGUISHES
Hayes-Roth
** Divulgative Work TO WHAT DEGREE IS Minsky ** High
Minsky, Carter, AND Hayes-Roth?
Carter IN COMMON? FROM THEM?
A RESEARCH WORK?
TO WHAT DEGREE IS Carter A RESEARCH WORK?
** Quite High TO WHAT DEGREE IS
Hayes-Roth
A RESEARCH WORK?
** L o w TO WHAT DEGREE IS
Walker
A RESEARCH WORK?
The two concepts introduced in the example dialogue are antonymic and form the two opposite poles in the 'Research/Divulgative' subjective domain• The degrees to which the objects verify the concepts are expressed by linguistic values from a user-defined set of contrastive terms• Implicitly, the opposite concept ('Divulgative' in our example) is assigned with the fuzzy antonymic value of the fuzzy set representing the linguistic value Antonym(V(x)) = V(1 - x ) [18]. Refinement mode: The elicitation and dialogue techniques in this mode have two goals: (a) eliciting a repertoire of unambiguous concepts with sufficient discrimination power for the tasks at hand (the retrieval and classification tasks), and (b) acquiring a set of domain objects (documents) that is complete or at least representative of the actual domain. These two goals are achieved using several analyses that generate different knowledge relationships: (a) the implicational analysis generates heuristical and abstractional relations that will form the hierarchical relations in the thesaurus, (b) the object similarity analysis, used to obtain a rich and discriminant construct repertory, and (c) the construct similarity
Knowledge engineering.for document retrieval
233
analysis, used to obtain representative domain objects not yet acquired and to generate the synonymy relations for the thesaurus. We will presently review these dialogue strategies and the different knowledge relations elicited• Object similarity analysis computes an indistinguishability relation among the objects. The dialogue strategy ranks the most similar objects and displays them menudike, asking the expert if he disagrees with any similarity. From the semantics of similarity relations and of the construct repertory, the system interprets a disagreement in this context as a lack of completeness in the construct repertory. The system therefore requires the expert to prove his disagreement asking him to state a new construct that distinguishes between the 'similar' objects. For example, once the expert has chosen a similarity relation in which he disagrees, the dialogue proceeds in the following way (much of data acquisition is expedited using menus, but here it is shown as typed for simplicity): YOU HAVE SELECTED
Rich
AND
Sowtl.
THIS MEANS THAT THEY SHOULD NOT BE
SO SIMILAR, BUT FROM THE INFORMATION ALREADY ACQUIRED I DEDUCE THAT THEY ARE 0.98 SIMILAR WITHIN THE PRESENT REPERTORY OF CONSTRUCTS. CAN YOU STATE A NEW CONSTRUCT THAT DISTINGUISHES THEM?
Sowa IS AND Rich is NOT OR VICE VERSA) ** Cognitive Science WHAT IS THE OPPOSITE OF Cognitive Science? [DEFAULT Not Cognitive Science] ** Not Cognitive Science TO WHAT DEGREE IS Sowa A Cognitive Science BOOK? ** High TO WHATDEGREEIS Rich A Cognitive Science BOOK? ** Very low (THAT IS TO SAY SOMETHING THAT
Construct similarity analysis is dual to object analysis. It computes an indistinguishability relation among the concepts. These similarities will form the synonymy relations among concepts in the thesaurus, but first they are validated at this knowledge acquisition stage. The dialogue strategy ranks the most similar constructs and displays them menu-like, asking the expert if he disagrees with any similarity. From the semantics of similarity relations and of the object characterizations the system interprets a disagreement in this context as a lack of completeness (or representativeness if we work with a limited set of objects) of the domain objects. The system therefore requires the expert to prove his disagreement asking him to state a new object for which these constructs behave differently: YOU HAVESELECTED Knowledge Representation AND Expert Systems. THIS MEANS THAT THEY SHOULD NOT BE SO SIMILAR, BUT FROM THE INFORMATION ALREADY ACQUIRED I DEDUCE THAT THEY ARE 0.89 SIMILAR WITHIN THE PRESENT REPERTORY OF BOOKS. CAN YOU STATE A NEW OBJECT THAT IS
Expert Systems
(OR VICE VERSA)?
Knowledge Representation
BUT NOT
234
R. L6pez de Mtintaras et al. NAME ME A BOOK ABOUT
Knowledge Representation
** Marr TO WHAT DEGREE IS
Mart
A BOOK
ABOUTKnowledge Representation?
Mart
A BOOK ABOUT
** High TO WHAT DEGREE IS
Expert Systems?
** Low .
.
.
Implication analysis elucidates the hierarchical relations among constructs. The representation of concepts and objects are type-2 fuzzy sets [8] and EAR* uses generalizations of the fuzzy implication relationship for the implication analysis [12]. Its result is a tangled hierarchy of constructs that later will be part of the data abstraction structure of the thesaurus. The data abstraction structure of hierarchical relationships elucidated by the implication analysis is also validated by the expert. In a similar way to the former examples, the results of the implicational analysis are graphically displayed to the expert. Upon any disagreement of the expert, the system asks him to prove his argument providing a counter-example, e.g.: YOU HAVE SELECTED THIS IMPLICATION AS A WRONG ONE
Expert Systems ~ Knowledge Representation COULD YOU THINK OF A COUNTER EXAMPLE, I.E. A BOOK THAT DEALS HIGHLY
WITH Expert Systems nUT NOTWITH Knowledge Representation? ** A.I. Applications to Industry TO WHAT DEGREE IS A . I . Applications to Industry A BOOK ABOUT Knowledge Representation ? ** Very low TO WHAT DEGREE IS A.I. Applications to Industry A BOOK ABOUT Expert Systems? ** High .
.
.
The hierarchical relations elicited by this analysis may be of different kinds. When the implication has a certainty factor less than 1 it is a heuristic relation based on knowledge about typicality, e.g. "Knowledge Representation Schemes (usually) imply Theoretical Work" expresses a correlation between the certainty of Knowledge Representation Schemes and the Theoretical component of books. When the implication is completely certain it embodies an abstract relationship of either definition-type (e.g. "Frame-based Language is a Knowledge Representation Scheme") or generalization-type (e.g. "KRL is a frame-based Language"). This hierarchical relationships are represented in the thesaurus by the fuzzy relations: Broader Term and Narrower Term.
4.2.2. Classification construction and validation The system uses the object similarity analysis to automatically build a tentative document classification by means of a clustering algorithm. Since similarity relationships are reflexive and transitive, and transitivity is needed in order to
Knowledge engineeringfor document retrieval
235
obtain a partitions tree for classification, EAR* computes the transitive closure through max-min composition of the objects similarity matrix. The partitions tree obtained defines the set of fuzzy partitions of the documents in classes. Each partition corresponds to a minimum level of the similarity degree of the objects within each class. These classes are shown to the expert in a summarized way and he validates the classes that he deems meaningful. We will see the process in detail. EAR* incorporates a method for learning from examples (or concept identification). This method automatically builds a conceptual description of the classes of objects generating a prototype that embodies the typicality knowledge. Furthermore, the relationships among these classes form the classification structure. Category psychology shows that natural classes or categories are not welldefined (in the sense of having a set of necessary and sufficient features describing them). Rather, categories are commonly defined by family resemblances with respect to a scheme or prototype that specifies the typical features of the class members. A prototype in EAR* is a conceptual description of a class, i.e. a set of attributes relevant to the class characterization. We consider an attribute relevant when it is common to most of the class members. Moreover, the notion of prototype in EAR* involves a measure of the variation of values inside the class with respect to the typical value. This is implemented in the notion of the prototype as a set of attributes summarizing the class members. The fuzzy summarization of a class attribute for construct A i is " A i is ~ U" i.e. it is formed by a summarizer ~ and a quantifier u. An example of summarization is "Natural Language is usually low or very low in class Knowledge Engineering", where 'Low or very low' is the summarizer and usually is the quantifier. Summarization is a heuristic process comprising two contradictory desiderata. The first is that summaries should be specific, i.e. a summary is more informative when the summarizer is more restrictive (in the limit, when one has only one contrastive term). The second desideratum is summary validity, that is the summarizer should cover all or most of class members, but in general the more specific the less objects it can cover. The concept identification method is used to support the expert's task of structuring his knowledge in a multiple hierarchy of solution classes. The system proposes classes or the expert himself explores groupings of objects (possible classes), and EAR* automatically represents them as prototypes that are displayed to the expert who decides which are meaningful and rejects the others (see Figure 4). The expert can easily explore different approaches to hierarchically structuring the solution classes using the learning algorithm as a tool to analyze, display and validate knowledge about typicality. The structure relating the classes is not in general a strict hierarchy. Classes can be grouped in different hierarchies and EAR* allows the definition of multiple hierarchies called taxonomies. The creation of taxonomies and the acceptance of prototypes as meaningful class descriptions proceeds in a disciplined way. The existence of multiple hierarchies is meaningful if they stem from different classification criteria embodying distinct perspectives on the basic objects. When the expert organizes
236
R. L6pez de Mtintaras et al.
Cortes /~ K-R-Models~---~ Minsky /' ' ~'~ CV~a/:;r /Rauch-Fu
.~Winston 1 Knowledge - - ~ B i b e l 1 Representationl\\~ /Freksa \ \ \C°gnitive~---~Phylishyn
/~/ ~Fahlman
Theow ~ F r e k s a
( ~
K-R-Foundations
ARTIFICIAL I INTELLIGENCE DOMAIN
~
~Touretzky Lt___~l
~
K-R-Languages
1
]
~Bibel
Winograd Charniak Bobrow
.~Knowledge ~::~Waterman I / J Engineering( Hayes-Roth BaKnd~ly%dtge s ~ - Aikins /\ ~ ~'~Shortliffe / \ ~" Voyer / ~1-~-I /''" \Chandrasekaran
Fig. 4. Partial view of the classification on AI books. Double boxed nodes denote not yet validated classes, single boxed nodes denote validated classes and machine-learned prototypes. the classes in different forms, the system creates new taxonomies to allow him to express, in a natural way, several structuring styles. Those multiple-criteria, conceptually-described hierarchical levels provide a sound data structuring (based on typicality knowledge) for intelligent search procedures implemented in document databases. 5. The retrieval mechanism
The KBRS core is formed by a retrieval mechanism based on fuzzy set theory and validated in previous work [13]. This is a retrieval model with expression of
Knowledge engineeringfor document retrieval
237
weights in the query elements and in the index terms. Both aspects will be discussed. In the classical retrieval an item is defined by its index terms (descriptors). In the fuzzy set model each information item is described by its index terms and by a weight associated to each one of them. The index term weight is the characteristic function of the fuzzy set of indexing terms describing the item. An item is neither described nor not described by an index term but described in a degree by all the index terms in the thesaurus (for each item there are several index terms with degree zero). The system goal is to calculate for each document a relevance value for a given query. This is made with some operations between the fuzzy values (in the fuzzy set theory framework).
5.1. A mathematical model of IR based on the theory of fuzzy sets We define the following basic notions: A quadruple I=(D,Q,T,
6)
where D = set of documents, Q = set of queries, T = set of terms,
IIDII = n, IIQII = m, II T II -- k,
and 6 is a matching function of the form 6 : 0 * xQ*---~R.
(1)
R is the set {0, 1} and can be interpreted as the set of possible values that the matching function can return. As an answer to a given query the system returns the following set: F = ((d, t,/av(d, t)) [d ~ O U Q, t e T} where
IZF:D X T---> [0, 1] is a function determining for each pair (d, t) the importance of the descriptor t in the description of the document d ~ D. In this model the queries are generated by the formulation of expressions formed by descriptors and the operations AND, OR, NOT, interpreted as fuzzy connectives. The matching function 6 is defined as in (1), and the query evaluating process follows this algorithm: (i) if q is a descriptor
6(q, d) = I~F(d, t)
for t corresponding to d;
(ii) if p and q are descriptors
6(p ^ q, d) = min(6(q, d), 6(p, d)), 6(p v q, d) = max(6(q, d), 6(p, d)), 6(~p, d) = 1 - 6(p, d).
238
R. Lrpez de Mdntaras et al.
5.2. Expression of the relative relevance of the query terms Weights can be assigned to descriptors to characterize the correspondence between a term and a document (2.1), but weights can also be assigned in the query expression to express what is the relative importance between the query terms (descriptors in the query expression) from the user point of view, so that weights assigned to terms influence the retrieval values for the documents, with higher values for docments containing a description similar to the descriptors in the query. In this retrieval model the query expression has the following structure:
q:
{3'lql op 3'2q2 op 3'3q3 op" • -op 3'nqn}
where op: 3':
{~, ^, v } (set of operators), [0, 1] (query term weight),
qi e Q (set of queries). These weights are evaluated in order to calculate the document retrieval relevance value following different approaches [1, 13], each interpretation model having some advantages and some drawbacks. 5.3. Expression of the weights as threshold values It is possible to choose a threshold value 3. for each descriptor present at~ the query. At any place of process we consider as elements of the associated fuzzy set to each descriptor those whose membership value is greater than or equal to 3'. For each descriptor,
A(t) is the fuzzy set for the descriptor t e T; d e D and t are descriptors in the query; d ~ A ( t ) with degree #r(x, t) if#F(x, t ) > 3 ' , d ~ A(t) with degree 0 if #r(x, t) ~< 3' A term with a small 3' associated will have great impact in the query and vice versa. This approach presents some problems. For example, if the descriptor A has 3.1 = 1, consider the operation A OR B; in this case A has no impact on the answer set, but if the operation is A AND B the answer set is empty and the impact of B is missed. This approach has been studied in [5]. We choose between the different models that cope with this problem the nought model proposed by Sanchez [16]. In this model, if we define the term V e T that represents a descriptor with 3. = zero, then
A^V=A,
AvV=A.
The nought model defines AND and OR as follows: d,3', v d23.2 = ((1 - 3',) v #~-(x, d,)) v ((1 - X2) v lZF(X, d2)), d,A, A d23'2 = ((A, ^ #e(x, d,)) A (3'2 A #F(X, d2)). The Figure 5 shows the performance of the model as implemented in [13].
Knowledge engineeringfor document retrieval
239
Documents dl
d2
d3
d4
A B
0.2 0.8
0.3 0.3
0.4 1.0
0.5 0.2
A OR B 0.4 A 0.4 A OR B
0.8 0.2 0.8
1.0 0.4 0.4
1.0 0.4 1.0
0.3 0.3 0.3
A AND B 0.3 A 0.3 A AND B
0.2 0.7 0.7
0.3 1.0 0.3
0.8 0.8 0.8
0.2 0.7 0.2
0.5
0.5
0.8
0.3
0.5AoR0.5B OR (A AND B)
Fig. 5. Nought model performance.
5.4. The retrieval mechanism algorithm The inference mechanism builds a query from the need of information expressed by the user (a query as defined in 5.1, 5.2 and 5.3). This query is used by this module in order to retrieve the document set. The interpretation model for the descriptors in the query expression is chosen using knowledge stored in the KDB, and depends on the user expertise and on the nature of the demanded retrieval process. This system module is based on previous system implementations, as described in [13].
6. The application at the Computer Science School Our work has been tested over small document collections with interesting results, but a question arises from those results: what is the behavior of this approach for a real library? The application is designed to answer this question. We propose a library (the Computer Science School Library) and the use of data presently in computer-readable format. A standard thesaurus (with relations stablished by a classification module) will also be used. The Computer Science Library has around 2500 documents (monographies). Our university (UPC, Universitat Polit~cnica de Catalunya) has a library computer system based on the MARC formats. On-line retrieval is available to the library personnel, and a standardized cataloguing is being made (based in the AACR2 Anglo American Cataloguing Rules, 2nd edition). The data catalogued by the university librarians will be used, with the inherent drawbacks this choice can have, as deficient descriptor assignment or wide classification (poor use of specific terms).
240
R. L6pez de M6ntaras et al.
7. Concluding remarks This paper reports some research results and a proposal for a system based on those results. We have studied the information retrieval process in the framework of Cognitive Science, and from experiences we have. We expose a new approach to the retrieval process based on knowledge and a current implementation using state of the art technology and theory in Artificial Intelligence. The use of fuzzy reasoning and fuzzy document characterization, the inferencing mechanism to apply the retrieval rules and the thesaurus definition with fuzzy relations based on synonymy and implications are the outstanding structural notions of our approach.
References [1] M. B~irtschi, An overview of information retrieval subjects, IEEE Computer 18(5) (1985) 67-84. [2] J. Bezdec, G. Biswas and Li-Ya. Huang, Transitive closures on fuzzy thesauri for informationretrieval systems, Internat J. Man-Machine Stud. 25 (1986) 343-356. [3] G. Biswas, J.C. Bezdek, M. Marques and V. Subramanian, Knowledge-assisted document retrieval: Part I and Part II, J. Amer. Soc. Inform. Sci. 38(2) (1987) 83-96. [4] A. Bookstein, Probability and fuzzy-set applications to information retrieval, Ann. Rev. Inform. Sci. Technol. (ARIST) 20 (1985). [5] D.A. Buell and D.H. Kraft, Threshold values and Boolean retrieval systems, Fuzzy Sets and Systems 7 (1981) 35-42. [6] U. Cort6s, Esquema multinivel para la adquisici6n y tratamiento de informaci6n en escenarios 2D y 3D, Ph.D. Thesis, Facultat d'Inform~tica, Universitat Polit6cnica de Catalunya, Barcelona (1984). [7] R. Davis, Intelligent Information Systems. Progress and Prospects (Ellis Horwood, Chichester, 1986). [8] D. Dubois and H. Prade, Theorie des Possibilitds (Masson, Paris, 1985). [9] F. Hayes-Roth, Building Expert Systems (Addison Wesley, Reading, MA, 1983). [10] J. Jacas, Contribuci6 a I'estudi de les relacions d'indistingibilitat i a l e s seves aplicacions en els processos de classificaci6, Ph.D. Thesis, Facultat d'Inform~ttica, Universitat Polit6cnica de Catalunya, Barcelona (1987). [11] J. Kacprzyk and A. Ziolkowski, Database queries with fuzzy linguistic quantifiers, IEEE Trans. Systems Man Cybernet. 11(12) (1986) 816-821. [12] R. L6pez de M~intaras, J. Agusti, E. Plaza and C. Sierra, Milord: A fuzzy expert system shell, in: A. Kandel, Ed., Fuzzy Expert Systems (CRC Press, Boca Raton, FL, 1990). [13] J. Manero and U. Cort6s, New results in classification and document retrieval using fuzzy tools, Proceedings Fall International Seminar on Applied Logic, Mallorca (University of Mallorca Press, 1987). [14] E. Plaza and R. L6pez de Mfintaras, Model-based knowledge acquisition for heuristic classitication systems, Proceedings ECAl-European Conference on Artificial Intelligence (Pitman, London, 1988). [15] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval (McGraw-Hill, New York, 1983). [16] E. Sanchez, Thinking about nought. Personal communication, NATO-ASI Seminar on Fuzzy Sets Theory and Applications, Lovaine-la-Neuve, Belgium (1985). [17] R. Yager, A logical on-line bibliographic searcher; An application of fuzzy sets, IEEE Trans. Systems Man Cybernet. 10(1) (1980). [18] L.A. Zadeh, PRUF: A meaning representation language for natural language, Internat. J. Man-Machine Stud. 10 (1978) 395-460.