Enhancing information systems management with natural language processing techniques

Enhancing information systems management with natural language processing techniques

Data & Knowledge Engineering 41 (2002) 247–272 www.elsevier.com/locate/datak Enhancing information systems management with natural language processin...

378KB Sizes 2 Downloads 104 Views

Data & Knowledge Engineering 41 (2002) 247–272 www.elsevier.com/locate/datak

Enhancing information systems management with natural language processing techniques Elisabeth Metais CEDRIC Laboratory, CNAM of Paris, 292 rue Saint Martin, 75141 Paris Cedex 03, France Received 4 December 2001; received in revised form 10 December 2001; accepted 19 December 2001

Abstract Natural language and databases are core components of information systems. They are related to each other because they share the same purpose: the conceptualization aspects of the real world in order to deal with them in some way. Natural language processing (NLP) techniques may substantially enhance most phases of the information system lifecycle, starting with requirements analysis, specification and validation, and going up to conflict resolution, result processing and presentation. Furthermore, natural language based query languages and user interfaces facilitate the access to information for anyone and allow for new paradigms in the usage of computerized services. This paper investigates the use of NLP techniques in the design phase of information systems. Then, it reports on data base querying and information retrieval enhanced with NLP.  2002 Elsevier Science B.V. All rights reserved. Keywords: Natural language processing; Natural language; Information systems; Database design; Database querying

1. Introduction ‘‘Sharing information’’ is the emerging theme in the trend of information systems. Information is nowadays exchanged as it never has been before. Shared access to data is expected via the Internet or intranets. Many efforts are put in finding new exchange formats such as XML. Inside organizations, operational data are gathered for common storage in Enterprise Resource Planning systems. For decisional purpose legacy data, external data and operational data are extracted into Data Warehouses.

E-mail address: [email protected] (E. Metais).

0169-023X/02/$ - see front matter  2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 9 - 0 2 3 X ( 0 2 ) 0 0 0 4 3 - 5

248

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

Gathering and sharing information leads to an explosion of activities like information retrieval, data integration, data mining and text mining. Such complex activities cannot be performed without a strong understanding of the manipulated data. We particularly need a better understanding of: • the real world for an adequate storage of its representation; • the information stored in the information system for allowing their exploitation; • the user’s needs to fulfill his/hers requirements. Managing this large amount of information is only possible if CASE tools are being used. However, users are faced with many problems when using these CASE tools. Most of these problems are due to misunderstanding between the tools and users and this can occur at different levels: • tools ignore the meaning of the words which represent the names of the objects or the content of the objects; • users may have problems in understanding models and languages (e.g. abstract modeling constructs and concepts, SQL queries); • in the stored data, much is lost from what was in the user’s mind when introducing these data (e.g. in a conceptual schema). Heterogeneity, both in forms and content, is another problem faced by information systems. However, new research in CASE tools has addressed this issue and attempted to solve this problem in an integrated and global way, usually by middleware in an architecture based on several tiers. Techniques for solving linguistic differences have to be found for both monolingual and multilingual systems. New matching algorithms and a way to measure semantic distances between words, or more complex structures, will be the research challenge of the next years in the data- and knowledge-engineering area. Another important feature we can notice in the trend of information systems is the increasing size of the volumes of data stored. The Information system field does not have the monopoly on this tendency and this is usually considered as a very positive point due to the advances in storage support research. However, in our field, it entails a need for intensive research on techniques to deal with this large volume. More volumes also entail more frequent changes; thus attention has to be paid to real time algorithms and event-based programming. New techniques for indexing, filtering, and summarization will be needed. The increasing size and complexity of information systems not only require modular modeling approaches but also motivate the reuse of existing schemas for the construction of the respective modules. The main difficulty met during the exploitation of existing schema definitions is to find the most suitable schema that satisfies the designer’s requirements. Consequently, reuse approach must be included in the techniques developed to capture the schema’s semantic and to supply semantic retrieval mechanisms. Information systems are highly user oriented. Users supply, check and validate requirements. Users are also part of their running and the milestones of their success. The tendency is to open more toward the general public, while this latter becomes more demanding, more aware, more

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

249

competent and more critical. However, new technologies are more and more complex and need high-level specialists to be operational. Thus, new methodologies and tools need to include a better way of communication between different stakeholders. Linguistic theories and natural language processing (NLP) techniques are increasingly used as a means for designing and realizing more user-oriented information and communication systems. Natural language and databases are core components of information systems. They are related to each other because they share the same purpose: the conceptualization aspects of the real world in order to deal with them in some way. Natural language is used by human both for communication and thinking (conceptualization, memorization, linking views). In information systems natural language progressively plays similar roles, first by facilitating the user dialogue and second by an automated understanding of the semantics of the stored information (e.g. for checking, reusing or integrating conceptual schemas). For several years, the trend in natural language research has been oriented towards the elaboration of huge linguistic dictionaries and ontologies, including relations between concepts and common sense. The exploitation of such dictionaries, together with sophisticated parsers, fulfill some ‘‘understanding’’ requirements: • a better understanding by the user (specification interface, validation, documentation); • a better understanding by the tool (catching the semantics of schemas, data and metadata). NLP techniques may substantially enhance most phases of the information system lifecycle, starting with requirements analysis, specification and validation, and going up to conflict resolution, result processing and presentation. Furthermore, natural language based query languages and user interfaces facilitate the access to information for anyone and allow for new paradigms in the usage of computerized services. It is exactly what today’s users expect from any kind of information services and tool, namely a high level of ‘‘intelligence’’. The ability to deal with natural language is naturally the first visible sign of such intelligence. An increasing number of research projects are undertaken in this field mainly since the early 1990s. They all promise to be stronger and more productive in the next years. The remaining of this paper is organized as follow: Section 2 reports on the use of NLP techniques in the design phase of information systems. Section 3 focus on information systems querying; it enlightens how specific problems in federated systems and information retrieval may be enhanced with NLP.

2. Natural language processing techniques in database design Database design is mainly based on CASE tools. In addition to storage and graphical interface functionalities, these tools are expected to be more intelligent and somehow behave like experts in the field. This is made possible by the contribution from other fields and particularly to advances in natural language knowledge and techniques. Nowadays, CASE tools are expected to have some knowledge beyond the information stored for the current application. This knowledge ranges from simple word meanings to the complex understanding of contexts.

250

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

2.1. Natural language specifications Using natural language as an explicit input for conceptual design has been an old challenge. With the entity–relationship model, Chen has already argued that entities and relationships represent nouns and verbs, respectively, from the natural language description of the model [17]. Moreno and van de Riet have shown the equivalence between linguistic patterns and their corresponding conceptual patterns in [61] by using a common translation comprising first order logic and set theory. Tools have to deal with natural language inputs according to a syntactic analysis, a semantic analysis and a pragmatic interpretation (the latter aims to relate the result of the analysis to entities, relationships and other ER structures). SECSI [5] allows defining a MORSE conceptual database schema, from a restricted subset of natural language. In the ANAPURNA project [28] binary relational schema are built from natural language sentences. The ACME system [47] accepts a domain description using natural language and derives an extended entity–relationship model. AMADEUS [4] draws for each input natural language sentence the corresponding NIAM schema. ISTDA [8], MODELER [77] and RADD [9] are other design tools, which accept natural language sentences from the user and produce a database schema. On-going for several years, COLOR-X, NL-OOPS and NIBA are very promising projects that have gathered teams of linguists and teams of computer scientists. Initiated in 1992 by van de Riet, the first big project aiming to use natural language for information system design is the COLOR-X project [12,14]. COLOR-X is based on strong linguistic theories and addresses both the dynamic and the static aspects of systems. COLOR-X is part of the LICS project within the LIKE project [80]. The LIKE project aims to use linguistic instruments in the area of knowledge engineering; LICS is focusing on information systems. The NL-OOPS project [58] started in 1994, is aiming at building a tool that generates object-oriented models from natural language requirements. Directed by Mayr, the NIBA project [31] deals with natural language based specifications and focus on the KCPM intermediate model. Extracting data structure from natural language text is a hard problem which may differ from natural language understanding or natural language translation. Indeed, within a text written in natural language, database models capture only part of the global semantics. Other aspects which deal with processing and dynamics of the described information system are not captured in static data models. Extracting knowledge relevant to conceptual modeling mainly consists of solving two problems: sorting relevant and irrelevant assertions, and stating correspondences between natural language concepts and conceptual modeling concepts. Within the semantic part, which can be captured by a conceptual data model, one of the hardest problems when analyzing natural language sentences, is to decide whether a term should be treated as an attribute, an object, a relationship or an integrity constraint. None of the classical techniques used in NLP can solve this problem, only expert rules added to these techniques may produce appropriate results. At first glance, a sentence is usually turned into a conceptual schema by abstracting verbs into relationships, subjects and complements into participating entities, and adverbs and adjectives into attributes. Some particular verbs are recognized as well-known relationships; for example the verb ‘‘to be’’ usually indicates a generalization link, the verb ‘‘to have’’ indicates a relationship role or a link between an entity (or a relationship) and its attribute.

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

251

Sentences may be interpreted as independent units, but they also appear in the context of a global text. The interpretation of a given sentence can be modified by the interpretation of subsequent sentences. For example, from the following sentence, ‘‘a product has a number, a unitprice, and a supplier’’, we understand that there is an entity named ‘‘product’’ and characterized by three attributes: ‘‘number’’, ‘‘unit-price’’, and ‘‘supplier’’. By adding a new sentence, such as ‘‘each product’s supplier, described by his/hers name and his/hers address, supplies one to ten parts’’, we modify the previous interpretation by transforming the attribute ‘‘supplier’’ into an entity described by two attributes (‘‘name’’ and ‘‘address’’) and a relationship (‘‘supplies’’) which links it to ‘‘product’’. Furthermore, the second sentence introduces additional complexity related to the usage of synonyms (‘‘product’’ and ‘‘parts’’) that have to be solved by using a dictionary. Redundancy is a frequent problem in the textual specification. Some new sentences, although true, do not augment the semantics of the application, as the new facts can be inferred from the previous ones. For example, in the following description, the third sentence is redundant to the first two: ‘‘A person has a name and age. An employee is a person. An employee has a name and an age.’’ In the specification, ‘‘Employees and secretaries are persons. A secretary is an employee’’; the second sentence makes part of the first one redundant. Techniques have been proposed to disclose the dynamic part of the specification. In the DATAID project [25], natural language sentences are transformed into a formal description of data, operations, events and conditions. OICSI [67] enhanced the Fillmore cases with some more modeling oriented cases (e.g. ‘‘owner’’, ‘‘owned’’). In addition, verbs are classified from a modeling point of view (e.g. a class of verbs expressing an ownership, a class of verbs corresponding to an action). RADD-NLI [10] catches from a German specification the behavior of the information system in order to build a dynamic model. Verbs corresponding to an action are recognized by the use of a lexicon. Antonymy of verbs (e.g. to borrow and to return) is explored to draw up the succession of actions in a process. Vadera and Meziane propose an interactive approach to transform English sentences into formal specifications [79]. An approach to support the use of natural language in order to capture textual scenarios and the construction of Use Cases specification, based on a Case Grammar, is presented by Rolland and Ben Achour in [68]. To reduce the complexity of natural language, parsing a restricted grammar only is often used, hence leading to a technical jargon, easy to specify by the designer and easy to understand by the CASE tool. In the KASPER project [63], a very restricted language called ‘‘normalized language’’, using standard grammar and standard terms, is enforced. This makes it understandable by people using different languages (translations are trivial) and the CASE tool can easily transform it into conceptual structures. However, some experts (e.g. Sabah in [69]) may argue that this simplicity provides only an appearance of a natural language, but it is not the usual natural language which deals with the three essential aspects of polysemy (homonymy, homotaxy), paraphrase (synonymy, allotaxy and definition) and relation to the context (anaphora, implicit, trope and spot). Some research projects such as DMG [78] and NIBA [31] have extended their languages to more complex sentences, and their CASE tools have integrated recent advances in parsing techniques. As described previously, the interpretation of natural language specification is not only a syntactic process, but a very high-level semantic and pragmatic process based on expert knowledge,

252

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

coming from research in NLP, and from database modeling. Nevertheless, natural language interfaces are appealing components in CASE tools and should be integrated. Research in conceptual modeling from the natural language interface perspective concerns many aspects. This includes best natural language parsing, best knowledge elicitation, sort and recovering of pertinent information with respect to the conceptual modeling. To answer this multi-facets problem, the trend is to propose an intermediate level representation that models the result of the linguistic analysis of the texts (syntactic and semantic analysis) and serves as an input for the conceptual schema elaboration (pragmatic analysis). The underlying model may be created for that purpose such as the CPL model [26] based on mathematics and linguistics for COLOR-X, or the KCPM model in the NIBA project [33]. Traditional representation models in NLP such as Sowa’s conceptual graphs have been used for the KHEOPS project [2]. The tendency for a few years is to use description logics such as the KL ONE [42]. The need for natural language specifications understanding in CASE tools entails new issues in NLP. All improvements in NLP techniques will have an impact on the use of natural language specification by CASE tools. Among the latest developments in NLP, used for conceptual modeling, are the recognition of person names, company names and geographical names [64,83] and the understanding of technical specifications and an ambiguity measure [59]. However, some specific needs have to be addressed such as the elicitation of cardinalities [33]. Methodologies for incremental parsing should be stated to fit in the information systems design cycle. The possibility to change the opinion by new statements or through a dialogue should be included in the parsing a proposed in [9]. Similarly, any progress in the elaboration of dictionaries will be beneficial to the understanding of natural language specification by providing a better understanding on the information stored on each concept. It would be very useful to get special features on verbs concerning their categorization into static or dynamic aspect of the system, or whether they could be considered as synonym to ‘‘being a property of’’ or ‘‘being a generic of’’ of other concepts. Similarly, some usual cardinalities could be stored as complement to relationships. A promising extension of the WordNet dictionary (called WordNetþþ), which contains a number of special types of relationships, not available in WordNet, is presented in [72]. The intensive use of WordNet in the information system area should influence its future. Concerning the pragmatic analysis, conceptual models are becoming more and more complex. As an example, with the UML notation, subtle distinctions have to be established in order to choose between ‘‘relationship’’, ‘‘aggregation’’ and ‘‘composition’’. On the other hand, following hundreds of proposals for new conceptual models, UML seems to become a standard in industry. Consequently, a strong investment on natural language specification for UML schemas should be investigated. Other tools can be proposed to support natural language specification understanding. Building tools based on forms is not new [1,20], but standard representation and high availability currently provided by Internet will quickly lead to powerful industrial CASE tools dealing with structured and semi-structured knowledge. The wide success of the Internet leads to the use of many HTML or XML documents as input knowledge sources for conceptual modeling. Both tags and data have to be parsed. Reciprocally XML may be considered as the DBMS target model for a natural language text input. Pertinent semantic tags have to be generated by a deep analysis of the text.

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

253

2.2. Schema validation Conceptual schema validation deals with the conformance of a given schema with respect to user requirements. It is one of the important issues that contribute to decide whether a conceptual schema is good or not. One of the frequent techniques proposed for validation is the conformance of the conceptual schema with applications processes. This is much more devoted for checking the completeness than to check whether conceptual entities and relationships represent effectively the semantics in users’ minds. A technique explored for several years by research tools consists in paraphrasing conceptual schemas in natural language. The validation process of the conceptual schema is transformed into a validation process of a text, which is much more adapted for end users than abstract data structures. Paraphrasing may be split into two theoretical parts, namely ‘‘deep generation’’ and ‘‘surface generation’’ [53]. The deep generation corresponds to the question ‘‘What to say?’’ and the surface generation corresponds to the question ‘‘How to say ?’’. In the deep generation, the paraphrasing algorithm strongly depends on the conceptual model. To each model is associated a set of rules linking the concepts of the model to linguistic component types. Patterns of sentences are also elaborated for the translation of cardinalities. However, paraphrasing an ER-like model, without any other source knowledge, may produce only generic sentences like ‘‘Leasing is a relationship between agency, person and vehicle’’, instead of a more natural and pertinent one like ‘‘A person leases a vehicle to an agency’’. To reach this level of paraphrasing, three of solutions have been explored: (i) the first one consists in enriching the conceptual model with linguistic considerations, (ii) the second one consists in adding a linguistic level between the specification and the conceptual level, (iii) the third one (LIKE [11], COLOR-X [14], KISS [37], KHEOPS [2]) aims to deduce from external sources such as linguistic dictionaries the lacking linguistic information which has been lost during the conceptual modeling. Several large-scale projects aiming to build electronic linguistic dictionaries have been proposed: the CYC project [49,50] which also provides some ‘‘common sense’’ knowledge, the EDR project [27], WordNet [29] which has the particularity to be currently available and free in the public domain, and the EuroWordNet [84] adaptation of WordNet to European languages. Beyond syntactical and morphological information, these dictionaries provide semantic links between concepts such as synonym, antonym, hyponym/hypernym (is-a) and meronym/holonym (part-of). An extensive discussion of relation types is presented in [73]. The ‘‘is-a’’ links are particularly useful and are organized into a ‘‘hierarchy of concepts’’ as the one in Fig. 1. In this hierarchy, nodes at the first level under the top may be considered as types and called ‘‘semantic constraint’’. Other useful links between concepts are supported by the ‘‘canonical graphs’’. The canonical graphs of Sowa [71] are structures that specify the expected relationships between concepts. For this purpose, they use the ‘‘semantic cases’’, introduced by Fillmore [30]. Semantic cases express the role played by each noun in a sentence (e.g. in the sentence ‘‘Workers manufacture shoes with leather’’, ‘‘Workers’’ has the semantic case ‘‘agent’’, ‘‘shoes’’ has the semantic case ‘‘result’’ and ‘‘leather’’ plays the role of ‘‘material’’). These cases are more semantic than the usual syntactic cases (subject, direct object, adverbial phrase of place, etc.) in that they are not dependent upon the surface structure of the sentence (i.e. the syntactic construction of the sentence). Canonical graphs provide the ‘‘semantic constraint’’ expected from each ‘‘semantic case’’. Fig. 2 shows an

254

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

Fig. 1. A sample of the hierarchy of concepts.

Fig. 2. Sample of canonical graph.

example of a canonical graph, where ‘‘agent’’, ‘‘object’’ and ‘‘location’’ are the ‘‘semantic cases’’, and ‘‘Human’’, ‘‘Physical-object’’ and ‘‘Place’’ are the ‘‘semantic constraints’’. This canonical graph means that if we have a sentence ‘‘X delivers Y at Z’’, X is the one who delivers (the agent of the action) and must satisfy the semantic constraint ‘‘Human’’, Y is the thing to be delivered (the object of the action) and must be a ‘‘Physical-object’’ and Z is the location and must be a ‘‘place’’. A contribution of these linguistic tools to conceptual schema paraphrasing is to deduce the semantic role played by each entity in a relationship. Roles may be deduced by using two types of information stored in the semantic dictionary: the canonical graph of the relationship and the hierarchy of concepts that includes the entities. A matching is then tried between each entity and a semantic case of the canonical graph. To fulfill this correspondence, the entity has to verify the semantic constraints assigned to the case; this checking is performed on the hierarchy of concepts. We illustrate this technique with the help of the example of Fig. 3(a) and (b). The objective is to show how to deduce the missing roles in the schema in Fig. 3(a). In the canonical graph (Fig. 2), among all the semantic constraints of the canonical graph of the verb ‘‘deliver’’ (‘‘Human’’, ‘‘Physical-object’’ and ‘‘Place’’) the entity name ‘‘Supplier’’ verifies the first one––

Fig. 3. (a) A conceptual schema; (b) the schema enriched with semantic roles.

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

255

‘‘Human’’––because ‘‘Human’’ is a generic of ‘‘Supplier’’ in the hierarchy of concepts (Fig. 1). Consequently, ‘‘Supplier’’ can only play the role of ‘‘agent’’. In a similar way ‘‘Product’’ can only play the role of ‘‘object’’ because the only semantic constraint it verifies is ‘‘Physical-object’’; and ‘‘Warehouse’’ can only play the role of ‘‘location’’. In the surface generation, other linguistic knowledge is needed for building the sentences. It includes syntactic, morphologic and pragmatic knowledge developed for NLP. For example, to generate French sentences, we have to know the gender of the nouns, which are dependent on their meaning, while homonyms may have different genders. Some researchers have worked on the coherence and readability of the discourse. Dalianis proposes a method (specific to conceptual model paraphrasing) to generate pretty aggregate atomic sentences in order to present a synthetic discourse [24]. The presentation of the discourse is important because the purpose of paraphrasing is to make the conceptual schema closer to non-expert users. The paraphrasing may be triggered for a part of the schema by a hypertext system (e.g. HypER [51]) which allows to navigate through the conceptual schema and associate to a node everything related to it. Paraphrasing techniques are particularly easy and powerful to use when the CASE tool already offers a natural language interface. The user can have feed-back from the system and can compare his/hers initial specification text to the one re-engineered by the CASE tool. He/she can modify it and submit it again to the design tool, which will generate another schema refinement. The process is repeated until a satisfactory schema is obtained. For both the static and the dynamic schema formal representations are required. The introduction of some intelligence in CASE Tools entails sophisticated models for supporting reasoning, including the representation of sophisticated constraints and relationships between objects. Mathematical foundations and all kinds of logics (temporal logic, modal logic, descriptive logic) are required. On the other hand, new applications involve end-users in the early stages of the design. Paraphrasing seems to be an appropriate approach to solve this problem and in spite of the fact that very little work has been performed on this topic, paraphrasing a conceptual schema should be seen as a great challenge for researchers in NLP. Much missing information in the schema has to be deduced by intelligent rules exploiting both NLP resources and modeling information such as cardinalities or object type. Summary techniques and multilingual aspects need also be included in this research. 2.3. Specification checking A good conceptual schema has to be intrinsically correct, consistent, complete and nonredundant. Consistency is defined both with respect to the syntactic rules of the conceptual model and with respect to the semantic rules. A schema is syntactically consistent if it satisfies the construction rules of the model. A conceptual schema is semantically consistent with respect to a conceptual model if the concepts are used according to their definition and if no contradiction can be found within the concepts of the schema. The first part of the definition is in general hard to verify. Whereas numeric data carry well-known semantics included in computational components (as dividing by zero should not be accepted by computers for example) non-numeric data, that are used by information systems, are hermetically closed for traditional treatments. Usually CASE

256

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

tools starting from graphical specification or from natural language do not check the semantics of the words. Specifying that ‘‘books eat bottles’’ would certainly not entail any reaction in traditional information systems management tools. An attempt to check automatically the semantic correctness is performed for the COLOR-X project in [82]. The idea is to establish some mandatory semantic features for the names of the different concepts. For example, the concept of ‘‘action’’ in a state/transition diagram has to be named by a ‘‘verb’’ having the three features ‘‘telic’’, ‘‘dynamic’’ and ‘‘controlled’’ and not having the feature ‘‘durative’’. The domains of these features are the ones used by linguistic dictionaries in order to describe their entries. Automatic checking is then possible. All research results in NLP concerning the semantic inconsistency of a sentence have to be adapted to information systems field. Semantic checking based on the meaning of the words is an exciting issue and will strongly contribute in making CASE tools more intelligent. 2.4. Quality assessment Quality assessment of a conceptual schema deals with desired properties which cannot be proved by a logic based approach. These properties define a subjective evaluation of a conceptual schema, and assign a value, which is checked against the user’s expectations. Among these properties, we can mention readability and reusability of conceptual schemas. The readability of a schema may be measured by two criteria: the percentage of the schema that the reader may understand, and the time needed for this understanding. In order to fulfill these criteria, a schema has to be as close as possible to the real world it is supposed to represent. A schema is close to the real world if (i) the names of entities and relationships correspond to usual names and verbs used in the application domain, (ii) the construction rules are used in order to reflect the real world, and (iii) the objects are clustered in the graphical representation with respect to semantic criteria. The starting point to evaluate readability of names may be a general electronic dictionary, or a specialized business domain dictionary. Each name given to an entity in the conceptual schema must exist in the dictionary. If the entity name does not match with any entry in the dictionary, this means that either the dictionary is not complete or the entity name is not a concept of the real world. In the first case, a new entry has to be added to the dictionary, in the second case, the name is rejected and the entity has to be renamed with an explicit and recognized term. Until now data dictionaries are used only by CASE tools to accept and manage abbreviations. Introduction of general linguistic dictionaries may stimulate using understandable names for entities, role and relationships. Graphical presentation of complex schemas influences also the readability of these schemas. Graphical algorithms aiming to minimize the number of crossings of arcs are not sufficient because they do not take into account any semantic. A semantic classification is required to divide large schemas into relatively small semantic units by grouping entities and relationships which deal with the same subject in the real world. Clustering techniques, based on the exploitation of cardinalities, are defined by Comyn et al. [23]. A new issue which would lead to visible results is the elaboration of clustering algorithms based on the semantic analysis of the object’s names by using NLP techniques.

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

257

2.5. Conceptual design by schema integration Modeling large database schemas with hundreds of entities and relationships, is a complex problem. The natural approach is to divide the application domain into subdomains, model each subdomain separately and integrate the results to form the global conceptual schema. System integration is also necessary when reusing existing database schema or when defining a global schema on top of a distributed databases. For all issues, we are faced with the same problem of integrating semantically heterogeneous schemas which may overlap and contradict on some of their elements. Schema integration is one of the design activities where the contribution of CASE tools is best appreciated due to the volume of knowledge to integrate and the complexity of its heterogeneity. Detection and resolution of conflicts between heterogeneous schemas are the core problems in schema integration. Among these are terminology conflicts. A terminology conflict occurs when the same real world object is referred to using different names or when two different real world objects are referred to using the same name. This includes synonyms and homonyms. However, in the context of database design, they mainly correspond to different levels of perceptions, such as ‘‘person’’ and ‘‘employee’’, or converse verbs such as ‘‘sell’’ and ‘‘buy’’. Some researchers have proposed the use of linguistic instruments in the integration process. This stream brought a new way for solving recurrent problems due to the loss of the correspondence between the real world concept and its conceptual representation. Frankhauser et al. [36] propose the use of fuzzy and incomplete terminological knowledge to determine the correspondence assertion between the compared objects. Johannesson [44,45] determines correspondence assertions between the schemas to integrate by using case grammar and speech act theory. Metais et al. [56,57] suggest the use of semantic electronic dictionaries storing concept hierarchies and canonical graphs to retrieve the semantics of the schemas. Mirbel [60] defines a model of fuzzy thesaurus drawn from linguistic tools. This thesaurus aims at dealing with the meaning of words and gives a fuzzy semantic-nearness coefficient between concepts. The last tendency in research on schema integration uses description logic that allows the inclusion of the knowledge of linguistic dictionaries (hierarchy of subsumptions of concepts, canonical graphs of the verbs) into a powerful logic reasoning mechanism [15,35]. The main idea of these approaches is to bring solutions to the basic problem of schema integration that is ‘‘given two conceptual fragments of data types, do they represent the same concept in the real world?’’ assuming that the real world is quite well portrayed by some models or thesaurus or dictionaries issued from long research in the linguistic domain. The use of a dictionary allows to state for example, that the two entities ‘‘Worker’’ and ‘‘Employee’’, in Fig. 4, are semantically equivalent, although having different names and different attributes, while the two entities ‘‘Worker’’ and ‘‘Department’’ are not semantically equivalent despite having identical attributes. Another example on the contribution of linguistic knowledge for schema comparison is the help in the selection of the couples of entities to be compared. In order to unify two relationships, the algorithm has to unify their participating entities. The system has to compare each pair of elements of the Cartesian product of the two entity sets. In the example of Fig. 5, to unify the relationship ‘‘buy’’ of schema 1 and the relationship ‘‘buy’’ of schema 2, the algorithm needs

258

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

Fig. 4. Two initial schemas.

Fig. 5. Selection of couples of entities led by linguistic cases.

between 3 and 9 comparisons. This process may be time-consuming because recursively, each of these comparisons could imply a Cartesian product of the attributes of the entities. Thanks to the linguistic roles deduced from the linguistic dictionary, we can significantly reduce the number of comparisons. Indeed, the use of the semantic cases permits the identification of the roles played by each entity (see Section 2.2). Once these roles are stated only those entities that have the same role will be compared. In the example of Fig. 5 it is obvious now that only the pair (Person, Customer) which play the roles of ‘‘agent’’ and the pair (Car, Vehicle) which plays the role of ‘‘object’’, are interesting to be compared. This means that only two comparisons are applied, significantly reducing the number of comparisons and processing time. Furthermore, this increases the quality of the result by avoiding some wrong unification. Schema merging may be enhanced by some semantic knowledge on the objects to be merged. We will point on an example of this contribution that we call ‘‘detection of hidden generalizations’’. Generalization is now a widely admitted extension of the entity/relationship model. Most schema integration algorithms have some solution to integrate an entity coming from one schema and another entity coming from a second schema if they have explicitly the same generic in the schemas, or if one of the two entities is a generic of the other one. The contribution of natural language techniques happens each time this semantic link is not explicitly specified by a generalization arc. That is because this arc is not meaningful when each schema is considered alone. A common generic may be found in the linguistic dictionary using the hierarchy of concepts. Fig. 6

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

259

Fig. 6. Detection of hidden generalizations.

shows an example of this process. The two relationships ‘‘repair’’ differ in the object component which is ‘‘motorcycle’’ in the first schema and ‘‘car’’ in the second one. The hierarchy of concepts generalizes the two concepts in a ‘‘vehicle’’. Thus the integrated schema can be enriched by this new generic concept. Without the hierarchy of concepts supplied by the linguistic dictionary, this example could not have been resolved automatically. All these improvements by linguistic knowledge exploitation are typical contributions that a CASE tool can provide for a complex problem such as schema comparison. However, in spite of many years of research, schema integration problems are far from solved. This task intensively involves the understanding of the semantics of the schemas. That is why the introduction of NLP techniques appears as the most promising solution and many hopes are based on future research in this domain. Object comparison could benefit from NLP techniques dealing with words and pattern comparisons could benefit from NLP techniques dealing with sentences and context. 2.6. Conceptual design based on reusable components Object-oriented methods recommend to reuse existing components in the design of new systems. Translated in the context of database design, and in contrast to the classical design approach where a conceptual schema is constructed directly from user requirements or from legacy systems, reuse implies that the designer must endeavor to construct a conceptual schema mainly from existing elements. There is a difference between design by reuse and design by integration. In the design by integration, all the integrated schemas concern the same real word application, while design by reuse means customization of elements which have been designed for different purposes. An example which highlights this difference is illustrated by a conceptual schema devoted to flight booking which can be reused in train booking or hotel booking. Then, designing by reuse means: (i) searching for one or several schemas which have similar purposes (but not necessarily in the same application domain), (ii) customizing these schemas to adapt them to the new application, and (iii) integrate them when there are many to have one unique global schema. Three main kinds of reusable artifacts have been proposed for building conceptual schema: schema components, patterns and ontologies. They induce three kinds of approaches.

260

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

Fig. 7. Example of component (tag þ schema).

The first approach, called ‘‘component based approach’’, consists of both (i) designing for reuse, that means storing each elaborated conceptual schema as a future reusable one, (ii) designing by reuse, i.e. retrieval of reusable components, composition and customization. To each stored schema component is assigned a description that abstracts its semantics. Without this description, the effort and time necessary to understand the contents of the schemas, to analyze its relevance and to select the best adapted one would make reuse inefficient. An example of this description as proposed by Ambrosio et al. [2] is shown in Fig. 7. Besides a matching with the graphic, retrieval techniques will be performed on the textual part describing the component. The second approach, called ‘‘pattern based approach’’, consists in the elaboration of generic patterns which have to be customized (e.g. those of Gamma [38] or Purao and Storey [65]). A conceptual pattern may for example portray the conceptual schema of an insurance company. Starting with such a frame saves times in the requirement analysis phase and help in being exhaustive. We can say that this approach helps in specification, validation and checking because patterns are supposed to be consistent. Several works have proposed different kinds of patterns: [21,34,65]. Johannesson and Wohed [46] have proposed new specification patterns based on a deontic perspective. Deontic objects are those which entail obligations (such as booking, employment, marriage). These patterns describe the obligations, the parties involved in these obligations and their respective roles. As an example, a ‘‘booking’’ is ‘‘made by’’ a ‘‘party’’, ‘‘for a’’ ‘‘given reason’’ and ‘‘involves’’ some ‘‘resource’’. Deontic objects are also linked to the speech acts that create and delete them. The speech act, which can delete an object from the booking class, may be the sentence ‘‘I cancel my booking for Paris tomorrow’’. The actual trend in the reuse for conceptual design is to aim not only to reuse schema compounds or schema patterns, but also to reuse relationships between objects, by means of ontologies elaborated for NLP. For example, in the COLOR-X project [13], the lexicon includes general knowledge about words provided by WordNet, domain knowledge incremented at each design, and application knowledge. In addition to their help in designing from scratch, these three kinds of knowledge obviously have a purpose for both designing by reuse and designing for reuse.

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

261

Retrieval mechanisms for reusable artifacts intensively involve NLP techniques. A first approach (e.g. Ambrosio et al. [2]) provides a mechanism for flexible queries. Flexible querying is obtained by the automatic modification of the query statements through the relaxation of query condition(s) as to recover concepts within a certain semantic distance according to the semantic relations, i.e. synonyms, hyponymy/hypernymy (is-a, gen-of), meronymy/holonomy (part-of, composed-of) and similarity. A second approach by Purao and Storey [65] allows the user to describe his/her problem in a natural language specification. A parser extracts keywords. A generalization is performed in order to retrieve a possible pattern. After that, the reverse operation (i.e. the specification which is the reverse of the applied generalization) is performed on the selected pattern in order to customize it. As an example of a third approach, Wohed [86] proposes a dialogue tool called ‘‘Modeling Wizard Tool’’ in order to select the right pattern. Several versions of the pattern are stored. The best one is selected step by step according to the answers given to questions such as ‘‘Does the booking consist of (1) one object or (2) may it consist of several objects?’’ and ‘‘Does a booking concern a (1) concrete object or (2) does it rather specify the character of the object?’’. Open issues in reuse by components may concern: (i) the automatic generation of descriptions for the components, (ii) the characterization of a good component and (iii) a flexible way to retrieve the desired components. Natural language techniques my help on the first point by finding the application domain by an analysis of the entity’s names (e.g. ‘‘education’’ for the entities ‘‘professor’’, ‘‘student’’, ‘‘lectures’’) and the designed functionality by analyzing the relationships (e.g. ‘‘booking’’, ‘‘selling’’). The second point is complex and involves many quality criteria, usually similar to those qualifying a good conceptual schema in general. Among typical reusable components is the ‘‘unity of purpose’’. Metrics have to be found in order to measure the semantic connectivity inside the component. The third point is shared by all research problems in textual information retrieval. A specificity of reusing is the need to limit the retrieval scope in order not to spend more time in retrieval and customization than in designing from scratch. Patterns and ontology based reuse approaches will directly benefit from any advance in these domains. They also may influence the elaboration of general ontologies by not only focusing on the representation and descriptions of objects and relationships but also in including the representation of generic macro-actions such as ‘‘booking’’ or ‘‘event-organization’’. All attempts to store ‘‘common sense’’ as CYC does may positively impact on reusability. For both approaches, indexing techniques as in textual databases could speed up the retrieval of artifacts. Enterprise resource planning (ERP) systems are replacing programming from scratch in most business processes. They are delivered with pre-implemented modules that only have to be customized to the particular needs of the organization. However, an ERP system encompasses several functional area and consists of thousands of tables and procedures. Gulla and Brasethvik suggest a linguistic component for searching all the pieces of the ERP to be customized, starting from natural language queries as for example ‘‘create purchase orders’’ [41]. 2.7. Web site design The stakes of the Internet are no more to be proved and the impact of a company’s Web site on its results may turn out to be important. Web sites and e-commerce places supply shared access to

262

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

data. As a recent domain, Web engineering can benefit from years of research in the domain of information systems for the structuring of data units. The definition of the logic units that are Web pages is very similar to the normalization phase in the design of relational databases. The normalization theory isolates parts of the attributes with the aim that each relation represents what is related to one semantic concept (then denormalizations combines relations which are accessed together). The inputs of this process are very formal elements, namely attributes, keys and functional dependencies. In the Web pages cutting problem, not any keys or formal dependencies may help. An issue appears thanks to NLP techniques. Similarities and semantic closeness between the elements of the text have to be found in order to define logic pages. A new family of dependencies and new algorithms have to be found based on linguistics. A notion of key has to be redefined in order to identify the piece of information. In [75]––where Thalheim and D€ usterh€ oft compare site development to database design including structures, functionalities and interfaces––pages are identified by metaphors. Metaphors are used for the exchange of semantic units. The exchange is dependent on the receiver, the sender, the dominant properties and the context. Metaphors for sites can be developed based on the achievements of linguistic research. Stand-alone metaphors are already used in sites; however Thalheim and D€ usterh€ oft also addresses the open issue of a consistent development of metaphors in an integrated manner. Metaphors or any other kind of page identifier, have the double purpose of guiding both for the structuring of the site and for the querying phase.

3. Querying of information systems enhanced with NLP The ability to interact in a natural human language for querying and getting answers is the first expectation of users when facing an ‘‘Intelligent’’ Data Base Management System. The main challenge for all managers is time. A good understanding of human language by the tools obviously appears as a gain of time during both the learning phase (reduced to no new learning) and the operational phase (by reducing the involvement of specialists). 3.1. Querying relational databases Traditional data base management systems are characterized by a fixed structuring of data, mainly supported by the relational model. Paradoxically, while extending the relational models with object-orientation produces schemas that are quite close to the real world, this has as a consequence to make SQL more and more complex. Moreover some user’s requests imply the use of an embedded SQL program that is not end-user oriented. As predefined menus are not always suitable, natural language querying is considered as a promising solution. Queries are first syntactically analyzed, usually by means of augmented transition networks: the input query is converted in a hierarchical structure consisting of sentence units. The parsing output is then converted to semantic structures. Subsequently these semantic structures are translated to SQL queries. The mapping from the natural language entities to the SQL entities can be the one proposed by Bastawala and Bhattacharyya [3]: nouns and pronouns correspond to

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

263

database tables and attributes; adjectives correspond to select and project operations; verbs correspond to join operations and prepositions correspond to select and join operations. Numerous natural language interfaces have been proposed during the middle of the eighties that used powerful syntactical treatments but dramatically lacked semantic knowledge. A new generation of tools has been appearing in the nineties mainly facing the semantic aspects thanks to the availability of ontologies (a recent state of the art by Thalheim and Kobiena may be found in [76]). Researchers have considered many particularities of natural languages queries and for example most of recent tools recognize ellipses. The main challenge consists in finding the right mapping between the entities of the sentences and the two following targeted entities: (1) the objects of the database (e.g. tables) and (2) the operations to be performed on these objects (e.g. joins). If the initial entity–relationship schema of the application is available for the user and if the querying interface has knowledge about the transformations processed by the CASE tool for generating the implemented relational schema, then this mapping is easier and will have a higher score of success. This ideal situation is not unrealistic for the future. Indeed, we widely noticed in previous sections how tracability is a key concept in database design for all new needs (reusability, interoperability, extraction for DWs and ERPs). However, until now, the mapping to an SQL query usually has to be performed without knowledge of the transformation from the conceptual to the implemented level, and the user does not have any idea about the conceptual schema. The mapping from the units of the natural language query to objects of the data base schema may be deduced thanks to powerful NLP dictionaries. At least a lemmatization is performed. Generalization, specification and search of synonyms and semantically closed concepts are now systematically applied. Finding paths in the SQL query (e.g. choice of joins) is also difficult. Heuristics based on user’s profile may help in conflict resolution. Some request tools available on the market do not allow natural language queries but automatically generate joins; they would be enhanced by applying the research results obtained in natural language querying, for example by a broadening of the ‘‘natural join’’ to semantically similar concepts. Open issues in the specific domain of relational databases querying trough natural language sentences mainly concern aspects where there is a gap between natural language’s and data base’s way of conceptualizing or where there is no direct equivalence. In a natural language query, part of the semantic specification is carried out by a careful choice of tenses and adverbs of time. A lot of work has been performed to enhance data bases management systems with temporal features and querying languages including temporal operators. The different times of the data base (transaction, validity, loading) are well identified and known. However, in spite of some previous works [54,70], finding an exact translation of what is underlying in the query into concepts supported by DBMS is an open issue. The increasing importance of decisional treatments should boost this research. Natural language querying interfaces have to take into account the ‘‘inserting’’, ‘‘updating’’ and ‘‘deleting’’ SQL statements that are part of the life of an information system and may be required by end users. This would allow the treatment of sentences like ‘‘I want to sell my Ferrari Testarosa, red, 50,000 km, 23,000€’’. Analysing these sentences is difficult because the matching has to be done starting from instances (‘‘Ferrari Testarosa’’, ‘‘red’’, ‘‘50,000’’) to retrieve elements of the schema (‘‘car type’’, ‘‘color’’, ‘‘mileage’’) and instances are hard to find in a dictionary.

264

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

Pragmatic analysis could call on data mining techniques to make most of the instances of the data base. The notion of ‘‘transaction’’ is very important in the functioning of an information system. If the treatment is expressed by means of natural language sentences the parsing process has to include a phase for transaction building. To avoid the trivial solution where one SQL statement corresponds to one transaction, a deep understanding of the text may help in this structuring. 3.2. Natural language processing for multi-sources information systems Information sharing from multiple heterogeneous sources is a challenging issue. Multi-sources information and data warehouses systems have been the subject of important research projects (CARNOT [22], OBSERVER [55], DWQ [43], TSIMMIS [16], Information Manifold [52], XYLEME [87]). NLP may help in semantic heterogeneity solving at two levels of the mediation: for providing a unified view of data through a global schema and for data reconciliation when a resulting item is computed from several sources. Building a global schema calls for techniques similar to those presented for the schema integration step in database design. However, in a multi-source context, the purpose is not to integrate the schemas once and for all, but to state a framework for reiterative integrations. Consequently linguistic resources cannot simply be external tools enhanced with expert rules and interactive human help, but have to be considered as part of the global information system, preferably using the same formalism as schemas and queries. There is now evidence that the very tool for dealing with semantically heterogeneous data is ontology. Numerous definitions may be found for ‘‘ontology’’ depending on their fields (databases, mathematic, linguistics, philosophy). However, one of these definitions is becoming predominant in the field of information systems: ‘‘an ontology is a formal conceptualization of a real world, sharing a common understanding of this real world’’. This definition fits so well to the semantic heterogeneity problem that this latter is nowadays the main client for ontologies. And reciprocally the main application of ontologies is heterogeneous data sharing. Wache et al. present an excellent survey of existing approaches to intelligent information integration after analyzing 25 multi-sources information systems [85]. In general three different directions can be identified: single ontology approaches, multiple ontologies approaches and hybrid approaches. Single ontology approaches use one global ontology providing a shared vocabulary for the specification of the semantics (e.g. SIMS). In multiple ontology approaches, each information source is described by its own ontology; inter-ontology mappings are provided (e.g. OBSERVER). In hybrid approach, local ontologies may be built as concrete views of the global ontology. The reverse solution, i.e. the global ontology built upon local ontologies, is possible but implies to update the global ontology each time a source is changing. Main research issues are now concerning formal languages to deal together with ontologies, metadata and data. Descriptive logics seem to be the most promising one because (i) they are very suitable to represent concepts and hierarchies of concepts, (ii) they can also formalize conceptual graphs and (iii) they provide powerful reasoning mechanisms. If there is a consensus on this formalization for ontologies, the open issue that is the semantic integration of ontologies will be intensively explored.

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

265

At the instance level, data reconciliation is required each time a property of an object has different values depending on which source it is issued. Famous examples are the coding of snail mail addresses which can be written in many different way. Small errors may also occur in proper names. Different levels of precision may be considered in the different sources (e.g. ‘‘azure’’ and ‘‘blue’’ for coding the same color). Numerous NLP techniques may help to adjust the data, among them: • context dependent semantic matching (e.g. semantically close values such as ‘‘azure’’ and ‘‘blue’’ may match; relationships provided by ontologies such as synonymy or hyponymy may make the matching successful); • context-free semantic matching (e.g. when two values differ by only one letter, the distance on the keyboard between these two letters is computed, if this distance is very small we can conclude to a typing error and thus the semantic matching succeeds); • comparison of values after lemmatization or comparison of stems; • orthographic correctors. The Web also serves as a platform for many distributed applications. Previous methods based on a global ontology are unimaginable in this context. XML (extensible Mark up Language) documents carry some semantics in their data thanks to the semantic tags beginning and ending each data (e.g.Gardarin France). These tags obviously facilitate data integration by giving a structure, and this structure is usually understandable both by humans and NLP tools. To take full advantages of XML as an exchange model, a large number of communities are developing shared ways of tagging data in their domains. This is a great step toward interoperability. However, to go further, Tim Berners–Lee has imagined a new research field called ‘‘semantic web’’. The objective is to encapsulate its own semantics in each unit of data having an Universal Resource Identifier (URI). This unit may be a web page or only an element of the page; it is called a ‘‘resource’’. For that purpose, the resource description framework (RDF) adds metadata to each resource through predicates in the form property (resource, value), e.g. is-a (‘http://. . .’, ‘Country’). The value may also be a resource, so RDF is building a graph across resources. Upon RDF, the ontology inference layer (OIL) allows to specify ontologies. (See also the article by Dieter Fensel et al. in this issue.) Other works concerning metadata on Web resources have been carried for other models. Simple HTML Ontology Extensions (SHOE) is a small extension to HTML which allows Web page authors to annotate their web documents with machine-readable knowledge. The goal is to make intelligent agent software on the web possible. This leads to new exciting research directions such as Web mining (based on information retrieval on the WEB), Semantic Web (which aims to encapsulate sense in web data) or Web intelligence (that encompasses all impacts of artificial intelligence on web management). 3.3. Information retrieval Information retrieval is a discipline in which vague or imprecise user queries are matched against text documents in order to rank those documents likely to be relevant to the user’s

266

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

information need (as defined by Richardson and Smeaton in [66]). The World Wide Web, where structured information represents only a small part of available data, is the main customer of content-based retrieval. The conventional approach to information retrieval is to compute a measure of similarity between a query and each document. Search engines provide the following services to ease the retrieval: (1) they access heterogeneous data and extract terms for preparing an index of terms, (2) they accept queries containing key words, (3) they match each query with the index and rank the results according to their matching score. The precision measure is computed as the number of retrieved relevant documents divided by the total number of retrieved documents. The more this measure is near from 100% the less irrelevant documents are given as answers. The recall measure is computed as the number of retrieved relevant documents divided by the total number of relevant documents existing in the corpus. The more this measure is near from 100% the less relevant documents have been ignored by the engine. NLP techniques are obviously the very means for enhancing the search engine’s capabilities, increasing both precision and recall. The main causes of a bad recall score are synonymy, hyponymy, hypernymy, paraphrase and implicit. A bad score of precision may be due to homonymy. The roles played by nouns is also important. As an example, a search on ‘‘our president’’ would get a low precision score because many documents concern other presidents and it would get a low score of recall because in numerous documents ‘‘our president’’ is only referred to by his/hers proper name. Both document indexing and query processing may be enhanced by NLP techniques. Morphological analysis and stemming are now currently applied for indexing. An example of sophisticated linguistically motivated indexing is shown in [74] for the text ‘‘There are urban centers. These are decaying.’’ that obtains the index description (CENTRES that are URBAN are DECAYING) instead of just (CENTRES that are URBAN) (DECAYING) or only a single term indexation. Query processing usually encompasses a ‘‘query reduction’’ phase where useless words are removed and a ‘‘query expansion’’ phase where the scope of the search is extended to semantically closed terms or frequently co-occurring terms. Ontological tools such as the linguistic dictionary WordNet are in large number used for guiding the expansion of the query. Examples of this approach are OntoSeek [40] and Smart Web Query (SWQ) [19]. WordNet is also increasingly used for computing semantic similarity values between a user’s query and the document description [66]. Image retrieval is an important issue for the next years. A new way of TV broadcasts consumption is arriving: spectators will have to choose what they want to load among the stock of programs delivered by channels. Request such as ‘‘the movie where a blue horse breaks into a gallop under the rain’’ should be satisfied. On the other side, advisers will have to choose which advertising spot has to be started depending on the immediate previous news or sporting event results. As video is indexing with words, this will be an important application of information retrieval. Data management systems are already ready for storing video records. Images may also take place in the ontology as visual metaphors or part of the description. An attempt is done with the CYC project [49]. The links between the concepts and their graphical representations have to be explored if we want to go forward in dealing with semantic. Chen has noticed in [18] the power of

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

267

ideogram-based languages in conceptualization; they fruitfully have inspired conceptual models and they could help in all user interfaces. Dealing with multilingual documents is also a burning question in the information retrieval research domain. Ontologies dealing with multilingual sets of synonyms gathered in the same hierarchy could be an alternative to the elaboration of monolingual ontologies related by translation links. 3.4. Natural language processing for dealing with answers From the user point of view, querying huge and heterogeneous sources of unstructured information is only the first and easiest part of the retrieval task. Dealing with answers is the second one, by far the most difficult. Because of the increasing amount of answers, filtering, ranking, clustering, intelligent integration and summarizing of documents are very important issues, which intensively involve all NLP techniques. Filtering and ranking may take into account the level of matching with the elements of the query, the context, the user profile and the intrinsic quality of the document including readability and freshness. Integration and summarizing of documents appear to be extremely difficult but these functionalities would be so amazing that it is worth researching on. E-mail messages make up a large portion of the free-form documents available today. Autoresponder systems return canned documents according to the presence of keywords in the subject of the body. More sophisticated systems identify a subset of the document and a set of passages that may contain the answer to this subset. The one presented by Kosseim et al. relies on information extraction to analyze the user message, a decision tree to determine the content of the answer, and template-based natural language generation to produce the surface forming a customized and cohesive way [48]. Natural language answering could be used to generate pretty answers from traditional relational data bases. Natural language answers should also be performed to queries about queries, such as ‘‘why did not I get any answers?’’.

4. Conclusion Modern applications such as web information systems, e-commerce, heterogeneous and multimedia databases, introduce a high level of complexity in their design and evolving, and thus increase the need of powerful aids. CASE tools dealing with semantic matching, ontology engineering, semi-structured data, audio and video data are far from being a reality in the market place, although a strong research effort has been invested in the different areas. Far from acting only as an interface, natural language understanding gives more intelligence to the tool by giving sense to manipulated data. Therefore, in the next 10 years, the integration of NLP and information systems engineering will be a major area of both research and technologic transfer. Some recurrent problems appear in the information system engineering which require NLP. The major one consists in the semantic matching of textual objects. This functionality is essential for database querying, information retrieval, data adjustment, view integration or schema reuse. Besides structure rewriting, the main techniques proposed for semantic matching are ontology

268

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

based. Thus ontologies are nowadays the spinal column of many applications in e-commerce, knowledge management, enterprise modeling, knowledge discovering and interoperability between systems. The intensive use of ontologies for information system engineering has been requiring specific extensions of these ontologies such as new features and new relationships. Computing semantic distances between two concepts in a hierarchy is one of the more crucial points to be solved. Numerous proposals can be found, however not any one is really convincing. Interoperability, which is the key concept in actual data bases applications, entails a promising issue that is the integration of the different ontologies coming either from each application or added on top of several applications for the purpose of sharing. The main problem of the third millenium is a shortage of time and a new kind of frontier appears to be pushed back. How much will the introduction of NLP in information systems engineering help in gaining time? The elaboration of evaluation measures and specific benchmarks have to be encouraged in order to state the concrete results of this exciting research area.

Acknowledgements I would like to thank Mokrane Bouzeghoub from the Versailles University for all the work performed together for years on the application of NLP techniques to information systems; part of this work is reported in this paper; I also sincerely thank Zoubida Kedad and Assia Soukane. I thank Farid Meziane for his nice help with the English language. I am very indebted to my colleagues from the CEDRIC laboratory Jacky Akoka and Isabelle Comyn-Wattiau for their efficient and warming encouragements. And finally I am very grateful to Reind van de Riet for the corrections he made on this paper and for his kind support.

References [1] Z. Alimazighi, C. Rolland, Conceptual modeling: an approach based on the analysis of documents and scenarios, Networking and Information Systems Journal 1 (4–5) (1998). [2] A.P. Ambrosio, E. Metais, J.N. Meunier, The linguistic level of the KHEOPS CASE tool in [6], 1995. [3] M. Bastawala, P. Bhattacharyya, Natural language interface to an object-relational database management system, 3rd International Workshop on Applications of Natural Language to Information Systems (NLDB’97), Vancouver, Canada, 1997. [4] W.J. Black, Acquisition of conceptual data models from natural language descriptions, 3rd Conference on European Chapter of the ACL, Copenhagen, 1987. [5] M. Bouzeghoub, G. Gardarin, The design of an expert system for database design, in: G. Gardarin, E. Gelenbe (Eds.), International Workshop on New Applications of Databases, Cambridge, UK, Academic Press, New York, 1983. [6] M. Bouzeghoub, E. Metais (Eds.), Proceedings of the 1st International Workshop on Applications of Natural Language to Data Bases (NLDB’95), 1995 (ISBN 2-903677139-3). [7] M. Bouzeghoub, Z. Kedad, E. Metais (Eds.), Proceedings of the 5th International Conference on Applications of Natural Language to Information Systems (NLDB’2000), Lecture Notes in Computer Science no. 1959, Springer, Berlin, 2001, ISBN 3-540-41943-8.

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

269

[8] G. Bracchi, S. Ceri, G. Pelagatti, A set of integrated tools for the conceptual design of database schemas and transactions, IEEE Transactions on Software Engineering 14 (1) (1988). [9] E. Buchholz, H. Cyriaks, A. D€ usterh€ oft, H. Mehlan, B. Thaleim, Applying a natural language dialogue tool for designing databases, in [6]. [10] E. Buchholz, A. D€ usterh€ oft, B. Thalheim, Capturing information on behavior with the RADD-NLI: a linguistic and knowledge-based approach, Data and Knowledge Engineering International Review 23 (1) (1997). [11] P. Buitelaar, R.P. van de Riet, The use of a Lexicon to interpret ER diagrams: a LIKE project, Proceedings of the ER Approach International Conference, Karlsruhe, 1992. [12] J.F. Burg, R.P. van de Riet, The impact of linguistics on conceptual models: consistency and understandability, in [6]. [13] J.F. Burg, R.P. van de Riet, COLOR-X: using knowledge from WordNet for conceptual modeling, in: C. Fellbaum (Ed.), WordNet: An Electronic Reference System and Some of its Applications, MIT Press, Cambridge, MA, 1996. [14] J.F. Burg, Linguistic Instruments in Requirements Engineering, IOS Press, 1997, ISBN 90-5199-316-1. [15] D. Calvanese, G. de Giacomo, M. Lenzerini, D. Nardi, R. Rosati, A principle approach to data integration and reconciliation in data warehousing, Workshop on Design and Management of Data Warehouses (DMDW’99), 1999. [16] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, J. Widom, The TSIMMIS project: integration of heterogeneous information sources, IPSJ’94, 1994. [17] P.P. Chen, The entity–relationship model: toward a unified view of data, ACM Transactions on Database Systems 1 (1) (1976). [18] P.P. Chen, B. Thaleim, L.Y. Wong, Future directions of conceptual modeling, Lecture Notes in Computer Science, vol. 1565, Springer, Berlin, 1999. [19] R.H.L. Chiang, C.E.H. Chua, V.C. Storey, A smart Web query for semantic retrieval of Web data, Data and Knowledge Engineering 38 (1) (2001). [20] J. Choobineh, M. Mannino, J. Nunamaker, B. Konsynski, An expert database design system based on analysis of forms, IEEE Transactions on Software Engineering 14 (2) (1988). [21] P. Coad, Object Models: Strategies, Patterns and Applications, Yourdon Press, Prentice Hall, 1995. [22] C. Collet, M.N. Huhns, W.-M. Shen, Resource integration using a large knowledge base in Carnot, IEEE 1991. [23] I. Comyn-Wattiau, J. Akoka, Z. Kedad, Combining View Integration and Schema Clustering to Improve Database Design, 14emes journees Bases de Donnees Avancees, Hammamet (Tunisia), October 1998. [24] H. Dalianis, Aggregation, Formal Specification and Natural Language Generation, in [6], 1995. [25] V. de Antonellis, B. Demo, Requirement collection and analysis, IEEE Transactions on Software Engineering 14 (1) (1988). [26] F.P.M. Dignum, R.P. van de Riet, How the modeling of knowledge bases can be based on linguistics and founded in logic, Data and Knowledge Engineering Journal 7 (1991) 1–34. [27] EDR Electronic Dictionary Technical Guide, Japan Electronic Dictionary Research Institute, Ltd. Mita-KokusaiBldg. Annex, Mita 1-4-28, Minato-Ku, Tokyo 108, Japan, August 1993. [28] C. Eick, From natural language requirements to good data base definitions: a data base design methodology, Data Engineering, IEEE, Los Angeles, CA, USA, 1984. [29] C. Fellbaum, WordNet, an Electronic Lexical Database, The MIT Press, Cambridge, 1998, ISBN 0-262-06197-X. [30] C. Fillmore, The Case for Case, Universal in Linguistic Theory, Bach and Harms, Holt, Rinehart and Winston, New-York, 1968. [31] G. Fliedl, C. Kop, W. Mayerthaler, H.C. Mayr, C. Winkler, NTS-based derivation of KCPM cardinalities: from natural language to conceptual predesign, in [81], 1996. [32] G. Fliedl, H.C. Mayr (Eds.), Proceedings of the 4th International Conference on the Applications of Natural Language to Information Systems ( NLDB’99), 1999 (ISBN 3-85403-129-7). [33] G. Fliedl, C. Kop, W. Mayerthaler, H.C. May, C. Winkler, The NIBA Approach to Quantity Settings and Conceptual Predesign in [62], 2001. [34] M. Fowler, Analysis Patterns: Reusable Object Models, Addison-Wesley, Reading, 1997.

270

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

[35] E. Franconi, Logical form and knowledge representation: toward a reconciliation, Working Notes on the AAAI Fall Symposium on Knowledge Representation Systems based on Natural Language, Cambridge MA, November 1996. [36] P. Frankhauser, M. Kracker, E.J. Neuhold, Semantic vs. structural resemblance of classes, Sigmod Record 20 (4) (1991). [37] M.O. Fr€ olich, R.P. van de Riet, Conceptual models as knowledge resources for text generation, 3rd International Worshop on Applications of Natural Language to Information Systems (NLDB’97), Vancouver (Canada), 1997. [38] E. Gamma, R. Helm, R. Johnson, J. Vlissides, Design Patterns, Elements of Reusable Object-Oriented Software, Addisson Wesley, Reading, MA, 1995. [39] N. Guarino (Ed.), Formal Ontology in Information Systems, IOS Press, London, 1998, ISBN 90-5199-399-4. [40] N. Guarino, Ontoseek; Content-Based Access to the Web, IEEE Intelligent systems, May/June 1999. [41] J.A. Gulla, T. Brasethvik, model-driven business management––the linguistic perspective in [62], 2001. [42] H. Horacek, An approach to Building Domain Models Interactively in [62], 2001. [43] M. Jarke, M. Lenzerini, Y. Vassiliou, P. Vassiliadis (Eds.), Fundamentals of Data Warehouses, Springer, Berlin, 1999. [44] P. Johannesson, Logic based approach to schema integration, 10th Entity/Relationship Approach, San Mateo, CA, USA, 1991. [45] P. Johannesson, Using conceptual graph theory to support schema integration, 12th Entity/Relationship Approach, USA, December 1993. [46] P. Johannesson, P. Wohed, The deontic patterns––a framework for domain analysis in information systems design, Data and Knowledge Engineering 31 (1999). [47] M.L. Kersten, H. Weigand, F. Dignum, J. Boom, A conceptual modeling expert system, ER’86, Dijon, 1986. [48] L. Kosseim, S. Beauregard, G. Lapalme, Using information extraction and natural language generation to answer e-mail, Data and Knowledge Engineering 38 (1) (2001). [49] D.B. Lenat, CYC: a large-scale investment in knowledge infrastructure, CACM 38 (11) (1995) 32–38. [50] D.B. Lenat, G.A. Millar, T. Yokoi, CYC, WordNet, and EDR: critiques and responses, CACM 38 (11) (1995) 45–48. [51] G. Levreau, M. Bouzeghoub, HypER: an extended E/R model with hypertext facilities in [81], 1996. [52] A.Y. Levy, A. Rajaraman, J.J. Ordille, Querying heterogeneous information sources using source description, VLDB’96, 1996. [53] K.R. McKeown, W.R. Swartout, Language generation and explanation, in: M. Zock, G. Sabah (Eds.), Advanced in Natural Language Generation, Communication in AI, vol. 1, 1988. [54] D. Maurel, M. Mohri, Computation of French temporal expressions to query databases, in [6]. [55] E. Mena, V. Kashyap, A. Illarramendi, A. Sheth, Domain specific ontologies for semantic information brokering on the global information infrastructure, in [39], 1998. [56] E. Metais, J.-N. Meunier, G. Levreau, Database schema design: a perspective from natural language techniques to validation and view integration, 12th International Conference on the Entity/Relationship Approach, Dallas (Texas), December 1993. [57] E. Metais, Z. Kedad, I. Comyn-Wattiau, M. Bouzeghoub, Implementation of a third generation tool in [81], 1996. [58] L. Mich, R. Garigliano, The NL-OOPS project: object oriented modeling using the natural language processing system LOLITA, in [32], 1999. [59] L. Mich, On the use of ambiguity measures in requirements analysis, in [62], 2001. [60] I. Mirbel, Semantic integration of conceptual schemas, in [6], 1995. [61] A.M. Moreno, R.P. van de Riet, Justification of the equivalence between linguistic and conceptual patterns for the object model, 3rd International Workshop on Applications of Natural Language to Information Systems (NLDB’97), Vancouver, Canada, 1997. [62] A.M. Moreno, R.P. van de Riet (Eds.), Proceedings of the 6th International Workshop on the Applications of Natural Language to Information Systems (NLDB’01), Lecture Notes in Informatics, 2001, ISBN 3-88579-332-6.

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272

271

[63] E. Ortner, B. Schienmann, Normative language approach––a framework for understanding, 15th International Conference on Entity–Relationship Approach (ER’96), 1996. [64] O. Piton, D. Maurel, Beijing Frowns and Washington Pays Close Attention, Computer Processing of Relations between Geographical Proper Names in Foreign Affairs, in [7], 2001. [65] S. Purao, V. Storey, Intelligent support for retrieval and synthesis of patterns for object oriented design, in: D.W. Embley, R.C. Goldstein (Eds.), CAISE’97: International Conference on Advanced Information Systems Engineering, LNCS no. 1331, Springer-Verlag, Berlin, 1997. [66] R. Richardson, A. Smeaton, An information retrieval approach to locating information in large scale federated database systems in [81], 1996. [67] C. Rolland, C. Proix, Natural language approach to conceptual modeling, in: P. Loucopoulos, R. Zacari (Eds.), Conceptual Modeling, Databases and CASE, Willey Professional Computing Publisher, New York, 1992, ISBN 0-471-55462-6. [68] C. Rolland, C. Ben Achour, Guiding the construction of textual use case specification, Data and Knowledge Engineering 25 (1–2) (1998). [69] G. Sabah, L’intelligence artificielle et le langage, Hermes, 1988. [70] S. Schwer, Temporal granularity enlightened by knowledge, in [7], 2001. [71] J.F. Sowa, Conceptual Structures: Information Processing in Mind and Machine, Addison Wesley Publishing Company, Reading, MA, 1984. [72] A.A.G. Steuten, F. Dehne, R.P. van de Riet, WordNetþþ: A Lexicon Supporting the Color-X Method in [7], 2001. [73] V.C. Storey, Understanding semantic relationships, VLDB Journal 2 (1993) 455–488. [74] K. Spark-Jones, What is the role of NLP in text retrieval?, in: T. Strzalkowski (Ed.), Natural Language Information Retrieval, Kluwer Academic Publishers, Dordrecht, 1999. [75] B. Thalheim, A. D€ usterh€ oft, Metaphor development for Internet sites in [32], 1999. [76] B. Thalheim, T. Kobiena, From NL DB request to intelligent NL DB answer, Preprint I-6-2001, BTU Cottbus, Computer Scince, 2001(available through http://www.informatik.tu-cottbus.de/thalheim).. [77] B. Tauzovich, An expert system for conceptual data modeling, ER’89, Toronto, Canada. [78] A.M. Tjoa, L. Berger, Transformation of requirement specifications expressed in natural language into EER model, 12th International Conference on Entity–Relationship Approach, 1993. [79] S. Vadera, F. Mezine, From English to formal specifications, The Computer Journal 37 (9) (1994). [80] R.P. van de Riet, R.A. Meersman, Linguistic instruments in knowledge engineering, in: R.P. van de Riet, R.A. Meersman (Eds.), Proceedings of the 1991 Workshop on LIKE, Tilburg, North Holland, The Netherlands, 1991. [81] R.P. van de Riet, J.F.M. Burg, A.J. van de Vos (Eds.), Proceedings of the Second International Workshop on the Application of (NLDB’96), IOS Press, 1996, ISBN 90-51-99-273-4. [82] A.J. van der Vos, J.A. Gulla, R.P. van de Riet, Verification of conceptual models based on linguistic knowledge in [6], 1995. [83] M. Volk, S. Clematide, Learn––filter––apply––forget. Mixed approaches to named entity recognition in [62], 2001. [84] P. Vossen, EuroWordNet––a multilingual database with lexical semantic networks, Kluwer Academic Publishers, Dordrecht, 1998. [85] H. Wache, T. V€ ogele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, S. H€ ubner, Ontology-based integration of information––a survey of existing approaches, in: G. Stumme, A. Maedche, S. Staab (Eds.), Proceedings of the IJCAI’2001 Workshop on Ontologies and Information Sharing, vol. 47, CEUR-WS, Seattle, USA, 2001. [86] P. Wohed, Tool for reuse of analysis patterns a case study, 19th International Conference on Conceptual Modeling, ER’2000, Salk Lake City, October 2000. [87] http://www.xyleme.com.

272

E. Metais / Data & Knowledge Engineering 41 (2002) 247–272 Elisabeth Metais (1959) is a full professor at the CNAM University (Paris, France) and a researcher in the CEDRIC Laboratory since September 2000. Up to 2000 she was an Associate Professor at the University of Versailles (France) working in the PRiSM Laboratory and she previously was a researcher for the University of Paris VI (France) where she holds her Ph.D. in Computer Science (1987). Her main axe of research has been Database Design. She participated in the definition of SECSI, the first expert system in database design, and has been interested since the early nineties in applying natural language techniques to Database Design. She is currently working on Data Warehousing, focussing on semantic heterogeneity problems in data cleaning and data integration. She directed the EVOLUTION French working group on Data Warehouse Design and is managing the REANIMATIC project aiming to build a warehouse from data collected in Intensive Care Units. She has been acting as a member of about 40 program committees and has been involved in the organisation of EDBT’96 and ER’99. She organised the 1st International Workshop on Application of Natural Language to Data Bases (NLDB’95) and the 5th International Conference on Applications of Natural Language to Information Systems (NLDB’2000) in Versailles (France).