Natural language analysis for semantic document modeling

Natural language analysis for semantic document modeling

Data & Knowledge Engineering 38 (2001) 45±62 www.elsevier.com/locate/datak Natural language analysis for semantic document modeling Terje Brasethvik...

632KB Sizes 0 Downloads 90 Views

Data & Knowledge Engineering 38 (2001) 45±62

www.elsevier.com/locate/datak

Natural language analysis for semantic document modeling Terje Brasethvik *, Jon Atle Gulla Department of Computer and Information Science, IDI, Norwegian University of Technology and Science, NTNU, O.S. Bragstads plass 2F, 7491 Trondheim, Norway Received 27 March 2001; received in revised form 27 March 2001; accepted 27 March 2001

Abstract To ease the retrieval of documents published on the Web, the documents should be classi®ed in a way that users ®nd helpful and meaningful. This paper presents an approach to semantic document classi®cation and retrieval based on natural language analysis and conceptual modeling. Users may de®ne their own conceptual domain model, which is then used in combination with linguistic tools to de®ne a controlled vocabulary for a document collection. Users may browse this domain model and interactively classify documents by selecting model fragments that describe the contents of the documents. Natural language tools are used to analyze the text of the documents and propose relevant model fragments in terms of selected domain model concepts and named relations. The proposed fragments are re®ned by the users and stored as document descriptions in RDF±XML format. For document retrieval, lexical analysis is used to preprocess search expressions and map these to the domain model for manual query-re®nement. A prototype of the system is described, and the approach is illustrated with examples from a document collection published by the Norwegian Center for Medical Informatics (KITH). Ó 2001 Published by Elsevier Science B.V. Keywords: Semantic modeling; Document classi®cation; Linguistic tools

1. Introduction Project groups, communities and organizations today use the Web to distribute and exchange information. While the Web makes it easy to publish documents, it is more dicult to ®nd an ecient way to organize, describe, classify and present the documents for the bene®t of later retrieval and use. One of the most challenging tasks is semantic classi®cation ± the representation of document contents. Semantic classi®cation is usually done using a mixture of text-analysis methods, carefully de®ned vocabularies or ontologies, and various schemes for applying the vocabularies in indexing tasks. With controlled vocabularies, more work can be put into creating meaningful attributes for document classi®cation and building up document indices. However, even though the words in a controlled vocabulary are selected on the basis of their semantic content, this semantic part is

*

Corresponding author. Tel.: +47-73-59-36-71. E-mail addresses: [email protected] (T. Brasethvik), [email protected] (J. Atle Gulla).

0169-023X/01/$ - see front matter Ó 2001 Published by Elsevier Science B.V. PII: S 0 1 6 9 - 0 2 3 X ( 0 1 ) 0 0 0 1 6 - 7

46

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

normally not apparent in the classi®cation and retrieval tasks. As far as the information retrieval system is concerned, the vocabulary tends to be a list of terms that are syntactically matched with terms in the documents. The inherent meanings or structures of the terms in the vocabulary are not used to classify and retrieve documents, and we are still left with a syntactic search approach to information retrieval. There have been a few attempts to add a semantic component to document retrieval. With the inclusion of semantically oriented dictionaries like WordNet, it has been possible to link search terms to related terms and use these links to expand or restrict the search query in a semantic way. Stemming is a technique for reducing a conjugated word to its base form, which may also be used when classifying documents and can be regarded as a way of deducing the underlying concept from the surface text string. There is a tendency, though, that linguistically oriented retrieval systems including such techniques do not easily combine with controlled vocabularies. An interesting approach to this problem would be to de®ne a controlled vocabulary of terms that can be semantically interpreted by the information retrieval system. For this to be useful, though, the semantic descriptions have to be derived from the way these terms are actually used in the document collection. A general multi-purpose dictionary would give us de®nitions that in most cases are too generic and will work against the whole principle of using controlled vocabularies. Also, the semantic aspects have to be given a representation that the retrieval system can understand and make use of. In this paper, we present a document retrieval system that brings together semantic retrieval and controlled vocabularies. Particular to this system is the way conceptual modeling and linguistics are combined to create conceptual models that serve as controlled vocabularies. Section 2 motivates our use of a semantic modeling language as a vehicle to de®ne and visualize the semantics of the concepts in the controlled vocabulary that will be used for classi®cation. Section 3 introduces our system, and the components of its architecture. The construction of the vocabulary is discussed in Section 4, and Sections 5 and 6 present the processes of classifying and retrieving documents, respectively. A comparison with related work is done in Section 7 and the main conclusions of our work are summed up in Section 8. 2. Semantic modeling of documents Today, sharing of documents within a project group or a community is often done by making them available electronically, for example on some kind of shared web site, directory or workspace. In the absence of web-librarians or library-like tools, the users are left with the task of organizing and describing these documents themselves. Our goal is to provide the users with tools that will aid the semantic classi®cation of documents, by using concepts from a controlled vocabulary. To be able to classify a document semantically by way of concepts from a controlled vocabulary, the users must be familiar with this vocabulary. Classifying a document by a selection of concepts, the semantic interpretation of these concepts should be apparent to the user, and should re¯ect his/her view of the documents to be classi®ed. Likewise, to retrieve documents by using a controlled vocabulary, users must enter a query string containing ± or referring to ± the concepts whose interpretation re¯ects their information retrieval goals. Our claim is that to aid the users in these tasks, the applied vocabulary should be apparent and directly visible in the provided tools.

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

47

Within project groups or communities that maintain such shared or ``common information spaces'', the semantics of information is often locally constructed and re¯ects the users' ``shared agreement'' on the meaning of information [2,32]. Successful use of concepts from a vocabulary to communicate the semantics of documents within this community is hence dependent on the users' shared semantic interpretation of these concepts. We do not need a ``globally correct vocabulary'', as this seems an impossible task. Rather, we are interested in letting the users de®ne and use their own domain-speci®c vocabulary for this task, in order to achieve this locally ``shared agreement'' of meaning. Hence, building a retrieval system, our goal is to allow the users to de®ne their own vocabulary of concepts and to utilize this vocabulary directly in an interface to classify and retrieve documents. A fundamental assumption for our document retrieval system is that the subject domain of the documents to be classi®ed can be characterized by a vocabulary of well-de®ned, inter-related concepts, and that each document can be classi®ed by way of a selection of these concepts [3,4]. Such a vocabulary forms the terminology of the domain, and may be represented as a conceptual model of entities and relationships that we will refer to as the domain model. The de®nition of such domain-speci®c terminologies is quite commonplace today ± e.g., even standardized [19,38] ± within several subject domains. Work on a terminology document is also often mentioned as an early deliverable in project management handbooks. Semantic modeling languages (see for example, [6,18,27]) are by nature intended to de®ne and express representations of concepts. Furthermore, they are intended to represent semantics at a semi-formal level between human and computer understandability. The graphical representation

Fig. 1. Using the domain model as interface to semantic document classi®cation.

48

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

of such languages also gives us the ability to visualize the concepts, their de®nitions and relations. This enables us to use the domain model as a part of the user interface, and having users explore and interact directly with the model in order to classify and retrieve documents. Fig. 1 illustrates the principle. Successful use of conceptual modeling in such a setting, however, imposes requirements on the supporting document retrieval system, as indicated in the ®gure. In order to perform semantic document retrieval, our approach relies on the use of linguistic tools both in the preparation of the domain model, as well as in the classi®cation and retrieval processes. We also need to connect to a retrieval system that is able to store and manage our model-based classi®cations and query these at a later stage. 3. Semantic document retrieval Semantic document retrieval implies that the documents are classi®ed using interpretable concepts. As opposed to the syntactic retrieval mechanisms of traditional information retrieval systems, the semantic approach needs an understanding of the domain addressed by the documents. Whereas syntactic retrieval is based on a simple matching of search strings with document attributes, semantic retrieval requires that the underlying meaning of the search expression be compared to the semantic descriptions of the documents. During this comparison, various logical, linguistic or conceptual theories may be applied. Our semantic retrieval approach rests on the following major principles: · The contents of documents are described using domain model fragments that have clear semantic interpretations. These classi®cations may be more precise than the domain model by labeling the relationships between the domain concepts. · Document queries must refer to the concepts included in the domain model. The query is regarded as a fragment of the domain model including the labeled relationships from the document classi®cations. Query re®nement is done graphically by selecting and deselecting elements of this extended domain model. The four-tier architecture of the information retrieval system is shown in Fig. 2. Using the retrieval system, the user accesses a model viewer implemented in Java that integrates the components of the system. In addition to classifying and retrieving documents, this graphical model viewer is also applied to construct the domain model for the documents. The models shown in the model viewer are based on the Referent modeling language [36], which is an ER-like language with strong abstraction mechanisms and sound formal basis. The Java servlets at the next level de®ne the overall functionality of the system. They are invoked from the model viewer and coordinate the linguistic tools incorporated into the system. This way of accessing the components of the system, both makes them generally accessible, and also gives us a loose coupling between, which o€er a ¯exible way of using a variation or combination of underlying tools. At the third level, a number of linguistic tools analyze natural language phrases and give the necessary input to construct domain vocabularies and classify and retrieve documents. The Word frequency analyzer from WordSmith is a commercially available application for counting word frequencies in documents and producing various statistical analyses. A Finnish company, Lingsoft, has two tools for analyzing nominal phrases and tagging sentences that are needed for the

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

49

D

Fig. 2. Overview of system architecture.

classi®cation and retrieval of documents. A smaller Prolog application for analyzing relations between concepts in a sentence is being developed internally at our department. This application assumes a disambiguated tagged sentence and proposes relations between the concepts on the basis of classi®cation-speci®c rules. As an alternative to the tagger and the relation analyzer, we are considering a parser linked to an extensive semantic dictionary. Finally, the documents and their classi®cations are stored in HTML and XML format, respectively. The domain model is also stored in XML and can be continuously maintained to re¯ect the vocabulary used in the documents being added to the document collection. The linguistic tools rest on lexical information that is partly stored in XML and partly integrated with the tools themselves. The functionality of the document system includes three fundamental processes: · the construction of the domain model on the basis of selected documents, · the classi®cation of new documents using linguistic analysis and conceptual comparison, and · the retrieval of documents with linguistic pre-processing and graphical re®nement of search queries. In the following sections, we will go into some more details of these processes. Examples from the Norwegian Centre of Medical Informatics (KITH) will be used to illustrate our approach. KITH has the editorial responsibility for creating and publishing ontologies covering various medical domains like physiotherapy, somatic hospitals, psychological healthcare and even the general domain of medical services. These ontologies are today created on the basis of documents from the domain and take the form of a list of selected terms from the domain, their textual

50

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

de®nitions (possibly with examples) and cross-references (Fig. 4). KITH's goal is to be able to use these medical ontologies directly in a web-based interface to classify and browse domain documents. Our work is related to the ontology from somatic hospitals [28], which is currently under construction. 4. Constructing domain models from document collections Conceptual modeling is mainly a manual process. However, when the conceptual models are constructed on the basis of a set of documents, textual analysis may be used to aid the process. Our approach is outlined in Fig. 3. The ®rst step of the process is to run the document collection through a word analysis with the purpose of proposing a list of concept candidates for the manual modeling process. In the KITH case, the WordSmith toolkit [34] is used for this purpose. This analysis ®lters away stop words and it also compares the text of the documents with what is denoted ``a reference set of documents'', assumed to contain words and phrases from average Norwegian language. The result from this analysis is a list of the highest frequented terms occurring in documents from this domain. The WordSmith tool o€ers a word concordance analysis that ± for a given term ± may display example uses of this term in phrases from the text and also compute words that are co-located with this term in the text. After the word analysis, the conceptual model is created by carefully selecting terms from the proposed list through a manual and cooperative process. In the KITH example, this work is performed by a committee consisting of computer scientists, medical doctors and other stakeholders from the particular domain. Terms are selected and de®ned ± possibly with examples from their occurrences in the document text, and related to each other. The manual concept de®nition process performed by KITH is a perfect example of local definition of a domain vocabulary as mentioned in Section 2. This is not always a process starting from scratch, however, typical sources of input for this work are guidelines for terminological

Fig. 3. Constructing the domain-speci®c conceptual model.

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

51

work, existing domain-speci®c thesauri, dictionaries, etc. We are working on an approach to import existing de®nitions into our models and generate an initial graphical representation [40], which will give users a possible starting point for their conceptual modeling work. In the KITH approach a visual conceptual modeling language is used to keep an overview of the central concepts and their relations. In the example model from the domain of somatic hospitals, the initial word analysis proposed more than 700 terms and the ®nal result is an MS Word document listing 131 well-de®ned concepts together with some overview conceptual models ± shown in Fig. 4. As shown in the ®gure, the ®nal domain ``model'' contains the terms, their de®nitions and their relations to other terms. Relations between concepts in this model are not well-de®ned, relations are here merely an indication of ``relatedness'' and are denoted ``cross-references''. The WordSmith tool is based on statistical analysis. An alternative is to use a linguistically motivated tool like Lingsoft's NPTool [49] which extracts complete noun phrases from running text. Such an analysis may propose more complete noun phrases as candidates for concepts in the model and may also handle more linguistic variations. Lingsoft's NPTool uses hand-coded linguistic rules in its morphological analysis and constraint grammar parsing before two di€erent

Fig. 4. Example concept de®nitions and model from the domain of somatic hospitals [28] (our translation).

52

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

parallel ®nite-state parsers are applied to propose a complete set of noun phrases from a corpus [21]. To be able to experiment with KITH's term-de®nition document further in our approach to semantic document classi®cation, we have chosen to use our own conceptual modeling language and tool ± the Referent Model Language. We translate the input from KITH into a complete Referent model and embed the textual de®nitions within this model. We also perform what we have denoted a ``linguistic re®nement'' of the model in order to prepare the model-based classi®cation of documents. For each of the concepts in the model, we add a list of terms that we accept as a textual designator of this concept. This list of terms may be extracted from a synonym dictionary, e.g. [25]. Today this is performed manually. We then run the model through an electronic dictionary and extract all conjugations for each of the concepts in the model and its corresponding list of terms. The resulting Referent Model is stored as an XML ®le and is now available for browsing and interaction across the Web in a Java-based Referent Model viewer. Fig. 5 shows the exploration of a fragment of the somatic hospital domain model. Using the model viewer, we are able to visualize for the users the conceptual model that will be used to

Fig. 5. Exploration of concepts and de®nitions in the Referent model viewer.

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

53

classify documents. As shown, the user may interact with the model ± by clicking on a concept, he/ she may explore the concept de®nitions as well as the list of terms and conjugations for this concept. 5. Classifying documents with conceptual model fragments In our approach, a document is semantically classi®ed by selecting a fragment of the conceptual model that re¯ects the document's content. Fig. 6 shows how linguistic analysis is used to match the document against the conceptual model and hence help the user classify the document. The user provides the URL of the document to be classi®ed. The document is downloaded by our document analysis servlet, which matches the document text with the concepts occurring in the model. This matching is done by comparing a sequence of words (a concept name in the model may consist of more than one word) from the document with the concepts in the model ± i.e., it uses the given conjugations found in the concept's term list. The result of this matching is a list of all concepts found in the document ± sorted according to the number of occurrences ± as well as a list of document sentences in which the concepts were found. The concepts found are shown to the user as a selection in our Referent Model viewer. The shown selection in the model viewer of Fig. 5 (the greyed out Referents ± Patient, Health service and Treatment) is the result of the matching of the document whose URL is indicated at the top right corner of the model viewer (Fig. 7). The user may then manually change the selection of concepts according to his/her interpretation of the document. The user may also select relations between concepts in order to classify a document with a complete model fragment, rather than individually selected concepts. Once the user is satis®ed with the selected model fragment, this fragment is sent back to the document servlet. In order to add some semantics to the selected relation, it is important also to provide relevant relation names. For each of the selected relations in the model fragment, the servlet

Fig. 6. Semantic classi®cation through lexical analysis and conceptual model interaction.

54

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

Fig. 7. Suggesting relation names ± example sentence analysis.

chooses all the sentences extracted from the document that contain both concepts participating in the relation. These sentences are ®rst sent to a syntactic tagger [24] and from there to a semantic analyzer in Prolog. The semantic analyzer uses a set of semantic sentence rules to extract suggested names for each relation from the set of sentences. As an example, take one of the documents classi®ed in our document retrieval system. There are 11 sentences in this document that contain the domain concepts ``helsetjeneste'' (health service) and ``pasient'' (patient). One of the sentences is ``Medarbeidere i helsetjenesten m a stadig utvikle nye muligheter for  a fremme helse, hjelpe pasientene og bedre rutinene''. (The employees of the health service must steadily develop new methods to promote health, help the patients and improve the routines.) and the results of tagging this sentence are shown in Fig. 6. When the disambiguation part of the tagger is completed, some syntactic roles will also be added to the tagged sentence. Using the results from the tagger as a basis, the system tries to propose a relation between the two concepts; health service and patient. Currently, the relation is formed by the part of the sentence linking the two concepts that remain after removing irrelevant sentence constituents. Constituents to remove are for example attributive adjectives (``new''), modal verbs (``must''), non-sentential adverbs, and most types of sentential adverbs (``steadily''). Also, we remove prepositional phrases and paratactic constructions that do not contain any of the two concepts in question (``promote health'' and ``improve the routines''). For the sentence above, the system proposes to use the following words to form a relation between the concepts (proposed words in bold face): The employees of the health service must steadily develop new methods to promote health, help the patients and improve the routines. Fig. 8 shows how a user, based on the initial selection from Fig. 5, has selected a complete model fragment. (The greyed Referents and relations that form the path from ``Treatment'', ``Diagnosis'', ``Condition'', ``Patient'' and to ``Healthservice'') In order to name a relation, a user must select a path between two of the found conceps. The example shows the output of the previously shown sentence analysis. The concept-occurrences and the proposed relation names are marked with color and the discarded parts of the sentence are greyed out. The user is provided with a list of all proposed relation names extracted from the found sentences, and may select the desired ones from this list and re®ne the proposed names. To facilitate naming of relations, users

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

55

Fig. 8. Re®ne model selection and suggest relation names.

may also provide their own names for any derived relation path. All the relation names that are used when classifying a document are stored along with the conceptual model, they are available for browsing and can later be added to the list of proposed relation names, when another user is selecting the same relation. The completed model fragment with relation names now serves as the classi®cation of the document. Our goal is to store classi®cations in conformance with the W3C proposed RDF standard for Web-document descriptions ± the Resource Description Framework [50]. Thus, we may translate the selected Referent Model fragment into a set of RDF statements. By using the RDF syntax, we hope to be able to use available internet search machinery at a later stage. For now, however, we store the classi®cations in our own Referent Model XML format and use our own document servlet for searching. In order to give a more complete presentation of the document, users should also specify a set of contextual meta-data attributes such as Document Title, Creator, Classi®cation Date, URL, etc. These meta-data attributes are in a way orthogonal to the semantic classi®cation of documents

56

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

that have been our main concern, but they yield important additional information and may be used to answer di€erent kinds of queries. Speci®c attributes may be needed according to the document types in question or depending on the actual retrieval system in use. In our system today, we use a selection of meta-data attributes from the Dublin Core (DC) [51] proposed standard for digital libraries and we use DC's guidelines for generating RDF±XML syntax. The DC proposal contains an attribute called ``subject'' which is the one we use to store our the representations of the semantic model fragment. 6. Retrieving documents with NL queries and model browsing Our document retrieval component includes graphical mechanisms for analyzing the queries with respect to the domain vocabulary. The retrieval interface combines the analysis of natural language query inputs with the subsequent re®nement of conceptual model query representations. Fig. 9 gives an overview of the retrieval process. The users enter a natural language query phrase which is matched against the conceptual model like in the classi®cation process. The domain model concepts found in this search phrase (if any) are extracted and used to search the stored document descriptions. Verbs found in between the concepts in the search phrase are extracted and matched against the stored relation names. Note that the relations are given a simpler analysis than what is done in classi®cation. Using the same sentence analysis here would require that the search phrase is written almost exactly the same way as the original document

Fig. 9. NL & Model-based document retrieval.

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

57

sentence in order to produce a match. Such relaxations have proven necessary also in other works [37]. Our goal is not to develop new indexing or retrieval methods at this level. Storing our document descriptions in a standard format, we intend to use available or standard indexing and retrieval machinery as these are made available. For the prototype interface shown in this paper, we are using our own document servlet, which uses a simple weighting algorithm to match, sort and present documents. Users may now re®ne their search by interacting with the domain model. Concepts found in the search phrase are marked out in the model. Relation names found in the stored classi®cations are presented to user in a list format. The search may be narrowed by (a) selecting several concepts from the model, (b) following the generalization hierarchies of the model and selecting more speci®c concepts or (c) selecting speci®c relation names from the list. Likewise, the search may be widened by selecting more general concepts or selecting a smaller model fragment, i.e., fewer relations and concepts. Fig. 10 shows our prototype search interface. The ®gure shows how users may interact with the model and retrieve a ranked list of documents. Documents are presented in a Web-browser interface by using a selection of stored DC attributes. The ®gure also shows what we have denoted ``enhanced document reader'', i.e., when reading a document, each term in the document that matched a model concept is marked as a hyper-link. The target of this link is a ``sidebar'' with the information regarding this concept that is stored in the domain model, i.e., the de®nition of the concept, the list of accepted terms for this concept and its relations to other concepts. If relevant, this sidebar also contains a list of documents classi®ed according to this concept. This way, a user

Fig. 10. Selecting referents in model ! Ranked list of documents ! Enhanced document reading.

58

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

may navigate among the stored documents by following the links presented in this sidebar. Such a user interface visualizes the connection between the documents and the domain model, and may aid the user in getting acquainted with the domain model. Note that the ``enhanced document reader'' not only works on classi®ed documents, but is implemented as a servlet that accepts any document URL as input. 7. Related work Shared or common information space systems in the area of CSCW, like BSCW [5], ICE [8] or FirstClass [11] mostly use (small) static contextual meta-data schemes, connecting documents to for example people, tasks or projects. These systems often use a ``desktop-like'' metaphor with folders as the main way of organizing documents and then relying on freely selected keywords, or free-text descriptions for the semantic classi®cation. TeamWave workplace [45] uses a ``concept map''-tool, where users may collaboratively de®ne and outline concepts and ideas as a way of structuring the discussion. There is however no explicit way of utilizing this concept graph in the classi®cation of information. An approach to this is found in the ConceptIndex system [48], where concepts are de®ned by attaching them to phrases ± or text fragments ± selected from their occurrences in the text. This way the concept-de®nitions also serve as an index of the text fragments. The concepts are only de®ned by their various appearances and the actual domain vocabulary is not visible in the approach. Some single user systems like the Navigational Brain allow users to manually drag and drop their desktop ®les onto a free-hand drawn mind map. The TopicMap ISO standard [20] ± o€ers a way of linking various kinds of ``instances'' (®les, pictures, textfragments, etc.) to a topic, and then navigate in this material by following associations between the topics. Topic maps may be stored using SGML. Ontologies [14,16,47] are nowadays being collaboratively created [7,15] across the Web, and applied to search and classi®cation of documents. Ontobroker [9,10] and Ontosaurus [44] allow users to search and also annotate HTML documents with ``ontological information''. Domainspeci®c ontologies or thesauri are used to improve search-expressions. The medical domain calls for precise access to information, which is re¯ected in several terminological projects, such as [13,29,35,39]. These formal ontological approaches that are used in connection with Web documents are often based on frame-logic and are thus connected to sound apparatus for reasoning. In our approach, the modeling language that is used to express the domain model has a sound formal basis, but what is more important in our case is the graphical notation. It is necessary for us to make the domain model available to the users and to make it actively useful in the classi®cation and search interface. Naturally, in order to facilitate information exchange and discovery, also several ``Web-standards'' are approaching. The DC [51] initiative gives a recommendation of 15 attributes for describing networked resources. W3C's Resource Description Framework [50] applies a semantic network inspired language that can be used to issue meta-data statements of published documents. RDF-statements may be serialized in XML and stored with the document itself. Pure RDF does however not include a facility for specifying the vocabulary used in the meta-data statements. This is however under development through what is denoted RDF-Schemas [52], and there are interesting works on document query-facilities that exploits the information in the RDF-Schema when querying for documents [22].

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

59

Natural language analysis enables the use of phrases or semantic units, rather than ``simple'' words in retrieval. Phrase extraction from text has led to advances in IR and has been the basis for Linguistically Motivated Indexing ± LMI ± [33,41±43]. Arampatzis et al. [1] provides a survey of NLP techniques and methods such as stemming, query expansion and word sense disambiguation to deal with morphological, lexical, syntactic and semantic variations when indexing text. A multiple stream indexing architecture [41] shows how several such ``simple'' LMI techniques may be applied simultaneously and e€ectively in the TREC-7 full-text ad hoc task. In some systems search expressions are matched with a conceptual structure re¯ecting a linguistic or logical theory, e.g. [23,30,31] Understanding natural language speci®cations and turning them into formal statements of some kind in a CASE tool has been a challenge for many years [26]. For ER-like conceptual models, the naive approach is to use nouns and verbs as candidates for entities and relations, respectively. The NIAM conceptual modeling language [17] illustrates this approach when generating natural language explanations from the model. Some CASE tools adopt a particular grammar, or accept only a selective subset of natural language in order to produce model statements. Constructing models from a larger set of documents, however, the system needs more sophisticated techniques for handling linguistic variation when proposing model elements. [12,46] give examples of CASE tools that have integrated advanced parsing and understanding [26]. 8. Concluding remarks We have presented an approach to semantic classi®cation of documents that takes advantage of conceptual modeling and natural language analysis. A conceptual modeling language is used to create a domain-speci®c vocabulary to be used when describing documents. Natural language analysis tools are used as an interface to the domain model, both to aid users when constructing the model, to suggest classi®cations of documents based on the model and to pre-process free-text search expressions. The conceptual modeling language visualizes the domain vocabulary and allows the users to use this vocabulary interactively when classifying documents. Users may browse the domain model and select a model fragment to represent the classi®cation of a document. Natural language tools are used to analyze the text of a document and to propose a relevant model fragment in terms of concepts and named relations. Our goal is to develop a method for domain-speci®c document classi®cation and retrieval that can be targeted at the situations that are so common in today's organizations, where huge amounts of documents are produced and made available through various types of intranet solutions. As long as these documents are not deemed ®nal and taken care of by the company's library system, for example while in an ongoing project, the users, or authors, often have to manage these documents themselves. To this end, there are no library tools available and no answer is given from standard web technology. Our goal is to provide a tool that can aid the users with such local document management tasks. Our focus has been the semantic classi®cation and retrieval of documents. Naturally, a classi®cation and retrieval method alone is not a stand-alone tool. To be useful, it should be integrated into the already available intranet solution, and it should be connected to a proper storing and retrieval machinery. Our model-based classi®cation and retrieval method is implemented in a ``Web style'' prototype, which also includes a simple document management system. The user interface is built

60

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

around a visual model viewer implemented as a Java applet. Much of the system functionality is implemented as Java servlets built around a Web-server. These servlets send the document texts to the relevant language analysis tools as needed. The user-created document descriptions are translated from a fragment in our visual modeling language and into the RDF±XML serialization syntax. XML is also used in the communication between the model viewer and the document servlets. Adhering to this open standard, we hope to be able to use standard indexing and retrieval machinery in the future. This paper reports on work in progress, and further work is needed in several directions. Improving the interface between modeling tool and linguistic tools, we want to speed up the construction of domain models with more automatic support. To enhance the matching of document texts against the conceptual model, further work is needed on the term lists used to match each of the model concepts. Today, synonyms are only added manually and word forms are extracted from a dictionary. There is however work in linguistic IR that uses advanced algorithms for expanding on noun phrases and generating term lists from a corpus. Furthermore, our approach must be interfaced with proper indexing and retrieval machinery, so that the approach can be tested on a full-scale case. We are currently working on setting up an evaluation project together with a Norwegian company. In this project, the approach must be evaluated both with respect to the usability of the system as well as with respect to the quality of the resulting document classi®cations. Acknowledgements This work is supervised by professor Arne Sùlvberg at the Information Systems Group at IDI NTNU. Special thanks to professor Torbjùrn Norg ard, Department of Linguistics NTNU for providing the Norwegian dictionaries and for helping with the linguistic analysis. Thanks also to Hallvard Trtteberg and Thomas F. Gundersen for most of the Java programming, and to érjan Mjelde and Torstein Gjengedal for the Java-Prolog interface. Parts of this work are funded by the CAGIS (Cooperative Agents in the Global Information Space) project, sponsored by the Norwegian Research Council (NFR). References [1] A.T. Arampatzis, T.P. van der Weide, P. van Bommel, C.H.A. Koster, Linguistically motivated information retrieval, Technical Report, CSI-R9918, University of Nijmegen, Holland. [2] L. Bannon, S. Bùdker, Constructing common information spaces, in: 5th European Conference on CSCW, 1997, Kluwer Academic Publishers, Lancaster, UK. [3] T. Brasethvik, A semantic modeling approach to meta-data, Journal of Internet Research 8 (5) (1998) 377±386. [4] T. Brasethvik, J.A. Gulla, Semantically accessing documents using conceptual model descriptions, in: World Wide Web & Conceptual Modeling (WebCM'99), Paris, France, 1999. [5] BSCW, Basic support for cooperative work on the WWW, http://bscw.gmd.de (accessed: May 1999). [6] J. Bubenko, B. Johanneson, B. Wangler, in: Conceptual Modeling, Prentice-Hall, Englewood Cli€s, NJ, 1997. [7] J. Domingue, Tadzebao and webonto: discussing, browsing, and editing ontologies on the web, in: 11th Ban€ Knowledge Aquisition for Knowledge-based systems Workshop, Ban€, Canada, 1998. [8] B.A. Farshchian, ICE: an object-oriented toolkit for tailoring collaborative web-applications, in: IFIP WG8.1 Conference on Information Systems in the WWW Environment, Beijing, China, 1998.

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

61

[9] D. Fensel, J. Angele, S. Decker, M. Erdmann, H.-P. Schnurr, On2Broker: improving access to information sources at the WWW, http://www.aifb.uni-karlsruhe.de/WBS/www-broker/o2/o2.pdf (accessed: May, 1999). [10] D. Fensel, S. Decker, M. Erdmann, R. Studer, Ontobroker: how to make the web intelligent, in: 11th Ban€ Knowledge Aquisition for Knowledge-based systems Workshop, Ban€, Canada, 1998. [11] FirstClass, FirstClass collaborative classroom, http://www.schools.softarc.com/ (accessed: May, 1999). [12] G. Fliedl, C. Kop, W. Mayerthaler, H.C. May, C. Winkler, NTS-based derivation of KCPM perspective determiners, in: 3rd International Workshop on Applications of Natural Language to Information Systems (NLDB'97), Vancouver, CA, 1997. [13] Galen, Why Galen ± The need for integrated medical systems, http://www.galen-organisation.com/approach.html (accessed: March 2000). [14] T. Gruber, Towards priciples for the design of ontologies used for knowledge sharing, Human and Computer Studies 43 (5/6) (1995) 907±928. [15] T.R. Gruber, Ontolingua ± A mechanism to support portable ontologies, Technical Report, KSL91-66, Knowledge Systems Lab, Stanford University. [16] N. Guarino, in: Ontologies and Knowledge Bases, IOS Press, Amsterdam, 1995. [17] T. Halpin, in: Conceptual Schema and Relational Database Design, second ed., Prentice-Hall, Sydney, Australia, 1995. [18] R. Hull, R. King, Semantic database modeling survey, applications and research issues, ACM Computing Surveys 19 (3) (1986). [19] ISO/DIS, Terminology work ± principles and methods, ISO/DIS no. 704. [20] ISO/IEC, Information technology ± document description and processing languages, http://www.ornl.gov/sgml/sc34/document/ 0058.htm (accessed: March 2000). [21] F. Karlsson, A. Voutilainen, J. Heikkila, A. Antilla, in: Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text, Mouton de Gruyter, Berlin, 1995. [22] G. Karvounarakis, V. Christophides, D. Plexousakis, Querying semistructured (meta)data and schemas on the web: the case of RDF and RDFS, http://www.ics.forth.gr/proj/isst/RDF/rdfquerying.pdf (accessed: September 2000). [23] B. Katz, From sentence processing to information access on the world wide web, in: AAAI Spring Symposium on Natural Language Processing for the World Wide Web, Stanford University, Stanford, CA, 1997. [24] Lingsoft, Lingsoft indexing and retreieval ± morphological analysis, http://www.lingsoft.®/en/indexing/ (accessed: March 2000). [25] Lingsoft, NORTHES Norwegian Thesauri, http://www.lingsoft.®/cgi-pub/northes (accessed: March 2000). [26] E. Metais, The role of knowledge and reasoning i CASE Tools, HDR Thesis, University of Versailles. [27] J. Mylopoulos, Information modeling in the time of the revolution, Journal of Information Systems 23 (3/4) (1998) 127±157. [28] I. Nordhuus, De®nisjonskatalog for Somatiske sykehus (in Norwegian), http://www.kith.no/kodeverk/de®nisjonskatalog/ defkat_somatiske/default.htm (accessed: March 2000). [29] OMNI, OMNI: organisaing medical networked information, http://www.omni.ac.uk/ (accessed: May, 1999). [30] A. Puder, Service trading using conceptual structures, in: International Conference on Conceptual Structures (ICCS'95), Springer, Berlin, 1995. [31] L.F. Rau, Knowledge organization and access in a conceptual information system, Information Processing and Management 21 (4) (1987) 269±283. [32] K. Schmidt, L. Bannon, Taking CSCW seriously, CSCW 1 (1±2) (1992) 7±40. [33] B. Schneiderman, D. Byrd, W. Bruce Croft, Clarifying search: a user-interface framework for text searches, D-Lib Magazine, January 1997. [34] M. Scott, WordSmith tools, http://www.liv.ac.uk/ms2928/wordsmit.htm (accessed: Jan 1998). [35] L. Soamares de Lima, A.H.F. Laender, B.A. Ribeiro-Neto, A hierarchical approach to the automatic categorization of medical documents, in: CIKM*98, ACM, Bethesda, USA, 1998. [36] A. Sùlvberg, Data and what they refer to, in: Conceptual Modeling: Historical Perspectives and Future Trends, in Conjunction with 16th International Conference on Conceptual modeling, Los Angeles, CA, USA, 1998. [37] K. Sparck-Jones, What is the role of NLP in information retrieval?, in: T. Strzalkowski (Ed.), Natural Language Information Retrieval, Kluwer Academic Publishers, Dordrecht, 1999. [38] SPRI, Methods and principles in terminological work (in Swedish), Technical Report, SPRI rapport 481, Helso och sjukv ardens utvecklingsinstitutt. [39] Spriterm, Spriterm ± halso och sjukv ardens gemensamma fakta och termdatabas, http://www.spri.se/i/Spriterm/i-prg2.htm (accessed: March 2000). [40] A. Steinholm, Automatisk Graf-Utlegg (automatic layout-generation of graphs ± in Norwegian), Spring Project Report, IDI, NTNU. [41] T. Strzalkowski, in: Natural Language Information Retrieval, Kluwer Academic Publishers, Dordrecht, 1999. [42] T. Strzalkowski, F. Lin, J. Perez-Carballo, Natural language information retrieval, TREC-6 Report, in: 6th Text Retrieval Conference, TREC-6 1997, Gaithersburg, November 1997. [43] T. Strzalkowski, G. Stein, G. Bowden-Wise, J. Perez-Caballo, P. Tapanainen, T. Jarvinen, A. Voutilainen, J. Karlgren, Natural language information retrieval, TREC-7 Report, in TREC-7, 1998.

62

T. Brasethvik, J. Atle Gulla / Data & Knowledge Engineering 38 (2001) 45±62

[44] B. Swartout, R. Patil, K. Knight, T. Russ, Ontosaurus: a tool for browsing and editing ontologies, in: 9th Ban€ Knowledge Aquisition for KNowledge-based Systems Workshop, Ban€, Canada, 1996. [45] TeamWave, TeamWave WorkPlace Overview, http://www.teamwave.com (accessed: May 1999). [46] A.M. Tjoa, L. Berger, Transformation of requirement speci®cations expressed in natural language into EER model, in: 12th International Conference on Entity-Relation Approach, 1993. [47] M. Uschold, Building ontologies: towards a uni®ed methodology, in: 16th Annual Conference of the British Computer Society Specialist Group on Expert Systems, Cambridge, UK, 1996. [48] A. Voss, K. Nakata, M. Juhnke, T. Schardt, Collaborative information management using concepts, in: 2nd International Workshop IIIS-99, Postproceedings published by IGP, Copenhague, DK, 1999. [49] A. Voutilainen, A short introduction to the NP tool, http://www.lingsoft.®/doc/nptool/intro (accessed: March 2000). [50] W3CRDF, Resource description framework ± working draft, http://www.w3.org/Metadata/RDF/ (accessed: March 2000). [51] S. Weibel, E. Millner, The Dublin core metadata element set home page, http://purl.oclc.org/dc/ (accessed: May 1999). [52] WRCRDFSchema, Resource description framework schema speci®cation 1.0, http://www.w3c.org/TR/2000/CR-rdf-schema20000327/ (accessed: May 2000). Terje Brasethvik received his Master's degree in Computer Science from the Norwegian Institute of Technology in 1996. He was then employed in Norway's largest oil company ± Statoil ± to work on the research project HOD ± Heterogeneous Organization of Data, before he went back to the university to work on his Ph.D. in 1997. His research interests are conceptual modeling, information systems engineering and computer supported cooperative work (CSCW).

Jon Atle Gulla received his Master's degree and Ph.D. in Computer Science from the Norwegian Institute of Technology in 1988 and 1993, respectively. In 1995 he ®nished his Master's degree in Linguistics at the University of Trondheim. He was a research scientist at the German National Research Center for Information Technology (GMD) in Darmstadt from 1995 to 1996, where he was involved in the development of a dialogue user interface to information retrieval systems. From 1997 to 1999 he worked as a senior consultant and project leader for Norsk Hydro's SAP project in Brussels. He is now a manager at FAST Search and Transfer and also serves as associate professor at the Norwegian University of Science and Technology in Trondheim. His research interests are in the areas of conceptual modeling, natural language semantics, and business engineering.