A Dutch medical language processor

A Dutch medical language processor

IMmuln~Jarnlul Bio-Medical ELSEVIER International Journal of Bio-Medical Computing 41 (1996) 181-205 gINling A Dutch medical language processor Pe...

2MB Sizes 10 Downloads 240 Views

IMmuln~Jarnlul

Bio-Medical ELSEVIER

International Journal of Bio-Medical Computing 41 (1996) 181-205

gINling

A Dutch medical language processor Peter Spyns*,

Georges

De Moor

Division of Medical lnJormaties, State University Gent, De Pintelaan 185 (5K3), B-9000 Gent, Belgium

Received 12 September 1995; accepted 22 April 1996

Abstract

This paper describes the current state of a medical language processor for Dutch. The goal is to implement a language specific front-end compatible with some existing applications that aim at the intelligent extraction and processing of information from patient discharge summaries. A complete chain for processing and understanding Dutch medical documents will be the ultimate result. The text focuses mainly on the language specific aspects of the language processing chain. Evaluation results of the already functioning components are given as well as an outline for future developments and enhancements. A short theoretical background is provided (cf. also [1-3]: Rossi Mori et al., Proc. S C A M C 90, 1990, pp. 185-189; Wingert, in: Informatics and Medicine, an advanced course, Springer-Verlag, 1977, pp. 579 646; Wingert, Proc. M E D I N F O 80, 1980, pp. 1321-1331) before the description of each component in order to familiarise the non-experienced reader with the basic notions of computational linguistics. Keywords: Medical language processing; Computational linguistics

I. Introduction

* Corresponding author, e-mail: [email protected]. Abbreviations: APT, Annotated Parse Tree; CEN TC251

PT002S, Centre for European Normalisation, Task Committee 251, Project Team 2S; CG, Conceptual Graph; DAG, Directed Acyclic Graph; DCG, Definite Clause Grammar; DMLP, Dutch Medical Language Processor; EMR, Electronic Medical Record; HSPG, Head Phrase Structure Grammar; ICD-9-CM, International Classification of Diseases ClinicalModifications (9th revision); KR, Knowledge Representation; LSPMLP, Linguistic String Project Medical Language Processor; MENELAS, an Access System for Medical Records using Natural Language; NLP, Natural Language Processing; NLU, Natural Language Understanding; PDS, Patient Discharge Summary; RDBMS, Relational Database Management System; RG, Restriction Grammar; SNOMED, Systematised Nomenclature of Medicine; SQL, Structured Query Language; UMLS, Unified Medical Language System.

I. 1. The medical c o n t e x t

At scientific congresses and in the various medical informatics journals, a lot o f attention has recently been paid to the electronic medical record ( E M R ) [4,5]. The E M R is supposed to perform better than the actual paper-based record with respect to the physical availability, ease o f reading, completeness and more importantly the accessibility o f data [6]. As a general rule, the physician uses documents o f the medical record as a m e m o r y support when the patient returns to the hospital. The patient discharge summary, as a synthesis o f the (previous) patient stay, is very

0020-7101/95/$15.00 © 1996 Elsevier Science Ireland Ltd. All rights reserved PH S0020-7101(96)01198-1

182

P. Spyns, G. De Moor/International Journal o] Biomedical Cornputing: 41 (1996) 181 205

well suited for such a task. A lot of information (especially the patient discharge summary) is stored in free text form. The use of natural language does not facilitate an easy access to the wealth of information in the EMR. However, natural language still is the most frequently used and easiest way to transmit complex messages [4]. Hence, some authors consider the study and application of Natural Language Processing (NLP) in medicine as one of the most challenging issues in the field of medical information retrieval [710]. Natural Language Processing in Medicine is a promising research area that has already delivered some important solutions [9,11 - 17]. The primary objective of the Dutch Medical Language Processor (DMLP) is to make the information in a patient discharge summary (PDS) available through advanced NLP and Knowledge Representation (KR) techniques [23]. The information of the PDSs will be stored in a logical representation (Conceptual Graphs) which allows a rule based inference engine to deduce and make explicit implicit knowledge. Information from the PDSs will be retrieved in an intelligent way. Instead of using pattern recognition, the system will compare the 'meaning of the query' with the 'meaning of the data'. More complete and accurate answers to the queries will thus be formulated. As a corollary, the D M L P can encode PDSs according to the ICD-9-CM nomenclature [18]. Recent legislation in Western Europe concerning the encoding of PDSs stresses the importance of research in this field. 1.2. The linguistic context

Language is a form of communication between humans. One human emits a message represented by a specific combination of acoustic or graphic signs to another person (receiver) who shares some common sense knowledge with the sender which should enable the receiver to understand the message. As the written language is considered to be a representation of the spoken language, most authors in theoretical linguistics build their theories considering as the starting point the spoken language. It was the philosopher Charles Morris who introduced the triplet 'syntax-seman-

tics-pragmatics'. He stated that the study o f pragmatics encompasses the complete environment of a person who speaks or hears. This includes semantics that is the study of meaning. S y n t a x examines the properties and structure of a language. The lexicon and morphology are sublevels of the syntactic level and concern the study of the words and word changes (inflection, derivation and compounding). The D M L P comprises morphological [24,25], syntactic [26,27], semantic and pragmatic analysis levels [28] for Dutch medical sublanguage. The D M L P also makes use of a Knowledge Representation formalism (Conceptual Graphs) [29,30] and a Production System [31]. The knowledge to be represented and modelled [32,33] only concerns the medical subdisciplines cardiology and cardiac surgery. One of the (practical) objectives of the D M L P is to try to reuse existing resources and research results of other groups as much as possible ~. This policy holds for the language dependent modules as well as for the language independent parts. Each level can produce ambiguities that can be partly resolved by the subsequent levels. Some ambiguities can only be resolved by the knowledge of the real state-of-affairs which is described by the sentence (cf. infra). A good and complete N L P system should be able to handle all the mentioned levels (up to a certain extent). The question is how to integrate the three main levels. Some researchers propose a cascaded architecture where the ambiguities of the lower levels are resolved subsequently by the higher levels. Another possibility is to work on one level and activate the 'linguistic machinery' of the other levels only when the default level leaves too many ambiguities unresolved. More information about this issue as well as an overview of the most important N L P systems in Medicine can be found in [34]. 1The DMLP reuses and integrates resources from several origins: some master's degree theses [19,20] as well as results from the PROTON I and II (internalK.U. Leuvenfunding [21]), EUROTRA- (ET-CA-B and ET-II-CA-B [22] and MENELAS- (AIM # 2023 [17,23]) projects (financed by the Directorate General XIII of the European Union).

P. Spyns, G. De Moor /International Journal of Biomedical Computing 41 (1996) 181 205

Firstly, the various levels of analysis are presented from a theoretical as well as an implementational point of view. After a discussion of the linguistic levels (word (Section 2.1) and sentence (Section 2.2) analysis), the knowledge levels (meaning (Section 2.3) and pragmatic (Section 2.4) layers) are presented. Each time, the most important linguistic notions (Section 2.1.1, Section 2.2.1, Section 2.3.1, Section 2.4.1) are explained and their current implementation in the D M L P is clarified (Section 2.1.2, Section 2.2.2, Section 2.3.2, Section 2.4.2). Partial evaluation results (Section 3) are given together with an overview of remaining work (Section 4) before the conclusion (Section 5) is provided. Fig. 1 provides an overview of the D M L P architecture and the flow of information (showing only the language specific knowledge sources). The numbers refer to the various sections of this paper devoted to the corresponding items.

2.1.1.1. &2.1.21.~i.

2112-3,

........

\ ':,~

2,123

"'"~i"

~ "

2.2.2.2

"'!!":!!~""'1disambiguat°,I "~\

2 1 Word

~ ~o~u.d~,~

.......................................

2114&2,25

~DAG ............................

............ :i=::,-I

222,

2

ItransducerI

2.2. S e n t e n c e . . . . . . . . . . . . . . . . . . . .

2.1.1.5. &2.1.2.4

OA G

. . . . . . . . . . . .

22

2.2.2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii:i>

23'2

i,,oo:::,:,oo ].................... ..................

2.4. Context

%.

...........................

pragmaticanalyser&

I indexg..... tor J

.......................................

+:: ...........................

Fig. 1. Overview of the D M L P architecture and (partial) content table of the paper.

183

2. Current achievements

2. I. The word level 2.1.1. The linguistic background 2.1.1.1. The lexicon. With each word a set of characteristics can be associated which determines the linguistic behaviour of that word, especially its lexical category. At the lexical sublevel, the words and their characteristics can be found in a dictionary. This information (syntactic or semantic) is necessary for a parser to successfully analyse a sentence. For example, a verb can be conjugated (person, tense and aspect), a noun can be declined (plural and sometimes case) but an adverb remains invariable. The lexical category also reduces the set of possible syntactic functions a word can fulfil in the sentence (an adverb can never be - - in principle - - the syntactic subject of a sentence). The linguistic information can be represented as a set of feature value pairs or feature bundles (cf. Section 2.1.2.2). Words often have different meanings. The context of the sentence helps to clarify which meaning of the word is to be retained. Lexical ambiguity has to be resolved by the subsequent levels. 2.1.1.2. Inflection. This (sub)level takes care of the mapping of an inflected form (or surface form) to its canonical form (or base or lexicalform) and of determining the inflectional characteristics. From now on, we will use the term inflection as a more general term indicating the declination of nouns and adjectives and the conjugation of verbs. The morphological knowledge used comprises lists of base forms, lists of suffixes, morphological rules which describe the possible combinations of base forms and suffixes to words, and lists of irregularities (e.g. 'has' is the 3rd person singular of the simple present of the infinitive (or canonical form) 'to have': the 3rd person singular present is marked by + s and the form appears amongst the irregularities). Ambiguities can arise when one form has several different morphological 'readings' (e.g. the infinitive of a regular verb in English can also be a 1st and 2nd person singular and all plural persons of the present simple). Of

184

P. Spyns, G. De Moor / International Journal of Biomedical Computing 41 (1996) lg1-205

course, lexical and morphological ambiguities can be combined. For example, in Dutch the surface form 'was' can be (i) the simple past of 'to be', as well as (ii) a singular noun meaning 'wax' or (iii) a singular noun meaning 'laundry'. The other levels of analysis are needed for disambiguation: the syntactic level can suffice to discriminate between (i) and (ii)-(iii), but semantic knowledge is needed to distinguish between (ii) and (iii).

2.1.1.3. Derivation. This linguistic activity concerns the creation of new words. A suffix (-iris) or prefix (ab-) is added at the end or the beginning, respectively, of an already existing autonomous word or unbound morpheme (e.g. the suffix -al makes new adjectives (spinal) from nouns (spine)). Some extra rules have to be obeyed (spine spin-al: e is skipped and the suffix -al is mostly used with Latinate nouns). When using suffixes, the lexical category of the new word is determined by the suffix and differs from the category of the original word. The right part (suffix) is considered to be the head of the derivation, i.e. responsible for the grammatical characteristics. Words that are lacking in the dictionary, can in many cases be (partly) characterised if the head is recognised (cf. Section 2.1.2.3). Specific for medicine (and terminologies in general) are confixes or neo-classical root forms (e.g. myo-card-itis) that are bound morphemes but with a proper meaning. 2.1.1.4. Compounding. When two (or more) unbound morphemes (or lexemes) are combined, a new compound is created. There exists sometimes variation in the spelling of a compound (in one word (tummyache), combined with a hyphen (sickle-cell) or separated by a blank (stomach pain)). Here again, the right compounding lexeme is considered to be the head in most of the cases. Compounding as a linguistic phenomenon is to be situated in the grey border zone between morphology and syntax, especially when complex compounds that are separated by blanks are concerned. Neo-classical compounding or confixation is a variant of compounding that only involves confixes (hepato-cholangio-cysto-duodenostorey). Derivation and (neo-classical) compounding can be jointly used to create new words, especially in medicine (e.g. ((liver)((cirrh)(osis)))).

2.1.1.5. Contextual disambiguation. A linguistic tool that combines lexical look-up and morphological analysis but without making a syntactic analysis is called a tagger/lemmatiser. A tagger/ lemmatiser uses contextual rules to reduce the ambiguities (but without assigning syntactic functions). It will be clear that a sequence 'determiner noun' is to be preferred above the impossible sequence 'determiner conjugated verb'. A good tagger/lemmatiser can save a parser (cf. Section 2.2.2.2) a lot of work since many ambiguities have disappeared which means that the parser will have less combinations to examine. Since a parser has a much more complicated control structure (household keeping variables, data structure for the parse tree,...), less alternatives to try out means faster execution and better performance (depending on the parse strategy). However, not all the ambiguities can be resolved. 2.1.2. The DMLP implementation 2.1.2.1. The lexical database. The syntactic lexicon for Dutch was built using several resources: an existing electronic valency dictionary for general Dutch language (based on 'Van Dale's Handwoordenboek') and a list of words extracted from a medical corpus (415 anonymous cardiology PDSs). The already existing electronic dictionary (resulting from the PROTON I and II project [21] and the newly coded entries were converted and merged into a common representation in a relational database. Amongst linguists, there seems to be a tendency to use a canonical form dictionary coupled to an inflectional rule base. This combination analyzes the inflected forms by means of morphological knowledge captured by the rules. However, we preferred a full-form dictionary to this former approach because of the advantages offered by a Relational DataBase Management System (RDBMS) (such as quick response time and ease 'of creation and maintenance). In our experiments, the 'full- form approach' clearly outperformed by a factor of 10 the 'canonical approach'. The time to retrieve an inflected entry together with its syntactic information is much less than the time needed to perform 'normal' morphological analysis. Other advantages also hold (cf. [35,36]).

P. Spyns, G. De Moor/International Journal of Biomedical Computing 41 (1996) 181-205

185

lexeme category category_type canonical verb_id vg_rootpart2 vg_perf_hebben vg_perf_zijn vg reflexive vg_modal vs tense vs_form vs number vs_person vs_preposition as_conjugated as_comparion ng_de_het ng_semantics ng_sex ns number ns_dimin numeral_number pron_deic prondet pers prondet sex prondet_neg pron_sem pron_case det_gen .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

de determiner def NULL NULL NULL 0 0 NULL O NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL no NULL NULL nneut patient noun common patient NULL NULL O O NULL 0 NULL NULL NULL NULL NULL NULL NULL de human malefemale sing no NULL NULL NULL NULL NULL NULL NULL NULL NULL heeft verb aux hebben 18590 NULL 0 0 no 0 pres finite sing 3 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL heeft verb main hebben 18610 NULL 1 0 no O pres finite sing 3 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL stenose noun common stenose NULL NULL 0 0 NULL 0 NULL NULL NULL NULL NULL NULL NULL de ??? female sing no NULL NULL NULL NULL NULL NULL NULL NULL NULL

Fig. 2. Entries of the full form database dictionary for 'de patient heeft stenose'.

Every word in the main table is assigned to a category. The main table is not normalised and each row contains a surface form together with its canonical form and linguistic information. A second table contains the .verbal subcategorisation rules. A numerical identifier functions as a foreign key from the main table to the valency table. As a verb can have several subcategorisation rules, there is no unique index for this table. With every word in the main table the result of morphological analysis is stored. If a word can be assigned to more than one category, and/or the word gets multiple morphological analyses the word will appear in more than one row of the main table (in other words: the name of the word is not a unique key for the main table) [37]. Currently, there are some 100000 full forms in the lexical database (which equals some 8000

non-inflected forms) [27]. For the moment, the database contains mostly simple word forms. Neither complex word forms nor idiomatic expressions are yet handled in a conclusive manner. All the verb entries contain valency information. As an example, the database entries for 'de patient heeft stenose' (the patient has stenosis) are given in Fig. 2. We will not elaborate on the exact meaning o f the linguistic features due to space constraints. The main features are lexeme (full form entry), category (syntactic class) and canonical (base form). A software layer between the database-dictionary and the syntactic parser has been provided to shape the dictionary information into the format needed by the parser (on the linguistic level as well as on the computational level, cf. also [36]). This enhances the reusability and independence of the data contained in the dictionary.

186

P. Spyns, G. De Moor/International Journal of Biomedical Computing 4l (1996) 181 205

[positie:l, nl_lu:de, lex:de, cat:det, pers:3, nb:_, nl_gender:nneut, sex:_, dtype:def I _ ] [positie:2, nl_lu:patient, lex:patient, cat:n, pers:3, nb:sing, sex:malefemale, nl_gender:nneut, ntype:ord, nclass:common, frame:human I _ ] [positie:3, nl_lu:hebben, lex:heeft, cat:v, parttype:no, verbpart:no, pers:3, nb:sing, nl_tense: pres, nl_vform:finite, verv:hebben, refltype:no, vtype:aux, modal:no I _ ] [positie:3, nl_lu:hebben, lex:heeft, cat:v, parttype:no, verbpart:no, pets'.3, nb:sing, nl_tense: pres, nl_vform:finite, verv'.hebben, refltype:no, vtype:main, modal:no, frame: [subj: h u m a n _ n o n - h u m a n , d i r _ o b j : h u m a n _ n o n h u m a n

I_l I_ ]

[positie:4, nl_lu:stenose, lex:stenose, cat:n, pers:3, nb:sing, sex:female, nl_gender:nneut, ntype:ord, nclass:common, frame:_ I _ ] Fig. 3. DAG representation of feature bundles for linguistic information of Fig. 2. We will not explain in detail the meaning of all the various morphological features, but the first lexical entry concerns the surface form (lex:de) of the definite (dtype:def) determiner (cat:det) 'de' (nl lu:de) for singular or plural ( n b : ) masculine or feminine nouns (nl gender:nneuter and sex: ).

From the computational point of view, the extra software layer hides the database (c.q. Sybase 4.9.1) and transforms the information from the database format into a feature bundle (cf. Figs. 2 and 3 and infra) containing the application specific features. The other software components can only access the database by means of one public predicate of the software layer. That way, the underlying implementation of the lexical database can be easily adapted without the entire system having to be changed 2. Seen under a linguistic angle, the software layer restricts and adapts the 'view' (just like the SQLviews) the programs have on the content of a lexical entry. Thanks to this method, a lexical entry in the database can contain all sorts of linguistic information while only the items relevant for a specific NLP-application are passed through a 'software filter'. Besides the qualitative aspect, the filter can also affect the quantitative aspect by collapsing or expanding certain entries or excluding specific combinations after examination of the input (e.g. the 2nd and 3rd persons of z The tight coupling of the database and the Prolog-layer is assured by the existing Sybase OpenClient/C module. Hereby, many SQL-statements can be used directly as Prolog-predicates while the database tables can be considered as Prologfacts, which allows transparent backtracking on the data.

the present singular of many Dutch verbs being equal, they constitute one entry in the dictionary database, but are expanded into 2 separate forms).

2.1.2.2. The feature bundles. The feature bundles constitute the main data structure of the morphological component. They are conceived as Directed Acyclic Graphs, which are implemented as open-ended Prolog lists [38]. This 'low level' implementation is only known by the predicates that make up the interface. The interface consists of 'public' operators that can access (observers) or change (modifiers) the data structure. In the other modules of the DMLP, solely the public predicates of the DAG data structure are used. The knowledge about how the D A G are actually implemented is confined to the D A G module itself. The feature value pairs cannot be accessed directly as elements of an open- ended list (data abstraction and encapsulation). A special predicate performs the graph-unification operation, which turns out to be more flexible than the 'structure-unification' of Prolog, since the feature-value pairs need not occupy a fixed position within the structure. This appears to be very handy for carrying out modifications (e.g. addition of new features). Also, graph-unifi-

P. Spyns, G. De Moor/International Journal of Biomedical Computing 41 (1996) I81 205

187

lexicon(de,145686, [positie: 1, nl_lu:de, lex:de, cat:det, pers:3, nb:, nl_gender:nneut, sex:_, dtype:def [ _ l) lexicon(patient,742539, [positie:2, nl_lu:patient, lex:patient, cat:n, pers:3, rib:sing, sex:malefemale, nl gender:nneut, ntype:ord, nclass:common, frame:human I _ 1) lexicon(heeft,1433770, [positie:3, nl_lu:hebben, lex:heeft, cat:v, parttype:no, verbpart:no, pers:3, rib:sing, hi_tense: pres, nl_vform:finite, verv:hebben, refltype:no, vtype:main,modal:no,frame:[subj: humannon-human, dir_obj:human_nonhuman [_1 I_ 1) lexicon(stenose,218765, [positie:4, nl_lu:stenose, lex:stenose, cat:n, pers:3, nb:sing, sex:female, nl_gender:nneut, ntype:ord, nclass:common, frame: I _ 1) Fig. 4. Morphological information asserted in the Prolog dynamic database to be used by the compounder and the syntactical parser after contextual disambiguation (cf. infra).

cation provides a neat and easy way to impose various restrictions. A linguistic restriction can be expressed in terms of feature value pairs, which in turn can be represented as a DAG. This DAG acts as a filter towards other DAGs. The DAGs that are unifiable with the 'filter DAG' meet the imposed restriction. The only thing to do is to define the appropriate filters. Fig. 3 gives the D A G representation of the database entries shown before (cf. Fig. 2). These feature bundles are asserted in the dynamic database of Prolog to be easily accessible for the subsequent modules (cf. also [39]). The lexicon as well is implemented as an abstract data type with only a few public predicates so that the actual implementation of the data structure remains hidden. A lexicon entry consists of a triplet containing the token, its uniquely identifying number and the D A G with the linguistic information (cf. Fig. 4). 2.1.2.3. The category guesser. The lexical database contains all the inflected forms of the words of the corpus so that morphological analysis is no longer necessary. However, since an exhaustive dictionary is an unrealistic assumption, a category guesser handles all the unknown word forms (cf. also [39]). Briefly, when an unknown inflected word is encountered, inflectional analysis (using separate routines for verbs and other categories) tries to

generate a hypothetical canonical form. By means of several algorithms, these forms are split into subparts (cf. also [40,41]). This approach takes advantage of the fact that medical terminology is often created by agglutinating (derivation and (neo-classical) compounding) Latin and/or Greek word parts (e.g. peri-card-itis). To this aim, the necessary lists of medical morphemes were created (76 prefixes, 414 confixes and 124 suffixes) (cf. also [42,43]). They comprise some general language suffixes and many medical terminals (cf. [1], p. 188). Words from the dictionary can also be used during this segmentation phase since Dutch medical monolithical compounds are mostly created by simply agglutinating the composing lexemes. At the opposite of standard Dutch, binding phonemes (like -e-, -s-, -en- or -er-) between the composing lexemes are seldom used [19]. The initial set of hypotheses is reduced by a cascading priority system, which takes into account inflectional rules, recognition by means of particular medical morphemes and typical sequences of ending characters. If no segmentation is possible, an extra routine checks if the endstring does not allow characterisation of the unknown word (e.g. in Dutch the end sequence -dt uniquely marks a 3rd person singular of the present simple [19]). When these knowledge sources do not permit identification of the unknown form, the hypothetical forms are considered to be nouns by default and are passed as pure guesses to the

188

P. Spyns, G. De Moor/International Journal of Biomedical Computing 41 (1996) 181-205

test_papa_present(_input,_) :Iook_ahead(_input,_word,_,pastpart,nl_vform), ! . test_papa_present(_,_aux_num):delete lex_v_dag{_, aux_num,_). Fig. 5. A contextual disambiguating rule (in Prolog). The predicate delete lex dag/3 is a public predicate of the abstract data types ' D A G ' and lexicon. It removes a verb lexicon entry from the Prolog dynamic database (using its identifying number).

syntactic analyser in order to prevent an immediate failure. This module is implemented as an integrated category guesser [24].

2.1.2.4. The contextual disambiguator. Once the morphological look-up is finished, contextual rules can be activated so that the ambiguous morphological analyses can be made univoque (or at least reduce the number of the morphological readings) [45,46]. The DMLP not only takes the immediate left and/or right neighbour of a word in the sentence into account, but also the complete left or right context of that word, depending on the contextual disambiguation rule. For example, if a simple form of the verb 'hebben' (have) appears, the auxiliary reading is kept only if a past participle is present in the context. Unlike English, where the past participle is the immediate neighbour of the auxiliary, the Dutch past participle can be located at the end of the sentence. An extra difficulty for the Dutch medical sublanguage is the relatively large degree

of omission of determiners (and even auxiliaries of the passive voice). An example of such a contextual rule can be found below (cf. Fig. 5). The auxiliary reading of "heeft" (3rd person singular simple present) (vtype:aux) is skipped (delete_lex v dag/3) in favour of the 'main' verb (vtype:main) because no past participle (nl vform = pastpart), potentially requiring an auxiliary, is present in the input sentence (look ahead/5). The result of the application of the mentioned disambiguation rule (cf. Fig. 5) using the data of Fig. 3 is shown in Fig. 4.

2.1.2.5. The compounder. After the dictionary look-up (or activation of the category guesser and contextual disambiguator) for the words of the input stream, all the nouns are checked on adjacent 'noun neighbours' (N N compounding) (e.g. blood pressure) [25]. In general, Dutch compounds are monolithical, but under the influence of English, the compounding parts are sometimes separated by a blank [44]. Also, as the documents are written by non-linguists, errors can occur (e.g. woensdagavond vs. donderdag morgen (Wednesday evening vs. Thursday morning)) (Fig. 6). The rightmost word is considered as the head and the compound inherits its grammatical featuresL If the headpart has one or more hypothetical analyses, the morphological analysis of the 3 We are fully aware that linguistic reality is more complex. Sometimes, the left compounding part is the head [44]. Such a noun could have a specific lexical mark that identifies it as a head.

[positie:l, nl_lu:donderdag, lex:donderdag, cat:n, morf:dict, pers:3, nb:sing, sex:male, N_gender: nneut, ntype:ntime, nclass:common, frame:abstract I _ ] [positie:2, nl_lu'morgen, lex:morgen, cat:n, morf:dict, pers:3, nb:sing, sex:male, nl_gender:nneut, ntype:ntime, nclass:common, frame:abstract I _ ]

[positie:l, nl_lu:donderdag-morgen, lex:donderdag morgen, cat:n, pers:3, nb:sing, sex:male, hi_gender: nneut, ntype:ntime, nclass:common, frame:abstract, compound_nb:2 I _ ] Fig. 6. The feature bundles of the composing lexemes and their compound.

P. Spyns, G. De Moor/International Journal of Biomedical Computing 41 (1996) 181-205 compound also consists of one or more possibilities. For the moment, we only consider a very limited number of compound rules (mainly involving proper nouns). This means that erroneous compounds can occur.

2.2. The sentence level 2.2.1. The linguistic background The syntactical analyser groups the words provided by the lexicon that have passed the morphological phase into constituents or syntagmas (noun phrase, relative sentence,...) and assigns a syntactic function to those constituents (e.g. subject, direct object or indirect object). ((the doctor)((gives)(the patient)(a drug)))= ((np:subject)((v)(np:indirect object)(np:direct object))). Certain relationships between the constituents of a sentence should be respected (e.g. the syntactic subject should have the same person and number as the main verb). The admitted combinations between the constituents and the relationships to be respected (i.e. the syntactic knowledge) are encoded in grammar rules according to a predefined convention (grammar formalism cf. Section 2.2.2.1). The parser (cf. Section 2.2.2.2) is the mechanism that checks if the submitted sentence can be analysed with the given grammar and represents the dependencies of the constituents by means of a tree (also called a parse tree). Others data structures and grammar formalisms can be used as well (e.g. a chart, (hence chart parsers) or a kind of matrix tbr unification grammars). The main idea is to separate the grammatical knowledge (declarative knowledge: what) from the evaluation mechanism (inference engine: how). This independence allows to change one of the two without having to change the other (up to a certain extent of course). The syntactic level can clear up some ambiguities of the previous level. However, it still is impossible to solve lexical ambiguities ('was' as 'wax' or 'laundry'). For example, consider (i) operating patients is hazardous versus (ii) operating patients are hazardous (in analogy with [47]. p. 179) where 'operating' can be considered as a verb or as an

189

adjective. As we know that the syntactic subject and the main verb must have the same number, the number of the verb indicates unambiguously that for (i) the verbal reading (the present participle with 'patients' as its direct object is the subject of the sentence) holds while for (ii) the adjectival reading (adjective to the subject of the sentence 'patients') is valid. Ambiguities specific for this level are called structural ambiguities. Typical examples of structural ambiguities are the use of conjunctions, compound nouns and prepositional phrases. For example, the noun phrase 'ernstig aorta- en mitraliskleplijden' ('severe aortic and mitral valve suffering') leaves the question pending whether it is only the aortic valve suffering that presents a degree of severity or not. The syntactical level does not take any contextual knowledge into account 4 (cf. Section 2.4.1). Sometimes possible compound nouns are extremely difficult to distinguish from their noncompound reading. A related problem (typical for medicine) are phrasal terms (i.e. strings of lexemes that identify a single medical concept (e.g. first metacarpal bone) [1], p. 187). Should they be included as a single entity in the syntactic dictionary, analysed compositionally as a compound or at the syntactic level joined in a complex nominal group (e.g. 'the left artery of the head' vs. 'the left artery')? On the syntactic level it is not always clear where a prepositional phrase should be attached to (in the parse tree) (cf. Fig. 7). For example, the doctor examined the patient with a stethoscope. Once again, contextual knowledge allows us to correctly interpret this sentence, but a purely syntactic parser fails to disambiguate this sentence (3 possible readings are given).

2.2.2. The D M L P implementation 2.2.2.1. The grammar formalism. In order to reuse existing pieces of lingware, the English grammar developed by the New York Linguistic String 4 NLP systems with an integrated architecture combining the syntacticallevelwith the semantic and pragmatic levelsare able to pick out the correct interpretation immediately.

190

P. Spyns, G. De Moot"/International Journal oJ Biomedical Computing 41 (1996) 181 205

subject

sentence _______---3 verb phrase



clet:the) (n:doctor) ~~object (v:examines~_.

?

o?

(det:the) (n:patient)o ? prep_object (p:with) object (det:a) (n:stethoscope) Fig. 7. Example of a parse tree containing a structural ambiguity (prepositional phrase attachment). Project (LSP) [48] was taken as a starting point. The syntactic analyser for Dutch uses the Restriction Grammar (RG) as the underlying grammar formalism, which is the Prolog version of String Grammar [49]. An example of RG-rules can be found in Fig. 8. The grammar rules state that all the sentences can be introduced by some connectives ('en', 'maar', 'of', 'want'), have to be concluded by a punctuation mark ('.', '?', '!'), and that a particular realisation of a sentence ('assertionl') consists of a sequence of a 'subject', a tensed verb ('ltvr') and an 'object', between which agreement ('w_ agree') and verbal valency ('w_frame') restrictions hold. RG-rules consists of a set of context-free production rules interspersed with Prolog predicates that restrict the combinatorial possibilities of the production rules. The restrictions operate on the context-sensitive information contained in the parse tree under construction and the input stream. The restrictions are implemented by means of special functions (locating routines e.g. core) that permit navigation from one node in the tree to another. Fig. 9 shows the restriction that validates or rejects the attachment of a prepositional phrase (_pn node) introduced by the preposition prep under the verb phrase node (cf. Fig. 7) as a prepositional object. If the valency frame v frame of the verb (cf. Section 2.3.1.1 and Fig. 14) contains that particular preposition (fetch_

value(_v_frame, prep, _prep, ), the restriction checks if the semantic feature ( n feature) of the head or core noun of the prepositional phrase ( head_pn node) is compatible (check_features/ 2) with the semantic feature ( v feature) of the prep_phr slot of the verbal valency frame (fetch_ value( v frame, prep_phr, _v_feature). R G is also used in the PUNDIT-system [50] which constitutes the basis of other N L P research projects (e.g. SPECIALIST [51]). As the LSP grammar has proved to lead to useful large scale N L P programs, we preferred the R G formalism and its older underlying linguistic theory to the more up-to-date formalisms (like L F G and HPSG) that are only slowly leaving the experi mental environment of toy grammars 5. A modest grammar for the Dutch medical language has been built as the result of the study of some 100 sentences of PDSs [20]. The grammar was refined -but not exhaustively - - after a test involving some extra 40 PDSs. Actually, there are some 208 RG-rules and 130 restrictions. Special emphasis was put on nominal constructions since these appear very frequently in the medical sublanguage. The main advantage of using verbal va-

5A discussion of the most important NLP in Medicine projects and an overviewof their theoretical and implementational characteristics are offered in [34]. In this respect, it is useful to consider again the ideas of K. Jensen about linguistic theories and practical applications [52].

P. Spyns, G. De Moor / International Journal of Biomedical Computing 41 (1996) 181 205

191

s e n t e n c e :: = introductor, center, e n d m a r k . i n t r o d u c t o r :: = [en]; [of]; [maar]; [ w a n t ] ; [1 . e n d m a r k : : = [.}; [?l; [!] • c e n t e r :: = assertion1 ; assertion2. assertion1 :: = s u b j e c t , s a , l t v r , { w _ a g r e e } , s a , o b j e c t , { w _ f r a m e } , s a .

Fig. 8. Excerpt of the RG rules.

lency information (cf. Section 2.3.2.1) is the rejection at the syntactical level of some incorrect parse trees, especially concerning the attachment of prepositional phrases (cf. Fig. 7). 2.2.2.2. The parser. Basically, the RG-parser does not differ substantially from a DCG-parser. It is a top-down left to right parser that currently returns only one complete parse tree. The data structure used for the parse tree is the one proposed by Hirschman and Puder [53]. The parse tree together with its operators are defined as an abstract data type. The implementation of the data structure is hidden so that the other program modules can only act upon the parse tree by means of some public functions which constitute the interface of the abstract data type [26]. This software engineering technique limits the cascade of changes that are unavoidable when the implementation of the data structure is updated. 2.2.2.3. The parse tree transducer. As the development of the D M L P is to be situated in a larger polylingual NLP-project (MENELAS A.I.M. # 2023) [17,27], the D M L P has a staged or cascaded architecture. The language specific levels

t e s t _ p r e p _ o b j e c t (_v_frame,_pn_node,_prep):f e t c h value [_v_frame,prep,_prep), corel_pn_node,_head

pn node},

tree_feat ures3[_head_pn_node,_n_feat ure ,frame}, fetch_value(_v_frame,prep_phr,

(morphology and morphosyntax) are treated before the more language independent levels (semantics and pragmatics). Therefore, a transducer is needed to convert the various parse trees (for Dutch and French) into a common structure [54]. Each RG-parse tree is transduced in an Annotated Parse Tree (APT) [27]. While building the APT, some linguistic information (concerning long distance dependencies) is rearranged. This information (e.g. about the antecedents of a relative clause), which in the R G parse tree is dispersed over multiple nodes, is regrouped in the APT. The APT, in fact, contains a regularised and rearranged parse tree (e.g. fixed slots for the various syntactic constituents). An example of an APT can be found in Fig. 10. Of course, it is possible to rearrange the R G parse tree in such a way that other NLP programs can make use of the result after sentence analysis (e.g. the Specialist system [51] or the LSP-MLP [55]. The degree of reusability will depend on the compatibility of the morphosyntactic (and semantic) labels and features used by both systems. 2.3. The meaning level 2.3.1. The linguistic background The semantic level relates the words to concepts and checks whether the semantic relationships hold. Therefore, a dictionary entry should also contain semantic features. We can distinguish between features like (human, non-human, abstract) and the thematic roles (or cases) like agent, object or patient (non-medical meaning), instrument.

v feature),

c h e c k _ f e a t u r e s l _ v _ f e a t u re, n_featurel.

Fig. 9. Excerpt of the RG restriction rules.

2.3.1.1. The semantic features. The semantic features characterise the nouns and pronouns. The subcategorisation rules of adjectives and verbs

192

P. Spyns, G. De Moor/International Journal of Biomedical Computing 41 (1996) 181-205 [[sent(decl,indicative,present,-], np([m,s,3], det(de,[_,_,3],def,nil,l,2),nil, noun(subs, patient, O, [m,s,3], nil,2,3),nil,nil,nil,nil,2,3),nil,nil,nil,nil,-,nil,nil,1,3}, vp(verb(other, indicative,present,active,_,[_,s,3], hebben,_, 3,4),nil, np([f,s, 3],det(dummy,nil,nil,nil,_,_), noun(subs(stenose,O,[f,s,3],nil,4,5),[f,s,3],nil,4,5), nil,nil,nil,nil,-,nil,nil,4,5),nil,nil,nil,nil,

-

,nil ,nil, nil, 1,5) ,nil,-,nil ,nil, 1,6)]]

Fig. 10. Example of a Menelas APT for the sentence 'de patient heeft stenose'. determine which syntactic elements can appear in their distribution or context. Chomsky [56] distinguished two types of subcategorisation rules: strict subcategorisation and selectional rules. The former are more syntactically oriented and determine whether a verb is intransitive, direct or indirect transitive (prepositional object) or whether the object is an indirect question or a subordinated sentence introduced by 'that'. The latter are more concerned if the (pro)nouns in the subject and various object positions possess the features (non)human, or abstract. The distribution of a particular verb or adjective as it is defined by the combination of both types of rules is called the valency (cf. Section 2.3.2.1). The valency of adjectives and verbs allows resolution of a lot of the structural ambiguities (cf. Figs. 7 and 11). Here, the border between pure syntax and semantics cannot be clear cut because the valency rules can be determined in a purely formal and distributional way (thus without taking the meaning into account) in the same way as is done for syntax rules. The valency rules belong to semantics, since sentences can be syntactically correct but semantically meaningless. For example, consider the well-known example of Chomsky: colourless green ideas sleep Juriously.

2.3.1.2. The conceptual structure. As we are concerned with the understanding and processing of the meaning of a sentence, it is necessary to represent that meaning by some (language-independent) conceptual structure. In general, such a conceptual structure consists of attribute-value pairs linked to each other by conceptual relationships represented in a kind of logical Jormalism for which general logic operations are defined (e.g. an operator that combines two conceptual struc-

tures). A semantic concept is e.g. 'agent of the action' or 'instrument'. A powerful formalism allows new concepts and relationships to be defined using a set of predefined concepts and relationships. A semantic representation consists of three basic elements: the formalism (and its logical operators), the set of concepts and the set of relationships between the concepts. 2.3.1.2.1. The representation formalism. The representation formalism functions as a carrier for the meaningful elements. Together with the formalism, an inference engine is implemented that is able to work with the information represented by means of the formalism (cf. the section on the grammar and the parser). A good formalism allows the meaningful elements to be easily combined by operators into new and more complex conceptual structures (e.g. the join operator unifies two conceptual structures). All subsequent examples will use the Conceptual Graph formalism (CG) [30]6. 2.3.1.2.2. The concepts. The meaningful elements are the primitive concepts discovered when modelling the application domain. The concepts are ordered by means of the is_a and/or part of relationships (e.g. a human is a living being and its head forms part of its body). Mechanisms such as property inheritance become applicable if the concepts are related to each other by the mentioned relations (in a tree or lattice structure). Assertions about the living beings are then by default valid for the humans (unless the default knowledge of the parent node is overruled by more specific knowledge of the child node). It will 6 In a conceptual graph, boxes or square brackets indicate concepts, and circles or parentheses indicate conceptual relations.

P. Spyns, G. De Moor/International Journal of Biomedical Computing 41 (1996) 181 205

193

the doctor gives [subject:human and no objects (intransitive) (.9) the doctor gives a drug [subject: human and direct object." nonhuman] (1) the doctor gives a patient a drug [ (1) + indirect object: human] (2) the doctor gives a drug to a patient [(2) + preposition: to] the doctor gives a drug in the morning [(1)

=>

+

sentence adjunct]

a drug gives the doctor[subject nonhuman and direct object." human] (*)

=>

(1)

Fig. 1l. Example of some distributions for the verb give. In linguistics "(?)" is used to indicate "deviating but accepted" sentences, while "(*)" marks agrammatical constructions or interpretations. be clear that a good selection and definition of the concepts is extremely important. This is a job for the knowledge engineer. An excerpt of a concept type lattice is given in Fig. 12, where x < y stands for x is_a y. 2.3.1.2.3. The semantic relationship. Also the relationships between the concepts have to be carefully selected and defined taking into account the already established relationships (e.g. the relation 'loc front' [is located in front of] is more specific as relation than 'loc' [is located somewhere]). The relationships have to be semantically restricted. Such semantic restrictions reduce the combinatory of concepts and relationships. The restriction mentioned in Fig. 13 states that an action or process (state changeproducer) is helped (inst.tool) by a physical object [57]. For example, the graph [drug_treatment] --* (inst.tool) ~ [aspirin] conforms to the mentioned semantic restriction because in the concept hierarchy drug treatment is a state change_producer and aspirin is a physical object (cf. Fig. 12)6. 2.3.1.2.4. The intensional semantic network. We can speak of a semantic network when the concepts are interrelated by means of semantic relationships. Another more philosophical denomination for the set of primitive concepts and relationships is the ontology. The definition of an ontology is a far from trivial task and many pitfalls are to be avoided [58]. Already some large scale projects concerning domain modelling exist [12,14,32,33]. The field of medicine has the great advantage that its concepts and relationships are largely universal. This advantage is proved and

reinforced by the existence of various international nomenclatures (e.g. S N O M E D III), classifications (ICD-9/10), normalisation committees (CEN TC 251 for medicine) [59,60] and an already existing large semantic network (UMLS) [61]. The mentioned ontologies do not include instances of the concepts that refer to entities in the real world. Therefore, these networks are said to be intensional or conceptual. 2.3.1.2.5. The state o f affairs. Sentences and meaning do not, in many cases, exist on their own but refer to something, i.e. an object in the reality or state o f affairs. The item referred to is called the referent or extension. For example, in medicine, a patient receives a unique number and this number always refers to that real patient. In his discharge summary, one can read expressions like 'the patient', 'he', 'our patient', 'Mr. X'. They all refer to the same entity in reality or referent. Definite noun phrases, pronouns and relative sentences refer to items (already mentioned in the text). In the context of the discourse (e.g. a patient discharge summary), the knowledge contained in the text together with a certain amount of common sense knowledge allows resolution of the referents, i.e. to couple some 'undetermined' semantic expression to a 'unique point of reference' in the extra linguistic world. 2.3.2. The D M L P implementation 2.3.2.1. The valeno' information. As already mentioned (cf. Fig. 9), the valency rules intervene at the morphosyntactic level. As R G allows the in-

194

P. Spyns, G. De Moor/' International Journal of Biomedical Computing 41 (1996) 181 205 drug_treatment < non_invasive_treatment < treatment < procedure < state_change_producer aspirin < anticoagulant < therapeutic_substance < medical_tool < physical_object

Fig. 12. Excerpt of the Menelas Concept Type Lattice. sertion of any restriction inside the grammar rules, it was rather straightforward to add this kind of semantic restrictions to the syntactic grammar. This strategy limits the number of possible parses as early as possible, so that the subsequent analysers (semantic and pragmatic) do not waste processor time on semantically meaningless parses. Because the valency information allows in many cases to attach prepositional phrases to the correct node in the parse tree, the number of structural ambiguities is also reduced drastically, which again saves processor time. A kind of integration between syntactic and semantic processing is thus achieved. As the valency information is language specific and partly domain independent, it perfectly fits in the architecture of the D M L P to activate this type of semantic knowledge on the syntactic level. An example of the verbal valency information available in the dictionary database shows the difference between 'hebben' as a direct transitive verb and an indirect transitive verb requiring the preposition 'met '7. The indirect transitive reading is only given when the required preposition is indeed present in the sentence (cf. Fig. 3). This is an example of the pruning functionality of the software layer between the dictionary database and the parser (cf. Section 2.1.2.1; Fig. 14). 2.3.2.2. The conceptual structure. The semantic analyser takes as input the APTs generated by the

morphosyntactic component (cf. Section 2.2.2.3) for a given sentence and transforms it to one (or more) CG that represents the content of the sentence [63]. Whether one or several CGs are delivered for a given sentence, depends on whether a full parse has been produced after the syntactic analysis or not, and on the remaining syntactic or semantic ambiguities. The semantic analyser func7 The auxiliary reading with verb id 18590 is not shown here, but present in the dictionary (cf. Figure 2).

tions on a language independent 8 base but needs language dependent knowledge, like the entries of the semantic lexicon. An entry in the semantic lexicon couples a lexeme to a language dependent concept (characterised by the _d suffix). Some examples of the semantic dictionary entries are shown in Fig. 15. The main function of the semantic analyser is to link together the CGs of the semantic dictionary entries 9 respecting the syntactic information provided by the APT (e.g. the obj and subj labels). Therefore, the linking process evaluates in a language independent way the semantic composition rules that are partly language dependent. An example of such compositional rule can be found in Fig. 16. The example rule shows that the C G for the sentence results from the directed join (LL) of the C G of the syntactic subject (subj > > n p > > graph: the APT has a subj slot containing an np slot) with the C G of the vp slot in the APT (vp > > graph) the CGs being built recursively using the CGs of the semantic dictionary entries. The operation tries to match the first graph that must contain a concept coupled to the label 'head' (syntactic subject) with the second graph that has to include a concept linked to the label 'subj' (semantic object l°) and join the two graphs (into the new graph ' gsent') on the resulting concept with label 'obj' while comparable edges in the resulting graph are folded. In more comprehensi-

The language independent evaluation mechanism has been implemented by the team of the IBM Centre Scientifiquede Paris, 68,76 Quai de la Rap~e F-75592 Paris Cedex 12 (cf. [29,621). 9The information needed for the determiners is provided entirely by the APT and some general rules (and not by a dictionary entry). m Another semantic composition rule has already switched the values of the subject and object slots of the semantic lexical entry for the verb.

P. Spyns, G. De Moor /International Journal of Biomedical Computing 41 (1996) 181 205

195

POSITIVE CATALOG R AB inst.tool is [state_change_producer]--> (inst.tool)--> [physical_object]

Fig. 13. Example of a Menelas semantic relationship.

ble words, the syntactic subject of the passive voice becomes the semantic object. The CG in Fig. 17 results from the semantic analysis combining the semantic lexical entries (cf. Fig. 15) with the APT (cf. Fig. 10) as defined by the semantic composition rules. The example implies that an identical semantic structure can correspond with various different syntactic and/or lexical expressions (syntactic and semantic paraphrase ([64], p. 4) Fig. 18). Another important function of the semantic analyser is to compute all the concept referents which have been left undefined up till then. The anaphora (or referent) resolver creates and handles a stack of concepts and associated integers which correspond to the world individuals encountered before in the discourse [63]. When an unresolved referent occurs, the anaphora resolver searches in the stack for all the candidates whose concepts types are compatible with the current concept to refer to. Here again, the evaluation mechanism is language independent but allows the rules to be parameterised for a particular language. Figs. 19 and 20 show what happens when after the sentence of Figs. 17 and 18, a new sentence 'hij werd behandeld' (he has been treated) is analysed. The pronoun refers to a male singular substantive. There is only one candidate referent ('patient') since the other candidate ('stenose') in the reference stack is a female noun. Instead of the personal pronoun, the CG contains the subgraph of 'patient'(cf. Fig. 15). Modelling of the domain (here cardiology) resulted in a set of concepts (the ontology) that are ordered semantically (Concept Type Lattice) as well as hierarchically (Relation Tree) [33] (cf. Figs. 12 and 13). Several semantic relationships are taken into account (e.g. possessor, agens, patiens, cause,...). An extra normalisation step was introduced to ensure that a canonical language independent representation of the informational content of the sentences is produced. Each lan-

guage dependent concept is defined in terms of a language independent concept or by a conceptual definition (Linguistic Type Definition) (Fig. 21). Both definitions use the CG notation. This approach facilitates the construction and reuse of the semantic dictionaries for the involved languages [65]. Linguistic Relation Definitions (Fig. 22) provide the link between language dependent prepositions and language independent paths of conceptual relations by searching for the best path in the knowledge models between concepts linked by a linguistic relation. Each linguistic relation is defined by its preferences for certain conceptual relations used in these paths [57]. Due to implementational reasons, this semantic normalisation step (the processing of Linguistic Types and Relations) has been shifted to the pragmatic level.

2.4. The pragmatic level and the index generator 2.4.1. The linguistic background Although a conceptual structure can express exactly the content of a sentence, this does not mean that all information concerning the sentence is present. Knowledge does not have to be present when it is common sense knowledge or specific knowledge shared by the emitter and receiver of a message. For example, in the situation of a physician writing a patient discharge summary to a general practitioner, the discharging physician does not need to explicitly give all the details since the GP also has a lot of medical knowledge or is acquainted with the way of working in the hospital. However, in order to obtain a good and complete knowledge representation of a document, this implicit or background knowledge needs to be made explicit. To this aim, frames and scripts are used. A good knowledge representation language offers the possibility to define and modify a

196

P. Spyns, G. De Moor/International Journal of Biomedical Computing 41 (1996) 181-205 verb_id subject

dir_obj

indir_obj

prep

prep_phr

18610

human_nonhuman human_nonhuman NULL

NULL NULL

18610

human

met

NULL

NULL

human

Fig. 14. Example of a verb valency frame (for the verb hebben 'to have'). frame as well as retrieval operations based on matching and inheritance. Frames [66] are special data structures that have slots that can be filled with other frames or default values. Each slot describes an aspect of the concept. Each frame can be represented as a node in a semantic framework. A frame describes a part of the state of affairs and thus creates a context in which actions and objects that are referred to in sentences can be interpreted. Scripts [67] are sequences of frames. Often it concerns stereotyped situations of which a particular object or event activates the background knowledge and the hypotheses about the events that will take place and which are represented by the script. So, issues related to deep understanding or deduction of background knowledge (e.g. by means of scripts or frames) are to be situated on the pragmatic level (cf. also [47]). For example, a series of medical tests taken one after another or a visit to the doctor can be represented as a script. A special case of implicit knowledge are the temporal and causal relationships on the discourse level. Hereby we mean that the events of a situation as it is described in a document can be interrelated on grounds of causality or temporality. For example, a patient discharge summary is mostly introduced by a special paragraph dedicated to the anamnesis but also in the body of the text, allusions to past hospitalisations or treatment can be made. These events are added to the proper temporality of the actual subject of the document (current hospitalisation or consultation). All these events are then to be situated on one absolute time-axis which represents the complete chronology of the situation (including the previous hospitalisations and consultations) described in the discharge summary.

Scripts show how the events are interlinked. Even if only a few of the essential events or objects of a script may actually be mentioned in a text, one may assume that the preceding events expressed by that script have taken place and that the future events will not fail to happen. 2.4,2. The D M L P implementation

As other partners of the M E N E L A S - c o n s o r tium took care of these language independent parts (including the conceptual modelling and the semantic normalisation step mentioned above) ~1, we will not elaborate on them. We only mentioned this level to complete the picture. In general, the pragmatic analyser of the Menelas system functions in the way that was sketched in the previous section. More details are available in [68]. The index generator fulfils two main functions. The first one is to generate for the processed discharge summary a set of codes according to an international medical nomenclature (ICD-9-CM) [18]. The second objective is to produce acceleration indices which will be used by the retrieval component of the system [68]. M a n y more details can be found in Zweigenbaum [65,69].

3. Evaluation In the present state of affairs of the D M L P , not all the components are equally well tested. Certain phenomena are difficult to test or improve if the previous level is not yet finished. On the other hand, it is important to have the different levels working together as soon as possible to figure out what the consequences of an implementation deci11Mainly, the Service d'Informatique M6dicale de l'H6pital La Piti6-Salpetri6re, 91 Boulevard de l'H6pital, F-75634 Paris Cedex 13 was responsible for these parts.

P. Spyns, G. De Moor/International Journal c4f Biomedical Computing; 41 (1996)181 205

subs('patient', 1) := [patient_d:head].

[patient]

verb('hebben', active, 1):=

[have]

197

[hebben_d:_<=> verb]/(subj_d) ---> [entity:_<=> subj] % (obj_d) ---> [entity:_<=> obj] 4. subs('stenose', 1):= [stenosis_d:head].

[stenosis]

det(L,3,s]),def,[l,_type,_r) := [lam(_x,Ltype:_x<=>det]): df_r,_)<=>head].

[the (sing.)]

det(L,3,s]),undef,[],_type,_r):=[lam(_x,Ltype:_x<=>det]): undef_r,_)<=>head].

[a (sing.)]

Fig. 15. Examples of entries (or definitions) of the language dependent semantic dictionary.

sion on one level entail for the subsequent level(s). For this reason, we have not paid serious attention to the problem of idiomatic expressions and locutions. The same reason made us ignore conjunctions and interrogative constructions. Sometimes, it is not so simple to define good criteria for the independent evaluation of a separate component (e.g. this holds for the pragmatic analyser in particular as this component is implemented elsewhere). For these parts, more details about the evaluation can be found in [65]. 3. I. The category-guesser

To examine the effectiveness of the category guesser, all the words from the corpus not appearing in the dictionary were submitted to the analyser. The total number of unknown words was 2832. Manual categorisation revealed the presence of 679 adjectives, 2056 nouns, 82 verbs. The 2832 unique unknown forms lead to the generation of 6342 hypothetical analyses, which means that for every unknown form 2.4 possible canonical forms are retained. We consider the case when an unknown surface form receives more than two different categories as a guess. Guesses are always interpreted as bad. I f the category guesser is not able to attribute a correct category, the result is regarded as bad. Once a correct category, even concurrently with an incorrect one, is assigned to the submitted word, the outcome is perceived as

good. Sometimes the surface form alone does not permit an unequivocal categorization (e.g. in principle, a Dutch noun formally equals the first person singular present of a regular verb). As the main concern lies with the syntactic characteristics, we did not consider an erroneously calculated canonical form as a reason to reject the complete feature bundle. Manual examination of the results permits us to state that 83.4% of the unknown forms are correctly identified. We consider the result as fairly good and are convinced that refinements can lead to an even better result. The linguistic coverage can still be improved by adding rules that treat comparatives and superlatives. Also, the application of contextual disambiguation rules will certainly improve this result the unknown words having been submitted as an alphabetically ordered list to the category guesser. 3.2. The lemmatiser /tagger

We tested the tagger/lemmatiser on a small corpus (not used during the dictionary build-up) since this would allow a thorough manual analysis of the results. The corpus consists of 69 sentences which comprise 480 non-unique surface forms (including punctuation). Thirty-eight ( = 7.91%) words did not appear in the dictionary database. Seventy-eight forms ( - 16.25%) are truly ambiguous. They cannot be

198

P. Spyns, G. De Moor / International Journal o f Biomedical Computing 41 (1996) 181 205

v p > > v e r b > > v o i c e = passive = = > _gsent =

subj > >

np > >

graph LLvp

>>

g r a p h C O N S (obj is head + subj)

Fig. 16. Language dependent semantic composition rule indicating how to combine the subgraphs of a passive sentence into a single CG for the entire sentence.

disambiguated since the context sensitive module is not yet entirely finished and are thus not taken into account. Twenty ( = 4.16%) attributed categories were wrong. Sixteen forms ( = 3.33%) of these are not present in the dictionary. They are marked as pure guesses which means that these forms are labelled as noun, adjective and verb. Some declined Latin and/or Greek forms are not handled (e.g. 'margo ani'). Because the tagger provides all the available syntactic information, many additional ambiguities appear (e.g. the verb 'hebben' (have) can be an auxiliary or main verb). Fifty-four ( = 11.25%) forms are ambiguous due to the specifiers (e.g. verb valency, pronoun case information). This richness of syntactic information is necessary because the results of the tagger are directly used by the syntactic parser. When we add this latter category to that of unequivocally (and correctly) attributed tags, the tagger attains a success level of some 75.85%. Only two surface forms were attributed to different canonical forms ( ' w a s ' - - 'zijn/wassen/was' (to be/wax/laundry) and 'zagen' - - 'zien/zagen' (to see/saw)). The lemmatiser generates a wrong canonical form for 36 surface forms (7.7%). These are all unknown words. Therefore, it was examined how many correct canonical forms are generated by the lemmatiser for 2832 unique unknown surface forms. Pure guesses are always considered to be bad (350 = 12.62%). Many of the erroneously generated canonical forms (299 = 10.78%) are Latin forms (1.76%), or are the result of a lack of phonological information (3.35%; e.g. in Dutch, stress determines whether certain consonants are to be duplicated or not) or are badly segmented past participles (1.01%). In these cases, two canonical forms are generated of which one is bad. The other errors are pure errors that are not classifiable. In total, 76.2% of the generated canonical forms are thus correct, which is a satisfying result.

3.3. The compounder The generally admitted problem with compounds is 'that interpreting them requires inference, and specifically pragmatic inference, in an unpredictable way' [70]. For practical reasons (mainly combinatorial explosion), the aggregation module only takes into account Noun Noun plurilithical compounds (separated by a blank). As for Dutch, all the noun noun compounds make up 50 60% of the total amount of compounds [44] and the use of plurilithical compounds is only recently becoming popular under the influence of the English spelling, most of the cases are assumed to be covered. The monolithical compounds are treated during the segmentation process of the category guesser if they are not present in the dictionary. For the other cases (e.g. adjective noun (yellow fever)), some authors do not consider them as real compounds but as lexicalised syntagmas [71]. Also, a non-compound reading remains possible for most of such constructs. Therefore, they are treated as non-compounds. But the initial problem still remains and is only changed in terms of discriminating between compounds and lexicalised syntagmas. The practical computational arguments mentioned here to defend the adopted strategy are not unique to the D M L P but are shared by many in the N L P community (e.g. consider [72], p. 198).

3.4. The grammar The Restriction Grammar rules were tested on a corpus of 40 PDSs containing together some 2253 sentences. These PDSs were randomly selected from the Dutch M E N E L A S corpus containing 415 PDSs from the cardiology department. Because we did not examine thoroughly (which equals manually) the results of the analysis of the 2253 sentences, we can only give

P. Spyns, G. De Moor / International Journal of Biomedical Computing 41 (1996) 181-205

199

[proposition: [hebben d: 1 < = > verb] /(obj_d) --- > [ l a m ( _ 2 , [ w o r d : s t e n o s e ] <---(name) < - - - [ e n t i t y : _ 2 ] : u n d e f 2 < = > obj] % {subj_d) - - - > [lam{_3, [patient_d: 3]): df 1 < = > subj] -/ < = > head ] .

Fig. 17. Example of a 'linguistic' CG representing the content of a sentence (the patient has stenosis).

the percentage of successfully analysed sentences (61.3%). A disadvantage is the impossibility of handling conjunctions and disjunctions. These constructions are on their own responsible for 25% of the failures at the syntactic level. Another 25% of not analysed sentences showed a very irregular (or even agrammatical) syntactic construction (mere juxtaposition of nouns and omissions of the sentence subject). Presumably no regular syntactic analyser can cope with these constructions. The corrected score probably lies around 80%. Other problems concern the difference between the standard language and the medical sublanguage with respect to the verb valency frames. Random tests showed that this factor is to be retained as a reason for failure of the syntactic analysis. Another problem in some eases is the high number of 'choice points' so that the parser control mechanism raises time out or stack overflow errors. More robustness of the parser mechanism (e.g. chart parser) is needed. When the contextual tagging rules will be completed, the general performance of the syntactic parser will undoubtedly be enhanced since the number of alternative (and erroneous) parses will be reduced. Also, the speed of parsing will be better: the less

(subj_d) ~ 1

I p a t i e n t d : #*x I

hebben_d:#~

I

~>(oN_d)

Iword:stenoseI

alternatives, the faster a good parse will be attained.

3.5. The semantic and pragmatic components For the moment, the APT-transducer is able to transform the larger part of the RG-parse trees into the APT format. However, a systematised quantitative evaluation has not yet taken place as the grammar is not stable yet. A change in the grammar rules can cause changes in the APTtransducer code (the grammar rules being necessarily embedded up to a certain extent directly in the APT-transducer code). The semantic compositional rules have not yet been evaluated systematically. A sample of some 9 PDSs served as a first test corpus. The same PDSs are actually being used to test the pragmatic module and the index generator.

4. Further work

4.1. The lexical and morphological components At the lexical and morphological levels, a great part of the work is done. However, more attention should be paid to the treatment of idiomatic expressions. Also the compounds should be treated more thoroughly. There is always opportunity for extensions and optimisations. For example, the finalisation of a module that automatically creates new dictionary entries. An[[proposition: (past)-->

<~

(name)

[behandelen d: l < = > v e r b ] / -

(obj d) ---> [lam(_ 2, [patient_d:_ 2]: df 1 < = > subj] %

Fig. 18. Graphical representation of Fig. 17. For simlicity reasons, we have omitted the linguistic annotations (mainly concerning quantifying aspects and the syntactic function) in the graphical representations.

(subLd) ---> [entity:_ 3 < = > door ] -/ < = > head ].

Fig. 19. Example of a CG after automatic reference resolution for the sentence 'hij werd behandeld' (he has been treated).

200

(past)

P. Spyns, G. De Moor / International Journal of Biomedical Computing 41 (1996) 181-205

~> (subj_d)~

I behandelen_d:*x I ~ (obj_d)

IPa"en'- "1 Fig. 20. Graphical representation of Fig. 19.

other possible addition is the integration of the P R O T O N valency dictionary for adjectives [73] in the D M L P lexical database. Concerning the lexicon, the semantic features (human, nonhuman, human nonhuman and abstract) that determine the verbal valency should be checked. As the P R O T O N dictionaries were conceived for the standard language, specific constructions of the medical sublanguage are sometimes not recognised as valid. We strongly believe that the actual syntactic coverage would increase with some percents. But this is a tedious job and should be preceded by a large corpus study. The rule evaluation mechanism for a contextual disambiguation module as outlined in [45,46] as well as many rules are already implemented (e.g. when a noun singular and a first person singular of the simple present formally coincide, the verbal reading is discarded if there is no verbal first person category (in practice, it will be a personal pronoun) present in the sentence). The most important thing to do now is a large scale evaluation of the contextual rules and the extension of their linguistic coverage. Work on this aspect has already started and the contextual disambiguation rules are being evaluated. A first test on the corpus described in Section 4.2 showed already an improvement of some 10%, which makes the tagging score rise to 85.94%. We hope to validate this tagging accuracy on a larger corpus. The more lexical ambiguities are resolved, the less useless efforts the parser will have to perform. The combination rules for the compounder should be refined (e.g. cases of left headed compounds) and extended to include idiomatic expressions or other 'multi-unit noun expressions' as well.

4.2. The morphosyntactic component Here, the first concern is to replace the actual parser by a more robust chart parser while keeping as much as possible the already defined grammar rules and restrictions. On the architectural level, the necessity of an autonomous syntactic level can be questioned. As in the medical sublanguage many near agrammatical constructions are present (e.g. juxtaposition of nouns without determiners nor prepositions), some authors prefer to replace the morphosyntactic analyser by a semantic parser that performs local syntactic checking when needed (cf. [34] for an overview). But of course, this would require a complete change in the D M L P architecture. As a complementary approach, the M U L T I T A L E strategy [74] could be adopted. This comprises a sort of regrouping process of words into larger syntactic units that are semantically labelled afterwards. If the semantic labels are represented by CGs, they could be fed directly to the M E N E L A S [75] or R E C I T [9] understanding module. An alternative can be the implementation of proximity rules [76] for Dutch. More research on the feasibility of this remains to be done. When speaking about syntactic coverage, it is necessary to integrate the treatment of conjunctions and disjunctions (cf. Section 3.4). Other currently insufficiently treated items are some subordinate sentences. Nor are some complex verbal constructions involving (semi-)modals and pivotconstructions (although the necessary information is available in the D M L P dictionary). Impersonal constructions need to be tackled on a large scale base and in a consistent manner. Relatives clauses are to a certain extent handled. There is still no mechanism implemented that takes care of the passing of syntactic characteristics from the antecedent and the relative pronoun to the 'gap-position' (however some information is passed during the generation of the APT-structure). Interrogatives are not treated since no interrogative constructions appear in the corpus. Some 'general language grammatical relations' are not fully checked, since they occur very rarely in the

P. Spyns, G. De Moor / International Journal o f Biomedical Computing 41 (1996) 181-205

201

Linguistic_Type patient_d (_x) is [human_being:_x](defines_cultu ral_fu nction)-- > [medical subfunction] --(cultural_role)-- > [patient~def] (attr)-- > [sexl--(val}-- > [male] %. Linguistic_Type hebben d ( x ) is [P hysical_object^def] <---(state_of)---[state_of_physical_object :_x]. Linguistic_Type stenosis d ( x ) is [stenosis: _x].

Fig. 21. Examples of Linguistic Type Definitions that define the language dependent concepts (cf. Fig. 15) by means of language independent concepts.

medical language. In principle, the mentioned items should be realisable, since the PUNDIT-system [50] that is based on the same theoretical basis successfully handles the described syntactic phenomena [77]. 4.3. The semantic parts

To increase rapidly the performance of the semantic analyser, the first requirement is to add new entries to the semantic dictionary and build new composition rules. Currently, the Dutch semantic dictionary comprises some 250 entries. Concerning the verb valency schemes, a more thorough and systematical adaptation to the medical sublanguage is necessary because a sublanguage typically allows different selectional rules than the general language [78].

5. Conclusion

The D M L P is an important step towards the effective use of the information encoded in a Dutch Patient Discharge Summary by means of Linguistic_Relation met_d :preference inst meth manr part inst tool acc :end

Fig. 22. Example of Linguistic Relation Definition for the preposition 'met' (with).

Natural Language. Projects of this kind constitute important opportunities for the future of (medical) information extraction. The most important result is the realisation of a complete processing chain for the Dutch language, while integrating and reusing as much as possible available resources. To our knowledge, there are few systems (probably none) that perform a complete analysis (morphological, syntactic, semantic and pragmatic) of Dutch medical PDSs. Thanks to the D M L P architecture, it is possible to extend and modify the modules separately. More specifically for Dutch, the lexical front-end in particular can become the core of a future electronic medical dictionary environment including various tools like (semi-) automatic dictionary extenders or tagger-lemmatisers. The grammar for the Dutch medical sublanguage is, as far as we know, rather unique. Special attention is being paid to build some bridges to other medical NLP-systems. A specific transducer reshapes the output of the lexical and morphological components so that its results are (re)usable by the MULTI-TALE semantic tagger [74] which uses the CEN ENV 1828:1995 (standard in Medical Informatics) semantic labels [79]. In the same spirit, the D M L P can be connected to the MENELAS language independent understanding parts - - as it was originally foreseen - thanks to the APT transducer and the language specific semantic knowledge. Subsequently, as the knowledge is represented under the form of CGs,

202

P. Spyns, G. De Moor /International Journal of Biomedical Computing 41 (1996) 181 205

LexicaI-MorphologicalcomponentI ]. . . .

il

TagC°nverterl-

I Syntactical ComponentI

I Semantic Tagger MultiTale

I APT Transducerl I Tree Transducer I IDutch SemanticI Semantic ,/Knowledge IComponent i[ Pragmatic Component , .... :,. : ...... : I Indexing Component I; J

.

I

.

.

.

.

Storage Component

I Selection Module [ I Transformation Module I ;I Regularisation Module I i l Information Formatter I

I'

Menelas

I

Storage Module I LSP-MLP

Fig. 23. Visualisation of the D M L P as a language specific front-end for domain specific information processing.

many other programs (e.g. for automatic encoding, translation, or other) using the CG formalism could process the MENELAS results. The major problem will be the harmonisation of the semantic labels or concepts. As the syntactic level uses the same grammar formalism (and in many cases even the same grammatical labels), the output of the Dutch syntactic module could be redirected to the domaindependent but language-independent modules of the LSP-MLP [15]. As the medical co-occurrence patterns of the LSP-MLP are practically identical for English, French and German [55], the application of these patterns to Dutch parse trees can lead to interesting results, namely the feasibility to reuse the non-language specific parts of the LSPMLP for Dutch medical NLP. This possibility is jointly being examined by both the LSP-MLP and D M L P teams (see [80] for a preliminary report on this activity). This goal of multiple connectivity explains the more traditionally layered analysis architecture instead of the integrated approach as has been advocated in the RECIT project [81]. Fig. 23 illustrates the actual integration opportu-

nities of the D M L P with other medical NLP systems. Potential applications for a medical NLP system are information processing tasks of various natures and goals. It concerns, amongst others, automated encoding of discharge summaries [10], determination of clinical patient profiles [82], health-care quality assurance [83] and queries of different kinds on a patient discharge summary knowledge base (cf. [7,84] for an overview of possible services an NLP-based system can offer for medical information processing). Next to the mentioned linguistic enhancements, the important hurdle for the D M L P to take subsequently is a clinical trial and assessment in a production environment. The mentioned realisations can give an impetus to research and industrial development concerning medical language processing for Dutch. This aspect is of paramount interest in the light of the nearby era of the (medical) information highway and of European unification. In such a context, it is extremely important that the 'smaller' languages can benefit from large scale research

P. Spyns, G. De Moor / International Journal of Biomedical Computing 41 (1996) 181-205

projects in order to avoid a 20th century electronic G/Jtenberg effect ~2 by developing high quality information systems where not only the medium but also the message becomes computationally treatable.

Acknowledgements We would like to thank Luc Dehaspe and the partners (especially the S.I.M. and I.B.M. teams) of the Menelas project (A.I.M. # 2023 supported by the E.U.) for their collaboration during the past few years, as well as the reviewers for their suggestions to improve this paper.

References [1] Rossi Mori A, Thornton A and Gangemi A: An EntityRelationship Model for a European Machine-Dictionary of Medicine, Proceedings" of SCAMC 90, 1990, pp. 185189. [2] Wingert F: In: Injormatics and Medicine, an advanced course (Eds: P Reichertz and G Goos), Springer Verlag, 1977, pp. 579 646. [3] Wingert F: Medical Linguistics: a Review. In: Proceedings oJ'MEDINFO 80, 1980, pp. 1321 1331. [4] Safran C, Chute C and Scherrer JR, (eds.): Natural Language and Medical Concept Representation, Vevey, (IMIA WG6 Conference), Methods lnf Med, 34 (1995) 1/2. [5] Scherrer JR, Cot6 R and Mandil S: Computerized Natural Medical Language Processing ./'or Medical Knowledge, Elsevier Science Publishers (North Holland), IMIA, 1989. [6] van Ginneken AM: Electronic Health Record (Synopsis). In: Yearbook of Medical lnformatics 94 (Eds: J van Bemmel and A McCray), IMIA, Schattauer, 1994, pp. 173175 [7] Baud R, Rassinoux AM and Scherrer JR: Natural Language Processing and Medical Records. In: Seventh Worm Congress on Medical InJormatics, MEDINFO 92, Geneva, (Eds: K Lun, P Degoulet, T Pierre and O Rienhoff), North Holland, 1992, pp. 1362 1367. [8] Friedman C and Johnson S: Medical Text Processing: Past achievements, future directions. In: Aspects of the ~2With the advent of book printing (invented by Gfitenberg), the cultures of 'non-printed languages' disappeared much easier during the course of history, In the second millenium, the same thing could happen to 'non automated languages'.

203

Computer-based Patient Record, (Eds: M Ball and M Collen), Springer-Verlag, 1992, pp. 212 228 [9] Rassinoux AM, Michel PA, Juge C, Baud R and Scherrer JR: Natural Language Processing of medical texts within the HELIOS environment, Comput Methods Programs Biomed, 45 (1994) 79-96 (Suppl.). [10] Sager N, Lyman M, Nhan N T and Tick L: Medical Language Processing: Applications to Patient Data Representation and Automatic Encoding, Methods Inf Med,. 34 (1995) 140-146. [11] Mccray A: Natural language processing for intelligent information retrieval, Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology, 1991, pp. 1160-1161. [12] McCray A and Nelson S: The Representation of Meaning in the UMLS, Methods lnf Med, 34 (1995) 193-201. [13] Pietrzyck P: A Medical Text Analysis System for German Syntax Analysis, Methods Inf Med, 30 (1991) 275-283. [14] Rector A, Solomon D, Nowlan W, Rush T, Zanstra P and Claassen W: A Terminology Server for Medical Language and Medical Information Systems, Methods Inf Med, 34 (1995) 147-157. [15] Sager N, Friedman C and Lyman M: Medical Language Processing: Computer Management of Narrative Data, Addison-Wesley, Reading, MA, 1987. [16] Schr6der M: Knowledge Based Processing of Medical Language: A Language Engineering Approach. In: Proceedings of the Sixteenth German Workshop on AI (GWAI 92), (Ed: HJ Ohlbach), Springer-Verlag, Berlin, 1992, pp. 221-234. [17] Zweigenbaum P et al: MENELAS, an access system for medical records using natural language, Comput Methods Programs Biomed 45 (1994) 117-120. [18] Delamarre D, Burgun A, Seka LP and Le Beux P: Automated coding system of patient discharge summaries using conceptual graphs, Methods lnf Med, 34 (1995) 354 351. [19] Lemmens M: A critical study of the Defaulters in the Belgian METAL © system, and a design of a morphologically guided category guesser, Master's thesis, K.U. Leuven, (in Dutch), 1989. [20] Spyns P: A Prototype Jor semi-automatic encoding, Master's thesis (in Dutch), Leuven, 1991. [21] Dehaspe L and Van Langendonck W: An Automated Valency Dictionary, for Dutch Verbs, Leuven, 1991. [22] Dorrepaal J, Florenza M, Heylen H, Hoekstra D, Pohlman R, van der Wouden T and Krauwer S: Implementation Report Eurotra NL/B, Leuven-Utrecht, 1991. [23] Spyns P and Willems JL: Dutch Medical Language Processing: discussion of a prototype. In: Proceedings of MEDINFO 95, HC and CC, Edmonton, 1995, pp. 37-40. [24] Spyns P: A robust Category Guesser for Dutch Medical Language. In: Proceedings of A N L P 94, ACL, 1994, pp. 150 155. [25] Spyns P and De Wachter L: Morphological Analysis of Dutch Medical Compounds and Derivations, ITL Rev Appl Linguistics 109-110 (1995) 19-35.

204

P. Spyns, G. De Moor / International Journal of Biomedical Computing 41 (1996) 181 205

[26] Spyns P and Adriaens G: Applying and Improving the Restriction Grammar Approach for Dutch Patient Discharge Summaries. In: Proceedings of COLING 92, 1992, pp. 1164-1168. [27] Spyns P, Dehaspe L and Willems JL: The Menelas Syntactic Analysis Component for Dutch, Menelas Deliverable # 6, Leuven, 1993. [28] Cavazza M, Dor6 L and Zweigenbaum P: Model-based Natural Understanding in Medicine. In: Proc. of MEDINFO 92, Elsevier Science Publishers, 1992, pp. 13561361. [29] Fargues J, Landau MC, Dugourd A and Catach L: Conceptual Graphs for semantics and knowledge Processing, IBM J Res Dev, 30 (1986) 70-79. [30] Sowa JF: Conceptual Structures: InJormation Processing in Mind and Maehine, Addison-Wesley, London, 1984. [31] Bouaud J and Zweigenbaum P: A reconstruction of conceptual graphs on top of a production system. In: Proceedings of the 7th Annual Workshop on Conceptual Graphs, Las Cruces, 1992. [32] Volot F, Zweigenbaum P, Bachimont B, Ben S'did M, Bouaud J, Fieschi M and Boisvieux JF: Structuration and acquisition of medical knowledge: using UMLS in the Conceptual Graph formalism, ln: Proceedings of SCAMC 93, McGraw Hill, 1993, pp. 710 714. [33] Zweigenbaum P, Baehimont B, Bouaud J, Charlet J and Boisvieux JF: Issues in the Structuration and Acquisition of an Ontology for Medical Language Understanding, Methods Inf Med, 34 (1995) 15-24. [34] Spyns P: Natural Language Processing in Medicine: An Overview, Methods lnf Med, (1996) in press. [35] B1/iser B, Schwall U and Storrer A: A Reusable Lexical Database Tool for Machine Translation. In: Proceedings' of COLING 92, 1992, pp. 510-516. [36] Michiels A: Feeding LDOCE Entries into Horatio. In: Lexical Issues in Machine Translation, (Eds: P Alberto and P Bennet), Luxembourg, 1994, pp. 93-116. [37] Dehaspe L: Report on the building of the Menelas lexieal database, Technical Report 93-002, K.U. Leuven - Dept. of Medical Informatics, 1993. [38] Gazdar G and Mellish C: Natural Language Processing in Prolog: an introduction to computational linguistics, Addison-Wesley, 1989. [39] McCray A, Sponsler J, Brylawski B and Browne A: The Role of Lexical Knowledge in Biomedical Text Understanding. In: Proceedings of SCAMC87, IEEE Computer Society Press, 1987, pp. 103 107. [40] Wingert F: Morphological Analysis of Medical Compound Word Forms. In: Computational Linguistics in Medicine, (Eds: W Schneider and AL Sgtgvall Hein), 1977, pp. 79-89. [41] Wingert F: Morphologic Analysis of Compound Words Methods' lnfMed 24 (1985) 155-162. [42] Dujols P, Baylon Chr and Chein M: Projet LIME: Linguistique et Langage Mrdical. In: Informatique et SantO, vol. 5: Nouvelles M~thodes de Traitement de l'InJbrmation MOdicale, (Eds: Degoulet P et al.), Springer-Verlag, France, 1992, pp. 126 138.

[43] Wolff S: The use of morphosemantic regularities in the medical vocabulary for automatic lexical coding. Methods Inf Med 23 (1984) 195-203. [44] De Wachter L and Provoost J: A Computational Interpretation of Compounds, Working Paper in NLP 11, LeuvenUtrecht, 1994. [45] Martin W, Heymans R and Platteau F: Dilemma, an automatic Lemmatizer, Colingua, l (1988) 5-62 [46] Paulussen H and Martin W: Dilemma-2: a LemmatizerTagger for medical abstracts. In: Proceedings of ANLP 92, ACL, Trento, 1992, pp. 141-146. [47] Nijholt A: Computers and Languages, Theory and Practice, (Studies in Computer Science and Artificial Intelligence 4), Elsevier Science Publishers B.V., Amsterdam, 1988. [48] Sager N: Natural Language Injormation Processing: a computer grammar of English and its applications, Addison-Wesley, Reading, MA, 1981. [49] Hirschman L and Puder K: Restriction Grammar in Prolog. In: Proceedings of the First International Logic Programming Conference Marseilles, (Ed: M van Caneghem), 1982, pp. 85-90. [50] Hirschman L, Palmer M, Dowding J, Dahl D et al.: The Pundit NLP System, AI systems in Government ConJbrence, Computer Society of the IEEE, March 1989. [51] McCray A: Inferencing in Information Retrieval, in DARPA Proceedings, 1992, pp. 218-223. [52] Jensen K: Language Engineering: The Real Bottle Neck of Natural Language Processing (panel). In: Proceedings of COLING 88, 1988, pp. 448-453. [53] Hirschman L and Puder K: Restriction Grammar: a Prolog Implementation. In: Logic Programming and its Applications, (Eds: M van Caneghem and D Warren), Ablex Publishing Corporation, Norwood, N J, 1986, pp. 244- 261, [54] Zweigenbaum P. (ed.): Initial Choices and Specifications, Menelas Deliverable # 1, Paris, 1992. [55] Sager N, Nh~m NT, Lyman M and Tick L: Computer Analysis of Clinical Narrative: why, how, what, when. In: Proceedings of BIRA 95, Gent, 1995, pp. 22-53. [56] Chomsky N: Aspects of the Theory of Syntax, MIT Press, Cambridge, MA, 1965. [57] Zweigenbaum P et al: Linguistic and Medical Knowledge Bases, Menelas Deliverable #9, Paris, 1993. [58] Ceusters W, Deville G and Buekens F: The Chimera of Purpose and Language Independent Concept System in Health Care. In: Proceedings of MIE 94, 1994, pp. 208212. [59] De Moor G, McDonald C and Noothoven van Goor J (eds.): Progress in Standardisation in Health Care Informatics, IOS Press, Amsterdam, 1993. [60] Rossi Mori A: Co-operative Development of a shared Ontology for Medicine in CEN/TC251/WG2. In Natural Language and Medical Concept Representation (Preprints of the 1MIA WG6 ConJerence), (Eds Safran C, Chute C, and Scherrer JR), Vevey, 1994 (supplementary paper). [61] Humphreys B and Lindberg D: The Unified Medical Language System Project: a distributed experiment in

P. Spyns, G. De Moor / International Journal of Biomedical Computing 41 (1996) 181 205 improving access to biomedical information. In: Seventh Worm Congress on Medical lnJbrmatics, MEDLYFO 92, Geneva, (Eds: K Lun, P Degoulet, T Pierre and O Rienhoff), North Holland, 1992, pp. 1496 1500. [62] B6rard-Dugourd A, Fargues J, Landau MC and Rogala JP: Un syst6me d'analyse de texte et question/r6ponse bas6 sur les graphes conceptuels. In: In/ormatique et Santk, vol. 1: InJbrmatique et Gestion des Unitks de Soins, (Eds: P Degoulet et al.), Springer-Verlag France, 1989, pp. 223-233. [63] Guillotin T et al.: MENELAS Understanding Components: Pre-integration, Menelas Deliverable # 7, Paris, 1993. [64] Rassinoux AM, Wagner J, Baud R and Scherrer JR: Utilisation des graphes conceptuels pour le traitement du language m6dical, Proceedings of 'Journke Graphes Conceptuels', P.R.C.-G.D.R. Intelligence Artificielle, Montpellier, 1994, pp. 1-24. [65] Zweigenbaum P (ed.): MENELAS, The Final Report, Menelas Deliverable # 17, Paris, 1995. [66] Minsky M: A framework for representing knowledge. In: The Psychology of Computer Vision, (Ed: PH Winston), McGraw-Hill, New York, 1975, pp. 211-277. [67] Schank R and Abelson R: Scripts, Plans, Goals and Understanding, Lawrence Erlbaum Associates, Publishers, Hillsdale, NY, 1977. [68] Zweigenbaum P, et al: MENELAS, Coding and Information Retrieval from Natural Language Patient Discharge Summaries. In: Health in the New Communication Age, (Eds: M Laires, M Ladeira and J Christensen), IOS Press, Amsterdam, 1995, pp. 82 89. [69] Nangle B and Keane M: Effective retrieval in Hospital Information Systems: The use of context in answering queries to Patient Discharge Summaries, Artif Intell Med, 6 (1994) 207 227. [70] Sparck-Jones K: So what about parsing compound nouns. In (Eds: K Sparck-Jones and Y Wilks), Automatic Natural Language Parsing, Ellis Horwood Limited, 1982, pp. 164-168. [71] Decaluwe J: Dutch nominal compounds J~om a Jimctionalist perspective, Ph.D. thesis (in Dutch), R.U. Gent, 1988. [72] Ritchie G, Russel G, Black A and Pulman S: Computational Morphology: Practical Mechanisms .['or the English

205

Lexicon, MIT, 1992. [73] Van Roosbroek W: A Valency Dictionary for Dutch Adjectives, Master's thesis, K.U. Leuven, (in Dutch), 1988. [74] Ceusters W, Deville G and De Moor G: Automated extraction of neurosurgical procedure expressions from full text reports: the Multi-TALE experience, Proc. MIE 96, (1996) in press. [75] Zweigenbaum P, Bachimont B, Bouaud J, Charlet J and Boisvieux JF: A Multi-lingual Architecture for Building a Normalised Conceptual Representation from Medical Language. In: Proceedings of SCAMC 95, New Orleans, 1995, pp. 357 361. [76] Morel-Guillemaz AM, Baud R and Scherrer JR: Proximity Processing of Medical Text. In: Proceedings of MIE 90, 1990, Springer-Verlag, pp. 625-630. [77] Hirschman L: Meta-Rules for Conjunction in Restriction Grammar, J Logic Program 4 (1986) 299-328. [78] Kittredge R and Lehrberger J: Sublanguage, De Gruyter, Berlin, 1982. [79] De Moor G: Standardisation in Health Care Informatics and Telematics in Europe: CEN TC 251 Activities. In: Progress in Standardisation in Health Care Informatics (Eds: G De Moor, C McDonald and J Noothoven van Goor), IOS Press, Amsterdam, 1993, pp. 1-13. [80] Spyns P and De Moor G: Medical Language Processing and Reusability of resources: a case study applied to Dutch, Proc. MIE 96, (1996) in press. [81] Rassinoux AM: Extraction et Reprksentation de la Connaissance tirke des Textes M~dicaux, Ph.D. thesis, DG partement d'Informatique, Universit6 de Genbve, 1994. [82] Borst F, Lyman M, Nhan NT, Tick L, Sager N and Scherrer JR: Textinfo: A Tool for Automatic Determination of Patient Clinical Profiles Using Text Analysis. In: Proceedings of SCAMC 91, AMIA, McGraw-Hill, 1991, pp. 63 67. [83] Lyman M, Sager N, Tick L, Nhhn NT, Su Y, Borst F and Scherrer JR: The application of natural-language processing to healthcare quality assessment, Medical Decision Making 11 (1991) (suppl.) $65-$68. [84] Baud R, Rassinoux AM and Scherrer JR: Natural Language Processing and Semantical Representation of Medical Texts, Methods Inf Med 31 (1992) 177-125.