International Journal of Medical Informatics (2006) 75, 418—429
Developing a corpus of clinical notes manually annotated for part-of-speech Serguei V. Pakhomov a,∗, Anni Coden b, Christopher G. Chute a a b
Division of Biomedical Informatics, Mayo College of Medicine, Rochester, MN 55905, USA IBM, T.J. Watson Research Center, Hawthorne, NY 10532, USA
Received 6 April 2005 ; received in revised form 23 July 2005; accepted 10 August 2005 KEYWORDS Natural language processing; Statistical part-of-speech tagging; Domain adaptation; Medical domain; Text analysis; Manual text annotation
Abstract Purpose: This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation. Methods: Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation. Results: We used the Trigrams‘n’Tags (TnT) [T. Brants, TnT—–a statistical partof-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging. Conclusion: Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text. © 2005 Elsevier Ireland Ltd. All rights reserved.
1. Introduction Natural language processing (NLP) has relatively recently emerged within Medical and Biomedical ∗
Corresponding author. Tel.: +1 507 538 1774. E-mail addresses:
[email protected] (S.V. Pakhomov),
[email protected] (A. Coden),
[email protected] (C.G. Chute).
Informatics as a viable approach to processing large amounts of textual data stored in electronic form in biomedical literature as well as clinical reports. One of the goals of NLP research and development is to provide means for identifying structural elements contained within unstructured text of clinical documents. For example, the information can be structured in terms of disorders, signs or symptoms a particular patient may have as well as other
1386-5056/$ — see front matter © 2005 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.ijmedinf.2005.08.006
Developing a corpus of clinical notes manually annotated for part-of-speech facts about a patient such as whether they drink or smoke or whether a particular disorder is present in any of their family members. This is just a small sample of the kind of information that can be mined from the free text of clinical records. The obvious long-term benefit of doing so lies in the subsequent combining of this ‘‘phenotypic’’ data mined from text with the genomic data. NLP is critical to extracting such ‘‘phenotypic’’ information from text. Typically, NLP is used to provide linguistic information necessary for low level tasks such as term identification [1—3], word sense ambiguity resolution [4,17], negation recognition [5], as well as the higher level tasks such as automatic question answering [6], data mining and knowledge discovery [7]. Many of these tasks rely on the availability of accurate POS information. Words of any given language can be categorized according to their formal characteristics. The notion of part-of-speech (POS) pertains to such categorization. For example, certain English words can have a plural form (e.g. nouns) while others cannot (e.g. adjectives, articles, prepositions, etc.). Given a number of such observations about a language, we can classify words into large POS categories containing other words with similar characteristics. Automatic POS tagging is a task of assigning POS categories to terms from a predefined set of categories. The granularity of the POS classification system used for automatic POS tag assignment typically depends on the subsequent NLP tasks for which the POS information will subsequently be used. Some POS systems contain as few as 12 categories (NLM MetaMap tagset),1 while others have 45 (Penn Treebank tagset2 [8]) or over 130 (CLAWS tagset).3 The higher the granularity of the tagset, the more fine grained distinctions can be made for subsequent processes; however, the tradeoff is that granular classifications introduce a greater potential for error and thus more training data are needed to maintain accuracy. Having reliable part-of-speech (POS) information is critical to successful implementation of various natural language processing techniques for processing unrestricted text in the biomedical domain. State-of-the-art automated POS taggers achieve accuracy of 93—98% and the most successful implementations are based on statistical approaches to POS tagging. Taggers based on the Hidden Markoff Model (HMM) technology currently appear to be in the lead. Prime public domain examples of such 1
http://mmtx.nlm.nih.gov/. Penn Treebank tagset can also be said to have 36 tags if one does not count punctuation. 3 http://www.comp.lancs.ac.uk/ucrel/claws/. 2
419
implementations include the Trigrams‘n’Tags (TnT) tagger [9], Xerox tagger [10] and LT POS tagger [11]. Maximum Entropy (MaxEnt) based taggers also seem to perform very well [12,22]. One of the issues with statistical POS taggers is that most of them need a representative amount of hand-labeled training data either in the form of a comprehensive lexicon and a corpus of untagged data or a large corpus of text annotated for POS or a combination of the two. Currently, most of the POS tagger accuracy reports are based on experiments involving Penn Treebank data that consists of manually tagged corpora. These corpora include the Department of Energy abstracts, Brown corpus, Department of Agriculture bulletins, Library of America texts, MUC-3 texts, sentences from IBM computer manuals and the ATIS corpus of spontaneous speech transcriptions of Air Travel Information Systems project [8]. The purpose of this particular selection of sources in Treebank is to represent the general English domain as completely as possible; however, it is not entirely clear how representative the general English language vocabulary and structure are of a specialized subdomain such as clinical reports. This article partly addresses this issue. Based on the observable differences between clinical and general English discourse and POS tagging accuracy results on unknown vocabulary, it is reasonable to assume that a tagger trained on general English may not perform as well on clinical notes, where the percentage of unknown words is greater. However, in order to test this assumption, a ‘‘gold standard’’ corpus of clinical notes needs to be manually annotated for POS information. The issues with the annotation process constitute the primary focus of this article. In the remainder of the article, we describe an effort to train three medical coding experts to mark the text of clinical notes for part-of-speech information. The motivation for using medical coders rather than trained linguists is three-fold. First of all, due to confidentiality restrictions, in order to develop a corpus of hand-labeled data from clinical notes one can only use personnel authorized to access patient information. The only way to avoid this restriction is to anonymize the notes prior to POS tagging which in itself is a difficult and expensive process [13]. Second, medical coding experts are well familiar with clinical discourse, which helps especially with annotating medicine specific vocabulary. Third, the fact that POS tagging can be viewed as a classification task makes the medical coding experts highly suitable because their primary occupation and expertise is in classifying patient records for subsequent retrieval.
420 We discuss the training process, various issues that are due to certain specifics of the clinical notes discourse, and the results of a pilot study that evaluates the level of inter-rater agreement between the three annotators. We show that given a good set of guidelines, medical coding experts can be trained in a limited amount of time to perform a linguistic task such as POS annotation at a high level of agreement on both clinical notes and the Penn Treebank Wall Street Journal data subset (referred to in the rest of the paper as Penn Treebank). Finally, we report on a set of training experiments performed with the TnT tagger [9] using Penn Treebank as well as the newly developed medical corpus.
2. Background A well-recognized problem is that the accuracy of all current POS taggers drops dramatically on unknown words. For example, while the TnT tagger performs at 97% accuracy on known words in the Treebank, the accuracy drops to 89% on unknown words [9]. The LT POS tagger is reported to perform at 93.6—94.3% accuracy on known words and at 87.7—88.7% on unknown words using a cascading unknown word ‘‘guesser’’ [11]. The overall results for both of these taggers are much closer to the high end of the spectrum because the rate of the unknown words in the tests performed on the Penn Treebank corpus is generally relatively low—–2.9% [9]. From these results, we can derive a testable hypothesis that the higher the rate of unknown vocabulary, the lower the overall accuracy will be, necessitating adaptation of the taggers trained on Penn Treebank to sublanguage domains with vocabulary that is substantially different from that found in the Penn Treebank corpus. A number of people have addressed this issue in the past. Campbel and Johnson [14] show that adapting a POS tagger to medical data improves its performance. Similar results have been obtained by [15] where a stochastic HMM-based POS tagger (Medpost) was trained on MEDLINE abstracts. Smith et al. [15] emphasize the importance of having a wide coverage lexicon for open class words such as nouns, verbs, adjectives, etc. available to achieve high accuracy. With a medical lexicon of 10,000 words, Smith et al. [15] are able to achieve over 97% accuracy on MEDLINE data. Another aspect of lexical items in biomedical texts — ambiguity — was investigated by Wertmer and Hahn [16]. They find that lexical items in French biomedical corpora are less ambiguous (15.8%) overall with respect to their POS category as well as word sense than in gen-
S.V. Pakhomov et al. eral French text (29.1%). They also conclude that adaptation is necessary when going from a general purpose to a specialized domain. Wertmer and Hahn [16] find evidence to the contrary where they experiment with two statistical POS taggers (Brill and TnT) trained on German newspaper data and refute Campbel et al.’s conclusions on the necessity of generating domain specific training data and adapting POS taggers. Wertmer and Hahn [16] applied the POS taggers to a variety of clinical documents that included discharge summaries, pathology reports and surgical reports. They find that the POS taggers trained on general German data perform very well on the medical corpus (close to 97% accuracy) and attribute this surprising result to the significant overlap in POS n-gram pattern types between the newspaper and the medical corpora. They conclude that statistical POS taggers trained on general purpose data can be readily used to process medical documents. While there appears to be ample evidence to validate their conclusion for the German language, it is unclear whether it would generalize to other languages such as English. These experiments were conducted on a relatively small corpus (little over 100,000 tokens) of German text with restricted coverage—–it represents only three types of clinical reports. Another important characteristic of their data that may have contributed to the surprising results is the fact that German has a very rich inflectional morphology, where the inflection in many cases tends to be a good predictor of the grammatical class to which the word belongs. The statistical package that performs the best on the medical data in their experiments is the TnT tagger [9], which has an algorithm for unknown (not seen in the training corpus) word category prediction. This algorithm is based on predicting the POS category based on affixation—–the last N characters of a given unknown word. The experiments conducted by Brants [9] on the NEGRA corpus of German newspaper articles show that the accuracy of predicting the correct category for unknown German words is 89% which is 4% higher compared to similar experiments by Brants [9] on the Penn Treebank English data—–85%. At the very least, these numbers suggest that the TnT tagger has built-in functionality that handles unknown words in German much better than English. Brill tagger has similar functionality for generalizing to unknown words. We do not have the corresponding numbers for the Brill tagger on German data; however, the evidence obtained with the TnT tagger seems to point in the direction of German inflectional morphology possibly contributing to the surprising results found by Wertmer and Hahn
Developing a corpus of clinical notes manually annotated for part-of-speech [16]. In light of these facts, it is unclear whether Wertmer and Hahn’s conclusion would generalize to the English medical texts, because English has a relatively poor and underspecified morphology compared to German and thus does not lend itself as easily to the existing algorithms for unknown word category prediction. Given the state of the art in POS tagging that we are aware of, there still appears to be a need for medical textual corpora of POS annotated data that can be used for both, testing hypotheses and tools, as well as training and adaptation of the general purpose tools to the domain. It is also important to note that the data we are working with consist of clinical notes collected at the Mayo Clinic. The language of the clinical notes is markedly different from that of pathology, surgical and radiology reports and, certainly, biomedical articles, which provides further justification for creating an annotated corpus for the clinical notes domain. We outline some of the characteristic features of clinical notes in the next section.
3. Clinical notes and quasi-spontaneous discourse Clinical notes at the Mayo Clinic represent a record of an encounter between a physician and a patient. After each patient’s visit, physicians are required by law to file a report that describes the findings and diagnoses discovered during the visit as well as other information that may be pertinent. From the language processing perspective, the main challenge presented by the clinical notes is that they represent quasi-spontaneous discourse [18]. Clinical notes are mostly dictated from memory and sometimes from partial handwritten notes. The characteristic features of this discourse include incomplete (see example (1)) and tabular/templated statements (see examples (2) and (7)), inverted constructions (examples (3) and (4)), marginally grammatical conversational statements (see example (5)). In addition, the process of transcription creates a noisy filter where various idiosyncratic elements are introduced. These include but are not limited to misspellings, shorthand notations (see examples (8) and (9)), unsanctioned abbreviations and acronyms (see example (9)). Some examples of statements found in clinical notes: (1) Patient appreciative. (2) ACTIVITY: as tolerated. (3) Attending was Dr. LAST NAME.
421
(4) Cataract, OS, moderate OS, stable. (5) Today is wife’s retirement party. (6) To assist with D/C plans and eventual homegoing. (7) Beconase nasal spray, one squirt each nostril q.d. (8) Lacerations4 × 2. (9) S/P L hip #.5 In short, clinical notes combine the characteristics of conversational speech with the shorthand efficiency devices introduced by transcriptionists. This combination makes clinical note processing particularly challenging as compared to other kinds of medical texts. The fact that clinical notes recorded at the Mayo Clinic are loosely structured using the HL7 Clinical Document Architecture (CDA)6 specification is also important to note as it may help in some text analytic tasks including POS tagging and word sense disambiguation. Fig. 1 shows a typical clinical note. Under the CDA, clinical notes are partitioned into sections such as Chief Complaint, Current Medications, History of Present Illness, Social History, Impression/Report/Plan, among many others. One can easily imagine that, for example, the language of a Social History section would be vastly different from the language of a Current Medications section and that these two sections should probably be treated very differently from NLP standpoint.
4. Materials and methods Prior to this study, the three annotators who participated in it had a substantial experience in coding clinical diagnoses but virtually no experience in POS markup. The training process consisted of a general and rather superficial introduction to the issues in linguistics as well as some formal training using the POS tagging guidelines developed by Santoriny [19] for tagging Penn Treebank data. Table 1 displays all the POS tags present in the Penn Treebank data and their short descriptions. The formal training was followed by informal discussions of the data and difficult cases pertinent to the clinical notes domain. The discussions often resulted in slight modifications to the Penn Treebank guidelines. Below is a list of such cases and the modifications. 4
‘‘Two lacerations’’. ‘‘Status post left hip fracture’’. 6 Health Level 7 is a medical standards organization part of whose purpose is to establish and maintain various standards applicable to medical information management. 5
422
S.V. Pakhomov et al.
Fig. 1 A clinical note sample (this is a note for a non-existent test patient).
4.1. Special symbols Medical transcriptionists often use a single keystroke shorthand for words that occur relatively
Table 1
frequently. For example, ‘‘+’’ is often used to mean ‘‘positive’’, as in ‘‘positive throat culture’’, ‘‘−’’ for ‘‘negative throat culture.’’ The pound sign ‘‘#’’ is often used to mean ‘‘pounds’’ or
Penn Treebank tagset
Tag
Description
Tag
Description
CC CD DT EX FW IN JJ JJR JJS LS MD NN NNS NNP NNPS PDT
Coordinating conjunction Cardinal number Determiner Existential there Foreign word Preposition/subordinating conjunction Adjective Adjective, comparative Adjective, superlative List marker Modal Noun, singular or mass Noun plural Proper noun, singular Proper noun, plural Predeterminer
POS PRP PRP$ RB RBR RBS RP TO UH VB VBD VBG VBN VBP VBZ WDT WP
Possessive ending Personal pronoun Possessive pronoun Adverb Adverb, comparative Adverb, superlative Particle To Interjection Verb, base form Verb, past tense Verb, gerund/present participle Verb, past participle Verb, singular, present, non-3D Verb, 3rd person singular, present Wh-determiner Wh-pronoun
Developing a corpus of clinical notes manually annotated for part-of-speech ‘‘fracture’’, ‘‘x’’ is often used to mean ‘‘scar’’ as in ‘‘x 2’’—–two scars. It is not entirely clear at this point if it would be more beneficial to treat these symbols as actual symbols (SYM) or as a special kind of abbreviations. In the latter case, they would be tagged as if the actual word was used in place of the symbol. For the time being we rather arbitrarily decided to mark them as symbols SYM—–‘‘x/SYM 2/CD.’’
4.2. Drug names Drug names are tagged as singular proper nouns (NNP) regardless of whether they are capitalized. Numeric and other types of attributes of drug names are considered to be part of the drug name and are tagged as NNP as well (Ex. ‘‘Extra/NNP Strength/NNP Tylenol/NNP #4/NNP’’). The choice of the NNP label was fairly arbitrary and largely driven by informal observation that drug names appear to be similar to other named entities such as geographical locations and persons in their distribution and orthography. The choice was also partly driven by the desire to avoid introducing new tags for the sake of future compatibility with other NLP tools. At a later point the token ‘‘#4’’ in this example and other examples of a symbol followed by a numeral, were converted to two tokens ‘‘#’’ and the numeral and re-tagged as a symbol (SYM) and cardinal number (CD), respectively. It can be argued that this decision may adversely affect the results of an experiment where a tagger trained on the Penn Treebank data is tested on the clinical notes data; however, our experiments show that such effect is unlikely. We address this issue in more detail in Section 6.
4.3. Dosages Dosages have a recognizable structure, which in most cases, consists of the numeric magnitude of the dose (e.g. 500), the measurement unit (e.g. mg), method of delivery (e.g. orally) and the interval of delivery (e.g. b.i.d., twice a day, etc.). The amount is tagged as a cardinal number (CD), the measurement unit is tagged as NNS or NN depending on whether the measurements are of a plural
423
or singular entity, the method of delivery is tagged the same as normal text. A more problematic case is the interval of delivery, which may include actual phrases ‘‘twice daily’’ or their Latin equivalents ‘‘b.i.d.’’ The former is tagged as if it were normal text, the latter is tagged as a foreign word—–FW.
4.4. Foreign words The medical domain is permeated with words of Latin and Greek origin and making a clear distinction between a medical word of foreign origin and a medical word that is a foreign word is rather difficult even in cases where the word has retained its foreign pronunciation, e.g. ‘‘polymyalgia rheumatica.’’ In the majority of such cases, the words in question are somehow related to a condition, procedure or some other clinically related phenomenon and normally not used separately. Based on that, we tag potential candidates for foreign words as if they formed a unit. By this token, ‘‘polymyalgia rheumatica’’ is tagged as a noun followed by another noun and forms a compound noun—–polymyalgia/NN rheumatica/NN.
4.5. Annotation structure and format Each clinical note is represented as an Extensible Markup Language (XML) document with the underlying schema shown in Fig. 2. The raw clinical notes go through automatic POS tagging using the Maximum Entropy POS tagger [22] whose results are rendered in XML format and are then presented in a graphical XML editor for correction. The top node in the XML schema represents the whole document and can have a number of ‘‘section’’ nodes under it that represent various subsections of a clinical note such as History of Present Illness (HPI), Chief Complaint (CC) and other standard sections compliant with HL-7 Clinical Document Architecture specification. The section nodes branch out into sentence nodes and each sentence node can branch out directly into W nodes, which are combinations of word orthographic representation and its part-of-speech information. The W nodes can be arranged into phrases with various types of attributes. These can be either
Fig. 2 XML schema that represents an annotated document.
424
S.V. Pakhomov et al.
Fig. 3 A sample clinical notes text marked up in XML for POS information.
traditional linguistic phrases such as NP, VP, etc. or specialized medical phrases such as ‘‘drug mention.’’ The ‘‘phrase’’ nodes group W nodes under it into meaningful chunks. The ‘‘meaningfulness’’ of the chunks as well as their boundaries is left up to the domain experts to determine. The ‘‘phrase’’ node also allows simultaneous anonymization of data by introducing ‘‘phrase’’ nodes with special attributes to group patient identifying information. Fig. 3 shows what the actual text of the note looks like after it has been marked up with XML according to the schema given in Fig. 2. The ‘‘dual residence’’ of the W node under both the ‘‘phrase’’ node and the ‘‘sentence’’ node in the schema reflects the fact that we are aiming at working with incomplete shallow parses rather than full parses. In the former case, not all phrases that comprise a sentence have to be identified and, therefore, not every word has to be part of a phrase. The data presented to the annotators is preprocessed before annotation. The pre-processing includes sentence boundary detection, tokenization and priming with part-of-speech tags generated by a MaxEnt tagger (Maxent 1.2.4 package (Baldridge et al.)) trained on Penn Treebank data. To expedite the annotation process, an in-house Java-based editor with a graphical user interface was developed.
4.6. Annotator agreement In order to establish the reliability of the data, we need to ensure internal as well as external consistency of the annotation. First of all, we need to make sure that the annotators agree amongst themselves (internal consistency) on how they mark
up text for part-of-speech information. Second, we need to find out how closely the annotators generating data for this study agree with the annotators of an established project such as Penn Treebank (external consistency). If both tests show relatively high levels of agreement, then we can safely assume that the annotators in this study are able to generate part-of-speech tags for biomedical data that will be consistent with a widely recognized standard and can work independently of each other thus tripling the amount of manually annotated data.
4.7. Methods Two types of measures of consistency were computed—–absolute agreement and Kappa coefficient. The absolute agreement was calculated according to the following formula in (1): M Abs agr = 100 × (1) T where M is the total number of time all annotators agreed on a tag and T is the total number of tags. Kappa coefficient is given in (2) [20]: Kappa =
P(A) − P(E) 1 − P(E)
(2)
where P(A) is the proportion of times the annotators actually agree and P(E) is the proportion of times the annotators are expected to agree due to chance.7 7
A very detailed explanation of the terms used in the formula for Kappa computation as well as concrete examples of how it is computed are provided in Poessio and Vieira [21].
Developing a corpus of clinical notes manually annotated for part-of-speech The absolute agreement is most informative when computed over several sets of labels and where one of the sets represents the ‘‘authoritative’’ set. In this case, the ratio of matches among all the sets including the ‘‘authoritative’’ set to the total number of labels shows how close the other sets are to the ‘‘authoritative’’ one. The Kappa-statistic is useful in measuring how consistent the annotators are compared to each other as opposed to an authority standard.
4.8. Internal consistency In order to test for internal consistency, we analyzed inter-annotator agreement where the three annotators tagged the same small corpus of clinical dictations. The results were compared and the Kappastatistic was used to calculate the inter-annotator agreement. The results of this experiment are summarized in Table 2. For the absolute agreement, we computed the ratio of how many times all three annotators agreed on a tag for a given token to the total number of tags. Based on the small pilot sample of five clinical notes (2686 words), the Kappa test showed a very high agreement coefficient—–0.93. An acceptable agreement for most NLP classification tasks lies between 0.7 and 0.8 [20,21]. Absolute agreement numbers are consistent with high Kappa as they show an average of 90% of all tags in the test documents assigned exactly the same way by all three annotators.
4.9. External consistency The external consistency with the Penn Treebank annotation was computed using a small sample of 939 words from the Penn Treebank-2 WSJ Corpus annotated for POS information. The sample had not been seen by the annotators prior to the test. The annotators were asked Table 2 Annotator agreement results based on five clinical notes Note ID
Abs agr (%)
Kappa
1137689 1165875 1283904 1284881 1307526
93.24 94.59 89.79 90.42 84.43
0.9527 0.9622 0.9302 0.9328 0.8943
Total Average
90.49
0.9344
N samples 755 795 392 397 347 2686
425
Table 3 Absolute agreement results based on five clinical notes with an ‘‘authority’’ label set Annotator
Abs agr (%)
A1 A2 A3
88.17 87.85 87.85
Average
87.95
to make POS judgments on the WSJ corpus sample just as they would on the clinical notes. The labels were compared to the Penn Treebank annotation individually by annotator and the results of the comparison are presented in Table 3. The results indicate that the three annotators who participated in this project are on average 88% consistent with the annotators of the Penn Treebank corpus.
5. Results and discussion The annotation process resulted in a corpus of 273 clinical notes annotated with POS tags. The corpus contains 100,650 tokens from 8702 types distributed across 7299 sentences. Table 4 displays frequency counts for the top most frequent syntactic categories. The distribution of syntactic categories suggests the predominance of nominal categories, which is consistent with the nature of clinical notes reporting on various patient characteristics such as disorders, signs and symptoms. Another important descriptive characteristic of this corpus is that the average sentence length is 13.79 tokens per sentence, which is relatively short as compared to the Penn Treebank corpus where the sentence length is 24.16 tokens per sentence. This supports our informal observation of the clinical notes data containing multiple sentence fragments and short diagnostic statements. Shorter sentence length implies greater number of inter-sentential transitions and therefore is likely to present a challenge for a tagger based on n-gram transitions such as an HMM tagger. Table 4 Syntactic category distribution in the corpus of clinical notes Category
Count
Total (%)
NN IN JJ DT NNP
18372 8963 8851 6796 4794
18 9 9 7 5
426
S.V. Pakhomov et al.
5.1. Results of training and testing a POS tagger on medical data
Table 5 Correctness results for the tagger trained on Penn Treebank
In order to test some of our assumptions regarding how the differences between the general English language and the language of clinical notes may affect POS tagging, we have trained the HMM-based TnT tagger [9] with default parameters at the trigram level both on Penn Treebank and the clinical notes data. We should also note that the tagger relies on a sophisticated ‘‘unknown’’ word-guessing algorithm, which computes the likelihood of a tag based on the N last letters of the word, which is meant to leverage the word’s morphology in a purely statistical manner. The clinical notes data were split at random 10 times in 80/20 fashion where 80% of the sentences were used for training and 20% were used for testing. This technique is a variation on the classic 10-fold cross-validation and appears to be more suitable for smaller amounts of data. As a result, we produced 10 training clinical notes corpora and 10 testing clinical notes corpora. We conducted two experiments. First, we trained a tagger on the Treebank data and tested it on each of the 10 testing clinical notes corpora. We tested the tagger trained on Treebank on the 10 testing corpora rather than the whole corpus of clinical notes in order to produce correctness results on exactly the same test data as would be used for testing the tagger trained on the 10 clinical notes training corpora. Correctness was computed simply as the percentage of correct tag assignments of the POS tagger (hits) to the total number of tokens in the test set: Correctness = 100 ×
Hits Total
(3)
Table 5 summarizes the results of testing the tagger trained on Penn Treebank data, while Table 6 summarizes the testing results for the tagger trained on the clinical notes. The average correctness of the tagger trained on Penn Treebank and tested on clinical notes is ∼90%, which is considerably lower than the state-of-theart performance of the TnT tagger, ∼96%. Training the tagger on a relatively small amount of clinical notes data brings the performance much closer to the state-of-the-art, ∼95%. Given the diversity of language used in various sections of the clinical notes, we also wanted to investigate whether we would find fluctuations in accuracy results depending on the section. In order to do this, we split the corpus of medical data into 10 subcorpora, each containing data from a
Split
Hits
Total
Correctness (%)
1 2 3 4 5 6 7 8 9 10
21560 22122 21417 21970 22079 21649 21598 21379 22131 22358
23872 24665 23923 24461 24665 24049 24040 23882 24610 24923
90.32 89.69 89.52 89.82 89.52 90.02 89.84 89.52 89.93 89.71
Average
21826.3
24309
89.79
single section. The choice of the 10 corpora was dictated by the amount of available data. The token counts for each subcorpus are given in Table 7. The smallest corpus (ALLergies) contains 392 tokens, while the largest corpus (IP/impression-plan) contains 43,633 tokens. We applied a model trained on Penn Treebank to each of the 10 corpora in turn. This model is distributed with the TnT tagger package. The results are presented in Table 7 and sorted in the increasing order of correctness, where Current Medications (CM), Special Instructions (SI) and Allergies (ALL) sections have the lowest correctness and History of Present Illness (HPI), Social History (SH) and Family History (FH) have the highest correctness. The results for these sections are greater than the mean by more than one standard deviation. Clearly, since the IP and HPI sections have the most representation in the corpus, the overall results of testing a Penn Treebank tagger on the entire medical corpus reported in Table 5 (90%) tend to be aligned somewhere between the results obtained from the IP section (87.4%) and the HPI section (90.34%). We would like to emphasize the point that the results obtained on each individual Table 6 Correctness results for the tagger trained on the clinical notes Split
Hits
Total
Correctness (%)
1 2 3 4 5 6 7 8 9 10
22654 23332 22645 23206 23326 22732 22807 22603 23316 23563
23872 24665 23923 24461 24665 24049 24040 23882 24610 24923
94.90 94.60 94.66 94.87 94.57 94.52 94.87 94.64 94.74 94.54
Average
23018.4
24309
94.69
Developing a corpus of clinical notes manually annotated for part-of-speech Table 7
427
Correctness results for the general English model tested on separate sections of the medical corpus Token ambiguity
Known token ambiguity
Percent unknown
Correctness (%)
Token count
CM (Current Medications) SI (Special Instructions) ALL (Allergies) PSH (Past Surgical History) CC (Chief Complaint) IP (Impression/Plan) ROS (Review of Systems) HPI (History of Present Illness) SH (Social History) FH (Family History)
5.87 7.03 4.51 3.77 3.67 3.51 3.04 3.31 2.61 2.92
1.9 2.41 1.99 2.07 2.12 2.33 2.12 2.3 2.18 2.1
33.22 23.26 27.30 22.12 19.39 14.67 12.31 11.38 5.64 9.25
74.70 77.20 79.34 84.42 86.32 87.40 89.25 90.34 91.36 92.62
3853 1741 392 3061 2997 43633 3469 37262 1400 854
Total Average
4.024
2.152
17.85
85.30
98662
section vary to a large extent, which suggests that further adaptation with more training data for a particular section of a clinical note may be desirable or even necessary as the correctness numbers with the CM, SI and ALL sections seem to indicate. The results presented in Table 7 also illustrate the correlation between the percentage of tokens previously unseen in the training data (‘percent unknown’ column) and the correctness of the POS tagger. The two variables negatively correlate at −0.94. The correlation between correctness and the overall POS ambiguity (‘token ambiguity’ column) in the corpus is also very high −0.91; however, the correlation between correctness and the ambiguity of only the known (previously seen in the training data) tokens is 0.27. This observation is interesting because it indicates that there is a relationship between unknown tokens and ambiguity in terms of how both of these factors affect POS performance. In fact, one could view a token that the POS tagger never ‘saw’ in the training data as one that is as ambiguous as the entire POS tagset. Since the tagger has no statistical information about it, there is nothing to restrict the set of possible POS tags. In the case of the Penn Treebank tagset, an unknown token could be potentially ambiguous in 45 ways where the only restriction is the prior probability of each POS tag. Since the top five most frequent POS categories in the corpus of clinical notes are nouns (NN), prepositions (IN), adjectives (JJ), determiners (DT) and proper nouns (NNP) as shown in Table 4, one would expect that the majority of the errors involving unknown words would result in one of these top categories being assigned to the unknown token. We tested this hypothesis by determining the relative frequency of the top 10 most frequent POS tags assigned to the unknown tokens. The results are summarized in Table 8 and confirm the hypothesis.
Table 8 Relative frequency of POS categories assigned to ‘‘unknown’’ tokens by the TnT tagger N
POS tag
Relative frequency (%)
154 163 174 175 228 626 1003 2693 4736 4738
VB RB VBN VBG FW CD NNS JJ NNP NN
1.03 1.09 1.16 1.17 1.52 4.17 6.68 17.93 31.53 31.54
The highest relative frequency of POS tags assigned to unknown tokens belongs to nouns and proper nouns (both at 31%), followed by adjectives (17%). After adjectives, the relative frequency drops to 7% with plural nouns (NNS). The relative frequency of prepositions (IN) and determiners (DT) did not even make it into the top 10 list and is less than 1%. This is not unexpected because both DT and IN belong to a closed class of POS tags and are relatively rarely assigned erroneously.
6. Discussion This paper was intended to share the goals and challenges experienced during annotation and analysis of a small sample of clinical notes data for part-ofspeech information. The annotation was performed by experts in the domain of indexing medical content who are minimally trained to label medical texts for part-of-speech. We have outlined some of the challenging issues in clinical note annotation as well as some of the more outstanding differences between clinical notes and other types of written
428 and spoken discourse widely used in training NLP applications. The results of this pilot project are encouraging. It is clear that with appropriate supervision, people who are well familiar with medical content can be reliably trained to carry out some of the tasks traditionally done by trained linguists. We have shown that the three annotators in this study have been able to achieve relatively high levels of inter-rater agreement (Kappa ∼0.93) as well as compliance with an authoritative Penn Treebank annotation (absolute agreement ∼89%). This study also indicates that a statistical POS tagger trained on data that does not include clinical documents may not perform as well as a tagger trained on at least some data from the clinical domain. There are a number of factors that contribute to this. The ‘‘unknown’’ out-of-vocabulary items is only one of the factors responsible for the decreased accuracy of POS taggers trained on the general English domain when they are used in the clinical notes domain. The differences in the actual distribution of tokens and POS categories in general English and the clinical domain seem to be another major contributing factor. A comparison between the Treebank and the clinical notes data shows that the clinical notes corpus contains 3239 lexical items that are not found in Penn Treebank. The Penn Treebank corpus contains over 40,000 lexical items that are not found in the corpus of clinical notes. 5463 lexical items are found in both corpora. In addition to this 37% out-of-vocabulary rate (words in clinical notes but not the Penn Treebank corpus), the picture is further complicated by the differences between the n-gram tag transitions within the two corpora. For example, the likelihood of a DT → NN bigram is 1 in Penn Treebank and 0.75 in the clinical notes corpus. On the other hand, JJ → NN transition in the clinical notes is 1 but in the Penn Treebank corpus it has a likelihood of 0.73. We also addressed the issue of having lowercased medication names annotated as proper nouns by comparing errors produced by the two taggers. The confusion between NN and NNP produced by the tagger trained on Penn Treebank and tested on the clinical notes accounted for 14% of all errors as opposed to 6% of the errors when the tagger is tested on Penn Treebank data. This result suggests that our decision to use NNP tag for all medication names including the lowercased ones may have introduced some confusion to the ‘‘unknown’’ word guessing built into the TnT tagger. In order to test this hypothesis, we changed the tags of all lowercase words that were initially tagged as NNP to NN and tested the tagger trained on Penn Treebank data again on the clinical notes.
S.V. Pakhomov et al. This resulted in NN—NNP confusion being responsible for 12% of the errors. The overall correctness results remained unaffected—–89.99%, which indicates that the NN—NNP confusion is likely due to the fact that most medication names are not in the Penn Treebank’s vocabulary. Since most of them do not comply with regular English morphology an unknown word guessing algorithm is likely to do poorly on this particular category of words. Finally, interesting results emerged from our experiments applying the general purpose POS tagger to various sections of the clinical notes. We are seeing lower than the overall correctness on SI, CM and ALL sections and higher than the overall correctness on HPI, FH and SH sections. When we examine the contents of the different sections, we find that SI, CM and ALL differ substantially from HPI, FH and SH sections in both vocabulary and structure. For example, the SI section contains a lot of names of local places around the Mayo Clinic and tends to be telegraphic in format: SI:
TESTING ONLY 10-8-98 To correspondence pending biopsy Please have iron studies routed to secretary (e-mail sent)
A typical CM section contains medication names, dosages and instructions mostly in an abbreviated form: CM:
Glucotrol XL 10 mg one by mouth two times a day Folic acid 1 mg one p.o. q.d.
In contrast, while HPI, FH and SH sections also contain specialized medical vocabulary, they tend to resemble regular English one may find in print or in casual conversation much closer than CM, ALL and SI sections. Here is an example of the SH section: SH:
She drinks approximately one to two caffeinated beverages a day She does not use tobacco, alcohol, recreational, or street drugs She denies any history of abuse She wears seatbelts 100% of the time
These examples are fairly representative of the content of the various sections in clinical notes. One possible explanation for why the results are so much better on SH, FH and HPI sections compared to CM, SI and ALL sections is that the vocabulary and the language patterns in the former are much closer to the vocabulary and the language patters found in general English data. Due to a relatively small size of the available hand-labeled data, it is hard to draw final conclusions. However, it is clear at this point that, at least for the domain of clinical notes, it is necessary to obtain domain specific data in order to train state-of-the-art POS taggers.
Developing a corpus of clinical notes manually annotated for part-of-speech
7. Conclusion We have trained three medical index coders to annotate a relatively small corpus of clinical notes at the Mayo Clinic for POS information. We have also performed a series of experiments with a state-ofthe-art part-of-speech tagger (TnT) trained on both general English and on a portion of the annotated clinical notes corpus. We find that using domain data are beneficial to improving POS tagging correctness and, furthermore, adaptation to various discourse types represented in the sections of clinical documents may be of substantial benefit also. Several questions remain unresolved. First of all, it is unclear how much domain specific data are enough to achieve state-of-the-art performance on POS tagging. Second, given that it is somewhat easier to develop lexicons for POS tagging than to annotate corpora, we need to find out how important the corpus statistics are as opposed to a domain specific lexicon. In other words, the question is whether we can achieve state-of-the-art performance in a specialized domain by simply adding the vocabulary from the domain to the POS tagger’s lexicon. We intend to address both of these questions with further experimentation.
Acknowledgements We would like to thank our medical index experts Barbara Abbot, Pauline Funk and Debora Albrecht for their persistent efforts in the difficult task of corpus annotation. This work was done in part under the NLM Training grant (# T15 LM07041-19).
References [1] G. Savova, M.R. Harris, T. Johnson, S.V. Pakhomov, C.G. Chute, A data-driven approach for extracting ‘‘the most specific term’’ for ontology development, in: Proceedings of the AMIA Symposium, 2003, pp. 579—583. [2] S. Ananiadou, A methodology for automatic term recognition, in: Proceedings of COLING Symposium, 1994, pp. 1034—1038. [3] K. Frantzi, S. Ananiadou, J. Tsujii, The C-value/NC-value method of automatic recognition for multi-word terms, in: Proceedings of ECDL Symposium, 1998, pp. 585—604. [4] M. Weeber, J. Mork, A. Aronson, Developing a test collection for biomedical word sense disambiguation, in: Proceedings of AMIA Symposium, 2001.
429
[5] W.W. Chapman, W. Bridewell, P. Hanbury, G.F. Cooper, B. Buchanan, Evaluation of negation phrases in narrative clinical reports, in: Proceedings of AMIA Symposium, 2001, pp. 105—109. [6] D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Girju, R. Goodrum, V. Rus, The structure and performance of an open-domain question answering system, in: Proceedings of ACL Symposium, 2000. [7] M. Hearst, Untangling text data mining, in: Proceedings of the 37th Annual Meeting of the Association for Computer Linguistics (ACL’99), 1999, pp. 3—10. [8] M. Marcus, B. Santorini, M.A. Marcinkiewicz, Building a large annotated corpus of English: the Penn Treebank, Comput. Linguistics 19 (1993) 297—352. [9] T. Brants, TnT—–a statistical part-of-speech tagger, in: Proceedings of NAACL/ANLP-2000 Symposium, 2000. [10] D. Cutting, J. Kupiec, J. Pedersen, P.A. Sibun, Practical POS tagger, in: Proceedings of ANLP Symposium, 1992. [11] A. Mikheev, Automatic rule induction for unknownword guessing, Comput. Linguistics 23 (3) (1997) 405— 423. [12] A. Ratnaparkhi, A maximum entropy part of speech tagger, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), University of Pennsylvania, May, 1996. [13] P. Ruch, R. Baud, A. Rassinoux, P. Bouillon, G. Robert, Medical document anonymization with a semantic lexicon, in: Proceedings of AMIA Symposium, 2000, pp. 729— 733. [14] D. Campbell, S.B. Johnson, Comparing syntactic complexity in medical and non-medical corpora, in: Proceedings of the AMIA Annual Symposium, 2001, pp. 90—95. [15] L.H. Smith, T. Rindflesch, J. Wilbur, MedPost: a part of speech tagger for biomedical text, Bioinformatics 20 (14) (2004) 2320—2321. [16] J. Wertmer, U. Hahn, Really, is medical sublanguage that different? Experimental counter-evidence from tagging medical and newspaper corpora, in: Proceedings of Medinfo, 2004. [17] P. Ruch, R. Baud, A. Geissbuhler, A.M. Rassinoux, Comparing general and medical texts for information retrieval based on natural language processing: an inquiry into lexical disambiguation, in: Proceedings of Medinfo Symposium, 2001, pp. 261—265. [18] S. Pakhomov, M. Schonwetter, J. Bachenko, Generating training data for medical dictations, in: Proceedings of NAACL Symposium, 2001. [19] B. Santorini, Part-of-speech tagging guidelines for the Penn Treebank project, Technical Report, Department of Computer and Information Science, University of Pennsylvania, 1991. [20] J. Carletta, Assessing agreement on classification tasks: the Kappa statistic, Comput. Linguistics 22 (2) (1996) 249— 254. [21] M. Poessio, R. Vieira, A corpus based investigation of definite description use, Comput. Linguistics 24 (2) (1988) 186—215. [22] J. Baldridge, T. Morton, G. Bierner, URL: http://maxent. sourceforge.net (last accessed 11/24/04).