Common data model for natural language processing based on two existing standard information models: CDA+GrAF
Journal of Biomedical Informatics 45 (2012) 703–710
Contents lists available at SciVerse ScienceDirect
Journal of Biomedical Informatics journal hom...
Journal of Biomedical Informatics 45 (2012) 703–710
Contents lists available at SciVerse ScienceDirect
Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin
Common data model for natural language processing based on two existing standard information models: CDA+GrAF Stéphane M. Meystre a,b,⇑, Sanghoon Lee a, Chai Young Jung a, Raphaël D. Chevrier c a
Department of Biomedical Informatics, University of Utah, School of Medicine, Salt Lake City, UT, United States VA Salt Lake City Health Care System, Salt Lake City, UT, United States c University of Geneva, School of Medicine, Geneva, Switzerland b
a r t i c l e
i n f o
Article history: Received 31 July 2011 Accepted 30 November 2011 Available online 8 December 2011 Keywords: Natural language processing (MeSH L01.224.065.580) Medical informatics (L01.700) Data model Information model HL7 Clinical Document Architecture ISO Graph Annotation Format
a b s t r a c t An increasing need for collaboration and resources sharing in the Natural Language Processing (NLP) research and development community motivates efforts to create and share a common data model and a common terminology for all information annotated and extracted from clinical text. We have combined two existing standards: the HL7 Clinical Document Architecture (CDA), and the ISO Graph Annotation Format (GrAF; in development), to develop such a data model entitled ‘‘CDA+GrAF’’. We experimented with several methods to combine these existing standards, and eventually selected a method wrapping separate CDA and GrAF parts in a common standoff annotation (i.e., separate from the annotated text) XML document. Two use cases, clinical document sections, and the 2010 i2b2/VA NLP Challenge (i.e., problems, tests, and treatments, with their assertions and relations), were used to create examples of such standoff annotation documents, and were successfully validated with the XML schemata provided with both standards. We developed a tool to automatically translate annotation documents from the 2010 i2b2/VA NLP Challenge format to GrAF, and automatically generated 50 annotation documents using this tool, all successfully validated. Finally, we adapted the XSL stylesheet provided with HL7 CDA to allow viewing annotation XML documents in a web browser, and plan to adapt existing tools for translating annotation documents between CDA+GrAF and the UIMA and GATE frameworks. This common data model may ease directly comparing NLP tools and applications, combining their output, transforming and ‘‘translating’’ annotations between different NLP applications, and eventually ‘‘plug-and-play’’ of different modules in NLP applications. Ó 2011 Elsevier Inc. All rights reserved.
1. Introduction The rapid adoption of Electronic Health Records (EHRs) and the corresponding growth of narrative data in electronic form, along with the needs for enhanced quality of care and reduced medical errors, are strong incentives for the development of Natural Language Processing (NLP) in the clinical domain. Most EHR content is recorded in narrative format. These documents represent the majority of the information used for medical care [1], but technologies developed to enhance quality of care and reduce medical errors require structured and coded data instead. As a possible answer to this issue, NLP can be used to convert free-text into structured and coded data [2]. The NLP research community is aware of this potential, and more NLP tools and applications have recently been developed ⇑ Corresponding author at: Department of Biomedical Informatics, University of Utah, 26 S 2000 E, HSEB suite 5700, Salt Lake City, UT 84112, United States. Fax: +1 801 581 4297. E-mail address: [email protected] (S.M. Meystre). 1532-0464/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2011.11.018
and made available to the community than ever before. Even a few corpora of de-identified clinical documents have become available for research in this domain. Sharing resources, both applications and annotated corpora, has become an important need, but a common framework to represent and name the elements used by these resources is lacking. Certain fundamental representational principles have been widely adopted, such as the use of standoff annotations (i.e., separate from the annotated text) and XML (Extensible Markup Language), but annotation formats often vary considerably from application to application, and require significant transformation and terminology mapping when shared. Such a framework, featuring a common data model and a common terminology for all information annotated and extracted from clinical text, would enable significantly more efficient research and development collaborations. It would allow directly comparing NLP tools and applications, combining their output, transforming and ‘‘translating’’ annotations between different NLP applications, and eventually ‘‘plug-and-play’’ of different modules in NLP applications. This would be especially useful in systems combining
704
S.M. Meystre et al. / Journal of Biomedical Informatics 45 (2012) 703–710
components from different sources, and in competitions such as the i2b2 NLP Challenges. The AMIA NLP working group and teams at the Veterans Health Administration have recognized this lack of a common framework to represent clinical text and the linguistic information extracted from it, and have initiated projects to answer this need. They are now collaborating in the exploration, evaluation, and development of a framework with a data model for combined clinical text and linguistic information, and terminologies for the linguistic information that can be extracted from clinical documents. The effort described here focuses on the former: the data model. 2. Background The development of collections of narrative text corpora, as well as their enrichment with various annotations, are crucial to NLP research. Several such corpora have been developed in the last two decades, especially in the general English domain (e.g., Brown corpus [3], Penn Treebank [4], American National Corpus [5]), and later also in the biomedical domain (i.e., scientific biomedical publications; e.g., GENIA [6], PennBioIE [7]). In the clinical domain, patient confidentiality issues have significantly limited progress, and only very few de-identified clinical text corpora have recently become available (e.g., i2b2 NLP Challenges [8]). The representation of such text corpora annotations have also significantly varied, ranging from tabular formats (one annotation per line), to inline XML annotations, and standoff XML formats. Inline annotations are stored in the annotated text. Standoff annotations are stored separately from the annotated text, offering advantages such as multiple overlapping annotations [9]. Several information representation standards for clinical text or linguistic information have already been published or are under development. A model for structured clinical text representation has been proposed by Friedman and colleagues in 1999. It used inline XML annotations and was based on a set of predefined elements to annotate sections, sentences, phrases, problems, and associated attributes [10]. Prominent standards for clinical document structure and content include the HL7 Clinical Document Architecture (CDA [11]), the ASTM Continuity of Care Record (CCR [12]), and the HL7 Continuity of Care Document (CCD [13]). In 2001, Dolin already realized that ‘‘. . . given the variability in clinical notes, including structure, underlying information models, degree of semantic encoding, use of standard healthcare terminologies, and platformand vendor-specific features, it is currently difficult to store and exchange documents with retention of standardized semantics over both time and distance.’’ The CDA was developed to answer these issues and represent all types of clinical notes, while the CCR and the CCD are only intended for summary clinical information representation and exchange. Since its initial release in 2001, the HL7 CDA has evolved and has been implemented in several healthcare organizations including the Mayo Clinic, the Veterans Health Administration [14], the University of Erlangen-Nüremberg in Germany [15,16], and many other international projects. Johnson et al. have proposed a structured narrative model for the EHR based on the HL7 CDA, using the and elements to annotate and specify detailed structure in the text [17]. In the linguistic domain, several data models have been developed, such as Annotation Graphs [18] or MATE [19], but the most significant work is currently realized within the International Organization for Standardization (ISO) Technical Committee 37 (TC37) sub-committee 4 (SC4). The TC37/SC4 is focused on Language Resources Management, and is developing several standards for morphological, lexical, syntactic, and semantic information representation. Among these standards, the Linguistic Annotation
Framework (LAF [20], ISO 24612, 2009) provides a framework for linguistic annotation of language resources that can serve as a common reference for different annotation scheme. It includes a data model and also provides an XML serialization called Graph Annotation Format (GrAF [21]). Both LAF and GrAF have been developed as pivot models, to serve as lingua franca between different text corpora or NLP applications. This format has already been used to translate annotated text from the GATE framework [22] to the UIMA framework [23], and back [24]. Both are open source NLP frameworks offering a development environment, a collection of methods and resources for basic text processing tasks, many reusable components for advanced text analysis, and a specific data model. GATE (General Architecture for Text Engineering) is developed by a team at the University of Sheffield (UK); UIMA (Unstructured Information Management Architecture) was initially developed by IBM, and is now an Apache project. Finally, GrAF provides the flexibility to represent any type of text annotation, and is also used to represent annotations in the American National Corpus [5]. 3. Methods The existence of standards for different types of information representation that have already been evaluated motivated us to try combining them and developing a hybrid data model capable of representing clinical documents and their semantic content, along with linguistic information extracted from these documents. We chose the HL7 CDA to represent the clinical document metadata (e.g., patient name, author of the report, data of dictation), structure (i.e., sections and their headers), and semantic content (i.e., structured and coded concepts linked with reference terminologies), and ISO GrAF to represent linguistic information at the lexical level (tokens, sentence boundaries) and syntactic level (parts-ofspeech, phrases). This hybrid data model is called ‘‘CDA+GrAF’’, and is explained in details below. 3.1. Practical clinical text annotation use cases We used two different clinical text annotation use cases for the work described here: clinical document sections and the 2010 i2b2/VA NLP Challenge [25]. This NLP Challenge included annotations of medical problems, tests, and treatments; of medical problem assertions; and of relations between medical problems, tests, and treatments. A collection of 349 de-identified discharge summaries and progress notes from this NLP Challenge training corpus, along with their reference standard annotations, was used for our experiment. A sub-corpus of 50 documents was randomly selected and used for the exploration of our two use case annotations with GrAF and CDA separately. Two documents were also randomly selected for the manual development and testing of the various models described below (these documents were manually transformed and completely de-identified again for the examples included in this paper). 3.2. The HL7 Clinical Document Architecture The HL7 CDA [11] was developed to represent the structure and semantics of all types of clinical documents, for storage and exchange, as mentioned above. It is part of the HL7 version 3 family of standards, and is therefore based on a common information model, the HL7 RIM (Reference Information Model) [26]. CDA documents are encoded in XML and are wrapped by the element. They include a header with all document metadata (i.e., data about the data), and a body with the document content (sections, paragraphs, lists, etc.). The header lies between the
S.M. Meystre et al. / Journal of Biomedical Informatics 45 (2012) 703–710
and the elements, and has four components: (1) document information to identify the document, and define confidentiality status and relationships to other documents; (2) encounter data to describe the setting in which a documented encounter occurred; (3) service actors who authored and authenticated the document, are intended to receive a copy of the document, or transcribed it; and (4) service targets that include the patient and other significant participants. The body comprises sections, paragraphs, lists, and tables. These structures have captions, can nest, and can contain coded ‘‘entries’’ with concepts from standard terminologies. A document section is wrapped by the element and can contain a single narrative block (contains the human readable content, wrapped by the element), and any number of entries and external references. We used the HL7 CDA to represent the clinical document metadata, structure, semantic content, and also the original clinical document text. For our use cases, the metadata were very limited (de-identified documents). The structure and semantic content consisted in document sections, and in concepts coded with an existing standard terminology: SNOMED-CT [27]. 3.3. The graph annotation format As mentioned above, ISO GrAF is the XML serialization of the Linguistic Annotation Framework (LAF [20], ISO 24612 (2009)). GrAF allows representation of diverse annotation types whilst maintaining at all times syntactic consistency between them. GrAF does not assess the underlying semantic issues of annotations sharing, but offers instead a unique structured and flexible format to represent the annotations themselves. It is a high level data model that virtually enables the mapping of any kind of annotation to any kind of media (e.g., text, sound, video). Each GrAF document has an XML element containing meta-information about the annotations. Then, GrAF expresses the referential structure of a linguistic annotation with three main XML elements: , and , as depicted in Fig. 1. The element indicates the area of a document
705
defined by start and end anchors, which provides primary positional data for other annotations. Both and elements are to contain the annotation information for a given object. Relationships between nodes are represented with edges. In our work, we use GrAF to represent lexical and syntactic information contained in narrative clinical documents. This information is stored as standoff annotations in a CDA+GrAF XML document. With the use of different GrAF elements ( and ), annotations point at regions of the clinical text. Through this process, annotations are mapped to their targets and informative content is added to the clinical document as proposed with GrAF [21]. In conformance with GrAF and LAF guidelines for good practice, none of the original (‘‘read-only’’) text document is modified along the creation of the CDA+GrAF XML document. We limited the use of GrAF’s annotations to syntactic and lexical information only. HL7 CDA provides indeed an efficient and broadly accepted model to represent metadata, structure and semantic content of clinical documents. Nonetheless, it is important to bear in mind that GrAF could be used virtually for any annotation purposes. For our use cases, GrAF was used to annotate all information extracted for the 2010 i2b2/VA NLP Challenge (i.e., medical problems, tests, and treatments, problem assertions, and relations between problems, tests, and treatments). 3.4. Combining the HL7 CDA and GrAF HL7 CDA and ISO GrAF can be combined in multiple different ways, and we experimented with four of them (Fig. 2). From the most to the least integrated, these combination possibilities include: 1. Common standoff annotations (CDA) XML document with embedded GrAF annotations in each section. 2. Common standoff annotations (CDA) XML document with GrAF annotations grouped at the end of the CDA body. 3. Common standoff annotations XML document with separate CDA and GrAF annotations.
Fig. 1. GrAF XML schema representation. The referential structure of a linguistic annotation is represented by the three main XML elements, , and .
706
S.M. Meystre et al. / Journal of Biomedical Informatics 45 (2012) 703–710
Combination 1 CDA
Combination 2
Combination 3
CDA
Section 1
Combination 4 CDA
CDA+GrAF
Section 1 CDA
GrAF Section 2 Section 2 GrAF
Section 3
GrAF
GrAF
GrAF
Section 3 GrAF
Fig. 2. HL7 CDA and ISO GrAF combinations. The four HL7 CDA and GrAF combinations we explored are depicted with CDA content in yellow, and GrAF content in orange. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
4. Separate standoff annotations XML documents (one CDA, one GrAF).
4. Results 4.1. HL7 CDA – ISO GrAF combinations exploration
Considering that an important premise for our work was to keep the existing standards we are using unchanged, and that we also favored integrating all annotations into one single file, the more integrated solutions were given priority in our exploration. We started the CDA–GrAF combinations exploration with the random selection of one clinical document for each use case. Then, for each combination, several variants of the annotations representation were manually crafted, and validation with HL7 CDA and ISO GrAF XML schemata was performed. Several iterations of modification and correction followed, based on validation errors and design considerations. When validation was eventually successful without errors, we focused our efforts on design details and finally settled on the data model described below. 3.5. Automating the generation of annotation documents Once the appropriate CDA–GrAF combination was selected, the automated conversion to and from CDA+GrAF became the next focus of our work. Tools to automatically convert annotations between the GATE framework [22], the UIMA framework [23], and CDA+GrAF are our final objective, and we initiated this development with the automated annotation conversion between the i2b2/VA NLP Challenge format, Knowtator [28], and GrAF. Knowtator is a Protégé [29] plug-in for text annotation. As depicted in Fig. 3, we first used an existing tool developed for the 2010 i2b2/ VA NLP Challenge to convert from the i2b2 format (bar-delimited text) to the Knowtator format (XML). We then developed a new tool – the Knowtator–GrAF Translator – to convert from the Knowtator format to GrAF. To test this composite annotations translation tool, the 50 documents corpus was used. The i2b2/VA NLP Challenge annotations of each document were automatically translated to Knowtator and then GrAF, and the resulting GrAF annotation documents were finally validated with the ISO GrAF XML schemata.
Each of the CDA–GrAF combinations we explored had different positive and negative aspects, as listed in Table 1. Our main criteria for this exploration were (1) the possibility to validate the resulting CDA+GrAF documents with the original HL7 CDA and ISO GrAF XML schemata, (2) a one-step validation process, and therefore one XML annotation document, (3) the immunity of CDA+GrAF to future (backwards-compatible) evolutions of the CDA or GrAF standards, and (4) the capability to link GrAF annotations with text in the CDA. Combinations 1 and 2 are similar, and in both cases, validation of the CDA part, or of the GrAF part, was always possible; but combining both was never successful because of namespace conflicts and intrinsic CDA structure limitations. Precise definition is the strength of the CDA information model but in our case, this lack of flexibility was a problem. HL7 CDA XML schemata strictly define their elements and would not allow improper nesting of other elements without generating errors. Since we favored integration into one standoff annotations XML document, and did not want to modify the original CDA and GrAF schemata, we had to keep CDA and GrAF parts separated. Among all combinations explored and listed above, only the third (common annotations XML document with separate CDA and GrAF annotations) and fourth (separate standoff annotations XML documents) could be validated without modification of HL7 CDA and ISO GrAF XML schemata. Using these schemata without modification means that the CDA+GrAF model will be immune to all (backwards-compatible) evolutions of either the HL7 CDA or the ISO GrAF standard. Combination 4 was excluded from our study because of our aim for a one-step validation, and therefore for one XML document. The third combination was eventually selected and is described in details below. A complete example is available as Appendix 1.
i2b2Knowtator Translator
i2b2 Challenge Annotations
KnowtatorGrAF Translator
Knowtator Annotations
ISO GrAF Annotations
Fig. 3. Automated annotation formats conversion. The workflow of the composite pilot conversion tool is detailed, from i2b2/VA NLP Challenge annotations to GrAF annotations.
707
S.M. Meystre et al. / Journal of Biomedical Informatics 45 (2012) 703–710 Table 1 Characteristics of CDA–GrAF combinations.
Integration of CDA and GrAF into a single XML file Validation without modification of the CDA XML schema One-step validation (by importing the GrAF schemata into the CDA XML schema) Namespace conflicts Immunity to CDA or GrAF XML schemata (backwards-compatible) modifications Cohesiveness of GrAF data Proximity of GrAF data to CDA clinical data
4.2. Common standoff annotations document with separate CDA and GrAF content To combine the HL7 CDA document and GrAF annotations into one common standoff annotations XML file, we created a new toplevel element that contains the CDA and the GrAF top-level elements. The validation of the resulting standoff annotations XML document requires 9 different XML schemata, 5 for CDA, 3 for GrAF, and 1 that defines the new top-level element. The CDA and GrAF XML schemata were unchanged and were obtained with the CDA and GrAF standard definitions (GrAF version 0.99.1 and HL7 CDA Release Two 04/21/2005). These XML schemata are organized as follows: CDA_GrAF.xsd, imports and includes: – POCD_MT000040.xsd (CDA), includes: NarrativeBlock.xsd (CDA) datatypes.xsd (CDA), includes datatypes-base.xsd (CDA) voc.xsd (CDA) – graf-0.99.1.xsd (GrAF) – graf.xsd (GrAF) – xml.xsd (GrAF) 4.2.1. Clinical document sections annotation with CDA+GrAF Our first use case – clinical document sections – only required annotation with CDA elements. This standard already allows for sections representation, as explained earlier, and the creation of standoff XML annotations in CDA+GrAF consisted in our new top level element, wrapping the CDA element; no element was necessary. Fig. 4 presents a partial example of such a CDA document, with one section (‘‘Patient States Complaint’’) annotated. The detailed example in Appendix 1 includes the CDA part with encoded sections (the GrAF part in Appendix 1 can be ignored for this use case). 4.2.2. 2010 i2b2/VA NLP Challenge annotation with CDA+GrAF The second use case (i.e., annotation of medical problems, tests, and treatments, of problem assertions, and of relations between problems, tests, and treatments) required both CDA and GrAF annotations, and was a good use case for the latter, including not only concepts, but also assertions about these concepts, and relations between them. A partial example of GrAF annotations is shown in Fig. 5 (complete example in Appendix 1). The elements indicate the spans of text in the clinical note, and are defined with start and end anchors (2010 i2b2/VA NLP Challenge format of line:character index). We also annotated tokens, even if they were not part of the Challenge, and these tokens include some linguistic information at the lexical and syntactic levels (parts-of-speech). The token annotations are represented with elements and are linked with the corresponding regions with the sub-element. Concept and assertion annotations are also represented with elements and linked with regions the same way tokens are (note the link with multiple regions for the ‘‘benign