Common data model for natural language processing based on two existing standard information models: CDA+GrAF

Common data model for natural language processing based on two existing standard information models: CDA+GrAF

Journal of Biomedical Informatics 45 (2012) 703–710 Contents lists available at SciVerse ScienceDirect Journal of Biomedical Informatics journal hom...

836KB Sizes 0 Downloads 35 Views

Journal of Biomedical Informatics 45 (2012) 703–710

Contents lists available at SciVerse ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin

Common data model for natural language processing based on two existing standard information models: CDA+GrAF Stéphane M. Meystre a,b,⇑, Sanghoon Lee a, Chai Young Jung a, Raphaël D. Chevrier c a

Department of Biomedical Informatics, University of Utah, School of Medicine, Salt Lake City, UT, United States VA Salt Lake City Health Care System, Salt Lake City, UT, United States c University of Geneva, School of Medicine, Geneva, Switzerland b

a r t i c l e

i n f o

Article history: Received 31 July 2011 Accepted 30 November 2011 Available online 8 December 2011 Keywords: Natural language processing (MeSH L01.224.065.580) Medical informatics (L01.700) Data model Information model HL7 Clinical Document Architecture ISO Graph Annotation Format

a b s t r a c t An increasing need for collaboration and resources sharing in the Natural Language Processing (NLP) research and development community motivates efforts to create and share a common data model and a common terminology for all information annotated and extracted from clinical text. We have combined two existing standards: the HL7 Clinical Document Architecture (CDA), and the ISO Graph Annotation Format (GrAF; in development), to develop such a data model entitled ‘‘CDA+GrAF’’. We experimented with several methods to combine these existing standards, and eventually selected a method wrapping separate CDA and GrAF parts in a common standoff annotation (i.e., separate from the annotated text) XML document. Two use cases, clinical document sections, and the 2010 i2b2/VA NLP Challenge (i.e., problems, tests, and treatments, with their assertions and relations), were used to create examples of such standoff annotation documents, and were successfully validated with the XML schemata provided with both standards. We developed a tool to automatically translate annotation documents from the 2010 i2b2/VA NLP Challenge format to GrAF, and automatically generated 50 annotation documents using this tool, all successfully validated. Finally, we adapted the XSL stylesheet provided with HL7 CDA to allow viewing annotation XML documents in a web browser, and plan to adapt existing tools for translating annotation documents between CDA+GrAF and the UIMA and GATE frameworks. This common data model may ease directly comparing NLP tools and applications, combining their output, transforming and ‘‘translating’’ annotations between different NLP applications, and eventually ‘‘plug-and-play’’ of different modules in NLP applications. Ó 2011 Elsevier Inc. All rights reserved.

1. Introduction The rapid adoption of Electronic Health Records (EHRs) and the corresponding growth of narrative data in electronic form, along with the needs for enhanced quality of care and reduced medical errors, are strong incentives for the development of Natural Language Processing (NLP) in the clinical domain. Most EHR content is recorded in narrative format. These documents represent the majority of the information used for medical care [1], but technologies developed to enhance quality of care and reduce medical errors require structured and coded data instead. As a possible answer to this issue, NLP can be used to convert free-text into structured and coded data [2]. The NLP research community is aware of this potential, and more NLP tools and applications have recently been developed ⇑ Corresponding author at: Department of Biomedical Informatics, University of Utah, 26 S 2000 E, HSEB suite 5700, Salt Lake City, UT 84112, United States. Fax: +1 801 581 4297. E-mail address: [email protected] (S.M. Meystre). 1532-0464/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2011.11.018

and made available to the community than ever before. Even a few corpora of de-identified clinical documents have become available for research in this domain. Sharing resources, both applications and annotated corpora, has become an important need, but a common framework to represent and name the elements used by these resources is lacking. Certain fundamental representational principles have been widely adopted, such as the use of standoff annotations (i.e., separate from the annotated text) and XML (Extensible Markup Language), but annotation formats often vary considerably from application to application, and require significant transformation and terminology mapping when shared. Such a framework, featuring a common data model and a common terminology for all information annotated and extracted from clinical text, would enable significantly more efficient research and development collaborations. It would allow directly comparing NLP tools and applications, combining their output, transforming and ‘‘translating’’ annotations between different NLP applications, and eventually ‘‘plug-and-play’’ of different modules in NLP applications. This would be especially useful in systems combining

704

S.M. Meystre et al. / Journal of Biomedical Informatics 45 (2012) 703–710

components from different sources, and in competitions such as the i2b2 NLP Challenges. The AMIA NLP working group and teams at the Veterans Health Administration have recognized this lack of a common framework to represent clinical text and the linguistic information extracted from it, and have initiated projects to answer this need. They are now collaborating in the exploration, evaluation, and development of a framework with a data model for combined clinical text and linguistic information, and terminologies for the linguistic information that can be extracted from clinical documents. The effort described here focuses on the former: the data model. 2. Background The development of collections of narrative text corpora, as well as their enrichment with various annotations, are crucial to NLP research. Several such corpora have been developed in the last two decades, especially in the general English domain (e.g., Brown corpus [3], Penn Treebank [4], American National Corpus [5]), and later also in the biomedical domain (i.e., scientific biomedical publications; e.g., GENIA [6], PennBioIE [7]). In the clinical domain, patient confidentiality issues have significantly limited progress, and only very few de-identified clinical text corpora have recently become available (e.g., i2b2 NLP Challenges [8]). The representation of such text corpora annotations have also significantly varied, ranging from tabular formats (one annotation per line), to inline XML annotations, and standoff XML formats. Inline annotations are stored in the annotated text. Standoff annotations are stored separately from the annotated text, offering advantages such as multiple overlapping annotations [9]. Several information representation standards for clinical text or linguistic information have already been published or are under development. A model for structured clinical text representation has been proposed by Friedman and colleagues in 1999. It used inline XML annotations and was based on a set of predefined elements to annotate sections, sentences, phrases, problems, and associated attributes [10]. Prominent standards for clinical document structure and content include the HL7 Clinical Document Architecture (CDA [11]), the ASTM Continuity of Care Record (CCR [12]), and the HL7 Continuity of Care Document (CCD [13]). In 2001, Dolin already realized that ‘‘. . . given the variability in clinical notes, including structure, underlying information models, degree of semantic encoding, use of standard healthcare terminologies, and platformand vendor-specific features, it is currently difficult to store and exchange documents with retention of standardized semantics over both time and distance.’’ The CDA was developed to answer these issues and represent all types of clinical notes, while the CCR and the CCD are only intended for summary clinical information representation and exchange. Since its initial release in 2001, the HL7 CDA has evolved and has been implemented in several healthcare organizations including the Mayo Clinic, the Veterans Health Administration [14], the University of Erlangen-Nüremberg in Germany [15,16], and many other international projects. Johnson et al. have proposed a structured narrative model for the EHR based on the HL7 CDA, using the and elements to annotate and specify detailed structure in the text [17]. In the linguistic domain, several data models have been developed, such as Annotation Graphs [18] or MATE [19], but the most significant work is currently realized within the International Organization for Standardization (ISO) Technical Committee 37 (TC37) sub-committee 4 (SC4). The TC37/SC4 is focused on Language Resources Management, and is developing several standards for morphological, lexical, syntactic, and semantic information representation. Among these standards, the Linguistic Annotation

Framework (LAF [20], ISO 24612, 2009) provides a framework for linguistic annotation of language resources that can serve as a common reference for different annotation scheme. It includes a data model and also provides an XML serialization called Graph Annotation Format (GrAF [21]). Both LAF and GrAF have been developed as pivot models, to serve as lingua franca between different text corpora or NLP applications. This format has already been used to translate annotated text from the GATE framework [22] to the UIMA framework [23], and back [24]. Both are open source NLP frameworks offering a development environment, a collection of methods and resources for basic text processing tasks, many reusable components for advanced text analysis, and a specific data model. GATE (General Architecture for Text Engineering) is developed by a team at the University of Sheffield (UK); UIMA (Unstructured Information Management Architecture) was initially developed by IBM, and is now an Apache project. Finally, GrAF provides the flexibility to represent any type of text annotation, and is also used to represent annotations in the American National Corpus [5]. 3. Methods The existence of standards for different types of information representation that have already been evaluated motivated us to try combining them and developing a hybrid data model capable of representing clinical documents and their semantic content, along with linguistic information extracted from these documents. We chose the HL7 CDA to represent the clinical document metadata (e.g., patient name, author of the report, data of dictation), structure (i.e., sections and their headers), and semantic content (i.e., structured and coded concepts linked with reference terminologies), and ISO GrAF to represent linguistic information at the lexical level (tokens, sentence boundaries) and syntactic level (parts-ofspeech, phrases). This hybrid data model is called ‘‘CDA+GrAF’’, and is explained in details below. 3.1. Practical clinical text annotation use cases We used two different clinical text annotation use cases for the work described here: clinical document sections and the 2010 i2b2/VA NLP Challenge [25]. This NLP Challenge included annotations of medical problems, tests, and treatments; of medical problem assertions; and of relations between medical problems, tests, and treatments. A collection of 349 de-identified discharge summaries and progress notes from this NLP Challenge training corpus, along with their reference standard annotations, was used for our experiment. A sub-corpus of 50 documents was randomly selected and used for the exploration of our two use case annotations with GrAF and CDA separately. Two documents were also randomly selected for the manual development and testing of the various models described below (these documents were manually transformed and completely de-identified again for the examples included in this paper). 3.2. The HL7 Clinical Document Architecture The HL7 CDA [11] was developed to represent the structure and semantics of all types of clinical documents, for storage and exchange, as mentioned above. It is part of the HL7 version 3 family of standards, and is therefore based on a common information model, the HL7 RIM (Reference Information Model) [26]. CDA documents are encoded in XML and are wrapped by the element. They include a header with all document metadata (i.e., data about the data), and a body with the document content (sections, paragraphs, lists, etc.). The header lies between the

S.M. Meystre et al. / Journal of Biomedical Informatics 45 (2012) 703–710

and the elements, and has four components: (1) document information to identify the document, and define confidentiality status and relationships to other documents; (2) encounter data to describe the setting in which a documented encounter occurred; (3) service actors who authored and authenticated the document, are intended to receive a copy of the document, or transcribed it; and (4) service targets that include the patient and other significant participants. The body comprises sections, paragraphs, lists, and tables. These structures have captions, can nest, and can contain coded ‘‘entries’’ with concepts from standard terminologies. A document section is wrapped by the
element and can contain a single narrative block (contains the human readable content, wrapped by the element), and any number of entries and external references. We used the HL7 CDA to represent the clinical document metadata, structure, semantic content, and also the original clinical document text. For our use cases, the metadata were very limited (de-identified documents). The structure and semantic content consisted in document sections, and in concepts coded with an existing standard terminology: SNOMED-CT [27]. 3.3. The graph annotation format As mentioned above, ISO GrAF is the XML serialization of the Linguistic Annotation Framework (LAF [20], ISO 24612 (2009)). GrAF allows representation of diverse annotation types whilst maintaining at all times syntactic consistency between them. GrAF does not assess the underlying semantic issues of annotations sharing, but offers instead a unique structured and flexible format to represent the annotations themselves. It is a high level data model that virtually enables the mapping of any kind of annotation to any kind of media (e.g., text, sound, video). Each GrAF document has an XML
element containing meta-information about the annotations. Then, GrAF expresses the referential structure of a linguistic annotation with three main XML elements: , and , as depicted in Fig. 1. The element indicates the area of a document

705

defined by start and end anchors, which provides primary positional data for other annotations. Both and elements are to contain the annotation information for a given object. Relationships between nodes are represented with edges. In our work, we use GrAF to represent lexical and syntactic information contained in narrative clinical documents. This information is stored as standoff annotations in a CDA+GrAF XML document. With the use of different GrAF elements ( and ), annotations point at regions of the clinical text. Through this process, annotations are mapped to their targets and informative content is added to the clinical document as proposed with GrAF [21]. In conformance with GrAF and LAF guidelines for good practice, none of the original (‘‘read-only’’) text document is modified along the creation of the CDA+GrAF XML document. We limited the use of GrAF’s annotations to syntactic and lexical information only. HL7 CDA provides indeed an efficient and broadly accepted model to represent metadata, structure and semantic content of clinical documents. Nonetheless, it is important to bear in mind that GrAF could be used virtually for any annotation purposes. For our use cases, GrAF was used to annotate all information extracted for the 2010 i2b2/VA NLP Challenge (i.e., medical problems, tests, and treatments, problem assertions, and relations between problems, tests, and treatments). 3.4. Combining the HL7 CDA and GrAF HL7 CDA and ISO GrAF can be combined in multiple different ways, and we experimented with four of them (Fig. 2). From the most to the least integrated, these combination possibilities include: 1. Common standoff annotations (CDA) XML document with embedded GrAF annotations in each section. 2. Common standoff annotations (CDA) XML document with GrAF annotations grouped at the end of the CDA body. 3. Common standoff annotations XML document with separate CDA and GrAF annotations.

Fig. 1. GrAF XML schema representation. The referential structure of a linguistic annotation is represented by the three main XML elements, , and .

706

S.M. Meystre et al. / Journal of Biomedical Informatics 45 (2012) 703–710

Combination 1 CDA

Combination 2

Combination 3

CDA

Section 1

Combination 4 CDA

CDA+GrAF

Section 1 CDA

GrAF Section 2 Section 2 GrAF

Section 3

GrAF

GrAF

GrAF

Section 3 GrAF

Fig. 2. HL7 CDA and ISO GrAF combinations. The four HL7 CDA and GrAF combinations we explored are depicted with CDA content in yellow, and GrAF content in orange. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

4. Separate standoff annotations XML documents (one CDA, one GrAF).

4. Results 4.1. HL7 CDA – ISO GrAF combinations exploration

Considering that an important premise for our work was to keep the existing standards we are using unchanged, and that we also favored integrating all annotations into one single file, the more integrated solutions were given priority in our exploration. We started the CDA–GrAF combinations exploration with the random selection of one clinical document for each use case. Then, for each combination, several variants of the annotations representation were manually crafted, and validation with HL7 CDA and ISO GrAF XML schemata was performed. Several iterations of modification and correction followed, based on validation errors and design considerations. When validation was eventually successful without errors, we focused our efforts on design details and finally settled on the data model described below. 3.5. Automating the generation of annotation documents Once the appropriate CDA–GrAF combination was selected, the automated conversion to and from CDA+GrAF became the next focus of our work. Tools to automatically convert annotations between the GATE framework [22], the UIMA framework [23], and CDA+GrAF are our final objective, and we initiated this development with the automated annotation conversion between the i2b2/VA NLP Challenge format, Knowtator [28], and GrAF. Knowtator is a Protégé [29] plug-in for text annotation. As depicted in Fig. 3, we first used an existing tool developed for the 2010 i2b2/ VA NLP Challenge to convert from the i2b2 format (bar-delimited text) to the Knowtator format (XML). We then developed a new tool – the Knowtator–GrAF Translator – to convert from the Knowtator format to GrAF. To test this composite annotations translation tool, the 50 documents corpus was used. The i2b2/VA NLP Challenge annotations of each document were automatically translated to Knowtator and then GrAF, and the resulting GrAF annotation documents were finally validated with the ISO GrAF XML schemata.

Each of the CDA–GrAF combinations we explored had different positive and negative aspects, as listed in Table 1. Our main criteria for this exploration were (1) the possibility to validate the resulting CDA+GrAF documents with the original HL7 CDA and ISO GrAF XML schemata, (2) a one-step validation process, and therefore one XML annotation document, (3) the immunity of CDA+GrAF to future (backwards-compatible) evolutions of the CDA or GrAF standards, and (4) the capability to link GrAF annotations with text in the CDA. Combinations 1 and 2 are similar, and in both cases, validation of the CDA part, or of the GrAF part, was always possible; but combining both was never successful because of namespace conflicts and intrinsic CDA structure limitations. Precise definition is the strength of the CDA information model but in our case, this lack of flexibility was a problem. HL7 CDA XML schemata strictly define their elements and would not allow improper nesting of other elements without generating errors. Since we favored integration into one standoff annotations XML document, and did not want to modify the original CDA and GrAF schemata, we had to keep CDA and GrAF parts separated. Among all combinations explored and listed above, only the third (common annotations XML document with separate CDA and GrAF annotations) and fourth (separate standoff annotations XML documents) could be validated without modification of HL7 CDA and ISO GrAF XML schemata. Using these schemata without modification means that the CDA+GrAF model will be immune to all (backwards-compatible) evolutions of either the HL7 CDA or the ISO GrAF standard. Combination 4 was excluded from our study because of our aim for a one-step validation, and therefore for one XML document. The third combination was eventually selected and is described in details below. A complete example is available as Appendix 1.

i2b2Knowtator Translator

i2b2 Challenge Annotations

KnowtatorGrAF Translator

Knowtator Annotations

ISO GrAF Annotations

Fig. 3. Automated annotation formats conversion. The workflow of the composite pilot conversion tool is detailed, from i2b2/VA NLP Challenge annotations to GrAF annotations.

707

S.M. Meystre et al. / Journal of Biomedical Informatics 45 (2012) 703–710 Table 1 Characteristics of CDA–GrAF combinations.

Integration of CDA and GrAF into a single XML file Validation without modification of the CDA XML schema One-step validation (by importing the GrAF schemata into the CDA XML schema) Namespace conflicts Immunity to CDA or GrAF XML schemata (backwards-compatible) modifications Cohesiveness of GrAF data Proximity of GrAF data to CDA clinical data

4.2. Common standoff annotations document with separate CDA and GrAF content To combine the HL7 CDA document and GrAF annotations into one common standoff annotations XML file, we created a new toplevel element that contains the CDA and the GrAF top-level elements. The validation of the resulting standoff annotations XML document requires 9 different XML schemata, 5 for CDA, 3 for GrAF, and 1 that defines the new top-level element. The CDA and GrAF XML schemata were unchanged and were obtained with the CDA and GrAF standard definitions (GrAF version 0.99.1 and HL7 CDA Release Two 04/21/2005). These XML schemata are organized as follows:  CDA_GrAF.xsd, imports and includes: – POCD_MT000040.xsd (CDA), includes: NarrativeBlock.xsd (CDA) datatypes.xsd (CDA), includes datatypes-base.xsd (CDA) voc.xsd (CDA) – graf-0.99.1.xsd (GrAF) – graf.xsd (GrAF) – xml.xsd (GrAF) 4.2.1. Clinical document sections annotation with CDA+GrAF Our first use case – clinical document sections – only required annotation with CDA elements. This standard already allows for sections representation, as explained earlier, and the creation of standoff XML annotations in CDA+GrAF consisted in our new top level element, wrapping the CDA element; no element was necessary. Fig. 4 presents a partial example of such a CDA document, with one section (‘‘Patient States Complaint’’) annotated. The detailed example in Appendix 1 includes the CDA part with encoded sections (the GrAF part in Appendix 1 can be ignored for this use case). 4.2.2. 2010 i2b2/VA NLP Challenge annotation with CDA+GrAF The second use case (i.e., annotation of medical problems, tests, and treatments, of problem assertions, and of relations between problems, tests, and treatments) required both CDA and GrAF annotations, and was a good use case for the latter, including not only concepts, but also assertions about these concepts, and relations between them. A partial example of GrAF annotations is shown in Fig. 5 (complete example in Appendix 1). The elements indicate the spans of text in the clinical note, and are defined with start and end anchors (2010 i2b2/VA NLP Challenge format of line:character index). We also annotated tokens, even if they were not part of the Challenge, and these tokens include some linguistic information at the lexical and syntactic levels (parts-of-speech). The token annotations are represented with elements and are linked with the corresponding regions with the sub-element. Concept and assertion annotations are also represented with elements and linked with regions the same way tokens are (note the link with multiple regions for the ‘‘benign

Combination 1

Combination 2

Combination 3

Combination 4

Yes No No Yes No No Strong

Yes No No Yes No Yes Moderate

Yes Yes Yes No Yes Yes Moderate

No Yes No No Yes Yes Weak

positional vertigo’’ concept). Finally, relations between problems, tests, and treatments are annotated with elements, linking two elements (i.e., concept assertions in our case). Nodes and edges can have annotation sets ( element) that include annotations ( elements), which in turn can include feature sets ( element) with features (name-values pairs, in elements). 4.3. Automatic GrAF annotations generation As explained in Section 3.5, we randomly selected 50 documents from our i2b2/VA NLP Challenge corpus, and processed their annotations with the i2b2-Knowtator Translator without any error or other problem. We then used the Knowtator–GrAF Translator and automatically generated GrAF annotation files. All were successfully validated with the ISO GrAF XML schemata. 4.4. Standoff annotations document visualization The XML format is meant for computers to read, not humans, but can easily be displayed as more human readable text in a web browser according to the rules specified in an XSL stylesheet. Such a stylesheet is provided with the HL7 CDA standard, and we modified it to allow displaying not only the CDA part of the CDA+GrAF standoff annotations document, but also the original clinical text (i.e., unstructured), and a table with all annotations included in the GrAF part. In our modified web browser display (Appendix 2), the structured CDA content is displayed in bulleted lists along with some CDA header content (e.g., patient, healthcare provider). This part is displayed according to the unmodified templates in the HL7 CDA XSL stylesheet. The original text follows as a CDA section, for convenience. The last part with linguistic annotation data in GrAF is represented as a table with one annotation per row, and columns corresponding to details about the annotation such as the term, linked region (with start and end indexes), part-ofspeech tags, concepts types, assertions, or relations. Some of these columns are specific to the use cases. 5. Discussion The hybrid CDA+GrAF data model offers several advantages. The XML format gives access to numerous existing tools for documents creation, validation, transformation, and rendering. The use of HL7 CDA improves compatibility with some existing EHR systems. As a future (imminent) ISO standard, GrAF also improves the compatibility of our hybrid data model, but in the linguistic domain. Clinical document text annotations’ content and levels of details will inevitably vary. A flexible but standardized model is therefore crucial. GrAF offers this required level of flexibility. Disadvantages of CDA+GrAF are related to other aspects of the XML format and to the combination of two existing standards. Using the XML format renders standoff annotations documents very verbose, and difficult for humans to read, but the availability

708

S.M. Meystre et al. / Journal of Biomedical Informatics 45 (2012) 703–710

Fig. 4. HL7 CDA document partial example. An example of the header and of one section is presented in this partial CDA example.

of an XSL stylesheet for web browser viewing alleviates this drawback. Combining CDA and GrAF adds complexity and some redundancy to another data model that could be created de novo, or based on GrAF only. The latter – using only GrAF – is an alternative we also explored that could be applied to both use cases. We found it attractive in terms of simplicity, but the established position of HL7 CDA in the clinical domain, and the ignorance of GrAF in this same domain, motivated us to combine both standards. Other models for clinical text annotations representation such as the ones proposed by Friedman et al. [10] or by Johnson et al. [17] offer some of the advantages of CDA+GrAF, but lack the flexibility required for common text annotation tasks. Johnson’s model has the advantage of being only based on the HL7 CDA, but both models use inline XML annotations, and therefore cannot represent overlapping annotations. Friedman’s model uses predefined XML elements, and would have to be modified to accommodate other types of annotations. The research work presented here is only a proof-of-concept, and this is a limitation we are aware of. In the last modeling step, only two randomly selected clinical documents was used, and these documents might not be representative of the variations in content and format that could be found in our collection of documents. As future efforts, we plan to expand this experimentation to a larger and more varied collection of clinical documents. As mentioned earlier, this common data model is only one part of the common framework. The common terminology to

name all annotations would be the other part. In addition, a shared annotation guideline would be needed to ensure that the text annotations are understood and applied consistently. These explanations could be integrated in the GrAF header of CDA+GrAF. CDA+GrAF annotation documents shall be automatically exported from and imported to NLP applications. We successfully tested a pilot composite tool to automatically convert annotations between the i2b2/VA NLP Challenge format and GrAF, and we plan to develop tools for the automated translation between CDA+GrAF and the GATE and UIMA frameworks. Ide and colleagues already experimented with the automated export and import of GrAF documents, to convert standoff annotation files from the GATE to the UIMA framework, and vice versa [24]. These tools are available with the American National Corpus [5], and we plan to adapt them to the CDA+GrAF hybrid data model, with the same compatibility with the GATE and UIMA frameworks. As explained above, CDA+GrAF annotations are meant to be imported to a NLP application, or exported from such an application. However, their content can also be accessed directly with the robust XML querying functionalities offered by XQuery [30], the standard XML query language developed by the W3C. For example, a query like ‘doc(‘‘Appendix1.xml’’)/CDA_GrAF/ClinicalDocument/component/structuredBody/component/section/entry/observation/code [@codeSystemName=’SNOMED CT’]/@code’ would extract all SNOMED-CT codes from the CDA+GrAF example available in Appendix 1.

S.M. Meystre et al. / Journal of Biomedical Informatics 45 (2012) 703–710

709

Fig. 5. ISO GrAF annotations partial example. Examples of regions (i.e. spans of text), tokens, concepts and assertions, and relations between concepts are represented in this partial GrAF example.

6. Conclusions We have developed a data model combining two existing standards: HL7 CDA and ISO GrAF, and successfully (manually and automatically) created valid examples of clinical text standoff annotations XML documents based on this data model. As mentioned earlier, a framework including a common data model with common terminologies for all annotations could enable: sharing of annotations without required translation/transformation, combining annotations from different sources, combining NLP modules and applications in a plug-and-play manner, and directly comparing annotations and NLP applications performance. Such a framework is currently in development at the Veterans Health Administration, with CDA+GrAF as data model.

for their useful suggestions. The AMIA NLP Workgroup, and more specifically Dr. Wendy Chapman, also contributed useful feedback. We also thank Dr. Nancy Ide and Keith Suderman for helpful information and the latest GrAF XML schemata, and Youngjun Kim for the i2b2-Knowtator conversion tool. Some funding was provided by the VA Health Services Research and Development VINCI project (PI: Dr. Jonathan Nebeker). Appendix A. Supplementary material A complete example of a CDA+GrAF XML document and of its visualization in a web browser can be found, in the online version, at doi:10.1016/j.jbi.2011.11.018. References

Acknowledgments We thank Veterans Health Administration CHIR and VINCI project members, and more specifically Guy Divita and Dr. Qing Zeng

[1] Pratt AW. Medicine, computers, and linguistics. Adv Biomed Eng 1973;3:97–140. [2] Spyns P. Natural language processing in medicine: an overview. Methods Inf Med 1996;35(4–5):285–301.

710

S.M. Meystre et al. / Journal of Biomedical Informatics 45 (2012) 703–710

[3] Francis WN. A tagged corpus – problems and prospects. In: Greenbaum S, Leech G, Svartvik J, editors. Studies in english linguistics for Randolph Quirk. London and New York: Longman; 1979. p. 192–209. [4] Marcus M, Santorini B, Marcinkiewicz MA. Building a large annotated corpus of English. Comput Linguist 1993;19(2):313–30. [5] Ide NM, Macleod C. The American national corpus: a standardized resource of American English. In: Proceedings of corpus linguistics. Lancaster, UK; 2001. [6] Kim JD, Ohta T, Tateisi Y, Tsujii J. GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics 2003;19(Suppl. 1):I180–2. [7] Kulick S, Bies A, Liberman M, Mandel M, McDonald R, Palmer M, et al. Integrated annotation for biomedical information extraction. HLT/NAACL workshop: Biolink.; 2004. p. 61–8. [8] i2b2.org. . [9] Thompson HS, McKelvie D. Hyperlink semantics for standoff markup of readonly documents. HCRC, University of Edinburgh; 1997. [10] Friedman C, Hripcsak G, Shagina L, Liu H. Representing information in patient reports using natural language processing and the extensible markup language. J Am Med Inform Assoc 1999;6(1):76–87. [11] Dolin RH, Alschuler L, Beebe C, Biron PV, Boyer SL, Essin D, et al. The HL7 clinical document architecture. J Am Med Inform Assoc 2001;8(6):552–69. [12] Kibbe DC, Phillips RLJ, Green LA. The continuity of care record. Am Fam Phys 2004;70(7):1220. 1222–3. [13] Dolin RH, Giannone G, Schadow G. Enabling joint commission medication reconciliation objectives with the HL7/ASTM Continuity of Care Document standard. AMIA annu symp proc; 2007. p. 186–90. [14] Li J, Lincoln MJ. Model-driven CDA Clinical Document Development Framework. AMIA annu symp proc; 2007. p. 1031. [15] Klein A, Prokosch HU, Muller M, Ganslandt T. Experiences with an interoperable data acquisition platform for multi-centric research networks based on HL7 CDA. Methods Inform Med 2007;46(5):580–5. [16] Gerdsen F, Mueller S, Jablonski S, Prokosch HU. Standardized exchange of medical data between a research database, an electronic patient record and an electronic health record using CDA/SCIPHOX. AMIA annu symp proc; 2005. p. 963.

[17] Johnson SB, Bakken S, Dine D, Hyun S, Mendonça E, Morrison F, et al. An electronic health record based on structured narrative. J Am Med Inform Assoc 2008;15(1):54–64. [18] Bird S, Liberman M. A formal framework for linguistic annotation. Speech Commun 2000;33:23–60. [19] McKelvie D, Isard A, Mengel A, Møller M, Gross M, Klein M. The MATE Workbench – an annotation tool for XML coded speech corpora. Speech Commun 2001:97–112. [20] Ide NM, Romary L. International standard for a linguistic annotation framework. J Nat Lang Eng 2004;10:211–25. [21] Ide NM, Suderman K. GrAF: a graph-based format for linguistic annotations. In: Proceedings of the first linguistic annotation workshop. Prague, Czech Republic; 2007. p. 1–8. [22] Cunningham H, Maynard D, Bontcheva K, Tablan V. GATE: a framework and graphical development environment for Robust NLP tools and applications. In: Proceedings of the 40th anniversary meeting of the association for computational linguistics (ACL’02); 2002. [23] Apache. UIMA (Unstructured Information Management Architecture). . [24] Ide NM, Suderman K. Bridging the gaps: interoperability for GrAF, GATE, and UIMA. In: Proceedings of the 3rd linguistic annotation workshop, ACL-IJCNLP; 2009. p. 27–34. [25] Uzuner O, South B, Shen S, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc 2011;18: 552–6. [26] Health Level Seven, Inc. HL7 reference information model; 1994. . [27] NLM US. SNOMED Clinical TermsÒ (SNOMED-CTÒ). . [28] Ogren PV. Knowtator. . [29] The Protégé ontology editor and knowledge acquisition system. . [30] W3C XML Query Working Group. XQuery 1.0: an XML query language, 2nd ed. .