334
SUN, SUN, System for Automated Lexical Mapping
JAMIA
Original Research Paper
Investigations
j
A System for Automated Lexical Mapping JENNIFER Y. SUN, MD, MS, YAO SUN, MD, PHD Abstract
Objective: To automate the mapping of disparate databases to standardized medical vocabularies.
Background: Merging of clinical systems and medical databases, or aggregation of information from disparate databases, frequently requires a process whereby vocabularies are compared and similar concepts are mapped. Design: Using a normalization phase followed by a novel alignment stage inspired by DNA sequence alignment methods, automated lexical mapping can map terms from various databases to standard vocabularies such as the UMLS (Unified Medical Language System) and LOINC (Logical Observation Identifier Names and Codes). Measurements: This automated lexical mapping was evaluated using three real-world laboratory databases from different health care institutions. The authors report the sensitivity, specificity, percentage correct (true positives plus true negatives divided by total number of terms), and true positive and true negative rates as measures of system performance. Results: The alignment algorithm was able to map 57% to 78% (average of 63% over all runs and databases) of equivalent concepts through lexical mapping alone. True positive rates ranged from 18% to 70%; true negative rates ranged from 5% to 52%. Conclusion: Lexical mapping can facilitate the integration of data from diverse sources and decrease the time and cost required for manual mapping and integration of clinical systems and medical databases. j
J Am Med Inform Assoc. 2006;13:334–343. DOI 10.1197/jamia.M1823.
Access to medical information is hindered by the variation that is inherent in the lexicon of medical terminology. As the medical field moves toward electronic health records, portability of patient information, and sharing of information across institutions, the need for a method to computationally normalize and map nonstandard terms and concepts to a standard vocabulary becomes more important. Several algorithms have previously been proposed to automate translation between medical vocabularies including the use of frames, semantic definitions, diagrams, and a combination of lexical, logical, and morphological methods.1–7 The Unified Medical Language System (UMLS), a product of the National Library of Medicine,8 exists to help in the development of systems to ‘‘understand’’ the language of biomedicine and health.9,10 The current version of the UMLS Metathesaurus contains information about over one million biomedical concepts and five million concept names from more than 100 controlled vocabularies and classifications (some in multiple languages). It includes vocabularies and coding systems designated as U.S. standards for the exchange of administrative and clinical data, including SNOMED CT (Systematized
Nomenclature of Medicine Clinical Terms) and LOINC (Logical Observation Identifiers Names and Codes). Tools developed for the UMLS include lexical variant generation, tools for customizing a subset of the Metathesaurus (MetamorphoSys), and extracting UMLS concepts from text (MetaMap).11,12 LOINC is a clinical terminology important for laboratory test orders and results, identified by the Health Level 7 (HL7) Standards Development Organization as a preferred code set for laboratory test names. The LOINC database also has a mapping program called RELMA, the Regenstrief LOINC Mapping Assistant developed by the Regenstrief Institute.13 None of these systems, however, have been evaluated as a fully automated mapping system on production databases from multiple health care institutions. The Lexical INtegrator of Concepts (LINC) system performs completely automated lexical mapping of medical vocabularies. The following sections illustrate the stages involved in implementation of the LINC system and discuss key issues and trade-offs in performance.
Background Affiliations of the authors: Newborn Medicine Informatics Program, Children’s Hospital, Boston, MA. Correspondence and reprints: Jennifer Y. Sun, MD, MS, 57 Blossomcrest Road, Lexington, MA 02421-7103; e-mail: . Received for review: 03/08/05; accepted for publication: 02/01/06.
The development of tools to manage variation among the many medical vocabularies throughout clinical and research domains is a daunting but necessary task. As the medical field develops increasingly complex information systems to store and exchange electronic information, the task of reconciling variations in terminology becomes increasingly urgent. Many different approaches have been created to help solve this problem.
Journal of the American Medical Informatics Association
Table 1A
j
Volume 13
335
Number 3 May / Jun 2006
Examples from the Laboratory Terms Databases
TCH
IMH
A. Examples of full terms from each database alanine, 24-hr urine amylase, plasma ap b-2 ap fibrgen factor viii recombinant glucose, whole blood hct.source hemoglobin a2 hla cross match dfm pco2, arterial blood qu hb elect panel rast for rhodotorula
DFCI
arginine, urine quantitative basement membrane ab igg, serum semi-quantitative if cells cd34, blood quantitative milk ab ige, serum quantitative protein, urine quantitative thyrotropin 60 minute post does trh iv, serum quantitative
abs mono albumin (ur) cmv chemilumin assay cpk-mb g#: cbc ggtp hbsag (surface ag) high dens.lipoprot meta quantitative iga (pedi) rapid hsv sed rate
B. Examples of ambiguous full terms from each database; these are terms that the investigators were unable to map manually to the UMLS or LOINC vocabularies ap ae11 3rlab g#: pv ‘‘if amp s or ms, report–r’’ 5ship g#: ahav li qc abyid1 1c qc temp freezer 11b cazcla s/n ratio C. Examples of excluded words from each database ap qc quantitative wksht ur serum rast plasma plasma urine blood semi worksheet micro ab ab csf urine gross screen blood
qualitative hour fluid ige igg isolate virus
abs urine antibody ab bwh
Table 1B n Examples of normalized terms from each database, excluded words highlighted in bold Full Term TCH
IMH
DFCI
‘‘5hiaa, urine mg/day’’ ‘‘amylase, plasma’’ ap ebv/epstein barr virus1 aldolase, serum or plasma quantitative hydroxyproline, 24 hour urine quantitative ketones, urine semi-quantitative test strip abs monos cmv ab igg ebv-vca (igg) antibody
Normalized Term 5hiaa mg day amylase ebv epstein barr virus aldolase hydroxyproline ketones test strip
monos cmv igg ebv vca igg
TCH 5 Children’s Hospital; IMH 5 Intermountain Hospital; DFCI 5 Dana Farber Cancer Institute.
The University of California San Francisco UMLS group began the undertaking of intervocabulary mapping in 1988 using lexical matching. Sherertz et al.14 used filters and rules to transform input for mapping. They attempted to map disease names and disease attributes to MeSH terms with mapping in 47% to 48% of phrases in both cases. The SPECIALIST lexicon, a UMLS knowledge source, is an English-language lexicon that contains biomedical terms as well as common English terms. Lexical records contain base form, spelling variants, acronyms, and abbreviations. In addition, there is a database of neoclassical compound forms, single words that consist of several Greek or Latin morphemes that are very common in medical terminology.15 Languageprocessing tools have been developed to mediate between existing vocabularies and the lexicon. These programs consist of modules for normalization and lexical variant generation, including lowercasing, removing punctuation, removing stop words, sorting words in a term alphabetically, generating inflectional variants, reducing words to their base forms,
and generating derivational variants for words.16 However, these tools do not automate the task of mapping existing vocabularies to the UMLS Metathesaurus. A major issue in natural language processing is a lack of a truly comprehensive clinical vocabulary. The Large Scale Vocabulary Test19,20 showed that a combination of existing terminologies can represent a majority of concepts needed to describe patient conditions in a range of health care settings and information systems. This finding is significant in that it can serve as a strategy for improving current vocabularies and perhaps in establishing a standard national vocabulary. Structured or coded data in electronic medical records is not widely available. Instead, the main source of clinical information is in unstructured-text reports and unconstrained data. Many natural language processing systems have been developed for the purpose of extraction of clinical medical information.19–25 These systems all include text parsers and part-of-speech tagging of text, some type of normalization method or variant expansion, and mapping of the individual
336
SUN, SUN, System for Automated Lexical Mapping
words/terms found. One of the core issues with these systems is the mapping of the normalized words to a standard vocabulary. Another issue is the development of specific lexicons for each individual system that are not necessarily generalizable. Our work differs from related work because our objective was not to extract text or attempt to parse it, but to develop a method focusing on the mapping of noun phrases to any standardized vocabulary, not a specific lexicon. Sager et al.19 developed the Linguistic String Project, which is one of the first comprehensive natural language processing systems for general English and was later adapted to medical text. The MedLEE system (Medical Language Extraction and Encoding system) was developed as part of the clinical information system at Columbia by Friedman et al.20 for use in actual patient care and was shown to improve care. The MedLEE system was also applied to map clinical information to coded forms.21 The MedsynDikate system developed at Freiburg University is used for extracting information from pathology findings reports. Their novel approach involved researching interdependencies between sentences while creating a conceptual graph of sentences parsed.22 The MetaMap Program, by Aronson et al.23,24 is a configurable program that maps biomedical text to concepts in the UMLS Metathesaurus. The algorithm uses variant generation using knowledge from the SPECIALIST lexicon, and then maps these variants to the Metathesaurus. 25
Lau et al. propose a method for automated mapping of laboratory results to LOINC codes. Their method uses a series of rules to parse and map the laboratory tests to the LOINC descriptors and then to the LOINC codes. This method is very specific in its mapping process and would be difficult to generalize to other medical vocabulary domains.
version 2.12 (http://www.regenstrief.org/loinc/) and includes all concepts, not just laboratory concepts. Examples of full terms are shown in Table 1.
Overview A graphical flowchart of the processes used in LINC is summarized in Figure 1. The following sections present the individual steps in the overall method: preprocessing the query and dictionary terms, the alignment process, and postalignment sorting of the candidate mappings. The alignment process of mapping query terms occurs in several phases, which proceeds only if a match threshold is not reached, as discussed below.
Preprocessing for Normalization To account for variations in the query vocabulary compared to the dictionary vocabulary, normalization of each term was necessary. The normalization process involves converting the term to lower case, splitting the term into tokens (its constituent words), removing punctuation, removing duplicate tokens, and sorting the tokens by frequency. The frequencysorting step uses a database created for each vocabulary used, consisting of the number of occurrences of each token within that particular vocabulary. For the query vocabulary, the 1% most frequent terms are removed for the initial mapping (see details in ‘‘Exclusion Terms’’ section below). After normalization, two approaches to string alignment were employed. In the first method, the query terms are passed to the alignment method as separate tokens, scored separately, and then recombined to give an overall mapping score. In the second method, the normalized tokens are
Systems have also been developed to map abbreviations to full forms such as AbbRE (abbreviation recognition and extraction) from Yu et al.26 and systems for scoring abbreviation expansions such as that by Chang et al.27 AbbRE uses a set of pattern-matching rules to map an abbreviation within free text (biomedical articles) to its full form. The system identifies abbreviations that are defined within the article, which in the case of biomedical articles, only occurred 25% of the time. The remaining undefined abbreviations could only be mapped to abbreviation databases in 68% of cases.
Methods Databases Used Database dictionaries of laboratory studies were obtained listing all available names and definitions of laboratory tests. These databases, representing the query terms, were obtained from Children’s Hospital in Boston, MA (TCH), Intermountain Health Care in Utah and Idaho (IMH), and the Dana Farber Cancer Institute in Boston, MA (DFCI). The TCH laboratory database contained 5,985 terms, of which 5,566 were unique terms; the IMH database had 4,440 terms, of which 3,364 were unique; and the DFCI database had 326 unique terms. Two dictionary sources were used: UMLS and LOINC. The UMLS dictionary has 11,033 terms and the LOINC dictionary has 34,840 terms. The UMLS dictionary terms are from the Semantic Network branch of the Metathesaurus that lists laboratory tests. The LOINC database was downloaded from
F i g u r e 1 . An overall flowchart detailing the steps of the alignment process.
Journal of the American Medical Informatics Association
Volume 13
337
Number 3 May / Jun 2006
F i g u r e 2 . Matrix representation of alignment. Query term is alk ptase (on the vertical) and dictionary term is alkaline phosphatase (on the horizontal). Highlighted cells represent positions where a character of the query term matches a character from the dictionary term within the optimal alignment. concatenated back into a string and passed to the alignment method and scored as a whole.
Alignment The inspiration for this algorithm comes from DNA sequence alignment algorithms such as BLAST (Basic Local Alignment Search Tool).28–32 LINC uses a matrix structure to find the best alignment between two strings: the query term and the dictionary term (Fig. 2). Every query term is mapped to every dictionary term to determine the dictionary terms with the best alignment. The matrix is initialized with zeros, and then for every cell where a character of the query term matches a character of the dictionary term (not including spaces), the numeral ‘‘1’’ is placed as a marker in the cell. To fully explore all possible alignments, the algorithm iteratively performs depth first searches to find the ‘‘best’’ alignment. Using principles of dynamic programming, the larger problem of finding the optimal alignment is solved by finding the solution to a series of smaller problems, namely, finding the optimal alignment for substrings. The smaller problems do not require recalculation because their results are saved, thus improving computational efficiency. The details of the alignment process are as follows. Each matrix cell containing a ‘‘1’’ (a character match) is represented by a node. The nodes are then linked together into a chain starting at the lower left corner of the matrix and proceeding to the right (Fig. 3). Starting with the first node in the chain, a score is calculated as the sum of the score for the first node plus the sum of the score for the next node; the calculation continues in this recursive way. By aligning every character of the query term to every matching character in the dictionary term, we are performing a depth-first search. Each node has an optimal score associated with it, representing the highest possible score that can be attained from that node forward (see ‘‘Scoring Algorithm’’ section). The algorithm tracks the scores for nodes that have already been traversed. Finding the node with the maximum score and following the path from that node through the matrix retrieves the best overall alignment.
Scoring Algorithm The scoring method is a combination of multiple factors that contribute to an optimal alignment. Each node (matched character) is given an initial score of 1. To penalize gaps between matching characters, the initial score is divided by the squared distance between the current node and the next mapped node (i.e., the gap).
F i g u r e 3 . A graphical representation of how the chain is linked together. The dark solid arrows of the matrix show the reading of the cells from lower left and up the column. Each column is read from bottom to top, proceeding from the leftmost column to the right. The lower image shows the chain representation of the matrix. Coordinates represent the position of the cell in the matrix. The cells marked ‘‘1’’ are read from the matrix starting from the lower left corner, up the first column (column 0 in Figure 2), then proceeding up the second column (column 1) and continuing through the table from left to right. Continuity between mapped characters is also tracked to benefit longer continuously matched node chains. A proximity score is calculated based on the Euclidian distance between any two nodes. If two matched characters are continuous within the original query term and dictionary term, then the proximity score equals 1. The continuity score is the sum of the proximity scores from all the nodes within the chain. But when two or more nodes have proximity scores equal to 1, the continuity score for that portion of the chain is squared prior to adding it to the overall continuity score. For example, a chain of four continuous nodes would have proximity scores of 1 1 1 1 1 5 3, and this would then be squared to add a continuity score of 9 to the overall score. Overall, the scoring scheme is skewed toward greatly rewarding longer chains (greater continuity) of matched nodes. Formulas: Point1 ðx1; y1Þ and Point2 ðx2; y2Þ Gap 5 x2 x1 Node score 5
1 ðx2 2x1 Þ2
pffiffiffi 2 Proximity score 5 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx2 2x1 Þ2 1ðy2 2y1 Þ2 Specific examples of scoring follow. Example 1 query term 5 AST dictionary term 5 AST Node score (A) 5 1 Node score (S) 5 1 Node score (T) 5 1
338
SUN, SUN, System for Automated Lexical Mapping
Table 2 j Summary of Alignment Algorithm Results of TCH Laboratory Terms Versus the UMLS Standard Vocabulary
Run No. 1 2 3 4 5 6 7 8
Correct (%)
True Positive Rate (%)
True Negative Rate (%)
Top 1 Rate (%)
Top 3 Rate (%)
Sensitivity (%)
Specificity (%)
63 70 70 70 72 72 72 78
18 20 21 20 22 20 21 28
46 51 50 51 50 52 51 50
74 72 71 77 70 77 74 73
97 87 83 92 86 92 90 87
42 46 47 46 49 46 49 64
78 88 88 88 89 90 89 89
The columns labeled Algorithm, String Processing, Scoring, and Post-processing categorize the different methods that were used in each run (noted with a U) as detailed in the ‘‘Methods’’ section previously. Percentage correct is true positives plus true negatives divided by total number of query terms. Top 1 rate is the number of mappings that were ranked number 1 divided by the total number of true positives. Top 3 rate is the number of mappings that were ranked either 1, 2, or 3 divided by the total number of true positives. TCH 5 Children’s Hospital; UMLS 5 Unified Medical Language System.
Proximity scores 5 2 (A-S is continuous, and S-T is continuous) Continuity score 5 4 (22) Overall score 5 1 1 1 1 1 1 4 5 7 Example 2 Query term 5 AST Dictionary term 5 ALT Node score (A) 5 1 Node score (S) 5 0 Node score (T) 5 1 Proximity score 5 0.5 (A-T is not continuous) Continuity score 5 0.25 Overall score 5 1 1 0 1 1 1 0.25 5 2 Example 3 Query term 5 AST Dictionary term 5 ASB Node score (A) 5 1 Node score (S) 5 1 Node score (T) 5 0 Proximity score 5 1 (A-S is continuous, and S-T is not continuous) Continuity score 5 1 Overall score 5 1 1 0 1 1 1 1 5 2
Exclusion Phase In the preprocessing of the query vocabulary, the 1% most frequent words in that particular vocabulary are removed. Because these words are so prevalent within their respective vocabularies, they are often nonspecific and contribute an excessive amount to the alignment mapping between terms. Examples of these words are shown in Table 1.
Using a compilation of abbreviation dictionaries from various online resources, an overall abbreviation table was created. Online sources for abbreviations included Wikipedia, Eugene Free Community Network, JD.MD, and Bo¨rm Bruckmeier Publishing.33 Tables were created expanding each query and dictionary term if an available abbreviation was found within the term. All combinations of the abbreviations were also created, i.e., if a term consisted of two words such as alk phos, alk with one possible abbreviation expansion (to alkaline) and phos with two possible abbreviation expansions (to phosphorus and phosphatase), there would be five possible combinations for expansion (alk phosphorus, alk phosphatase, alkaline phos, alkaline phosphorus, alkaline phosphatase). If there were more than 16 combinations, the term was deemed to have no relevant expansions since the expansions were then nonspecific and ambiguous. In the TCH database, there were 5,566 terms of which 1,359 had greater than 16 abbreviation expansion combinations. The IMH database had 3,364 terms, of which 692 terms had greater than 16 expansions, and the DFCI database had 326 terms, of which 23 had greater than 16 expansions. The algorithm uses three different passes to expand and map abbreviated terms. Alignment is attempted between the expanded query term and the unexpanded dictionary, then with the unexpanded query term and expanded dictionary, and last with the expanded query term to the expanded dictionary.
Match Threshold
Abbreviation Expansion Process
As a part of the scoring, a threshold was set to determine whether two terms were appropriate mappings based on the scoring algorithm described previously. The threshold is a percentage of a ‘‘perfect match,’’ a perfect match being the case where the query term and the dictionary term are identical (normalized) character strings. Query terms that had no candidate mappings scored above the threshold are defined as ‘‘unmapped.’’
Within the overall LINC process, the abbreviation expansion steps occur in the third through fifth passes of the algorithm. If a query term is not mapped in the first two passes (see flow diagram), it then falls into the abbreviation expansion process.
To find the optimum threshold, the mapping algorithm was run with scoring thresholds ranging between 50% and 100%, in 5% increments. Plotting a receiver operating characteristics (ROC) curve, we further investigated thresholds between
The initial alignment is done without these words. Then a second alignment (‘‘second pass’’ in the flow chart) is done using the candidate mappings from the initial alignment, with the excluded words added back to the overall query term. By adding back the excluded words, higher scores could be assigned for the most specific mappings.
Journal of the American Medical Informatics Association
Table 2
j
339
Number 3 May / Jun 2006
(Continued)
Algorithm Nontoken
Volume 13
Token
String Processing Second Pass
Alphabetize
U U U U U
Frequency
Scoring Standard
Continuity
U U U U U U U U
U U U U U U U
U U U U U U U U
U
80% and 90% (in increments of 1%), finally choosing a threshold of 85% as the best balance of sensitivity and specificity.
Post-processing Several different techniques were used to reorder the candidate mappings so that the ‘‘best’’ mapping would be sorted to the top of the list. These methods include Sort by dictionary term length Sort by position: summing the x coordinates of the matched characters within the dictionary term Sort by position method 2: summing the first and last x coordinates of the matched characters of the dictionary term Sort by score and position: a new score that is a combination of the score and position of mapped terms Score: the score of the current query term Max score: the highest scoring mapping for the current query term Position sum: the sum of all the x coordinates for the mapped characters of the dictionary term Max position sum: the maximum position sum of all the dictionary terms that map to the current query term. New score 5
score max positionsum * max score positionsum
Sort by percentage mapped, a new score that is a combination of the score and the percentage of the dictionary term mapped New score 5 score *
#charactersmatched #charactersofdictionaryterm
Evaluation Using a random number generator, 200 query terms were randomly chosen from each query term database (TCH, IMH, and DFCI) for evaluation. LINC mappings were generated for each of these terms using various combinations of the algorithms discussed previously. For each LINC mapping run, a 2 3 2 truth table was then constructed to tally true positives (the algorithm found the correct mapping in the dictionary), false positives (the mapping found was incorrect, but scored above the threshold), false negative (the mapping scored below the threshold, but the term was present in the dictionary), and true negatives (the mapping was below the threshold and the term was not in the dictionary). The gold standard for the truth table was based on manual evaluation of the mappings by the investigators.
Post-processing Gap Penalty
Dictionary Length U U U
U U
Position
Position 2
Score and Position
U U U U U U
A mapping was labeled true positive as long as an appropriate term from the dictionary scored above the threshold. In some cases, there were multiple mappings above the threshold, of which some would be correct mappings and some would be incorrect mappings; these were still labeled as true positives in that at least one correct mapping was identified. If the query term was ambiguous or undecipherable (as was the case with many abbreviated terms), those mappings were considered true negatives or false positives depending on the mapping results since there was no means by which to confirm a mapping. Additionally, for all terms for which the system did not generate a mapping scoring above the threshold, the investigator manually searched the dictionary vocabulary for the query term to determine whether the mapping should be labeled as a false negative.
Results Results for 19 alignment runs are shown in Tables 2–5. In initial experiments with TCH, many different scoring algorithms and postprocess sorting of mappings were evaluated. The initial runs were with TCH versus the UMLS only. An overall ‘‘best’’ algorithm was chosen from the initial runs and applied to TCH, IMH, and DFCI versus LOINC only. The abbreviation phases were added last. Table 2 shows an increase in the true-positive rate with the changes in scoring algorithms and postprocess sorting. The best mapping showed a true positive rate of 28% and a true negative rate of 50%, with an overall percentage correct of 78%. It was also noted that normalizing the query term using alphabetization or sorting the tokens by frequency did not seem to make much difference and created some new problems, so both those methods were removed for the rest of the runs. Our initial mapping with TCH (5,985 terms) versus UMLS (11,033 terms) averaged 50 terms mapped per minute. Mapping to a larger dictionary like LOINC (34,840 terms) lowered the speed to six to ten terms mapped per minute. Table 3 shows the mapping of TCH to LOINC with better success except when the abbreviation algorithm is applied. Table 4 for IMH shows an overall true positive rate between 65% to 70% with 99% of the mappings falling in the top three mappings produced, with 95% in the top position. With the DFCI database, Table 5 shows not much change in mapping with changes in the algorithm; mappings were between 62% and 63% correct with 33% to 34% true positive rates. Again, the mappings ranked in the top position in 82% of cases and 91% ranked in the top three positions.
340 Table 3
SUN, SUN, System for Automated Lexical Mapping j
Summary of Alignment Algorithm Results of TCH Laboratory Terms to the LOINC Standard Vocabulary
Algorithm String PostTrue True Second Third Fourth Fifth Processing processing Run Correct Positive Negative Top 1 Top 3 Sensitivity Specificity Nontoken Token Pass Pass Pass Pass Frequency % Mapped No. (%) Rate (%) Rate (%) Rate (%) Rate (%) (%) (%) 1 2 3 4 5 6
73 73 69 59 58 57
25 24 20 24 25 25
48 50 49 35 34 32
76 74 75 81 78 82
90 96 83 87 84 88
61 57 48 63 66 66
81 85 83 56 53 52
U U U U U U
U U U U U U
U U U
U U
U
U U U U
U U U U
The columns labeled Algorithm, String Processing, and Post-processing categorize the different methods that were used in each run (noted with a U) as detailed in the ‘‘Methods’’ section previously. Percentage correct is true positives plus true negatives divided by total number of query terms. Top 1 rate is the number of mappings that were ranked number 1 divided by the total number of true positives. Top 3 rate is the number of mappings that were ranked either 1, 2, or 3 divided by the total number of true positives. TCH 5 Children’s Hospital; LOINC 5 Logical Observation Identifier Names and Codes.
Additionally, the query terms and matching dictionary terms were evaluated to determine the rate of nonmatching words within either. In the TCH versus UMLS runs, there were nonmatching words in 85% to 95% of mappings. In TCH versus LOINC, nonmatching words occurred in 98% to 100% of cases; in IMH versus LOINC, in 100% of mappings; and in DFCI versus LOINC, in 97% to 100% of mappings.
different vocabularies, possibly from different domains. The system is not dependent on the creation of specialized lexicons to perform the mappings. Second, our system is fully automated, requiring no preformatting, manual input, or construction of rules or filters. Finally, the LINC system has been evaluated with multiple real-world vocabularies against both the UMLS and LOINC.
As shown above in Table 6, there were varied results from the abbreviation expansion. For the TCH database, the percentage of correct mappings actually decreased. Although the abbreviation expansion process had found ten new mappings, it also incorrectly expanded 33 terms resulting in a net decrease. For DFCI, the expansion netted no change in percentage correct, although the method did find as many as 26 new mappings. In the case of the IMH database, there was a net increase resulting from 12 new mappings found and only four errors in expansion.
In experimenting with different algorithms for normalization and mapping within LINC, there were trade-offs for each method used, and no one optimal method was found. In mapping query terms to dictionary terms, it was noted that the order of words within the terms differed from vocabulary to vocabulary. For example, the TCH query term ‘‘glucose, whole blood’’ does not exist in that exact form in the UMLS Metathesaurus. Instead, the UMLS has the term ‘‘whole blood glucose tests.’’ The matching algorithm was run against a subset of the UMLS Metathesaurus, which was likely helpful in decreasing the number of false positives.
The system is biased to identify more mappings, thereby producing more false positives. It was assumed that it would be easier to eliminate incorrect mappings than to manually search a large vocabulary for an appropriate mapping.
Discussion
Three methods were employed to deal with these types of variation. The first method was alphabetizing the words within each term. Through the string processing method, the query term would then become ‘‘blood glucose whole’’ and the UMLS term would become ‘‘blood glucose tests whole.’’
A comparison of the LINC system against past vocabulary mapping investigations is highlighted by three key features. First, our system is flexible enough to allow mapping of
Another method was to order the words within each term by their frequency in the vocabulary. From each vocabulary, a frequency table was generated to tally the frequency of each word
Table 4 j Summary of Alignment Algorithm Results of IMH Laboratory Terms Versus LOINC Standard Vocabulary Algorithm String PostTrue True Top 3 Second Third Fourth Fifth Processing processing Run Correct Positive Negative Top 1 Rate Sensitivity Specificity Nontoken Token Pass Pass Pass Pass Frequency % Mapped No. (%) Rate (%) Rate (%) Rate (%) (%) (%) (%) 1 2 3 4 5 6
71 73 72 72 74 75
65 67 65 66 70 70
6 6 7 6 5 5
72 72 95 95 97 95
88 83 98 99 99 99
73 73 71 74 79 81
52 63 72 54 36 33
U U U U U U
U U U U U U
U U U
U U
U
U U U U
U U U U
The columns labeled Algorithm, String Processing, and Post-processing categorize the different methods that were used in each run (noted with a U) as detailed in the ‘‘Methods’’ section previously. Percentage correct is true positives plus true negatives divided by total number of query terms. Top 1 rate is the number of mappings that were ranked number 1 divided by the total number of true positives. Top 3 rate is the number of mappings that were ranked either 1, 2, or 3 divided by the total number of true positives. IMH 5 Intermountain Hospital; LOINC 5 Logical Observation Identifier Names and Codes.
Journal of the American Medical Informatics Association
Volume 13
341
Number 3 May / Jun 2006
Table 5 j Summary of Alignment Algorithm Results of DFCI Laboratory Terms Versus LOINC Standard Vocabulary Algorithm String PostTrue True Top 3 Second Third Fourth Fifth Processing processing % Run Correct Positive Negative Top 1 Rate Sensitivity Specificity Nontoken Token Pass Pass Pass Pass Frequency Mapped No. (%) Rate (%) Rate (%) Rate (%) (%) (%) (%) 2 3 4 5 6
63 62 63 63 63
34 33 41 42 42
30 30 22 21 21
64 81 80 82 82
82 90 89 91 91
63 62 82 85 86
63 61 44 41 40
U U U U U
U U U U U
U U U
U U
U U U U
U
U U U U
The columns labeled Algorithm, String Processing, and Post-processing categorize the different methods that were used in each run (noted with a U) as detailed in the ‘‘Methods’’ section previously. Percentage correct is true positives plus true negatives divided by total number of query terms. Top 1 rate is the number of mappings that were ranked number 1 divided by the total number of true positives. Top 3 rate is the number of mappings that were ranked either 1, 2, or 3 divided by the total number of true positives. DFCI 5 Dana Farber Cancer Institute; LOINC 5 Logical Observation Identifier Names and Codes.
within its vocabulary. Using this frequency method, the TCH query term ‘‘whole blood glucose’’ would become ‘‘whole glucose blood.’’ Because each term is ordered using frequencies within its own vocabulary, the UMLS term ‘‘whole blood glucose tests’’ would become ‘‘whole glucose tests blood.’’
Another issue that hindered lexical mapping was the existence of less ‘‘discriminating’’ or specific words within the query term, such as ‘‘qualitative,’’ ‘‘quantitative,’’ ‘‘urine,’’ ‘‘blood,’’ and ‘‘csf’’ (cerebrospinal fluid). These terms are clinically relevant, but not the critical tokens that differentiate a query term. LINC addresses this issue by removing the most frequent terms in the query vocabulary (top 1%) from the query term for the initial alignment/mapping. After candidate mappings were identified during the initial alignment, a second alignment was run with the query term reincorporating these initially excluded tokens to obtain the most specific mapping possible.
The third method was to break each term into tokens (words) and perform a separate alignment on each token, from which the scores were joined to obtain an overall score. In the third method, the order of words would then not matter because as long as the token from the query term was found within the dictionary term, it would be mapped. Among the three vocabularies mapped, there was variation in the sensitivity and specificity. The vocabularies were quite different, as shown in Table 1, with the TCH vocabulary having many more nonmappable terms; thus, the specificity was higher because there were many more true negative terms. The IMH terms most closely resembled the LOINC terms; thus, that vocabulary had the highest sensitivity. The DFCI terms were abbreviated in many cases; thus, when expanded, many could be mapped to the LOINC vocabulary, increasing the sensitivity.
LINC tackles the obstacle of abbreviations in both the query terms and the dictionary terms by running several phases of abbreviation expansion. Because of the limitations of the available abbreviation dictionaries, however, many appropriate expansions were unavailable and many inappropriate expansions were performed, thus increasing the number of false positives and decreasing the specificity of the match. For example, ‘‘ser’’ was only expanded to serine, not serum. There is a trade-off with customized abbreviation dictionaries; with more possible expansions, there may be more matches, but there also may be more false positives generated from the incorrect expansions. Additionally, the matching algorithm was run against a subset of the UMLS Metathesaurus, which was likely helpful in decreasing the number of false positives; therefore, the higher specificity in the matching runs with the UMLS may be confounded by that factor.
With the alphabetization method, although words will be ordered in a standard manner across vocabularies, the ordering may create gaps in the alignment due to words in the dictionary term that do not occur in the query term; as in the example shown previously, the term ‘‘tests’’ creates a gap in an otherwise continuous string. With the frequency method, the least frequent terms will be the most relevant, but the downside is that words may be reordered in such a way that the score is decreased due to poor alignment.
Table 6
j
Joint Results Table: TCH, IMH, and DFCI Versus LOINC, Run 2 Versus Run 6
Query Database TCH IMH DFCI
Another major consideration is how to choose the ‘‘optimal mapping.’’ A single query term may map to multiple terms in the dictionary vocabulary, but in an automated system, we only
Algorithm
Correct (%)
True Positive Rate (%)
True Negative Rate (%)
Sensitivity (%)
Specificity (%)
Without abbreviation With abbreviation Without abbreviation With abbreviation Without abbreviation With abbreviation
73 57 73 75 63 63
24 25 67 70 34 42
50 32 6 5 30 21
57 66 73 81 63 86
85 52 63 33 63 40
Percentage correct is true positives plus true negative divided by total number of query terms. TCH 5 Children’s Hospital; IMH 5 Intermountain Hospital; DFCI 5 Dana Farber Cancer Institute; LOINC 5 Logical Observation Identifier Names and Codes.
342 want the ‘‘best’’ mappings. Two methods were used to extract the best mappings from the list of candidate mappings generated. The first method was to use a mapping threshold. Using this threshold, the accuracy of the mapping can be controlled; an exact mapping would have a threshold of 100%, meaning that all words in the query term must appear in the dictionary term as exact string matches. As detailed in the ‘‘Methods’’ section, an ROC curve was used to optimize the threshold. The second method to extract the optimal mapping was to sort the list of candidate mappings according to various scoring metrics. As described previously, several sorting methods were evaluated. Initial sorting methods were based on the length of the dictionary term and the sum of positions of the character mappings of the dictionary terms, under the assumption that a shorter dictionary term without other extraneous tokens would be more relevant. In the end, the best sorting method was a simple one that scored mappings by the percentage of characters in the dictionary term that matched.
Limitations and Future Goals A limiting factor to the breadth of this investigation is the choice of standardized medical vocabularies used as well as native ambiguity in some of the legacy terms. Neither the UMLS Metathesaurus nor LOINC covered the entire domain of laboratory tests that is available at this time. In addition, the available abbreviation dictionaries for the latter part of our experimental runs were incomplete or often contained abbreviations irrelevant to our specific domain. Last, the system has only been tested on laboratory terms, which may not be representative of performance in other vocabularies such as diseases, medications, or genomics. While the system is automated in its generation of potential mappings for a query term, currently, an expert/clinician still needs to confirm the correct mapping from the list of generated candidates. Thus, there still exists a trade-off between the speed of automation and manual accuracy. Although manual confirmation of the lexical mapping is more time intensive than allowing the system to function in a totally automated fashion, this still represents an improvement in efficiency compared to systems that require manual search or collation of vocabularies. Currently, LINC is implemented as part of MEDIATE (Medical Information Acquisition and Transmission Enabler),34 an automated system developed to integrate data from multiple disparate sources using semantic networks to perform context-based mapping. LINC provides the underlying lexical mapping for the nodes of the semantic network and helps provide the ground structure to support context-based mapping. Further development of both the lexical and context-based mapping techniques should yield further improvements in automating electronic medical information exchange. Future avenues of exploration might include the following experiments. A phase for synonym incorporation could be evaluated. Further tests against SNOMED would provide additional evidence to the generalizability of the system. In addition, testing on different real-world vocabularies within the clinical or research realm can help test the scope of the algorithm’s application. Finally, the adaptability of LINC could be tested on other databases in the medical domain that are not
SUN, SUN, System for Automated Lexical Mapping limited to clinical databases. For example, the bioinformatics domain has many similar issues with standardization where automated lexical mapping will become more important.
Conclusion In the drive to expedite and improve the efficiency of information exchange, the mapping of local clinical terminology to standardized vocabularies will always be necessary to accommodate the legacy systems of individual institutions. LINC uses novel methods to automate the lexical mapping of terms between medical terminologies. The evaluation of LINC on multiple real-world laboratory databases and two ‘‘standardized’’ medical vocabularies illustrates the continuing obstacles that confront data mapping efforts, and the performance of the system demonstrates promise to facilitate data exchange using automated lexical mapping.
References
j
1. Rocha RA, Rocha BHSC, Huff SM. Automated translation between medical vocabularies using a frame-based interlingua. Proc Annu Symp Comput Appl Med Care. 1993;690–4. 2. Rocha RA, Huff SM. Using diagrams to map controlled medical vocabularies. Proc Annu Symp Comput Appl Med Care. 1994;172–6. 3. Dolin RH, Huff SM, Rocha RA, Spackman KA, Campbell KE. Evaluation of a ‘‘lexically assign, logically refine’’ strategy for semi-automated integration of overlapping terminologies. J Am Med Inform Assoc. 1998;5:203–13. 4. Elkin PL, Cimino JJ, Lowe HJ, Aronow DB, Payne TH, Pincetl PS, Barnett GO. Mapping to MeSH: the art of trapping MeSH equivalence from within narrative text. Proc 12th SCAMC. 1988;185–90. 5. Barrows RC, Cimino JJ, Claton PD. Mapping clinically useful terminology to a controlled medical vocabulary. Proc Annu Symp Comput Appl Med Care. 1994;211–5. 6. Hahn U, Honeck M, Piotrowski M, Schulz S. Subword segmentation—leveling out morphological variations for medical document retrieval. Proc AMIA Symp. 2001;229–33. 7. Cimino JJ, Barnett GO. Automated translation between medical terminologies using semantic definitions. MD Comput. 1990;7:104–9. 8. More information about the UMLS knowledge source server (UMLSKS) is available from: http://umlsks.nlm.nigh.gov. Accessed March 17, 2006. 9. McCray AT, Aronson AR, Browne AC, Rindflesch TC. UMLS knowledge for biomedical language processing. Bull Med Libr Assoc. 1993;81:184–94. 10. Humphreys BL, Lindberg DAB, Schoolman HM, Barnett GO. The Unified Medical Language System: an informatics research collaboration. J Am Med Inform Assoc. 1998;5:1–11. 11. Divita G, Browne AC, Rindflesch TC. Evaluating lexical variant generation to improve information retrieval. Proc AMIA Symp. 1998;775–9. 12. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32: D267–70. 13. McDonald CJ, Huff SM, Suico JG, et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin Chem. 2003;49:624–33. 14. Sherertz DD, Tuttle MS, Blois MS, Erlbaum MS. Intervocabulary mapping within the UMLS: the role of lexical mapping. Proc Annu Symp Comput Appl Med Care. 1988;201–6. 15. McCray AT, Srinivasan S, Browne AC. Lexical methods for managing variation in biomedical terminologies. Proc Annu Symp Comput Appl Med Care. 1994;235–9. 16. McCray AT. The nature of lexical knowledge. Methods Inform Med. 1998;37:353–60.
Journal of the American Medical Informatics Association
Volume 13
17. Humphreys BL, McCray AT, Cheh ML. Evaluating the coverage of controlled health data terminologies: report on the results of the NLM/AHCPR Large Scale Vocabulary Test. J Am Med Inform Assoc. 1997;4:484–500. 18. McCray AT, Cheh ML, Bangalore AK, Rafei K, Razi AM, et al. Conducting the NLM/AHCPR Large Scale Vocabulary Test: A distributed Internet-based experiment. Proc AMIA Symp. 1997;560–4. 19. Sager N, Lyman M, Nhan NT, Tick LJ. Medical Language Processing: Applications to patient data representation and automatic encoding. Methods Inf Med. 1995;34:140–6. 20. Friedman C, Alderson PO, Auston JH, Cimino JJ, Johnson SB. A general natural-language text processor for clinical radiology. J Am Med Inform Assoc. 1994;1:161–74. 21. Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004;11:392–402. 22. Hahn U, Romacker M, Schulz S. MedsynDikate—a natural language system for the extraction of medical information from findings reports. Int J Med inform. 2002;67:63–74. 23. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp. 2001; 17–21. 24. Background information on MetaMap Transfer (MMTX) available at: http://mmtx.nlm.nih.gov/ and Semantic Knowledge Representation, http://skr.nlm.nih.gov/. Accessed March 17, 2006. 25. Lau LM, Johnson K, Monson K, Lam SH, Huff SM. A method for automated mapping of laboratory results to LOINC. Proc AMIA Symp. 2000;472–6.
Number 3 May / Jun 2006
343
26. Yu H, Hripcsak G, Friedman C. Mapping abbreviations to full forms in biomedical articles. J Am Med Inform Assoc. 2002;9: 262–72. 27. Chang JT, Schu¨tze H, Altman RB. Creating an online Dictionary of Abbreviations from MEDLINE. J Am Med Inform Assoc. 2002; 9:612–20. 28. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 19905;215:403–10. 29. Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402. 30. Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999;27:2682–90. 31. Pretsemlidis A, Fondon JW 3rd. Having a BLAST with bioinformatics (and avoiding BLASTphemy). Genome Biol. 2001;2:1–10. Epub 2001 Sep 27. 32. Altschul SF, Boguski MS, Gish W, Wootton JC. Issues in searching molecular sequence databases. Nat Genet. 1994;6: 119–29. 33. Wikipedia—The Free Encyclopedia-List of Medical Abbreviations, www.wikipedia.org/wiki/List_of_medical_abbreviations, Future Nurses—Our Acronym (& Abbreviation) List, www. efn.org/;nurses/acro.html, JD-MD Inc.—Medical Abbreviations Glossary, www.jdmd.com/glossary/medabbr.pdf, Bo¨rm Bruckmeier Publishing—Medical Abbreviations, www.media4u. com/bbp/medabb_a.html. Accessed March 18, 2006. 34. Sun Y. Methods for automated concept mapping between medical databases. J Biomed Inform. 2004;37:162–78.