596
Yang et al., Text Mining of Disease Status Predictions
Research Paper 䡲
A Text Mining Approach to the Prediction of Disease Status from Clinical Discharge Summaries HUI YANG, PHD, IRENA SPASIC, PHD, JOHN A. KEANE, GORAN NENADIC, PHD A b s t r a c t Objective: The authors present a system developed for the Challenge in Natural Language Processing for Clinical Data—the i2b2 obesity challenge, whose aim was to automatically identify the status of obesity and 15 related co-morbidities in patients using their clinical discharge summaries. The challenge consisted of two tasks, textual and intuitive. The textual task was to identify explicit references to the diseases, whereas the intuitive task focused on the prediction of the disease status when the evidence was not explicitly asserted. Design: The authors assembled a set of resources to lexically and semantically profile the diseases and their associated symptoms, treatments, etc. These features were explored in a hybrid text mining approach, which combined dictionary look-up, rule-based, and machine-learning methods. Measurements: The methods were applied on a set of 507 previously unseen discharge summaries, and the predictions were evaluated against a manually prepared gold standard. The overall ranking of the participating teams was primarily based on the macro-averaged F-measure. Results: The implemented method achieved the macro-averaged F-measure of 81% for the textual task (which was the highest achieved in the challenge) and 63% for the intuitive task (ranked 7th out of 28 teams—the highest was 66%). The micro-averaged F-measure showed an average accuracy of 97% for textual and 96% for intuitive annotations. Conclusions: The performance achieved was in line with the agreement between human annotators, indicating the potential of text mining for accurate and efficient prediction of disease statuses from clinical discharge summaries. 䡲 J Am Med Inform Assoc. 2009;16:596 – 600. DOI 10.1197/jamia.M3096.
•
Introduction 1
The objective of the 2008 i2b2 obesity challenge in Natural language processing (NLP) for clinical data was to evaluate NLP systems on their performance in identifying patient obesity and associated co-morbidities based on hospital discharge summaries. Fifteen related diseases were considered: Diabetes mellitus (DM), Hypercholesterolemia, Hypertriglyceridemia, Hypertension (HTN), Atherosclerotic CV disease (CAD), Heart failure (CHF), Peripheral vascular disease (PVD), Venous insufficiency, Osteoarthritis (OA), Obstructive sleep apnea (OSA), Asthma, GERD, Gallstones/Cholecystectomy, Depression, and Gout. The aim was to label each document with disease/co-morbidity status, indicating whether:
•
a patient was diagnosed with a disease/co-morbidity (Y—yes, disease present),
Affiliation of the authors: School of Computer Science, University of Manchester, Manchester, UK; Dr. Yang is currently with the Department of Computing, Open University, UK. This work was partially supported by the UK BBSRC project “Mining Term Associations from Literature to Support Knowledge Discovery in Biology”. Irena Spasic gratefully acknowledges the support of the BBSRC and EPSRC via “The Manchester Centre for Integrative Systems Biology” grant. Correspondence: Goran Nenadic, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester M1 7DN, UK; e-mail: ⬍
[email protected]⬎. Received for review: 12/07/08; accepted for publication: 04/07/09.
• •
a patient was diagnosed with not having a disease/comorbidity (N—no, disease absent), it was uncertain whether a patient had a disease/comorbidity or not (Q— questionable), or a disease/co-morbidity status was not mentioned in the discharge summary (U— unmentioned).
The challenge consisted of two tasks, textual and intuitive. The textual task was to identify explicit references to the diseases in the narrative text. Each hospital report was to be labeled using one of four possible disease status labels (Y, N, Q, or U). The intuitive task focused on inferring the disease status even when the evidence was not explicitly asserted. Possible intuitive labels were Y, N, and Q for each disease. The organizers provided a training set with 730 hospital discharge summaries manually annotated with more than 22,000 labels. We implemented a hybrid approach that combined three types of features: lexical, terminological and semantic, exploited by dictionary look-up, rule-based and machinelearning methods. We assembled a set of resources to lexically and semantically profile the diseases and their associated symptoms, treatments, etc. The methods were applied on a set of 507 previously unseen discharge summaries, and the predictions were evaluated against the manually prepared gold standard. In the textual task, a macroaveraged F-measure (81%) for our approach was the highest achieved in the challenge. In the intuitive task, we achieved the macro-averaged F-measure of 63%. The micro-averaged F-
Journal of the American Medical Informatics Association
Volume 16
Number 4
July / August 2009
597
Methods The general idea underlying our approach was to identify sentences that contained evidence to support a judgment for a given disease, and then to integrate evidence gathered at the sentence level to make a prediction at the document level. The system workflow consisted of three major steps: report pre-processing, textual prediction and intuitive prediction, with the final integration of the textual and intuitive results (see Fig 1). The prediction steps were applied for each of the 16 diseases/co-morbidities separately.
F i g u r e 1. The general design of the system. measure showed an average accuracy of 97% for textual annotation and 96% for intuitive annotation, indicating the potential of text mining techniques to accurately extract the disease status from hospital discharge summaries.
F i g u r e 2. The system architecture diagram.
The report pre-processing involved basic textual processing of input discharge narratives. In the textual prediction step, explicit evidence was identified and combined to derive textual predictions. The intuitive prediction module focused on capturing intuitive clues that could associate the report with the disease. The finial intuitive judgments were combined with the textual ones. Figure 2 depicts a detailed architecture of the system. In the following sections we describe each module and the basic steps performed (for further details see a JAMIA online data supplement at http://www.jamia.org).
Report Pre-processing Module Input discharge summaries were first split into sections using a set of flexible lexical matching rules that identified section titles and classified them into six predefined cate-
598
Yang et al., Text Mining of Disease Status Predictions
Table 1 y Examples of Disease-Status Lexico-Semantic Patterns (target patterns are in italic) Label
Examples
N
No history of coronary artery disease; negative for CHF; denied congestive heart failure; she does not have a history of GERD; no signs or sxms of heart failure ?sleep apnea; question of asthma; it is possible that she has sleep apnea; possibility of sleep apnea; gout may be involved in this problem; No known diagnosis of CAD; assess for CAD was non-diagnostic; equivocal for coronary artery disease; need Cath to assess CAD; evaluate for PVD; Normal coronaries; clear coronary arteries; gallbladder was normal with no stones; he is a thin, health-appearing black man; We should also consider further gastroesophageal reflux studies as an outpatient; CAD assessment was not indicated;
Q U N U
CHF ⫽ congestive heart failure; GERD ⫽ gastroesophageal reflux disease; CAD ⫽ coronary artery disease; PVD ⫽ peripheral vascular disease.
gories: “Diagnosis”, “Past or Present History of Illness”, “Social/Family History”, “Physical or Laboratory Examination”, “Medication/Disposition”, and “Other”. Section titles were recognized by matching the most frequent title keywords collected semi-automatically from the training dataset. In addition, each section type was assigned a weight reflecting its predictive capacity for a given disease (see the Training Data Analyses section). The sections were decomposed into sentences using LingPipe.2 Part of Speech (POS) tagging and shallow parsing were performed using the GeniaTagger, which is specifically tuned for biomedical text.3
Textual Prediction Module The main objective of this module was to identify sentences that, given a disease, explicitly mentioned the disease itself and/or associated clinical terms. We lexically profiled each disease by collecting (1) its name and synonyms from public resources including the UMLS4, (2) disease sub-classes (e.g., diabetes type II) and their synonyms, (3) disease superclasses (e.g., reflux for GERD and arthritis for OA) and their synonyms, and (4) clinical terms closely related to the disease (e.g., associated symptoms and treatments), imported from public medical resources or selected from the training dataset based on their occurrence statistics. All clinical terms collected were assigned confidence levels taking into account the quality of the prediction results obtained from the training dataset (available as an online data supplement at www.jamia.org). Initially, the sentences that contained any term from the lexical profile were labeled with Y, and, in the subsequent steps, the evidence was challenged and potentially reversed to N, Q, or U based on the context in which they were used. The sentence-based predictions were then combined at the document level. The four processing steps in this module are described briefly below (further details are given in the online supplement). Step T1: Term matching. To cater for terminological variation, terms that characterize a disease were matched against the text approximately, taking into account morphological variants, and if necessary ignoring word order and tolerating the distance between the words within a term (e.g., both “stent placement” and “placement of coronary stent” referred to the same treatment for CAD). Step T2: Sentence filtering. Sentences that did not mention a disease-related term were filtered out. We also discarded sentences from the sections deemed less important for the textual task (namely “Social/Family History” and “Other”), sentences that potentially referred to family members, and sentences containing ambiguous disease terms.
Step T3: Sentence labeling. After filtering, the remaining sentences were initially considered to support the judgment of disease presence (Y). We then applied a set of lexicosemantic patterns (see Table 1 for examples) to potentially re-label them with N, Q, or U judgments, using a pattern matching algorithm similar to NegEx.5 The patterns generalized the structure of manually collected examples that indicated negative, questionable or unmentioned status of diseases. If any of these patterns was matched successfully, the disease status was changed using the label associated with the pattern. Step T4: Result integration. When a report contained multiple sentences with conflicting labels associated, we employed a weighted voting scheme. The score for each disease status label was obtained by collecting all sentences with the given label, and adding up the weights associated with the container sections. The highest-scored label was suggested as the final annotation, with potential tie cases labeled as Q. We submitted the results of two runs for the textual task: in run 1, all clinical terms from the associated lexical profiles were used, whereas clinical terms with lower confidence were excluded in run 2.
Intuitive Prediction Module The intuitive task focused on the prediction of the disease status (Y, N, and Q) based on both explicit and implicit textual assertions. We relied on a combination of term- and clinical inference rule-matching to extract disease information at the sentence level, and a supervised learning method for disease status classification at the document level. The module consisted of five steps, described briefly here (further details are given in the online supplement). Step I1: Candidate sentence identification. In the first step, the system identified potential evidence sentences (labeled Y initially) by looking for any of the following three evidence types within the sentences: a. Terms referring to the disease symptoms (e.g., RCA occlusion for CAD). The first two intuitive runs (1 and 2) differed in the predictive capacity of the symptoms used (all terms vs. most important ones, respectively for runs 1 and 2; see the Training Data Analyses section). b. Important clinical facts or conditions related to the disease (e.g., weight ⬎ 200 lbs; systolic blood pressure ⬎ 135). Around 20 manually designed inference rules were used. c. Medications typically used to treat the disease and/or symptoms (appearing within the “Medication/Disposition” sections). Step I2: Sentence labeling. This step was analogous to textual prediction (step T3).
Journal of the American Medical Informatics Association
Volume 16
Table 2 y The Distribution of Annotations in the i2b2 Obesity Challenge Datasets Training Set
Number 4
Table 4 y The Summary of the Evaluation of the Two Textual Runs
Test Set
Label
Textual
Intuitive
Textual
Intuitive
Y N Q U
3208 87 39 8296
3267 7362 26 —
2192 65 17 5770
2285 5100 14 —
Step I3: Sentence-level result integration. Similarly to textual predictions (step T4), the integration of sentence-level predictions was performed when some sentences had different labels attached for the same disease. Three factors were considered: (a) the confidence level of disease symptom terms found in the sentences; (b) the weight of the section where the evidence appeared; and (c) the significance of the three types of sentence evidence (step I1) for the given disease. Step I4: Document-level labeling. This was an optional step with only one run submitted (run 3, see below). We applied a support vector machine (SVM) classifier to assign disease labels at the document level. Phrases recognized in the pre-processing stage by GeniaTagger were mapped to the UMLS concepts using approximate string matching. Concepts mentioned in a negative context were identified using a negation module similar to NegEx. The weight assigned to a feature was calculated as the difference between the number of positive mentions of the corresponding concept and their negative mentions. Finding questionable evidence at a document level was considered unfeasible (there were too few examples for machine learning), so we trained a binary SVM classifier that differentiated between potential Y and N labels only. Step I5: Final result integration. Textual Y and N predictions were given high confidence and were recycled as intuitive predictions (see the section Training Data Analyses). Only Q and U textual judgments were adjusted in cases where intuitive evidence suggested different labels. More precisely, when new implicit evidence was established for a previously assigned textual Q or U judgment, then it was changed to an intuitive Y, N, or Q label based on the procedure described in steps I1–I3. If no new sentence-level implicit evidence was established for a Q or U textual judgment, then the SVM-based document classification was taken into account. If the classifier produced a highly confident Y label, then the final intuitive label for the disease would be amended to Y. Otherwise, a textual Q judgment would be kept unchanged, whereas a textual U judgment
Table 3 y Comparison of Textual and Intuitive Annotations in the Training Data on the Same Report-Disease Cases Training Data
599
July / August 2009
Micro P, R, F Macro P Macro R Macro F
Run 1
Run 2
0.9719 0.7578 0.7816 0.7693
0.9723 0.8482 0.7737 0.8052
would change to N in the final intuitive annotation. This approach was used to provide the intuitive run 3.
Experiments and Results Experimental Settings The training and testing data for the challenge were collected from the Research Patient Data Repository of Partners Health Care (see Table 2 for the distribution of the annotations provided manually by two experts).
Training Data Analyses We compared textual and intuitive annotations assigned to each document-disease pair (see Table 3). Intuitive annotations largely agreed with the textual ones in case of Y and N labels. Intuitive annotations differed primarily from the textual Q and U labels. This observation motivated our integration strategy—the intuitive results “inherited” all textual Y and N predictions, and only Q and U textual labels were considered eligible for re-annotation in the intuitive part. The training data were further analyzed to estimate the relevance of certain features and their predictive capacity. We first analyzed the relevance of six section types. Relative relevance weights were assigned to each section type based on the ratio between the number of sentences in the given section type whose labels were consistent with the expertgenerated judgments (at the document level) and the total number of evidence sentences that supported the correct annotations. This gave us relative predictive capacity of the section types to enable inference of the document label. Similar distributional analyses were performed for other features (see the online supplement for further details).
Testing Environment and Results Each of the 28 teams taking part in the challenge was allowed to submit the results of up to three system runs. The system performance was measured using a set of three standard measures: recall (R), precision (P) and F-measure. The results were micro- and macro-averaged across the status labels for each of the diseases considered. The overall performance was measured in the same way for all diseases taken together. The participating teams were primarily ranked based on the macro-averaged F-measure. Hereafter, we only report a single averaged score for the micro values as the values for P-Micro, R-Micro and F-Micro were identical.
Table 5 y The Summary of the Evaluation of the Three Intuitive Runs
Intuitive Annotations
Textual Annotations
Y
N
Q
Y N Q U
2223 0 3 274
1 56 1 7286
0 0 17 9
Micro P, R, F P-macro R-macro F-macro
Run 1
Run 2
Run 3
0.9572 0.6383 0.6294 0.6336
0.9568 0.6386 0.6286 0.6333
0.9559 0.6369 0.6289 0.6326
600
Yang et al., Text Mining of Disease Status Predictions
Table 6 y Disease-Based Performance of the Best Textual (Run 2) and Intuitive (Run 1) Textual Predictions (Run 2)
Intuitive Predictions (Run 1)
Disease
Micro P, R, F
Macro P
Macro R
Macro F
Micro P, R, F
Macro P
Macro R
Macro F
OSA GERD Asthma Gallstones Hypertriglyceridemia Depression Diabetes CHF OA Gout Hypercholesterolemia Obesity PVD Venous insufficiency Hypertension CAD Mean average SD
0.9940 0.9802 0.9841 0.9724 0.9961 0.9704 0.9821 0.9375 0.9741 0.9881 0.9721 0.9655 0.9822 0.9842 0.9501 0.9215 0.9723 0.0203
0.9897 0.9743 0.6461 0.9676 0.9490 0.9193 0.9682 0.7774 0.9369 0.9542 0.9248 0.9835 0.9536 0.7778 0.9082 0.8386 0.8482 0.0960
0.6618 0.7357 0.7434 0.8017 0.9490 0.9711 0.8477 0.8149 0.9798 0.9849 0.8019 0.4881 0.9765 0.9920 0.8311 0.8964 0.7737 0.1400
0.6591 0.7298 0.6861 0.8569 0.9490 0.9429 0.8979 0.7935 0.9566 0.9689 0.8526 0.4858 0.9646 0.8531 0.8655 0.8581 0.8052 0.1338
0.9939 0.9131 0.9830 0.9735 0.9753 0.8910 0.9791 0.9361 0.9632 0.9740 0.8979 0.9709 0.9742 0.9649 0.9484 0.9629 0.9572 0.0309
0.9934 0.9390 0.9656 0.9535 0.9873 0.8757 0.9737 0.9588 0.9519 0.9367 0.8997 0.9737 0.9681 0.9818 0.9271 0.6416 0.6383 0.0838
0.6616 0.5567 0.9656 0.9490 0.7600 0.7900 0.9773 0.6288 0.6493 0.9428 0.9056 0.9674 0.6317 0.7414 0.9207 0.6436 0.6294 0.1512
0.6608 0.5760 0.9656 0.9512 0.8357 0.8220 0.9755 0.6268 0.6327 0.9397 0.8977 0.9701 0.6332 0.8163 0.9239 0.6425 0.6336 0.1497
OSA ⫽ obstructive sleep apnea; GERD ⫽ gastroesophageal reflux disease; CHF ⫽ congestive heart failure; OA ⫽ osteo arthritis; PVD ⫽ peripheral vascular disease; CAD ⫽ coronary artery disease.
The results of two textual runs were submitted (see Table 4). Run 2 improved the results, but only by a small margin. The macro-averaged F-measure for run 2 was the highest one achieved in the challenge and was substantially better than the mean average of all participating teams (81 versus 56%). Similarly, the micro-averaged F-measure was high (97%), compared to the mean average calculated for all participating teams (91%). A detailed analysis of the results is available in the online supplement.
tives. Dynamic expansion of abbreviations that could correctly map ambiguous abbreviations to corresponding medical terms should improve identification of key clinical findings. Finally, the estimation of discriminative power of medications used to treat specific diseases should improve intuitive predictions.
Table 6 shows the detailed evaluation of the results for the individual diseases. In the textual task, the micro-averaged F-measure ranged from 92% (CAD) to 100% (hyper-triglyceridemia), whereas for the intuitive task it ranged from 89% (depression) to 99% (OSA). The micro-averaged values were more consistent across different diseases, whereas there were substantial differences in the macro-evaluated metrics. A detailed analysis and full discussion of the results are available in the online supplement.
Overall, the performance of our system and most of the other systems developed for the i2b2 obesity challenge was comparable to that of a human expert, indicating that text mining techniques have substantial potential to accurately and efficiently extract the disease status from hospital discharge summaries. However, more research is required to investigate if the methodologies used can be easily ported between different areas of medical practice. For our system, the infrastructure developed is general enough to be re-used across the clinical domain. However, few details require knowledge elicitation from domain experts or medical resources, and manual changes to the system (e.g., clinical inference rules). Still, a major bottleneck faced by medical text mining systems in general is the provision of the training data, which need to be analyzed manually and statistically to identify clues to be exploited in both rulebased and machine-learning approaches.
Conclusions
References y
The system implementing the methodology described achieved excellent results with an average micro accuracy of 97% for the textual task and 96% for the intuitive task. The macro-averaged F-measure of 81% for the textual task was the highest achieved in the challenge, and the macroaveraged F-measure of 63% (the highest was 66%) for the intuitive task was ranked 7th out of 28. The macro-averaged measures showed that prediction of questionable labels was most challenging, in particular in the intuitive task.
1. i2b2 obesity challenge. Available at: http://www.i2b2.org/. Accessed: Nov 23, 2008. 2. Carpenter B. Phrasal queries with LingPipe and Lucene: Ad hoc genomics text retrieval. In: Proceedings of the 13th Annual Text Retrieval Conference, 2004. 3. Tsuruoka Y, Tateishi Y, Kim J, et al. Developing a robust part-of-speech tagger for biomedical text. Adv Inform 2005: 382–92. 4. UMLS Knowledge Base. Available at: http://www.nlm.nih.gov/ research/umls. Accessed: Nov 23, 2008. 5. Chapman W, Bridewell W, Hanbury P, Cooper G, Buchanan B. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 2001;34:301–10.
The results of three runs were submitted for the intuitive task (see Table 5): run 1 was the best run with the macroaveraged F-measure of 63% (ranked 7th) and the microaveraged F-measure of 96% (ranked fifth overall). A detailed analysis of the results is available in the online supplement.
The system’s performance may be improved in several ways. More work is required to expand the set of clinical inference rules and match them reliably in textual narra-