Ensembles of natural language processing systems for portable phenotyping solutions

Ensembles of natural language processing systems for portable phenotyping solutions

Journal Pre-proofs Ensembles of Natural Language Processing Systems for Portable Phenotyping Solutions Cong Liu, Casey N. Ta, James R. Rogers, Ziran L...

1MB Sizes 0 Downloads 48 Views

Journal Pre-proofs Ensembles of Natural Language Processing Systems for Portable Phenotyping Solutions Cong Liu, Casey N. Ta, James R. Rogers, Ziran Li, Junghwan Lee, Alex M Butler, Ning Shang, Fabricio Sampaio Peres Kury, Liwei Wang, Feichen Shen, Hongfang Liu, Lyudmila Ena, Carol Friedman, Chunhua Weng PII: DOI: Reference:

S1532-0464(19)30237-0 https://doi.org/10.1016/j.jbi.2019.103318 YJBIN 103318

To appear in:

Journal of Biomedical Informatics

Received Date: Revised Date: Accepted Date:

10 May 2019 15 September 2019 21 October 2019

Please cite this article as: Liu, C., Ta, C.N., Rogers, J.R., Li, Z., Lee, J., Butler, A.M., Shang, N., Sampaio Peres Kury, F., Wang, L., Shen, F., Liu, H., Ena, L., Friedman, C., Weng, C., Ensembles of Natural Language Processing Systems for Portable Phenotyping Solutions, Journal of Biomedical Informatics (2019), doi: https:// doi.org/10.1016/j.jbi.2019.103318

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier Inc.

Ensembles of Natural Language Processing Systems for Portable Phenotyping Solutions

Cong Liu1, Casey N. Ta1, James R. Rogers1, Ziran Li1, Junghwan Lee1, Alex M Butler1, Ning Shang1, Fabricio Sampaio Peres Kury1, Liwei Wang2, Feichen Shen2, Hongfang Liu2, Lyudmila Ena1, Carol Friedman1, Chunhua Weng1,*

1: Department of Biomedical Informatics, Columbia University, New York, NY 10032 2: Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN 55901, USA *: Corresponding author

1

Abstract Background Manually curating standardized phenotypic concepts such as Human Phenotype Ontology (HPO) terms from narrative text in electronic health records (EHRs) is time consuming and error prone. Natural language processing (NLP) techniques can facilitate automated phenotype extraction and thus improve the efficiency of curating clinical phenotypes from clinical texts. While individual NLP systems can perform well for a single cohort, an ensemble-based method might shed light on increasing the portability of NLP pipelines across different cohorts. Methods We compared four NLP systems, MetaMapLite, MedLEE, ClinPhen and cTAKES, and four ensemble techniques, including intersection, union, majority-voting and machine learning, for extracting generic phenotypic concepts. We addressed two important research questions regarding automated phenotype recognition. First, we evaluated the performance of different approaches in identifying generic phenotypic concepts. Second, we compared the performance of different methods to identify patient-specific phenotypic concepts. To better quantify the effects caused by concept granularity differences on performance, we developed a novel evaluation metric that considered concept hierarchies and frequencies. Each of the approaches was evaluated on a gold standard set of clinical documents annotated by clinical experts. One dataset containing 1,609 concepts derived from 50 clinical notes from two different institutions was used in both

2

evaluations, and an additional dataset of 608 concepts derived from 50 case report abstracts obtained from PubMed was used for evaluation of identifying generic phenotypic concepts only. Results For generic phenotypic concept recognition, the top three performers in the NYP/CUIMC dataset are union ensemble (F1, 0.643), training-based ensemble (F1, 0.632), and majority vote-based ensemble (F1, 0.622). In the Mayo dataset, the top three are majority vote-based ensemble (F1, 0.642), cTAKES (F1, 0.615), and MedLEE (F1, 0.559). In the PubMed dataset, the top three are majority vote-based ensemble (F1, 0.719), training-based (F1, 0.696) and MetaMapLite (F1, 0.694). For identifying patient specific phenotypes, the top three performers in the NYP/CUIMC dataset are majority vote-based ensemble (F1, 0.610), MedLEE (F1, 0.609), and training-based ensemble (F1, 0.585). In the Mayo dataset, the top three are majority vote-based ensemble (F1, 0.604), cTAKES (F1, 0.531) and MedLEE (F1, 0.527). Conclusions Our study demonstrates that ensembles of natural language processing can improve both generic phenotypic concept recognition and patient specific phenotypic concept identification over individual systems. Among the individual NLP systems, each individual system performed best when they were applied in the dataset that they were primary designed for. However, combining multiple NLP systems to create an ensemble can generally improve the performance. Specifically, the ensemble can increase the results reproducibility across different cohorts and tasks, and thus provide a more portable phenotyping solution compared to individual NLP systems. Keywords: Natural language processing, Human Phenotype Ontology, Concept recognition, Ensemble method, Evaluation, Reproducibility

3

Introduction Undiagnosed genetic diseases can cause medical, psychosocial, and economic burden for both patients and their families [1]. Next generation sequencing (NGS) methods, such as Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WES), can yield a massive number of genetic variants in a single experiment, which makes NGS a promising tool to resolve undiagnosed genetic diseases. However, recent studies estimated the overall diagnostic yield of WES for previously undiagnosed disease to be only 20-40% [2, 3]. Interpretation and prioritization of these variants in a scalable fashion remains a bottleneck for utilizing these NGS methods in clinical practice [4-6].

To address this bottleneck, many gene-prioritization tools have been developed to aid interpretation of next generation sequencing results, including Phenolyzer [7], Phenomizer [8], PheVor [9], and Exomiser [10]. Among numerous methods proposed, the most common and successful method is to integrate the phenotype information to empower the discovery of causal gene and variants. Most of the tools require an input of a list of standardized diseases such as the Online Mendelian Inheritance in Man (OMIM) [11], or more often, standardized phenotypic terms such as the Human Phenotype Ontology (HPO) [12]. As a standardized vocabulary of phenotypic abnormalities encountered in human diseases, the HPO contains over 13,000 terms and over 156,000 annotations to hereditary diseases as of April 2019. Conventionally, clinical researchers curate the standardized HPO terms by manually reviewing the electronic health record (EHR) narratives, a time consuming and prone to error activity that often requires extensive knowledge

4

of HPO vocabulary. With the recent expansion of the HPO knowledge base and resources [12], the demand for scalable extraction of standardized phenotypic concepts continues to rise.

Recently, Son et al. proposed a framework called “EHR-Phenolyzer” [13] to leverage natural language processing (NLP) techniques to automate phenotype extraction from EHR narratives for whole exome sequencing based gene prioritization. In their study, two clinical concept extraction systems, MedLEE and MetaMap, were customized to extract standardized HPO terms. An alternative tool, ClinPhen [14], was later designed specifically for patient related HPO term extraction. In addition, Liu et al. have developed an interactive web application, Doc2Hpo, that enables interactive and efficient phenotype concept curation from clinical text [15]. To identify patient-specific phenotypes (i.e., phenotypes not negated, not related to family members, not appearing in educational context, etc.), additional rules are often handcrafted to determine the contextual properties (e.g., negation) of the recognized phenotypes.

Phenotypic concept extraction is a subfield of clinical concept extraction. Two main types of approaches can extract clinical concepts: rule-based and supervised machine learning based approaches. Rule-based systems rely on dictionaries and hand-coded rules to simultaneously extract and normalize phrases. The rule-based system can handle rare concepts and allow for error correction on a case-by-case basis. Meanwhile, with more and more de-identified clinical text available to the research community, machine learning based approaches, especially deep learning approaches, have achieved great successes in clinical concept extraction. For example, Wei et al. implemented a BI-LSTM-CRF model to extract medication entities and predict their associated adverse drug events [16]. Gehrmann et al. first used cTakes to extract CUIs from the notes and 5

then used the extracted CUIs as the input for a CNN model to classify the notes into ten phenotypes [17]. However, due to the scarcity of publicly available HPO-based annotation data, it is difficult to collect a sufficiently big dataset for successfully training of a deep learning model. In addition, most of the popular concept extraction systems are developed within a single institution and designed for specific use cases, which limited the portability of the system at other sites if the writing style and sublanguage differ between the system development site and the implementation site [18].

The ensemble method can combine the strength of individual tools and improve the portability of NLP systems. An ensemble is a meta-algorithm that combines several base models into one predictive model, and this combination has shown supreme performance in many machine learning tasks [19-21]. It has been widely used in multiple clinical and biomedical related problems including protein-protein interaction [22], gene expression based disease diagnosis [23], causal molecular networks inference [24] and biomarker identification [25]. The ensemble of NLP tools for concept recognition have been explored in many studies. For example, Torii et al. combined recognition results from individual systems and used a voting schema to create BioTagger-GM and achieved the best performance in the BioCreAtIvE II challenge to identify gene/protein names from literature [26, 27]. Doan et al. demonstrated that ensemble classifiers that combine individual classifiers into a voting system could achieve better performance than a single classifier in recognizing medication information from clinical text using the 2009 i2b2 (Informatics for Integrating Biology and the Bedside) challenge datasets [28]. Kang et al. combined two dictionarybased systems and five statistical-based systems by a simple voting scheme and achieved a third place finish in extracting medical problems, tests and medications in 2010 i2b2/VA challenge [29].

6

Kuo et al. built an ensemble pipeline by combining cTAKES and MetaMap and demonstrated that the pipeline can improve the performance of NLP tools in extracting clinical data elements but with high variability depending on the cohort.

However, ensemble approaches have not yet been applied to extract standardized HPO concepts, and there are only a few studies quantifying the performance of different individual NLP systems on HPO based phenotypic concept extraction [30]. Although the methods applicable for extracting HPO based phenotypic concepts overlap with the ones for clinical concept extraction, the involved individual systems and the use cases differ. Thus, it is important to evaluate the performance of ensemble approaches for phenotypic concept extraction.

A major challenge for a quantitative evaluation is the lack of a proper evaluation metric that can handle the granularity differences caused by the hierarchical structure of the HPO ontology. For example, if an NLP system identified “Stroke (HP:0001297)” and the concept identified by the gold standard is “Transient ischemic attack (HP:0002326)”, a true positive count is unsuitable because “Stroke” is not as specific as “Transient ischemic attack”, whereas a false positive count is also unsuitable because “Transient ischemic attack” is a subtype of “Stroke”. A few studies have considered using “partial positives,” where inexact overlaps contribute a half count towards both the precision and recall calculations [31]. However, this count is arbitrary and cannot distinguish

7

the low frequent phenotypes from high frequent phenotypes, in which the former is more important for genetic based diagnosis [14].

Therefore, to address the incorrectness of phenotype extraction due to the granularity difference along the hierarchical lineage on the ontology, we proposed a novel evaluation metric that considers both the hierarchical structure of the ontology and the phenotype frequency. We took a set of 50 clinical notes sourced from two institutions and manually identified and annotated concepts and their contextual properties (whether the terms were negated, applied to the patient or family history). In addition, we included a set of 50 case report abstracts obtained from PubMed for an additional evaluation. We then compared the performance of four individual NLP systems and four ensemble techniques. The individual NLP systems were MedLEE [32], MetaMapLite [33], ClinPhen [14] and cTAKES [34]; and the ensemble methods were intersection, union, majority-voting and machine learning. Our evaluation demonstrated that the ensemble-based methods consistently outperformed individual systems. Compared with the individual systems, the ensembles can specifically increase the results reproducibility across different datasets and tasks and provide a more portable phenotyping solution.

Methods Dataset and Gold Standard Establishment We used two batches of de-identified clinical notes, 40 pediatric clinical notes derived from NewYork-Presbyterian Hospital/Columbia University Irving Medical Center (NYP/CUIMC), and 10 pediatric notes from Mayo Clinic (Mayo). Mayo Clinic notes were de-identified and extracted 8

for another previous study [13], only specific paragraphs were extracted. The study was approved by the NYP/CUIMC and Mayo Clinic Institutional Review Boards. We used six annotators split up across the NYP/CUIMC notes. Each note was annotated by three individuals. Only phenotypes could be mapped to standard HPO concepts descending from “Phenotypic abnormality (HP:0000118)” were extracted. In addition, annotators were asked to annotate whether it is a patient-specific phenotype by excluding negated terms, family-member related terms, and education-related terms. Notably, for the concepts under “Abnormality of prenatal development or birth (HP:0001197)”, we considered them related to the patients even though some of them were experienced by the pregnant mothers. For example, in the phrase “multiple miscarriages were experienced by mom” from a prenatal or newborn consulting note, we extracted “Spontaneous abortion (HP:0005268)”, and assigned it as a patient specific phenotype. The final extraction was summarized at the document level, and only unique records in each document remained. More details on annotation can be found in the Supplementary Methods.

The gold standard set was then determined by the consensus of three annotators (i.e., annotated by at least two of the annotators). For records extracted by only one annotator, a fourth individual determined if the record should be added into the gold standard set. For standard concept recognition, the gold standard set consisted of all unique concepts identified within each document regardless of their contextual properties. For patient-specific phenotype identification, the gold standard set contained only the unique concepts deemed patient-specific. The annotation for the

9

Mayo Clinic notes followed a similar process, except that they were annotated by only two individuals due to their simplicity.

In addition to clinical notes, we curated another dataset (PubMed) containing 50 case report abstracts related to “Turner Syndrome” obtained from PubMed. We selected Turner Syndrome because it is a well-studied genetic disease with phenotypes well covered that our medical annotators are familiar with. 50 latest PubMed case report abstracts related to “Turner Syndrome” were downloaded. A similar annotation procedure to the Mayo dataset was followed, with only two individuals annotating the dataset. However, unlike clinical notes, most of the concepts in the biomedical literature are not patient specific. As a result, we did not annotate the patient-specific phenotypic concepts and this dataset was only used to evaluate the performance in generic phenotypic concept recognition.

Individual NLP Systems We evaluated four individual NLP systems: MetaMapLite, MedLEE, ClinPhen, and cTAKES. MetaMap is a tool to automatically identify clinical terms from biomedical literature [35]. MetaMapLite is a less rigorous but fast version of MetaMap, which aims to provide a near realtime named-entity recognition [33]. In this study, we used MetaMapLite, version 2018 3.6.2rc2 and its supported version (2018AA) of the Unified Medical Language System (UMLS). We made some minor changes in its Java source code to add the detection of contextual properties (negation and subject information), which was a wrapper of the ConText program developed by Harkema et al [36]. Concepts were determined to be patient specific when the contextual properties did not

10

include negation, family medical history, or education. In order to manage the number of terms from MetaMapLite extracts, we limited the extraction results to only the following phenotyperelated UMLS semantic types: Anatomical Abnormality (anab), Finding (fndg), Congenital Abnormality (cgab), Disease or Syndrome (dsyn), Mental or Behavioral Dysfunction (mobd), Sign or Symptom (sosy), Laboratory or Test Result (lbtr), Pathologic Function (patf), and Cell or Molecular Dysfunction (comd). The UMLS concepts were then mapped to HPO concepts following the mapping at http://purl.obolibrary.org/obo/hp.obo.

MedLEE is a Medical Language Extraction and Encoding System primarily developed by Friedman et al. at Columbia University [37-39]. The NLP engine’s lexicon was loaded with the HPO terms and synonyms available via UMLS and SNOMED-CT. The raw text files were tokenized and split into sentences. After processing, XML files were generated with tagged tokens with information regarding clinical note section information, sentence, token information, and HPO concept(s) identified. The patient specific HPO concepts were determined to have no ‘negation’ and ‘family’ annotation in the extraction record.

ClinPhen is an NLP tool developed for the patient specific HPO term extraction [14]. We used its default settings with “rare phenotypes only” mode turned off to extract phenotypes from clinical notes. Notably, since the results provided by ClinPhen only contained phenotypes positively observed in patients, all extracted phenotypes were patient-specific phenotypes.

cTAKES is a comprehensive, modular, extensible NLP system designed primarily at Mayo Clinic for viable information extraction from clinical narratives [34]. cTAKES version 4.0.0 was

11

used in this study. We first used its dictionary creator function to create an HPO specific dictionary and then ran the clinical pipeline with the “fast dictionary lookup” option to extract corresponding HPO terms from clinical notes. Patient specific phenotypic concepts were extracted if the value of the ‘subject’ was ‘patient’, and the tag was not negated.

Since one of the primary goals for using the NLP tools is to derive a document level summary that can be used for phenotyping, all the results for the individual NLP systems were summarized at the document-level as described in the gold standard section, and then the document-level extraction results were used as the input for the ensemble methods.

Ensemble Methods Figure 1 provides an overview of how an ensemble-based phenotype extraction pipeline works. We used four ensemble methods: (1) union, which extracts the concept if any of the NLP systems recognized the concept; (2) intersection, which extracts the concept only if all four NLP systems recognized the concept; (3) majority voting, which extracts the concept if at least two NLP systems recognized the concept, and (4) the training-based approach, in which we trained a model to learn the weight of each individual system.

For the training-based approach, each instance contained four binary features representing whether it was extracted by MetaMapLite, MedLEE, ClinPhen and cTAKES, respectively. The binary outcome of each instance was whether this phenotypic concept was identified in the gold standard set. A non-negative constraint on the weight for each individual feature was applied to prevent negated contribution from any individual NLP system. For example, if ClinPhen finds a 12

particular annotation (i.e. feature of ClinPhen is 1), it should have a non-negative contribution to the predicted positive response (i.e. outcome is 1). In addition, due to the limited training size we have, it is more reasonable to train a simplified classifier. Therefore, we trained a logistic regression model with non-negative constraint as our classifier. For prediction, a phenotype from the remaining testing set extracted by at least one NLP system was fed to the classification model, and the prediction outcome was then used to determine whether this phenotype should be extracted by the ensemble pipeline. Models were trained separately for different datasets. Notably, the training models were different for two tasks since the inputs of the model were different (i.e. generic concepts vs. patient-specific concepts).

For concepts not extracted by any of the individual NLP systems, we filtered those concepts out and assigned a negative outcome to all the ensemble approaches because the ensemble approaches cannot make any positive predictions on them. For the remaining data, it was split into training (20%) and testing (80%). The performance of all the approaches, including all combinations of individual systems and ensemble methods, were reported based on the testing data plus those concepts exists only in gold standard.

Evaluation Metrics We evaluated the performance for two tasks: (1) generic phenotypic concepts recognition, and (2) patient-specific phenotypic concepts identification. For each task, the test set is the remaining 80%

13

untrained records recognized by the union of the NLP systems plus the records extracted only in the gold standard set.

We used three evaluation metrics in this study. (1) Exact-match (E): only phenotypic concepts mapping with the extract same HPO identifiers were considered as correct. (2) Generalized-match (G): both the gold standard set and the prediction set were first extended by adding all possible ancestors for each of the recognized phenotypes up until “Phenotypic abnormality (HP:0000118)” (not included). Then duplicated concepts were removed for each document. Precision and recall were then computed by comparing these two extended sets. (3) Weighted generalized-match (W): the same method as the generalized-match, but with each concept weighted by its information entropy. We expect the frequency of a phenotype is an important consideration. Because generalized match only considers the ontology, we designed a weighted evaluation metric by adding different weights for different concepts. Specifically, by taking the information entropy into account, we calculated a weighted precision and recall. We defined the information entropy for each phenotype p as following:

|𝐴𝑝|

𝐸(𝑝) = ―log (

∑𝑝′ ∈ 𝑑𝑒𝑠𝑐(HP:0000118)|𝐴𝑝′|

)

where |𝐀𝐩| is the total number of documents with the HPO concept 𝐩 in its extraction set after ancestor extension. 𝒑′ ∈ 𝒅𝒆𝒔𝒄(𝐇𝐏:𝟎𝟎𝟎𝟎𝟏𝟏𝟖) in the denominator indicates all phenotypic codes

14

descending from “Phenotypic abnormality (HP:0000118)”. The calculation of precision and recall in the generalized match were then weighted as

𝑤𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =

𝑤𝑅𝑒𝑐𝑎𝑙𝑙 =

∑𝑝 ∈ 𝑇𝑃𝐸(𝑝) ∑𝑝 ∈ {𝑇𝑃,𝐹𝑃}𝐸(𝑝) ∑𝑝 ∈ 𝑇𝑃𝐸(𝑝)

∑𝑝 ∈ {𝑇𝑃,𝐹𝑁}𝐸(𝑝)

where TP is the true positives, FP is the false positives, and FN is the false negatives. By first extending both the predicted set and the gold standard set to include ancestors, and then assigning weights to each concept according to information content, the calculation will guarantee parents have a smaller weight. If two concepts appear in the same depth along the hierarchy lineage, the one with higher frequency will have a smaller weight. Therefore, this evaluation metric considers both ontology and frequency information.

Results There are a total of 1510 and 99 concept-document pairs recognized in the gold standard for NYP/CUIMC notes and Mayo Clinic notes, respectively. Among them, 653 and 69 are patientspecific concepts, respectively. In the PubMed dataset, we identified 608 concept-document pairs in the gold standard. Table 1 summarizes the difference among the three datasets. On average, the number of tokens in the NYP/CUIMC dataset is larger than the number of tokens in the other datasets. As a result, more HPO concepts are annotated per document for the NYP/CUIMC dataset. While the length of documents in the Mayo dataset is similar to PubMed dataset, there are more HPO concepts identified in Mayo notes than PubMed literature; this situation likely reflects that

15

clinical notes are a more comprehensive record for patient’s phenotype then a PubMed case report. The annotation depth of the HPO terms across different datasets are similar, indicating a similar level of granularity across datasets. Sentences specified in Table 1 were detected using Natural Language Toolkit (NLTK)’s (https://www.nltk.org/) sentence split function. Depth was measured as the shortest distance from the node to the root “Phenotypic abnormality (HP:0000118)”. Table 1. Summary statistics of three datasets. Statistics Mayo NYP/ CUIMC Average length of sentences 86.99 89.65 Average number of tokens 295.00 1010.95 Average number of HPO annotations 9.90 37.75 Average number of patient specific HPO annotations 6.90 16.33 Average depth of HPO annotation 4.41 4.02

PubMed 131.20 233.00 12.16 N/A 4.26

To establish a human baseline, we evaluated the performance of the individual annotators against the gold standard. Individual annotations before discussion and consensus were used. Since the same set of annotators were used to create the gold standard, there is a potential bias in favor of human performance. Table 2 shows the human performance of concept recognition in the NYP/CUIMC, Mayo and PubMed datasets respectively. Table 2. Human performance of generic HPO concept recognition in three datasets. E: extract match; G: generalized match; W: weighted generalized match. Annotator F1 Precision Recall E G W E G W E G W NYP/CUIMC Dataset 1 0.829 0.903 0.893 0.973 0.984 0.982 0.722 0.835 0.818 2 0.751 0.836 0.821 0.956 0.979 0.979 0.618 0.729 0.707 3 0.874 0.935 0.929 0.939 0.957 0.953 0.818 0.913 0.906 4 0.569 0.756 0.748 0.941 0.975 0.973 0.408 0.618 0.607 5 0.918 0.950 0.944 0.972 0.986 0.984 0.870 0.918 0.907 6 0.508 0.671 0.655 0.964 0.980 0.977 0.344 0.510 0.493 Mayo Dataset

16

1 2

0.869 0.800

0.884 0.917

0.879 0.981 0.996 0.909 0.778 0.893 PubMed Dataset

0.996 0.883

0.779 0.824

0.794 0.942

0.787 0.937

1 2

0.913 0.804

0.943 0.907

0.937 0.896

0.896 0.947

0.957 0.724

0.984 0.869

0.982 0.850

0.873 0.905

0.905 0.948

Table 3 shows the human performance of patient specific phenotypic concept recognition in the NYP/CUIMC and Mayo datasets, respectively. In general, the human annotators tend to have a higher precision but relatively lower recall. Table 3. Human performance of patient specific HPO concept recognition in two datasets. E: extract match; G: generalized match; W: weighted generalized match. Annotator F1 Precision Recall E G W E G W E G W NYP/CUIMC Dataset 1 0.861 0.936 0.928 0.984 0.990 0.988 0.766 0.888 0.875 2 0.783 0.861 0.847 0.920 0.950 0.948 0.681 0.787 0.765 3 0.841 0.915 0.906 0.928 0.948 0.941 0.769 0.885 0.874 4 0.751 0.858 0.850 0.926 0.959 0.956 0.632 0.776 0.766 5 0.849 0.925 0.918 0.973 0.984 0.984 0.754 0.872 0.861 6 0.714 0.791 0.779 0.955 0.965 0.963 0.570 0.671 0.654 Mayo Dataset 1 0.888 0.919 0.914 0.988 0.997 0.997 0.806 0.853 0.844 2 0.843 0.930 0.922 0.838 0.916 0.907 0.847 0.944 0.939 Table 4 summarizes the performance differences of individual and ensemble approaches for generic phenotypic concept recognition in NYP/CUIMC, Mayo, and PubMed datasets, respectively. Table 4. Performance of generic HPO concept recognition for different NLP approaches in three datasets. E: extract match; G: generalized match; W: weighted generalized match; (E): ensemble method. Method F1 Precision Recall E G W E G W E G W NYP/CUIMC Dataset MedLEE 0.595 0.762 0.743 0.853 0.919 0.911 0.457 0.651 0.627 17

ClinPhen cTAKES MetaMapLite (E) intersection (E) union (E) majority voting (E) machine learning

0.345 0.590 0.542 0.226 0.634 0.622 0.632

0.480 0.760 0.707 0.359 0.792 0.773 0.784

MedLEE ClinPhen cTAKES MetaMapLite (E) intersection (E) union (E) majority voting (E) machine learning

0.559 0.367 0.615 0.548 0.240 0.582 0.642 0.539

0.694 0.492 0.731 0.697 0.325 0.712 0.771 0.669

MedLEE ClinPhen cTAKES MetaMapLite (E) intersection (E) union (E) majority voting (E) machine learning

0.560 0.573 0.643 0.694 0.422 0.608 0.719 0.696

0.460 0.803 0.736 0.746 0.679 0.806 0.334 0.906 0.770 0.701 0.753 0.833 0.765 0.812 Mayo Dataset

0.680 0.561 0.461 0.633 0.716 0.735 0.685 0.740 0.303 0.925 0.697 0.508 0.759 0.745 0.652 0.701 PubMed Dataset 0.621 0.621 0.508 0.746 0.718 0.715 0.820 0.799 0.723 0.800 0.778 0.869 0.548 0.507 0.940 0.678 0.685 0.500 0.852 0.834 0.761 0.830 0.809 0.800

0.926 0.870 0.886 0.965 0.832 0.904 0.892

0.912 0.856 0.871 0.959 0.813 0.893 0.880

0.220 0.488 0.409 0.129 0.580 0.497 0.519

0.324 0.674 0.588 0.221 0.755 0.676 0.701

0.308 0.646 0.556 0.202 0.732 0.651 0.677

0.682 0.894 0.865 0.941 0.987 0.651 0.917 0.875

0.672 0.874 0.853 0.936 0.983 0.636 0.909 0.864

0.557 0.259 0.530 0.436 0.138 0.682 0.565 0.461

0.708 0.341 0.634 0.554 0.195 0.786 0.666 0.567

0.690 0.314 0.618 0.541 0.180 0.772 0.652 0.550

0.532 0.922 0.899 0.921 0.981 0.549 0.913 0.921

0.552 0.907 0.889 0.914 0.976 0.566 0.905 0.912

0.623 0.479 0.580 0.578 0.272 0.774 0.681 0.620

0.746 0.627 0.755 0.707 0.381 0.886 0.798 0.757

0.711 0.595 0.725 0.678 0.343 0.866 0.774 0.729

Table 5 summarizes the performance differences of individual and ensemble approaches for patient specific phenotypic concept recognition in NYP/CUIMC, and Mayo datasets, respectively. Table 5. Performance of patient specific HPO concept recognition for different NLP approaches in NYP/CUIMC dataset. E: extract match; G: generalized match; W: weighted generalized match; (E): ensemble method. Method F1 Precision Recall E G W E G W E G W NYP/CUIMC Dataset MedLEE 0.609 0.737 0.716 0.727 0.812 0.802 0.524 0.675 0.647 ClinPhen 0.478 0.611 0.580 0.614 0.772 0.747 0.392 0.506 0.474

18

cTAKES MetaMapLite (E) intersection (E) union (E) majority voting (E) machine learning

0.537 0.522 0.352 0.554 0.610 0.585

0.696 0.665 0.468 0.698 0.732 0.714

0.674 0.508 0.638 0.557 0.434 0.903 0.679 0.455 0.711 0.622 0.690 0.720 Mayo Dataset

MedLEE ClinPhen cTAKES MetaMapLite (E) intersection (E) union (E) majority voting (E) machine learning Discussion

0.527 0.453 0.531 0.516 0.335 0.502 0.604 0.510

0.661 0.576 0.658 0.612 0.375 0.640 0.739 0.622

0.645 0.549 0.641 0.596 0.363 0.622 0.726 0.602

0.512 0.606 0.566 0.609 1.000 0.405 0.633 0.635

0.659 0.685 0.939 0.597 0.733 0.816 0.645 0.886 0.744 0.778 1.000 0.551 0.868 0.827

0.645 0.669 0.928 0.582 0.720 0.805 0.634 0.871 0.730 0.765 1.000 0.534 0.857 0.815

0.571 0.492 0.219 0.709 0.599 0.496 0.543 0.363 0.502 0.448 0.202 0.661 0.578 0.454

0.738 0.647 0.312 0.841 0.731 0.637 0.679 0.429 0.591 0.506 0.231 0.765 0.645 0.520

0.708 0.610 0.283 0.817 0.703 0.606 0.657 0.403 0.572 0.490 0.223 0.746 0.631 0.500

Despite the success achieved by clinical NLP systems at multiple tasks, prior studies have shown that the heterogeneous nature of clinical sublanguages across institutions can affect the performance and the portability of the individual systems [40-42]. Our evaluation confirmed this conclusion that for HPO based phenotypic extraction, individual methods are better equipped to handle notes from the development site. For example, Columbia University’s proprietary MedLEE achieves the best performance in the NYP/CUIMC dataset for both tasks among other individual systems; cTAKES, with its dictionary enriched with a Mayo EHR-maintained list of terms, achieves the best performance in the Mayo dataset; and MetaMapLite, primarily designed for biomedical literature processing, achieves the best performance in the PubMed dataset. According to Deisseroth et al. original paper, ClinPhen, developed by Stanford researchers, performs much better on the Stanford dataset than cTAKES and MetaMap [14]. Our overall evaluation suggested ClinPhen did not perform well as a single individual system for HPO based phenotypic concept

19

extraction, though it may be unfair to compare the performance of ClinPhen in the generic phenotypic concept recognition task since ClinPhen only identifies patient-specific concepts.

By combining multiple NLP systems, our study demonstrates that ensemble methods can improve the performance for both concept recognition and patient-specific concept recognition tasks, and more importantly, can result in superior performance when generalized to other institutions. The main challenge for concept recognition is the low recall rate (i.e., lack of coverage in individual NLP systems), which might explain why a union ensemble can achieve a better performance than individual NLP systems, while an intersection ensemble has the worst performance. Consistent with other ensemble performance reports in clinical concept extraction tasks [28, 29], a simple majority voting schema achieves the best performance for most tasks and datasets, suggesting it is an optimal and portable phenotype extraction method that is also easily implemented. As an advanced ensemble method, a training-based ensemble allows flexibility to adjust the weights of the individual NLP systems in a final ensemble model by learning models. It shows the best performance in NYP/CUIMC is most likely attributable to the fact that NYP/CUIMC is a much larger dataset than the other two. However, without enough data, a training-based ensemble is vulnerable to irreproducibility and unreliability. Unfortunately, there is no way to know how big of a sample size we would need to train a reliable ensemble model ahead of time. Therefore, our suggestions for clinical researchers are as follows (Figure 2). First, if there is an individual NLP system developed within an institution, it is likely sufficient enough to extract phenotypic concepts for that institution. Second, for network-based studies such as phenotyping in the eMERGE network [43], an ensemble method should be considered when implementing phenotypic concept extraction pipeline. If there is no training set available, a majority-voting ensemble can achieve

20

relatively good performance. Third, the size of training sample does affect the performance so to train a reliable ensemble, researchers should increase the sample size iteratively until the classifier demonstrates satisfactory performance or the limit of annotation efforts is reached.

The three evaluation metrics considered in this study can be divided into three parts: (1) a binary notion of correctness; (2) ontology-integrated evaluation; (3) frequency-integrated evaluation. Although this study found the three metrics have high correlation and generally will not change the rank of evaluations, generalized match and weighted generalized match can be more intuitive under certain circumstances. Figure 3 demonstrates how these evaluation metrics behave in a few example scenarios of mismatched predicted and gold-standards based on the ontology shown on the left. When the divergence between the predicted set and gold standard set increases (i.e. from case 1 to case 2), exact match cannot reflect this change, while generalized match and weighted generalized match did reflect a decreased performance. Similarly, weighted generalized match can account for concepts with lower frequencies by imposing a heavier penalty. For example, comparing case 2 and case 3, the number of false negatives decreased according to the measurement by generalized match. After weighing the concepts by frequency, the number of false negatives increased for the weighted generalized measure, thus leading to a decrease in the recall. For both generalized match and weighted generalize match, the sets are first extended to include ancestral codes, but only unique codes are kept in order to avoid multiple counting of codes in the higher levels of the ontology. For example, case 7 predicts one more code ({N7, N8}) than case 6

21

({N7}), but since the true positive ({N2}) has already been counted as an ancestor of N7, the true positives do not increase in case 7. Other studies have proposed evaluation for ontology-based information extraction. For example, Perotte et al. proposed evaluation metrics that reflect the distances among gold-standard and predicted codes and their location in the ICD9 hierarchy [44]. Our proposed generalized match metric shares many similarities with these prior methods. By measuring divergence path between the predicted code and the gold standard codes and shortest path of their common ancestry to the root, “partial match” is penalized based on its position on the hierarchical level on the ontology. They also provide a hierarchical based precision, recall measurement. In their method, True positives were defined as predicted codes that are ancestors of, descendants of, or identical to a gold-standard code. False positives are defined as predicted codes that are not true positives. False negatives are defined as gold-standard codes where the code itself or a descendant was not predicted. In some cases, it is similar to our generalized match (case 1 in Figure 3). However, our proposed methods provide more distinguishable measurement when the divergence between the predicted codes and gold standard codes are different along the lineage (e.g. Case 1 vs. Case 2 and Case 5 vs. Case 6). Maynard et al. proposed an augmented precision and recall which considers weighted semantic distance [45]. It requires one-to-one (partial) match between predicted concepts and gold-standard concepts, and partial match will affect both precision and recall. Our approach differs from this metric because it is a document-level evaluation that avoids the phrase match (i.e. first match the phrase in the text and then match the concepts) and can allow “ontology-inference”. For example, in case 4 described in Figure 3, it will have a partial match between n1 and n3, and the recall will be imperfect. However, under our proposed metrics, case 4 will achieve both perfect precision and

22

recall. From the NLP point of view, our proposed metrics seems counter-intuitive. However, when we applied in the real-world application for HPO-based differential diagnosis, there will not be any information gain at the document level by adding an “n1” to the predicted set. Furthermore, our weighted generalized match considers both ontology and frequency information. By extending the set to include all ancestors, our design guarantees that the parent in the hierarchical lineage has an equal or smaller weight than its child, emphasizing the importance of accuracy in more granular levels of identification. By assigning larger weights to less frequently occurring phenotypes, NLP methods will be rewarded more if they can make more accurate predictions for less frequently occurring phenotypes, which would be more likely to be helpful for rare disease diagnosis [14].

Error Analysis One of the reasons for the low recall rate of the concept recognition is suboptimal ontology mapping. MetaMapLite, cTAKES and MedLEE are designed to extract clinical entities as UMLS or SNOMED-CT concepts first, and then map to the corresponding HPO concepts. However, the UMLS concepts often cannot be mapped to any HPO concepts. For instance, in the clause “patient showed weakness of hands”, MetaMapLite identified the UMLS concept “Weakness of hand [C0575810]”, which cannot be mapped to any HPO concepts; in contrast, identifying the UMLS concept “Hand muscle weakness [C0239831]” can map to “hand muscle weakness [HP:0030237]”. The ontology mapping could be improved by adding the HPO concepts directly into the dictionary or increasing the cross-reference between terms from the ontologies. Furthermore, the lack of concepts in the HPO and the knowledge gap is another major source of errors. For example, none of the NLP systems were able to identify an HPO term from “patient showed prolongation of conduction velocity”. At the time of this study, there were no concepts or synonyms named

23

“prolongation of conduction velocity” in the HPO. However, human annotators were able to identify the HPO concept “abnormal nerve conduction velocity [HP:0040129]” since “prolongation” is a type of “abnormality”.

In addition to suboptimal ontology mapping, other errors that occurred during patient-specific phenotype identification were a result of incomplete or incompatible rules in the individual NLP systems for contextual property detection when applied to new corpora with different underlying text and structure [18, 46]. For example, ClinPhen was unable to identify “deny” as a negation term, which was common in clinical notes from NYP/CUIMC. As another example, phenotypic concepts related to family members were usually listed in a “Family History” section in the NYP/CUIMC notes. Without identifying the correct section (as MedLEE does), it is not enough to rely on the context of the sentence to determine whether the phenotypic concept is family related. Prior evidence has shown a decrease in performance when applying individual NLP systems to new corpora [47]. Furthermore, all of the NLP systems assume the phenotypes appearing in the notes to be a patient specific by default, and then use trigger terms to exclude them. This means the NLP systems are susceptible to a high yield of false positives if the clinical note contains a large paragraph of education materials or treatment plans. This suggests that applying individual patient-specific concept extraction systems in real-world applications will require additional tuning.

Ensemble methods can provide promising improvement, but they can also be affected by the individual NLP performance. For example, an intersection ensemble inherits all the limitations of each system in low recall rate because every system must identify a particular concept, which can 24

result in an even worse recall. In contrast, a union ensemble can improve the recall rate at the expense of precision, especially exacerbated in the patient-specific task because all the systems treat the concept as patient specific by default. These reasons likely explain why majority voting achieved the best performance in the majority of the evaluations. Despite the promising performance, the majority voting remained susceptible to errors. To better identify these particular errors from the majority vote ensemble, we randomly selected ten clinical records and categorized the errors in patient specific HPO concept recognition made based on exact match. Table 6 summarizes these findings. Table 6. Examples of common errors made by a majority voting ensemble. “Text” stands for the original clinical narratives; “GS”: stands for annotation in gold standard; and “E” stands for annotation in majority-voting based ensemble approach. FP: false positive; FN: false negative; Error Cause Error Type Example Abbreviation/Code FN Text: …He had a PDA … GS: patent ductus arteriosus (HP:0001643) E: None Verb phrase FN Text: …fell frequent at school… GS: Frequent falls (HP:0002359) E: None Negation FP Text: …Negative for seizure, …, muscle pain… GS: None E: muscle pain (HP:0003326) Context FP Text: …supportive care RTC if fever… GS: None E: fever (HP:0001945) Granularity FP/FN Text: …constipation of sudden onset… GS: Acute constipation (HP:0012451) E: constipation (HP:0002019)

Among the randomly sampled notes, we found that none of the false negatives resulted from the inability of all individual systems to extract the expected term – i.e. at least one of the individual systems was able to extract the correct concept. Instead, 40% of the false negatives were due to a failure in understanding a verb-based description and find corresponding noun concept; 35% were

25

caused by mismatch in granularity; and the remaining 25% were caused by the failure to extract abbreviations or medical codes, such as ICD diagnostic codes. The aforementioned false negatives explained by ontology mapping inconsistencies from the individual systems did not occur in the ensemble approach because it incorporated the different rules maintained by the multiple institutions to provide a comprehensive capture of the concepts. Regarding false positives, the majority (55%) in patient specific concepts extraction were caused by failure in negation identification, usually a result of complex sentence structure or a typo in the note. Approximately 15% were a result of granularity discrepancies between annotation and predication. Another 15% of false positives were due phenotypes not being used to describe the patients, generally a result of the phenotype being family related, present in a treatment plan, or utilized in an education context. Finally, the remaining 15% of false positives were found to be correct predictions by the majority vote ensemble, meaning that human annotators failed to identify that particular set of concepts. This last observation provides support that a machine facilitated platform should be considered for annotation work [15].

Limitations This study has several limitations. First, though this study used ensemble methods for phenotype extraction, the four base systems used in this study do not constitute a complete list of potential systems. Because phenotype extraction is a subfield of concept extraction, there are many other systems, both rule-based and machine learning-based, that can be leveraged or customized to

26

achieve this goal. In addition, only very simple ensemble methods were explored in this study while many other advanced ensemble methods exist. Furthermore, our analysis was limited to only NYP/CUIMC and Mayo clinical notes, and for the public dataset, we only included case reports from literature. Although case reports describe patient phenotypes, their language and content structure are very different from real-world clinical notes. There are publicly available datasets such as i2b2 [48] and SemEval [49, 50] containing clinical notes. Because HPO is a relatively new ontology, manual annotation is required to prepare these public datasets for adoption for methodology development and evaluation for HPO-based phenotype extraction. Another limitation of our current training-based ensemble is that the training model is designed for exact match only, which might not be optimized for other evaluation metrics. To achieve a better performance, the training model should be optimized differently according to the evaluation metric that best fits the desired goals. For example, using generalized evaluation metrics, the output for a training instance should be “1” if there is a generalized match observed. In addition, more features related to the concept itself, such as the category of phenotypic abnormality, should be considered to train a more sophisticated model. However, it is important to note again that without a large enough training set, more features could be susceptible to the curse of dimensionality and overfitting [51] and thus reduce out-of-sample (“real-world”) performance. Finally, the evaluation metrics in this study were designed to reflect the “high-reward, high-penalty” idea, which might not be suitable for other application fields. As is true for most studies in the field of biomedical informatics, we do not think there is a single universal ‘perfect’ evaluation metric.

27

Future work in expanding evaluations to real-world applications will allow a more robust evaluation.

Conclusions Our study shows that ensembles of natural language processing systems can improve the performance of recognizing HPO-based phenotype concepts in clinical notes. Using a simple majority voting-based ensemble increased the robustness across different cohorts and tasks. A machine learning-based ensemble might achieve better performance only given a large enough training dataset. We recommend combining multiple NLP systems in an ensemble-based algorithm to make the phenotyping pipeline more portable.

28

Figure Legends Figure 1 An overview of ensemble pipeline and evaluation schema for extracting patientrelated phenotypes. Four different NLP systems, MedLEE, MetaMapLite, ClinPhen and cTAKES were leveraged independently to extract patient-related phenotypes. Ensemble methods (e.g. majority-voting) were used to integrate the results from individual NLP systems. Red text represents the extracted terms. Gold standard was created by consensus of three different human annotators. A single human being was then asked to manually extract the phenotypes. Different extraction results were compared against the gold standard, where the extraction results derived from single annotator served as a baseline for evaluation of automated extraction results. Figure 2 A flowchart to choose an NLP-based phenotype extraction method. The grey boxes mark the start and end points of the flowchart, the black boxes represent the processes, and the white boxes represent the decisions. Figure 3 Four different evaluation metrics for concept recognition evaluation. The hypothetical ontology shown on the left is a simplified excerpt from the Human Phenotype Ontology (HPO). Each box refers to a phenotype defined in HPO (term names shown underneath the ontology). Each node is labeled in the format “node_id(weight)”. Weight is an estimation based on information content of the phenotype. The right side compares the differences between four evaluation metrics. For exact match, the calculation follows standard practices. For generalized match, the count includes the predicted concept(s) plus all of their ancestors. For example, in case 6, true positives is 2 (i.e. {N2, N4}), false negatives is 4 (i.e. {N1, N3, N6, N9}), and false positives is 1 (i.e. {N7}). For weighted generalized-match, the counting is similar to the generalized match methods except each count is multiplied by its correspond weight. For example, true positives are calculated as the sum of the weights obtained from {N2, N4}, which is 5. In Perotte’s match, true

29

positives are defined as predicted codes that were ancestors of, descendants of, or identical to a gold-standard code. False positives are defined as predicted codes that are not true positives. False negatives are defined as gold-standard codes where the code itself or a descendant was not predicted. Therefore, in case 6, there are no true positives, and there is 1 false positive (i.e. {N7}), and 1 false negative (i.e. {N9}).

30

Declarations Ethics approval This study was approved by the Institutional Review Board of the Columbia University Medical Center (protocols AAAD1873 and AAAR3954) Availability of data and material The original clinical notes used for the current study are available from the corresponding authors on reasonable requests and institutional approvals. The processed results generated or analyzed during this study are included in this published article (and its supplementary information files). Declaration of interests C.F. is a consultant for Health Fidelity, which has licensed MedLEE from Columbia. Funding This study was supported by National Library of Medicine/National Human Genomic Research Institute Grant R01LM012895-01 and National Library of Medicine grant R01LM009886-10. Authors contributions CL and CW conceived the study together. CL implemented the ensemble method and conducted the data analysis. LE and CF conducted the MedLEE based parsing. CL, ZL, JRR, CT, FSPK, NS, AMB, and JL contributed to the gold standard curation. LW, FS, HL provided the deidentified Mayo clinical notes. All authors reviewed and approved the manuscript.

31

Reference 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

Shashi, V., et al., The utility of the traditional medical genetics diagnostic evaluation in the context of nextgeneration sequencing for undiagnosed genetic disorders. Genet Med, 2014. 16(2): p. 176-82. Retterer, K., et al., Clinical application of whole-exome sequencing across clinical indications. Genet Med, 2016. 18(7): p. 696-704. Sawyer, S.L., et al., Utility of whole-exome sequencing for those near the end of the diagnostic odyssey: time to address gaps in care. Clin Genet, 2016. 89(3): p. 275-84. Cooper, G.M. and J. Shendure, Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet, 2011. 12(9): p. 628-40. Kearney, H.M., et al., American College of Medical Genetics standards and guidelines for interpretation and reporting of postnatal constitutional copy number variants. Genet Med, 2011. 13(7): p. 680-5. Richards, S., et al., Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med, 2015. 17(5): p. 405-24. Yang, H., P.N. Robinson, and K. Wang, Phenolyzer: phenotype-based prioritization of candidate genes for human diseases. Nat Methods, 2015. 12(9): p. 841-3. Kohler, S., et al., Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am J Hum Genet, 2009. 85(4): p. 457-64. Singleton, M.V., et al., Phevor combines multiple biomedical ontologies for accurate identification of disease-causing alleles in single individuals and small nuclear families. Am J Hum Genet, 2014. 94(4): p. 599-610. Robinson, P.N., et al., Improved exome prioritization of disease genes through cross-species phenotype comparison. Genome Res, 2014. 24(2): p. 340-8. Amberger, J.S., et al., OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic disorders. Nucleic Acids Res, 2015. 43(Database issue): p. D789-98. Kohler, S., et al., Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res, 2019. 47(D1): p. D1018-D1027. Son, J.H., et al., Deep Phenotyping on Electronic Health Records Facilitates Genetic Diagnosis by Clinical Exomes. Am J Hum Genet, 2018. 103(1): p. 58-73. Deisseroth, C.A., et al., ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis. Genet Med, 2018. Liu, C., et al., Doc2Hpo: a web application for efficient and accurate HPO concept curation. Nucleic Acids Res, 2019. 47(W1): p. W566-W570. Wei, Q., et al., A study of deep learning approaches for medication and adverse drug event extraction from clinical text. J Am Med Inform Assoc, 2019. Gehrmann, S., et al., Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. PLoS One, 2018. 13(2): p. e0192360. Doing-Harris, K., et al., Document Sublanguage Clustering to Detect Medical Specialty in Cross-institutional Clinical Texts. Proc ACM Int Workshop Data Text Min Biomed Inform, 2013. 2013: p. 9-12. Dietterichl, T.G., Ensemble learning. 2002. Yu, L., S. Wang, and K.K. Lai, Credit risk assessment with a multistage neural network ensemble learning approach. Expert systems with applications, 2008. 34(2): p. 1434-1444. Liu, B., et al., iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics, 2016. 33(1): p. 35-41. Wang, X., et al., Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics, 2018. Liu, C., et al., Multi-omics facilitated variable selection in Cox-regression model for cancer prognosis prediction. Methods, 2017. 124: p. 100-107. Hill, S.M., et al., Inferring causal molecular networks: empirical assessment through a community-based effort. Nat Methods, 2016. 13(4): p. 310-8. Liu, C., et al., High-dimensional omics data analysis using a variable screening protocol with prior knowledge integration (SKI). BMC Syst Biol, 2016. 10(Suppl 4): p. 118. Torii, M., et al., BioTagger-GM: a gene/protein name recognition system. J Am Med Inform Assoc, 2009. 16(2): p. 247-55.

32

27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51.

Torii, M., K. Wagholikar, and H. Liu, Using machine learning for concept extraction on clinical documents from multiple data sources. J Am Med Inform Assoc, 2011. 18(5): p. 580-7. Doan, S., et al., Recognition of medication information from discharge summaries using ensembles of classifiers. BMC Med Inform Decis Mak, 2012. 12: p. 36. Kang, N., et al., Using an ensemble system to improve concept extraction from clinical records. J Biomed Inform, 2012. 45(3): p. 423-8. Groza, T., et al., Automatic concept recognition using the human phenotype ontology reference and test suite corpora. Database (Oxford), 2015. 2015. Tseytlin, E., et al., NOBLE - Flexible concept recognition for large-scale biomedical natural language processing. BMC Bioinformatics, 2016. 17: p. 32. Friedman, C., System and method for medical language extraction and encoding. 2000, Google Patents. Demner-Fushman, D., W.J. Rogers, and A.R. Aronson, MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc, 2017. 24(4): p. 841-844. Savova, G.K., et al., Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc, 2010. 17(5): p. 507-13. Aronson, A.R., Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp, 2001: p. 17-21. Harkema, H., et al., ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform, 2009. 42(5): p. 839-51. Friedman, C., et al., Architectural requirements for a multipurpose natural language processor in the clinical environment. Proc Annu Symp Comput Appl Med Care, 1995: p. 347-51. Friedman, C., et al., A general natural-language text processor for clinical radiology. J Am Med Inform Assoc, 1994. 1(2): p. 161-74. Friedman, C., et al., Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc, 2004. 11(5): p. 392-402. Fan, J.W., et al., Part-of-speech tagging for clinical text: wall or bridge between institutions? AMIA Annu Symp Proc, 2011. 2011: p. 382-91. Wagholikar, K., et al., Feasibility of pooling annotated corpora for clinical concept extraction. AMIA Jt Summits Transl Sci Proc, 2012. 2012: p. 38. Sohn, S., et al., Clinical documentation variations and NLP system portability: a case study in asthma birth cohorts across institutions. J Am Med Inform Assoc, 2017. Newton, K.M., et al., Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inform Assoc, 2013. 20(e1): p. e147-54. Perotte, A., et al., Diagnosis code assignment: models and evaluation metrics. J Am Med Inform Assoc, 2014. 21(2): p. 231-7. Maynard, D., W. Peters, and Y. Li. Metrics for Evaluation of Ontology-based Information Extraction. in EON@ WWW. 2006. Patterson, O. and J.F. Hurdle, Document clustering of clinical narratives: a systematic study of clinical sublanguages. AMIA Annu Symp Proc, 2011. 2011: p. 1099-107. Wu, S., et al., Negation's not solved: generalizability versus optimizability in clinical natural language processing. PLoS One, 2014. 9(11): p. e112774. Uzuner, Ö., et al., 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 2011. 18(5): p. 552-556. Pradhan, S., et al. Semeval-2014 task 7: Analysis of clinical text. in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). 2014. Elhadad, N., et al. SemEval-2015 task 14: Analysis of clinical text. in Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). 2015. Lee, C.H. and H.J. Yoon, Medical big data: promise and challenges. Kidney Res Clin Pract, 2017. 36(1): p. 3-11.

33

34

35

Highlights    

The best performance of nature language processing systems often do not generalize to new datasets, resulting limited reproducibility; Ensembles of natural language processing can improve both generic phenotypic concept recognition and patient specific phenotypic concept identification consistently over individual systems; A simple majority voting-based ensemble can increase the reproducibility of the performance across different cohorts and tasks; An evaluation metric considering both concept relations and frequencies is useful.

36

Graphical abstract

37

Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

38