Deep learning predicts extreme preterm birth from electronic health records

Journal Pre-proofs Deep Learning Predicts Extreme Preterm Birth from Electronic Health Records Cheng Gao, Sarah Osmundson, Digna R. Velez Edwards, Gre...

Download PDF

878KB Sizes 0 Downloads 33 Views

Report

PDF Reader
Full Text

Journal Pre-proofs Deep Learning Predicts Extreme Preterm Birth from Electronic Health Records Cheng Gao, Sarah Osmundson, Digna R. Velez Edwards, Gretchen Purcell Jackson, Bradley A. Malin, You Chen PII: DOI: Reference:

S1532-0464(19)30253-9 https://doi.org/10.1016/j.jbi.2019.103334 YJBIN 103334

To appear in:

Journal of Biomedical Informatics

Received Date: Revised Date: Accepted Date:

17 January 2019 23 September 2019 29 October 2019

Please cite this article as: Gao, C., Osmundson, S., Velez Edwards, D.R., Purcell Jackson, G., Malin, B.A., Chen, Y., Deep Learning Predicts Extreme Preterm Birth from Electronic Health Records, Journal of Biomedical Informatics (2019), doi: https://doi.org/10.1016/j.jbi.2019.103334

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier Inc.

Deep Learning Predicts Extreme Preterm Birth from Electronic Health Records Authors: Cheng Gao1, Sarah Osmundson2, Digna R. Velez Edwards1,2, Gretchen Purcell Jackson1,3,5, Bradley A. Malin1,4,6, and You Chen1

Author Affiliations: 1Department

of Biomedical Informatics, School of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA 2Department

of Obstetrics and Gynecology, School of Medicine, Vanderbilt University Medical

Center, Nashville, TN 3Departments

of Pediatric Surgery and Pediatrics, School of Medicine, Vanderbilt University

Medical Center, Nashville, TN 4Department

of Biostatistics, School of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA 5Evaluation

Research Center, IBM Watson Health, Cambridge, MA, USA

6Department

of Electrical Engineering & Computer Science, School of Engineering, Vanderbilt University, Nashville, TN, USA

To Whom Correspondence Should Be Addressed: You Chen, Ph.D. Department of Biomedical Informatics Vanderbilt University Nashville, TN 37203 USA Email: [email protected] Phone: +1 615 343 1939 Fax: +1 615 322 0502

Abstract Objective: Models for predicting preterm birth generally have focused on very preterm (28 to 32 weeks) and moderate to late preterm (32 to 37 weeks) settings. However, extreme preterm birth (EPB), before the 28th week of gestational age, accounts for the majority of newborn deaths. We investigated the extent to which deep learning models that consider temporal relations documented in electronic health records (EHRs) can predict EPB. Study Design: EHR data were subject to word embedding and a temporal deep learning model, in the form of recurrent neural networks (RNNs) to predict EPB. Due to the low prevalence of EPB, the models were trained on datasets where controls were undersampled to balance the casecontrol ratio. We then applied an ensemble approach to group the trained models to predict EPB in an evaluation setting with a nature EPB ratio. We evaluated the RNN ensemble models with 10 years of EHR data from 25,689 deliveries at Vanderbilt University Medical Center. We compared their performance with traditional machine learning models (logistical regression, support vector machine, gradient boosting) trained on the datasets with balanced and natural EPB ratio. Risk factors associated with EPB were identified using an adjusted odds ratio. Results: The RNN ensemble models trained on artificially balanced data achieved a higher AUC (0.827 vs. 0.744) and sensitivity (0.965 vs. 0.682) than those RNN models trained on the datasets with naturally imbalanced EPB ratio. In addition, the AUC (0.827) and sensitivity (0.965) of the RNN ensemble models were better than the AUC (0.777) and sensitivity (0.819) of the best baseline models trained on balanced data. Also, risk factors, including twin pregnancy, short cervical length, hypertensive disorder, systemic lupus erythematosus, and hydroxychloroquine sulfate, were found to be associated with EPB at a significant level.

Conclusion: Temporal deep learning can predict EPB up to 8 weeks earlier than its occurrence. Accurate prediction of EPB may allow healthcare organizations to allocate resources effectively and ensure patients receive appropriate care.

Introduction Extreme preterm birth (EPB), which is defined by the World Health Organization (WHO) as birth before the 28th week of gestational age1,2,3, is associated with significant morbidity and accounts for the majority of newborn deaths4-8. Preterm infants have a high risk of developing a wide array of complications, including cerebral palsy9,10,11, developmental delay, sensory impairments (e.g., visual and auditory deficits)12, school difficulties13-17, and behavioral problems18-20 both in childhood and adulthood21. Identifying pregnancies at risk for EPB and transferring care to centers with neonatal intensive care units (NICUs) can improve both short- and long-term outcomes for EPB infants2126.

Healthcare facilities lacking appropriate level NICUs usually do not have adequate resources

(e.g., experienced staff, specialized ventilators) to manage EPB infants22,23. As a consequence, EPB infants may not receive timely or appropriate treatment. Predicting EPB at an early point in care (e.g., several weeks earlier than its occurrence) could allow healthcare organizations and expectant mothers at risk of EPB birth to plan for this possibility. In addition, the prediction of EPB could also assist NICUs in planning for transfers from obstetric settings. However, predicting EPB is challenging because premature delivery occurs through a variety of mechanisms and is affected by multiple factors. Prior research has identified risk factors (e.g., age, race, multiparous, presence of depression, alcohol and drug use and cervical length) for very preterm (28 to 32 weeks of gestational age) and moderate to late preterm (32 to 37 weeks of gestational age)23-37 27-41 and has leveraged these risk factors to predict preterm birth42. However, these studies have focused on a small number of risk factors and the predictive performance was poor (no greater than 0.25 sensitivity with a specificity of 0.33)43. More recently, machine learning techniques have been applied to predict preterm birth, with some promising results44-45. The most

common approach used to predict very, moderate, or late preterm is logistic regression (LR)47. For instance, one LR-based model was shown to achieve a sensitivity of 0.62 at a specificity of 0.81 for late preterm birth prediction. Yet few studies have focused on risk factors for EPB or have designed machine learning models to predict EPB. Those that have were limited in that they involved only a small number of women (e.g., the Appalachia study was based on 120 observations and predicted EPB close to the time of delivery47. As a result, EPB prediction models proposed to date do not allow sufficient time for families or healthcare organizations to prepare accordingly. In this paper, we introduce a framework based on deep learning to predict EPB early (i.e., before 20 weeks’ gestational age) using information available in the electronic health records (EHRs). To evaluate the approach, we studied 10 years of EHR data from expecting mothers who visited Vanderbilt University Medical Center (VUMC).

Related Works Since this work leverages a deep learning approach to predict preterm birth, we provide backgrounds for preterm birth and deep learning, and existing data mining and machine learning approaches applied to predict preterm birth. Preterm birth Preterm birth (less than 37 weeks of gestational age) is one of the most common causes of death in children less than 5 years worldwide, according to WHO48. Extremely preterm birth refers to babies born before 28 weeks of gestational age and has been recognized to be associated with increased short-term and long-term adverse outcomes17-20, which require more intensitive neonatal care10. A study10 conducted in New South Wales found that 29.3% of 614 live births born at 2028 weeks’ gestation died in the labor ward and nearly 71% were admitted to the NICU. Two-thirds of the infants admitted to NICU survived to 1 year and these survived infants were more likely to have cerebral palsy and major development disability. Another study11 was conducted by the group at the University of Nottingham, to assess cognitive and neurologic development of children who were born at 25 or fewer completed weeks of gestation, at 30 months of age. They found there were 21% of these extremely preterm children had cognitive impairment and 86% of the children had severe disabilities at the 30 months of age. Extremely preterm births usually come along with low birth weight. Children born with low birth weight have a high rate of cognitive dysfunction at age five .13 They tend to have poorer achievement, behavior and academic performance at middle school age.17Existing studies found that more than a quarter of low birth weight children had attention deficit hyperactivity disorders at 12 years of age and the children were also more likely to have generalized anxiety and depression.18

Predictive models on preterm birth Preterm birth prediction is a very important research direction and most of the works have been done to improve the prediction performance. Mercer et al. developed a risk score-based system to predict preterm birth49. They identified a number of risk factors, including fetal fibronectin, historical preterm and a short cervix and considered them as variables to train a multivariate logistic regression. Their system can achieve a sensitivity of 24.2% (18.2%) and a specificity of 28.6% (33.3%) for nulliparous (multiparous) women. Goodwin and her colleagues50 explored the use of data mining and machine learning techniques to predict preterm birth. They used machine learning models to generate 520 rules to predict preterm birth, and the accuracy can reach to be ~53%. In a subsequent work51, the same group studied 19,970 patient records with 1,622 variables from Duke University medical center using data mining and machine learning techniques. The best performance was achieved by logistic regression with an AUC of 0.66. Word embedding and deep learning on medical data Word embedding has shown great success in representing contextual information of medical concepts. Choi et al52 developed an algorithm to learn dense representations of diagnoses, medications, and procedures and found machine learning models with the support of the lowdimensional representations can achieve a 23% improvement of the AUC in the specific prediction task. Deep learning along with word embedding are very impressive in their performances in predicting diseases or patient outcomes in healthcare 52-53. The research group at Georgia Institute of Technology applied a series of advanced deep learning models including long short-term memory (LSTM) and attention mechanism based neural networks to predict diseases and patient outcomes from clinical text.54

Study Materials Data This study is based on the EHR data of women who delivered newborns between August 9, 2005, and July 31, 2017, at VUMC. The data include patient demographics (e.g., age and self-reported race), diagnoses and procedures (e.g., ICD-9 billing codes and CPT procedure codes), prescribed medications, and laboratory test results. The data were preprocessed, such that each datum was represented as a concept in a set of standardized terminologies, including SNOMED, RxNorm, and LOINC. The mapping process was performed by the Vanderbilt Institute of Clinical and Translational Research (VICTR). They mapped each medical concept in EHRs to a common concept developed under the Observational health data sciences and informatics (OMOP) framework. The data consist of 5,602,792 documented medical concepts, distributed over 25,689 deliveries. The number of unique medical concepts representing health conditions, procedures, prescriptions, and laboratory tests was over 21k, 22k, 20k, and 17k, respectively. The average number of concepts coming from SNOMED, RxNorm, and LOINC at individual patient level are 31.8, 83.8 and 64, respectively. Case and Control Definitions We leveraged the following ICD-9 codes to identify gestational age: 765.21 (less than 24 weeks), 765.22 (24 weeks), 765.23 (25 to 26 weeks), 765.24 (27 to 28 weeks), 765.25 (29 to 30 weeks), 765.26 (31 to 32 weeks), 765.27 (33 to 34 weeks), 765.28 (35 to 36 weeks) and 756.29 (more than 37 weeks). In addition, we designed and applied regular expressions to discharge summaries to extract gestational age at delivery to confirm the age identified from the ICD-9 codes. For patients whose records did not contain the aforementioned ICD-9 codes or whose discharge summary did

not report a gestational age at delivery, we relied on their clinical notes before delivery to extract an estimated due date (EDD), from which we calculated gestational age at delivery. If the birth occurred before the 28th week of gestational age, it was designated as a case (EPB); otherwise, it was designated as a control (Non-EPB). We randomly selected 5% of the identified cases and controls and invited two experts to manually review the medical records of the samples to independently confirm their classes: EPB or Non-EPB. In this study, for the selected samples, their classes confirmed by the two experts were consistent with the classes learned from a combination of ICD-9 codes, gestational age extracted from discharge summaries and gestational age calculated via EDD. For each delivery, we ordered the medical concepts chronologically. For this study, the EHR data of the expecting mother that was available before the 20 weeks’ gestational age was relied upon to predict EPB. We selected 20 weeks’ gestational age at the midpoint of pregnancy to enable early prediction of EPB and retain sufficient information about the mother to allow for accurate prediction. This decision was based on the fact that most pregnant women will receive at least one healthcare service before 18~20 weeks’ gestational age. Since a woman could have had more than one delivery during the study period, we selected their first one as an instance for the dataset to ensure that models built on the dataset are not biased to historical deliveries. After applying these steps, 25,689 patients were available for this study. Among these, approximately 1% of the instances were EPB cases. Data for Prediction Model Development and Evaluation The dataset was split into two parts. We refer to instances of births that occurred between August 2005 and December 2014 as development data and instances between January 2015 and August

2017 as evaluation data. As we alluded to, EPB was a rare event (n = 132 or 0.75% in the development set; n = 85 or 1.1% in the evaluation set). The way we created the development and evaluation dataset is different from the common practice which leverages a random way to generate datasets. Prediciton models trained on the randomly created development and evaluation set have a major problem – the development set can contain future information, and thus the models built on future information are used to predict both past and future events. To solve this problem, we cut the development and the evaluation set at a specific time point. In that case, we only leverage the past knowledge to train prediction models, and then using the trained models to predict future events. The development data was used for model training and cross-validation. Due to the small number of cases, we under-sampled the controls in the development set to solve bias problems caused by the unbalanced case:control ratio. There are two typical ways to perform the sampling: under-sampling and up-sampling. The reason we did not choose the up-sampling is that we want to ensure the investigated instances can satisfy the independent and identical distributed (i.i.d.) assumption, which can potentially impact the performance of machine learning models. Upsampling usually generates a large number of simulated instances which are very similar to real case instances, and thus, the relationships between the real case instances and the simulated instances are partially dependent. Such phenomena become worse when the case:control ratio is extremely small such as 0.75% in our study. We sampled 30 subsets without replacement, each of which contained an equal number of cases and controls. Within each subset, we applied a 3-fold cross-validation strategy.

The evaluation set was used to assess the performance of the models in the native, imbalanced setting. It was used to ascertain if preterm birth prediction models trained on balanced data could generalize to inform future predictions in imbalanced environment. Table 1 summarizes the demographics (age and self-reported race) and clinical characteristics (weeks at delivery, diagnoses, procedures, medications, and laboratory studies) of the mother for the development and evaluation datasets. For the development dataset, the average and median age of mothers was approximately 28 years. The majority of pregnant women were Caucasian (68.1%) and Black (21.2%). Similar to the development dataset, the average and median age of mothers were about 29 years in the evaluation dataset. The majority of pregnant women were Caucasian (70.1%) and Black (16.8%). The percentage of cases in the evaluation dataset (1.1%) was slightly higher than the development dataset (0.75%). This may be due, in part, to the fact that VUMC has established itself as a regional referral center.

Prediction Methods Our framework is built on a combination of natural language processing (NLP), and machine learning (e.g., regularized logistical regression and deep learning) to predict EPB. Figure 1 depicts the workflow for the framework, which consists of four components: i) word embedding; ii) cohort construction, which uses bootstrapping to undersample controls to compose balanced cohorts; iii) machine learning (based on regularized logistic regression and recurrent neural networks (RNNs)) and iv) medical concept ranking based on statistical models. Bag of words and word embedding In this study, we use both bag of words (BOW) and word embedding as NLP techniques to represent each medical concept and compare differences in the performance of machine learning models based on the two techniques. BOW is a popular technique to preprocess clinical narratives and its effectiveness and efficiency for traditional machine learning models (e.g., logistic regression) have been impressive55. To apply this approach, the EHR of a patient can be likened to a document that consists of a series of words (medical concepts). We model the data as a matrix of EHRs by words, where each cell in the matrix corresponds to the number of times a certain concept was documented in a specific EHR. For numerical variables, such as laboratory test results, we converted the data into categorical values of low, normal, and high by comparing the numerical values with normal ranges. For instance, if a numerical value is lower than the low bound of the normal range, then it will be assigned as a low label; if the value is within the normal range, then it will be assigned as normal label; and if the value is higher than the normal range, then it will be assigned as a high label. Since BOW does not consider temporal relations between words, and therefore, each EHR consists of an unordered

bag of words. Term frequency-inverse document frequency (TF-IDF) is a very popular weighting strategy which aims to normalize the importance of a medical concept with a patient’s EHRs. In this study, we will use TF-IDF along with BOW to do feature engineering. Word embedding represents the temporal information in a patient’s EHR56. We specifically use a skip-gram, which aims to find a representation (e.g., a word vector) that can predict the surrounding medical concepts for a particular concept of interest. More concretely, word vectors are learned by fixing a sliding window of size 𝜔 and maximizing the following log-transformed probability: 1 𝑇

𝑇

∑𝑡 = 1∑ ―𝜔 ≤ 𝑗 ≤ 𝜔,𝑙𝑜𝑔 (𝑝(𝑐𝑡 + 𝑗|𝑐𝑡)), 𝑗≠0

(1)

where 𝑝(𝑐𝑡 + 𝑗│𝑐𝑡) =

exp (𝐯(𝑐𝑡 + 𝑗)T𝐯(𝑐𝑡)) 𝑇

∑𝑖 = 1exp (𝐯(𝑐𝑖)T𝐯(𝑐𝑡))

(2)

In this representation, 𝑁 is the number of medical concepts, 𝑐𝑡 is the concept at position 𝑡, T is the vector transpose operation, and 𝐯(𝑐𝑡) is the vector representation of the concept 𝑐𝑡. Based on Equation (2), a skip-gram will maximize the inner product of the vector representation of temporally proximal concepts. Cohort construction We create two types of cohorts to train models. First, we build a cohort that ensures the case:control ratio is the same in the development dataset as is observed in the EHR system. We refer to models learned in this setting as imbalanced models. Second, we create a set of balanced datasets, where the case:control ratio is set to 50:50. Since prediction models based on imbalanced data may be

biased (i.e., in our study, dominated by controls), we rebalanced the development dataset. Specifically, we applied bootstrapping46 to undersample controls and sampled the development data into 30 datasets, within each of which, the number of controls is the same with the number of cases (See Figure 1). Models trained in a balanced and imbalanced case-control setting are compared in terms of their EPB prediction performance. Machine learning and deep learning models We applied regularized LR, support vector machine (SVM), gradient boosting (GB) and RNNs to learn the EPB prediction models on the balanced and imbalanced cohorts. LR, SVM and GB are traditional machine learning models that have demonstrated strong prediction and classification performance. We used all of these models as baselines. By contrast, RNNs are a type of artificial neural network that is used to characterize sequential information. With respect to the latter, we rely upon the long short-term memory (LSTM)57 model, which uses vector representations of medical concepts as well as sequences of these representations. Building on the medical concept representations (BOW, TF-IDF weighting strategy, word embedding) along with machine learning models, we investigated a collection of EPB prediction models. Baselines with BOW BOW, in combination with TF-IDF weighting, is a simple, yet popular, representation of the input to a machine learning model. To mitigate overfitting due to sparsity in the matrix, LR_BOW, SVM_BOW, GB_BOW, LR_BOW_WEIGHT ( TF-IDF weighting), SVM_BOW_WEIGHT, and GB_BOW_WEIGHT use a feature selection algorithm to triage variables that are highly correlated or unlikely to play a role in distinguishing cases from controls. Specifically, in the training set of each fold, each medical concept is assigned an entropy score in the form of its information gain

(IG). The IG represents the predictive performance of the concepts and the status of preterm birth. The larger the score, the better it distinguishes cases from controls. We order the medical concepts according to their scores. Since there are ~200 instances in each balanced dataset, and the total number of features should usually be less than the total number of instances, we retained the top 150 medical concepts as features. Baselines with word embedding and BOW Sequences of word vectors cannot be directly applied to traditional machine learning models such as LR. This is due to the fact that, if LR is trained on the vectors directly, then the total number of dimensions will be |𝑉| × 𝑙, where |𝑉| is the size of the vocabulary and 𝑙 is the length of a word vector. Such a high dimensionality will lead to overfitting models. Thus, we aggregate the vectors of all medical concepts across EHRs by taking their average, minimum or maximum. We apply LR, SVM, and GB to these aggregated features to learn a series of prediction models. For instance, LR related models include LR_MEAN, LR_MIN, LR_MAX, LR_MIN_MAX, LR_MIN_MEAN, LR_MEAN_MAX, LR_MIN_MAX_MEAN. Furthermore, we combine BOW and word embedding-based aggregated features as inputs to learn LR, SVM and GB models, which we refer to as LR_BOW_MEAN, SVM_BOW_MEAN, and GP_BOW_MEAN, respectively. For word embedding, we set the sliding window size 𝜔 and the length of a word vector l to 100, respectively. Neural networks with word embedding For RNN models, we use an LSTM network that is trained on word vectors directly. Each vector is a long sequence, chronologically ordered as documented in the EHR. At the final step of the LSTM, logistic regression is applied to estimate the probability of predicting an instance as an EPB. For the LSTM models, we use vectors of all words as inputs. We wanted to ensure the

sequential information we investigated is complete. When words are removed (e.g., low frequent medications or laboratory test variables), the ordered relations between the remaining words within a sequence may be modified. Medical concept ranking Deep learning models along with attention mechanism can be leveraged to rank medical concepts. Attention mechanism-based neural networks can increase the computational complexity, and thus, we did not use the attention mechanism in this study. We designed an intuitive way to rank medical concepts via LSTM models. We developed an approach to rank the importance of a medical concept based on the difference in the area under the ROC curve (AUC) of a deep learning model with and without the medical concept. The larger the difference, the greater the importance of the concept. In addition, we measured the odds ratio for each medical concept by adjusting for the age and race of the patient. We say that a medical concept is associated with EPB when the p-value is smaller than 0.05 after Bonferroni correction and the odds ratio is greater than 1.

Experimental Design We evaluate predictive performance in terms of sensitivity (SN), specificity (SP), positive predictive value (PPV), and area under the ROC curve (AUC). Evaluation of models in a balanced setting For each of the 30 cohorts, we use a 3-fold cross-validation strategy, such that two-thirds of the instances are used for training while the remaining one-third are reserved for testing in each fold. Within each cohort, we trained and validated a series of models: i) LR, SVM and GB built on BOW, and TF-IDF weighting, such as LR_BOW, LR_BOW_WEIGHT; ii) LR, SVM, and GB built on word embedding features, such as LR_MEAN, SVM_MEAN, and GB_MEAN; iii) LR, SVM, and GB built on a combination of BOW and word embedding features, such as LR_BOW_MEAN, SVM_BOW_MEAN, and GB_BOW_MEAN; and iv) LSTM built on vectors of words such as LSTM_WORD2VEC. We report the averages and the 95% confidence interval of all models’ performances. The models are categorized into 4 groups: non-neural networks without word embedding (e.g., LR_BOW, LR_BOW_WEIGHT), non-neural networks with word embedding (LR_MEAN), non-neural networks with a combination of BOW and word embedding (e.g., LR_BOW_MEAN) and neural networks with word embedding (LSTM_word2vec). Within each group, the model with the best performance is retained for further evaluation with the evaluation set. Evaluation of models in an imbalanced setting We design an ensemble algorithm to integrate the 30 best models learned from the balanced cohorts to predict each instance in the evaluation set. When the number of models predicting an instance as a case is above a certain threshold 𝛽, then the instance is predicted as an EPB. We evaluated 𝛽 over the range of 1 to 30 and report on the performance of the ensemble algorithm on

the evaluation set. In addition, we predict each instance in the evaluation set using models (e.g., trained on the imbalanced cohort and compare their performance with ensemble models.

Results Predictive performance in the balanced setting Table 2 depicts the AUC of the four types of models in a balanced setting. All the models were both trained and tested in the balanced datasets. For non-neural networks with BOW or weighting strategy, SVM_BOW achieved the best AUC (0.769). For non-neural networks with word embedding, LR-MEAN had the highest AUC (LR_MEAN: 0.765). The performance of LR, SVM, and GB built on min, max, and mean and their combinations are detailed in Table 3. It can be seen that models based on mean values achieved the best performance. Moreover, LR_MEAN outperformed SVM_MEAN and GB_MEAN. For non-neural networks with a combination of BOW and word embedding features, LR-BOW-MEAN achieved the best performance, with an AUC of 0.780. Compared to LR_BOW, LR_BOW_MEAN has an improvement of 2.49% in AUC. The AUC of LSTM-WORD2VEC was 0.811 with a 95% confidence interval of [0.799, 0.823], which is 4% and 5% higher than LR-BOW-MEAN (mean = 0.780, CI = [0.765, 0.794]), and SVMBOW (mean = 0.769, CI = [0.754, 0.784]) respectively. Table 4 depicts the AUC, SN, SP, and PPV for the best models (SVM_BOW, LR_MEAN, LR_BOW_MEAN, and LSTM_WORD2VEC) under each category. LR_MEAN had similar performance as SVM-BOW in terms of AUC (0.765 vs. 0.769), SP (0.724 vs. 0.753) and PPV (0.753 vs. 0.754). LR_BOW_MEAN outperformed LR_MEAN and SVM_BOW. The performance of LSTM-WORD2VEC outperformed LR_BOW_MEAN with respect to AUC, SN and PPV (AUC: 0.811 vs. 0.780, SN: 0.797 vs. 0.771, PPV: 0.788 vs. 0.779).

Predictive performance in the imbalanced setting

Based on the above results, we created an ensemble of the LSTM-WORD2VEC models that exhibited the highest AUC in each of the 30 cohorts. We tested the ensemble on the evaluation set with 𝛽 ranging from 1 to 30. The performance of the model is depicted in Figure 2. The ensemble was maximized when 𝛽 was 16, achieving an AUC of 0.827 (as shown in Figure 2a). The sensitivity of the ensemble model achieves a sensitivity of 0.965 (𝛽 = 16, as shown in Figure 2b), which implies that over 96.5% of cases were correctly predicted as such in the application dataset. However, it can also be seen that the specificity rate of the ensemble was 69.8% (as shown in Figure 2c), which implies that 30.2% of non-EPB were falsely identified as EPB. The PPV of the ensemble models was extremely low (~3.3% at 𝛽 = 16, as shown in Figure 2d), which indicates that most of the predicted EPB were non-EPB. Table 5 shows the performances of models learned from the balanced and imbalanced cohorts. All the models were tested on the evaluation set. It can be seen that the ensemble models achieve higher AUC (SVM_BOW: 0.728 vs. 0.710; LR_MEAN: 0.749 vs. 0.743; LR_BOW_MEAN: 0.777 vs. 0.732) and LSTM_WORD2VEC: 0.827 vs. 0.744) than the models learned from cohorts with a natural EPB ratio. Specially, the LSTM_WORD2VEC ensemble models achieved substantially higher sensitivity (0.965 vs. 0.682) than the LSTM_ WORD2VEC models learned from the imbalanced setting. Medical Concept Ranking The results of LSTM ranking, odds ratio and p-value of each medical concept are depicted in Table 6. We assumed that medical concepts that exhibited p-values smaller than 0.05 after Bonferroni correction and the odds ratio greater than 1 are associated with EPB. Therefore, we only list the medical concepts associated with EPB in Table 6. It can be seen that the medical conditions related to EPB include 1) twin pregnancy, 2) systemic lupus erythematosus, 3) short

cervical length during pregnancy, and 4) hypertensive disorder. Beyond the aforementioned risk factors on health conditions, our study suggested that hydroxychloroquine sulfate, a medication used to treat pregnant women with systemic lupus erythematosus56, was also associated with EPB.

Discussion This study introduced a deep learning framework with word embedding and bootstrapping to predict EPB. We demonstrated that the RNN ensemble models built on artificially balanced datasets can achieve the best predictive performance. Moreover, compared to models relying on known risk factors4,55, RNN ensemble models achieved better sensitivity (96% vs.18.2% - 62.3%) in predicting preterm birth. We believe this happens because deep learning considers the context of each medical concept in the EHR and learns hidden layers representing a variety of EPB or nonEPB patterns from a set of artificially balanced datasets. This study is further notable in that, to the best of our knowledge, it is one of the first to focus on early (< 28 weeks) preterm birth, rather than very or moderate to late preterm birth. One of the benefits of models based on machine learning is that they have the potential to identify risk for idiopathic or spontaneous preterm birth that would not usually be anticipated based only on known medical risk factors (e.g., history of preterm birth or short cervical length). The majority (around two-thirds) of preterm births occur in women with no history of preterm birth2,3,9. Beyond the typical risk factors clinicians used to identify preterm birth in their routine care, machine-learned models can consider a broad array of health to infer patterns related to preterm birth, which can potentially identify high risk for preterm birth that would otherwise go unrecognized. Incorporating predictive models based on machine learning into care may assist clinicians in preparing expectant families and delivery centers to manage at-risk preterm infants. The study also identified five medical concepts associated with EPB: twin pregnancy, systemic

lupus

erythematosus,

short

cervical

length,

hypertensive

disorder,

and

hydroxychloroquine sulfate. The first four medical conditions have face validity as risk factors, as there are various studies that have shown relationships between them and preterm birth. For

instance, Gardner and colleagues found that preterm birth was more likely to happen for twins than singletons58. Johnson and colleagues found that premature rupture of the membrane is very common in pregnancies in women with systemic lupus erythematosus, and it is the main cause of preterm birth59. Andersen and colleagues found that shorter cervical length was associated with a high risk of preterm delivery60. The National High Blood Pressure Education Program categorized hypertensive disorders during pregnancy as: 1) chronic hypertension, 2) preeclampsia-eclampsia, 3) preeclampsia superimposed on chronic hypertension, and 4) gestational hypertension61. It has been recognized that these four types of hypertensive disorders are associated with preterm birth42,62. Hydroxychloroquine sulfate is a treatment for systemic lupus erythematosus, which is associated with hypertension and spontaneous preterm birth. Despite the merits of this study, this project was a pilot and there are several limitations worth acknowledging. First, although the ensemble model can achieve an AUC of 0.83 and a sensitivity 0.96 in the imbalanced evaluation dataset, the positive predictive value is still very low (3.3%). Through our analysis, 2,418 (30.2%) controls were misclassified as cases. 16.2% (393) of these falsely classified controls had deliveries before the 36th week of gestational age. Specifically, 72% (109) of the deliveries (159) between the 28th and 32nd week of gestational age were misclassified as cases. Additionally, 59% (284) of the deliveries (482) between the 32nd and 36th week of gestational age were falsely identified as cases. Our framework exhibited worse performance in discriminating control deliveries between the 28th week and 36th week of gestational age from case deliveries than those larger than the 36th week of gestational age. One possible reason for this is that deliveries between the 28th week and 36th week may exhibit similar clinical phenomena to those before the 28th week of gestational age. Researchers may need to refine their models to learn patterns that can discriminate control deliveries between the 28th week and 36th week of gestational

age from EPB in their further investigations. Although the PPV of the ensemble model is low, the prediction model is still useful for clinicians because of its high sensitivity (0.965). For instance, imagine there are 100 patients at the 20th week of their gestational age (the prediction time point of our models), only one of which has EPB. Using the ensemble model, we expect to identify 30 patients (PPV as 3.3%) with EPB and the only patient with EPB is likely to be included (96.5% sensitivity). Thus, clinicians have a probability of 0.965 to monitor 30 patients instead of 100 to identify the patient who has EPB, reducing workload by approximately 70%. Second, the PPV of all EPB prediction models including those built on artificially balanced data or those built on natural EPB ratio data is very low; however, the reasons leading to the poor PPV are unclear. One explanation for why the models exhibit poor PPV is that each of them did not capture all characteristics of the controls in the evaluation dataset or there are many controls in the evaluation set who had similar health information with EPB cases. Therefore, many controls in the evaluation set were classified as EPB cases. Third, three types of preterm birth -spontaneous, premature preterm ruptures of the membranes, and delivery because of maternal or fetal infections—were considered as the same class. Researchers should separate each individual type and then design corresponding machine learning models to predict each of them independently in future studies. Fourth, the preterm birth cases involved in this study were all collected from a single healthcare organization in the southeastern United States. As such, the results may be biased to the specific institution or region. Additionally, the institution is a tertiary referral center with an advanced maternal-fetal center and fetal surgery program, and thus, the patient population includes a large portion of high-risk patients. The results may not be applicable in other healthcare organizations.

Fifth, we only applied LSTM instead of other advanced neural networks such as gated recurrent units (GRU) to train EPB prediction models. It is suggested that GRU is impressive in its performance in many natural language processing and machine learning tasks. We will test the GRU model in our future study of EPB. . Conclusion This paper introduced a deep learning-based framework along with word embedding to predict EPB from information in the EHR. We built models based on an artificially balanced dataset, which were subsequently evaluated on an imbalanced dataset indicative of real-world settings. The models achieved better performance than traditional machine learning models on both balanced and imbalanced datasets. The models discovered known risk factors, including twin pregnancy, systemic lupus erythematosus, short cervical length, and hypertensive disorder, but also uncovered a potentially novel risk factor in the form of hydroxychloroquine sulfate. While the approach shows promise, approximately 30% in the imbalanced setting were incorrectly predicted as EPB. We believe that such errors could be resolved by training our model with data from larger and more diverse populations. Still, we believe that this framework has the potential to support clinicians in identifying risk for preterm birth that would not be otherwise anticipated in routine care early in pregnancy and aid families in preparing to deliver in an appropriate setting. However, expert knowledge and clinical judgment may be still be needed to interpret this risk and take appropriate action in individual cases.

Funding This research was supported, in part, by the National Institutes of Health grant R01LM012854.

Contributors CG performed the data collection and analysis, method design, experimental design, evaluation and interpretation of the results, and drafting and revising of the manuscript. SO, DE and GJ performed evaluation and interpretation of the results and revising of the manuscript. BM performed method design, experimental design, evaluation and interpretation of the results, and revising of the manuscript. YC performed the data collection and analysis, methods design, experimental design, evaluation and interpretation of the results, and drafting and revising of the manuscript. Acknowledgment We thank the developers at the Vanderbilt Institute of Clinical and Translational Research for mapping medical concepts in EHRs to common concepts in SNOMED, RxNorm, and LOINC. Conflicts of Interests GP is Vice President and Director of the Evaluation Research Center at IBM Watson Health.

Reference 1. CDC. Reproductive Health Centers for Disease Control and Prevention. https://www.cdc.gov/reproductivehealth/maternalinfanthealth/pretermbirth.htm 2. Liu L, Oza S, Hogan D, Chu Y, Perin J, Zhu J, et al. Global, regional, and national causes of under-5 mortality in 2000-15: an updated systematic analysis with implications for the Sustainable Development Goals. Lancet. 2016;388(10063):3027-35. 3. Blencowe H, Cousens S, Oestergaard M, Chou D, Moller AB, Narwal R, Adler A, Garcia CV, Rohde S, Say L, Lawn JE. National, regional and worldwide estimates of preterm birth. The Lancet. 2012; 9;379(9832):2162-72. 4. Mercer BM, Goldenberg RL, Das A, Moawad AH, Iams JD, Meis PJ, Copper RL, Johnson F, Thom E, McNellis D, Miodovnik M. The preterm prediction study: a clinical risk assessment system. American Journal of Obstetrics & Gynecology. 1996 Jun 1;174(6):1885-95. 5. Doyle LW. Outcome at 5 years of age of children 23 to 27 weeks' gestation: refining the prognosis. Pediatrics. 2001 Jul 1;108(1):134-41. 6. Doyle LW, Rogerson S, Chuang SL, James M, Bowman ED, Davis PG. Why do preterm infants die in the 1990s? The Medical Journal of Australia. 1999 Jun;170(11):528-32. 7. Tin W, Wariyar U, Hey E. Changing prognosis for babies of less than 28 weeks' gestation in the north of England between 1983 and 1994. Bmj. 1997 Jan 11;314(7074):107. 8. O’Connor AR, Stephenson T, Johnson A, Tobin MJ, Moseley MJ, Ratib S, Ng Y, Fielder AR. Long-term ophthalmic outcome of low birth weight children with and without retinopathy of prematurity. Pediatrics. 2002 Jan 1;109(1):12-8. 9. MacDorman MF, Gregory EC. Fetal and perinatal mortality: United States, 2013. National vital statistics reports: from the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System. 2015 Jul;64(8):124. 10. Sutton L, Bajuk B. Population-based study of infants born at less than 28 weeks' gestation in New South Wales, Australia, in 1992-3. New South Wales Neonatal Intensive Care Unit Study Group. Paediatric and perinatal epidemiology. 1999 Jul;13(3):288-301. 11. Wood NS, Marlow N, Costeloe K, Gibson AT, Wilkinson AR. Neurologic and developmental disability after extremely preterm birth. New England Journal of Medicine. 2000 Aug 10;343(6):378-84. 12. Rijken M, Stoelhorst GM, Martens SE, van Zwieten PH, Brand R, Wit JM, Veen S. Mortality and neurologic, mental, and psychomotor development at 2 years in infants born less than 27 weeks’ gestation: the Leiden follow-up project on prematurity. Pediatrics. 2003 Aug 1;112(2):351-8. 13. Mikkola K, Ritari N, Tommiska V, Salokorpi T, Lehtonen L, Tammela O, Pääkkönen L, Olsen P, Korkman M, Fellman V. Neurodevelopmental outcome at 5 years of age of a national cohort of extremely low birth weight infants who were born in 1996–1997. Pediatrics. 2005 Dec 1;116(6):1391-400. 14. Saigal S, den Ouden L, Wolke D, Hoult L, Paneth N, Streiner DL, Whitaker A, PintoMartin J. School-age outcomes in children who were extremely low birth weight from four international population-based cohorts. Pediatrics. 2003 Oct 1;112(4):943-50.

15. Victorian Infant Collaborative Study Group. Eight-year outcome in infants with birth weight of 500 to 999 grams: continuing regional study of 1979 and 1980 births. The Journal of pediatrics. 1991 May 1;118(5):761-7. 16. Hack M, Taylor HG, Klein N, Eiben R, Schatschneider C, Mercuri-Minich N. Schoolage outcomes in children with birth weights under 750 g. New England Journal of Medicine. 1994 Sep 22;331(12):753-9. 17. Taylor HG, Klein N, Minich NM, Hack M. Middle‐school‐age outcomes in children with very low birthweight. Child development. 2000 Nov 1;71(6):1495-511. 18. Botting N, Powls A, Cooke RW, Marlow N. Attention deficit hyperactivity disorders and other psychiatric outcomes in very low birthweight children at 12 years. Journal of Child Psychology and Psychiatry. 1997 Nov 1;38(8):931-41. 19. Hack M, Youngstrom EA, Cartar L, Schluchter M, Taylor HG, Flannery D, Klein N, Borawski E. Behavioral outcomes and evidence of psychopathology among very low birth weight infants at age 20 years. Pediatrics. 2004 Oct 1;114(4):932-40. 20. Cooke RW. Health, lifestyle, and quality of life for young adults born very preterm. Archives of Disease in Childhood. 2004 Mar 1;89(3):201-6. 21. Saigal S, Doyle LW. An overview of mortality and sequelae of preterm birth from infancy to adulthood. The Lancet. 2008 Jan 19;371(9608):261-9. 22. Yordy KD. Regionalization of health services: Current legislative directions in the United States. The Regionalization of Personal Health Services. 1976:201-15. 23. McCormick MC, Richardson DK. Access to neonatal intensive care. The Future of Children. 1995 Apr 1:162-75. 24. Merkatz IR, Johnson KG. Regionalization of perinatal care for the United States. Clinics in perinatology. 1976 Sep 1;3(2):271-6. 25. McCormick MC, Shapiro S, Starfield BH. The regionalization of perinatal services: summary of the evaluation of a national demonstration program. Jama. 1985 Feb 8;253(6):799-804. 26. Richardson DK, Reed K, Cutler JC, Boardman RC, Goodman K, Moynihan T, Driscoll J, Raye JR. Perinatal regionalization versus hospital competition: the Hartford example. Pediatrics. 1995 Sep 1;96(3):417-23. 27. Lasswell SM, Barfield WD, Rochat RW, Blackmon L. Perinatal regionalization for very low-birth-weight and very preterm infants: a meta-analysis. Jama. 2010 Sep 1;304(9):992-1000. 28. Goldenberg RL, Cliver SP, Mulvihill FX, Hickey CA, Hoffman HJ, Klerman LV, Johnson MJ. Medical, psychosocial, and behavioral risk factors do not explain the increased risk for low birth weight among black women. American Journal of Obstetrics & Gynecology. 1996 Nov 1;175(5):1317-24. 29. Fiscella K. Race, perinatal outcome, and amniotic infection. Obstetrical & gynecological survey. 1996 Jan 1;51(1):60-6. 30. Goldenberg RL, Andrews WW, Faye-Petersen O, Cliver S, Goepfert AR, Hauth JC. The Alabama Preterm Birth Project: placental histology in recurrent spontaneous and indicated preterm birth. American Journal of Obstetrics & Gynecology. 2006 Sep 1;195(3):792-6. 31. Ananth CV, Getahun D, Peltier MR, Salihu HM, Vintzileos AM. Recurrence of spontaneous versus medically indicated preterm birth. American Journal of Obstetrics & Gynecology. 2006 Sep 1;195(3):643-50.

32. Romero R, Espinoza J, Kusanovic JP, Gotsch F, Hassan S, Erez O, Chaiworapongsa T, Mazor M. The preterm parturition syndrome. BJOG: An International Journal of Obstetrics & Gynaecology. 2006 Dec 1;113(s3):17-42. 33. Newman RB, Iams JD, Das A, Goldenberg RL, Meis P, Moawad A, Sibai BM, Caritis SN, Miodovnik M, Paul RH, Dombrowski MP. A prospective masked observational study of uterine contraction frequency in twins. American Journal of Obstetrics & Gynecology. 2006 Dec 1;195(6):1564-70. 34. Gavin NI, Gaynes BN, Lohr KN, Meltzer-Brody S, Gartlehner G, Swinson T. Perinatal depression: a systematic review of prevalence and incidence. Obstetrics & Gynecology. 2005 Nov 1;106(5, Part 1):1071-83. 35. Dayan J, Creveuil C, Marks MN, Conroy S, Herlicoviez M, Dreyfus M, Tordjman S. Prenatal depression, prenatal anxiety, and spontaneous preterm birth: a prospective cohort study among women with early and regular care. Psychosomatic medicine. 2006 Nov 1;68(6):938-46. 36. Orr ST, James SA, Blackmore Prince C. Maternal prenatal depressive symptoms and spontaneous preterm births among African-American women in Baltimore, Maryland. American journal of epidemiology. 2002 Nov 1;156(9):797-802. 37. Schoenborn CA, Horm J. Negative moods as correlates of smoking and heavier drinking: implications for health promotion. Advance data. 1993 Nov (236):1-6. 38. Zuckerman B, Amaro H, Bauchner H, Cabral H. Depressive symptoms during pregnancy: relationship to poor health behaviors. American Journal of Obstetrics & Gynecology. 1989 May 1;160(5):1107-11. 39. Tamura T, Goldenberg RL, Freeberg LE, Cliver SP, Cutter GR, Hoffman HJ. Maternal serum folate and zinc concentrations and their relationships to pregnancy outcome. The American journal of clinical nutrition. 1992 Aug 1;56(2):365-70. 40. Goldenberg RL, Goepfert AR, Ramsey PS. Biochemical markers for the prediction of preterm birth. American Journal of Obstetrics & Gynecology. 2005 May 1;192(5):S3646. 41. Leitich H, Brunbauer M, Kaider A, Egarter C, Husslein P. Cervical length and dilatation of the internal cervical os detected by vaginal ultrasonography as markers for preterm delivery: a systematic review. American Journal of Obstetrics & Gynecology. 1999 Dec 1;181(6):1465-72. 42. Goldenberg RL, Culhane JF, Iams JD, Romero R. Epidemiology and causes of preterm birth. The lancet. 2008 Jan 5;371(9606):75-84. 43. Mercer BM, Goldenberg RL, Das A, Moawad AH, Iams JD, Meis PJ, Copper RL, Johnson F, Thom E, McNellis D, Miodovnik M. The preterm prediction study: a clinical risk assessment system. American Journal of Obstetrics & Gynecology. 1996 Jun 1;174(6):1885-95. 44. Vovsha I, Rajan A, Aouissi AS, et.al. Predicting preterm birth is not elusive: Machine learning paves the way to individual wellness. In 2014 AAAI Spring Symposium Series. 2014. 45. Goodwin LK, Iannacchione MA, Hammond WE.et.al. Data mining methods find demographic predictors of preterm birth. Nursing research. 2001; 50(6):340–345. 46. Tran T, Luo W, Phung D, et al. Preterm birth prediction: Deriving stable and interpretable rules from high dimensional data. Conference on machine learning in healthcare. LA, USA. 2016.

47. Jesse DE, Seaver W, Wallace DC. Maternal psychosocial risks predict preterm birth in a group of women from Appalachia. Midwifery. 2003 Sep 1;19(3):191-202. 48. World Health Organization (WHO) Preterm Birth https://www.who.int/news-room/factsheets/detail/preterm-birth. Accessed on 04/05/2019. 49. Goldenberg RL, Iams JD, Mercer BM, Meis PJ, Moawad AH, Copper RL, Das A, Thom E, Johnson F, McNellis D, Miodovnik M. The preterm prediction study: the value of new vs standard risk factors in predicting early and all spontaneous preterm births. NICHD MFMU Network. American journal of public health. 1998 Feb;88(2):233-8. 50. Woolery LK, Grzymala-Busse J. Machine learning for an expert system to predict preterm birth risk. Journal of the American Medical Informatics Association. 1994 Nov 1;1(6):439-46. 51. Goodwin LK, Iannacchione MA, Hammond WE, Crockett P, Maher S, Schlitz K. Data mining methods find demographic predictors of preterm birth. Nursing research. 2001 Nov 1;50(6):340-5. 52. Choi E, Schuetz A, Stewart WF, Sun J. Medical concept representation learning from electronic health records and its application on heart failure prediction. arXiv preprint arXiv:1602.03686. 2016 Feb 11. 53. Miftahutdinov Z, Tutubalina E. Deep Learning for ICD Coding: Looking for Medical Concepts in Clinical Documents in English and in French. InInternational Conference of the Cross-Language Evaluation Forum for European Languages 2018 Sep 10 (pp. 203215). Springer, Cham. 54. Mullenbach J, Wiegreffe S, Duke J, Sun J, Eisenstein J. Explainable prediction of medical codes from clinical text. arXiv preprint arXiv:1802.05695. 2018 Feb 15. 55. Tran T, Luo W, Phung D, et al. Preterm birth prediction: Deriving stable and interpretable rules from high dimensional data. Conference on machine learning in healthcare. LA, USA. 2016. 56. Goldberg Y, Levy O. word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. 2014 Feb 15. 57. Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997 Nov 15;9(8):1735-80. 58. Gardner MO, Goldenberg RL, Cliver SP, Tucker JM, Nelson KG, Copper RL. The origin and outcome of preterm twin pregnancies. Obstetrics & Gynecology. 1995 Apr 1;85(4):553-7. 59. Johnson MJ, Petri M, Witter FR, Repke JT. Evaluation of preterm delivery in a systemic lupus erythematosus pregnancy clinic. Obstetrics & Gynecology. 1995 Sep 1;86(3):3969. 60. Andersen HF, Nugent CE, Wanty SD, Hayashi RH. Prediction of risk for preterm delivery by ultrasonographic measurement of cervical length. American Journal of Obstetrics & Gynecology. 1990 Sep 1;163(3):859-67. 61. Report of the National High Blood Pressure Education Program Working Group on High Blood Pressure in Pregnancy. Am J Obstet Gynecol. 07;183(1):S1–S22. 62. Demissie K, Rhoads GG, Ananth CV, et al. Trends in preterm birth and neonatal mortality among blacks and whites in the United States from 1989 to 1997. Am J Epidemiol. 2001; 154(4): 307-315.

Figure Legends Figure 1. A workflow representation of the preterm birth prediction framework. Figure 2. Performance of the ensemble model in the imbalanced setting as a function of β.

Graphical abstract

Highlights 1. Extreme preterm birth is associated with significant morbidity and accounts for the majority of newborn deaths. 2. Deep learning models that consider temporal relations documented in electronic health records can predict extreme preterm birth. 3. Deep learning ensemble models trained on artificially balanced data can achieve a higher performance than those trained on the datasets with naturally imbalanced extreme preterm birth ratio. 4. Risk factors, including twin pregnancy, short cervical length, hypertensive disorder, systemic lupus erythematosus, and hydroxychloroquine sulfate, were found to be associated with extreme preterm birth at a significant level.

Table 1. A summary of the development and evaluation datasets.

Study period Number of Patients Cases

Development Dataset Aug. 9, 2005 - Dec. 31, 2014 17607 132 (0.75%)

Gestational Age at Delivery for Cases (weeks) ≤ 22 6 (4.5%) 22 - 24 12 (9.1%) 24 - 26 55 (41.7%) 26 - 28 59 (44.7%) Gestational Age at Delivery for Controls (weeks) 28 - 32 251 (1.4%) 32 - 36 966 (5.5%) > 36 16,258 (93.1%) Mother Age (years) 1st Quartile Median Mean 3rd Quartile

24 28 27.9 32

Race of Mother (self-reported) Caucasian 11,993 (68.1%) Black 3,729 (21.2%) Other 1,885 (10.7%)

Evaluation Dataset Jan. 1, 2015 - Aug. 31, 2017 8082 85 (1.1%)

9 (10.6%) 20 (23.5%) 22 (25.9%) 34 (40.0%)

152 (1.9%) 482 (6.0%) 7,363 (92.1%)

25 29 29.1 33

5,662 (70.1%) 1,358 (16.8%) 1,062 (13.1%)

Number of Diagnoses per Mother Median Mean Standard Deviation

8 13.2 15.4

14 20.2 20.8

Number of Procedures per Mother Median Mean Standard Deviation

13 16.8 14.3

15 19.7 16.6

Number of Medications per Mother Median Mean Standard Deviation

14 33.5 47.3

30 53.9 70.1

Number of Laboratory Studies per Mother Median 17 Mean 26.8 Standard Deviation 24.7

24 35.4 30.8

Table 2. Model performance with the artificially balanced dataset. The models were assessed on the validation datasets. Thirteen models are within four categories. Performance is reported as AUC (Area Under the ROC Curve).

Category

Baselines with BOW

Baselines with word embedding Baselines with BOW and word embedding Neural networks with word embedding

Mean

AUC Low

High

LR_BOW

0.761

0.751

0.770

SVM_BOW

0.769

0.754

0.784

GB_BOW

0.763

0.748

0.778

LR_BOW_WEIGHT

0.718

0.702

0.734

SVM_BOW_WEIGHT

0.756

0.746

0.766

GB_BOW_WEIGHT

0.614

0.593

0.635

LR_MEAN

0.765

0.755

0.77

SVM_MEAN

0.751

0.733

0.768

GB_MEAN

0.750

0.725

0.775

LR_BOW_MEAN

0.780

0.765

0.794

SVM_BOW_MEAN

0.761

0.748

0.773

GB_BOW_MEAN

0.770

0.758

0.792

LSTM_WORD2VEC

0.811

0.799

0.823

Sub Model

Table 3. The AUC of logistical regression, support vector machine and gradient boosting built on embedding based features including min, max, mean or their combinations. The models were assessed on the validation datasets.

Machine learning algorithm

Logistic regression

Support Vector Machine

Gradient Boosting

Sub_model

AUC Mean

Low

High

MIN

0.727

0.701

0.754

MAX

0.701

0.675

0.727

MEAN

0.765

0.755

0.774

MIN_MAX

0.71

0.684

0.736

MIN_MEAN

0.731

0.705

0.757

MAX_MEAN

0.715

0.689

0.742

MIN_MAX_MEAN

0.713

0.687

0.74

MIN

0.747

0.73

0.765

MAX

0.745

0.713

0.778

MEAN

0.751

0.733

0.768

MIN_MAX

0.734

0.717

0.751

MIN_MEAN

0.744

0.723

0.765

MAX_MEAN

0.735

0.71

0.761

MIN_MAX_MEAN

0.749

0.722

0.776

MIN

0.736

0.711

0.761

MAX

0.725

0.697

0.752

MEAN

0.75

0.725

0.775

MIN_MAX

0.739

0.707

0.771

MIN_MEAN

0.734

0.707

0.761

MAX_MEAN

0.727

0.71

0.744

MIN_MAX_MEAN

0.747

0.726

0.768

Table 4. Model performance with the artificially balanced dataset. The models were assessed on the validation datasets. Performance is reported as mean and 95% confidence interval. (AUC = Area Under the ROC Curve; SN = Sensitivity; SP = Specificity; PPV = Positive Predictive Value) Model SVM_BOW LR_MEAN LR_BOW_MEAN LSTM_WORD2VEC

AUC 0.769 (0.754-0.783) 0.765 (0.755-0.774) 0.780 (0.765-0.794) 0.811 (0.799-0.823)

SN 0.749 (0.734-0.764) 0.715 (0.694-0.736) 0.771 (0.744-0.798) 0.797 (0.774-0.821)

SP 0.724 (0.711-0.737) 0.753 (0.733-0.774) 0.778 (0.759-0.797) 0.777 (0.757-0.797)

PPV 0.754 (0.723-0.775) 0.753 (0.740-0.766) 0.779 (0.763-0.795) 0.788 (0.776-0.800)

Table 5. Performances of ensemble models trained on data from the artificially balanced and naturally imbalanced setting. The models were assessed on the evaluation dataset. Performance is reported as the mean and 95% confidence interval. (AUC = Area Under the ROC Curve; SN = Sensitivity; SP = Specificity; PPV = Positive Predictive Value) Model

AUC

SN

Balanced

0.728 (0.724 – 0.731)

0.660 (0.649 – 0.671)

Natural

0.710 (0.704 – 0.716)

SVM_BOW

Balanced LR_MEAN

Natural LR_BOW_ME AN

LSTM_WOR D2VEC

Balanced Natural Balanced Natural

0.749 (0.744 - 0.753) 0.743 (0.739 - 0.748) 0.777 (0.764 – 0.790) 0.732 (0.725 – 0.739) 0.827 (0.813 - 0.840) 0.744 (0.738 - 0.750)

SP

0.796 (0.789 – 0.802)

0.614 (0.592 – 0.636)

0.805 (0.789 – 0.821)

0.819 (0.775 - 0.865) 0.631 (0.603 - 0.650) 0.797 (0.737 – 0.857) 0.669 (0.656 – 0.682) 0.965 (0.945 - 0.985) 0.682 (0.674 - 0.694)

0.678 (0.634 - 0.722) 0.776 (0.757 - 0.799) 0.756 (0.682 – 0.830) 0.795 (0.776 – 0.814) 0.698 (0.659 - 0.736) 0.743 (0.741 - 0.744)

PPV

0.033 (0.032 – 0.34) 0.032 (0.031 – 0.033) 0.026 (0.025 - 0.07) 0.029 (0.028 - 0.030) 0.027 (0.021 – 0.033) 0.045 (0.043 – 0.047) 0.033 (0.032 - 0.034) 0.028 (0.027 - 0.029)

Table 6. Medical concepts associated with EPB. All of the listed concepts had p values smaller than 0.05 after Bonferroni correction and odds ratio greater than 1. The corresponding rank in the LSTM-word2vec model is shown in the leftmost column.

LSTM-Rank 12 18 62 64 65 86

Concept Twin pregnancy with antenatal problem Systemic lupus erythematosus Hydroxychloroquine Sulfate 200 MG Hydroxychloroquine Sulfate 200 MG [Plaquenil] Short cervical length in pregnancy Hypertensive disorder

Patients with the concept

EPB

397

20

63

8

39

5

47

4

54

6

264

10

Odds ratio (95% Confidence interval) 8.10 (4.98 - 13.18) 20.43 (9.53 - 43.80) 20.20 (7.773 - 52.474) 12.67 (4.48 - 35.81) 17.29 (7.27 - 41.13) 5.56 (2.88 - 10.71)

p-value

7.46 × 10-33 3.72 × 10-07 4.15 × 10-13 1.33 × 10-10 5.30 × 10-09 1.42 × 10-11

Conflicts of Interests GP is Vice President and Director of the Evaluation Research Center at IBM Watson Health.

Deep learning predicts extreme preterm birth from electronic health records

Deep learning predicts extreme preterm birth from electronic health records

Recommend Documents