An intelligent warning model for early prediction of cardiac arrest in sepsis patients

An intelligent warning model for early prediction of cardiac arrest in sepsis patients

Computer Methods and Programs in Biomedicine 178 (2019) 47–58 Contents lists available at ScienceDirect Computer Methods and Programs in Biomedicine...

1MB Sizes 0 Downloads 38 Views

Computer Methods and Programs in Biomedicine 178 (2019) 47–58

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine journal homepage: www.elsevier.com/locate/cmpb

An intelligent warning model for early prediction of cardiac arrest in sepsis patients Samaneh Layeghian Javan a, Mohammad Mehdi Sepehri a,∗, Malihe Layeghian Javan b, Toktam Khatibi a a b

Faculty of Industrial and Systems Engineering, Tarbiat Modares University, Tehran 1411713116, Iran Mashhad University of Medical Sciences, Mashhad, Iran

a r t i c l e

i n f o

Article history: Received 1 December 2018 Revised 31 May 2019 Accepted 10 June 2019

Keywords: Intelligent warning model Heart arrest Prediction Sepsis

a b s t r a c t Background: Sepsis-associated cardiac arrest is a common issue with the low survival rate. Early prediction of cardiac arrest can provide the time required for intervening and preventing its onset in order to reduce mortality. Several studies have been conducted to predict cardiac arrest using machine learning. However, no previous research has used machine learning for predicting cardiac arrest in adult sepsis patients. Moreover, the potential of some techniques, including ensemble algorithms, has not yet been addressed in improving the prediction outcomes. It is required to find methods for generating highperformance predictions with sufficient time lapse before the arrest. In this regard, various variables and parameters should also been examined. Objective: The aim was to use machine learning in order to propose a cardiac arrest prediction model for adult patients with sepsis. It is required to predict the arrest several hours before the incidence with high efficiency. The other goal was to investigate the effect of the time series dynamics of vital signs on the prediction of cardiac arrest. Method: 30 h clinical data of every sepsis patients were extracted from Mimic III database (79 cases, 4532 controls). Three datasets (multivariate, time series and combined) were created. Various machine learning models for six time groups were trained on these datasets. The models included classical techniques (SVM, decision tree, logistic regression, KNN, GaussianNB) and ensemble methods (gradient Boosting, XGBoost, random forest, balanced bagging classifier and stacking). Proper solutions were proposed to address the challenges of missing values, imbalanced classes of data and irregularity of time series. Results: The best results were obtained using a stacking algorithm and multivariate dataset (accuracy = 0.76, precision = 0.19, sensitivity = 0.77, f1-score = 0.31, AUC= 0.82). The proposed model predicts the arrest incidence of up to six hours earlier with the accuracy and sensitivity over 70%. Conclusion: We illustrated that machine learning techniques, especially ensemble algorithms have high potentials to be used in prognostic systems for sepsis patients. The proposed model, in comparison with the exiting warning systems including APACHE II and MEWS, significantly improved the evaluation criteria. According to the results, the time series dynamics of vital signs are of great importance in the prediction of cardiac arrest incidence in sepsis patients. © 2019 Elsevier B.V. All rights reserved.

Abbreviations: Acc, Accuracy; APPACHE, Acute Physiologic And Chronic Health Evaluation; AUC, Area Under ROC Carve; B, Balancing; BM, Bedside Monitor; bp, Blood Pressure; CA, Cardiac Arrest; CHP, Configuration Hyper Parameters; D, Dataset; DSR, Design Science Research; DT, Decision Tree; ECG, Electrocardiogram; ENR, Electronic Nursing Record; EHR, Electronic Health Record; F1, F1-score; FPR, False Positive Rate; FS, Feature Selection; GLM, Generalized Linear Model; HRV, Heart Rate Variability; ICU, Intensive Care Unit; KNN, K-Nearest Neighbor; LAB, laboratory; LR, Logistic Regression; MEWS, Modified Early Warning Score; MICE, Multivariate Imputation by Chained Equations; MLT, Machine Learning Technique; NB, Naïve Byes; NN, Neural Network; Prec, Precision; RF, Random Forest; ROSC, Return of Spontaneous Circulation; Sen, Sensitivity; Spec, Specificity; SVM, Support Vector Machine; T, Time. ∗ Corresponding author. E-mail addresses: [email protected] (S. Layeghian Javan), [email protected] (M.M. Sepehri), [email protected] (M. Layeghian Javan), [email protected] (T. Khatibi). https://doi.org/10.1016/j.cmpb.2019.06.010 0169-2607/© 2019 Elsevier B.V. All rights reserved.

48

S. Layeghian Javan, M.M. Sepehri and M. Layeghian Javan et al. / Computer Methods and Programs in Biomedicine 178 (2019) 47–58

1. Introduction

2. Background

Sepsis is a leading cause of death in general intensive care units (ICU) [1]. Its incidence is rising with a reported rate in the USA of 436 severe cases/10 0,0 0 0 persons in 2012 and an overall mortality of 17.5% [2]. Many sepsis cases result in cardiac arrest (CA) with poor outcome. Patients with sepsis are less likely to achieve return of spontaneous circulation (ROSC) or survive to hospital discharge following in hospital cardiac arrest [3]. Many cardiac arteries may be preventable. Since cardiopulmonary resuscitation in sepsis patients is challenging and usually unsuccessful, more research is required to prevent CA in these patients. Medical experts believe that early cardiac interventions can reduce mortality if CA is predicted before it happens. Considering the importance of this issue, several studies have been conducted to predict the risk of CA. Traditional studies used standard statistical methods aimed at identifying group-level differences and often used a limited number of variables for prediction. Today, modern technologies provide a wealth of physiological data and clinical parameters that require robust and inexpensive computing tools to process. Machine learning, a subset of artificial intelligence, helps to automatically analyze complex data and produces significant results. Javan et al. [4] reviewed the use of machine learning algorithms in predicting cardiac arrest. According to this review paper, machine learning techniques (MLT) have high potential to be used as physician’s assistants for early prediction of cardiac arrest. Compared to traditional methods of prediction such as regression, machine learning techniques provide better performance in many cases. However, the potential of some techniques, including ensemble algorithms, has not yet been addressed in improving the prediction outcomes. It is also necessary to find methods for generating high-performance predictions with sufficient time lapse before the arrest. On the other hand, in recent years, some studies have analyzed the time series of vital signs to produce early warning systems [5–10]. Physiological systems generate complex dynamics in their output signals, reflecting the state of the body’s control systems. It should be investigated how the dynamics of these signals affects the prediction outcomes. Considering the above motivations, an exploratory study was conducted to evaluate the efficiency of different classification algorithms in early prediction of cardiac arrest in adult sepsis patients. Our purpose was to maximize the prediction performance and make predictions as early as possible. Another goal of this study was to investigate whether the time series dynamics of vital signs has an effect on improving the efficiency of cardiac arrest prediction in sepsis patients. Regarding the research goals, we extracted 30 h clinical data of adult sepsis patients recorded in Mimic III database. Three datasets (multivariate, time series and combined) were created using the extracted data. Some preprocessing tasks were performed to address the challenges of missing values, outliers, imbalanced classes of data and irregularity of the time series. Various machine learning techniques including classical models and ensemble methods were trained on the three datasets to predict CA occurrence in six time groups. We used sensitivity, precision and f1-score as the main criteria to evaluate the models. The best model in 1-hour group was used to predict the CA incidence in the other time groups. The remainder of the paper is organized as follows. In Section 2, we describe the background and related works of CA prediction using machine learning. Section 3 details the proposed method. Section 4 presents the models and results achieved. Section 5 provides comparisons and discusses the obtained results. Finally, Section 6 concludes this paper and presents limitations and recommendations for future works.

2.1. Machine learning In this study, since we had labeled records, the supervised learning approach was used. Several classification models, including classical methods (support vector machine (SVM), decision tree (DT), logistic regression (LR), k-nearest neighbor (KNN), GaussianNB), and ensemble methods (gradient boosting, XGBoost, random forest (RF), balanced bagging classifier and stacking) were used to classify the sepsis patients into two groups of normal and cardiac arrest. Ensemble methods combine various algorithms to improve the predictive efficiency of the models. All data mining tasks of this research were done using python programming. 2.2. Early warning scores In this paper, two APACHE II and MEWS scoring variables, which are standard criteria for determining the patients’ health status, were calculated as latent variables and added to the feature set. These scores can improve the understanding of the real condition of the patients and bring some benefits to the ICU management process [11]. 2.2.1. MEWS MEWS (Modified Early Warning Score) [12] is a composite score commonly used by medical staff to determine the severity of the illness. It evaluates the risk of a patient’s death based on four physiological parameters (systolic blood pressure, heart rate, respiration rate, temperature), and patient’s level of consciousness. Each observation is assigned a score between 0 and 3. A total score of more than or equal to 5 is associated with an increased risk of death, and is a sign of immediate admission in ICU. 2.2.2. APACHE II APACHE II (Acute Physiologic And Chronic Health Evaluation II) [13] is one of the most widely used ICU scoring systems in the world. It consists of three components: (1) Acute physiology score (APS); (2) Age adjustment; (3) Chronic health evaluation (CHE). APACHE II is calculated using the worst values of these parameters recorded during the first 24 h of the patient hospitalization. Each patient receives a score between 0 and 71. 2.3. Design science research The Design Science Research (DSR) paradigm is fundamentally a problem-solving paradigm that seeks to enhance human knowledge with the creation of innovative artifacts [14]. The artifacts include, but are not limited to, algorithms, human/computer interfaces, and system design methodologies or languages. In this paper, we conducted the DSR methodology proposed by Peffers et al. [15]. This approach consists of six steps: problem identification (understanding the research objectives and requirements), definition of the expected results (quantitative or qualitative), design and development (using the existing theoretical knowledge to propose artifacts), demonstration (using the artifact to solve the problem), evaluation (comparing the artifact performance results with the requirements for solving the problem) and communication (employing academic literature to present the rigor of the research, as well as the effectiveness of the solution). 2.4. Related works Several studies related to cardiac arrest prediction have been conducted using machine learning techniques. However, no previous research was found to use machine learning for predicting cardiac arrest in adult sepsis patients [4]. In this paper, we aim to

S. Layeghian Javan, M.M. Sepehri and M. Layeghian Javan et al. / Computer Methods and Programs in Biomedicine 178 (2019) 47–58

classify normal/CA adult sepsis patients, therefore, only the studies with the classification purpose were investigated. In the related literature, systematically reviewed by Javan et al. [4], the majority of studies with the aim of classification, only used electrocardiogram (ECG) or heart rate variability (HRV) parameters to classify the normal and CA cases (24 out of 28 papers). In the current paper, ECG/HRV signals were not used. Therefore, in order to produce a homogeneous comparison, we did not consider the studies that analyzed ECG/HRV parameters. Portela et al. [16] trained four machine learning techniques of Generalized Linear Models (GLM), SVM, DT and Naïve Byes (NB) in order to classify cardiac arrhythmias/normal cases. The authors used admission and complementary data (vital signs, laboratory results, therapeutics) as input variables. Several scenarios were developed on the input variables and machine learning models were induced for each scenario. SVM technique produced the best results in this study. Kennedy et al. [17] used MLT to predict shock-induced CA in a pediatric intensive care unit, using time series analysis as input. The variables included data from code sheets, the hospital’s data repository, and 12 h of the data extracted from physiologic monitors. The authors trained 20 arrest prediction models using a matrix of five feature sets with four modeling algorithms: LR, DT, neural network (NN) and SVM. The best results were obtained by SVM model. Somanchi et al. [18] used SVM and LR for early prediction of CA. Input variables include demographic information, hospitalization history, vitals and laboratory measurements in the last 24 h. Moreover, the authors extracted the trend features of vital signs. SVM produced the highest values of performance criteria. Churpek et al. [19] evaluated the capability of MLTs in detecting clinical deterioration on the wards. Demographic variables, laboratory values, and vital signs were utilized to predict the combined outcome of cardiac arrest, intensive care unit transfer, or death. Two LR models (linear predictor terms and restricted cubic splines), tree-based models, KNN, SVM and NN algorithms were examined. The random forest algorithm was the most accurate model. Although the previous studies have proven to be quite successful in predicting CA, they still require further efforts in order to produce higher prediction performance accompanied with sufficient length of time lapse. Moreover, further research has to be done to evaluate the effectiveness of less frequently used algorithms on CA prediction. Finally, due to the dynamic nature of the physiological systems, more studies are needed to analyze the time series of longitudinal physiological data of patients for CA prediction. 3. Method

49

freely-available database comprising de-identified health-related data associated with over 40,0 0 0 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. Only a small fraction of patients were associated with their ECG waveforms in MIMIC database. As a result, we did not use ECG signals for analytic tasks. The purpose of this retrospective cohort study was to predict the occurrence of CA in adult patients with sepsis. Out of 7769 sepsis records, 127 CA cases with the registered incidence time were found. We intended to investigate how much earlier the CA incidence can be predicted? And can we predict the CA occurrence earlier than 24 h? In addition, we were interested to examine the effect of trend features on prediction performance. Therefore, we needed several hours of patient data before the prediction time. Considering the CA patients’ lengths of stay, In order to maximize the number of included samples, the 30-hour interval was selected as a minimum value for length of stay. Consequently, patients under the age of 15 and those with the length of stay less than 30 h were excluded from the dataset. In addition, due to the importance of the time in this study, only cases were selected as cardiac arrests that the time of the occurrence was recorded and CA occurred after the diagnosis of sepsis. Finally, 79 records were selected as CA cases and 4532 records were included in the normal group.

3.2. Candidate features A combination of approaches, including research background [17,21], conceptual model, and statistical analysis, are the methods we used to identify acceptable attributes. However, many of these attributes were not electronically available. Moreover, some clinically useful concepts that were not explicitly measured were calculated as latent variables. These variables were derived from the multivariate physiological dataset and time series of vital signs. In this study, two APACHE II and MEWS scoring variables were calculated and added to the feature set. The latent trend features (slope, intercept, mean, median, minimum, maximum, standard deviation, first quartile and third quartile) were also derived from various windows of time series before the occurrence of CA. The data sources used for collecting these features include: electronic health records (EHR), bedside monitors (BM) for vital signs, laboratories (LAB) and electronic nursing records (ENR) for patient manual values. Table 1 presents the list of candidate variables and their respective data sources.

3.3. Determine the appropriate time interval for extracting variables In this study, the prediction time is shown with Tp , and the time of CA incidence, which is h hours after Tp , is shown with TE . We are interested in predicting the result (Yi) in TE so that:

3.1. Data In this study, clinical data from the MIMIC-III database [20] were used for analysis and modeling. MIMIC-III is a large,



Yi =

1 0

i f patient i goes into cardiac arrets at T E i f patient i does not go into cardiac arrest at T E

Table 1 Candidate features for predicting cardiac arrest in sepsis patients. Feature type

Feature

Data source

Multivariate features

Age, Gender, Tobacco, ICU unit, Acute renal failure (ARF), History of ICU admission Glasgow coma scale (GCS), Level of consciousness (AVPU) Pao2, Hco3, Sodium, Creatinine, Potassium, Hematocrit, Arterial ph, Hemoglobin, Blood lactate, Platelet count, Blood urea nitrogen (BUN), White blood cell count (WBC)

EHR ENR LAB

Time series

Spo2, Heart rate, Respiratory rate Temperature, Mean blood pressure, Systolic blood pressure, Diastolic blood pressure

BM BM / ENR

Clinical latent features

MEWS score, APACHE II score, Trend features

Calculated



50

S. Layeghian Javan, M.M. Sepehri and M. Layeghian Javan et al. / Computer Methods and Programs in Biomedicine 178 (2019) 47–58

Fig. 1. Determining the appropriate time interval for data extraction. Table 2 The distribution of the categorical variables in each class of data. Variable

value

Normal group

Arrest group

Variable

value

Normal group

AVPU

Alert Arouse to Pain Arouse to Stimuli Arouse to Stimulation Arouse to Voice Confused Dozing Intermit Lethargic Other/Remark Paralytic Med Sedated Sleeping Unresponsive 0 1 0 1

50.1% 2.4% 2.4% 2.7% 15.3% 0.4% 1.3% 6.1% 0.1% 0.4% 4.6% 2.1% 4.6% 42.2% 57.8% 43.5% 56.5%

13.7% 5.3%

GCS

3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 0 1

2.4% 0.3% 0.2% 1.3% 1.8% 2.4% 3.2% 4.7% 5.7% 0.6% 1.6% 4.6% 19.7% 88.9% 11.1% 58.0% 42.0%

Male ARF

7.6% 19.8%

2.3%

11.5% 36.7% 63.3% 20.3% 79.7%

In order to predict the occurrence of CA, information prior to Tp from both case and control patients were collected. We have TE -TP = h and h= {1, 2, 3, 4, 5, 6}, therefore, there is six different classification problems. As an example, Fig. 1 illustrates the appropriate time interval for extracting the values of variables in order to predict CA one hour (Fig. 1(a)) and two hours (Fig. 1(b)) before the incidence. 3.4. Determining the event time For CA patients, the time of CA incidence is considered as the event time (TE ). For control patients, we chose the 30-hours middle interval from hospitalization to discharge as the data collection interval and considered the end of the interval as TE . 3.5. Extracting variables In this study, variables were collected in two ways. A group of variables are registered at the admission time and do not change

Tobacco History of ICU admission

Arrest group

0.8%

87.3% 12.7% 62.0% 38.0%

during the hospitalization, including age, gender, height, weight, ICU type, hospitalization IDs, history of ICU admission and history of smoking. Other variables may be recorded once, several times, or never during the patient’s hospitalization, including laboratory variables and vital signs. We considered the patient’s vital signs (heart rate, diastolic blood pressure, systolic blood pressure, mean blood pressure, respiration rate, spo2 and temperature) as time series. The patient’s demographics and laboratory test results were chosen as multivariate variables. For multivariate features, the last recorded value before Tp was selected and for the time series features, all values recorded in the 30-hours interval were collected. 3.6. Statistical analysis of the data In order to better understand the data, we presented the distribution (percentage of cases) of the categorical variables for both groups of arrest and normal in Table 2. This analysis was conducted before preprocessing tasks.

S. Layeghian Javan, M.M. Sepehri and M. Layeghian Javan et al. / Computer Methods and Programs in Biomedicine 178 (2019) 47–58

According to Table 2, in both normal and CA groups, there are the same distributions for gender, tobacco usage and history of admission. Although there are many missing values in AVPU variable, we see that a large percentage of normal patients have “Alert” status and the status “Unresponsive” is more common in the CA group. Only 0.8% of the CA group and 48% of the normal group have valid values for GCS. The percentage of the patients with ARF was larger in the CA group compared with the normal group. Numerical variables were also analyzed statistically. The Kolmogorov-Smirnov test showed that the distributions of all variables were normal. After conducting an independent sample t-test, it was found that there were no significant differences between the mean values of platelet count, hemoglobin, spo2, pao2, hematocrit, hco3 and temperature in the two groups of CA and normal. But, there were significant differences of mean values for CA and normal groups for variables Age, Bun, Arterial ph, Creatinine, Blood lactate, WBC count, Potassium, Respiratory rate, Hear rate, Systolic bp, Diastolic bp and Mean bp. 3.7. Data preprocessing The challenges that need to be addressed include: Irregularity of time series, missing values, outliers and imbalanced classes of data. These challenges are addressed in the preprocessing step by providing solutions that are described in the following. 3.7.1. Solving the problem of irregular time series with bucking technique Different vital signs may be recorded with irregular intervals and different frequencies. Many of the classification models are not inherently designed to classify streaming data. Although some of them can be adapted for classifying time series, they require data with regular time intervals and replications. In order to solve the problem of irregular and multi-scale gaps in time series, we used the bucketing technique. We divided the 30-hours window into 30 sequential buckets of 1 h each. The measurement values were then averaged within a bucket. Therefore, each time series included 30 values with regular one-hour intervals. These calculations were carried out in PostgreSQL. 3.7.2. Imputing missing values In order to address the problem of missing values, two different approaches were used for multivariate features and time series. Among the multivariate features, some had between 37 and 52% of lost data that were inevitably removed. To fill out the other missing values, every record that contained missed values was removed from the normal group (1851 records were deleted and 2681 records remained in the normal group). However, such records were not deleted from the arrest group, but filled in using multivariate imputation by chained equations (MICE) method [22]. The MICE method completes an incomplete column using data from the other columns and produces acceptable artificial quantities. Each incomplete column acts as a target variable with a specific set of predictor variables. The default set of predictors for a given target includes all the columns other than the target variable. Time series related to temperature, systolic, diastolic and mean blood pressure had between 50 and 70% of lost data. Temperature was omitted due to the high rate of missing values, but due to the importance of blood pressure variables, we could not eliminate them. Since there were a limited number of CA cases, we intended to avoid removing these records. For this purpose, we chose CA cases which had at least 25 values for each of their time series (53 CA cases). We also selected instances from the control group with at least 27 values for each of their time series (753 control patients). Missing values of time series were filled with the value

51

in the previous bucket. If there is no previous value, it was filled in with the next bucket value. 3.7.3. Removing outliers In order to remove outliers, the acceptable range of each variable was determined according to the opinion of medical experts. The values outside the acceptable range were eliminated and completed with calculations through MICE method. 3.7.4. Balancing the data The class distribution of the dataset was highly imbalanced. Out of the 4611 patients with sepsis, there were only 79 CA cases (0.017%). Traditional classification methods do not perform well with imbalanced datasets, because they evaluate the overall performance of the data and classify all the data into the majority class. To solve the problem of imbalanced data, various methods have been proposed that are generally classified into data-level and algorithm-level approaches. We examined different methods for dealing with imbalanced data using both approaches. For balancing at algorithm-level, BalancedBaggingClassifier, XGBoost, and GradientBoosting techniques that are standard ensemble methods, as well as data weighting techniques, were used. Balancing methods at data-level included SMOTE oversampling, under sampling with ClusterCentroids, NearMiss, RandomUnderSampler and a combination of oversampling and under sampling with SMOTEENN technique. Ultimately, the use of an unsupervised method for under sampling produced significant improvements in the results. We clustered the control class in r groups, with r being the number of CA cases. For each cluster, only the center of cluster (medoid) was kept. The model was then trained with the CA cases and the medoids only. 3.7.5. Feature selection In this study, due to the large number of models and the need for high speed, only filter methods were used for feature selection. These methods include feature selection with random forest algorithm, embedded functions of algorithms including l1 regularization in logistic regression and SVM, mutual_info, classif and selectKBest methods. In multivariate datasets, feature selection often reduced the model’s results. In the time series and combined datasets, which contained more than 300 variables, the selectKBest produced better results compared with the other feature selection methods. 3.8. Evaluation criteria Various criteria were used to evaluate the models, but the two most important criteria in this study were “precision” and “sensitivity”, which are calculated as follows: Sensitivity = TP/(TP+FN) Precision = TP/(TP+FP) Where TP, FP, TN and FN respectively means “true positive”, “false positive”, “true negative” and “false negative”. It is very important to identify all cases of cardiac arrest. Therefore we need a high value of sensitivity. We should also reduce the number of wrong alarms. Hence, the value of precision should be increased. In this study we mainly used F1-score as a model selection criterion, which is computed as follows: F1-measure = 2.(Precision.Sensitivity)/(Precision +Sensitivity) F1-score is in fact the weighted average of two criteria of sensitivity and precision. In order to compare the models, we considered sensitivity as the second priority of importance after F1-score.

52

S. Layeghian Javan, M.M. Sepehri and M. Layeghian Javan et al. / Computer Methods and Programs in Biomedicine 178 (2019) 47–58

Moreover, in order to implement a warning system in the real world, a high degree of specificity is required. Therefore, it is necessary to balance the sensitivity and specificity criteria. ROC curve graphically displays the trade-off between sensitivity and specificity. The area under ROC curve (AUC) provides a useful parameter for evaluating the performance.

We conducted Grid-search method to find the optimal hyper parameters of the models. The grid search algorithm was guided by f1-score metric, measured by cross-validation on the training set.

3.8.1. Evaluating criteria for imbalanced datasets “Accuracy” is not a proper measure of performance for imbalanced datasets. With the preprocessed dataset in this study, if a model just makes a dummy prediction that all samples belong to the control group, the accuracy will be 97% (2681/2760). On the other hand, since the number of positive samples is much less than the number of negative samples, even a small amount of FPR generates a large number of false alarms (FP) and reduces the amount of precision. As a result, it is very difficult to have an acceptable precision with a high value of sensitivity with imbalanced datasets. However, the AUC criterion does not depend on the degree of data balancing. A model can discriminate the two classes well in an imbalanced dataset, but still has a low sensitivity, because it is very difficult to identify rare examples.

In this section we model one-hour group that predicts CA incidence one hour before it occurs.

4. Modelling and results In this study, different models for six time classes were trained on three different data sets: •





The first dataset contains multivariate features including demographics and the last values of vital signs and laboratory test results. The second dataset involved trend features extracted from time series of patient’s vital signs. The third dataset consisted of a combination of multivariate and trend features.

As previously mentioned, classification models are divided into six different groups in terms of the time of prediction. One group predicts whether CA will occur in the next hour? The next group estimates the likelihood of CA incidence in the next two hours, and so on. The following tuple can be used to represent each learning model:

4.1. Modeling one-hour group

4.1.1. Modeling with multivariate dataset After handling missing values, 79 CA cases and 2681 control records remained in the multivariate dataset. We used stratified 10-fold cross validation to evaluate the predictive models [23]. In this way, the original sample is randomly split into k mutually exclusive subsets of approximately equal size. The inducer is trained and tested k times. Each time, a single subset is selected as validation data to test the model and the remaining k-1 subsets are used as training data. This process repeated k times, so that each subset is selected exactly once as the validation data. In order to generate the final estimation, the average of the k results from the folds is calculated. In stratified cross validation, the folds are stratified so that they contain approximately the same proportions of labels as the original dataset [23]. We trained logistic regression and SVM algorithms with FS1 to FS5 feature selection methods and all balancing techniques: 2MLT

5FS



7B = 70 models

The FS1 method (l1 regularization) can only be used to select features in SVM and logistic regression models. Moreover, the results showed that the FS2 (SelectKBest) method selected the common variables in both FS3 and FS4 most of the times. It also produced better results compared with these two methods. Therefore, FS3 and FS4 were not used in the next models. Other classical machine learning techniques (MLT3 to MLT5) and stacking model (MLT10) were trained with FS2 and FS5 feature selection techniques and all balancing methods: 3MLT



2FS



7B = 42 models

In the random forest algorithm, the proper features were selected by the model.

Model = (T, D, MLT, FS, B, CHP)

1MLT

Where T is the time group, D refers to the dataset, MLT is the machine learning technique, FS is the feature selection method, B is the balancing method and CHP is the configuration hyper parameters of the machine learning models. Table 3 presents the different values of the tuple elements. Each value is shown with an index, for example, MLT2 represents the SVM technique. Table 3 does not include CHP values due to the high diversity of hyper parameters. Each machine learning model has its own specific hyper parameters.





1FS



7B = 7 models

The Gradient Boosting, XGBoost and Balanced Bagging Classifier techniques balance the data internally. FS2 and FS5 methods were used to select features for these algorithms. 3MLT



2FS = 6 models

In total, 125 different models were induced on the multivariate data set for the 1-hour group: 70 + 42 + 7 + 6 = 125

Table 3 The values of the tuple elements. Indices

T

D

MLT

FS

B

1 2 3 4 5 6 7 8 9 10

1h 2h 3h 4h 5h 6h

Multivariate Time Series Combined

Logistic regression SVM Decision tree KNN GaussianNB Gradient Boosting XGBoost Random forest Balanced bagging classifier Stacking methods

l1 regularization SelectKBest Mutual_info Classif No feature selection Random forest

Weighting SMOTE SMOTEENN Cluster centroid Near miss Random under sampler K-medoid clustering

S. Layeghian Javan, M.M. Sepehri and M. Layeghian Javan et al. / Computer Methods and Programs in Biomedicine 178 (2019) 47–58

53

Fig. 2. Selecting classification algorithms for CA prediction.

Table 4 The best results of the classical algorithms for predicting cardiac arrest incidence in the next hour using multivariate dataset. Algorithm

Balancing method

Feature selection

Acc

Prec

Sen

F1

AUC

FPR

Spec

LR DT GaussianNB Linear SVM KNN (k = 15) Kernel SVM

SMOTE Weighting – K-medoid clustering SMOTE K-medoid clustering

L1 penalty – – – – –

0.79 0.74 0.88 0.75 0.89 0.78

0.09 0.07 0.06 0.07 0.06 0.09

0.67 0.67 0.44 0.64 0.36 0.68

0.15 0.13 0.11 0.13 0.10 0.15

0.79 0.76 0.75 0.79 0.65 0.80

0.21 0.26 0.12 0.25 0.11 0.21

0.79 0.74 0.88 0.75 0.89 0.79

Acc: Accuracy; Prec: Precision; Sen: Sensitivity; F1: F1-score; AUC: Area Under ROC Carve; FPR: False Positive Rate; Spec: Specificity.

4.1.1.1. Selecting machine learning algorithms. The steps for selecting different classification models for multivariate dataset were performed according to the characteristics of the MLTs, as shown in Fig. 2. At first, based on the two criteria of speed and accuracy, we chose faster techniques for modeling, which included classical algorithms (SVM, DT, LR, KNN, GaussianNB). Then, since the interpretation of the results is important for medical decision-making, we selected the interpretable algorithms that were DT and LR. Each algorithm was trained using several feature selection and balancing methods and configuration parameters were tuned in order to find the optimal solution. In this research, the optimal model has the highest f-score. In the case of two models with the same f-score value, sensitivity was considered as the second priority. Moreover, sensitivity or accuracy values less than 70% were considered unacceptable. Table 4 presents the best results of classical algorithms to predict CA incidence one hour later. According to Table 4, the best sensitivity values produced by DT and LR was less than 70%, which is why in the next step we examined the other classical algorithms. In this regard, probabilistic algorithms, including naïve Bayes, and non- probabilistic algorithms including KNN and linear SVM were investigated. But the

sensitivity generated by these algorithms was also very low. Therefore, in the next step, the kernel SVM algorithm, which has a lower learning speed, was trained with different kernels. The best results of this algorithm were similar to the best results of LR. Given the higher speed and interpretable results of LR, we preferred LR. However, the sensitivity was still less than 70%. In the next step, we developed ensemble classifiers including boosting, bagging and stacking methods to improve the results. An ensemble algorithm combines a variety of classifiers to improve the robustness and achieve a higher performance. The examined boosting algorithms included XGBoost and GradientBoosting. The bagging algorithms included Random Forest and BalancedBaggingClassifier. The best results of the ensemble methods are presented in Table 5. According to Table 5, boosting algorithms produced unacceptable values for sensitivity, but bagging algorithms produced better results. In order to further improve the results, we combined the models with better outcomes to develop stacking models. Various combinations of the classification algorithms were investigated. Finally, a stacking model which combined the two RF and the BalancedBagging classifier and used LR as a meta-classifier, significantly improved the sensitivity.

54

S. Layeghian Javan, M.M. Sepehri and M. Layeghian Javan et al. / Computer Methods and Programs in Biomedicine 178 (2019) 47–58 Table 5 The best results of the ensemble methods for predicting cardiac arrest incidence in the next hour using multivariate dataset. Algorithm

Balancing method

Feature selection

Acc

Prec

Sen

F1

AUC

FPR

Spec

XGBoost Gradient Boosting Random forest BalancedBagging Stacking (RF, BalancedBagging, SVM, LR; meta_classifier: LR) Stacking (RF, BalancedBagging; meta_classifier: LR)

SMOTE – Weighting – K-medoid clustering

12-best No No No No

0.92 0.95 0.73 0.79 0.72

0.09 0.22 0.06 0.09 0.07

0.35 0.19 0.63 0.63 0.74

0.13 0.19 0.12 0.15 0.13

0.76 0.64 0.77 0.79 0.72

0.07 0.02 0.27 0.20 0.28

0.93 0.98 0.73 0.80 0.72

K-medoid clustering

No

0.75

0.08

0.77

0.15

0.79

0.25

0.75

Table 6 The effect of the data filtering on performance. Data

Algorithm

Acc

Prec

Sen

F1

AUC

FPR

Spec

Before filtering After filtering

Stacking Stacking

0.75 0.76

0.08 0.19

0.77 0.77

0.15 0.31

0.79 0.82

0.25 0.24

0.75 0.76

According to the results, feature selection methods reduced the performance of the majority of the models.

ation, first and third quantile for each window. These calculations led to the production of 270 new trend features: 6 (number of time series) ber of features) = 270

4.1.2. Filtering the records based on the time of measuring vital signs As previously mentioned, the last recorded value before TP was selected for vital signs and laboratory test results in multivariate dataset. This value could be recorded from one to 30 h before TP . Patient data are likely to have a high predictive value in the hours close to the time of CA incidence. Therefore, we only included patients which at least one value was recorded for their vital signs at each hour of their hospitalization. This reduced the number of CA cases to 53 and decreased the number of control records to 732. In Table 6, the best results of the proposed stacking model are compared before and after filtering the records. According to Table 6, despite filtering a large number of records, the results improved. Particularly, the precision and the f-score significantly increased. By analyzing these results, we conclude that the values of vital signs at times close to CA incidence have a great effect on increasing the prediction performance.

4.1.3. Modeling with time series dataset In the next step, in order to analyze the dynamics of time series and use it for classification, we extracted the trend features of the time series. In this study, we used a simple approach to construct trend features. The time interval before Tp was divided into five windows, and the trend features for each window were extracted. The first window contains one bucket of information (one hour) before Tp . The second window contains the information about the two hours before Tp , the third window contains the information of the four hours before Tp . The fourth window contains the information of the eight hours before Tp , and finally the fifth window contains the information of 16 h before Tp . We also considered the entire interval before Tp as a window. For each window, a regression line was fitted and the slope and intercept that contain the trend information of that window were recorded. Moreover, we extracted the values of minimum, maximum, mean, median, standard devi-



5 (number of windows)



9 (num-

Regarding the better outcomes generated by SVM, LR, RF, Balanced Bagging Classifier, and stacking methods in the previous step, only these techniques were used to analyze time series. We used the two methods of L1 regularization and SelectKBest to select features for SVM and LR models. In the SelectKBest method, the features were selected with four different values of k (20, 30, 40, and 50). B1, B2 and B7 methods were also used to balance the data. 2MLT



5FS



3B = 30 models

In the random forest algorithm, the proper features were selected by the model 1MLT



3B = 3 models

The Balanced Bagging classifier technique balances the data internally. The SelectKBest method was used to select features for this algorithm. 1MLT



4FS = 4 models

All the three methods of balancing and the SelectKBest method were examined for the stacking model: 1MLT



4FS



3B = 12 models

In total, 49 different models were trained on the time series data. The best results of the algorithms are presented in Table 7. According to Table 7, the best results were obtained using SVM algorithm, by selecting 40 features and balancing with SMOTE method. The highest values of sensitivity, precision and f1-score were 0.70, 0.169 and 0.27, respectively. Comparing the best results from multivariate and time series analysis shows that multivariate analysis generally yields better results, especially for sensitivity. However, due to the fact that the time series consists of only six variables, this difference is accept-

Table 7 The best results of the time series analysis for predicting CA incidence in the next hour. Algorithm

Balancing method

Feature selection

Acc

Prec

Sen

F1

AUC

FPR

Spec

SVM LR Balanced Bagging RF Stacking

SMOTE Weighting Embedded Weighting K-medoid

40-Best L1 penalty 50-Best No 50-Best

0.76 0.76 0.75 0.69 0.76

0.17 0.17 0.15 0.10 0.16

0.70 0.68 0.61 0.54 0.61

0.27 0.26 0.24 0.16 0.25

0.81 0.77 0.74 0.69 0.75

0.24 0.23 0.24 0.29 0.23

0.76 0.77 0.76 0.71 0.77

S. Layeghian Javan, M.M. Sepehri and M. Layeghian Javan et al. / Computer Methods and Programs in Biomedicine 178 (2019) 47–58

55

Table 8 Comparison of the best results from multivariate, trend and combined features in predicting cardiac arrest one hour before the occurrence. Dataset

Feature selection

Balancing method

Algorithm

Acc

Prec

Sen

F1

AUC

FPR

Spec

Multivariate Time series Combined dataset

– 40_Best L1 penalty

K-medoid SMOTE Weighting

Stacking Kernel SVM LR

0.76 0.76 0.77

0.19 0.17 0.18

0.77 0.70 0.70

0.31 0.27 0.28

0.82 0.81 0.78

0.24 0.24 0.22

0.76 0.76 0.78

Fig. 3. Comparing the evaluation criteria for different time groups using the proposed model.

able, and the results indicate that time series of vital signs are of great importance in the prediction of cardiac arrest.

4.2. Modeling other time groups (prediction of cardiac arrest two to six hours before the occurrence)

4.1.4. Modeling with combined dataset In the next step, the trend features were combined with the multivariate features to examine the effect of this combination on CA prediction. We induced models the same as the time series models (49 models). The best results were obtained using logistic regression algorithm. Table 8 compares the best results from the three datasets. According to Table 8, the combined dataset produced better results than the results of time series analysis, but the value of precision and sensitivity decreased compared with the multivariate dataset. The reason for this reduction may be due to overfitting the model as a result of adding a lot of new features to the model. Probably the feature selection methods were not capable to overcome this problem. Moreover, balancing methods may be implemented with little efficiency due to the large number of features. The results show that ensemble classifiers perform well with datasets of less than 40 variables, but in datasets with a large number of variables, the classical algorithms produce better results.

As shown in Table 8, the multivariate dataset, coupled with a stacking model, produced the best results for predicting cardiac arrest an hour before its occurrence. In order to measure the predictive power of this early warning model, we predicted cardiac arrest at 2, 3, 4, 5 and 6 h before the incidence. Due to the importance of early prediction, this study did not investigate the intervals less than one hour. For more than 6 h, the values of accuracy or sensitivity criteria dropped below 70%, therefore, the intervals more than 6 h were not included. Since multivariate datasets produced better results in previous analysis, in this section, modeling was performed only with multivariate datasets. In Fig. 3, the best results produced by different time groups using the stacking model on multivariate dataset were compared. Regarding Fig. 3, we observe that the values of the criteria decrease in the even time-groups and increase in the odd timegroups. The reason for these fluctuations is probably due to the pattern of data collection. Perhaps, the values of some variables were not recorded during even intervals. This increases the noise of the data and reduces the performance of the model at these intervals.

56

S. Layeghian Javan, M.M. Sepehri and M. Layeghian Javan et al. / Computer Methods and Programs in Biomedicine 178 (2019) 47–58

Fig. 4. Best values of the evaluation criteria using multivariate dataset in 1-hour group.

In spite of these fluctuations, the values of f1-score, sensitivity, precision, accuracy, and AUC are generally reduced by increasing the prediction time interval. On the contrary, the FPR criterion, indicating the number of wrong alarms, increases. Given these results, we found that the effectiveness of the predicting model decreases gradually as we move away from the time of the CA incidence. The reason for these results is that with approaching the time of CA, more physiological changes occur in the patient’s body. 5. Discussion In this research, various models were induced to predict the cardiac arrest incidence. As previously mentioned, increasing the amount of sensitivity will reduce the amount of precision and vice versa. Our goal was to make a trade-off between the sensitivity and precision values, so that the majority of CA cases were detected, while the number of false alarms was reduced. Although identifying all cases of cardiac arrest is important, but false alarms increase the workload of the hospital staff, decrease the attention of the staff to the warnings and thus endanger the patients’ health. That’s why we were looking to maximize the F1-score, which is the weighted average of these two criteria. Meanwhile, we set a threshold, and considered models with a sensitivity or accuracy of less than 70% unacceptable. In this regard, a stacking model was selected on the multivariate data set as the best model. At the same time, other models can be used to maximize the other criteria. For example, the best accuracy, precision and sensitivity values were obtained using random forest, XGBoost, and SVM, respectively. In Fig. 4, the values of the different criteria of these three models and the proposed stacking model are compared. According to Fig. 4, the best values of accuracy, FPR and specificity criteria were generated by the two RF and XGboost algorithms. But these models produced low and unacceptable values for sensitivity. The XGBoost algorithm produced the best precision value as well. The SVM model generated a very high sensitivity value, but the other criteria of this model (other than AUC) were very low and unacceptable. The proposed stacking model produced

the highest f1-score, AUC, and acceptable values for other criteria. All models obtained almost the same amount of AUC. 5.1. Comparing the proposed model with existing warning systems In order to evaluate the effectiveness of the proposed model, we compared it with two standard warning systems APACHE II [13] and MEWS [12].There was no information about the two parameters of APACHE II in the used dataset, which was considered zero. Various thresholds were investigated for CA prediction. Ultimately, patients with the APACHE II score of 18 or more, or the MEWS score of 5 or more were considered as patients who were experiencing CA in the next h hours. In Fig. 5, we compared the results of CA prediction one hour before the incidence, using APACHE II, MEWS and the proposed model. According to Fig. 5, we conclude that the proposed model produced a significant improvement in the sensitivity and AUC values compared with APACHE II and MEWS. It also relatively improves the f1-score criteria. 5.2. Comparing the proposed model with the related works In this section, we compare the results of the proposed model with the results obtained in the related papers. The previous researches are different from the current work in terms of the population under study, the number of records, feature sets, and the ratio of CA cases to normal patients. Therefore, comparing their outcomes is somewhat challenging. Table 9 presents the results from the related works and the current paper. According to Table 9, Portela et al. [16] obtained the value of%91 for sensitivity using the SVM algorithm. But the F-score and precision values were not mentioned in the results. It is also not clear how many hours later we are expected cardiac arrhythmia to occur. This paper, compared with the proposed model for 1-hour group in the current study, produced a higher sensitivity, but lower accuracy and specificity.

S. Layeghian Javan, M.M. Sepehri and M. Layeghian Javan et al. / Computer Methods and Programs in Biomedicine 178 (2019) 47–58

57

Fig. 5. Comparing the results of CA prediction one hour before the incidence, using APACHE II, MEWS and the proposed model. Table 9 Comparing the proposed model with the related works. Reference 1 2 3 4

[16] [17] [18] [19]

6

Proposed model

Population 109 control, 103 CA cases 133,0 0 0 record (815 CAs) 269,999 patients (424 CAs, 13,188 ICU transfers, and 2840 deaths) 2681 control, 79 CA cases

Feature selection

Method

Selection by the model SVMW,RFE Selection by the model –

SVM SVM SVM RF

4h 24 h

– – – – – –

Stacking Stacking Stacking Stacking Stacking Stacking

1h 2h 3h 4h 5h 6h

Kennedy et al. [17] obtained the values of 94 and 98% for accuracy and AUC criteria using SVM, but other criteria were not mentioned. In our study, which deals with an imbalanced dataset, accuracy is not a good measure for evaluation. However, according to Fig. 4, the random forest model generated the value of 95% for accuracy, but this model was discarded due to the low sensitivity. Somanchi et al. [18] used SVM to obtain the sensitivity value of 80% for predicting cardiac arrest 4 h before the incidence. However, the amount of the other performance criteria was not provided; therefore, we could not make a complete comparison. The proposed stacking model in the current paper produced a sensitivity value of 76% for predicting cardiac arrest, 4 h earlier than the incident. Churpek et al. [19] used random forest to obtain the value of 83% for the AUC criterion in predicting clinical deteriorations 24 h earlier than the occurrence. However, no other performance metrics were provided. Therefore, we could not obtain a reliable estimation of the efficiency of this study. This paper, in addition to the CA, considers transfers to ICU and deaths as case samples. In the present article, at the times earlier than 6 h, the values of sensitivity or accuracy dropped below the threshold of 70%, therefore these time-groups were not included in our analyses.

Time to CA

Acc

Prec

0.69 0.94

Sen

F1

AUC

FPR

0.91

Spec 0.63

0.98 0.80 0.83

0.76 0.73 0.75 0.71 0.73 0.70

0.19 0.17 0.18 0.16 0.16 0.15

0.77 0.80 0.77 0.76 0.70 0.74

0.31 0.28 0.29 0.26 0.26 0.25

0.82 0.74 0.76 0.70 0.73 0.73

0.24 0.28 0.26 0.30 0.27 0.30

0.76 0.73 0.74 0.70 0.73 0.70

6. Conclusion In this study, we showed that machine learning techniques, especially ensemble algorithms, have high potential to be used in prognostic systems for sepsis patients. The main contributions of this paper were as follows: •





Evaluating the effectiveness of a wide range of machine learning algorithms in predicting CA, including classical models and ensemble methods. Investigating the effect of time series dynamics of vital signs on CA prediction. Predicting CA incidence in different time intervals.

There were various challenges in the dataset under investigation, including missing values, outliers, imbalanced classes of data and irregularity of time series. Innovative solutions were proposed to meet these challenges. In the next step, more than 220 models were induced on three different datasets (multivariate, time series and combined) to predict CA incidence one hour earlier. The models used various methods of feature selection and balancing. We aimed to increase the value of f-score as much as possible. The best results were obtained using a stacking method, the multivariate dataset, the all variables as feature set and k-medoid clus-

58

S. Layeghian Javan, M.M. Sepehri and M. Layeghian Javan et al. / Computer Methods and Programs in Biomedicine 178 (2019) 47–58

tering method for balancing (precision = 19%, sensitivity = 77% and f1-score = 31%, AUC=82%). However, the results of the time series analysis indicate that the time series dynamics of vital signs can also play a significant role in predicting CA (precision = 17%, sensitivity = 70% and f1-score = 27%). In the next step, we used the best model in 1-hour group to predict CA incidence earlier than one hour. We found that the effectiveness of the predicting model decreases gradually as we move away from the time of the CA incidence. We did not consider the time groups more than six hours, because the values of accuracy or sensitivity dropped below 70%. The proposed model, in comparison with the exiting warning systems such as APACHE II (precision = 18%, sensitivity = 67%, f1-score = 28% and AUC=71%) and MEWS (precision = 20%, sensitivity = 62%, f1-score = 30% and AUC=70%), significantly improved the sensitivity and AUC metrics and relatively improved the f1-score criteria. In addition, we compared the proposed model with the related works. However, due to the lack of sufficient information, it was not possible to make a fair comparison. There are some practical limitations in this study. The proposed model was not designed to extract a set of rules that can be computed manually. Another limitation is that the proposed model is exclusively trained with ICU data from one health center and may not be generalizable to other hospitals or departments. In future studies, the proposed model should be evaluated with more data from different health centers. Other methods of time series analysis, other than extracting the trend features, should also be examined. In addition, the proposed model should be improved and combined with the expert’s opinions. Ultimately, it is a good idea to develop an architecture for smart ICU to integrate the proposed model with the intensive care unit. Conflicts of interest None. References [1] D. Leoni, J. Rello, Cardiac arrest among patients with infections: causes, clinical practice and research implications, Clin. Microbiol. Infect. 23 (10) (2017) 730–735. [2] J. Stoller, et al., Epidemiology of severe sepsis: 2008-2012, J. Crit. Care 31 (1) (2016) 58–62. [3] R.W. Morgan, et al., Sepsis-associated in-hospital cardiac arrest: epidemiology, pathophysiology, and potential therapies, J. Crit. Care 40 (2017) 128–135 Supplement C.

[4] S.L. Javan, M.M. Sepehri, H. Aghajani, Toward analyzing and synthesizing previous research in early prediction of cardiac arrest using machine learning based on a multi-layered integrative framework, J. Biomed. Informat. 88 (2018) 70–89. [5] L. Mayaud, et al., Dynamic data during hypotensive episode improves mortality predictions among patients with sepsis and hypotension, Crit. Care Med. 41 (4) (2013) 954. [6] J. Wiens, E. Horvitz, J.V. Guttag, Patient risk stratification for hospital-associated c. diff as a time-series classification task, in: Proceedings of the Advances in Neural Information Processing Systems, 2012. [7] H.L. Li-wei, et al., Tracking progression of patient state of health in critical care using inferred shared dynamics in physiological time series, in: Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, NIH Public Access, 2013. [8] H.L. Li-wei, et al., Discovering shared dynamics in physiological signals: application to patient monitoring in ICU, in: Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), IEEE, 2012. [9] T.G. Buchman, Nonlinear dynamics, complex systems, and the pathobiology of critical illness, Current Opin. Crit. Care 10 (5) (2004) 378–382. [10] J.R. Moorman, et al., Cardiovascular oscillations at the bedside: early diagnosis of neonatal sepsis using heart rate characteristics monitoring, Physiol. Measur. 32 (11) (2011) 1821. [11] F. Portela, et al., A Pervasive Intelligent System for Scoring MEWS and TISS-28 in Intensive Care, in: Proceedings of the 15th International Conference on Biomedical Engineering, Springer, 2014. [12] C. Subbe, et al., Validation of a modified early warning score in medical admissions, QJM 94 (10) (2001) 521–526. [13] W.A. Knaus, et al., APACHE II: a severity of disease classification system, Crit. Care Med. 13 (10) (1985) 818–829. [14] A. Hevner, S. Chatterjee, Design science research in information systems, in: Design Research in Information Systems, Springer, 2010, pp. 9–22. [15] K. Peffers, et al., A design science research methodology for information systems research, J. Manag. Inf. Syst. 24 (3) (2007) 45–77. [16] F. Portela, et al., Preventing patient cardiac arrhythmias by using data mining techniques, in: Proceedings of the IEEE Conference on Biomedical Engineering and Sciences (IECBES), IEEE, 2014. [17] C.E. Kennedy, et al., Using time series analysis to predict cardiac arrest in a pediatric intensive care unit, Pediatr. Crit. Care Med. 16 (9) (2015) e332 a journal of the Society of Critical Care Medicine and the World Federation of Pediatric Intensive and Critical Care Societies. [18] S. Somanchi, et al., Early prediction of cardiac arrest (Code Blue) using electronic medical records, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, NSW, Australia, ACM: Sydney, 2015, pp. 2119–2126. [19] M.M. Churpek, et al., Multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards, Crit. Care Med. 44 (2) (2016) 368–374. [20] A.L. Goldberger, et al., PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals, Circulation 101 (23) (20 0 0) e215–e220. [21] J.M. Gonçalves, et al., Predict sepsis level in intensive medicine–data mining approach, in: Advances in Information Systems and Technologies, Springer, 2013, pp. 201–211. [22] M.J. Azur, et al., Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20 (1) (2011) 40–49. [23] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proceedings of the IJCAI, Montreal, Canada, 1995.