Intradialytic blood pressure pattern recognition based on density peak clustering

Intradialytic blood pressure pattern recognition based on density peak clustering

Accepted Manuscript Intradialytic blood pressure pattern recognition based on density peak clustering Feng Wang, Jing-yi Zhou, Yu Tian, Yu Wang, Ping ...

1MB Sizes 0 Downloads 36 Views

Accepted Manuscript Intradialytic blood pressure pattern recognition based on density peak clustering Feng Wang, Jing-yi Zhou, Yu Tian, Yu Wang, Ping Zhang, Jiang-hua Chen, Jing-song Li PII: DOI: Reference:

S1532-0464(18)30096-0 https://doi.org/10.1016/j.jbi.2018.05.013 YJBIN 2983

To appear in:

Journal of Biomedical Informatics

Received Date: Revised Date: Accepted Date:

22 December 2017 15 March 2018 20 May 2018

Please cite this article as: Wang, F., Zhou, J-y., Tian, Y., Wang, Y., Zhang, P., Chen, J-h., Li, J-s., Intradialytic blood pressure pattern recognition based on density peak clustering, Journal of Biomedical Informatics (2018), doi: https:// doi.org/10.1016/j.jbi.2018.05.013

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Intradialytic blood pressure pattern recognition based on density peak clustering

Feng Wang #a, Jing-yi Zhou #b, Yu Tian a, Yu Wang a, Ping Zhang b, Jiang-hua Chen b, Jing-song Li *a

a

Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education,

Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China b

Kidney Disease Center, the First Affiliated Hospital, College of Medicine, Zhejiang

University, China #

contributed equally to this study.

*Corresponding author: Jing-song Li Corresponding author at: Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, 38 Zheda Road, Hangzhou 310027, China. Tel: +86-571-87951564 Fax: +86-571-87951564 E-mail address: [email protected]

Conflict of interest: No potential conflicts of interest relevant to this article were reported.

Acknowledgments: This work was supported by the National High-tech R&D Program (No. 2015AA020109) and the National Natural Science Foundation of China (No. 81672916, No. 61702446). We thank all reviewers and editors who helped improve this work.

ABSTRACT End-stage renal disease (ESRD) is the final stage of chronic kidney disease (CKD) and requires hemodialysis (HD) for survival. Intradialytic blood pressure (IBP) measurements are necessary to ensure patient safety during HD treatments and have critical clinical and prognostic significance. Studies on IBP measurements, especially IBP patterns, are limited. All related studies have been based on a priori knowledge and artificially classified IBP patterns. Therefore, the results were influenced by subjective concepts. In this study, we proposed a new approach to identify IBP patterns to classify ESRD patients. We used the dynamic time warping (DTW) algorithm to measure the similarity between two series of IBP data. Five blood pressure (BP) patterns were identified by applying the density peak clustering algorithm (DPCA) to the IBP data. To illustrate the association between BP patterns and prognosis, we constructed three random survival forest (RSF) models with different covariates. Model accuracy was improved 3.7~6.3% by the inclusion of BP patterns. The results suggest that BP patterns have critical clinical and prognostic significance regarding the risk of cerebrovascular events. We can also apply this clustering approach to other time series data from electronic health records (EHRs). This work is generalizable to analyses of dense EHR data.

Keywords: Hemodialysis; Intradialytic blood pressure patterns; Dynamic time warping; Density peak clustering algorithm

1. INTRODUCTION End-stage renal disease (ESRD) is the final stage of chronic kidney disease (CKD) when the kidneys can no longer remove waste and excess water from the body, and dialysis or kidney transplantation is necessary for survival [1]. In the U.S., the number of incident (newly reported) ESRD cases in 2014 was 120,688, and the unadjusted (crude) incidence rate was 370 per million/year. Since 2011, both the number of incident cases and the unadjusted incidence rate have increased [2]. It has been proved that a lower estimated glomerular filtration rate [3], higher albuminuria [3], higher sodium intake [4], 25-hydroxyvitamin D deficiency [5] and lupus erythematosus [6] are associated with the progression of ESRD. However, few studies are available on the intradialytic blood pressure (IBP) of ESRD patients. Observational studies have demonstrated associations between intradialytic hypotension [7-10], intradialytic hypertension [11-13], blood pressure (BP) variability [14-18] and mortality. Intradialytic hypotension [8-10] and low nadir systolic blood pressure (SBP) [8] are associated with high mortality and a high risk of cardiovascular events [9]. Intradialytic hypertension is associated with high mortality, and every 10-mm Hg increase in pre- to post-hemodialysis (HD) SBP is associated with a 12% higher mortality risk [12]. Significant long-term BP variability [16] is associated with stroke, heart disease and mortality. However, the results of these studies vary by the research standards, and the lack of consensus regarding the diagnostic criteria has hampered data synthesis. Therefore, the use of the machine learning method could facilitate the automatic classification of the IBP patterns and the study of its relationship with cerebrovascular risk in patients with ESRD, potentially leading to improvements in treatment and prognosis. Few studies are available on IBP patterns. One study showed associations between IBP phenomena, including fall, rise and variability, with adverse clinical outcomes [19]. Another study fitted an IBP curve using regression splines and then calculated the first and second derivatives to determine four mutually exclusive classifications at different time points. They showed that these classifications correspond to the near-term risk of cardiac events and are moderately stable over a consecutive two-week period [20]. The studies

described above were based on a priori knowledge and classified IBP patterns artificially. Therefore, the results are subjective. In this study, we use the dynamic time warping (DTW) algorithm to assess the similarity of two series of IBP data and analyze IBP patterns using the density peak clustering algorithm (DPCA). We constructed three random survival forest (RSF) models to illustrate the association between BP patterns and prognosis.

2. MATERIALS AND METHODS

2.1 Data sources The data are from the Electronic Health Record (EHR) System of the Kidney Disease Center of the First Affiliated Hospital of Zhejiang University (FAHZU), People's Republic of China (PRC). FAHZU was established in November 1947 and is a tertiary referral hospital. Its kidney disease center is one of the largest integrated treatment centers for kidney disease in China and was one of the first units to perform kidney puncture, HD, peritoneal dialysis, kidney transplantation and continuous renal replacement therapy. The EHR system includes HD, peritoneal dialysis, and kidney transplant records, with approximately 1.7 million HD records collected since 1991.

2.2 Cohort We selected all patients with ESRD for at least 90 days who received maintenance HD at FAHZU’s Kidney Disease Center between July 30, 2007 and August 25, 2016. We selected BP records from different patients and merged them into one dataset to identify typical IBP patterns to represent the general HD population. Since our focus was on typical sessions, we required all patients to receive HD three times per week, with each session lasting at least 2 h and no more than 5 h. Ultimately, we included 14885 ESRD patients on maintenance HD.

2.3 BP Pattern recognition

2.3.1

IBP According to the EHR system of FAHZU’s Kidney Disease Center, each HD session

lasted for approximately 4 h, and SBP was measured and recorded by a nurse during the session. The measurement interval varied from 5 minutes to 1 h depending on the nurse. We removed SBP values that were <60 mm Hg, >250 mm Hg, or lower than the diastolic BP measurements. We used all SBP information from different patients to assess similarity and clustering to obtain IBP patterns.

2.3.2

Similarity measurement The IBP data used in this study were time series with different time intervals. We used

the DTW algorithm to measure the similarity between the two series of IBP data. The concept of dynamic programming was first proposed by Itakura et al. in the field of speech recognition [21]. Then, Berndt et al. formally proposed the DTW technique in the database domain and used it for pattern recognition [22]. The DTW technique dynamically warps the time axis to match two similar sequences. For the two intradialytic SBP sequences in this study: ,

,

,…… ,……

different patients, and

,

,

,…… ,……

, S and T can be from the same patient or from two

,…… ,……

represent the SBP data arranged in time order

during the HD session. The BP sequences S and T can be arranged in a grid of n by m (see Fig. 1), where each grid point and

. The warping path

corresponds to an alignment between the BP elements ,

,……

,……

> is the alignment of the elements

of S and T, so that the cumulative distance between them is minimal, with to the point

. For example,

in Fig. 1 corresponds to the alignment of

corresponding and

.

Fig. 1. Example of a warping path. The abscissa is the intradialytic SBP sequence ,…… ,……

, and the ordinate is the intradialytic SBP sequence

,…… ,……

. Each grid point

elements

and

,

,

corresponds to an alignment between the BP

.

We used the Manhattan distance (L1 regularization) to measure the distance between the SBP points

and

. After defining the distance between two points, the DTW

process can be described as the process of finding the shortest warping path. The warping path is typically subject to several constraints, as described below. • Boundary conditions:

= (1, 1) and

= (n, m). This requires the warping path to

start and finish in diagonally opposite corner cells of the matrix. • Continuity: The steps in the grid are confined to neighboring points,

1 and

1. • Monotonicity: The points must be monotonically ordered with respect to time, and

. There are several paths to satisfy these constraints, and we were interested in the path

with the lowest cost: (1) This path can be found using dynamic programming to evaluate the following recurrence, which defines the cumulative distance

as the distance

found in

the current cell and the minimum of the cumulative distances of the adjacent elements: (2) The shortest path value measures the similarity between the two series, and a smaller value corresponds to greater similarity. We can measure the similarity of the two intradialytic SBP sequences by calculating the DTW distance. The time and space complexity of DTW is O(nm).

2.3.3

Clustering serial measurements In this study, we used the DPCA [23] to analyze IBP patterns. The DPCA is based on

the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities. This idea forms the basis of a clustering procedure in which the number of clusters arises intuitively, outliers are automatically spotted and excluded from the analysis, and clusters are recognized regardless of their shape and the dimensionality of the space in which they are embedded. For each data point i, we compute two quantities: its local density

and its distance

from points of higher density. Both quantities depend only on the distances data points, and we used the DTW distance. The local density

between

of data point i is defined as

the reciprocal of the average distance of the nearest M points from the point i: (3) In this study, M is 1.5% of the total sample size, and

is measured by computing the

minimum distance between point i and any other point with higher density: (4) For the point with the highest density, we take

. Note that

is much

larger than the typical nearest neighbor distance only for the points with the highest local or global density. Therefore, cluster centers are recognized as points for which the value of is abnormally large.

We obtained the IBP sequence data from each HD session. Since we wanted to summarize only the patterns of the waveforms, we used Min-Max Normalization: in a

to normalize the sequence data and then used the DTW method to calculate the

in

distance (similarity) from other HD intradialytic SBP sequences. Then, we calculated the local density

and the distance

of all SBP sequences. We included approximately

656,405 sequences in this study for a total of 656,405

and

into a two-dimensional space called the decision graph with

values. We put as the abscissa and

and as the

ordinate. According to the theory of the DPCA, the cluster cores are isolated points with relatively large

and

values in the decision graph. After the cluster centers have been

found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. Due to the vast amount of data we needed to process in the clustering process (we clustered 656,405 HD SBP sequences), we wrote our own code for DTW and DPCA to improve efficiency instead of using the existing Python package. The DTW algorithm was developed with C code and embedded in Python based on the cffi package. The DPCA algorithm was developed with Python for the parallel computing of 32 processes based on the performance of the local server. All DTW and DPCA implementations were based on Python 2.7.13 and Python packages.

2.4 BP pattern assessment

2.4.1

Variables The raw patterns were derived from the DTW and DPCA and changed over time. We

weighted the one-year raw BP patterns of patients who survived at least one year to obtain the patients’ representative BP patterns as RSF model inputs. The weighting scheme is described in the third paragraph in section 3.2. In this study, we selected basic patient information, including sex and age (patient age at first HD), and two groups of HD-associated covariates to construct three prognostic models. One group of covariates included interdialytic BP before an HD session, interdialytic BP after

an HD session, and BP variability (standard deviation/mean). Since our work mainly focused on recognizing BP patterns through the clustering method, we wanted to compare the predictive ability of representative BP patterns we recognized and existing BP charactering techniques, such as interdialytic BP before an HD session, interdialytic BP after an HD session, and BP variability. This approach explains the first group of HD-associated covariates (existing BP charactering techniques). Therefore, we used patients’ basic information and the first group of covariates to construct the first model. We wanted to see how models with representative BP patterns performed. Therefore, we used patients’ basic information and representative BP patterns to construct the second model. Since we wanted to summarize only the patterns of the BP waveforms, we used Min-Max Normalization before the clustering process. To compensate for the information loss during Min-Max Normalization, we added the minimum BP and maximum BP to representative BP patterns as the second group of HD-associated covariates. Therefore, we used patients’ basic information and the second group of covariates to construct the third model. We used death due to cerebrovascular events as the survival outcome to calculate mortality. Cerebrovascular events mainly include cerebral infarction, cerebral hemorrhage and stroke. Survival time was calculated from the date of first HD to the cutoff date (August 25, 2016) in surviving patients or to the date of cerebrovascular events. We compared the accuracy of the prognostic models constructed by these features to illustrate the association between representative BP patterns and prognosis.

2.4.2

RSFs After obtaining a patient's representative BP pattern, we constructed an RSF model to

illustrate the relationship between BP patterns and prognosis. RSF [24] is a random forest method for the analysis of right-censored survival data considering both the survival time and the outcome. We used the log-rank method as the survival splitting rules to grow survival trees, and we used out-of-bag (OOB) data to calculate the prediction error. The prediction error was measured as 1-C, where C is Harrell's (Harrell et al., 1982) concordance index [25]. The prediction error is between 0 and 1 and measures how well the predictor correctly ranks (classifies) two random individuals in terms of survival.

RSF algorithm: 1. Select B bootstrap samples from the original data. Note that each bootstrap sample excludes an average of 37% of the data, referred to as OOB data. 2. Grow a survival tree for each bootstrap sample. At each node of the tree, randomly select p candidate variables. Split the node using the candidate variable that maximizes survival differences between daughter nodes. 3. Grow the tree to full size under the constraint that a terminal node should have no less than d0>0 unique deaths. 4. Calculate a CHF for each tree and average them to obtain the ensemble CHF. 5. Using OOB data, calculate the prediction error for the ensemble CHF. A total of 1028 patients who survived for at least one year were selected to construct three prognostic RSF models with three groups of covariates. The first RSF model was constructed with sex, age, interdialytic BP before an HD session, interdialytic BP after an HD session, and BP variability. The second RSF model was constructed with sex, age, and interdialytic BP patterns. To compensate for information loss in Min-Max Normalization, we added a minimum BP and a maximum BP to the second model to construct the third model. We combined all covariates to construct an RSF model to analyze the variable importance (VIMP). The error rate was estimated by the OOB error. Each bootstrap sample excludes an average of 37% of the data, referred to as OOB data, to calculate the prediction error. It has been proven that the OOB error is an unbiased estimation that can be used to estimate the generalization error and is as precise as the estimation of a test set of the same size as the training set [26-28]. Therefore, there is no need for a test set or for cross-validation. All computations and modeling were performed using the freely available R software package randomForestSRC [29].

3. RESULTS

3.1 Data description

A total of 14885 patients received HD 3 times a week, and each session lasted 3 to 4.5 h. We screened these HD data for sessions with no less than 4 and no more than 10 SBP points. We reviewed 656,405 HD sessions data to analyze the BP patterns. We show the statistical distribution of the SBP counts in a session in Fig. 2.

Fig. 2. Statistical distribution of the SBP counts in an HD session, with counts of the sequence quantities for various sequence lengths.

A total of 1028 patients who survived for no less than one year were selected to construct three prognostic RSF models with groups of covariates. The covariates included basic patient information (sex and age), HD-related variables used in previous studies (interdialytic BP before an HD session, interdialytic BP after an HD session, and BP variability) and HD-related variables used in this study (BP pattern, minimum BP, and maximum BP). Table 1 contains descriptive statistics of the study population.

Table 1. Characteristics of the study population Covariate

Frequency

Ratio

Mean

Minimum

Maximum

-

-

Sex Male Female Age at first HD Interdialytic SBP before HD a Interdialytic SBP after HD a BP variability b Minimum BP c Maximum BP c a

636 392

0.62 0.38 -

-

-

52

10

89

-

-

146

139

155

-

-

141 0.069 125 152

134 0.063 69 89

144 0.088 171 211

Interdialytic SBP before or after HD and BP variability are expressed by the mean value in

the first 10 HD sessions. b

BP variability was calculated by the standard deviation/mean.

c

The maximum or minimum BP is expressed by the mean value in the first survival year

among ESRD patients.

3.2 Outcomes related to SBP pattern recognition The results of clustering are displayed in the decision graph shown in Fig. 3. According to the theory of the DPCA, the cluster cores are isolated points with relatively large

and

values in the decision graph. The cluster cores are represented by the red

points in Fig. 3. Since the DTW algorithm does not satisfy triangular inequality, the points (sequences) with the same local density in Fig. 3, such as b1 and b2, show 0 DTW distance but different

values (distances to the higher density point). We found that the DTW

distances between the isolated red points with the same local density are all 0, and the waveforms are similar. Therefore, we classified the isolated red points with the same local density as one pattern. Therefore, the clustering cores are represented by five isolated points (sequences) with different local densities: a, b1, c1, d, and e.

Fig. 3. Decision graph of the DPCA with M=1.5% N. N is the number of all samples. The cluster cores are the red points annotated by a, b1, b2, c1, c2, c3, c4, c5, c6, d, and e. The local density

is based on the similarity measurement of the DTW method, and

is based

on

We used the five cluster cores in the decision graph as the raw BP patterns, which are displayed in Fig. 4. Pattern A (sequence a) first rises and then falls slightly, pattern B (sequence b) is purely rising, pattern C (sequence c1) is purely falling, pattern D (sequence d) first falls and then rises slightly, and pattern E (sequence e1) falls slightly and then rises. The patterns identified in Fig. 4 look very similar to the clinically analyzed patterns recognized by Goldstein et al. [20], who fit a curve of IBP using regression splines and then calculated first and second derivatives to produce four mutually exclusive classifications at different time points. This finding suggests that fluctuations of BP measurements during HD sessions do relate to clinical outcomes. In some ways, this is a nice validation of the clinical supposition.

Fig. 4. Raw BP patterns. The abscissa indicates the order of the HD SBP sequences, and there is no unit due to the warping process of DTW. The ordinate is normalized, and the unit is 1. Pattern A is the waveform of sequence a. Pattern B is the waveform of sequence b1. Pattern C is the waveform of sequence c1. Pattern D is the waveform of sequence d. Pattern E is the waveform of sequence e.

According to the clustering results, the sequence of each HD session can be classified as a BP pattern. We weighted the one-year raw BP patterns of a patient who survived at least one year to obtain the patient’s representative BP pattern. For five patterns, we counted the overall occurrence Q (occurrence in the 656,405 sequences) and the individual occurrence P (occurrence in one person) in one year. We calculated the five-pattern ratio R=P/Q and obtained a patient’s representative BP pattern as the pattern with the largest R. The distribution of the patterns and the representative patterns are displayed in Table 2.

Table 2. Distribution of patterns and representative patterns SBP Pattern

Overall occurrencea

Overall ratio

Weighted distributionb

Weighted ratio

A B C D E Total

72655 105106 330256 94740 53648 656405

11% 16% 50% 14% 9% 100%

137 222 218 241 210 1028

13% 22% 21% 24% 20% 100%

a

The overall occurrence is the distribution of SBP patterns in the 656,405 sequences.

b

The weighted distribution is the distribution of representative patterns in the

one-year-survival population.

The Kaplan-Meier (KM) survival curves of the 5 representative patterns are displayed in Fig. 5. We can conclude that patterns B (purely rise) and C (purely fall) are the principal patterns as they account for the greatest proportions of the overall distribution and are separated the most in the KM curve according to Table 2 and Fig. 5. We observed that pattern C had the highest survival rate, and pattern B had the lowest survival rate. We compared the five patterns and inferred that patterns with a falling waveform correspond to a better survival rate.

Fig. 5. KM curves for the 5 representative patterns. Log-rank test P=0.0172<0.05. The representative pattern is measured by the weighted value in the first survival year of ESRD patients.

3.3 Outcomes related to SBP pattern assessment To illustrate the association between HD BP patterns and prognosis, we constructed three RSF models, and the accuracy of the models is shown in Fig. 6.

Fig. 6. Accuracy of the three RSF models. The three RSF models were repeated 100 times to obtain the box plot. Model 1 is the RSF model with the input covariates of sex, age, interdialytic BP before an HD, interdialytic BP after an HD, and BP variability. Model 2 is the RSF model with the input covariates of sex, age, and representative BP patterns. Model 3 is the RSF model with the input covariates of sex, age, representative BP patterns, minimum BP, and maximum BP.

We can conclude that the introduction of BP patterns can absolutely improve the accuracy of the prognostic models. Furthermore, we combined the variables to construct an RSF prognostic model to measure VIMP. The importance of the variables is shown in Fig. 7. The results showed that the importance of the BP pattern-related variables (model 2 and model 3: BP pattern, minimum BP, and maximum BP) is much greater than that of the variables in previous studies (model 1: interdialytic BP before HD, interdialytic BP after HD, and BP variability).

Fig. 7. VIMP is the variable importance of RSF model 4 with the input covariates of sex, age, interdialytic BP before HD, interdialytic BP after HD, BP variability, BP patterns, minimum BP, and maximum BP.

3.4 Stability of clustering The clustering results varied by the M value in equation (3). Referring to the settings in Alex Rodriguez's algorithm [23], we made M equal to 1.5% of the total sample size N. We display the different decision graphs of a series of different M values in Fig. 8. We can conclude that the result is robust, and the cluster cores are always the five isolated points.

Fig. 8. Different decision graphs of a series of different M values. N is the number of all samples. We obtained 3 cluster cores when M=0.5% N. We obtained 5 cluster cores when M varied from 1% N to 7% N.

4. DISCUSSION In this paper, we proposed an approach to classify individuals based on a series of longitudinal clinical measurements using the example of repeated BP measurements routinely taken during HD sessions of patients with ESRD. We used the DTW method to measure the similarity between two SBP sequences, and five SBP patterns were obtained using the DPCA based on the intradialytic SBP data. To illustrate the association between BP patterns and prognosis, we constructed three RSF models. The accuracy of model 2 (sex, age, and BP pattern) was approximately 3.7% (a mean of 100 repetitions)—higher than that of model 1 (sex, age, interdialytic BP before HD, interdialytic BP after HD, and BP variability). The results suggest that BP patterns have critical clinical and prognostic significance regarding

the risk of cerebrovascular events. Model accuracy was improved by 2.6% (a mean of 100 repetitions) for the models with minimum BP and maximum BP, indicating that the minimum and maximum BPs during HD are related to the risk of cerebrovascular events. This result was consistent with the argument that the simple summary statistics could provide a better prediction in HD [30]. The results were also consistent with previous studies showing that intradialytic hypotension [7-10] and hypertension [11-13] are related to adverse outcomes. The KM survival curves of the 5 representative patterns are displayed in Fig. 5. We can conclude that patterns B (purely rise) and C (purely fall) are the principal patterns as they account for the greatest proportions of the overall distribution and were separated the most in the KM curves according to Table 2 and Fig. 5. This finding is consistent with the result in the decision graph showing that patterns B and C are the cluster cores, with relatively large and

values. We observed that pattern C had the highest survival rate, and pattern B had the

lowest survival rate. We compared the five patterns and inferred that patterns with a falling waveform correspond to a better survival rate. This finding is consistent with the conclusions of previous studies showing that patients on HD with increasing BP during dialysis (i.e., intradialytic hypertension) have a poorer prognosis than patients with decreases in BP [11,31,32]. We inferred that a certain degree of SBP decrease during HD sessions may contribute to a lower risk of cerebrovascular events. Cerebrovascular events mainly include cerebral infarction, cerebral hemorrhage and stroke. A certain degree of SBP decrease during HD sessions may provide some protective effects relative to cerebrovascular events, which is consistent with the conclusion that hypertension is a risk factor of stroke in the literature [33]. Our approach also has potential applications in a wide range of analyses. For medical studies that capture dense sequences of measurements, such as health data that are routinely captured during the typical care of patients, we can identify patterns with clustering methods and analyze the relationship to prognosis. For example, BP series data during kidney transplant surgery can help predict surgical outcomes; in the intensive care unit, respiratory rates can be used to assess the risk of lung injury [34]; and beat-to-beat heart rate variability from continuous electrocardiogram monitoring can help predict outcomes after stroke [35,36]. Furthermore, the use of EHRs also provides an opportunity to analyze integrated laboratory data. Interventions based on patterns of intradialytic SBP, perhaps in combination with other

EHR data, will ultimately result in improved clinical outcomes. Our work has several strengths and limitations. We took advantage of a large dataset to identify a relatively homogeneous sample of patients undergoing regular HD. We were able to assess the stability of the clustering by comparing different decision graphs of different M values. Furthermore, our approach for SBP pattern recognition is unsupervised since it is a clustering method. Therefore, the classification of BP patterns was not subjective. We used the DTW method to measure the similarity between two temporal sequences. Therefore, this method could also be applied to other serial laboratory measurements in an EHR that are taken more sporadically. However, the weighting step was affected by the time span and the method we used to weight different patterns. There may be a better method to cluster IBP patterns, and future work could extend in this direction. Overall, we suggest an approach to identify intradialytic SBP patterns to classify ESRD patients based on dense sequences in EHR data. This work based on the idea of clustering can classify ESRD patients in an unsupervised manner, which is different from previous studies. Due to the limitations of the data and methods, this work may be inadequate. This investigation is important, open-ended, and warrants further exploration from a combined statistical, informatics and clinical perspective.

Conflict of interest: No potential conflicts of interest relevant to this article were reported.

Acknowledgements: This work was supported by the National High-tech R&D Program (No. 2015AA020109) and the National Natural Science Foundation of China (No. 81672916, No. 61702446). We thank all reviewers and editors who helped improve this work.

REFERENCES [1]

United States Renal Data System. 2016 Annual Data Report Highlights, National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, MD, 2016.

[2]

United States Renal Data System. Incidence, Prevalence, Patient Characteristics, and Treatment Modalities, National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, MD, 2016.

[3]

S.I. Hallan, E. Ritz, S. Lydersen, S. Romundstad, K. Kvenild, S.R. Orth, Combining GFR and albuminuria to classify CKD improves prediction of ESRD, J. Am. Soc. Nephrol. 20 (2009) 1069-1077.

[4]

S. Vegter, A. Perna, M.J. Postma, G. Navis, G. Remuzzi, P. Ruggenenti, Sodium intake, ACE inhibition, and progression to ESRD, J. Am. Soc. Nephrol. 23 (2012) 165-173.

[5]

G.M. London, A.P. Guerin, F.H. Verbeke, B. Pannier, P. Boutouyrie, S.J. Marchais, et al., Mineral metabolism and arterial functions in end-stage renal disease: potential role of 25-hydroxyvitamin D deficiency, J. Am. Soc. Nephrol. 18 (2007) 613-620.

[6]

E.C. Somers, W. Marder, P. Cagnoli, E.E. Lewis, P. DeGuire, C. Gordon, et al., Population-based incidence and prevalence of systemic lupus erythematosus: the Michigan Lupus Epidemiology and Surveillance Program, Arthritis Rheumatol. 66 (2014) 369-378.

[7]

A. Tisler, K. Akocsi, B. Borbas, L. Fazakas, S. Ferenczi, S. Gorogh, et al., The effect of frequent or occasional dialysis-associated hypotension on survival of patients on maintenance haemodialysis, Nephrol. Dial Transplant. 18 (2003) 2601-2605.

[8]

T. Shoji, Y. Tsubakihara, M. Fujii, E. Imai, Hemodialysis-associated hypotension as an independent risk factor for two-year mortality in hemodialysis patients, Kidney Int. 66 (2004) 1212-1220.

[9]

B.V. Stefansson, S.M. Brunelli, C. Cabrera, D. Rosenbaum, E. Anum, K. Ramakrishnan, et al., Intradialytic hypotension and risk of cardiovascular disease, Clin. J. Am. Soc. Nephrol. 9 (2014) 2124-2132.

[10]

J.E. Flythe, H. Xue, K.E. Lynch, G.C. Curhan, S.M. Brunelli, Association of mortality

risk with various definitions of intradialytic hypotension, J. Am. Soc. Nephrol. (2014) ASN. 2014020222. [11]

J.K. Inrig, E.Z. Oddone, V. Hasselblad, B. Gillespie, U.D. Patel, D. Reddan, et al., Association of intradialytic blood pressure changes with hospitalization and mortality rates in prevalent ESRD patients, Kidney Int. 71 (2007) 454-461.

[12]

J.K. Inrig, U.D. Patel, R.D. Toto, L.A. Szczech, Association of blood pressure increases during hemodialysis with 2-year mortality in incident hemodialysis patients: a secondary analysis of the Dialysis Morbidity and Mortality Wave 2 Study, Am. J. Kidney Dis. 54 (2009) 881-890.

[13]

C.Y. Yang, W.C. Yang, Y.P. Lin, Postdialysis blood pressure rise predicts long-term outcomes in chronic hemodialysis patients: a four-year prospective observational cohort study, BMC Nephrol. 13 (2012) 12.

[14]

J.E. Flythe, S. Kunaparaju, K. Dinesh, K. Cape, H.I. Feldman, S.M. Brunelli, Factors associated with intradialytic systolic blood pressure variability, Am. J. Kidney Dis. 59 (2012) 409-418.

[15]

J.E. Flythe, J.K. Inrig, T. Shafi, T.I. Chang, K. Cape, K. Dinesh, et al., Association of intradialytic blood pressure variability with increased all-cause and cardiovascular mortality in patients treated with long-term hemodialysis, Am. J. Kidney Dis. 61 (2013) 966-974.

[16]

P. Muntner, J. Whittle, A.I. Lynch, L.D. Colantonio, L.M. Simpson, P.T. Einhorn, et al., Visit-to-visit variability of blood pressure and coronary heart disease, stroke, heart failure, and mortality: a cohort study, Ann. Int. Med. 163 (2015) 329-338.

[17]

T.I. Chang, J.E. Flythe, S.M. Brunelli, P. Muntner, T. Greene, A.K. Cheung, et al., Visit-to-visit systolic blood pressure variability and outcomes in hemodialysis, J. Hum. Hypertens. 28 (2014) 18-24.

[18]

T. Shafi, S.M. Sozio, K.J. Bandeen-Roche, P.L. Ephraim, J.R. Luly, W.L. St Peter, et al., Predialysis systolic BP variability and outcomes in hemodialysis patients, J. Am. Soc. Nephrol. 25 (2014) 799-809.

[19]

M.M. Assimon, J.E. Flythe, Intradialytic blood pressure abnormalities: the highs, the lows and all that lies between, Am. J. Nephrol. 42 (2015) 337-350.

[20]

B.A. Goldstein, T.I. Chang, W.C. Winkelmayer, Classifying individuals based on a densely captured sequence of vital signs: an example using repeated blood pressure measurements during hemodialysis treatment, J. Biomed. Inform. 57 (2015) 219-224.

[21]

F. Itakura, Minimum prediction residual principle applied to speech recognition, IEEE Trans. Acoustics Speech Signal Process. 23 (1975) 67-72.

[22]

D.J. Berndt, J. Clifford. Using dynamic time warping to find patterns in time series. KDD Workshop: Seattle, WA, 1994, pp. 359-370.

[23]

A. Rodriguez, A. Laio, Machine learning. Clustering by fast search and find of density peaks, Science. 344 (2014) 1492-1496.

[24]

H. Ishwaran, U.B. Kogalur, E.H. Blackstone, M.S. Lauer, Random survival forests, Ann. Appl. Stat. (2008) 841-860.

[25]

F.E. Harrell, Jr., R.M. Califf, D.B. Pryor, K.L. Lee, R.A. Rosati, Evaluating the yield of medical tests, JAMA. 247 (1982) 2543-2546.

[26]

R. Tibshirani, Bias, variance and prediction error for classification rules: University of Toronto, Department of Statistics, 1996.

[27]

D.H. Wolpert, W.G. Macready, An efficient method to estimate bagging's generalization error, Mach. Learn. 35 (1999) 41-55.

[28]

L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123-140.

[29]

H. Ishwaran, U.B. Kogalur. Random Forests for Survival, Regression and Classification (RF-SRC). R package version 2.4.2 2017.

[30]

B.A. Goldstein, G.M. Pomann, W.C. Winkelmayer, M.J. Pencina, A comparison of risk prediction methods using repeated observations: an application to electronic health records for hemodialysis, Stat. Med. 36 (2017) 2750-2763.

[31]

J.K. Inrig, Intradialytic hypertension: a less-recognized cardiovascular complication of hemodialysis, Am. J. Kidney Dis. 55 (2010) 580-589.

[32]

P.N. Van Buren, C. Kim, R. Toto, J.K. Inrig, Intradialytic hypertension and the association with interdialytic ambulatory blood pressure, Clin. J. Am. Soc. Nephrol. 6 (2011) 1684-1691.

[33]

H. Ueshima, A. Sekikawa, K. Miura, T.C. Turin, N. Takashima, Y. Kita, et al., Cardiovascular disease and risk factors in Asia: a selected review, Circulation. 118

(2008) 2702-2709. [34]

T.P. Clemmer, Computers in the ICU: where we started and where we are now, J. Crit. Care. 19 (2004) 201-207.

[35]

S.C. Tang, H.I. Jen, Y.H. Lin, C.S. Hung, W.J. Jou, P.W. Huang, et al., Complexity of heart rate variability predicts outcome in intensive care unit admitted patients with acute stroke, J. Neurol. Neurosurg. Psychiatry. 86 (2015) 95-100.

[36]

J.M. Schmidt, M. Crimmins, H. Lantigua, A. Fernandez, C. Zammit, C. Falo, et al., Prolonged elevated heart rate is a risk factor for adverse cardiac events and poor outcome after subarachnoid hemorrhage, Neurocrit. Cre. 20 (2014) 390-398.

Graphical abstract

Highlights • We suggest an approach to identify intradialytic systolic BP patterns. • We classify ESRD patients based on dense sequences in EHR data. • We demonstrate that models with BP patterns have a better prediction of cerebrovascular events • We suggest that our clustering approach is generalizable to analyses of dense EHR data.