International Journal of Medical Informatics 129 (2019) 175–183
Contents lists available at ScienceDirect
International Journal of Medical Informatics journal homepage: www.elsevier.com/locate/ijmedinf
Machine learning approaches for risk assessment of peripherally inserted Central catheter-related vein thrombosis in hospitalized patients with cancer
T
Shanshan Liua, Fengyi Zhangb,c, Lingling Xied, Ying Wanga, Qiufen Xiangd, Zhiying Yued, ⁎ ⁎ ⁎ Yue Fenge, Yanmeng Yange, Junying Lid, , Li Luoc, , Chunhua Yud, a
Center of Infectious Diseases, West China Hospital of Sichuan University, China West China Biomedical Big Data Center, West China Hospital, Sichuan University, China c Business School, Sichuan University, China d Department of Thoracic Cancer and Cancer Research Center, West China Hospital of Sichuan University, China e Nursing School, Sichuan University, China b
ARTICLE INFO
ABSTRACT
Keywords: Peripherally inserted central catheter-related vein thrombosis Machine learning approaches Assessment Seeley
Objective: The aim of this study was to conduct an effective assessment of peripherally inserted central venous catheter (PICC)-related thrombosis based on machine learning (ML) techniques considering genotype. Design: We conducted a prospective cohort study of 348 cancer patients with PICCs who were admitted to the Department of Oncology of West China Hospital, over a 1-year period, between February 1, 2016, and February 31, 2017. We obtained the clinical attributes, onset, duration, and outcome of thrombosis from electronic health records. We assigned all patients to either the training or testing set, and used four models for comparison with the currently used criteria. Results: ML methods showed good efficiency in PICC-related thrombosis risk assessment (with areas under the curve of 0.7733, 0.7869, 0.7833, and 0.7717 respectively) and outperform the currently used criteria (Seeley), which did not identify any positive case. Conclusions: Our research confirmed that ML approaches are powerful tools to identify cancer patients with a high risk of PICC-related thrombosis, which outperform the currently used criteria (Seeley). Moreover, our research also offers some indications on the predictors and risk factors of PICC-related thrombosis. From our research, more-precise assessments can be performed in cancer patients with PICCs to help decide the prophylaxis and effectively lower the incidence of PICC-related thrombosis.
1. Introduction Peripherally inserted central venous catheters (PICCs) are common in cancer patients who require long-term chemotherapy and supportive care therapy [1]. They lower peripheral vein irritation due to chemotherapy drugs, avoid mechanical phlebitis, and chemical phlebitis as a result of repeated venipuncture, and prevent tissue necrosis caused by excessive permeability. However, PICC also has some disadvantages. PICC-related vein thrombosis is one of the most common and serious complications of PICC placement [2]. Recent studies have shown that the presence of cancer increases the risk of catheter-related thrombosis (CRT) by 4.1-fold [3]. In PICC-related thrombosis, most clinical symptoms are not obvious after the thrombus formation [4]. Consequently, the confidant will result in a post embolization syndrome, which hampers treatment, aggravates pain and economic burden, and may
⁎
even lead to embolism and endanger the patient life [5]. Hence, methods to avoid such eventualities attract many oncologists, and encourage specialists to focus on this issue. Prophylactic anticoagulation may reduce the incidence of PICC-related thrombosis in cancer patients; however, it can also increase the risk of bleeding [6,7]. Hence, assessment of PICC-related thrombosis risk according to the status of the patient is important. It would be helpful to identify patients at high risk of PICC-related thrombosis and provide suggestions to the treating physicians to avoid thrombosis. Until now, several studies have been conducted, which focus on venous thromboembolism (VTE) risk assessment and forecasting. These studies have considered assessing the risk of VTE in various settings, such as surgical [8,9], hospitalized [10], outpatient [11,12], medical [13], and chemotherapy [12] settings. Khorana et al. proposed recent guidelines for cancer outpatients undergoing chemotherapy [14].
Corresponding authors. E-mail addresses:
[email protected] (J. Li),
[email protected] (L. Luo),
[email protected] (C. Yu).
https://doi.org/10.1016/j.ijmedinf.2019.06.001 Received 8 January 2018; Received in revised form 4 May 2019; Accepted 3 June 2019 1386-5056/ © 2019 Elsevier B.V. All rights reserved.
International Journal of Medical Informatics 129 (2019) 175–183
S. Liu, et al.
However, studies related to PICC-related thrombosis are rare. Seeley et al. (2007) [15] proposed a prediction tool for PICC-related thrombosis. Its sensitivity, specificity, and negative predictive value (NPV) are high, but it fails to classify patients who test positive for PICC-related thrombosis (positive predictive value [PPV]: 28%). It is suboptimal because data were not divided into training and testing sets; samples between the model exploitation and verification are the same, their validity is not reliable, and the evaluation indicator is biased. Hence, further testing is needed on a separate testing set to extend the validity of this instrument. Furthermore, genotypes were not considered, although some published articles have suggested that genetics are an independent risk factor of thrombosis, especially genotype 4 G5 G [16–19]. The study of genes 4 G5 G began in the late 1980s [16]. With the advancement of research, the associations between genotype 4 G5 G and thrombosis have been increasingly studied [17,17,18,19]. Genotype 4 G5 G is located in the promoter region of plasminogen activator inhibitor-1 (PAI-1), which is the major inhibitor of the fibrinolytic system, thus increasing its levels, reducing fibrinolytic activity, and promoting thrombosis. Machine learning (ML) techniques are powerful tools for analyzing latent patterns and assessments [20]. Previous studies applied ML techniques to assess venous thromboembolism risk and achieved outstanding results. Emily et al. (2012) [21] proposed that ML methods can induce models that exceed previously developed scoring models for VTE. Furthermore, Patrizia et al. [22] developed VTE risk predictors with a multiple kernel machine learning model that combines random optimization (RO) and support vector machines (SVMs), which could optimize the relative importance of clinical attributes. In addition, Narain et al. [23] affirmed that natural language processing (NLP) along with ML classifiers could accurately identify VTE. All of the above-mentioned studies verify that ML techniques are powerful tools in thrombosis risk assessment. However, to the best of our knowledge, specific research on ML for PICC-related thrombosis has not been conducted yet. The aim of this study was to effectively assess the risk of PICCrelated thrombosis on the basis of ML techniques, with consideration of the genotype, and the risk is defined as the possibility of resulting in PICC-related thrombosis. According to the ML model, we can objectively evaluate the risk of PICC-related thrombosis in patients with tumors and identify high-risk individuals earlier, which will promote the rational use of health resources and reduce the incidence of PICCrelated thrombosis. To the best of our knowledge, this is the first study to identify patients with high risks of PICC-related thrombosis, and will be a good reference point for further research. In addition, it can be used in medical decision making.
period, between February 1 st, 2016, and February 31 st, 2017. This investigation was approved by the institutional review board of the hospital. Patients were included if they had a PICC (inserted the day before) with or without chemotherapy and thromboprophylaxis. None of the patients were diagnosed as having thrombosis using color Doppler ultrasonography. PICC-related thrombosis was diagnosed by radiologists interpreted imaging results using color Doppler ultrasonography. And the results are not based on personal experience, but on scientific evidence [24] In ultrasound examination, the thrombosis showed as low echo area in the lumen of the blood vessel, presenting as a mass. The lumen did not disappear after pressure, and there was basically no blood flow signal. Patients who had a combination of multiorgan failure or had a life expectancy of < 1 month were excluded. Informed consent was obtained from the patients prior to the study. All the selected patients were followed up for 30 days after catheter insertion. Follow-up assessments consisted of continuous color Doppler ultrasonography, a diagnostic method for symptomatic or asymptomatic PICC-related thrombosis, at weeks 1, 2, 3, and 4. Time to event was calculated from the PICC-implemented date until PICC-related thrombosis or the most recent follow-up visit. Before PICC insertion, all the subjects were investigated for peripheral blood counts, coagulation examination, and genetic tests, the results of which were considered one of the most important clinical attributes. Furthermore, we collected clinical attributes such as onset, duration, and outcome of thrombosis from electronic health records, an information management system covering the whole process of work in our hospital; moreover, information from anamnesis was also collected via face-to-face communication. The data included sex, age, body mass index (BMI), smoking, drinking, nutrition (NRS2002), metastatic cancer, trauma, surgery, family history of thrombosis, comorbidities (e.g., diabetes, heart failure, and renal failure), chemotherapy, and radiotherapy. Catheter-related data included technique, size and lumen of the catheter, site and vein of insertion, and tip location and malposition. Biochemical indicators and genotype are also considered in this study. Biochemical indicators of thrombosis includes D-dimer, prothrombin time (PT), plasma fibrinogen (FIB), white blood cell count (WBC) and platelet count (PLT). For genotype, Genotype 4 G5 G is a major inhibitor of the fibrinolytic system, and 4 G5 G-polymorphisms are the 4 G and 5 G alleles, including three genotypes, namely, 4 G/4 G, 4 G/5 G, and 5 G/5 G. The PAI-1 activity of genotype 4 G/4 G is the highest, and it is most likely to reduce fibrinolytic activity and promote thrombosis; genotype 5 G/5 G has the lowest likelihood to reduce fibrinolytic activity and promote thrombosis. An independent risk factor for general thrombosis is likely to be an equally independent risk factor for PICC-related thrombosis. Groups and eligibility clinical predictors are detailed in Table 1. And the abbreviations are explained as follows: BMI, body mass index, kg/m2; Nutrition, NRS (nutrition risk screening) 2002, ≥3,nutritional risk ;COPD, chronic obstructive pulmonary disease; VTE, venous thromboembolism; CVC, central venous catheter; PICC, peripherally inserted central venous catheter; Port, implantable venous access port; APTT, activated partial thromboplastin time; PT, prothrombin time; FIB, plasma fibrinogen; WBC, white blood cell count;
2. Methods 2.1. Data source We conducted a prospective cohort study of 348 cancer patients with PICCs admitted to the Department of Oncology of West China Hospital (a 4300-bed comprehensive hospital in China) over a 1-year Table 1 Predictors considered in this study. Groups
Clinical attributes
Demographic data Physical health Disease types Anamnesis Biomedical indicators Treatment Catheter factors Genotype
Sex, Age, BMI, Smoking, and Drinking Nutrition, Self-care ability, and Performance status Tumor site and stage, Metastatic cancer, Coronary artery disease, Heart failure, COPD, Hypertension, Diabetes, and Septicemia/Bacteremia Trauma, Surgery, VTE, Family history of VTE, and Central venous catheter (CVC/PICC/Port) D-Dimer, APTT, PT, FIB, WBC and PLT Vascular interventional therapy, Hormonotherapy, Chemotherapy and Chemotherapeutic drugs, Radiotherapy, and Anticoagulation Technique, Size and lumen of the catheter, Site and vein of insertion, and Tip location and malposition Genotype 4 G5 G
176
International Journal of Medical Informatics 129 (2019) 175–183
S. Liu, et al.
PLT, platelet count; Technique, vascular visualization (ultrasonography guided vs. dynamic technique); Tip location, “Tip of PICC” is located in the cavoatrial junction by radiographically, and “Tip of PICC” is not in normal location.
survival distributions of two samples. It is a nonparametric test and is appropriate to use when the data are right skewed and censored. It is widely used in clinical trials to establish the efficacy of a new treatment in comparison with a control treatment when the measurement is the time to event. For robustness, we employed Monte Carlo cross-validation [30]. It is a common and effective technique in model selection which is widely used in the medical field [31,32]. 100 repeat experiments were conducted with independent sampling of training and testing sets, and the mean as well as standard deviation of the mentioned metrics were analyzed. To compare the performance of each model, t-test were employed. The null hypothesis of onesided t-test between A and B is that A is not greater than B; and the null hypothesis of two-sided t-test between A and B is that there is no difference between A and B. Hence, in this study, one-sided ttest is more suitable. Finally, we measured the importance of the predictors with the highest scores, measured using RF (in the aspect of the mean decrease in accuracy) in the whole dataset. All the experiments were conducted on R software (3.3.3).
2.2. Random Forest with LASSO Random forests (RF) [25,26] is an ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees during training and deciding on the class, that is the mode of the classes (classification), or mean prediction (regression) of the individual trees. Random decision forests correct the propensity of decision trees to overfit into their training set [27]. In statistics and machine learning, LASSO (least absolute shrinkage and selection operator) is a regression analysis method that performs both variable selection and regularization to enhance the prediction accuracy and interpretability of the statistical model it produces. It was introduced by Robert Tibshirani in 1996, based on Leo Breiman’s nonnegative Garrote [28,29]. LASSO was originally formulated for the least squares model, and this simple case reveals a substantial amount about the behavior of the estimator, including the relationship with ridge regression, best subset selection, and the connections between LASSO coefficient estimates and the so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear. The general formulation of LASSO is as given below:
y= minP { R
1 N
y
X
2
+
3. Results Tables 2 and 3 respectively summarize the results achieved in the training and testing sets, using Seeley and the other ML models. As shown in Table 2, two ML models without LASSO achieved perfect performance, while the other two with LASSO made less precise identification, in the training set. For Seeley-LASSO-RF, the AUC, PPV, NPV, sensitivity and specification are 0.8761, 0.5854, 0.9627, 0.8276 and 0.8836, respectively; and that of LASSO-RF are 0.9355, 0.8571, 0.9660, 0.8276 and 0.9726, respectively. With regard to Seeley, its performance was much worse than that of the ML models in nearly all the aspects, but it excelled in specificity and NPV, as it identified all cases as negative. As shown in Table 3, Seeley-LASSO-RF, LASSO-RF, and RF all achieved around-0.8 AUC (0.7978, 0.8092 and 0.7750), and Seeley-RF attained the value of 0.7291; whereas Seeley attained the value of 0.5, in the testing set. In addition, Seeley achieved the best score in the aspect of specificity, in accordance with the training set, but was not useful because it failed to identify any positive case. Moreover, among the four ML models, models with Seeley may not perform better than the counterpart without Seeley. The AUC of Seeley-LASSO-RF is 0.7978, and it is better than that of LASSO-RF (0.8092). However, the AUC of Seeley-RF is 0.7291, and it is worth than that of RF (0.7750). Fig. 1 reports the Receiver-operating characteristic curves for all five models, and we can find that all curves are overlapped with other curves. Fig. 2 depicts the Kaplan-Meier curves for the patients in the testing set stratified over the output of ML models. Seeley was excluded, as it cannot identify any positive cases. For each ML model, the results of assessment can effectively identify the risk of PICC-related thrombosis. These results were validated by log-rank test and proven to be significant (p < 0.001). To validate the robustness of the performance, we conducted 100 independent repeated experiments for each model. All experiments
2}
In this study, we combined the LASSO and RF techniques together to construct a powerful model (LASSO-RF) to achieve excellent attribute selection and outcome classification. Owing to its excellent performance in predictor selection, LASSO is used to select the predictors on the whole, and RF acts as the classifier to identity patients with a high risk of PICC-related thrombosis. 2.3. Experimental settings Cases with PICC-related thrombosis are labelled as positive cases, and the observations of VTE is used as response variable in this study. To construct and validate our PICC-related thrombosis risk assessment model, the patient data were randomly divided into two data sets as follows: (1) Training set: 50% of the cases were used for training. (2) Testing set: 50% of the cases were used to test the performance of PICC-related thrombosis risk assessment model. The percentages of cases with PICC-related thrombosis in the training and testing sets were both 16.38%. We used five models to compare assessment outcomes, including Seeley, Seeley-RF, Seeley-LASSO-RF, RF, and LASSO-RF. Seeley refers to the criteria proposed by Seeley et al. (2007) [15], which is widely used for PICC-related thrombosis risk assessment, and its cutoff score is 20 points. Seeley-RF and Seeley-LASSO-RF are the models considering Seeley as predictor, and the difference is whether to use LASSO to select predictors. RF and LASSO-RF are the non-Seeley versions of Seeley-RF and Seeley-LASSO-RF, respectively. Sensitivity, specificity, PPV, NPV, and area under the curve (AUC) were used to measure the performance of each model. Also, we used Kaplan-Meier curve to present the performance of each risk assessment model. Seeley is excluded, as it cannot identify any positive cases. In addition, log-rank test was performed, and p values were provided. The Kaplan-Meier estimator, also known as the product limit estimator, is a nonparametric statistical tool used to estimate the survival function from lifetime data. In medical research, it is often used to measure the fraction of patients living for a certain amount of time after treatment. The log-rank test is a hypothesis test to compare the
Table 2 Training Set Results of the Seeley and ML Models. Model
Sensitivity
Specificity
PPV
NPV
AUC
Seeley-LASSO-RF Seeley-RF Seeley RF LASSO-RF
0.8276 1 0 1 0.8276
0.8836 1 1 1 0.9726
0.5854 1 NA 1 0.8571
0.9627 1 0.8342 1 0.9660
0.8761 1 0.5 1 0.9355
NA: not available. The PPV of Seeley cannot be calculated, as Seeley identifies all cases as negative. All five performance measures for each ML model in the training set are shown in the table. 177
International Journal of Medical Informatics 129 (2019) 175–183
S. Liu, et al.
only exception that Seeley ranks first in Seeley-LASSO-RF. For SeeleyRF and RF, the ranks are exactly the same, with the importance score of each predictor varying.
Table 3 Testing Set Results of Seeley and the ML Models. Model
Sensitivity
Specificity
PPV
NPV
AUC
Seeley-LASSO-RF Seeley-RF Seeley RF LASSO-RF
0.7500 0.4643 0 0.5000 0.7143
0.8345 0.9379 1 0.8828 0.9034
0.4667 0.5909 NA 0.4516 0.5882
0.9453 0.9006 0.8382 0.9014 0.9424
0.7978 0.7291 0.5000 0.7750 0.8092
4. Discussion According to our validation of the performance of PICC-related thrombosis patient identification (Fig. 3 and Table 4), all ML models can effectively identity the patients with PICC-related thrombosis, and have spectacularly low p-values of one-sided t-test in Table 4 compared to Seeley (currently used criterion). This indicates that ML methods are effective approaches to assess PICC-related thrombosis risk, and all ML models perform significantly better than Seeley in our research. In addition, the LASSO-RF is the best model which has the highest AUC value and can achieve the best prediction accuracy. As found in previous studies [20–22], ML models are powerful techniques for VTE prediction, and we specifically confirmed that they are also effective in PICC-related thrombosis forecast, despite relatively poor sensitivity. These round-0.6 sensitivities may be misleading with respect to the low incidence rate of PICC-related thrombosis (16.38%). Despite this, all ML methods are still superior to Seeley, the currently and commonly used criteria for PICC-related thrombosis risk assessment. The Seeley predictor comprises of five items (recently bedridden, localized tenderness along deep veins, smoking, osteomyelitis, reason for PICC insertion, and anticoagulant at home or in the hospital) [15]. When we use Seeley alone as a classifier, it can also achieve an AUC of 0.5 because it identifies all cases as negative cases. The failure of Seeley in this study may lie in following reasons:(1) Biased sampling: Seeley et al. [15] only labelled symptomatic PICC-related thrombosis (thrombosis occurs with symptoms of thrombosis, such as swelling, tenderness, high skin temperature and so on) as PICC-related thrombosis; however, in this study we take both symptomatic and asymptomatic PICC-related thrombosis (thrombosis occurs without symptoms of thrombosis) into consideration, and all PICC-related thrombosis were diagnosed by radiologists interpreted imaging results using color Doppler ultrasonography. Even though, symptomatic PICCrelated thrombosis is easy to observe, asymptomatic PICC-related thrombosis seems account for most thrombosis cases. The rate of PICCrelated thrombosis in Seeley et al. [15] (7%) is much lower than that in this study (16.38%). (2) Inadequate predictors: Seeley et al. [15] only considered predictors in the aspects of social history, family history and other diseases and conditions. However, in this study, despite the predictors considered in Seeley et al. [15], genotype is also considered. (3) Insufficient ability of generalization. Seeley et al. [15] only used a linear model (logistic regression) to fit the data, and did not use testing set to measure the performance of model. In this study, we used a non-linear model (random forest) to learn the pattern of identification of PICC-related thrombosis, and validated the performance by Monte Carlo cross-validation. In addition, Seeley was also not necessarily significantly better than others in the testing set (Tables 4, A1, A2, A3, and A4). Hence, although Seeley could not identify any positive cases, the importance of Seeley with ML approaches in PICC-related thrombosis risk assessment still needs further research. Our research also offers some indications of the predictors and risk factors of PICC-related thrombosis. For example, according to Table 5, drinking and malnutrition are strongly associated with PICC-related thrombosis in all ML models. This finding is consistent with the assertions that drinking and malnutrition can increase the risk of VTE, as proposed by relevant studies [33–36]. Moreover, the Seeley-LASSO-RF and LASSO-RF models identify a family history of deep vein thrombosis [37], diabetes [38], and tip location [39–41] as predictors or potential risk factors, and these findings support corresponding research studies [37–40]. Moreover, the Seeley-RF and RF models showed strong associations with genotype, sex, age, BMI, and smoking. Except genotype,
NA: not available. The PPV of Seeley cannot be calculated, as Seeley identifies all cases as negative. All five performance measures for each ML model in the testing set are shown in the table.
Fig. 1. Receiver-operating characteristic curves (ROCs) of all five models in one experiment. The ROC curve can be thought of as a plot of power as a function of type I error of the decision rule. The AUC denotes the area under the ROC curve. The higher the AUC, the better the performance.
were conducted in the same manner as in the previous experiment, and their performances in the training and testing sets are shown in Figs. A1 and 3, respectively. In Fig. 3, compared to Seeley, all ML models achieved better performance in the aspect of AUC, which respectively are 0.7733, 0.7869, 0.7833 and 0.7717. And the same with NPV respectively (ML models 0.9105, 0.9154, 0.9147, 0.9086 VS. Seeley 0.8382). The PPV in the diffident ML models are 0.4860, 0.4957, 0.4845 and 0.4826, respectively. In addition, the PPV of Seeley is not available, and its specificity is constant of 1. The specificities of other models (0.8793, 0.8821, 0.8770 and 0.8819) are close to that of Seeley. For sensitivity, values are 0.5489, 0.5754, 0.5743 and 0.5375, respectively. The performance of 100 independent repeated experiments almost follow the same pattern as the previous single experiment, and the standard deviation in the AUC and NPV are much less than the PPV, specificity, and sensitivity. In addition, paired one-sided t tests in the aspects of AUC, PPV, NPV, sensitivity and specificity were conducted, and results are shown in Tables 4 and A2, A3, A4, and A5Table A2, respectively. Table 4 shows that all ML models are significantly superior to Seeley in the aspect of AUC. In addition, according to Tables A2 and A4, we can also find similar results in the aspects of PPV, and sensitivity. ML models and Seeley are not comparable for PPV, and Seeley is significantly superior to the ML models, as it identifies all cases as negative cases. Table 5 depicts the importance of the predictors with the highest scores, measured using RF (in the aspect of the mean decrease in accuracy) in the whole dataset. For Seeley-LASSO-RF and LASSO-RF, the ranks of the most important predictors are nearly the same, with the 178
International Journal of Medical Informatics 129 (2019) 175–183
S. Liu, et al.
Fig. 2. Kaplan-Meier curves of PICC-related thrombosis-free survival in the testing set. A comparison between patients with low (solid line) and patients with high (dotted line) risk of PICC-related thrombosis based on all models are shown.
the other predictors are commonly used in VTE forecast [41–43]. Genotype is the hotspot in medical research and is at the forefront for biotechnology development. Recent studies [44–46] have recognized that certain genotypes are related with thrombosis in cancer patients. Gene 4 G5 G is associated with the fibrinolytic system, which has been widely applied in the medical field, such as in diabetes [47], coronary heart disease [48], and oncology [49]. This study supports the findings of previous studies that reported that the genotype 4 G5 G is a good predictor of increased VTE risk, including PICC-related thrombosis. Our research has some limitations. Our sample size was modest, and all the participants were enrolled from a single medical institution. These limitations mainly arose because of the low PICC implementation rate and high cost of data collection. Despite the West China Hospital being large in size, the low PICC implementation rate and high cost involved in genotype detection (almost $110 per patient) hinder the inclusion of a larger number of cases. In addition, time series analysis, detailed medical treatment variables and other related variable should be considered in future work.
thrombosis, which outperform the currently used criteria. Furthermore, our research also offers some indicators of the predictors and risk factors of PICC-related thrombosis. Based on our research, a more precise assessment can be performed in cancer patients with PICC to guide early prophylaxis therapy and can thus effectively lower the incidence of PICC-related thrombosis. Further work will be conducted on specific predictor selection based on data with a larger number of cases. Author contributions Conceived and designed the experiments: Junying Li, Shanshan Liu, and Fengyi Zhang Performed the experiments: Lingling Xie, Ying Wang, Yue Feng, and Yanmeng Yang Contributed reagents/materials/analysis tools: Qiufen Xiang and Zhiying Yue Analyzed the data: Fengyi Zhang and Shanshan Liu Drafted or revised the manuscript: Fengyi Zhang and Shanshan Liu Approved the final version: Junying Li, Li Luo, and Chunhua Yu Funding: The work was supported in part by the National Natural Science Foundation of China (Nos. 71532007, 71131006, and 71172197) and Program Foundation for Science and Technology Department of Sichuan Province (2015SZ0155 and 2018SZ0203).
5. Conclusion In conclusion, our research confirms that ML approaches are powerful tools to identify cancer patients with high risk of PICC-related
179
International Journal of Medical Informatics 129 (2019) 175–183
S. Liu, et al.
Fig. 3. Performance of 100 independent repeated experiments in the testing set. Means (Standard Deviations) are shown in each bar, and error bars refer to the corresponding maximums and minimums. NA: not available. NAN: not a number. Table 4 P values of the paired t-test for the AUC of 100 independent repeated experiments in the testing set.
Seeley Seeley-RF Seeley-LASSO-RF LASSO-RF RF
Seeley
Seeley-RF
Seeley-LASSO-RF
LASSO-RF
RF
NA < 0.0001*** < 0.0001*** < 0.0001*** < 0.0001***
1— NA 0.0001*** < 0.0001*** 0.1238—
1— 1— NA 0.0030** 0.9990—
1— 1— 0.997— NA 1—
1— 0.8762— 0.001*** < 0.0001*** NA
Paired one-sided t-test was used to validate the difference. When the p value is low enough, we reject H0 and accept that the scheme in the row is better than the scheme in the column. Significance Codes:0 *** 0.001 ** 0.01 * 0.05 — 1. NA: Not available.
Table 5 Importance of predictors in each ML Model. Rank
Seeley-LASSO-RF
Seeley-RF
RF
LASSO-RF
1 st 2nd 3rd 4th 5th 6th 7th 8th
Seeley(15.79) Drinking(13.96) NRS 2002 Score(8.45) Family History of Deep Vein Thrombosis(6.36) Diabetes(4.91) Malposition(3.35) Tip Location(3.3) Chemotherapy(2.06)
Genotype(10.61) Sex(10.06) Age(6.92) BMI(6.73) Smoking(5.12) Drinking(5.07) NRS 2002 Score(4.91) Self-care Ability Score(4.1)
Genotype(14.08) Sex(7.13) Age(7.07) BMI(5.75) Smoking(5.7) Drinking(5.58) NRS 2002 Score(3.99) Self-care Ability Score(3.87)
Drinking(24.6) NRS 2002 Score(10.87) Family History of Deep Vein Thrombosis(7.22) Diabetes(5.83) Tip Location(4.14) Chemotherapy Cycle(2.93) Radiotherapy(2.49) Anticoagulation(2.25)
Predictors with their corresponding importance scores in each model measured using RF are ranked, and only the top 8 predictors in each model are shown for conciseness. 180
International Journal of Medical Informatics 129 (2019) 175–183
S. Liu, et al.
• Thrombosis is not caused by a single factor, but compound factors.
Conceived and designed the experiments: Junying Li, Li Luo, Chunhua Yu Performed the experiments: Shanshan Liu, Ying Wang, Yue Feng, Yanmeng Yang Analyzed the data: Fengyi Zhang, Lingling Xie Contributed reagents/materials/analysis tools: Qiufen Xiang, Zhiying Yue There is no conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication. I would like to declare on behalf of myco-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere.
What this study added to our knowledge:
• ML approaches are powerful tools to identify cancer patients with high risk of PICC-related thrombosis. • Genotype 4 G5 G is a good predictor of PICC-related thrombosis to •
Conflict of interests The authors report no conflicts of interest. Summary points What was already known on the topic
identify patients with high risk. We can take specific preventive action for high risk patients, thereby reducing the incidence of PICCrelated thrombosis. Diabetes, family history of deep vein thrombosis, tip location, drinking, malnutrition, gender, age, BMI and smoking are verified to be associated with PICC-related thrombosis. Amount them, drinking and malnutrition are the most authentic. ;1;
• Risk assessment of PICC-related thrombosis would be helpful to
Acknowledgments
•
The authors thank the editors and anonymous reviewers for their insightful and constructive comments and suggestions that have led to this improved version of the paper.
identify the patients at high risk of PICC-related thrombosis and provide suggestions to the treating physicians to avoid thrombosis. ML approaches are powerful tools in making risk assessment of venous thromboembolism.
Appendix A
Fig. A1. Performance of 100 independent repeated experiments in the training set. The mean (standard deviation) is shown in each bar, and the error bars refer to the corresponding maximum and minimum. NA: not available. NAN: not a number.
181
International Journal of Medical Informatics 129 (2019) 175–183
S. Liu, et al.
Table A1 The variable types of attributes considered. Variable types
Attributes
categorical variables
Sex; Smoking; Drinking; Nutrition; Tumor site and stage; Metastatic cancer; Coronary artery disease; Heart failure; COPD; Hypertension; Diabetes; Septicemia/Bacteremia; Trauma; Surgery; VTE; Family history of VTE; Central venous catheter (CVC/PICC/Port); D-Dimer; APTT; PT; FIB; WBC; PLT; Genotype 4 G5 G; Vascular interventional therapy; Hormonotherapy; Chemotherapy and Chemotherapeutic drugs; Radiotherapy; and Anticoagulation; Technique; Size and lumen of the catheter; Site and vein of insertion; Tip location and malposition Age; BMI; Self-care ability, and Performance status
Numerical variables
Table A2 P value of the paired one-sided t-test for PPVs of 100 independent repeated experiments in the testing set.
Seeley Seeley-RF Seeley-LASSORF LASSO-RF RF
Seeley
Seeley-RF
Seeley-LASSORF
LASSO-RF
RF
NA NA NA
NA NA 0.4124—
NA 0.5876— NA
NA 0.9207— 0.9755—
NA 0.6922— 0.5613—
NA NA
0.0793— 0.3078—
0.0245* 0.4387—
NA 0.8237—
0.1763— NA
Paired one-sided t test was used to validate the difference. When the p value is low enough, we reject H0 and accept that the scheme in the row is better than the scheme in the column. Significance Codes:0 *** 0.001 ** 0.01 * 0.05 — 1. NA: Not available. Table A3 P value of the paired one-sided t-test for NPVs of 100 independent repeated experiments in the testing set. Seeley Seeley Seeley-RF Seeley-LASSO-RF LASSO-RF RF
NA < 0.0001*** < 0.0001*** < 0.0001*** < 0.0001***
Seeley-RF
Seeley-LASSO-RF
—
—
1 NA < 0.0001** < 0.0001** 0.0113*
1 1— NA 0.2105— 0.9996—
LASSO-RF —
1 1— 0.7895— NA 1—
RF 1— 0.9887— 0.0004*** < 0.0001*** NA
Paired one-sided t-test was used to validate the difference. When the p value is low enough, we reject H0 and accept that the scheme in the row is better than the scheme in the column. Significance Codes:0 *** 0.001 ** 0.01 * 0.05 — 1. NA: Not Available.
Table A4 P-value of the Paired one-sided t test for sensitivity of 100 independent repeated experiments in the testing set. Seeley Seeley Seeley-RF Seeley-LASSO-RF LASSO-RF RF
NA < 0.0001*** < 0.0001*** < 0.0001*** < 0.0001***
Seeley-RF
Seeley-LASSO-RF
—
—
1 1— NA 0.4220— 0.9995—
1 NA < 0.0001*** < 0.0001*** 0.0187*
LASSO-RF —
1 1— 0.5780— NA 0.9998—
RF 1— 0.9813— < 0.0005*** < 0.0002*** NA
Paired one-sided t test was used to validate the difference. When p-value is low enough, we would like to reject H0, and accept that the scheme in the row is better than the scheme in the column. Significance Codes:0 *** 0.001 ** 0.01 * 0.05 — 1. Table A5 P value of the paired one-sided t test for specificity of 100 independent repeated experiments in the testing set. Seeley Seeley Seeley-RF Seeley-LASSO-RF LASSO-RF RF
NA 1— 1— 1— 1—
Seeley-RF ***
< 0.0001 NA 0.8277— 0.4858— 0.7355—
Seeley-LASSO-RF
LASSO-RF
RF
< 0.0001*** 0.1723— NA 0.0804— 0.3481—
< 0.0001*** 0.5152— 0.9196— NA 0.6759—
< 0.0001*** 0.2645— 0.6518— 0.3241— NA
Paired one-sided t test was used to validate the difference. When the p-value is low enough, we reject H0 and accept that the scheme in the row is better than the scheme in the column. Significance Codes:0 *** 0.001 ** 0.01 * 0.05 — 1. 182
International Journal of Medical Informatics 129 (2019) 175–183
S. Liu, et al.
[24] D. Gaitini, Current approaches and controversial issues in the diagnosis of deep vein thrombosis via duplex Doppler ultrasound, J. Clin. Ultrasound 34 (6) (2010) 289–297. [25] Tin Kam Ho, Proceedings of the 3rd international conference on document analysis and recognition, Random Decision Forests (1995) 278–282. [26] T.K. Ho, The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intell. 20 (8) (1998) 832–844. [27] T. Hastie, R. Tibshirani, J.H. Friedman, et al., The elements of statistical learning, second edition: data mining, inference, and prediction, J. R. Stat. Soc. 173 (2) (2009) 693–694. [28] R. Tbshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. 73 (3) (2011) 273–282. [29] L. Breiman, Better subset regression using the nonnegative garrote, Technometrics. 37 (4) (1995) 373–384. [30] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, New York, 2011. [31] Y. Zhang, C. Liu, J. Wang, et al., Application of Monte Carlo cross-validation to identify pathway cross-talk in neonatal sepsis, Exp. Biol. Med. 243 (5) (2018) 444–450. [32] M. Fahey, A. Rudd, Y. Béjot, et al., Development and validation of clinical prediction models for mortality, functional outcome and cognitive impairment after stroke: a study protocol, BMJ Open 7 (8) (2017) e014607. [33] F.S. Gaborit, K. Overvad, M. Nørgaard, et al., Alcohol intake and risk of venous thromboembolism. A Danish follow-up study, Thromb. Haemost. 110 (1) (2013) 39. [34] B. Zöller, J. Ji, J. Sundquist, et al., Alcohol use disorders are associated with venous thromboembolism, J. Thromb. Thrombolysis 40 (2) (2015) 167–173. [35] M. Ali, A.N. Ananthakrishnan, E.L. Mcginley, et al., Deep vein thrombosis and pulmonary embolism in hospitalized patients with cirrhosis: a nationwide analysis, Digest. Dis. & Sci. 56 (7) (2011) 2152–2159. [36] R.L. Bick, Cancer-associated thrombosis: focus on extended therapy with dalteparin, J. Support. Oncol. 4 (3) (2006) 115–120. [37] Y. Shi, L. Wen, Y. Zhou, et al., Thrombotic risk factors in patients undergoing chemotherapy via peripherally inserted central catheter, J. Int. Med. Res. 42 (3) (2014) 863–869. [38] A. Aw, M. Carrier, J. Koczerginski, et al., Incidence and predictive factors of symptomatic thrombosis related to peripherally inserted central catheters in chemotherapy patients, Thromb. Res. 130 (3) (2012) 323–326. [39] D.H. Ahn, H.B. Illum, D.H. Wang, et al., Upper extremity venous thrombosis in patients with cancer with peripherally inserted central venous catheters: a retrospective analysis of risk factors, J. Oncol. Pract. 9 (1) (2013) 8–12. [40] A. Wickramarathne, G. Wahl, Peripherally inserted central catheter (PICC) associated upper extremity deep vein thrombosis (UEDVT) - patterns, predictors, and outcomes, Chest 148 (4) (2015) 1001A,1001B-1001A,1001B. [41] T. Marnejon, D. Angelo, A.A. Abu, et al., Risk factors for upper extremity venous thrombosis associated with peripherally inserted central venous catheters, J. Vasc. Access 13 (2) (2012) 231. [42] V. Chopra, S. Anand, A. Hickner, et al., Risk of venous thromboembolism associated with peripherally inserted central catheters: a systematic review and meta-analysis, Lancet 382 (9889) (2013) 311. [43] S.G. Morano, R. Latagliata, C. Girmenia, et al., Catheter-associated bloodstream infections and thrombotic risk in hematologic patients with peripherally inserted central catheters (PICC), Support. Care Cancer 23 (11) (2015) 3289–3295. [44] P. Ferroni, R. Palmirotta, S. Riondino, et al., VEGF gene promoter polymorphisms and risk of VTE in chemotherapy-treated cancer patients, Thromb. Haemost. 115 (1) (2016). [45] A. Eroglu, A. Ulu, R. Cam, et al., Plasminogen activator inhibitor-1 gene 4G/5G polymorphism in cancer patients, J. B.u.on. Off. J. Balkan Union of Oncol. 12 (1) (2007) 135. [46] A. Eroğlu, G.G. Ceylan, E. Ozturk, et al., The efficacy of tissue factor -603A/G and +5466A&G polimorphisms at the development of venous thromboembolism in cancer patients, Exp. Oncol. 38 (3) (2016) 187. [47] T. Zhang, C. Pang, N. Li, et al., Plasminogen activator inhibitor-1 4G/5G polymorphism and retinopathy risk in type 2 diabetes: a meta-analysis, BMC Med. 11 (1) (2013) 1. [48] Y.Y. Li, Plasminogen activator inhibitor-1 4G/5G gene polymorphism and coronary artery disease in the Chinese Han population: a meta-analysis, PLoS One 7 (4) (2012) e33511. [49] E. Jorgenson, S.R. Deitcher, M. Cicek, et al., Plasminogen activator inhibitor type‐1 (PAI‐1) polymorphism 4G/5G is associated with prostate cancer among men with a positive family history, Prostate. 67 (2) (2007) 172–177.
References [1] T. Marnejon, D. Angelo, A.A. Abu, et al., Risk factors for upper extremity venous thrombosis associated with peripherally inserted central venous catheters, J. Vasc. Access 13 (2) (2012) 231. [2] V. Chopra, S. Anand, S.L. Krein, et al., Bloodstream infection, venous thrombosis, and peripherally inserted central catheters: reappraising the evidence, Am. J. Med. 125 (8) (2012) 733–741. [3] A. Leung, C. Heal, M. Perera, et al., A systematic review of patient-related risk factors for catheter-related thrombosis, J. Thromb. Thrombolysis 40 (3) (2015) 363–373. [4] J.D. Grant, S.M. Stevens, S.C. Woller, et al., Diagnosis and management of upper extremity deep-vein thrombosis in adults, Thrombosis & Haemostasis 108 (6) (2012) 1097–1108. [5] K.E. Burns, A. Mclaren, Catheter-related right atrial thrombus and pulmonary embolism: a case report and systematic review of the literature, Canadian Resp. J., J. Canadian Thoracic Soc. 16 (16) (2016) 163–165. [6] S.R. Kahn, W. Lim, A.S. Dunn, et al., Prevention of VTE in nonsurgical patients: Antithrombotic therapy and prevention of thrombosis, 9th ed: american college of chest physicians evidence-based clinical practice guidelines, Chest 141 (2) (2012) e195S–226S. [7] G.H. Lyman, K. Bohlke, A.A. Khorana, et al., Venous thromboembolism prophylaxis and treatment in patients with cancer: american society of clinical oncology clinical practice guideline update 2014, J. Clin. Oncol. Off. J. Am. Soc. Clin. Oncol. 33 (6) (2015) 654. [8] R. Autar, Nursing assessment of clients at risk of deep vein thrombosis (DVT): the Autar DVT scale, J. Adv. Nurs. 23 (4) (1996) 763–770. [9] J.A. Caprini, J.I. Arcelus, J.H. Hasty, et al., Clinical assessment of venous thromboembolic risk in surgical patients, Semin. Thromb. Hemost. 17 (Suppl. 3) (1991) 304–312. [10] R. McCaffrey, M. Bishop, M. Adonis‐Rizzo, et al., Development and testing of a DVT risk assessment tool: providing evidence of validity and reliability, Worldviews Evid. Nurs. 4 (1) (2007) 14–20. [11] P.S. Wells, J. Hirsh, D.R. Anderson, et al., Accuracy of clinical assessment of deepvein thrombosis, Lancet 345 (8961) (1995) 1326–1330. [12] A.A. Khorana, N.M. Kuderer, E. Culakova, et al., Development and validation of a predictive model for chemotherapy-associated thrombosis, Blood. 111 (10) (2008) 4902–4907. [13] M. Vardi, N.O. Ghanem‐Zoubi, R. Zidan, et al., Venous thromboembolism and the utility of the Padua Prediction Score in patients with sepsis admitted to internal medicine departments, J. Thromb. Haemost. 11 (3) (2013) 467–473. [14] A.A. Khorana, H.M. Otten, J.I. Zwicker, et al., Prevention of venous thromboembolism in cancer outpatients: guidance from the SSC of the ISTH, J. Thromb. & Haem 12 (11) (2014) 1928–1931. [15] M.A. Seeley, M. Santiago, S. Shott, Prediction tool for thrombi associated with peripherally inserted central catheters, J. Infus. Nursing the Off. Publ. Infusion Nurses Soc. 30 (5) (2007) 280–286. [16] C.J. Madson, Guidelines for analyzing franchise arbitration provisions, Franchise Law J. 4 (3) (1985) 1–23. [17] R. Seguí, A. Estellés, Y. Mira, et al., PAI-1 promoter 4G/5G genotype as an additional risk factor for venous thrombosis in subjects with genetic thrombophilic defects, Br. J. Haematol. 111 (1) (2000) 122–128. [18] G. Balta, C. Altay, A. Gurgey, PAI-1 gene 4G/5G genotype: A risk factor for thrombosis in vessels of internal organs, Am. J. Hematol. 71 (2) (2002) 89–93. [19] G.A. Giurgea, S. Brunner-Ziegler, B. Jilma, et al., Plasminogen activator inhibitor-1 4G/5G genotype and residual venous occlusion following acute unprovoked deep vein thrombosis of the lower limb: A prospective cohort study, Thromb. Res. 153 (2017) 71–75. [20] G. Wang, M. Kalra, C.G. Orton, Machine learning will transform radiology significantly within the next 5 years, Med. Phys. (2017) 2041–2044. [21] E. Kawaler, A. Cobian, P. Peissig, et al., Learning to predict post-hospitalization VTE risk from EHR data, AMIA. Annual Symposium Proceedings / AMIA Symposium. AMIA Symposium, (2012), p. 436 2012:2012. [22] P. Ferroni, F.M. Zanzotto, N. Scarpato, et al., Risk assessment for venous thromboembolism in chemotherapy-treated ambulatory Cancer patients: a machine learning approach, Med. Decision Making An Int. J. Soc. Med. Decision Making. (37) (2017) 9234–9242. [23] Wazim R. Narain, Development of an Automated System for Querying Radiology Reports and Recording Deep Venous Thromboses and Pulmonary Emboli, Rutgers Biomedical and Health Sciences School, 2016.
183