Predictive and Prescriptive Analytics for Optimal Decisioning

Predictive and Prescriptive Analytics for Optimal Decisioning

Tutorial D Predictive and Prescriptive Analytics for Optimal Decisioning: Hospital Readmission Risk Mitigation Thomas Hill, PhD, Vladimir Rastunkov, ...

8MB Sizes 0 Downloads 75 Views

Tutorial D

Predictive and Prescriptive Analytics for Optimal Decisioning: Hospital Readmission Risk Mitigation Thomas Hill, PhD, Vladimir Rastunkov, PhD and John W. Cromwell, MD

Chapter Outline Overview Statistical Data Analysis vs General Predictive (Pattern Recognition) Models Example Data The Goal Step 1: Data Acquisition Step 2: Feature Selection and Predictor Coding Target Variable Feature Selection

341 342 342 343 343 344 344 344

Predictor Coding, for More Interpretable Results Weight-of-Evidence (WoE) Coding Step 3: Predictive Modeling and Interpreting Results Assessing the Quality of the Model: Lift Charts Understanding the Implications of Results: What-If and Reason Scores Step 4: Decision Management and Prescriptions Conclusions References

346 347 347 351 352 354 358 358

EDITOR’S NOTE: This chapter is best described as a “combined case study and tutorial,” or as an “advanced tutorial.” This study assumes that the reader is familiar with the use of STATISTICA software and thus can perform some of the steps that are not explicitly provided. Therefore, for the new user of STATISTICA, it might be better to work through some of the other tutorials in this book before attempting to do all the steps needed to recreate this case study. (For those who want to do this, we suggest working through Tutorial M, on Schistosomiasis, and/or Tutorials F-1 and F-2, and G-3, on WoE and Automatic Binning, before doing this tutorial.) Alternatively, readers can read through this study and then decide which parts they can perform without studying other tutorials. The data set is provided on the companion web page of this book.

OVERVIEW Traditional statistics and hypothesis testing were successfully used in medicine for decades, and some areas of statistics were developed specifically to address use cases and solve analytic problems in medicine and healthcare delivery (see, for example, Cox, 1972; Lee, 1980). However, advanced predictive analytics or data mining methods such as neural networks and ensemble models (see Nisbet et al., 2009) were usually considered “black boxes,” not useful for providing interpretable results and actionable information and guidance. These analytic approaches have thus not been routinely applied to analyze medical or patient-care data. In this tutorial we will describe the advantages of general (non-statistical) predictive modeling methods and algorithms, discuss ways to derive useful interpretations and actionable guidance from results, and show how general prediction models can be combined and improved with decision rules to deliver an effective “decisioning” system i.e., an automated system that will provide relevant risk assessment and predictions. Practical Predictive Analytics and Decisioning Systems for Medicine. DOI: http://dx.doi.org/10.1016/B978-0-12-411643-6.00019-3 © 2015 Thomas Hill, Vladimir Rastunkov and John W. Cromwell. Published by Elsevier Inc. All rights reserved.

341

342 PART | 2 Practical Step-by-Step Tutorials and Case Studies

Statistical Data Analysis vs General Predictive (Pattern Recognition) Models There are a number of important differences between traditional statistical analysis and modeling (e.g., multiple regression and logistic regression) compared to the general predictive modeling techniques that have been widely adopted across many domains over the past decade or so (see, for example, Breiman, 2001; also Miner et al., 2012). Statistical analysis and modeling is typically based on testing of hypotheses and the estimation of population parameters from samples, based on statistical/mathematical reasoning alone. For example, in multiple regression the parameters of a linear model are estimated that predict some outcome or response variable y as a linear function of the available predictor or x variables. Thus these traditional methods reflect the “mean” or average of a population, and have become the basis of Evidence-Based Medicine (EBM). They are often referred to by physicians as “best practices,” but, unfortunately, do not provide predictions for an individual, although an individual that may be one or two standard deviations from the average may respond quite differently to medical treatments. In contrast to statistical modeling, the algorithms for general predictive modeling use a much more pragmatic approach. Namely, data mining and general predictive modeling typically apply general learning or pattern recognition algorithms to extract from sample data repeatable patterns that allow for the most accurate predictions, regardless of the nature and complexity of the relationships between predictors and outcomes. These modern statistical learning theory methods thus make predictions for the “individual,” not the mean of a population, and thus offer the best hope of making accurate predictions/prescriptions for the individual patient.

Example Data Due to the Health Insurance Portability and Accountability Act of 1996 (HIPAA), the use of real healthcare data, even blinded to a high degree, is unlikely to be possible. For the purpose of this tutorial we’ve selected a different approach, by generating simulated data that have a format similar to that seen in the field. So all conclusions based on the data are for illustration purposes only, and do not contain any scientific information. The preview of the data is shown in Figure D.1. The data description and statement of the problem are courtesy of Dr. John W. Cromwell. The data set contains the following fields: ID unique identifier of the patient Unplanned Readmission binary variable that shows that specific patient was readmitted to the hospital (Unplanned Readmission 5 1) Age continuous variable that contains the age of the patient Surgeon ID unique ID of the surgeon EBL Estimated blood loss Apgar_Score interval of the Apgar score of the patient HMB_within_60_days beta-hydroxy beta-methylbutyric acid measurement ASA physical status classification system Gender gender of the patient WND_Class wound contamination class TrainTest train and test samples identifier.

FIGURE D.1 Preview of the data simulated for this tutorial.

Predictive and Prescriptive Analytics for Optimal Decisioning: Hospital Readmission Risk Mitigation Tutorial | D

343

THE GOAL The goal of this tutorial is to create an analytic flow that investigates predictors of readmission, builds an analytic model, combines model with expert-defined logic, and provides sensitivity analysis for the selected prediction. We’ll use STATISTICA Workspace as the primary environment for implementation of this logic. The subsequent sections of this tutorial are devoted to specific steps of this process.

Step 1: Data Acquisition STATISTICA Workspace is the environment for building analytic workflows and templates. It is often viewed as “graphical” programming environment: users construct workflows from a set of predefined nodes and use graphical user interfaces to modify node parameters as needed. Advanced users who have elementary programming skills in Visual Basic can additionally create their own nodes. The links between the nodes can be viewed as data flows from one node to another, whereas transformations and calculations happen within the nodes. The process starts from a data acquisition node. We create a new workspace from File New Workspace. Workspaces allow four primary channels for data input: (1) data configuration template predefined in STATISTICA Enterprise, (2) in-place database connection node, (3) embedded STATISTICA spreadsheet, and (4) import node (e.g., from MS Excel) or input-generator SVB node. For the sake of simplicity in this tutorial we’ll be using embedded STATISTICA spreadsheet node. By clicking the Data Source button we select the spreadsheet to be used in the workflow (Figure D.2). The workflow is shown in Figure D.3.

FIGURE D.2 Select Data Source dialog.

FIGURE D.3 Screenshot of the workspace.

344 PART | 2 Practical Step-by-Step Tutorials and Case Studies

Even though it is possible to change the size of the workspace area at any moment, it is recommended to reserve enough space to allocate analysis objects beforehand.

Step 2: Feature Selection and Predictor Coding A first step in any analysis that can potentially involve hundreds of predictor candidates is to identify relevant predictors that are likely to provide diagnostic value for the prediction of the respective outcome of interest, such as the probability for hospital readmission after discharge. Specifically, an initial review of available predictor variables should focus on how different parameters impact the risk of readmission if considered by themselves.

Target Variable In this example the variable Unplanned Readmission represents the target variable, describing whether or not the respective patient was readmitted to the hospital within 30 days following discharge (0, no readmission; 1, readmitted at least once). Our simulated data file contains only eight predictors. The real data may contain hundreds of variables that can be used as predictors, including Gender, Age, and others.

Feature Selection One approach to screen a large number of predictor candidates is to apply general feature selection algorithms. For example, the STATISTICA Feature Selection and Variable Screening module (StatSoft, Inc., 2013) applies so-called recursive partitioning or tree algorithms (see Nisbet et al., 2009) to all predictor variables to derive predictor importance, and to select a subset of k best and likely most useful predictors. The search will be performed predictor-by-predictor; i.e., this approach will perform a fast first-order (no-interactions considered) search of potential predictors of the outcome variable. Importance can be calculated using different statistics that reflect the degree to which different possible partitions of the values observed for a predictor under consideration will provide maximum differentiation between outcomes (e.g., maximum differential readmission rates). Note that, even though it is a first-order search, predictors that contribute to significant interaction effects will likely still be detected, and identified in subsequent modeling. The Feature Selection node can be found in Data Mining Tools Feature Selection (Figure D.4). When the node is added to the workspace, we need to connect it to the data source and to click on the Edit Parameters control. Feature Selection requires definition of the target and input variables (Figure D.5). Next, we execute the workspace. As shown on the screenshot in Figure D.6, the Feature Selection node has generated a downstream Reporting Document node that will accumulate all results. Importance Plot is the main result in this analysis, and is saved in Reporting Documents (Figure D.7). We may look into dependence of the readmission rate on the surgeon ID that was identified with top importance by adding a 2D Box Plots node to the workspace (Graphs Common Box) (Figure D.8). Note that we’ve selected unplanned readmission as a dependent variable and surgeon ID as independent (according to the coding selected unplanned readmission is equal to 1, so the average value of this variable will represent the readmission rate). The two graphs are combined in Figure D.9. The insert in the upper-right corner demonstrates variation in the readmission rates for different surgeons. It should be considered as a coincidence that Surgeon ID has the highest importance score in our simulated data. In practice, however, a similar result was observed due to the fact that difficult patients were usually assigned to a specific surgeon so this was not an indication of his or her skills, but rather an initial (upon assignment) expert classification of the group of patients.

FIGURE D.4 Feature Selection node location on the ribbon bar.

Predictive and Prescriptive Analytics for Optimal Decisioning: Hospital Readmission Risk Mitigation Tutorial | D

FIGURE D.5 Variables for the Feature Selection.

FIGURE D.6 Screenshot of the workspace.

Importance plot Dependent variable: Unplanned readmission Surgeon ID Apgar_Score EBL HMB_within_60_days ASA Age Gender 0

20

40

60

80

100

Importance (Chi-square) FIGURE D.7 Importance plot.

120

140

345

346 PART | 2 Practical Step-by-Step Tutorials and Case Studies

FIGURE D.8 Screenshot of the workspace.

Importance plot Dependent variable: Unplanned readmission 140 60% Unplanned readmission

Importance (Chi-square)

120 100 80 60

50% 40% 30% 20% 10% 0%

1

2

3

4

Surgeon ID

40 20 0 Surgeon ID Apgar_Score

EBL

Gender

ASA

HMB_within_60_days

Age

FIGURE D.9 Variable importance plot for readmission rate. Small insert in the upper right-hand corner shows average readmission rates by surgeon ID.

Predictor Coding, for more Interpretable Results After selecting predictors that will likely provide the greatest power for predicting the outcome of interest (hospital readmission probability, in this case), they are often recoded into meaningful intervals (for continuous predictors) or combined categories (for discrete predictors). Thus, when the recommendations from final results are provided to decision-makers, they can be communicated in ways that are more easily understood. For example, in this analysis a predictor variable EBL (estimated blood loss from the surgical procedure in milliliters) may be an important predictor of readmission risk. EBL is a continuous variable, and in order to derive simpler recommendations and rules describing the relationship between EBL and readmission risk it is useful to divide the range of values in this variable into intervals associated with different risk.

Predictive and Prescriptive Analytics for Optimal Decisioning: Hospital Readmission Risk Mitigation Tutorial | D

347

Weight-of-Evidence (WoE) Coding In the case of risk-probability prediction, so-called optimal weight-of-evidence recoding can be very useful. Specifically, the weight of evidence (WoE) statistic for each (possible) binning of values in a continuous variable is calculated as follows (see also Siddiqi, 2006):    Distr }Read Rela unplan 5 0} WoE 5 ln :100 Distr }Read Rela unplan 5 1} The smaller the value of WoE for a specific binned group of observations, the higher is the percentage of readmissions in that group. The goal of WoE coding in risk modeling is to identify binning (interval boundaries) based on historical data, so that the observations in each interval are maximally uniform with respect to (readmission) risk, while the differences in (readmission) risk across the intervals (groups) are maximized. Further, in order to arrive at more easily interpreted results, certain constraints can be applied to the binning algorithm for example, to impose a linear risk function for the predictor over the consecutive bin intervals. Linear, quadratic, or higher-order polynomial constraints can be applied to the calculations of optimal predictor intervals (bins), to yield the largest differences in risk across the coded intervals while observing the respective constraint. By recoding continuous predictors into binned intervals, or re-binning the categories of discrete predictor variables, the final results predicted from the application of general predictive modeling algorithms are often more easily communicated, providing clearer guidelines for specific interval boundaries that are associated with greater risk. In a sense, instead of using the original predictor values and metrics, predictions can then be communicated in terms of risk profiles and groups rather than more abstract specific values or often non-linear functions. The data set used in this tutorial is already prepared and does not require binning. However, for reference purposes we’ll point out that the Weight of Evidence node is located under Data Mining Tools Weight of Evidence (Figure D.10). Once added to the workspace the node provides an interactive environment for working with the groups, applying various constraints (e.g., Monotone, One minimum or maximum, One minimum and one maximum for continuous variables), defining interactions, and customizing groups (Figure D.11). As the result the module generates a set of transformations from initial variables to weight-of-evidence codes. Those transformations are defined by the Rules node.

Step 3: Predictive Modeling and Interpreting Results In this example, a general predictive modeling algorithm called stochastic gradient boosting trees (Friedman, 1999) will be used for modeling readmission probabilities. This method usually demonstrates excellent performance and robustness for classification and risk prediction tasks. In classification problems it is important to understand the correct selection of prior probabilities. If classes have significantly different numbers of observations, equal priors are recommended (Classification tab of the Boosted Trees Specifications dialog). In our case, we add a Frequency tables node to calculate the percentage of cases with Unplanned Readmission 5 1 (Statistics Basic Statistics Frequency Tables) (Figures D.12 and D.13). Results indicate that unplanned readmission was seen in 30% of cases in this data set, so we’ll use equal prior probabilities in this example. In order to build the Boosted Trees model in STATISTICA, the following steps can be used. On the Data Mining tab we select Boosted Trees and Boosted Classification Trees (Figures D.14 and D.15). We access parameters of the node, select variables, and identify different parameters of the analysis as shown on the screenshot in Figure D.16. Next, on the advanced tab, we set the Maximum n of nodes to 7. This parameter represents the maximum complexity of each individual tree.

FIGURE D.10 Weight of Evidence node location on the ribbon bar.

348 PART | 2 Practical Step-by-Step Tutorials and Case Studies

FIGURE D.11 Weight of Evidence node interface.

FIGURE D.12 Screenshot of the workspace.

FIGURE D.13 Frequency table.

Predictive and Prescriptive Analytics for Optimal Decisioning: Hospital Readmission Risk Mitigation Tutorial | D

349

FIGURE D.14 Boosted Classification Trees location on the ribbon bar.

FIGURE D.15 Screenshot of the workspace.

FIGURE D.16 Variable Selection dialog.

For model quality assessment we need to generate results such as a classification matrix and lift chart. Consequently, we make the settings on the Classification tab as in Figure D.17. Finally, on the Advanced tab we select the variable that will identify cases for the test sample. If the variable is not selected, the Test sample will be randomly assigned (Figure D.18). When all settings are done (Figure D.19) we execute the workspace and move to the analysis of the results. As the screenshot in Figure D.20 demonstrates, the Boosted Classification Trees node generates output for Reporting Documents, PMML Model, and Rapid Deployment nodes. Note that the Boosted Trees is an iterative algorithm that is used to find the best solution. However, initialization of parameters is done in every run using randomization function. This means that results achieved in two consecutive runs or on different machines may not be identical.

350 PART | 2 Practical Step-by-Step Tutorials and Case Studies

FIGURE D.17 Boosted Classification Trees node parameters.

FIGURE D.18 Test-Sample variable selection dialog.

FIGURE D.19 Boosted Classification Trees node parameters.

Predictive and Prescriptive Analytics for Optimal Decisioning: Hospital Readmission Risk Mitigation Tutorial | D

351

FIGURE D.20 Screenshot of the workspace.

Assessing the Quality of the Model: Lift Charts The quality of models is usually characterized by lift charts. The value of lift is computed as the ratio of the readmission rate in a sample compared to the baseline readmission rate. Specifically, following best practices, the prediction model is applied to a hold-out sample of observations (not used for the model-building process itself), and then the top 10% of patients with the highest predicted readmission risk are selected. The actual readmission rates for those patients are then computed. This process is repeated for the next 10% of patients in the hold-out samples, and so on, to construct an entire lift chart as shown in Figure D.21 (see also Nisbet et al., 2009). The base-rate for hospital readmission was approximately 33% in this test subset. The actual readmission rate is around 74.2%, or almost twice as high, when applying the model to a hold-out sample of patients (not used for the model-building process itself), and selecting the top 10% of patients in that sample with the highest predicted readmission risk. Put another way, through the application of the predictive model patients can be identified with a predicted readmission risk nearly twice as high as that for the average patient.

2.2

FIGURE D.21 Lift chart for test sample. Point labels show readmission rates in %.

74.2%

67.2%

2.0 59.3%

1.8 Lift

57.0% 53.0%

1.6

46.7%

1.4

44.3% 40.7%

1.2

1.0 10

33.0% 36.6%

20

30

40

50

60

Percentile

70

80

90

100

352 PART | 2 Practical Step-by-Step Tutorials and Case Studies

Understanding the Implications of Results: What-If and Reason Scores Once the model is built, it can be used to predict probabilities of readmission for new patients. Those probabilities then can be used to bring recommendations to professionals working with those patients. However, while the identification of higher-risk patients is important, it is equally important to know how to mitigate this risk, or what to do about it. Reason scores are a way to explain the prediction of analytic models and to identify the root causes driving specific prediction. The term emerged in the risk-scoring domain, where a final score is computed as the sum of reason scores from multiple predictors (see Siddiqi, 2006). In general, reason scores are computed as first-order partial derivatives of the considered parameter (probability of readmission, in our case). Thus, they address the question concerning by how much the risk will change if the values of a specific predictor variable changes or is changed. For example, consider the following case. Table D.1 shows the characteristics of a patient predicted to have an elevated risk for hospital readmission (and in fact, this patient was readmitted to the hospital). The values in parentheses are calculated from the initial data by using Descriptive Statistics and Frequency Tables nodes (the average value is used as the baseline for continuous variables; the mode is used as a baseline for categorical variables). The “baseline” can also be defined based on the expert opinion or meta-analysis of scientific publications. In this tutorial we’ll be looking at the increase or decrease in the predicted probability of readmission in response to changing one of the parameters to the baseline value. Positive differences in our case mean that the specified factor contributed towards the increase of readmission probability by x%; otherwise it decreased that probability by x% compared to baseline. It is important to note that these values are not additive, as they are calculated under the assumption that all other variables except those under consideration are fixed at their values. Still, these values can be placed into a graph that provides an immediate “recommendation” of how changing a specific predictor value will affect likelihood of readmission. As discussed above, we prepare the data set for further simulations (this process can be automated with SVB nodes, but this falls beyond the scope of this tutorial) (Figure D.22).

TABLE D.1 Characteristics of the Patient (Values in Parentheses Show Baseline) Parameter name

Value

Parameter name

Value

Age

22 (33.544)

HMB_within_60_days

1.8 (2.976)

Surgeon ID

1 (4)

ASA

9 (2.828)

EBL

3300 (1714.9)

Gender

Male (Male)

Apgar_Score

7 8 (5 6)

WND_Class

02 (02)

FIGURE D.22 Spreadsheet for sensitivity analysis.

Predictive and Prescriptive Analytics for Optimal Decisioning: Hospital Readmission Risk Mitigation Tutorial | D

353

FIGURE D.23 Screenshot of the workspace.

FIGURE D.24 Rapid Deployment node parameters.

The first row contains values of the case in question. The second row represents the “baseline” values. The rest of the rows represent copies of the case with the baseline value substituted instead one of the values. The name of the variable is recorded in the Variable column. In order to calculate predictions for all scenarios, we plug the data set into the workspace and connect it to the Rapid Deployment node (Figure D.23). Before we run the Rapid Deployment node we need to include prediction probabilities in the output using the settings shown in Figure D.24. Next, we execute the Rapid Deployment node and append predicted probabilities to the initial data set using Concatenate Variables node (Data Manage Merge Concatenate Variables) (Figure D.25). The result is shown as a spreadsheet in Figure D.26, or in the form of the calculated differences from the baseline (Figure D.27). Figure D.28 shows that major contributing factors for the considered patient are Surgeon ID, EBL, and HMB_within_60_days.

354 PART | 2 Practical Step-by-Step Tutorials and Case Studies

FIGURE D.25 Screenshot of the workspace.

FIGURE D.26 Spreadsheet with predictions.

FIGURE D.27 Sensitivities.

While the nature of why or how Surgeon ID impacts readmission risk may be complex (e.g., it could be a function of the severity of cases seen by different surgeons), this small example demonstrates that predictions derived from data mining methods can be explained in the same way as those from simpler statistical regression type models.

Step 4: Decision Management and Prescriptions Decisioning logic that summarizes the rules guiding the decision-making process usually needs to be integrated with model predictions, for the following two reasons: (1) in order to associate statistical probabilities of certain conditions

Predictive and Prescriptive Analytics for Optimal Decisioning: Hospital Readmission Risk Mitigation Tutorial | D

355

54.6%

50.0%

40.0%

28.4%

ΔP

30.0%

20.0%

17.3%

10.0%

7.0%

5.3% 0.0%

0.0% Surgeon ID

HMB_within_60_days EBL

ASA

Age

0.0%

–0.4%

WND_Class Gender

Apgar_Score

FIGURE D.28 Reason scores of different factors for patient with characteristics shown in Table D.1.

to specific prescriptions for the next “recommended” action; and (2) in order to augment analytic models with specific domain knowledge that is not captured by the analytic model. In the previous steps we’ve described the workflow to build prediction for the readmission probability based on the model. Here, we’ll discuss how predicted values can be combined with rules logic. The section Step 2. Feature Selection and Predictor Coding described WoE coding as a data preparation step to bin the range of predictor values into intervals associated with different risk. Note that such data pre-processing and binning of predictors for maximum predictive power while maintaining simple interpretability can be incorporated into the scoring process as simple rules, via if.. then.. elseif.. endif statements. In addition, by combining into such rules predicted risk probabilities and a priori expert knowledge about risk not captured in the analyses (e.g., because of lack of data), and linking it to action plans to mitigate risk, an effective automated decision support (“decisioning”) system can be built. For example, Figure D.29 shows part of a decisioning logic flow implementing the optimal binning for predictor EBL (estimated blood loss from the surgical procedure in milliliters), linking the specific interval boundaries for this predictor to specific action plans or Prescriptions (Prescription 1, Prescription 2, etc.). The right side of Figure D.29 shows how this logic is implemented in the STATISTICA Decisioning Platform. The entire workflow that summarizes knowledge about the patients’ readmission probabilities can now be used to score patients at the point of discharge, and to execute appropriate action plans to mitigate readmission risk. Again, it is important to note that these predictive analytic and decisioning system predictions can be made for “individual patients,” thus providing the best possible care for the individual. The rules can be implemented on the workflow as follows. First, we add a subset node to separate scenario of interest (Data Subset) with the parameters shown in Figure D.30. The node outputs a single case described above (Figure D.31). Now we add the Rules node (Data Mining Deployment Rules Builder) with the set of rules as shown in Figure D.29 (Figure D.32). The combined output can now be retrieved as shown in Figure D.33. For new patients the above workspace can be simplified to model deployment nodes only, as shown in Figure D.34 (the workspace node can be copied and pasted from one workspace to another for example, by using well-known Windows keys Ctrl 1 C and Ctrl 1 V). In STATISTICA Enterprise, the workspace can become a reusable template and be deployed to the STATISTICA Enterprise metadata repository. This functionality is available for STATISTICA Enterprise users through the Deploy button.

356 PART | 2 Practical Step-by-Step Tutorials and Case Studies

FIGURE D.29 Illustration of Rules logic and implementation in STATISTICA Rules Builder.

FIGURE D.30 Subset node parameters dialog.

Predictive and Prescriptive Analytics for Optimal Decisioning: Hospital Readmission Risk Mitigation Tutorial | D

FIGURE D.31 Spreadsheet with parameters for single case.

FIGURE D.32 Screenshot of the workspace.

FIGURE D.33 Final result. The spreadsheet contains probability predicted with analytic model and action plan assigned based on rules.

FIGURE D.34 Deployment workspace.

357

358 PART | 2 Practical Step-by-Step Tutorials and Case Studies

CONCLUSIONS This tutorial provides an overview of how modern advanced predictive modeling and pattern recognition algorithms can be integrated with effective rules logic into a single decisioning platform, delivering actionable prescriptions to improve patient outcomes. Using an example data set describing hospital readmission risk in a sample of patients, all steps from data preparation, predictor coding, and model building to model implementation and interpretation were discussed. Using the example data set, the predictive modeling algorithms provided a prediction model that can be expected to successfully identify the top 10% of patients with a readmission risk almost twice as high compared to baseline risk. In addition, this tutorial described how results derived from complex modeling procedures can yield interpretable and actionable results that can be combined into action plans for mitigating risk (“prescriptive analytics”).

REFERENCES Breiman, L., 2001. Statistical modeling: the two cultures. Stat. Sci. 16 (3), 199 231. Cox, D.R., 1972. Regression models and life tables. J. R. Stat. Soc. 34, 187 220. Friedman, J.H., 1999. Stochastic Gradient Boosting. Stanford University, Stanford, CA. Lee, E.T., 1980. Statistical Methods for Survival Data Analysis. Lifetime Learning, Belmont, CA. Miner, G., Elder, J., Hill, T., Nisbet, R., Delen, D., Fast, A., 2012. Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications. Elsevier, New York, NY. Nisbet, R., Elder, J., Miner, G., 2009. Handbook of Statistical Analysis and Data Mining Applications. Elsevier, New York, NY. Siddiqi, N., 2006. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley & Sons, New York, NY. StatSoft, Inc., 2013. STATISTICA (data analysis software system), version 12. ,www.statsoft.com..