Tutorial U
Predicting Breast Cancer Diagnosis Using Support Vector Machines Haranath Varanasi
Chapter Outline Introduction Data Description Data Analysis and Exploration Feature Selection and Root Cause Analysis (Using Chi-square Method (Default)/p Value Method):
850 850 851
851
Modeling Using Support Vector Machine with Deployment Rapid Deployment, Cross-Validating and Predicting on Different Data Summary Data Set Locations
857 862 865 865
INTRODUCTION Support vector machines (SVMs) are a set of related supervised learning methods that analyze data and recognize patterns, and are used for classification (machine learning) and regression analysis. The goal of an SVM model is to predict which category a particular subject or individual belongs to, based on training set examples. In this tutorial we will carry out feature selection and root cause analysis to select predictors and then, using the SVM model, will predict the breast cancer diagnosis (benign or malignant) of a particular patient based upon information obtained by doctors through scanned images. The data set used is the Breast Cancer Wisconsin (Original) Data Set, and can be obtained from the UCI Machine learning library.
Data Description The Wisconsin Breast Cancer Data Set comprises 699 patients with a total of 11 different variables. These variables will be used to create models that predict the diagnosis (benign or malignant) of a particular patient. The variables are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
850
Sample code number: ID number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class: (2 for benign, 4 for malignant).
Practical Predictive Analytics and Decisioning Systems for Medicine. DOI: http://dx.doi.org/10.1016/B978-0-12-411643-6.00038-7 © 2015 Haranath Varanasi. Published by Elsevier Inc. All rights reserved.
Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U
851
DATA ANALYSIS AND EXPLORATION With these data the ID number will not be used and the categorical response is Diagnosis (2 for benign, 4 for malignant). This is a cleaned data set, so no cleaning is required. First we will run the feature selection process to select the best predictors to predict the target variable.
Feature Selection and Root Cause Analysis (Using Chi-square Method (Default)/p Value Method): 1. Go to Data Mining, then down to Data Mining Workspace, and click on All Procedures. 2. Click on Data Source and on the Breast-cancer-Wisconsin Data Set, then OK. The variable selection box will come up immediately. 3. Select Class (Variable 11) as a target variable which is categorical (with values 2 for benign, 4 for malignant) and the remaining variables (numbered 2 through 10) as predictors, excluding ID in the continuous category (shown in Figure U.1). This loads the data as shown in Figure U.2. 4. Click on Node Browser, and select Feature Selection and Root Cause Analysis (Figure U.3) 5. Right click on Feature Selection and Root Cause Analysis, and edit parameters. a. Edit parameters for running feature selection based on Chi-squares (Figure U.4). b. Edit parameters for running feature selection based on p values (Figure U.5). 6. Right click on Feature Selection and Root Cause Analysis, and click on Run to Node. 7. Click on Reporting Documents to view Results (Figures U.6, U.7, Tables U.1, U.2) As the p values of all predictors are 0 (zero) with both Chi-square and p-value methods, we might consider all to be predictors of class. Before we do that, let’s examine data correlations more closely. 8. Open Data set, click on Multiple Regression, and select Variable 11 (Class) as the Dependent variable and Variables 2 10 as Independent variables; click OK, click the Advanced tab then select Descriptive Statistics (Figure U.8). 9. Open the Results book and have a look at the Correlations table and Scatterplots (Figure U.9 and Table U.3) Using the Correlations scatterplots and Correlations matrix, we can conclude that the predictors are correlated with each other. As the p values of all predictors are zero and are correlated, we will consider all variables as predictors for modeling.
FIGURE U.1 Selecting variables in the STATISTICA workspace.
FIGURE U.2 Another view of the variable selection dialog, and the loaded data file in the background.
FIGURE U.3 Using the ‘Node Browser’ to select specific data analysis processes. FIGURE U.4 The ‘Edit Parameters’ dialog for Feature Selection, selecting “Fixed Number” of predictors.
FIGURE U.5 The ‘Edit Parameters’ dialog for Feature Selection, selecting “Based on p-value” of predictors.
FIGURE U.6 Predictors Importance plot (p value).
Importance plot Dependent variable: Class Cell size Cell shape Bare nuclei Epithelial cell size Normal nucleoli Bland chromatin Clump thickness Marginal adhesion Mitoses 0
50
100
150
200
250
300
350
400
450
500
Importance (F value)
FIGURE U.7 Predictors Importance plot (Chisquare).
Importance plot Dependent variable: Class Cell size Cell shape Bare nuclei Epithelial cell size Normal nucleoli Bland chromatin Clump thickness Marginal adhesion Mitoses 0
100
200
300
400
Importance (Chi-square)
500
600
854 PART | 2 Practical Step-by-Step Tutorials and Case Studies
TABLE U.1 p Values Obtained Using Chi-Square Method Best predictors for categorical dependent var: Class (breast-cancer-wisconsin in breast-cancer-wisconsin) Chi-square
p value
Cell Size
534.1405
0.00
Cell Shape
522.0482
0.00
Bare Nuclei
485.3459
0.00
Epithelial Cell Size
443.9259
0.00
Normal Nucleoli
417.2568
0.00
Bland Chromatin
411.7986
0.00
Clump Thickness
380.4005
0.00
Marginal Adhesion
370.7075
0.00
Mitoses
195.1930
0.00
TABLE U.2 p-Values Obtained Using p-Value Method Best predictors for continuous dependent var: Class (Spreadsheet in breast-cancerwisconsin) F value
p value
Cell Size
449.0604
0.000000
Cell Shape
340.2597
0.000000
Bare Nuclei
276.6565
0.000000
Epithelial Cell Size
200.7238
0.000000
Normal Nucleoli
205.2642
0.000000
Bland Chromatin
165.3686
0.000000
Clump Thickness
117.8626
0.000000
Marginal Adhesion
156.5069
0.000000
67.2202
0.000000
Mitoses
Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U
FIGURE U.8 Initial results dialog for Multiple Regression.
Correlations (breast-cancer-wisconsin in breast-cancer-wisconsin 11v*699c) Clump Thickness
Cell Size
Cell Shape
Marginal Adhesion
Epithelial Cell Size
Bare Nuclei
Bland Chromatin
Normal Nucleoli
Mitoses
FIGURE U.9 Correlation scatterplots across predictors.
855
TABLE U.3 Correlation Table for Predictors Variable
Correlations (breast-cancer-wisconsin in breast-cancer-wisconsin) Clump thickness
Cell size
Cell shape
Marginal adhesion
Epithelial cell size
Bare nuclei
Bland chromatin
Normal nucleoli
Mitoses
Class
Clump Thickness
1.000000
0.642481
0.653470
0.487829
0.523596
0.593091
0.553742
0.534066
0.350957
0.714790
Cell Size
0.642481
1.000000
0.907228
0.706977
0.753544
0.691709
0.755559
0.719346
0.460755
0.820801
Cell Shape
0.653470
0.907228
1.000000
0.685948
0.722462
0.713878
0.735344
0.717963
0.441258
0.821891
Marginal Adhesion
0.487829
0.706977
0.685948
1.000000
0.594548
0.670648
0.668567
0.603121
0.418898
0.706294
Epithelial Cell Size
0.523596
0.753544
0.722462
0.594548
1.000000
0.585716
0.618128
0.628926
0.480583
0.690958
Bare Nuclei
0.593091
0.691709
0.713878
0.670648
0.585716
1.000000
0.680615
0.584280
0.339210
0.822696
Bland Chromatin
0.553742
0.755559
0.735344
0.668567
0.618128
0.680615
1.000000
0.665602
0.346011
0.758228
Normal Nucleoli
0.534066
0.719346
0.717963
0.603121
0.628926
0.584280
0.665602
1.000000
0.433757
0.718677
Mitoses
0.350957
0.460755
0.441258
0.418898
0.480583
0.339210
0.346011
0.433757
1.000000
0.423448
Class
0.714790
0.820801
0.821891
0.706294
0.690958
0.822696
0.758228
0.718677
0.423448
1.000000
Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U
857
MODELING USING SUPPORT VECTOR MACHINE WITH DEPLOYMENT In these data the ID number will not be used and the categorical response is Diagnosis (2 for benign, 4 for malignant). 1. Click on All procedures in the Data mining tab (Figure U.10). 2. Now, click on Data Source and bring the Breast Cancer Data Set into the DM workspace (Figure U.11). Categorical variables are those that contain information about some discrete quantity or characteristic describing the observations in the data file (e.g., Gender: Male or Female); continuous variables are measured on some continuous scale (e.g., Height, Weight, Cost).
FIGURE U.10 Selecting All Procedures.
FIGURE U.11 Selecting Data Source.
Dependent variables are the ones we want to predict; they are also sometimes called outcome variables; predictor (independent) variables are those that we want to use for the prediction or classification (of categorical outcomes).
858 PART | 2 Practical Step-by-Step Tutorials and Case Studies
3. Double click on input data (Breast cancer data ) which is in the DM workspace, and click on Variables. Select Variable 11 (Class) in the “Dependent; categorical” section, and Variables 2 10 as predictor variables in the “Predictor; continuous category” (Figure U.12). Here we are trying to predict Class (2 for benign, 4 for malignant) using another variable (Cell information). 4. Click on the Node Browser and select Support Vector Machine with Deployment (Classification) and double click to bring it into the DM workspace. 5. Right click on “Support Vector Machine with Deployment” icon and Edit Parameters (Figure U.13) 6. Change details of computed results from “Minimal” to “All results” on the General tab (Figure U.14)
FIGURE U.12 Selecting variables dialog.
FIGURE U.13 By right-clicking on the Icon-Node the flying menu appears which includes ‘Edit parameters’.
Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U
859
FIGURE U.14 Edit parameters dialog, General tab.
FIGURE U.15 Edit parameters dialog, Cross validation tab.
7. 8. 9. 10.
In the Cross-validation tab, select V-fold cross-validation (Figure U.15). In the Deployment tab, check PMML/XML deployment so it will be outputted into a workbook (Figure U.16). Leave the Sampling, SVM, Kernel, Training, and Results tabs alone (i.e., leave them at their “defaults”). Make a connection to the SVM node from our data set by right clicking the Data set icon and then clicking on Node (Connection from this Node) (Figure U.17).
860 PART | 2 Practical Step-by-Step Tutorials and Case Studies
FIGURE U.16 Edit parameters dialog, Deployment tab.
FIGURE U.17 Right clicking on the data set icon brings up this flying menu.
11. Right click on the SVM icon and click Run to Node to run the SVM model (Figure U.18). 12. Upon completion, there will be two more icons in the DM workspace. To view the results, right click on Results and click on View Document (Figure U.19; Tables U.4, U.5). 13. Right click on Support vector machine PMML deployment code, and save PMML xml file (Figure U.20). 14. Save as Breast_Cancer_SVM_PMML. We will use this file for rapid deployment and cross-validation on a different data set (Figure U.21).
FIGURE U.18 Selecting ‘Run to node’ in the flying menu runs the computations up to and through this node.
FIGURE U.19 By right-clicking on the Reporting Documents node one can select ‘View Document’ to see the results workbook.
TABLE U.4 Classification Summary (Support Vector Machine) Class name
Classification summary (Support Vector Machine), Class, Overall sample (Spreadsheet in breast-cancer-wisconsin); SVM: Classification type 1 (C 5 2.000); Kernel: Radial Basis Function (gamma 5 0.200); Number of support vectors 5 50 (40 bounded) Total
Correct
Incorrect
Correct (%)
Incorrect (%)
2
444
432
12
97.29730
2.702703
4
239
232
7
97.07113
2.928870
862 PART | 2 Practical Step-by-Step Tutorials and Case Studies
TABLE U.5 Confusion Matrix (Support Vector Machine) Class observed
Confusion matrix (Support Vector Machine), Class, Overall sample (Spreadsheet in breast-cancer-wisconsin); SVM: Classification type 1 (C 5 2.000); Kernel: Radial Basis Function (gamma 5 0.200); Number of support vectors 5 50 (40 bounded) observed (rows) 3 predicted (columns) 2
4
2
432
12
4
7
232
FIGURE U.20 Right-clicking on the “Support vector machine PMML Deployment Code” brings up this flying menu allowing saving of this code to another location.
RAPID DEPLOYMENT, CROSS-VALIDATING, AND PREDICTING ON DIFFERENT DATA We will rapidly deploy the previously built SVM model and cross-validate using a different data set. 1. 2. 3. 4.
Open STATISTICA, and open the data set Breast_cancer_tut_test From the Data Mining tab, click on Rapid Deployment (Figure U.22). Click on “Variable selection via PMML” (selects predictors and target variable from PMML file) (Figure U.23). Click on “Load models from disk” to select the data set (the data set on which you want to run the model) (Figure U.23). 5. Click OK. Upon completion, we will see the Result book (Figures U.24, U.25). The SVM model predicted classes with 99.97 (100 2 0.03) accuracy, as shown in Table U.6.
Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U
FIGURE U.21 Save As dialog, where the document can be renamed if desired and saved.
FIGURE U.22 Selecting the ‘Rapid Deployment’ module.
863
FIGURE U.23 The ‘Rapid Deployment’ module with the saved PMML (xml) model selected.
FIGURE U.24 Summary of the Rapid Deployment model’s predictions of cases.
FIGURE U.25 The error rate spreadsheet from the Rapid Deployment results workbook.
Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U
865
TABLE U.6 Excerpt from Result Book Showing Prediction Accuracy Summary of deployment (Error rates) (Breast_cancer_tut_test) SVM Model Error rate
0.030000
SUMMARY In this tutorial, we have cleaned and run feature analysis to select predict variables. Upon noticing that all predictors had a zero p value (the lower the p value, the higher the significance of the predictor; in this case, all are equally valued), we further explored the data and analyzed correlations between the predictor variables. The correlation analysis also shows that all the predictors are highly correlated. Hence, we chose to give equal significance to all the predictors for SVM modeling. We also ran SVM modeling and saved a PMML XML file for rapid deployment and cross-validation, and then rapidly deployed the SVM model on different data and cross-validated our model.
DATA SET LOCATIONS ,http://archive.ics.uci.edu/ml/Datasets/Breast 1 Cancer 1 Wisconsin 1 %28Original%29.. ,http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-isconsin/wdbc.names.. ,https://courses.cs.washington.edu/courses/cse446/09wi/hw1/..