Predicting Breast Cancer Diagnosis Using Support Vector Machines

Predicting Breast Cancer Diagnosis Using Support Vector Machines

Tutorial U Predicting Breast Cancer Diagnosis Using Support Vector Machines Haranath Varanasi Chapter Outline Introduction Data Description Data Ana...

9MB Sizes 3 Downloads 168 Views

Tutorial U

Predicting Breast Cancer Diagnosis Using Support Vector Machines Haranath Varanasi

Chapter Outline Introduction Data Description Data Analysis and Exploration Feature Selection and Root Cause Analysis (Using Chi-square Method (Default)/p Value Method):

850 850 851

851

Modeling Using Support Vector Machine with Deployment Rapid Deployment, Cross-Validating and Predicting on Different Data Summary Data Set Locations

857 862 865 865

INTRODUCTION Support vector machines (SVMs) are a set of related supervised learning methods that analyze data and recognize patterns, and are used for classification (machine learning) and regression analysis. The goal of an SVM model is to predict which category a particular subject or individual belongs to, based on training set examples. In this tutorial we will carry out feature selection and root cause analysis to select predictors and then, using the SVM model, will predict the breast cancer diagnosis (benign or malignant) of a particular patient based upon information obtained by doctors through scanned images. The data set used is the Breast Cancer Wisconsin (Original) Data Set, and can be obtained from the UCI Machine learning library.

Data Description The Wisconsin Breast Cancer Data Set comprises 699 patients with a total of 11 different variables. These variables will be used to create models that predict the diagnosis (benign or malignant) of a particular patient. The variables are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

850

Sample code number: ID number Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class: (2 for benign, 4 for malignant).

Practical Predictive Analytics and Decisioning Systems for Medicine. DOI: http://dx.doi.org/10.1016/B978-0-12-411643-6.00038-7 © 2015 Haranath Varanasi. Published by Elsevier Inc. All rights reserved.

Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U

851

DATA ANALYSIS AND EXPLORATION With these data the ID number will not be used and the categorical response is Diagnosis (2 for benign, 4 for malignant). This is a cleaned data set, so no cleaning is required. First we will run the feature selection process to select the best predictors to predict the target variable.

Feature Selection and Root Cause Analysis (Using Chi-square Method (Default)/p Value Method): 1. Go to Data Mining, then down to Data Mining Workspace, and click on All Procedures. 2. Click on Data Source and on the Breast-cancer-Wisconsin Data Set, then OK. The variable selection box will come up immediately. 3. Select Class (Variable 11) as a target variable which is categorical (with values 2 for benign, 4 for malignant) and the remaining variables (numbered 2 through 10) as predictors, excluding ID in the continuous category (shown in Figure U.1). This loads the data as shown in Figure U.2. 4. Click on Node Browser, and select Feature Selection and Root Cause Analysis (Figure U.3) 5. Right click on Feature Selection and Root Cause Analysis, and edit parameters. a. Edit parameters for running feature selection based on Chi-squares (Figure U.4). b. Edit parameters for running feature selection based on p values (Figure U.5). 6. Right click on Feature Selection and Root Cause Analysis, and click on Run to Node. 7. Click on Reporting Documents to view Results (Figures U.6, U.7, Tables U.1, U.2) As the p values of all predictors are 0 (zero) with both Chi-square and p-value methods, we might consider all to be predictors of class. Before we do that, let’s examine data correlations more closely. 8. Open Data set, click on Multiple Regression, and select Variable 11 (Class) as the Dependent variable and Variables 2 10 as Independent variables; click OK, click the Advanced tab then select Descriptive Statistics (Figure U.8). 9. Open the Results book and have a look at the Correlations table and Scatterplots (Figure U.9 and Table U.3) Using the Correlations scatterplots and Correlations matrix, we can conclude that the predictors are correlated with each other. As the p values of all predictors are zero and are correlated, we will consider all variables as predictors for modeling.

FIGURE U.1 Selecting variables in the STATISTICA workspace.

FIGURE U.2 Another view of the variable selection dialog, and the loaded data file in the background.

FIGURE U.3 Using the ‘Node Browser’ to select specific data analysis processes. FIGURE U.4 The ‘Edit Parameters’ dialog for Feature Selection, selecting “Fixed Number” of predictors.

FIGURE U.5 The ‘Edit Parameters’ dialog for Feature Selection, selecting “Based on p-value” of predictors.

FIGURE U.6 Predictors Importance plot (p value).

Importance plot Dependent variable: Class Cell size Cell shape Bare nuclei Epithelial cell size Normal nucleoli Bland chromatin Clump thickness Marginal adhesion Mitoses 0

50

100

150

200

250

300

350

400

450

500

Importance (F value)

FIGURE U.7 Predictors Importance plot (Chisquare).

Importance plot Dependent variable: Class Cell size Cell shape Bare nuclei Epithelial cell size Normal nucleoli Bland chromatin Clump thickness Marginal adhesion Mitoses 0

100

200

300

400

Importance (Chi-square)

500

600

854 PART | 2 Practical Step-by-Step Tutorials and Case Studies

TABLE U.1 p Values Obtained Using Chi-Square Method Best predictors for categorical dependent var: Class (breast-cancer-wisconsin in breast-cancer-wisconsin) Chi-square

p value

Cell Size

534.1405

0.00

Cell Shape

522.0482

0.00

Bare Nuclei

485.3459

0.00

Epithelial Cell Size

443.9259

0.00

Normal Nucleoli

417.2568

0.00

Bland Chromatin

411.7986

0.00

Clump Thickness

380.4005

0.00

Marginal Adhesion

370.7075

0.00

Mitoses

195.1930

0.00

TABLE U.2 p-Values Obtained Using p-Value Method Best predictors for continuous dependent var: Class (Spreadsheet in breast-cancerwisconsin) F value

p value

Cell Size

449.0604

0.000000

Cell Shape

340.2597

0.000000

Bare Nuclei

276.6565

0.000000

Epithelial Cell Size

200.7238

0.000000

Normal Nucleoli

205.2642

0.000000

Bland Chromatin

165.3686

0.000000

Clump Thickness

117.8626

0.000000

Marginal Adhesion

156.5069

0.000000

67.2202

0.000000

Mitoses

Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U

FIGURE U.8 Initial results dialog for Multiple Regression.

Correlations (breast-cancer-wisconsin in breast-cancer-wisconsin 11v*699c) Clump Thickness

Cell Size

Cell Shape

Marginal Adhesion

Epithelial Cell Size

Bare Nuclei

Bland Chromatin

Normal Nucleoli

Mitoses

FIGURE U.9 Correlation scatterplots across predictors.

855

TABLE U.3 Correlation Table for Predictors Variable

Correlations (breast-cancer-wisconsin in breast-cancer-wisconsin) Clump thickness

Cell size

Cell shape

Marginal adhesion

Epithelial cell size

Bare nuclei

Bland chromatin

Normal nucleoli

Mitoses

Class

Clump Thickness

1.000000

0.642481

0.653470

0.487829

0.523596

0.593091

0.553742

0.534066

0.350957

0.714790

Cell Size

0.642481

1.000000

0.907228

0.706977

0.753544

0.691709

0.755559

0.719346

0.460755

0.820801

Cell Shape

0.653470

0.907228

1.000000

0.685948

0.722462

0.713878

0.735344

0.717963

0.441258

0.821891

Marginal Adhesion

0.487829

0.706977

0.685948

1.000000

0.594548

0.670648

0.668567

0.603121

0.418898

0.706294

Epithelial Cell Size

0.523596

0.753544

0.722462

0.594548

1.000000

0.585716

0.618128

0.628926

0.480583

0.690958

Bare Nuclei

0.593091

0.691709

0.713878

0.670648

0.585716

1.000000

0.680615

0.584280

0.339210

0.822696

Bland Chromatin

0.553742

0.755559

0.735344

0.668567

0.618128

0.680615

1.000000

0.665602

0.346011

0.758228

Normal Nucleoli

0.534066

0.719346

0.717963

0.603121

0.628926

0.584280

0.665602

1.000000

0.433757

0.718677

Mitoses

0.350957

0.460755

0.441258

0.418898

0.480583

0.339210

0.346011

0.433757

1.000000

0.423448

Class

0.714790

0.820801

0.821891

0.706294

0.690958

0.822696

0.758228

0.718677

0.423448

1.000000

Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U

857

MODELING USING SUPPORT VECTOR MACHINE WITH DEPLOYMENT In these data the ID number will not be used and the categorical response is Diagnosis (2 for benign, 4 for malignant). 1. Click on All procedures in the Data mining tab (Figure U.10). 2. Now, click on Data Source and bring the Breast Cancer Data Set into the DM workspace (Figure U.11). Categorical variables are those that contain information about some discrete quantity or characteristic describing the observations in the data file (e.g., Gender: Male or Female); continuous variables are measured on some continuous scale (e.g., Height, Weight, Cost).

FIGURE U.10 Selecting All Procedures.

FIGURE U.11 Selecting Data Source.

Dependent variables are the ones we want to predict; they are also sometimes called outcome variables; predictor (independent) variables are those that we want to use for the prediction or classification (of categorical outcomes).

858 PART | 2 Practical Step-by-Step Tutorials and Case Studies

3. Double click on input data (Breast cancer data ) which is in the DM workspace, and click on Variables. Select Variable 11 (Class) in the “Dependent; categorical” section, and Variables 2 10 as predictor variables in the “Predictor; continuous category” (Figure U.12). Here we are trying to predict Class (2 for benign, 4 for malignant) using another variable (Cell information). 4. Click on the Node Browser and select Support Vector Machine with Deployment (Classification) and double click to bring it into the DM workspace. 5. Right click on “Support Vector Machine with Deployment” icon and Edit Parameters (Figure U.13) 6. Change details of computed results from “Minimal” to “All results” on the General tab (Figure U.14)

FIGURE U.12 Selecting variables dialog.

FIGURE U.13 By right-clicking on the Icon-Node the flying menu appears which includes ‘Edit parameters’.

Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U

859

FIGURE U.14 Edit parameters dialog, General tab.

FIGURE U.15 Edit parameters dialog, Cross validation tab.

7. 8. 9. 10.

In the Cross-validation tab, select V-fold cross-validation (Figure U.15). In the Deployment tab, check PMML/XML deployment so it will be outputted into a workbook (Figure U.16). Leave the Sampling, SVM, Kernel, Training, and Results tabs alone (i.e., leave them at their “defaults”). Make a connection to the SVM node from our data set by right clicking the Data set icon and then clicking on Node (Connection from this Node) (Figure U.17).

860 PART | 2 Practical Step-by-Step Tutorials and Case Studies

FIGURE U.16 Edit parameters dialog, Deployment tab.

FIGURE U.17 Right clicking on the data set icon brings up this flying menu.

11. Right click on the SVM icon and click Run to Node to run the SVM model (Figure U.18). 12. Upon completion, there will be two more icons in the DM workspace. To view the results, right click on Results and click on View Document (Figure U.19; Tables U.4, U.5). 13. Right click on Support vector machine PMML deployment code, and save PMML xml file (Figure U.20). 14. Save as Breast_Cancer_SVM_PMML. We will use this file for rapid deployment and cross-validation on a different data set (Figure U.21).

FIGURE U.18 Selecting ‘Run to node’ in the flying menu runs the computations up to and through this node.

FIGURE U.19 By right-clicking on the Reporting Documents node one can select ‘View Document’ to see the results workbook.

TABLE U.4 Classification Summary (Support Vector Machine) Class name

Classification summary (Support Vector Machine), Class, Overall sample (Spreadsheet in breast-cancer-wisconsin); SVM: Classification type 1 (C 5 2.000); Kernel: Radial Basis Function (gamma 5 0.200); Number of support vectors 5 50 (40 bounded) Total

Correct

Incorrect

Correct (%)

Incorrect (%)

2

444

432

12

97.29730

2.702703

4

239

232

7

97.07113

2.928870

862 PART | 2 Practical Step-by-Step Tutorials and Case Studies

TABLE U.5 Confusion Matrix (Support Vector Machine) Class observed

Confusion matrix (Support Vector Machine), Class, Overall sample (Spreadsheet in breast-cancer-wisconsin); SVM: Classification type 1 (C 5 2.000); Kernel: Radial Basis Function (gamma 5 0.200); Number of support vectors 5 50 (40 bounded) observed (rows) 3 predicted (columns) 2

4

2

432

12

4

7

232

FIGURE U.20 Right-clicking on the “Support vector machine PMML Deployment Code” brings up this flying menu allowing saving of this code to another location.

RAPID DEPLOYMENT, CROSS-VALIDATING, AND PREDICTING ON DIFFERENT DATA We will rapidly deploy the previously built SVM model and cross-validate using a different data set. 1. 2. 3. 4.

Open STATISTICA, and open the data set Breast_cancer_tut_test From the Data Mining tab, click on Rapid Deployment (Figure U.22). Click on “Variable selection via PMML” (selects predictors and target variable from PMML file) (Figure U.23). Click on “Load models from disk” to select the data set (the data set on which you want to run the model) (Figure U.23). 5. Click OK. Upon completion, we will see the Result book (Figures U.24, U.25). The SVM model predicted classes with 99.97 (100 2 0.03) accuracy, as shown in Table U.6.

Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U

FIGURE U.21 Save As dialog, where the document can be renamed if desired and saved.

FIGURE U.22 Selecting the ‘Rapid Deployment’ module.

863

FIGURE U.23 The ‘Rapid Deployment’ module with the saved PMML (xml) model selected.

FIGURE U.24 Summary of the Rapid Deployment model’s predictions of cases.

FIGURE U.25 The error rate spreadsheet from the Rapid Deployment results workbook.

Predicting Breast Cancer Diagnosis Using Support Vector Machines Tutorial | U

865

TABLE U.6 Excerpt from Result Book Showing Prediction Accuracy Summary of deployment (Error rates) (Breast_cancer_tut_test) SVM Model Error rate

0.030000

SUMMARY In this tutorial, we have cleaned and run feature analysis to select predict variables. Upon noticing that all predictors had a zero p value (the lower the p value, the higher the significance of the predictor; in this case, all are equally valued), we further explored the data and analyzed correlations between the predictor variables. The correlation analysis also shows that all the predictors are highly correlated. Hence, we chose to give equal significance to all the predictors for SVM modeling. We also ran SVM modeling and saved a PMML XML file for rapid deployment and cross-validation, and then rapidly deployed the SVM model on different data and cross-validated our model.

DATA SET LOCATIONS ,http://archive.ics.uci.edu/ml/Datasets/Breast 1 Cancer 1 Wisconsin 1 %28Original%29.. ,http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-isconsin/wdbc.names.. ,https://courses.cs.washington.edu/courses/cse446/09wi/hw1/..