CHAPTER 6
Application of machine learning and image processing for detection of breast cancer Muhammad Kashif1, Kaleem Razzaq Malik1, Sohail Jabbar2 and Junaid Chaudhry3 1
Department of Computer Science, Air University Multan Campus, Multan, Pakistan, 2Department of Computer Science, National Textile University, Faisalabad, Pakistan, 3College of Security and Intelligence, Embry-Riddle Aeronautical University, Prescott, AZ, United States
6.1 Introduction The implementation of an advanced computer application in healthcare requires integrated approaches. These approaches have social, economic, political, and cultural impacts. These approaches also have the challenges of communication and information technologies. Data analysis and smart data with computation are the technologies which make great attention for the domain of healthcare (Spruit & Lytras, 2018). In the modern era, many developed countries are going to the smart cities concept. In a smart city, everything is controlled by a computer. Every field of the world is emerging toward the computer. Human wants to control everything by computer. Decision-making is a big challenge in a smart environment. This modern era is also known as data world. Because on daily basis data are generated by the users. These data are very useful to make the environment smart. Decisions based on data. Artificial intelligence (AI) allows the computer to make decisions. To make decisions, data should be available to the computer (Lytras & Visvizi, 2018; Razzaq Malik et al., 2017). Healthcare is a big challenge in the world. Use of computer, intelligent systems, and intelligent device play an important role in healthcare. Early stage detection of disease can reduce the risk of human lives. The risk of human lives can be reduced new innovations. Cancer is the deadliest disease in the world. The critical condition is that every year thousands of people die from cancer. Because cancer cannot be diagnosed in its early stage by a physician. Sometimes cancer disease symptoms cannot be seen from outside of the body. At the second or third stage, cancer cannot be cured. Some latest techniques can cure
Innovation in Health Informatics. DOI: https://doi.org/10.1016/B978-0-12-819043-2.00006-X © 2020 Elsevier Inc. All rights reserved.
145
146 Chapter 6 the disease but those techniques also have some side effects. So the early stage detection of cancer can reduce the risk of patient life. Breast cancer is the second leading disease which causes woman deaths. According to report, 2.5 million cases of cancer identified in 2017 and 40,000 women expected to die in the United States (DeSantis, Ma, Goding Sauer, Newman, & Jemal, 2017). A report by IARC shows 8.2 million deaths were caused by cancer in 2012. Before 2030 is expected that 27 million new cancer cases (Patel, Uvaid, & Suthar, 2012). A report by the International Agency of Cancer says that 79,000 women are facing the disease of breast cancer (Sreeja, Rathika, & Devaraj, 2012). This ratio is increasing in developed countries. This is a critical problem in healthcare. The cells that uncontrolled and grow rapidly given the name of cancer. Breast cancer occurs due to growth of cells in the breast. Group of extra tissues is known as tumor. The second name of a tumor is cancer. In this paper, we developed a model to predict breast cancer from mammogram images. We used a hybrid approach having mammogram processing and machine learning (ML) algorithms. The mammogram processing technique is used to extract features from mammogram images. The images having abnormalities Malignant or Benign are classified by ML. These images are taken from a mammogram image analysis society (MIAS) cancer dataset (Suckling et al., 2015). In the first phase, images preprocessed and then passed through the segmentation phase. In the second phase, features were extracted. In the third phase, classification is done on the basis of features. In the last phase, the comparison is done for different classification techniques. The classification technique gave the best results that can be used for prediction of cancer for new mammograms.
6.1.1 Mammograms To check, the symptoms of breast cancer disease mammograms can be used. X-ray picture of woman breast is known as a mammogram. When a woman has no symptoms of breast cancer, then screening mammography is done. This process can reduce the deaths of women from breast cancer at the age of 4070. This process also has some disadvantages. Sometimes breast has no symptoms but mammogram shows abnormality. This caused anxiety because more test has been done. Mammograms are taken from those younger women which has the symptoms of breast cancer. X-ray images of the breast which have abnormalities are known as a mammogram. The x-ray beam is applied to the breast which compressed between two plates to take the image. Breast cancer can be diagnosed using these images. The technique which is performed
Application of machine learning and image processing for detection of breast cancer 147 before symptoms occur in woman breast is known as screening mammography. The technique which is performed after symptoms occur in woman breast is known as diagnosing mammography. Mammography is currently the most effective technique to detect early breast cancer (Sundaram, Sasikala, & Rani, 2014).
6.1.2 Preprocessing The objective of preprocessing is to enhance the mammogram image. Because enhancement gives the more accurate results. In this phase, we remove noise from a mammogram. Noise is removed from mammogram to sharp the image. This sharpness gives the sharp edges and boundaries. Enhancement cannot increase or decrease the mammogram image. It only increases the features sharpness. Mammogram preprocessing is a technique which is used to improve the quality of the mammogram image. The objective of mammogram preprocessing is to clean noise and enhancement of quality of mammogram to proceed for further processing. Image filters are used to remove irrelevant artifacts and enhance the quality of the image (John & Nallathambi, 2017).
6.1.3 Segmentation Segmentation is the process of finding the region of interest (ROI). In this phase that region is selected which has a tumor. This phase is necessary to get our results. Because if the ROI is not extracted then the whole image is processed to extract features. So the ROI helps to get the required features. In this process, the tumor area is extracted from the mammogram image. After this feature extraction process is applied to the ROI image. Two methods are used for image segmentation. In the first method, the author differentiates the region of the image by locating edges. This method is known as edge-based technique for image segmentation. In the second method, the author slices up the image into small blocks by combining separate pixels into blocks. This method is known as region-based segmentation (Bandyopadhyay, 2010).
6.1.4 Machine learning ML is a subfield of AI. It has two major types. Algorithms are trained using labeled data. Unsupervised algorithms are trained using unlabeled data. In labeled data, classification labels are known. In my case, label “0” indicates normal case and label “1” indicates Cancerous case. ML is subcategories in four types. • •
Supervised ML Unsupervised learning
148 Chapter 6 • •
Semisupervised learning Reinforcement and deep learning
6.1.4.1 Supervised machine learning In supervised ML, two variables are used in practice. One variable is (T) which is the input variable. Second variable is (P) which is the output variable. A mapping function is learned to the algorithm from input to output. P 5 f ðTÞ The objective of this mapping function is to predict output value for the variable (P) from new input data for the variable (T). It is known as supervised learning. The algorithm learns from training dataset and makes predictions for new data. At the acceptable performance, learning is stopped. Supervised learning is subclassified into two categories. • •
Classification Regression
6.1.4.1.1 Classification
Classification is used when the output variable is like “yes” or “no.” When output variable has “disease” or “Normal,” then classification is applied. 6.1.4.1.2 Regression
Regression is used when the output variable has real or continuous values like “salary” or “weight.” 6.1.4.2 Unsupervised learning Unsupervised learning is applied where only (T) variable is given. It means only input data are given. The output variable is unknown. The objective of unsupervised learning is to understand the structure of data or distribution of data. It is known as unsupervised learning because there is no output variable. It means we don’t know the actual answer of the input data. In unsupervised learning, the algorithm finds the pattern on its own devises in given data distribution. Unsupervised learning is subcategorized into association and clustering. • •
Clustering Association
Application of machine learning and image processing for detection of breast cancer 149 6.1.4.2.1 Clustering
Clustering is the process of grouping the same type of data in a way that data are similar within a cluster. Which means data are more similar within a group or cluster and differences between other groups or cluster of data. For example, a college will group students according to their IQ in sections like super section, average section, and weak section. In another example, books are grouped in a shelf according to their field. 6.1.4.2.2 Association
Association means finding interesting patterns in data. Mostly used in unsupervised learning to find a frequent pattern in data. For example, we can find a frequent pattern in superstore data, like [butter, egg] 5 . [bread] rule tells that, a customer who buys butter and egg from the superstore is likely to buy bread. 6.1.4.3 Semisupervised learning It is similar to supervised learning but the difference is that it also uses unlabeled data for training. Mostly, it has less labeled data and more unlabeled data. Which mean it lies between supervised and unsupervised. 6.1.4.4 Reinforcement and deep learning Both techniques are autonomous and self-learning. Millions of data are given to these algorithms for learning. But the difference between both is that reinforcement learning learns by trial and error mechanism and deep learning learns finding patterns from millions of data (Haider, Malik, Khalid, Nawaz, & Jabbar, 2017). For example, in order to recognize dinosaur photos, millions of dinosaur photos are given to algorithm in order to self-learn.
6.2 Literature review Authors used three ML techniques to predict breast cancer. They used ICBC dataset to extract results from experiments at National Cancer Institute of Tehran. They applied three machine algorithms (ANN, D-Tree, and SVM) on ICBC dataset. They observed that SVM has the highest accuracy of 95%, ANN 94%, and Decision Tree has a low accuracy of 93% (Ahmad, Eshlaghy, Poorebrahimi, Ebrahimi, & Razavi, 2013). In this paper, authors develop a system which combines CFRSFS, K-Mean clustering, and LS-SVM. The author proposed a feature selection algorithm called CFRSFS. CFRSFS select only eight features that are given to only K-Mean clustering to further feature processing. Their method has 99.54% accuracy (Suji & Rajagopalan, 2015). In this paper, the authors applied different ML algorithms according to the dataset. They used the WDBC dataset to get experimental results. They calculated the accuracy of
150 Chapter 6 Table 6.1: Previous work. No. Authors
Technique or algorithm Dataset
1
Ahmad et al. (2013)
Decision TreeANNSVM
2
Suji and Rajagopalan (2015) CFRSFS, K-Mean
3
Rana et al. (2015)
KNN SVM Logic Reg. Naı¨ve Byes
Accuracy (%)
ICBC
93 94 95 Wisconsin Diagnostic Breast 99.54 Cancer (WDBC) WDBC 72 68 68 67
algorithms as SVM-linear 68%, Logic regression 68%, KNN 72%, and Naı¨ve Byes 67% (Rana, Chandorkar, Dsouza, & Kazi, 2015). In this paper, the authors used Neural Networks to detect breast cancer. They used 1808 cases of cancer. They used 387 cases for testing. They calculated accuracy 95.80% for Neural Network algorithm (Singh, Gupta, & Sharma, 2010). Previous related work is presented in Table 6.1.
6.3 Proposed work See Fig. 6.1.
6.3.1 Dataset Sample dataset is taken from MIAS (Suckling et al., 2015). It is a United Kingdom research group. This research group is interested in understanding mammogram images. These images are taken from United Kingdom National Breast Screening center. This dataset contains 322 instances. The original image from this dataset is shown in Fig. 6.2. It is displayed in Mat lab tool. This image is sent to preprocess block or module.
6.3.2 Noise removal (preprocessing) The objective of preprocessing is to enhance the mammogram image. Because enhancement gives the more accurate results. In this phase, we remove noise from a mammogram. Noise is removed from mammogram to sharp the image. This sharpness gives the sharp edges and boundaries. Enhancement cannot increase or decrease the mammogram image. It only increases the features sharpness. Mammogram preprocessing is a technique which is used to improve the quality of the mammogram image. The objective of mammogram preprocessing is to clean noise and enhancement of quality of mammogram to proceed for
Application of machine learning and image processing for detection of breast cancer 151
Figure 6.1 Proposed architecture.
Figure 6.2 Original mammogram from MIAS database (Suckling et al., 2015).
152 Chapter 6
Figure 6.3 Preprocessed mammogram.
further processing. Image filters are used to remove irrelevant artifacts and enhance the quality of the image. In this step, the image is load from database and preprocessed to remove the noise from the image. This step increases the quality of the image. This step is necessary because the original mammogram image has noise and other artifacts. I will apply the 2D median filter to remove noise from the image. The average and adaptive filter can also be applied to remove noise. But the median filter is enough to remove noise in my case. In Fig. 6.3, preprocessed image is shown. This image is noise free. To remove, noise median filter is applied. This module is also executed in Mat lab tool.
6.3.3 Segmentation process Segmentation is the process to select the ROI. In this process that area is selected that has a tumor after noise removal. This step is necessary because it is very difficult to extract features from the whole image. We are interested to extract a feature from that region which has a tumor or cancer. Segmentation can be done from using Grab cut algorithm, thresh-holding, and watershed algorithm. So I will apply thresh-holding for segmentation. The segmented image will be used for feature extraction. Segmentation is the process of finding the ROI. In this phase that region is selected which has a tumor. This phase is necessary to get our results. Because if the ROI is not extracted then the whole image is processed to extract features. So the ROI helps to get the required features. In this process, the tumor area is extracted from the mammogram image. After this feature extraction process is applied to the ROI image. Two methods are used for image segmentation. In the
Application of machine learning and image processing for detection of breast cancer 153
Figure 6.4 Noise-free image converted in binary.
first method, the author differentiates the region of the image by locating edges. This method is known as edge-based technique for image segmentation. In the second method, the author slices up the image into small blocks by combining separate pixels into blocks. This method is known as region-based segmentation. Fig. 6.4 binary image is given. Pixels that are contained white color having “1” and black pixels having “0” value. After combining, the binary and original image got a masked image that shows tumor as shown in Fig. 6.5. In Fig. 6.6, segmented image is shown that is required for feature extraction.
6.3.4 Feature extraction In this step, I will use the segmented image to extract features. Features will describe the condition of cancer. A Features that will be extracted from the segmented image are: 1. 2. 3. 4. 5.
the radius of the highlighted region; the entropy of highlighted region; the smoothness of highlighted region; mean texture of highlighted region; and texture-based features.
154 Chapter 6
Figure 6.5 Tumor highlighted picture.
Figure 6.6 Segmented image.
Application of machine learning and image processing for detection of breast cancer 155
6.3.5 Training model and testing In this step, features label data will be divided into two parts. One part is training and the second part is testing. Training data train the model. We measure the precision and recall by providing unlabeled testing data.
6.3.6 Classification In this step, prediction or test data will be given to model and the classified label will be extracted. Classification model that is applied is Support Vector Machine (SVM), Ada Boost, Decision Tree, Logistic regression, K Nearest Neighbor, and Random Forest Classifiers. Each algorithm gives us the result in binary classification. Using these binary classification results performance measures will be calculated. This measure will help us to compare the results. On the basis of comparisons of the algorithm, we can suggest the best technique that can be used to solve the given healthcare problem.
6.3.7 Performance evaluation metrics To evaluate the performance of algorithms, the confusion matrix is used and some formulas of precision, recall, f-score, sensitivity, specificity, and accuracy were applied. PT 5 true positive PF 5 false positive NT 5 true negative NF 5 false negative Confusion matrix divides the predicative instance into four outcomes as shown above. PT shows that the instances predicted no cancer are correct. It means these instances have no cancer. Algorithm predicted results are correct. PF means the predicted result is incorrect. Actual instances have no cancer but algorithm says cancer exist. This means algorithm predicted wrong results. NT is similar to PT in a way, there was cancer in instances, it predicts there is cancer in instances. Algorithm results are correct. Finally, NF shows the situation in which actual instances have cancer but the algorithm predicts that instances do not have cancer. The algorithm is again shown wrong results like in case of PF. In Table 6.2, Confusion matrix is given that describes the relationship between predicted class and actual class. The predicted class is compared with actual class with four performance indices PT, NT, PF, and NF. PT class shows correctly classified Benign tuples. Pf class shows False Classified Malignant tuples that are marked Benign. NT class shows Correctly classified Malignant tuples.
156 Chapter 6 Table 6.2: Confusion matrix. CM Positive Negative
Positive PT NF
Negative PF NT
PT, true positive; PF, false positive; NT, true negative; NF, false negative.
Table 6.3: Performance measurements. Performance measure
Definition
PT PF NT NF
Correctly classified Benign tuples. False classified Malignant tuples that are marked Benign. Correctly classified Malignant tuples. False classified Benign tuples that are marked Malignant.
PT, true positive; PF, false positive; NT, true negative; NF, false negative.
NF class shows False Classified Benign tuples that are marked Malignant. Discussed classes are the performance measures. Definition of these measures are given in Table 6.3. Sensitivity is the ratio between PT and all marked tumors (PT 1 NF). Sensitivity 5
PT PT 1 NF
(6.1)
Specificity is the relation between NT and marked a tumor (PF 1 NT). NT PF 1 NT
(6.2)
PT 1 NT PT 1 PF 1 NF 1 NT
(6.3)
Specificity 5 Accuracy formula is given in Eq. (6.3). Accuracy 5
6.3.8 f-Score measure f-Score is an evaluation metrics for ML algorithm. Precision 5 Recall 5
PT PT 1 PF
PT PT 1 NF
(6.4) (6.5)
From Eqs. (6.4) and (6.5), precision and accuracy calculated and used in Eq. (6.6) to calculate F-measures: F-measure 5
2ðprecisionrecallÞ precision 1 recall
(6.6)
Application of machine learning and image processing for detection of breast cancer 157
6.4 Results In Table 6.4, output labels and their test score is given. “0” label indicates normal case and label “1” cancerous case. These predicated labels are compared with original labels from the dataset. Different measures are calculated that given in Tables 6.5 and 6.6. Table 6.4: Predict labels by applied algorithms. Algorithm
Test score (%)
Output (predicted label)
SVM
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Ada Boost [0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0.] Decision Tree [1. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0.] Logistic [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. Regression 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] Random [0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. Forest 1. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0.] Gradient [1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. Boost 0. 1. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0.] KNN [0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0.]
76 66 50 76 60 60 63
Table 6.5: Predicted binary classification results of algorithms. Model
Total instances
PT
NT
PF
NF
SVM Ada Boost Decision Tree Logistic Regression Random Forest Gradient Boost KNN
42 42 42 42 42 42 42
38 23 21 36 26 21 30
0 1 2 0 0 1 2
4 3 2 4 4 3 2
0 15 17 2 12 17 8
PT, true positive; PF, false positive; NT, true negative; NF, false negative.
Table 6.6: Performance of algorithms calculated from the above formulas. Model SVM Ada Boost Decision Tree Logistic Regression Random Forest Gradient Boost KNN
Test score
f-Score
Sensitivity
Specificity
Precision
Recall
Accuracy (%)
0.77 0.67 0.53 0.70 0.63 0.60 0.63
0.95 0.71 0.68 0.92 0.76 0.67 0.85
1 0.61 0.55 0.94 0.69 0.55 0.78
0 0.25 0.5 0 0 0.25 0.5
0.90 0.88 0.91 0.90 0.87 0.88 0.94
1 0.60 0.55 0.95 0.68 0.55 0.78
90 57 54 85 61 52 76
158 Chapter 6
6.5 Discussions Dataset is divided into training and testing data. Training data have 250 instances and testing data have 30 instances. After developing the model, 42 instances are predicted using different models. Their binary classification results are given in Table 6.5. Using the above formulas, we have calculated the f-score, precision, sensitivity, specificity, recall, and accuracy of different algorithms applied to breast cancer dataset. The graph given in Fig. 6.7 shows the outcomes of different algorithms. Each algorithm predicts different results as shown in the graph. Total tested instances were 42. SVM classifier predicted 38 instances as PT and 4 instances as PF. Ada Boost classifier predicted 23 instances as PT, 1 instance as NT, 3 instances as PF, and 15 instances as NF. Decision Tree classifier predicted 21 instances as PT, 2 instances as NT, 2 instances as PF, and 17 instances as NF. Logistic Regression classifier predicted 36 instances as PT, 4 instances as PF, and 2 instances as NF. Random Forest classifier predicted 26 instances as PT, 4 instances as PF, and 12 instances as NF. Gradient Boost classifier predicted 21 instances as PT, 1 instance as NT, 3 instances as PF, and 17 instances as NF. K Nearest Neighbor classifier predicted 30 instances as PT, 2 instances as NT, 2 instances as PF, and 8 instances as NF. These outcomes of the different classifier are very important for the calculation of test score, f-score, sensitivity, specificity, precision, recall, and accuracy. It means the whole results are dependent on the given outcomes of different algorithms or classifier. Percentage accuracy of SVM is best with 90% accuracy as shown in Table 6.6. Similar way, Gradient Boost is the worst case with the lowest accuracy, precision, and f-score. Logistic regression comes in an average case where accuracy is 85% with average f-score, test score, and precision. My result concludes that the precision, recall, f-score, and 45 42 38 40
42
42
42
42
42
42
36
35
30
30
26 23
25
21
20
21 17
15
17 12
15
8
10 5
4 0
0
1
3
2 2
4 0
2
4
3
0
1
Random forest
Gradient Boost
2 2
0 SVM
Ada Boost
Decision tree
Total instances
Logistic regression PT
NT
PF
NF
Figure 6.7 Graph for confusion matrix of different algorithms.
KNN
Application of machine learning and image processing for detection of breast cancer 159
Figure 6.8 Accuracy graph of applied algorithms. 1.2 1 0.8 0.6 0.4 0.2 0 Test score
f-Score
Sensitivity
Specificity
Precision
Recall
SVM
Ada Boost
Decision tree
Logistic regression
Random forest
Gradient Boost
Accuracy
Figure 6.9 Graph of different measurements calculated.
precision increases out prediction will show the best results. But if we increase the training dataset, the accuracy of the model increases. On the basis of work, we suggest the SVM ML technique can be used to identify the critical disease because our work shows the highest accuracy that is obtained. In Fig. 6.8, a graph is a plot. This plot is obtained from the calculated values of different measure given in Table 6.6. These values are calculated from different formulas that are given above. In the above graph, test score, f-score, sensitivity, specificity, precision, recall, and accuracy are given. These values are calculated using binary classification
160 Chapter 6 Table 6.7: Classification symbols. Cancer type
Binary symbol
Benign Malignant
0 1
results. These results vary according to binary classification. This graph shows that the algorithm having high precision and recall show high accuracy. The technique having low precision and recall gives less accuracy (Fig. 6.9). In Table 6.7 Classification symbols are discussed. Benign is the type of image having no cancer. Malignant is the type of image having cancer. SVM provides good results when its sigma value is small and “c” parameter is large. Similarly, logistic regression utilizes “lambda” to give better performance. Its performance depends on lambda. On the certain value of lambda, the performance of logistic regression is best as in our case.
6.6 Conclusion Information technology is playing an important role in healthcare. Communication technologies enable the user to stay in contact with the physician. Physicians are also staying in contact with his patient. Smart wearable sensing devices are easily available in the latest environment. The patient wears these sensing devices and performs daily life routine actions. If there is any disturbance in the patient body than sensing devices, sense it and send the message to the doctor about the patient condition. Through this process, the physician can give the proper treatment to a patient according to the condition of the patient. Patient life can be saved using these innovations in smart healthcare. Key findings of our work are introducing a new innovation in healthcare. The innovation is the use of the computer in healthcare. According to our work, computer techniques can be used for the early detection of critical disease. Our work is based on the detection of critical disease breast cancer. Our work is beneficial to save human lives. Our work is based on the comparison of different algorithms that can be used for the classification of disease type. Our work shows that SVM algorithm gives reasonable accuracy for the prediction of cancer. So this technique is useful in healthcare systems to predict the diseases. In the field of healthcare system, computer is used to get the results at the earliest possible way to give treatment to a patient on time. ML algorithms are used to predict cancer from mammogram images. The experimental results are good to predict cancer. Mammogram images are not suitable for direct mining of symptoms leasing to the prediction for this we needed a multistage strategy. At first step,
Application of machine learning and image processing for detection of breast cancer 161 this study required to apply segmentation consequently leading to feature extraction. A total of 322 instances have been used for MIAS cancer images datasets. In the later phase, this work applied ML mixed approach for cancer prediction. A combination of techniques consisting of supervised and unsupervised was used for classification. Based on classifications different measures are identified for each algorithm. Study observations reflect that SVM classifier has the highest accuracy leading to better prediction of cancer from complex images. On the other hand, Gradient Boost classifier shows worst results with 52% accuracy. Similarly, Gradient Boost algorithm has the lowest precision, recall, test score, and f-score. While SVM has the highest precision, recall, f-score, and test score. These characteristics make the SVM a better option to use for such dataset of images. In future, our approach can be used to predict lungs cancer, liver cancer, skin cancer, brain tumor, heart disease, and diabetics. ML algorithms are trained using a dataset. In this world of Big Data, large datasets of MRI, CT scans, and mammograms are available in the research labs. Researchers use these datasets to train the ML algorithms and make the predictions for new data. In the future, these techniques will be used for every disease. Healthcare field is growing very fast. In the future, deep learning algorithm to predict and classify the disease with high accuracy and precision. Deep learning algorithms are more advanced from the algorithms used in this project. Deep learning algorithm demands high computing power. As computing powers increase, it enables the researcher to use latest and more accurate techniques to use.
6.7 Research contribution highlights • • • • • • • •
Understanding of mammograms Reading of mammograms Finding of ROI Understanding features of mammograms Feature extraction from mammograms Comparison of ML algorithms Prediction for new data Suggestion a technique for critical healthcare problem
6.8 Teaching assignments • • • • • •
Preprocessing of mammograms Feature extraction from mammograms Implementation of ML algorithms Differentiate between supervised and unsupervised learning The solution for classification problem by ML Which ML technique is best and why?
162 Chapter 6
References Ahmad, L. G., Eshlaghy, A., Poorebrahimi, A., Ebrahimi, M., & Razavi, A. (2013). Using three machine learning techniques for predicting breast cancer recurrence. Journal of Health and Medical Informatics, 4(124), 3. Bandyopadhyay, S. K. (2010). Pre-processing of Mammogram Images. International Journal of Engineering Science and Technology, 2(11), 67536758. DeSantis, C. E., Ma, J., Goding Sauer, A., Newman, L. A., & Jemal, A. (2017). Breast cancer statistics, 2017, racial disparity in mortality by state. CA: A Cancer Journal for Clinicians, 67(6), 439448. Haider, K. Z., Malik, K. R., Khalid, S., Nawaz, T., & Jabbar, S. (2017). Deepgender: Real-time gender classification using deep learning for smartphones. Journal of Real-Time Image Processing. Available from https://doi.org/10.1007/s11554-017-0714-3. John, B., & Nallathambi, S. (2017). Study and analysis of filters. Advances in Computational Sciences and Technology, 10(3), 331341. Lytras, M., & Visvizi, A. (2018). Who uses smart city services and what to make of it: Toward interdisciplinary smart cities research. Sustainability, 10(6), 1998. Patel, V. K., Uvaid, S., & Suthar, A. (2012). Mammogram of breast cancer detection based using image enhancement algorithm. International Journal of Emerging Technology and Advanced Engineering, 2(2012), 143147. Rana, M., Chandorkar, P., Dsouza, A., & Kazi, N. (2015). Breast cancer diagnosis and recurrence prediction using machine learning techniques. IJRET: International Journal of Research in Engineering and Technology eISSN, 2319-1163. Razzaq Malik, K., Habib, M., Khalid, S., Ullah, F., Umar, M., Sajjad, T., & Ahmad, A. (2017). Data compatibility to enhance sustainable capabilities for autonomous analytics in IoT. Sustainability, 9(6), 877. Singh, S., Gupta, P., & Sharma, M. K. (2010). Breast cancer detection and classification of histopathological images. International Journal of Engineering Science and Technology, 3(5), 4228. Spruit, M., & Lytras, M. (2018). Applied data science in patient-centric healthcare. Elsevier. Sreeja, G. B., Rathika, P., & Devaraj, D. (2012). Detection of tumours in digital mammograms using wavelet based adaptive windowing method. International Journal of Modern Education and Computer Science, 4 (3), 57. Suckling, J., Parker, J., Dance, D., Astley, S., Hutt, I., Boggis, C.,. . . Savage, J. (2015). Mammographic image analysis society (MIAS) database v1.21 [Dataset]. Available from https://www.repository.cam.ac.uk/handle/ 1810/250394. Suji, R. J., & Rajagopalan, S. (2015). A novel hybrid system for diagnosing breast cancer using fuzzy rough set and LS-SVM. Research Journal of Applied Sciences, Engineering and Technology, 10(1), 4955. Sundaram, K. M., Sasikala, D., & Rani, P. A. (2014). A study on preprocessing a mammogram image using adaptive median filter. International Journal of Innovative Research in Science, Engineering and Technology, 3(3), 1033310337.