Available online at www.sciencedirect.com Available online at www.sciencedirect.com
ScienceDirect ScienceDirect
Procedia Computer Science 00 (2019) 000–000 Available online at www.sciencedirect.com Procedia Computer Science 00 (2019) 000–000
ScienceDirect
www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia
Procedia Computer Science 161 (2019) 449–457
The Fifth Information Systems International Conference 2019 The Fifth Information Systems International Conference 2019
Analysis and Prediction of Diabetes Complication Disease using Analysis and Prediction of Diabetes Complication Disease using Data Mining Algorithm Data Mining Algorithm Cut Fiarni*, Evasaria M. Sipayung, Siti Maemunah Cut Fiarni*, Evasaria M. Sipayung, Siti Maemunah
Information Systems Department, Institut Teknologi Harapan Bangsa (ITHB),Jl Dipati Ukur 80-84 Bandung 4132, Indonesia Information Systems Department, Institut Teknologi Harapan Bangsa (ITHB),Jl Dipati Ukur 80-84 Bandung 4132, Indonesia
Abstract Abstract Diabetes is one of the most dangerous chronic disease that could lead to others serious complicating diseases. In Indonesia, the Diabetes is onediabetes of the most dangerous complications chronic diseasediseases that could lead to othersnephropathy serious complicating diseases. Indonesia, the most common microvascular are retinopathy, and neuropathy. In In order to prevent most common diabetes microvascular complications areknowledge retinopathy, In becomes order to prevent these complications to manifest, data mining techniquediseases to extract of nephropathy risk factor forand eachneuropathy. complication crucial. these complications to manifest, data mining technique to extract knowledge of risk factor for each diseases complication becomesand crucial. The goal of this research is to construct a prediction model for three major diabetes complication in Indonesia find The goal of this research to construct prediction model for major complication diseases in Indonesia andAge, find out the significant featuresiscorrelated witha it. In this research, thethree diabetes riskdiabetes factor narrowed into seven features, which are out the significant features correlated withBlood it. In this research, the diabetes risk suffers factor narrowed sevenlevel. features, which areBayes Age, Gender, BMI, Family history of diabetes, pressure, duration of diabetes and Bloodinto glucose Thus, Naive Gender, history of diabetes, Blood pressure, duration of diabetes sufferstechniques and Bloodwere glucose level. Thus, Naive Bayes Tree andBMI, C4.5 Family decision tree-based classification techniques and k-means clustering used to analyze this dataset. Tree decision tree-basedthe classification and k-means clustering techniquesfeature were used to analyze Afterand thisC4.5 analysis, we evaluated performancetechniques of each technique and found the correlated and sub feature this as a dataset. disease After this analysis, evaluated performance each technique and foundis the correlated feature and suba feature as a disease risk factor for them.we Resulting thethe most influential of risk factor for Retinopathy a female patient that having hypertension crisis. risk factor for them. Resulting the most influential factor for Retinopathy is a female that for having a hypertension crisis. As for Nephropathy, the most prominent risk factorrisk is the duration of diabetes more than 4patient years. But Neuropathy, it dominated As Nephropathy, the most prominent riskAs factor the duration diabetes more years. But for Neuropathy, it dominated for for female patients, with BMI more than 25. for is family history of diabetes, there than is no4distinct significant correlation with these for female patients, withThe BMI moreaccuracy than 25. As for proposed family history there is no correlation withtothese complication diseases. overall of the modelofisdiabetes, 68% so it, could bedistinct used tosignificant as an alternative method help complication diseases. The overall accuracy of thestage. proposed model is 68% so it, could be used to as an alternative method to help predict diabetes complication diseases at an early predict diabetes complication diseases at an early stage. © 2019 The Authors. Published by Elsevier B.V. © 2019 2019 The The Authors. Published by B.V. © Authors. by Elsevier Elsevier B.V. This is an open accessPublished article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under CC BY-NC-ND licenseThe (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee Fifth Information Systems International Conference 2019 Peer-review under responsibility of the scientific committee ofofThe Fifth Information Systems International Conference 2019. Peer-review under responsibility of the scientific committee of The Fifth Information Systems International Conference 2019 Keywords: Diabetes complication disease; data mining; prediction model; k-means; Naive Bayes; C4.5 decision tree. Keywords: Diabetes complication disease; data mining; prediction model; k-means; Naive Bayes; C4.5 decision tree.
* Corresponding author. Tel.: +62-22-250-6636. address:author.
[email protected] * E-mail Corresponding Tel.: +62-22-250-6636. E-mail address:
[email protected] 1877-0509 © 2019 The Authors. Published by Elsevier B.V. This is an open access under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) 1877-0509 © 2019 Thearticle Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of The Fifth Information Systems International Conference 2019 This is an open access article under CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of The Fifth Information Systems International Conference 2019 1877-0509 © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the scientific committee of The Fifth Information Systems International Conference 2019. 10.1016/j.procs.2019.11.144
450 2
Cut Fiarni et al. / Procedia Computer Science 161 (2019) 449–457 Author name / Procedia Computer Science 00 (2019) 000–000
1. Introduction Diabetes Mellitus (DM) defined as group of metabolic disorders mainly cause by excess glucose within the bloodstream. The World Health Organization states that approximately more than 700 million people were projected suffering from diabetes by 2030. Diabetes patients occur throughout the world, but is more common in developed countries [1]. In Indonesia, the prevalence of diabetes was 10,9%, and the trend is gradually increasing [2]. Diabetes as metabolic disorders could damage the blood vessels, which increase the risk of serious health complications that damaging the heart, eyes, kidneys and nerves. The most common diabetes complications diseases are divided into two grouped based on its damage to small blood vessels (microvascular) and damage to the arteries (macrovascular). Microvascular disease group into which organ the disease attack, which are eye (retinopathy), kidney (nephropathy) and neural damage (neuropathy). The major macrovascular complications include accelerated cardiovascular disease manifesting as strokes among other serious diseases. According to Indonesian Ministry of Health, the top three of the diabetes microvascular complication diseases are retinopathy, neuropathy and nephropathy [3]. Moreover, to prevent the complications getting worse, one of the ways that can be done is by gaining information regarding its risk factor. Due to high mortality and morbidity of diabetes complication diseases, prevention and risk factor prediction become important and emerging trend of research subject and studies. Many studies have been conducted to gain knowledge related of risk factors and diagnosis of diabetes and prediabetes. However, few studies have been conducted to evaluate the diabetes complication diseases, especially its risk factors. Consequently, diabetes complication diseases continue to be underutilized in disease prevention and not often only found out when the disease already manifested in the harming condition. Diabetes is also known as the silent killer because of this reason. Diabetes risk factor divided into two groups, which are modified and unmodified. Modified related to attribute such as racial, ethnic, age, gender, etc. Unmodified related to unhealthy and sedentary lifestyle. In this research, we used attributes from medical records of the Indonesian diabetes patient to found risk factor for each of three major diabetes complication diseases in Indonesia. In this information era, Data mining has already become an important technique to assist researchers in order to extract knowledge from large and complex data, such as from patient medical records. In this research, Indonesian diabetic’s patient dataset was processed with data mining techniques to find rules that can help to determine the risk factor attribute and its value of potential diabetes complication disease. We used data mining to gain this knowledge from patient medical record attributes, from modified and unmodified risk factor. Hence, in the architecture of this research approach, an effort was made to find the most suitable data mining technique to generate the rule and the most influence attribute and its value from modified and unmodified risk factors. This research is organized as follows: section 2 provides the necessary background knowledge of data mining and related research in this subject and the difference with the proposed of this research. Section 3 presents the methodology approach and section 4 provide the result and discussion of this research, with section 5 providing conclusions. 2. Related work Because of each patient has different modified and unmodified risk factors, diabetes complication disease could manifest differently on each diabetes patient. These researchers focused to extract knowledge of modified and unmodified risk factor for three major micro vascular diabetes complication disease, which are Retinopathy (DR), Nephropathy (DN) and peripheral Neuropathy (DPR). Retinopathy could cause blindness, Nephrophaty could cause renal failure, and Peripheral Neuropathy that could cause a foot ulcers and even lower limb amputation. Since there were statistical significance between diabetes duration and diabetic peripheral neuropathy [4], we would also include it as risk factor to be analyze. Data mining is becoming the important element for analyzing and gain knowledge from data, especially in complex and with vast attributes as medical record. Data mining encompasses varied in technique to form clusters, classified and made the association between structure and unstructured data. The most common data mining tehnique to diagnose diabetes ware naive Bayes, decision tree and neural network [5, 6].regarding unmodified risk factor, research using spesific gender and etnic become important, Sharmila et al concentrate on gaining insight about the big data prediction of Indian diabetic dataset through Hadoop using K-means method [7]. The result of these researches aims to construct models for automatic screening system
Cut Fiarni et al. / Procedia Computer Science 161 (2019) 449–457 Author name / Procedia Computer Science 00 (2019) 000–000
451 3
for diabetic complication diseases. Model constructs by selecting a subset of the feature based on character of data or based on data mining techniques with the best performance accuracy. A Case based reasoning algorithm has been adopted to build a conceptual model for knowledge management system of diabetes complication disease [8]. In the case of nephropathy among type 2 diabetic patients, a rule based diagnostic classification using a decision tree algorithm has been built with genetic and clinical features in a gender-specific classification as risk factors [9]. In the case of neuropathy, DuBrava et al conduct research with the objective to identify risk factor variables correlated with diagnosis of neuropathy on electronic health records of diabetic patient using random forest modeling. This research resulting that the most correlated variables are age as modified risk factor, with other variables come from medical treatment and Labs data, such as Charlson Comorbidity Index score, number of pre-index procedures and services, number of pre-index outpatient prescription, number of pre index outpatient visits, etc [10]. Diabetic Retinopathy as major diabetes complication become the most studied field using data mining approach of image processing. Torok et al built automatic methods for retinopathy screening using retina photographs and tear fluid proteomics biomarkers of diabetic patient [11]. Previously Zhang et al proposed method using image from tongue color, texture and geometry features of diabetic patient [12]. Appose to previous similar research, our research first aim of finding correlated modified and unmodified risk factors and its significant value of the medical record of the Indonesian diabetic patient, the second objective was to group that risk factors in each of the three major micro vascular diabetes complication disease as rule for screening model using data mining techniques. In order to achieve the research objectives, we compare two machine learning task, supervised and unsupervised learning. In the supervised learning, pattern learns inductively from each of diabetes complication diseases as the labeled training data, in order to find the risk factor and its significant value. Supervised learning divided into regression and classification. in this research, we build several hyphotesis as alternative function of diabetic modified and unmodified risk factors to classify into three major diabetes complication diseases. Some of the most common supervised learning tehcniques are decision trees, rule larning, k-nearest neighbors (k-NN),genetic algorithm,artificial neural networks and support vector machine. But in unsupervised learning, hidden pattern of data discover without any corresponding labels, with the most common learning tehcniques are association and clustering [13]. In this research, the unsupervised learning technique that being use are clustering. Which mean to find hidden pattern of risk factor, being done by built separation of diabetic patient data into group of data with simmiliar characters. In this research, the unsupervised learning technique that being used are clustering. Which means to find hidden patterns of risk factor, being done by building separation of diabetes patient data into groups of data with similar characters. 3. Research approach In this section we would discuss the approach to achieve the research objective. The main objective of this research is to construct models by feature selection process from modified and unmodified diabeties risk factor, adopt the data mining algortihm with the best accuracy performance and also compare data mining learning technique for diabetes complication diseases. Fig 1. Illustrated the workflow of this research approach. There are three main phases of this research approach, which are data attribute selection and data mining pre-processing, data mining algorithm and its evaluation criteria analysis, then generated rules for the three major microvascular diabetes complication diseases to construct a general model. In the data mining analysis stage, data would undergo clustering and classification technique of data mining. In this research, data mining analysis created with the data mining programs Waikato Environment for Knowledge Analysis (WEKA) using the diabetes patient medical record. 3.1. Data pre-processing methods The diabetic medical recod used in this research obtained from three sources, which are the Sri Pamela Hospital and Kumpulan Pane Hospital, Tebing Tinggi, and Dolok Community Health Center, North Sumatra, Indonesia. After eliminating incomplete data through Extraction Transform, Load (ETL) Process, data set comprises of 158 medical records, with 15 attributes. After eliminating patient personal information, from 15 attributes, we use 8 attributes as modified and unmodified diabetes risk factors, as shown in Fig. 1.
452 4
Cut Fiarni et al. / Procedia Computer Science 161 (2019) 449–457 Author name / Procedia Computer Science 00 (2019) 000–000
Fig. 1. Architecture of Research Approach.
In the unmodified risk factor, the feature are gender and family history of diabetes. As for modified risk factors the feature are patient’s common health indicators such as BMI, blood pressure, duration of diabetes sufferers, blood glucose level and patient’s age. These risk factors converted into eight features, by adding patient complication diseases that would be used as target feature. Then these 8 features grouping into ordinal and categorical, according to its each type of data. The feature and its varied sub feature and its description describe on Table 1. Table 1. Diabetes Complication Diseases Dataset Description. Feature Name
Data Type
Sub Feature
Age
Ordinal
Young Adult Adult Young Middle Age Middle age Senior Adult
Gender
Categorical
Male/Female
BMI
Ordinal
Underweight, Normal, Overweight, Obese
Family history of diabetes
Categorical
Yes/No
Blood presure
Ordinal
Normal Prehypertension Hypertension 1 Hypertension 2 Hypertension Crisis
Duration Of Diabetes Sufferers
Ordinal
Short,Middle,Long Term
Blood glucose level
Ordinal
Level 1 Level 2 Level 3
Description
Age 25-35 Age 36-45 Age 46-55 Age 56-65 > 65 Years Old
<18.5 18.5 to 24.9 25 to 29.9 >30
< 120 mm Hg 120 to 139 mm Hg 140 to 159 mm Hg 160 to 180 mm Hg >180 mm Hg
Less than 4 years, 4 to 10 years, more than 10 years
<200 mg/dL 200-400 mg/dL 401-600 mg/dL
Cut Fiarni et al. / Procedia Computer Science 161 (2019) 449–457 Author name / Procedia Computer Science 00 (2019) 000–000
Feature Name Diabetes Complication Disease
Data Type Categorical
Sub Feature
453 5
Description
Level 4
(DPR).
Retinopathy (DR) Nephropathy (DN) Peripheral Neuropathy
Related to Aye Disorder Related to Renal Failure Related to Foot and Lower Limb Disorder
Other
>600 mg/dL
3.2. Data mining analysis and generate rules In this phase, pattern learning and performance evaluation being done by using the data mining programs Waikato Environment for Knowledge Analysis (WEKA). For unsupervised learning, we used unlabeled data to build a cluster that correlate with each of the complication disease. In order to gain the optimal number of clusters we implement kmeans algorithm with Elbow method, which are general method for determining the optimal number of clusters from data. This method would start with k-number =2, and gradually increasing it by adding 1 poin to k-number training process, then evaluate the impact on each k-number to performance value. At certain k-number, the performance value would drop dramatically, and after that it reaches a plateau with increasing k-number any further. This is the optimal k-number,and the number of cluster [14, 15]. In the supervised learning, data that have been labeled into four complication diseases, then the decision tree algorithm used to classify the risk factor on each class of complication disease. In order to generate a decision tree model to classify diabetes complication disease, we use Naive Bayes Tree and C4.5. The WEKA classifier package has its own version of C4.5 known as J4.8. The accuracy of class prediction is defined by the most suitable algorithm used, weather it it clustering or classification. Performance evaluation of a classification technique based on the percentage of training instances that were correctly classified from each complication disease as the labeled data. The percentage of correctly classified instances are also often called accuracy. The outcome of this data analysis phase was the feature selection from each of diabetes complication diseases, based on its accuracy. Finally, at the last phase, we generated rules based on the feature selection from the previous phase as the prediction model. 4. Result and discussion In this section we would discuss the result of data mining analysis and generate rule from the architecture of research as shown of Fig. 1. We would analyze, learn and find knowledge from the Indonesia Diabetic dataset regarding: To find the relationship between feature of diabetes health risk and its sub feature To find a possible cluster number from diabetes health risk feature To Classify the feature selection of diabetes health risk and its sub to each complication disease class To generate rules and construct the prediction model 4.1. k-Means clustering In this research, clustering method is used to group data into several classes according to their common characteristics. In this first step, experiments were carried out by changing the number of clusters in k-means. Changes are made to obtain the most optimal class. Table 2 shows the comparison result of clustering with several predefined groups. Table 2. Comparison result of clustering method for Diabetes Complication Diseases. No of Cluster
Clustered Instances
No of iteration
Sum of squared errors
Different Value
2
46 (51%) & 44 (49%)
3
205
205
4
28 (31%), 39 (43%),12 (13%) &11(12%)
3
174
30
Cut Fiarni et al. / Procedia Computer Science 161 (2019) 449–457 Author name / Procedia Computer Science 00 (2019) 000–000
454 6
No of Cluster
Clustered Instances
No of iteration
Sum of squared errors
Different Value
5
23 (26%), 37 (41%),12 (13%), 9 (10%)& 9 (10%)
4
158
16
6
19 (21%), 34 (38%),12 (13%), 9 (10%), 10 (11%) & 6 (7 %)
5
150
8
7
19(21%), 35(39%), 10(11%), 9(10%),7(8%), 8(9%) &2 (2%)
4
140
10
8
16(18%),35(39%),10(11%), 9(10%), 7(8%), 8(9%), 2 (2%)& 5(6%)
3
128
12
As shown in Table 2, by adopting Elbow method, we could conclude that only cluster k= 6 exhibits similar cluster distributions of characteristics. On this cluster, patients were characterized that patient as middle age, blood pressure more than 140 mm Hg, BMI range overweight, having diagnosed with diabetes between 4 to 10 years, and with no Family history of diabetes. But unfortunately, further analysis of this cluster shown that we can’t gain knowledge about the desired class of major microvascular complications, namely retinopathy, neuropathy and nephropathy, because its overlap. Table 3. Comparison result of classification method for Diabetes Complication Diseases. Classifier NB
Features Selection Gender, Family history of diabetes
Correctness
Disease
%
retinopathy
81
neuropathy
86
nephropathy
41
Others
73
Age, BMI, Blood presure, Diabetes duration,
retinopathy
81
Blood glucose level
neuropathy
90
nephropathy
88
Age, BMI, Blood presure, Diabetes duration
Others
90
retinopathy
76
neuropathy
28
nephropathy
41
Others
24
Age, BMI, Diabetes duration,
retinopathy
81
Blood glucose level
neuropathy
90
nephropathy
88
Age, Diabetes duration
Age, Blood glucose level
C4.5
Complication
BMI, Diabetes duration
Others
90
retinopathy
48
neuropathy
48
nephropathy
71
Others
69
retinopathy
81
neuropathy
90
nephropathy
88
Others
90
retinopathy
33
neuropathy
72
nephropathy
53
Cut Fiarni et al. / Procedia Computer Science 161 (2019) 449–457 Author name / Procedia Computer Science 00 (2019) 000–000
Classifier
Features Selection
BMI, Blood glucose level
Blood presure, Diabetes duration
Blood presure, Blood glucose level
Family history of diabetes,Blood glucose level
Complication
Correctness
Disease
%
Others
51
retinopathy
81
neuropathy
90
nephropathy
88
Others
90
retinopathy
76
neuropathy
41
nephropathy
35
Others
22
retinopathy
81
neuropathy
90
nephropathy
88
Others
69
retinopathy
81
neuropathy
90
nephropathy
88
Others
69
455 7
4.2. Classification method For the classification purpose, we adopt Naive Bayes (NB) for the feature selection of categorical type of data, which are gender and family history of diabetes. The Naive Bayes algorithm builds a probabilistic model of the learning process the conditional probabilities of each input attribute given a possible value taken by the output attribute. Then, we adopt C4.5 for the rest of feature selection. We choose C4.5 algorithm, because it has the ability to support ordinal type of data and the decision tree construction rule of this algorithm are relatively easy to understand. The rules built based on evalution of the most prominent value of the correct presentation for each classification disease of each feature selection. The rules built based on the most prominent value of the correctness presentation for each classification disease of each feature selection. As shown in the table 3, using C4.5, the feature and sub feature that influence the character of each complication disease based on given dataset are: Characters of Retinophaty Patient: None of Family history of diabetes (81%), Hypertension Crisis (76%), Diabetes duration middle (50%) and long term (50%), Blood glucose level between 200-400 mg/dL (71%), Normal BMI (42%). Characters of Neuropati Patient : None of Family history of diabetes (86%), Blood glucose level between 200400 mg/dL (62%), Diabetes duration suffer more than 10 years (41%), with range of BMI more than 25 (41%). Character of Nephropathy Patient : Diabetes duration more than 10 years (87%) with Normal range of BMI (52%). Character of other diseases : None of Family history of diabetes (72%), Blood glucose level between 200400 mg/dL (62%), Diabetes duration between 4 to 10 years (51%). 4.3. Generate rules In this stage we generated rules for each class of diabetes major microvascular complication diseases based on previous data mining analysis stage. As explained in the clustering stage, the optimal number of clusters is 6. But as described in Fig 2, from 6 cluster we can gain knowledge regarding features and sub feature selection, but we can't group it into three major microvascular diabetes complication disease due to the overlapping labelled
456 8
Cut Fiarni et al. / Procedia Computer Science 161 (2019) 449–457 Author name / Procedia Computer Science 00 (2019) 000–000
instances. The incorrect labelled instance is eliminated using k-means clustering followed by classifying it using Decision tree classifier (C4.5) and NB decision. As explained in the clustering stage, the optimal number of cluster is 6. But as describe on Fig 2, from 6 cluster we can gain knowledge regarding feature and sub feature selection, but we can’t group it into three major microvascular diabetes complication disease due to the overlaping labeled instances. Incorrect labelled instance are eliminated using k-means clustering followed by classify it using Decision tree classifier (C4.5) and NB decison tree. Resulting the proposed cascaded prediction model: If Female & Blood preassure = Hypertension Crisis then class=> Retinophaty If Female & BMI = Overweight & Obesity then Class => Neurophaty If Duration Of Diabetes Sufferers = middle & long term then Class=> Nephrophaty To evaluate the performance of the proposed model, we use accuracy as measurement metric. Resulting the overall accuracy of the prediction model of diabetes complication disease are 68%, with the highest accuracy in predicting retinopathy (see Fig. 2).
Fig. 2. Visualization of Feature Selection with 6 Clusters.
5. Conclusion Clustering and classification of data mining technique and its algorithm were studied to build the prediction model of diabetes complication disease. The model generates rule from diabetic medical data into four groups, which are nephropathy, retinopathy, neuropathy and mixed complications (other). In order to build the most suitable rule-based model for the prediction purpose, we evaluate the performance from clustering and classification technique. It can be seen that compare to clustering technique, classification technique gives better information, performance and could classify features and sub feature into three major microvascular diabetes complication disease. From the data mining analysis, we can conclude the most influential risk factor for each diabetes complication disease. Turn out that even though the blood glucose level and the duration of diabetes suffer lead to complication disease, but it's most prominent on Neprophaty. It also concludes that glucose level and gene (family history of diabetes) turn out do not influence to specific diabetes complication. Also, we gain knowledge that the most common risk factor for Retinopathy are the blood pressure in a range hypertension crisis. As for Nephrophaty the most prominent risk factor is the duration of diabetes suffer, especially that more than 10 years. Diabetes patients that overweight and obese, having more risk to Neurophaty. Given the accuracy of the proposed model is 68%, with the higest accuracy on
Cut Fiarni et al. / Procedia Computer Science 161 (2019) 449–457 Author name / Procedia Computer Science 00 (2019) 000–000
457 9
Retinophaty prediction model. This mean the model could be implemented on automatic system, even though more effort needs to increase these current accuracy level. The future work of this research is to implement the prediction model of automatic pre-diagnosis system, to help diabetes patients to each risk factor of the complication disease. Also, to enhance the prediction model, more diabetes medical record is important, especially if to get sample dataset from all regions in Indonesia. The other classification algorithm such as logistic model tree, Random Forrest and Random Tree, could also be used in order to gain more prominent performance that lead to more reliable rule-based model. References [1] Shaw, J. E., R. A. Sicree, and P. Z. Zimmet. (2010) “Global Estimates of the Prevalence of Diabetes For 2010 And 2030.” Diabetes Res Clin Pract 87: 4–14. [2] Health Research and Development Division of Ministry of Health Republic of Indonesia. (2018) Basic Health Research Survey. [3] Data and Information Center of Ministry of Health Republic of Indonesia. (2014) Analysis and Situation of Diabetes. [4] Tarigan, Tri J.E., E. Yunir, I. Subekti, A. Laurentius, A. Pramono, and Diah Martina. (2015) “Profile and Analysis of Diabetes Chronic Complications in Outpatient Diabetes Clinic of Cipto Mangunkusumo Hospital, Jakarta.” Medical Journal of Indonesia. [5] Harleen, Bhambri (2016) “A Prediction Technique in Data Mining for Diabetes Mellitus.” Journal of Management Sciences And Technology 4 (1). [6] Misra, (2007) “Simplified Polynomial Neural Network for Classification Task in Data Mining”, in International Conf. on Evolutionary Computation, 721 – 728. [7] Sharmila, K., and S. A. Vetha Manickam. (2016) “Diagnosing Diabetic Dataset using Hadoop and K-means Clustering Techniques.” Indian Journal of Science and Technolog, 9 (40). [8] Fiarni, C. (2016) “Design of Knowledge Management System for Diabetic Complication Diseases” in International Conference on Computing and Applied Informatics, IOP Publishing. [9] Huang, G-M., K-Y. Huang, T-Y. Lee, and J. Weng. (2015) “An Interpretable Rule-Based Diagnostic Classification of Diabetic Nephropathy Among Type 2 Diabetes Patients.” BMC Bioinforma, 16 (S-1): S5. [10] DuBrava S, J. Mardekian, A. Sadosky, E. J. Bienen, B. Parsons, MD, M. Hopps, and J. Markman. (2017) “Using Random Forest Models to Identify Correlates of a Diabetic Peripheral Neuropathy Diagnosis from Electronic Health Record Data.” Pain Medicine,18: 107-115. [11] Torok, Zsolt, Tunde Peto, Eva Csosz, Edit Tukacs, Agnes M. Molnar, Andras Berta, Jozsef Tozser, Andras Hajdu,Valeria Nagy, Balint Domokos, and Adrienne Csutak. (2015) “Combined Methods for Diabetic Retinopathy Retinopathy Screening Using Retina Photographs and Tear Fluid Proteomics Biomarkers.” Journal of Diabetes Research. pp. 1-9. [12] Zhang, Bob, B.V.K. Vijay Kumar, David Zhang. (2014) “Detecting Diabetes Mellitus and Nonproliferative Diabetic Retinopathy Using Tongue Color, Texture and Geometry Features.” IEEE Transaction on Biomedical Engineering 61 (2): 491-501. [13] Kavakiotis, I., O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas, and Chouvadra. (2017) “Machine Learning and Data Mining Methods in Diabetes Research.” Computational and structural biotechnology Journal 15: 104-116. [14] Coates, Adam, and Y. Ng Andrew. (2012) “Learning Feature Representations with K-means.” Neural Networks: Tricks of the Trade, 2nd edn, Springer . [15] Kodinariya, T. M., and P. R. Makwana. (2013) “Review on Determining Number of Cluster in K-Means Clustering.” International Journal of Advance Research in Computer Science and Management Studies,1 (6): 90-95.