Available online at www.sciencedirect.com Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2019) 000–000 Procedia Computer Science 00 (2019) 000–000 Procedia Computer Science 159 (2019) 2172–2178
www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia
23rd International Conference on Knowledge-Based and Intelligent Information & Engineering 23rd International Conference on Knowledge-Based Systems and Intelligent Information & Engineering Systems
Tree-based Tree-based generational generational feature feature selection selection in in medical medical applications applications a Faculty a Faculty
Paja Wiesława* Paja Wiesława*
of Mathematics and Natural Sciences, University of Rzesz´ow, Pigonia Str. 1, 35-310 Rzesz´ow, Poland of Mathematics and Natural Sciences, University of Rzesz´ow, Pigonia Str. 1, 35-310 Rzesz´ow, Poland
Abstract Abstract In many knowledge discovery experiments feature selection is obvious initial part. In the paper, some attempt to tree-based generaIn many knowledge discovery experiments feature is presented. obvious initial In thedevotes paper, some attempt tooftree-based generational feature selection applications in medical dataselection analysis is Thispart. approach to application classification tree tional feature selection applications in medical data analysis is presented. This approach devotes to application of classification tree algorithm to estimate importance of attributes extracted from structure of the tree with recursive application of generational feature algorithm to estimate of attributes extracted fromfrom structure of and the tree recursive applicationofofimportant generational feature selection. This methodimportance apply removing of selected features dataset thenwith creates next generation feature set. selection. This method removing of selected dataset andImplemented then creates next generation of important feature set. The process goes until apply the most important feature features will be afrom random value. method were applied on three artificial The process goes until the most important feature will be a random value. Implemented method were applied on three artificial and real-world medical datasets and the results of selection and classification are presented. They were mostly more efficient after and real-world medical datasets and the results of selection and classification are presented. They were mostly more efficient after selection than using original datasets. selection than using original datasets. c 2019 2019 The The Authors. Author(s). Published ElsevierB.V. B.V. © Published byby Elsevier c 2019an The Author(s). Published bythe Elsevier B.V. This This is is an open open access access article article under under the CC CC BY-NC-ND BY-NC-ND license license (https://creativecommons.org/licenses/by-nc-nd/4.0/) (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND Peer-review under responsibility of KES International. under responsibility of KES International. license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of KES International. Keywords: feature selection, feature ranking, dimensionality reduction, relevance and irrelevance, generational feature selection; Keywords: feature selection, feature ranking, dimensionality reduction, relevance and irrelevance, generational feature selection;
1. Introduction 1. Introduction According to many publications feature selection process has been proven to be effective initial step during preparto many selection process hasvarious been proven to be effective initial step during preparingAccording data, mostly in thepublications domain of feature high-dimensional data, for data mining and machine learning problems ing data, mostly in the domain of high-dimensional data, for various data mining and machine learning problems [1, 2, 3, 6, 7, 8]. The main object of feature selection are to build simpler and more efficient learning models [4, 5], [1, 2, 3, 6, learning 7, 8]. The main objectand of feature selection are to build simpler efficient learning 5], improving performance, preparing clear, understandable data. and Datamore mining contains many models aspects [4, which improving learning performance, and preparing clear, understandable data. Data mining contains many aspects which impede it like a very large number of cases and variables, the irrelevance of the part of variables for recognition impede like a dependence very large number of cases and variables, the irrelevance of the part of variables for recognition process, itinternal of conditional variables, the simultaneous presence of variables with different types, the process, internal dependence of conditional variables, the simultaneous presence of variables with different the presence of undefined or erroneous values of variables, imbalanced categories. Thus, the efficient feature types, selection presence of undefined or erroneous values of variables, imbalanced categories. Thus, the efficient feature selection techniques are so important. These kind of methods are often used as a preprocessing steps before or during machine techniques are so important. of methods are oftenofused as a preprocessing steps before duringattributes machine learning experiments [1]. It These could kind be defined as a process identification/selection a subset of or original learning experiments [1]. It could be defined as a process of identification/selection a subset of original attributes so that the feature space is reduced according to defined evaluation criterion. They are often categorized in three so that the feature space is reduced according to defined evaluation criterion. They are often categorized in three ∗ ∗
Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000. Corresponding Tel.: +0-000-000-0000 ; fax: +0-000-000-0000. E-mail address:author.
[email protected] E-mail address:
[email protected]
c 2019 The Author(s). Published by Elsevier B.V. 1877-0509 c 2019 1877-0509 Thearticle Author(s). Published by Elsevierlicense B.V. (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access under the CC BY-NC-ND 1877-0509 © 2019 Thearticle Authors. Published by Elsevier B.V. This is an open access under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of KES International. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of KES International. Peer-review under responsibility of KES International. 10.1016/j.procs.2019.09.391
2
Paja Wiesław / Procedia Computer Science 159 (2019) 2172–2178 W. Paja / Procedia Computer Science 00 (2019) 000–000
2173
groups based on how they implement the identification algorithm and the model building: filter, wrapper and embedded methods. Filter methods select features apart of the model. They take into consideration only general features like the correlation with the variable to predict. These methods identify only the most relevant set of features. They seem to be sufficiently effective in computation time and robust to overfitting. Nevertheless, redundant, but relevant, features may not be recognized. In turn, wrapper methods evaluate subsets of features which allows to identify some dependencies between features. However, when the number of cases is insufficient the increasing overfitting risk is possible. Additionally, the significant computation time highly increase when the set of variables is large. In turn, the embedded methods devotes to reduce the classification of learning. The learning algorithm takes advantage of its own variable selection algorithm. So, it needs to know initially what a good selection is, which limits their exploitation. During the selection process two main objectives could be defined [2]: • Minimal Optimal Feature Selection (MOSF) - the goal is to identify the minimal subset of features with the best classification quality; • All Relevant Feature Selection (ARFS) - where the main object is to find all relevant features, even that with the minimal relevance [3]; Here, an algorithm which is kind of ARFS algorithm is presented. During this process, all relevant set of features is trying to be identified. 2. Generational feature selection algorithm Presented algorithm devotes to attribute importance calculation based on the decision tree developed [10, 11]. The main idea of such algorithm is formally described as Algorithm 1. As an input, the decision system S = (U, A ∪ d) is considered. Then, machine learning algorithm MLA applied during variable importance estimation is introduced, here the random forest algorithm is utilized. The output is in the form of selected relevant subset FS. Algorithm iteratively generates learning model and in this way it could be called “generation”. After the first generation (iteration) selected important features are separated from the rest of dataset. The next generation is gathered based on the remaining data, and so on. In each iteration the contrast features ACONT are generated (contrastFeatures) based on original current set ACURR (by random permutation of attribute values) and merged with them creating extended set AEXT . Using this set the machine learning algorithm MLA is applied and the learning model m is developed (generateModel). Here, the classification tree algorithm is used and then the set of importance values M for each feature from learning model m is estimated (modelMeasure), and ranking algorithm is applied (rankingAlgorithm) to obtain ranking L. In this research importance value is estimated based on the tree structure (see Section 2.1). Then, the contrast feature with the highest value of ranking (maxCFRank) is identified. Next, the set of relevant features FS which achieve importance value l higher than maxCFRank is separated out of the currently investigated set ACURR . If FS is empty then algorithm stops. Finally, FS is gathered by removing irrelevant features from the original set A. Importance of attributes depends on their presence in tree structure [16]. In each node of investigated tree there occur one attribute which identifies a certain number of learning objects. Considering the fact that the node located at a higher level of the tree, identifies more objects and has a greater gain of information, it can be assumed that the validity of attributes in the upper nodes of the tree is greater than in the lower nodes. In this way we can define importance measure (see Eq. 1). Importance(a) =
N l
j=1 node=1
w j · ν(node) · ω(a).
(1)
Where v(node) is the number of objects that are identified in a given tree node at each level j where the investigated attribute a is tested. N is the number of levels inside the tree and ω(a) describes the occurrence of the attribute a usually 1 (attribute occurs) or 0 (attribute does not occur). Additionally, the weighting factor value w j of the level l is defined, see Eq. 2. 1 if j = 1, wj = (2) w j−1 · δ if 1 < j ≤ l and δ ∈ (0, 1).
2174
Paja Wiesław / Procedia Computer Science 159 (2019) 2172–2178 W. Paja / Procedia Computer Science 00 (2019) 000–000
3
Algorithm 1: Generational Feature Selection Input : S = (U, A ∩ {d}) - a decision system; MLA – an applied machine learning algorithm. Output: FS – a selected feature subset. Function GFS(S,MLA) ACURR ← A FS ← φ x=0 while x=0 do ACONT ← contrastFeatures(ACURR ) AEXT ← ACURR ∪ ACONT S = (U, ACURR ∩ {d}) m ← generateModel(S , MLA) M ← modelMeasure(m) L ← rankingAlgorithm(AEXT , M) maxCFRank ← max(L(a : a ∈ ACONT )) FS ← φ for each l ∈ L(a : a ∈ ACURR ) do if l > maxCFRank then FS ← FS ∪ a(l) end ACURR ← ACURR \FS if FS = φ then x++ end FS ← A\ACURR end end return FS end
2.1. An example of the GFS algorithm application Example of classification tree developed using Pima Indians Diabetes dataset [12] is presented on Fig. 1. According to this figure the tabular view of such tree with appropriate number of objects recognized in nodes, tree level and expected class is presented in Table 1. The first level (root) contains glucose variable and identifies 768 of objects. Here, the δ parameter value was assumed as 0.8. Moreover, weight factor value for each level is defined in Table 2 according to Eq. 2 with δ = 0.8. Thus, using Eq. 1 we can calculate importance value for each of variables. For example importance for glucose variable which is present in level 1, 3 and 4 so according to Eq. 3 importance is equal 858.7873. Importance(glucose) = (768 · 1) + (173 · 0.4096) + (76 · 0.2621) ≈ 768 + 70.86 + 19.92 ≈ 858.7873
(3)
Similarly the rest of importances in the first iteration (generation) could be calculated, see Table 3. Here, we can find that the relevant variables are glucose, age, mass and pedigree. The lowest value of importance is calculated for artificial contrast variable contrast.1 which is a kind of threshold. So, identified four relevant variables are removed from dataset, and the next iteration (generation) is conducted without these variables, see Algorithm 1. The summary result of variables extraction during generations is presented in Table 4. In the first generation four variables are selected and removed, then in the second generation next three variables, then in the third generation one variable, and finally in the fourth generation the most important recognized variable is contrast.3 the artificial shadow variable. During 10-fold cross validation process we can find how many times each variable is recognized as relevant one using GFS algorithm. For our example this recognition is presented in Table 5. The main conclusion for this example
4
Paja Wiesław / Procedia Computer Science 159 (2019) 2172–2178 W. Paja / Procedia Computer Science 00 (2019) 000–000
2175
Fig. 1. Example of classification tree Table 1. Tabular representation of classification tree from Fig. 1.
Variable glucose age mass mass glucose pedigree glucose contrast.1
# of objects 768 485 283 214 173 118 76 35
Level
Expected class
1 2 2 3 3 4 4 5
1 1 2 1 1 1 1 2
is that all eight variables in Pima dataset is recognized as relevant set of variables. Only pressure variable wasn’t identified during each validation. But some threshold could be also defined to select relevant features. If the threshold is more than 50% of occurences so all 8 variables will be treated as relevant. 3. Results and conclusions During experiments three datasets with different characteristics were used. Two of them are medical datasets: Colon dataset (I2000) [14] contains the expression of the 2000 genes with highest minimal intensity across the 62 tissues. Lung dataset [13] contains 73 objects described by 325 features and are categorized into two classes: positive are the examples with adenocarcinoma and negative the examples with squamous cancer. The last dataset is an artificial data set, which was one of the Neural Information Processing Systems challenge problems in 2003 (called NIPS2003) [15]. It contains 2600 objects assigned to the one of two classes. Objects are described by 500 variables: 5 of them are
Paja Wiesław / Procedia Computer Science 159 (2019) 2172–2178 W. Paja / Procedia Computer Science 00 (2019) 000–000
2176
5
Table 2. Weight vector values for each tree level
Level
Weight
1 2 3 4 5 6 7 8
1.00 0.80 0.64 0.512 0.4096 0.3277 0.2621 0.2097
Table 3. Importance of variables identified in the first iteration (generation)
Variable
Importance
glucose age mass pedigree contrast.1
858.7873 388.0000 290.6880 38.6662 7.3400
Table 4. Variable importance extracted during sequence of iterations (generations).
Variable
Iteration 1 Importance
glucose age mass pedigree contrast.1
858.7873 388.0000 290.6880 38.6662 7.3400
Iteration 2 Variable Importance
Iteration 3 Variable Importance
Iteration 4 Variable Importance
insulin pregnant triceps contrast.4
pressure contrast.2
contrast.3
776.4223 691.8796 154.0137 97.8091
696.5844 343.8924
1042.2677
truly relevant, the other 15 variables are random linear combinations of the first 5. The rest of the data is a uniform random noise. The goal is to select 20 important variables from the system without false attributes selection. GFS algorithm were applied and then three classification methods were involved. Naive Bayes (NB), decision tree (DT) and random forest (RF) algorithms were used to develop learning models. The results of classification using datasets with all original features and with selected set of feature are presented in Table 6. Finally, average accuracy of classification is better in case of all datasets after their dimmensionality reduction during feature selection process (see Table 6). Additionally, effectiveness of GFS algorithm was confirmed by Madelon dataset analysis. The almost all truly relevant features (19 of them) in this dataset were recognized simultanously with average accuracy increasing from 0.68 to 0.74. Only one relevant feature wasn’t identified as important. In case of Colon dataset, decrease of the relevant features are spectacular. From expression of the 2000 genes only 3 of them were indicated as relevant in all folds together with increasing of average accuracy (AVG) from 0.69 to 0.84. In turn, in case of Lung dataset 204 features were selected from 325 of original features and average accuracy increased from 0.8 to 0.82. Proposed procedure seems to be efficient in all-relevant category of feature selection. Initial results
6
Paja Wiesław / Procedia Computer Science 159 (2019) 2172–2178 W. Paja / Procedia Computer Science 00 (2019) 000–000
2177
Table 5. Number of occurences of variables during 10-fold cross validation.
Variable
Number of occurences
glucose age mass pedigree insulin pregnant triceps pressure
10 10 10 10 10 10 10 9
Table 6. Classification accuracy before and after feature selection.
Dataset
Colon
All feature set, 2000 features
Colon
Selected feature set, 3 features
Lung
All feature set, 325 features
Lung
Selected feature set, 204 features
Madelon
All feature set, 500 features
Madelon
Selected feature set, 19 features
Model
Accuracy
NB DT RF AVG NB DT RF AVG NB DT RF AVG NB DT RF AVG NB DT RF AVG NB DT RF AVG
0.61 0.61 0.84 0.69 0.92 0.77 0.85 0.84 0.80 0.67 0.93 0.80 0.93 0.67 0.87 0.82 0.61 0.75 0.69 0.68 0.60 0.75 0.88 0.74
presented here are promising and could be developed by including the influence parameter connected with generation of features discovered. For example, the first generation/iteration subset of relevant features could have higher degree of importance than the subset identified for example in fourth iteration.
2178
Paja Wiesław / Procedia Computer Science 159 (2019) 2172–2178 W. Paja / Procedia Computer Science 00 (2019) 000–000
7
Acknowledgements This work was supported by the Center for Innovation and Transfer of Natural Sciences and Engineering Knowledge at the University of Rzesz´ow. References [1] Kuhn, M., Johnson, K., (2013) Applied Predictive Modeling, Springer, New York, pp. 487-500. [2] Bermingham, M.L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Wright, A.F., Wilson, J.F., Agakov, F., Navarro, P., Haley, C.S.: Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific Reports 5, (2015). [3] Rudnicki W.R., Wrzesie´n M. and Paja W.: All relevant feature selection methods and applications, In: U. Sta´nczyk, L. Jain (Eds), Feature selection for data and pattern recognition, Studies in Computational Intelligence, vol. 584, Springer-Verlag, Germany, (2015), pp. 11-28. [4] Nilsson, R., Pe˜na, J.M., Bj¨orkegren, J., Tegn´er, J.: Detecting multivariate differentially expressed genes. BMC Bioinformatics 8, 150 (2007). [5] R´oz˙ ycki, P., Kolbusz, J., Bartczak, T., Wilamowski, B., ”Using Parity-N Problems as a Way to Compare Abilities of Shallow”, Very Shallow and Very Deep Architectures, Lecture Notes in Computer Science, vol. 9119, ICAISC 2015, Zakopane, pp. 112-122 [6] Solorio-Fern´andez, Sa´ul et al. ”A review of unsupervised feature selection methods.” Artificial Intelligence Review (2019): 1-42. [7] Masoudi-Sobhanzadeh, Yosef et al. ”FeatureSelect: a software for feature selection based on machine learning approaches.” BMC Bioinformatics (2019) 20:170. [8] Deraeve, James and William H. Alexander. ”Fast, Accurate, and Stable Feature Selection Using Neural Networks.” Neuroinformatics 16 (2018): 253-268. [9] Li, Jundong et al. (2017) “Feature Selection: A Data Perspective.” ACM Computing Surveys 50 (2017): 94:1-94:45. [10] Paja, W. (2015) ”Medical diagnosis support and accuracy improvement by application of total scoring from feature selection approach”, Proceedings of the 2015 Federated Conference on Computer Science and Information Systems (FEDCSIS 2015) Annals of Computer Science and Information Systems, pp. 281-286. [11] Paja W. (2016) Feature Selection Methods Based on Decision Rule and Tree Models. In: Czarnowski I., Caballero A., Howlett R., Jain L. (eds) Intelligent Decision Technologies 2016. Smart Innovation, Systems and Technologies, vol 57. Springer, Cham [12] Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., and Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261–265). IEEE Computer Society Press. [13] Hanchuan Peng, Fuhui Long, and C. Ding. (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226–1238. [14] Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D. and Levine, A.J. (1999) ”Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays”, In: Proceedings National Academy of Sciences USA 96, 6745-6750. The data set is available at http://microarray.princeton.edu/oncology. [15] Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result Analysis of the NIPS 2003 Feature Selection Challenge, Advances in Neural Information Processing Systems, 17(2005)545-552. [16] Paja, W., Generational Feature Elimination to Find All Relevant Feature Subset. In: Czarnowski I., Howlett R., Jain L. (eds) Intelligent Decision Technologies 2017. IDT 2017. Smart Innovation, Systems and Technologies, vol 72. Springer, Cham, pp. 140-148.