Comparing Fitness Functions for Genetic Feature Transformation

Comparing Fitness Functions for Genetic Feature Transformation

14th IFAC Conference on Programmable Devices and Embedded 14th Systems 14th IFAC IFAC Conference Conference on on Programmable Programmable Devices De...

510KB Sizes 0 Downloads 38 Views

14th IFAC Conference on Programmable Devices and Embedded 14th Systems 14th IFAC IFAC Conference Conference on on Programmable Programmable Devices Devices and and Embedded Embedded 14th IFAC5-7, Conference onCzech Programmable Devices and Embedded Systems October 2016. Brno, Republic Systems Available online at www.sciencedirect.com Systems October October 5-7, 5-7, 2016. 2016. Brno, Brno, Czech Czech Republic Republic October 5-7, 2016. Brno, Czech Republic

ScienceDirect

IFAC-PapersOnLine 49-25 (2016) 299–304

Comparing Fitness Functions for Genetic Comparing Fitness Functions for Genetic Comparing Fitness Functions for Genetic Feature Transformation Feature Transformation Feature Transformation ∗ ∗∗

Jan Klusáček ∗ Václav Jirsík ∗∗ Jan Jan Klusáček Klusáček ∗ Václav Václav Jirsík Jirsík ∗∗ Jan Klusáček ∗ Václav Jirsík ∗∗ ∗ Brno University of Technology The Faculty of Electrical Engineering ∗ ∗ Brno University of Technology The Faculty of Electrical Engineering University of Technology TheofFaculty Electrical Engineering and Communication Department Controlof Instrumentation ∗ Brno Brno University of Technology TheofFaculty ofand Electrical Engineering and Communication Department Control and Instrumentation and Communication Department of 61600, ControlBrno and Česká Instrumentation Technická 3082/12, Královo Pole, republika and Communication Department of 61600, ControlBrno and Česká Instrumentation Technická 3082/12, Královo Pole, republika Technická 3082/12, Královo Pole, 61600, Brno Česká republika [email protected] Královo Pole, 61600, Brno Česká republika ∗∗ Technická 3082/12, [email protected] [email protected] Brno University of Technology The Faculty of Electrical Engineering ∗∗ [email protected] ∗∗ Brno University of Technology The Faculty of Electrical Engineering University of Technology TheofFaculty Electrical Engineering and Communication Department Controlof Instrumentation ∗∗ Brno Brno University of Technology TheofFaculty ofand Electrical Engineering and Communication Department Control and Instrumentation and Communication Department of 61600, ControlBrno and Česká Instrumentation Technická 3082/12, Královo Pole, republika and Communication Department of 61600, ControlBrno and Česká Instrumentation Technická 3082/12, Královo Pole, republika Technická 3082/12, Královo Pole, 61600, Brno Česká republika [email protected] Technická 3082/12, Královo Pole, 61600, Brno Česká republika [email protected] [email protected] [email protected] Abstract: A representation of features is a very important parameter when creating machine Abstract: A of is important creating machine Abstract: A representation representation of features features is aa very very important parameter when creating machine learning models. The main goal of this paper is to introduceparameter a way to when compare feature space Abstract: A representation of features is a very important parameter when creating machine learning models. The main goal of this paper is to introduce a way to compare feature space learning models. The main goal of this paper is to introduce a way to compare feature space transformations that change this representation. It particularly deals with a comparison of a learning models. that The change main goal of this paper is to introduce a way towith compare feature space transformations this particularly of transformations change this representation. representation. Itdifferent particularly deals with aaforcomparison comparison of aa method that usesthat a genetic programming based onIt fitnessdeals functions transformation transformations that change this representation. It particularly deals with a comparison of a method that uses a genetic programming based on different fitness functions for transformation method usesThe a genetic based on different for transformation of featurethat space. fitness programming function is a very important partfitness of the functions genetic algorithm because it method that uses a genetic programming based on different fitness functions for transformation of space. The is important of genetic algorithm it of feature feature space. properties The fitness fitnessoffunction function is aa very very important part of the the genetic algorithm because it defines required new feature space. Several part possible fitness functions are because described of feature space. The fitnessof function is a very important part of the genetic algorithm because it defines required new feature space. Several fitness functions are described defines requiredinproperties properties ofThe newprocess featureof space. Several possible possible fitness functions arepaper described and compared this paper. the comparison is also introduced in this and defines requiredinproperties ofThe newprocess featureof space. Several possible fitness functions arepaper described and compared this paper. the comparison is also introduced in this and and compared in this of the are comparison is also introduced in this results paper and selected functions arepaper. testedThe andprocess their results compared to each other. These are and compared in this paper. The process of the are comparison is also introduced in this results paper and selected are tested and results to other. are selected functions functions are and tested and their their results are compared compared to each each other. These These results are compared and upsides downsides of each method are discussed in conclusion. A part of the selected functions are and tested and their results are compared to each other. These results are compared and upsides downsides of each method are discussed in conclusion. A part of the compared and upsides and downsides of each method are discussed in conclusion. A part of the work is a framework used to automate processes needed to create this comparison. compared and upsides and downsides of each method are discussed in conclusion. A part of the work work is is a a framework framework used used to to automate automate processes processes needed needed to to create create this this comparison. comparison. work is IFAC a framework used Federation to automate processesControl) neededHosting to create this comparison. © 2016, (International of Automatic by Elsevier Ltd. All rights reserved. Keywords: Feature space, Feature space transformation, Genetic programming, Fitness Keywords:Machine Feature learning space, Feature Feature space space transformation, transformation, Genetic Genetic programming, programming, Fitness Fitness Keywords: Feature space, function, Keywords: Feature learning space, Feature space transformation, Genetic programming, Fitness function, Machine Machine function, learning function, Machine learning 1. INTRODUCTION 2. FEATURE TRANSFORMATIONS 2. 1. INTRODUCTION INTRODUCTION 1. 2. FEATURE FEATURE TRANSFORMATIONS TRANSFORMATIONS 2. FEATURE TRANSFORMATIONS 1. INTRODUCTION space transformations cannot introduce any new One of the most used areas of a machine learning is an area Feature Feature cannot any new Feature space space transformations transformations cannot introduce anyinfornew One of the the most most used areas areas of aa machine machine learning is islearning an area area information, the best case they canintroduce preserve all One of used of learning an of supervised learning. A goal of the supervised Feature spacein transformations cannot introduce anyinfornew One of the most used areas of a machine learning islearning an area information, in the best case they can preserve all information, in the best case they can preserve all inforof supervised learning. A goal of the supervised mation already present in original features but often they of supervised A goal of theexamples supervised learning information, in the best case they can preserve all inforis to create a learning. model based on given (consisting of supervised learning. A goal of the supervised learning mation already present in original features but often they mation an already present in original features but often is to to create model based on This givenmodel examples (consisting amount of information, nevertheless, theythey can is create aa model on given examples (consisting of both features and based labels). should be able reduce mation already present in original features but often they is create a model based on This givenmodel examples (consisting reduce an amount of information, information, nevertheless, they can reduce an amount of nevertheless, they can of to both features and labels). should be able able improve resulting accuracy. This improvement is achieved of both features and labels). This model should be to assign labels to new (previously unseen) examples as reduce an amountaccuracy. of information, nevertheless,is they can of both features and labels). This model should be able improve resulting This improvement achieved improve resulting accuracy. This improvement is achieved 1 to assign labels to new (previously unseen) examples as by transforming feature space to a form better suited for a to assign labels to new. (previously unseen) as improve resulting accuracy. This improvement is achieved accurately as possible Generally, there are examples two sources 1 to assign labels to new (previously unseen) examples as by transforming feature space to a form better suited for a 1 . Generally, there are two sources by transforming feature space to a form better suited for a accurately as possible model. These transformations can bring other advantages accurately possible 1and . Generally, there are two sources of errors: Aas variance a bias (Friedman, 1997). The by transforming feature space to a form better suited for a accurately as possible . Generally, there are two sources model. These transformations can bring other advantages model. These transformations can bring other advantages of errors: A variance and a bias (Friedman, 1997). The apart from improving resulting accuracy. They can, for of errors: A variance and a bias (Friedman, 1997). The variance is caused and missing1997). valuesThe in model. Theseimproving transformations canaccuracy. bring other advantages of errors:error A variance andby a noise bias (Friedman, apart from resulting They can, the for from improving resulting accuracy. They can, for variance error is caused by noise missing in example, reduce an amount of saved data or decrease variance data errorand is it caused by reduced noise and and missing values values in apart learning could be by increasing number apart from improving resulting accuracy. They can, the for variance error is caused by noise and missing values in example, reduce an amount of saved data or decrease example, reduce an amount of saved data or decrease the learning data and it could be reduced by increasing number complexity of the model. Some of these transformations learning data Second and it could beof reduced increasing of examples. source error isby bias. Bias isnumber caused example, reduce anmodel. amountSome of saved data transformations or decrease the learning data and it could be reduced by increasing number complexity of the of these complexity of thefurther. model. Some of these transformations of examples. source of is caused be described of examples. Second source on of error error is bias. bias. Bias Bias is learned caused will by erroneous Second assumptions a character theis of the model. Some of these transformations of Second source on of error is bias. of Bias caused complexity will will be be described described further. further. by examples. erroneous assumptions a character character of theisbetween learned by erroneous assumptions on a of the learned model (For example if we assume linear relation be described further. by erroneousexample assumptions on a character of the between learned will model linear model (For (For if itwe weis assume assume linear relation relation between feature and example label, butif in fact quadratic we can never 2.1 Input cleaning model (For example if we assume linear relation between feature label, it quadratic can 2.1 Input Input cleaning cleaning feature and label, but it is is in in fact fact quadratic wereduced can never never fully fitand model to but data.). error can bewe by 2.1 2.1 Input cleaning feature and label, but it is This in fact quadratic wereduced can never fully fit model to data.). This error can be by fully fit model to data.). This error can be reduced by The first step of preprocessing should be cleaning of input. using more complex models example order fully fit model to data.). This(for error can behigher reduced by The step of be of 2 using models (for example higher order The first first stepremoves of preprocessing preprocessing should be cleaning cleaning of input. input. using more more complex complex models (for of example higher order This process examplesshould that have missing values or polynomial) but with possibility overfitting model The first stepremoves of preprocessing should be cleaning of input. 2 using more complex models (for of example higher order 2 This process examples that have values or polynomial) but with possibility model Thisanother process way removes examples that have missing missing values or polynomial) but withThis possibility of aoverfitting overfitting model are corrupted. It’s necessary to have some (Geman et al., 1992). means that selection of a right 2 This process way removes examples that have missing values or polynomial) but withThis possibility of aoverfitting model are another corrupted. It’s necessary to have some (Geman et al., 1992). means that selection of a right are anotherknowledge way corrupted. It’s necessary to have some (Geman et al., 1992). This means that astrait selection of a ways right additional of examples to be able to decide model is one of the most important and forward are another way corrupted. It’s necessary to have some (Geman et al., 1992). This means that a selection of a right knowledge of model is most and strait ways additional knowledge of examples examples to to be be able able to to decide decide model is one one of of the the most important important andAnother strait forward forward ways additional which examples are corrupted. to improve prediction accuracy. possibility knowledge of examples to be able to decide model is one the of the most important andAnother strait forward ways additional to improve improve the accuracy. possibility which examples examples are are corrupted. corrupted. the prediction prediction accuracy. Another possibility which to accuracy of resulting model is to preprocess to improve accuracy the prediction accuracy. Another possibility which examples are corrupted. to of is to improve improve accuracy of resulting resulting model is to to preprocess preprocess input data. This preprocessing is inmodel fact transformation of 2.2 Feature normalization to improve accuracy of resulting model is to preprocess input data. This preprocessing is transformation data. Thisspace preprocessing is in in fact fact transformation of of 2.2 Feature Feature normalization normalization ainput models input (feature space transformation). input data. Thisspace preprocessing is in fact transformation of 2.2 2.2 Feature normalization models input (feature space space transformation). aa models input space (feature transformation). Raw data can be represented in different ways, using a models input space (feature space transformation). Raw can be different ways, using Raw data dataunits. can This be represented represented infeatures different ways, using different can lead toin with orders of Raw dataunits. can This be represented infeatures different ways, using different can lead to with orders of different units. This can lead to features with orders of magnitude different feature values. Models like decision  different units. This feature can leadvalues. to features with orders of magnitude different Models like decision  magnitude different feature values. Models like decision 1  trees that works with informational properties of features This accuracy is measured by function called error function. magnitude different feature values. Models like decision  1 trees that works with informational properties of features 2 1 Models treesgenerally that works with informational of features This measured by function called error are unaffected by feature properties scale, but others like that areis complex necessary find relations This accuracy accuracy ismore measured by than function called tend errortofunction. function. 1 trees that works with informational properties of features 2 This accuracy measured by than function called tend errorto are generally unaffected by feature but others like 2 Models that are more complex necessary find are generally unaffected by featuretoscale, scale, but othersthey like nearest neighbor are very sensitive a scale because that aren’t really Models that areispresent more complex than necessary tend tofunction. find relations relations 2 are generally unaffected by featuretoscale, but othersthey like Models that are more complex than necessary tend to find relations nearest that nearest neighbor neighbor are are very very sensitive sensitive to aa scale scale because because they that aren’t aren’t really really present present nearest neighbor are very sensitive to a scale because they that aren’t really present

Copyright © 2016, 2016 IFAC 299Hosting by Elsevier Ltd. All rights reserved. 2405-8963 © IFAC (International Federation of Automatic Control) Copyright © 2016 IFAC 299 Copyright ©under 2016 responsibility IFAC 299Control. Peer review of International Federation of Automatic Copyright © 2016 IFAC 299 10.1016/j.ifacol.2016.12.053

2016 IFAC PDES 300 October 5-7, 2016. Brno, Czech Republic

Jan Klusáček et al. / IFAC-PapersOnLine 49-25 (2016) 299–304

work with a distance between examples (Yoon and Hwang, 1995). For this reason, it’s essential to normalize all features before using them. The process of normalization is a complicated problem that needs to deal among others with a problem of different distributions of features and outliers. 2.3 Feature selection

most used criterion functions. This criterion is defined for nominal features and labels. Information gain is defined by equation 1 where F is the original dataset, f is a candidate for selected feature and H(F) is entropy. It is defined as a difference of entropy calculated from whole, and entropies calculated from a data set partitioned using given feature.  H(F |a) (1) IG(F, f ) = H(F ) − a∈vals(f )

One of the most basic and the most used transformation is a feature selection. The goal of the feature selection is to remove features with a low relevancy (they are not participating on a prediction). Apart from apparent advantages of reducing an amount of stored data and complexity of the created model, it can often increase an accuracy of created model 3 . Many different feature selection algorithms are used. A large group of feature selection algorithms relies on a greedy search algorithm. They are selecting one feature after another, starting from best one. All these algorithms can be divided according to the parameter used for feature selection. The first group of methods called wrappers uses an accuracy of created model to evaluate a quality of feature set. They start by creating a model for each available feature and evaluate their accuracy and then select the best feature. The second feature is selected same way from all possible feature subsets containing best feature and one of the unselected features. This approach is repeated until the desired number of features is acquired. These methods lead to very good accuracy if resulting subset is used with the same model. On the other hand, they have high computational requirements (repeated creation and evaluation of models) and can easily lead to overfitting. The second group of methods is called filters. They work similarly to wrappers but they use a heuristic function 4 (criterion function) instead of the accuracy of created model. This leads to lower computational requirements and selected features are not dependent on selected model but these methods generally lead to worse accuracy than wrapper methods. All filter methods can be further divided according to their approach to the redundancy. Simple methods ignore a possibility of redundant features and therefore can select similar or even same features if they are present in the original feature space. More sophisticated methods try to address this possibility and evaluate a similarity between features. The downside of this approach is a big increase in computational requirements of these methods. Criterion functions used to select features are very important part of feature selection process and they are used further in the experiment conducted in this paper. For this reason, few functions will be described in more details. Information gain and information gain ratio Information gain (Guyon and Elisseeff, 2003) is one of the Models like decision trees are mostly immune to abundant features. Models like nearest neighbor are very sensitive on the dimensionality of feature space and can greatly profit from its reduction 4 There are many available functions like information Gain, Relief, Area under curve 3

300

This criterion function naturally prefers features with more distinct values, to mitigate this problem Information gain ratio can be used. (Agrawal et al., 1993). In this case, information gain is divided by a factor that takes into account number of possible distinct values. Area under curve The criterion Area under the curve (AUC) is defined by (Fawcett, 2006) as the probability that the classifier based on given feature will rank a randomly chosen positive instance higher than a randomly chosen negative instance. It can be alternatively understood as a number of steps required to order all examples according to their label when they were originally ordered by feature. The main advantage of this criterion is its robustness. One problem with using AUC as criterion function is that it is defined only for binary labels. It can be however adapted for multiclass tasks using feature decomposition. There are two basic ways to decompose multiclass problem to binary. We can divide it to multiple classifications problems, each solving classification between two classes, or between one class and all other classes. In this paper, the first approach is used. 2 · H(F, f ) SU (F, f ) = (2) H(X) − H(Y ) Minimum redundanct maximal relevancy Minimum redundancy maximal relevance by (Ding and Peng, 2005) (mRMR) is a criterion that unlike others discussed so far try to address a problem of redundant information. Previously described criterions considered only relevancy between feature and a label, when selecting the best feature, but they completely ignored redundancy between selected features. This can easily lead to a selection of two identic copies (possibly with some noise) if they are present in the original feature set. This is undesirable because redundant features don’t provide any additional information and undermine the whole purpose of the feature selection. Its main disadvantage is higher computational requirements that rise quadratically with a number of selected features (because redundancy with each previously selected feature has to be calculated for every candidate feature). 2.4 Principal component analysis Principal component analysis (PCA) described by (Jolliffe, 2002) is method used to transform (and possibly reduce dimensionality) feature space. This method works with complete feature space instead of one feature after another in contrast to filter and wrapper methods. PCA is searching for projections from the original feature space to a new one. It is searching for the new features as a

2016 IFAC PDES October 5-7, 2016. Brno, Czech Republic

Jan Klusáček et al. / IFAC-PapersOnLine 49-25 (2016) 299–304

linear combination of original features trying to maximize information carried in each feature. PCA expects that features with the biggest variance carry most information. This means that it create each new feature as a linear combination of original features that new feature has the highest variance with each selected feature orthogonal to all previous. 2.5 Independent component analysis Independent component analysis (ICA) is in some aspect similar to PCA. It also works with all features instead one after another and it also searches for projections of original feature space to the new one as linear transformations. Unlike PCA, ICA expects that original (measured) features are linear combinations of some basic features (which doesn’t have the normal distribution). As a result, it searches for such transformations that create the set of features that are most independent 5 features.

301

2.8 Grammatical programming Grammatical programming described by (O’Neill and Ryan, 2001) is a method similar to the genetic programming. The main difference is that it doesn’t work with an individual in a form of the tree structure, but it represents it as an integer string. Rules for creating the program (expression) from the integer string are described by a grammar. This grammar is a part of a grammatical programming. It is often given in BackusâĂŞNaur Form and it ensures the creation of a valid individual. Genetic operators (mutation and crossover) are applied on the integer string and the resulting string is transformed into the tree structure using same grammar. This process is on fig. 1. As a result of this approach, created individuals are always valid. The drawback of this approach is, that small changes in a beginning of the integer string (genotype) can lead to big changes in resulting tree structure (phenotype). This problem is called low locality. (Rothlauf and Oetzel, 2006).

2.6 Genetic approach All previous feature space transformations search only in a very specific set of transformations. Feature selection doesn’t change individual features at all, it can only create subsets. PCA and ICA create new features, but they are searching only for linear combinations of original features. These limitations can be overcome by usage of evolutionary methods like grammatical evolution (O’Neill and Ryan, 2001) or genetic programming that can be used to optimize any tree structure, in this case, an expression describing feature transformation. Genetic programming methods as demonstrated by (Vafaie and De Jong, 1998) and (Krawiec, 2002) can be used for these purposes. These methods allow searching transformations in huge space limited only by types of nodes used for the construction of transformations and maximal depth of created tree. Huge search space is also the biggest challenge when using these methods because it’s generally possible to evaluate only its very small part. For every genetic algorithm, it’s necessary to define a fitness that will guide its search. 2.7 Genetic programming Genetic programming (GP) is evolutionary method described by (Koza, 1992a). It creates each individual from given set of nodes. The set of nodes depends on given task. It can be set of lisp commands as described by (Koza, 1994) or for example equation. Every individual has tree structure and nodes has to be assembled using given rules (for example to assure that functions parameters have right types). The main challenge with genetic programming is that a genetic operator of random crossover can create invalid individuals (individuals that don’t follow given rules). It is possible to create crossover in such way that it follow these rules, or prepare routines that are able to repair invalid individuals, but it’s often quite complicated.

Fig. 1.

2.9 Fitness functions A fitness function is a function that describes a quality of every individual created by the genetic algorithm. Its value is consequently used to determine a probability that individual will be copied to next generation (a genetic operator of the selection). There are several requirements on the fitness function: A big value of the fitness function should correspond to features that lead to models with a good accuracy and calculation of fitness function should be fast because it will be evaluated many times. This means that in our case 6 requirements on fitness functions are same as requirements on functions used for the feature selection, in fact, most of the functions used for the feature selection can be directly used as fitness functions. Fitness functions considered by this paper were described at chap. 2.3 3. COMPARING RESULTS FOR DIFFERENT FITNESS FUNCTION The main focus of this paper is to introduce a process of comparing different fitness functions.The most stray forward way to compare a quality of two fitness functions is to use both of them together with genetic programming Genetic operator of selection is selecting best-generated features It is feature selection task

6 5

Independency can be measured for example by mutual information

301

Grammatical programming

2016 IFAC PDES 302 October 5-7, 2016. Brno, Czech Republic

Jan Klusáček et al. / IFAC-PapersOnLine 49-25 (2016) 299–304

to create new feature space and then create a model and evaluate their accuracies. An accuracy of each feature space can be then used to compare a quality of each fitness function. A problem with this approach is, that resulting accuracy doesn’t depend only on used a fitness function, but also on a used dataset, other parameters of genetic programming 7 and used models. A good fitness function should work well with a wide range of different datasets, models, and GP parameters. For this reason, it is desirable to run multiple tests with different datasets and models for each fitness function and draw a conclusion from these results. This task can be very time-consuming and it has to be repeated every time changes are made to fitness functions. For this reason, it is desirable to automate this process as much as possible. 3.1 Framework To automate the process of testing different transformations methods a framework automating these tasks was created. This framework was written in Python. The goal of this framework is not to create a complete application that could be deployed as is because it would require very complex configuration and even then it cannot cover all possible scenarios. It is rather a framework that takes care of tedious tasks and let users focus on a real problem. Instead of complex configurations that try to address all possible scenarios, this framework implements functionality that is common for most tasks and user has to customize it to meet his specific requirements. This customization is achieved by inheriting from prepared classes. This approach allows an easy way to add a custom behavior, without a necessity of complex configuration. This approach requires that user has some programming skills but this should not be a problem because this skills are required anyway for implementation of the transformation method.

Fig. 2.

Framework diagram

transferred to the available remote machine, executed and their results collected. Task class also allows to define some additional properties. It can, for example, require results of another task, before it can be executed. The second part of the framework is task planner. It can be run as a standalone program or as a part of Django server. It has two tasks. It monitors a state of available machines and assigns them tasks. After it finds the task that is prepared to be run, it transfers it to the remote machine using ssh protocol and executes it. After the task is finished, its results are transferred back and saved to database, and the machine is marked as ready for next task.

The whole framework is written in Python. It is using Django framework for access to database and visualization. It uses Matplotlib for graphs and pandas library for work with data. Python was selected because it allows a fast development and it is very widespread in a machine learning community. It also allows writing multiplatform code 8 . A basic diagram depicting framework’s function is in Fig. 2. The first step of every test is a task creation. All necessary functionality for one task is implemented in a class Task. This class is derived from a Django class Model, this means that it is possible to easily load and save this class to a databese. Most important property of this class is task_data. This property contains a pickled 9 object that will be automatically transferred and executed on a remote machine. The user only has to prepare his tests (or parts of tests) into this property and they will be automatically Population size, a number of generations, probabilities of mutation and crossover etc. 8 Described framework is able to run on Windows and Linux without any modifications. 9 cPickle is a python library that can easily serialize and deserialize generic python objects. It has its limitations (can’t serialize objects like file handles, sockets etc.) but it works with simple objects 7

302

Fig. 3.

Examle of output table

2016 IFAC PDES October 5-7, 2016. Brno, Czech Republic

Jan Klusáček et al. / IFAC-PapersOnLine 49-25 (2016) 299–304

303

Parameter Generations Population Crossover probability Mutation probability

Value 500 500 0.3 0.1

Table 2. Parameters of used GA with their default settings. These weren’t tuned so they definitely don’t give best achievable results, but the goal of this experiment isn’t a creation of best model but a comparison of a different fitness function. For this purpose simple models with default settings are suitable and they provide fair conditions for their comparison. Models used in this experiment with their RapidMiner implementations are in table 3. Fig. 4.

Examle of output graph

The third part of this framework is a result visualization. This part of the framework similarly to others doesn’t provide a complete solution for visualization of results. It just provides infrastructure to allow easy creation of own protocols. All results are saved in the database so they can be accessed and filtered using Django ORM model. They can be further processed using pandas or ScyPy libraries. Tables and graphs can be created using tables2 and Matplotlib libraries. Example of possible graphical outputs achievable by framework are at fig. 3 and fig. 4 3.2 Experiment The main goal of this paper is to introduce a way to compare feature space transformations and to compare previously described fitness functions (chap. 2.9). Designed experiment is using 7 different datasets and 6 model types together with genetic programming (Koza, 1992b) and 5 fitness functions to evaluate a quality of each function. Grammatical programming was used as evolutionary algorithm in this paper because it has better locality than Grammatical evolution and expressions created in these tests are quite simple so there are no problems with creation of invalid individuals (see chap. 2.7 and 2.8) All used datasets are from UCI repository (Lichman, 2013) and are described in table 1. Datasets with a wide range of instances count and features count were used. Name Isolet Numerals Optdig Satelite Scene Sensvehicle Spectro

Instances 7797 20000 3823 6435 1137 78823 509

Features 617 649 63 1332 294 1000 101

Model type Nearest neighbor Decision tree Random forest Naive bayes clasificator Neural networks

RapidMiner implementation k-NN Decision tree Random forest Naive bayes AutoMLP

Table 3. Models used in experiment and their RapidMiner implementations.

As part of experiment 10 models of each type are trained for each combination of dataset and model using 10-fold cross validation. This means that 2100 models are created and evaluated for this experiment. Results from all 10 folds are used and for each case order of all functions ordered by their accuracy is determined. The average order is then used to determine a quality of tested fitness functionsAll these experiments were conducted using the framework described earlier using machines rented at Amazons AWS service. 3.3 Results The results of the whole experiment are summarized in table 4. Each column contains the average rank of each fitness function. These results show that best fitness function from all of the tested fitness function is by big margin mRMR. Fitness function Area under curve Information gain Information gain ratio Minimal redundancy maximal relevancy

Average order rank 2.060 3.22 3.311 1.409

Table 4. Average ank achived by different fitness functions

Table 1. Parameters of used datasets The first step is the transformation of every dataset using genetic programming with a tested fitness function. Parameters used in genetic programming are summarized in table 2. For every original feature space, new feature space containing 10 features was created. In second step models are created using these transformed datasets. Simple models available in RapidMiner were used 303

Another way to examine results of this experiment is to compare a number of wins (first places) achieved by each fitness function. These results are shown in table 3.3. Results are similar to the previous case, mRMR is again best fitness function by a large margin. On the other, the hand genetic algorithm using mRMR was more than 10 times slower than others. Exact experiment duration for each function wasn’t measured because many different machines were used for the experiment so runtime of each run can’t be directly compared.

2016 IFAC PDES 304 October 5-7, 2016. Brno, Czech Republic

Jan Klusáček et al. / IFAC-PapersOnLine 49-25 (2016) 299–304

Fitness function Area under curve Information gain Information gain ratio Minimal redundancy maximal relevancy

Number of wins 112 2 4 232

Table 5. Number of wins 4. CONCLUSION

As we can see from results, information gain and information gain ratio functions give very similar results. This can be explained by a similarity in their function. The best fitness criterion function is mRMR. This function is using same relevancy measurement as information gain and it is the only criterion function in the test that works with redundancy, so we can expect that this property is important for usage as fitness function It can be expected that this is very important in our case because GP naturally creates lots of similar or even identical features, and methods that don’t take this into account will often create similar features. This conclusion is very interesting, it would be desirable to test other methods working with redundancy. One big downside of this fitness functions (and all similar methods) is that it is much slower than other. This is caused by a necessity to calculate redundancy with all other already selected features. Unfortunately, this problem is shared by all similar methods. It would be desirable to find a function that can reduce computational requirements on the redundancy calculations because it would enable the possibility to effectively use GP for these tasks. Futher experiments shold be executed in pool of same machines so computing times can be meaningfully compared. 5. ACKNOLEDGEMETS This work was supported by grant No. FEKT-S-14-2429 - "The research of new control methods, measurement procedures and intelligent instruments in automation", which was funded by the Internal Grant Agency of Brno University of Technology. REFERENCES Agrawal, R., Imielinski, T., and Swami, A. (1993). Database mining: A performance perspective. IEEE transactions on knowledge and data engineering, 5(6), 914–925. Ding, C. and Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02), 185–205. Fawcett, T. (2006). An introduction to roc analysis. Pattern recognition letters, 27(8), 861–874. Friedman, J.H. (1997). On bias, variance, 0/1— loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1), 55– 77. doi:10.1023/A:1009778005914. URL http://dx.doi.org/10.1023/A:1009778005914. Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural computation, 4(1), 1–58. 304

Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157–1182. Jolliffe, I. (2002). Principal component analysis. Wiley Online Library. Koza, J.R. (1992a). Genetic programming: on the programming of computers by means of natural selection, volume 1. MIT press. Koza, J.R. (1992b). Genetic programming: on the programming of computers by means of natural selection, volume 1. MIT press. Koza, J.R. (1994). Genetic programming ii: Automatic discovery of reusable subprograms. Cambridge, MA, USA. Krawiec, K. (2002). Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines, 3(4), 329–343. Lichman, M. (2013). UCI machine learning repository. URL http://archive.ics.uci.edu/ml. O’Neill, M. and Ryan, C. (2001). Grammatical evolution. IEEE Transactions on Evolutionary Computation, 5(4), 349–358. Rothlauf, F. and Oetzel, M. (2006). On the locality of grammatical evolution. In European Conference on Genetic Programming, 320–330. Springer. Vafaie, H. and De Jong, K. (1998). Feature space transformation using genetic algorithms. IEEE Intelligent Systems, 13(2), 57–65. Yoon, K.P. and Hwang, C.L. (1995). Multiple attribute decision making: an introduction, volume 104. Sage publications.