Neurocomputing 73 (2010) 3200–3223
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Generation and simplification of Artificial Neural Networks by means of Genetic Programming ˜ al, Alejandro Pazos Daniel Rivero n, Julian Dorado, Juan Rabun ˜a, A Corun ˜a, Spain Department of Information and Communication Technologies, University of A Corun
a r t i c l e in f o
a b s t r a c t
Article history: Received 16 March 2009 Received in revised form 17 March 2010 Accepted 17 May 2010 Communicated by A. Abraham Available online 7 July 2010
The development of Artificial Neural Networks (ANNs) is traditionally a slow process in which human experts are needed to experiment on different architectural procedures until they find the one that presents the correct results that solve a specific problem. This work describes a new technique that uses Genetic Programming (GP) in order to automatically develop simple ANNs, with a low number of neurons and connections. Experiments have been carried out in order to measure the behavior of the system and also to compare the results obtained using other ANN generation and training methods with evolutionary computation (EC) tools. The obtained results are, in the worst case, at least comparable to existing techniques and, in many cases, substantially better. As explained herein, the system has other important features such as variable discrimination, which provides new information on the problems to be solved. & 2010 Elsevier B.V. All rights reserved.
Keywords: Artificial Neural Networks Evolutionary computation Genetic Programming
1. Introduction Artificial Neural Networks (ANNs) are systems easily implemented and handled. They comprise some learning systems which have led to the solution of a large number of complex problems in diverse fields (classification, clustering, regression, etc.) [1–3]. They present interesting characteristics which make them a powerful technique for problem solving and have determined many researchers to use them in a large number of different environments [4–6]. However, their use raises a series of problems, mainly in their developmental process. This process can be roughly divided into two parts: development of the architecture and its training and validation. The development of the architecture refers to the process of determining how many neurons an ANN will have, in how many layers, and how they will be interconnected. Training refers to calculating the values of the connections. Given that the network architecture depends on the problem to be solved, the design process of this architecture can be a manual process based on experience. In other words, the expert has to experiment on several different architectures until he finds one that returns good results after the training process. Therefore, the expert has to carry out several experiments using different architectures and train each one of them in order to be able to determine which of them is the best. Therefore, for most of the
n
Corresponding author. Tel.: +34 981167000; fax: + 34 981167160. E-mail address:
[email protected] (D. Rivero).
0925-2312/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2010.05.010
problems, and depending on the application cases, determining an architecture can be a manual process, and become very slow. In general, this can make the development of ANNs a very slow process in which the human expert may need to put a lot of effort in order to find a network that solves the problem accurately. To this end, the present paper proposes a new method that allows the automatic development of ANNs by using Genetic Programming (GP). The method described in this paper is not specific for generating ANNs for specific problems, but can be used when an expert needs to develop an ANN and he/she does not have knowledge about which is the best topology for solving that problem. In this case, using GP can help him to find a good ANN. Due to using this method, the expert does not have to put in any effort to develop ANNs. Although some ANN creation techniques have recently been developed, most of them are not completely automatic yet, and therefore, the expert still has to do much work. This expert’s work involves the design of some initial networks or the setting of some parameter values, which means that the expert has to run some experiments again to fit these values. In the system given herein, a standard parameter configuration set is presented and, as it will be shown, it is useful for problems with different complexities. Therefore, if an expert wants to use this system, he will not need to set the value of any parameter and the ANN development process is completely automatic. The system presented here uses GP as a method to automatically evolve ANNs. The networks evolved by this system do not follow a traditional architecture in fully connected hidden layers. The evolved topologies can have any kind of connectivity,
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
with connections, for example, between input and output neurons. As there is a direct relation between the ANN architecture and its representation capacity, allowing a topology to have any kind of connectivity with no limitations can increase the networks’ representation capacity. Also, this can allow the obtaining of networks with the same representation capacity but with a lower number of neurons and connections. This is another objective of this works: the obtaining of networks that can solve a problem with the minimum number of neurons and connections. As the obtained results show, the system described in this work can develop these networks with a minimum set of neurons and connections. Furthermore, this paper contains a complete study of this new approach, including a study of parameters, a comparison with other techniques, features of this system, etc. In order to study the behavior of this system, some of the best known databases were used, so that the results can be comparable with others. Moreover, a comparison with other automatic ANN development techniques is provided. The results show that the performance of this system can be better than the rest of the techniques in most cases.
2. State of the art 2.1. Genetic Programming Genetic Programming (GP) appeared at the end of the 1980s [7,8] as an evolution of Genetic Algorithms (GAs). A GA [9,52] is a search technique inspired in the world of biology. More specifically, the Evolution Theory is taken as basis for its working. GAs are used to solve optimization problems by copying the evolutionary behavior of species. From an initial random population of solutions, this population is evolved by means of selection, mutation and crossover operators [16,17], inspired in natural evolution. By applying this set of operations, the population goes through an iterative process in which it reaches different states, each one called a generation. As a result of this process, the population is expected to reach a generation in which it contains a good solution for the problem. In general, this kind of algorithm is called Evolutionary Algorithm (EA). GAs are EAs in which the solutions of the problem are codified as a string of bits or real numbers. This leads to the solution of a lot of problems. However, this codification shape is not suitable for some problems in which the solution cannot be expressed as a string of bits or real numbers. In order to be able to apply an Evolutionary Algorithm to solve those problems in which GAs are not applicable, a new codification type turned out. The EA that has a codification with the shape of trees instead of binary or real-number strings is called GP. The formal base of GP, when this technique was recorded under this name, can be dated back in 1992 with the publication of the book ’’Genetic Programming’’ [14], although in a previous work by the same author [15] it was clear that genetic programming was simply an extension of genetic algorithms applied to tree-based programming structures. However, prior to that, there were some works describing machine language program creation techniques [10,11]. In other previous works, this time from the sixties [12,13], programs are evolved as finite state automata. GP works in the same way as GAs and EAs: from an initial random population, this population is evolved by means of evolutionary operators to create new solutions. In classical GP, the codification of the solutions is performed with the shape of trees. For this reason, in order to allow the EA to build correct trees, the user has to specify which nodes the EA can use to be built. There are two types of nodes: terminals (leaves of the trees) and
3201
functions (nodes that have other nodes as arguments). Using them, it is possible to build complex expressions which can have very different nature: mathematical (including, e.g., arithmetical or trigonometric operators), logical (with Boolean or relational operators) and others, much more complex, that follow a particular grammar determined by the user. The ability of GP to adapt itself to a large number of different environments has given it a great success and a high level of application in many diverse fields. Although its main and most direct application is the generation of mathematical expressions [18], it has also been used in other different areas such as rule generation [19], knowledge extraction [20], filter design [21,22], image processing [23–25], etc.
2.2. ANN development with evolutionary computation tools The development of ANNs is a topic that has been extensively dealt with very diverse techniques. The world of evolutionary algorithms is not an exception, and the great amount of works that have been published on different techniques in this area [9,26–38,74,77] stands as proof. These techniques follow the general strategy of an evolutionary algorithm: an initial population consisting of different genotypes, each one of them codifying different parameters (typically, the weight of the connections and/or the architecture of the network and/or the learning rules), is randomly created. This population is evaluated in order to determine the fitness of each individual. Afterwards, this population is repeatedly evolved by means of different genetic operators (replication, crossover, mutation, etc.) until a determined termination criterion is fulfilled (for example, a good enough individual is obtained, or a predetermined maximum number of generations is achieved). Essentially, the ANN generation process by means of evolutionary algorithms is divided into three main groups: evolution of the weights, architectures, and learning rules. The evolution of the weights starts from a network with a predetermined topology. In this case, the problem is to establish, by means of training, the values of the network connection weights. This is generally conceived as a problem of minimization of the network error. Most training algorithms, such as the backpropagation algorithm (BP) [39,40], are based on gradient minimization, which has several drawbacks [41,42]. One way of overcoming these problems is to carry out the training through an Evolutionary Algorithm [42] by formulating the training process as the evolution of the connection weights in an environment defined by the network architecture and the problem to be solved. In these cases, the weights can be represented as a string of binary values [42–45] or real numbers [46–51,53], which means that GAs with a codification shape of binary strings or real-number strings can be used, respectively. The evolution of the architectures refers to the generation of the topology and connectivity of the neurons. The architecture of a network has a great importance in order to successfully apply the ANNs, as the architecture has a very significant impact on the process capacity of the network. Therefore, the design of a network is crucial. The automated architecture design has been possible thanks to the appearance of constructive and destructive algorithms [54,55]. In general terms, a constructive algorithm starts from a minimal network (with a small number of layers, neurons and connections) and successively adds new layers, nodes and connections, if necessary, during the training. A destructive algorithm carries out the opposite operation, i.e., it starts from a maximum network and removes unnecessary nodes and connections during the training. However, the methods based
3202
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
on Hill Climbing algorithms are quite susceptible to falling to a local minimum [56]. In order to develop ANN architectures by means of an evolutionary algorithm, it is necessary to decide how to codify a network inside the genotype so that it can be used by the genetic operators [30,33]. To this end, different types of network codifications have emerged. In the first codification method, direct codification, there is a one-to-one correspondence between the genes and the phenotypic representation [57]. The most typical codification method consists of a matrix C ¼(cij) of N N size which represents an architecture with N nodes. In this matrix, cij indicates the presence or absence of a connection between nodes i and j. It is possible to use cij ¼1 to indicate a connection and cij ¼0 to indicate an absence of connection. In fact, cij could take real values instead of Booleans to represent the value of the connection weight between neuron ‘‘i’’ and ‘‘j’’, and in this way, architecture and connections can be developed simultaneously [58–60]. These types of codification are generally very simple and easy to implement, but have a lot of disadvantages [61–64]. More recently, in a work by Stanley, a system called NEAT (NeuroEvolution of Augmenting Topologies) has been presented [65]. This system uses a GA to encode and evolve ANNs. Starting from minimal ANNs (only input and output neurons are used at the beginning), the mutation operator allows the addition of new hidden nodes and connections [66]. As a counterproposal to this type of direct codification method, there are also indirect codification types in existence. With the objective of reducing the length of the genotypes, only some of the architecture characteristics are coded into the chromosome. Within this type of codification, there are several types of representations. First, parametric representations are worth mentioning. The network can be represented by a set of parameters such as the number of hidden layers, the number of connections between two layers, etc. [67–70]. Although parametric representations can reduce the length of the chromosome, the evolutionary algorithm makes a search in a limited space inside the whole search space that represents all the possible architectures. Another type of non-direct codification is based on a representational system with the shape of grammatical rules [71,72]. In this system, the network is represented by a set of rules that build a matrix representing the network. These rules have the shape of production rules, with antecedent and consequent parts. As the rules are being applied, the matrix which represents the network increases its size until it reaches its final dimension [61]. Other types of codification, more inspired in the world of biology, are the ones known as ‘‘growing methods’’. Using them, the genotype does not code the network any longer, but it contains a set of instructions instead. The decoding of the genotype consists of the execution of these instructions, which will lead to the construction of the phenotype [73]. These instructions usually include neuronal migrations [75], neuronal duplication or transformation, and neuronal differentiation [76]. When talking about non-direct codifications, it is worth mentioning another type, based on the use of fractal subsets of a map [78]. According to this study, the fractal representation of the architectures is biologically more plausible than a representation with the shape of rules. Three parameters are used, which take real values to specify each node of the architecture: a border code, an entry coefficient and an exit coefficient. Finally, and within the indirect codification methods, there are also other methods, very different from the ones already described. Andersen describes a technique in which each individual of a population is a hidden node instead of an architecture [79]. The network is constructed layer by layer, i.e.,
the hidden layers are added one by one if the architecture in question cannot reduce the training error below a certain threshold. Other similar works are carried out by Smith with similar results [80,81]. Using a similar idea, Moriarty stated in a work of his that each individual of the population represents a hidden neuron [82], in a system called SANE. In this work, several neural networks are built using random individuals in a single hidden layer, and the fitness of each individual is the average fitness of the networks in which it takes part. An important characteristic is that, in general, these methods only develop architectures and/or weights. The transfer function of each architecture node is assumed to have been previously determined by a human expert, and it is usually the same for all the network nodes (at least, for all the nodes of the same layer). Although the transfer function has been shown to have a great importance on the behavior of the network [83,84], few methods which lead to the evolution of the transfer function [85–89] have been developed. One of the few significant pieces of researches can be found in Dorado [30]. Accordingly, a two-layer GA is used to design the architecture of an ANN and to perform its training. During the evolutionary process, the parameters of the neurons are determined, among which is the transfer function. Recently, a new approach in developing ANNs has emerged. This approach is based on designing neural network ensembles instead of single ANNs [90,91]. Network ensembles are based on linear combinations of different networks, trained with the same data. The output of the whole ensemble is a combination of the outputs of the ensemble networks [92]. The development of these systems has been carried out by means of Evolutionary Computation [93,94] tools in which populations of networks are evolved and the final solution is made up of combining different individuals of the population [95,96]. Another interesting approximation to the development of ANNs by means of EC is the evolution of the learning rule. This idea emerges because a training algorithm works differently when applied to networks with different architectures [97]. However, there are few works that focus on the very evolution of the learning rule [98–103]. One of the most common approaches is based on setting the parameters of the BP algorithm: learning rate and momentum [104,105]. Some authors propose methods in which an EA is used to find these parameters while maintaining the architecture constant [106,104]. Other authors, on the other hand, propose to code these BP algorithm parameters together with the network architecture inside the individuals of the population [67,107]. Due to the complexity involved in coding all the possible learning rules, certain restrictions should be established to simplify this representation. Chalmers [108] defined a learning rule as a lineal combination of four variables and six constants. Each individual of the population is a binary string which exponentially codifies ten coefficients and one scale value. With respect to the use of GP to develop ANNs, very few works have been published on this topic. One of the most important are those described by Ritchie et al. [130,131]. They use a basic encoding approach in which, using binary trees, simple feed-forward ANNs are developed. This approach is similar to the one described by Koza and Rice [132]. As operators of the trees, they use special nodes to represent weights, transfer functions or inputs of the neural network. In this way, it is very easy to use GP to evolve simple feed-forward neural networks. However, the neurons of the networks evolved cannot reuse the computation done by other neurons. In other words, the output of a neuron can be connected as input of only another single neuron, not to a set of neurons. This is a huge limit to the processing capacity of a neural network, which is based on the reutilization
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
of previous processed neurons. In these works, that model is used in bioinformatics to model gene interactions for studying human diseases, and no results are reported in order to apply this model to more general problems.
n-Neuron: Set of nodes that identify neurons with n entries
3. Model This section described the configuration of the GP algorithm used to develop ANNs. These networks evolved do not follow a traditional layer-based architecture with total connectivity between all of the neurons of one layer and the following. Instead of it, any kind of topology and architecture is allowed: any neuron can be connected to any other neuron. The only exception to this rule is that no cycles are allowed in the networks (no recurrent networks are allowed). Therefore, in order to allow the obtaining of neurons and connections, these have to be represented inside the GP trees as special nodes. When one of these nodes is evaluated, a neuron or a connection between two networks is created. As no layer information is introduced inside the GP trees, these neurons are not organised into layers, and any kind of connectivity is allowed. The development of an ANN by means of GP is achieved thanks to the typing property of GP [109]. This property allows the development of structures which follow a certain grammar. In this work, these structures have to allow the construction of ANNs. To achieve this, each node of the GP trees has a type. In addition, for those GP tree nodes are not leaves of a tree (i.e., that have argument nodes) it is necessary to establish the type of each one of their argument nodes. Therefore, with the objective of using GP to generate ANNs, the first thing to do is to define the types that are going to be used. These types are the following [110]:
TNET: This type identifies the network. It is used only in the root of the tree.
TNEURON: This type identifies a node (or sub-tree) as a neuron, whether it is a hidden, an output or an input one.
TREAL: This type identifies a node (or sub-tree) as a real value. It is used to indicate the value of the connection weights, i.e., a node with this type is either a floating point constant or an arithmetical sub-tree which defines a real value. With just these three types, it is now possible to build networks. However, the terminal and function sets are more complicated. The description is the following:
ANN: It is the only node that will have a TNET type. As the tree must have the type TNET, this node will be the root of the tree, not being able to turn out somewhere else. It has as many arguments as outputs of the network. Each of these arguments will be an output neuron, and therefore they will have a TNEURON type.
3203
(other n neurons which pass their outputs to this one). These nodes have TNEURON type and 2n arguments. The first n arguments of this node designate the neurons or sub-networks that will be inputs to this neuron (TNEURON type). The second n arguments will have the TREAL type, and they will contain the value of the corresponding input neuron connection weights (of the first n arguments) to this neuron. Input_Neuron_n: Set of nodes which define an input neuron that receives its activation value from the variable n. These nodes will be of a TNEURON type and they do not have arguments. Finally, and in order to generate the values of the connection weights (which will be sub-trees of the n-Neuron nodes), arithmetic operators {+ , , *, %} are necessary, where % designates the protected division operation (its result is equal to 1 if the divisor is 0). These nodes perform operations between constants to give rise to new values. Therefore, real values should also be added in order to be able to carry out these operations. These real values are introduced by adding random constants ranging between [ 4, 4].
A more formal description of these operations is given in Table 1. In Fig. 1, a simple network, which can be generated using these terminal and function sets, is shown. The nodes named herein ‘‘IN_x’’ correspond to ‘‘Input_Neuron_x’’. The network has 4 inputs and 2 outputs. These two inputs are the neurons created by the two children of the ‘‘ANN’’ node: nodes labelled as ‘‘1’’ and ‘‘2’’. The first output neuron (‘‘1’’) is a 2-Neuron. Therefore, it has two children. Since the first two children are ‘‘Input_neuron_1’’ and ‘‘Input_neuron_3’’, this output neuron receives inputs from the first and third input neurons, and therefore these connections are created. The other two children of the node are the weights of these two connections: 3.2 to the first input neuron and 1.1 (as result of the arithmetic tree 2.1-1) to the second input neuron. The second output neuron (‘‘2’’) is a 3-Neuron, which means that it receives inputs from three neurons. These are the first three children of this neuron: a 2-Neuron (labelled as ‘‘3’’), the first and the fourth input neurons. The other three children are the weights of these connections: 2.8 for the connection to the 2-Neuron (‘‘3’’), 0.2 for the first input neuron and 2 for the fourth input neuron. Finally, neuron ‘‘3’’ is a 2-Neuron, which means that it receives inputs from two other neurons. In this case, these two neurons are the second and third input neurons (the first two children), and the weights of these two connections are determined by the other two children: 1.2 for the second input and 0.5 for the third input neurons. It is important to keep in mind that, during the creation of a neuron, in the process of creating or referencing neurons, a neuron can appear several times as input of another neuron. In this case, a new input connection between these two neurons is not established, but instead, the existing connection weight is modified and the weight value of the new connection is added. Therefore, a common
Table 1 Terminal and function sets. Name
Type
Num. arguments
Argument types
Function set
ANN n-Neuron + , , *, %
TNET TNEURON TREAL
n 2nn 2
TNEURON, y, TNEURON TNEURON, y, TNEURON, TREAL, y, TREAL TREAL, TREAL
Terminal set
Input_Neuron_n [ 4, 4]
TNEURON TREAL
– –
– –
3204
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
ANN 1
2
2-Neuron
3-Neuron 3
+
IN_1
2-Neuron
3.2
IN_1
IN3
0.2
IN_2 2.1
-1.2
-1
IN_4 2.5
IN_3
-2
2.8
x1 x2 x3 x4
0.2
-1.2
3.2 1.1 2.8
2.5 -2
Fig. 1. GP tree and its corresponding ANN.
situation is that an ‘‘n-Neuron’’ operator is not referencing n different neurons, but instead of it there may be several repeated neurons, especially if n has a high value. It is necessary to limit the n value, to set the maximum number of ancestors a neuron can have. A high value will surely lead to the described effect, i.e., that not all those entries will be used effectively, and some inputs will be repeated. However, it is also necessary to consider a high value in order to ensure a possibly high number of ancestors. This typing system allows the construction of simple networks, but it has a significant disadvantage: it does not allow the reuse of network parts. With those terminal and function sets one neuron cannot be the input to more than one different neuron (except the input neurons); i.e., the same neuron cannot be referenced different times from different parts of the ANN. This is a big drawback, because it removes one of the big advantages of ANNs, which is the partial reuse of its structure. As ANNs have a high connectivity, they massively reuse the previously computed results, turning many parts of the network into functional blocks. Therefore, the described codification of the ANNs (terminal and function sets) is insufficient, and, in order to overcome this problem, new elements have to be added to the terminal and function sets. Moreover, the whole system has also been made complicated by the inclusion of a list which allows the referencing of previously used nodes. While the tree is being evaluated and the network is constructed, the neurons created in the network are also stored in this list. By means of special operators, these neurons can be extracted from the list so they can be referenced. In order to extract neurons from the list, an index is used for referencing one of its elements (a neuron), which will be the next neuron to be extracted after the execution of the corresponding operator. In this way, two new operators are needed: to extract neurons from the list and to modify the position of the index. The addition of neurons to the list is done automatically, at the same time that the tree is being evaluated. The description of the new operators (nodes) is the following:
type. Unlike the previous operator, its evaluation does not replace the evaluation of a neuron because its argument will be a neuron. The evaluation of this node has two effects: it moves forward the list index in one position, and it returns the neuron resulting from the evaluation of its argument.
Therefore, a ‘‘Pop’’ node may be found in a tree, instead of an ‘‘n-Neuron’’ node, meaning that, instead of creating a neuron, a previously created one will be referenced. It is also possible to find a ‘‘Forward’’ node, which will increment the index of the list, but it also will return a neuron, regardless of whether it is a newly created one or another that already exists. Therefore, in the evaluation of the ‘‘n-Neuron’’ operator, this newly created neuron must be added to the list. The evaluation of this operator implies the possible creation of other neurons. The order in which this neuron is added to the list, according to the evaluation of the arguments of this operator and the creation of new neurons, is essential. There are two possibilities:
The neuron is created and added to the list before evaluating
‘‘Pop’’—TNEURON type: This node extracts the neuron from the
its arguments and creating the respective neurons. In this case, at the moment of evaluating a node with a TNEURON type in a sub-tree from one of this neuron’s arguments, if this node is a ‘‘Pop’’ node, a neuron from the list will be referenced. Given that the preceding neuron is present on the list, this neuron may be the one referenced, which means that a recurrent link is being created. Therefore, this evaluation order allows the creation of recurrent networks. The neuron is created, its arguments evaluated, and the node is added to the list after the evaluation of the arguments. In this case, during the creation of the links of the argument neurons, it will not be present on the list, which means that recurrent connections to ancestor neurons are not allowed. This is the study case of this work, in which recurrent networks are not being developed.
list in the position pointed by the index. This node replaces the evaluation of a neuron because it returns an already existing one and, therefore, it does not have arguments. ‘‘Forward’’—TNEURON type: This node moves forward the list index one position. It has one argument, with the TNEURON
For a better understanding of the evolution of the set of terminals and functions, a pseudocode is included herein to show how each one of the nodes is evaluated.
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
3205
evaluate(node) begin case node of: ANN: begin create empty network, with input neurons create empty List Index = 1 for i=1 to (number of outputs), neuron = evaluate(argument i) set neuron as ith output of the network endfor return network end n-Neuron: begin create neuron for i=1 to n, input_neuron = evaluate(argument i) input_weight = evaluate(argument i+n) if (input_neuron is already input of neuron) then update weight of input_neuron with input_weight else set input_neuron as input of neuron with input_weigth endif endfor add neuron to the List return neuron end Input-Neuron-n: Return input neuron n arithmetical operator: begin f1 = evaluate(argument 1) f2 = evaluate(argument 2) return result of the operation between f1 and f2 end float f: return f Pop: return List[Index] Forward: begin Index = Index + 1 neuron = return(argument) return neuron end endcase end
An example of a network including these operators is given in Fig. 2. In this figure, the arithmetic operation sub-trees which determine the connection weight values have been left out for the sake of simplicity. In the same way as in the previous figure, the nodes denominated ‘‘IN_n’’ refer to ‘‘Input_Neuron_n’’. In this figure, the generating network can be seen in successive steps of the tree evaluation. The explanation of these steps, as labelled in the figure, is the following:
a) Network generated during the evaluation of the node labelled as ‘‘1’’, after the complete evaluation of the nodes labelled as ‘‘2’’ and ‘‘3’’, and before finishing the evaluation of node ‘‘1’’. Note that part of the ANN has been created and connected, and some neurons have been added to the list. The neurons added are those whose evaluation has finished (‘‘2’’, ‘‘3’’ and ‘‘4’’). Neuron ‘‘1’’ has not been added to the list yet because its evaluation has not finished yet. Until this part, the evaluation of the tree is very similar to that shown in the example of Fig. 1, except for the addition of the list. Note also that node ‘‘4’’ has
three arguments, but neuron ‘‘4’’ only has two inputs. This occurs because 2 of the nodes’ arguments refer to the same input neuron. b) This point is immediately after a), after node ‘‘1’’ has been completely evaluated. The only difference with the network previously described in a) is that this neuron (‘‘1’’) has been added to the list because its evaluation has been finished. After this point, the second output neuron (labelled as ‘‘5’’) will begin its evaluation. Fig. 2b shows in bold the addition of the neuron to the list. c) Network generated during the evaluation of the second output neuron (‘‘5’’), and before the evaluation of the first ‘‘Pop’’ node (this node must return a neuron). As the evaluation of the second output neuron has begun, this neuron has been created, but not added to the list, because its evaluation has not finished yet. Fig. 2c shows in bold the creation of the new neuron. d) This point is after the evaluation of the first ‘‘Pop’’ node. This node returned a neuron, and this neuron is the one pointed by the index in the list (neuron ‘‘2’’). As a consequence, a new connection is created between neuron ‘‘2’’ and neuron ‘‘5’’,
3206
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
f) In this point, the ‘‘Forward’’ and ‘‘Pop’’ nodes have finished their evaluation. As result of the evaluation of the ‘‘Pop’’ node, the neuron pointed by the index in the list (neuron ‘‘4’’) was returned to its father node. This node (‘‘Forward’’) returned this neuron (‘‘4’’) to its father (neuron ‘‘5’’). Therefore, a connection between neurons ‘‘4’’ and ‘‘5’’ is created. Fig. 2d) shows in bold the creation of the new connection in the network. After this point, the evaluation of the second output neuron (‘‘5’’) is finished and, therefore, it would be added to the list. However, as the
which is the node father of this ‘‘Pop’’ neuron. Fig. 2d shows in bold the creation of this connection. e) Network generated after the execution of the ‘‘Forward’’ node (second child of the neuron ‘‘5’’), and before the evaluation of the last ‘‘Pop’’ node. The execution of this ‘‘Forward’’ node only makes the list index to point to the next node (neuron ‘‘4’’). This node also has to return a neuron, this neuron will be the one returned by its only child. Fig. 2d shows in bold the movement of the index in the list.
ANN 1
(b)
2-Neuron
5
(a) 2
(f)
(d)
3
2-Neuron
2-Neuron
(c) 3-Neuron
Pop
Forward (e)
IN_1
4
IN_1
IN_2
IN 2
IN 3
x1
IN_4
Pop
IN 3
x1
2
2
x2
x2
1
3 x3
1
3
x3
? x4
3-Neuron
?
4
x4
4 Inde x
Inde x 2
List of neurons:
4
x1
5
5
x4
4
Inde x 2
4
Inde x 3
1
x1
2
4
1
x2
5
x3
5
x4
4
Inde x
List of neurons:
1
1
3
4
2
3
2 3
x4
List of neurons:
x1
2 x2 x3
1
3 x3
4
List of neurons:
3
2 x2
1
3
x4
4
x1
2 x2 x3
2
List of neurons:
3
4
3
Inde x 1
2
4
List of neurons:
Fig. 2. GP tree and its corresponding ANNs in different parts of the tree evaluation.
3
1
1
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
evaluation of the tree also finishes, this has no effect on the network created and therefore it is not shown on the example. Once a tree has been evaluated, this genotype has been turned into a phenotype, i.e., a network with fixed weight values that now can be evaluated. It does not have to be trained. The evolutionary process requires the assignment of a fitness value to each genotype. This fitness value is the result of the network evaluation with the training pattern set that represents the problem. In this case, the result of the evaluation is the mean square error (MSE) of the difference between the network outputs and the expected ones. However, this fitness value has been modified in order to make the system generate simple networks. For that reason, it has been penalized with a value which is added to this error. This added value is a penalization value multiplied by the number of neurons of the network. Therefore, and given that the evolutionary system has been designed to minimize a value error, if a value is added to the fitness value, it will make a bigger network to return a worse fitness value. Therefore, the appearance of simple networks is preferred because this added penalization value is proportional to the number of ANN neurons. The final fitness calculus is the following: fitness ¼ MSE þN*P where MSE is the mean square error of the network on the training pattern set, N is the number of neurons of the network, and P is the penalization value to the number of neurons. The P constant will have a low value and, as it will be shown in experiments, it has a great importance in the evolution of the system. In this way, by making a network with more neurons but with identical functioning have a worse fitness value, it is intended that the system looks for simple systems with fewer neurons. These metrics are similar to those used in previous works [12,13] to evolve finite state machines with a penalty factor on the number of states. A common problem when generating and training ANNs is the tendency to overfit them. This can be observed when the training fitness value (in this case, MSE) keeps improving during the training process while the test value begins to worsen. In that case, the network loses the capacity of generalization. In order to avoid this problem, a different pattern set is also used besides, the training set, with the objective of making a validation. This validation set controls the training process in order to avoid the overfitting of the networks [111,112]. Using this validation set, the training occurs in a similar way, but this time each network returned by the evolutionary process is evaluated in the validation set, which gives an estimation of the test error. The system always returns the network that obtained the best results in the validation set, even if the training process keeps going and other networks with better training results are obtained. This is carried out in a similar way to the process performed by the technique of early stopping [113,114]. Thus, the validation set provides an estimation of how a network is going to behave in the test.
3207
Table 2 Summary of the datasets used.
Appendicitis Breast cancer Iris flower Mushroom Heart disease Ionosphere
Num. inputs
Num. data points
Num. outputs
8 9 4 22 13 34
106 699 150 5644 303 351
1 1 3 1 1 1
the system described here, and compare it with other ANN generation approaches. The first problem to be solved involves the diagnosis of appendicitis. This database has 106 cases with only 8 attributes. The second problem to be solved involves the classification of breast cancer into two possible types: benign and malignant cancer. The database has 699 cases: 458 benign (65.5%) and 241 malignant (34.5%). Each data point is characterized by 9 attributes that are considered to be continuous, although they take discrete values ranging between 1 and 10. The next problem is the classification of iris flowers. This problem was originally raised by Fisher in 1936 [116] as an application in the field of discriminatory analysis and clustering. Four continuous parameters are used to solve this problem. These parameters are measured in millimetres taking into consideration four characteristics of the flowers: longitude and width of the petals and sepals. The measurements correspond to 150 flowers belonging to three distinct species of iris: Setosa, Versicolor and Virginica (50 of each type). Each output must return an expected value equal to 1 if it belongs to that class or 0 otherwise. Another problem to be solved is the classification of poisonous mushrooms. In this case, the objective is to determine, from a set of 22 symbolic features, whether a mushroom is poisonous or not. To this end, a database with 5644 data points is used. In another problem, the objective is to detect whether heart disease is present or not. These data correspond to 13 measurements taken from 303 patients at the V.A. Hospital of Cleveland. The last problem to be solved consists of classifying radar measurements from the ionosphere. These data were taken by a system in Goose Bay, Labrador, Canada. This system consists of a set of 16 high frequency antennas with a transmission power of about 6.4 kW. The ‘‘good’’ measurements are those that demonstrate evidence of some type of structure in the ionosphere. The ‘‘bad’’ measurements are the ones that do not demonstrate this kind of evidence, i.e., the signals pass through the ionosphere. From a set of 34 attributes, the objective is to predict whether there are structures or not. In this case, there are 351 instances. Table 2 shows a small summary of the most important characteristics of the 6 problems to be solved. It is important to keep in mind that most of them only have one output. However, regarding the problem of Iris, the objective is to do a classification in three possible cases, meaning that networks are generated with 3 possible outputs, one for each case. The attributes of all these databases have been normalized between 0 and 1.
4. Problem description 5. Parameters In order to test the behavior of this system, experiments were carried out facing up to diverse problems of different natures. All these databases were taken from UCI [115], which is a repository of databases for the machine learning community. Therefore, the databases used herein are well-known databases and have already been used with different machine learning tools, including ANNs. Therefore, they are useful to validate and test
In this work, the influence of different parameters on the system performance is studied. Two of the most important, and the first ones to be studied, are those that limit the complexity of the networks to be created. These parameters are the maximum height of the tree and the maximum number of inputs that a neuron can have. A very low value from these parameters will
3208
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
avoid the creation of networks complex enough to solve the problems. However, a very high value will cause the creation of excessively large networks. This leads to problems of efficiency loss in the evolutionary process, and possibly overfitting. In the experiments carried out, the following values have been taken:
For the maximum height of the tree, a value of 6 seems to be
adequate in order to get networks complex enough to solve arbitrarily complex problems. For this reason, values of 4, 5 and 6 have been taken for this parameter. In previous experiments, tests with values of 7 or even higher have been done. However, the complexity of the generated networks and the consequent demands of the computational resources provoke an enormous efficiency loss, reaching the point that the necessary computational requirements are so high that it is almost impossible to work with those values. With respect to the maximum number of inputs of each neuron, it also occurs that a value of 12 provides enough complexity to solve any problem. For this reason, experiments have been done with this parameter, taking values from 3 to 12. As performed with maximum height, experiments have been carried out with values higher than 12, but the computational resources required by the system provoke serious efficiency losses.
The influence of these parameters on each problem will be studied. The objective of this study is to determine a set of values that produce good results, not for a particular problem, but for any problem that could come up. In general, these parameters should be fitted so that the particular problem to be solved, but one of the objectives of this work is to establish values which allow the solution of arbitrarily complex problems. For this reason, problems of very diverse complexity are being used to determine which values of these parameters are shown to be good for the solution of different types of problems. Another parameter, which is the object of study herein, is the size of the GP population. However, the most important thing that will be studied is the penalization which is applied to the number of neurons. This parameter will lead to the obtaining of simple or complex networks. As it will be shown, this parameter has a crucial relevance in the behavior of the networks returned by the system. In addition, this parameter also has a notable influence on the entire evolutionary process. Previous to the study of these parameters, diverse experiments have been carried out with the objective of setting the values of the rest of the GP parameters. In these experiments, different values were used in the different parameters of the system. As a result of these experiments, the following configuration was used for the experiments described in this paper:
crossover rate: 95%, mutation probability: 4%, selection algorithm: 2-individual tournament, creation algorithm: ramped half-and-half.
These parameter values were used because they returned the best results in most of the experiments performed. For all the experiments carried out in the following sections, the results obtained have been measured with respect to the computational effort, i.e., the number of executions of the fitness function. For each problem and parameter used in the experiments, 20 different trials have been done in which 500,000 fitness function iterations have been carried out.
6. Results This section shows a study of this system performance. The objective of this study is to find a set of parameter values that return good results when working with problems with different complexities. For this reason, validation results are reported on this section, because they are used to tune the parameters of a system and to select between different networks (in this case, each of them was obtained with a different parameter configuration). In order to evaluate the real performance of this system, and compare it with other systems, Section 7 reports a set of accuracies on the test set. In this section, databases were randomly divided into two parts, taking 70% of the data for the training set and the other 30% for the validation set. The first experiments already done have the objective of determining the best complexity level of the system to solve arbitrarily difficult problems. As already described, this complexity is measured with two parameters: maximum tree height and maximum number of inputs of the neurons. With respect to maximum height, values have been taken with the heights varying from 4 to 6. A height lower than 4 does not have the necessary complexity to solve these problems. Similarly, the minimum value which has been taken as the maximum number of inputs if a neuron is equal to 3, value that, as it can be imagined, is too low to obtain good results. Given that the problems have been normalized between 0 and 1, in all the experiments run, the transfer function (which remains fixed, is not evolved with the network, and is the same for all neurons) is the sigmoid, which returns values between 0 and 1. Fig. 3 shows the results obtained in the validation for the resolution of the previously described problems. To achieve them, the algorithm was executed until a maximum of 500,000 fitness function evaluations was reached. During the execution of the algorithm, its output was the individual that obtained the best validation value up to that point. For each combination of parameters, 20 different trials have been performed. The results shown are the arithmetic mean of the results (MSE) of each trial. In all these results, a constant population size of 1000 individuals has been maintained and a penalization of 0 to the number of neurons, i.e., without optimization. Since the problems have very different complexities, the results obtained and shown on this figure are very different between them. For instance, the MSE results are, in general higher in the iris flower problem than in the Breast Cancer problem (even this last problem has more inputs than iris), because it is a more complex problem. In general, Heart Cleveland and Ionosphere problems are harder to solve than the others, except for the appendicitis problem. In this case, the low size of the pattern set (only 106 cases) makes an overfitting problem: when dividing the dataset in two parts, each one of them has a very low number of patterns, not enough to represent the whole data. This is the reason why the ECM results on this problem are much higher than in the others. Fig. 3 shows that the behavior of the system is quite independent of the values of these parameters. From the value of 5 to the number of inputs, the results are equivalent, and some problems (ionosphere, mushroom) seem to show a worse behavior with a height of 4. Moreover, the system does not show a significant difference for values of maximum number of inputs from 6 to 12. Therefore, the system is robust for values within the intervals of [5,6] for maximum height and [6,12] for the maximum number of inputs. To go on with the rest of parameters, a value should be adopted for these parameters and they should be kept constant. These values are a height of 5 and 6 neurons as maximum number
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
3209
Height 4
0.24 0.23 0.22 0.21 0.2 0.19
Breast Cancer
Height 6
MSE
MSE
Height 4
Height 5
Appendicitis
3
4
5
6
7
8
9
0.02 0.015 0.01 0.005 0
Height 6
10 11 12
3
4
5
Height 4
Height 6
0.04
7
8
9 10 11 12
0.02
Height 4
Mushroom
Height 5
MSE
MSE
0.06
6
Number of inputs
Number of inputs
Iris
0.06
Height 5
0.04
Height 6
0.02 0
0 3
4
5
3
6 7 8 9 10 11 12 Number of inputs
4
6 7 8 9 10 11 12 Number of inputs
5
Height 4
Heart Cleveland 0.15 0.145 0.14 0.135 0.13
Height 4
Ionosphere
Height 5
Height 5
0.1
Height 6
MSE
MSE
Height 5
Height 6
0.08 0.06 0.04
3
4
5
6 7 8 9 10 11 12 Number of inputs
3
4
5
6
7
8
9 10 11 12
Number of inputs
Fig. 3. MSE results in validation for different height and number of inputs.
Table 3 MSE results in validations with different population sizes. Appendicitis
Breast cancer
Iris
Mushroom
Heart Cleveland
Ionosphere
100 500 1000 2000 3000 4000 5000 6000 7000 8000 9000 10,000
0.16600 0.15801 0.15833 0.15603 0.15444 0.15285 0.15549 0.15811 0.15680 0.15851 0.15665 0.15991
0.01430 0.01353 0.01350 0.01362 0.01439 0.01386 0.01375 0.01462 0.01394 0.01389 0.01388 0.01384
0.02590 0.02356 0.02188 0.02157 0.02964 0.02809 0.02700 0.03626 0.03888 0.03889 0.04402 0.04829
0.01505 0.00714 0.00814 0.01184 0.01987 0.01856 0.02715 0.02664 0.03008 0.02928 0.03526 0.02998
0.14787 0.14211 0.13956 0.14059 0.14043 0.14058 0.14047 0.14087 0.14078 0.14122 0.14153 0.14105
0.06603 0.06221 0.0577 0.05911 0.06143 0.06388 0.06798 0.07339 0.07965 0.07597 0.08517 0.08522
500–4000 Ratio Std.
1.0359 0.0023
1.0659 0.0004
1.3741 0.0037
2.7829 0.0059
1.0183 0.0009
1.1071 0.0025
of inputs to each neuron. These values were taken because of efficiency reasons: to reduce the computational load while obtaining good results in all the problems. Once these parameters are set, different population sizes have been tested in order to evaluate how this particular parameter affects the behavior of the system. Population sizes ranging from 100 to 10,000 individuals have been used. The results, which can be seen in Table 3, show the arithmetic mean of the errors in the validations obtained from those 20 tests. This table shows that a small population size returns better results. However, once again, the differences are not very
significant for different values of this parameter for most of the problems, and the results seem to be stable for values of a population size between 500 and 4000 for most of the problems. The last two rows of this table are the ratio of the division of the highest MSE value and the lowest MSE for populations between 500 and 4000, and the standard deviation of the MSE values of populations between 500 and 4000. As can be seen in these two rows, in most of the problems the ratio is very close to 1. This means than in these problems the differences are not very significant. On one of these problems, Mushroom, there is a higher variation between these population sizes values. However, the
3210
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
value chosen for the rest of the experiments returns a good MSE in this problem and the rest of them. This value is a population size of 1000 individuals. Finally, experiments have been carried out with a parameter of great importance, the penalization to the number of neurons. As previously mentioned, the value of this parameter is added to the fitness function multiplied by the number of neurons. The values taken by this parameter are 0.1, 0.01, 0.001, 0.0001, 0.00001 and 0. The value of 0 is the one taken for the experiments run this far (no penalization). Fig. 4 shows the evolution followed by the system in the ionosphere problem with different penalization values. In this graph, the MSE values (without penalization) of the validation corresponding to the best individual found by the algorithm are shown. The X-axis represents the computational effort, measured as the number of fitness function executions, up to a maximum of 500,000 executions. As it can be seen in the graph, the error obtained is lower when the penalization is lower. These error values are the result of applying the arithmetical mean of errors obtained after performing 20 different runs for each penalization value in each problem. The same results were found for the other problems, and similar graphs could be drawn, showing the same behavior. The computational cost required to obtain those solutions to the problem can be seen in this figure. In some of the problems, the system was still improving when it reached the 500,000 fitness function executions. This occurred in the case of the most complex problems (poisonous mushrooms or ionosphere). On the
other hand, in other simpler problems, such as the prediction of breast cancer or appendicitis, the best solution was found much sooner than this. In these cases, the rest of the training overfits the networks. To avoid this, and as already mentioned, the system’s output has been maintained as the individual with the best validation. Fig. 5 shows the average number of neurons of the ANNs that returned the best MSE values in validation (they correspond to the error values of Fig. 4) in the ionosphere problem. Obviously, the larger the penalization, the lower is the number of neurons. With a very high penalization (0.1), the system will generate only networks with very few neurons. With a too low penalization, the system generates networks with many neurons, but, as previously seen, the error obtained is lower. The same results were found for the other problems, and similar graphs could be drawn, showing the same behavior. Table 4 shows a summary of these two figures. This table shows the average number of neurons and connections of the networks found over the total number of executions for each penalization and each problem. In addition, it also shows the arithmetic mean of the MSE obtained in the training and the validation after 500,000 executions of the fitness function in the 20 different runs. The neuron number in this table means the total number of neurons apart from inputs (hidden+output) because no processing is performed on the input neurons and they have no input weights. For example, in the case of Iris flower or Heart Cleveland with a penalization of 0.1, the number of neurons is, respectively, 3 and 1. This means that no hidden neurons are used.
↓ 0.1 ↓ 0.01
↓ 0.001↓ 0.0001
4945
4685
4425
4165
3905
3645
3385
3125
2865
2605
2345
2085
1825
1565
1305
1045
785
525
265
↑ 0 ↑ 0.00001
5
MSE
Ionosphere - MSE Validation 0.14 0.13 0.12 0.11 0.1 0.09 0.08 0.07 0.06 0.05
x100
Effort Fig. 4. MSE in validation for different penalization values in the ionosphere problem.
Ionosphere - Number of neurons ↓0
12
↓ 0.0001
10
↑ 0.00001
8
↓ 0.001
6 4
↓ 0.01
2
↓ 0.1
Effort Fig. 5. Average number of neurons for different penalization values in the ionosphere problem.
4865
4595
4325
4055
3785
3515
3245
2975
2705
2435
2165
1895
1625
1355
1085
815
545
275
0 5
Number of Neurons
14
x100
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
3211
Table 4 Results obtained with different penalization values for each problem. 0.1
0.01
0.001
Appendicitis
Neurons
1 (20)
1.6
1 (8) 2 (12)
Breast Cancer
Connections Training Validation Neurons
3.05 0.080083 0.149913 2 (20)
3.45 0.068735 0.153103 2 (20)
Iris Flower
Connections Training Validation Neurons
6.05 0.054054 0.016404 3 (20)
Mushroom
Connections Training Validation Neurons
8.8 0.063018 0.071735 1 (20)
Heart Cleveland
Connections Training Validation Neurons
4.35 0.049114 0.046069 1 (20)
Ionosphere
Connections Training Validation Neurons
3.85 0.111395 0.144119 1 (20)
Connections Training Validation
4.55 0.109343 0.109653
6.75 0.021042 0.014327 3.85 3 4 10.7 0.037535 0.038576 1.35 1 2 6 0.041824 0.039260 1.05 1 2 4.15 0.110298 0.144564 2.2 2 3 8.4 0.080765 0.088133
(3) (17)
(13) (7)
(19) (1)
(16) (4)
2.2
0.0001 2 (16) 3 (4)
5.5 0.059800 0.153251 2.85 2 4 10.35 0.019398 0.013446 7.35 7 8 21.23 0.023231 0.025357 4.7 4 5 18.15 0.015163 0.014725 3.35 3 4 12.45 0.090913 0.143769 6.1 6 7 24.65 0.041037 0.060709
When non-integer numbers of neurons are shown in Table 4, they do not represent the number of neurons found in one single run. Instead, they represent the average number of neurons found in the 20 different runs. Due to the stochastic nature of the methods used, it is necessary to repeat the experiments several times and report the average value of the results. In this case, the only way to represent the number of neurons found in the different runs is by an average. Although it may seem strange, it gives an idea of the complexity of the networks found. The use of non-integer numbers of neurons gives an idea of how many neurons were obtained. For instance, a value of 3 means that 3 neurons were found in all of the 20 runs, while a value of 3.85 means that 3 neurons were found in 3 runs, and 4 neurons were found in 17 runs. In Table 4, next to the average number of neurons for each problem and each penalization value, the number of neurons found in the different runs is shown. More specifically, the values shown are the number of neurons, and, between parenthesis, the number of runs in which that number of neurons was found. For instance, a value of 1.6 in appendicitis with a penalization of 0.01 is explained as that 1 neuron was found in 8 runs, and 2 neurons were found in the remaining 12 runs. If the average number of neurons is an integer number, these values are not shown. Instead of it, next to this number the total number of runs in which that number was found (20) is shown. Also, in Table 4 the larger number of neurons in the case of Iris is because these networks have 3 output neurons, while the other problems only have 1 output neuron, and the number of neurons in that table includes hidden and output neurons. As previously commented, with a high penalization value (0.1), the system generates networks that only have output neurons. The number of neurons increases as the penalization value decreases, with the consequent reduction of the training error. As the number of neurons increases, so does the number of
(3) (17)
(13) (7)
(6) (14)
(13) (7)
(18) (2)
3.55
0.00001 3 (9) 4 (11)
9.6 0.055485 0.154981 4.95 4 (1) 5 (19) 16.95 0.021153 0.014027 10.35 10 (13) 11 (7) 33 0.021895 0.023964 7 (20) 27.65 0.009725 0.009256 4.6 4 (8) 5 (12) 17.15 0.087188 0.141717 10.4 10 (12) 11 (8) 41 0.038350 0.060848
4.05
0 4 (19) 5 (1)
11.6 0.055139 0.154466 4.3 4 (14) 5 (6) 14.45 0.021932 0.013650 11.85 11 (3) 12 (17) 34.25 0.018216 0.024657 7.25 7 (15) 8 (5) 27.4 0.008761 0.008493 3.9 3 (2) 4 (18) 14.05 0.085101 0.139023 10.25 10 (15) 11 (5) 41.45 0.037155 0.055604
2.8
2 (4) 3 (16)
7.9 0.052832 0.158337 4.9 4 (2) 5 (18) 16.95 0.020549 0.013508 14.1 14 (18) 15 (2) 41.95 0.019512 0.021883 7.5 7 (10) 8 (10) 29.35 0.008441 0.008143 4.9 4 (2) 5 (18) 17.9 0.098505 0.139564 12.2 12 (16) 13 (4) 48.4 0.038113 0.057700
network connections. These two values (the number of neurons and connections) are indicators of the network complexity. Therefore, the more complex the network, the lower are the training errors obtained. However, this does not mean that the best results have been achieved without penalization. When looking at the validation errors, it is possible to see that a low penalization value favors the appearance of better results rather than a higher one. This is caused by the fact that, previously, the complexity of the network had already been limited by determining the values of the height of the tree and the maximum number of inputs of each neuron. If these values were high, the penalization parameter could take high values in order to limit the size of the generated networks. In any case, considering the values which have been taken here, in complex problems such as classification of poisonous mushrooms, the results indicate that it is better not to use this high penalization value, because a more complex network is necessary to solve the problem. In these situations, it would be advisable to increase the parameters of tree height and the number of input neurons. In more simple problems such as breast cancer prediction, these values, which limit the network complexity (tree height and maximum number of inputs to each neuron), are enough to satisfactorily solve the problem. This means that the penalization value is certainly useful to optimize the resultant networks and to generate networks with few neurons. In the case of the prediction of breast cancer, the process found a solution 500,000 times earlier than the execution of the fitness function. This means that the maximum complexity of the network seems to be enough to solve the problem, and the system tries to optimize the network architecture by means of the penalization parameter. Table 4 also shows that the selected problems have very different complexities. This can be seen in the differences between the results obtained in validation in the different
3212
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
Table 5 Recommended intervals for the parameters and values chosen. Recommended interval GP Parameters
Population size Selection algorithm Creation algorithm Crossover rate Mutation rate Function selection probability
1000 2-individual tournament Ramped half-and-half 95% 4% 95%
System parameters
Height Maximum number of inputs Penalization
[5,6] [6,12] [0, 0.001]
problems. As stated before, some of the problems (Ionosphere, Heart Cleveland) are harder than others (Breast Cancer, Iris flower), and therefore the error obtained is higher. In the case of the appendicitis problem, the difference between the training and validation results shows an overfitting problem. This is caused by the low number of patterns (106) that, when is split into training and validation, is not enough to generate two different sets that correctly represent the original dataset. When looking at this table, one can notice that only one problem has returned good results with a high value in this parameter. For the rest of the problems, the best results were obtained with low values of this parameter, reaching a point in which the differences in error are not very significant. In general, values lower than 0.001 do not produce big variations in the results, and thus the system is robust with these values. For this reason, the recommended interval for this parameter is [0, 0.001]. From now on, the value adopted for the penalization is 0.00001, which has returned good results for most of the problems. Table 5 shows a summary of the recommended intervals for the values of the parameters and the values adopted in the experiments.
7. Test and comparison with other methods The system described here has been compared with other ANN generation and training methods in order to evaluate its performance. In this section, accuracies on the test set are reported. When comparing classification algorithms, the most common method is cross validation [117] to estimate the accuracy of the algorithms, and then the use of t-tests to confirm whether the results are significantly different. In the cross validation method, the dataset D is divided into k non-overlapping sets D1, y, Dk (k-fold cross validation). In each i iteration (that varies from 1 to k), the algorithm trains with the D\Di set and a test is carried out in Di. However, some studies have shown that the comparison of algorithms using these t-tests in cross validation leads to what is known as Type I error [36]. In [118] the performance of the k-fold cross validation method was analyzed, combined with the use of a t-test. In this work, it was proposed to modify the used statistic and it was proven that it is more effective to perform k/2 executions of a 2-fold cross validation test, with different permutations of the data, than to run a k-fold cross validation test. As a solution between test accuracy and calculation time, it was proposed to perform 5 executions of a cross validation test with k¼2, resulting in the name 5 2cv. In each one of the 5 iterations, the data are divided randomly into two halves. One half is taken as input of the algorithm, and the other one is used to test the final solution,
Values chosen
5 6 0.00001
meaning that there are 10 different tests (5 iterations, 2 results for each one) [36]. In [37] the 5 2cv method is used to compare different techniques based on evolutionary methods in order to generate and train ANNs. In that work, the results presented are the arithmetic means of the accuracies obtained in each one of the 10 results generated by this method. These values are taken as a basis for comparing the technique described in this work with other well-known ones. Each one of these 10 test values were obtained after the training with the method described herein. With respect to the parameters used to generate ANN, in order to perform the comparison, the parameter values employed are the same ones that have been used in Section 6, and they are given in Table 5. An additional drawback of the 5 2cv technique is that this technique requires the division of the pattern set into two halves. This may not be problematic when working with large sets. However, the pattern sets used are quite small (see Table 2, breast cancer, iris, heart disease and ionosphere problems). To avoid overfitting, this work proposes to split the training set into training and validation sets order to perform a validation of the obtained networks and to take the best results from this validation subset. In case the 5 2cv is used, this involves dividing the training set, which is half of the initial pattern set, into two parts. This second partitioning provokes that either training or validation is carried out with a very reduced number of data points. Therefore, either the training set or the validation sets, will surely not be representative of the search space that is being explored. To verify this effect, different experiments have been performed. In these experiments, the training set has been divided into two parts (training and validation) extracting a total of 50%, 40%, 30%, 20%, 10% and 0% of the training data points to the validation set. As previously explained, this validation process makes the system return the network that has produced the best results in the validation set (which is an estimate of the results given in the test). In case of extracting a 0%, which means that no validation is performed, the results show the effect of overfitting. The results are shown, for the ionosphere problem in Fig. 6, which shows the accuracies obtained in the tests up to a maximum of 500,000 fitness function executions. Fig. 6 shows that the best results are obtained when the training set is not split once again into training and validation, i.e., using no validation set. This means that when the training set is split into training and test, these new sets are too small and thus not representative of the search space, especially the validation set. The fact that the bigger the validation set, the better are the results, leads to thinking that it does not have an appropriate size to be representative of the pattern set and be able to offer a reliable estimation of the test. In case of having a sufficiently large pattern set, it is recommended to divide it into three parts: training, test and validation. However, if this is not possible, it
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
3213
Ionosphere 92 Accuracy (%)
No validation ↓ 50% validation ↓
87
40% validation ↑
30% validation ↑
20% validation ↑
82
10% validation ↓
77
Effort
4850
4565
4280
3995
3710
3425
3140
2855
2570
2285
2000
1715
1430
1145
860
575
290
5
72
x100
Fig. 6. Comparison of the accuracies obtained in the 5 2cv test taking different amounts of the training set in order to perform validation in the ionosphere problem.
Table 6 Network topologies used for the comparisons. Proposed here
Breast cancer Iris flower Heart Cleveland Ionosphere
[37]
Num. inputs
Num. hidden neurons
Num. outputs
Num. inputs
Num. hidden neurons
Num. outputs
Epochs
9 4 13 34
3.3 8.85 2.9 9.25
1 3 1 1
9 4 26 34
5 5 5 10
1 3 1 1
20 80 40 40
should be divided into only training and test. The same results were obtained with the other problems, and similar graphs could be drawn. A more detailed description of all the algorithms with which this technique is compared can be found in [37]. In that work, the average times needed to achieve the aforementioned results are indicated. In order to correctly compare the performance and efficiency of different algorithms, a common measure is to compare the execution times, as in [37]. However, this is not useful to compare results with other published ones, because the same processor is not available. Therefore, a different measure has to be used. As EAs are being used, one could think that the number of generations could be a measure to compare them. However, this measure is not good because the execution of a single generation of different algorithms requires very different computational time. Even the evaluation of a single individual in different algorithms can take different amounts of time. For example, the calculation of the fitness of an individual in one algorithm can involve the training of a network, while in other algorithm it can involve only to evaluate a network with the training set. A more universal way to measure computational time is to use the computational effort. This measures the number of times that the pattern set was evaluated. With the use of this measure, comparisons can be performed between different algorithms measuring time as computational effort. In this paper, the computational effort of [37] was calculated for each algorithm. The computational effort for each technique can be measured using parameters such as population size, number of generations, number of times that the BP algorithm is run, etc. In general, the calculation varies for each algorithm that is used. Despite that, the calculation is very similar for all of them, and it is based on doing the calculation of the effort needed in evaluating each individual and multiplying that value by the population size and the number of generations.
Table 7 Comparison of this technique with different ANN training methods.
Breast cancer
BP
G3PCX
Binary
96.39 0.58
98.94 96.19 2.35 0.97 2,247,200
98.88 0.34 8400
Iris flower
94.53 2.12
89.73 95.51 11.7 1.84 1,566,450
88.67 6.09
Heart Cleveland
78.17 3.16
90.42 80.56 2.12 2.27 6,055,200
87.72 3.42
Ionosphere
84.77 3.80
64.10 88.72 2.04 3.02 15,736,050
74.10 1.94
Mean
88.46
85.79
87.34
90.24
95.98 1.04 84.48 2.85 7000 79.02 3.14 13,900 83.75 3.81 22,400 85.80
The evaluation of each individual can imply evaluating an ANN only once, training an entire ANN, or something in between. Once the computational effort was calculated for a specific technique, the proposed system was run until it reached the same number of training pattern set evaluations. At the time it was stopped, both algorithms to be compared ([37] and this wok) have run the same time (measured as computational effort), and therefore the comparison is carried out more accurately (Table 6). For each comparative table shown (Tables 7–11), each square corresponds to a particular problem with a particular technique. Five different values are included. The two values on the left show the accuracy obtained in [37] (top) and the standard deviation (below, smaller). Below, the computational effort needed to obtain that value with that particular technique is shown. On
3214
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
Table 8 Comparison of this technique with evolutionary training strategies, using Lamarckian and Baldwinian strategies in 5% of the population. Baldwinian
Lamarckian
1 BP Breast cancer
98.48 0.54
Iris
89.47 6.49
Heart Cleveland
88.68 1.11
Ionosphere
68.43 2.87
Mean
86.26
2 BP 95.99 1.04
98.91 0.33
84.63 2.69
91.07 4.38
79.15 3.09
89.25 1.19
84.01 3.77
69.43 3.45
85.94
87.16
8820
5 BP 96.02 1.00
99.03 0.43
84.89 2.46
87.20 9.05
79.30 3.04
89.21 1.99
84.12 3.73
67.86 4.55
86.08
84.89
9240
7350
98.88 0.50
85.34 2.23
89.20 12.74
79.71 3.07
86.45 2.48
84.50 3.47
65.12 3.06
86.40
84.91
98.74 0.48
84.63 2.69
88.00 13.17
79.15 3.09
88.82 1.83
84.01 3.77
65.88 4.04
85.94
85.36
96.02 1.00
99.08 0.48
84.89 2.46
88.13 10.67
79.30 3.04
87.98 1.75
84.12 3.73
64.65 4.95
86.08
84.96
85.34 2.23 8750
15,290
23,520
96.06 1.00 10,500
7700
14,595
28,000
5 BP
9240
7350
17,375
24,640
95.99 1.04 8820
8750
15,290
23,520
96.06 1.00
2 BP
10,500
7700
14,595
1 BP
79.71 3.07 17,375
24,640
84.50 3.47 28,000 86.40
Table 9 Comparisons of this technique with evolutionary training strategies, using lamarckian and baldwinian strategies in 100% of the population. Baldwinian
Lamarckian
1 BP Breast cancer
98.83 0.45
Iris flower
91.33 6.20
2 BP 96.18 0.98
98.86 0.50
86.56 2.46
89.87 4.05
16,800
88.58 1.27
Ionosphere
73.88 2.15
Mean
88.15
96.21 0.99
98.60 0.62
87.66 2.90
91.07 5.40
25,200
14,000 Heart Cleveland
5 BP
88.45 1.39
85.47 3.21
73.15 2.73
87.16
87.58
27,800
98.88 0.49
91.78 3.64
92.40 4.85
88.32 2.29
80.76 2.43
86.87 2.06
86.27 3.29
72.89 87.35 2.98 3.27 134,400
74.25 1.62
87.75
87.72
88.1
83,400
Table 10 Comparison of this technique with the feature selection method.
96.48 1.38
89.04
84.72 12.40
Ionosphere
87.00 1.82
87.66 2.90
92.00 3.72
25,200
50,400
21,000 80.43 2.75
88.98 1.61
85.47 3.21
74.03 2.15
87.16
88.78
96.27 0.96 91.78 3.64 42,000
80.89 2.54
87.66 1.57
86.27 3.29
73.77 87.35 3.67 3.27 134,400
87.75
88.07
41,700
80.76 2.43 83,400
67,200
89.04
Parameters
Grammar
Breast cancer
96.77 96.27 1.10 0.91 92,000
96.31 95.66 1.21 1.26 4620
96.69 96.28 1.13 0.91 100,000
96.71 96.24 1.16 0.91 300,000
94.48 2.57
Iris flower
92.40 95.22 2.67 2.05 320,000
92.40 81.43 1.40 4.23 4080
91.73 95.20 8.26 2.14 400,000
92.93 95.47 3.08 1.87 1,200,000
80.78 2.52
Heart Cleveland
76.78 80.68 7.87 2.40 304,000
89.50 77.75 3.36 3.28 7640
65.89 80.71 13.55 2.45 200,000
72.8 80.57 12.56 2.22 600,000
85.14 3.32
Ionosphere
87.06 88.29 2.14 2.96 464,000
83.66 81.68 1.90 3.96 11,640
85.58 87.83 3.08 3.14 200,000
88.03 88.31 1.55 3.01 600,000
Mean
88.25
90.46
84.97
87.61
40000 90.45
98.86 0.38
Pruning
40000
Mean
96.21 0.99
Matrix
80000 Heart Cleveland
93.20 3.84
5 BP
Table 11 Comparison of different ANN development techniques.
96.22 1.03
93.60 2.41
86.56 2.46
44,800
20000 Iris flower
98.94 0.57
27,800
Feature selection Breast cancer
96.18 0.98
14,000
80.89 2.54
67,200
2 BP
16,800
42,000
41,700
44,800
96.27 0.96 50,400
21,000 80.43 2.75
1 BP
89.15
the right side, the two other values show the obtained accuracy with the technique described here (top), corresponding to the result obtained with that computational effort value, with its standard deviation below. If the computational effort needed
90.11
84.13
90.00
90.14
for each technique is lower than 2,000,000 fitness function executions, the accuracy and standard deviation values shown by the technique described in this work will be the one corresponding to that effort. However, if it is greater,
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
the accuracy and standard deviation values shown will correspond to the one obtained after 2,000,000 fitness function executions. All of the accuracies shown in these tables were obtained as result of a 5 2cv. This means that the data was divided in two halves (training and test) 5 times, and in each fold each model was trained with one half and tested with the other, and vice versa, having as result 10 different test values. The final accuracy is the average between these test values. This is used in all of this section, including when using the GP algorithm. These techniques with which the comparison is performed can be divided into three groups: training of ANNs by means of evolutionary algorithms, feature selection, and design and training of ANNs by means of evolutionary algorithms. 7.1. Training of ANNs by means of evolutionary algorithms The first group of techniques used for comparison are those that use only genetic algorithms to train ANNs with an already fixed topology. Table 6 shows a summary of the network topologies used in [37] together with an arithmetic mean of the resultant network topologies using the method proposed here (extracted from Table 4). It is important to keep in mind that the results in the cited work were obtained using ANNs with a fully connected hidden layer and a determined number of neurons in the hidden layer. However, the network topologies presented according to the method proposed in this study do not correspond to a classic topology divided into layers. The hidden neurons can have any kind of connectivity between themselves and the input and output ones with the possibility of having connections also between input and output neurons. Table 7 shows the results obtained by the method proposed herein, compared with the ones obtained with the traditional backpropagation (BP) algorithm and trained by means of GA, with either binary or real codification (algorithm G3PCX [119]). When using binary codification, the configuration of the genetic algorithm was the following:
chromosome of 16 bits per weight in the [ 1, 1] interval; j p ffik population size of n ¼ 3 l , where l is the number of chromosome bits;
mutation rate of 1/l; two-individual tournament without replacement; in each generation the whole population was replaced without elitism;
after 100 generations the algorithm was stopped and the result was the individual with the highest accuracy. On the other hand, the genetic algorithm with real codification, named G3PCX [119], presents a similar configuration with the following changes:
chromosome length:jl, where pffi k l is the number of weights; population size: n ¼ 30 l , where l is the number of weights; the algorithm was stopped after n iterations without improve
ment in the best solution, or after 50n executions of the fitness function; it uses a steady-state algorithm, i.e., only one population that evolves and is not replaced by another, built from the genetic operators that have been applied to it.
By using lamarckian and baldwinian strategies, results have also been taken from other training processes by means of a GA (with binary codification). This time the weights were optimized by means of the BP algorithm. In the case of the Lamarckian
3215
strategy, the BP algorithm is applied to the chromosomes of the individuals during the evaluation of the fitness function in a certain percentage of the individuals of the population. The changes performed in the chromosomes (weights of the ANNs) are permanent. The fitness value is the result of applying the BP algorithm. On the other hand, the Baldwinian strategy is similar to the Lamarckian one, with the difference that the changes in the individuals after the execution of the BP algorithm are not permanent. The BP algorithm is executed only to calculate the fitness value. In addition, in these experiments, the best set of weights (i.e., the best chromosome) found in the Baldwinian strategy was used to train a network by means of the BP algorithm. Table 8 shows the comparison between the method proposed in this work and the Lamarckian and Baldwinian strategies, applying these strategies to 5% of the population. For each strategy, the number of BP algorithm iterations is indicated (1, 2 and 5 iterations in the cases of 1 BP, 2 BP and 5 BP, respectively). Table 9 shows a similar comparison, but this time these strategies were applied to 100% of the population. The GA used in these experiments has the same parameters as the binary GA used in the previous case. 7.2. Feature selection The next technique used in [37] to perform comparisons is based on the selection of variables of the problem [120,121]. This technique is based on having a GA with binary codification. In this GA each bit inside the chromosome indicates if a determined input variable will or will not be used for training. The evaluation of the individuals, therefore, will consist of training the ANN (which has a fixed structure) with the input variables fixed for the chromosome ofj the pffi k individual. The population was randomly initialized with 3 l individuals, but with a minimum size of 20. A standard crossover was carried out with a probability of 1.0 and a mutation rate of 1/l. The networks were designed and trained as described in Table 6. The algorithm was stopped either when no change was found in the best solution for five generations or after reaching a limit of 50 generations. However, this limit was never achieved because the algorithm found a good solution much earlier. Table 10 shows a comparison of this technique with the one described in this work. 7.3. ANN design by means of evolutionary algorithms The last set of techniques with which this work is compared refers to the use of evolutionary algorithms to design neural networks. The techniques to be compared with are the following:
connectivity matrix; pruning; finding network parameters; graph-rewriting grammar.
In all these techniques, in order to evaluate the accuracy of each network generated by any of these methods, 5 iterations of a 5-fold cross validation test [36] are performed, which have a notable influence on the computational effort needed to achieve the results presented. The connectivity matrix technique is based on representing the topology of a network as a binary matrix: the element (i,j) of the matrix has a value of 1 if there is a connection between i and j, and zero if there is no connection. A genetic algorithm with binary codification can be easily used because the chromosome is easily obtained by linking the matrix rows together [105]. In this case, the number of hidden neurons indicated in Table 6 is used, and
3216
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
connections have been allowed between inputs and outputs, meaning that the length of the chromosomes is l¼(hidden+ outputs)ninputs+ hiddennoutputs. A multipoint crossover was used with a probability of 1.0 with l/10 crossover points j and pffi k a mutation rate of 1/l. The population had a size of 3 l individuals with a minimum of 20. The algorithm was stopped after 5 generations without improving the best solution or if a maximum of 50 generations was achieved. The pruning technique is based on a representation similar to the previous one. However, the method is different. It starts from a fully connected network [122], which is trained by means of the BP algorithm according to the parameters in Table 6. When this trained network is obtained, the evolutionary algorithm is executed. The evaluation function of each individual consists of removing from the previously trained network those weights whose value in the connectivity matrix is equal to 0. The resulting ANN is evaluated with the training set, with no further training. The networks began with the topologies shown in Table 6, with the same configuration of parameters used in the previous case. Finding the network parameters is a different approach because in this case an evolutionary algorithm is used to identify the general designing and training parameters of the networks [105,123]. In this case, these parameters are the number of hidden neurons, the BP algorithm parameters, and the initial interval of the weights. The chromosome’s longitude was 36 bits, divided in the following way:
5 bits for the learning rate and the coefficient b of the activation function, in the [0, 1] interval;
5 bits for the number of hidden neurons, in the [0, 31] interval; 6 bits for the number of BP epochs; 20 bits for the upper and lower values of the initial weights interval (10 bits for each value), and their intervals were [ 10, 0] and [0, 10], respectively. The evaluation of an individual consists of the construction of the network, its initialization and its training according to the parameters. The population had 25 individuals and was randomly initialized. The algorithm used a two-point crossover with a probability of 1.0 and a mutation rate of 0.04. As in the other experiments, a two-individual tournament selection algorithm without replacements was used and the execution was stopped after 5 generations with no change in the best solution or after having reached a limit of 50 generations. Finally, the graph-rewriting grammar [61] consists of a connectivity matrix which represents the network. Unlike the previous cases, the matrix is not codified directly in the chromosome, but a grammar is used instead to generate the matrix. The chromosome only has rules which turn each element of the matrix into 2 2-size sub-matrixes. In this grammar, there are 16 terminal symbols that are matrices with size 2 2, and 16 non-terminal symbols. The rules have the n-m form, where n is one of the non-terminal symbols, and m is a nonterminal 2 2 matrix. There is a starting symbol and the number of steps is determined by the user. The chromosome contains the 16 rules as follows: it contains the 16 right sides of the rules because the left side is implicit in the position of the rule. To evaluate the fitness of an individual, the rules are decoded and a connectivity matrix is constructed by means of the same rules. The network trained by means of BP, is generated from this matrix. For the application in these problems, the number of steps is limited to 8, meaning that the results are networks with a maximum number of 256 elements. The size of the chromosome, therefore, is 256 bits (4 2 2 binary matrices for each one of the
16 rules). The population size is 64 individuals, with a multipoint crossover with a probability of 1.0 and l/10 crossover points and a mutation rate of 0.004. The algorithm was stopped after 5 generations with no improvement in the best individual or after 50 generations. The results obtained with these 4 methods, compared with the method described in this work, are given in Table 11.
8. Discussion As the tables show, the results obtained by the method proposed here are given in the same order as those presented in [37], improving them most of the time. However, these good results are only one of the features of this system. This section describes them as follows. 8.1. Results Sections 7.1 and 7.2 show a comparison of the results obtained by this method and other ANN training methods by means of EC and hybrid techniques. The accuracy values obtained by the technique described here in the 5 2cv tests demonstrate that the results obtained are similar to the ones obtained using other tools, improving them most of the time, especially in those cases in which a lot of computational capacity is required (such as in hybrid techniques like, for example, Lamarckian strategies). However, the techniques with which it is compared start from a fixed network topology, meaning that it is still necessary to have the intervention of a human expert in those cases. The tool described here, on the other hand, and as it has been demonstrated, can provide even better results without requiring that kind of human intervention. Section 7.3 shows a comparison with a set of techniques that do not need a predetermined architecture beforehand. In other words, an expert’s intervention is no longer needed in order to determine the ANN topology and connectivity. These techniques, because of joining the architectural evolution with the training of the weights, require an enormous computational load. Table 11 shows that the technique described here offers worse results in only one of the comparisons. In the others, the accuracies achieved show much better results than those offered using other techniques. Fig. 6 shows the computational cost necessary to obtain the results in the ionosphere problem. As it can be observed, the convergence of this algorithm is very easily achieved and the results are obtained with much less computational effort than with what is shown in Table 11. In some problems (breast cancer, heart disease) there is a decrease in the test accuracy after an initial improvement. This is caused by the overfitting of the networks. This problem, as already explained, is due to the fact there is no validation set. If bigger datasets were used, the training set could have been split again into training and validation to avoid this effect. On the contrary, in other problems, the accuracy value keeps improving, so when the maximum number of fitness function evaluations is reached, the system has not achieved the maximum accuracy value that can be obtained for that specific problem. Therefore, the accuracy values shown in the tables are not the best that this technique can offer. In some cases, due to the fact that the algorithm is stopped too early, the results would improve if the algorithm kept running. In other cases, because better values were previously reached and would be maintained if a validation set was used. If this happens, it is expected that the test accuracy does not undergo a significant decrease, and the resulting test accuracy would be very similar to the best test
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
3217
Table 12 Comparison of the best results found between different ANN development techniques. Breast cancer
Iris flower
Heart Cleveland
Ionosphere
Best 1 2 3 4
Matrix Grammar Parameter Pruning
96.77% 96.71% 96.69% 96.31%
Proposed here Grammar Matrix Pruning
95.53% 92.93% 92.40% 92.40%
Pruning Proposed here Matrix Grammar
89.50% 81.04% 76.78% 72.80%
Proposed here Grammar Matrix Parameter
88.72% 88.03% 87.06% 85.58%
Worst 5
Proposed here
96.30%
Parameter
91.73%
Parameter
65.89%
Pruning
83.66%
Table 13 ANN development techniques ordered by accuracy. Technique
Mean position Table 12
Best 1 2 3 4
Proposed here Grammar Matrix Pruning
2.25 2.5 2.5 3.5
Worst 5
Parameter
4.25
accuracy found during the training. A comparison of these best test accuracies with the accuracies obtained with other tools can be seen in Table 12, arranged according to accuracy value. As it can be seen in this table, the results obtained by the system described here are better than the ones returned by the rest of the techniques in most of the problems, except in the case of breast cancer. However, in this problem the difference in the accuracies obtained is very low. There is a difference of only 0.47% between the results obtained by the tool described here and the one that has returned the best result. In other problem, heart disease, the technique proposed here gave the second best results, after pruning. However, this latter technique has offered good results only in that particular problem, with very low accuracies in the rest of the problems, and therefore it is a very unstable tool. A mean of the position of each technique in Table 12 can be performed. Using these values, a quantification of the goodness of each tool can be performed. With this measure, these tools can be arranged as shown in Table 13. This table shows that the technique described here obtained the highest accuracy. Therefore, it can be concluded that the results returned are the best on average, independently of the problem to be solved.
8.2. Independency with the expert As it was already explained, one of the main goals of this work was to develop a system in which the expert has a minimum participation, to eliminate the excessive effort that has to be done in the classical ANN development process. The development of this system has led to the appearance of some parameters. One could think that the fact of having to set the values of those parameters leaves out that desired expert’s independence. However, Section 6 describes a set of experiments performed in order to obtain a set of parameter values that offer good results in problems with different complexity. The results in that section show, for each parameter, a set of intervals in which the system is robust, without big changes in the error that it returns. Therefore, the exact value of each parameter does not have a big influence. In Section 7, a value is taken for each parameter and is used for the
comparison with other tools in solving of problems with different complexity. Therefore, by using these parameter values (or other values inside these intervals), the parameters do not have to be set, and thus the effort that the expert has to do is minimal. This effort is referred to the classical actions in machine learning, data analysis and pre-processing. Moreover, the rest of the techniques used here in the comparison require the expert’s participation. The ones used in Section 7.1 as well as those used in Section 7.2 need a previous network design. The tools used in Section 7.3 automatically perform this design task, but all of them still require some effort from the expert. This expert must perform some tasks like the design of an initial network (in the case of pruning) or establish how the networks that will be developed can be (in the case of parameter search or graph-rewriting grammar). This makes these tools not to be completely independent from the expert, and this expert still has to make some effort in order to correctly apply them. This is not the case of the technique described in this work.
8.3. Discrimination of the input variables Another important advantage of this technique is that it allows the discrimination of those variables that are not important to solve the problem. This is carried out by the evolutionary process, because in the resulting ANN, the variables which are not significant to solve the problem do not appear in the network. For example, a network which obtained an accuracy of 90.85% in the first iteration of the 5 2cv method, in the problem of the ionosphere, is the one shown in Fig. 7. The problem of the ionosphere is characterized, among other things, by the high number of entries (34) that it possesses. However, after taking a quick look at Fig. 7, one can see that in order to solve the problem with high accuracy, not so many variables are necessary. Instead of using all the variables, the system has been able to determine which were not useful, so it ended up with a reduced set of 7 variables. This feature selection can be very useful to give insight into the problem domain [124]. In this figure, one can observe how the network obtained by this method, apart from discriminating the useful variables, is much simpler than the networks that were used traditionally to solve the problem. In addition, the obtained network does not have the architectural limitation of having to be separated into layers, but rather, it can eventually have any type of connectivity. The reduction of the number of input features is given in Table 14, which shows the average number of variables used by different networks when 500,000 fitness function evaluations were performed. The initial number of input features is also shown. As it can be seen, the number of inputs is very low compared with the original number, most of them in problems with many features, like ionosphere. This table also shows a main component analysis (PCA) [125] of these problems. This analysis is based on the study of the
3218
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
X0
X0 X1 X2
X2
X3 X4
X3
X5
X4
…
X12
…
X28 X29 X30 X31
X15
X32
X23
X33
Fig. 7. Networks that solve the problem of the ionosphere: (a) obtained by this method and (b) traditional network.
Table 14 Comparison of the features used in each problem and PCA analysis. Num. features
Breast cancer Iris flower Heart disease Ionosphere
9 4 13 34
Used features
5.775 3.15 6.175 7.025
PCA 1%
2%
5%
10%
15%
8 3 11 15
8 3 9 7
7 2 7 3
3 2 4 2
3 1 2 1
database from the point of view of the input features, because they are usually quite correlated. Therefore, there is usually much redundant input information. PCA tries to reduce the dimension of the input features by performing several transformations on them. As a result, this algorithm returns a set of features (a lower number than the original) which are orthogonal and, therefore, uncorrelated. The importance of each variable in PCA is given by its variance. To perform a PCA analysis one must indicate what percent of the total variance is expected to be explained by the new model, i.e., the new feature set and the transformations done to reach it. Table 14 shows different values of the number of variables for different values of variance percent and problem. These values represent the number of features of the model resulting after removing those variables that contribute less than that percent to the total variance of the dataset. As it can be seen in the table, the number of variables used for each problem corresponds to the result of applying a PCA with a very low variance percent, lower than 10%, and sometimes lower than 5% (in the case of iris flower problem) or even than 2% (in the case of ionosphere problem). The system described in this work discriminates a set of variables that contain a great amount of information of the original database. Also, this set of important variables can be used without any need of doing any transformation to them, which does not happen in PCA. This reduction in the number of input features is graphically shown in Fig. 8. This figure shows an average of the features used by the different networks in the execution process up to a maximum of 500,000 pattern file evaluations in the ionosphere
problem. Similar experimental results were found for the rest of the problems, and similar graphs could be drawn for them. This discrimination of the input variables is an important advantage, because it can be used to extract knowledge from the problem. A clear example of this is the case of the iris flower problem. As it is already known, with only two variables (petal length and width) a 98% classification accuracy can be obtained (147 correct classifications out of 150 data points) [126]. However, it was believed that the other two variables were needed in order to obtain a higher accuracy. Previous works with this system have given new knowledge when discriminating the input variables of the iris flower problem, after a brief analysis of the networks returned by the system [127]:
To correctly classify 149 out of 150 data points, the variable sepal length is not necessary.
The variable petal length, which is necessary to achieve a
correct classification higher than 145 correct classifications, is not required (together with the variable sepal length, which is neither needed) to correctly classify 145 out of 150 data points. As it was already known and found here again, both variables, sepal length and sepal width are not necessary for the correct classification of 147 data points.
8.4. Optimization of the networks Another important feature of this technique is that it makes an optimization of the networks. As it was already explained, this optimization depends on the parameter that penalizes the networks according to the number of neurons they have. This forces the system develop networks with a low number of neurons. An example of this feature can be seen in Fig. 7, where a network that solves the ionosphere problem has been found. This network is much simpler than those that would be used by the classical ANN development system. Table 15 shows a comparison between the network found here and the ones that could be used with the classical system, i.e., with a total connectivity between the neurons of one layer and the following ones (see Fig. 7). As it
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
3219
Ionosphere - Variables used 9 Number of variables
8 7 6 5 4 3 2 1 4845
4625
4405
4185
3965
3745
3525
3305
2865
3085
2645
2425
2205
1985
1545
1765
1325
1105
665
885
445
225
5
0
x100
Effort Fig. 8. Number of features used in the ionosphere problem.
Table 15 Comparison between the network found with this method and the ones found with the classical method in the ionosphere problem. Method
Number of hidden neurons
Classical
Proposed here
Number of connections
25 20 15 3
875 700 525 105
3
15
X1 Setosa X2 Versicolor X3 Virginica X4 Fig. 9. Traditional ANN that solves the iris flower problem.
Table 16 Comparison between the network found with this method and the ones found with the classical method in the iris flower problem. Method
Hits
Accuracy (%)
Hidden neurons
Connections
[128] Proposed here
148 149 148 147 146 145
98.66 99.33 98.66 98 97.33 96.66
5 3 1 1 1 1
35 15 11 9 10 10
can be seen in this table, the network found in this work presents an important improvement from the point of view of the number of neurons and connections. Another example can be seen with the iris flower problem. Previous works have dealt with this issue in order to analyze the resulting networks [127]. Table 16 shows a summary of the accuracies obtained here, along with the number of networks and connections, compared with those obtained using the classical method [128]. As it can be seen in Tables 15 and 16, the networks found by this system are much simpler than the ones obtained with the classical method.
traditional networks. The network can have any type of connectivity. In the particular case of this work, this connectivity has been limited only to obtaining networks without recurrent connections, as previously explained. This ability of having any type of connectivity is a great advantage with the ANNs that possess a traditional architecture, because these present many difficulties when analyzed. The networks developed by this system, however, as they have been optimized, have a low number of neurons. This fact makes them much easier to analyze, discovering, for example, which variables participate in a determined output of the network. For example, Table 16 shows a summary of the architectures found in the networks that solved the iris flower problem with different accuracies compared with a classical network used in previous works. This network is given in Fig. 9 [128]. This network is very difficult to analyze due to the great amount of neurons and connections. However, the networks described in Table 16 are much simpler and can be more easily analyzed. In fact, they were analyzed and the conclusions reached on the inputs of the problem are shown at the end of Section 8.3. Another example is shown in Fig. 7 and Table 15, regarding the ionosphere problem. Again, the possibility of having any type of connectivity allows the finding of much simpler networks, and networks that can be more easily analyzed.
8.5. Architectures
9. Conclusions and future works
An additional advantage of this system is that the network layers do not have the limitations of the networks built by the
In this paper, a technique with which ANNs can be generated by means of GP is presented. Section 8 presents a set
3220
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
of features of the system described here that can be summarized as follows:
The results obtained make this tool, on average, better than the rest of the tools used for the comparison.
This system is completely independent of the expert. The
expert does not have to put any effort to execute it, not even for tuning the system parameters. Finding the optimal architecture and weight values is completely automated. The system performs a discrimination of the input variables, which allows obtaining knowledge about the problem domain. The networks obtained by the system have been optimized so they contain a minimal set of hidden neurons. The architectures found by the system can have any type of connectivity. This allows a better and much easier analysis of the networks. This analysis cannot be carried out by classical architecture.
Therefore, the technique presented in this paper is a powerful tool that can develop simple ANNs without human intervention. The problem presented herein, classification of EEG signals in epileptic patients, has already been studied with other tools. This paper shows that evolutionary techniques can also be applied in order to develop ANNs to solve this problem. The results obtained show a very high accuracy on the test, ranging up to 98% in many cases. Moreover, the networks developed by this system have been optimized so they have a small number of neurons and connections. Another important feature is that this system can discriminate the inputs needed to obtain the resulting accuracy. As was explained, some features have proven to be useless in many of the networks. Other features were not used by some networks and therefore it can be concluded that, although they can contribute to giving a high accuracy, their contribution is not essential for this task. Once functioning of the system has been checked, the research work is carried out on several directions. One interesting research line is the study of the possible integration of a Genetic Algorithm (GA) into the system to train the networks generated. In this way, the GP system would only be in charge of creating different architectures, which will be trained by means of the GA. Another interesting research line is the study of the modification of this system so it can be used by a GP algorithm based on graphs. In this way, it could be avoided the use of a list and special operators for referencing an already existing neuron. Instead of having these structures, the referencing of multiple neurons would be implicit inside the GP algorithm [129].
Acknowledgements This work was supported in part by the Spanish Ministry of Education and Culture (Ref. TIC2003-07593, TIN2006-13274), the INBIOMED network (Ref. P10/52048) financed by the Carlos III Health Institute, grants from the General Directorate of Research of the Xunta de Galicia (Ref. PGIDIT03-PXIC10504PN, PGIDIT04PXIC10503PN, PGIDIT04-PXIC10504PN), and the European project INTERREG (Ref. IIIA-PROLIT-SP1E194/03). The development of the particular experiments in this paper was carried out thanks to the support of the ‘‘Centro de Supercomputacio´n de Galicia (CESGA)’’. The Cleveland heart disease database was available thanks to Robert Detrano, M.D., Ph.D., V.A. Medical Center, Long Beach and Cleveland Clinic Foundation.
References [1] W.S. McCulloch, W. Pitts, A logical calculus of ideas immanent in nervous activity, Bulletin of Mathematical Biophysics 5 (1943) 115–133. [2] G. Orchad, Neural Computing Research and Applications, Institute of Physics Publishing, Londres, 1993. [3] S. Haykin, in: Neural Networks, 2nd ed., Prentice Hall, Englewood Cliffs, NJ, 1999. [4] R. Andrews, R. Cable, J. Diederich, S. Geva, M. Golea, R. Hayward, C. Ho-Stuart, A.B. Tickle, An evaluation and comparison of techniques for extracting and refining rules from artificial neural networks (QUT NRC Technical Report), Queensland University of Technology, Neurocomputing Research Centre, Queensland, 1996. ˜ al, J. Dorado, A. Pazos, J. Pereira, D. Rivero, A new approach to the [5] J.R. Rabun extraction of ANN rules and to their generalization capacity through GP, Neural Computation 16 (2004) 1483–1524. ˜ al, J. Dorado (Eds.), Artificial Neural Networks in Real-Life [6] J.R. Rabun Applications, Idea Group Inc., 2005. [7] N.L. Cramer, A representation for the adaptive generation of simple sequential programs, in: Proceedings of First International Conference on Genetic Algorithms, Grefenstette, 1985. [8] C. Fujiki, Using the genetic algorithm to generate lisp source code to solve the prisoner’s dilemma, in: International Conference on GAs, 1987, pp. 236–240. [9] J.J. Holland, in: Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI, 1975. [10] R.M. Friedberg, A learning machine: part I, IBM Journal of Research and Development 2 (1) (1958) 2–13. [11] R.M. Friedberg, B. Dunham, J.H. North, A learning machine: part II, IBM Journal of Research and Development 3 (3) (1959) 282–287. [12] L.J. Fogel, On the organization of intellect, Ph.D. Dissertation, UCLA, 1964. [13] L.J. Fogel, A.J. Owens, M.J. Walsh, in: Artificial Intelligence through Simulated Evolution, John Wiley, 1966. [14] J.R. Koza, in: Genetic Programming: On the Programming of Computers by Jeans of Natural Selection, MIT Press, Cambridge, MA, 1992. [15] J.R. Koza, Hierarchical genetic algorithms operating on populations of computer programs, in: Proceedings of the Eleventh International Joint Conference on Artificial Intelligence IJCAI-89, Morgan Kaufmann, 1989, pp. 768–774. [16] M. Fuchs, Crossover versus mutation: an empirical and theoretical case study, in: Proceedings of the Third Annual Conference on Genetic Programming, Morgan Kauffman, San Francisco, CA, 1998. [17] S. Luke, L. Spector, A revised comparison of crossover and mutation in genetic programming, in: Proceedings of the Third Annual Conference on Genetic Programming, Morgan Kauffman, San Francisco, CA, 1998. ˜ al, J. Dorado, A. Pazos, Time series forecast [18] D. Rivero, J.R. Rabun with anticipation using genetic programming, IWANN 2005 (2005) 968–975. [19] M. Bot, in: Application of genetic programming to induction of linear classification trees. Final Term Project Report, Vrije Universiteit, Amsterdam, 1999. [20] A.P. Engelbrecht, S.E. Rouwhorst, L.A. Schoeman, Building block approach to genetic programming for rule discovery, in: R. Abbass, C. Sarkar, Newton (Eds.), Data Mining: A Heuristic Approach, Idea Group Publishing, 2001. ˜ al, J. Puertas, A. Santos, D. Rivero, Prediction and [21] J. Dorado, J.R. Rabun modelling of the flor of a typical urban basin through genetic programming, in: Applications of Evolutionary Computing, Proceedings of EvoWorshops 2002: EvoCOP, AvoIASP, EvoSTIM/EvoPLAN. ˜ al, J. Dorado, J. Puertas, A. Pazos, A. Santos, D. Rivero, Prediction [22] J.R. Rabun and modelling of the rainfall-runoff transformation of a typical urban basin using ANN and GP, Applied Artificial Intelligence, 2003. ˜ al, J. Dorado, A. Pazos, Using genetic programming for [23] D. Rivero, J.R. Rabun character discrimination in damaged documents. In: Applications of Evolutionary Computing, EvoWorkshops 2004: EvoBIO, EvoCOMNET, EvoHOT, EvoIASP, EvoMUSART, EvoSTOC (Conference proceedings), 2004, pp. 349–358. [24] M.I. Quintana, R. Poli, C. Claridge, On two approaches to image processing algorithm design for binary images using GP, in: Applications of Evolutionary Computing, Proceedings of EvoWorkshops 2003: EvoBIO, EvoCOP, EvoIASP, EvoMUSART, EvoROB, and EvoSTIM. [25] G. Adorni, S. Cagnoni, Design of explicitly or implicitly parallel lowresolution character recognition algorithms by means of genetic programming, in: R. Roy, M. Koppen, S. Ovaska, T. Furuhashi, F. Hoffmann (Eds.), Soft Computing and Industry: Recent Applications, Proceedings of the Sixth Online Conference on Soft Computing, Springer 2002, pp. 387–398. [26] R.R. Kampfner, Computational modelling of evolutionary learning, Ph.D. Dissertation, University of Michigan, Ann Arbor, MI, 1981. [27] R.R. Kampfner, M. Conrad, Computational modelling of evolutionary learning processes in the brain, Bulletin of Mathematical Biology 45 (6) (1983) 931–968. [28] D.B. Fogel, L.J. Fogel, V.W. Porto, Evolving neural networks, Biological Cybernetics 63 (6) (1990) 487–493. [29] Leandro M. Almeida, Teresa B. Ludermira, A multi-objective memetic and hybrid methodology for optimizing the parameters and performance of artificial neural networks, Neurocomputing 73 (7–9) (2010) 1438–1450.
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
[30] J. Dorado, Modelo de un sistema para la seleccio´n automa´tica en dominios complejos, con una estrategia cooperativa, de conjuntos de entrenamiento y arquitecturas ideales de redes de neuronas artificiales utilizando algoritmos ˜ a, 1999. gene´ticos, Ph.D. Thesis, University of A Corun [31] X. Yao, Evolving artificial neural networks, Proceedings of the IEEE 87 (9) (1999) 1423–1447. [32] S. Nolfi, D. Parisi, Evolution and Learning in neural networks, in: M.A. Arbib (Ed.), Handbook of Brain Theory and Neural Networks, second ed, MIT Press, Cambridge, MA 2002, pp. 415–418. [33] S. Nolfi, D. Parisi, Evolution of artificial neural networks, in: M.A. Arbib (Ed.), Handbook of Brain Theory and Neural Networks, second ed, MIT Press, Cambridge, MA 2002, pp. 418–421. [34] M.D. Ritchie, B.C. White, J.S. Parker, L.W. Hahn, J.H. Moore, Optimization of neural network architecture using genetic programming improves detection and modelling of gene–gene interactions in studies of human diseases, BMC Bioinformatics 3 (1) (2003). [35] Dajun Du, Kang Li, Minrui Fei, A fast multi-output RBF neural network construction method, Neurocomputing, Available online 25 February 2010, in press. [36] F. Herrera, C. Herva´s, J. Otero, L. Sa´nchez, Un estudio empı´rico preliminar sobre los tests estadı´sticos ma´s habituales en el aprendizaje automa´tico, in: R. Giraldez, J.C. Riquelme, J.S. Aguilar (Eds.), Tendencias de la Minerı´a ˜ a, Red Espan ˜ ola de Minerı´a de Datos y Aprendizaje de Datos en Espan (TIC2002-11124-E), 2004, pp. 403–412. [37] E. Cantu´-Paz, C. Kamath, An empirical comparison of combinations of evolutionary algorithms and neural networks for classification problems, IEEE Transactions on systems, Man and Cybernetics—Part B: Cybernetics (2005) 915–927. [38] K. Davoian, W.A. Lippe, New Self-Adaptive EP Approach for ANN weights training. Enformatika, Transactions on Engineering, Computing and Technology 15 (2006) 109–114. [39] P. Werbos, Beyond regression: new tools for prediction and analysis in the behavioral sciences, Ph.D. Dissertation, Committee on Applied Mathematics, Harvard University, Cambridge, MA, November 1974. [40] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, in: D.E. Rumelhart, J.L. McClelland (Eds.), Parallel Distributed Processing: Explorations in the Microstructures of Cognition, vol. 1, MIT Press, Cambridge, MA 1986, pp. 318–362. [41] R.S. Sutton, Two problems with backpropagation and other steepestdescent learning procedure for networks, in: Proceedings of the Eighth Annual Conference on Cognitive Science Society, Erlbaum, Hillsdale, NJ, 1986, pp. 823–831. [42] D. Whitley, T. Starkweather, C. Bogart, Genetic algorithms and neural networks: optimizing connections and connectivity, Parallel Computing 14 (3) (1990) 347–361. [43] M. Srinivas, L.M. Patnaik, Learning neural network weights using genetic algorithms—improving performance by search-space reduction, in: Proceedings of the 1991 IEEE International Joint Conference Neural Networks (IJCNN’91 Singapore), vol. 3, pp. 2331–2336. [44] H. de Garis, GenNets: Genetically Programmed neural nets—using the genetic algorithm to train neural nets whose inputs and/or outputs vary in time, in: Proceedings of the 1991 IEEE International Joint Conference on Neural Networks (IJCNN’91 Singapore), vol. 2, pp. 1391–1396. [45] D.J. Janson, J.F. Frenzel, Training product unit neural networks with genetic algorithms, IEEE Expert 8 (1993) 26–33. [46] F. Menczer, D. Parisi, Evidence of hyperplanes in the genetic learning of neural networks, Biological Cybernetics 66 (1992) 283–289. [47] G.W. Greenwood, Training partially recurrent neural networks using evolutionary strategies, IEEE Transactions on Speech and Audio Processing 5 (1997) 192–194. [48] D.B. Fogel, E.C. Wasson, E.M. Boughton, Evolving neural networks for detecting breast cancer, Cancer Letters 96 (1) (1995) 49–53. [49] D.B. Fogel, E.C. Wasson, V.W. Porto, A step toward computer-assisted mammography using evolutionary programming and neural networks, Cancer Letters 119 (1) (1995) 93. [50] W. Yan, Z. Zhu, R. Hu, Hybrid genetic/BP algorithm and its application for radar/target classification, in: Proceedings of the 1997 IEEE National Aerospace and Electronics Conference, NAECON, Part 2 (of 2), pp. 981–984. [51] P. Bartlett, T. Downs, Training a neural network with a genetic algorithm, Technical Report, Department of Electrical Engineering, University of Queensland, Australia, January, 1990. [52] D.E. Goldberg, in: Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, 1989. [53] D. Montana, L. David, Training feed-forward neural networks using genetic algorithms, in: Proceedings of the 11th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, San Mateo, CA, 1989, pp. 762–767. [54] M. Frean, The upstart algorithm: a method for constructing and training feedforward neural networks, Neural Computation 2 (2) (1990) 198–209. [55] J. Sietsma, R.J.F. Dow, Creating artificial neural networks that generalize, Neural Networks 4 (1) (1991) 67–79. [56] P.J. Angeline, G.M. Suders, J.B. Pollack, An evolutionary algorithm that constructs recurrent neural networks, IEEE Transactions on Neural Networks 5 (1994) 54–65.
3221
[57] G.F. Miller, P.M. Todd, S.U. Hedge, Designing neural networks using genetic algorithms, in: Proceedings of the Third International Conference on Genetic algorithms, Morgan Kaufmann, San Mateo, CA, 1989, pp. 379–384. [58] F.J. Marin, F. Sandoval, Genetic synthesis of discrete-time recurrent neural network, in: Proceedings of the International Workshop on Artificial Neural Networks (IWANN’93), Lecture Notes in Computer Science, vol. 686, Springer-Verlag, Berlin, Germany, 1993, pp. 179–184. [59] E. Alba, J.F. Aldana, J.M. Troya, Fully automatic ANN design: a genetic approach, in: Proceedings of the International Workshop on Artificial Neural Networks (IWANN’93), Lecture Notes in Computer Science, vol. 686. Springer-Verlag, Berlin, Germany, 1993, pp. 399–404. [60] B. Kothari, B. Paya, I. Esat, Machinery fault diagnostics using direct encoding graph syntax for optimizing artificial neural network structure, in: Proceedings of the 1996 Third Biennial Joint Conference on Engineering Systems Design and Analysis, ESDA, Part 7 (of 9), ASME, New York, 1996, pp. 205–210. [61] H. Kitano, Designing neural networks using genetic algorithms with graph generation system, Complex Systems 4 (1990) 461–476. [62] X. Yao, Y. Liu, EPNet for chaotic time-series prediction, in: X. Yao, J.-H. Kim, T. Furuhashi (Eds.), Selection of Papers of the First Asia-Pacific Conference on Simulated Evolution and Learning (SEAL’96), Lecture Notes in Artificial Intelligence, vol. 1285, Springer-Verlag, Berlin, Germany 1997, pp. 146–156. [63] X. Yao, Y. Liu, Toward designing artificial neural networks by evolution, Applied Mathematical Computation 91 (1) (1998) 83–90. [64] D. Thierens, Non-redundant genetic coding of neural networks, in: Proceedings of the 1996 IEEE International Conference on Evolutionary Computation, ICEC’96, pp. 571–575. [65] Kenneth O. Stanley, Risto Miikkulainen, Evolving neural networks through augmenting topologies, Evolutionary Computation 10 (2) (2002) 99–127. [66] Shimon Whiteson and Daniel Whiteson, Stochastic optimization for collision selection in high energy physics, in: IAAI 2007: Proceedings of the Nineteenth Annual Innovative Applications of Artificial Intelligence Conference. [67] S.A. Harp, T. Samad, A. Guha, Toward the genetic synthesis of neural networks, in: J.D. Schafer (Ed.), Proceedings of the Third International Conference on Genetic Algorithms and their Applications, Morgan Kaufmann, San Mateo, CA 1989, pp. 360–369. [68] S.A. Harp, T. Samad, A. Guha, Designing application-specific neural networks using the genetic algorithm, in: D.S. Touretzky (Ed.), Advances in Neural Information Processing Systems, vol. 2, Morgan Kaufmann, San Mateo, CA 1990, pp. 447–454. [69] N. Dodd, D. Macfarlane, C. Marland, Optimization of artificial neural network structure using genetic techniques implemented on multiple transputers, in: P. Welch, D. Stiles, T.L. Kunii, A. Bakkers (Eds.), Proceedings of the Transputing’91, IOS, Amsterdam, The Netherlands 1991, pp. 687–700. [70] P.J.B. Hancock, GANNET: design of a neural net for face recognition by genetic algorithm, Technical Report CCCN-6, Center for Cognitive and Computational Neuroscience, Dep. Comput. Sci. Psychology, Stirling University, Stirling, UK., August, 1990. [71] E. Vonk, L.C. Jain, R. Johnson, Using genetic algorithms with grammar encoding to generate neural networks, in: Proceedings of the 1995 IEEE International Conference on Neural Networks, Part 4 (of 6), 1995, pp. 1928–1931. [72] X. Yao, Y. Shi, A preliminary study on designing artificial neural networks using co-evolution, in: Proceedings of the IEEE Singapore International Conference on Intelligence Control and Instrumentation, Singapore, June 1995, pp. 149–154. [73] S. Nolfi, D. Floreano, in: Evolutionary Robotics: The Biology, Intelligence and Technology of Self-Organizing Machines, MIT Press/Bradford Books, Cambridge, MA, 2000. [74] Thomas J. Glezakos, Theodore A. Tsiligiridisa, Lazaros S. Iliadisa, Constantine P. Yialourisa, Fotis P. Marisa, Konstantinos P. Ferentinosa, Feature extraction for time-series data: an artificial neural network evolutionary training model for the management of mountainous watersheds, Neurocomputing 73 (1–3) (2009) 49–59. [75] A. Cangelosi, S. Nolfi, D. Parisi, Cell division and migration in a ‘genotype’ for neural networks, Network-Computation in Neural Systems 5 (1994) 497–515. [76] F. Gruau, Automatic definition of modular neural networks, Adaptive Behaviour 3 (1994) 151–183. [77] T. Kathirvalavakumara, S. Jeyaseeli Subavathib, Neighborhood based modified backpropagation algorithm using adaptive learning parameters for training feedforward neural networks, Neurocomputing 72 (16–18) (2009) 3915–3921. [78] J.W.L. Merrill, R.F. Port, Fractally configured neural networks, Neural Networks 4 (1) (1991) 53–60. [79] H.C. Andersen, A.C. Tsoi, A constructive algorithm for the training of a multilayer perceptron based on the genetic algorithm, Complex Systems 7 (4) (1993) 249–268. [80] R.E. Smith, H.B. Cribbs III, Is a learning classifier system a type of neural network, Evolutionary Computation 2 (1) (1994) 19–36. [81] R.E. Smith, I.H.B. Cribbs, Combined biological paradigms: a neural, geneticsbased autonomous systems strategy, Robotics and Autonomous Systems 22 (1) (1997) 65–74. [82] David E. Moriarty, Risto Miikkulainen, Efficient refinforcement learning through symbiotic evolution, Machine Learning 22 (1996) 11–33.
3222
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223
[83] B. DasGupta, G. Schnitger, Efficient approximation with neural networks: a comparison of gate functions, Technical Report, Department of Computer Science, Pennsylvania State University, University Park, 1992. [84] D.R. Lovell, A.C. Tsoi, The performance of the neocognitron with various S-cell and C-cell transfer functions, Technical Report, Intell. Machines Lab., Department of Electrical Engineering, University of Queensland, April 1992. [85] D.G. Stork, S. Walter, M. Burns, B. Jackson, Preadaptation in neural circuits, in: Proceedings of the International Joint Conference on Neural Networks, vol. 1, Washington, DC, 1990, pp. 202–205. [86] D. White, P. Ligomenides, GANNet: a genetic algorithm for optimizing topology and weights in neural network design, in: Proceedings of the International Workshop on Artificial Neural Networks (IWANN’93), Lecture Notes in Computer Science, vol. 686. Springer-Verlag Berlin, Germany, 1993, pp. 322–327. [87] Y. Liu, X. Yao, Evolutionary design of artificial neural networks, in: Proceedings of the 1996 IEEE International Conference on Evolutionary Computation (ICEC’96), Nagoya, Japan, pp. 670–675. [88] M.W. Hwang, J.Y. Choi, J. Park, Evolutionary projection neural networks, in: Proceedings of the 1997 IEEE International Conference on Evolutionary Computation, ICEC’97, pp. 667–671. [89] A.V. Sebald, K. Chellapilla, On making problems evolutionary friendly, part I: evolving the most convenient representations, in: V.W. Porto, N. Saravanan, D. Waagen, A.E. Eiben (Eds.), Evolutionary Programming VII: Proc. 7th Annu. Conf. Evolutionary Programming, vol. 1447 of Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany 1998, pp. 271–280. [90] V. Khare, X. Yao, B. Sendhoff, Multi-network evolutionary systems and automatic problem decomposition, International Journal of General Systems 35 (3) (2006) 259–274. [91] X. Yao, Md.M. Islam, Evolving artificial neural network ensembles, IEEE Computational Intelligence Magazine 3 (1) (2008) 31–42. [92] M.P. Perrone, L.N. Cooper, When networks disagree: ensemble methods for hybrid neural networks, in: R.J. Mammone (Ed.), Neural Networks for Speech and Image Processing, Chapman & Hall, London, U.K 1993, pp. 126–142. [93] A. Chandra, X. Yao, Ensemble learning using multi-objective evolutionary algorithms, Journal of Mathematical Modelling and Algorithms 5 (4) (2006) 417–445. [94] A. Chandra, X. Yao, Evolving hybrid ensembles of learning machines for better generalisation, Neurocomputing 69 (7–9) (2006) 686–700. [95] N. Garcia-Pedrajas, C. Hervas-Martinez, D. Ortiz-Boyer, Cooperative coevolution of artificial neural network ensembles for pattern classification, IEEE Transactions on Evolutionary Computation 9 (3) (2005) 271–302. [96] V.R. Khare, Xin Yao, B. Sendhoff, Yaochu Jin, H. Wersing, Co-evolutionary modular neural networks for automatic problem decomposition, In: Evolutionary Computation, 2005. The 2005 IEEE Congress on 2–5 September 3 (2005) 2691–2698. [97] P.A. Castillo, M.G. Arenas, J.J. Castillo-Valdivieso, J.J. Merelo, A. Prieto, G. Romero, Artificial neural networks design using evolutionary algorithms, in: Proceedings of the Seventh World Conference on Soft Computing, 2002. [98] D. Crosher, The artificial evolution of a generalized class of adaptive processes, in: X. Yao (Ed.), Preprints of AI’93 Workshop on Evolutionary Computation, 1993, pp. 18–36. [99] P. Turney, D. Whitley, R. Anderson, Special issue on the baldwinian effect, Evolutionary Computation 4 (3) (1996) 213–329. [100] J. Baxter, The evolution of learning algorithms for artificial neural networks, in: D. Green, T. Bossomaier (Eds.), Complex Systems, IOS Press, Amsterdam 1992, pp. 313–326. [101] Y. Bengio, S. Bengio, Learning a synaptic learning rule. Technical Report 751, De´partment d’Informatique et de Recherche Ope´rationelle, Universite´ de Montre´al, Canada, 1990. [102] S. Bengio, Y. Bengio, J. Cloutier, J. Gecsei, On the optimization of a synaptic learning rule, in: Preprints of the Conference on Optimality in Artificial and Biological Neural Networks. University of Texas, Dallas, 1992. [103] A. Ribert, E. Stocker, Y. Lecourtier, A. Ennaji, Optimizing a Neural Network Architecture with an Adaptive Parameter Genetic Algorithm, Lecture Notes in Computer Science, 1240, Springer-Verlag, 1994, pp. 527–535. [104] H. Kim, S. Jung, T. Kim, K. Park, Fast learning method for backpropagation neural network by evolutionary adaptation of learning rates, Neurocomputing 11 (1) (1996) 101–106. [105] R. Belew, J. McInerney, N. Schraudolph, Evolving networks: using the genetic algorithm with connectionist learning, in: Proceedings of the Second Artificial Life Conference. Addison-Wesley, New York, NY, 1991, pp. 511–547. [106] D. Patel, Using genetic algorithms to construct a network for finantial prediction, in: Proceedings of SPIE: Applications of Artificial Neural Networks in Image Processing, Society of Photo-Optical Instrumentation Engineers, Bellingham, WA, USA, 1996, pp. 204–213. [107] J. Merelo, M. Pato´n, A. Canas, A. Prieto, F. Mora´n, Genetic optimization of a multilayer neural network for cluster classification tasks, Neural Network World 3 (1993) 175–186. [108] D. Chalmers, The evolution of learning: an experiment in genetic connectionism, in: D.S. Touretzky, J.L. Elman, G.E. Hinton (Eds.), Proceedings of the 1990 Connectionist Models Summer School, Morgan Kaufmann, San Mateo, CA 1990, pp. 81–90.
[109] D.J. Montana, Strongly typed genetic programming, Evolutionary Computation 3 (2) (1995) 199–200. ˜ al, A. Pazos, Using genetic programmning for [110] D. Rivero, J. Dorado, J. Rabun artificial neural network development and simplification, in: Proceedings of the Fifth WSEAS International Conference on Computational Intelligence, Man–Machine Systems and Cybernetics (CIMMACS’06), WSEAS Press, 2006, pp. 65–71. [111] C.M. Bishop, in: Neural Networks for Pattern Recognition, Oxford University Press, New York, 1995. [112] B.D. Ripley, in: Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, 1996. [113] L. Prechelt, Early stopping—but when? Neural Networks: Tricks of the Trade (1996) 55–69. [114] L. Prechelt, Automatic early stopping using cross validation: qualifying the criteria, Neural Networks 11 (1998) 761–767. [115] C.J. Mertz, P.M. Murphy, UCI repository of machine learning databases, 2002, /http://www-old.ics.uci.edu/pub/machine-learning-databasesS. [116] R.A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics (1936) 179–188. [117] M. Stone, Cross-validation: a review, Matemastische Operationsforschung Statischen, Serie Statistics 9 (1978) 127–139. [118] T.G. Dietterich, Approximate statistical tests for comparing supervised classification learning algorithms, Neural Computation 10 (7) (1998) 1895–1924. [119] K. Deb, A. Anand, D. Joshi, A computationally efficient evolutionary algorithm for real-parameter optimization, Evolutionary Computation 10 (4) (2002) 371–395. [120] J. Yang, V. Honavar, Feature subset selection using a genetic algorithm, IEEE Intelligent Systems 13 (1998) 44–49. [121] M. Ozdemir, F. Embrechts, C.M. Breneman, L. Lockwood, K.P. Bennett, Feature selection for in-silico drug design using genetic algorithms and neural networks, in: IEEE Mountain Workshop on Soft Computing in Industrial Applications. IEEE Press, 2001, pp. 53–57. [122] R. Reed, Pruning algorithms—a survey, IEEE Transactions on Neural Networks 4 (5) (1993) 740–747. [123] S.J. Marshall, R.F. Harrison, Optimization and training of feedforward neural networks by genetic algorithms, in: Proceedings of the Second International Conference on Artificial Neural Networks and Genetic Algorithms, SpringerVerlag, 1991, pp. 39–43. [124] T.W. Brotherton, P.K. Simpson, D.B. Fogel, T. Pollard, Classifier design using evolutionary programming, in: A.V. Sebald, L.J. Fogel (Eds.), In: Proceedings of the Third Annual Conference on Evolutionary Programming, World Scientific Publishers, River Edge, NJ 1994, pp. 68–75. [125] I.T. Jolliffe, in: Principal Component Analysis, Springer-Verlag, New York, 1986. [126] W. Duch, R. Adamczak, K. Grabczewski, A new methodology of extraction, optimisation and application of crisp and fuzzy logical rules, IEEE Transactions on Neural Networks 11 (2) (2000). ˜ al, J. Dorado, A. Pazos, Automatically design of ANNs by [127] D. Rivero, J. Rabun means of GP for data mining tasks: Iris flower classification problem, in: Adaptive and Natural Computing Algorithms, Eighth International Conference, ICANNGA 2007, Warsaw, Poland, April 2007, Proceedings, 2007, pp. 276–285. ˜ al, Entrenamiento de redes de neuronas artificiales mediante [128] J.R. Rabun ˜ a, 1999. algoritmos gene´ticos, Universidade da Corun ˜ al, A. Pazos, J. Pereira, Artificial neural network [129] D. Rivero, J. Dorado, J. Rabun development by means of genetic programming with graph codification, ENFORMATIKA, Transactions on Engineering, Computing and Technology, World Enformatika Society 15 (2006) 209–214. [130] M.D. Ritchie, B.C. White, J.S. Parker, L.W. Hahn, J.H. Moore, Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases, BMC Bioinformatics 7 (4) (2003) 28. [131] Marylyn D. Ritchie, Alison A. Motsinger, William S. Bush, Christopher S. Coffey, Jason H. Moore, Genetic programming neural networks: a powerful bioinformatics tool for human genetics, Applied Soft Computing 7 (1) (2007) 471–479. [132] J.R. Koza, J.P. Rice, Genetic generation of both the weights and architecture for a neural network, in: International Joint Conference on Neural Networks, IEEE Press, vol. II, 1991, pp. 397–404.
˜ a on January 30, Daniel Rivero was born in A Corun 1978. He obtained his M.S. degree in Computer Science ˜ a, A Corun ˜ a, Spain, in 2001 at the University of A Corun and his Ph.D. degree in Computer Science in 2007 from the same university. He is currently an assistant professor at the Faculty of Computer Science of the ˜ a. Before he got that academic University of A Corun position, he has received different research grants from different administrations for more than four years. His main research interests are artificial neural networks, genetic algorithms, genetic programmingand adaptative systems.
D. Rivero et al. / Neurocomputing 73 (2010) 3200–3223 ˜ a, on July 30, 1970. Julian Dorado was born in A Corun He obtained his M.S. degree in Computer Science at the ˜ ˜ University of A Coruna, A Coruna, Spain, in 1994 and his Ph.D. degree in Computer Science from the same university in 1999. He also obtained his M.S. in Biology ˜ a in 2004. He is a Senior at the University of A Corun Lecturer at the Faculty of Computer Science of the ˜ a. He has headed different University of A Corun research projects for the university, the regional and national government. He is author or co-author of over 90 contributions to the most important conferences and he is also author or co-author of over 40 papers in different journals. He is the co-editor of the Encyclopedia of Artificial Intelligence, Information Science Reference, 2008. His research interest is focused on artificial embryogeny, genetic algorithms, genetic programming, artificial neural networks and bioinformatics.
˜ al was born in Arteixo on January Juan Ramon Rabun 21, 1973. He obtained his B.Sc. degree in Computer Science in 1996, the M.S. degree in Computer Science in 1999 and his Ph.D. degree in Computer Science in ˜ a, A 2002, all of them at the University of A Corun ˜ a, Spain. He also obtained his Ph.D. degree in Corun Civil Engineering in 2008 at the same university. Nowadays, he shares his time between his lecturer position at the Faculty of Computer Science of the ˜ a and the direction of the Center University of A Corun of Technological Innovations in Construction and Civil Engineering. He has also headed several research projects for the university, the regional government as well as the national government. His main research interests are artificial neural networks, genetic programming, genetic algorithms and artificial intelligence in civil engineering.
3223
Alejandro Pazos was born in Padron on November 20, 1959 (M-92). He obtained his M.S. degree in Medicine and Surgery in 1987 at the USC, Santiago de Compostela, Spain. He also has his Ph.D. degree in Medicine obtained at the UCM, Madrid, Spain, in 1996, an M.S. degree in Computer Science obtained at the UPM, Madrid, Spain, in 1989 and a Ph.D. degree in Computer Science at the UPM, Madrid, Spain, obtained in 1990. He is currently a Professor at the Faculty of Computer Science of the ˜a since 1999, A Corun ˜a, Spain. He is University of A Corun also the head of the department of Information and Communication Technologies at the same university. He has headed several research projects for regional, national and international administrations. He is author or co-author of more than 60 papers in different journals and more than 160 papers in different conferences. His main research interests are biomedical computer systems, artificial neural networks and artificial intelligence in medicine. Dr. Pazos is also affiliated to INNS, ACM and IAKE. He has also worked as a consultant of the Spanish Ministry of Defence. He has worked as a representative of the Science and Technology Ministry of Spain and the Galician Regional Research and Development plan.