Neural Networks 22 (2009) 614–622
Contents lists available at ScienceDirect
Neural Networks journal homepage: www.elsevier.com/locate/neunet
2009 Special Issue
Neural networks with multiple general neuron models: A hybrid computational intelligence approach using Genetic Programming Alan J. Barton ∗ , Julio J. Valdés, Robert Orchard Knowledge Discovery Group, Institute for Information Technology, National Research Council Canada, Ottawa, Ontario, Canada
article
info
Article history: Received 5 May 2009 Received in revised form 11 June 2009 Accepted 25 June 2009 Keywords: General neuron model Evolutionary Computation Genetic Programming Hybrid algorithm Machine learning Parameter space Visualization
abstract Classical neural networks are composed of neurons whose nature is determined by a certain function (the neuron model), usually pre-specified. In this paper, a type of neural network (NN-GP) is presented in which: (i) each neuron may have its own neuron model in the form of a general function, (ii) any layout (i.e network interconnection) is possible, and (iii) no bias nodes or weights are associated to the connections, neurons or layers. The general functions associated to a neuron are learned by searching a function space. They are not provided a priori, but are rather built as part of an Evolutionary Computation process based on Genetic Programming. The resulting network solutions are evaluated based on a fitness measure, which may, for example, be based on classification or regression errors. Two real-world examples are presented to illustrate the promising behaviour on classification problems via construction of a low-dimensional representation of a high-dimensional parameter space associated to the set of all network solutions. © 2009 Published by Elsevier Ltd
1. Introduction Many different neuron models, neural network architectures and learning procedures have been proposed, addressing two main types of problems: regression and classification. Both can be seen as derived from the general problem of function approximation. The feed forward neural network (multilayer perceptron) (Bishop, 2004; Ripley, 1996) is probably the most popular and it is composed of a layout of similar neurons arranged into layers, trained with the backpropagation algorithm or one of its many variants. In networks of these types, the goal of the learning procedure is to find the vector of weights for each neuron in the network such that a given function is optimized (a classification error, a least squared error, an entropy measure, etc.). From a more general perspective, the networks may be composed of neurons in which the neuron model is not fixed and in which the architecture is not necessarily a layered one. Fig. 1 shows a general network architecture with general multiple neuron models and Fig. 2 shows the same neuron models for a layered network. In particular, the neurons do not consist of the composition of aggregation and activation functions endowed with a vector of weights (possibly with a bias). Rather, they are given by a general analytical (deterministic) function, which can be connected in any way specified by a directed graph in order to define a
∗
Corresponding author. Tel.: +1 613 991 5486; fax: +1 613 952 0215. E-mail addresses:
[email protected] (A.J. Barton),
[email protected] (J.J. Valdés),
[email protected] (R. Orchard). 0893-6080/$ – see front matter © 2009 Published by Elsevier Ltd doi:10.1016/j.neunet.2009.06.043
network. In this case, the learning procedure is oriented to find the collection of functions (and their parameters) such that the network output optimizes a given performance measure, as previously indicated. The use of general analytic functions as neuron models is appealing because they are: (i) easy to understand by humans, (ii) the preferred building blocks of modeling, and (iii) a classical and highly condensed form of knowledge. There are many possible approaches for Evolutionary Computation based learning of neural networks (Koehn, 1996; Stanley, 2004; Yao, 1993; Zhang & Mühlenbein, 1993). For example, Montana and Davis (1989) encode the weights of a fixed architecture neural network within one chromosome and use a Genetic Algorithm (GA) to perform global optimization in order to find potential weight solutions for a sonar image classification problem. This paper uses multiple chromosomes (Valdés, Orchard, & Barton, 2007) and learns the complete function associated to a neuron, not only the weights (Barton & Valdés, 2008; Barton, 2009). The purpose of this paper is to extend previous results (Barton, Valdés, & Orchard, 2009) and to report the promising empirical behaviour of NN-GP when used for two real-world classification problems; underground cave detection and hydrochemical research in the Arctic. In particular, a low-dimensional visual representation of the algorithm’s parameter spaces are used to demonstrate the network solution qualities w.r.t. their parameters. 2. Evolutionary Computation A general Evolutionary Computation (EC) algorithm (Algorithm 1) consists of a problem, P, as input and a set of solutions, S, as
A.J. Barton et al. / Neural Networks 22 (2009) 614–622
615
can be approached from a computational intelligence perspective via Evolutionary Computation. In particular, Genetic Programming techniques aim at evolving computer programs, which ultimately are functions. Genetic Programming was introduced by Koza (1989, 1992, 1994) and Koza, Bennett, Andre, and Keane (1999) as an extension of genetic algorithms that evolves a population of computer programs. In Algorithm 1, the family of populations of individuals FP (Line 2,8,9,10) or FP 0 (Line 6,8) is usually a family composed of one set. In other words, FP and FP 0 are each a simple set of individuals I . GP programs may be symbolic expression trees, which, essentially, are functions. Algorithm 1 uses Koza’s computer programs on lines (Line 2,6,8), and names them internal representations. More specifically, one of Koza’s computer programs is called an S-expression (S stands for symbolic) and is used, for example, within the List Processing Language (LISP). An example of a LISP S-expression that represents a neural network is given in Koza and Rice (1991). Fig. 1. General Neural Network containing (n) neurons where all activities occur (e.g. activation, aggregation). Weights are learned within the neuron. ℵi (·) is the function associated to the ith neuron; with output oi . The network may be organized into layers. Not all possible connections are shown. No bias neurons are used.
output. Algorithm 1 consists of three main stages: (i) initialization Line 1–4, (ii) processing generations Line 5–11, and (iii) postprocessing Line 12. Algorithm 1: General EC Specification
1 2 3 4 5 6 7 8 9 10 11 12
Input : A problem, P. (i.e. a question) Output: A set of solutions, S. (i.e. possible answer(s) to the question) Let A ⇐ archivePopulations(∅, ∅) ; Let FP ⇐ constructPopulations(∅) ; computeFitness(FP ) ; A ⇐ archivePopulations(A, FP ) ; while not terminationCriteriaSatisfied(FP ) do Let FP 0 ⇐ constructPopulations(FP ) ; computeFitness(FP 0 ) ; FP ⇐ combinePopulations(FP , FP 0 ) ; A ⇐ archivePopulations(A, FP ) ; FP ⇐ selectIndividuals(FP , A) ; end S ⇐ selectIndividuals(∅, A) ;
2.1. Genetic Programming (GP)
2.2. Gene expression programming (GEP) Gene expression programming (GEP) (Ferreira, 2001, 2006) is a variant of Genetic Programming where individuals are expression trees encoded as simple strings of fixed length (chromosomes) representing entities of different sizes and shapes, generally nonlinear expressions. For the interplay of the GEP chromosomes and the expression trees (ET), GEP uses an unambiguous translation system to transfer the language of chromosomes into the language of expression trees and vise versa (Ferreira, 2006). The set of genetic operators applied to GEP chromosomes always produces valid ETs. The chromosomes in GEP itself are composed of genes structurally organized in a head and a tail (Ferreira, 2001). The head contains symbols that represent both functions (elements from a function set F) and terminals (elements from a terminal set T), whereas the tail contains only terminals. Therefore, two different alphabets occur at different regions within a gene. For each problem, the length of the head h is chosen, whereas the length of the tail t is a function of h, and the number of arguments of the function with the largest arity (nmax ). The length of the tail is evaluated by t = h(nmax − 1) + 1. As an example, consider a gene composed of the function set F = {Q , +, −, ∗, /}, where Q represents the square root function, and the terminal set T = {a, b}. In this case nmax = 2. For instance h = 10 and t = 11, the length of the gene is 10 + 11 = 21. Such a gene looks like (the tail is shown in bold): ? Q-b++a/-bbaabaaabaab, and corresponds to the mathematical √ equation f (a, b) = b · a + ba − ((a − b) + b) simplified as √
Analytic functions are among the preferred building blocks for modeling and a highly condensed form of knowledge, but direct discovery of general analytic functions poses enormous challenges because of the size of the search space. This problem
f (a, b) = b· a b . GEP chromosomes are usually composed of more than one gene of equal length. For each problem the number of genes as well as the length of the head has to be chosen. Each gene encodes a sub-ET
Fig. 2. Special case of the general neural network: a feed forward neural network for one specific architecture (NN : n − m − c). There are 3 layers and (n + m + c) neurons where neurons in layer i are connected to neurons in layer i + 1. Neurons may not use all inputs; implying a connectivity upper bound. Possibilities exist providing more reuse (e.g. neurons in layer i may connect to neurons in all layers greater than i).
616
A.J. Barton et al. / Neural Networks 22 (2009) 614–622
and the sub-ETs interact with one another forming more complex multi-subunit ETs through a connection function. As an evolutionary algorithm GEP defines its own set of crossover, mutation and other operators (Ferreira, 2001, 2006): Inversion: is restricted to the heads of genes. Any sequence might be randomly selected and inverted. The inversion operator randomly chooses the chromosome, the gene to be modified, and the start and termination points of the sequence to be inverted. Mutation: can occur anywhere in the chromosome. But the structural organization of chromosomes must remain intact, that is, in the heads any symbol can change into another (function or terminal), whereas in the tails terminals can only change into other terminals. This way, the structural organization of chromosomes is preserved, and all the new individuals produced by mutation are structurally correct programs. One point recombination: the parent chromosomes are paired and split up at exactly the same point. The material downstream of the recombination point is afterwards exchanged between the two chromosomes. Two point recombination: two parent chromosomes are paired and two points are randomly chosen as crossover points. The material between the recombination points is afterwards exchanged between the parent chromosomes, forming two new daughter chromosomes. Gene recombination: entire genes are exchanged between two parent chromosomes, forming two daughter chromosomes containing genes from both parents. The exchanged genes are randomly chosen and occupy exactly the same position in the parent chromosomes. RIS transposition: Root Insertion Sequence elements or RIS elements are short fragments with a function in the first position that transpose to the start position of genes. All RIS elements start with a function, and therefore must be chosen among the sequences of the heads. For that, a point is randomly chosen in the head and the gene is scanned downstream until a function is found. This function becomes the start position of the RIS element. If no functions are found, the operator does nothing. IS transposition: Any sequence in the genome might become an IS element and, therefore, these elements are randomly selected throughout the chromosome. A copy of the IS element is made and inserted at any position in the head of a gene, except the first position. Gene transposition: An entire gene works as a transposon and transposes itself to the beginning of the chromosome. In contrast to the other forms of transposition, in gene transposition, the transposon (the gene) is deleted at the place of origin. The gene transposition operator randomly chooses the chromosome to be modified and then randomly chooses one of its genes (except the first) to transpose. RNC mutation: direct mutation of random numerical constants or RNC mutation for short, randomly selects particular targets in the arrays where the random numerical constants are kept, and randomly generates a new constant. Dc mutation: is similar to the mutation that occurs in the heads and tails of genes. Dc inversion: randomly choose the chromosome, the gene to be modified, and the start and termination points of the sequence to be inverted. Dc IS transposition: randomly choose the chromosome, start and termination points of the IS element, and the target site. To evaluate GEP chromosomes, different fitness functions can be used. The fitness fi of the individual i is expressed by some inverse relation w.r.t. an error measure, or by some positive relation w.r.t. a correlation or similarity measure. Typical choices P Ct C(i,j) − T(j) , or the relative are the absolute error fi = M − j =1 error fi =
C −T M − (i,jT) (j) · 100 . (j)
PCt j =1
2.3. Fitness function The algorithm can be applied to many domains because it can be tailored via its computation of fitness for an individual (Line 3, 7) in order to include domain dependent information. In fact, these fitness functions are used within the search methodology (Line 4, 5, 9, 10) in order to determine whether an exact or approximate solution is desired from the family of populations of individuals FP (Line 2, 8, 9, 10) or FP 0 (Line 6, 8). The internal representation of an individual is the data that a fitness function can access in order to compute the fitness. Algorithm 1 performs operations on this internal representation (Line 2, 6, 8) that create new internal representations. The internal representation that is returned from Algorithm 1 on Line 12 will lead to the external representation that will be presented as the algorithm’s solution to the particular problem of interest. For NN-GP there is a one-to-one correspondence between chromosomes and neurons. Constructing new individuals will result in new chromosome values and hence new neuron functions over time. When an individual’s fitness needs to be ascertained the neural network needs to be evaluated for its performance (See Algorithm 2). In summary, GEP is being used as a global optimizer in order to minimize the neural network error on training data. The fitness (Ifit ) for an individual (I ) may be computed in many ways. Two examples of fitness functions that result in values in [0 . . . 1000] and that are appropriate under different conditions are: Ifit = (1.0 − E ) · 1000.0 Ifit =
1.0
(1)
· 1000.0
E + 1.0
(2)
Eq. (1) should be used for an error measure in [0..1], such as a classification error (Barton & Valdés, 2008): c P
EC =
) − card(Ciexp ∩ Ciobs )
i =1
(3)
card(Obj) c P
E CN =
exp
card(Ci
exp
card(Ci
) − card(Ciexp ∩ Ciobs ) /card(Ciexp )
i =1
c
(4)
while Eq. (2) is intended for an error measure in [0 . . . ∞], such as a dissimilarity value (Sammon, 1969): 1 ERSammon = P i
δij
P (δij − ζij )2 i
δij
(5)
where δij is a dissimilarity measure in the original space between any objects i, j and ζij a dissimilarity measure in the new space between the images of objects i, j. It is clear that there are many possible error measures (E ) that may be investigated. Typically, classical algorithms have been used for directly optimizing such measures: Steepest descent, Conjugate gradient, Fletcher–Reeves, Powell, Levenberg–Marquardt, and others. They suffer from local extrema entrapment. However, the mappings obtained using approaches of this kind are only implicit, as no functional representations are found. This paper constructs explicit mappings using NN-GP. NN-GP Heuristic Termination Criteria: The NN-GP algorithm will terminate if one of the following cases is satisfied:
(
Case I, Case II,
the pre-specified maximum number of generations has been reached; the error measure (E ) increases on test data.
A.J. Barton et al. / Neural Networks 22 (2009) 614–622
Algorithm 2: NN-GP’s Processing Algorithm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Input : Configuration parameters and an Individual (I ) composed of multiple chromosomes Output: The fitness (Ifit ) associated to I if isFitAlreadyComputed (I ) then return Ifit ; Net ⇐ createFFNN (I , NN topology); Dict ⇐ {} ; // Construct Neuron Dictionary neuronIndex ⇐ 1 ; forall layersInNetwork do forall neuronsInLayer do c ⇐ getChromoAt (I , neuronIndex) ; neuronName ⇐ c .name ; insert (neuronName, c .equation, Dict) ; neuronIndex ⇐ neuronIndex + 1 ; end end expandAllNeuronEquations (Dict) ; ETr ⇐ computeError (Net , {Training Data}) ; // ETe ⇐ computeError (Net , {Testing Data}) Ifit ⇐ f(ETr ) ; // See Eq. 1 or Eq. 2
Table 1 Mapping functions. (ε = 0.001).
ℵ`1out = c · π2 · asin sin π · x − π2 `out c = 1+e−x f2 ℵ1 ) n( o `out `out ℵ = max1≤j≤c o`j out f1 >1 = ci : i ∈ {1, 2, . . . , c = card(`out )} and oi ! ( ) Pc `out 1 ℵ ci : i ∈ {1, 2, . . . , c = card(`out )} and <ε f2 ℵ>1 = `out − δi j=1 1+e−oj f1
For both cases, the optimization process (See Algorithm 1) is maximizing n fitness, which is related o to minimizing training error; e.g. E Tr ∈ ECTr , ECTrN , ERTrSammon , . . . . For Case II, the error measure is Te the n same as that used for o calculation on the training data; e.g. E ∈
ECTe , ECTe , ERTe N
Sammon
,... .
In Algorithm 1, the archive, A, is initialized to be empty on Line 1 and is updated on Line 4 and Line 9. Every time an update is made, the current minimum archived training error result, Ibest ∈ 0 0 A, is compared to Ibest ∈ P = FP . That is, if (E Tr (Ibest ) < 0 Tr Te Te E (Ibest )) and (E (Ibest ) > E (Ibest )) then Case II is satisfied, and termination should occur. The intuition behind Case II is that the population of ‘‘good’’ individuals are not generalizing on test data, so it is assumed that NN-GP is overfitting the problem appropriately mapped (Table 1). 3. Experimental settings The broad scale experimental settings are reported in Table 2. There are a large number of parameters that may take on a large number of values and this paper only explores a small fraction; some related to the neural network (NN) and some related to the evolutionary algorithm (GEP).
617
Table 2 Broad scale experimental settings. There are 8 groups; indicated by . Within 1 group of 11,520 experiments, 10 parameters are investigated 3. I.L.: Input Layer, H.L.: Hidden Layer, O.L.: Output Layer. Error measure NN architecture NN connectivity NN output layer NN mapping function
Mapping Fcn tolerance 3NN I.L. constant weights NN I.L. terminal weights 3NN H.L. const. weights NN H.L. terminal weights NN O.L. constant weights NN O.L. terminal weights GEP seeds GEP chromo./individual GEP symbol sets ’’: function(weight) GEP symbol sets (The terminals) 3GEP genes/chromosome 3GEP gene headsize GEP gene linking function 3GEP Float const./gene GEP float const. bounds GEP max generations 3GEP population size GEP num. elite individuals GEP inversion rate GEP mutation rate GEP IS transposition rate GEP RIS transposition rate GEP one point recomb. GEP two point recomb. GEP gene recombination GEP gene transposition 3GEP species RNC mut. 3GEP species DC mut. 3GEP species DC inver. 3GEP species DC IS
Classical (EC ) or normalized (ECN ) Fixed Learned Single or multiple neurons `out ` ` `out , f2 ℵ1out , f1 ℵ>out f 1 ℵ1 1 , f2 ℵ>1 `out ε = 0.001, f2 ℵ>1 1, No. input variables 1 1, No. input variables 1 1 1 5 unique randomly chosen Determined by No. Neurons 3 (Determined by No. Layers) ‘‘+’’(1), ‘‘-’’(1), ‘‘*’’(1) I.L. uses input variables H.L. uses I.L. outputs O.L. uses H.L. outputs 1, 2 2, 3 ‘‘+’’ — used when >1 gene 1, 5 [−10.0, 10.0] 500 5, 20, 50 1 0.1 0.044 0.1 0.1 0.3 0.3 0.1 0.1 (for genes) 0.01, 0.03 0.044, 0.01, 0.03 0.1, 0.3 0.1, 0.3
Geophysical methods complemented with a topographic survey were used in an area containing a known cave with the purpose of finding the relation between the physical fields with the subsurface phenomena (Valdés & Gil, 1982). The investigated area, as well as the outline of the known cave is shown in Fig. 4. From a classification point of view, this is a problem with partially defined classes: the existence of a cave beneath a measurement station is either known with confidence if made over the known cave, or unknown since there might be a cave beneath, but if it is not open to the surface, its presence is unknown. The set of data variables are: V1 = the spontaneous electric potential (SP) of the earth‘s surface measured in the dry season, V2 = the vertical component of the electro-magnetic field in the Very Low Frequency (VLF) region of the spectrum, V3 = SP in the rainy season, V4 = gamma radioactive intensity (i.e. radioactivity), and V5 = altitude (local topography). A separation of the different geophysical field sources was necessary in order to focus the study on the contribution coming from underground targets, in an attempt to minimize the influence of both larger geological structures and local heterogeneities.
3.1. Geophysical Prospecting for caves The NN-GP approach was applied to geophysical data obtained from an investigation dealing with the detection of underground caves. Cave detection is an important problem in some regions from the point of view of civil and geological engineering. For example, pillars of heavy structures like buildings or bridges may collapse if they are placed on top of caves not opened to the surface.
3.1.1. Data preprocessing Geophysical fields are measured using different units. In order to neutralize the effect of the different scales introduced by the units of measurement, all values were transformed to standard scores (i.e. to variables with zero mean and unit variance). Each geophysical field was assumed to be described by an additive two-dimensional model composed of trend, signal,
618
A.J. Barton et al. / Neural Networks 22 (2009) 614–622
Table 3 Mean errors for the 8 groups of experiments for each data set. Input data
One output neuron Minimize
Geophysical Prospecting (c = 1) Hydrochemical Research (c = 5)
f1
E Te
EC ECN
0.130 0.478
0.128 0.477
EC ECN
0.688 0.740
0.659 0.726
C glacier
E morraines
bedrock
bedrock
ℵ`1out (x)
E Tr
f1
ℵ`>out 1 (x )
f2
E Te
E Tr
E Te
5-1-1 0.083 0.329
0.075 0.319
0.080 0.328
0.073 0.317
9-2-1 0.576 0.670
0.549 0.703
0.559 0.660
0.524 0.652
m 600 500 400
A B
D
f2
E Tr
A
sea
Multiple output neurons
ℵ`1out (x)
300 200 100 0
ℵ`>out 1 (x )
E Tr
E Te
0.113 0.410
0.107 0.418
0.644 0.740
0.619 0.743
5-1-2
9-2-5
contexts in Spitzbergen and the chemical composition of the waters and ice. This is reflected by the existence of several families of waters as given by their geochemical composition. The samples were characterized by a collection of physical and chemical parameters determined in situ: V1 = temperature, V2 = pH, V3 = electrical conductivity, V4 = hydrocarbonate, V5 = chloride, V6 = sulphate, V7 = calcium, V8 = magnesium, and V9 = sodium–potassium. 4. Results
Fig. 3. Schematic longitudinal section of the Werenskiold Glacier. The different classes of samples for Hydrochemical Research are: (A) precipitation (snow, ice), (B) supraglacier, (C) subglacier, (D) tundra, (E) moraine lakes.
and random noise. To focus on the signals produced by the underground target bodies, a linear trend term was fitted (by least squares) and subtracted from the original field. The residuals were then filtered by direct convolution with a low pass finiteextent impulse response two-dimensional filter to attenuate the random noise component. The collection of residual fields (as an approximation to the signals for all physical fields) was used for analysis. In total, 1225 points in a regular grid were measured for the five physical fields previously mentioned The collection of residual fields (as an approximation to the signals for all physical fields) was used for analysis. In total, 1225 points in a regular grid were measured for the five physical fields studied. 3.2. Hydrochemical research in the arctic In the framework of the scientific expedition Spitzbergen’85 (Fagundo, Valdés, & Pulina, 1990; Valdés & Barton, 2007), hydrochemical and glaciological research in the arctic was conducted. A multinational team performed glaciological and hydrogeological investigations in several regions of the Spizbergen island in the Svalbard archipielago. The purpose of the research was to determine mass and energy balance within experimental hydrogeological basins and to study the interaction between natural waters and rock forming minerals under the conditions of polar climate. Another goal was to compare with similar processes developed in tropical and temperate climatic conditions. This research made an important pioneering contribution to the evaluation of the impact of global climatic changes. In the region there are complex processes taking place under peculiar geological, geomorphological and hydrogeological conditions which are reflected in water geochemistry. During that expedition, a collection of ice and water samples were taken from different hydrogeological zones in the region of the Hornsund fjord in Spitzbergen, specifically in the Werenskiold Glacier basin. These samples are representative of different zones from a hydrogeological point of view: subglacial, superglacial, springs, lakes, snow, ice, and the tundra (Fig. 3). Hydrogeological and geochemical studies (Fagundo et al., 1990) have shown the relation between the different hydrogeological
The experiments investigated 10 parameters (Table 2) controlling the algorithm (NN-GP) as reported in Algorithm 2 for 8 groups of experiments on each selected data set. The 8 groups are determined by selecting one of two error measures and one of four neural network output mapping functions. Each group contains 11,520 experiments (5 random seeds were used; hence 11,520 = 2304 5 distinct combinations), yielding an experimental database composed of 8 · 11,520 = 92,160 neural network solutions for each data set. The mean values for experiments within each of the groups is presented in Table 3. Minimum errors may be an order of magnitude (or more) smaller than the error averages. 4.1. Selected NN-GP result: Geophysical Prospecting The Geophysical Prospecting learning problem is related to determining the spatial location(s) of potential underground cave(s) given that at least one underground cave is known Fig. 4. `
4.1.1. Group 1 (EC , f1 ℵ1out ) result This group of experiments deals with neural networks containing 5 input neurons, 1 hidden layer with 1 neuron and 1 output E ,NN:5-1-1 layer with 1 neuron. Experiment Exp11C,505 was selected. The experiment’s associated neural network solution has a training error of ECTr = 0.052 and a testing error of ECTe = 0.033, respectively, which may be compared with the means reported in Table 3 of 0.130 and 0.128, respectively. The number of per class correct classifications for training is c1 = 855 and c2 = 75 , while that for 884 97 19 testing is c1 = 217 and c2 = 24 . For this selected experiment, the 220 analytic function constructed from the complete neural network and associated only with the single output neuron is:
0.02(v1 + 2 · v2 + v3 + v5 ) − 5.66
(6)
with vi being described in the data description section of this paper. This function can be seen to contain 4 out of the original 5 variables, indicating the potential use of NN-GP as a feature subset selection algorithm. In addition, since the analytic function is explicit, NNGP can be interpreted as a constructive induction algorithm. In ` any case, after application of the output mapping function (f1 ℵ1out ) for this group the crisp class predictions for each of the 1225 measuring stations for both training and testing data were found and are visually represented in Fig. 4. Black areas indicate the
A.J. Barton et al. / Neural Networks 22 (2009) 614–622
619
30 25 20 15 10 5
0
5
10
15
20
25
30
0
Fig. 4. Left: Location of the known cave for the Geophysical Prospecting problem. Right: Image Map for one neural network solution NN:5-1-1. Error measure: EC . Mapping ` function: f1 ℵ1out (x).
neural network’s prediction for underground cave location. This result may be compared to the known underground cave location in Fig. 4 and can be seen to be in quite close agreement. The black region located at approximately (10,19) in the image is the network’s prediction for an underground cave in an area where it is not known whether a such a cave exists or not. In fact, a bore hole was drilled in that area and an underground cave was found. 4.2. Selected NN-GP result: Hydrochemical research The experiment within the group of experiments in which the investigated error measure was EC and the mapping function ` was f1 ℵ>out 1 that lead to the absolute minimum training error is E ,NN:9-2-5
Exp20C,193 . This network used 9 neurons in the input layer, 2 neurons in the single hidden layer, and 5 neurons in the output layer; one neuron per class. The error values associated to this network solution are ECTr = 0.18 and ECTe = 0.15 for training and testing respectively. The per class correct training predictions , c2 = 24 , c3 = 90 , c4 = 98 and c5 = 19 , while are c1 = 26 30 24 22 those for correct testing predictions are c1 = 2 2
3 . 5
7 , 7
c2 =
5 , 5
c3 =
0 , 1
c4 = and c5 = It can be observed that all predictions for c2 are correct, while all predictions for c3 are not. The simplified functions associated to each of the 5 output neurons are as follows:
(v2 − 7.02) · (v4 + v1 − 8.68) · (v7 + v5 ) (v2 − 7.02) · (v4 + v1 − 8.68) · (v7 + v5 ) · 7.12 (v2 − 7.02) · (v4 + v1 − 8.68) · (v7 + v5 ) − 10.04 6 25.45 4 · v4 + · v1 −
(10)
−3.86.
(11)
4
4
(7) (8) (9)
Inspection of the equations reveals why c3 has such poor empirical performance; namely, that Eq. (9) will never be larger than Eq. (7). In addition, it can be observed that subexpressions within the neuron models are shared. For example, (v7 + v5 ) is shared between neurons 1, 2 and 3. It can also be observed that neuron 4 (Eq. (10)) partially shares a subexpression with neurons 1, 2 and 3; namely (v4 + v1 − 8.68). In fact, (v4 + 46 · v1 − 254.45 ) simplifies to (v4 + 1.5 · v1 − 6.3625). 4.3. Comparison with classical neural networks trained with backpropagation A neural network backpropagation weight training algorithm (Anderson, 2008) was run on both the Geophysical Prospecting and Hydrochemical Research data sets for a total of 24 trials.
After 1000 epochs, the neural network training was terminated. An output layer learning rate of 0.001, a hidden layer learning rate of 0.1 and a momentum of 0.9 were used. In addition, this classical algorithm learns the neuron weights, uses a bias neuron weight, and has a logistic activation function. For the Geophysical Prospecting problem, the selected architecture was NN:5-1-1; that is, 3 layers, with 5 neurons in the input layer, 1 neuron in the hidden layer and 1 neuron in the output layer. This architecture was chosen in order to be as close as possible to the reported experiments for NN-GP (See Table 3). One neuron in the output layer was used and the two classes used for training were {0, 1}, enabling the use of the logistic activation function within the output neuron. Training this classical network using backpropagation was performed within 12 trials in order to allow the random assignment of initial weights to occur with a greater chance of finding a good classical solution. The 12 classical networks were found (in the aggregate) to have a training error range ECTr ∈ [0.1937, 0.1975] and a testing error range ECTe ∈ [0.2139, 0.2631]. On the other hand, one group of results (of the eight reported in Table 3) was selected from the collected experimental database containing 11,520 solutions for NN-GP using the ` error measure EC and the neuron output mapping function f1 ℵ>out 1 . This group of results has training errors in [0.0428, 0.5872] with a mean training error of 0.0801 (See Table 3), and testing errors in the range [0.0246, 0.5164] and mean 0.0729. When this group is compared to the obtained classical network solutions, it was found that the number of better NN-GP results is 11465 · 100 = 99.52%. 11520 For the Hydrochemical Research problem, the architecture was NN:9-2-5. The classes were constructed as a set of additional attributes; with each attribute corresponding to a particular class of water samples. For example, 10000 indicates class 1 and 00001 indicates class 5. This encoding scheme was required for the classical backpropagation algorithm to handle more than two class predictions with the logistic activation function. In addition, the the maximum excited output neuron indicates the predicted class. For this set of trials, the 12 solutions had a a training error range ECTr ∈ [0.3861, 0.3900] and a testing error range ECTe ∈ [0.3847, 0.3856]. As before, the appropriate group of experiments from the collected experimental database was selected and it was found that NN-GP has training errors in the range [0.181, 0.862] with mean 0.560 and testing errors in the range [0.15, 0.8] with mean 0.524 respectively. In this case, the number of better NN358 GP results is 11520 · 100 = 3.12%. A smaller percentage than the Geophysical Prospecting results; but of those errors that are strictly smaller, the best are almost twice as small as the classical network results.
620
A.J. Barton et al. / Neural Networks 22 (2009) 614–622
These empirical results demonstrate that NN-GP is capable, without explicit functions for activation and aggregation nor with explicit bias neurons, to be able to construct a neural network that is capable of competitive performance with a classical network algorithm. This shows a promising research direction in which the controlling parameters of NN-GP need to be investigated more fully. The following section presents one possible approach for such an investigation, from the point of view of the broad properties of the associated parameter spaces.
Instead, a collection of granules was constructed such that they keep the similarity structure of the whole set (up to a certain similarity threshold), while having a much smaller cardinality. A partition of the original data was computed such that each subset of the partition is composed of elements whose mutual similarity to a representative element within the subset is greater or equal than a predefined similarity threshold. The set of representatives from the different granules is a smaller data set, preserves the main similarity structure of the whole and it is easier and more accurate to optimize.
5. Parameter spaces Almost every machine learning algorithm depends on a collection of parameters of differing types (real valued, categorical, etc.) which influence the algorithm’s behaviour, effectiveness and performance. Two sources of variation are: (i) the algorithm as such and (ii) the nature of the data. For algorithms requiring a large number of controlling parameters, it becomes difficult to understand the relationships between experimental parameters and the quality of the solutions. Additional problems are the combinatorial explosion of the parameter combinations (a highdimensional parameter space) and whether there are properties of the parameters that systematically lead to good solutions. In the later case, the algorithm is capable of producing ‘good’ solutions more or less uniformly distributed within the parameter space. Thus, good solutions may be found in the vicinity of almost every parameter combination. For high-dimensional parameter spaces exploratory tools enabling a visual inspection of parameter spaces becomes valuable. The original set of objects O must be mapped into another smaller-dimension space Oˆ , with an appropriate metric (Valdés, 2003). A feature generation approach of this kind implies information loss and usually involves a nonlinear transformation (ϕ : O → Oˆ ). In this work, unsupervised spaces are constructed because data structure is one of the most important elements to consider when the location and adjacency relationships between different objects in the new space could give an indication of the similarity relationships (Borg & Lingoes, 1987; Chandon & Pinson, 1981) between the objects. The mapping ϕ is constructed while minimizing some error measure of information loss (Sammon, 1969). A frequently used error measure associated to the mapping ϕ is the Sammon error (Eq. (5)). In order to optimize that error measure, a dissimilarity measure (δij ) derived from Gower’s similarity coefficient for real-valued variables was used: δij = 1 Sij
− 1 where Sij =
Pp
k=1 sijk
/
Pp
k=1
wijk , sijk = 1 −
|xik −xjk | Rangek
, p is
the number of variables (attributes), xik is the value of attribute k for object i, xjk is the value of attribute k for object j and Rangek is the range of attribute k. In the target space, Euclidean distance was used for ζij . The solution of Eq. (5) can be obtained with a wide variety of optimization methods: classical (deterministic), computational intelligence-based, or hybrid. Here, the classical Fletcher–Reeves algorithm was used, which is a well known technique in deterministic optimization (Pres, Flannery, Teukolsky, & Vetterling, 1992). The space mappings obtained using an approach of this kind are implicit, as the images of the transformed objects are computed directly and the algorithm does not provide a function representation. The accuracy of the mapping depends on the final error obtained in the optimization process. According to the experimental settings presented in Table 2, each of the 11,520 experiments made (for each of the examples), represents a vector in a 10-dimensional parameter space. In order to produce an efficient computation of a nonlinear threedimensional space for visual representation, a preprocessing step was made using a granular computing approach. Working with all of the 11,520 vectors will make the solution of Eq. (5) impractical.
5.1. Strategy for partitioning neural network solutions In order to study the set of results, the errors associated to each network are associated to the set of algorithm parameters that led to that solution quality. For example, {P1 = V1 , P2 = V2 , . . . , Pn = Vn } may be the set of n investigated parameters Pi along with their associated values Vi for a particular network solution η with error measure E . Of course, the appropriate network quality measure E should be properly chosen and will dramatically affect the final analysis e.g. E ∈ E Tr , E Te , . . . . The results reported within this paper are for E = E Tr ; in particular, E Tr = ECTr . For ease of notation, the investigated group of experimental results for a particular data set have been partitioned into solutions referred to as good, ok and bad using a heuristically constructed rule for class assignment. Of course, other partitioning strategies are possible, such as using statistical information; such as quartiles. In any case, the particular thresholds were heuristically chosen in order to result in partitions that have a small number of solutions in the good and bad classes. The general form of the rule is:
(
good ok bad
and the specific values for the rules that were used for each data set are:
• Geophysical Prospecting: E = E Tr = ECTr ,
a = 0.05
and
b = 0.15,
and
b = 0.50.
and
• Hydrochemical Research: E = E Tr = ECTr ,
a = 0.42
5.2. Case 1: Solutions distribution, Geophysical Prospecting `
The Group 4 (f2 ℵ>out 1 (x) (EC , NN:5-1-2)) set of experiments has 11,520 NN-GP solutions, which are represented within Fig. 5 in two distinct ways. In both representations, each solution has many associated parameters (Table 2), of which 10 are investigated within this paper. As such, both representations are three-dimensional interpretations of the higher-dimensional (10dimensional) spaces; with sphere location and size used to indicate granule location and size respectively. Each granule is formed based on similarity of algorithm parameters. Hence, each granule may have objects associated to any of the 3 classes of network solutions (good, ok, bad). The left figure in Fig. 5 shows an interpretation in which each granule is partitioned based on class, and then each partition is visualized as a sphere; coloured by class and sized by number of objects within the partition. Therefore, up to three spheres may be rendered for each granule. The right figure in Fig. 5 shows an interpretation in which each granule’s location is indicated by a tiny black solid sphere. In addition, since each granule may be partitioned into up to 3 pieces, these pieces may be ordered. The ordering relation used within this paper is based on number of objects (i.e. size) within the partition. For the image, only the
A.J. Barton et al. / Neural Networks 22 (2009) 614–622
621
Y
Y
X Z
X
Z
Fig. 5. Two representations of the distribution of 11,520 NN-GP solutions for the Geophysical Prospecting problem.
Y
Y
X Z
X Z
Fig. 6. Two representations of the distribution of 11,520 NN-GP solutions for the Hydrochemical Research problem.
second and third most important partitions are displayed and coloured by their associated class (good, ok, bad). In particular, the largest spheres in the left figure are all ‘‘ok’’ neural network solutions, and have been unwrapped (removed) for the right figure in order to more clearly show the relation between ‘‘good’’ solutions (class 1 are small dark spheres) and ‘‘bad’’ solutions (class 3). Generally, ‘‘ok’’ solutions are distributed throughout the space in Fig. 5 and the ‘‘good’’ solution locations (and hence their associated parameters) can be seen in relation to the ‘‘bad’’ solutions. This representation allows a preliminary investigation, such as the one presented in this paper, to focus on those more interesting regions within the parameter space. 5.3. Case 2: Solutions distribution, hydrochemistry The group of 11,520 experiments represented in two distinct renderings within Fig. 6 are those for the investigation of EC and the ` neuron output mapping function f1 ℵ>out 1 . The left representation shows ‘‘ok’’ solutions distributed throughout the parameter space. The right representation removes all but the third most important partition within each granule; hence unwrapping the granules and determining the locations of the ‘‘good’’ solutions (dark spheres). It can be observed that these dark spheres are located throughout the parameter space, indicating that radically different NN-GP parameter values lead to ‘‘good’’ solutions. 6. Conclusions An approach for learning the functions within neural networks using Evolutionary Computation has been presented for a series of broad scale experiments. The classical neural network weights and activation and aggregation functions were not used. The results seem promising for the investigated problems; underground cave detection (Geophysical Prospecting) and water/ice and mineral interaction (Hydrochemical Research) as determined by the obtained solutions and by their distribution within constructed parameter spaces. The presented research is preliminary, for which further experimentation is required in order to more thoroughly understand the parameters influencing the algorithm’s performance.
References Anderson, C. W. (2008). C code for error backpropagation. http://www.cs.colostate. edu/∼anderson/code. Accessed December 2008. Barton, A. J., & Valdés, J. J. (2008). Computational intelligence techniques applied to magnetic resonance spectroscopy data of human brain cancers. In Lecture notes in artificial intelligence (LNAI) of the Lecture notes in computer science (LNCS): Vol. 5306. Sixth international conference on rough sets and current trends in computing (pp. 485–494). Springer-Verlag. Barton, A. J. (2009). Learning the neuron functions within neural networks based on genetic programming. MCS thesis. Ottawa-Carleton Institute for Computer Science. School of Computer Science. Ottawa, Canada. Jan 2009. Barton, A. J., Valdés, J. J., & Orchard, R. (2009). Learning the neuron functions within a neural network via genetic programming: Applications to geophysics and hydrogeology. In IEEE international joint conference on neural networks. pp unknown. Bishop, C. M. (2004). Neural networks for pattern recognition. Oxford University Press. Borg, I., & Lingoes, J. (1987). Multidimensional similarity structure analysis. New York, NY.: Springer-Verlag. Chandon, J. L., & Pinson, S. (1981). Analyse typologique. In Théorie et Applications. Paris: Masson. Fagundo, J., Valdés, J. J., & Pulina, M. (1990). Hydrochemical investigations in extreme climatic areas, cuba and spitzbergen. In Water resources management and protection in tropical climates (pp. 45–54). Ferreira, C. (2001). Gene expression programming: A new adaptive algorithm for solving problems. Complex Systems, 13(2), 87–129. Ferreira, C. (2006). Gene expression programming: Mathematical modelling by an artificial intelligence. Germany: Springer Verlag. Koza, J. R. (1989). Hierarchical genetic algorithms operating on populations of computer programs. In 11th international joint conference on artificial intelligence. Koza, J. R., & Rice, J. P. (1991). Genetic generation of both the weights and architecture for a neural network. In International joint conference on neural networks, 2 (pp. 397–404). Koza, J. R. (1992). Genetic programming: On the programming of computers by means of natural selection. MIT Press. Koza, J. R. (1994). Genetic programming II: Automatic discovery of reusable programs. MIT Press. Koza, J. R., Bennett, F. B., III, Andre, D., & Keane, M. A. (1999). Genetic programming III: Darwinian invention and problem solving. Morgan Kaufmann. Koehn, P. (1996). Genetic encoding strategies for neural networks. In Proceedings of information processing and management of uncertainty in knowledge-based systems. Montana, D. J., & Davis, L. (1989). Training feedforward neural networks using genetic algorithms. In Proceedings of the international joint conference on artificial intelligence. http://vishnu.bbn.com/people/dmontana.html. Accessed December 2008. Pres, W., Flannery, B., Teukolsky, S., & Vetterling, W. (1992). Numeric recipes in C. Cambridge University Press.
622
A.J. Barton et al. / Neural Networks 22 (2009) 614–622
Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge University Press. Sammon, J. W. (1969). A non-linear mapping for data structure analysis. IEEE Transactions on Computers, C18, 401–408. Stanley, K. O. (2004). Efficient evolution of neural networks through complexification. Ph.d. dissertation. University of Texas at Austin, Austin, Texas, USA. http:// www.cs.ucf.edu/∼kstanley. Accessed December 2008. Valdés, J. J., & Gil, J. L. (1982). Application of geophysical and geomathematical methods in the study of the insunza karstic area (la salud, la habana). In Proceedings of the first international colloquium of physical-chemistry and karst hydrogeology in the caribbean region (pp. 376–384). Valdés, J. (2003). Virtual reality representation of information systems and decision rules: An exploratory technique for understanding data knowledge structure.
In LNAI: Vol. 2639. Lecture notes in artificial intelligence (pp. 615–618). SpringerVerlag. Valdés, J. J., Orchard, R., & Barton, A. J. (2007). Exploring medical data using visual spaces with genetic programming and implicit functional mappings. In genetic and evolutionary computation conference (GECCO-2007) workshop on medical applications of genetic and evolutionary computation (MedGEC). Valdés, J. J., & Barton, A. J. (2007). Multi-objective evolutionary optimization of neural networks for virtual reality visual data mining: Application to hydrochemistry. In IEEE international joint conference on neural networks. Yao, X. (1993). A review of evolutionary artificial neural networks. International Journal of Intelligent Systems, 8(4), 539–567. Zhang, B., & Mühlenbein, H. (1993). Evolving optimal neural networks using genetic algorithms with Occam’s razor. Complex Systems, 7, 199–220.