Copyright © IFAC Automation in Mining, Mineral and Metal Processing, Nancy. France. 2004
ELSEVIER
IFAC PUBLICATIONS wwwelsevlt.:r.com/lncalehfJ(
FEATURE SELECTION USING SECOND ORDER DERIVATIVE PRUNING METHODS A CASE STUDY IN GIS ANDES Mohammed Attik •.•• Daniel Cassard •• Vincent Bouchot·· Andors Lips··
• LORIA, Campus Scientifique - BP 239 - 54506 Vandreuvre-les-Nancy Cedex - France •• BRGM, 3 av Claude Guillemin - BP 6009 - 45060 Orleans Cedex 2 - France
Abstract: In this paper, we show how neural networks can be applied to Geographic Information System (GIS) for feature selection. Precisely, we want to select the most relevant variables in a classification task between gold deposits and deposits without gold by applying pruning neural methods to a GIS Andes dataset. Two families of pruning methods based on an analysis of the weight saliencies are presented, their strengths and weaknesses are discussed. This work could help in the long term to better understand the formation of gold deposits. Copyright © 2004 IFAC Keywords: Neural Networks, Pruning, Feature Selection, Weight Saliency, GIS
1. INTRODUCTION
easier the rule extraction) and reducing the time and the cost to collect and transform data.
Artificial neural networks (ANNs) are good candidates to model a phenomenon. They have already been applied to a geological problem with success (Brown et al. 2000, Bougrain et al. 2003). These are considered as a black-box technology because it is difficult to give an explicit explanation for the reasoning process of training. Therefore many researchers have addressed the issue of improving the understanding of ANNs, where the most attractive solutions are to select variables and to extract rules.
The motivation of this work is that in the geology domain, as in many other fields, collecting very large amounts of data can be difficult at best and impossible at worst. It is important to determine the variables which really contribute to the formation of substance deposits and to determine the optimal topology to facilitating extract rules (optimizing the architecture not only consists in removing variables but also weights in the network). In the next section, we review previous works on neural network feature selection methods. We introduce two families of methods and we discuss their strengths and weaknesses. In Section 3, we describe the dataset extracted from a GIS of the Andean Cordillera. In section 4, we present in details the neural networks and the parameters we used to train and select the features. In Section 5, we present our analysis of the results after apply-
In this paper, we are interested in the exploration of various neural models to select the most relevant variables and the best topology optimization. The objectives of feature selection can be numerous: improving the prediction performance, providing a faster predictor, providing a better understanding of the underlying process that generates data (providing variable selection, making
353
ing pruning methods to the dataset. In Section 6, we conclude with some suggestions for future works.
(1) Choose a reasonable network architecture (2) Train the network until a reasonable solution is obtained (3) Compute the second derivatives h u for each parameter (~) Compute the saliencies for each parameter : Sic = h u wU2 (5) Sort the parameters by saliency and delete some low-saliency parameters (6) Iterate to step 2
2. PRUNING METHODS In the last years, several heuristic methods based on computing the saliency for feature selection have been proposed. We focus, in this study, on two methods to perform a backward selection : Optimal Brain Damage(OBD) and Optimal Brain Surgeon (OBS) where a connection is removed according to a relevance criterion often named the weight saliency (also termed sensitivity). More precisely, the weight with the smallest saliency will generate the smallest error variation if it is removed. These methods have inspired some methods specialized in variable selection such as Optimal Cell Damage (OCD) and Unit-Optimal Brain Surgeon (Unit-OBS), which defined the saliency of a variable as the combination of its out going weights saliencies. For all these methods, saliency is computed using as performance measure the error variation on the training set.
Fig. 1. OBD Algorithm (1) Train the network until obtain local minimum fixed by threshold B (2) Compute saliency for all variables (3) Variables whose saliency is below a given threshold (q) are eliminated (4) Iterate to step 1
Fig. 2. OCD Algorithm (1) Train a "reasonably large" network to minimum error (2) Compute H- 1 . (3) Find the q that gives the smallest saliency Iq = w~/(2[H-l]qq). If this candidate error increase is much smaller than E, then the qth weight should be deleted, and we proceed to step 4; otherwise go to step 5. (Other stopping criteria can be used too.) (4) Use the q from step 3 to update all 1 weights with 8w -_ - ~w H- qq H- .e q Go to step 2. (5) No more weights can be deleted without large increase in E. (At this point it may be desirable to retrain the network.)
These techniques considered a network trained to a local minimum in error by using the Taylor function. The functional Taylor series of the error with respect to weights is :
A well trained network implies that the first term in (Eq. 1) will be zero because E is at a minimum. When the perturbations are small, the last term will be negligible.
Fig. 3. OBS Algorithm 2.2 Optimal Cell Damage (OCD)
2.1 Optimal Brain Damage (OBD)
Optimal Cell Damage method (OCD) has been proposed by (Cibas et al. 1994, Cibas et al. 1996) (Fig. 2). This method, derived from OBD, proposes to eliminate variables with the smallest saliencies. We compute the saliency of variable j as the sum of its weight saliencies (Eq. 4).
Optimal Brain Damage (OBD) was introduced by Le Cun (Le Cun et al. 1990). Le Cun measure the saliency of a weight by approximate (Eq. 1):
8E =
2 E (8w·)2 =
~ ,,8
2 LJ 8w 2 I
I
I
~ "h2 2 LJ I
ii(8w i)2
(2)
Si
= LSii
(4)
i=1
where hii'S are the element of the Hessian matrix H of E. He approximated the Hessian by a diagonal matrix. The weights sa.J8ll¥;ies are giyen2 by : _ _ (w·)2 = _h ~ 8w? I 2 2 Si = ii(Wi)
2.3 Optimal Brain Surgeon (OBS)
Optimal Brain Surgeon (OBS) (Hassibi and Stork 1993) was a further development of OBD (Fig. 3). Hassibi measures the saliency of a weight by approximating (Eq. 1):
(3)
The pseudo-algorithm is presented below (Fig. 1).
354
far from being optimal. OBS and Unit-OBS require only one training for all pruning but OBD and OCD require before each pruning a training, which implies that the first techniques are faster in selection.
(1) 1rain the network to minimum error. (2) Compute H- 1 . (3) For each unit u (a) Compute the indices ql , q2, ... , qm( u) of the outgoing connections of unit u where m( u) is the number of outgoing connections of unit u. (b) M = (eq"eq" .. ·,eq~(U»)· (c) .6.E(u) = ~wT M(M T H- 1 M)-l MT w (4) Find the Uo that gives the smallest increase in error .6.E(uo). (5) M = M(uo)(refer to steps 3.(a) and 3. (b)). (6) .6.w = _H-l M(M T H- 1 M)-l MTw (7) Remove unit Uo and use .6.w to update all weights. (8) Repeat steps 2 to 7 until a break criteria is reached.
3. GIS ANDES SUBSET The subset used to test the selection methods was extracted from the GIS Andes (t he French Geological Survey). The GIS Andes is created by BRGM (Cassard 2000), it is a homogeneous information system of the entire Andes Cordillera, covering an area of 3.83 million km 2 and extending for some 8500 km from the Guajira Peninsula (northern Colombia) to Cape Horn (Tierra del FUego). Conceived as a tool for both the mining sector, as an aid to mineral exploration and development, and the academic sector as an aid to developing new metallogenic models, GIS Andes is based on original syntheses and compilations.
Fig. 4. Unit-OBS Algorithm 6E
= -1~~ L L h;j6w;6wj 2
. .
. )
=
1 6wT .H.6w (5) 2
The subset contains samples from the whole Andean Cordillera. It contains 35 variables describing the geography, geology (lithology, faulting, recent volcanism), geometry of the subduction zone, geothermy, geophysics (seismic, gravimetry), and gold deposit (see Table 1 for more details). The additional information concerning segmentation of Andes Cordillera in the different wnes (northern, central and southern sector) was inspired by the work of Rarnos (Rarnos 1999).
OBS computes the full Hessian matrix iteratively, which leads to a more exact approximation of the error function, does not use the diagonal approximation and computes the new weight values without explicit retraining. The problem is that the inverse of the Hessian matrix must be computed to deduce saliency and weight change for every link.
The samples has been divided into the training and test sets. 2268 samples were assigned to the training set and 484 samples to the test set. Each sample is labelled as a gold deposit or a no-gold deposit (deposit without gold).
2.4 Unit-Optimal Brain Surgeon (Unit-OBS)
Stahlberger and Riedmiller (Stahlberger and Riedmiller 1997) proposed to the OBS's users, a calculation, called Generalized Optimal Brain Surgeon (G-OBS), to obtain in a single step the update to apply to every weights when deleting a subset of m weights. In the special case of G-OBS, they propose an algorithm specialized on the variable selection, called Unit-OBS, which prunes the variable (the input neuron) that will generate the smallest increase of error if it is removed (see Fig. 4).
Dealing with a class imbalance problem in our dataset, we used an over-sampling method (Japkowicz 2000) method to resolve this problem. This method consists of re-sampling the smallest class at random until it contains as many examples as the other class. The choice of this method is based on the conservation of information and not on its loss.
4. MODEL DESCRIPTION
2.5 Comparison
The efficiency of OBS on large networks is compared to OBD in the literature, and it turns out that OBD is preferable to OBS, since more weights can be removed using less computational effort. OBS is severely influenced by the cumulated noise generated by Hessian matrix calculation on the weights : risk to eliminate the significant variables/weights and to have a network
The ANNs used are multilayer perceptrons (MLPs) with : 35 neurons with linear activation function in the input layer, 5 neurons with hyperbolic tangent activation function in the hidden layer, one output neuron with sigmoid activation function in the output layer. This number of hidden neurons allows a good enough representation able to solve this discrimination problem. The total number of
355
Table 1. Variables of the dataset Id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Short Name LONGITUDE LATITUDE BOUGUER GRTOPISO PROF..MOHO DEM S-DEPTH TOPO-.lSO BENPENTE DIST_SEIS DISLSUBS DIST_VOL NBR...SEISM DISL112 DIST_135 DIST_157 DIST_180 DISL22 DISTA5 DIST_67 DISL90 ZONE =1 ZONE =2 ZONE =3 ZONE =4 CENOZOIC PALEOZOIC MESOZOIC PROTEROZOI SEDIMENTAR VOLCANIC VOL...sEDIME PLUTONIC METAMORPHIC UNDIFFEREN
Definition Longitude position Latitude position Bouguer anomaly calculation (10 Km x 10 Km) Vertical gradient of the topo-isostatic residual anomaly (10 Km x 10 Km) Moho depth model assuming local isostasy compensation (20 Km x 20 Km) Altitude (1 Km x 1 Km) Distance to the subduction plane Topo-isostatic residual anomaly (10 Km x 10 Km) Dip of the underlying (modeled) subduction plane Distance to the closest underlying epicenter seism Distance to (modeled) subduction plane Distance to closest volcano Number of earthquake calculated over 20 km radius Distance to the nearest fault trending 90 to 112 degrees Distance to the nearest fault trending 112 to 135 degrees Distance to the nearest fault trending 135 to 157 degrees Distance to the nearest fault trending 157 to 180 degrees Distance to the nearest fault trending 0 to 22 degrees Distance to the nearest fault trending 22 to 45 degrees Distance to the nearest fault trending 45 to 67 degrees Distance to the nearest fault trending 67 to 90 degrees Northern Andes Northern sector of central Andes Central sector of central Andes Southern sector of central Andes Cenozoic age of rocks Paleozoic age of rocks Mesozoic age of rocks Proterozoic age of rocks Sedimentary rock Volcanic rock Volcano sedimentary rock Plutonic Rock Metamorphic rock Undifferentiated rock
weights for these fully connected networks (including a bias) is 186. This value will have to be compared to the remaining weights after pruning.
I~
We use 200 cycles for MLP training, the mean square error is lesser or equal to 0.18 for the training set and the test set. MLP training proceeds again with another initialization and definitively stops after 10 successive failures.
J:
,
0
Concerning pruning, OBS and Unit-OBS (selection methods without retraining) the mean square error is greater or equal to 0.18 for training set and the test set. For OBD method, 3 weights are pruned at the same time (to speed up the process without significant changes on the results), concerning OCD method, only one weight is pruned at the same time.
,-I 0
n
I •
Fig. 5. Distribution of the number of preserved variables significant, the goal is to have enough experience on these methods), the results are presented as histograms. Comparing the feature selection obtained by the different methods is a difficult task. There is no unique measure to compare them but different aspects can be used to choose one of them rather than another one. In this study we select three values as measures of the performance for variable selection and optimization : the number of preserved weights, the number of pruned variables and the choice of pruned variables.
5. RESULTS AND DISCUSSION Neural networks seek a solution into the space of the solutions, then two training can lead to two different acceptable solutions. Since the methods of pruning stays on the results of the training of neural networks then the pruning solution is not unique. For each method, 114 different initializations were tested (this number of test is not
356
~
The histograms in (Fig. 5) and (Fig. 6) present different solutions obtained by OBD and OBS. They show that the number of weights and the number of variables are highly reduced compared to a fully connected multilayer perceptron (35 variables, 186 weights). As an example, OCD gives with 7 variables the same performance obtained with 35 variables.
. "
,
ID
100
~
n
n
n
120
No..orN:-oI~_vr-
Fig. 6. Distribution of the number of preserved weights
.
--. ..
Fig. 7. Distribution of preserved variables I~~
.. " ,
•
n
~ n~
Fig. 8. Distribution of the number of preserved variables
_oco
...........
f
The histogram in (Fig. 8) shows the performance results of OCD and Unit-OBS for variable selection. These techniques give the best results compared to OBD and OBS according to the number of preserved variables (Fig. 5, which allows to say that these techniques are specialized on the variable selection and OBD and OI3S are specialized on the topology optimization. According to figures (Fig. 5), (Fig. 6) and (Fig. 8), OBD and OCD overcome respectively OBS and Unit-OBS techniques according to the number of preserved weights and the number of pn"-served variables.
,
j
The mean percentage of correct classification obtained by the pruning methods is equal to 75%.
Table 2 summarizes the variable selection for different methods (see Fig. 7 and Fig. 9 for more details). The variables are presented by pertinence order. The last column presents the pertinence mean order for all methods. We can divide Table 2 in three classes, where each class represents 1/3 of the total number of variables. We can analyze each class method by method. We can also made the intersection between classes for all methods, three classes are obtained in important pertinence order: • The first class contains 5 variables: distances to the nearest fault trending 22 to 45 degrees, 90 to 112 degrees, 45 to 67 degrees, 0 to 22 degrees, and the dip of the underlying subduction plane. • The second class contains 8 variables : the altitude, the central sector of central Andes, the volcano sedimentary rock, the sedimentary rock, the distance to closest volcano and the distances to the nearest fault trending 112 to 135 degrees, 157 to 180 degrees, 67 to 90 degrees. • The last class contains the 22 remainder variables. As the results obtained by the technique of pruning are different, it can be justified that the selection methods rely heavily on heuristics for the three steps of feature selection:
100
1
..
-Fig. 9. Distribution of preserved variables
• a feature evaluation criterion in our case, the saliency calculation which is based on diagonal or full Hessian matrix to compare variable subsets,
and the distribution of the number of preserved weights of the model.
• a search procedure, to explore a (sub)space of possible variable combinations, • a stopping criterion or a model selection strategy
Finally, the future works can be oriented to search an estimation of the optimal number of pertinent variables/weights. Knowing this number of parameters will make it possible to select the significant variables/weights which are associated to rules which can be extracted.
Table 2. Sort the variables by method Order 1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
OBD 14 19 9 1 20 12 8 18 24 3 4 6 2 15 23 21 25 11
17 32 29 22 30 10
16 5 13 34 27 26 35 33 31 7 28
OBS 20 19 15 14 9 33 17 10 31 18 34 16 12 30 28 21 6 32 24 7 13 4 26 23 35 5 2 8 27 1 11
25 3 22 29
Unit-OBS 19 20 17 10
9 14 18 16 12 15 21 30 6 33 13 24 8 7 1 31 32 23 28 27 34 5 26 4 11
35 2 25 3 22 29
OeD 19 14 1 8 12 9 20 3 2 18 11
4 15 25 22 23 21 24 6 32 16 17
30 5 34 33 29 27 26 31 7 35 10
13 28
Mean 19 14 20 9 12 18 15 17 1 8 21 6 24 16
REFERENCES Bougrain, Laurent, Maria Gonzalez, Vincent Bouchot, Daniel Cassard, Andor L.W. Lips, F'rederic Alexandre and Gilbert Stein (2003). Knowledge recovery for continental-scale mineral exploration by neural networks. Natural Resources Research 12(3),173-181. Brown, W. M., T. D. Gedeon, D. I. Groves and R. G. Barnes (2000). Artificial neural networks: a new method for mineral prospectivity mapping. Australian Journal of Earth Sciences 47(4),757. Cassard, D., Ed.) (2000). A metallogenic GIS of the Andes Cordillera: Abstracts CD, IGC 31st. International Geological Congress. Rio de Janeiro, Brasil. Cibas, T., F Folgelman Souli, Patrick Gallinari and S Raudys (1994). Variable selection with optimal cell damage. In: ICANN 94. Cibas, T., F. Soulie, P. Gallinari and S. Raudys (1996). Variable selection with neural networks. Hassibi, Babak and David G. Stork (1993). Second order derivatives for network pruning: Optimal brain surgeon. In: Advances in Neural Information Processing Systems (Stephen Jose Hanson, Jack D. Cowan and C. Lee GHes, Eds.). Vol. 5. Morgan Kaufmann, San Mateo, CA. pp. 164-171. Japkowicz, N. (2000). Learning from imbalanced data sets: a comparison of various strategies. Le Cun, Yann, John S. Denker and Sara A. Solla (1990). Optimal brain damage. In: Advances in Neural Information Processing Systems: Proceedings of the 1989 Conference (David S. Touretzky, Ed.). Morgan-Kaufmann. San Mateo, CA. pp. 598--605. Ramos, V.A. (1999). Plate tectonic setting of the andean cordillera. A Special Issue dedicated to the 31st International geological Congress Rio de Janiero, Brasil 6-17 August 2000. Stahlberger, Achim and Martin Riedmiller (1997). Fast network pruning and feature extraction by using the unit-OBS algorithm. In: Advances in Neural Information Processing Systems (Michael C. Mozer, Michael I. Jordan and Thomas Petsche, Eds.). Vol. 9. The MIT Press. p. 655.
10
30 4 23 33 32 2 3 11
34 31 25 13 5 7 22 28 26 27 29 35
6. CONCLUSIONS In this article, we presented various second-order derivative pruning methods for feature selection by neural networks. We noticed that for different initializations of MLP we obtained a different selection (variable selection or topology optimization) of the same pruning method. Thus for different pruning methods, we obtained different results. We explained in this paper why the pruning techniques based on the neural networks do not give a unique solution for each method and the same results for different methods. The approach of pruning techniques based on the saliency analysis requires a post-processing which makes it possible the decision on the results obtained. The post-processing suggested in this study based on different initializations associated to a pruning technique, makes it is possible to select pertinent variable to another variable. As well as the distributions of the number of preserved variables
358