Parallel genetic algorithm co-optimization of spectral pre-processing and wavelength selection for PLS regression

Parallel genetic algorithm co-optimization of spectral pre-processing and wavelength selection for PLS regression

Chemometrics and Intelligent Laboratory Systems 107 (2011) 50–58 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory S...

1MB Sizes 0 Downloads 39 Views

Chemometrics and Intelligent Laboratory Systems 107 (2011) 50–58

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c h e m o l a b

Parallel genetic algorithm co-optimization of spectral pre-processing and wavelength selection for PLS regression Olivier Devos ⁎, Ludovic Duponchel Laboratoire de Spectrochimie Infrarouge et Raman (LASIR CNRS UMR 8516), Université de Lille 1 Sciences et Technologies, Bât. C5, 59655 Villeneuve d'Ascq, France

a r t i c l e

i n f o

Article history: Received 28 July 2010 Received in revised form 14 January 2011 Accepted 17 January 2011 Available online 25 January 2011 Keywords: Genetic algorithm Variable selection Spectral pre-processing PLS NIRS

a b s t r a c t Spectral pre-processing and variable selection are often used to produce PLS regression models with better prediction abilities. We proposed here to optimize simultaneously the spectral pre-processing and the variable selection for PLS regression. The method is based on parallel genetic algorithm with a unique chromosome coding both for pre-processing and variable selections. A pool of 31 pre-processing functions with various settings is tested. In the same chromosome several pre-processing steps can be combined. Three near infrared spectroscopic datasets have been used to evaluate the methodology. The efficacy of the co-optimization is evaluated by comparing the prediction ability of the PLS models with those after pre-processing optimization only. The effect of the number of successive pre-processing steps has been also tested. Concerning the different datasets used here, one can observe two different behaviors. In a first case the GA co-optimization procedure is found to perform well, leading to important improvement of the prediction ability especially when three consecutive pre-processing techniques are applied. In a second case, only the preprocessing optimization is enough to obtain an optimal model. All these models are optimal and more accurate compared to the classical models (build with the “trial and error” methods). © 2011 Elsevier B.V. All rights reserved.

1. Introduction Near infrared spectroscopy (NIRS) combined with multivariate regression models as partial least squares (PLS) is widely used for quantitative applications in chemistry. Unfortunately the NIR spectra are often subject to significant noises, chemical and physical interferences leading to less than optimal calibration. To improve the prediction abilities and the robustness of the regression model, adapted data pre-processing and/or selection of the most relevant wavelengths can be applied. Spectral pre-processing [1], for example multiplicative scatter correction (MSC), offset correction, or Savitzky–Golay derivatives, can remove non-relevant sources of variation and non-linear behavior, and may lead to more parsimonious models (less latent variable in PLS model) [2]. The models obtained may have better predictive abilities and are expected to be more robust. However, the choice of the preprocessing methods is generally based on a trial and error approach, on the practitioner's expertise and on basic rules (MSC and SNV will correct scatter effects, derivatives will remove additive and multiplicative effects...). Therefore the PLS model performs generally better than the one without pre-processing but is not necessarily optimal.

⁎ Corresponding author. Tel.: +33 320 434 748; fax: +33 320 436 755. E-mail address: [email protected] (O. Devos). 0169-7439/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2011.01.008

Considering the whole spectral range we usually test with a priori some combinations of pre-processing which correspond effectively to a few percentage of the total number of possible ones. On the other hand the selection of variables will identify the most useful wavelengths and remove the non informative or noisy wavelength zone for a precise and accurate regression model [3]. Numerous approaches exist for variable selection: knowledge based selection, forward backward procedure, interval selection methods… More details about these different approaches can be found in the review article by Xiaobo et al. [4]. Most of these techniques are based on stepwise selection schemes or elaborate search-based selection methods such as genetic algorithms (GA) [5,6]. Most of the time the pre-processing is not optimized and generally only simple preprocessing (mean center, auto scale) is applied before the selection. Moreover, even the worth pre-processing combination obtained with the full spectral range can develop the best model when wavelength selection is optimized. In other words, such global optimization cannot be sequential. This paper proposed a methodology to optimize simultaneously the pre-processing and the variable selection for PLS regression. The strategy adopted here is based on Parallel Genetic Algorithm with chromosomes coding both for pre-processing and variable selection. In order to test the performance of the GA method, three near infrared spectroscopic datasets with different levels of noise, number of variables and number of samples have been used.

O. Devos, L. Duponchel / Chemometrics and Intelligent Laboratory Systems 107 (2011) 50–58

2. Theory 2.1. Overview of genetic algorithms Genetic algorithms (GA) are efficient methods to solve complex optimization problem and have been widely used in chemistry [7–11]. The principle of the genetic algorithms has been extensively described in literature [12–14]. Therefore, only a brief description of the GA and the problem dependent steps (encoding, evaluation, and parallelization) will be given here. In genetic algorithms, a solution is represented by a chromosome consisting of a fixed length of genes (coding variables). The simple GA comprised the following steps (Fig. 1): the initial stage consisting of encoding, initialization, evaluation of the first generation and the evolutionary stage when a new population is generated with selection, recombination (crossover), mutation, evaluation and reinsertion. During the encoding step, the optimization problem is transformed in a code that can be handled in GA. Classically problems are encoding using bit representation. A number of possible candidate solutions are then randomly generated, this step is called initialization. Usually, the population size (total number of individuals or chromosomes) lies between 50 and 500 and remains constant along the GA procedure. The next step is the evaluation of the performance of each individual in the population by the use of a fitness or objective function. The best individuals are selected according to their fitness by means of a selection algorithm like roulette wheel or stochastic universal sampling [15]. Two new individuals (children or offspring) are then generated from two selected individuals (parents) by combining the chromosomes of the parents. This step is called recombination or crossover for binary values. To escape from local minima of the solution space some variable values of the new individuals can be randomly altered with defined small mutation probability. This small perturbation is also a good way to probe other parts of the solution space keeping in a certain extent the adaptation of the considered individual. In the case of binary coding, mutation consists in flipping bits of individual's variable values at random with a small probability. After mutation, the fitness of the “children” is determined and these new individuals are reinserted to obtain a new population with a constant size. Depending on the number of children compared to the number of individual in the population, different strategies are used as elitist reinsertion and

51

fitness based reinsertion. The cycle selection, recombination, mutation, evaluation and reinsertion are repeated for a prescribed number of generations or end when an optimization criterion is met. Most of the GA steps and operators are generic except two main components that are problem dependent: the problem encoding and the evaluation function. 2.2. Encoding method for co-optimization of pre-processing and variable selection For the co-optimization of pre-processing and wavelength selection both are encoded in the same chromosome using binary representation (Fig. 2). The first part of the chromosome corresponds to the preprocessing coding. Each pre-processing method is coded on several bits depending on the number of pre-processing functions tested. For example a five bit string can code up to 25 (32) pre-processing methods. Furthermore p pre-processing methods can be combined (several pre-processing bit strings are concatenated) which during the fitness evaluation are applied to the full spectra in the same order as they appear in the chromosome (from left to right). During the initialization step each bit value is assigned randomly given equiprobability for each pre-processing to be selected. The second part of the chromosome represents the variable selection. Each bit represents a wavelength or a block of wavelength; strings of zeros indicate that the solution does not include these variables while strings of one's indicate the selected wavelengths. During the initialization step the value of each bit is randomly drawn with a fixed probability p0: in average only a given% of the bits are initially set to 1. First during the evaluation of each individual the GA selected preprocessing steps are applied to the full spectra. In a second step the non selected part of the spectra is removed and PLS regression is applied on the selected wavelength zone to estimate the fitness of the individual. 2.3. PLS model and fitness evaluation The aim of the GA optimization procedure is to minimize a predefined fitness function. The fitness evaluates the quality of candidate solutions. For PLS model the fitness corresponds to the prediction ability of the regression model. Classically for GA optimization of PLS model the Root Mean Square Error of Cross Validation (RMSECV) is used and is defined as follows [16]: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ∑ni = 1 ð yˆi −yi Þ2 RMSECV = n Where n is the number of predicted values, yi the measured value and yˆ i the value estimated by the PLS model. To evaluate the RMSECV several iterations of random K-fold cross validation are performed. This method saves computation time and decreases the risk of overfitting compared to leave one out cross validation [16,17]. Finally an independent validation set is used after GA optimization in order to evaluate the more adapted individuals of the last generation by the calculation of the root mean squared error of prediction (RMSEP). 2.4. Genetic algorithm parallelization

Fig. 1. Flow diagram of genetic algorithms procedure.

To cover an important part of the solution space, a large population is required and a large number of generations are necessary before reaching convergence leading to an important computation time. To speed up the calculation a parallel GA (PGA) is used instead of a single GA (SGA). The strategy implemented here is based on a single

52

O. Devos, L. Duponchel / Chemometrics and Intelligent Laboratory Systems 107 (2011) 50–58

Fig. 2. Binary encoding for pre-processing and variable selection co-optimization.

population and a synchronous master–slave parallelization [18] (also called distributed fitness evaluation). This methodology has been successfully applied to GA variable selection [19]. In this strategy, the master executes the GA operations (selection, recombination, and mutation), stores the population and the slaves evaluate a fraction of the individuals. Here the slaves decode the chromosome, apply the selected pre-processing, suppress from the spectra the unselected wavelengths and evaluate the fitness of each individual. This strategy has been chosen since the pre-processing of the raw data, building and evaluation of the PLS models required important computation time compared to master–slave communication and GA operations (selection, recombination, cross over…). Furthermore since no communication is necessary between slaves this algorithm is straightforward to implement. 3. Experimental 3.1. NIR datasets Three near infrared spectroscopy datasets with different variable/ object ratio, spectral interferences and non linear behavior have been used. Their characteristics are summarized in Table 1. One of the major concerns with using GA is the problem of random correlations [20] and therefore overfitting in which noise instead of information is modeled. This risk increases when the ratio of the number of variables to the number of objects increases. According to Leardi [6] a variable/object ratio equals to 5 has been found to be the critical point. Furthermore the performance of the GA-based feature

selection algorithm has been found to diminish when the number of variables (wavelengths) increases [21] due in part by the exponential growth of the search domain. Therefore for the datasets containing more than 100 data points, the wavelengths are included or excluded in “blocks”. The number of wavelengths per block indicates how many adjacent wavelengths are grouped together (see Table 1). To test the risk of overfitting during the GA selection the randomization test proposed by Leardi [6] is performed on the datasets. The results of the test with 50 runs are inferior to 4 (respectively 3.30, 0.542 and 0.22 for the 3 datasets) showing that GA can be applied successfully on these datasets. 3.1.1. Corn data NIR spectra from 1100 to 2498 nm at 2 nm intervals (700 wavelengths) have been recorded on 3 different NIR spectrometers for 80 corn samples with reference moisture values (from 9.37 to 11%). These data came originally from Mike Blackburn at Cargill and are freely available online from eigenvector website (http://software. eigenvector.com/Data/Corn). This dataset has been used for Standardization Benchmarking [22] and to test wavelength selection using Tikhonov regularization [23]. For this study only the spectra on instrument m5 have been used. Prior to analysis the dataset has been randomly split in two: 60 samples for the calibration and 20 for the validation as an independent test set. 3.1.2. Tecator data NIR spectra from 850 to 1050 nm at 2 nm intervals (100 wavelengths) have been recorded for 215 samples of ground pork meat with

Table 1 Datasets description. Dataset name

Analyte

Number of wavelengths

Wavelengths per block

Number of blocks

Samples in calibration set

Samples in validation set

Corn Tecator Sugar beet

Moisture value Fat content Sucrose content

700 100 1050

7 1 10

100 100 105

60 172 1062

20 43 1062

O. Devos, L. Duponchel / Chemometrics and Intelligent Laboratory Systems 107 (2011) 50–58

corresponding fat content (0.9 to 49%). The NIR spectra were recorded on a Tecator Infratec spectrometer. This dataset is freely available online on the StatLib website (http://lib.stat.cmu.edu/datasets/). The relationship between the fat content and the autoscaled spectra presents an obvious non-linearity. This dataset has been extensively used as a benchmark for non linear regression models [24–26] and variable selection methods [27]. The samples have been previously divided into five data sets. As proposed by Thodberg [25] the datasets C and M are used together as calibration set (172 samples) and T is used for validation (43 samples).

3.1.3. Sugar beet data This dataset consists of 2124 NIR spectra of sugar beet brei samples representative of the French production between 1999 and 2002. The aim is to predict the sucrose content [28,29]. The reference values were determined by the ICUMSA official method (14.7 to 21 g of sucrose for 100 g of beet). The spectra have been recorded from 400 to 2498 nm every 2 nm (1050 wavelengths). The dataset has been randomly split in two datasets: 1062 samples for the calibration set and 1062 for the validation set.

Table 3 The 32 pre-processing functions for GA-PLS co-optimization. Category

Filtering

Methods

No pre-processing Savitzky Golay smoothing filter of polynomial order K and F points in the filter

Detrend with polynomial order K Normalization

Derivatization

Multiplicative scatter correction (MSC) Normalize, sum equals to unity Standard normal variate (SNV) Savitzky Golay derivative filter of polynomial order K, F points in the filter and derivative degree M

3.2. GA configuration The GA parameter settings presented in Table 2 have been obtained after fine-tuning using a one-at-a-time approach. The chosen parameter values are the ones leading to an optimal fitness for a low number of generations without premature convergence. Since the computation time is not a problem due to the use of computing cluster with parallel GAs a large population with 256 individuals has been chosen. The pre-processing methods are coded on 5 bits representing 32 different methods with various settings representing the most common techniques for near-infrared spectra [1] (Table 3). During the initialization step each pre-processing method has the same probability to be selected. Therefore the probability to select no pretreatment (coded 0) is really low and generally most of the individuals used as much as pre-processing steps as allowed during the optimization. To overcome this problem and study the effect of the number of combined pre-processing steps on the PLS model results, a user-specified parameter is introduced. This parameter controls exactly the maximum number of pre-processing steps that can be selected by the GA. For the variable selection during the initialization only 10% of the variables are selected (bit set to 1). The fitness evaluation is based on the RMSECV calculated by 10 iterations of 10 fold cross validation. The optimal number of latent variables (LVs) retained for the PLS model is automatically selected according to the decrease in RMSECV produced for each LV added to

Table 2 Genetic algorithm parameter settings. Parameters

Value

Population size Pre-processing coding Pre-processing initialization Consecutive pre-processing Initialization probability for variables PLS Objective function Maximum number of latent variable (PLS) Crossover rate Selection methods Mutation probability Reinsertion Maximum number of generations Chromosome convergence Number of independent runs

256 Coded on 5 bit (32 pre-processing) Random User-defined (1 to 4) 0.1 RMSECV (10 iterations of 10 fold CV) 20 0.5 (Single point) Stochastic universal sampling 0.005 Elitism replacement 150 50% 5

53

Scaling and centering

Filtering

Autoscale Mean center Median center Square root mean scale Weighted least squares baseline of polynomial order K

Parameters K

F

2 3 2 3 2 3 1 2 3

15 15 7 7 21 21

2 2 2 3 3 3 2 2 2 3 3 3

1 2 3

7 15 21 7 15 21 7 15 21 7 15 21

M

1 1 1 1 1 1 2 2 2 2 2 2

ID code

Gene coding

#0 #1 #2 #3 #4 #5 #6 #7 #8 #9 #10

00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010

#11

01011

#12

01100

#13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31

01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111

the model (maximum number of LVs of 20). LVs improving less than 1% the RMSECV are discarded. At each generation 50% of the parents (128 individuals) are selected by the stochastic universal sampling method, the children are generated by using single point cross over. A mutation probability of 0.5% is used. The new population is then composed of the 128 children and the 128 individuals of the previous generation with the best fitness (elitist reinsertion). In this way this new generation is more adapted to his environment compared to the previous one. Finally the optimization stops when 50% of the chromosomes of the last generation share exactly the same bit string. In all of the preliminary experiments less than 150 generations are required to obtained convergence. Five independent GA runs with different initialization starting points are realized and the 128 best individuals for each run are kept and the duplicates are removed.

3.3. Material and software All the calculations were performed using MATLAB version 2007b (The MathWorks Inc., Natick, MA). The parallel GA scripts developed in-house are based on the Distributed Computing Toolbox (MathWorks) for the parallelization, the Genetic and Evolutionary Algorithm Toolbox (developed by Hartmut Pohlheim, GEATbx, Berlin) for the basic evolutionary operators and the PLS Toolbox (Eigenvector Research, Manson, WA) for pre-processing and PLS regression. The program runs on a High Performance Computing Cluster (Transtec AG, Tübingen) with one master computer and 4 computing nodes with 2 “dual core AMD Opteron” processors. With the parallel

54

O. Devos, L. Duponchel / Chemometrics and Intelligent Laboratory Systems 107 (2011) 50–58

GA running on the 16 processors a speed-up of around 12 is generally observed as compared to a single processor.

4. Results and discussion To test the effect of the co-optimization, two GA configurations have been used: one when only the pre-processing is optimized, in this case PLS models are built with the full spectra (variable selection bits fixed to 1), and the second one when both pre-processing and variable selection is optimized simultaneously. During optimization no restriction on the pre-processing is applied: all the possible combinations are possible and the same pre-processing may occur several times. Defining pre-processing gene coding in this way is easy to implement. A better way would be to penalize the fitness function when a high number of pre-processing steps is used or when the same pre-processing is observed several times in a considered individual. Nevertheless, it would be necessary in this case to find the weights of such penalty terms compared to the RMSECV included in the fitness function which is not a trivial task.

4.1. Corn data The PLS results for the best individual obtained after 5 independent GA runs are presented in Table 4. Different values of the maximum number of consecutive pre-processing steps (1 to 4) have been used. For the GA co-optimization the convergence is reached after 5 min and around 45 generations (i.e. 11,520 cross validated PLS models). When looking at the best individual after GA optimized preprocessing only (PLS model build on the full spectra) it can be noticed a progressive decrease in the prediction error when the maximum number of the pre-processing steps increased. Generally only centering methods (mean, median and square root center) are selected, except when 4 pre-processing steps are combined for which first derivative is selected. Since the combination of several centering methods is not natural, the use of only one pre-processing (mean center) might be preferred by the spectroscopist. Nevertheless even if such spectral corrections cannot be interpreted, it is possible to obtain a really more accurate model with 4 pre-processing steps. For the co-optimization of pre-processing and variable selection an important decrease in the RMSECV is observed compared to the one obtained after the pre-processing optimization only. The RMSEP for the best individual does not show significant difference when the number of pre-processing steps increase. The selected pre-processing methods are a combination of scaling methods and smoothing filters with 7 points and a second or third polynomial order. When both preprocessing and variable selection are optimized the resulting PLS models are generally based on fewer latent variables compared to the ones with optimized pre-processing only.

The PLS model with the higher prediction ability has been obtained for 3 consecutive pre-processing steps and with 7 selected blocks of 7 wavelengths (7% of the original wavelengths). Fig. 3 represents the best 150 unique individuals after 5 independent GA co-optimization runs for a maximum number of pre-processing steps fixed to 3. The individuals are sorted by ascending RMSECV values (from 0.0032 to 0.0045). It can be noticed that a really good homogeneity in the chosen pre-processing methods for the 5 independent runs: for each individual at least one smoothing filter with 7 points and one centering method (mean, median, root square) are selected. Furthermore 40% of the individuals have selected less than 3 pre-processing steps. It can be noticed that the wavelength blocks centered on 1910 nm and 2106 nm are always selected. Due to the good computational ability of the computer cluster, it was also possible to evaluate the co-optimization without using blocks i.e. using directly discrete wavelengths. The best individual has obtained a higher RMSEP (0.0050) compared to a previous situation using blocks. The variable/object ratio was effectively equal to 11 indicating a problem of random correlation. The same behavior was observed for the Tecator dataset. Moreover due to the high bandwidth, consecutive wavelengths are highly correlated and many spectral information are redundant. The dimension of the real solution space is lower than the initial one using all wavelengths. The global optimum is thus reachable with the “blocks methodology”. For all these reasons, using blocks is a good choice when you apply GAs on near infrared spectroscopic data. 4.2. Tecator data For pre-processing optimization only, an important decrease in the RMSECV (2.23 to 0.77) of the best individual (Table 5) is observed when the maximum number of pre-processing steps increase from 1 to 2. When looking at the residuals it is clear that the use of normalization (SNV or normalize) or a second derivative alone does not correct the obvious non linearity, whereas the combination of these 2 pre-processing steps in order selected by the GA procedure (second derivative followed by normalize) completely removes the non linear sources. For more than 2 consecutive pre-processing steps, mean center and smoothing filter are associated to the previous preprocessing and no significant improvement in the prediction ability is observed. When both pre-processing and variable selection are optimized, lower RMSEP values are observed with an increasing number of consecutive pre-processing steps. We can observe a statistically significant difference of the RMSEP between co-optimization (0.50) and pre-processing selection only (0.67). In this way the cooptimization on 3 pre-processing steps drastically reduced the confidence interval of the quantification method. These results are even more interesting because an RMSEP value of 0.65 was previously obtained with non-linear methods like ANN on autoscaled spectra [25]. For the same pre-processing method

Table 4 Corn data. Results of PLS model for the best individual obtained after GA optimization (5 independent runs) of a) pre-processing optimization only b) variable selection with the previously optimize pre-processing and c) co-optimization of pre-processing and variable selection for different maximum number of consecutive pre-processing. FS: Full spectra (pre-processing optimization only). LVs: number of latent variables. RMSEP: Root Mean Square Error of Prediction on the independent test set.

Pre-processing selection only

Co-optimization

Maximum number of consecutive pre-processing

Selected pre-processing

Number of selected wavelengths (wl)

LVs

RMSECV

RMSEP

1 2 3 4 1 2 3 4

#26 #27 + #28 #26 + #26 + #28 #27 + #26 + #28 + #16 #25 #26 + #4 #26 + #27 + #4 #25 + #3 + #25 + #3

700 700 700 700 49 49 49 77

9 13 11 8 7 9 8 9

0.0202 0.0166 0.0139 0.0116 0.0037 0.0033 0.0032 0.0034

0.0208 0.0166 0.0133 0.0118 0.0033 0.0036 0.0031 0.0037

(FS) (FS) (FS) (FS) (7 7 wl) (7 7 wl) (7 7 wl) (11 7 wl)

O. Devos, L. Duponchel / Chemometrics and Intelligent Laboratory Systems 107 (2011) 50–58

55

Fig. 3. Corn data. Pre-processing and wavelengths selected by the GA-PLS procedure for the best individuals of the last generation. For the 5 independent runs the maximum number of pre-processing has been fixed to 3. Individuals are sorted in ascending order of RMSECV.

(autoscale) with a linear regression (PLS) the RMSEP is equal to 2.68 when the full spectra are used and 2.01 after the variable selection. Non linear data structures can effectively be handled with linear regression considering co-optimized pre-processing and variable selection. If we look more precisely at the best individuals for the cooptimization with 3 pre-processing steps (Fig. 4) we can observe that all models use the same spectral zones with normalization and simultaneous first and second derivatives which is quite unusual. Even if this pre-processing combination is difficult to interpret, the 5 independent runs converge to the same solution. If the spectroscopist considers that it is necessary to be able to interpret a complex spectral correction in order to use the best predictive model, the 4 preprocessing model containing only one derivative for example might be used instead. Genetic algorithms give a set of equivalent solutions that can be managed with additional specific criteria.

4.3. Sugar beet data This dataset contained 1062 samples in the calibration set; therefore the computation time is longer (around 50 min and 50

generations). The results for the best individual are summarized in Table 6. For the pre-processing optimization only it seems that the combination of 3 pre-processing steps removes the majority of the interference sources since no significant improvements are observed when the 4 pre-processing steps are used. The selected preprocessing methods are mainly a combination of normalization, first or second derivative and autoscale. For the co-optimization compared to the PLS models with optimized pre-processing, the improvement is important (0.149 to 0.107) when 1 or 2 pre-processing steps are used. For more than 2 pre-processing steps (removal of most of the interference sources with the pre-processing) the improvement in RMSEP with the cooptimization is not significant. We can thus consider that a good global optimum was found during the pre-processing optimization. For the last population of the 5 independent GA runs (134 individuals) and 3 pre-processing steps (Fig. 5), it can be observed that all these individuals are equivalent in terms of prediction ability (RMSEP from 0.101 to 0.106). Comparatively to the results of the other datasets, important variability from one run to the other is observed in the selected pre-processing and wavelengths. From Fig. 5 it can be seen that the selection of poorly relevant variables probably

Table 5 Tecator data. Results of PLS model for the best individual obtained after GA optimization (5 independent runs) of a) pre-processing optimization only and b) co-optimization of preprocessing and variable selection for different maximum number of consecutive pre-processing. FS: Full spectra (pre-processing optimization only). LVs: number of latent variables. RMSEP: Root Mean Square Error of Prediction on the independent test set.

Pre-processing selection only

Co-optimization

Maximum number of consecutive pre-processing

Selected pre-processing

Number of selected wavelengths (wl)

LVs

RMSECV

RMSEP

1 2 3 4 1 2 3 4

#12 #22 + #12 #19 + #11 + #26 #19 + #11 + #25 + #1 #31 #22 + #11 #13 + #20 + #10 #19 + #11 + #3 + #26

100 (FS) 100 (FS) 100 (FS) 100 (FS) 17 (17 1 wl) 17 (17 1 wl) 19 (19 1 wl) 13 (13 1 wl)

5 6 7 9 12 6 7 6

2.23 0.77 0.70 0.66 1.72 0.66 0.64 0.64

2.33 0.67 0.70 0.72 1.69 0.67 0.50 0.51

56

O. Devos, L. Duponchel / Chemometrics and Intelligent Laboratory Systems 107 (2011) 50–58

Fig. 4. Tecator data. Pre-processing and wavelengths selected by the GA-PLS procedure for the best individuals of the last generation. For the 5 independent runs the maximum number of pre-processing has been fixed to 3. Individuals are sorted in ascending order of RMSECV.

occurs. In this particular case, it is better to select the ‘pre-processing only’ models which will be more robust. In conclusion the cooptimization is not the best for this dataset but now we know that nothing else can be done.

5. Conclusions In this paper we demonstrate that genetic algorithms are a viable approach to optimize simultaneously the pre-processing and the variable selection for PLS regression in order to improve the prediction ability. Furthermore the proposed method shows reasonable computation time on the computer cluster even when the number of samples and variables are important. To test the performance of this method three real NIR datasets with different characteristics (wavelengths/samples ratio, interferences, non linearity…) have been used and the results of the optimization discussed. When the GA is used to optimize only the pre-processing (PLS model on the full spectra) important improvements are observed mainly when several pre-processing steps are combined. In most of the case the combination of more than 3 pre-processing steps does not significantly increase the prediction ability of the PLS model.

For GA co-optimization, the PLS models obtained perform generally better than the ones obtained with optimized pre-processing only. For the two first datasets (corn and tecator) the prediction abilities are significantly improved. For the sugar beet data, preprocessing is only necessary to obtain an optimal model. We have here to keep in mind that the aim of the optimization process is to find the minimal value on the ‘fitness landscape’ which presents a certain ruggedness. In other words, even if we have here three near infrared spectroscopic datasets we can have different solutions landscape with local minima or flat regions. It is exactly these differences that explain the different results for the three considered datasets. Nevertheless, we sometimes observe odd combination of preprocessing methods when co-optimization is used. Such combination is difficult to use for a spectroscopist but very often it is not possible to interpret why on classical PLS regression (built with the full spectral range) a second derivative and normalization (MSC or SNV) is for example better than just a first derivative. Even sometimes, the best model is obtained with just normalization. Nevertheless, we have to keep in mind that all cooptimizations converge to the same kind of solutions for five different runs which give a real and significant decrease of the RMSEP. The question is: is it necessary to be able to interpret a complex spectral

Table 6 Sugar beet data. Results of PLS model for the best individual obtained after GA optimization (5 independent runs) of a) pre-processing optimization only and b) co-optimization of pre-processing and variable selection for different maximum number of consecutive pre-processing. FS: Full spectra (pre-processing optimization only). LVs: number of latent variables. RMSEP: Root Mean Square Error of Prediction on the independent test set.

Pre-processing selection only

Co-optimization

Maximum number of consecutive pre-processing

Selected pre-processing

Number of selected wavelengths (wl)

LVs

RMSECV

RMSEP

1 2 3 4 1 2 3 4

#16 #13 + #25 #11 + #24 + #25 #11 + #14 + #24 + #25 #15 #11 + #15 #11 + #21 + #25 #11 + #18 + #6 + #25

1050 (FS) 1050 (FS) 1050 (FS) 1050 (FS) 140 (14 10 wl) 110 (11 10 wl) 230 (23 10 wl) 240 (24 10 wl)

14 8 11 9 13 13 11 12

0.164 0.143 0.105 0.104 0.119 0.103 0.096 0.096

0.16 0.149 0.106 0.104 0.121 0.107 0.101 0.102

O. Devos, L. Duponchel / Chemometrics and Intelligent Laboratory Systems 107 (2011) 50–58

57

Fig. 5. Sugar beet data. Pre-processing and wavelengths selected by the GA-PLS procedure for the best individuals of the last generation. For the 5 independent runs the maximum number of pre-processing has been fixed to 3. Individuals are sorted in ascending order of RMSECV.

correction in order to use the best predictive model? Genetic algorithms give a way to answer this question. If we do not want to use the best solution due to an unusual combination, the last population of individuals can give other solutions with no statistical difference for the RMSEP and better understanding of the spectroscopists. Another way to improve the interpretability of the selected preprocessing and reduce the computation time is to constrain certain combinations of pre-processing methods. For example forcing the GA to choose only one pre-processing maximum in each category (derivative, smoothing…), incorporating expert knowledge… In order to remove the irrelevant or poorly relevant variables selected by the GA an additional step as a backward procedure might be used after GA optimization [6]. For the moment the optimized pre-processing and selected variables depend strongly on two user defined parameters: the maximum number of the pre-processing steps and the% of initially selected variables. In order to avoid this situation an approach based on multiple population GAs might be used. In this case the whole population is divided into several subpopulations with different values of initial% of selected variables and maximum number of preprocessing. Subpopulations may exchange information during optimization by allowing some individuals to migrate from one subpopulation to another. Furthermore to give preference to solutions with fewer pre-processing, variables and PLS factors, penalty terms can be introduced in the fitness calculation. Finally the proposed method could be easily extended to other problems (as classification) when the pre-processing and variable selection are known to significantly improve the prediction ability [14].

References [1] A. Rinnan, F. van den Berg, S.B. Engelsen, Review of the most common preprocessing techniques for near-infrared spectra, Trends Anal. Chem. 28 (2009) 1201–1222.

[2] O.E. de Noord, The influence of data preprocessing on the robustness and parsimony of multivariate calibration models, Chemom. Intell. Lab. Syst. 23 (1994) 65–70. [3] C.H. Spiegelman, M.J. McShane, M.J. Goetz, M. Motamedi, Q. Li Yue, G.L. Coté, Theoretical justification of wavelength selection in PLS calibration: development of a new algorithm, Anal. Chem. 70 (1998) 35–44. [4] Z. Xiaobo, Z. Jiewen, M.J.W. Povey, M. Holmes, M. Hanpin, Variables selection methods in near-infrared spectroscopy, Anal. Chim. Acta 667 (2010) 14–32. [5] A.S. Bangalore, R.E. Shaffer, G.W. Small, M.A. Arnold, Genetic algorithm-based method for selecting wavelengths and model size for use with partial least-squares regression: application to near-infrared spectroscopy, Anal. Chem. 68 (1996) 4200–4212. [6] R. Leardi, A. Lupianez Gonzalez, Genetic algorithms applied to feature selection in PLS regression: how and when to use them, Chemom. Intell. Lab. Syst. 41 (1998) 195–207. [7] R. Wehrens, L.M.C. Buydens, Evolutionary optimisation: a tutorial, Trends Anal. Chem. 17 (1998) 193–203. [8] R. Leardi, Genetic algorithms in chemometrics and chemistry: a review, J. Chemometr. 15 (2001) 559–569. [9] R. Leardi, Genetic algorithms in chemistry, J. Chromatogr. A 1158 (2007) 226–233. [10] J. Koljonen, T.E.M. Nordling, J.T. Alander, A review of genetic algorithms in near infrared spectroscopy and chemometrics: past and future, J. Near Infrared Spectrosc. 16 (2008) 189–197. [11] R.M. Jarvis, R. Goodacre, Genetic algorithm optimization for pre-processing and variable selection of spectroscopic data, Bioinformatics 21 (2005) 860–868. [12] D.E. Goldberg, Genetic Algorithms in Search Optimization and Machine Learning, Addison-Wesley, Berkeley, 1989. [13] D.B. Fogel, An introduction to simulated evolutionary optimization, IEEE Trans. Neural Netw. 5 (1994) 3–14. [14] E.G. Talbi, Metaheuristics: from Design to Implementation, John Wiley & sons, 2009. [15] T. Blickle, L. Thiele, A comparison of selection schemes used in evolutionary algorithms, Evol. Comput. 4 (1996) 361–394. [16] K. Baumann, H. Albert, M. von Korff, A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part I. Search algorithm, theory and simulations, J. Chemometr. 16 (2002) 339–350. [17] G. Cruciani, M. Baroni, S. Clementi, G. Costantino, D. Riganelli, B. Skagerberg, Predictive ability of regression models 1. Standard-deviation of prediction errors (SDEP), J. Chemometr. 6 (1992) 335–346. [18] E. Cantu-Paz, D.E. Goldberg, Efficient parallel genetic algorithms: theory and practice, Comput. Meth. Appl. Mech. Eng. 186 (2000) 221–238. [19] Anderson da Silva Soares, Roberto K.H. Galvão, Mário César U. Araújo, Sófacles F.C. Soares, Multi-core computation in Chemometrics: case studies of voltammetric and NIR spectrometric analyses, J. Braz. Chem. Soc. 21 (2010) 1626–1634. [20] D. Jouan Rimbaud, D.L. Massart, O.E. de Noord, Random correlation in variable selection for multivariate calibration with a genetic algorithm, Chemom. Intell. Lab. Syst. 35 (1996) 213–220.

58

O. Devos, L. Duponchel / Chemometrics and Intelligent Laboratory Systems 107 (2011) 50–58

[21] S. Gourvénec, X. Capron, D.L. Massart, Genetic algorithms (GA) applied to the orthogonal projection approach (OPA) for variable selection, Anal. Chim. Acta 519 (2004) 11–21. [22] J.A. Westerhuis, S. de Jong, A.K. Smilde, Direct orthogonal signal correction, Chemom. Intell. Lab. Syst. 56 (2001) 13–25. [23] F. Stout, J.H. Kalivas, K. Heberger, Wavelength selection for multivariate calibration using Tikhonov regularization, Appl. Spectrosc. 61 (2007) 85–95. [24] C. Borggaard, H.H. Thodberg, Optimal minimal neural interpretation of spectra, Anal. Chem. 64 (1992) 545–551. [25] H.H. Thodberg, A review of Bayesian neural networks with an application to near infrared spectroscopy, IEEE Trans. Neural Netw. 7 (1996) 56–72.

[26] J.P. Vila, V. Wagner, P. Neveu, Bayesian nonlinear model selection and neural networks: a conjugate prior approach, IEEE Trans. Neural Netw. 11 (2000) 265–278. [27] F. Rossi, A. Lendasse, D. Francois, V. Wertz, M. Verleysen, Mutual information for the selection of relevant variables in spectrometric nonlinear modeling, Chemom. Intell. Lab. Syst. 80 (2006) 215–226. [28] Y. Roggo, L. Duponchel, B. Noé, J.-P. Huvenne, Sucrose content determination of sugar beets by near infrared reflectance spectroscopy. Comparison of calibration methods and calibration transfer, J. Near Infrared Spectrosc. 10 (2002) 137–150. [29] Y. Roggo, L. Duponchel, C. Ruckebusch, J.-P. Huvenne, Statistical tests for comparison of quantitative and qualitative models developed with near infrared spectral data, J. Mol. Struct. 654 (2003) 253–262.