Chemometrics and Intelligent Laboratory Systems 65 (2003) 67 – 81 www.elsevier.com/locate/chemometrics
Genetic algorithms and neural networks for the quantitative analysis of ternary mixtures using surface plasmon resonance Frank Dieterle *, Birgit Kieser, Gu¨nter Gauglitz Institute of Physical and Theoretical Chemistry, University of Tu¨bingen, Auf der Morgenstelle 8, 72076 Tu¨bingen, Germany Received 23 May 2002; received in revised form 27 August 2002; accepted 30 August 2002
Abstract Recently, time-resolved measurements have been proposed for sensor research to reduce the number of sensors needed for a multicomponent analysis. These measurements usually generate many variables, which are unfortunately highly correlated, creating several problems in data analysis. In this study, a variable selection algorithm is presented, which is optimized for limited data sets with correlated variables. The algorithm is based on many parallel runs of genetic algorithms (GA). The calibration is performed using neural networks. The algorithm is successfully applied to the selection of time points of timeresolved measurements performed by a single transducer surface plasmon resonance (SPR) setup. The selection of the time points enables an improved calibration of the vapor concentrations of three analytes in ternary mixtures. The relative root mean square errors of prediction of an external validation data set by the optimized models were 3.6% for methanol, 5.9% for ethanol and 7.6% for 1-propanol. The variable selection is reproducible and not affected by chance correlation of variables. The selected time points give insight into characteristic sensor responses of the pure analytes, and it is shown that the analysis time can be significantly reduced. D 2002 Elsevier Science B.V. All rights reserved. Keywords: Genetic algorithms; Variable selection; Time-resolved measurements; Neural networks; Surface plasmon resonance
1. Introduction Besides the development of new sensors, the qualitative and quantitative multicomponent analysis is a main objective in chemosensory research. The most common approach for a multicomponent analysis is the combination of more or less various specific transducers on a sensor array. For a quanti-
* Corresponding author. Fax: +49-7071-295960. E-mail address:
[email protected] (F. Dieterle).
tative analysis of mixtures, this parallel approach is limited by the need of at least as many sensors as analytes to be quantified resulting in a high technical effort. Recently, a new approach has been proposed by several groups for reducing the number of required sensors [1– 11]. This approach is based on the exploitation of the time-specific information of sensor responses. If the various analytes of a mixture show different kinetics for the sorption into the sensitive layer, the resulting sensor response recorded versus time features a different shape for these analytes. This additional temporal information can render the paral-
0169-7439/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 9 - 7 4 3 9 ( 0 2 ) 0 0 1 0 4 - 1
68
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
lel information of different sensors in an array redundant. Consequently, different research groups successfully quantified binary and ternary mixtures with single sensor setups by recording the sensor responses over time (further referred to as time points) and by performing a multivariate analysis of these time points [1,11]. However, this approach is challenged by the problem of finding the significant time points. This task is highly problem-specific, and thus, different strategies are used in literature. Either the sensor responses are recorded using a rather coarse time resolution [2,3], a very fine resolution [4 –6] or even a coarse resolution combined with other more or less arbitrary features [7,8]. A coarse time resolution gives a manageable quantity of information but increases the risk of losing important information, especially if several analytes show similar or very fast sensor responses. Thus, a fine time grid should be generally preferred. However, a high number of time points (or more general variables) cause several data analysis problems: 1. Variables which are irrelevant, redundant or very noisy might be included and therefore could hide meaningful variables [12]. 2. If the number of variables exceeds the number of measurements, the problem of overfitting occurs, especially when using neural networks or other multivariate calibration methods [13,14]. 3. The danger of chance correlation increases with the number of variables [15]. 4. A high number of variables prevent iterative calibration methods (like neural networks) from finding optimal models [16]. Therefore, an optimal subset of variables has to be selected. This task is known as variable selection or feature selection in literature [17]. Many variable selection strategies have been proposed as forward selection [18], backward elimination [19], generalized simulated annealing [20,21], genetic algorithms (GA) [22 – 29], orthogonal descriptors [30] and many more. Several comparisons of different variable selection methods have identified genetic algorithms as very powerful tools, since they are able to find global optima in a complex multidimensional search space, which is defined by all possible combinations of the variables [21,31].
In this study, a strategy for the selection of variables is presented, which is based on genetic algorithms. However, in contrast to most common genetic variable selection strategies found in literature, the proposed method is a two-step algorithm. The first step is based on parallel runs of many genetic algorithms using a random subsampling method for data splitting. The variables, which were ranked by the first step, are added iteratively to the final calibration model in a second step. The genetic algorithm itself is designed to select only highly predictive variables and to select as few variables as possible. Problems faced by many genetic algorithm strategies for variable selection like chance correlation or the selection of different variables by different runs are prevented. Besides using a data set limited in size in a very effective manner, this approach is also suited for massive parallel computer systems. The calibration is based on the use of neural networks, since the relationship between the sensor responses and the concentrations of the analytes under investigation is nonlinear. Another advantage of using neural networks for the calibration is that no a priori information of the relationship to model is necessary, yet resulting in the need of an extensive calibration. The combination of the measurements recorded with a high resolution of time, the two-step variable selection strategy and the neural networks for calibration form a system which is able to tailor itself to a multicomponent analysis problem with practically no need of user input. The variable selection strategy, which is presented in Section 2, is applied to single sensor measurements of ternary mixtures of vapors with the task of simultaneously quantifying the concentrations of methanol, ethanol and 1-propanol.
2. Theoretical 2.1. Genetic algorithms Genetic algorithms (GA) have become a popular optimization method imitating the natural selection process in biological evolution. The parameters to be optimized are represented by a chromosome, whereby each parameter is encoded in a binary string called gene. Thus, a chromosome consists of as
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
many genes as parameters to be optimized. A population, which consists of a given number of chromosomes, is initially created by randomly assigning ‘1’ and ‘0’ to all genes. The best chromosomes that have the highest probability to survive are evaluated by a so-called fitness function. The next generation is reproduced by selecting the best chromosomes, mating the chromosomes to produce an offspring population and by an occasional mutation. The evaluation and reproduction steps are repeated until a certain number of generations or a defined fitness is reached. The theory and benefits of GA in variable selection has been described several times in literature [23 – 29] and will not be repeated here. Instead, a description of the GA implementation of this study and its special features like the fitness function and the parallel and iterative approach will be explained in detail. In this study, the first population of the GA is randomly generated except for one chromosome, which was set to use all variables. The binary string of the chromosomes has the same size as variables to select from whereby the presence of a variable is coded as 1 and the absence of a variable as 0. After evolving the fitness of the population, the individuals are selected by means of the roulette wheel. Thereby, the chromosomes are allocated space on a roulette wheel proportional to their fitness, and thus the fittest individuals are more likely selected. Then, offspring chromosomes are created by a crossover technique. A one-point crossover technique is employed, which randomly selects a crossover point within the chromosome. Then, two parent chromosomes are interchanged at this point to produce two new offsprings. After that, the chromosomes are mutated with a probability of 0.005 per gene by randomly changing genes from 0 to 1 and vice versa. The mutation prevents the GA from converging too quickly in a small area of the search space. A crucial point in using GA is the design of the fitness function, which determines what a GA should optimize. In the case of a variable selection for calibration, the goal is to find a small subset of variables, which are most significant for a regression. In this study, the calibration is based on neural networks for the correlation of time-dependent sensor signals and the concentration of different analytes. In contrast to many GA for variable selection found in
69
literature, the fitness function f of this study takes into account not only the prediction error of test data but also partially the calibration error and primarily the number of variables used to build the corresponding neural nets: f ¼ 0:2MRMSEcal þ 0:8MRMSEtest t a 1 tmax
ð1Þ
where t is the number of variables used by the neural networks, tmax is the total number of variables and MRMSE is the mean root mean square error of the respective calibration test data. The MRMSE is defined in Eq. (2) with N as the total number of samples predicted, M as the total number of analytes, yˆi,j as predicted concentration of analyte j in sample i and yi,j as the corresponding known true concentration: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N uX u ðyˆ y Þ2 i;j i;j M u X t i¼1 20 N X j¼1 ð2Þ MRMSE ¼ M k¼1 The fitness function can be broken up into three parts. The first two parts correspond to the accuracy of the neural networks. Thereby, MRMSEcal is based on the prediction of the calibration data used to build the neural nets, whereas MRMSEtest is based on the prediction of separate test data not used for training the neural networks. It was demonstrated in Ref. [32] that using the same data for variable selection and model calibration introduces a bias, and thus, variables are selected based on data poorly representing the true relationship. On the other hand, it was also shown [32,33] that a variable selection based on a small data set unlikely finds an optimal subset of variables. Therefore, a ratio of 1:4 between for the influence of calibration and test data was chosen. Although being partly arbitrary, this ratio should give as little influence to the calibration data as to bias the feature selection, yet taking the samples of the larger calibration set partly into account. The third part of the fitness function rewards small networks using only few variables by an amount proportional to the parameter a. The choice of a influences the number
70
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
of variables used by the evolved neural nets. A high value of a results in only few variables selected for each GA, whereas a small value of a results in more variables being selected. As the initial weights of a neural network are randomly set, the network finds another local minimum for each run with a slightly different performance of prediction. In order to reduce the variance of the error of prediction due to random weight initialization, the fitness is averaged in Eq. (2) for over 20 training and prediction sessions per network. 2.2. Global algorithm Many variable selection strategies perform a variable selection by the use of only one single run of a genetic algorithm. Besides computation time benefits, this single run approach is faced by several drawbacks: (1) Both the chromosomes of the initial population and the weights of the neural networks are randomly generated. As there is no guarantee that the walk of the genetic algorithm in the search space, which also contains random steps, can always find the best subset of variables before converging, different runs often find similar but not exactly identical subsets of variables [34]. (2) Jouan-Rimbaud et al. [35] demonstrated recently that by chance correlation of variables, irrelevant variables are selected by GA or have at least a significant influence on the final model, even if validation procedures are used. (3) Often, the prediction of a single test data set is used as a measure for the fitness function. Thus, the variable selection is judged using only this limited and not necessarily representative part of the complete data set. (4) It has been shown [35] that several runs of GA end with similar but often different variables selected due to collinearity in the data. This results in the variable selection being not very robust. In literature, several approaches are reported to solve these problems of single runs of GA: (1) Massart and Leardi [23,36] use a very refined algorithm for variable selection, which is based on the parallel run of many GA with different combinations of test and calibration data. Then, a validation
step is performed to find the best variable subset. The GA is a hybrid algorithm using a stepwise backward elimination of variables to find the smallest possible subset of variables. Although this approach is very promising, Jouan-Rimbaud et al. [35] showed that this algorithm is still partly subject to chance correlation. (2) In Ref. [24], Leardi et al. used 100 runs of GA with the same calibration and test data sets. The final model is obtained by adding systematically the variables, which are ranked according to the frequency of selection of the GA runs, and by using the combination with the smallest error of prediction. In Ref. [37], this algorithm is modified by the different GA runs learning from each other. (3) In Ref. [38], the predictions are averaged by several models found by different GA runs. However, the average prediction was not better than the prediction by a single model. (4) In Ref. [34], 10 runs of GA were performed by using different calibration and test data subsets. The final model uses all variables, which were selected at least five times, whereby this limit is rather arbitrary. The global algorithm proposed in this study uses several elements of the studies mentioned above and is presented in the flow diagram in Fig. 1. The algorithm can be divided into two steps. The first step consists of multiple parallel runs of GA using different calibration and test data subsets (dark gray boxes in the flow diagram). Variables which are represented higher than average in the final population of each GA run are collected over all GA runs and are ranked according to the frequency of appearance in the final populations. The second step of the algorithm builds the final model in an iterative procedure. Thereby, the variables are added to the neural network model according to their rank in a stepwise procedure. The neural network is evaluated by the use of different calibration and test data subsets (dark gray boxes in Fig. 1). The RMSE of prediction of all test data sets is compared with the RMSE of the previous model by the use of a Kruskal – Wallis nonparametric test. If the RMSE is significantly lower ( p < 0.05), the last variable is accepted and the procedure is repeated adding the next important variable until predictions are not significantly improved. Finally, this neural network topology should be trained with the complete data set
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
Fig. 1. Flow chart of the global algorithm.
71
72
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
several times, and the neural net with the smallest error of cross-validation should be used as final optimized model and should be validated by an external data set not used during the complete variable selection algorithm. In both major steps of the global algorithm, the complete data set has to be split several times into a calibration (75%) and a test data (25%) subset. This procedure was performed by random subsampling also known as multiple holdout or as repeated evaluation set [39]. Compared to a jackknifing procedure, this results in more pessimistic predictions of the test data, and according to Eq. (1), models are preferred, which are more predictive and yield a better interpolation. As already stated, the choice of a in the fitness function Eq. (1) influences the numbers of variables being selected during each run of a GA. Too high a value of a ignores partly the accuracy of the neural nets and ends in only few variables being selected. Consequently, there might be too few variables selected in the first step to be added to the neural net in the second step. This problem can be recognized by all variables with a ranking higher than ‘0’ being used for the neural net in the second step. On the other side, too low a value of a results in too many variables being selected. This can be detected by the absence of a differentiation of the variables in the ranking. An empirical way to select an optimal a is based on running a single GA with different values of a and on choosing that a, which results in the selection of the number of variables expected to be needed for the calibration. A good choice to start with is setting a to ‘1’ for these single runs of the GA. However, preliminary studies showed that the parallel runs of the GA make the global algorithm quite robust towards the choice of a and to the population size, which is suggested to be set to the number of variables to select from. Although the global algorithm seems to be complex on first sight, this robustness renders the algorithm somewhat user friendly. Since major parts of the algorithm are based on the parallel run of independent calculations (light gray boxes in Fig. 1) with only few connection points for information exchange, the complete algorithm is suitable for highly distributed calculations on numerous parallel computers.
3. Experimental 3.1. Experimental setup The single transducer setup used for the timeresolved measurements is based on surface plasmon resonance (SPR), which is an evanescent field method. The white light SPR setup, which is described in detail elsewhere [40], is sensitive to changes of the refractive index of the sensitive layer. The sensitive layer, which is based on a commercially available polycarbonate, was spin-coated with 40 Al of a polycarbonate solution (Makrolon, M2400; Bayer, Leverkusen, Germany; 1.7 wt.%) in chloroform for 40 s with 3000 rpm on a silver layer, which was deposited on a NPH2 glass prism [40]. The prepared polymer layer had a thickness of about 300 nm. The probe depth can be calculated as 98 nm. The analyte mixtures were generated by thermocontrolled bubblers filled with pure liquid analytes. The analyte concentrations in the gas phase are defined by Antoine’s law [41]. Defined vapor concentrations were obtained by using synthetic air as carrier gas and by diluting with computer-driven mass flow controllers (MKS, Munich, Germany). All vapors were mixed and thermostated before entering the measurement cell with a volume of 1 ml. A four-way valve before the cell ensures that the path length is the same for all analytes. All measurements were performed at a constant flow rate of 300 ml min 1. 3.2. Data sets Two data sets were recorded for the multicomponent analysis of ternary vapor mixtures. The first data set is a six-level equidistant full factorial design, whereby the relative saturation pressures of methanol, ethanol and 1-propanol have been varied between 0 and 0.035. Eighteen concentration combinations were measured two more times for an estimation of the experimental signal inaccuracies. This data set was used for the variable selection process and for the training of the neural networks. The second data set is a five-level equidistant full factorial design with relative saturation pressures in between 0.0035 and 0.0315. Additional 15 measurements were performed with the pure analytes at the five concentration levels. This second data set was not
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
used for the variable selection process but as external validation data set. It is noteworthy that all concentrations of both data sets differ. Thus, the second data set should give a realistic estimate of the network performance in a real-world situation [42]. Nine measurements were identified as outliers according to Ref. [43], whereby seven measurements of the first data set and two measurements of the data set 2 were removed. Consequently, the data set 1 consists of 245 samples and the data set 2 of 138 samples. All measurements were done in random order within the two data sets. The second data set was recorded 1 month after the data set 1 and the signals of the second data sets were averaged using two measurements. During all measurements, the polycarbonate has been exposed to the analyte mixtures for 600 s and afterwards to dry synthetic air during 4760 s. During the exposure to analyte, the data have been recorded at 28 points of time and during the exposure to synthetic air at 22 points of time. Thus, the signals were recorded during 50 points of time resulting in 50 variables. All data were autoscaled by subtracting the mean and dividing by the standard deviation for each variable. 3.3. Neural networks The neural networks implemented for this study belong to the class of multilayer feedforward backpropagation networks. Due to the numerous excellent
73
textbooks describing in detail these networks [44 – 47], only a short outline of the implementation for this study will be given: The topology consisted of three output units, six hidden units and as many input units as time points used. The nets were fully connected. The training included a maximum of 1000 learning cycles. Scaled conjugate gradient (SCG) was used as optimization algorithm [48]. This algorithm includes the pseudo-second-derivative information. In contrast to standard backpropagation, SCG converges faster, is more capable of dealing with local minima and is not sensitive to any parameters, which were all set to standard values suggested in Ref. [48]. Hyperbolic tangent was used as activation function for the hidden layer and a linear activation function was used for the output layer. BP networks are often affected by an effect called overtraining [49]. An overtrained ANN memorizes the small calibration data set instead of generalizing the data and consequently performs badly on new data, e.g. on test data sets. In this work, the overtraining was anticipated by the so-called early stopping [50]. Early stopping was implemented by stopping the training when the error of cross-validation of the calibration data starts going up, as the net may start losing its generalization ability at this moment. The performance of the different neural networks was measured by comparing the MRMSE of predic-
Table1 Relative RMSE of calibration and prediction by neural networks using different numbers of time points Analyte
Methanol
Ethanol
1-Propanol
Time points
50 5 6 4 50 5 6 4 50 5 6 4
Subsampling
Final neural net
Calculated data (data set 1) (%)
Test data (data set 1) (%)
Calculated data (data set 1) (%)
Test data (data set 2) (%)
4.2 F 0.2 4.0 F 0.2 4.1 F 0.2 4.4 F 0.3 5.9 F 0.3 6.9 F 0.3 5.8 F 0.2 9.4 F 0.7 7.0 F 0.2 7.4 F 0.2 7.4 F 0.2 8.1 F 0.2
6.3 F 0.7 4.3 F 0.7 4.3 F 0.7 6.0 F 0.7 8.4 F 0.8 7.0 F 0.7 6.4 F 0.7 8.4 F 1.0 9.7 F 1.0 7.8 F 0.7 8.7 F 0.7 10.1 F 0.8
2.8 3.2 3.0 4.2 6.2 6.8 6.2 8.9 7.4 7.8 7.9 9.1
4.3 3.6 3.6 6.7 8.7 6.3 5.9 11.3 12.6 7.6 10.8 18.1
For the subsampling procedure, the standard deviations of the different subsets are also given.
74
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
tion and calibration (Eq. (2)) and by relative root mean square errors in Table 1:
RMSErel
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , u N N uX X u ðyˆ y Þ2 , yi 1 i u t i¼1 i¼1 ¼ N N
ð3Þ
4. Results 4.1. Sensor responses of pure analytes First, the sensor responses for the pure analytes were investigated to see if a multicomponent analysis of the analytes is feasible. In Fig. 2, the sensor responses are plotted during 600 s of exposure to single analyte– air mixtures and afterwards during exposure to synthetic air, whereby the measurement was repeated using 10 different concentrations of each analyte. The first plot shows that the sensor reacts spontaneously to the presence and absence of the methanol vapors resulting in a rectangular shape along the time axis. The sensor response for ethanol is slower resulting in shapes which are more curved. The sorption of 1-propanol into the sensitive polymer layer is very slow not reaching equilibrium between the 1-propanol vapors and the 1propanol sorbed into the polymer layer. Therefore, the shape of the sensor response plot does not reach a plateau during sorption and the signal needs a long time to come down to the baseline during desorption. The responses of the sensor to mixtures (not shown) is, in a first approach, the sum of the single vapor responses, whereby for higher concentrations, a significant saturation effect can be observed, which can also be seen as curvatures of the sensor responses along the concentration axis of pure ethanol and 1-propanol in Fig. 2. Additionally, the fast kinetics of methanol is slightly retarded by high concentrations of the other analytes, which might be ascribed to a competitive effect. In summary, it may be said that a multicomponent analysis of the three analytes should be possible as the plots of the sensor responses feature different shapes along the time axis. The use of
Fig. 2. Sensor responses of pure analytes recorded during 600 s of exposure to pure analyte and afterwards during exposure to synthetic air. The x-axis represents the exposure time; the y-axis represents different relative saturation pressures (concentrations) of the analytes and the z-axis represents the signal shift during the SPR measurement.
neural networks for calibration is justified by the relationship between the sensor responses and the concentrations of all three analytes being nonlinear.
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
4.2. Calibration without variable selection First, a direct calibration was performed using all 50 time points. The data set 1 was split into disjoint test and calibration sets by a 20-fold random subsampling procedure with a ratio of 1:3. A neural net was trained by the calibration data set and then predicted the concentrations of the calibration and the test data. This procedure was repeated 20 times resulting in 1400 predictions of test data. The relative RMSE of prediction of the calibration and the test data are listed in Table 1 (columns 3 and 4) including the standard deviations of the 20 folds. The errors of calibration and test data increase in the series methanol, ethanol and 1-propanol, whereby the errors of the test data are significantly worse. The modeling power of the neural networks using all time points was also validated by the data set 2. For this purpose, a neural net was trained using the complete data set 1 and subsequently predicted the concentrations of the external validation data set 2. The predictions of the calibration data and the external test data are listed in Table 1 (columns 5 and 6), whereby the predictions of the test data are even worse with a relative RMSE of 4.3% for methanol, 8.7% for ethanol and 12.6% for 1-propanol.
75
In all cases, the predictions of test data are significantly worse than the predictions of calibration data. This effect is called shrinkage [51] and is an indication of poor calibration models explainable by an overtraining due to the big size of the neural nets in relation to the limited size of the calibration data set. To improve the performance of the neural networks, the network size and thus the number of time points have to be reduced by a variable selection procedure. 4.3. Variable selection The variable selection algorithm described in Section 2 was applied to the data set 1. Thereby, 100 parallel runs of the GA were used. Each genetic run evaluated 50 populations using about 60 generations, whereas the stopping criterion was set to a convergence of the standard deviation of the genes below 0.04. The parameter a of the fitness function was set to 0.9, which results in approximately six variables being selected per single GA. The results of the first step of the global algorithm are shown in Fig. 3. Thereby, the ranking of the variables is shown as frequency of the variables being present in the last population of the genetic algo-
Fig. 3. Frequency of the selection of the time points for 100 parallel runs of the genetic algorithms. The five time points selected by the algorithms are labeled in addition.
76
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
rithms. In the second step, these variables enter the model according to their rank until the prediction of the test data does not significantly improve any more. For this iterative procedure, the same 20-fold random subsampling procedure was used as described in the last section. Thus, the predictions of the iterative procedure can be directly compared with the results of the previous section. The iterative procedure stopped after the addition of five variables. The corresponding mean RMSE is marked by an arrow in Fig. 4, which shows the mean RMSE versus the amount of variables used in the second step of the global algorithm. Consequently, the algorithm suggests, as according to Fig. 3, the usage of the time points 5, 15, 25, 55 and 615 s as optimal variables. The errors of the calibration of the iterative subsampling procedure with the five time point neural networks are comparable to the neural networks using 50 time points (Table 1, column 3), whereas the prediction of the test data by the neural networks using five time points is significantly better (Table 1, column 4). Practically no shrinkage is noticeable, indicating much more stable models compared to the calibration with all time points. The prediction errors of the test data should also give an estimate of other external data sets. Consequently, the modeling
power of neural networks using these five time points was validated by external test data using the complete data set 1 for the training and the data set 2 for prediction. Again, the neural networks using only five time points clearly outperformed the neural networks using all 50 time points. The external validation data were predicted with low relative errors of 3.6% for methanol, 6.3% for ethanol and 7.6% for 1-propanol. These errors are in congruence with the errors estimated by the random subsampling procedure. The predictions of the calibration data set 1 and the test data set 2 for validation are shown in the true predicted plots in Fig. 5. As the single predictions cannot be graphically resolved, the predictions of each concentration level are represented by the mean and standard deviation. In all plots, the means of each concentration level are located very close to the diagonal, demonstrating the absence of systematical errors. The neural networks seem to deal very well with the nonlinearities indicated in Fig. 2. The standard deviations, which are a measure for the scattering of the predictions, are almost the same for the test data and the calibration data. The predictions of methanol are least scattered corresponding to the smallest RMSE in Table 1.
Fig. 4. Mean RMSE of prediction versus the number of time points selected by the iterative variable addition step of the global algorithm. After the addition of five variables, no significant improvement was achieved.
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
77
The signals of three times 18 repeated measurements, which were spread over the complete concentration range used for the mixtures, show a relative standard deviation of 4.6%. These inaccuracies of the signals are caused by the noise of the spectrometer, inaccuracies of the gas mixing station and fluctuations of the measuring temperature. The rather small increase of the mean relative RMSE in the concentration domain after the data analysis shows the power of the combination of the algorithm proposed for variable selection and neural networks for calibration. In order to assess the limits below which the components cannot be detected, the limits of detection of the analytes in mixtures were calculated following a procedure analogous to a univariate calibration function [52]: xD ¼ t1a;m s0 þ t1b;m sD
ð4Þ
with xD as the limit of detection, s0 and sD as the estimates of the standard deviations for the blank and for xD based on m degrees of freedom and t represents the one-sided critical values of the Student’s t-test. Both a and b were set to 0.05 and the approximation sD = s0 was used. The limits of detection of the test data derived from the subsampling procedure were relative saturation pressures of 0.0024 for methanol, 0.0023 for ethanol and 0.0024 for propanol. 4.4. Analysis of the selected time points The five time points selected by the algorithm (5, 15, 25, 55 and 615 s) can be analyzed in more detail in combination with the response plots of the sensor exposed to the pure analytes (Fig. 2). The response surface of methanol shows that after 5 s, the response has practically reached the plateau of the highest sensor signals, whereas ethanol and 1-propanol nearly show no sensor signal. The same applies to the 615 s signal, which is situated 15 s after the end of exposure
Fig. 5. Predicted concentrations (expressed as relative saturation pressures) versus the known true concentrations. The external data set 2 was predicted as test data set by neural networks trained by the complete calibration data set 2 using five time points.
78
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
to analyte: Methanol has already desorbed, whereas the sensor response of ethanol is still very high and 1propanol shows practically no decrease of the sensor signal. Thus, the time points can be primarily correlated with the concentration of methanol. Large parts of the variance of the sensor signals after 15 and 25 s can be identified with ethanol since the signal of methanol has already reached the plateau and 1propanol attributes negligibly to the total signal. On the other hand, the variance of the sensor signal at 55 s can be mainly ascribed to 1-propanol, whereby the sensor signal of methanol has completely, and the sensor signal of ethanol has nearly, reached equilibrium. In summary, it may be said that all five time points selected by the algorithm can be associated with the characteristic sensor responses of the pure analytes and consequently make sense in a chemical respect. Another benefit from the feature selection results from the direct relation of the variables with the analysis time. Only information during 55 s of the exposure to the analytes and 15 s after the exposure is evaluated. Thus, it should be enough to reduce the analyte exposure to 55 s and record the sensor signals during 70 s. This would dramatically reduce the analysis time. 4.5. Variation of the number of time points In addition to the five time points selected by the algorithm, the four and six most important variables according to Fig. 3 were investigated using the external validation data set 2 in the same way as described above. When only four time points are used by dropping the time point 55 s, all predictions are significantly worse according to Table 1. The predictions of 1-propanol are affected most. This can be explained with the dropping of time point 55 s, which has already been associated with the prediction of 1propanol. The effects of increasing the number of variables from five to six by adding the time point 5360 s are more inconsistent. While the predictions of the test data are unchanged for methanol, the predictions of ethanol slightly improve, whereas the predictions of 1-propanol significantly deteriorate. Additionally, a systematic variation of the number of variables similar to Fig. 4 for the prediction of the external test data 2 was performed. Thereby, the lowest MRMSE according to Eq. (2) for the external
validation data was found when using the five time points selected by the algorithm demonstrating that the algorithm selects an optimal number of variables. 4.6. Randomization tests and reproducibility An unattended use of many genetic algorithms is limited by chance correlations of variables. This can happen if variables are noisy, if the number of samples is limited and if there are many variables to choose from. In that case, it can happen that the GA model noise instead of information and consequently select randomly correlated variables. Therefore, two tests were performed to investigate the robustness of the variable selection algorithm proposed in this study. The first test is a randomization test similar to Ref. [24]. Thereby, the order of the elements in the concentration vector y is randomly perturbed. Consequently, the analyte concentrations are trained with measurements belonging to other concentrations. If a GA can model something anyway, it models only noise, as there should be no significant information in the data. The test was performed using the global algorithm with the same settings used for the data set 1 in Section 4.3. The ranking of the variables showed that all variables were selected with approximately the same frequencies, and that in contrast to Fig. 3, no variable was noticeably preferred (like the right side of Fig. 6). This means that if there is no information in the data set, the algorithm does not select variables showing a random correlation which might be modeled. In the subsequent second step, the algorithm stopped after the addition of one variable with a relative mean RMSE of 70.5% for the test data of the random subsampling procedure. When compared to the relative mean RMSE of the original meaningful data set 1 of 6.4%, the dramatic increase of the RMSE and the immediate termination in the iterative step show that the algorithm prevents itself from modeling pure noise. The second test addresses the problem of the selection of randomly correlated variables by GA. This test was carried out similar to Ref. [35]. In this test, the number of variables is increased by adding meaningless artificial variables containing only random numbers to the meaningful original variables. Then, the algorithm for the feature selection is run
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
79
Fig. 6. Frequency of selection for 50 time points of data set 1 and 50 additional random variables (R1 – R50) after the first step of the global algorithm.
using the increased amount of variables. A wellperforming algorithm should not select any of the artificial random variables containing no meaningful information. For this study, 50 random variables were added to the set of 50 original time points. The random variables were created by uniformly distributed random numbers with the same variation as the original time points. The global algorithm was used for this extended data set in the same way as described before except for two parameters adapted for the increased data set. The population size was increased to 100 resulting in about 110 generations until the convergence criterion was reached and the parameter a was set to 1, which resulted in approximately six variables being selected in single runs of the GA. The variable ranking after the first step of the algorithm is shown in Fig. 6. It is obvious that all random time points are ranked very low and no random variables can be found among the most important 27 time points. The parallel runs of multiple GA with different combinations of test and calibration data seem to prevent the selection of randomly correlated variables, whereas single runs of the GA selected random
variables, evident by nonzero frequencies of random variables in Fig. 6. Additionally, the left side of Fig. 6 looks very similar to Fig. 3, demonstrating the reproducibility of the ranking of meaningful variables when running the global algorithm repeatedly. The top five time points are ranked similarly to the algorithm applied to the original data set 1 (Fig. 3). Consequently, the same five variables are selected in the second step of the algorithm demonstrating the reproducibility of the selection of the variables by the global algorithm.
5. Conclusions In this study, an algorithm for the selection of variables is presented. The algorithm was successfully applied to a data set of three analytes in vapor mixtures recorded by time-resolved measurements of a single sensor SPR setup, whereby the data set is limited in size and contains highly correlated variables. It has been shown that the algorithm is robust to chance correlations of variables and to noise and that
80
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81
the selection of variables is reproducible. The predictions of the concentrations of the three analytes were considerably improved when using the five time points selected by the algorithm instead of using all 50 time points. The general application of the combination of the SPR setup, the time-resolved measurements and the calibration and variable selection procedure proposed in this study to other analytes is limited by the weakest component of this combination. The SPR detection scheme is quite sensitive to the sorption of the analytes, due to the changes of the refractive index during the displacement of the air in the micropores by analyte molecules. The algorithm for the calibration and variable selection performs well and is expected to be directly applicable to other data sets. However, a crucial point is the time-resolved measurements relying on different kinetics of the analytes. The principle of diffusion constants depending on the size of the analytes is only adaptable to a variety of analytes if suitable polymers with different-sized pores are available, which will be further investigated. At the moment, the application is limited to longterm measurements as the analytes were measured during several minutes. However, the selection of the time points by the algorithm suggests a major reduction of the analysis time, which will be further examined as well as a variation of the thickness of the sensitive layer. Furthermore, the performance of the variable selection algorithm will be studied for data sets from other fields of research.
References [1] H.M. Yan, G. Kraus, G. Gauglitz, Anal. Chim. Acta 312 (1995) 1 – 8. [2] H.P. Beck, C. Wiegand, Fresenius’ J. Anal. Chem. 351 (1995) 701 – 707. [3] J. White, J.S. Krauer, T.A. Dickinson, D.R. Walt, Anal. Chem. 68 (1996) 2191 – 2202. [4] M. Slama, C. Zaborosch, D. Wienke, F. Spener, Anal. Chem. 68 (1996) 3845 – 3850. [5] O. Herna´ndez, A.I. Jime´nez, F. Jime´nez, J.J. Arias, Anal. Chim. Acta 310 (1995) 53 – 61. [6] B.W. Saunders, D.V. Thiel, A. Mackay-Sim, Analyst 120 (1995) 1013 – 1018. [7] S.R. Johnson, J.M. Sutter, H.L. Engelhardt, P.C. Jurs, J. White, J.S. Kauer, T.A. Dickinson, D.R. Walt, Anal. Chem. 69 (1997) 4641 – 4648.
[8] J.M. Sutter, P.C. Jurs, Anal. Chem. 69 (1997) 856 – 862. [9] C. Herva´s, S. Ventura, J. Chem. Inf. Comput. Sci. 38 (1998) 1119 – 1124. [10] K. Kato, Y. Kato, K. Takamatsu, T. Udaka, T. Nakahara, Y. Matsuura, K. Yoshikawa, Sens. Actuators, B 71 (2000) 192 – 196. [11] V. Plegge, M. Slama, B. Su¨selbeck, D. Wienke, F. Spener, M. Knoll, M. Zaborosch, Anal. Chem. 72 (2000) 2937 – 2942. [12] M.B. Seasholtz, B. Kowalski, Anal. Chim. Acta 277 (1993) 165 – 177. [13] H. Martens, T. Naes, Multivariate Calibration, Wiley, New York, 1989. [14] E. Richards, C. Bessant, S. Saini, Chemometr. Intell. Lab. Syst. 61 (2002) 35 – 49. [15] D.J. Livingstone, D.T. Manallack, J. Med. Chem. 36 (1993) 65 – 70. [16] D. Broadhurst, R. Goodacre, A. Jones, J.J. Rowland, D.B. Kell, Anal. Chim. Acta 348 (1997) 71 – 86. [17] A.J. Miller (Ed.), Subset Selection in Regression, Chapman & Hall, London, 1990. [18] S. Chatterjee, B. Proce, Regression Analysis by Example, Wiley, New York, 1977. [19] D.L. Selwood, D.J. Livingstone, J.C. Comley, B.A. O’Dowd, A.T. Hudson, P. Jackson, K.S. Jandu, V.S. Rose, J.N. Stables, J. Med. Chem. 33 (1990) 136 – 142. [20] U. Ha¨rchner, J.H. Kalivas, Anal. Chim. Acta 311 (1995) 1 – 13. [21] C.B. Lucasius, M.L.M. Bekers, G. Kateman, Anal. Chim. Acta 286 (1994) 135 – 153. [22] R. Leardi, J. Chemom. 15 (2001) 559 – 569. [23] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord, Anal. Chem. 67 (1995) 4295 – 4301. [24] R. Leardi, A.L. Gonza´lez, Chemometr. Intell. Lab. Syst. 41 (1998) 195 – 207. [25] K. Hasegawa, Y. Miyashita, K. Funatsu, J. Chem. Inf. Comput. Sci. 37 (1997) 306 – 310. [26] B.M. Smith, P.J. Gemperline, Anal. Chim. Acta 423 (2000) 167 – 177. [27] S.S. So, M. Karplus, J. Med. Chem. 39 (1996) 5246 – 5256. [28] H. Handels, T. Roß, J. Kreusch, H.H. Wolff, S.J. Po¨ppl, Artif. Intell. Med. 16 (1999) 283 – 297. [29] H. Yoshida, R. Leardi, K. Funatsu, K. Varmuza, Anal. Chim. Acta 446 (2001) 485 – 494. [30] B. Licic, N. Trinajstic, J. Chem. Inf. Comput. Sci. 39 (1999) 121 – 132. [31] L. Xu, W.J. Zhang, Anal. Chim. Acta 446 (2001) 477 – 483. [32] A.M. Kupinski, M.L. Giger, Med. Phys. 26 (1999) 2176 – 2182. [33] M. Clark, R.D. Cramer, Quant. Struct.-Act. Relat. 12 (1993) 9 – 12. [34] M.J. Arcos, M.C. Ortiz, B. Villahoz, L.A. Sarabia, Anal. Chim. Acta 339 (1997) 63 – 77. [35] D. Jouan-Rimbaud, D.L. Massart, O.E. de Noord, Chemometr. Intell. Lab. Syst. 35 (1996) 213 – 220. [36] R. Leardi, J. Chemom. 8 (1994) 65 – 79. [37] R. Leardi, J. Chemom. 14 (2000) 643 – 655. [38] S.S. So, M. Karplus, J. Med. Chem. 39 (1996) 1521 – 1530.
F. Dieterle et al. / Chemometrics and Intelligent Laboratory Systems 65 (2003) 67–81 [39] M. Forina, G. Drava, R. Boggia, S. Lanteri, P. Conti, Anal. Chim. Acta 295 (1994) 109 – 118. [40] B. Kieser, D. Pauluth, G. Gauglitz, Anal. Chim. Acta 434 (2001) 231 – 237. [41] J.A. Riddeck, W.B. Bunger, Organic Solvents, Wiley, New York, 1986. [42] P.C. Jurs, G.A. Bakken, H.E. McClelland, Chem. Rev. 100 (2000) 2649 – 2678. [43] I.V. Tetko, A.E.P. Villa, Neural Netw. 10 (1997) 1361 – 1374. [44] J. Zupan, J. Gasteiger (Eds.), Neural Networks in Chemistry and Drug Design, 2nd ed., Wiley-VCH, Weinheim, 1999. [45] D. Patterson (Ed.), Artificial Neural Networks, Theory and Applications, Prentice Hall, Upper Saddle River, 1996. [46] J. Principe, N. Euliano, W. Lefebvre (Eds.), Neural and Adap-
[47] [48] [49]
[50] [51] [52]
81
tive Systems: Fundamentals through Simulations, Wiley, New York, 2000. S. Kaykin (Ed.), Neural Networks a Comprehensive Foundation, Prentice Hall, Upper Saddle River, 1999. M.F. Moller, Neural Netw. 6 (1993) 525 – 533. A. Weigend, in: M.C. Mozer, P. Smolensky, D.S. Touretzky, J.L. Elman, A.S. Weigend (Eds.), Proceedings of the 1993 Connectionist Models Summer School, Lawrence Erlbaum Associates, Hillsdale, NJ, 1994, pp. 335 – 342. W.S. Sarle, Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 1995, pp. 352 – 360. M.L. Astion, P. Wilding, Arch. Pathol. Lab. Med. 116 (1992) 995 – 1001. L.A. Currie, Chemometr. Intell. Lab. Syst. 37 (1997) 151 – 181.