Analytica Chimica Acta 452 (2002) 255–264
Variable reduction on electronic tongue data Tom Artursson∗ , Per Spångeus, Martin Holmberg S-SENCE and Laboratory of Applied Physics, Linköping University, SE-58183 Linköping, Sweden Received 13 June 2001; accepted 4 October 2001
Abstract Reduction of the number of variables in data from a so-called electronic tongue contributes to simpler model calculations and less storage requirements. In this study, we have developed a model for this purpose. This model describes the response from the electrodes in the electronic tongue with two exponential functions plus a constant term, i(t) = k + kf e−tα + kc e−tβ , where t is the time. From the model, five parameters which describe the signal are extracted. These parameters can be used as inputs instead of the original signal to any multivariate algorithm. The results show that the variables obtained are at least as good as the original data to separate between different classes, even though the number of parameters has been reduced between 80 and 199 times. © 2002 Elsevier Science B.V. All rights reserved. Keywords: Electronic tongue; Variable reduction; Pre-processing; Curve-fitting; PCA
1. Introduction This paper describes a way to reduce the number of variables in data from a so-called electronic tongue. The principle behind this electronic tongue is to combine non-specific and overlapping signals from pulsed voltammetry used in wet-chemical processes with pattern recognition [1]. Since the size and shape of the current responses reflect the concentration and diffusion coefficients of electro-active and charged compounds in the solution, the whole response curve is used for interpretation. The current response basically stems from two sources, the Faradic and the charging currents. The Faradic current is a measure of the redox activity of the solution, while the charging current gives information about the concentration of ions and the adsorbed species on the electrode surface. Since ∗ Corresponding author. Tel.: +46-13-281357; fax: +46-13-288969. E-mail address:
[email protected] (T. Artursson).
the data contain a large number of variables in each measurement, there is a need to reduce this number. The current response is in this paper modeled with a second-order exponential model, which reduces the number of variables to five for each response compared to between 400 and 1000 in the original data. This model has been tested on two different datasets. The change of information content after variable reduction is studied through principal component analysis, Mahalanobis distance between clusters and classification using k-nearest-neighbor (KNN) clustering. 2. Theory The measurement principle used is pulsed voltammetry, where the current is measured as a function of the applied voltage, I = f (V ). When the voltage is changed there will be a current through the solution. This current has two origins, Faradic and charging current, both decaying with time. When the applied voltage is switched back again, similar but opposite
0003-2670/02/$ – see front matter © 2002 Elsevier Science B.V. All rights reserved. PII: S 0 0 0 3 - 2 6 7 0 ( 0 1 ) 0 1 4 4 8 - 9
256
T. Artursson et al. / Analytica Chimica Acta 452 (2002) 255–264
Fig. 1. Example of input (applied voltage) and output (measured current) from one electrode. The enlarged curve shows the current response from one pulse only.
reactions occur. An example of an applied voltage (input) and the current response (output) from one electrode is shown in Fig. 1. The very fast charging current is due to ion transport, where negatively charged ions move towards the positive electrode and vice versa. An electric double layer of charged particles is formed on the electrode surface as the charging current decays [2]. The Faradic current stems from redox reactions of electro-active compounds close to the electrode surface. This current is concentration limited and also decaying [2]. The current response, which is measured, is based on the sum of these two current components. A variable is in this context defined as one measurement in the time sequence of the current resulting from the voltage pulses. Each variable, is thus, dependent both on the applied voltage, the time since the voltage was applied to the electrode, what electrode material is used, and what chemical species that are present in the solution. The number of variables from the electronic tongue increases fast with the sampling rate, the number of electrodes, and the number of pulses. The total pattern, the voltammogram, from all electrodes is rather complex (see Fig. 2) and it is necessary to use multivariate methods to compare voltammograms from different samples. In purpose
to extract more information from the measurements and to reduce the number of variables, different methods have been tested. Holmin et al. tried hierarchical principal component analysis, wavelet transform, and physical modeling to reduce the number of variables [3]. One of the conclusions from that work was that the physical model, a model quite similar to a first-order exponential model, showed very promising results. This paper continues that work by developing the physical model to a second-order exponential model and also using new algorithms to calculate the parameters of the model. 2.1. The second-order exponential model Since all curves have similar shape, (decaying or increasing depending on the applied potential) (see Fig. 1) and based on two different currents, Faradic and charging current both decaying, it is natural to build a model with two exponential functions and a constant term (see Eq. (1)): i(t) = k + kf e−tα + kc e−tβ e−tα
(1)
where k + kf describes the Faradic current and kc e−tβ the charging current. This equation is used to
T. Artursson et al. / Analytica Chimica Acta 452 (2002) 255–264
257
Fig. 2. A voltammogram from three electrodes: platinum, iridium, and rhodium.
approximate the current response from one pulse with five parameters k, kf , α, kc and β. These parameters can be related to physical characteristics of the sample. The parameters k, kf and α are proportional to the concentration of the redox active species and their diffusion coefficients. The parameters kc and β describe the capacitance of the electric double layer and which is affected by molecules absorbing on the electrode surface [2]. If this model describes the response from the electronic tongue well, information is actually added since the curve is only allowed to have a certain shape which removes noise from the measurement system. The parameters from the model can be used as inputs to any multivariate algorithm instead of the very large voltammogram. Before the multivariate processing, we need to make the parameters non-interchangeable. To see the problem, let the exponentials in Eq. (1) change places. Now you get a different set of parameters—interchange kc and kf , and α and β in the parameter tuple, yet we have exactly the same current response. In conclusion, two different parameter tuples may yield exactly the same current response, thus, making the parameter tuples interchangeable. If the physical interpretation of Eq. (1) is dropped, the time constants could be sorted in order. We have chosen
another way to solve the interchangeable problem, namely by transforming the parameters into new non-interchangeable variables without physical interpretation. 2.2. Calculations of the parameters The parameters are calculated by solving a homogenous difference equation [4]. If the constant term, k in Eq. (1), is subtracted from both sides we have a new expression given by Eq. (2): F (t) = i(t) − k = k1 e−tα + k2 e−tβ
(2)
Since the computer samples the curve in discrete time, we transform Eq. (2) to discrete time representation Eq. (3): FF(q) = F (qT) = k1 (e−T α )q + k2 (e−T β )q
(3)
where q = 1, 2, . . . , b, b is the number of samples and T the sampling interval. Furthermore, let c = e−T α and d = e−T β and rewrite Eq. (3): FF(q) = k1 cq + k2 d q
(4)
Eq. (4) is the solution to the homogenous difference Eq. (5): FF(q + 2) + s1 FF(q + 1) + s2 FF(q) = 0
(5)
258
T. Artursson et al. / Analytica Chimica Acta 452 (2002) 255–264
Fig. 3. The solid circles show the part of the signal that is used for calculation of parameters.
Let FF(·) = i(·) − k and rewrite Eq. (5):
α=−
ln(c) T
(8)
where i(t) is the original signal, and is (t) is the simulated response calculated from the parameters above, and n is the number of variables. The Eqs. (2)–(10) are calculated on the transient part of the signal and the end of the signal (see Fig. 3). This is done in order to avoid too strong influence from the stationary part of the signal on the calculated parameters. If α and β are equal, a first-order exponential model is enough and the response curve will, therefore, be modeled with a first-order exponential function and a constant term. Transforming the parameters solves the problem with interchangeable parameters. Two new variables, v1 and v2 , are used instead of α and β. The parameters v1 and v2 are the coefficients to a second-order polynomial with the roots α and β (see Eq. (11)):
β=−
ln(d) T
(9)
(y − α)(y − β) = y 2 − (α + β)y + αβ
s1 i(q + 1) + s2 i(q) − (1 + s1 + s2 )k = −i(q + 2) (6) This equation system can be solved through least square methods. The terms c and d are found by calculating the roots to Eq. (7): x 2 + s1 x + s 2 = 0
(7)
Eq. (7) is the characteristic polynomial of the difference equation in Eq. (5). Calculate now k1 and k2 by solving the least square problem of Eq. (4). The terms α and β can now easily be calculated by Eqs. (8) and (9):
The parameters k, k1 , k2 , α, and β are then optimized so the agreement between the original and the simulated curve becomes as good as possible. The optimization routine finds the minimum of the error function described by Eq. (10): 1 e= (10) (i(t) − is (t))2 n
= y 2 − v1 y + v 2
(11)
The coefficients v1 and v2 are unique, i.e. if you interchange α and β you still get the same polynomial coefficients v1 and v2 . Instead of using k1 and k2 , we use the non interchangeable variables top = is (0) − k = k1 +k2 and slope = is (0) = −αk1 −βk2 . The parameters describing the double exponential curve and then
T. Artursson et al. / Analytica Chimica Acta 452 (2002) 255–264
used in the multivariate algorithms are top, slope, v1 , v2 and k. By experience we know that the measurements of the very first points of the response curve sometimes are uncertain. These uncertain measurements were detected and left out of the calculation. The measurements which are left out can then be reconstructed by extrapolating the model.
3. Experiments Two different datasets will be used for illustration. They differ in measuring objects, the number of variables and the instrument used. The first dataset, Chemicals, is a small dataset (10 objects×800 variables) and contains only 10 measurements on three different solutions (0.1 M LiCl, 0.1 M KCl and 0.1 M NaOH). Three samples each on LiCl and KCl and four samples on NaOH. These measurements consist only of two response pulses from a platinum electrode with a sampling frequency of 2 kHz. The second dataset, Mixed, consists of measurements on totally different complex solutions such as orange juice, process water from a pulp factory, tea, and a reference solution (K4 Fe[CN6 ]). This dataset contains 112 measurements equally divided among the classes. A series of 12 potential pulses was applied to each electrode (platinum, iridium and rhodium) with changing amplitude (see Fig. 1). The current response sampled with 2 kHz ended up in 71640 variables for each sample. Dataset Mixed contains more noise than dataset Chemicals since the former measurements were made with a magnetic stirrer. The stirrer perturbs the original measurements, which is very obvious when the same measurements are made with or without the stirrer. The influence of noise on the calculated parameters was studied by adding normally distributed random noise to an artificial signal. Twenty different noise signals with the same standard deviation were added to 20 identical artificial signals. The mean and the relative standard deviation for the model parameters were calculated. The relative standard deviation is calculated as the ratio between the standard deviation of the parameters and their true values, and is given in percent. The mean and the relative standard deviation was calculated
259
six times each time with increasing magnitude of noise. 3.1. Comparison of results To compare the results, different tools such as principal component analysis (PCA), cluster distance measure, and the classification rate before and after variable reduction have been used. The goal of PCA is often to get an overview of the data, i.e. tendencies for grouping of samples and variables [5,6]. PCA decomposes the data matrix (X) into scores (T ) and loadings (P). The first component t1 × p1T explains the largest possible variation in the data. The first component is then subtracted from the data, E = X−t1 ×p1T . The second component explains the largest variance in the residual matrix E and is orthogonal to the first component. The input to the PCA is scaled to zero mean and unit variance. The distance measure that has been used is the Mahalanobis distance [7]. This distance measure takes into account the variations within and between the different data classes. This distance measure has been calculated in the two-dimensional principal component space to avoid problems with singular or nearly singular variance–covariance matrices. A supervised pattern recognition algorithm was used for classification, namely the k-nearest-neighbor (KNN) clustering. The KNN classifies the samples based on the (Euclidean) distance to its nearest-neighbors. A classification of an observation is made as follows. The distance from the point to all training observations is calculated. Theses distances are sorted and a number of nearest-neighbors points (three points in this paper), with the smallest distance are kept. Among those observations a majority vote is made for each new measurement, i.e. the class having most points (≥2 in this paper) as nearest-neighbors is declared as the winner, and the new observation is said to belong to that class. The KNN analysis has been calculated in the two-dimensional principal component space. The classification was made only on dataset Mixed, since this was the only dataset that contained a large enough number of measurements from each class. The PCA and KNN were trained on half of the dataset, randomly chosen, and the other half was used as test set.
260
T. Artursson et al. / Analytica Chimica Acta 452 (2002) 255–264
4. Results and discussion A simulation of the response was made with help of the calculated parameters from the second-order exponential model. An example of the agreement between the original curve and the simulated is shown in Fig. 4, the original measurement was from dataset Chemicals. The original and simulated curves are really tight together and it is hard to see any difference. If the last part of the original and simulated signal is enlarged, see the zoomed part in Fig. 4, a recurrent pattern comes to light. The frequency of this pattern is 50 Hz, which corresponds to the power supply ac frequency. It is most likely that this recurrent pattern, which our model does not capture, does not contain any useful information about the samples. This is an example where the modeling yields noise reduction. Furthermore, it appears from Fig. 4 that our simulation gives us a final value which is a bit too low. This depends on the fact that only part of the stationary curve is used for calculation of the parameters. Fig. 5a and b shows the two score plots from dataset Chemicals before and after variable reduction. In the original data, there is a big jump in each of the three clusters (see Fig. 5a). Due to this jump, there is a overlap between LiCl and KCl. After the modeling, the number of variables has been reduced by a factor of 80. Even though the number of variables is reduced, the clusters
Table 1 The Mahalanobis distance between the three classes in dataset Chemicals Samples
Original data
Parameterised data
LiCl–NaOH LiCl–KCl NaOH–KCl
4.9 2.5 5.4
20.5 9.3 22.6
are easily separated after the treatment. The distances between the three clusters increased after the reduction (see Table 1). This indicates that the information that separates the clusters is more apparent after the reduction. A PCA was calculated on the dataset Mixed before variable reduction, the score plot shows the four classes juice, reference, process water and tea (see Fig. 6a). Two of the classes (reference and tea) overlap. The variance within the group reference samples is large and influences the first principal component very much. After the variable reduction, the clusters are on the whole better separated and there is no overlap between the reference and tea samples (see Fig. 6b). Even though the number of variables has been reduced from 71640 to 360, a reduction by a factor of 199, the important information that separates the different clusters are still there. The Mahalanobis distances between some of the clusters have increased, while it has
Fig. 4. Original response vs. simulated curve. The plot in the right corner of the figure is an enlargement of the last part of the original and simulated curve.
T. Artursson et al. / Analytica Chimica Acta 452 (2002) 255–264
261
Fig. 5. The two score plots from dataset Chemicals before (a) and after (b) variable reduction.
decreased for other clusters after variable reduction (see Table 2). But for the classes which were tight together in original data, the distance increased after the variable reduction (process water–tea, and reference–tea). This is reflected in the prediction rate, see the confusion matrix in Table 3. The classification error decreased from 8.6 to 0% going from the original data to the parameterised data. This maybe due
to the fact that information is provided by the model since the signal is forced to take a certain shape, which reduces the amount of noise. The samples in the reference group that stand out, samples which are encircled in the score plot of Fig. 6a, are measurements done after polishing the electrode. This is a pre-treatment that removes the oxide layer on the electrode surface. After the polishing,
262
T. Artursson et al. / Analytica Chimica Acta 452 (2002) 255–264
Fig. 6. Score plot from dataset Mixed: (a) original data (112 × 71640); and (b) reduced data (112 × 360). The samples, which are encircled, are the samples measured after the polish pre-treatment.
Table 2 The Mahalanobis distance between the four classes in dataset Mixed Samples
Original data
Parameterised data
Juice–process water Juice–reference Juice–tea Process water–reference Process water–tea Reference–tea
15.4 5.3 11.0 8.6 5.9 1.6
12.6 5.7 19.0 4.7 4.1 6.6
the electrode surface is refreshed and the redox reactions now easily take place. As written above, the response is based on two currents (Faradic and charging). Since the Faradic current is influenced by this Table 3 A confusion matrix of a KNN classification using the original data
Juice Process water Reference Tea
Juice
Process water
Reference
Tea
14 0 0 0
0 12 0 0
0 0 16 3
0 0 3 8
T. Artursson et al. / Analytica Chimica Acta 452 (2002) 255–264
263
Table 4 Average values of the parameters calculated on perturbed artificial signals with six different levels of noisea Noise level 1.3 2.7 4.0 5.2 1.3 2.5
× × × × × ×
10−4 10−4 10−4 10−4 10−3 10−3
Top 1.9 1.9 1.9 1.9 2.0 2.1
v1
Slope −6.7 −6.8 −6.9 −7.0 −8.3 −1.1
× × × × × ×
10−2 10−2 10−2 10−2 10−2 10−1
v2 × × × × × ×
6.0 6.1 6.2 6.3 7.4 9.6
10−2
5.0 5.1 5.2 5.4 7.3 1.1
10−2 10−2 10−2 10−2 10−2
k × × × × × ×
10−4
1.0 1.0 1.0 1.1 1.5 2.1
10−4 10−4 10−4 10−4 10−3
× × × × × ×
10−2 10−2 10−2 10−2 10−2 10−2
a The noise level is calculated as the ratio between the standard deviation of the noise in the stationary phase and the top value of the current response.
pre-treatment, the samples (responses) with a large Faradic part are more affected of this pre-treatment than others. This is why the reference samples are more influenced of the polish pre-treatment compared to the other groups, where the Faradic current is smaller. The polishing pre-treatment is also visible for the other classes, but not as clear as for the reference samples (see Fig. 6a). After the variable reduction, the reference samples do not influence the first principal component as much as for the original data. This depends on the fact that the number of variables that are affected by the polishing pre-treatment have decreased. This is due to the fact that the number of variables that relates to the stationary phase has decreased after the variable reduction. Still the polishing pre-treatment is visible in the score plot on the parameterised data (see Fig. 6b). The influence of noise on the calculated parameters was studied by calculating the mean and relative standard deviation of the parameters for different noise levels. The noise level is here defined as the ratio between the standard deviation of the current response in the stationary phase and the top value of the current response. For most of the current responses in the dataset Mixed, the noise level is below 1.3 × 10−3 . At this noise level, there is no strong systematic variations in the parameters caused by noise (see Table 4). On the other hand, it is obvious from Table 5 that the relative standard deviations of the parameters increase even with a small level of noise. These deviations are small compared to other error sources in the measurements (e.g. reproducibility and stability). To be able to capture the very rapid course of events, i.e. estimate the time constants well, with the model a high sampling frequency is needed. This is especially
Table 5 The relative standard deviations of the model parameters in percent of the true parameter values calculated from artificial signals perturbed with noisea Noise level 1.3 2.7 4.0 5.2 1.3 2.5
× × × × × × a
10−4
10−4 10−4 10−4 10−3 10−3
Top (%)
Slope (%)
v1 (%)
v2 (%)
k (%)
0.14 0.22 0.22 0.33 0.85 1.02
0.52 0.85 0.86 1.30 3.91 5.31
0.53 0.84 0.86 1.28 3.88 4.56
0.94 1.46 1.52 2.30 7.65 10.33
0.70 1.00 1.40 2.05 8.09 12.54
The noise level is defined in the same way as in Table 4.
important if the response is small and decays fast. We have seen that the parameters which have a direct relationship to the time constants (slope, v1 and v2 ) are more sensitive to the sampling frequency than the other two parameters (top and k). As a conclusion, the sampling frequency should be higher than about 0.3 kHz for our data.
5. Conclusions The electronic tongue generates very large datasets with a lot of variables. The result presented here show that the number of variables can be reduced without loosing information. The model, i(t) = k + kf e−tα + kc e−tβ , is based on the knowledge of how the electronic tongue response is generated. The terms, k + kf e−tα and kc e−tβ are approximations of the Faradic current and charging current, respectively. A transformation of the parameters k, kf , α, kc and β can be used as inputs to any multivariate algorithm. The quality of the variable reduction has been studied on two
264
T. Artursson et al. / Analytica Chimica Acta 452 (2002) 255–264
different datasets by comparing the original and simulated signal, score plots before and after the reduction, Mahalanobis distances between different classes before and after the reduction and the prediction rate for one of the datasets. Even though the number of variables has been reduced between 80 and 199 times, the important information which separates the clusters, can still be extracted from the parameters. Close reconstruction of the original signal is also possible using these parameters. In fact, the prediction rate increased after the parameterisation. This maybe due to the fact that information is supplied through the model, since the response is limited to a certain shape. The parameterised datasets are smaller and easier to work with, but the model requires a high sampling frequency and not too much noise in the data in order to calculate stable parameters. The method outlined above is of course not restricted to electronic tongue data. It can be applied to any signal that can be described with two exponential terms and a constant.
Acknowledgements Our research on chemical sensors is supported by grants from the Swedish Research Council for Engineering Science and the National Swedish Board
for Industrial and Technical Development (NUTEK). Swedish Industry, NUTEK and Linköping University finance Swedish Sensor Centre (S-SENCE). One of the authors is sponsored by the graduate school Forum Scientium (funded by the Council for Strategic Research in Sweden) in Lin Köping. The Forum Scientium contribution is gratefully acknowledged. Susanne Holmin is acknowledged for providing the datasets and valuable information. Fredrik Winquist and Fredrik Björefors are also acknowledged for valuable discussions on electrochemistry. References [1] F. Winquist, P. Wide, I. Lundström, Anal. Chim. Acta 357 (1997) 21. [2] J. Wang, Analytical electrochemistry, Wiley-VCH, New York, 1994. [3] S. Holmin, P. Spångeus, C. Krantz-Rulcker, F. Winquist, Sens. Actuators B 76 (2001) 454. [4] J. Proakis, D. Manolakis, Digital Signal Processing, Macmillan, New York, 1992. [5] S. Wold, K. Esbensen, P. Geladi, Chem. Intell. Lab. Syst. 2 (1987) 37. [6] A. Höskuldsson, Prediction Methods in Science and Technology. Part 1. Basic Theory, Thor Publishing, Denmark, 1996. [7] R.D. Maesschalck, D. Jouan-Rimbaud, D.L. Massart, Chem. Intell. Lab. Syst. 50 (2000) 1.