Sensors and Actuators B, 9 ( 1992) 9- I5
9
Detection of vapours and odours from a multisensor array using pattern-recognition techniques Part 2. Artificial neural networks J. W. Gardner, E. L. Hines and H. C. Tang Department of Engineering, University of Warwick, Coventry, CV4 7AL (UK)
Received July 31, 1991; accepted in revised form December IO, 1991)
Abstract Considerable interest has recently arisen in the use of arrays of gas sensors together with an associated patternrecognition technique to identfy vapours and odours. The performance of the pattern-recognition technique depends upon the choice of parametric expression used to define the array output. At present, there is no generally agreed choice of this parameter for either individual sensors or arrays of sensors. In this paper, we have initially performed a parametric study on experimental data gathered from the response of an array of twelve tin oxide gas sensors to five alcohols and three beers. Five parametric expressions of sensor response are used to characterize the array output, namely, fractional conductance change, relative conductance, log of conductance change and normalized versions of the last two expressions. Secondly, we have applied the technique of artificial neural networks (ANNs) to our preprocessed data. The Rumelhart back-propagation technique is used to train all networks. We find that nearly all of our ANNs can correctly identify all the alcohols using our array of twelve tin oxide sensors and so we use the total sum of squared network errors to determine their relative performance. It is found that the lowest network error occurs for the response parameter defined as the fractional change in conductance, with a value of 1.3 x 10-4, which is almost half that for the relative conductance. The normalized procedure is also found to improve network performance and so is worthwhile. The optimal network for our data-set is found to contain a single hidden layer of seven elements with a learning rate of 1.0 and momentum term of 0.7, rather than the values of 0.9 and 0.6 recommended by Rumelhart and McClelland, respectively. For this network, the largest output error is less than 0.1. We find that this network outperforms principal-component and cluster analyses (discussed in Part 1) by identifying similar beer odours and offers considerable benefit in its ability to cope with non-linear and highly correlated data.
Introduction
Poor selectivity in metal oxide semiconductors and other gas-sensing materials has led to greater interest in the use of sensor arrays with patternrecognition (PARC) techniques. Selectivity can, in principle, be improved by the use of an array of gas sensors with partially overlapping sensitivities and appropriate pattern-recognition techniques. The essential elements of such a system are the transduction properties of the individual sensors and the characteristics of the signal-processing and PARC system. The relationship between sensor output and gas concentration is generally non-linear, with saturation often occurring at high concentrations. Nevertheless, some success has been achieved in the use of a tin oxide gas-sensor array and linear pattern-recognition techniques by limiting the measurement to low concentrations where 0925-4005/92/$5.00
the principle of superposition holds [ 1,2]. Linear pattern-recognition techniques have also been used to identify volatiles from the response of other sensor arrays, such as piezoelectric, SAW, MOSFET and pellistors. A review of this work has been reported [3]. More recently, workers have studied the use of non-linear pattern-recognition techniques, to model the non-linear transduction properties. Examples of the use of classical multivariate analysis (MVA) include the use of multiple nonlinear regression [4], non-linear partial least squares [ 51 and partial model building (PMB) [6]. In our study, a different approach to this problem has been adopted. In Part 1 [7] we showed that the choice of sensor response parameter influences the nature of the transduction process and thus the performance of a subsequent linear classification technique, such as principal-component analysis (PCA) or cluster analysis (CA). A physical model @ 1992 -
Elsevier Sequoia. All rights reserved
IO
was constructed which suggested the use of the fractional conductance change as the sensor parameter in metal oxide thick-film devices. The non-linear concentration dependence could then be reduced through the use of a preprocessing algorithm that normalized an individual sensor output to that over the array. This process maps the n-dimensional response vector onto the surface of a unit hypersphere and the performance of several linear classification techniques was substantially improved. In Part 2, we carry out a parametric study on the performance of artificial neural networks to classify the response of a tin oxide sensor array to alcohols and beers. There are several disadvantages to the use of classical MVA, such as partial least squares (PLS) and PMB. First, these techniques only work well with transduction properties that are reasonably well behaved, i.e., quadratic or power law. Secondly, these techniques often assume a multivariate normal distribution that may not be appropriate and lead to considerable errors. In contrast, ANNs can handle highly non-linear transduction properties and do not assume multivariate normal statistics. They can also tolerate a considerable change due to, say, noise or drift in a dependent variable with a lower degradation of performance than PCA or PLS [5].
UNKNOWN
CHEMICAL
CHEMICAL
IDENTIFIED
INPUT
SENSOR
1
j
PARC
ARRAY
‘Itr-’
‘-_
INTERFACE
PREPROCESSOR
ELECTRONICS t-4 I
Fig. 1. Pattern
recognition
I
I
(unsupervised)
in the Electronic
Nose.
Method
mode, an unknown odour is presented to the sensor array and the chemical signal converted to an electrical one via the interface electronics. The preprocessing and pattern recognition are carried out by an IBM-compatible PC which identifies the unknown odour, having previously learnt standard odours. An investigation was carried out into the effect of five preprocessing algorithms that define the response of the sensor array, see Table 1. The first three do not use any array information, while the last two normalize the output to remove the concentration dependency. In addition, the fourth algorithm maps the n-dimensional response vector onto the surface of an n-dimensional hypersphere and was found to be advantageous when classifying by PCA or CA.
Experimental data
Network design and learning rules
The response of twelve commercial tin oxide gas sensors to five different alcohols and three beers was measured in the Warwick Electronic Nose [8]. An alcohol series of methanol, ethanol, 1-butanol, propanol and 2-methyl-1-butanol was chosen for the initial parametric study, because the response of the sensors should be broadly similar with a low degree of selectivity. Measurements were made by injecting 0.5 pl of each alcohol or beer into a 20 1 flask of air and recording the sensor outputs after 180 s. Eight sets of data were taken on the five alcohols and three beers. A full description of both the apparatus and experimental details are given elsewhere [ 91.
ANNs are biologically inspired with network configurations and algorithms developed from
Data preprocessing
Figure 1 shows the generalized arrangement of the pattern-recognition system. In its unsupervised
TABLE 1. Preprocessing tances of the alcohol/beer sensors
algorithms. Gsas and Gai, are the conducand air, respectively. n is the number of
Description
Formulae
1. Relative conductance 2. Fractional conductance change 3. Log absolute conductance change
GsarlG,,,
4. Normalized conductance
fractional change
for sensor i response,
(Gsas - G,,,) IG,,, tog( IGs,, - G,ir I) (Gg.s - G,,r)/G,,r I/Z x (G,,, - Gair)‘/Gair
5. Normalized log absolute conductance change
log( IGgas - G,irl) (ElGw
- G,#
>
Si
Fig. 2. Artificial neuron used in study.
studies of neural organization in the brain. In odour sensing, a comparison has been drawn by Gardner et al. between the mammalian olfactory system and a two-layer network design [9]. Recent studies by Nakamoto et al. [lo] on a quartz resonator array and by Sundgren et al. [ 111 on a MOSFET array have adopted similar networks, but have not shown these to be optimal. The artificial neuron used to design our networks is shown in Fig. 2. Each of the inputs, x1 to x12, is multiplied by an associated weight, wI to w12, and applied to a summation block. Each weight has a value that reflects the strength of the synaptic link and thus the importance of a particular input (e.g., sensor). The output signal yi is further processed by a non-linear activation function P that ‘squashes’ the data. In our study, we have used a sigmoid activation function so the neural output oj is given by 0i = l/[ 1 + exp( -yi)]
(1)
and n
(2) These artificial neurons were used to design several multilayer networks for testing. Figure 3 shows a generalized two-layer back-propagation network that has twelve inputs (sensors) in the input layer, an arbitrary number of artificial neurons in the hidden layer and five artificial neurons in the output layer (alcohols). The learning rules specify
the initial set of weights and determine how these weights change to improve the network performance. In all our parametric studies, we have used the back-propagation paradigm [ 121 and the delta learning rule. In the delta rule, the difference ai between target value ti and actual neuronal output oi is propagated back through the network to the input. The weightings of the processing elements are thus iteratively changed until the output errors are determined to be acceptable. A correction of Ai is made to the weightings, where Ai (m + 1) = q6iXi + a Ai (m)
(3)
wi (m + 1) = Wi(m) + Ai (m + 1)
(4)
So the weights are adjusted by one term that is proportional to the error ~5~ = (ti - Oi) with a constant of proportionality, q, called the learning rate. The second term is used to improve the stability of the learning process and is called the momentum term, with CIbeing the momentum coefficient. The performance of each network and learning rule was determined by calculating the total sum of the square errors, E, from E
C&k _Ojk12
=I i
k
where j is the number of sensor response samples (up to eight) in the training and test set and k is the number of outputs in the outer layer (up to five). After each iteration of the back-propagation process, the network error E should gradually fall and converge to an asymptotic value. For individual outputs, a value of [O.1,0.2] is generally regarded as acceptable. A series of experiments has been carried out to investigate the performance of ANNs to classify sensor array data. These involved variation of the preprocessing algorithms, learning rate q and momentum term cc, number of processing elements and hidden layers, initial network weights and finally the size of the training set.
Results Selection of preprocessing
Fig. 3. Two-layer
Rumelhart
back-propagation
model.
algorithm
The two-layer back-propagation network shown in Fig. 3 was used to test the effect of the choice of a preprocessing algorithm on the alcohol data. A
12
TABLE
2. ANN
results
from preprocessing
algorithms
Species
Relative conductance
Fractional conductance
Log absolute conductance
Normalized conductance
Log normalized conductance
Methanol I-Butanol
0.998/0.002 0.01 l/O.989
0.998/0.002 0.01 l/O.989
0.993/0.007 0.007/0.993
0.997/0.003 0.003/0.997
0.992/0.008 0.005/0.995
preliminary study [9] showed that a two-layer network gives reasonable results. The recommended [ 121 learning parameters were used, i.e., a learning rate q of 0.9, momentum coefficient M.of 0.6 and initial random weights of +O.Ol. Initially, only two of the alcohols (methanol and I-butanol) were used for training, requiring only two output elements. The network was trained on seven of the eight samples with the eighth used to test the network. Training took place for up to 10 000 iterations in 2000 iteration steps. Convergence was observed for all the preprocessing algorithms and all the alcohols were correctly identified. Table 2 shows the network output after 10 000 iterations for each of the five preprocessing algorithms. From Table 2 it appears that all the techniques do well, but a plot of the network error E, calculated from eqn. (5) against the number of iterations shows that the fractional conductance change is significantly better, see Fig. 4. In fact, outputs of 0.990/0.110 for methanol and 0.017/ 0.983 for 1-butanol are obtained after only 2000 iterations. This result agrees with the argument presented in Part 1 that the fractional conductance change should be used to define the sensor array response. Further training was carried out on all five alcohols and the network performance tested
against each of the eight samples. This confirmed that the fractional conductance change was the optimal parameter, but its normalized value was a close second, an observation that was not apparent from Fig. 4. Consequently, all further experiments were carried out using the fractional conductance change to define the sensor array response. Eflects of learning rate and momentum term The values of the learning rate q and momentum coefficient CIaffect the performance of a network. It is generally recommended that the ratio of these parameters is 1.5, and so we tried several values of q and a with the ratio constant (Table 3) on the entire data-set (five alcohols, seven elements in a single hidden layer). All of these networks could identify all of the five alcohols correctly. However, the default setting (Case 1) gave a markedly reduced network error of 0.15 at 20 000 iterations, compared to about 0.7 for the other two cases after more iterations. Figures 5 and 6 show the variation of the network performance with learning rate q (a = 0.6) and momentum coefficient a (q = 0.9) after 20 000 to 100 000 iterations. Clearly, there is a minimum network error for q = 1.0 and LX= 0.7 and we suggest these values should be adopted. Optimization of network design Previous work showed that a single hidden layer with four processing elements provided a favourable network performance for a normalized conductance parameter [9]. We therefore decided TABLE Case
2
4
6
6
10
PJJMBERbF LEARNNGlTERATlONS(1.000)
Fig. 4. Effect of preprocessing two-layer network.
parameter
on the performance
of the
1 2 3
3. Choice
of network
parameters
(co
Iterations (maximum)
Comment
(V) 0.90 0.45 0.09
0.60 0.30 0.06
20 000 30 000 100 000
Default
Learning
rate
Momentum
13
01. t
”
4
”
”
6
LEARWG
Fig. 5. Variation (G(= 0.6).
in
6
k
’
”
10
m
’
0”
*I12
4
6
0
n
lo
3
1
12
MhBER OFPROCESSMELE~NTSNHXENLAYER
t7ATE.q
network
~“~“‘~‘~‘~
performance
with
learning
rate
Fig. 7. Effect of number network performance.
of artificial
neurons
in hidden
layer
on
with an error of 10 for eight elements after 100 000 iterations. In conclusion, a single-layer network of seven rather than four elements improves performance by a factor of two to four, depending upon the number of iterations.
01
0
I
I
I
I
I
2
4
6
6
10
M0MNTUMTERM.a
Fig. 6. Variation (q = 0.9).
in network
performance
with momentum
coefficient
to investigate the performance of other networks using the fractional conductance change. The number of processing elements was varied from 5 to 12 in a single hidden layer and from 4 to 8 in two hidden layers. The networks were trained on all five alcohols with up to 100 000 iterations. Figure 7 shows the network performance against the number of processing elements in a single hidden layer after 40 000, 60 000, 80 000 and 100 000 iterations. In each case the network error is minimal for seven processing elements. In comparison, the performance of a network with two hidden layers was considerably worse; typically the network error E was 20 times higher. No minimum was observed, but instead a gradual improvement with increasing number of processing elements
Eflect of initial randomized weights The initial weightings of the networks were systematically varied to observe their influence upon the network performance. A two-layer back propagation model was used with seven processing elements in the hidden layer (q = 0.09, a = 0.06). Six ranges of the initial weights o were set at [ +O.Ol, -0.011, [ -0.1, +O.l], [ -0.5, OS] [ - 1, 11, [ - 551 and [ - 10, lo]. Figure 8 shows the network performance against the number of iterations for each range except [ - 10, + lo], which failed to converge. In each of these cases all the alcohols were correctly identified. The largest network error
pi
:
El01”“““““’ 30
xl
70
90
8’1
II0
M!MBER OF LEARWGITERATDNS(t000)
Fig. 8. Influence of initial weightings two-layer network.
range,
o, upon performance
of
14
NETWORK
OUTPUT
1.2
-..0
w
l.27 1
,’
0.8
B6 i%
0.6
E4
0.4
0.2
0
60
60
MO
Methanol
Ethanol
TEST
PERCENTAGEOFDATATRAND
Fig. 9. El&A of training-set layer network.
size upon
Propanol
the performance
of the two-
of 0.3348 was observed for [ -551 and the smallest value of 0.129 for [ - 1, 11.This, in fact, is better than a network with four processing elements in a hidden layer for the default learning parameters. It seems that a small or large initial weighting range substantially reduces the ability of the network to learn the alcohol data and so the default range is recommended.
Fig. IO. Identification (‘I = 0.9, a = 0.6, w =
NETWORK
I
Butand
VAPOUR
of alcohol
CLASSIFICATION 0
Methanol Ethanol
0
Propanol
w
Butand
0
hl-propanol
M-propan.
series
using
two-layer
OUTPUT
1.2 1 0.8 0.6 CLASIFICATION 0.4
Beer
0.2 0
Size of training set The size of the training set was also varied to observe the effect upon the performance of the two-layer back-propagation network with seven processing elements in the hidden layer (q = 0.09, a = 0.06, o = [ - 1, 11). Surprisingly, it was found that there was only a slight reduction in the performance of the network trained on between 50 and 100% of the sample set, see Fig. 9. Below 50% unsatisfactory results were obtained. It was expected that the network error would rapidly increase with reduced training-set size, especially on this small training set. Conclusions We have shown that the choice of preprocessing algorithm influences the performance of an artificial neural network. The minimum network error was found with the use of a fractional conductance change (see Table 1) to define the sensor response parameter Si. It was found that a two-layer backpropagation network with seven processing elements was best and was able to classify five alcohols from an array of twelve tin oxide sensors. From the results obtained here, we suggest that a
network
[ - 1, + I]).
Lager
1
Lager
2
0 Lager TEST
1
Lager
2
VAPOUR
Fig. 11. Identification of beers (and (‘I = 0.9, c1 = 0.6, o = [ - I, + I]).
lagers)
using two-layer
network
learning rate of 1.0 and momentum coefficient of 0.7 are used rather than the default values, but that the initial weights lie in the recommended range [ - 1, 11. Figure 10 shows the final network performance on the alcohols (20 000 iterations). Excellent identification is seen, with the largest output error 6 less than 0.1. Figure 11 shows the same network applied to three beers. Although the result is not as good, the network was able to discriminate between two similar lagers and a beer. This last result shows that ANNs are, in certain cases, superior to multivariate techniques, such as PCA and CA, which were incapable of separating out the lagers. A similar conclusion has been drawn by other workers, who have used arrays of different types of sensor, such as piezoelectric [lo] and MOSFET [ 1I], on different chemical species. Thus the improvement in odour selectivity from the use of ANNs is not limited to a single type of chemical sensor, such as the semiconducting oxide, but appears to apply to solid-state chemical sensors in
general. This conclusion is clearly encouraging to workers in the field of odour sensing, because the sensitivity of tin oxide sensors is typically at the ppm level, whereas key flavour compounds are often present at the ppb level. Nevertheless, there is the distinct possibility of increasing the sensitivity of semiconducting oxide sensors to the ppb level via novel signal-processing techniques [ 131. However, although the’extent and resolution of the functions that map the relationships between sensor space and classification space (i.e., odour space) still remain to be determined, it seems likely that a hybrid array is needed (that contains several sensor types, e.g., tin oxide and polypyrrole) with the inherently fuzzy odour data mapped onto an adaptive neural network.
References R. Mtiller and G. Horner, Chemosensors with pattern recognition, Siemens Forsch. Entwicklungsber., IS (1986) 95- 100. H. V. Shurmer, J. W. Gardner and H. T. Chan, The application of discrimination techniques in alcohols and tobacco using tin oxide sensors, Sensors and Actuators, 18 (1989) 361-371. J. W. Gardner, P. N. Bartlett, P. T. Moseley, D. E. Williams and J. 0. W. Norris (eds.), Techniques and Mechanisms in Gas Sensing, Adam Hilger, Bristol, 1991, pp. 347-380. Chr. Hierold and R. Mtiller, Quantitative analysis of gas mixtures with non-selective gas sensors, Sensors and Actuators, I7 ( 1989) 587-592.
H. Sundgren, I. Lundstrom and F. Winquist, Evaluation of a multiple gas mixture with a simple MOSFET gas sensor array and pattern recognition, Sensors and Actuators, B, 2 ( 1990) I I5 123. G. Horner
and Chr. Hierold, Gas analysis by partial model buildings, Sensors and Actuators B, 2 (1990) l73- 184.
7 J. W. Gardner, Detection of vapours and odours from a multisen-
sor array using pattern recognition. Part I. Principal component and cluster analysis, Sensors and Actuators B, 4 (1991) lO9- 116. 8 H. V. Shurmer, J. W. Gardner and P. Corcoran, Intelligent vapour discrimination using a composite l2-element sensor array, Sensors and Actuators, Bl (1990) 256-260. 9 J. W. Gardner, E. L. Hines and M. Wilkinson, The application of artificial neural networks in an electronic nose, Meas. Sci. Techno/., I (1990) 446-451.
IO T. Nakamoto, K. Fukunishi and T. Moriizumi, Identification capability of odor sensor using quartz-resonator array and neural network pattern recognition, Sensors and Actuators, BI (1990) 4733476.
II H. Sundgren, F. Winquist, I. Lukkari and 1. Lundstrom, Artificial neural networks and gas sensor arrays: quantification of individual components in a gas mixture, Meas. Sci. Technol., 2 (1991) 464469.
I2 D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing, MIT Press, Boston, MA, 1986. I3 H. V. Shurmer, personal communication, University of Warwick, November 1991.
Biographies Julian Gardner is a lecturer in the Department
Engineering, at Warwick University, interests in sensor engineering.
of with special
Euor Hines is also a lecturer in the Engineering Department, with research interest in the application of artificial neural networks; current areas of work include ultrasonics, medicine and image analysis. Thomas Tang has just completed an M.Sc. dissertation in information technology at Warwick University.