Sensors and Actuators B 80 (2001) 243±254
Arti®cial intelligence methods for selection of an optimized sensor array for identi®cation of volatile organic compounds$ Robi Polikara,*, Ruth Shinarb, Lalita Udpac, Marc D. Porterb a
b
Department of Electrical and Computer Engineeing, Rowan University, 136 Rowan Hall, Glassboro, NJ 08028, USA Ames Laboratory, USDOE and Department of Chemistry, Microanalytical Instrumentation Center, Iowa State University, Ames, IA 50011, USA c Department of Electrical and Computer Engineering, Iowa State University, Ames, IA 50011, USA Accepted 19 July 2001
Abstract We have investigated two arti®cial intelligence (AI)-based approaches for the optimum selection of a sensor array for the identi®cation of volatile organic compounds (VOCs). The array consists of quartz crystal microbalances (QCMs), each coated with a different polymeric material. The ®rst approach uses a decision tree classi®cation algorithm to determine the minimum number of features that are required to classify the training data correctly. The second approach employs the hill-climb search algorithm to search the feature space for the optimal minimum feature set that maximizes the performance of a neural network classi®er. We also examined the value of simple statistical procedures that could be integrated into the search algorithm in order to reduce computation time. The strengths and limitations of each approach are discussed. # 2001 Elsevier Science B.V. All rights reserved. Keywords: Optimum coating selection; Decision tree; Wrapper search; Neural network classi®cation
1. Introduction Piezoelectric chemical sensors, such as surface acoustic wave (SAW) devices and quartz crystal microbalances (QCMs), have been widely used for detection and identi®cation of volatile organic compounds (VOCs) [1±8]. In general, an array of polymer-coated sensors is used for detection, where the change in the resonant frequency of each sensor as a function of VOC concentration constitutes a response pattern. Over the past 15 years, a signi®cant amount of work has been done on developing pattern recognition algorithms, using principal component analysis, neural networks and fuzzy inference systems, for various gas sensing problems [9±13]. However, these methods can only be successful, if the features (polymer-coated sensor responses) used to identify the VOCs allow an ef®cient separation of patterns in the feature space. The challenge is then to identify a subset of polymer coatings such that a classi®cation algorithm provides optimum classi®cation performance. Selection of coatings is usually based on $
Portions of this work were completed while the corresponding author was with the Department of Electrical and Computer Engineering of Iowa State University. * Corresponding author. Tel.: 1-856-256-5372; fax: 1-856-256-5241. E-mail address:
[email protected] (R. Polikar).
various chemical properties (e.g. solubility parameters [14±16]) of the VOCs and the compatibility of each with a range of compositionally different polymer coatings. Some researchers have also tried using various signal processing metrics, such as the Euclidean distance [17], or the principal component analysis [18] to obtain the optimum set of coatings for speci®c applications. Since there may be a large number of polymers suitable for the identi®cation of a VOC, the selection of the smallest set giving the best performance is an ill-de®ned problem. This situation arises because testing every possible combination is usually not manageable. Furthermore, many researchers have investigated the relationship between the number of sensors and the performance of the array [19], and found that using as many sensors as possible does not necessarily improve the performance of a classi®cation system. In fact, Park and Zellers [20], and Park et al. [21], through a careful analysis of the required number of sensors versus the number of analytes and Osbourn et al. [22] through an examination of the effects of increasing the sensor size, have shown that the performance of classi®ers for VOC identi®cation typically degrades as the number of sensors increase beyond a certain number. Therefore, an ef®cient algorithm for optimum selection of sensors is of paramount importance. For small pools of potential coatings, an exhaustive search may be manageable. For example, Zellers and coworkers
0925-4005/01/$ ± see front matter # 2001 Elsevier Science B.V. All rights reserved. PII: S 0 9 2 5 - 4 0 0 5 ( 0 1 ) 0 0 9 0 3 - 0
244
R. Polikar et al. / Sensors and Actuators B 80 (2001) 243±254
used extended disjoint principal components regression analysis to conduct an exhaustive search on a 10-polymer dataset and identi®ed four polymers as requisite array elements for optimum identi®cation of six VOCs [20,21,23]. The use of four polymers out of 10 amounts to 210 possible combinations, which is manageable for an exhaustive search. However, as the number of possible coatings increase, an exhaustive search becomes computationally prohibitive. Adding only two more coatings to the pool, for instance, requires evaluating 495 possible four-coating combinations, and a more practical problem of choosing 6 out of 20 coatings requires testing 38,760 different combinations of coatings. In efforts to reduce the number of candidate coatings from a larger pool of potentially useful coatings, various pattern recognition (PR) algorithms have been developed. Principal component analysis (PCA), a dimensionality reduction technique, has been one of the most popular of such techniques. Carey et al. used PCA [24] to reduce the feature vector obtained from 27 sensors to less than 8 for an identi®cation problem consisting of 14 VOCs. Avila et al. introduced correspondence analysis as an alternative to PCA [25] and showed that it had computational advantages as well as performance improvement over PCA on the same dataset used by Carey et al. [24]. PCA has been employed not only in the gas sensor area, but also in many other areas where data analysis for dimensionality reduction is important. With PCA, the strategy is to ®nd a set of n orthogonal vectors along which the m dimensional data has the largest variance such that n < m. PCA is, therefore, a dimensionality reduction procedure, rather than a feature selection procedure. This distinction is because the principal components are computed as the projection of the data on a set of orthogonal vectors that are the eigenvectors of the covariance matrix of the data. The covariance matrix, may and frequently does, contain signi®cant information obtained from each sensor. Consequently, PCA does not reduce the number of sensors, nor does it identify the optimum set of coatings. Recently, Osbourn and Martinez [26], and Ricco et al. [27] introduced visual empirical region of in¯uence pattern recognition (VERI-PR) for identi®cation of VOCs. Various shortcomings of neural network and statistical techniques for pattern recognition have also been addressed by these authors. For example, VERI does not require or assume any speci®c probability distributions to be known, and it does not require a large number of parameters to be adjusted by the user. Furthermore, VERI is a versatile algorithm not only capable of pattern recognition, but also of optimum feature selection. The optimal feature selection capabilities of VERI on a VOC identi®cation problem have been reported to be very promising [28]. However, the feature selection module is based on an exhaustive search, called leave-one-out, and therefore the authors recommended its use for pools of less than 20 coatings. Selection of optimum coatings for gas sensing is actually a subset of the more general need for choosing an optimum
subset of features for a PR problem. Feature subset selection is commonly encountered in pattern analysis, machine learning and arti®cial intelligence [29,30]. Many studies have shown that most classi®cation algorithms perform best when the feature space includes only the most relevant information that is required for identi®cation [31±33]. While having relevant features is a key to successful performance of any classi®cation algorithm, the de®nition of a relevant feature has been extensively debated. Some studies suggest algorithms that are preprocessing in nature. These preprocessing algorithms can be viewed as ®ltering the data, and thus eliminating irrelevant features. Statistical measures, such as the properties of the probability distribution function of the data, are often employed for ®ltering out the irrelevant features, and consequently, these algorithms are referred to as ®lter approaches [34±36]. Filter algorithms, however, are independent of the classi®cation algorithm to be used to process the data. Some researchers suggest that relevant features for any set of data are dependent on the classi®cation algorithm [30,37]. For example, a good set of features for a neural network may not be as effective for decision trees. Such studies indicate that a feature selection algorithm must be based on or wrapped around the classi®cation algorithm [37]. Feature selection algorithms that use such an approach are known as wrapper approaches. Most wrapper approaches, on the other hand, suffer from large computational time and space complexity problems, particularly for data sets with a large number of features. Due to the limited number of possible coatings typically used in gas sensing area, the computational complexity of wrapper approaches does not constitute a major drawback. We have therefore analyzed two techniques based on the wrapper approach, and we report herein on the performances of these two arti®cial intelligence (AI) approaches for selecting the optimum set of coatings for VOC identi®cation. The ®rst approach is based on Quinlan's iterative dichotomizer 3 (ID3) algorithm [31], a decision tree algorithm that integrates classi®cation and feature selection. The second approach is a modi®ed version of the wrapper model of Kohavi and John [37], which uses a hill-climb search algorithm to search the feature space for an optimum set of features. The original wrapper model combines the hill-climb search with ID3. We have explored integrating the hill-climb search with a multilayer perceptron (MLP) neural network. We have also investigated the value of using a different starting point for the search, based on the variance of the data, to accelerate the convergence of the hill-climb search. This scheme allowed us to signi®cantly reduce the computational complexity of the search. We emphasize that our goal was to develop a systematic and ef®cient procedure for determining the optimum coatings, and we note that the best set of coatings for any application depends on the analytes to be detected and identi®ed. The analytes and coatings used in this study were
R. Polikar et al. / Sensors and Actuators B 80 (2001) 243±254
selected from those that have been reported extensively in the literature. 2. Experimental 2.1. Experimental system and sample preparation The sensor responses used in this study were from 9 MHz QCMs purchased from Standard Crystals that were subsequently coated with several different polymer ®lms. Cr/Au contacts were deposited onto the quartz by means of a resistive heating evaporator. The ®lms were cast on the QCMs from dilute solutions of polymers, typically 20 ml of 0.3±3% (w/w), spinning at 2000±5000 rpm. The sensors were then dried at 658C for 24 h. The thickness of the coatings were calculated from the frequency shifts detected after coating application [38]. The coated QCMs were subsequently mounted in a sealed test ®xture, which could house up to six sensors. An array of 12 crystals, coated with the following polymers, was used to detect and identify 12 VOCs. The polymers were Apiezon L (APZ), poly(isobutylene) (PIB), poly[di(ethylene glycol) adipate] (DEGA), sol±gel (SG), poly[bis(cyanoallyl)polysiloxane] (OV275), poly(dimethylsiloxane) (PDS), poly(diphenoxyphosphazene) (PDPP), polychloroprene (PCP), poly[dimethylsiloxane-comethyl(3-hydroxypropyl)siloxane]-graft-poly(ethylene glycol) 3-aminopropyl ether (PDS-CO), poly(dimethylsiloxane), hydroxy terminated (PDS-OH), polystrene beads (PSB), and graphite (GRAP). Fig. 1 depicts a schematic of the experimental setup. The vapor generation system consisted of a gas stream module and a three-way switchable valve. The gas stream module included a reference module, dry nitrogen ¯owing at 200 sccm that served to establish the baseline response, and an analyte module. The analyte vapor was generated by means of calibrated mass ¯ow controllers (Tylan general FC-280 AV) and conventional gas bubblers containing the
245
analytes. The bubblers were composed of two connected compartments. The gas carrier bubbled through the solution in the ®rst compartment, supplying the vapor, whereas the second analyte-containing compartment served as a headspace equilibrator. This process resulted in a gas stream saturated with the analyte vapor. The saturated analyte vapor was further diluted with nitrogen to obtain the desired concentrations at a total ¯ow rate of 200 sccm. The sensors were exposed periodically to the reference gas or to the diluted analyte vapor stream by means of the computer controlled three-way valve and a MKS multi-gas controller (model 147B) that controlled the mass ¯ow controllers. Polyethylene and Te¯on1 tubings together with stainless steel or brass valves were used, but only Te¯on1 and stainless steel were exposed to the analytes. All experiments were performed at ambient temperature. To evaluate sensor performance, the resonant frequency of the sensors was monitored before and following exposure to VOCs. Repeated measurements indicated reproducibility of the collected data with small variations of 2±4%. The variability, due to small temperature ¯uctuations, was within experimental error. The frequency response was monitored using a HP8753C network analyzer, interfaced to an IEEE 488 card installed in a PC, running HP8516A resonatormeasurement software. Real time data were displayed and saved. The data were then analyzed to obtain frequency shifts (relative to the baseline) versus VOC concentration. Typical noise levels (standard deviations of the baseline) for the QCMs were around 0.01 Hz. Further details regarding the experimental setup can be found in [39,40]. 2.2. Data collection and handling The 12 VOCs used were acetone (AC), methyl ethyl ketone (MEK), ethanol (ET), methanol (ME), 1,2-dichloroethane (DCA), acetonitrile (ACN), 1,1,1-trichloroethane (TCA), trichloroethylene (TCE), hexane (HX), octane (OC), toluene (TL) and xylene (XL). These VOCs were exposed to
Fig. 1. Experimental setup.
246
R. Polikar et al. / Sensors and Actuators B 80 (2001) 243±254
Fig. 2. A typical signature pattern for toluene.
the sensor array at seven different concentration levels, namely 70, 140, 210, 250, 300, 350 and 700 ppm, yielding the 84 responses that constituted the experimental database. Response patterns at each concentration were composed of 12 features, representing the resonant frequency change of a particular sensor to each of the above listed VOCs. These responses were considered as signature patterns of their respective VOCs. The signature pattern for toluene at 250 ppm is shown in Fig. 2, as a typical response pattern. We note that sensor responses to any given VOC were notably linear with concentration within the 50±1500 ppm range. Therefore, the available data with responses to 12 VOCs at seven concentrations were interpolated to include 15 concentrations. This interpolation was achieved through linear regression of the experimental data. Linear regression coef®cients with r 2 > 0:998 were obtained for each sensor and each VOC. The interpolated data set allowed us to obtain estimated frequency responses of sensors at 70, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, and 1500 ppm, resulting in 180 data instances. The interpolated data, however, were not used for any of the training algorithms, but were simply used for evaluating neural network performances at intermediate concentration levels. It has also been realized that the change in frequency is linearly dependent on the thickness of the coatings: the thicker the coating, the higher the frequency response. However, no attempt has been made to date to normalize the response with respect to the coating thickness in order to test the generalization capabilities of the neural network classi®ers. 3. Results In the following sections, we describe and evaluate two methods to select a subset of the 12 coatings that can
uniquely identify all 12 VOCs. The ®rst method is the new version of Quinlan's C4.5 decision tree algorithm, C5.0, which is based on its well-known predecessor ID3 [23]. C5.0 is actually a classi®cation algorithm, with a built-in feature selector that automatically chooses the best features that would maximize its own performance. The second method expands the recent work of Kohavi and John [37], and it is based on an organized search of feature space for the optimum feature subset. We note that frequency response patterns (signature patterns) for various VOCs will be referred to as the feature vectors. The response of each individual sensor (coated with a different polymer) constitutes the individual features. We therefore use the terms feature and sensor response interchangeably. 3.1. Method I: ID3/C4.5/C5.0 family of decision trees 3.1.1. Generating decision trees Decision trees are compact forms of displaying a list of IF-THEN rules in a hierarchical order. These rules are used to make classi®cation decisions about the input pattern. Decision trees are one of the most commonly used machine learning algorithms for classi®cation applications, ID3 being one of the most popular [30,31]. ID3 classi®es test data by constructing a decision tree from training data. The algorithm determines the features necessary for correct classi®cation of training data. The decision tree starts by identifying the most important feature, based on the information content of each feature. For a training data set, T, the probability, P, that a certain response pattern belongs to a speci®c class Ci is: P
freq
Ci ; T jTj
(1)
R. Polikar et al. / Sensors and Actuators B 80 (2001) 243±254
247
where freq(Ci, T) is the number of patterns in T that belong to class Ci, and jTj is the total number of patterns in the training data set. The information I associated with the probability P is defined as: freq
Ci ; T I log2 (2) jTj and measured in units of bits. Note that, by definition, P must lie in the [0, 1] interval. The minus sign in Eq. (2) assures that I is a positive quantity. The average amount of information, info(T) in bits, needed to identify the class of any pattern in training set T is then defined as the sum over all classes weighted by their frequency of occurrence: N X freq
Ci ; T freq
Ci ; T info
T log2 (3) jTj jTj i1 The information needed to identify the class of any pattern after the training data set has been partitioned into K subsets based on the value of feature X is given by: infoX
T
K X jTi j i1
jTj
info
Ti
(4)
where |Ti| is the number of patterns in partition i of the training set T. Eq. (3) is generally referred to as the entropy before partitioning, and Eq. (4) as the entropy after partitioning of the training set T. The original ID3 algorithm uses a criterion called `gain' to determine the additional information obtained by partitioning T using the feature X. Thus: gain
X info
T
infoX
T
(5)
ID3 ®rst selects the feature that has the largest gain and places that feature at the root of the tree. This feature is then removed from the feature set, and the feature with the next largest gain becomes the second important feature and so forth. This criterion, however, has a very strong bias in favor of features that have many outcomes, that is, features that partition the data set into the largest number of classes. Although a very legitimate bias, use of this criterion can cause poor classi®er performance if an irrelevant feature uniquely identi®es all classes. This problem can be overcome by de®ning split_info of a feature: K X jTi j jTi j split info
X log2 (6) jTj jTj i1 where split_info represents the potential amount of information that is generated by dividing T into K partitions by the feature X. Then: gain ratio
X
gain
X split info
X
(7)
is the proportion of information generated that is useful by the split of T. If the number of partitions, K, generated by the
Fig. 3. Sample decision tree generated by C5.0.
feature X is excessively large, split_info(X) will also be large, making the gain_ratio(X) small. Details and example uses of this procedure can be found elsewhere [41]. C4.5, and more recently C5.0, the newest version of ID3 family of decision trees, adds a number of new features to the algorithm, such as cross-validation and boosting, as described later in this section. A sample decision tree generated by C5.0 for this particular problem of determining optimum coatings is shown in Fig. 3. The decision tree given by C5.0 can easily be converted into a set of rules from which the classi®cation can be made. For example, the ®rst rule generated by the tree in Fig. 3 can be expressed as ``IF PIB response is less than 0.19947, AND the response of PSB is less than 0.057534, THEN the VOC is ET (7)'', where the sensor responses are given as normalized frequency deviations from the fundamental frequency of each coated sensor in response to pure nitrogen. The number in parenthesis (7) refers to the number of patterns that were classi®ed correctly with this rule. In this case, all seven responses to the seven different concentrations of ethanol were successfully classi®ed by this rule. If applicable, a second number following a slash sign is given corresponding to the number of misclassi®cation cases. This decision tree algorithm was tested on the original experimentally obtained data set consisting of responses of 12 sensors to 12 VOCs. A number of trees were constructed using various options such as pruning, cross-validation, and boosting. Pruning is a procedure for removing the redundancy from the generated decision tree. Pruning usually results in much simpler trees, using a signi®cantly smaller number of features than the original tree, at a possible cost of minor performance deterioration. The features used in the ®nal tree are considered as the most important features.
248
R. Polikar et al. / Sensors and Actuators B 80 (2001) 243±254
Cross-validation is used to optimize the generated tree by evaluating the tree on a test data set. Cross-validation is achieved by partitioning the entire database (training and testing) into M blocks, where each block is internally divided into a training sub-block and a testing sub-block (hold-out set). During this partitioning, the number of patterns and class distributions are made as uniform as possible. M trees are generated from these M blocks of data, and the average error rate over the M holdout sets is considered to be a good predictor of the error rate of the tree built from the full data set. Finally, boosting is also a procedure for generating multiple trees, where the misclassi®ed test signals of the previous tree are moved to the training data set. Boosting [42,43] is a common procedure used for improving the performance of a classi®er. 3.1.2. Results using decision trees Among many trees generated using these various options, none performed satisfactorily. Although able to reduce the number of features from 12 to 5±7, the classi®cation performance using these features was in the range of 63±83%. Contrary of its intention, previous studies have shown that this algorithm is most useful when the features selected in its decision tree are actually used to train a neural network [44]. In other words, this algorithm appears to be a good feature selection algorithm, rather than a classi®cation algorithm, although it was originally designed as a classi®cation scheme. One of the better trees, constructed using boosting, pruning, and cross-validation options, is the ®ve-feature tree shown in Fig. 4. As evident, this tree used the features APZ, PIB, OV275, PSB, and PDS. These features were then used to train a neural network. A MLP neural network with a 5 25 12 architecture was trained, where the numbers refer to the number of nodes in each layer. The responses of ®ve sensors constituted the input layer; there were 25 hidden layer nodes and 12 output nodes, each output representing one of the 12 VOCs. The training data were obtained by
Table 1 Results of the neural network trained with the features suggested by C5.0 VOC
Train
Performance
VOC
Train
Performance
AC MEK ET ME ACN DCA
2 1 4 2 2 3
6/7 6/7 6/7 7/7 6/7 7/7
TCA TCE HX OC TL XL
2 3 4 2 3 2
7/7 7/7 7/7 7/7 7/7 7/7
randomly selecting 30 patterns from the database of 84 patterns. The remaining 54 patterns, constituted the test data, and they were used to evaluate the classi®cation performance. The results are summarized in Table 1. The Train columns indicates the number of signals used in the training data for each VOC, and the Performance columns indicates the number of correctly classi®ed patterns for each VOC out of seven that were in the original dataset. With only four misclassi®ed signals and a test data performance of 93%, this neural network performed signi®cantly better than the best decision tree used as a classi®er. The same neural network was then tested on the expanded (synthetic) data set, and the results are shown in Table 2. With eight misclassi®cations, the network that was trained with patterns of the original (experimentally obtained) database had a correct classi®cation performance of 96% on the expanded data set. The performance of the neural network, and consequently that of the feature selection capability of the decision tree method, must be evaluated with some caution, however. The feature subset containing ®ve features was the best of over 40 different feature subsets suggested by C5.0 at various attempts. A number of different parameters had to be adjusted by trial and error in order to obtain this feature set. Therefore, this algorithm may not be the most ef®cient one to use, particularly for users who are not very familiar with decision tree algorithms. 3.2. Method II: modified wrapper approach The decision tree-based approach for the selection of optimum features is a very ef®cient algorithm, rapidly converging to a solution. This algorithm is dif®cult to use, however, requiring the adjustment of several parameters. Furthermore, since it does not satisfactorily perform Table 2 Results on the expanded data set (15 concentrations)
Fig. 4. Optimal decision tree generated by C5.0.
VOC
Performance
VOC
Performance
AC MEK ET ME ACN DCA
14/15 14/15 10/15 15/15 14/15 12/15
TCA TCE HX OC TL XL
13/15 15/15 15/15 14/15 13/15 13/15
R. Polikar et al. / Sensors and Actuators B 80 (2001) 243±254
as a classi®er, a separate classi®cation algorithm, such as a neural network, must be used for the actual classi®cation. On the other hand, the features chosen by the decision tree may not always be optimal for neural network classi®cation. Most importantly, the decision tree is unable to give a pre-speci®ed number of features that would work best. The number of features used in the tree is determined by the algorithm and not by the user. Pre-specifying a number K for the number of features, and being able to ask the algorithm to ®nd the best K features that would optimize the performance among all other K feature subsets is a very desirable property. These concerns motivated us to look for an alternate method for obtaining an optimal feature subset. More recently developed wrapper approaches [37,45] have been successfully used as feature selection algorithms, where the features are selected based on the performance of the subsequent classi®cation algorithm. Thus, the features selected are the optimum features for the particular classi®cation algorithm to be used. Wrapper approaches also allow us to select the number of features, and there are fewer parameters to be selected. These bene®ts, however, come at a cost of computational complexity. 3.2.1. Strong and weak relevance Kohavi and John expanded the meaning of relevance in feature selection by de®ning strong relevance and weak relevance as follows [37]: let Xi be a feature, Si fX1, . . ., Xi 1, Xi1, . . . Xm} be the set of all features except Xi, and let xi and si be the value assignments to Xi and Si, respectively. The feature Xi is strongly relevant if and only if there exists some xi, y, and si such that the probabilistic relation in Eq. (8) holds: P
Y yjXi xi ; Si si 6 P
Y yjSi si ; for P
Xi xi ; Si si > 0
(8)
where Y is a random variable for the set of classes, and y is a class assignment to the current pattern. A feature Xi is weakly relevant, if it is not strongly relevant and there exists a subset of features S0i of Si for which there exists some xi, y, and s0i with P(Xi xi , S0i s0i > 0 such that: P
Y yjXi xi ; S0i s0i 6 P
Y yjS0i s0i
(9)
These de®nitions are based on Bayes classi®ers [46], which are statistically considered as optimal classi®ers. However, these classi®ers require that the distribution of the data and their classes be fully known, which is seldom true. According to above de®nitions, a feature is strongly relevant if removing this feature alone results in performance degradation of an optimal Bayes classi®er. A feature, X, is then weakly relevant if it is not strongly relevant and there exists a subset of features, S0 , that does not include X, such that the performance of Bayes classi®er on S0 is worse than the performance on S0 [ fXg. Kohavi and John's [37] approach, originally developed to improve the classi®cation accuracy of ID3, simply searches
249
a subset of the feature space in an organized manner. The algorithm is based on testing all feature subsets within a limited search space using ID3 until there is no further improvement in ID3 performance. The underlying idea is that any feature subset selection algorithm should be based on the subsequent classi®cation algorithm intended for use. Therefore, the feature subset selection algorithm must work together with, or be ``wrapped around'', its intended classi®er. The best method for ®nding the optimum feature subset is to search the feature space exhaustively for every possible feature combination. The problem, however, is that such a search algorithm can be computationally prohibitive. Because of this problem, only a subset of the feature space must be searched in an organized manner by exploiting any additional information that is available. The search can start, for instance, using all features and progress by removing features that do not contribute notably to the classi®cation performance (backward search), or alternatively, the search can begin devoid of features and proceed by adding features that contribute the most to classi®cation performance (forward search). We adopted the forward search approach, where we started the search with zero features, and the performance was evaluated for each feature using a classi®er acting as an evaluation function. Once the feature that gave the best performance was identi®ed, one and only one feature was iteratively added to the search. These two-feature subsets were then evaluated by the classi®er, and the best two-feature subset was determined. A third feature was then added to those two features, and this procedure was continued until adding new features did not improve performance. This search procedure is commonly known as the hill-climb search algorithm, where a subset of feature space is searched until the best performance is found. Kohavi and John [37] suggested that this method could ®nd all strongly relevant features as well as some weakly relevant features. These researchers also suggested that this method would work best with decision tree algorithms or with Bayes classi®ers. They therefore used the classi®cation performance of ID3 and Naive Bayes classi®ers as their evaluation functions. The reduction in the computational complexity using the hill-climb can be easily seen from a numerical example. For example, searching for the best feature subset from a set of 12 features requires evaluating (that is, training and testing) C
12=1 C
12=2 C
12=3 C
12=11C
12= 12 4095 different networks, where C
n=k is the number of possible combinations of choosing k features from a set of n. The maximum number of subsets searched using hillclimb search, on the other hand, is N(N 1)/2, where N is the original number of features. For 12 features, there would be 66 subsets to search, which is computationally more feasible. Fig. 5 shows the complete search space for N 4, where each node has a binary code indicating the features
250
R. Polikar et al. / Sensors and Actuators B 80 (2001) 243±254
Fig. 6 illustrates the application of this algorithm to the 12-feature space. Note that at each stage, the number of possible combinations that need to be evaluated decreases by 1. Therefore, the maximum number of feature subsets that must be evaluated is N
N 1
N 2
N
N 1, or N(N 1)/2. Also note that in most cases, the total number of feature subsets to be searched will be less than this number, since the search will stop when the optimum set is found.
Fig. 5. Complete feature space for N 4.
that are included and the ones that are not. For instance, the feature subset [0, 1, 1, 0] includes the second and third but not the ®rst and fourth features. Note that each node is connected to nodes that have one and only one feature added or deleted. Every feature set obtained from a previous node (parent node) by adding or removing one feature is called a child node. Obtaining the children of a parent node is referred to as expanding. As such, the hill-climb search algorithm can be formally described as follows: 1. Let S be the initial feature subset, typically (0, 0, . . ., 0, 0). 2. Expand S: find all children of S by adding or removing one feature at a time. 3. Apply the evaluation function, f, to each child, s. 4. Let smax be the child with the highest evaluation f(s). 5. If f(smax > f
S, then S smax , return S to step 2, else 6. Return S as solution.
3.2.2. Results using wrapper approach The hill-climb search algorithm has a potential problem of getting trapped at a local performance maximum, since the search algorithm continues only until performance improvement stops. It is quite possible for the performance to stop improving at a certain feature set, but then continue to improve. Since the feature space in this database was manageably small, all 66 subsets in the hill-climb search space were examined, thus effectively eliminating this potential problem of local performance maximum. For each feature subset, a new MLP was trained with the experimentally generated training data set consisting of 30 patterns, and the network was then tested on the remaining 54 patterns. The performance of the MLP (percentage of correctly classi®ed test samples) was used as the evaluation function. The network architecture was T 25 12, where T was the number of features in the current feature subset being evaluated. The same training and testing data sets were used for each network. On a reasonably con®gured machine, for example, a Pentium III1 running at 800 MHz or better with 128 MB RAM or higher, the algorithm takes little over 1 h to complete. The feature subset that had the best performance for the least number of features on the model neural network architecture (4 25 12 with 30 training data, 54 testing
Fig. 6. Hill-climb search for the feature space with 12 features. Evaluation function is the performance of the T 25 12 MLP neural network, where T is the number of features in current feature space.
R. Polikar et al. / Sensors and Actuators B 80 (2001) 243±254 Table 3 The performance of the best feature subset as chosen by hill-climb VOC
Train
Performance
VOC
Train
Performance
AC MEK ET ME ACN DCA
2 1 2 3 2 3
7/7 7/7 7/7 7/7 7/7 7/7
TCA TCE HX OC TL XL
2 2 3 4 4 2
7/7 7/7 7/7 7/7 7/7 7/7
data) was PIB, OV275, SG, and PDPP. Although another subset with an additional feature had slightly higher performance, the four-feature subset was preferred because of its smaller dimension. To avoid a rapid growth in adding features, we also added a subroutine to the program code to penalize marginally when adding an additional feature. It is interesting to note that PIB, OV275, and PDPP were also on the most successful coatings list of Zellers et al. [23] for a similar list of VOCs. The algorithm was therefore able to pick the best set of coatings at a fraction of time, without requiring an exhaustive search of all possible coatings. The feature subset chosen by the hill-climb search, based on 30 training patterns, classi®ed all 54 validation patterns (previously unseen by the network) as well as the training patterns correctly, giving 100% classi®cation performance. The distribution of the training data and the performance are given in Table 3. This network was also tested with the expanded synthetic data set of 15 concentrations, and all but three patterns (all ethanol) of the total 180-signal set were classi®ed correctly, giving a classi®cation performance of 98.3%. 3.3. Improving wrapper approach The hill-climb search technique is not guaranteed to ®nd the best feature set, since it is prone to getting trapped in a local maximum in the performance space. When the total number of features is small, the algorithm can search the entire hillclimb search space, which essentially eliminates the local performance maximum problem. Note that the hill-climb search space is typically orders of magnitude smaller than the entire feature space. When the total number of features is large, however, searching even the hill-climb search space may be quite computationally expensive. Furthermore, when starting with one feature, the initial steps are more likely to result in the network not converging, or the search being trapped at a local maximum. This situation arises because only one (or very few) feature is not suf®cient for convergence to the desired error minimum or for satisfactory performance. On the other hand, we note that the time required for selecting the next feature decreases as the number of selected features increases. In other words, ®nding the best subset with k features takes more time than ®nding the best subset with k 1 features, given that the best k features are known from previous iterations. We observe that the search spends most of its time during initial steps, and that the networks are
251
most likely to fail during these steps. This observation leads us to a more computationally ef®cient approach: if we can initially identify a few of the best possible features and then start the hill climbing from that point, the search time can be signi®cantly reduced. Furthermore, selecting the ®rst few critical features increases performance by avoiding any initial missteps starting the hill climb, and reduces the possibility of being trapped at a local maximum. We note that if no prior information is known about the data and/or relevance of the possible features, statistical procedures can be used to determine important features. One such procedure is using the variance of the features among different classes. Intuitively, features whose values change when the class changes carry more information than features whose values do not change with class. In addition, if the value of a particular feature is constant regardless of the class, then that feature provides no discriminatory information and therefore it is of no use. This approach, however, has a major ¯aw. If a particular feature changes in each case, then the variance of this feature would be very high; nevertheless, the high variance would render this feature useless for classi®cation. Care must therefore be taken to select the features that have the maximum variance among different classes, but minimum variance among the patterns of the same class. Such features are good candidates as the best starting features. An effective normalization scheme is also necessary for this approach to work. When applied to the database we analyzed, the features that had the highest variance among different classes (and smallest within individual classes) were PIB and OV275, for which the hill-climb search agreed by identifying them as two of the best features. However, identifying these two features took about 1 h (on a PIII 800 MHz machine with 128 MB RAM) using the hill-climb search, whereas variance-based identi®cation of the initial two features took only a few seconds. Finally, when the hill-climb search was initialized with PIB and OV275, the search identi®ed the other two features as SG and PDPP in less than 15 min. Since the time required for computing the variances are negligible compared to search time, total running time was also less than 15 min. For completeness, the following is the list of the features, in descending order of their variances: 1. PIB 2. OV275 3. GRAP 4. PSB 5. PDPP 6. APZ 7. DEGA 8. PCP 9. PDS-CO 10. PDS 11. SG 12. PDS-OH
252
R. Polikar et al. / Sensors and Actuators B 80 (2001) 243±254
Table 4 The performance of the coatings chosen based on their variance VOC
Train
Performance
VOC
Train
Performance
AC MEK ET ME ACN DCA
2 2 2 2 3 3
6/7 5/7 7/7 6/7 6/7 7/7
TCA TCE HX OC TL XL
3 1 4 3 2 3
5/7 1/7 7/7 6/7 5/7 7/7
It should be noted that SG, which was chosen by the hill-climb search, was at the bottom of the list. At issue then is how this approach alone would perform, if the features were chosen from the top of this list. To answer this question, the top four coatings from this list were chosen and the standard network was trained again with 30 cases. The distribution of the training data and the results of this network are shown in Table 4. Note that the worst performance came from TCE, which happened to have only one representative in the training set. Apart from TCEs six misclassi®cations, there were nine other misclassi®cations. These results prove that features should not be chosen based on their variance only. However, choosing a few features with the highest variance, and using them as initial features in the hill-climb search may provide the best of two worlds by reducing the total processing time of the search algorithm, as well as the possibility of being trapped at a local maximum. 4. Conclusions and future work We examined the viability of two feature selection methods for the VOC identi®cation problem. The ®rst approach, using a decision tree to determine the features carrying the most information and then training a neural network with these features, performed fairly well on both experimental and expanded databases. The correct classi®cation performance was 93% on the original experimentally generated data set, and 96% on the expanded data set. One major drawback of this scheme is the number of parameters that need to be optimized for various options of the decision tree algorithm. On the other side, decision trees are considerably faster to train than neural networks, which constitutes a major drawback of the second approach. The second approach based on a hill-climb search of the feature space performed very well: the network trained with the four features selected by hill-climb search classi®ed all patterns correctly. This approach, although signi®cantly faster and computationally more ef®cient than exhaustive search, can still be computationally expensive if the number of features becomes large. Using simple statistical measures for selecting the ®rst few features has been introduced as an intuitive and effective solution for reducing the computation time.
If the original feature space is very large, and/or the statistical properties of the data do not allow accurate predictions of the initial feature set, yet another alternative can be integrating the decision tree-based approach into the hybrid procedure. In such cases, the decision tree can be used to obtain a rough estimate of the relevant features, and the hill-climb search can be started with these features. Finally, we note two of the most important advantages of the hill-climb approach: it allows the user to pre-specify the number of features desired, and it requires less number of parameters to be optimized. Future work includes developing better search algorithms, and/or better starting points for these selection algorithms. We are also interested in pursuing the optimum selection of coatings for mixtures of VOCs. Acknowledgements The authors gratefully acknowledge the assistance and suggestions of Guojun Liu, Robert Lipert, and Bikas Vaidya. This work was supported by Fisher Controls International of Marshalltown, Iowa and the Microanalytical Instrumentation Center of Iowa State University. The Ames Laboratory is operated for the US Department of Energy by Iowa State University under contract W-7405-Eng-82.
References [1] S.L. Rose-Pehrsson, J.W. Grate, D.S. Ballantine, P.C. Jurs, Detection of hazardous vapors including mixtures using pattern recognition analysis of responses from surface acoustic wave devices, Anal. Chem. 60 (1988) 2801±2811. [2] A. D'Amico, C. Di Natale, E. Verona, in: K. Rogers (Ed.), Handbook of Biosensors and Electronic Noses, CRC Press, Boca Raton, FL, 1997, Chapter 9, pp. 197±223. [3] J.W. Grate, S. Rose-Pehrsson, D.L. Venezky, M. Klutsy, H. Wohltjen, Smart sensor system for trace organophosphorus and organosulfur vapor detection employing a temperature-controlled array of surface acoustic wave sensors, automated sample preconcentration, and pattern recognition, Anal. Chem. 65 (1993) 1868±1881. [4] J.W. Grate, B.M. Wise, M.H. Abraham, Method for unknown vapor characterization and classification using a multivariate sorption detector: initial derivation and modeling based on polymer-coated acoustic wave sensor arrays and linear solvation energy relationships, Anal. Chem. 71 (1999) 4544±4553. [5] J.D.N. Cheeke, Z. Wang, Acoustic wave gas sensors, Sens. Actuators B 59 (1999) 146±153. [6] C.K. O'Sullivan, G.G. Guilbault, Commercial quartz crystal microbalances Ð theory and applications, Biosens. Bioelectron. 14 (1999) 663±670. [7] L. Cui, M.J. Swann, A. Glidle, J.R. Barker, J.M. Cooper, Odour mapping using microresistor and piezoelectric sensor pairs, Sens. Actuators B 66 (2000) 94±97. [8] T. Nakamoto, A. Iguchi, T. Moriizumi, Vapor supply method in odor sensing system and analysis of transient sensor responses, Sens. Actuators B 71 (2000) 155±160. [9] J.W. Gardner, Detection of vapours and odours from a multisensor array using pattern recognition. Part 1. Principal component and cluster analysis, Sens. Actuators B 4 (1991) 109±115.
R. Polikar et al. / Sensors and Actuators B 80 (2001) 243±254 [10] P.M. Schweizer-Berberich, S. Vaihinger, W. Gopel, Characterization of food freshness with sensor arrays, Sens. Actuators B 18/19 (1994) 282±290. [11] N. Ryman-Tubb, They all stink! chemometrics and the neural approach, Proc. SPIE Virtual Intell. 2878 (1996) 117±127. [12] B. Yea, T. Osaki, K. Sugahara, R. Konishi, The concentration estimation of inflammable gases with a semiconductor gas sensor utilizing neural networks and fuzzy inference, Sens. Actuators B 41 (1997) pp. 121±129. [13] Z. Wang, J. Hwang, B.R. Kowalski, ChemNets: theory and application, Anal. Chem. 67 (1995) 1497±1504. [14] D.S. Ballantine, S.L. Rose, J.W. Grate, H. Wohltjen, Correlation of surface acoustic wave device coating responses with solubility properties and chemical structure using pattern recognition, Anal. Chem. 58 (1986) 3058±3066. [15] J.W. Grate, M.H. Abraham, Solubility interactions and the design of chemically-selective sorbent coatings for chemical sensors and arrays, Sens. Actuators B 3 (1991) 85±111. [16] D. Amati, D. Arn, N. Blom, M. Ehrat, J. Saunois, H.M. Widmer, Sensitivity and selectivity of surface acoustic wave sensors for organic solvent vapor detection, Sens. Actuators B 7 (1992) 587±591. [17] K. Nakamura, T. Suzuki, T. Nakamoto, T. Moriizumi, Sensing film selection of QCM odor sensor suitable for apple flavor discrimination, IEICE Trans. Electron. E83 (2000) 1051±1056. [18] T. Nakamoto, K. Nakamura, T. Moriizumi, Classification and evaluation of sensing films for QCM odor sensors by steady-state sensor response measurement, Sens. Actuators B 69 (2000) 295±301. [19] C. Di Natale, A. D'Amico, A.M.F. Davide, Redundancy in sensor arrays, Sens. Actuators A 37/38 (1993) 612±617. [20] J. Park, E.T. Zellers, Determining the minimum number of sensors required for multiple vapor recognition with arrays of polymercoated SAW sensors, Proc. Electrochem. Soc. 99 (1999) 132±137. [21] J. Park, W.A. Groves, E.T. Zellers, Vapor recognition with small arrays of polymer-coated microsensors, Anal. Chem. 71 (1999) 3877±3886. [22] G.C. Osbourn, R.F. Martinez, J.W. Bartholomew, W.G. Yelton, A.J. Ricco, Optimizing chemical sensor array, Proc. Electrochem. Soc. 99 (1999) 127±131. [23] E.T. Zellers, S.A. Batterman, M. Han, S.J. Patrash, Optimal coating selection for the analysis of organic vapor mixtures with polymercoated surface acoustic wave sensor arrays, Anal. Chem. 67 (1995) 1092±1106. [24] W.P. Carey, K.R. Beebe, B.R. Kowalski, D.L. Illman, T. Hirschfeld, Selection of adsorbates for chemical sensor arrays by pattern recognition, Anal. Chem. 58 (1986) 149±153. [25] F. Avila, D.E. Myers, C. Palmer, Correspondence analysis and adsorbate selection for chemical sensor arrays, J. Chemom. 5 (1991) 455±465. [26] G.C. Osbourne, R.F. Martinez, Empirically defined regions of influence for clustering analysis, Pattern Recog. 28 (1995) 1793± 1806. [27] A.J. Ricco, R.C. Crooks, G.C. Osbourn, Surface acoustic wave chemical sensor arrays: new chemically sensitive interfaces combined with novel cluster analysis to detect volatile organic compounds and mixtures, Accounts Chem. Res. 31 (1998) 289±296. [28] G.C. Osbourn, J.W. Bartholomew, A.J. Ricco, G.C. Frye, Visualempirical region of influence pattern recognition applied to chemical microsensor array selection and chemical analysis, Accounts Chem. Res. 31 (1998) 297±305. [29] R.O. Duda, D. Stork, P.E. Hart, Pattern Classification, 2nd Edition, Wiley, New York, 2001. [30] T.M. Mitchell, Machine Learning, WCB/McGraw Hill, Boston, 1997. [31] J.R. Quinlan, Induction of decision trees, Machine Learning 1 (1986) 81±106. [32] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Machine Learning 6 (1991) 37±66.
253
[33] B.V. Dasarathy, Nearest Neighborhood (NN) Norms: NN Pattern Classification Techniques, IEEE Computer Society Press, Los Alamitos, 1990. [34] H. Almuallim, T.G. Dietterich, Learning Boolean concepts in the presence of many irrelevant features, Artif. Intell. 69 (1994)279±306. [35] K. Kira, L.A. Rendell, Feature selection problem: traditional methods and a new algorithm, in: Proceedings of the 10th International Conference on Machine Learning (AAAI92), 1992, pp. 129±134. [36] D. Koller, M. Sahami, Toward optimal feature selection, in: Proceedings of the 13th International Conference on Machine Learning (ICML-96), 1996, pp. 284±292. [37] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artif. Intell. 97 (1997) 273±324. [38] H. Wohltjen, Mechanism of operation and design considerations for surface acoustic wave device vapor sensors, Sens. Actuators 5 (1984) 307±325. [39] R. Polikar, Algorithms for Enhancing Pattern Separability, Optimum Feature Selection and Incremental Learning with Applications to Gas Sensing Electronic Nose Systems, Ph.D. Dissertation, Iowa State University, Ames, IA, 2000. [40] R. Shinar, G. Liu, M.D. Porter, Graphite microparticles as coatings for quartz crystal microbalance-based gas sensors, Anal. Chem. 72 (2000) 5981±5987. [41] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, 1993. [42] R. Schapire, Strength of weak learning, Machine Learning 5 (1990) 197±227. [43] Y. Freund, R. Schapire, A decision theoretic generalization of on-line learning and an application to boosting, Comput. Syst. Sci. 57 (1997) 119±139. [44] M. Seo, Automatic Ultrasonic Signal Classification Scheme, MS Thesis, Iowa State University, Ames, IA, 1997. [45] R. Kohavi, G.H. John, in: L. Huan, H. Motoda (Eds.), Feature Extraction, Construction and Selection: A Data Mining Perspective, Kluwer Academic Publishers, Norwell, MA, 1998. [46] K. Fukunaga, Statistical Pattern Recognition, 2nd Edition, Academic Press, San Diego, 1990.
Biographies Robi Polikar received his BS degree in electronics and communications engineering from Istanbul Technical University in 1993, MS and PhD degrees both in co-major biomedical engineering and electrical engineering from Iowa State University, Ames, Iowa, in 1995 and in 2000, respectively. He is currently an assistant professor of electrical and computer engineering at Rowan University, Glassboro, NJ. His current research interests include signal processing, pattern recognition, and applications of neural networks for biomedical engineering. Ruth Shinar received her PhD in physical chemistry from the Hebrew University, Jerusalem, Israel, in 1977. She was a post-doctoral fellow at the University of California, Santa Barbara before joining the Microelectronics Research Center and then the Microanalytical Instrumentation Center at Iowa State University. Her research interests include surface chemistry, chemical sensors, photovoltaics, bioassays, and chip-scale instrumentation. Lalita Udpa received her PhD in electrical engineering from Colorado State University in 1986. She is currently a professor of electrical engineering at Iowa State University. Dr. Udpa works primarily in the area of nondestructive evaluation (NDE). Her research interests include numerical modeling of the forward problem and solution of the inverse problems in NDE. She works extensively on the application of signal processing, pattern recognition, and neural network algorithms. She also teaches graduate level pattern recognition and signal processing classes at Iowa State University.
254
R. Polikar et al. / Sensors and Actuators B 80 (2001) 243±254
Marc D. Porter received his BS and MS degrees in chemistry from Wright State University in 1977 and 1979, respectively, and his PhD in analytical chemistry from Ohio State University in 1984. His graduate work focused on new ways to characterize electrode materials. He then joined Bell Communications Research as a post-doctoral associate, and explored
structural and electrochemical issues in self-assembled monolayers. He is presently a professor of chemistry at Iowa State University and is the Director of its Microanalytical Instrumentation Center. His research interests include surface analytical chemistry, monolayer assemblies, chip-scale instrumentation, analytical separations, bioassays, and chemically selective microscopies.