expert system for identification of toxic compounds from low resolution mass spectra

expert system for identification of toxic compounds from low resolution mass spectra

Chemometrics and intelligent laboratory systems Chemometrics and Intelligent Laboratory Systems 23 (1994) 351-364 Pattern recognition/ expert system ...

1MB Sizes 0 Downloads 47 Views

Chemometrics and intelligent laboratory systems Chemometrics and Intelligent Laboratory Systems 23 (1994) 351-364

Pattern recognition/ expert system for identification of toxic compounds from low resolution mass spectra * Donald R. Scott Atmospheric

Research and Exposure Assessment Laboratory, US Environmental Research Triangle Park, NC 27711, USA

Protection Agency,

(Received 7 December 1993; accepted 8 February 1994)

Abstract An empirical rule-based pattern recognition/expert system for classifying, estimating molecular weights and identifying low resolution mass spectra of toxic and other organic compounds has been developed and evaluated. The system was designed to accommodate low concentration spectra and provide some information for mixtures. It consists of a classifier followed by molecular weight estimators, filters and identification modules. Computed series of allowed molecular weights and selected base peaks for five classes are used in the filters to reduce misclassification and ensure correct identification. The target classes are nonhalobenzenes; chlorobenzenes; bromo- and bromochloroalkanes/ alkenes; mono- and dichloroalkanes/ alkenes; and tri-, tetra- and pentachloroalkanes/ alkenes. The identification module for the 75 target compounds relies upon the high accuracy of the molecular weight estimators and base peak data for unique identification. The total system was extensively tested with reference spectra of 32 potential air pollutants, 99 randomly selected compounds, 37 gas chromatographic-mass spectroscopic (GC-MS) field spectra and with 400 pharmaceutical related spectra. Even with incomplete spectra the classification and identification performance was very good with accuracies of 97 (test, random and pharmaceutical) and 95% (field GC-MS). The median absolute deviations from the true molecular weights of the test, random, field and pharmaceutical spectra were l-2 Da and the average absolute deviations were 6-10 Da. The program is very fast and runs on a personal computer.

1. Introduction The analysis of ples, whether air, organic pollutants not encountered in samples frequently

complex environmental samwater, soil or biological, for presents analytical problems other types of samples. These contain hundreds of organic

xl Elsevier Science B.V. SSDI 0169-7439(94)00010-G

components, including trace quantities of compounds sought, and the compounds usually are not completely separated by the chromatographic methods used. At present gas chromatographymass spectrometry (GC-MS) or for less volatile pollutants, liquid chromatography-mass spectrometry (LC-MS), are the methods of choice for

identification and quantitation of organic compounds in environmental and other complex samples. The usual approach to compound identification from low resolution mass spectra of these

352

D.R. Scott / Chemometrics

and rnteII~~ent Laboratory

types of samples is the use of the library search [1,21 alone or with some manual interpretation. Candidate structures are ranked according to a similarity index determined by the search aIgorithm. An ideal search algorithm should assign probabilities to the candidate structures so that an objective judgement of the accuracy of the spectral match and related chemical structure can be obtained. Some recent progress has been made toward this goal [3] and with interpretation of the candidate structure list [4]. However, variability of spectra due to instrumental operation and the inherent limitations of mass spectromet~ limit the performance of any search algorithm. In a recent study [5] the identification accuracy of two widely used search algorithms, probability based matching (PBM) [6] and the INCOS dot product method [7] and three other afgorithms was determined with a set of 12592 reference spectra. The best of these algorithms, an augmented JNCOS type, could only correctly identify 88% of the test compounds at ranks 1 and 2 on the candidate list. The PBM method was found to correctly identify only 78%. These results were obtained on reference spectra and represent the optimum performance of these search algorithms with a wide variety of chemical st~ctures. The performance of library search methods on mass spectra of actual environmental or other complex samples will certainly not be as good. An indication of search performance on simulated real samples is given in the results of a study by Pellizarri et al. [8]. Four methods of identifying the 85 volatile compounds in a low-concentration mixture from GC-MS data were studied. The best performance was shown by the INCOS dot product method with 75% accuracy exceeding the next best performance which was manual interpretation by a spectroscopist. On this relatively simple sample, about 22% of the compounds were not correctly identified in the top five candidates on the search list even by the best search algorithm. In order to increase the accuracy of compound identification as well as supply some information on spectra which are currently not interpreted due to time or other constraints, a pattern recognition/expert system approach to the problem

Systems 23 (1994) 351-364

has been investigated 191. The basic approach is to classify an unknown mass spectrum as one of the known classes in a target set and then follow that classification with a molecular weight estimate, associated lower limit and a positive identification, if possible. It is recognized that many of the spectra will not be positively identified, but chemical class and molecular weight estimates will be provided for these spectra. Extensive evaluation of a recently developed expert system [9] to perform these functions and of a related one for classification and molecular weight estimation ElO,ll] has given very encouraging results. However, the pattern recognition classifier, which is the same for both systems, misclassified some complex chemical structures as halogen containing compounds and therefore erroneously estimated the molecular weights of these compounds. A new classification and identification system for mass spectra has been designed and evaluated. The classifier, exclusion filters, molecular weight predictors and identification modules have been re-designed to reduce misclassifications. This was done by using the improved accuracy of the molecular weight predictors and specific definitions of the class members to prevent spectra of complex chemical compounds from being misclassified. The resulting system is very fast, easy to use and can be run on a personal computer. It can be used to assist in manual or computer-aided identification of mass spectra of environmental and other types of samples.

2. Methods The expert system was developed and evaluated on an IBM compatible personal computer. The shell system used was lST-CLASS FUSION, version 2.0, which is an inductive pie-building program using the ID3 algorithm. Further details regarding this shell system are available elsewhere [12,13]. Al1 reference spectra were from the US National Institute of Standards and Technology/ EPA/ MSDC Database, Personal Computer Version 4 (hereafter called the NIST data base), except for the pharmaceutical spectra.

D.R. Scott / Chemometrics and Intelligent Laboratory Systems 23 (1994) 351-364

These were obtained from J.T. Clerc of the University of Bern and came from an ETH Zurich data base. All mass spectra were ternary encoded for use with the expert system. Intensities (relative to the base peak intensity of 100%) of 0 to 4.99% were assigned values of 0, those of 5 to 49.9% values of 0.5 and those of 50 to 100% values of 1.0. No masses less than 30 Da were used in any rules.

3. System specifications

and design

Two problems encountered in the mass spectra of environmental and other complex samples are incomplete separation of mixtures and low concentration of components, resulting in missing peaks. Usual library search techniques cannot successfully cope with either of these problems. The basic purpose of the present expert system is to alleviate these problems by providing partial identification of an unknown compound or mixture with chemical class and molecular weight information and by providing definitive identifications of the target compounds, where possible. The main design specifications for the expert system included the following: (a) use high Shannon information content peaks, if possible; (b) use high intensity peaks in the rules; (c) supply partial identification; (d) reduce false identifications. Additional requirements were that the system should be easily implemented, be user friendly and be fast in operation. These last requirements essentially dictate the use of an expert system on a personal computer. A very simplified schematic diagram of the expert system design is shown in Fig. 1. A classifier module is followed by molecular weight estimators, molecular weight filters, base peak filters

Fig. 1. Schematic diagram of classification and identification expert system.

353

and identification modules. Each of the six classes, except the very large unknown class, has its own set of filters, molecular weight estimator and identification module. The unknown class has its own molecular weight predictor but no filters or identification module. If a spectrum does not pass a given filter constraint, then it is diverted to the unknown class and appropriate information regarding its class, estimated molecular weight and lower limit is given. If a spectrum passes all filter constraints, including those in the identification module, then the spectrum is identified and its estimated and true molecular weight are provided. The actual path through the system starts with the classifier where a spectrum is assigned a tentative class, including the unknown class. The spectrum then passes to the molecular weight estimator for that class where a tentative molecular weight is assigned. If the tentative molecular weight is correct for a defined member of that specific class, then the spectrum passes the molecular weight filter and proceeds to the exclusion base peak filter for the assigned class. If the spectrum contains no base peaks which are excluded in the first base peak filter, then it passes to a second base peak filter. The first base peak filter excludes spectra which have been misclassified and the second filter checks for known base peaks of target compounds. If the spectrum passes the second base peak filter, then it goes to the identification module for further verification and subsequent identification as a member of the target set. Spectra may be diverted to the unknown class directly from the classifier or by being rejected from the two filters. If the spectrum is passed to the unknown class, its tentative molecular weight is recalculated with the rules in the unknown class estimator. The unknown class is a very large one consisting of all spectra which do not fit into the other five groups determined from the target set. An important design specification for the molecular weight filter was a very narrow definition of the allowed members of the five major classes so that only chemical structures present in the training spectra were represented. Further details about each type of module are given below.

354

D.R. Scott / Chemometrics

and Intelligent Laboratory

4. Target and training set The target set is a relatively small, but representative, group of 75 toxic and other pollutants including a variety of chemical structures. A complete list of the compounds has been reported previously [14]. The types of compounds included are given in Table 1. This set was originally derived from volatile air pollutants but incorporates compounds found on pollutant lists for other media. Most of the target compounds are substi-

Systems 23 (1994) 351-364

tuted benzenes or substituted alkanes/ alkenes with less than five carbons. Only alkanes/ alkenes with a chlorine and/or bromine substituent were included. These target compounds together with 31 representative members of the unknown class were used as training spectra. 5. Pattern recognition classifier The classifier is the most important module in the system. It provides tentative class information

Table 1 Training set composition and performance results Class

Structure

Number in set

Nonhalobenzenes

Benzene Alkylbenzenes Alkenylbenzenes Phenylketones Phenylaldehydes Benzonitriles Chlorobenzenes Alkylchlorobenzenes Alkenylchlorobenzenes Monochloroalkanes Monochloroalkenes Dichloroalkanes Dichloroalkenes Monochloroepoxyalkane Monochlorooxyalkene Trichloroalkanes Trichloroalkenes Tetrachloroalkanes Tetrachloroalkenes Pentachloroalkanes Monobromoalkanes Dibromoalkanes Tribromoalkanes Monobromoalkenes Monobromomonochloroalkanes Monobromodichloroalkanes Monobromotrichloroalkanes Dibromomonochloroalkanes Cyclic ethers Acyclic hydrocarbons Cyclic hydrocarbons Unsaturated hydrocarbons Aldehydes Ketones Alcohols Acids

1 10 1 1 1 1

Chlorobenzenes

Mono- and dichloroalkanes/

alkenes

Tri-, Tetra- and pentachloroalkanes/alkenes

Bromo- and bromochloroalkanes/

alkenes

Unknown

Classification and identification accuracy: Molecular weight absolute deviations:

median average

4 3 1 4

1 8 3 1

1 5 1 6

1 1 4 7

1 2 3

1 1 1 3 13 3 3 3 3 2 1 100% 0.0 Da 0.17 Da

D.R. Scott / Chemometrics and Intelligent Laboratory Systems 23 (1994) 351-364

and determines which filters and molecular weight modules are appropriate for a spectrum. In previous studies the training set was found by unsupervised SIMCA pattern recognition to contain six classes: nonhalobenzenes; chlorobenzenes; bromo- and bromochloroalkanes/ alkenes; monoand dichloroalkanes/ alkenes; tri-, tetra- and pentachloroalkanes/ alkenes; and unknown (all others). The set of initial masses used in deriving the rules for these predetermined classes were selected from those with high Shannon information content and high peak intensity, i.e., 50% of base peak intensity or greater. From an original set of 31 key masses, fourteen were used in the final rules of the best previous classifier [91. In deriving the present rules with the ID3 algorithm, these 14 masses (39, 42, 49, 55, 63, 76, 77, 93, 96, 107, 108, 122, 129, 1651 were initially used but were insufficient to separate the six defined classes. It was then found that complete separation could be accomplished by deleting mass 165 and adding mass 41 and 75. These 15 key masses had Shannon binary information contents relative to the 62235 spectra in the entire NIST data base ranging from 0.48 to 1.0 bit with 60% having 0.90 bit or greater. There were 36 branches in the classification decision tree with 25 of them being concerned

355

with only one or two training spectra. The largest number of branches in the resulting decision tree involved the bromo- and bromochloroalkanes/ alkenes (9 branches) and mono- and dichloroalkanes/ alkenes (10 branches) classes. The fewest number of branches were concerned with the nonhalobenzene (3 branches), unknown (4 branches) and chlorobenzene (4 branches) classes.

MW

6. Molecular weight estimators The molecular weight estimators are a key part of the new system. The more accurate the estimates from these modules, the better the performance of the first exclusion filters and the accuracy of the output information. Even if a spectrum was classified as unknown, an estimate of its molecular weight and associated lower limit was provided. For cases where definite identities cannot be established, but class information can, the class and estimated molecular weight can help to eliminate a large number of candidate compounds. Even if the identity of a particular spectrum had been established, this module provided an estimated molecular weight which could be compared with the true value as an additional

MW

100

200

300

400

!500

600

0

MAXMASS Fig. 2. Linear

relationship

loo

200

a00

400

HIMAX between

(a) MAXMASS

and (b) HIMAXl

and molecular

weight.

600

600

356

D.R. Scott /Chemornetrics

and Intelligent Laboratory

identity check. The molecular weights discussed throughout this study are those calculated with average isotopic composition. The estimator modules, one for each of the six classes, provide the calculated molecular weights and the calculated lower limits. Recently [l&15] it was found that two spectral features, MAXMASS (the highest observed mass with an intensity of at least 5% of the base peak) and HIMAX1 (the highest observed mass with a peak intensity of at least 1%) are very efficient indicators of the molecular ion mass. For example in the set of 400 pharmaceutical spectra, HIMAXl is equal to the molecular ion mass 16.2% of the time and within 1 Da another 39.3%. An illustration of the relationship between the molecuIar weight and MAXMASS and HIMAXl for this set of 400 spectra is shown in Fig. 2. The basis of the estimation procedure is the application of linear corrections to HIMAXl to provide molecular weights. These corrections for each class module were empirically determined from the appropriate sets of training spectra. Both MAXMASS and HIMAXl occurred in all class rules, where MAXMASS acted as a preliminary sorting variable, but corrections were only applied to HIMAXl. The resulting rules consisted of 3-4 branches for the nonhalobenzene and chlorobenzene classes to 21 branches for the very diverse unknown class. Applied corrections ranged from - 4.6 Da for the tri-, tetraand pentachloroalkane/ alkene class to + 34.4 Da for the same class. The smallest corrections appeared in the nonhalobenzene and chlorobenzene classes. The lower limits to the molecular weights were determined from the average of MAXMASS and HIMAXl less 5 Da. The correction of 5 Da to the average is required to adjust for the occurrence of halogen isotopic peaks since the calculated molecular weights are average ones. The 106 training compounds and 400 randomly selected spectra were used to test the lower limit estimates. Of these 506 spectra only four from the random set had molecular weights lower than the estimated lower limits. All four appeared to be contaminated with high mass ions, which would cause HIMAXl values and the lower limits to be too high.

Systems 23 (1994) 351-364

The probable errors for the molecular weight estimates were calculated with a robust statistic, the median absolute deviation 1161.For a normal distribution, this error corresponds to limits of about 3.5 times the standard deviation. The probable error for the unknown class, which was determined from the 99 random spectra, was +5 Da and for all other classes was f 1 Da.

7. Exclusion filters 7.1. Molecular

weight fiLter

After passing through the classification module, a spectrum was either directed to one of the five main class modular routes or to the unknown class. If the latter occurred, the spectrum passed to the unknown class molecular weight estimator and the appropriate message was displayed with the estimated molecular weight and its lower limit. If the spectrum was classified as one of the five main classes, it passed to a set of filters for that particular class. The first set of filters are very important since they refine the classifier assignments. Evaluations of the previous classifier 19,101 showed that some unusual chemical structures in test spectra were being misclassified as chlorine- or bromine-containing compounds. An example is 4_diethylaminobenzaldehyde, which was classified as a chlorobenzene instead of as an unknown. Molecular weight estimation rules for the wrong class were then used resulting in erroneous predictions. In the previous classifier, rather broad definitions of the five major classes were allowed since essentially only the classifier results were relied on to define class members. In the present filters a very strict definition of the allowed chemical structures in the training set was employed which was based on the known molecular weights of the training set classes. For example, no hydroxychloro- or aminochloroalkanes/alkenes were included in the mono- and dichloroalkane/ alkene class. The first filter following the classifier and molecular weight estimator is a computed one with a different filter for each class except unknown, which had none. Only those molecular

D.R. Scott /Chemometrics and Intelligent Laboratory Systems 23 (1994) 351-364

weights which could possibly arise from the series of chemical structures, including higher molecular weight members, represented in the training set compounds were passed. Each allowed molecular weight had an error band of k 1 Da in all filters. An example for the mono- and dichloroalkane/ alkene class would consist of the discrete series of molecular weights of all compounds from chloromethane through chlorooctane and dichlorohexane, including all corresponding chloroalkenes with up to three double bonds. This type of filter was effective at lower molecular weights; but at higher molecular weights the series of allowed molecular weights was essentially continuous and ineffective. Therefore, upper limits for passing these filters were established at 218 (nonhalobenzenes), 197 (chlorobenzenes), 154 (mono- and dichloroalkanes,’ alkenes), 251 (tri-, tetra- and pentachloroalkanes/ alkenes) and 299 Da (bromo- and bromochloroalkanes/ alkenes). Spectra with higher estimated molecular weights or those that did not match the computed allowed molecular weights were routed to the unknown class estimator. The advantage of the new filters is that the classes are narrowly defined and resulting classification is more accurate. The penalty for rejection by these filters is classification as an unknown and molecular weight estimation using unknown class rules, which are fairly accurate even for other classes.

351

the filters. Each part of every class filter was extensively tested vs. all 62235 spectra in the entire NIST data base to ensure its validity. Any spectrum rejected by these filters was routed to the unknown class. 7.3. Base peak identification filter Any spectrum passing the base peak interference filter proceeded to the second base peak filter. The purpose of the first two filters in the system following the classifier was to correct for possible misclassification. The basic purpose of the next two filters (modules) in the system was to prevent nontarget compounds from being misidentified as target compounds and to correctly identify actual target compounds. This base peak module checked for the appearance in the unknown spectrum of any of the base peaks present in the target set. If one was present, then the spectrum was passed to the identification module. If not, then the previously assigned class information and estimated molecular weight and its lower limit were displayed. Since the base peak is the highest intensity peak in the mass spectrum, it is apparent that it will always be present even in low concentration spectra. With this system the occurrence of the correct base peak is assumed to be a necessary, but not sufficient, step in correctly identifying a target spectrum.

7.2. Base peak interference filter If a spectrum passed the class molecular weight filter, it then encountered a class base peak filter. Since these peaks are the most intense in any spectrum, they are obviously always present. The purpose of this filter was to eliminate spectra which were misclassified and happened to have estimated molecular weights which matched some in the class. Base peaks of compounds which were found to be misclassified in earlier studies [9] were used in these filters. Usually the base peak of the misclassified spectrum alone was used. However, in some cases the base peaks of the misclassified spectra were identical to some in the training set spectra. It was then necessary to use the base peaks and additional masses in

8. Identification

module

An unknown spectrum which passed all previous filters proceeded to the last module in the expert system, the identification module. This last module verified the estimated molecular weight and presence in the unknown spectrum of certain key masses for the target set of compounds. Since the molecular weight estimators were very accurate for the target set, the estimated molecular weights were used in these modules. The key masses consisted of base peaks, other high intensity peaks and, in some cases, lower intensity masses which were necessary to distinguish between similar spectra of different target com-

358

D.R. Scott /Chemometrics

and Intelligent Laboratory

pounds. Besides the correct base peak and molecular weight, one to four additional masses were checked. Simultaneous occurrence of all these features was required for correct identification to be assumed. If a spectrum passed this last filter, then its identity and estimated molecular weight were displayed along with its correct molecular weight. Comparison of the estimated and true molecular weights allowed the user to make a final check on the authenticity of the identification. If the unknown spectrum did not pass this filter, then the previously assigned class, estimated molecular weight and its lower limit were displayed. Each part of this module was extensively tested vs. all 62235 spectra in the entire NIST data base to ensure its validity. During this final evaluation of the system it was found that some spectra were identical within the encoding scheme and in actual experimental practice. These alternative identities for target spectra were also incorporated into the identification messages. In the nonhalobenzene class the spectra of o- and pxylene and 1,3,5- and 1,2,4_trimethylbenzene were identical. Also l-methyl-Z, 1-methyl-3- and lmethyl-4-isopropylbenzenes had identical spectra as did l-ethyl-Z, 1-ethyl-3-, and 1-ethyl-4-methylbenzene. In the chlorobenzene class o- and pchlorobenzene had identical spectra. In the bromo- and bromochloroalkane/ alkene class the l-, 2- or 3-bromopropene-1 and cyclopropylbromide spectra were identical as were the l- and 2-bromopropane spectra. In the mono- and dichloroalkane/ alkene class the 1,3- and 1,4-dichlorobutane spectra, the l,l- and l,Zdichloroethene spectra and the l,Cdichlorobutene-1 and 1,4-dichlorobutene-2 spectra were identical. 9. Operation of the system Communications between the user and the system occur via screen messages and keyboard entries. The user is asked for the values of MAXMASS (the largest mass with an intensity of at least 5%), HIMAXl (the largest mass with an intensity of at least l%), the intensities of certain key masses and the mass of the base peak. Two examples follow, a spectrum which belongs to one

Systems 23 (1994) 351-364

of the target classes and another which is from the unknown class. The first session is with 3-bromo-1-propene, a target compound, which has a molecular weight of 121 Da. This session took 60 s, including data look-up and manual keyboard input. Note that intensity data is ternary encoded. QUERY What (Enter What What What What What What What What What What What What What

is the intensity of mass peak 0, 0.5 or 1) is the intensity of mass peak is the intensity of mass peak is the intensity of mass peak is the intensity of mass peak is the intensity of mass peak is the intensity of mass peak is the intensity of mass peak is the intensity of mass peak is the value of MAXMASS? is the value of HIMAXl? is the base peak? is the intensity of mass peak is the intensity of mass peak

RESPONSE 42? 49? 77? 63? 75? 96? 76? 107? 41?

67? 122?

0 0 0 0 0 0 0 0 1 122 122 41 0 0.5

The system response was: “This compound is a bromo- or bromochloroalkane/ alkene and has been identified as l-, 2-, or 3-bromo-1-propene or cyclopropyl bromide. Its estimated and true molecular weights are 121.0 and 121.0 dalton.” The first nine responses were used in the classification module to classify the spectrum as that of a bromo- or bromochloroalkane/alkene. The next two responses for MAXMASS and HIMAX1 were used to calculate the molecular weight, which successfully passed the molecular weight filter. The next two responses for the value of the base peak and the intensity of mass peak 67 were used to successfully pass the two base peak filters. The last response together with the already known base peak and previously computed molecular weight were used to establish the identity of the compound. Note that the estimated and true molecular weights of this member of the target set are identical. Similar results would be obtained with spectra of the other five classes with the appropriate class and molecular weight information displayed. The second session is with p-(acetylamino)phenol, which has a molecular weight of 151 Da

D.R. Scott / Chemometrics

and Intelligent Laboratory

and is a member of the unknown class. This session took about 45 s and the system produced the following message: “The class of this compound is unknown, but its estimated molecular weight is 152 f 5 dalton. A lower limit to its molecular weight is 146.5 dalton.” There is one other general type of spectrum which will be processed by the system - a member of one of the five target classes but not a specific member of the target set. In this case the displayed message is, for a nonhalobenzene: “This compound is a nonhalobenzene with an estirf: 1 dalton. An mated molecular weight of estimated lower limit to the m&ecular weight is dalton.” -This message would be displayed, for example, if a spectrum did not pass the identification module.

10. System evaluation results The approach used during final refinement of the total system was to extensively test with a variety of spectra, particularly with the base peak interference filters and the identification filters where the entire NIST data base was used. Classification accuracy was based on the number of incorrect spectral class assignments using the strict definitions of class members described previously. This means that a nitrobenzene, for example, which was finally classified as a nonhalobenzene would be counted as a misclassification since it was not a member of the training classes. With the present system it should be classified as a member of the unknown class. The strict view that no spectrum can be correctly identified if it is misclassi~ed was adopted and therefore the identi~cation and classification accuracies are usually equal. 10.1. Training and pollutant test spectra The optimum performance for the system should be obtained by testing with the 106 training spectra from which the rules were derived. The results for these spectra are listed at the end of Table 1. The classification and identification

359

Systems 23 (1994) 351-364

accuracies for this set were lOO%, showing that the mathematical structure for classification and identification exists and can be successfully modeled with the variables used. The molecular weight estimates had a median absolute deviation from the true values for the entire training set of 0.0 Da. The average absolute deviation was only 0.17 Da from the true values vs. 0.99 Da for the previous system [9]. These excellent results show that the molecular weight estimation procedure is fundamentally sound, Further tests beyond the original domain of the training spectra were also carried out with a set of 32 spectra 1111. Many of these compounds had been tentatively identified in field ambient air samples and additional mixed class compounds, which were not in the training set, were also included to test the classification rules. These compounds included 15 alkanes and alkenes; 9 aldehydes, ketones and alcohols; 2 chloroalkanes/alkenes; and 6 substituted benzenes. The benzenes included nitro-, chloro-, fluoro-, bromoand bromochloro- substituents. The molecular weights of these compounds ranged from 30 Da for formaldehyde to 236 Da for 1,4-dibromobenzene. The classification and identification accuracies with the present system were 97%, with chloroethene, a nontarget monochIoroalkane/ alkene compound, miscIassified as an unknown. Even with this misclassification the molecular weight estimate for chloroethene was within 0.5 Da of the true value. None of these spectra were misidentified as target compounds. The median and average absolute deviations for the molecular weights with this set were 1.0 and 5.6 Da vs. 7 and 13 Da in the previous system [9]. 10.2. Random and pha~ace~tical

spectra

Essentially, with a rule-based expert system one is interpolating within a domain of expertise which is defined by the training set. Therefore, the present system should perform well within the narrowly defined five main classes and to a much lesser extent within the extremely large and diverse unknown class. However, to test the performance limits of the system, 99 NIST reference spectra of compounds with motecular weights less

360

D.R. Scott / Chemometrics

and Intelligent Laboratory

than 3.50 were selected at random and used to evaluate the system. The types of compounds included were various hydrocarbons; oxygenated hydrocarbons; nitrogen and sulfur containing compounds; chloro-, bromo- and other halogen substituted hydrocarbons; pyridines and pyrazines; and various substituted benzenes. The benzenes had alkyl-, bromo-, sulfur and oxygen-, nitrogen-, and oxygen- substituents. The wide variety of chemical structures in this set caused many misclassification problems with previous expert systems [9,11]. The list of individual compounds has been published elsewhere [15]. Performance with this set of randomly selected spectra should approximate the worst system performance since the rules are being extrapolated into unexplored structural space. As seen from Table 2 the median and average absolute deviations of the predicted molecular weights were 1.0 and 7.3 Da vs. 7 and 15 Da for the previous system 191. These greatly improved molecular weight results show that the rules have a validity well beyond the original domain. The classification and identification accuracy was 97% with only three nontarget compounds, 1-bromo-2,2-dimethylpropane, 1,4-dichlorocyclohexane and trans-1,4-dimethyl-3-piperidinol, misclassified, the first two as unknown and the last as mono- or dichIoroalkane/ alkene. This classification result was a great improvement over the previous result of 76% [9] for the same spectra, primarily due to the introduction of the new filters. The identification results verified the effectiveness of the filters since none of the 99 spectra was misidentified as a target compound. Another set of spectra from outside the original training domain was also used to test the

Systems 23 (1994) 351-364

ruggedness of the system. A set of 400 mass spectra of compounds of primarily pha~aceutical interest from a Swiss data base were used. They included many substituted benzenes and other chemical structures which may be of interest in environmental samples. The compounds were comprised only of carbon, hydrogen, nitrogen and oxygen and their molecular weights ranged from 58 to 578 Da with an average of 207. A wide variety of structures of volatile and nonvolatile compounds was included in this set with formulas ranging from C,H,O, (acrylic acid) to C,,H,,N,G, (ll-demethoxyreserpine). Most of the structures were in the unknown class with only eleven nonhalobenzenes included, within the present narrow class definitions. Forty percent of the spectra were incomplete, lacking peak data below ca. 55 Da. The results with this evaluation set are listed in Table 2. The median and average absolute deviations of the predicted molecular weights from the true values were 2.0 and 10 Da which are similar to those for the random spectra. The classification and identification accuracy was 97.5% with most of the misclassified compounds being apparent or actual members of the nonhalobenzene class. The previous classification accuracy for this set was 63% [lo]. 10.3. Field GC-MS tests Another test of the system was made with GC-MS data obtained from field samples of volatile ambient air contaminants collected on Tenax and thermally desorbed into a GC-MS system. The GC column was not a capillary column and unresolved mixtures probably occur in

Table 2 Performance results with various test sets Sets

Pollutant test Random NIST Pharmaceutical Overall

Number

32 99 400 531

Molecular weight absolute deviations Median (Da)

Average (Da)

Classification and identification accuracy (%)

1.0 1.0 2.0 2.0

5.6 1.3 10.0 9.2

97 97 97.5 97.4

D.R. Scott /Chemometrics

the samples. There are probably also spectra with missing low intensity peaks due to the very low pollutant concentrations. Thirty-seven scans were selected to represent typical results. A very serious problem with the use of these data for evalu-

Table 3 Performance

results with GC-MS

No.

Identification

287 294 303 338 344 355 390 401 415 425 471 490 494 508 524 527 538 555 570 617 643 663 687 712 728 740 776 790 795 830 883 888 904 966 974 1026 1082

a Compounds

field data

Pentane 1,1,2-Trichloro-1,2,2trifluoroethane 2-Methyl-1-propen-1-one/2-methyl-2-propenal Perfluorotoluene l,l,l-Trichloroethane a Benzene 2,2,3- or 2,2,4-trimethylpentane/2,2-dimethylhexane Heptane 2- or 3-heptene Methylcyclohexane/4,4-dimethyl-2-pentene Toluene 2-Methylheptane 2,4-Dimethylpentanal/2-methylpentanal Hexanal 3-Methyleneheptane/3-ethyl-4-methyl-1-pentene 1-Octene/pentylcyclopropane 1,2-; 1,3-; 1,4-Dimethylcyclohexane 2-; 3- or 4-octene 2-; 3- or 4-octene 1,1,3-Trimethylcyclohexane/lor 2-ethyl-2,4-dimethylcyclohexane Ethylbenzene o-; m- or p-xylene 2,4-Dimethylhexane/2,6-dimethylheptane o- or p-xylene 2-Methyl-1-octene/2,6-dimethyl-I-heptene 1-Nonene/l-octene 2-; 3- or 4-nonene Isopropylbenzene 2-; 3- or 4-nonene Propylcyclohexane/ isopropylcyclohexane 1-Ethyl-2- or -4-methylbenzene/l,2,4-trimethylbenzene 1-Ethyl-2-; -3- or -4-methylbenzene 1-Ethyl-2-; -3- or -4-methylbenzene 1,2,4-Trimethylbenzene/l-ethyl-2-or -4-methylbenzene Octanal 2,4_Dimethylheptane/ nonane I-Fluoro-2-; -3- or -4-iodobenzene

in boldface

ation tests is that the identity, and therefore the class and molecular weight, of many of the compounds could not be definitely established. Each spectrum was subjected to a library search using a new optimized algorithm [5] and the NIST data

Molecular

Average absolute deviation Median absolute deviation Classification and identification

361

and Intelligent Laboratory Systems 23 (1994) 351-364

Predicted

Lower limit

72 187 70 236 133 78 114 100 98 98 92 114 114/100 100 112 112 112 112 112

72 155 69 237 126 78 99 100 97 100 109 129 100 111 130 111 111 111 111

59.5 150 65 231.5 115 74 73.5 95 93 95 87.5 101.5 87.5 99.5 92.5 93 107 107 107

126/140 106 106 114/128 106 126 126/112 126 120 126 126 120 120 120 120 128 128 222

111 106 106 114 106 126 126 126 120 126 126 120 120 120 120 148 99 223

106.5 101.5 101.5 94.5 101.5 106.5 106.5 121 115.5 121 121 115.5 115.5 115.5 115.5 129 87 217.5

5.9 1 95%

accuracy

were incorrectly

weight (Da)

Expected

classified

as unknown.

11.0 5

362

D. R. Scott / Chemometrics and Intelligent Laboratory Systems 23 (I 994) 351-364

base. These search results were used with a combination of previous SIMCA pattern recognition class assignments [9] and likely ambient air pollutants to establish identities. The top ranked compounds from the search results were unreliable in a few cases, yielding reactive or very unlikely structures. In these and other cases the class information from the previous SIMCA studies and the fact that some candidate structures from the search list are highly unlikely in ambient air samples were used to eliminate obviously incorrect identities. In some cases compounds with different molecular weights were equally likely and an average molecular weight was used to calculate the deviations. The assumed identities and test results including lower limits to the molecular weights are listed in Table 3. The median and average absolute deviations for the molecular weight estimates were 1.0 and 5.9 Da, which are very similar to those for the random spectra. The lower limits deviated from the assumed molecular weights by a median of only 5 and an average of 11 Da. In only one case, octanal, was the lower limit higher than the expected molecular weight. The apparent classification and identification accuracy was 95% with the two target compounds, l,l,l-trichloroethane and toluene, misclassified as unknown. This latter misclassification problem has been noted with the previous system [9] and is due to the rigid structure of the rules and to the distortion in these particular field spectra. Again, no nontarget spectra were incorrectly identified as target compounds. 10.4. System selectivity test

In every case where a base peak alone or with other peaks was used in the filters, a complete sequential search of the 62235 spectra in the NIST mass spectral data base was performed to ensure that no legitimate members of the particular class were being excluded. Compounds of all molecular weights were included in these searches and the elemental composition and intensity of mass peaks were used as search constraints. Every individual branch in every class filter was checked for selectivity. In addition each rule in

the identi~cation module was checked vs. the entire data base to ensure that only members of the target compounds would be identified. With the exceptions noted above in the description of the identification module, only the target compounds were identified. These results show the very high selectivity of the total identification system.

11. Conclusions

This empirical system does not employ any spectral matching or data base searching since the required info~ation extracted from the training set and auxiliary logic for the filters are embedded in the rules, not in a data base. The general design and approach used in the expert system could be used for other target sets. However, the only way to ensure that a system like this one is actually selective and accurate is by extensive evaluation, which is usually not done due to the time and effort required. This type of evaluation is very tedious and requires a large data base of high quality reference spectra. This system has been extensively evaluated and the results show that the basic logic and empirical rules are correct. This redesigned system is a great improvement over the previous one [9] particularly with regard to misclassification. For example, with the very difficult random spectra the classification accuracy increased from 76 to 97% due to the more restrictive filters in the present system. Much of the success of the filters and identification modules is due to the very high accuracy of the molecular weight estimators on target compounds. ~though the performance data over all data sets are very impressive, the system is not perfect as can be seen with some remaining misclassification problems in the random and pharmaceutical sets and missed identifications with the CC-MS data. However, it should be remembered that most of the data sets used in the evaluation require extrapolation well beyond the domain of the original training set. The performance of this system is probably near optimal and further major performance improvements will be very diffi-

363

DR. Scott / Chemometrics and Intelligent Laboratory Systems 23 (1994) 351-364

cult to attain. For example, it might be possible to relax the strict identification rules for target compounds and increase the corresponding accuracy. However, this would decrease the selectivity of the total system and cause more problems than it would cure. The methods used in this system have a number of advantages over existing search techniques. The entire spectrum is not required and therefore the method will work even with incomplete spectra as shown with the pharmaceutical spectra. Partial identification information is provided in the form of a class assignment, a molecular weight and a reliable lower limit to the molecular weight. Some molecular weight information is provided even for mixtures with the estimate being for the component with the highest HIMAXl value, which is usually the one with the highest molecular weight. This particular expert system also has some weaknesses. Low molecular weight compounds with few and/or low mass peaks will almost always be classified as unsown due to the classification rules. Spectra contaminated by ions of much higher mass than the molecular ion will yield erroneously high molecular weight predictions and lower limits. Compounds that fragment and leave essentially no detectable masses near the original molecular ion mass will yield erroneously low molecular weight estimates. Examples of this latter problem are some phthalates, nitro compounds and alkanes which occur in the random and pharmaceutical evaluation spectra. This system should be useful in the interpretation of complex environmental GC-MS or other mass spectra. It will not falsely identify target compounds due to the very strict filters in the identification module. However, it may not identify target compound spectra if the spectra are distorted or contaminated. It could be used in conjunction with manual interpretation or in library searches to limit candidate structures by using the molecular weight and/or lower limit. As a quality control program, it could be used to verify previously assigned identities of toxic and other target compounds. The classification and molecular weight estimation features can be used for any type of low resolution mass spectra.

Molecular weight accuracies of l-2 Da (median) or 6-10 Da (average) are expected with classification accuracies of ca. 95% or greater. Free copies of the classification and molecular weight programs [lo,151 as well as the present system are available from the author if a formatted floppy disk is sent to the address above.

Disclaimer The information in this document has been subjected to Agency review and approved for publication. The mention of trade names or commercial products does not constitute endorsement of recommendation for use.

References [l] D.P. Martinsen, Survey of computer aided methods for mass spectral interpretation, Applied Spectroscopy, 35 (1981) 255-266. [2] R.G. Dromey, Data systems for mass spectrometry, Pinnigun MAT Spectra, 10 (1984) 3-10. [3] SE. Stein, Estimating probabilities of correct identifica-

tion from results of mass spectral library searches, Journal of the American Society of Mass Spectrometry, (1994)

in press. [4] K. Varmuza, W. Werther, D. Henneberg and B. Weimann, Computer-aided interpretation of mass spectra by a ~mbination of library search with principal components analysis, Rapid Communications in Mass Spectrometry, 4 (1990) 159-162. [5] SE. Stein and D.R. Scott, Optimization

and testing of mass spectral library search algorithms for compound identification, Journal of the American Society of Mass Spectrometry, (1994) submitted. [6] F.W. Mc~ffer~ and D.B. Stauffer, Retrieval and interpretative computer programs for mass spectrometry,

Journal of Chemical Information and Computer Science, 5 (1985) 245-252. [7] S. Sokoiow, J. Karnofsky and P. Gustafson, The F’innigan Library Search Progrum, Finnigan Application Report,

No. 2, March, 1978. [8] E.D. Pellizarri, T. Hartwell and J.A. Crowder, Comparative Evaluation

of GC/MS

Data Analysis Processing,

Project Report PB-85125664, US Environmental Protection Agency, Research Triangle Park, NC, 1985. [9] D.R. Scott, Pattern recognition/expert system for mass spectra of volatile toxic and other compounds, Analytica Chimicu Acta, 265 (1992) 43-54. (101 D.R. Scott, A. Levitsky and SE. Stein, Large scale evalu-

364

D. R. Scott / Chemomettics

and Intelligent Laboratory

ation of a pattern recognition/expert system for mass spectral molecular weight estimation, Analytica Chimica Acta, 278 (1993) 137-147.

1111D.R. Scott, Rapid and accurate method for estimating molecular weights of organic compounds from low resolution mass spectra, Chemometrics and Intelligent Laboratory Systems, 16 (1992) 193-202. [12] D.R. Scott, Classification of binary mass spectra of toxic

compounds with an inductive expert system and comparison with SIMCA class modeling, Analytica Chimica Acta, 211 (1988) 11-29. [13] D.R. Scott, 1ST CLASS and FUSION expert shell sys-

Systems 23 (1994) 351-364

terns (Software Review), Chemometrics Laboratory

and Intelligent

Systems, 8 (1990) 245-247.

[14] D.R. Scott, Improved method for estimating molecular

weights of volatile organic compounds from low resolution mass spectra, Chemometrics and Intelligent Laboratory Systems, 12 (1991) 189-200. [15] D.R. Scott, Empirical pattern recognition/expert system for molecular weight estimation of low resolution mass spectra, Analytica Chimica Acta, 285 (1994) 209-222. [16] P.J. Huber, Robust Statistical Procedures, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1977, p. 3.