Analytica Chimica Acta 427 (2001) 233–244
Three-step procedure for infrared spectrum interpretation夽 F. Ehrentreich∗ Institut für Analytische Chemie, Technische Universität Dresden, D-01062 Dresden, Germany Received 15 September 1998; received in revised form 15 June 2000; accepted 13 September 2000
Abstract A stepwise procedure is proposed that regard computer-supported structure elucidation relying on infrared spectra as leading to hypotheses that have to be verified in the subsequent steps. As test set compounds were selected by a preliminary counter-propagation neural network prediction. As target substructure the phenolic group was investigated, however, structures were also included not containing that group due to false positive predictions of the neural network. For its verification library search procedures, including the partial cross correlation approach, with maximum common substructure analysis were applied. In the majority of instances, erroneous propositions of the neural network could be corrected. © 2001 Elsevier Science B.V. All rights reserved. Keywords: Infrared spectroscopy; Computer-supported structure elucidation; Spectrum–structure-correlation; Partial cross correlation approach; Maximum common substructure; Neural networks
1. Introduction The computer-supported (infrared) spectrum interpretation, i.e. qualitative spectral analysis, may lead to the recognition of a structure within a cycle that consists of the following four steps. 1. Derivation of chemical structures or substructures from the infrared spectrum — solution of the inverse spectrum–structure-correlation (SSC) problem and decision finding to accept the right (true positive and/or true negative) as well as to reject the wrong (false positive and/or false negative) fragments. This step may be subdivided due 夽 Presented at the conference COMPANA incorporating LISMS — Chemometrics and Data Evaluation in Analytical Chemistry, Duisberg, 15–18 September 1998. ∗ Present address: Institut für Biochemie, Universität zu Köln, Zülpicher Strasse 47, D-50674 Köln, Germany. Tel.: +49-221470-6436; fax: +49-221-470-5092. E-mail address:
[email protected] (F. Ehrentreich).
to different procedures that solve the inverse SSC problem as follows: 1.1. Library search for comparing an unknown spectrum with a set of reference spectra (see [1] for an overview). 1.2. Numerical methods to calculate decision functions between classes of compounds: • regression procedures (PCA, PLS) [2]; • pattern recognition procedures (see [1] for an overview); • artificial neural networks [2–9]. 1.3. Application of logic using rules derived by infrared spectroscopy experts, sometimes denoted as expert systems [10–15]. 1.4. Dialog or interactive systems, such as SpecTool [16]. 2. Generation of candidate structures by a structure generator. 3. Simulation of spectra (solving the primary or direct SSC problem).
0003-2670/01/$ – see front matter © 2001 Elsevier Science B.V. All rights reserved. PII: S 0 0 0 3 - 2 6 7 0 ( 0 0 ) 0 1 1 9 9 - 5
234
F. Ehrentreich / Analytica Chimica Acta 427 (2001) 233–244
4. Comparison of simulated spectra with the sample spectrum and decision finding to accept the right structure. A joint application of diverse procedures may also be performed. Different reasons prevent the cycle (steps 1–4) from being solved automatically. The first cause is due to insufficient representatives, i.e. missing reference spectra for the library search. However, methods that belong to category 1.2 or 4 also suffer from it. A second reason concerns the insufficient reliabilities of items 1.2 and 1.3. Insufficient reliabilities of computer-supported structure elucidation are not limited to infrared spectroscopy, however, much progress has been achieved with 13 C-NMR spectroscopy, leading to really automated systems [17]. Suppose the reliabilities concerning the proposed fragments are normalised in the range 0–1, where 0 expresses fully unreliable and 1 complete reliable hypotheses. In case, the reliabilities are not close to 1, no automated strategy exists at present to select the right fragments as goodlist restrictions for the structure generator and discard the false ones. To solve the problem by a combinatorial sequence weighted by the reliability seems to be hopeless. Anyway, actually the explicit or implicit assumption is supposed that the human being takes the decision in interactive mode.
Consequently, one may talk about automation of partial steps of infrared-supported structure elucidation, however, the complete process is driven interactively. In the unfavourable case the black box character of diverse procedures, such as neural networks is a hindrance for the evaluation of the results because the “reasoning scheme” is hidden. Further the reliabilities evaluate the overall performance of the system under investigation in an averaged/statistical sense, but it cannot evaluate the actual, single case. However, in contrast to quantitative analysis, even that is demanded in qualitative analysis, because in the end a strict yes/no question has to be answered.
2. Experimental The collection of 13938 infrared spectra stems from the SpecInfo infrared database [18]. The spectra of the SpecInfo database were available in pre-processed format, i.e. standard operations as baseline correction and smoothing were already performed. Explicit quality criteria are not assigned to the spectra, however, in some instances the purity is given. Transmission spectra were transformed to absorbance and renormalised from the standard normalisation of the SpecInfo infrared database (range 0–999) to the
Fig. 1. Strategies to improve the reliability of computer-based qualitative infrared spectral analysis.
F. Ehrentreich / Analytica Chimica Acta 427 (2001) 233–244
235
Fig. 2. Three-step procedure for computer-supported qualitative infrared spectral analysis consisting of the steps hypothesis building, verification and conclusion (CPNN, counter-propagation neural network; MCSS, maximum common substructure search; PCCF, partial cross correlation function; SSC, spectrum structure correlation; the term prototype is used to characterise classes or subclasses of chemical compounds, i.e. generic chemical structures with generalised spectral properties).
interval 0–1. A cubic spline interpolation procedure was applied to ensure equidistant arranged spectral points with a step size of 2 cm−1 within the library. The spectra encompass the interval 400–4000 cm−1 . Most of the original database spectra have a wavenumber increment of 1.93 cm−1 . The structures are stored in an ISIS/BASE [19] structural database. The application of the substructure search integrated in the ISIS/DRAW/BASE-system for a PC-environment was sufficient for searching the phenolic fragment in the hitlists. However, to derive large common fragments not previously defined, the program ToSiM [20] was applied, relying on the maximum common substructure (MCSS) approach [21–23]. ToSiM includes rules to derive large common substructures that occur frequently in an analysed hitlist. Because the substructure need not to be common to all the structures of the hitlist, larger fragments as the maximum common (in the strong sense) substructure are derived in general. The computations were performed on a PC using the MATLAB-package [24]. The 64 test structures and spectra used in the current investigations were chosen from a subset of the SpecInfo database selected by Zupan applying Kohonen neural networks to have a representative small subset of the SpecInfo database (1446 chemical structures and infrared spectra) for studying spectrum structure correlations efficiently [25]. The CPNN program was used as programmed by Novic and Zupan [7].
Table 1 Parameter set of the counter-propagation neural network and results for the phenolic group Parameter set Training set Number of spectra Dimension of layer Dimension of spectrum vector Number of fragments Pre-processing Initialisation of weights Epoques Correction function Maximum correction Minimum correction Toroid Stimulated neurons (%) RMS Threshold for phenols Phenols (true positive) Phenols (false positive) Phenols (false negative) Phenols (true negative)
749 30 × 30 128 34 Hadamard transform Randomised (0.0–1.0) 100 Triangle 0.5 0.01 – 51 0.19 0.3 11 8 0 730
Test set Number of spectra RMS Phenols (true positive) Phenols (false positive) Phenols (false negative) Phenols (true negative)
695 0.25 16 8 23 648
236
F. Ehrentreich / Analytica Chimica Acta 427 (2001) 233–244
3. Results and discussion In general the reliability of derived fragments (methods 1.2 and 1.3, respectively, according to the introduction schemes) is insufficient. If estimated, the numerical values differ for many of the substructures too much from 1 as to be useful for an automated mode (cf. [3,7]). Hence, two general conclusions may be drawn, as shown in Fig. 1. The first one is more straightforward, because it is aimed to enhance the reliabilities to be close to 1. However, as long as this aim is not realised, improvement of computer support to come to correct decisions is also necessary and valuable. Further, experience gained by that second way will find weaknesses in the first one and will help to improve it. As long as the ultimate goal to perform interpretation automatically has not yet been reached, it is no restriction if post-processing of the obtained information is envisaged to run in interactive mode. The starting point for post-processing is the hypothetical character of derived fragments. In this context, three steps should be envisaged performing the qualitative analysis to obtain structural fragments, as shown in Fig. 2. In the middle of Fig. 2, the three steps are highlighted. In fact, the number of steps is not important, it could have also been four, if the first step would
be split into a learning step and a test step. However, it should be emphasised that the first step leads to hypotheses using SSCs. As example for its realisation, the application of a counter-propagation neural network (CPNN) is given in the rightmost column. Because of the hypothetical character of the derived fragments, verification is necessary as a second step. It should be performed in the spectrum space and may be performed by the partial cross correlation function (PCCF) approach [15] or by the library search for the complete spectrum. For the latter case the cross correlation function was applied as similarity measure [15,26]. As a last step, the conclusions about the absence or presence of the proposed fragments have to be drawn. It seems reasonable to do that in the discrete space of the structures by an MCSS analysis, e.g. by ToSiM [20]. The tendency of the variance concerning the SSCs is shown schematically in the left column. For structure elucidation performed by infrared spectroscopy, in contrast to 13 C-NMR spectroscopy [17], the variance of the prototypes is often too high as to come to reliable conclusions. Evidently, it tends to be smaller if individual reference species are compared with the sample instead of prototypes. As the hypotheses builder a CPNN was applied for the current investigations. In addition to the verifi-
Fig. 3. Spectral investigation scheme. The first level consists of the complete spectral range. The second level contains standard subintervals, cf. Section 3. Level 3 is due to decomposition of the prototype spectrum that was learned by the CPNN into characteristic intervals.
F. Ehrentreich / Analytica Chimica Acta 427 (2001) 233–244
237
Table 2 Analysis of 3-(2-propynyloxy)-phenol in the ranges 4000–400, 2000–400 and 1700–1550 cm−1 (true positive by CPNN: confirmed by confirmation step)
cation/rejection of structural hypotheses the CPNN should be tested regarding its capability to select characteristic intervals to be used by the PCCF method. The parameters of the CPNN and the figures of merit for the phenolic group are given in [27] and are summarised in Table 1. The selection of intervals to be investigated is due to different experiences. On the one hand it could be shown [15] that in some instances the investigation of selected characteristic intervals according to the knowledge of SSCs was superior to the complete analysis of the spectrum. On the other hand, Varmuza et al. [28] has derived large fragments in many cases from an investigation of the complete range of the infrared spectrum using a highly sophisticated MCSS procedure with an advanced version of ToSiM [20]. The investigated ranges may be categorised by three levels, as shown in Fig. 3. The first level consists
of the complete spectral range. It corresponds to the MCSS analysis of a classical library search. In contrast to the strategy developed by Varmuza et al. [28], the procedure applied to the presented investigations gave more weight to the first hits. The presence of phenol as large common substructure has been accepted if substructures of the first three hits (3/3) were isomorphic to it. Phenol was also considered as present if four out of five (4/5) or five out of seven (5/7) hits contained it. The particular test structure was indicated by zero as hit number. The second level contains standard subintervals. The boundary at 2000 cm−1 often separates the infrared spectrum into a part with a higher (wavenumbers below 2000 cm−1 ) and a lower resolution (wavenumbers above 2000 cm−1 ). Note, however, that in the current investigations the resolution of the spectrum is unique as described in Section 2. The interval 1000–1500 cm−1 was cho-
238
F. Ehrentreich / Analytica Chimica Acta 427 (2001) 233–244
sen according to the definition of the fingerprint region [29]. The scheme should not be considered as rigid and inflexible. It may be adapted to other characteristic and well-separated regions, e.g. vibrations for functional groups as O–H, triple bonds or C=O. Level 3 is due to decomposition of the prototype spectrum that was learned by the CPNN into characteristic intervals. For the present investigations the selection was performed interactively. The prototype spectrum, as shown in Fig. 3, corresponds to nor-
malised weights of the Kohonen layer of the CPNN. A projection was performed from the output layer to the Kohonen layer by the neuron with the highest weight concerning the phenolic group. An enhancement of the method is considered, choosing the weight spectrum according to the actual neuron in the output layer that has indicated the considered fragment. As requirement for that procedure the automation of “valley picking” to select the characteristic intervals is in progress. Hence, two types of information are used from the hypothesis building and for further
Table 3 Analysis of ␣-[(aminocarbonyl)amino]-4-hydroxy-benzeneacetic acid in the ranges 4000–400 and 2000–400 cm−1 (true positive by CPNN: erroneously rejected by confirmation step)
F. Ehrentreich / Analytica Chimica Acta 427 (2001) 233–244 Table 4 Analysis of 4-chloro-5-[[(3-nitrophenyl)methyl]sulfonyl]-2-phenyl-3(2H)-pyridazinone in the ranges 4000–400 and 2000–400 cm−1 (false positive by CPNN: true negative by confirmation step)
239
240
F. Ehrentreich / Analytica Chimica Acta 427 (2001) 233–244
refinement of the qualitative analysis. First, the structural information of the obtained fragment is used and second, the spectral information regarding characteristic spectral intervals is exploited, both considering the spectrum–structure-correlations in a hypothetical way. Note general problems of the CPNN-based learning of SSCs. Some of the learned intervals are not characteristic, such as the region of the C–H stretching vibrations (3120–2600 cm−1 ). That is due to the fact, that almost all compounds used in the training set and, moreover, in organic chemistry contain aliphatic and/or aromatic C–H fragments. The results of the CPNN concerning the phenolic fragment are summarised in Table 1, separately for the
training and the test set. These results are regarded as hypotheses and should be confirmed or rejected by the verification and conclusion steps according to Fig. 2. The investigations were constrained to the evaluation of the goodlist fragments, i.e. false positive determinations should be excluded, however, an outlook for the check of badlist fragments is also given. With respect to the training set, 10 of the 11 true positives were confirmed, i.e. one structure was erroneously excluded afterwards as not containing the phenolic group. Seven of the eight false positives of the training set were considered as not being present. Fourteen of the fifteen true positives of the test set were confirmed and all eight false positives were rejected or at least were put to question.
Table 5 ToSiM analysis of 4-chloro-5-[[(3-nitrophenyl)methyl]sulfonyl]-2-phenyl-3(2H)-pyridazinone in the range 2000–400 cm−1
F. Ehrentreich / Analytica Chimica Acta 427 (2001) 233–244
An example for the confirmation of the phenolic group is shown in Table 2. The phenolic fragment is not only derived from the complete spectrum, but also from smaller intervals, giving more evidence for its determination. As shown in Table 2 the spectral interval 1700–1550 cm−1 selected from the CPNN analysis proofed to be informative during the confirmation step. As an additional advantage of the refining procedure, a further chemical environment — the ether group — could even enlarge the derived target fragment.
241
The phenolic fragment of ␣-[(aminocarbonyl)amino]-4-hydroxy-benzeneacetic acid shown in Table 3 was truly recognised by the counter-propagation neural network. However, that conclusion could not be reasoned from the investigation of the hitlists computed by the cross correlation approach. The phenolic group was contained within the first hits neither in the range 4000–400 or 2000–400 cm−1 shown in Table 3 nor in other ranges or its combinations. Obviously, other fragments had a larger influence on the spectrum. Note that the aminocarbonylamino frag-
Table 6 Analysis of 3,5-dimethoxy-benzenemethanol in the ranges 4000–400 and 2000–400 cm−1 (false positive by CPNN: false positive by the confirmation step)
242
F. Ehrentreich / Analytica Chimica Acta 427 (2001) 233–244
ment could be determined by the restricted interval of 2000–400 cm−1 , however, it is not shown as a common substructure when evaluating the complete spectrum. A more thorough evaluation of the interval
2000–400 cm−1 by application of the ToSiM program [20] is shown in the rightmost column. The six best fragments are listed that were extracted by ToSiM. The proportion of its presence within the first 19 hits
Table 7 Analysis of 2-t-butyl-p-cresol in the range 4000–400 cm−1 and by joined interpretation of the intervals 1300–1150, 1700–1550 and 3700–3120 cm−1 (false negative by CPNN: true positive by confirmation step)
F. Ehrentreich / Analytica Chimica Acta 427 (2001) 233–244
is shown. It is shown also from that analysis that the aminocarbonylamino fragment frequently occurs within the first hits. In Table 4 hitlists for the analysis of 4-chloro-5-[[(3nitrophenyl)methyl]sulfonyl]-2-phennyl-3(2H)- pyridazinone are shown. From the analysis of the complete spectrum as well as the interval 2000–400 cm−1 the presence of the phenolic group was not confirmed, giving the analyst the information to consider its rejection. It may be simply deduced that larger fragments in comparison to smaller ones are less frequently represented in a structural database as well as in the hitlist resulting from a library search. Such a result is shown in Table 5. The rather small benzene fragment was contained in the complete set of 19 hits. However, a much larger common substructure could be assigned if the hitlist was restricted to the seven best results. Further, analysing only seven hits, the chlorine substituent was contained in six of the seven hits and the remaining structure contained bromine in that position. An interactive evaluation of these findings may lead to a specification of the structure. Note that unions of these large fragments with the nitrobenzene fragment restrict the set of structural hypotheses enormously. In reality, such conclusions cannot be reached in a simple way. However, it could be useful considered by the structure generation and spectra simulations steps if additional information from other spectroscopic methods would be available. In Table 6 both, CPNN and refinement procedures conclude a false positive fragment. The large common substructures derived by ToSiM are due to the analysis of the complete spectrum, however, the benzenemethanolic group was recognised neither by analysis of the complete spectrum nor by the analysis of smaller intervals. The analysis of that structure is really a difficult task for infrared spectroscopy. Additional information has to be drawn from other spectroscopic methods in such cases. With Table 7 the basic idea to verify a hypothesis is abandoned — the CPNN has not indicated the phenolic group for the infrared spectrum of 2-t-butylp-cresol. As an example and for comparison, the analysis with respect to characteristic intervals was performed in the same manner as for the structures given before. It can be seen from Table 7 that the analysis of the complete spectrum as well as the analysis by the PCCF approach of the intervals 3700–3120,
243
1700–1550 and 1300–1150 cm−1 clearly assigned the phenolic group. Further, from the best hits an even larger substructure (2-t-butyl-phenol) could be derived. 4. Conclusions With the actual insufficient reliabilities of computersupported structure elucidation relying on infrared spectroscopy it is necessary to provide the analyst with additional information. It should support his decision finding to exclude erroneously determined fragments. The first item concerns merely the rejection of false positive fragments. Second, the procedure should give some insight in the way the conclusions were drawn for the actual sample. A test set was chosen with structures proposed by a counter-propagation neural network as containing the phenolic group as substructure. In most cases, the verification could already be done by a library search with subsequent MCSS analysis. Furthermore, the intervals selected from the CPNN were in some instances successful applied for a PCFF/MCSS analysis. The evaluation is supported, because the species from which the conclusions were drawn are shown. That is a general advantage of individual species comparison with respect to prototype analysis. False rejections of correct hypotheses derived from CPNN and even more the confirmation of a false positive fragment have shown the limitations. Obviously, the distribution of representatives in the substructure–subspectrum space is not dense enough to come to correct conclusions in such cases. Other analytical information must be regarded together with infrared spectral data in those instances. Acknowledgements The work was funded by the “Deutsche Forschungsgemeinschaft”. The support of Chemical Concepts, Weinheim, FRG, for making the SpecInfo infrared spectral database available to the author is gratefully acknowledged. References [1] H.J. Luinge, Vib. Spectrosc. 1 (1990) 3. [2] T. Visser, H.J. Luinge, J.H. van der Maas, Anal. Chim. Acta 296 (1994) 141.
244
F. Ehrentreich / Analytica Chimica Acta 427 (2001) 233–244
[3] E.W. Robb, M.E. Munk, Mikrochim. Acta I (1990) 131. [4] M.E. Munk, M.S. Madison, E.W. Robb, J. Chem. Inf. Comput. Sci. 36 (1996) 231. [5] R.J. Fessenden, L. Györgyi, J. Chem. Soc., Perkin Trans. 2 (1991) 1755. [6] M. Meyer, K. Meyer, H. Hobert, Anal. Chim. Acta 282 (1993) 407. [7] M. Novic, J. Zupan, J. Chem. Inf. Comput. Sci. 35 (1995) 454. [8] C. Klawun, C.L. Wilkins, J. Chem. Inf. Comput. Sci. 36 (1996) 249. [9] J. Zupan, M. Novic, J. Gasteiger, Chemom. Intell. Lab. Syst. 27 (1995) 175. [10] M.E. Elyashberg, L.A. Gribov, V.V. Serov, Molecular Spectral Analysis and the Computer, Nauka, Moscow, 1980 (in Russian). [11] M.E. Elyashberg, E.R. Martirosian, Y.Z. Karasev, H. Thiele, H. Somberg, Anal. Chim. Acta 337 (1997) 265. [12] V.V. Serov, L.A. Gribov, M.E. Elyashberg, J. Mol. Struct. 129 (1985) 183. [13] F. Ehrentreich, U. Dietze, U. Meyer, H. Schulz, H.-M. Klötzer, S.J. Abbas, M. Otto, Fresenius J. Anal. Chem. 354 (1996) 829. [14] F. Ehrentreich, Fresenius J. Anal. Chem. 357 (1997) 527. [15] F. Ehrentreich, Fresenius J. Anal. Chem. 359 (1997) 56. [16] M. Cadisch, E. Pretsch, Fresenius J. Anal. Chem. 344 (1992) 173.
[17] M. Will, W. Fachinger, J.R. Richert, J. Chem. Inf. Comput. Sci. 36 (1996) 221. [18] SpecInfo, Chemical Concepts, Weinheim, Germany. [19] ISIS, MDL Information Systems, San Leandro, USA. [20] H. Scsibrany, K. Varmuza, Fresenius J. Anal. Chem. 344 (1992) 220. [21] M.M. Cone, R. Venkataraghavan, F.W. McLafferty, J. Am. Chem. Soc. 99 (1977) 7668. [22] J.J. McGregor, P. Willett, J. Chem. Inf. Comput. Sci. 21 (1981) 137. [23] L. Chen, W. Robien, J. Chem. Inf. Comput. Sci. 32 (1992) 501. [24] Matlab 5.0, The Mathworks Inc., Natick, USA. [25] J. Zupan, Technical Report, BMBF-Project on Interpretation of Infrared Spectra, Ljubljana, Slovenia, 1995. [26] G. Horlick, G.M. Hieftje, Correlation methods in chemical data measurement, in: D.M. Hercules, G.M. Hieftje, L.R. Snyder, M.A. Evenson (Eds.), Contemporary Topics in Analytical and Clinical Chemistry, Plenum Press, New York, 1978. [27] F. Ehrentreich, M. Novic, S. Bohanec, J. Zupan, Bewertung von IR-Spektrum–Struktur-Korrelationen mit Counterpropagation-Netzen, in: J. Gasteiger (Ed.), Proceedings of the 10th Workshop on Computers in Chemistry, 1995, GDCh, Frankfurt am Main, 1996, pp. 271–292. [28] K. Varmuza, P.N. Penchev, H. Scsibrany, J. Chem. Inf. Comput. Sci. 38 (1998) 420. [29] H. Günzler, H. Böck, IR-Spektroskopie, 2nd Edition, VCH, Weinheim, 1990.