New system for computer-aided infrared and Raman spectrum interpretation

New system for computer-aided infrared and Raman spectrum interpretation

Chemometrics and Intelligent Laboratory Systems 88 (2007) 107 – 117 www.elsevier.com/locate/chemolab New system for computer-aided infrared and Raman...

1MB Sizes 0 Downloads 30 Views

Chemometrics and Intelligent Laboratory Systems 88 (2007) 107 – 117 www.elsevier.com/locate/chemolab

New system for computer-aided infrared and Raman spectrum interpretation Eugene Karpushkin a,⁎, Andrey Bogomolov a,1 , Yury Zhukov a , Michael Boruta b a

Advanced Chemistry Development, Inc., Moscow Office, 6 Akademika Bakuleva str., 117513 Moscow, Russia Advanced Chemistry Development, Inc., 110 Yonge Street, 14th floor, Toronto, Ontario, Canada M5C 1T4

b

Received 31 March 2006; received in revised form 27 July 2006; accepted 16 August 2006 Available online 6 October 2006

Abstract A new software tool for the interpretation of infrared and Raman spectra has been developed. It makes use of fragment libraries comprising representative, carefully selected and refined data on characteristic vibration frequencies. The information was collected from multiple sources including published correlation tables and reference spectral databases. In an automatic mode, the system performs structure to spectrum verification to test the correspondence between a drawn chemical structure and an experimental spectrum. Computer-aided “manual” interpretation (an expert mode) facilitates the assignment of structural elements to spectral features by means of highlighting the library information over the spectrum. The interpretation performance was significantly improved compared to other systems of this type due to some novel features. These are original fragment structures incorporating the concept of nucleus (vibrating group) and a new system of queries (fuzzy atoms and bonds), as well as the inter-fragment logic, which make fragment formulation extremely flexible. Important methodological aspects of fragment library construction were considered, and the main principles of fragment formulation on the basis of experimental spectra were formalized. Principal component analysis (PCA) was applied to distinguish clusters formed by spectral responses due to a specific structural environment and to refine characteristic frequency regions. © 2006 Elsevier B.V. All rights reserved. Keywords: Computer-aided spectral interpretation; Infrared spectroscopy; Raman spectroscopy; PCA

1. Introduction Computer-aided spectral interpretation is one of the most intricate problems of qualitative analysis in vibrational spectroscopy. In the infrared region, there are two factors making the interpretation of spectral signals in terms of possible functional groups particularly challenging. First, there is a high degree of overlap between spectral bands, especially in the fingerprint region containing the main structural information. Second, characteristic bands may shift significantly depending on the structural environment of functional groups. Besides, various chemical and physical factors influencing sample components should be taken into account, e.g., tautomerism, chemical inter-

⁎ Corresponding author. E-mail address: [email protected] (E. Karpushkin). 1 Present address: European Molecular Biology Laboratory, Hamburg Outstation, Build. 25a, 85 Notkestrasse, 22603 Hamburg, Germany. 0169-7439/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2006.08.010

actions, and solvent effects. Therefore, interpretation of infrared (IR) and Raman spectra requires attracting a wealth of knowledge about characteristic spectral properties of functional groups and multiple different factors should be simultaneously considered. During the last two decades, a number of new algorithmic approaches to the spectral interpretation problem were developed, such as neural networks [1–4], or factor-based methods [5]. Nevertheless, traditional expert-driven spectral interpretation technique based on the knowledge of characteristic group frequencies [6–10] still remains the most common and dominates in the available software. The ultimate goal of spectral interpretation in general is full structure elucidation from spectral data. This task belongs to the class of so-called “inverse problems”, which are typically solved by imposing various constraints on the full set of possible solutions. These constraints may be derived from a priori information about sample properties and from available analytical data. Structure elucidation based on vibrational spectra only (except for individual simple cases) is hardly possible due to their insufficient information contents [11]. Nevertheless, IR and

108

E. Karpushkin et al. / Chemometrics and Intelligent Laboratory Systems 88 (2007) 107–117

Raman spectroscopy are often used as complementary techniques in combination with nuclear magnetic resonance or mass spectrometry data. In routine laboratory practice, the problem of qualitative analysis of infrared data usually boils down to testing the hypothesis of correspondence between an experimental spectrum and a suggested structure; in many cases, even a conclusion about the class of compound is useful for sample characterization. There are three main approaches to the problem of computeraided spectrum interpretation.

pounds. Spectral information was extracted from numerous literature sources [17–23] as well as from available databases of experimental spectra [24–26]. The developed system possesses a number of unique features, in particular, optimized architecture, elements of inter-fragment logic, three operating levels of the tool (browsing, interactive interpretation, and automatic spectrum–structure verification), as well as an original userfriendly interface. System performance is analyzed and discussed in comparison with similar software tools. 2. Methods and data

(1) Computational approach. The theoretical spectrum is directly predicted from a candidate structure and compared to the experimental one, and thus, the correspondence is tested. (2) Database search. As in the previous method, the interpretation is based on spectral shape comparison; however, spectral databases are used as a source of reference information rather than calculated spectra. (3) Expert systems. These tools make use of information on characteristic group frequencies collected in an electronic table or knowledge base (fragment library) for sequential topological interpretation, by searching for a specific peak pattern corresponding to a tested structural fragment and vice versa. In the present paper we report on a system of the latter type. Such systems tend to emulate the work of a human expert: fragment-by-fragment and peak-bypeak interpretation based on accumulated experience. One of the fundamental theoretical assumptions of vibrational spectroscopy, which provides a background for such a piecewise spectral interpretation, declares the general independence of vibration frequencies on the chemical environment of atoms involved in the vibration. In practice, spectral signals of the same group may vary significantly and the environment should be taken into account for successful spectrum interpretation. Therefore, the possibility to flexibly handle various environmental situations is one of the most important requirements to the knowledge base for IR/Raman interpretation software. For fragment definition, existing expert systems [6–16] mostly offer a limited number of hard-coded descriptors of structural neighborhood, e.g., aromatic ring, specific heteroatoms, conjugation with double bond, etc. and the logic of “environmental spheres”. These means are sometimes insufficient to define a specific environmental situation or to separate it from a more common case. In the present work, we intended to create a system that would make it possible to define structural fragments, including the functional group and its chemical environment, with a desired level of distinctness (or ambiguity). We also addressed a problem of extracting precise and reliable information about spectral features of the fragments from experimental spectral data. This important methodological issue usually stays out of consideration by the authors of reference books and interpretation software. For the interpretation tool presented here, we collected verified IR and Raman spectral characteristics of over one thousand structure fragments covering the main classes of organic com-

2.1. Fragment The fragment library is a database composed of records, which we hereafter call fragments. Each fragment contains information on spectral characteristics of a specific group of atoms in a Table 1 Fragment architecture Field name

Example

General information Fragment ID 420 Description Class name Comment

Pyridine N-oxide [2-] Heterocycles May be detailed further on

Structural information Fragment structure

Nucleus Exception Equivalence

C–H 982, 983, 986, 987 10

Spectral information (peak table) 1 Range 3000–3100 Height

3–5

Width

10–100

Vibration C–H stretching Reference [17] Type VH

2 …

Comment Corrected … …

Description Unique fragment identification number Textual fragment identifier Chemical class Textual comments

Drawn chemical formula that defines vibrating group (nucleus) and its possible environment

Textual identifier of the nucleus Inter-fragment relationships Inter-fragment relationships

Range of characteristic peak positions Peak intensity range, relative units from 1 (very weak) to 5 (very strong) Peak FWHH (full width at half-height) range Vibration type Reference Peak information usage: verification and highlight (VH) or highlight only (H). Textual comments … …

E. Karpushkin et al. / Chemometrics and Intelligent Laboratory Systems 88 (2007) 107–117

specific environmental context as well as other related data. The fragment architecture is shown in Table 1 and illustrated by a real library example. In the table, the fragment information is divided into three sections called general, structural, and spectral information. The first one assigns an ID number, textual identifier, and class attribute to the fragment. The structural part includes a drawn structural formula (hereafter called fragment structure) designating the vibrating atomic group and its possible environment; besides, it may define inter-fragment relationships (see Section 2.4). Spectral information is represented by a list of peaks associated with the fragment vibrations. Each peak is defined by intervals of its characteristic frequency, intensity, and width, as well as supplementary and reference information. The fragment structure is a key element of the present interpretation system. It has been developed on the basis of a genuine technology of electronic chemical structure by ACD/ Labs [27]. Fragment structure represents an ensemble of environmental situations of the same vibrating group, called fragment nucleus, having common (very similar) spectral characteristics. The fragment structure is designated by a conventional structural formula, except that it always includes the highlighted nucleus (selected in bold in Table 1), characteristic properties of which it defines. The rest of the formula, called environment, typically includes so-called queries, special symbols in place of normal atoms and bonds designating a set of possible neighborhoods. The list of queries presently used in the system is given in Table 2. The nucleus is defined as a connected group of atoms, the joint vibration of which is spectroscopically observed. Since the term “nucleus” introduces a new entity into the conventional nomenclature of IR spectral interpretation, an explanation of its necessity here is relevant. The term “functional group”, adopted from organic chemistry, is commonly used in the literature and software to designate structural information associated with characteristic spectral features. In many cases, these terms are congruent, e.g., methyl or carbonyl, or “functional group” means even broader structural fragment than our “nucleus” and may include more than one underlying elementary vibrations, as in the case of carboxyl. However, there are multiple exceptions. For example, single-atom functional groups, e.g., –Cl, cannot Table 2 Querie used for fragment structure encrypting Querie

Description

Query atoms ⁎ A SatC Aryl Hal [L] CorH

Any atom Any atom except for hydrogen Saturated carbon atom Atom belonging to an aromatic ring Halogen atom (F, Cl, Br, or I) Atom from an arbitrary list L Carbon or hydrogen atom

Query bonds –Ar– –Rn– –Ch– –D–

Aromatic bond Bond within a ring Bond in a chain Double bond (non-aromatic)

109

be taken as a nucleus since the latter needs at least two atoms for a joint vibration. On the other hand, C–C–O vibration in esters goes beyond the conventional –COO– functional group. The introduction of the term “nucleus” provides necessary flexibility and specificity with regard to the problem of IR/Raman spectral interpretation. Its usage allows one to avoid ambiguities and inconsistencies associated with the general-purpose term “functional group”. The fragment object addresses the problem of optimal data structure. However, the efficiency of an interpretation/verification system equally depends on the information content of fragment library. In order to provide high quality and reliability of the data, we put the following demands of the input fragment during the library construction. (1) Selectivity. The optimal frequency interval of a characteristic signal was found to be 20 cm− 1 to 50 cm− 1. For wider ranges the information becomes too general and, hence, less helpful because of the increasing risk of false positive hits. Narrower regions, on the contrary, increase the risk of false negatives and rejection of a correct structure when the corresponding peak is shifted due to, e.g., different sampling procedure or experimental conditions, which is even more undesirable. (2) Homogeneity. Dispersion of values within the frequency interval should be due to a stochastic error. If experimental data described by the fragment show evident clusters related to the structural environment, breaking into two or more fragments should be considered. (3) Representativeness. Each fragment added to the knowledge base should be a representative of at least three experimental reference spectra. The safe level of representativeness corresponds to five or more spectra. The main principles of data collection are considered in detail in the next section. 2.2. Library construction principles Filling the fragment library with complete and reliable data presents a serious methodological problem. Experimental data is the principal source of spectral information; therefore, representative collections of high-quality spectra are required. Each fragment of our knowledge base has been formulated and verified making use of all available sources of spectral information. They include comprehensive printed atlases and textbooks [17–23] as well as spectral databases [24–26]. Such a combined approach allowed us to use a wealth of experience accumulated in the literature and, at the same time, verify, correct, and expand the contents based on the primary spectral data. The knowledge database has been constructed element-wise, fragment by fragment. First, a list of atom groups (future fragment nuclei) having characteristic signals in the infrared region was defined. For the main classes of organic compounds covered by the present system, these groups are well known and published, for example [17–23]. Therefore, raw spectral data are not required at the initial stage.

110

E. Karpushkin et al. / Chemometrics and Intelligent Laboratory Systems 88 (2007) 107–117

bands associated with the fragment nucleus are specific, i.e., narrow enough to provide independence on the environment. The following four-step fragment formulation procedure was developed and utilized in the present work.

Fig. 1. Fragment structure overlap: (a) non-overlapping fragments; (b) inclusion; and (c) intersection.

Since characteristic group frequencies are not completely independent on the structural neighborhood, the process of fragment formulation consists in the determination of a set of environmental situations, in which the intervals of spectral

Table 3 Fragment statistics of the structure verification library Class/subclass of compounds

Aliphatic hydrocarbons Methyl compounds Methylene compounds Methine compounds Alkenes Alkynes Carbocyclic aromatic compounds Carbonyl compounds Acyl halogenides Carboxylates Carboxylic anhydrides Carboxylic carbonates Carboxylic esters Aldehydes Ketones Carboxylic acids Other O compounds Alcohols Ethers Peroxy compounds N compounds Amines Amides Imines Imides Nitrocompounds S compounds P compounds Si compounds Se compounds B compounds Non-carbon acids Heterocycles Inorganic ions Other Total

The number of fragments IR verification library

Raman verification library

254 61 78 16 75 24 69 158 10 10 8 9 35 15 25 8 45 24 13 8 149 23 12 9 6 11 129 99 32 36 53 11 183 59 100 1377

196 52 43 8 74 19 69 148 11 4 8 9 32 11 31 8 38 20 10 8 163 23 14 10 9 12 78 85 30 23 8 7 132 47 88 1112

(1) Fragment expression starts by analyzing a group of spectra of pure chemical compounds, structures of which include the target nucleus. For populous classes, the analysis may initially start from a well-known subclass, e.g., from ketones' NCfO instead of the general carbonyl. (2) The key step in fragment formulation is the detection of clustered subsets of the initial set of related spectral bands, characteristic frequencies of which are very close and, at the same time, noticeably differ from the rest of the data. In the simplest case, when the starting set is a single homogeneous data, this step is skipped. (3) If the resulting subset meets the requirement of the fragment homogeneity (Section 2.1 above), it can be taken as a basis for a new fragment. Otherwise, the cluster analysis is repeated as described in step 2. (4) If the resulting fragment is represented by less than three experimental spectra it cannot be accepted as sufficiently representative (Section 2.1). In such case, this data is considered as outlying and stays outside the library coverage. It is generally assumed that the literature data on characteristic group frequencies published in various atlases and textbooks result from the analysis of a representative experimental material. Nevertheless, contradictions between different sources are quite common and hence, data validation and correction is necessary. Besides, literature fragments often tend to be excessively general, and hence, should be broken into several more specific fragments. Therefore, we carefully validated all information taken from indirect sources, such as correlation tables, and, if necessary, corrected it using the above fragment formulation method. Quite often new fragments, that were not described in the reference literature, were found and expressed during the analysis of experimental spectral data.

Fig. 2. Fragment occurrence in structures of the reference database [24].

E. Karpushkin et al. / Chemometrics and Intelligent Laboratory Systems 88 (2007) 107–117

Once the fragment structure is found, the next stage is to extract corresponding spectral information from experimental data. In an ideal case, related peaks lie close to each other and the full interval of their positions is taken as a required offset range. When peak positions scatter within 50 cm− 1 the interval is considered selective and the analysis stops. Otherwise, the utmost points at the interval ends are revised. If their deviation can be explained in terms of unusual structural environment, such data points are taken as outliers and ignored, or, in case of three or more non-typical points of the same nature, the possibility of separating them into a new fragment should be considered. If, in spite of all efforts, the offset interval remains relatively broad, it is accepted for the fragment library “as it is”. Information about bandwidths and intensities is not as critical for the interpretation as the frequency intervals. Although it is optionally applied by our system for manual or automatic verification, it is usually considered as supplemen-

111

tary data only. Therefore, the width and intensity intervals should be set broad enough to avoid false negatives in the structure verification procedure. When estimating the peak width and intensity from an experimental spectrum, signal overlap should be taken into account. In case of significant overlap, peak fitting procedure was applied to model the whole group. In this method, each peak is fitted by a mixed Gauss + Lorentz function, Eq. (1), using the Levenberg-Marquardt optimization routine [28] to minimize the least-square deviation of the total simulated signal from the experimental spectrum. f ð xÞ ¼ M d

H 2

0 4d ðxX W Þ þ1

þ ð1−M Þd Hd e− ð

xX0 2 W Þ d

4lnð2Þ

ð1Þ

where X0—peak position (offset), H—peak height, W—peak full width at half-height (FWHH), M—fraction of the Lorentz shape in the function.

Fig. 3. Structure verification results for allyl acetate: (a) IR spectrum and the structure (allyl highlighted); and (b) structure verification protocol (methyl in acetate fragment highlighted).

112

E. Karpushkin et al. / Chemometrics and Intelligent Laboratory Systems 88 (2007) 107–117

Fig. 4. Determination of aromatic methyl by spectral interpretation for toluene: (a) IR spectrum and the structure; and (b) spectrum verification protocol.

2.3. Principal component analysis for fragment formulation Principal component analysis (PCA) [29] has been applied to aid in formulation of highly representative fragments based on at least ten experimental spectra. PCA decomposes the original data matrix, X, in accordance with Eq. (2). X ¼ TPT þ E

ð2Þ

where X (r × c) is the data matrix (spectra in rows), T (r × a) is the matrix of scores, P (a × c) is the matrix of loadings, and E (r × c) is the matrix of residual errors; r and c are the numbers of spectra and wavelengths correspondingly, a is the effective rank of X chosen for the PCA model. The main advantage of PCA is its ability to reveal and visualize internal data structures. Application of the factor space

instead of the raw data makes PCA particularly effective for the analysis of overlapped spectral signals. In the present work, a PCA plot of scores was used for cluster recognition, e.g., distinguishing data subsets with similar spectral properties (step 2 of fragment formulation, Section 2.2 above). The plot of loadings and bi-plot (overlaid scores and loadings) were also helpful for the determination and refinement of signal frequency regions, specifically, in the case of overlap. Examples of PCA application are given in the Discussion section below. PLS Toolbox by Eigenvector Research, Inc., version 3.54 [30] for Matlab®, was used to perform PCA. 2.4. Inter-fragment logic There are two types of inter-fragment logical interaction in the library: exception and equivalence.

E. Karpushkin et al. / Chemometrics and Intelligent Laboratory Systems 88 (2007) 107–117

2.4.1. Exception Each fragment structure defines a separate set of structural situations that does not intersect with others. The exception is the basic compositional principle of the library, which helps avoiding ambiguities in new structure assignment during the system operation. However, in the course of the library construction, fragment intersections are rather common and should be handled. Typical situations of two-fragment interaction are shown in Fig. 1a–c. The situation of fragment overlap occurs during the formulation process when analyzing the environment effects on spectral characteristics of the same nucleus. Inclusion (Fig. 1b) is one of the most typical cases when one specific fragment B (for instance, dialkyl ketone carbonyl) has spectral characteristics that differ from a more general case defined by the fragment A, e.g., ketone carbonyl, and thus, B should be isolated from A. To exclude B, its ID is indicated in the Exception field of A (Table 1). The case of partial intersection (Fig. 1c) can arise from ambiguous fragment definitions taken from literature. For example, if A and B represent alkyl and aryl ketones, respectively, then alkyl aryl ketones case falls into the intersection. The situation is resolved by annexation of the common subset to one of the two fragments or by its exclusion from both. To exclude the common area from both intersecting fragments, cross-reference exception is applied.

113

specific verification purposes. While the structure verification library is the main and full common-purpose fragment collection, the spectral interpretation library contains only a limited number of fragments with highly specific and reliable spectral characteristics. Therefore, the price of false negative verification (erroneously reported absence of an expected structure fragment) is much higher in the case of spectral interpretation. On the contrary, the absence of a specific spectral peak expected in the structure verification is only a warning that can still be neglected due to sample or experimental peculiarities. If no discrepancies were found during the two-way verification procedure, the structure-spectrum pair under test is considered as non-contradictory; this, however, still does not provide a ground for the final confirmation of correspondence. In this respect, negative verification results may be more informative and helpful as providing an appropriate basis for rejection. 3. Discussion

2.5. Two-way structure-spectrum verification

In this section, the IR/Raman interpretation system performance is briefly analyzed in comparison with available competitive products. Some examples of its practical application are given. IR and Raman verification libraries presently include over 1300 and 1100 fragments respectively, and cover all main classes of small-molecule organic compounds (Table 3). The total number of structural elements (in our case, fragments) necessary for the interpretation of vibrational spectra has a theoretical limit, which follows from the limited number of characteristic vibrating atomic groups and their vibration modes. In reality, the effect of structural environment leads to a wider variability of fragments; however, it is also limited in the expert interpretation approach that is based on the similarity of spectral responses within possibly more abundant sets of structural situations. The number of fragments in our library is presumably close to saturation. This conclusion can hardly be proven directly; it is mostly based on the drastic decrease in the relative yield of new fragments from the processed source data, which was observed during the library compilation. In later stages, when the library size exceeded 1000 fragments only

Verification of mutual correspondence between an experimental spectrum and a hypothetic structure consists of two independent tests, using the approach implemented in the X-Pert software [10]. The first test (structure verification) starts from a list of library fragments, which represent the verified structure. Each fragment should be confirmed by the presence of corresponding peaks in the experimental spectrum. Discrepancies are reported in the protocol. Another test, called spectral verification (or spectral interpretation), analyzes the spectrum for characteristic peak patterns known to be indicative of a specific structural fragment. The absence of the fragment in the tested structure puts it in question; this is reported. Two different fragment libraries are used for the tests. Both have the same format (Table 1), but their contents differ to meet

Fig. 5. Fragment hierarchy of ketones' carbonyl: (1) general ketone, (2) aryl ketone, (3) dialkyl ketone, (4) 4-pyrone, (5) cyclopropenone, (6) 4-pyridone, (7) 4-thiopyrone, (8) cyclopropanone, (9) cyclobutanone, (10) cyclopentanone, (11) α-haloketone, (12) flavone, (13) hydrogen-bonded aryl ketone, (14) 1,2-diketone, (15) 1,3-diketone, (16) α,α′-dihaloketone. Fragment structures are given in Table 4.

2.4.2. Equivalence Sometimes, representation of a specific structural situation by a single fragment is problematic: for example, consider “C–H stretching in monosubstituted benzenes”. In such cases, several fragments are added to the library, although they have common nucleus, identical spectral properties, and represent the same functional group. The above example includes three fragments with the nucleus C–H in ortho-, meta-, and para-positions. To keep the relation between such fragments in the library, they are assigned a group ID, which is indicated in the Equivalence field (Table 1). Equivalent fragments can be used by the system separately, or as a whole group, depending on the specific task.

114

E. Karpushkin et al. / Chemometrics and Intelligent Laboratory Systems 88 (2007) 107–117

Table 4 List of carbonyl fragments selected to illustrate an inter-fragment relationship in Fig. 5 No.

Description

1

General ketone

2

3

4

5

Fragment structure

Table 4 (continued) No.

Description

11

α-haloketone

12

Flavone

13

Hydrogen-bonded aryl ketone

14

1,2-diketone

15

1,3-diketone

16

α,α′-dihaloketone

Fragment structure

Aryl ketone

Dialkyl ketone

4-pyrone

Cyclopropenone

6

4-pyridone

7

4-thiopyrone

8

Cyclopropanone

9

Cyclobutanone

10

Cyclopentanone

about 5 new fragments were produced per hundred of new spectrum–structure pairs. This value is at least 10 times lower than that in the beginning. For comparison, the X-Pert IR interpretation system operates with a set of libraries that contain about 450 fragments in total; the SIRS-SS system includes about 250 fragments [31]. Fragment occurrence in real-world structures is illustrated by the histogram in Fig. 2. High library abundance of fragments with regard to typical small organic molecules is illustrated by Fig. 3, where eight fragments were detected for a relatively small and simple structure of allyl acetate. The protocol reports the verification results for each fragment; this process transparency facilitates the analysis of results. One can navigate verified fragments (upper part of the protocol). For a selected fragment, the table of corresponding peaks is shown below and the fragment itself is highlighted over the structure. In contrast to the general-purpose verification library, IR and Raman libraries for spectral interpretation include only 17 and 21 fragments respectively. Unique peak patterns, which can be unambiguously interpreted as fragments are quite uncommon considering that similar combinations of peaks from several different vibrations should also be avoided. To be truly indicative, spectral information of an interpretation library fragment

E. Karpushkin et al. / Chemometrics and Intelligent Laboratory Systems 88 (2007) 107–117

should include an intensive peak in an unusual frequency region, such as cyanate –O–CgN band about 2250 cm− 1, or a set of several finely defined peaks, as in the case of aromatic methyl (Fig. 4); the protocol of spectrum verification reports the verification results for each spectral band. In spite of the small fragment number in the knowledge base, the coverage (fraction of structures where at least one fragment is present) of the interpretation library is 79% (based on the reference spectral database [24]). Prediction error, the number of erroneously detected fragments, of spectral interpretation for raw spectra from [24] was 27%. Subsequent semi-automatic spectral preprocessing including the deconvolution of overlapped peaks (peak fitting procedure, Section 2.2) had reduced this number to about 10%. Much attention was paid to the quality and reliability of information during the library development. Well-balanced fragments keep an optimum balance between the specificity (selectivity) and representativity. The optimized information structure developed in the present system produces highly selective fragments. Our fragment structure possesses several unique features providing maximum flexibility in the fragment formulation; in particular, the explicitly defined nucleus and a new system of queries. In fact, this structure enables fragment definition with a necessary level of fuzziness (from “any neighborhood” on one end) or distinctness (up to an explicitly drawn chemical structure) in the nucleus environment specification. This structure avoids limitations of “environmental spheres”, the approach commonly used by our predecessors [10,12]. For example, in the system [12] highly characteristic properties of two representative classes of carboxylic acid anhydrides and chloroanhydrides were missed: the former was absent completely, while the latter was included into the common and broad N CfO fragment. Inter-fragment logic, specifically, fragment exception, is another important novelty of our system that enables an easy way for unambiguous definition of selective fragments. Fig. 5 uses an example of carbonyl group in ketones to illustrate the utility of exception logic in the situation of complex environmental hierarchy (Table 4). 16 initial fragments in this schematic have 27 pair-wise intersections in five levels. Note that it is only a part of the whole carbonyl hierarchy, which includes 113 fragments in total. The present inter-fragment logic of exceptions seems to be the best way to handle such complex cases, while keeping a simple human-readable fragment structure format. Obtaining possibly narrow but statistically well-grounded frequency regions is a complex methodological problem that is often neglected by authors and developers [7,9,12,17– 20,22,23,32]. Leaving such important issues as methods of fragment selection or frequency region definition unpublished keeps the question about data reliability unanswered and reduces the overall value of the material. The formalized fragment formulation approach presented here has been aimed at the elimination of errors related to the “human factor”. The expert's knowledge based on hundreds and thousands of “manually” interpreted spectra stays rather intuitive and subjective, which makes the quantification of published sources of characteristic frequencies difficult.

115

Multivariate analysis and PCA provide a reasonable alternative to the methods purely based on historical human experience, accumulated and summarized in someone's brain. The ability to handle thousands of spectra and variables at a time, extracting related information, is extremely valuable for statistical fragment formulation. An example given in Fig. 6 demonstrates PCA application for distinguishing between different structural clusters, and an accurate definition of corresponding frequency ranges. PCA-based cluster analysis of 34 gas-phase IR spectra of carbonyl-containing compounds reveals four clear clusters—aldehydes and chloroanhydrides having either aromatic or aliphatic substitution. From the bi-plot it is seen that the intense CfO stretching band allows one to distinguish between aldehydes and chloroanhydrides (corresponding peaks occur at 1696–1766 cm− 1 and 1770–1830 cm− 1 respectively), as well as aromatic aldehydes and aliphatic ones (1718–1730 cm− 1 and 1734–1766 cm− 1). In contrast, the major CfO peak is quite useless for distinguishing between aromatic and aliphatic chloroanhydrides; the latter should be based on other, secondary, spectral features. Since the distance from a variable point to the center of coordinates in the loading plot correlates with the aggregate component signal intensity, the “leaves” in this plot can be used for accurate determination of corresponding peak intervals. Unobtrusive tunable automation and user-friendliness are important attributes of any expert-operated tool, implying extensive interaction between a human and the machine. The present system provides three different operating levels: navigation, computer-aided manual interpretation, and automated verification. The first level provides the possibility to use the entire library as a source of reference information, making use of advantages of an electronic table (searching, filtering, sorting, etc.). The second level, which can be called “expert mode”, is

Fig. 6. PCA scores and loadings (bi-plot of PC1 versus PC2) of IR spectra of aldehydes and chloroanhydrides. Scores (the markers correspond to individual structures): aliphatic chloroanhydrides, group A, are designated by hollow circles; aliphatic aldehydes, group B—filled circles; aromatic aldehydes, group C—filled stars; aromatic chloroanhydrides, group D—hollow stars; and loadings (the markers correspond to individual wavenumbers): CfO stretching band in chloroanhydrides is designated by hollow circles, CfO stretching band in aliphatic aldehydes—grey filled circles, CfO stretching band in aromatic aldehydes—black filled circles, other variables—smaller grey circles. Arrows point at most characteristic variables for each potential fragment.

116

E. Karpushkin et al. / Chemometrics and Intelligent Laboratory Systems 88 (2007) 107–117

Fig. 7. Automatic verification of incorrectly assigned structure of benzoic acid with IR spectrum of toluene: (a) spectrum and structure, (b) structure verification protocol; and (c) spectrum verification protocol.

intended for performing manual assignment of structural elements to spectral peaks (and vice versa) as it is typically done by a human operator, but accompanied by interactive assistance. As a part of the structure is selected by the user, the related library information is immediately highlighted over the spectrum (peak positions, intensity ranges, and vibration types) and the library is automatically filtered for included fragments. The last, fully automated method performs mutual spectrum–structure verification as described in Section 2.5. Exam-

ples of structure and spectrum verification are presented in Figs. 3 and 4 correspondingly; e.g., Fig. 4 shows successful structure verification results for toluene. Spectrum verification protocol generated also reveals complete correspondence between spectrum and structure. To show system reaction on a wrong structure, methyl in the same toluene example was deliberately replaced by the carboxylic group (Fig. 7a). Structure verification protocol in this case reported failed verification of three fragments related to carboxylic acids

E. Karpushkin et al. / Chemometrics and Intelligent Laboratory Systems 88 (2007) 107–117

(Fig. 7b). At the same time the absence of methyl was detected by the spectral interpretation routine and reported (Fig. 7c). Development of the presented system is in progress. Our plans include enabling user's fragment input and library management. Updating the verification databases with new thematic libraries (i.e. polymer or pollutant spectra) is also projected. Application of multivariate spectrum-to-structure regression for more effective fragment expression is another prospective subject of the future research. 4. Conclusions The new IR/Raman spectral interpretation system and software reported in the present paper is an essential step forward in the development of expert-driven interpretation approach based on the application of characteristic frequency libraries. High predictive accuracy in combination with a user-friendly interface makes it a useful tool for both routine spectral interpretation and for educational purposes. New principles of fragment formulation create a methodological basis for the development of future computer-aided interpretation tools. Multivariate analysis was shown to be a powerful approach for extracting reliable information for the library construction; it has a great potential for further system improvement. Acknowledgements We cordially thank Michail E. Elyasberg and Edward R. Martirosyan for fruitful discussions and providing the possibility to work with the X-Pert software. William Costa (FDM) is acknowledged for providing the possibility to use Fiveash Data Management, Inc. IR spectral databases during the development and testing of the present interpretation system. Antony Williams is acknowledged for supporting this work and his help during the manuscript preparation. References [1] D. Ricard, C. Cachet, D. Cabrol-Bass, T.P. Forrest, J. Chem. Inf. Comput. Sci. 33 (1993) 202–210. [2] M. Novic, J. Zupan, J. Chem. Inf. Comput. Sci. 35 (1995) 454–466. [3] C. Klawun, C.L. Wilkins, J. Chem. Inf. Comput. Sci. 36 (1996) 249–257. [4] C. Klawun, C.L. Wilkins, J. Chem. Inf. Comput. Sci. 36 (1996) 69–81. [5] P.N. Penchev, G.N. Andreev, K. Varmuza, Anal. Chim. Acta 388 (1999) 145–159.

117

[6] M.E. Elyasberg, Russ. Chem. Rev. 67 (1999) 525–547. [7] M. Farkas, J. Markos, P. Szepesvary, I. Bartha, G. Szalontai, Z. Simon, Anal. Chim. Acta 133 (1981) 19–29. [8] M. Farkas, G. Szalontai, Z. Simon, Z. Csapo, M. Farkas, Gy Pfeifer, Anal. Chim. Acta 133 (1981) 31–40. [9] K. Baumann, J.T. Clerc, Anal. Chim. Acta 348 (1997) 327–344. [10] M.E. Elyashberg, E.R. Martirosian, Yu.Z. Karasev, H. Thiele, H. Somberg, Anal. Chim. Acta 337 (1997) 265–286. [11] L.A. Gribov, M.E. Elyasberg, V.V. Serov, J. Mol. Struct. 50 (1978) 371–387. [12] J.E. Dubois, G. Mathieu, P. Peguet, A. Panaye, J.P. Doucet, J. Chem. Inf. Comput. Sci. 30 (1990) 290–302. [13] H. Huixiao, X. Xinquan, J. Chem. Inf. Comput. Sci. 30 (1990) 203–210. [14] H.B. Woodruff, G.M. Smith, Anal. Chem. 52 (1980) 2321–2327. [15] G.N. Andreev, O.K. Argirov, P.N. Penchev, Anal. Chim. Acta 284 (1993) 131–136. [16] R.E. Carhart, D.H. Smith, H. Brown, C. Djerassi, J. Am. Chem. Soc. 97 (1975) 5755–5762. [17] G. Socrates, Infrared Characteristic Group Frequencies: Tables and Charts, Second edition, John Wiley & Sons Ltd., Toronto, 1994. [18] D. Lin-Vien, N.B. Colthup, W.G. Fateley, J.G. Grasseli, Infrared and Raman Characteristic Frequencies of Organic Molecules, San Diego Academic Press, 1991. [19] B. Smith, Infrared Spectral Interpretation: A Systematic Approach, CRC Press, New York, 1999. [20] N.P.G. Roeges, A Guide to the Complete Interpretation of Infrared Spectra of Organic Structures, John Wiley & Sons, Toronto, 1994. [21] L.J. Bellamy, The Infra-Red Spectra of Complex Molecules, Methuen and Co. Ltd., 1968. [22] L.A. Kazitsyna, N.B. Kupletskaya, The Use of UV-, IR-, NMR- and MASS-spectrometry in Organic Chemistry [in Russian], Moscow State University, Moscow, 1979. [23] K. Nakanishi, Infrared Absorption Spectroscopy, Holden-Day, Inc., San Francisco, 1962. [24] FDM FTIR Spectra of Organic Compounds, Copyright (c) 2000, 2001, 2002 Fiveash Data Management, Inc. [25] Gas-Phase FT-IR Spectra of Organic Compounds, National Institute of Standards and Technology, Environmental Protection Agency 1992 (original data). [26] SDBSWeb: http://www.aist.go.jp/RIODB/SDBS/ (National Institute of Advanced Industrial Science and Technology). [27] www.acdlabs.com. [28] W.H. Press, S.A. Teukolsky, W.T. Vetterling, Second edition, Flannery Numerical Recipes in Fortran, vol. 77, Cambridge University Press, 1992. [29] S. Wold, K. Esbensen, P. Geladi, Chemometr. Intell. Lab. Syst. 2 (1987) 37–52. [30] http://software.eigenvector.com/#pls. [31] J. Yao, B. Fan, J.-P. Doucet, A. Panaye, S. Yuan, J. Li, J. Chem. Inf. Comput. Sci. 41 (2001) 1046–1052. [32] J. Coates, in: R.A. Meyers (Ed.), Encyclopedia of Analytical Chemistry, John Wiley & Sons Ltd., Chichester, 2000, pp. 10815–10837.