n
OriginalResearch Paper
217
Chemometricsand IntelligentLaboratory Systems, 19 (1993) 217-223 Elsevier Science Publishers B.V., Amsterdam
The CSEARCH-NMR data base approach to solve frequent questions concerning substituent effects on 13CNMR chemical shifts Lingran Chen ’ and Wolfgang Robien Department of Organic Chemistry, University of Vienna, Wiihringerstrasse 38, A-1090 Vienna (Austria)
(Received 3 September 1992; accepted 26 January 1993)
Abstract
Chen, L. and Robien, W., 1993. The CSEARCH-NMR data base approach to solve frequent questions concerning substituent effects on 13C NMR chemical shifts. Chenwmetricsand IntelligentLaboratory Systems, 19: 217-223. A novel approach for solving some frequently encountered questions concerning substituent effects on chemical shifts in “C NMR spectroscopy has been implemented in the CSEARCH-NMR data base system. The features and possible applications of the method are described in detail and some examples are given.
INTRODUCTION
The CSEARCH-NMR data base system [l], containing some 85 Ooo 13C NMR reference spectra and 15 000 NMR spectra of other nuclei, has proven its excellent flexibility over a ten-year period. Retrieval by well-defined items such as molecular formula, compound name and spectral pattern can be done in a very easy and userfriendly way by simple mouse clicks. In principle, these questions can also be answered by a few other programs published in the chemical litera-
Correspondence to: W. Robien, Department of Organic Chemistry, University of Vienna, Wihringerstrasse 38, A-1090 Vienna (Austria). ’ On leave from the University of Science and Technology of China, Hefei, Anhui 230026, The People’s Republic of China. 0169-7439/93/$06.00
ture, because most of the algorithms are wellestablished. These methods are designed to answer very dedicated and well-defined questions using the available reference data. However, only a few techniques are known to extract the complete information contained in a large data base of reference spectra. Even the simple question “How large are chemical shift differences induced by acetylation?” can only be answered in a cumbersome, time-consuming way needing a lot of user interaction and even manual calculations. Therefore, a new generation of algorithms has been developed for automatic extraction and analysis of substituent-induced chemical shift differences (SCSD) of 13C NMR spectra utilizing the whole information content of a large reference data base in order to give more complete and precise answers. In this paper, the features and possible applications of the method are de-
0 1993 - Elsevier Science Publishers B.V. All rights reserved
218
L. Chen and W. Robien / Chemom. Intell. Lab. Syst. 19 (1993) 217-223 /Original Research Paper
scribed and some examples are given. The detailed description of the algorithm will be presented elsewhere [2,3]. BRIEF DESCRIPTION OF THE SCSD METHOD
Substituent-induced chemical shift differences play an important role in both theoretical studies of the nature of substituent effects on chemical shift values and the practical application to the simulation of NMR spectra of a known molecular structure. The SCSD value is simply defined as the difference of the chemical shift values between two corresponding compounds differing by a substituent under investigation. The present approach for automatic extraction and analysis of substituent-induced chemical shift differences of 13C NMR spectra can be regarded as a significant extension of our new methodology for automatic calculation of chemical shift differences from two given structures and their 13C NMR spectra [4,5]. The main difference between the two methods is that the latter deals with only two well-defined structures, while the SCSD approach does not need to know explicitly the exact structure representation, only the two different substituents (Sl and S2) under investigation must be given. The program automatically retrieves all possible structure pairs @l-R, S2-R) from the data base, where R can be any partial structure containing at least one carbon atom. Subsequently, chemical shift differences within each structure pair containing Sl and S2 are calculated. Finally, the SCSD approach shows statistical information on the obtained data, offering information about the effects of substitution on chemical shift values. The main procedures of the SCSD-algorithm are: (A) Accept the structures of substituents and fragment from the user. 03) Determine all the possible structure pairs according to the differences of molecular formula of two input substituents and their elemental composition. CC) Perform screening by using three-atom and ring-size screens generated from substituents and fragment under investigation and delete incompatible structure pairs.
n
(D) Deal with each structure pair: (a) Perceive substituents Sl and S2 in structures 1 and 2, respectively. (b) Perceive fragment, if given, in structure 1. (c) Check whether Rl is isomorphic to R2 or not. (d) Calculate SCSD values for each carbon atom considered. (E) Collect and analyze obtained SCSD values.
INPUT AND OUTPUT OF THE PROGRAM
The SCSD program has four different input modes, which are summarized below, allowing to answer different types of questions: (1) input of one substituent; (2) input of two substituents; (3) input of one substituent and one fragment; (4) input of two substituents and one fragment. The screen output contains a table showing the statistical information of the obtained SCSD data. During input mode (1) or (2) the results are arranged according to the distances between the atom holding the substituent and the other carbons in increasing order. This method allows to collect all the possible structure pairs which are consistent with the input substituent(s). The resulting table supplies the user with the following information: (a) The largest and smallest SCSD values induced by the substituent(s) and the corresponding structure pair information. (b) Information about the influence of the substituent(s) on the chemical shift values as a function of the distance to the substituent. The chemical shift difference appearing at carbon atoms with the same distance usually varies within a considerably large range. This fact reveals that the influences of substituents upon the chemical shift values also depend heavily on the structural environment of each carbon atom considered. Therefore, in many cases, the user may prefer to investigate some specific carbon atoms in a certain fragment, e.g., a benzene ring system; this can be easily done by using input modes (3) or (4). The program uses this fragment information to select only those structure pairs which contain both substituents and the fragment under
n
L. Chen and W. Robien / Chemom. Intell. Lab. Syst. 19 (1993) 217-223 /Original Research Paper
investigation. Furthermore, only those chemical shift differences of the carbon atoms corresponding to the fragment are calculated, and the results are arranged according to the positions of the carbons within the query fragment. For each structure pair the complete SCSD information is stored on a file allowing further detailed investigations.
EXAMPLES AND DISCUSSION
Example 1 shows an investigation of the effects of a fluorine-substituent on chemical shift values on the carbon atoms at ‘a benzene ring skeleton in various substituted benzene derivatives. In order to do this, we simply enter a fluorine atom as the only substituent and a benzene ring as the fragment (Fig. 1) into the SCSD program (input mode (3)). The program first creates a table containing all the possible structure pairs (= 9089) according to the difference of the molecular formulas calculated from the input data. This means that two molecules of each structure pair differ only by +F and -H in their elemental compositions, in other words, the first molecule contains one fluorine atom more and one hydrogen atom less than the second molecule. From these possible structure pairs, the program uses some tricky strategies to find those pairs in each of them the structures of two molecules are exactly identical except that they differ only by
X
2 3
0
4
Fig. 1. The structural fragment of example 1, where X indicates the position of the substituent.
219
+F and -H, and both contain a benzene ring. Out of 9089 possible pairs, 2821 valid structure pairs were obtained, one of which is shown in Fig. 2. The results are summarized in Table 1. Table 1 gives an overview for all carbons within the benzene ring system showing the information about highest, average and lowest SCSD values. Each highest and lowest SCSD value is accompanied with its corresponding structure pair allowing easy access to the reference data base. The average values are separately calculated for positive and negative SCSD values and given together with the number of hits (carbon atoms used in calculation). The structure pair having the largest SCSD value (43.5 ppm) at position 1 in the benzene ring is shown in Fig. 2. The corresponding SCSD values are inserted into the structural diagram. One important fact should be noticed from the results given in Table 1. The average values resulting from the maximum number of corresponding carbons agree very well with the substituentinduced chemical shifts (SCS) derived from monosubstituted benzene derivatives [6]. If the number of structure pairs used for the calculation of SCSD values is large enough (approx. > lOOO), this relationship is almost independent of the data base used as can be shown from the investigation of more than 50 different substituents [3,7]. Those SCS values have been widely used to estimate chemical shifts for polysubstituted benzene derivatives. However, the very large SCSD ranges shown in Table 1 demonstrate that using such SCS values derived from monosubstituted benzenes to predict chemical shifts is very dangerous for many cases of poly-substituted derivatives. A more detailed discussion of this problem will be given elsewhere [7]. Recently, a new spectral prediction method which can automatically utilize SCSD values derived from both mono- and polysubstituted parent structures has been proposed [8], showing very good results [93. Example 2 shows the comparison of the influences of hydroxy- and acetoxy-groups on the chemical shifts within the steroid skeleton of 3&5&derivatives. In this case, the input mode (4) of the SCSD program was used, i.e., hydroxy- and acetoxy-groups were entered as two different sub-
L. Chen and W. Robien / Chemom. Intell. Lab. Syst. 19 (1993) 217-223 /Original Research Paper
220
stituents and the partial structure shown in Fig. 3 was entered as the fra~ent. According to the difference of molecular formulas of hydroxy- and acetoxy-substituents, respectively, a total number of fourteen possible structure pairs were obtained. Only seven pairs were valid according to the partial structure condition and further used in the calculation of SCSD values. A copy of the final result is given in Table 2, and the graphical representation of this result is summarized in Fig. 4, showing comparably narrow ranges within the A-ring in good agreement with the well-known influences caused by acetylation.
a
AUTOMATIC DETECTION OF 13C NMR SPECTRUM DATA BASE ERRORS BY COMP~ISON OF SPECTRA OF SIMILAR COMPOUNDS
Computer-based spectroscopic data bases are becoming increasingly more important in many areas of chemistry. On the other hand, with the increase of info~ation contents of data bases, the validation of the stored reference material becomes more and more challenging. A recently proposed method [5] for automatic detection of data base errors is based on the comparison of two identical structures and their corres~nding spectra. This method has proved
LIMIT: 0.2 PPM
2
7257/c
I
I
I 1
1096/F
SCREEN-DUMP :
I
I
I
I
I
200
150
100
50
0
I1
***
SSI ***
FC
14/08/1992
18:07:55
Fig. 2. The structure pair having the largest SCSD value (43.5 ppm) at position 1 within the benzene ring. The SCSD values are inserted into the structural diagram.
n
L. Chen ana’ W. Robien / Chemom. Intell. Lab. Syst. 19 (1993) 217-223/0riginal
Research Paper
221
TABLE 1 The statistical information on substituent-induced
chemical shift values induced by fluorine substitution
The structures used in this calculation contain exactly one benzene ring; data bases used: A B C F H J K L M. All possible structure pairs found = 9089. Total number of structure pairs used = 2821. Positions considered
Differences of chemical shifts in each position Highest values
Entries used ** (l-2)
Average values Plus
Hits
Minus
Hits
43.5 0.8 7.2 18.0 7.2 0.8
1096F- 7257C 25988B-14441A 7209B- 8592B 16771B- 6517B 7209B- 8592B 25988B-14441A
33.1 0.8 2.0 4.3 1.8 0.8
2813 2 2370 113 2452 2
0.0 - 12.9 - 1.6 -4.4 -0.9 - 13.1
2813 445 2702 361 2812
l
1 3 4 5 6 7
Lowest values
Entries used
7.2 - 30.5 -8.4 -30.1 - 7.1 -21.9
20446A- 5705A 11852A-14337A 5962B- 1600H 25988B-14441A 17920A- 970B 19345B- 1988A
(l-2)
For carbon numbering, see Fig. 1. * The two ‘Entries used’ columns show the structure pairs (record address and data base identifier) corresponding values’ and ‘Lowest values’, respectively. l l
to be a very powerful tool for detection of data base errors but it is only applicable to those cases where redundant spectra of a compound exist in the data base under investigation. In this section, we present another approach to detect errors by comparison of similar compounds. The novel approach is based on the SCSD algorithm [2,3]. As has been discussed previously, the SCSD algorithm can automatically extract SCSD values from all the possible structure pairs
to ‘Highest
available in a data base. Two structures of each pair are identical except that they differ by only one substituent depending on the given information. By selection of input modes (3) or (4) the SCSD program produces a table containing the average SCSD values, the largest and the smallest SCSD values of each carbon atom at the structural fragment under investigation as well as the corresponding structure pair information. The extreme SCSD values can give some hints about
TABLE 2 The statistical information on chemical shift difference values induced by HO- and CHsCOOThe structures used during the calculation are 3&S/3-steroidal compounds stored in data base Z. All possible structure pairs found = 14. Total number of structure pairs used = 7. Positions considered * 1 2 3 5 6 7 8 9 10 l l
Differences of chemical shifts in each position Highest values
Entries used U-2)
0.5 3.8 -2.0 4.2 0.4 0.3 0.4 0.4 0.6
352-362 372-382 352-362 352-362 352-362 152-162 262-272 352-362 372-382
For carbon numbering, see Fig. 3. * Same as Table 1.
l
*
Average values Plus
Hits
Minus
Hits
0.5 3.0 0.0 3.0 0.4 0.3 0.3 0.2 0.3
1 7
- 1.2 0.0 -3.8 0.0 - 1.0 -0.1 -0.1 -0.7 -0.1
6
I 1 6 4 3 3
7 6 1 3 4 4
Lowest values -3.7 2.6 -4.6 2.7 -1.5 -0.1 -0.2 -1.8 -0.1
Entries used (l-2)
’
372-382 152-162 152-162 152-162 152-162 372-382 352-362 372-382 152-212
L. Chen and W. Robien / Chemom. Intell. Lab. Syst. 19 (1993) 217-223 /Original Research Paper
222
n
matic error detection within a data collection of some 85 000 NMR spectra.
EXPERIMENTAL
X
7
Fig. 3. The structural fragment for example 2, where X indicates the position of the substituents (-OH or -O-COCHJ.
errors because the unusual SCSD value is quite probably the result of a mistake. Therefore, with such extreme SCSD values and the corresponding structure pair information available, it is easy to check possible errors behind each corresponding reference entry. Another method for error detection is based on the comparison of experimental and estimated chemical shift values [51. Like the method based on the SCSD algorithm, this technique is also not restricted to identical structure pairs. It should be emphasized that these three different methods are all powerful tools for automatic error detection and they should be applied to existing data base systems in order to achieve high-quality reference data for the spectroscopist. In fact, these methods have been implemented into the CSEARCH-NMR data base system for the auto-
The ranges of chemical shift differences (largest/smallest):
.6/
The method described here has been implemented into the CSEARCH-NMR data base system. The data collection of r3C NMR spectra consists of some 85000 spectra including the libraries of the University of Vienna, SADTLER Research Laboratories and the German Cancer Research Center at Heidelberg. The programs have been written in FORTRAN 77 under UNIX operating system on a Silicon Graphics workstation and an IBM R6000 workstation consisting of about 5000 lines of source code.
CONCLUSION
The SCSD approach described in this paper can extract the complete SCSD information from a large data base according to one or two input substituents and optionally an additional fragment depending on the user’s choice. The final result shows the information about the effects of the substituent(s) under investigation on chemical shift changes at specific positions in a molecular structure. The four different input modes enable the user to generate conveniently with this tool
The mean values of chemical shift differences (positive / negative):
-.l
.5/-3.7 ' 1
rh.4/-1.8
Fig. 4. The graphical representation
of the results for example 2 taken from Table 2.
H
L. Chen and W. Robien/Chemom.
Intell. Lab. Syst. 19 (1993) 217-223/Original
complete and precise answers to various frequently encountered questions concerning the substituent influences on 13C NMR chemical shift values. Furthermore, this method has found its application to automatic detection of spectral data base errors with its ability to extract the extreme SCSD values and corresponding structure pair information.
Research Paper
223
automated assignment of carbon-13 nuclear magnetic resonance spectra, Journal of Chemical Information and Computer Sciences, 25 (1985) 103-108. 2 L. Chen and W. Robien, Journal of Chemical Information and Computer Science, in press.
3 L. Chen, Ph.D. Thesis, University of Vienna, in preparation. 4 L. Chen and W. Robien, MCSS: A new algorithm for perception of maximal common substructures and its application to NMR spectral studies. 1. The algorithm. Journal of Chemical Information and Computer Sciences, 32 (1992) 501-506.
ACKNOWLEDGEMENT
LC thanks the Austrian Academic Exchange Service for financial support. The authors are grateful to the staff of the University Computing Center for helpful discussion during the program development. This project was supported by the European Academic Supercomputing Initiative (EASI).
5 L. Chen and W. Robien,
MCSS: A new algorithm for perception of maximal common substructures and its application to NMR spectral studies. 2. Applications, Journal of Chemical Information and Computer Sciences, 32 (1992) 507-510. 6 E. Pretsch, A. Fiirst and W. Robien, Parameter set for the prediction of the 13C NMR chemical shifts of sp2- and sp-hybridized carbon atoms in organic compounds, Analytica Chimica Acta, 248 (1991) 415-428. L. Chen and W. Robien, Journal of Chemical Information and Computer Science, in press.
L. Chen and W. Robien, A novel approach for optimized prediction of 13C NMR spectra using increments, Analytica REFERENCES 1 H. Kalchhauser and W. Robien, CSEARCH: A computer program for identification of organic compounds and fully
Chimica Acta, 272 (1993) 301-308.
L. Chen and W. Robien, Optimized prediction of 13C NMR spectra using increments. Comparison with other methods, Fresenius’ Journal of Analytical Chemistry, 344 (1992) 214216.