Analytica Chimica Acta, 221 (1989) 345-351 Elsevier Science Publishers B..V., Amsterdam -
345 Printed in The Netherlands
Short Communication
A COMPUTER SEARCH SYSTEM FOR SIMILAR ORGANIC COMPOUNDS IN CARBON-13 NUCLEAR MAGNETIC RESONANCE DATA FILES
SHEN-GANG
YUAN*, YUN-WE1 WANG, DONG CHEN and CHONG-ZHI ZHENG
Shanghai Institute of Organic Chemistry, Academia Sinica, Shanghai (China) (Received 29th August 1987)
Summary. From the 13C-NMR spectrum of an unknown compound, the system provides a list of compounds ranked according to their similarity to the unknown. The similarity is estimated in a three-dimensional feature space, rather than by direct match of all peaks. All the spectra in the library are first converted to pattern points in the feature space by a dimensionality-reduction method. Thus, the search for similar compounds is simplified to a search for points within a given distance from the point representing the unknown. The compounds listed can be offered as the result or used for further operations (match of carbon number and peak position) in order to get a more exact result. An auto-optimization option is included to provide efficiency and user convenience.
Carbon-13 nuclear magnetic resonance (NMR) spectroscopy is valuable in obtaining information about the carbon skeleton of organic compounds and has been widely applied in organic structural analysis. Interpretation of the 13C-NMR spectroscopic data for structure elucidation obviously requires knowledge of the relationship between chemical shifts and molecular structures and a great deal of work has been done to establish such relationships theoretically. However, the complexity of factors affecting the shielding constants of carbon nuclei restricts the application of these theoretical approaches. Since Grant and Paul [ 1 ] and Lindeman and Adams [ 21 proposed an empirical relationship between chemical shift and structure for alkanes, the general approach based on correlation between chemical shift and a characteristic segment (substructure) of the molecule has advanced greatly and various correlation models have been proposed. The success of these empirical models depends on correct selection of the characteristic segment in the structure and the large available collections of reference spectral data. However, large collections of reference data are not easily handled by conventional means; consequently, many computer data-bank systems have been developed along with the rapid increase of 13C-NMR information. To develop the application of computers to the simulation and elucidation of spectroscopic data in this
0003-2670/89/$03.50
0 1989 Elsevier Science Publishers B.V.
346
laboratory, an infrared data bank was first established [ 31. More recently, a 13C-NMR data bank was developed. In this communication, the new search method for similar compounds is based on 13C-NMR spectra adapted in the system. Description of the system Most present 13C-NMR file-search systems are based on the philosophy adopted for the mass and infrared spectra data banks which were developed much earlier. These systems are generally of two kinds, original spectrum searches and compressed spectrum searches. Though the former is simple, it needs much more storage space, which increases the search time because of the amount of redundant information included in the original spectrum. In view of these shortcomings, Schwarzenbach et al. [4] and Bremser et al. [5] developed spectral code systems to give compressed spectra which were used in their computer systems. In this communication, the new method of compression reported is designed to find similar compounds in the library from the 13C-NMR spectrum of unknown compounds. Given that the unknown compounds are often not present in the reference library, a search system can be more significant if it can provide a list of reference compounds ranked according to their similarity to the unknown. To facilitate comparison between the 13C-NMR spectra of different compounds, each spectrum was transformed to a point in a three-dimensional feature space. The Euclidean distance between the points is considered as the measure of similarity of spectra and so the similarity of compounds. Thus, the search for similar compounds is simplified to a search for all the points within a given distance from the point representing the unknown. The order of these distances is the order of the similarity between the unknown and reference compounds. Such a search result can be provided directly to the user, or used for further operations, for example, matching of carbon numbers or forward or reverse peak match, depending on the requirements of the user. The system is based on an auto-optimizing search to provide high efficiency and convenience for the user. Definition of the feature space In the first step of transformation of the 13C-NMR spectra to the feature space, the method used is similar to that of Wilkins et al. [ 61 for the conversion of graphic information to digital data. In this step, some information which is redundant or difficult to utilize (e.g., peak intensity and peak multiplicity) is not included. The intensities are not quantitatively proportional to the number of nuclei in resonance and multiplicities are not present in most original spectra. In this work, the spectrum range of O-200 ppm was selected and divided into 200 1-ppm intervals; within each interval, if one or more peaks appear, the interval is denoted by one, otherwise by zero. Because the occurrence of
347
peaks outside the specified range is relatively low, peaks appearing outside the range are neglected in the transformation to feature space. However, while the search for similar compounds is in progress, the information on these peaks is still available. After the treatment, each spectrum is converted to a point in the 200-dimensional pattern space. In pattern recognition, several methods are available for converting a data point in a space of high dimensionality to a point in a low-dimensional space; among the most used are the nonlinear mapping and Karhunen-Loeve transformations. However, quite apart from the difficulties found in their practical applications, such methods have the serious shortcoming, seldom noticed, that they treat all data points as a set; consequently, if a new data point is added, it is necessary to recalculate the coordinates for all the points [ 71. Thus these methods are not suitable for computer file-search systems. On the basis of the work of Fukunaga and Olsen [ 81, Lin and Chen [ 7 ] proposed a method for converting multivariate chemical information to a three-dimensional feature space. Though their method has been criticized by Drack [9], the algorithm is very simple and should be very suitable for the file-search system as long as the reference points are selected properly. Wehrli and Wirthlin [lo] have listed the peak positions of 13C-nuclei in 137 different chemical environments. In the work reported here, all these chemical environments were divided into six kinds according to their appearance in each 1-ppm interval. These six kinds consist of the three pairs of reference spectra: RF and R,+, RF and Rz , and RF and Rz . For each reference spectrum, each 1-ppm interval is either marked zero if none of the chemical environments appears, or weighted with a value between 0.1 and 1.0 if one or more appear. The weight is used to distinguish between the different chemical environments in the same pair of reference spectra. For example, if in the case of R, , each 1-ppm interval has three chemical environments in the ranges O-3 ppm and 178-188 ppm, then the weights are 0.1 for the first to third 1-ppm interval and 0.2 for the 178th to 188th intervals. The six corresponding reference spectra are shown in Fig. 1. The height of the peaks indicates the weighting. After selection of the reference points, the feature space can be defined according to the method. The coordinates of every spectrum in the library can be calculated very conveniently from the following formulas: xi= [d(iR,)
-d(zR,+)]/[d(iR,)
+d(iR,+)]
yi= [d(iR,)-d(iR$)]/[d(iR,)+d(iR$)] zi = [d(iR,)
-d(iR,+)]/[d(iR,)
+d(iR,+)
]
where d (iRi ) and d (iR]f ) are defined as the Euclidean distances between the ith spectrum and the first and second reference spectra of the&h pair, respectively. In this way, all the spectra in the library can be converted to a point in
348 ,...\..._,....,__....._,...,..._.__..,........,_._. * ..“‘...“.“...‘~“‘.,.“~...‘.“..~“. :: j : : : : : : j : : : : : : : : j : : : : : : : : : : : : : : : . .._. ._._..............,...._.._...__: . . . ..__...........___.C.................. . . . ... . . . . . . . . : : : : : : : : : : : : : : : : ; : : : : : : : : : : : : : : ,...~....,..._,....__..,.__._.._: .___,....__..: . . . . . __.._... ;.._; . . . ..:. . . . .: . . . . . . . ..:. . . . .: . . . .: R,[;; f f j;; j; j j f;;; i;; j [ ,.._; ..__:_... ;...; ..__ ;...; . . . .. ;...; . . . .. . ;...; . ;...; . . . ... . . . . . . . . . . . ... . . . . . . . . : : : : : : : : : : : : : : : : : : : :
2e0
189
168
188
168
148
128
100
88
68
128
186
88
68
4e
28
ePPH
. ;...: . . . ... . . . . .;.. _., : : : : :
il II
R:
:
:
280
: 288
:
:
168
:
: 188
: 160
140
: : : : : : . . . . . .._.___.,.__ : ..__..:
: 168
.. :
: i...
, :
..;...1.. :
:
..;.__:_..: 286
188
168
Fig. 1. The bar spectra
-
:
:
:
:
: : I...: ,....,......__: : : : : : : :__.:_ :
:
:
;_..i . . : : : :
i4e
for the three pairs of reference
spectra.
,: :
.:. :
.: :
.: :
349
the feature space. The search for similar compounds can then be conducted in a space which describes the spectral information in a very compact form. Search procedure To keep its efficiency high, the search does not compare the unknown spectrum with all the spectra in the library. The system first calculates the coordinates and the modulus of the unknown in the feature space using the same method, and then selects only those spectra for which the modulus is not larger and not smaller than a given value (calculated from the similarity wanted) compared to the unknown. Then the system calculates the distance between the unknown and these selected spectra in the feature space: if the distance is smaller than the similarity index, the corresponding spectrum is a candidate. These candidates can be output directly as the result, or can be treated in further operations. The operations available at present are a match of carbon numbers and a forward or reverse peak match within a variable error range. These operations can be varied to suit the special requirements of the user. Because the search algorithm is very simple, only one factor, i.e., the similarity index, will affect the result. For convenient general use, the similarity index can be selected by the user, or it can be set and regulated by the system depending on the search condition, so that in any case some reasonable result is achieved. To avoid the effects of variable peak positions (caused, for example, by different experimental conditions), a variable match error is included in the peak-match algorithm. Results and discussion At present, about 20 000 13C-NMR spectra are included in the library. The data base includes spectral data (chemical shifts, multiplicity, relaxation time ), structural data (compound name, formula, WLN), bibliography (authors, journal, date) and experimental conditions (instrument, temperature, solvent, etc). The data base is augmented continually by the addition of new data collected from scientific journals and from new compounds synthesized at this Institute. All these data are input twice by different persons and have to pass self-consistency checks and cross-checks in order to avoid input errors and ensure intrinsic and internal consistency. Generally, whether or not the input unknown spectrum is present in the library, the system will produce a series of candidate compounds ranked in order of similarity to the unknown. If the unknown is present in the library, it is usually listed first. Thus, the user can always get some valuable structural information about the unknown compounds. Some representative search results obtained with test spectra are presented in Table 1. Neither 3-methyl-1-butyne nor l-heptyne is included in the library; with these compounds, the outputs were three large cycloalkynes and 1-octyne, respectively, which can be considered satisfactory. Similarly, for the two com-
350 TABLE 1 Some search results Sample
Reference compound
3-Methyl-1-butyne
1,3,9,11-Cyclohexadecatetrayne 1,3,10,12Cyclooctadecatetrayne 1,3,10,12,19,21,28,30-Cyclohexatriacontaoctyne 1-Octyne 3-Methyl-3-ethylpentane 2-Methylbutane 3,3-Dimethylpentane 1-Bromopentane Acetic acid Acetamide Acetate N,N,iV-trimethylmethanaminium
1-Heptyne 3-Methyl-3-ethylpentane
Acetic acid
pounds present in the library, the output list of similar compounds covers a reasonably wide range. Such results are very suitable as a basis for further computerized automatic elucidation or, for a user who needs more exactly similar compounds, as a basis for the further operations available in the system. Test operations over a period of months have shown that the file-search system is particularly suitable for compounds with numerous peaks. Such a search by conventional methods will be relatively slow. Another advantage of the system is that, because the similarity index can be set by the user according to the compound, if any similar compound is not found, the index can be decreased automatically or manually so that less similar compounds will be found. Conclusion The new search method satisfactorily solved the search problem for similar compounds from 13C-NMR spectral data. However, the system is not yet optimal, especially with regard to selection of the reference spectra. Further studies on the relationship between the 13C-NMR chemical shift and structure are in progress to improve the system. The preliminary results have shown not only that the system is suitable for finding similar compounds, but also that the output compounds are suitable as a starting point for automated structure elucidation. The latter saves computational time, increases efficiency and provides more relevant results in the elucidation step.
REFERENCES 1 2 3
D.M. Grant and E.E.G. Paul, J. Am. Chem. Sot., 85 (1963) 1701; 86 (1964) 2984. L.P. Lindeman and J.O. Adams, Anal. Chem., 43 (1971) 1245. C.-Z. Zheng, Y. Wang, C. Qian, C.Z. Nie and Y.C. Hui, Kexue Tongbao, 11 (1981) 663.
351 4 5 6 7 8 9 10
R. Schwarzenbach, J. Meili, H. Konitzer and J.T. Clerc, Org. Magn. I&son., 8 (1976) 11. W. Bremser, B. Franke and H. Wagner, Chemical Shift Ranges in Carbon-13 NMR Spectroscopy, Verlag Chemie, Weinheim, 1982. C.L. Wilkins, R.C. Williams, T.R. Brumer and P.J. McCombie, J. Am. Chem. Sot., 96 (1974) 418. C.H. Lin and H.F. Chen, Anal. Chem., 49 (1977) 1357. K. Fukunaga and D.R. Olsen, IEEE Trans. Comput., 20 (1971) 917. H. Drack, Anal. Chem., 50 (1978) 2147. F.W. Wehrli and T. Wirthlin, Interpretation of Carbon-13 NMR Spectra, Heyden, London, 1976.