Journal of Molecular Structure, 319 (1994) 31-39 0022-2860/94/$07.00 0 1994 - Elsevier Science B.V. All rights reserved
31
Chemical structure and IR spectra of organic compounds connected by statistical methods V.A. Dementiev*, Department (Received
S.D. Timchenko
of Physics, 22 July
Timiryasev
Agricultural
Academy,
127550 Moscow,
Russian Federation
1993)
Abstract In order to determine the connection between the structure of a compound and its non-characteristic spectral attributes, an experiment in which the methods of factor analysis were applied to an IR spectral set, was carried out. Clusterization of initial spectra, matched with chemical classification of the investigated compounds, was received. Universal basis for transition in the low-dimensional space, where the investigated spectra are described by 11 factors instead of an initial 65, was received. The idea of spectral space identification is proposed. A linear statistical model, enabling the construction of IR spectra of organic compounds from their structural formulae, is presented.
Introduction The methods
application
of
mathematical
to large files of IR spectra
statistical has previously
been connected with the problem of the creation of correlation tables for automatic recognition of compounds from their spectra by expert systems. Thus the efforts of researchers have been directed towards the establishment of confidence levels for characteristic spectral attributes of given nuclear groups. The task was successfully solved for characteristic vibrational frequencies, but not for intensities, since absorption intensities in an IR spectrum of a molecule are complex functions of the whole molecular structure, rather than of separate groups. Therefore, information about the intensities, and information about non-characteristic frequencies, often remained unused in such researches. As a consequence, loss of spectral information in an expert system entrance led to ambiguity in the * Corresponding
author.
SSDI 0022-2860(93)07892-Z
decision of the task about the establishment of a structure of the investigated compounds. Attempts to rectify these losses and to input into a sphere of activity of an expert system the spectral area of “fingerprints”, causes great complications of logic in expert system construction by the inclusion of the computing apparatus of molecular vibration theory [ 11. It is interesting to investigate the possibility of applying multidimensional statistical methods to a set of spectra with the aim of revealing spectralstructural correlations, involving all accessible information from a spectrum, and not just information about characteristic spectral attributes. If successful, this will provide an opportunity to increase the essentially intelligent quality of an expert system. As far as the spectral area of “fingerprints” provides rich information on the features of nuclear groups jointly, the processing of the complete spectral picture can enable an expert system to recognise the structure of researched compounds without the use of complex methods of analytical theory of spectra.
V.A. Dementiev. S.D. TimchenkojJ. Mol. Struct. 319 (1994) 31-39
32
The purpose of the present work is to answer the following questions. Is it possible to conduct a classification of the IR spectral sets of large organic compounds so as to reveal internal variability laws for spectral attributes of compounds in connection with the features of their structural formulae? Is it possible to find a small number of integrated statistical characteristics describing all sets of spectral and structural features of the investigated classes of compounds? Is it possible, based on given characteristics, to construct a complete spectrum of a compound with sufficient accuracy using several registered spectral points or a given structural formula? Initially, to exclude the influence of incidental noise and background, we used a computer experiment, reported in refs. 2, 3 and 4, which had as its object not experimental, but calculated IR spectra. These spectra were determined by an approximation of theoretical spectra to experimental ones using inverse spectral solution methods, so that they reflect real spectral validity with suflicient accuracy (frequency with an error up to lo2Ocm-‘, intensity with an error up to 20%). It can be considered that we have taken advantage of
Fig. 1. IR spectra of compounds
vibrational spectral theory with the aim of clearing experimental spectral data from outside factors, so hereafter we consider researched spectra as a sample of experimental data. From now on we overlook known results of theoretical analysis of researched spectra as we want to determine information on their internal laws and their relations with structures by statistical methods only. 8 1 Spectra of compounds belonging to classes of alkanes, alkenes, cycloalkanes, hydrocarbons with double, triple and conjugated bonds, and appropriate benzene derivatives were used in the experiment. Some homologous rows of head molecules (acetylene, allene, benzene) were not included in the researched set, as their spectra are relatively poor. The spectral region 400-1700 cm-’ was considered. This area is interesting to us, as both characteristic and non-characteristic vibrations of the given molecules are displayed in it. The region of higher frequencies is not so interesting, because mostly characteristic vibrations are displayed within it. The spectral region of interest was divided into 65 intervals of 2Ocm-’ width according to the accepted accuracy of inverse spectral methods for calculated spectra. In the given work a spectrum of an organic
used in the statistical experiment.
33
Y.A. ~e~entiev, SB. Ti~c~e~o/J. Mol. Struct. 319 (1994) 31-39
compound used as the object of statistical research was presented by a casual (random) vector of 65 dimensions. As components of the vector, absolute intensities of absorption in each interval were taken. Intensities were expressed in absolute units lo-* cm2 mol-’ s-i. Each particular spectrum was considered as a realisation of a casual vector. The information on the structural and group content of appropriate chemical compounds in the experiment was also presented by a casual vector. As components of the vector in its particular realisation, the number of each different structural group in the given compound and the number of bonds between different possible pairs of adjacent groups were taken. The following groups were considered to be standard structural units of -CH <, organic compounds: CH3-, -CH2-, CH,=CH-, -CH=CH(cis, trans), -CH=C < , Ph with various types of replacements and other similar groups. Figure 1 shows all 81 spectra. From the figure it is obvious that it is very difficult to conduct any classification of this set of spectra, apart from the allocation of characteristic frequencies. Each described spectrum can be represented by a point in 65dimensional space. All the researched spectra of the set can be presented by 81 points in this space. In comparison with Fig. 1 this method
of info~ation representation does not give additional clarity. However, it allows us to offer a new method of space spectral identification, concluding in a transition to new space of smaller dimensions, in which the points representing spectra would be collected in relatively close clusters, characterised by high statistical uniformity of internal relations between spectral attributes. The methods of transition to a new required basis can be found using methods of factor analysis [5,6]. The technique of transition used in the present work and based on an analysis of the main components of the above-mentioned casual vectors, has previously been described by us in detail 171. Results
An analysis of the main components, conducted for all the researched spectra, has shown that the first 11 eigenvectors (e-v.) of the covariation matrix pick up 80% of the total dispersion, as shown in Fig. 2. Taking into account the prevailing contribution to summary dispersion of the first 11 e.v., we have accepted these 11 vectors as the new basis set and have performed decomposition of the researched spectra using this basis. It was shown that any spectrum from the researched set can be restored on the data of eleven factors with a lower
I
03
70
Fig. 2. Cumulative sum of dispersion
20
30
40
obtained from eigenvectors
l
l
50
60
of the covariation
l
70 matrix of the spectra.
34
VA. Dementiev, S.D. T~che~olJ.
Fig. 3. Factor loads on the first three eigenvectors:
possibility of error than at the solution of inverse spectral problems (20% on intensity). In particular we note that transition matrix columns are universal functions (factors) in a field of researched spectra, enabling the spectral information to be compressed on transition to the new space. The availability of such functions and the opportunity of convolution of the spectral information provide a statistical analogue of an opportunity to construct an adequate ~brational model of molecules with universal parameterization and analytical theory of molecule vibrations, enabling us to predict the spectral appearance of these models. Hence, basis vectors of the new spectral space have the same value as parameters of molecular models. The technique of compression and restoration of the spectral information described above is the statistical model of the given set of spectra. It is clear that the speed of restoration of a spectrum based on this technique with the help of a computer surpasses the speed of restoration of a spectrum using the calculation methods of the theory of molecular vibrational spectra. Figure 3 shows factor loads for the first three e.v. of the covariation matrix. It is clear that for basis Fl, significant correlation coefficients between
Mol. Strut. 319 (1994) 31-39
I-Fl; 2-F2; 3-F3.
intensities of absorption in intervals with wavenumbers of 880, 960, 1200 and 1600 cm-’ are revealed. For orths F2, intensities at intervals of 720, 860 and 980cm-’ correlate. For basis F3 the essential intensity is the only intensity in the 82Ocm-’ interval. From Fig. 3 it is clear that correlation coefficients between groups of intensities are not large, clarifying the failure of approaches using the intensity of absorption lines as being characte~stic of nuclear groups attributes. However, these correlation coefficients are signiflcant, and permit a reduction in the dimension of the space of the spectral representation from 65 down to 11. In the new 1l-dimensional measured space all the researched spectra will again be represented by 81 points, but now the points will be grouped in some relatively close clusters. In each cluster the set of spectra has greater statistical uniformity of relation between structure and spectral attributes. Therefore, for spectra belonging to one cluster, it is possible to execute transition to a new system of orths of even lower dimension and to characterise each spectrum by a smaller number of factors. Thus mutual correlation of spectral attributes grows inside a cluster, enabling refinement of spectral classification.
V.A. Dementiev, S.D. TimchenkolJ. Mol. Siruct. 319 (1994) 31-39
-20
1 - 40
I
I -20
35
I 0
Fig. 4. Projection of researched spectra on the Fl-F2
As an illustration, the space transition, using bases Fl, F2, F3, can be considered. These three e.v. pick up only 40% of the total dispersion; therefore the accuracy of restoration of spectra under three factors will become lower than under 11, but we receive evidence of spectral clusterization. Figure 4 shows projection of this space on the Fl-F2 plane. It is clear that the entire set of researched spectra was distributed between four relatively close clusters, separated in the new space by various distances. We note that the closest central cluster 1 basically consists of the spectra of normal and cyclic alkanes and alkenes, cluster 2 of the spectra of benzene derivatives, and clusters 3 and 4 of the spectra of compounds with conjugated and isolated double bonds. Thus the classification of spectra executed by purely statistical methods agrees with the chemical classification of the compounds included in the experiment. On the basis of the classification of the researched spectra by statistical methods, we attempted to construct a statistical model of the relationship between a casual spectrum vector and a vector of compound structural features. Theoretical research of the relationship between the intensity of some characteristic line in the IR spectra of complex compounds and the number of structural groups of given type in the studied compounds [8] allows us to assume that, in
C? plane.
our case, the most adequate appears to be the linear model. According to this assumption we searched for factors in linear transformation giving a transition from a casual vector of structural features to a casual vector of spectra. Such a transformation would be expected to find and to use only statistically similar spectra inside cluster limits or within the limits of several clusters, if the researched compounds are multifunctional. We have investigated this using 23 spectra included in clusters 1 and 2. The appropriate casual vectors of the structural features of the given compounds were constructed for these spectra. Coefficients of the required linear model were obtained by the least squares method. Experience has shown that the model permits the restoration of a spectrum from given clusters under the given structural formula with satisfactory accuracy, if the compound consists of a large enough number of structural groups. Figure 5 gives the results of the restoration of some spectra using their structural formula, incorporated in building a linear model. The realisation of the tert-butylbenzene structure vector, for instance, was given in the following form: groups CH,-- 3, groups > C < 1, group -Ph 1; bonds CH, > C < 3, bond > C < Ph 1. The restoration results are also satisfactory with respect to frequencies and intensities. This gives us an opportunity to restore the structure vector of an
V.A. Dementiev,
36
S.D. TimchenkolJ. Mol. Strut.
25
319 (1994) 31-39
(4
I
:. ;;: .‘,.
I
.
800
(b
25
Fig. 5. Comparison of the results of restoration of some spectra using their structural formula with real spectra: (a) tributylbenzene; (b) m-divinylbenzene; (c) isobutylene; (d) ethylbenzene; (e) o-zylene; (f) m-xylene; (g) biphenyle; (h) mesitilene.
31
V.A. Demenriev, S.D. TimchenkolJ. Mol. Struct. 319 (1994) 31-39 18,
(4
B160;
1800
(e
DO
25
20
15
10
5
900 -
600
8bO
Fig. 5. (continued).
V.A. Dementiev, S.D. Timchenko/J. Mol. Struct. 319 (1994) 31-39
Fig. 5. (continued).
unknown compound from a given spectral cluster. The simple compounds are restored with less accuracy. Discussion of results
The results show that the methods of multidimensional statistics help in the construction of models of the IR spectra of a vast set of compounds belonging simultaneously to several chemical classes. In particular, these compounds can be multifunctional, as in the computer experiment described above. With the help of the
constructed model identification a newly registered experimental spectrum is possible, if the researched compounds belong to one of the chemical classes included in the model. The spectrum will be identified by significance factors, defined by the decomposition of a spectrum on received in work universal bases. The spectrum will be identified in one of the clusters, determining the structure of correlation between spectral attributes of the researched compounds. The spectrum of a compound can not only be identified, but can also be predicted, by using an advanced linear model, on the basis of the features
39
V.A. Dementiev, S.D. TimchenkolJ. Mol. Strut. 319 (1994) 31-39
of compound structural formulae. It is important that the speed of computer forecasting with the help of the advanced method is considerably higher than the speed of forecasting by calculation methods. Hence, the method suggested in the present work can be successfully applied in expert spectral systems. The technique of universal orths and appropriate factor determination outlined in this work allows us to look at the process of spectrum registration from a new angle. The factors are measurable. If necessary, it is sufficient to measure the minimum number of factors to restore a whole spectrum with some accuracy. Such a feature can appear technically attractive in the development of various molecular spectra express-analysis techniques. In this work we looked at the information on absolute intensities of investigated spectra. The extensive collections of IR spectra from which it is possible to extract information for the construction of statistical models contain, as a rule, only relative intensities. This is a major difficulty. It is possible that it will be overcome by the application of the method of the internal standard, which was widely used to obtain results for the decision of the inverse spectral task in refs. 2-4. The advanced technique can influence the logic of the statement of inverse spectral tasks in the theory of the vibrational spectra of molecules. It is known that the purpose of the solution of an inverse spectral task is a search of parameters, certainly involving long numbers of multifunctional compounds. At the same time, the inverse spectral tasks are proposed and solved for individual spectra. The technique described in this work permits an integrated spectral portrait of an
integrated vibrational model of a multifunctional compound to be constructed. Having received such convolution of the structural and spectral information, it is possible to develop an inverse spectral task for finding the parameters of the integrated vibrational model, and then to transfer these parameters into a model of a particular compound to forecast its spectral response. It is clear that such an integrated model is, theoretically, simply the reverse of the transformation described in this. However, realisation of this idea requires additional research on the ambiguity of the inverse spectral problem.
References L.A. Gribov and V.A. Dementiev, Simulation of vibrational spectra of complex compounds on a computer, Science, (1989) 158 (in Russian). L.A. Gribov, V.A. Dementiev and A.T. Todorovsky, Interpreted vibrational spectra of alkanes, alkenes and benzene derivatives, Science, (1986) 496, (in Russian). L.A. Gribov, V.A. Dementiev and O.V. Novoselova, Interpreted vibrational spectra of hydrocarbons with isolated and conjugated double and triple bonds, Science, (1987) 472, (in Russian). M.E. Eliasberg, Ju.2. Karasev, L.A. Gribov and V.A. Dementiev, Interpreted vibrational spectra of hydrocarbons cyclohexane and cyclopentane derivatives, Science, (1988) 376, (in Russian). K. Iberla, Factor analysis, Finance and statistics, 1982, 350 PP. D. Lowley and A. Maxwell, Factor analysis as a statistical method, The World, 1967, 144 pp. S.D. Timchenko and V.A. Dementiev, Classification of IRspectra of organic compounds by multidimensional statistics methods, J. Struct. Chem., 34 (1933) 94. L.A. Gribov and W.J. Orville-Thomas, Theory and methods of calculation of molecular spectra, Wiley, Chichester/New York, 1988, 636 pp.