Journal of Molecular Structure, 292 (1993) 141-160 Elsevier Science Publishers B.V., Amsterdam
141
Neural nets for mass and vibrational spectra J. Gasteiger*q X. Li”, V. Simon”, M. Novicb, J. Zupanb aOrganisch-chem.isches Institut, Technische UniversitZt Miinchen, D-8046 Garching, Germany bInstitute of Chemistry, Hajdrihova 19, 61115 Ljubljana, Slovenia *To whom the correspondence, proofs and offprints should be sent. A brief introduction into neural networks with emphasis on multi-layer networks with errorbackpropagation learning and the Kohonen network is given. The application of these two types of networks to the investigation of the relationships between structure and mass spectral and IR spectral data is illustrated with reports in the literature and work of our own.
The application of neural networks to chemical problems increased rapidly, indeed. If there were oniy three publications in 1988, the number increased to five in 1989, 30 in 1990, and 110 papers on the use of neural networks in chemistry in 1991. A review on the use of neural networks for solving chemical problems has recently appeared[5].
1. INTRODUCTION Neural networks have been developed as models for the information processing in the human brain. Their study dates back into the fourties with the work of McCuUoch and Pitts [I] and that of Hebb [2]. These early attempts and the work that built on it had theoretical weaknesses and was not very successful in solving practical problems. Thus it suffered severe set-backs in the sixties and seventies. It was not until 1982 that a publication by J.J. Hopfield [3] brought new life into neural network research. A dramatic increase in activities was initiated in 1986 by the introduction of the backpropagation algorithm by D.E. Rumelhart, G.E. Hinton, and R.J. Williams[4]. This led to a host of applications of neural network models for processing information in such diverse areas as engineering, stock market, picture processing, and medical diagnosis. 0022-2860/93/$06.00
0 1993 Elsevier Science Publishers
The interest in neural networks stems from their potential as general problem solving techniques. The same algorithm can be applied to weather-forecasting or to the interpretation of spectral data. Not the algorithm but which information is put into a neural network and the way it is presented to it determines which problem is being solved. A host of different neural network models has been developed, each with its own strength and emphasis on problem types. The following kind of problems can be dealt with by neural networks [S]: B.V.
All rights reserved.
142
Classification:
An object is assigned to one
or several categories out of a series of alternatives. Instead of writing an explicit equation for the relationship between a dependent variable and some parameter (measurement), an implicit relationship is modelled by a neural network.
Modelling:
connecting lines, are transformed on entering the neuron and within the neuron, are then sent along connecting lines to additional neurons until finally producing the output values. In Figure 1 a biological neuron is compared with its model, the artificial neuron.
Association: An object is retrieved even if only incomplete or corrupted data are input. Projection: A high-dimensional information is projected into a space with lower dimensionality. The first two areas of application are by far the most important ones in chemistry. In this paper only two, albeit very important neural network models will be presented. This is then followed by their application to the study of the relationships between structure and data from mass spectra as well as those from IX spectra.
2. NEURAL
NETWORKS
A neural network can simply be considered as a box that accepts a series of input data and transforms them into one or more output values. The input can be spectral data with substructures as output, or it can be data on electronic and energy parameters of a bond with a value for the reactivity of this bond as output. Within such a box there are basic processing units, the artificial neurons, or neurons for short, that are models of the biological neurons or nerve cells. These neurons are highly connected. The data received on input are passed along many
Figure 1: Biological and artificial neuron. The biological neuron has many dendrites attached to the cell body that serve to receive signals. The drawing in Figure 1 is strongly magnified and simplified, as a neuron has many more dendrites than indicated in this picture. The signals received by the dendrites are sent into the neuron and might cause them to fire. In this case, a signal is sent through the axon that contains synapses at the ends. These synapses attach to the dendrites of other neurons and thus make up for the connections of neurons into a network. On crossing the synaptic gap, a signal is modified by an extent expressed as the synaptic strength. The strength of a synapse might change in time and is one of the foundations of learning. The artificial neuron receives a series of data, z;, (i = 1,2,. . . , TZ). These input data are modified by weights, wji, that serve to model the strength of the synapse that is crossed when the signal enters into the dendrite. The various signals (data) entering a neuron, i, give rise to a net signal, Neti, that is obtained by eq. 1.
143
Netj =
2
Wjixi
(1)
i
The output of a neuron, ozttj, is not this net result of the input data, but is obtained from it by changing it by a transfer function. The most coIDmonly used transfer function is a sigmoidal (logistic or Fermi) function (eq. 2). OUtj
=
1 1 - ezp(-ajNetj
+ 8j)
tf4
This sigmoidal function has the effect that the relationship between input and output data can be non-linear. The basic operations in a neuron as expressed by eq. 1 and 2 are rather simple. The real power of the neural network approach is the result of interconnecting the neurons. Thus, in the expression neural network, the emphasis is on the word network. Different forms and architectures for organizing neurons into a network have been proposed. This organizational form and the method of learning in a neural network are the characteristics of a particular neural network model. More than 20 different neural network types have been proposed. In the following, only two of them that are considered of more widespread interest for chemical applications will be explained in some detail: 1. a multi-layer neural network with errorbackpropagation learning and 2. a Kohonen network for building organized feature maps.
2.1.
Multi-layer
neural
self-
network
In a multi-layer neural network, the neurons are organized into layers. The neurons
of one layer are not connected to each other but are connected to all neurons of the layer directly above and the layer directly below the considered layer. There is an additional layer of input units that only serves to distribute the input data onto all neurons of the fist layer. Figure 2 shows a neural network consisting of two layers of neurons and the input units, we call this a two-layer neural network. input UIlitS
output
neurons Figure 2: Two-layer neural network with input units, one hidden, and one output layer. Information flows from the top of Figure 2, the input units, to the bottom. The output of one layer of neurons is the input to the next ,one. The neurons of the first layer in Figure 2 cannot directly be accessed from the outside and are called hidden neurons. The last layer of neurons provides the output values of the network and is therefore called output layer. Usually, to each layer an additional neuron is added that has connections with weights equal one. This neuron is called the bias. It is responsible for the shift 8 in the transfer function(see eq. 2). The essential task in any neural network application is the determination of the weights for the connections between the neurons. This is achieved in a learning process. Learning in a multi-layer neural network is usually performed by the error backpropaga-
144
tion algorithm, widely popularized by Rumelhart, Hinton and Williams [4]. The backpropagation algorithm is a supervised learning technique whereby a series of input data and their known associated output data (e.g., the spectrum of a compound and its structure) are presented to the network. The weights are changed iteratively in epochs of presenting the training set of input/output pairs to the network until the output data obtained are within a given threshold equal to the expectation value (target). Input = out@) 4,i(l) = ,,&j(‘)Out,@) out(‘)
g.(l)= _&(‘)/ &‘&<” ;(I)= cc wijej’2’C&utj)
I A,..(2)=@j.(2),$‘) $2, = _& , d&$(2) I J
output = OUP
error E(2)= c 2
Target = T
e. %Tj_&tj(2) J
Aw(dji
q
=
=
q$n)j
learning rate
Out(pl).
+
p
L\wb)ji(P=vi~d
j.t = momentum term
Figure 3: Backpropagtion algorithm. Figure 3 serves to illustrate the essential characteristics of the backpropagtion algorithm. First, the weights are assigned randomly. The input data are sent through the layers of the network in a feedforward manner. The output of the last (here second) layer, opt, is compared with the expected output, the target, t, to determine the error ~(~1. This error is backpropagated from the last to the first layer of the network. The change of weights, Awl:’ in the second layer is done by a gradient descent method, taking the derivative of the error ~(~1to the net in-
put results, Netj. Furthermore, the than of weights is made proportional to the or T put value of the layer above, here out?t amount of the change in the weights is detc mined by a preset parameter 7, the so-call learning rate. Whereas the error in the last, the outp layer can directly be calculated, the error the hidden layer is not known. Here con a basic assumption of the backpropagati algorithm that derives the error in a hidd layer (here ~(~1)from the error in the layer 1 low, hence the name error-backpropagatic Usually, a second term is added to the char of weights that also considers the change weights in the previous iteration. The I rameter, ~1, that det ermines to what exte this previous change is considered is call momentum term. It brings in a memory feet into the learning procedure. The lov the momentum, the more rapidly the pre ous changes are forgotten. Multi-layer neural networks with err backpropagation learning are the most wide used neural network models. In chemist they make up for about go-90% of all apI cations. This is because the types of prc lems that can be tackled with backpropal tion networks are quite common in chemist They can by applied to classification as w as to modelling problems. In classificatic an object (e.g., spectrum) has to be assign to either one or several categories (e.g., sr structures) out of a series of classes. In modelling, a relationship is establisl between the output values and the input di much in the same way as by an equation those obtained by statistical methods. Hc ever, with neural networks this relation&I expressed implicitly in the weights and net not to be specified explicitly. In additic neural networks can also model non-linear
145
lationship. Although the backpropagation algorithm contains well-defined equations its application to a specific problem requires a lot of work to get optimumresults. First, the architecture of the network has to be determined, the number of layers and the number of neurons in the hidden layers. Values for the learning rate, 7, and the momentum term, /,L,have to be chosen; quite often they are changed in a systematic manner during the learning process. All this is usually achieved by trial and error. An important role plays the selection of the training and test dataset. The size of the training set should be at least as large as the number of weights in the network. Unfortunately, many applications violate this rule. If the number of training data is low the effect of overtraining becomes very serious. The training dataset can be learnt perfectly by the network - because it has so many weights to adjust them to the training data - but this network has no predictive capability, because only a local and not a general solution was found. Above that all, the choice of appropriate representations of the input and output information is the most critical point in deciding the success of the neural network study. 2.2.
Kohonen
network
Whereas the error backpropagation learning is rather remote from the learning mechanism in the human brain, the Kohonen network model [7, 8] is probably the one most closely related to a biological system. The human brain has regions in the cortex that are processing somatosensory, auditory, and visual information, respectively. These parts of the human brain contain maps of certain regions of the body. Thus, the somatorsen-
sory cortex has a map of the entire human body with those parts of the body that have many sensory receptors (tongue, lips, hands) covering larger parts of the cortex than those having only few sensory receptors. A Kohonen network has a two-dimensional arrangement of neurons that generate a map of the information input into the network. It is also called “a self-organised topological feature map”. This expresses several characteristics of this model; it is an unsupervised learning method (self-organized), it tries to conserve the topology of an information, and it can be taken as a method for projection into two dimensions. Central to a Kohonen network is the concept of topology and the neighborhood relationship among the neurons. Figure 4 shows the first and second sphere of neighbors of a neuron in a quadratic network. Learning in a Kohonen network is an unsupervised competitive learning. After sending the input data into a randomly initialized Kohonen network that neuron, c, with the largest output or with weights most similar to the input data is enhanced even further, i.e., its output is further increased, or the weights are adjusted so as to make them even more similar to the input data(3). OUi!, +
IIliIl&Z~~ Wji)*]
(3)
i=l
The other neurons are also Amplified but to a decreasing amount with increasing distance form the central neuron, c. With each cycle in the learning process the fall-off in excitation with distance is increased such that less and less spheres of neighborhood are enhanced. In the end, the network will have stabilized such that the information input into the network is somehow spread into the plane of the
146
to a host of spectroscopic data, ranging from spectroscopic calibration and baseline correction to spectra-structure correlations[5]. Spectra-structure correlation are a very promising application of neural networks. First, spectral data play an important role in the structure elucidation process. Secondly, these relationship are in most cases too complex to be put into explicit equations. activated neuron sphere of first second neighbours Figure 4: The spheres of neighbors of a neuron in a quadratic network. neurons. The plane of projection can be a simple two-dimensional plane, or, it can be the surface of a torus. A torus is obtained by connecting the top row of the plane to the bottom row and the left-hand column of a plane to the right-hand column. The torus has a plane without beginning and end and each neuron on a torus has the same number of neighbors in each sphere. As we will see, a Kohonen network is a powerful tool to extract the essential characteristics within complex information.
3. APPLICATION
NETWORKS SCOPIC DATA
OF NEURAL TO SPECTRO-
Neural networks have already been applied
In this paper we will limit ourselves to the investigation of the relationships between the structure of a compound and its mass or infrared spectrum. In both cases, typical work already published is presented fist. This is followed by reports on more recent work that tries to solve some deficiencies of the initial studies and aims at more advanced insights.
4.
MASS
SPECTRA-STRUCTURE
Mass spectra play an important role in the elucidation of the structure of organic compounds. However, the relationships between structure and mass spectral data are too complex to be expressed explicitly. In this situation resort is being taken to derive features from both the structure and the mass spectrum and to try to find relationships between such features, i.e., substructures and a variety of spectral features (Figure 5). Deriving features from structure and mass spectra might lead to information reduction such that the original information cannot be reconstructed. This is certainly the case with mass spectral features. In the case of substructures, a powerful structure generator using additional constraints may be able to construct the correct fuh structure. One of the most critical points in the en-
147
substructures
spectral
-
features
Figure 5: Transforming the task of mass spectra - structure correlations into one between substructures and spectral features. deavor to find relationships between a mass spectrum and the structure of a compound is the fact, that, indeed, a mass spectrum embodies not only one structure, but a host of structures: the ionized molecular structure and a series of ions derived from it by fragmentation and rearrangement reactions. In the first section we will report on work that used neural networks for expressing the relationships between mass spectral features and one or several substructures. In the second section we will address this one to many relationships one mass spectrum - many ions, explicitly and report on work using neural networks for modelling mass spectral fragmentation reactions. 4.1.
Relationships
between
features
A large-scale effort to the study of the relationships between mass spectral data and chemical structure was made by B. Curry and D.E. Rumelhart[9]. Mass spectra were represented by 493 spectral features including the log of intensity of the peaks between m/z 40-219, the log of the
intensity of neutral losses between O-179, the modulo 14 series, etc. Thus, the input to the network consisted of 493 real values (scaled between 0 and 1). The structure of a compound was coded by the presence or absence of 36 substructures like benzenoid ring, CC-double bond, various carbonyl groups like ketone or ester, C-halogene bond, etc. The output of the network was a series of 36 zeros and ones. A multi-layer neural network with errorbackpropagation learning was used in this study. One hidden layer with 80 neurons was found necessary to learn the training set. Thus, a rather massive network with (including the bias) 493 x (80 + 1) + (80 + 1) x 36 = 42,849 weights was used (Figure 6). 493 input units
80 hidden neurons
V
36 oumut neurok
42,849 weights Figure 6: Architecture of the network to study mass spectra - structure relationships. In order to determine such a large numb,er of weights and achieve results that not only reproduce the training set but offer predictive ability one needs also a large number of data for training. The authors [9] did comply with this requirement and split a dataset of more than 44,000 mass spectra into a training set of 31,926 mass spectra and a test set of 12,671 mass spectra. Clearly, working with such a large dataset for training requires substantial computa-
148
tional resources. Thus, in order to achieve a learning of the training data i.e., that on input of 493 features for a mass spectrum, the correct substructures of this compound are output, the entire dataset has to be presented 50 times to the backpropagation algorithm (50 epochs). This required two weeks of computation time on a SUN 4 or on HP 9000/370 workstation. Large as these computation times are, they are not that serious because they have to be spent only once, in the training phase. Once a neural network has been trained, predictions can be made very rapidly. In this case, the prediction of the substructures in a compound by MSnet is obtained within 0.5 set (on a workstation) after input of the mass spectrum. The quality of the prediction was measured by a reliability index, obtained from the number of class members correctly asserted, I,, and the number of compounds falsely accused of being class members, If, by eq. 4. reliability
L
= I, + If
(4)
The reliability index is for 16 substructures within 94100%. This high value compares favorably with the results obtained by expert systems. In fact, the reliability indices obtained with MSnet are slightly higher than those with STIRS one of the more successful expert systems on mass spectra [lo]. High as these reliability indices are, they do hide some problems. Those compounds that are rarely contained in the database are difficult to identify. For example, only 33 phthalates are contained in the training set of about 32,000 Phthalates have a highly mass spectra. characteristic peak at m/z 149. However, the majority of compounds in the entire dataset with a peak at m/z 149 are not phthalates.
This has the consequence that phthalates are never recognized. As a solution to this problem a hierarchy of neural networks is proposed. The top, main network as just described assigns a structure to one or several of the 36 functional groups on the basis of its mass spectrum. For the substructures specialized subnets are developed. Only, four of such subnets, for benzenoid, for 0-C=O, for methyl groups and for the ether group have been reported. Some details of the O-C=0 subnet are given here. Input to such a subnet =e again the 493 mass spectral features, together with units that only serve to identify class membership. The O-C=0 subnet contains one layer of 36 hidden neurons and 22 output neurons for 22 different O-C=0 subclasses (acids, esters, phthalates, etc. ). The O-C=0 subnet is now able to correctly identify 92% (30) of the phthalates with 0.5% false positives, i.e. 0.5% of the compounds in the entire dataset are incorrectly identified as phthalates. From the information given in the paper one can conclude that these 0.5% must be 33 compounds. Thus, looking at the numbers in a different manner, from those 63 compounds identified as phthalates by the O-C=0 subnet, only 30 (47.6%) are correct. Interesting and promising as the concept of a hierarchy of neural networks for the representation of the relationships between mass spectral data and structure is, it still leaves much to be desired and done. Specialized neural networks for the classification of compounds have been developed by H. Lohninger[ll]. Separate networks were obtained for the classification of polycyclic aromatic hydrocarbos, for steroids, and for barbiturates. The work on steroids is here further elaborated. Eight mass spectral features were
selected and input into a network with an architecture of eight input units, 16 neurons in the hidden layer and one output neuron, indicating the presence (1) or absence (0) of a steroid. A training set of 100 steroids and 188 non-steroids was used to determine the weights by the error-backpropagation algorithm. This network could assign 94.4% of a test set of 794 steroids correctly and 88.6% of 906 non-steroids. Such classifiers open interesting perspectives of using such specialized neural networks as detectors in the analysis of mixtures. Comparing the work of ref. [ll] with that of Curry and Rumelhart [9] it is surprising that such a few mass spectral features (eight compared with 493) were sufficient for achieving such a highly distinctive classifier. This stresses the point that one of the most important tasks in any neural network application is the selection of input and output representation. We will be concerned with this point throughout this paper. 4.2.
Mass
spectral
processes
We discussed already that one of the reasons that makes the understanding between mass spectra and structure so complicated is the fact that one mass spectrum embodies the imprints of many structures: primary ions and sequences of ions derived from it (see Figure 7) To address this problem explicitly we have developed the program system FRANZ (Fragmentation and Rearrangement ANa1yZer) that generates a detailed fragmentation scheme of individual ionization, fragmentation, and rearrangement steps given a mass spectrum and the associated structure (see Figure 8) [12]. In addition, probabilities
H2fl \c H2
one spectrum -
O-CH3
many structures
Figure 7: The relationship between a mass spectrum and structure is a one to many (ions) relationship.
for the occurrence of the individual reaction steps are derived from the peak intensities. A discussion of this system is beyond the scope of this paper. Details of the system and its applications have already been published [12, 131. Figure 9 shows the fragmentation scheme of 5-hexen-3-one as obtained with FRANZ. In Figure lo-the experimental mass spectrum of 5-hexen-3-one is compared with that part of the mass spectrum that can be explained with the fragmentation scheme of Figure 9. 88.7% of the intensity can thus be accounted for. The derivation of the details of the fragmentation of an organic compound in the mass spectrometer by FRANZ opens new perspectives in our understanding of mass
structure
massspectrum
dH2CH,+( 22
CH3CH,+ s
0.983(l)
0.500(l)
J/
.COCH2CH2CH: !? 0.791(3)
i
+
i
O=C=bHCH2CH,+
H2Cti CH 12
11
Figure 9: Fragmentation scheme obtained with FRANZ for 5-hexen-3-one. Figure 8: The generation of an explicit fiagmentation scheme from a mass spectrum and the associated structure. spectral processes. Above and beyond, FRANZ has a second equally important apBy processing a series of mass plication. spectra/structure pairs a series of fragmentation schemes can be generated by FRANZ. Then, all instances of a certain reaction type, e.g., all a-cleavages, can be collected on a file (Figure 11). These individual examples of a fundamental mass spectral reaction type can then be investigated by statistical or neural network methods. Although FRANZ provides transition probabilities for the instances of a given reaction type, initially only a dataset of ocleavages simply classified as occurring or not observed was investigated. The dataset contained 144 sites in 70 different molecules com-
Experimental 0
10
20
30
40
spect50
60
70
80
90
100
100
50
0
50
100
0 Reconstructed
spect-
L
Figure 10: Experimental mass spectrum of 5-hexen-3-one and that part that can be explained by the fragmentation scheme of Figure 9. prising alcohols, ethers, thiols, thioethers,
151
ca
ml
structure
mass spectrum d
\n
Figure 11: Colleting instances of general reaction types by processing a series of mass spectrum/structure pairs. amines, and alkyl chlorides, bromides and iodides that could potentially undergo an acleavage. 69 of these a-cleavages were observed, 75 of them were classified as not occurring.
-
“A -B-C 1
2
3
i=B+.C 1
2
3
Figure 12: General scheme of an a-cleavage. Figure 12 shows the general scheme of an a-cleavage. A series of electronic and energy effects including charge distribution, inductive, resonance and polarizabihty effect, as well as bond dissociation energies were calculated for the atoms and bonds of the reaction sites by previously published empirical procedures [14]. After elimination of parameters with no variance and those that are Linearly dependent 35 parameters remained. Taking all these parameters as input values to a neural network was considered too high. For,
with 35 input units one surely would have ended up with hundreds of weights, too high a number in comparison to all the 144 data available. Therefore, efforts were made to reduce the number of parameters further by a process called reconstruction learning [15]. In effect, reconstruction learning searches for the units and neurons that have only small weights when a backpropagation learning is made on a multi-layer neural network. These input units and neurons are therefore considered as rather insignifIcant. The weights are further reduced until those input units and neurons can be eliminated. This process of weight reduction and input and neuron elimination is iteratively repeated until no further elimination of neurons appears to be possible without substantial loss in learning ability. We thus ended up with 13 physicochemical parameters for the reaction site of an a-cleavage [16]. These parameters included effective polarizabilities, charge and electronegativity values at the various atoms of the reaction site in the educt and product of an a-cleavage as well as a value for the bond dissociation energy of bond 2-3 in the educt (see Figure 12). With 13 parameters, equivalent to 13 input units a multi-layer network with a reasonable size and amount of weights seemed feasible. In fact, a network with 13 input units, three hidden neurons and one output unit was able to learn correctly all the 62 data selected for training from the entire dataset of 144 a-cleavages. The remainder, 82 data were used for testing the net and all except one were correctly predicted by this network. Thus, a neural network could be trained that is able to classify a reaction site in a organic molecule containing an alcohol, ether, thiol, thioether, amine or alkylhalide function whether it will undergo an a-cleavage or
152 not.
After this success in classification a backpropagation network will be trained for modelling. Reaction probabilities for the various general reaction types observed in the mass spectrometer as obtained by FRANZ will be submitted to a neural network training. These neural networks will be used in the MASSIMO system (MASS Spectra SIMulatOr) for simulating mass spectra [12]. Some words have to be said to the selection of training and test sets. A training set should cover the information space as widely and evenly as possible. Usually, experimental design techniques are used for the task of choosing a dataset that has points distributed over the entire multi-dimensional information space. However, with a 13dimensional space as obtained in the study of the a-cleavage (see above) experimental design techniques would ask for a prohibitively large number of data points, much more than were actually available. In this situation we have developed the use of Kohonen network as an alternative to experimental design techniques. In effect, a Kohonen network is used to project the data from the 13-dimensional space containing parameters on electronic and energy influences on the a-cleavage onto a twodimensional plane consisting of 9 x 9 = 81 neurons (Figure 13). The results of this mapping are shown in Figure 14. Of the 81 neurons, 51 neurons are activated by a-cleavages. Many of these neurons contain several different a-cleavages. However, quite a few of these multiply activated neurons contain either only feasible a-cleavages (in black) or non-observed acleavages (in gray). Apparently, these different a-cleavages contain similar information and it s&ices to take only one a-cleavage from those different a-cleavages of a single
inputvector
weightvector
Figure 13: Kohonen network for the mapping of a Ifdimensional information into a plane.
Figure 14: Kohonen map of a-cleavage projected from a 13-dimensional space. Black: observed a-cleavages. Gray: non-cccuring a-cleavages. White: not occupied by acleavage. Cross: both observed and nonoccuring a-cleavages.
neuron. Iu several cases, conflicts arise in a neuron as both observed and non-occurring
153
a-cleavages end up in the same neuron (indicated by crosses). In those cases, both types of a-cleavages, reactive and non-reactive, were added to the training set. In this manner, we ended up with 62 a-cleavages in the training set that contained at least one acleavage from each activated neuron and thus covered the information space as broadly as possible.
\
/
5. IR
SPECTRA - STRUCTURE CORRELATIONS
5.1.
Spectrum
intervals
vs.
256 Intervalle
36 Substrukturen
sub-
structures M. Murk and coworkers [17] tried to derive the presence or absence of substructures from the IR spectrum of any organic compound. The IR spectrum in the range of 400 cm-’ to 3960 cm-’ was devided into 256 intervals and each interval was assigned to au input unit. If a peak occurred in such an interval, then the intensity of the peak (between 0.0 and 1.0) was given into the input unit. 36 functional groups were selected to represent the structure of a compound and, thus, 36 output neurons were needed (see Figure 15). A two layer neural network with 34 neurons in the hidden layer was chosen. Altogether this amounts to 256 x (34 + 1) + (34 + 1) x 36 = 10,220 weights. 2,499 IR spectra were taken for training the network by the backpropagation algorithm, 416 for testing the results. It must be realized that the number of data (PR spectra - structure pairs) is less than 25% of the number of weights. This violates the rule that the number of training data should be about equal to the number of weights and thus carries the danger that the results may only be of local validity and may
Figure 15: Input and output representation for IR spectrum - structure correlation[l’l]. not have much predictive power. The results obtained were also compared with those of a previous study that used only a single-layer neural network [18]. In aIi cases, the twolayer network gave better answers. As a typical result those obtained for primary alcohols are given (Figure 16). The dataset contained 265 primary alcohols. A cut-off was set at the median of the distribution curve which is at 0.86 in this case. At this value, half of the primary alcohols (132) were correctly identified, with 34 compounds falsely classified as primary alcohols. Such data were used to calculate so-called A50 values, the accuracy at 50% information retrieval, by eq. 5. A50 =
132 132 + 34
= 79.5%
(5)
Of the 36 functional groups 21 had A50 values higher than 90%, 9 values between 7590%, 5 in the range 50-75%, and only one
154
HzOH not present
-CHzOH present
sonable results. However, in order to obtain results that go beyond those just presented, deeper representations of IR spectra and of structure are needed. This is the theme of the next two sections. 5.2.
0.0
0.2
0.4
Y
0.6
0.8
1.0
Figure 16: Distribution of output values at the output neuron for primary alcohols.
IR spectra representation
In recent work, M. Novic and J. Zupan [19] used a Hadamard transformation (transformation by box waves) to represent IR spectra. 512 points were taken along the frequency range of an IR spectrum, submitted to a Hadamard transformation, and truncated to the first 64 coefficients. Figure 17 compares an original spectrum to the one reconstructed from the first 64 Hadamard coefficients. As can be seen, the information loss is quite acceptable.
lower than 50%. Although a value of 80% might seem quite acceptable it is an expression that at the chosen cut-off value only half of the considered substructures are correctly identified and that 20% of the identified compounds are falsely perceived to have this substructure. Basically, as Figure 16 indicates, such a neural network can be taken to determine with a reasonable reliability the absence or the presence of a functional group by using one cut-off value at low output values and one cut-off value at rather high output values. However, there is a large number of compounds for which no reliable decision can be made (say, when the output values are between 0.15 and 0.90). Such an information can successfully be used in a expert system for structure elucidation that draws its information from different spectroscopic methods. The additional merit of this work lies in the fact that such a simple encoding scheme for IR spectra and structure can already give rea-
Figure 17: Original IR spectrum and that reconstructed from the first 64 Hadamard coefficients. These 64 coefficients were input into an 11 by 11 Kohonen network (Figure 18). In effect, an IR spectrum is thus stored in a single neuron. After training of of this network with 150 IR spectra, the neurons were analysed as to which IR spectra they contained. When the structures of the IR spectra in the various neurons were analysed for their
155
64 coefficients of Hadamard transformation out, -
11.11 Kohonen - Net for 1.50 IR spectra
min { $, ( xii - wji )’ }
Figure 18: Kohonen network for storing the Hadamard coefficients of an IR spectrum.
tified by letters, A for acid, E for ester, K for ketone, 0 for ethers, etc. As indicated, the IR spectra of acids, esters, and ethers, respectively, group in specific areas of the Kohonen network. Thus, the Kohonen network was able to extract (in an unsupervised learning) from the Hadamard coefficients of the IR spectra those features that are typical for the various functional groups. Closer inspection of the Kohonen network showed that it can be even further discriminating. The clustering of the IR spectra into neurons according to their functional groups can even be traced to individual weights of the Kohonen network. Thus, the clustering of the acids is largely brought about by the weights in the sixth layer of the Kohonen net as indicated in Figure 20. There, the lines with equal weights are drawn, with the maximumlabelled by the letter A. The sixth layer corresponds to the frequency region 3110-3030 cm-‘.
functional groups it was found that IR spectra containing the same functional group ended up in the same or in adjacent neurons. COOR
Figure 20: Distribution of weights in the sixth layer of the Kohonen network
Figure 19: Distribution of functional groups in the Kohonen network trained by IR spectra. In Figure 19 the functional groups are iden-
The identification of the ethers is mainly made in the 47th weights of the Kohonen network as shown with Figure 21 that gives the lines of equal weights and and indicate the maximum with the letter 0. This layer corresponds to the frequency region
156
ble for the occurrence of bands in the IR spectrum: the change in dipole moments on vibrations of various parts of the molecule. What is therefore needed is a proper account of the three-dimensional structure of a molecule and its electron distribution. As an inroad to such a more elaborate representation of chemical structure for correlating it with IR. spectral data we have made the following steps:
Figure 21: Distribution of weights in the 47th layer of the Kohonen network
1. Input the molecule.
structural
formula
of
a
2. Generate a 3D-model of the molecule by the automatic model builder CORINA
1132-1100 cm-‘. Thus, by proper representation of IR spectra a Kohonen network is able to cluster them according to similar spectral properties that reflect the presence of specific structural fiagments.
3. Calculate the charge distribution by the empirical PEOE method [23, 241.
5.3.
5. Use a Kohonen network to project this electrostatic potential into a plane.
Structure
representation
The identification of many functional groups, e.g., of carbonyl groups, by IR spectral data is a fairly established procedure. This is largely due to the fact that such functional groups show characteristic valence vibrations. The identification of skeletons from IR spectral data presents a more formidable problem. Attempts to solve this problem have been made by including skeletons with various substitution patterns in the list of substructures. Expert systems have been developed that work with 229 [20] or nearly 700 [21] substructures. However, this an open-ended endeavor because the list of substituted skeletons is, in principle, infinite. We believe that a proper representation of the structure of a molecule has to take account of the features that are responsi-
1221.
4. Calculate the electrostatic potential on the van der Waals surface.
Figure 22 shows the Kohonen maps obtained for substituted benzene derivatives with the combinations dichloro, chloro-nitro, and dinitro in all possible nine arrangements. Clearly, these images in grey-shading are only a poor reflection of the color-maps we are actually working with. These Kohonen maps allow to identify the substituents on the benzene system. DitIerent as the electronic nature of the two substituents are, nevertheless common features of the ortho-, meta-, and para-substituted derivatives can clearly be distinguished. We hope that we can derive from such Kohonen maps descriptors that can be used to represent skeletons in general, and ortho-, meta-, and para-substituted benzene derivatives in
157
Figure
Meta-Cl-Cl
Meta-Cl-NO2
Meta-N02-NO2
Para-Cl-Cl
Para-Cl-NO2
Para-N02-NO2
22:
trobenzene
Kohonen
maps
derivatives.
of the electrostatic
Top
row:
potential
ortho-compounds;
of dichloro-,
middle
row:
chloro-nitro-,
meta-compounds;
and dinibottom
row: para-derivatives.
particular,
for correlating
them with IR spec-
6.
CONCLUSION
tral data. offer interesting
perspec-
tives for the processing of chemical tion. Their ability to incorporate
Neural
network
informarelation-
158
ships in an implicit manner makes them particularly attractive for the study of correiations that are too complex to be expressed in an explicit way. This is particularly true for the relationships between the structure of a molecule and its spectral data. Success critically hinges on the kind of representation chosen for the encoding of the information. In addition, different neural network models offer characteristic advantages. ACKNOWLEGDEMENTS We thank the Bundesminister fiir Forschung und Tecbnologie (BMFT), Bonn and the Ministry of Science and Technology (MZT) of Slovenia for fInanciaI support of this work. K. Weidenblcher’s help in preparing the figures is appreciated.
PI
PI T.
Kohonen, Self-Organization and Associative Memory, Springer, Berlin, 1988.
PI B.
Curry, D. E. Rumelhart, Tetrahedron Comput. Meth. 3(1990) 213-237.
PI
PI D.
0. Hebb, The Organization of Behavior, Wiley, New York, 1949.
PI J.
J. Hopfield, Proc. Nat. Acad. 79( 1982), 2554-58.
PI
in: Software-Development in Chemistry 5, J. Gmehling, Ed., Springer, Berlin, 1991.
PI
J. Gasteiger, W. Hanebeck, K.-P. SchuIz, J. Chem. Inf. Comput. Sci. 32 (1992) 264-271.
WI
S. Bauerschmidt, W. Hanebeck, K.-P. Schulz, J. Gasteiger, Anal, Chim. Acta 265 (1992) 169-181.
PI
J. Gasteiger, M. Marsili, M. G. Hutchings, H. SalIer, P. Liiw, P. R&e, K. Rafeiner, J. Chem. Inf. Comput. Sci. 30( 1990) 467-476.
Sci.
D. E. Rumelhart, G. E. Hinton, R. J. WiIIiams, in: Parallel Distributed Processing: Explorations in the Microstructures of Congnition. Eds., D. E. Rumelhart, J. L. McClelland, Vol.1, MIT Press, Cambridge, 1986, pp 318-362. J. &pan, J. Gasteiger, Acta 248 (1991), l-30.
Anal. Cbim.
J. Gasteiger, J. Zupan, Angew. Chem., submitted.
H. E. Dayringer, G. M. Pesyna, R. Venkataraghavan, F. W. McLafferty, Org. Mass Spectrum. 11(1976) 529-542.
WI H. Lohninger
References
PI W.S McCuRoch, W. Pitts, BuII. Math. Biophysics, 5( 1943), 115-133, ibid., 9( 1947), 127-147.
T. Kohonen, Biol. Cybernetics a( 1982) 59.
P51T.
Aoyama, H. Ichikawa, Chem. Pharm. BuU., 39(5), (1991) 1211-1228.
WI
V. Simon, J. Gasteiger, unpublished result.
P71M.
Munk, M. S. Madison, E. W. Robb, Mirochim. Acta. 1991 II, 505-514.
P81M.
E. Mu&, E. W. Robb, Mikrochim. Acta. 1990, 131.
PI
M. NoviE, J. Zupan, Vestn. Slov. Kern. Drust. (1992), in print.
159
[20] H. Huixiao, X. Xinquan, J. Chem. Info. Comput. Sci. 39( 1990) 203-210.
[23] J. Gasteiger, M. Marsili, Tetrahedron. s( 1980) 3219-3228.
[21] J. E. Dubois, G. Mathieu, P. Peguet, A. Panaye, J. P. Doucet, J. Chem. IInf. Comput. Sci. a(1990) 290-302.
[24] J. Gasteiger, Sailer, Angew. Chem., 97 (1985)699-701. Angw. Chem. Intern. Ed. Engl. 2_4 (1985) 687-689.
[22] J. Gasteiger, C. Rudolph, J. Sadowski, Tetrahedron Comput. Method. (1992) in print.