Neurocomputing 3 (1991) 247-257 Elsevier
247
Backpropagation neural network analysis of circular dichroism spectra of globular proteins Martin Blazek a, Petr Pancoska b and Timothy A. Keiderling c "Institute of Computer Science, Czechoslovak Academy of Sciences, Prague, Czechoslovakia bDepartment of Chemical Physics, Faculty of Mathematics and Physics, Charles University, Prague, Czechoslovakia CDepartment of Chemistry, University of Illinois at Chicago, USA Received 17 May 1991 Revised 10 June 1991
Abstract Blazek, M., P. Pancoska and T.A. Keiderling, Backpropagation neural network analysis of circular dichroism spectra of globular proteins, Neurocomputing 3 (1991) 247-257 The possible applications of multilayer neural networks and the backpropagation algorithm to qualitative analysis and quantitative evaluation of circular dichroism spectra of proteins are discussed. The analysis consisted of three parts: (i) the results of x-ray crystallographic analyses of proteins were sorted out using both symmetric 5-n-5 neural networks (n = 1, 2, 3, 4) and "classical' cluster analysis. (ii) The x-ray results were used as the learning information in the second step: the network was trained to recognize them from the experimental spectral intensities forming the input layer of the network. (iii) The reverse step (i.e. reconstruction of the spectrum from the structural information) was the final task. The developed combination of the networks forms a unique, internally consistent tool for the biomolecular spectra analysis.
Keywords. Backpropagation; neural network analysis.
I. Introduction
The ability of neural networks (NN) to discover hidden relationships by supervised inputoutput mapping of the evaluated information makes this calculation method a promising tool for empirical analyses of experimental data in chemistry, chemical and molecular physics, biophysics etc. These scientific disciplines gather the relevant (and often redundant) information mostly from combination of physical, chemical
and biochemical methods (as spectroscopy, crystallography, sequencing, measurements of biological activities etc.). Very o f t e n - especially in the case of large biomolecules-the theoretical models for quantitative extraction of the seeked information from the experiment are either absent or the calculation efforts necessary for their utilization are beyond the scope of current possibilities. As an example we can mention the lack of any theory for the relation between the sequence of amino acids in the
0925-2312/91/$03.50 © 1991 -Elsevier Science Publishers B.V. All rights reserved
248
M. Blazek et al.
protein and the spatial arrangement of its polypeptide chain in the native state, or e.g. the 'infinite' computer requirements for the quantum chemical calculations of spectra of a protein molecule consisting of several thousands of atoms. Interestingly, the application of neural networks to the first problem was published recently [1]. For spectroscopy of biomolecules, pattern recognition in the NMR spectra of sugars [2] can serve as an example of a successful NN application. In this paper we present first results of neural network analysis applied to circular dichroism (CD) spectra of proteins. The CD spectrum [Ae(u)] is the plot of the difference in the absorption of left- and right-circulary polarized light within the absorption bands of an optically active sample [Ae(v)=eL(V)--eR(U)]. These measurements are routinely available in the visible/ultraviolet spectral region, i.e. in the electronic absorption bands [3]. The instrumentation has been recently developed at UIC and other laboratories that makes measurements of CD reasonably routine also over much of the infrared spectral region, i.e. in vibrational absorption bands [4]. Both of these techniques can be viewed as having the first-order dependence on molecular geometry. Their conformational sensitivity arises from the interaction of probing polarized light with the three-dimensional distribution of the charge density on atoms of a (chiral) protein molecule. The variability of the CD spectra can be rationalized by the fact that a part of any protein polypeptide chain can be folded into only one of five standard (regular) geometrical arrangements: a-helical, /3-sheet, bend, turn and 'other' conformations [5]. The diversity of the geometrical arrangements of peptide chain atoms in these conformations is the basis of the differences in their CD manifestation. The common features of the given regular conformation in various proteins are, on the other hand, the reason for overall qualitative stability of the
'standard' shapes of CD spectra related to ahelical, /3-sheet, bend and possibly turn segments of protein secondary structures. The fractions (FC) of amino acids forming these distinct segments of protein secondary structure (calculated relatively to the total number of amino acids in a polypeptide chain) are widely used in biochemical or biotechnical applications as useful descriptors of the sample conformation. These fractions also represent the 'concentrations' of individual secondary structures and as such, they determine largely the contributions of the standard CD spectra to the overall CD bandshape as measured for a given protein. The reverse of this assumption forms the basis for the quantitative exploitation of CD spectra measurements in terms of FC~, FC~, FCt .... FCbend and FCothe r (ZFC i = 1). The information necessary for the calibration of any method of quantitative spectra analysis is provided by the independent determination of protein secondary structure in crystals by x-ray diffraction measurements. All classical methods of CD spectra analysis use the following assumptions [6]: (i) retention of the protein crystal structure in solutions; (ii) simple additivity of contributions from secondary structures (no contributions from their interactions are taken into account); (iii) no contributions from other than amide chromophores (like those of the aromatic side chains, disulfides etc.); (iv) geometric variability of the regular secondary structures is neglected. No-one of these assumptions is of course fully valid in any real protein CD spectrum, irrespectively of the frequency region studied and generate the problems with which the successful method for estimation of secondary structure of proteins should cope. We hope that the generalization capabilities of neural networks can overcome these problems by being trained to extract just that part of the information from the CD
Backpropagation neural network analysis spectra, which is directly related to the relevant values FC~ without interferences from other factors.
2. Experiments
Symmetrical neural networks of 5-n-5 type (n = 1,2,3 and 4) have been used to get the insight into the 4-dimensional space of FCe 5vectors of the 62 proteins from [7]. One iteration step of the training process consisted of the 62 single backpropagation adaptive cycles. During them, the output layer of the network was set to be identical to the input one. Both synaptical weights and the activities of all neurons were recorded and used for further analysis. The activities of hidden layer neurons were used as coordinates characterising the positions of individual proteins in one-, two- and three-dimensional 'contracted' projected space to visualize the grouping of studied proteins. Alternatively, the clusters of proteins characterised by 5-component FC~ vectors were sought using single linkage, complete linkage, group average, incremental sum of squares, centroid, median and Lance-Williams flexible cluster analysis algorithms [11]. The results were transformed into the form of two-dimensional treegraphs (dendrograms), in which the lengths of the branches connecting the clusters are given by the relative dissimilarity indexes (RDI). (The most 'distant' objects in FC¢-space have RDI = 1, for the identical ones RDI = 0.) Except for the single linkage method, the clustering provided by all other algorithms were stable up to the RDI level < 0.25. We therefore present here only the results of the Lance-Williams flexible algorithm as the representative examples. The electronic and vibrational CD spectra were measured for thirteen proteins representing the different folding types identified by the above methods. The protocols of the sample
249
preparations and measurements are described elsewhere [8-10]. VCD spectra in Amide I' region (17501550 cm ~) were recalculated to AA/A values (a being the peak absorbance of the IR absorption band at -1650cm-1). UVCD spectra (260180 nm) were transformed into molar Ae values on a per unit amide basis. 100 VCD intensities (step 2cm ~) and 80 UVCD intensities (step 1 nm) of all spectra of samples from the training set were injected into the 180 neuron input layer (every neuron in this layer representing the corresponding frequency or wavelength) of a 180-80-5 backpropagation network. The network was trained to recognize the (FC~, FC~, FCb, FC,, FCp) vectors with components being the fractions of a-helix, /3structure, bends, turns and other secondary structures respectively, as determined for the corresponding protein by the x-ray crystallography. Alternatively, the UVCD and VCD spectra of the proteins were analysed by the PC method of factor analysis followed by the multiple regression of PC factor loadings to x-ray determined FC¢ values of calibration samples [9, 10]. To translate the numerical FC~ descriptors back to the spectroscopic (analog) form, the backpropagation neural network 5-80-180 was trained to recognize the combined calibrating UVCD and VCD spectra (output layer) from five input x-ray FC~ values of a given protein. The CD spectra were proportionally normalized to the ( - 1 , 1) interval for this purpose.
3. Results and discussion
3.1 Analysis of x-ray information about the secondary structure of proteins The selection of the calibration set of protein molecules, the CD spectra of which will be used to train the network evaluating the information
250
M. Blazek et al.
about the secondary structure, is the primary but more chemical than computational problem. The reference set of samples should be sufficiently large and covering all different structural types of studied proteins. We have therefore analysed first the extensive information obtained from the results of x-ray crystallography, as published in the Dictionary of protein secondary structures [7]. Little is known about the FC-data structure and underlying physical factors determining the actual protein folding. The 'topology' of protein FC-space (necessary for the required selection of the representative protein samples) can therefore be defined by the similarity evaluation. In this approach, the 'galaxies' of proteins related closely by the FC, values are sought first and the representative samples are used for CD spectra measurements. Simultaneously, these clusters of similar proteins represent the reference points for further orientation in the FC-space. The cluster analysis approach is the classical tool for such a study. We have found that the neural networks of 5-n-5 type, which are forced to 'pass' the input information (FC-vectors) through the activities of hidden layer neurons, are useful in getting insight into the factors generating the observed similarity patterns. The discriminative power of the neural network compression in combination with the classical cluster analysis is seen from Fig. 1, where the individual proteins are projected into the twodimensional plane by coordinates determined as activities of two from three hidden layer neurons from a 5-3-5 network and these projections are connected according to results of cluster analysis. The seven groups of different protein folding types were characterised in this way and the training set of 19 proteins for the calibration of the further quantitative analyses of CD spectra was selected to cover this variability. The mapping of FC 5-vectors in the 5-n-5 networks with n < 5 cannot be homomorphic. The topology of the network forces the input variables into some relations when passing
0.6 O~
Q3
0.2
01
o ~)j
o2
Q3
o.4
Q5
0.6
Fig. 1. Projectionsof 62 proteins from [7] usingthe activities of the first two neurons of the hidden layer of a 5-3-5 network. The proteins are groupedin clustersfrom K1 to K9 together accordingto clusteranalysisresults of corresponding FC-values. through the hidden layer. The analysis of synaptic weights document this process. The flow of the input information through the 5-n-5 networks with n = 2, 3 and 4 is shown in Fig. 2. The weights connecting individual neurons in these networks are represented here in the relative scale by areas of squares. By analysis of these weights we can conclude that it is the FC~ which determines mostly the results in the 5-1-5 network. With two neurons in the hidden layer, the FC~ and FC~ are dominant for the input-output projection with small contribution of FCp. For n - - 3 , more complicated relations are incorporated into the projection, but the above FC dominate again. The reason for this we see in the range of variability of the corresponding FC values: FC~ varies from 0 to 0.9, FC~ from 0 to 0.5, FCp from 0.08 to 0.6. The minimal network configuration which is able to project correctly the input information is 5-3-5. This observation supports the idea of relations between FC of some protein secondary structures. The rank of the input matrix should be lower than 4. By principal component analysis of this matrix of all 62 FC 5-vectors the following eigenvalues and cumulative percentage of the input variance were found: A1 =2.42 (48.3%), A2 = 1.11 (70.5%), A3 = 0.94 (89.3%), A4 = 0.54 (100%). The varimax rotation of the P C factor
~o~
©
~ ~J
ii f~ ~r
c~ ~J
~q
ii
~
6
©
© ~
©
t © ©
6 -6 ~L
252
M. Blazek et al.
analysis results yielded the following simple structure of factor loadings:
a b t p
fl
f2
f3
f4
-0.8 1.0 0.10 0.0 0.1
-0.5 0.0 0.2 0.1 1.0
-0.4 0.1 1.0 0.0 0.2
-0.2 0.0 0.0 1.0 0.1
F Q and FC~ are clearly forming the rotated factor 1, the second factor is connected with p-fractions, the third with the bend and the fourth with the turn content. This result is fully consistent with the above-discussed performances of used projecting neural networks. We believe that at least for the analysed selection of proteins, the neural network approach revealed clearly the relationship between the FC~ and FC~ fractions, which is neither accidental nor the product of used mathematical data analysis techniques. The level of generalization of the network employed can also be reflected in the internal relations of neuron potentials and in their connection to the input data. The following nonlinear correlations were found by the regression of hidden layer neuron activities Pi to input and output layer content: Pl = e x p ( - 1.78 - 0.4.FC,,) FC~ = exp(-0.29 - 18.86.p1) P2 = exp(-2.33 + 0.01.FC~ ) FC b = exp(-0.23 - 15.21.P2) P3 = e x p ( - 1.88 - 0.04.FC~ ) FC, = exp(-2.24 + 2.25.p3 ) FCp = exp(-0.24 - 7.29.p1) • The practical value of these equations is in fact that they allow for approximative calculation of
FCe of all other secondary structure types when FC~ is known. The following mean average differences between the x-ray data and results of the above equations have been calculated for 62 proteins of the analysed set: AFC~ =2.7%, AFC b = 4.0%, AFC, = 3.5% and AFCp = 5.8%.
3.2 Quantitative analysis of CD spectra In this modification of the NN application we take advantage of the fact that the 180-80-5 network allows us to reduce the CD spectra directly into the fractional concentrations of individual secondary structures forming the protein conformation. The generalization capability of the network is used to bypass any feature selection procedure (as developed in our recent publications [10, 11]). The spectra are analyzed independently of any prescribed algorithm. The training set consisted of combined UVCD and VCD spectra of 19 proteins with known x-ray structure. The average error achieved after n. 106 iterations was n. 10 -3. In testing the NN performance in the analysis, we have adopted the procedure similar to that used for testing other methods of CD spectra reduction to structural information. In this approach, several NN calculations have been performed with the training set reduced by one protein. The spectra of this omitted sample was then analysed by the NN and the resulting fractional concentrations of secondary structure were compared to the known ones. This procedure resulted in the following average errors of calculated and true secondary structure concentrations: a-helix 15%,/3-structure 12%, bends 7%, turns 5% and other conformations 10%. These values can be compared to results of factor analysis of the same set of VCD spectra where the average errors were a-helix 9%, /3-structure 7%, bends 5% and other structures 6% respectively. Identical treatment of UVCD spectra yielded errors 8, 10, 4 and 3%. For the partial least square analysis of the VCD (UVCD) spectra, the error for a-helix, fl-structure, bends, turns and others
Backpropagation neural network analysis
were 8(7), 14(23), 5(10), 10(10) and 5(4)% respectively. Dousseau and Pezolet published recently a similar least square and partial least square analysis of infrared absorption spectra of 15 proteins with the errors ranging from 5-10% for a-helix, 4-12% for /3 structure, 6-10% for turns and 4-11% for other conformations. The NN approach is thus comparable in the quantitative results with the other method of protein spectra analysis in its present state of development. Further work on the optimization of the NN procedure is in progress. The unique capability of NN which is not achievable by any other method is the principal possibility of the network to generate CD spectra from FC values. In Fig. 3 we present the series of VCD and UVCD spectra generated by the 5-80-180 network from the fractional concentrations of 62 proteins published in [7]. The calculated spectra are grouped together in families determined by the cluster analysis of their secondary structure composition as described in the previous section. The capability of the NN to generate the physically acceptable results follows from the similarities of the generated spectra bandshapes within the groups of
100
253
K1
dA
(,4o' 50
-50
16'50
V (cd)
1700
K1 dA K2
200
dA
(~t0
(,,10 100
lO0
0
-0
-100 260
2½0
2¼0
~ (nm)
Fig. 3. T h e C D curves for clusters from K1 to K9.
26o
2;~0
2/.0
~, (rim
254
M. Bl az e k et al.
K2
K3
dA
dA
(, lot 100.
(,,40 '~
100 ' 5000
-100-
50"
-
1650
v (crfi'} 1760
v Cc r~'l
16.~o
K3
K4 50d4
dA)t
o-
0
-
200
230
240
50-
~, (nm)
2()0 Fig. 3. (cont.).
~0
22.0
& (nm
17oo
Backpropagation neural network analysis
100dA (~+0
K4
100] dA)l (~0 ~
+
255
K5
O-
-
100
~65o
# (cr#l
lS50
1760
K5
dA (,40) I
dA
v (crY' 117()0
K5
100 100
0
260
2)0
2~0
"2_6o
& (rim)
Fig. 3. (cont.).
2;2o
2z;o
-~ t nm )
256
M. B l a z e k et al.
K~
K7
200t
dA
(,~o') 50-
100 0-
-50-
-
100 -,
16~o
v
(cry')
~70b
-
10.3
1650
K?
v
(e~f) 1700
K8 dA
100 -
(~o)
dA
(-oI
-20
-
100-
260
2~0
2Z0
'~ {nm
-40 200
Fig. 3. (cont.).
2'20
240
~ (nm !
Backpropagation neural network analysis
257
K9
K8
dA
dA
(,~o') 50 0'
-50
-5O 1S'50 200"
16go
'2 (Cn~'r ) 1700
v
(c~~) ~7do
Fig. 3. (cont.).
K9
Acknowledgement
dA
~o)
We gratefully acknowledge partial support of this research through a grant (GM30147) from the National Institute of Health.
100
References
-
!013
200
2J0
2&'0
~ (nm)
structurally similar proteins. This is the most exciting potential of the neural network implementation in this field. After supplementing the training set of protein CD spectra with more samples, we suppose that the differences between the experimental and the network calculated curves will be analysable in terms of physical factors as the solvent or concentration effects on the protein structure in solution as compared to the crystal state etc. These topics are subject of further research.
[1] N. Qian and T.J. Sejnowski J. Mol. Biol. 202 (1988) 865-884. [2] J.U. Thomsen, B. Meyer, J. Magnet. Resonance 84 (1989) 212-217. [3] E Pancoska, in: Experimental Methods of Biophysics, (Prosser V., ed.), chap. 6 (Academia, Prague, 1989). [4] T.A. Keiderling, in: Practical Fourier Transform Infrared Spectroscopy. Industrial and Laboratory Chemical Analyses (J. Ferraro, K. Krishnan, eds.) (Academic Press, San Diego, CA, 1990) 203-284. [5] G.A. Walton, in: Polypeptides and Protein Structure (Academic Press, New York, 1981). [6] M. Manning, J. Pharmaceut. Biomed. Anal. 7 (1989) 1103-1119. [7] W. Kabsch, C. Sander, Biopolymers 22 (1983) 25772637. [8] M.A. Sharaf, D.L. Illman, B.R. Kowalski, in: Chemometrics (John Wiley, New York, 1986). [9] P. Pancoska, S.C. Yasui, T.A. Keiderling, Biochemistry 28 (1989) 5917-5923. [10] P. Pancoska, S.C. Yasui, T.A. Keiderling, Biochemistry, in press. [11] P. Pancoska, T.A. Keiderling, Biochemistry, in press.