Ultramicroscopy North-Holland
9 (1982) 3-8 Publishing Company
METHODS FOR STUDYING MACROMOLECULES Joachim
FRANK
Cenferfor
Laboratories
Received
23 March
THE DYNAMIC
and Research. New York State Department
BEHAVIOR
OF BIOLOGICAL
of Health, Empire State Plaza, Albany, New York 12201, USA
1982
Multivariate statistical molecular structure.
analysis
of molecule
projections
opens up new possibilities
1. Introduction Natura non facit saltus - “nature doesn’t make jumps”. We refer to the part of nature that is seen in the electron microscope: the projection of a biological macromolecule” and its dynamic behavior due to the process of observation, or due to other controlled experimental factors. While there is no doubt that jumps do occur in the atomic realm, the general experience with structural observations on the stain-limited level of resolution (15-20 A) is that of a gradual rather than abrupt change in the appearance of molecule images. At a time when many electron microscopic groups were struggling to overcome the hurdles posed by radiation damage and contrasting media in the study of a static molecule (e.g. refs. [1,2]) Hoppe envisioned a program of research, the trace structure analysis [3,4], designed to study the dynamic behavior of molecules from a series of observations at constant time intervals. The ultimate goal of this study, the description of the changes of the three-dimensional structure, would require a phenomenal effort even for a single molecule, since it would involve the analysis of a series of projections in each time interval. I would like to take the liberty on the occasion of Professor Hoppe’s 65th birthday of making a similar leap into the future: the value and the significance of trace structure analysis, if it ever comes close to its goal, would be much enhanced if 0304-3991/82/0000-0000/$02.75
Q 1982 North-Holland
in the study of the dynamical
behavior
of
an entire set of molecules in its successive time realizations is subjected to multivariate data analysis.
2. Molecules and tombs In a random walk model [5], the structural change due to radiation damage is seen as a gradual accumulation of small random changes. According to this model, the similarity between any two images of an exposure series decreases with the number of time intervals elapsed. This prediction conforms both with experimental observations [5] and with common sense. If we were to draw a graph to symbolize the changes in the molecule projection as a function of time, we would choose a linear arrangement (fig. l), where decreasing similarity would be depicted as an increase in the distance between the realizations of the molecule at different stages. What do an Egyptian tomb and a molecule projection have in common? It is the fact that culture, as well as nature, changes gradually: culture does not make jumps either. If the tomb in a given historical period is characterized by a set of archeologically significant observations (the occurrence of 70 types of artifacts [6]), we may see these in analogy to the measurements, on a sampling grid, of the optical density distribution representing a molecule image. It can be assumed that -
4
+--_-s(t)
J. Frank / Dynamic _.---_--_-_C
S(t +At)
-
S(t +2At)
behavior of biological macromolecules
---_*
S(t+ 3At)
Fig. I. Graph symbolizing the continuous cule image as a function of time.
changes
of a mole-
although any particular artifact may have been in use only for a limited period - the staggered “phasing in” and “phasing out” of different artifacts at different times creates a continuous linear order among the sets of observations. To put it another way, if the incidence of N artifacts at a given time is described by an N-dimensional vector, then any two vectors separated by a small time interval appear highly correlated. Thus the linear order in time is revealed by a linear order in similarity [6]. The method of multivariate data analysis used to uncover this linear order was correspondence analysis [7,8]. The capability of this method to order “time-scrambled” data is very valuable in the archeological application to which we have been referring [6], but it may be equally useful in the ordering of molecule images, if one proceeds from the analysis of a single molecule and its changes to that of a large set of molecules. In the next section, we will first attempt to give a rationale for the use of correspondence analysis in identifying a linear data structure in RN associated with continuous changes of a molecule image. We then proceed to introduce the concept of structural ancestry and associated branched data structures.
3. Correspondence
analysis of molecule images
Let a molecule image (showing a particular view of the molecule) be represented by an array of N optical density measurements on a regular sampling grid. If an N-dimensional set of measurements is depicted as a vector in N-dimensional Euclidean hyperspace, the similarity between two sets implies that the corresponding vector end points are in close proximity to one another. Provided that the changes of the molecule structure are continuous, the end points of vectors corresponding to different observation times would sketch out a “time line” that worms its way through
the N-dimensional “event volume” that comprises all data configurations physically possible. Since our natural space of representation is two-dimensional, a convenient visualization of the time line involves a two-dimensional projection of the N-dimensional hyperspace. The direction of projection should be chosen such that the variation of the projected line in the two coordinate directions is maximized. For such a choice, the linear arrangement of data points projected will be most easily recognized since the chance of encountering crossovers is smallest. Thus the requirements for optimum visualization of the “natural” linear order of molecule images are identical to those for the identification of classes among molecule images [9- 111: in general, the data cloud in RN must be searched for directions in which the components of the total interiniage variance are largest. This is accomplished by correspondence analysis, which is distinguished from other multivariate data analysis methods by scale invariance and certain symmetry properties [8]. In order to apply correspondence analysis directly to image data we must ensure that they are properly matched. For the analysis of the tomb data [6] we have the trivial requirement that the elements in the incidence vector describing the set of artifacts at a given time are ordered consistently; for instance, if element number 39 describes the presence or absence of a certain vase in Tomb Number One, it must have the same meaning in all other vectors. This stipulation, if applied to the analysis of a set of images, translates into the requirement that all images be aligned. We may describe alignment of the image set as a relative spatial arrangement that has the property of minimizing the total interimage variance [12]. The details of this procedure, which is normally done iteratively, may be found in refs. [ 1315].
4. Multiple pathways of structural changes: Structural ancestry If we had the task of following a single molecule through various stages of change as envi-
J. Frank / Dynamic behavior of biological macromolecules
sioned by Hoppe in the concept of trace structure analysis [3,4], we would know the time sequence beforehand: each micrograph would be labeled in the experiment, and the multivariate data analysis would tell us nothing about the sequence that we did not know before. However, if we were to take the concept of trace structure analysis earnestly, we would have to account for the fact that different competing pathways of physical and chemical reactions may be possible, each realized with a different probability. If we allow branching for each reaction product, we may be faced with a large “family tree” that would require the study not of one but of many separate molecules by the elaborate analysis (three-dimensional reconstruction in each time interval) proposed by Hoppe [4]. The relationships of descent symbolized by the family tree may be called structural ancestry. An example for such a tree structure is shown in fig. 2a. While any given molecule will follow only one of the three possible pathways (s-sZ-So,, s--s~--s~~, and s-s,), a sufficiently large number of molecules will trace the entire tree. By virtue of the fact that “closeness” in the tree symbolizing structural ancestry is mapped into closeness in the N-dimensional Euclidean hyperlarge space RN, we see that for a sufficiently number of molecules analyzed, the tree “exists” as a data structure in RN: the points in RN (each of which represents a molecule image) are distributed in a branched data cloud, whose branching topology is identical to that of the structural ancestry. Thus if correspondence analysis is applied to a data set formed by throwing together all molecule images recorded in a time series of micrographs, the factor maps obtained will actually visuafize the tree depicted in fig. 2a (or any of its topological equivalents), provided that the changes induced by radiation damage (or other controlled experimental conditions) are predominant among the structural variations observed (fig. 2b). The actual shape of the tree, and the ways in which its branches may be twisted (while always maintaining the same nodal structure), are of course dependent upon the “angle of view” in RN, or the combination of factors chosen for the map display, bearing in mind that the factors form an
a
lb
I
Fig. 2. (a) Structural ancestry: graph describing the dynamic behavior of the structure of a molecule. A linear, unbranched segment symbolizes a gradual change. A node point marks a change that may lead to two (or more) different structures with different probabilities. The diagram is probabilistic: Only one of the three possible pathways will be realized by a given molecule. (b) Branched data cloud resulting from analyzing a large set of molecules whose structural ancestry graph is the one shown in (a). Each point represents a realization of the molecule (as imaged in the electron microscope) along the pathways of (a). The figure shows the expected appearance of the N-dimensional data cloud when projected into a coordinate system (“factor map”) spanned by two of the most important factors of the eigenvector expansion produced by correspondence analysis. The topology of the linear “skeleton” graph that represents the connectivity of the cloud segments (broken line) is expected to be identical to the topology of the graph in (a).
orthonormalized system [S]. Since this result, so far only derived by reasoning, has a somewhat futuristic ring to it, we have attempted to demonstrate the capability of correspondence analysis to visualize the graph of structural ancestry by using a simple molecule model. The molecule was built from 10 spheres arranged in two planes (fig. 3). The structural changes were modeled by changing the (x, y) positions of two subunits independently. Twenty-one projections were obtained in different configurations linked together by structural ancestry (fig. 3a) according to the way in which the configurations were derived from one another. Correspondence analysis was applied to a data set consisting of the 21 projections. The factor maps (fig. 3b) reveal the structural ancestry underlying the modeling of the changes unambiguously. Of course, a more realistic model would include the effects of noise, using multiple realizations of each stage of structural change. The result would be that the points depicting the images on the factor map would scatter around the ideal linear
J. Frank
6
Sl
9t
4
of biologrcal
macromolecules
FACTOR
3
I5
\
I9
10
*
16 S21
\
4
behavior
hk-*;~2, \
+
4
/ Dynamic
11
4
17 \
4
18
Fig.3. Model computation to illustrate the use of correspondence analysis for tracing the structural ancestry of a molecule and its derivatives from a set of projections. The three-dimensional model molecule consists of ten subunits represented by spheres, arranged in two pentagons. The pentagons are in different L positions of the volume, with the distance in z direction such that the spheres are just touching one another. Upper and lower pentagons are shifted by - 2 and + 2 units in x-direction, respectively. The structural modifications are obtained by gradually translating the subunits labeled A and B of the upper pentagon. The different branches of structural ancestry are obtained in the following way: s: move A only, from original position to position A ,; s, : move A only, beyond position A, ; s2: move A from position A, to position A, and move B from original position to position B, in x-direction;
b
*’
sz,: move A beyond A, and move B beyond B, in x-direction; saa: as s2, but add a y-component to the movement of B. (a) Graph of structural ancestry with model projections overlaid. The graph symbolically represents the way the modified configurations have been derived from the original structural model. (b) Map of factor 1 versus 3 obtained by correspondence analysis of the set of model projections shown in (a). (Note that the order in which the model projections enter the analysis has no effect on the outcome.) As predicted, the correspondence analysis reveals the relationships among the images. The graph obtained by connecting the image positions on the factor map is topologically equivalent to the graph of structural ancestry in (a). The map of factor 1 versus 2 shows the same structure as the 1 versus 3 map, but with some overlap of the branches.
J. Frank / Dynamic behavior
graph, limiting the ability of the analysis to distinguish closely neighboring branches. However, the aim of the model computation was to demonstrate that rather complex relationships among images are uncovered by the new method of analysis, and that the tracing of structural descent is possible in principle. Clearly, the same method could be used to reconstruct the genealogy of a family from a set of photographs (time-scrambled!) found in the family album, again relying on the assumption of continuity in the change of facial features from one generation to the next. (It should be noted, though, that the changes of scale, angle of view, and illumination, and the confusion created by hairstyles, beards, moustaches, glasses, etc. would produce almost insurmountable difficulties for such an analysis.) Examples for linear, unbranched data structures relating to the variation of an experimental parameter have been previously observed: the strong variation in the staining of 40s ribosomal subunits [ 11,161 and the variation of the tilt angle of 30s ribosomal subunits [ 171; we note that the latter use of correspondence analysis to sort out projections from a population of randomly oriented particles appears as a generalization of Hoppe’s idea of “correlation mapping” [8]. The result obtained by both the Gedankenexperiment and the model computation suggests the following method for the study of the dynamic behavior of macromolecules. Only a single view (corresponding to one of the stable positions of the molecule on the support film) of the molecule is analyzed. All molecules presenting this view in a series of micrographs taken at different stages of an experiment (a radiation damage or any other in situ experiment) are selected, aligned, and subjected to correspondence analysis. A tree structure depicting all possible experimental realizations of the molecule (as seen in the selected projections) and their interrelationships will emerge in the factor maps if the structural differences produced by the controlled experimental conditions are significantly larger than those produced by other, uncontrolled conditions (e.g., variations of the angle of view and staining). Once the tree is visualized, the molecule projec-
of biological macromolecules
-I
tion at any stage (i.e., corresponding to any segment or node of the tree) can be obtained with high significance, utilizing the enhancement achievable by averaging. Three-dimensional reconstruction could be done at any selected point along the tree, the advantage here being that tilt series need be taken only where necessary as judged by the structural ancestry diagram.
5. Conclusions It has been demonstrated that, for a gradually changing molecule, the similarities among molecule projections imaged at different times allow the entire history of structural change (including any branching in the process of changing) to be traced by correspondence analysis. Thus the alignment and correspondence analysis of a large set of molecule projections (all relating to the same view) showing many “snapshots” of different molecules at different times appears to be a way to organize the data collection for dynamic studies, and at the same time to obtain pertinent information about the progress of structural change. It is appropriate that, at the end of this leap into the future, the reader is guided back to earth. In the course of our discussion, a number of idealizing assumptions have been made, the most important of which is the postulated continuity of structural changes. This precludes the analysis of molecules that break up into two or more reaction products. We also must be aware of resolution limitations existing in practical electron microscopy, which probably confine the method of analysis to the quaternary level of molecule structure.
Acknowledgments
I would like to thank Adriana Verschoor carrying out the model computations.
for
References
[ 1) M. Beer, J. Frank, R.C. Williams,
Quart.
K.-J. Hanszen, Rev. Biophys.
E. Kellenberger 7 (1975) 211.
and
8
J. Frank / Dynamic behavior of brological macromolecules
[2] P.N.T. Unwin and R. Henderson, J. Mol. Biol. 94 (1975) 425. [3] W. Hoppe, Naturwissenschaften 61 (1974) 239. 141 W. Hoppe, Z. Naturforsch. 30a (1975) 1188. [5] M. Eckert, Thesis, Technical University Munich (1976). [6] M.O. Hill, Appl. Statistics 23 (1974) 340. [7] J.P. Benzecri, in: Methodologies of Pattern Recognition, Ed. S. Watanabe (Academic Press, New York, 1969) p. 35. [8] L. Lebart, A. Morineau and N. Tabard, Techniques de la Description Statistique (Dunod, Paris, 1977). [9] M. van Heel and J. Frank, in: Pattern Recognition in Practice, Eds. E.S. Gelsema and L.N. Kanal (North-Holland, Amsterdam, 1980) p. 235. [IO] M. van Heel and J. Frank, Ultramicroscopy 6 (1981) 187. [I 1] J. Frank, A. Verschoor, M. Boublik, J.S. Wall and J. Hainfeld, in: Proc. 10th Intern. Congr. on Electron Microscopy, Hamburg, 1982, submitted. [ 121 M. van Heel, J.-P. Bretaudiere and J. Frank, in: Proc. 10th Intern. Congr. on Electron Microscopy, Hamburg, 1982, submitted.
[13] J. Frank, W. Goldfarb, D. Eisenberg and T.S. Baker, Ultramicroscopy 3 (1978) 283. [14] M. Kessel, J. Frank and W. Goldfarb, J. Supramol. Struct. 14 (1980) 405. [15] J. Frank, The Role of Correlation Techniques in Computer Image Processing in: Computer Processing Of Electron Images, Ed. P.W. Hawkes (Springer, Berlin, 1980) p. 187. [16] J. Frank, A. Verschoor and M. Boublik, Multivariate Statistical Analysis of Ribosome Electron Micrographs: L and R Lateral Views of the 40s Subunit from HeLa Cells, submitted. [17] M. van Heel and M. Stiiffler-Meilicke, in: Proc. 10th Intern. Congr. on Electron Microscopy, Hamburg, 1982, submitted. 1181 W. Hoppe, in: Unconventional Electron Microscopy for Molecular Structure Determination, Eds. W. Hoppe and R. Mason (Vieweg, Braunschweig, 1979) p. 191.