Journal of
Structural Biology Journal of Structural Biology 146 (2004) 368–380 www.elsevier.com/locate/yjsbi
The Kohonen self-organizing map: a tool for the clustering and alignment of single particles imaged using random conical tilt L.M. Zampighi,a,1 C.L. Kavanau,b,1 and G.A. Zampighia,b,* a b
Department of Physiology, David Geffen UCLA School of Medicine, Los Angeles CA, USA Department of Neurobiology, David Geffen UCLA School of Medicine, Los Angeles CA, USA Received 5 November 2003, and in revised form 15 January 2004
Abstract An important step in determining the three-dimensional structure of single macromolecules is to bring common features in the images into register through alignment and classification. Here, we took advantage of the striking computational properties of the Kohonen self-organizing map (SOM) to align and classify images of channels obtained by random conical geometry into more homogeneous subsets. First, we used simulations with artificially created images to deduce simple geometrical rules governing the mapping of bounded (differing in size and shape) and unbounded (differing in in-plane orientation) variations in the output plane. Second, we measured the effect of noise on the accuracy of the algorithm to separate homogeneous subsets. Finally, we applied the rules ascertained in the previous steps to separate freeze-fracture images of the cytoplasmic and external domains of the small (118 kDa) aquaporin-0 water channel. Comparison with the results obtained from a similar input set using alignment-throughclassification [J. Mol. Biol. 325 (2003) 201] showed that both methods converged to stable classes exhibiting the same overall shapes (tetragonal and octagonal) for the cytoplasmic and external views of the channel. Processing with the SOM, however, was simplified by the utilization of the geometric rules governing the mapping of bounded and unbounded variations as well as the lack of subjectivity in selecting the reference images during alignment. Ó 2004 Elsevier Inc. All rights reserved. Keywords: Unsupervised networks; Self-organizing maps; Alignment/classification; Freeze-fracture electron microscopy; Random conical tilt; AQP0
1. Introduction The need to study biological macromolecules in conditions closer to their native cellular environment has encouraged the development of methods for the analysis of single molecules. These methods calculate structures by combining projections along different viewing directions and thus avoid the need for 2D and 3D crystals required for electron and X-ray diffraction crystallography. In principle, single particle methods should provide information at the atomic level. In practice, however, a number of factors cloud the higher resolution features with signals unrelated to the structure of the macromolecule. This paper describes the use of the remarkable properties of the Kohonen self-orga*
Corresponding author. Fax: 1-310-825-2024. E-mail address:
[email protected] (G.A. Zampighi). 1 These authors contributed equally to this work. 1047-8477/$ - see front matter Ó 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jsb.2004.01.008
nizing map to process images of single particles gathered with the electron microscope. An important step in image processing is to bring common features in the images into register through an alignment process that requires homogeneous input sets (Frank, 1996). Yet, the input images gathered with the electron microscope are heterogeneous due to specimen deformation and different molecular conformations. The analysis of heterogeneous sets is dealt with by clustering similar images into subsets called ‘‘classes.’’ If the input set is aligned, the classes are homogeneous subsets of the original set. If the input set is not aligned but accurate references are available, homogeneous classes are easily obtained by a ‘‘reference-based’’ alignment procedure. If the input set is not aligned and accurate references are not available, the classes are obtained by alignment-through-classification. This method involves an iterative process of successive rounds of alignment and classification where each round
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380
generates better references, which lead to better alignment and so on. Alignment-through-classification is considered a general idea rather than a rigid procedure and often produces unpredictable results, which make the comparison of models calculated from different operators (and on occasion the same operator) difficult. Therefore, there is a need for alternative methods that can independently align and classify complex input sets. We explored the self-organizing map (the ‘‘SOM’’ or the ‘‘network’’) developed by Teuvo Kohonen almost two decades ago (Kohonen, 1984). The SOM is one of a multitude of algorithms known collectively as artificial neuronal networks, which were developed both to facilitate the understanding of brain functions and to take advantage of their unique computational properties. Kohonen describes the SOM as a clustering method that systematically maps high dimensional data onto a low dimensional grid (the ‘‘output plane’’), which lends itself to visualization and interpretation (Kohonen, 2001). Marabini and Carazo (1994) first introduced the SOM to processing images of single particles and later extended it to clustering tomograms (Pascual-Montano et al., 2002). While their work has unequivocally demonstrated that images of biological macromolecules are systematically mapped in the output plane, no simple rules to guide the interpretation of these mappings have been deduced. Their work has not resulted in 3D models calculated solely through SOM processing that can be directly compared to those calculated with established methods, such as alignment-through-classification. Here, we used the SOM to process artificially created images of increasing complexity to determine whether the principal variations in the images follow rules that can be applied to the processing of ‘‘real’’ images obtained using the electron microscope. By processing images of sticks, we determined that the end-points of a bounded variation (differing in length) were mapped at diametrically opposite corners of the output plane. In contrast, images of an unbounded variation (differing in in-plane rotation) were mapped in closed perimeters. By processing projections calculated from the atomic models of the aquaporin-1 and glycerol facilitator channels (Fu et al., 2000; Sui et al., 2001), we determined the effect of defined amounts of noise in the mapping of the two types of variations. We applied the rules deduced from the simulations to the clustering of images of the water channel aquaporin-0 (AQP0), a small tetramer of 6.6 nm side and 118-kDa molecular weight. Random insertion in phospholipid bilayers resulted in images of the channel viewed from the external and cytoplasmic domains (‘‘preferential orientations’’), which have been difficult to separate into homogeneous classes using the alignment-through-classification method (Zampighi et al., 2003). Both methods separated the cytoplasmic and external domains of the AQP0 into subsets exhibiting
369
octagonal and tetragonal shapes with the same dimensions. Processing with the SOM was simplified by the elimination of the subjectivity in selecting the images used as references during alignment. Also, the 3D models of the subsets obtained through processing with the SOM exhibited lower Fourier Shell Correlation values than those reconstructed from classes obtained by alignment-through-classification.
2. Materials and methods With the exception of processing using the SOM, all methods have already been described in detail in previous publications (Konig et al., 1997; Turk et al., 2000; Zampighi et al., 2003). 2.1. Preparation of the sample AQP0 was purified from plasma membrane fractions isolated from lens fiber cells of calf lenses following procedures already published (Konig et al., 1997). Reconstitution in phospholipid vesicles was done by the bcyclodextrin method described previously (Turk et al., 2000). 2.2. Preparation of the replicas Aliquots (1–5 ll) of the reconstituted liposomes were spread on glass surfaces, sandwiched between a glass and copper specimen carrier (BalTec), and frozen by immersion in liquid propane, as described previously (Zampighi et al., 1989). Knocking the copper hat on top of the glass surface with a liquid nitrogen-cooled knife at )145 °C and 1 107 mbar in a JOEL RFD 9019CR apparatus fractured the specimens. The cleaved surfaces were shadowed first with a thin layer of carbon to eliminate ‘‘decoration’’ and then with platinum and carbon in all three axes simultaneously (Zampighi et al., 2003). The replicas were floated in 10% hydrofluoric acid solution, transferred to distilled water, and deposited on holes 1–2 lm in diameter (‘‘holey’’ grids). The region of the replica spanning the holes was used to collect images. 2.3. Electron microscopy The intra-membrane particles representing the water channel were imaged first tilted at 50° and then un-tilted (random conical tilt, RCT; Radermacher, 1988; Radermacher et al., 1987). Both un-tilted and tilted images were collected at 60,000 magnification in a Philips CM12 equipped with a CCD camera (1024 1024 pixels). The digital micrographs had a pixel size of 0.31 nm on the molecular scale. The un-tilted images were recorded at 600 nm under-focus. The tilted images
370
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380
exhibited a gradient of focus that extended from perfect focus to 200–300 nm under-focus. Therefore, the images did not require correction for the contrast transfer function (CTF). The images were recorded under low electron dose conditions by focusing in regions away from the area of interest. While metal replicas are essentially insensitive to radiation (Muller et al., 1985), organic material left attached during cleaning was a source of contamination. The location of the tilt axis was determined by the deformation of holes in the substrate and by analysis of the tilted data set. Astigmatism, specimen drift, and the level of under-focus were assessed from Fourier transforms calculated during imaging. 2.4. Computer simulations (a) Processing of simple images. Single-pixel image (a black and a white square) simulations were run using the neural network program KMAP developed in-house in Qbasic for PC. Processing of single-pixel images explained the selection of the ‘‘winning image’’ and the ‘‘update’’ of the output plane (Appendices A and B). Multi-pixel images of sticks were created using a program also developed in-house. The program allowed the creation of sticks varying in length (10, 20, 30, 40, 50, and 60 pixels) and in in-plane rotation (0–177° in 3° increments). Simulations using data sets comprised of sticks were run using the Xmipp software package (Marabini et al., 1996). Processing the images of sticks determined how the network mapped the bounded (differing in length) and unbounded (differing in in-plane rotation) variations in the output plane. (b) Processing of model projections and effect of noise. We used the Situs software package (Wriggers et al., 1999) to create electron microscopy densities of atomic models of aquaporin-1 (Sui et al., 2001) and glycerol facilitator (Fu et al., 2000) channels at 1 nm resolution. The models were halved into cytoplasmic and external domains and the four resulting models were projected along the a (0–90°) and b (0–10°) angles. These projections (450 for each model) were expected to simulate the variations present in the images of AQP0 collected using the RCT geometry. Different levels of noise were added to the projections of the cytoplasmic and external domains of the channels using the ‘‘proc2d’’ program in the EMAN software package (Ludtke et al., 1999). The program adds a flat band of Gaussian noise with a defined number of SD (1– 7). The resulting data sets were processed using the Xmipp package (Marabini et al., 1996). 2.5. Processing of images of AQP0 Particle selection was done using ‘‘boxer,’’ a program in the EMAN software package (Ludtke et al., 1999), by
displaying a pair of images on a LINUX workstation with dual monitors. This produced two galleries of 3366 images, one un-tilted and one tilted, centered inside 64pixel side boxes. The un-tilted particles were aligned by translation, and C4 symmetry was imposed using the Imagic-5 software package (van Heel et al., 1996). The selection of C4 symmetry was based on the study of the principal Eigen images of a similar input set performed in a previous study (Zampighi et al., 2003). The un-tilted images were then processed with the software package Xmipp (Marabini et al., 1996). The output from the Xmipp package (or any SOM) is a mathematical representation of the most significant features of the images (the ‘‘code vectors’’). Instead of visualizing the ‘‘code vectors,’’ it was more convenient to display the average images mapped into the different locations (bins) of the output plane (note: if no images were mapped into a particular bin then the average image is blank). We used variable numbers of iterations (i.e., presentations to the network) and visualization on output grids of different dimensions. We obtained good results by setting the number of iterations equal to 20 the number of images (67,320) and mapping on a 15 15 grid. The data set was normalized in Xmipp to eliminate the variations in intensity. We used a linearly decreasing learning coefficient (Lo ) with a value of 0.1, and a Gaussian distribution radius update of 15. The critical updating process depended on the ‘‘learning rate function,’’ L, and the ‘‘neighborhood radius.’’ L established the upper boundary for the amount of change (i.e., the learning) permitted to the output images. The original value, Lo , called the ‘‘learning coefficient,’’ was used in the initial iteration. The remaining iterations used a monotonically decreasing sequence for the learning rate function that allowed for fine-tuning of the output images in the grid (Appendix B). We used the ‘‘Gaussian distribution rule’’ to update the output images. As the distance increased between the ‘‘winning image’’ and the grid image being updated, the magnitude of the updating decreased. The neighborhood radius was used to control the standard deviation of the Gaussian distribution curve: the smaller the value, the narrower the width of the corresponding bell curves; the smaller the neighborhood radius, the weaker the updates performed at any given distance from the ‘‘winning image.’’ 2.6. Three-dimensional reconstruction of images mapped at opposite corners To perform 3D reconstruction, one needs the three Euler angles, a, b, and c. a (the azimuth) is the angle at which the input image was rotated in the plane and was determined using the SOM. b is the tilt angle of the plane and was determined experimentally (50°). Gamma, the tilt angle about the z-axis of the coordinate
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380
system, was 0° because the channels were inserted in the phospholipid bilayer (Radermacher, 1988; Radermacher et al., 1987). After determining the Euler angles, the tilted images were rotated to bring the direction of the tilt axes along the y-axis, and the 3D reconstruction was calculated using the weighted back projection algorithm (Radermacher, 1988; Radermacher et al., 1987) and a direct Fourier method based on the discrete Radon transform (Lanzavecchia et al., 1999). Comparing each tilt image with projections obtained from the 3D-reconstruction refined the Euler angles. This refinement of the viewing directions, based on the ‘‘projection matching strategy’’ (Harauz and Ottensmeyer, 1984) was iterated several times until reaching convergence. Finally, the reconstructions of the cytoplasmic and external domains were aligned manually and joined to create the model of the envelope for the complete channel. 2.7. Docking The model of the AQP0 channel was docked into the atomic resolution model of the AQP1 tetramer. We used SITUS to perform automatic rigid body fitting (Wriggers et al., 1999) and VMD to visualize the docked model (Humphrey et al., 1996). The number of vectors varied from 3 to 9 and the best solutions were selected based on the values of rmsd, the correlation coefficients, and the fitting parameters.
3. Results Image processing using the SOM was a surprisingly simple procedure. First, the operator generated an output plane comprised of images with random density values. After initialization of the output plane, the input images were presented sequentially to the network. Second, the SOM compared each input image to all the output plane images and declared the most similar to be the ‘‘winning image.’’ Third, the entire output plane was updated (Appendix A), a process that changed all the output images to resemble the input image. The magnitude of the updating was the key to the processing: the winning image was updated the most while those located farthest away from the winning image were updated the least (Appendix B). To help ‘‘fine tune’’ the output plane, the amount of updating decreased with the number of iterations (Appendix B). After the iteration cycle was completed, the input images were compared to the output images and placed in locations (‘‘bins’’) based once again on image similarity. We used this simple method to process single pixel (a black and a white square) and of multi-pixels (sticks, projections of channels, and AQP0 particles) images. Processing of single-pixel images explained the selection
371
of the ‘‘winning image’’ and the ‘‘update’’ of the output plane. Processing of sticks determined how the network mapped the bounded and unbounded variations in the output plane. Processing of artificially created projections of the aquaporin-1 and glycerol facilitator channels determined the robustness of the algorithm in dealing with noise. The appendices describe the winning image criteria, the grid-updating equation, the learning distance function Di ðnÞ, and the learning coefficient LðnÞ used here. 3.1. Computer simulations 3.1.1. Single-pixel images After initialization (reset) of the output images created by the operator, the black and white squares were presented to the network one at the time. For the white square, the closest matching image in the output plane (the ‘‘winning image’’) was located in row 4, column 4 (Fig. 1A). The network then changed the pixel value (i.e., ‘‘updated’’) the output images so that they better resemble the white square. The concentric circles of decreasing intensity indicate to the decreasing amount of updating (Fig. 1B). The change in the brightness of the output images around the winning image was evident after a single iteration (Fig. 1C). At the end of the iteration cycle, the whitest and darkest squares were mapped in diametrically opposite corners of the output plane (Fig. 1D). Analysis of the intensity value revealed that the network ordered the output images along lines where the brightness increased monotonically from dark to bright (green lines connecting the red numbers at the corners, Fig. 2). 3.1.2. Multi-pixel images Differences in orientation are a principal characteristic in images of single particles extracted from electron micrographs. This characteristic arises from the fact that, in contrast to crystals, single particles can adopt all possible orientations in 3D space (described by the Euler angles). The differences in orientation were ‘‘unbounded’’ variations because the first and last images of the series are identical. In contrast, the differences in size and shape were ‘‘bounded’’ variations because the end points of the series are the most dissimilar. (a) Mapping of a ‘‘bounded’’ variation. To determine how the network mapped bounded variations, we processed images of sticks differing in lengths but with the same in-plane orientation. The SOM mapped this variation following a simple rule: sticks with the largest difference in length (the end points) occupied diametrically opposite corners (Fig. 3A). If input sets containing images of sticks with four different lengths were mapped on a square grid, the network neatly clustered a different length at each corner of the plane. If the input set contained sticks with six or more different lengths, the
372
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380
Fig. 1. Processing of single-pixel images. The input plane was comprised of white and black squares that are presented sequentially to the network. The panels represent changes in the output plane images that occur after critical steps in processing. (A) The initial random distribution of brightness levels in the output plane resulting after initialization (reset). The arrow points to the ‘‘winning image’’ selected for the white pixel presented to the network. (B) After the ‘‘winning image’’ was selected, the brightness level of all images comprising the output plane was updated according to a ‘‘Gaussian distribution rule’’. The different circles represent the degree of updating. Output images closer to the ‘‘winning image’’ were updated more (white circle) than output images located farther away (gray circle). (C) The output images after a single updating. Note that when compared to (A): the original darker images in the vicinity of the ‘‘winning image’’ now appear gray. (D) After repeating the updating process 800 times, the network mapped the whiter and blacker images at diametrically opposite corners in the output plane.
mapping was more complex (Z-shaped, not shown). Yet, the end points, (the largest and shortest sticks) were always mapped in diametrically opposite corners of the output plane. Removal of the images representing these end points from the input set freed corners for mapping the next group of images that differed the most in length. (b) Mapping of an ‘‘unbounded’’ variation. We processed images of sticks with the same length but with a rotation of 0–177° in 3° increments. The SOM mapped this unbounded variation in closed perimeters in the output plane (Fig. 3B). Sticks mapped in diametrically opposite corners differed the most in orientation (90°) and sticks mapped in neighboring locations differed the least in orientation, satisfying the learning distance function Di ðnÞ (Appendix B). The dimensions of the output plane influenced only the resolution of the clustering. A 5 5 grid, with an outer perimeter of 16 bins, mapped sticks that differed in orientation by 11.2°; the 15 15 grid with an outer perimeter of 56 bins, however, mapped sticks that differed only by 3.2°. Therefore, the dimension of the output plane impinges upon the visualization but not the interpretation of the mapping of unbounded variations. (c) Mapping of the images exhibiting bounded and unbounded variations. The mapping of sticks exhibiting both bounded and unbounded variations was complex. The SOM solved the problem by mapping the bounded variation in closed perimeters and the unbounded variation in the perimeterÕs diameters (Fig. 4A). The shortest sticks were always placed in the inner and the longest in the outermost perimeters of the output map. Therefore,
concentric perimeters neatly solved the problem posed by the mapping of sticks that differed in length and inplane rotation. A corollary of this simulation was that the networks always assign the largest number of output images (‘‘real state’’) to the variation present in the largest number of input images. In this simulation, the sticks exhibited 60 different orientations but only five different lengths. This situation parallels that of input sets extracted from electron micrographs where particle orientation always dominates the contributions of bounded variations, such as particle size and shape. Therefore, the gridÕs corners contain heterogeneous mixtures of images exhibiting the principal variations in the input set. To resolve this problem, we decreased the dominance (i.e., the number of images) representing the in-plane rotation through alignment. To accomplish this task, we took advantage of the learning distance function Di ðnÞ, the property that clusters the most similar images in neighboring locations of the output grid (Appendix B). Since the input images mapped in the same bin of the output plane exhibited the largest similarity, we first aligned them against their average image (using ‘‘AlignDirect’’). This local alignment within each bin was followed by the alignment of the average images from all the bins containing input images in the output plane using one of them as a reference. The resultant ‘‘rotationally aligned’’ input set was re-processed in the SOM. As expected, the partially aligned images were mapped into seven bins, instead of the 124 occupied when mapping the un-aligned input set (compare Figs. 4A and B).
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380
373
Fig. 2. A representation of the mapping of black and white squares using numbers varying from close to 1 (blacker) to close to 2 (whiter). The brightness levels of the output images were distributed monotonically along lines with the output images representing the largest difference in brightness (red numbers) mapped at diametrically opposite corners.
More importantly, this simple strategy of decreasing the dominance of the unbounded variation clustered the sticks that differed the most in length (the end points) at diametrically opposite corners of the output plane (Fig. 4B, corrected). A key point was that alignment does not have to be perfect for mapping bounded variations at diametrically opposite corners of the output plane. (d) The effect of noise. We tested the approach developed with sticks by processing data sets that contained mixtures from projections of the external and cytoplasmic domains of the aquaporin-1 and glycerol facilitator channels, which had been solved at atomic resolution using X-ray diffraction crystallography (Fu et al., 2000; Sui et al., 2001). The data set (1800 images) was comprised of images of four models (two from each protein) that were rotated along the a and b angles. This data set provides a stringent test because, at 1 nm resolution, the projections of the two domains are very similar and hence difficult to separate, particularly after the level of noise is increased. As expected, in the absence of noise, the SOM mapped each image type at one of the corners of the output plane (not shown). Table 1 describes the results of adding defined levels of noise to the projections. We determined the effect of increasing the amount of noise on the mappings by determining the percentage of images that were mapped at the ‘‘wrong’’ corner in the output plane. At SNR >0.9, the SOM placed each image type in the correct corner with essentially 100% accuracy (column 3, Table 1). Below 0.9, the accuracy of the mapping deteriorated rapidly, reaching 0% at a SNR value of 0.8 (Table 1). Since un-tilted freeze-fracture images of AQP0 have SNR values of 1, we concluded that the noise is not an important factor in the processing of images from the channel. We realized that: (a) the end points of a bounded variation were mapped in diametrically opposite corners of the output plane; (b) the unbounded variation was mapped in closed perimeters; and (c) the complex
mapping of input sets exhibiting bounded and unbounded variations can be reduced by local alignment within the bin against their average images (or code vectors). Armed with this realization, we proceeded to cluster the input set containing un-tilted images obtained by RCT geometry representing the external and cytoplasmic domains of the water channel AQP0. Since freeze-fracture images have SNR of 1, the effect of noise in the mapping of the AQP0 images was ignored. 3.2. Processing of AQP0 images The purified AQP0 channel is a tetramer 6.6 nm per side and 118 kDa in molecular weight (Konig et al., 1997). The tetramers were reconstituted with their external and cytoplasmic domains oriented toward the lumen or the exterior of the liposomes (‘‘preferential orientation’’). The particles were imaged both un-tilted and tilted at a high angle (random conical tilt; Radermacher et al., 1987). Therefore, the critical step in reconstructing the AQP0 channels inserted in bilayers and imaged using random conical tilt geometry is to separate the images representing the two domains independently from their in-plane orientation (azimuth). The tilt angle was fixed experimentally (50°), and the orientation of the channel across the membrane was fixed by the fact that proteins cross the bilayer at a fixed orientation. We have already classified images of the AQP0 channel obtained by RTC geometry using the alignment-through-classification method (Zampighi et al., 2003). With the lessons learned by processing the sticks, we separate the two sides of the AQP0 and compared the merits of both procedures in terms of accuracy and computational properties. (a) 2D-Analysis. We used 3366 un-tilted images of single freeze-fracture particles representing views of the two domains of AQP0. Imposing C4 symmetry decreased image inhomogeneities resulting from the evaporation of a layer of metal of variable thickness on
374
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380
Fig. 3. Contrasting the mapping of bounded and unbounded variations. (A) The mapping of 240 images of sticks with four different lengths (small panels on top). To simplify the visualization of the mapping, the output planes show only the locations containing clusters of input images (not the code vectors as in Fig. 1). The network mapped these 240 images in the bins located at the corner of the output plane. The arrow indicates that the largest and shortest sticks (i.e., the end points of the variation) were mapped in diametrically opposite corners. (B) The mapping of 60 images of sticks of the same length (small panel on top) that were rotated from 0–177° in 3° increments. The network mapped these images at the outer perimeter of the output plane. The input images mapped in bins close together contained images that differed the least in rotation while the input images mapped in bins farther apart differed the most in the rotation. The opposite corners contained sticks differing by 90°.
Fig. 4. Contrasting the mapping of images comprised of a bounded and an unbounded variation before and after alignment. As in Fig. 3, the output planes show only the locations containing clusters of input images. (A) Mapping of an input set comprised of 240 images of sticks with four different lengths and rotated from 0–177° in 3° increments. The images were mapped in 120 locations in the output plane, which were arranged in three concentric closed perimeters. Sticks with the shortest lengths were placed in the inner perimeters and sticks with the longest lengths in the outer perimeters. This arrangement of the input images was dictated by the fact that the longest sticks changed the most with rotation. There were only three closed perimeters because the two longest sticks were mapped together in the same bins. (B) Mapping of the same set of input images after decreasing the dominance of the rotation in the plane using Align-Direct. The images in the different locations were aligned against their average images (see Section 2). The input images were mapped in seven bins instead of 120 bins. The mapping in seven bins (instead of the expected four) resulted from imperfect rotational alignment. Yet, the important result was that the longest and shortest sticks (the end points) were mapped in diametrically opposite corners (arrow).
the fracture faces. This symmetry selection was based on the study of the Euler images described in a previous publication (Zampighi et al., 2003). To cluster the images in subsets representing the external and cytoplasmic domains, we followed the following procedure: (1) The first SOM mapped the unprocessed 3366 un-tilted images (with C4 imposed) in a 15 15 output plane comprised of 225 bins. (2) We calculated the angle of the input images contained in the
bin with respect to their average image. (3) We calculated the angle between all the average images with respect to one reference chosen among them. (4) We summed the two angles for each image (its angle with respect to the average image and the average image with respect to the reference) and rotated the images. (5) A final SOM mapping, calculated using the locally aligned input set, ordered the clusters of particles representing the cytoplasmic and external domains of the channel at
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380 Table 1 Effect of noise
SNR
Mapping Accuracy (%)
Number of Models
1.28
100
4
0.91
99
4
0.89
90
3
0.86
64
2
0.83
33
0
The first column shows the effect of noise on the image quality. The second column shows the signal-to-noise ratio (SNR) values of the different images. Columns three and four show results of processing with the SOM. The third column expresses the accuracy (%) of the mapping. (One hundred percent accuracy or ‘‘perfect mapping’’ means that the SOM mapped only images of one of the models at the corners of the output plane). The fourth column expresses another aspect of the mapping: how many models were mapped at the corners of the grid. The number 4 means that mapping was perfect and each corner contains a different model.
opposite corners of the output plane (Fig. 5). The corresponding tilted views were used to reconstruct the molds (Zampighi et al., 2003). Misalignment was corrected by re-projection of the 3D model. The average images of these clusters were then compared to the classes obtained by the alignment-throughclassification procedure (Zampighi et al., 2003). The clusters processed by the SOM and the classes calculated by the alignment-through-classification procedure exhibited the same overall dimensions and shape. The particle dimensions ranged from 10.3 to 9.3 nm (Table 2). The corner #1 and the diametrically opposite corner #4 exhibited similar shapes (Fig. 5) but corner #1 was 1 nm larger than corner #4 (Table 2). In contrast, the
375
average images at corner #2 and corner #3 exhibited similar dimensions (Table 2) but different shapes (Fig. 5). The average image at corner #2 exhibited a tetragonal shape while that at corner #3 was octagonal (Fig. 5). Therefore, the SOM mapped the shape variation in one pair of diametrically opposite corners and the size variation in the other (see Table 3). The strategy using the SOM differed from the alignment-through-classification method in the absence of extensive bouts of iteration, the speed of processing and the absence of operator input in the selection of the references used for alignment. The three step strategy (first SOM, alignment, second SOM) required hours of processing, which were needed to select the ideal processing parameters (Di ðnÞ and LðnÞ), the output plane dimensions, the number of iterations and the starting conditions to make sure that the optimal mapping would be achieved. Since the references were the average images of the bins, the operator input was reduced to a minimum. Finally, the end points of the variation (in this case the tetragonal and octagonal shapes of the channel) were conveniently placed at opposite corners in the output plane, a fact that simplified the selection of the clusters used for 3D reconstructions. (b) 3D Reconstruction. To determine the quality of the mapping, we used the input images mapped at the corners to select the corresponding tilted images for 3D reconstruction (Radermacher, 1988; Radermacher et al., 1987). As in our previous study (Zampighi et al., 2003), the 3D reconstructions appeared as cups with external convex and internal concave surfaces (designated the ‘‘molds,’’ not shown). The molds were used to reconstruct the volumes, which represented the molecular envelope of the particles (designated the ‘‘imprints,’’ Fig. 6). The four imprints exhibited similar dimensions (Table 2) but different shapes (Fig. 6). Imprints #1 and #2 exhibited octagonal shapes while imprints #3 and #4 were tetragonal (Fig. 6). The imprints also exhibited smaller features including elevations at the 4-fold axis in imprints #1 and #4 and small depressions in imprint #2 and #3. Also, imprint #2 exhibited four smaller subunits with a small depression at the center (red area, Fig. 6). These subunits were not observed in the other imprints, although imprint #1 had four small depressions in the same location (arrows, Fig. 6). Therefore, analysis of the 3D envelopes indicated that the dominant variation was shape with the end points being octagonal and tetragonal. This conclusion strongly supports the results obtained using the alignment-through-classification (Zampighi et al., 2003). We focused on the images located at diametrically opposite corners of the grid to identify the external and cytoplasmic domains of the channel. A concern raised by this approach is the fate of the input images that were not mapped at the corners. In an effort to resolve this concern, we calculated the imprints of the entire output
376
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380
Fig. 5. Mapping of the un-tilted images of AQP0 channels inserted in phospholipid bilayers and imaged using the RCT geometry. The large map underneath represents the output plane (the code vectors) resulting from processing the entire input set that was aligned by translation and rotation of the input images. The smaller map above shows the average images clustered at the four corners (3 3 bins colored red, blue, green, and yellow). One pair of diametrically opposite corners (1, 4) mapped input images of slightly different dimensions. The other pair (2, 3) mapped the input images exhibiting octagonal (corner 2) and tetragonal (corner 3) shapes.
Table 2 Properties of the imprints calculated from images clustered at the corners in the output plane
Number of images Dimensions with coat Dimensions without coat Across Height Shape Pore–pore distance Volume (in voxels) FSCa
Corner 1
Corner 2
Corner 3
Corner 4
428 10.3 6.7 6.5–6.8 3.41 Octagonal 2.45 3159 1.0
184 9.7 6.2 5.8–6.8 3.41 Octagonal 2.98 3039 1.0
206 10 6.2 5.3–6.8 3.1 Tetragonal N/A 2661 1.0
417 9.3 6.5 6.2–6.8 3.41 Tetragonal N/A 2526 1.0
The imprints exhibited similar dimensions (size, height, and across) and volume measured in voxels. The imprints differed in shape, with particles exhibiting octagonal shapes mapped at the upper locations and particles with tetragonal shapes at the lower locations of the output plane. Pore–pore distance refers to a feature present in the octagonal shaped imprints only. * In nanometer. a Fourier shell correlation (FSC) was calculated using the 0.5 criteria. We use this criteria to demonstrate that the information used for the reconstruction of the particles was consistent, not to estimate their spatial resolution.
Table 3 Molecular characteristics of the models calculated with the SOM
Model 1, 4 Model 2, 3 a
Sidea
Heighta
Volume (voxels)
MW
rmsd
6.6 6.2
6.8 6.5
5685 5700
114 114
1.707 2.328
In nanometer.
plane in groups of 3 3 bins (not shown). As expected, the imprints of these images were a continuum of variation between the octagonal and tetragonal shapes
mapped at the opposite corners. This observation justified our focus on the imprints representing the end points of the variation in the shape of the particles.
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380
3.3. The complete channel Since freeze-fracture exposes particles representing one half of the channel only, we joined together the imprints at opposite corners of the output plane to reconstruct the entire AQP0 channel (2, 3 and 1, 4; green areas, Fig. 7). The models exhibited similar volumes (5685 voxels for 1, 4, and 5700 voxels for 2, 3) and dimensions (6.2–6.7 nm in side and 3.1–3.4 nm in height, Table 2). Moreover, we performed rigid-body docking of the atomic structure of AQP1 (Sui et al., 2001) to both models (Fig. 7). The docking estimated an rmsd of 1.707 for model 1, 4 and 2.328 for model 2, 3 (k ¼ 5). A contributor to the lower rmsd was that the amino and carboxyl termini protruded from the envelope of the cytoplasmic domain and the external loops from the envelope of the external domains (Fig. 7). The model reconstructed by joining imprints 1, 4 exhibited negative densities at the 4-fold axis and in a location corresponding to the middle of the phospholipid bilayer. Also, the external and cytoplasmic surfaces tapered off, giving a more rounded shape. Inspite of these small differences, however, the similarities of the two models were more significant than the differences. Therefore, the AQP0 image set collected using RCT can be described by a single unbounded variation (the azimuth) and two bounded variations (particle size and shape). The end points of the principal variation were the octagonal and tetragonal shapes. The small variation in size most likely represented differences in the thickness of the layer of metal used for visualization.
4. Discussion In a previous publication, we used alignmentthrough-classification to reconstruct the 3D envelope of the AQP0 channel reconstituted in phospholipid bilayers (Zampighi et al., 2003). This places us in a good position to compare the classification of a similar data set in an effort to grasp the relative strengths and weaknesses of both methods. Both methods identified the octagonal and tetragonal shapes as representative of the external and cytoplasmic domains of the channel. Alignmentthrough-classification separated the input images into eight stable classes (Zampighi et al., 2003). In contrast, the approach using the SOM separated the images in neatly arranged clusters, with the end points of the principal variation (the differing in shape) at diametrically opposite corners. Therefore, the SOM and alignment-through-classification procedures distinguished the same shapes for the particles representing the cytoplasmic and external views of the AQP0 channel. There were, however, unique advantages in using the SOM for processing images of particles representing a
377
small molecular assembly, such as AQP0 (118-kDa molecular weight), which is approximately five times smaller than molecules studied by single particle cryoelectron microscopy. The principal advantage was the elimination of the inherent subjectivity in selecting the classes to be used as references for the successive rounds of iteration required in the alignment-through-classification method. In contrast, processing using the SOM was a straightforward three step process: (a) mapping of the entire input set, (b) aligning to decrease the dominance of the in-plane orientation variation and to obtain the relative azimuth angles of the clusters of images, and (c) a final mapping of the locally aligned set to select the end points of the shape and size variations located at the opposite corners. The same determination using alignment-throughclassification required numerous iterations and subjective decisions about ‘‘good’’ and ‘‘bad’’ classes. In addition, the SOM mapped the end points (the tetragonal and octagonal clusters) at diametrically opposite corners of the output plane, which removed subjectivity in selecting the best clusters for 3D reconstruction. The power of the algorithm in clustering bounded variations at the corners was evident even when the input set was only partially aligned. Also, the lower values for FSC attested to the higher consistency of the images in the subsets separated using the SOM. 4.1. What are the disadvantages of processing with the SOM? As is often the case, the strength of a method is its weakness. For the SOM, the principal weakness is the tendency to preserve the topology and metric of the variations in the input set. In particular, if the input set contains many variations, such as for example in the case of input sets imaged using angular reconstruction geometry, the network will tend to map them by assigning a similar number of bins (‘‘real state’’) in the output plane. Since the grid is low dimensional (square or rectangular), the different variations compete for space and thus complicate the interpretation of the final mapping. An output plane of higher dimension, such as a cylinder or a torus, overcomes some of these problems, but this solution complicates the visualization of the output plane, thereby negating one of the principal benefits of the algorithm. Another disadvantage was that the clusters of images at the corners of the grid allowed the reconstruction of two complete models of the channel by joining the imprints of the opposite corners (model 1, 4 and model 2, 3, Fig. 7). Both models exhibited similar sides (6.6 vs. 6.2 nm), height (6.8 vs. 6.5 nm) and volume (Table 2) and hence predicted molecular masses that were consistent with a channel comprised of four polypeptide chains.
378
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380
Fig. 6. Imprints of the envelopes calculated by using the corresponding tilted images mapped at the four corners of the output plane (corners 1–4, Fig. 5). The imprints represent the volume contained inside the 3D reconstruction of the replica (the ‘‘mold,’’ Zampighi et al., 2003). The red area in corner #2 delineates a subunit comprised of a cylindrical elevation with a small depression at the center. The cylindrical elevations might correspond to the polypeptide chain and the depression at the vestibule of the aqueous pore (Table 2).
Fig. 7. Models of the complete envelope of the AQP0 channel constructed by joining the imprints at diametrically opposite corners in Fig. 6 (1, 4 and 2, 3). The structure of the AQP1 channel (Sui et al., 2001) was docked within the envelope. One polypeptide chain was colored red for emphasis. Both models fitted the transmembrane domain of the AQP1 channel but not the amino and carboxyl termini or the external loops (green area). The envelopes differed in shape (1, 4 was rounder) and in a small elevation at the 4-fold axis that was present only in model 1, 4.
In an effort to identify the better model, we fitted the atomic structure of the aquaporin-1 channel (AQP1, Sui et al., 2001) into the models (Fig. 7) using rigidbody docking (Wriggers et al., 1999). Both the 1, 4 and
2, 3 envelopes fitted accurately at the region where AQP0 and AQP1 exhibited the highest amino acid sequence homology (the transmembrane domain) but poorly at the regions with no or low sequence
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380
homology (the amino- and carboxyl-termini and the external loops). A negative density around the 4-fold axis of the cytoplasmic surface of the 2, 3 model might account for the location of the carboxyl terminus. However, at the resolution accomplished by the reconstructions, it was not possible to distinguish whether model 1, 4 or model 2, 3 were better representations of the AQP0 channel. In conclusion, when dealing with input sets imaged using RTC imaging geometry, the SOM succeeded in aligning and clustering the input set of a small channel (6.6 nm side). The 3D models obtained from clusters produced envelopes that were consistent with models calculated with the alignment-through-classification method. Processing using the SOM provides a simple, yet effective, solution to the iterative rounds of alignment and classification needed to generate the references to separate homogeneous subsets for 3D reconstruction of biological macromolecules from single particles extracted from electron micrographs.
Acknowledgments We wish to thank Drs. E. Wright, S. Lanzavecchia, and S.A. Simon for helpful comments and discussion. Mr. N. Tang wrote and modified existing computer programs. This work was supported by NIH DK60846 and from the laboratory of Dr. E. Wright.
minjIðnÞ Gi ðnÞj ¼ jIðnÞ Gwin ðnÞj: i
Let us agree on the following terminology: IðnÞ is the pixel value of the nth image presented to the network and Gi ðnÞ the pixel value of the ith output grid image prior to update via the nth presented input image. The values for the index ‘‘i’’ in Gi ðnÞ is assigned to the output images in a one-to-one correspondence. Here we use a simple ‘‘raster’’ where the upper left-hand corner grid image of the grid was assigned an index value of 1. For a network that employs a 5 5 output plane, the index ‘‘i’’ changes from left to right and from top to bottom until an index value of 25 is reached for the lower right-hand corner grid image. The ‘‘winning image’’ is the image in the output plane fG1 ðnÞ; G2 ðnÞ; . . . ; G25 ðnÞg that most closely resembles the presented image IðnÞ. The network finds this ‘‘winning image’’ by comparing pixel values and searching for the lowest numerical value in the following set: fjIðnÞ G1 ðnÞj; jIðnÞ G2 ðnÞj; . . . :; jIðnÞ G25 ðnÞjg. The index value corresponding to the lowest numerical value in this set specifies the location of the winning image on the grid. Eq. (A.1) describes the ‘‘winning image’’ criteria:
ðA:1Þ
The network maps the input images differing the most by placing them in maximally separated locations (bins) in the plane (red numbers, Fig. 2). The network does this because it creates a monotonically increasing order on the output grid across diametrically opposite corners (green arrows, Fig. 2). In the case of multi-pixel images, the network looks for the closest matching multi-pixel image in the grid in its search for the winning image. The same mathematical symbolism used for single-pixel networks may be applied to multi-pixel networks. The only difference is that instead of using numbers, IðnÞ and Gi ðnÞ are represented by multi-dimensional vectors; one vector dimension is assigned to each pixel comprising the multi-pixel image. The degree of similarity between two multi-pixel images is defined by the proximity of the corresponding image vectors in vector space: the closer the vectors, the greater the degree of similarity of the images they represent. The winning image vector is the one, among all the image vectors, that lies closest to the presented image vector. This distance of closest approach is jIðnÞ Gwin ðnÞj, which represents the shortest distance between the input image vector IðnÞ and the entire set of output image vectors Gi ðnÞ. Once the location of the winning image has been identified for a given input image, the network then individually updates all of the output images using the following rule: Gi ðn þ 1Þ ¼ Gi ðnÞ þ LðnÞDi ðnÞ½IðnÞ GðnÞ;
Appendix A. The winning image criteria and grid update equation
379
ðA:2Þ
where Gi ðn þ 1Þ represents the updated ith grid image vector; LðnÞ, the ‘‘learning rate’’ function that monotonically decreases with successive presentations of input images, and is also openly bounded between zero and one: 0 < LðnÞ < 1; Di ðnÞ is the ‘‘learning distance’’ function that monotonically decreases with successive presentations of input image, and is openly bounded between zero and one: 0 < Di ðnÞ < 1. For a given winning image the value of the learning distance function, Di ðnÞ, changes depending on the grid image, i, being updated (see Appendix B). In Eq. (A.2), the closer the product LðnÞ Di ðnÞ is to 1, the more the updated output images, Gi ðn þ 1Þ, resemble the input image IðnÞ. When LðnÞ Di ðnÞ ¼ 1, we have: Gi ðn þ 1Þ ¼ Gi ðnÞ þ ðlÞ½IðnÞ Gi ðnÞ ¼ Gi ðnÞ þ IðnÞ Gi ðnÞ ¼ IðnÞ:
ðA:3Þ
In this extreme case, the network would force all the updated output images, Gi ðn þ 1Þ, to be identical to IðnÞ. The product LðnÞ Di ðnÞ is referred to as the ‘‘neighborhood function’’ and acts as a ‘‘regulator’’ that controls how much the grid images are modified to resemble the presented image with each network update. To attain eventual convergence, it is necessary that the
380
L.M. Zampighi et al. / Journal of Structural Biology 146 (2004) 368–380
neighborhood function approaches zero as the network is progressively trained. Appendix B. The learning distance function Di (n) and the learning rate function L(n) The networkÕs ability to cluster similar images together on the grid is largely due to the behavior of the learning distance function Di ðnÞ. The learning distance function varies during training according to the iteration number n and the particular grid image being updated i. Eq. (B.1) represents the learning distance function used here: S 2 Di ðnÞ ¼ exp : ðB:1Þ 2R2 ðnÞ Si is the Euclidean distance on the grid separating the location of the center of the ‘‘winning image’’ from the location of the center of the output image, i, being updated, and R is the Gaussian neighborhood radius. The latter may be chosen to be a monotonically decreasing distance that begins with some initial radius R0 for the first training iteration, and linearly progresses down to a radius of 1 for the final iteration, or in other words: ðR0 1ÞðT nÞ : ðB:2Þ T In this way the images become ordered first globally and then locally on the grid surface as the network is progressively trained. The learning rate function LðnÞ establishes an upper limit on the amount of change (or learning) permitted of the output images with each network iteration. The value of LðnÞ decreases as each new image is presented to the network to ‘‘fine tune’’ the grid images with each successive presentation of an input image. In this paper, the specific learning rate function is the following monotonically decreasing sequence: h n n oi L ¼ L0 1 ; ðB:3Þ T where T is the total number of training iterations, n is the current iteration number, and L0 is the initial learning rate. RðnÞ ¼ 1 þ
References Frank, J., 1996. Three-Dimensional Electron Microscopy of Macromolecular Assemblies. Academic Press, New York. Fu, D., Libson, A., Mierke, L.J.W., Weitzman, C., Nollert, P., Krucinsky, J., Strroud, R.M., 2000. Structure of a glycerol-
conducting channel and the basis of its selectivity. Science 290, 481–486. Harauz, G., Ottensmeyer, F.P., 1984. Nucleosome reconstruction via phosphorus mapping. Science 226, 936–940. Humphrey, W., Dalke, A., Schulten, K., 1996. VMD—visual molecular dynamics. J. Mol. Graph. 14, 33–38. Kohonen, T., 1984. Self-Organization and Associative Memory. Springer, Berlin. pp. 119–157. Kohonen, T., 2001. Self-Organizing Maps, third ed. Springer Series in Information Sciences, Heidelberg. Konig, N., Zampighi, G.A., Butler, P.J., 1997. Characterization of the major intrinsic protein (MIP) from bovine lens fibre membranes by electron microscopy and hydrodynamics. J. Mol. Biol. 265, 590– 602. Lanzavecchia, S., Bellon, P.L., Radermacher, M., 1999. Fast and accurate three-dimensional reconstruction from projections with random orientations via Radon transforms. J. Struct. Biol. 128, 152–164. Ludtke, S.J., Balwin, P.R., Chiu, W., 1999. EMAN: semiautomatic software for high-resolution single-particle reconstruction. J. Struct. Biol. 128, 82–97. Marabini, R., Carazo, J.M., 1994. Pattern recognition and classification of images of biological macromolecules using artificial neuronal networks. Biophys. J. 66, 1804–1814. Marabini, R., Masegosa, I.M., San Martın, M.C., Marco, S., Fernandez, J.J., de la Fraga, L.G., Vaquerizo, C., Carazo, J.M., 1996. Xmipp: An image processing package for electron microscopy. J. Struct. Biol. 116, 237–241. Muller, T., Gross, H., Winkler, H., Moor, H., 1985. High resolution shadowing with pure carbon. Ultramicroscopy 16, 340–348. Pascual-Montano, A., Taylor, K.A., Winkler, H., Pascual-Marqui, R.D., Carazo, J.M., 2002. Quantitative self-organizing maps for clustering electron tomograms. J. Struct. Biol. 138, 114–122. Radermacher, M., 1988. Three-dimensional reconstruction of single particles from random and non-random tilt series. J. Electron Microsc. Tech. 9, 359–394. Radermacher, M., Wagenknecht, T., Verschoor, A., Frank, J., 1987. Three-dimensional reconstruction from a single-exposure random conical tilt series applied to the 50S ribosomal subunit of Escherichia coli. J. Microsc. 146, 113–136. Sui, H., Han, B.G., Lee, J.K., Wallan, P., Jap, B.K., 2001. Structural basis of water-specific transport through the AQP1 water channel. Nature 414, 872–878. Turk, E., Kim, O., le Coutre, J., Whitelegge, J.P., Eskandari, S., Lam, J.T., Kreman, M., Zampighi, G.A., Faull, K.F., Wright, E.M., 2000. Molecular characterization of Vibrio parahaemolyticus vSGLT. A model for sodium-coupled sugar cotransporters. J. Biol. Chem. 275, 25711–25716. van Heel, M., Harauz, G., Orlova, E.V., 1996. A new generation of IMAGIC image processing system. J. Struct. Biol. 116, 17–24. Wriggers, W., Milligan, R.A., McCammon, J.A., 1999. Situs: a package for docking crystal structures into low resolution maps from electron microscopy. J. Struct. Biol. 125, 185–195. Zampighi, G.A., Hall, J.E., Ehring, G.R., Simon, S.A., 1989. The structural organization and protein composition of lens fiber junctions. J. Cell Biol. 108, 2255–2275. Zampighi, G.A., Kreman, M., Lanzavecchia, S., Turk, E., Eskandari, E., Zampighi, L., Wright, E.M., 2003. Structure of functional AQP0 channels in phospholipid bilayers. J. Mol. Biol. 325, 201– 210.