TEXTAL System: Artificial Intelligence Techniques for Automated Protein Model Building

TEXTAL System: Artificial Intelligence Techniques for Automated Protein Model Building

244 [12] map interpretation and refinement the map interpretation routines regarding phase quality, and the inadequate use of knowledge-based decis...

693KB Sizes 1 Downloads 160 Views

244

[12]

map interpretation and refinement

the map interpretation routines regarding phase quality, and the inadequate use of knowledge-based decision making during model building. Acknowledgments This work was supported by EU Grant BIO2-CT920524/BIO4-CT96-0189 (R.J.M.). The authors thank Keith Wilson, Zbyszek Dauter, Rob Meijers, and Petrus Zwart for fruitful discussions and useful comments; Ge´rard Bricogne, for helpful suggestions, advice, mathematical rigor, and for generously allowing R. J. M. to write this contribution while working at Global Phasing, Ltd.; Eric Blanc, Pietro Roversi, Claus Flensburg, and Clemens Vonrhein for constructive critique on an initial draft of this manuscript; Garib Mushudov and Eleanor Dodson for help with REFMAC; and all ARP/wARP users for helpful suggestions.

[12] TEXTAL System: Artificial Intelligence Techniques for Automated Protein Model Building By Thomas R. Ioerger and James C. Sacchettini Introduction

Significant advances have been made toward improving many of the complex steps in macromolecular crystallography, from new crystal growth techniques, to more powerful phasing methods such as multiwavelength anomalous diffraction (MAD),1 to new computational algorithms for heavyatom search,2–4 reciprocal-space refinement, and so on.5 However, the step of interpreting the electron density map and building an accurate model of a protein (i.e., determining atomic coordinates) from the electron density map remains one of the most difficult to improve. Currently it takes days to weeks for a human crystallographer to build a structure from an electron density map, often with the help of a 3D visualization and model-building program such as O.6 This manual process is both time-consuming and error-prone. Even with an electron density map of high quality, model 1

W. A. Hendrickson and C. M. Ogata, Methods Enzymol. 276, 494 (1997). C. M. Weeks, G. T. DeTitta, H. A. Hauptmann, P. Thuman, and R. Miller, Acta Crystallogr. A 50, 210 (1994). 3 de la E. Fortelle and G. Bricogne, Methods Enzymol. 276, 590 (1997). 4 T. C. Terwilliger and J. Berendzen, Acta Crystallogr. D Biol. Crystallogr. 55, 849 (1999). 5 A. T. Bru¨nger, P. D. Adams, G. M. Clore, W. L. DeLano, P. Gros, R. W. Grosse-Kunstleve, J.-S. Jiang, J. Kuszewski, M. Nigles, N. S. Pannu, R. J. Read, L. M. Rice, T. Simmonson, and G. L. Warren, Acta Crystallogr. D Biol. Crystallogr. 54, 905 (1998). 6 T. A. Jones, J.-Y. Zou, and S. W. Cowtan, Acta Crystallogr. A 47, 110 (1991). 2

METHODS IN ENZYMOLOGY, VOL. 374

Copyright 2003, Elsevier Inc. All rights reserved. 0076-6879/03 $35.00

[12]

TEXTAL system

245

building is a long and tedious process.7 There are many sources of noise and errors that can perturb the appearance of the density map.8–10 All these effects contribute to making the density sometimes difficult to interpret. There exist several prior methods for, or related to, automated model building, such as searching fragment libraries,11,12 template convolution and other fast Fourier transform (FFT)-based approaches,13,14 the freeatom insertion method of ARP/wARP,15,16 DADI17 (which uses a real-space correlational search), X-Powerfit,18 MAID,19 MAIN,20 and molecular scene analysis.21–24 However, most of these methods have limitations. For example, fragment library searches11,12 require user intervention to pick C coordinates in the sequence (although backbone tracing can help, it does not reliably determine C locations), and methods like ARP/wARP and molecular scene analysis seem to work best only at high ˚ or better. resolution, for example, around 2.5 A TEXTAL is a new computer program designed to build protein structures automatically from electron density maps. It uses AI (artificial intelligence) and pattern recognition techniques to try to emulate the intuitive 7

G. J. Kleywegt and T. A. Jones, Methods Enzymol. 277, 208 (1997). J. S. Richardson and D. C. Richardson, Methods Enzymol. 115, 189 (1985). 9 C. I. Branden and T. A. Jones, Nature 343, 687 (1990). 10 T. A. Jones and M. Kjeldgaard, Methods Enzymol. 277, 173 (1997). 11 T. A. Jones and S. Thirup, EMBO J. 5, 819 (1986). 12 L. Holm and C. Sander, J. Mol. Biol. 218, 183 (1991). 13 G. J. Kleywegt and T. A. Jones, Acta Crystallogr. D Biol. Crystallogr. 53, 179 (1997). 14 K. Cowtan, Acta Crystallogr. D Biol. Crystallogr. 54, 750 (1998). 15 A. Perrakis, T. K. Sixma, K. S. Wilson, and V. S. Lamzin, Acta Crystallogr. D Biol. Crystallogr. 53, 448 (1997). 16 A. Perrakis, R. Morris, and V. Lamzin, Nat. Struct. Biol. 6, 458 (1999). 17 D. J. Diller, M. R. Redinbo, E. Pohl, and W. G. J. Hol, Proteins Struct. Funct. Genet. 36, 526 (1999). 18 T. J. Oldfield, in ‘‘Crystallographic Computing 7: Proceedings from the Macromolecular Crystallography Computing School’’ (P. E. Bourne and K. Watenpaugh, eds.). Oxford University Press, New York, 1996. 19 D. G. Levitt, Acta Crystallogr. D Biol. Crystallogr. 57, 1013 (2001). 20 D. Turk, in ‘‘Methods in Macromolecular Crystallography’’ (D. Turk and L. Johnson, eds.). NATO Science Series I, Vol. 325, p. 148. Kluwer Academic, Dordrecht, The Netherlands, 2001. 21 L. Leherte, S. Fortier, J. Glasgow, and F. H. Allen, Acta Crystallogr. D Biol. Crystallogr. 50, 155 (1994). 22 K. Baxter, E. Steeg, R. Lathrop, J. Glasgow, and S. Fortier, in ‘‘Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology,’’ p. 25. American Association for Artificial Intelligence, Menlo Park, CA, 1996. 23 L. Leherte, J. Glasgow, K. Baxter, E. Steeg, and S. Fortier, Artif. Intell. Res. 7, 125 (1997). 24 S. Fortier, A. Chiverton, J. Glasgow, and L. Leherete, Methods Enzymol. 277, 131 (1997). 8

246

map interpretation and refinement

[12]

decision-making of experts in solving protein structures. Previously solved structures are exploited to help understand the relationship between patterns of electron density and local atomic coordinates. The method applies this along with a number of heuristics to predict the likely positions of atoms in a model for an uninterpreted map. TEXTAL is aimed at solving ˚ range, which are just at the range of human intermaps in the 2.5 to 3.5-A pretability, but turn out to be the majority of cases in practice for maps constructed from MAD data. TEXTAL has the potential to reduce one of the bottlenecks of highthroughput structural genomics.25 By automating the final step of model building (for noisy, medium-to-low resolution maps), less effort will be required of human crystallographers, allowing them to focus on regions of a map where the density is poor. TEXTAL will eventually be integrated with other computational methods, such as reciprocal-space refinement26 or statistical density modification,27 to iterate between building approximate models (in poor maps) and improving phases, which can then be used to produce more accurate maps and allow better models to be built. This will be implemented in the PHENIX crystallographic computing environment currently under development at the Lawrence Berkeley National Laboratory.28 Principles of Pattern Recognition and Application to Crystallography

Pattern recognition techniques can be used to mimic the way the crystallographer’s eye processes the shape of density in a region and comprehends it as something recognizable, such as a tryptophan side chain, or a  sheet, or a disulfide bridge. The history of statistical pattern recognition is long, and a great deal of research, both theoretical and applied (i.e., development of algorithms), has been done in a wide range of application domains. Much work has been done in image recognition in two dimensions, such as recognizing military vehicles or geographic features in satellite images, faces, fingerprints, carpet textures, parts on an assembly line for manufacturing, and even vegetables for automated sorting29; however,

25

S. K. Burley, S. C. Almo, J. B. Bonanno, M. Capel, M. R. Chance, T. Gaasterland, D. Lin, A. Sali, W. Studier, and S. Swaminathian, Nat. Genet. 232, 151 (1999). 26 G. N. Murshudov, A. A. Vagin, and E. J. Dodson, Acta Crystallogr. D Biol. Crystallogr. 53, 240 (1997). 27 T. C. Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 56, 965 (2000). 28 P. D. Adams, R. W. Grosse-Kunstleve, L.-W. Hung, T. R. Ioerger, A. J. McCoy, N. W. Moriarty, R. J. Read, J. C. Sacchettini, and T. C. Terwilliger, Acta Crystallogr. D Biol. Crystallogr. 58, 1948 (2002).

[12]

TEXTAL system

247

patterns in electron density maps are three-dimensional, and less work has been done for these kinds of problems. The basic idea behind pattern recognition (at least for supervised learning) is to ‘‘train’’ the system by giving it labeled examples in several competing categories (such as tanks vs. civilian cars and trucks). Each example is usually represented by a set of descriptive features, which are measurements derived from the data source that characterize the unique aspects of each one (e.g., color, size). Then the pattern recognition algorithm tries to find some combination of feature values that is characteristic of each category that can be used to discriminate among the categories (i.e., given a new unlabeled example, classify it by determining to which category it most likely belongs). There are many pattern recognition algorithms that have been developed for this purpose but work in different ways, including decision trees, neural networks, Bayesian classifiers, nearest neighbor learners, and support vector machines.30,31 The central problem in many pattern recognition applications is in identifying features that reflect significant similarities among the patterns. In some cases, features may be noisy (e.g., clouds obscuring a satellite photograph). In other cases, features may be truly irrelevant, such as trying to determine the quality of an employee by the color of his or her shirt. The most difficult issue of all is interaction among features, where features are present that contain information, but their relevance to the target class on an individual basis is weak, and their relationship to the pattern is recognizable only when they are looked at in combination with other features.32,33 A good example of this is the almost inconsequential value of individual pixels in an image; no single pixel can tell much about the content of an image (unless the patterns were rigidly constrained in space), yet the combination of them all contains all the information that is needed to make a classification. While some methods exist for extracting features automatically,34 currently consultation with domain experts is almost always needed to determine how to process raw data into meaningful, high-level features that are likely to have some form of correlation with 29

R. C. Gonzales and R. C. Woods, ‘‘Digital Image Processing.’’ Addison-Wesley, Reading, MA, 1992. 30 T. Mitchell, ‘‘Machine Learning.’’ McGraw-Hill, New York, 1997. 31 R. O. Duda, P. E. Hart, and D. G. Stork, ‘‘Pattern Classification.’’ John Wiley & Sons, New York, 2001. 32 L. Rendell and R. Seshu, Comput. Intell. 6, 247 (1990). 33 G. John, J. Kohavi, and K. Pfleger, in ‘‘Proceedings of the Eleventh International Conference on Machine Learning,’’ p. 121. Morgan Kaufmann, San Francisco, 1994. 34 H. Liu and H. Motoda, ‘‘Feature Extraction, Construction, and Selection: A Data Mining Perspective.’’ Kluwer Academic, Dordrecht, The Netherlands, 1998.

248

map interpretation and refinement

[12]

the target classes, often making manual decisions about how to normalize, smooth, transform, or otherwise manipulate the input variables (i.e., ‘‘feature engineering’’). A pattern recognition approach can be used to interpret electron density maps in the following way. First, we restrict our attention to local ˚ radius* (a whole regions of density, which are defined as spheres of 5A ˚ map can be thought of and modeled as a collection of overlapping 5A spheres). Our goal is to predict the local molecular structure (atomic coordinates) in each such region. This can be done by identifying the region with the most similar pattern of density in a database of previously-solved maps, and then using the coordinates of atoms in the known region as a basis for estimating coordinates of atoms in the unknown. To facilitate this patternmatching process, features must be extracted that characterize the patterns of density in spherical regions and can be used to recognize numerically when two regions might be similar. This idea of feature-based retrieval is illustrated in Fig. 1. Recall that electron density maps are really 3D volumetric data sets (i.e., a three-dimensional grid of density values that covers the space around and including the protein). As we describe in more detail later, features for regions can be computed from an array of local density values by methods such as calculating statistics of various orders, moments of inertia, contour properties (e.g., surface area, smoothness, connectivity), and other geometric properties of the distribution and ‘‘shape’’ of the density. Not all of these features are equally relevant; we use a specialized feature-weighting scheme to determine which ones are most important. For this pattern recognition approach to work, an important requirement of the features is that they should be rotation-invariant. A feature is rotation-invariant if its value would remain constant even if the region were rotated, i.e., F(region) ¼ F(Rot(,region)), where  is an arbitrary set of rotation parameters that can be used to transform grid-point coordinates locally around the center of the region. Rotation-invariance is important for matching regions of electron density because proteins can appear in arbitrary orientations in electron density maps. If one region has a similar pattern to another, we want to detect this robustly, even if they are in different orientations. Features such as Fourier coefficient amplitudes are well-known to be translation-invariant, but they are not rotation-invariant. Therefore, one of the initial challenges in the design of TEXTAL was to *We selected 5 A as a standard size for our patterns because (a) this is just about large enough to cover a single side-chain, (b) smaller regions would lead to redundancy in the database of feature-extracted regions and fewer predicted atoms per region, and (c) larger regions would contain such complexity in their density patterns that sufficiently similar matches might not be found in even the largest of databases, i.e. they might be unique.

[12]

TEXTAL system

249

˚ -spherical region of density is shown Fig. 1. Illustration of feature-based retrieval. A 5A around a histidine residue (centered on the C atom). Below it is shown a hypothetical feature vector, consisting of a list of scalar values that are a function of the density pattern in the region. This feature vector may be used to search for other regions with similar patterns of density, which would presumably have a similar profile of feature values, independent of orientation.

develop a set of rotation-invariant numeric features to capture patterns in spherical regions of electron density. Given a set of rotation-invariant features, this pattern-matching approach may be used to predict local atomic coordinates in arbitrary spherical regions throughout a density map. However, to provide some structure for the process, we subdivide the problem of map interpretation along the traditional lines of decomposition used by human crystallographers: first, we try to identify the backbone (or main-chain) of the protein, modeled as linear chains of C atoms, and then we apply the pattern-matching process to predict the coordinates of other backbone and side-chain atoms around each C. The first step is accomplished by a routine called CAPRA (for ‘‘C-Alpha Pattern Recognition Algorithm’’). We refer to the second step as LOOKUP, because of the use of a database of previously solved maps. Both routines use pattern recognition (though different techniques), and both rely centrally on the extraction of rotation-invariant features. CAPRA feeds the features into a neural network to predict likely locations 35 of C atoms in a map. Then LOOKUP is run on each consecutive C by ˚ sphere around it and using them to identify extracting features for the 5A 35

T. R. Ioerger and J. C. Sacchettini, Acta Crystallogr. D Biol. Crystallogr. 58, 2043 (2002).

250

[12]

map interpretation and refinement

the most similar region of density in the database of solved maps (with features pre-extracted from C-centered regions in maps of known structures), from which coordinates of atoms in the new map may be estimated. Methods

Overview of TEXTAL System As an automated model-building system, TEXTAL takes an electron density map as input and ultimately outputs a protein model (with atomic coordinates). There are three overall stages, depicted in Fig. 2, that TEXTAL goes through to solve an uninterpreted electron density map. The first step involves tracing the main chain, which is done by the CAPRA subsystem. The output of CAPRA is a set of C chains—a PDB file containing several chains (multiple fragments are possible, due to breaks in the main-chain density), each of which contains one atom per residue (the predicted C atoms). These C chains are fed as input into the second stage, which we call LOOKUP. During LOOKUP, TEXTAL calculates features for each region around a predicted C atom and uses these features to search for regions with similar density patterns in a database of regions from maps of known structures, whose features have been calculated offline. For each C, LOOKUP extracts the atoms of the local residue in the best-matching region (from a PDB file) and translates and rotates them into corresponding positions in the uninterpreted map. By concatenating these transformed ATOM records, LOOKUP fills out the C chains with all the additional side-chain and backbone atoms; the output of LOOKUP is a complete PDB file of the residues that could be modeled. There is a variety of inconsistencies that might need to be resolved, such as getting the independent residue predictions in a chain to agree on directionality

CAPRA

Post-Processing

LOOKUP

O

O

PDB file

OH Electron density map

Backbone (Ca chains)

Initial model

Fig. 2. Main stages of TEXTAL.

Final model

[12]

TEXTAL system

251

of the backbone, refining the coordinates of backbone atoms to draw close ˚ spacing of C atoms, and adjusting the side-chain atoms to the ideal 3.8-A to eliminate any steric conflicts. Also, another major postprocessing step involves correcting the identities of the residues by aligning chains into likely positions in the amino acid sequence of the protein, if known. Because TEXTAL is able to recognize amino acids only by the shape (size, structure) of their side chains, there is ambiguity that occasionally leads to prediction of incorrect amino acid identities in chains. By using a special amino acid similarity matrix that reflects the kinds of mistakes TEXTAL tends to make, we often can determine the exact identity of amino acids based on where those chains fit in the known sequence. This can be used to go back through the LOOKUP routine to select the best match of the correct type, producing a more accurate model. CAPRA: C-Alpha Pattern Recognition Algorithm CAPRA operates in essentially four main steps, shown in Fig. 3. First, the map is scaled to roughly 1.0 ¼ 1, which is important for making patterns comparable between different maps. Then, a trace of the map is ˚ made. The trace gives a connected skeleton of pseudo-atoms (on a 0.5-A grid) that generally goes through the center of the contours (i.e., approximating the medial axis). Note that the trace goes not only along the backbone, but also branches out into side chains (similar to the output of Bones).6 CAPRA picks a subset of the pseudo-atoms in the trace (which we refer to as ‘‘waypoints’’) that appear to represent C atoms. This central step is done in a pattern-analytic way using a neural network, described later. Finally, after deciding on a set of likely C atoms, CAPRA must link them together into chains. This is a difficult task because there are often breaks in the density along the main chain, as well as many false connections between contacting side chains in the density. CAPRA uses a combination of several heuristic search and analysis techniques to try to arrive at a set of reasonable C chains. Tracing is done in a way similar to many other skeletonization algo˚ grid rithms,36,37 as follows. First, a list is built of all lattice points on a 0.5-A throughout the map that are contained within a contour of some fixed threshold (we currently use a cutoff of around 0.7 in density). This leaves hundreds of thousands of clustered points that must be reduced to a backbone. The points are removed by an iterative process, going from worst (lowest density) to best (highest density), as long as they do not create a 36 37

J. Greer, Methods Enzymol. 115, 206 (1985). S. M. Swanson, Acta Crystallogr. D Biol. Crystallogr. 50, 695 (1994).

252

map interpretation and refinement

Electron density map

[12]

Scaling of density

Tracing of map

Predicting Ca locations (Neural network)

Ca chains

Linking Ca atoms into chains

Fig. 3. Steps within CAPRA.

local break in connectivity. This is evaluated by collecting the 26 surrounding points in a 3  3  3 box and preventing the elimination of the center point if it would create two or more (disconnected) components among its 26 neighbors. Generally, because the highest density occurs near the center of contour regions, the outer points are removed first, and maintaining connectivity becomes a factor only as the list of points is reduced nearly to a linear skeleton. What remains is roughly a few thousand pseudo-atoms, typically around 10 times as many as expected C atoms (due to the closer ˚ spacing along the backbone, and the meandering of the skeleton into 0.5-A side chains). To determine which of these pseudo-atoms are likely to represent true C atoms, CAPRA relies on pattern recognition. The goal is to learn how to associate certain characteristics in the local density pattern with an estimate of the proximity to the closest C. CAPRA uses a two-layer feedforward neural network (with 20 hidden units and sigmoid thresholds in each layer) to predict, for each pseudo-atom in the trace, how close it is likely to be to a true C (see Ioerger and Sacchettini35 for details). The inputs to the network consist of 19 feature values extracted from the region of density surrounding each pseudo-atom (Table I). The neural network is

[12]

253

TEXTAL system TABLE I Rotation-Invariant Features Used in TEXTAL to Characterize Patterns of Density in Spherical Regions

Class of features Statistical

Symmetry Moments of inertia and their ratios

Shape/geometry

Specific features Mean of density Standard deviation Skewness Kurtosis Distance to center of mass Magnitude of primary moment Magnitude of secondary moment Magnitude of tertiary moment Ratio of primary to secondary moment Ratio of primary to tertiary moment Ratio secondary to tertiary moment Min angle between density spokes Max angle between density spokes Sum of angles between density spokes

Method of calculation P ð1=nÞ Pi ½ð1=nÞ ði  Þ2 1=2 P ½ð1=nÞ ði  Þ3 1=3 P ½ð1=nÞ ði  Þ4 1=4 j < xc ; yc ; zc > j where P xc ¼ ð1=nÞ xi j , etc. Compute inertia matrix, diagonalize, sort eigenvalues

Compute 3 distinct radial vectors that have greatest local density summation

trained by giving it examples of these feature vectors for high-density lattice points at varying distances from C atoms, ranging from 0 to around ˚ , in maps of sample proteins. The weights in the network are optimized 6A on this data set, using the well-known back-propagation algorithm.38 Given these distance predictions, the set of candidate C atoms (i.e., way points) is selected as follows. All the pseudo-atoms in the trace are ranked by their predicted distance to the nearest C. Then, starting from the top (smallest predicted distance), atoms are chosen as long as they ˚ of any previously chosen atoms. This procedure has are not within 2.5 A the effect of choosing waypoints in a more-or-less random order through the protein, but the advantage is that preference is given to the pseudoatoms with the highest scores first, since these atoms are generally most likely to be near C atoms; decisions on those with lower scores are put off until later. Each trace point selected as a candidate C has the property of being locally best, in that it has the closest predicted distance to a true ˚ radius. Trace points whose preC among its neighbors within a 2.5-A ˚ are discarded, as they dicted distances to C atoms are greater than 3.0 A are highly unlikely to really be near C atoms, and are almost always found in side chains. 38

G. E. Hinton, Artif. Intell. 40, 185 (1989).

254

map interpretation and refinement

[12]

Next, the putative C atoms are linked together into linear chains using the BUILD_CHAINS routine. There are many choices on how to link the C atoms together, since the underlying connectivity of the trace forms a graph with many branches and cycles. BUILD_CHAINS relies on a variety of heuristics to help distinguish between genuine connections along the main-chain and false connections (e.g., between side-chain contacts). Possible links between C atoms are first discovered by following connected chains of trace atoms to a neighboring C candidate not too far away ˚ ), provided the path through the trace atoms does not go too (within 5A ˚ ) to another C candidate. This links the C candidates toclose (<2A gether in a way that reflects the connectivity of the underlying trace, typically forming an over-connected graph. Next, BUILD_CHAINS attempts to identify potential pieces of secondary structure, using a geometric analysis. All connected fragments of length 7 are enumerated, and they are evaluated for their ‘‘linearity’’ and their ‘‘helicity.’’ Linearity is measured by the ratio of the end-to-end distance to the sum of the lengths of the individual links, which is typically between 0.8 and 1.0 for  strands, and helicity is measured by computing the mean deviation from 95 for bond angles and þ50 for torsion angles among consecutive C atoms within an  helix.39 Finally, all this information is brought together to make heuristic decisions about which candidate C atoms to link together into chains. First, the graph is divided into connected components. Then for each separate component, a different strategy is applied based on its size. For small connected components (with up to 20 atoms), a depth-first search is used to enumerate all possible non-self-intersecting paths; they are scored with a function that reflects desirable attributes such as overall length, good neural network predictions, consistency with secondary structure, and so on, and the single path with the highest score is selected. For large connected components (with more than 20 atoms), a different strategy is used in which the links in the overconnected component are incrementally clipped down to linear chains, where no atom has more than two connections to other atoms. First, all cycles are broken at the weakest point (with worst neural network prediction), and then individual links at three-way (or more) branch-points are clipped. The links chosen for clipping are based on a similar scoring function as described previously, tending to prefer to clip links to short branches (e.g., side chains) containing atoms with poor neural net scores, and tending to avoid clipping links that fall along putative secondary structures. The procedures of CAPRA are illustrated in Fig. 4, which shows (1) the pseudo-atoms of the trace, (2) waypoints selected on the basis of locally 39

T. J. Oldfield and R. E. Hubbard, Proteins Struct. Funct. Genet. 18, 324 (1994).

[12]

TEXTAL system

255

Fig. 4. Illustrations of the main steps of the CAPRA routine. (A) Close-up of tracer points in the area of a helix in CzrA (estimated distance to true C is predicted for each of these points by a neural network). (B) Selected waypoints (in red, with locally minimum predicted ˚ ). (C) Bonds (in purple) drawn between distance to a true C, and minimum spacing of 2.5 A connected waypoints, showing links in side chains as well as main chain. (D) Final subset of connected waypoints to form linear chains (in green).

minimal distance-to-true-C predictions by the neural network, (3) linking waypoints together based on trace connectivity, and (4) determination of linearized substructure that uses geometric analysis and other information to identify plausible chains, with cycles removed and branches into side chains clipped off. LOOKUP: The Core Pattern-Matching Routine After constructing the main-chain with CAPRA, the LOOKUP routine is used to fill in the remaining backbone and side-chain atoms. LOOKUP is called individually on each C in the main chain, and the local sets of predicted atoms for each residue are concatenated to produce a complete initial model. LOOKUP takes a pattern recognition approach to predicting the local coordinates of atoms in a region. It tries to find the most similar

256

map interpretation and refinement

[12]

region in a database of previously solved maps and bases its prediction of local coordinates on the positions of atoms found in the matched region. Extraction of Features to Represent Density Patterns. The key to the matching process is the extraction of numerical features that capture and represent various aspects of the patterns of density. As mentioned before, it is important for the features to be rotation-invariant, in order to detect similarities between similar regions that might be in different orientations. Currently, for each region, TEXTAL calculates 19 different features that can be grouped into 4 general classes40 The first class of features consists of statistical measures, such as the mean, standard deviation, skewness, and kurtosis (third and fourth-order moments) of density in the region. These measures are clearly rotation-invariant; for example, a region of density would have the same standard deviation even if it were rotated in an arbitrary direction. A second class of features is based on moments of inertia. During feature extraction, TEXTAL calculates the inertia matrix and diagnoalizes it to extract the moments of inertia as eigenvalues. These reflect aspects of the symmetry and dispersion of the density in the local region. In addition to the absolute value of the moments themselves, we also compute ratios of the moments as features. Another feature, in a class by itself, is the distance to center-of-mass, which measures whether the density in the region is locally balanced or off-set. Finally, there is a class of features we call ‘‘spokes.’’ These measure geometric properties related to the shape of the density. For regions centered on C atoms, there are typically three spokes (or tubes) of density emanating from the center: two for the backbone and one for the side chain. We identify up to three (non-adjacent) directions in space where the weighted sum of the density in those directions is locally maximum, and then we measure the angles among these vectors (min, max, and sum). In some regions, the spokes sit relatively flat (coplanar), with about 120 between them, and in other regions, they form more of a pyramid or are drawn close together. The formal definitions of these features, 19 in all, can be found in [Holton et al., 2000]. While many other features are possible, we have found these to be sufficient. In addition, each feature can be calculated over different radii, ˚ , 4A ˚ , 5A ˚ and 6A ˚ , so every so they are parameterized; currently, we use 3A individual feature has four distinct versions.y All these features are rotation invariant. They are used both in CAPRA, as inputs to the neural network to predict how close a given 40

T. R. Holton, J. A. Christopher, T. R. Ioerger, and J. C. Sacchettini, Acta Crystallogr. D Biol. Crystallogr. 46, 722 (2000). y Note, these radii can each capture slightly different information about the density pattern in the surrounding spherical region, and they could have different sensitivity to noise.

[12]

TEXTAL system

257

pseudo-atom in the trace appears to be to a true C, and also in LOOKUP, to search the database of known regions for regions with similar patterns of density. Searching the Region Database Using Feature Matching. The feature vectors produced by the feature-extraction process can be used to efficiently search a large database of regions from previously-solved maps to find similar matches. LOOKUP implements this database-search process (illustrated in Figure 5). First, given the coordinates of a putative C (from CAPRA), the feature vector representing the pattern of density in the region is calculated. Then it is compared to the feature vectors of previously-solved regions in the database, and the closest match is identified. Matches are first evaluated by comparing feature vectors, i.e., to find those with minimum feature-based differences, and later by density correlation, as described below. Finally, atoms from the best-matching region are looked up in the structure (assumed to be known), and they are rotated and translated into position in the new model (around the original C) by applying the appropriate geometric transformations.

Input: Ca coordinates of new region 1. Feature extraction Feature vector = 2. Calculate weighted euclidean distance to each feature vector in database 3. Sort matches based on distance 4. Select top K regions with smallest distance 5. Evaluate density correlation of region to top K matches (optimizing rotations) 6. Re-rank top K matches by density correlation 7. Select best matching region

Database of Regions: Centered on Ca atoms in solved maps with known structures with pre-extracted feature vectors Each region consists of: PDB id, residue number, coordinates of center (Ca ), and feature values: pdb,res,, pdb,res,, pdb,res,, ...

8. Retrieve coordinates for atoms in matched region from database (PDB files) 9. Rotate and translate atoms into position in the new map Output: Predicted coordinates of atoms in new region

Fig. 5. The LOOKUP process.

258

map interpretation and refinement

[12]

The database of feature-extracted regions that TEXTAL uses is derived from back-transformed maps. These are generated by calculating structure factors (by Fourier transform) from a known protein structure, and then inverse-Fourier transforming them to compute the map, using ˚ (to simulate medium-resolution maps). The only reflections up to 2.8 A proteins consist of a subset of 200 proteins from PDBSelect,41,42 which is a representative set of nonredundant, high-resolution structures (with no more than 25% pairwise homology among them). Since these backtransformed maps have minimal phase error, they contain the ideal density patterns around common small-scale motifs found in real proteins (including many variations in side-chain and backbone conformations, contacts, ˚ temperature factors, occupancies, etc.). Features are calculated for a 5-A spherical region around each C atom in each protein structure for which we generated a map, producing a database with 50,000 regions. The feature-based differences between regions are measured by computing a weighted Euclidean distance between their feature vectors. The formula for the weighted Euclidean distance between two feature vectors is given as: distðR1 ; R2 Þ ¼

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X wi ðFi ðR1 Þ  Fi ðR2 ÞÞ2 i

where the Fs are the features and the Rs are the regions. The larger the difference between individual feature values is, the larger the overall distance will be. Regions that have similar patterns of density should have similar feature values, and hence a low feature-based distance score. During the database search, LOOKUP computes the distance between the probe (unknown) region and every (known) region in the database, and keeps a list of the top K ¼ 400 matches with least distance, representing potential matches. The weights wi can be used to normalize the feature differences onto a uniform scale, so larger values do not dominate unfairly. Furthermore, the weights may be biased to give higher emphasis to features that are more relevant, and reduce the influence of less-relevant or noisy features. A novel algorithm for weighting features in this way, called SLIDER, is described in Holton et al.40 and Gopal et al.43 The calculation of feature-based distance, however, is not always sufficient. For example, while two regions that have similar patterns are 41

U. Hobohm, M. Scharf, R. Schneider, and C. Sander, Protein Sci. 1, 409 (1992). U. Hobohm and C. Sander, Protein Sci. 3, 522 (1994). 43 K. Gopal, R. Pai, T.R. Ioerger, T. Romo, Sacchettini J.C. ‘‘Proceedings of the 15th Conference on Innovative Applications of Artificial Intelligence,’’ pp. 93–100 (2003). 42

[12]

TEXTAL system

259

expected to have similar features values, it cannot be guaranteed that regions with different patterns will necessarily have different feature values. Therefore, there could be some spurious matches to regions that are not truly similar. Hence we use this selection initially as a filter, to catch at least some similar regions. Then we must follow this up with a more computationally expensive step of evaluating the candidate matches further by calculating density correlation (this is a time-consuming operation that involves searching for the optimal rotation between the two regions; see Holton et al.40 for details). This is done for all the top K matches. They are then reranked according to their density correlation score. This gives a more accurate estimate of similarity, but because of the slowness of the calculation and size of the database, we cannot afford to run it on all possible regions in the database. Once LOOKUP has identified the region in the database that appears to have the greatest similarity to the region we are trying to model, the final step is to retrieve the coordinates of atoms from the known structure for the map from which the matching region was derived, specifically, the local side-chain and backbone atoms of the residue whose C is at the center of the region, and apply the appropriate transformations to place them into position in the new map. The transformation of the coordinates of each atom happens as follows: each atom from the database region is translated to the vicinity of the origin by subtracting the coordinates of the center of the region, and then the rotation matrix that gave the highest density correlation for the match is applied to put it into the appropriate orientation for the new region. Finally, the predicted atom is translated into the new region by adding the coordinates of its center. This procedure is repeated for all the atoms in each region surrounding a C atom identified by CAPRA. The resulting side-chain and backbone atoms are written out in the form of ATOM records in a new PDB file, which constitutes the initial, unrefined model generated by TEXTAL for the map. Postprocessing Once a complete initial model has been constructed (through CAPRA and LOOKUP), the third stage of TEXTAL consists of postprocessing. There are a number of postprocessing procedures that can be applied to help reduce imperfections in the initial model and improve the accuracy of the final model. The first postprocessing step is a simple routine to fix ‘‘flipped’’ residues, that is, residues whose backbone atoms are going in the wrong direction with respect to their neighbors. After the backbone atoms have been

260

map interpretation and refinement

[12]

retrieved by LOOKUP for all the C atoms in a chain, a calculation of overall directionality for the chain is performed. For each residue i in the chain (except on the termini), the angle between the Ci–Ci vector and the Ci–Ci þ 1 vector is calculated. If the backbone for residue i is pointing ‘‘forward,’’ angle(Ci, Ci, Ci þ 1) should be near 0 (we use a threshold of <45 ). If it is pointing ‘‘backward,’’ then angle(Ci, Ci, Ci  1) should be near 0. Similarly, directionality can be estimated by computing the angle between Ci, Ni, and Ci  1 (near 0 means pointing forward) and to Ci þ 1 (near 0 means pointing backward). These angles are computed all the way up the chain, and a majority vote on the most common direction is taken. Then a sweep is made back through the individual residues in the chain; if they are determined to be going in reverse of the selected direction, subsequent matches in the list of candidate regions retrieved during LOOKUP are scanned (in order of highest density correlation down) until a match is found whose backbone atoms are going in a direction consistent with the rest of the chain. The second postprocessing step is real-space refinement.44–47 TEXTAL employs a simple form of real-space refinement (with coordinates only, not torsion angles). This procedure is not powerful enough to fix all potential problems in the model, but does an adequate job of adjusting the spacing of backbone atoms between adjacent residues in the same chain. The procedure tries 20 random perturbations of the coordinates of each atom up to ˚ from its current position, and picks the one that appears to improve 0.1 A the local energy the most. The local energy function is based on the terms of the global energy function to which a specific atom contributes. The terms in the global energy function include local density of each atom (interpolated from the map), bond distance constraints (quadratic function of the deviation from the optimal distance), bond angle constraints, torsion angle constraints (mainly for planarity of aromatic side chains and peptide bonds in the backbone), and steric contacts (increases exponentially if distance between atoms drops below sum of van der Waals radii). After local perturbations to the atomic coordinates are made, a check of the global energy is made, and the new configuration is accepted only if the global energy has decreased. This cycle is repeated up to 100 times. While initially there is no constraint on the connection between backbone atoms at adjacent residues (as they are modeled independently by LOOKUP), this procedure now pulls the Ni and Ci þ 1 atoms together to 44

R. Diamond, Acta Crystallogr. A 27, 436 (1971). M. S. Chapman, Acta Crystallogr. A 51, 69 (1995). 46 T. J. Oldfield, Acta Crystallogr. D Biol. Crystallogr. 57, 82 (2001). 47 D. E. Tronrud, Methods Enzymol. 277, 306 (1997). 45

[12]

TEXTAL system

261

a reasonable bond distance, rotates the atoms so the peptide bond is planar, ˚ and even tends to pull the neighboring C atoms to within about 3.8 A distance spacing all the way down the chain. Other real-space refinement routines, such as in TNT,45,47 could also conceivably be used. The third postprocessing step, which we are currently in the process of implementing, involves correcting the identities of mislabeled amino acids. Recall that, since TEXTAL models side chains only on the basis of local patterns in the electron density, it cannot always determine the exact identity of the amino acid (e.g., due to isoforms such as valine and threonine), and occasionally even predicts slightly smaller or larger residues due to noise perturbing the local density pattern. However, most of the time TEXTAL outputs a residue that is at least structurally similar to the correct residue. We can correct mistakes about residue identities using knowledge of the true amino acid sequence of the protein if we know how the predicted fragment mapped into this sequence. The idea is to use sequence alignment techniques to determine where each fragment maps into the true sequence; then the correct identities of each amino acid can be determined, and another scan through the list of candidates returned by LOOKUP can be used to replace the side chains with an amino acid of the correct type at each position. We use a gapped alignment algorithm,48 since TEXTAL occasionally adds or skips an extra C in a chain, appearing as an insertion or deletion. However, the gap parameters (gap-open and gap-extension penalties) must be specially optimized for TEXTAL, since the length distribution of gaps is different from that expected from evolutionarily related sequences. Preliminary experiments have shown that the accuracy of alignment of chains is sensitive to length (longer chains are easier to align), and that the alignments can be improved by using a special amino acid similarity matrix that reflects the types of mistakes TEXTAL tends to make (based on empirical statistics of predicting one amino acid for another; TEXTAL tends to confuse residues with similar size and shape, although chemical differences are irrelevant, so traditional similarity matrices, e.g., PAM or BLOSUM, could not be used). Further testing of this alignment method is in progress. Results

Here we report the results of evaluating TEXTAL in detail on two experimental electron density maps. The first protein is CzrA, a dimeric DNA-binding transcription factor.49 CzrA contains 105 amino acids per 48 49

T. F. Smith and M. S. Waterman, J. Mol. Biol. 147, 195 (1981). Eicken et al., in preparation (2003).

262

map interpretation and refinement

[12]

subunit, and is composed of four  helices (10 residues on the C and N termini were disordered and could not be built or refined, so the effective size is 95 residues). Three wavelengths of diffraction data were collected at the BioCARS beamline and phased by the MAD method at a resolution of ˚ . After building an initial model, similarity to another protein in the 2.8 A ˚ by PDB was discovered, and the final structure was then solved at 2.3 A molecular replacement. The final R-factor and Rfree values for CzrA were 0.195 and 0.249. The second protein is mevalonate kinase (MVK), an enzyme involved in isoprene biosynthesis.50 It contains 317 amino acids composed primarily of large  sheets, with a few  helices packed around the outside. The data, also collected at three wavelengths at BioCARS, ˚ . Both were phased with selenomethionine MAD at a resolution of 2.4 A maps were solvent-flattened with DM,51 manually built using O, and refined with CNS.5 The final R-factor and Rfree values for MVK were 0.197 and 0.282. To test the ability of TEXTAL to build models for medium-resolution ˚ maps were generated from both data sets by truncating the maps, 2.8-A structure factors. The results presented below are from running TEXTAL ˚ maps. In addition, we extended the boundaries of each map on these 2.8-A ˚ border on each to cover one complete monomer (with an additional 5-A side), so the chain could be traced in its entirety (i.e., to avoid having a border of the map cut the molecule into multiple pieces appearing in separate parts of the asymmetric unit). Tracer Results The first steps of CAPRA involve scaling and tracing the maps. The maps must first be scaled uniformly to make the density patterns comparable between the target map and database of previously solved maps; the outcome is that the maps are scaled roughly to a level similar to 1.0 ¼ 1. The algorithm adjusts the scale slightly to negate the effect of different solvent proportions. The trace was constructed using discrete lattice points ˚ grid. The resulting pseudo-atoms (tracer points) follow along on a 0.5-A the centers of the ‘‘tubes’’ of density and have a spacing of roughly 0.5– ˚ , depending on the direction. The number of tracer points for each 0.9 A map is shown in Table II.

50

D. Yang, L. W. Shipman, C. A. Roessner, A. I. Scott, and J. C. Sacchettini, J. Biol. Chem. 277, 11559 (2002). 51 K. Cowtan, Joint CCP4 and ESF-EACBM Newsletter on Protein Crystallography 31, 526 (1994).

[12]

263

TEXTAL system TABLE II Statistics of Tracer Output

Protein

Dimensions ˚ 3) of map (A

Volume as % of ASU

CZRA MVK

56  58  39 42  54  78

65% 163%

No. of tracer points

No. of tracer points ˚ of within 3 A structure

No. of true C atoms in structure

Ratio of tracer points to true C

3459 8891

910 2784

95 317

9.6 8.8

Fig. 6. Examples of Tracer output for: (A) a helix in CZRA, and (B) three strands of a  sheet in MVK (0.7 contour shown in yellow).

As Table II shows, the Tracer program greatly compresses the information content of an electron density map to on the order of a few thousand tracer points. We find that, on average, there are around 9 tracer atoms per residue (Fig. 6). The trace forms a compact representation of the density in the map. The core CAPRA routine, described next, takes advantage of this simplified representation by hypothesizing that any true C atom will be located near the trace, and in fact CAPRA picks the best local estimates of C positions directly from the trace itself (i.e., the waypoints). CAPRA Results Next, we examine the operation of the core CAPRA algorithm, which in the end produces linear chains of predicted C coordinates. There are four substeps within this routine: 1. Prediction of the estimated distance of each trace point to a true C atom (using a neural net) 2. Selection of waypoints among the tracer atoms

264

map interpretation and refinement

[12]

3. Discovery of the ‘‘structural framework’’ of the protein, consisting of short fragments of connected waypoints that are either linear or helical 4. Linking the structural fragments together into linear chains Figure 7 shows the correlation between predicted and true distances between pseudo-atoms in the trace of CzrA and true C atoms in the manually built structure. This scatter plot illustrates that, in general, trace atoms that are farther away from true C atoms tend to have higher distances predicted by the neural network (based on using rotation-invariant features of the local density pattern as input). The predictions are not perfect, but there is a clear general trend. This is sufficient to enable CAPRA to pick an initial set of waypoints out of the trace (one per residue) that are likely to be near true C atoms, as a basis for subsequent steps. Note that, at ˚ grid; but this point, the waypoints are restricted to the artificial 0.5-A real-space refinement, applied at the end of the whole TEXTAL process, affords an opportunity to adjust the coordinates of the atoms to improve satisfaction of geometric constraints along the backbone, along with the fit to the density. Given these predictions, the rest of the steps in CAPRA do a nice job of creating long C chains that consistently follow the backbone, excluding breaks in the density. The paths and connectivity that it chooses are often

Fig. 7. Correlation between distances predicted by the neural net and true distances from each waypoint to the nearest true C in CzrA.

[12]

TEXTAL system

265

visually consistent with the underlying structure, only rarely traversing false connections through side-chain contacts. It produces candidate C ˚ apart on average (although ranging from atoms spaced roughly 3.8 A ˚ 2.5 up to 5 A), and corresponding nearly one-to-one with true C atoms, leaving only a few skips or spurious insertions. For example, on a back˚ map (with perfect structure factors calculated from the transformed 2.8-A known model), CAPRA was able to identify a single chain of 135 residues in length for 3NLL, an / protein with 137 residues. This accuracy is probably due to the fact that, even if one of the lattice points in the trace near a true C is mispredicted by the neural net to have a high distance, often this is compensated for by another trace point nearby that is correctly predicted. In between C atoms, and also off in side chains, the predicted distance of trace points to C atoms increases roughly as expected. Figure 8 shows stereo views of the overall output of the CAPRA algorithm (C chains) for CzrA, along with the C trace of the manually built and refined model for comparison. Significant secondary structures can easily be seen to be captured, including several  helices. When the predicted C chains in the model for CzrA are trimmed down to just those that cover the true structure (monomer), CAPRA is found to output 2 chains of length 65 and 25 residues that cover 94% of the molecule (90 of 95 residues). The break between the chains occurs in the region of a small -hairpin loop that has weak density. For MVK, CAPRA outputs 10 chains covering 91% of the molecule (287 of 317 residues). Except for a short chain of length 6, the chains ranged in length from 19 to 57 residues, with a mean of 28.7. The breaks tend to occur in loop regions, rather than in regular secondary structures like helices and strands. The only significant part of the molecule that was not built was a solvent-exposed helix plus adjacent turns totaling 18 consecutive residues; this was due to weak density that broke up the connectivity in this region. The RMSD (root–mean–square deviation) scores for the predicted C coordinates in the CAPRA chains, compared with their closest neighbors ˚ for CzrA and MVK, rein the manually built models, were 0.68 and 0.84 A spectively. (These RMSD scores were calculated from the final TEXTAL models, which includes real-space refinement.) LOOKUP Results Given the C chains from CAPRA as input, the side-chain coordinates predicted by LOOKUP matched the local density patterns very well, and the additional (non-C) atoms in the backbone were also fit very well. In many cases, TEXTAL placed carbonyl oxygens in nearly the correct

266

map interpretation and refinement

[12]

Fig. 8. CAPRA chains for CzrA (in green), with manually-built model superimposed (in purple). (All stereo views in this paper are rendered as ‘‘cross-eyed’’ stereo. All molecular images in this paper were made with the computer graphics program, Spock, written by Dr. Jon. A. Christopher, http://quorum.tamu.edu.)

position, based purely on the shape of the local density around the backbone. For example, in CzrA, 65 of 90 carbonyl oxygens were predicted ˚ of their correct position, and 144 of 287 carbonyl oxygens in within 1 A ˚ of their correct position. On the basis of MVK were predicted within 1 A placement of carbonyls, TEXTAL also correctly determined the directionality of both predicted chains in CzrA and all 10 chains in MVK (except the shortest one of length 6). The all-atom RMSD scores for the TEXTAL models built for CzrA ˚ , respectively, relative to the manually built and MVK were 0.88 and 1.00 A and refined models (see Table III). Because LOOKUP does not know the true identities of the amino acids being modeled at this stage, it sometimes predicts a residue chemically different from the one in the true model at a given location. Hence, there is not necessarily a one-to-one correspondence between atoms in the TEXTAL-generated and true models. Therefore, these RMSD scores were calculated by pairing up atoms between two models, choosing closest pairs first, and excluding subsequent atoms from matching with atoms already selected for existing pairs. The RMSD scores are the root–mean–square averages over these pairwise (nearest ˚ (to reduce the impact of unmoneighbor) distances, up to a cutoff of 3.0 A deled regions). To better illustrate the meaning of these RMSD scores, histograms of the individual pairwise atomic distance distributions for both CzrA and MVK are shown in Fig. 9. It is clear that the majority of atoms in ˚ of a mate in the manually built structure. the CzrA model are within 1 A The distribution of pairwise distances is similar for MVK.

[12]

267

TEXTAL system TABLE III RMS Scores for TEXTAL Models Compared with Models Built Manually Protein

CzrA

MVK

C RMS (backbone) All-atom RMS (closest pairs)

˚ 0.68 A ˚ 0.88 A

˚ 0.84 A ˚ 1.00 A

Fig. 9. Histograms of pairwise distances between atoms in TEXTAL-generated models ˚ buckets. and nearest neighbors in true structures for CzrA and MVK, shown in 0.2-A

On average, the matches for each region (around each C) had a local density correlation of 0.85 for CzrA and 0.81 for MVK. This indicates that the LOOKUP process was able to discover reasonably similar matches in the database (first filtered by feature-matching, and then further evaluated and ranked by density correlation). However, LOOKUP does not always predict the amino acid identity in each position correctly; the accuracy, in terms of percent identity, was only 26.7 and 19.3% for CzrA and MVK, respectively. Part of the reason why amino acids are not recognized more precisely is because the only information the method has available at this stage is the local pattern of the density, and many residues have similar or even identical structures. Nonetheless, LOOKUP often chooses a structurally similar residue. We divided the amino acids into the following seven categories of similarly sized residues—{AG} {CS} {P} {VTDNLI} {QEMH} {FWY} {RK}{—and computed the frequency with which the TEXTAL {

Note that structural similarity is quite different from the usual biochemical notion of amino acid similarity for the purposes of estimating homology—polarity is irrelevant since it cannot be directly observed in low-resolution electron density patterns; so residues like Thr and Val are considered to match, for example.

268

map interpretation and refinement

[12]

model had a residue that was in the same category as the residue in the true structure. For CzrA, this structural similarity score was 54.4%, and for MVK it was 46.4%. So LOOKUP is clearly being influenced by the patterns of density and inserting residues of sizes similar to those in the true structure. One of the reasons that the similarity scores are not closer to 100% may be that the density for many of the side chains of residues on the surface that project out into solvent has been truncated, due to disorder or possibly solvent-flattening, making them appear smaller than they really are. Several close-up examples of the models built by TEXTAL are shown in Figs. 10 and 11. In the first figure, a fragment of a helix and turn in CzrA is shown. TEXTAL has done a nice job of building a model to fit the density (i.e., predicting reasonable coordinates for backbone and side-chains) in comparison to the true structure. Notice how well even the carbonyl oxygens in the backbone are modeled; this was accomplished purely by pattern-matching (retrieval of regions with similar patterns of density), though the initial placement of cabonyls by this process is improved significantly by real-space refinement in postprocessing. In Fig. 11, a region of a  sheet in the core of the MVK is shown. Figures 10 and 11 reflect visually the level of accuracy of protein models constructed by TEXTAL from electron density maps.

Fig. 10. Stereoview of a helix and turn in CzrA, contoured at around 0.7. The TEXTAL model is in red; the true (refined) structure is in blue.

[12]

TEXTAL system

269

Fig. 11. Stereoview of TEXTAL model (red) superimposed on manually built model of MVK (blue) for a buried portion of a  sheet.

Discussion

It is important to remember that TEXTAL is based on pattern recognition – this is both its strength and weakness. The analysis of local patterns of electron density allows TEXTAL to make complex predictions of atomic coordinates in a simple way by analogy (database lookup), without requiring commitment to or understanding of a general model of the relationship between scatterers and density. Furthermore, since the features capture such general aspects of the electron density pattern (e.g., symmetry, geometry), TEXTAL is relatively insensitive to resolution, and we ˚ . Howhave obtained good results on maps with resolution as poor as 3.1A ever, in places where noise perturbs the density (e.g., truncated side chains, breaks in the main chain), TEXTAL is often limited because it cannot recognize anything meaningful. While TEXTAL still makes occasional mistakes, many of these are due to poor density, where even a human crystallographer would have a difficult time building in atomic coordinates. The next major extension of TEXTAL’s capabilities is to integrate model building with phase refinement. It is possible that, by building partial models for a few interpretable fragments (or even a polyalanine backbone) in a map with poor density, either standard phase combination (e.g., SigmaA52), reciprocal-space refinement techniques (as in Refmac26 or

270

map interpretation and refinement

[12]

CNS5), or more recent statistical density modification methods27 could be used to improve phase estimates, thereby generating a new map with higher-quality density. This process could then be iterated by applying another round of model building, and so on. However, the threshold in terms of phase error at which TEXTAL starts working well (producing reasonable models), and the magnitude of the improvements in the accuracy of phases by this approach, are still under investigation. The model-building capabilities of TEXTAL also could be potentially used to facilitate other methods, such as determining a mask for solvent-flattening, and automatically detecting noncrystallographic symmetry. All these possibilities can be explored when TEXTAL is integrated as the model-building component into the PHENIX environment28 which will provide a common interface, scripting mechanism, and data format interconversion routines for running these kinds of experiments. There are many additional ideas that can or are being tested to improve its accuracy, such as adding new features or clustering the database. But even in its current state, TEXTAL can be a great time-saving benefit to crystallographers, by automatically building accurate models for electron density maps. Currently, access to TEXTAL is being provided through a Web site (http://textal.tamu.edu:12321), where maps are uploaded and processed on our server. For medium-sized electron density maps, TEXTAL currently takes roughly 1–6 h§ to build a model, including 10–30 min for CAPRA to build the backbone C chains, about 1–3 h to run LOOKUP (for modeling the side chains), the rest of the time for postprocessing (sequence alignment, real-space refinement). Acknowledgments This work was supported in part by grant PO1-GM63210 from the National Institutes of Health.

52 §

R. J. Read, Acta Crystallogr. A 42, 140 (1986). These estimates are based on running on one 400-MHz processor of an SGI Origin 2000.