JOURNAL
OF MAGNETIC
RESONANCE
82,467-482 (1989)
Programs for Computer-Assisted CHARLES School qfpharmacy,
Sequential Assignment of Proteins
D. EADSAND IRWIN D. KUNTZ
Department of Pharmaceutical Chemistry, University ofCal$ornia San Francisco, California 94143
Received June 22, 1988; revised August 30, 1988 A set of computer programs for assisting in the sequential assignment of proton chemical shifts in proteins is presented. These programs use peak coordinates and intensities directly as input. The major tasks performed by the programs include (1) use of scalar coupling information to find N, C-LX,and C-0 proton chemical shifts for the spin systems: (2) use of nuclear Overhauser effect data for the identification of pairs of spin systems which may be adjacent in the primary sequence; and (3) assembly of long lists of spin systems which are sequential in the primary sequence.The output of these programs may be combined with knowledge of spin system types obtained manually to quickly and unambiguously assign the majority of the spin systems. Operation of the programs on a test data set acquired for a mutant of bovine pancreatic trypsin inhibitor correctly placed 4 8 % of the spin systems with finger print peaks in sequential lists of length 3 or longer, and lists of possible next and previous spin systems were created for most of the other spin systems. By combining knowledge of the spin system types (determined manually) with the program output, it was possible to assign unambiguously 8 6 % of the spin systems. 0 1989 Academic Press, Inc. INTRODUCTION
Two-dimensional NMR spectroscopy is now a well-established technique for determining the structure of small protein m o lecules in solution. G iven a suitable set of spectra, the data analysis proceeds in several stages(I). F irst, the chemical shifts of the m a jority of protons in the protein must be determined. This is usually carried out using the “sequential assignment” @A) strategy (2, 3). This procedure determines spin system types using through-bond connectivity first and then organizes the spin systems sequentially using nuclear Overhauser effect data. Recently a “m a in chain directed” (MCD) strategy has been reported (4). In the MCD approach the sequential relationship among N, C-LYand C-/3 protons is established first according to their presencein secondary structure, and then specific side-chain identification is carried out. G iven the sequential assignments, lim its on the distances between pairs of protons in the protein are obtained by assigning other cross peaks in the NOESY’ spectrum. F inally, this distance information is used to construct a m o d e l of the protein structure, using, for example, the distance-geometry algorithm (5). I W e use the usual acronyms for common NMR experiments, NOESY meaning nuclear Overhauser effect spectroscopy, and COSY meaning correlation spectroscopy. See Ref. (I) for details and original references.Other experiments are discussed in the text as needed.
467
0022-2364/89 $3.00 Copyright 0 1989 by Academic Press, inc. All rights ofreproduction in any form reserved
468
EADS AND
KUNTZ
Arguably, the most tedious and time-consuming stage of the structure determination process is the assignment of chemical shifts. The source of this difficulty is not that the logic of the procedure is complex, but rather that the quantity of data involved is large and potentially ambiguous. This combination of simple logic and large data sets makes the chemical-shift assignment problem suitable for computer automation, and such software could conceivably save the spectroscopist a great deal of time. A second reason for developing automatic assignment software is to aid in the problem of keeping track of the decisions and operations carried out during the assignment process, so that the work may be easily checked or reproduced by other investigators. Due to the nonlinear nature of the assignment process, detailed bookkeeping during a manual assignment is tedious and distracting. In the limit of a reliable, totally automated assignment procedure, specification of the input and the assignment algorithm will allow the assignment to be reproduced, so that no manual bookkeeping at all would be required. A semiautomatic approach represents a middle ground. Input and output of programs which handle certain aspects of the procedure would provide “snap shots” of the process. This should be of value in documentation of assignments. In this paper we present algorithms for automation of several aspects of the assignment process. Previous work from our laboratory has produced a program which applies the SA logic to an input consisting of spin system types and a characterization of the possible connectivities between them (6). This approach has proved quite useful in keeping track of all possibilities for the assignment of spin systems for the relatively large protein, a-bungarotoxin (7). Furthermore, this program detected logical inconsistencies in the sequence of spin systems determined by NMR compared to the primary sequence of the protein determined by classical sequencing techniques, and this led to corrections of the published primary sequence (8). In order to generate the input for this program, the spectroscopist must evaluate manually the spin system types and search the NOESY spectrum for cross peaks among all pairs of protons (6). One could reduce the human effort required for this approach by dealing directly and automatically with peak coordinates and intensities. Consequently, the programs described here use peak coordinates and intensities as input. Our approach to automating the assignment process is based on the sequential assignment logic (.3), but also includes features introduced by Englander and Wand (4). We will briefly summarize the two approaches. STRATEGIES
FOR ASSIGNMENT
The first step of the sequential assignment strategy is to identify the spin systems, which are networks of mutually scalar-coupled protons. Because the peptide bond interrupts the scalar-coupling network, each spin system corresponds to an amino acid. Several amino acid types have unique scalar-coupling patterns, while others fall into groups having indistinguishable scalar-coupling patterns. In the end, each spin system identified spectroscopically may be associated with a group of one or more amino acid types. Based on a structural and statistical analysis of a number of proteins
COMPUTER-ASSISTED
SEQUENTIAL
PROTEIN ASSIGNMENT
469
(2) it is likely that the N, C-cu,and C-/3 protons of an amino acid residue will be close in space to the N proton of the next amino acid in the primary sequence. Thus, the strategy of sequential assignments is to look for NOESY cross peaks between N, c‘(Y,and C-/3 protons of one spin system with the N proton of another. If such peaks are found, then one may assume, with a well-defined probability of being correct (2). that the two spin systems are sequential. If one builds lists of sequential spin systems of known types and compares these lists to the known primary sequence, it is possible to assign the spin systems to their corresponding amino acids in the sequence. Ambiguities can arise for two reasons. First, the relevant proton pairs can in some casesbe close in space and not be sequential in the primary sequence. Second, chemical-shift degeneracies can prohibit assigning a cross peak uniquely to a spin system. In both cases the ambiguities are resolved by constraints imposed by the primary sequence on the types of spin systems which may be sequential, or by acquiring spectra under different conditions, which can resolve the chemical-shift degeneracies. In the MCD strategy, Englander and Wand (4) assign spin systems to various elements of secondary structure using only connections between N, C-a, and C-/3 protons, that is, without regard to spin system type. These authors have described characteristic patterns of connectivity for the major secondary structure elements and analyzed the frequency and fidelity of their occurrence in a number of proteins. It is only after the spin systems are put in long sequential lists that their types are considered and identified with the primary sequence. An advantage of this approach is that it is easier to find a unique position in the sequence of a long list of sequential residues given the identity of a few of its spin systems than it is with a short list. When applying the sequential assignment strategy, the purpose of searching for NOESY cross peaks is to establish sequential connectivity, and this does not require knowledge of the spin systems beyond the /3 protons. The purpose of tracing the spin systems is to establish correspondence with the primary sequence and resolve ambiguities in the sequential connectivity. These two operations are logic& indepcndent. We may therefore choose to first trace the spin systems only out to the ij protons, then look for NOESY cross peaks, and finally finish tracing the spin systems and establish correspondence with the primary sequence. There is little or no advan tage to doing things in this order when analyzing data by hand, but there is a real advantage to this strategy using an automatic approach, because an effective computer program for tracing spin systems through N, C-LY,and C-p protons may be readily implemented. Therefore, in the present work we shall adopt the sequential assignment strategy of Wiithrich and co-workers for the purpose of establishing sequential connectivity among spin systems. However, regarding the determination of spin system type we shall adopt the philosophy of Englander and Wand in that this step will be delayed until after sequential lists are established. We may summarize the strategy incorporated in our algorithms. First, trace the spin systems out to the ,L3protons. Next, look for NOESY cross peaks between the relevant protons. Then, create lists of sequential spin systems. The final aspect, tracing the spin systems beyond the p protons and establishing correspondence with the primary sequence, is still left to the spectroscopist but is greatly facilitated by the output of the programs.
470
EADS AND IMPLEMENTATION
KUNTZ AND
TESTING
The tasks of finding p protons, looking for connectivity between spin systems, and linking sequential spin systems into lists have each been implemented as separate programs. The programs communicate by generating output files which are suitable as input for the next program in the list. These files are also readable and easily interpretable by the user. This allows the programs to be run in a semiautomatic mode because the user may inspect and edit the program output, or create input at any stage of the process. In this way, the user may incorporate logic and information not available to the programs into the assignment process. Five files are required as input to the programs: (1) a file containing the coordinates of the peaks in the fingerprint region of the COSY spectrum; (2) a file containing arbitrary labels for these peaks, i.e., spin system names; (3) a file containing the coordinates of all other COSY peaks; (4) a file containing the coordinates of the HOHAHA peaks; and (5) a file containing the coordinates and intensities of NOESY peaks in the region where cross peaks with the N protons may be found. The information in these files and the sequential assignment logic are used by the programs to generate lists of spin system names in an order corresponding to their order in the primary sequence. To illustrate and test the operation of each of the programs, a sample data set has been acquired. This consists of the peak coordinates from appropriate spectra acquired on a mutant of bovine pancreatic trypsin inhibitor (BPTI) in which the Cys 30 and Cys 5 1 residues have been replaced by Ala residues. The protein was the kind gift of Drs. M. Hurle and S. Anderson of Genentech, Inc. Details of the spectral acquisition will be published along with the structure analysis at a later time. This spectrum was easy to assign by hand because most chemical shifts are very similar to those published for the wild-type protein under identical conditions (9). It is advantageous to use a real data set to test the programs because such a set contains the overlaps, omissions, and spurious peaks that would not be found in a fabricated data set. To be useful to the experimentalist, the programs must be tolerant of these real problems in the input data. Fabricated data sets have been used to test certain logical operations of the programs. Files containing peak coordinates and intensities were generated from the experimental data using an interactive peak picking program developed in our laboratory by Dr. Sandasivam Manogaran. In this program, the user identifies peaks by eye from a contour plot displayed on a graphics terminal and selects the peaks by creating a rectangle around each peak using cursor-oriented commands. The program calculates the coordinates of the center of gravity of the points within the rectangle and takes these as the peak coordinates. The integrated intensity (in arbitrary units) is also calculated for the points within the rectangle. The names chosen for the fingerprint peaks are those of the corresponding amino acids. They were given their correct names in the test case to facilitate evaluation of the program performance, although knowledge of their identities was not used in any of the automatic procedures. We emphasize that within the program the spin system names are arbitrary.
COMPUTER-ASSISTED FINDING
SEQUENTIAL
PROTEIN ASSIGNMENT
4’7 1
N, C-a, AND C-0 PROTONS
The algorithm described here automatically traces spin systems out to the /3 protons. Because of the unique locations and large dispersions in the chemical shifts of N and C-a protons, their COSY cross peaks occur in a relatively isolated region of the spectrum known as the fingerprint region (I). Peaks in this region are generally very well resolved, so that they serve as convenient starting points in tracing the spin systems. The coordinates of these cross peaks give directly the N and C-a proton chemical shifts for the corresponding spin systems. To find the p protons, one needs only to find cross peaks with the (Yprotons already identified. Several approaches to identifying the B protons are imaginable. One m ight look directly in the COSY spectrum for cross peaks involving the LYproton of interest. However, there is often overlap in the (Yproton chemical shifts and unique association of cy and /3 protons can be difficult. Another approach involves the relay-type experiments, in which coherence is transferred among several spins in the spin system. For example, in the RELAY experiment (IO), one would observe coherence transfer from the N through the C-cuto the C-p protons, and thus observe cross peaks between the N and the C-ol protons as well as between the N and the C-p protons in the same spin system. Peaks occurring in the RELAY experiment which do not occur in the COSY would be ascribable to /I protons. In our laboratory, we find that the homonuclear Hartmann-Hahn (II, 12) (HOHAHA) experiment reliably gives excellent sensitivity. HOHAHA spectra show cross peaks among all pairs of spins in a spin system. Thus, one would observe N proton cross peaks with the (Y, & y, etc., protons of the spin system. These sets of peaks appear as “ladders” above the N proton of the spin system. The problem in this case is to identify the members of the ladder which represent p protons. This can be determined by looking in the COSY spectrum for cross peaks between the (Ychemical shifts and the shifts of the members of the HOHAHA ladder. Using the HOHAHA ladder in this way virtually eliminates the problems of N chemical-shift degeneracy that would be encountered by a direct search of the COSY data. For example, consider the case in which two spin systems share the same C-tx proton chemical shift but have different N proton chemical shifts. For each of these spin systems one would test each member of the ladder for being a 6 proton. Once an cu-/3COSY cross peak is found, it is known immediately which spin system it belongs to by virtue of its appearance in one HOHAHA ladder and not the other. By looking directly in the COSY spectrum one would not be able to make this distinction. Similarly, consider the case in which two spin systems have the same HN chemical shift, but different C-a chemical shifts. In this case, the two HOHAHA ladders will be in line, but when an LY+ cross peak is found, it can be ascribed to the appropriate spin system by virtue of the chemical shift of the (Yproton with which it joins. Difficulties arise only when both the N and the C-cuprotons are the same in both spin systems. The algorithm for this approach, called NAB (HN, HC-alpha, He-beta), is given in Fig. 1. The program uses the fingerprint, spin system name, nonfmgerprint COSY. and HOHAHA files. The user must also specify a window size, w, in units of parts
472
EADS AND for each fingerprint
KUNTZ
peak
find the corresponding
HOHAHA ladder
for each member of the HOHAHA ladder search the COSY spectrum
for a cross peak between
the alpha and HOHAHA shifts
if found, enter this as a beta proton for this spin system end for end for FIG. 1. NAB algorithm.
per million. The program must search the peak files for peaks occurring at certain positions. The window size provides a criterion for determining if a peak should be considered as occurring at that position. A peak p is said to lie within the window of a target t if the length of the vector tp drawn in the spectral plane joining t and p is less than W. For reasons discussed above, overlap and degeneracy are not much of a problem for our test data set, so a window as large as 0.020 ppm can be used. For larger proteins, a smaller window size will be required. One can envision situations in which the NAB algorithm will create ambiguous output. For glycine residues, which often have two fingerprint peaks, the NAB algorithm will list two spin systems, each with the same N proton chemical shift but with interchanged C-(Y and C-p proton chemical shifts. In most cases, this can be immediately recognized by the user, and the redundant entry may be edited by hand. The distinction between C-a and C-/3 protons is irrelevant to the SA strategy, so listing the second C-a proton in glycine residues as a C-p proton causes no difficulty. Occasionally, an N or C-cxproton may have an unusual chemical shift and the corresponding cross peak will lie outside of the expected fingerprint region. If such a cross peak is not listed in the fingerprint file used as input to NAB, then the program will not consider that spin system in its search for C-/3 proton chemical shifts. An example of the output of this program operating on the test data set, using a window of 0.015 ppm, is given in Table 1. Of the 55 /3 protons located by hand belonging to spin systems having fingerprint peaks in our spectra (excluding the degenerate fingerprint peaks), the program found 44 of these automatically (80%), suggested 3 more using the “soft logic” option (see below), and introduced two extraneous peaks. For the wild-type protein, 63 C-p protons have unique chemical shifts for the same spin systems2 (9). Consideration of the differences between the manually and automatically assigned B protons is of value in evaluating this result. Example 1: L29 Automatic: 6.669 Manual: 6.670
4.664 4.669
1.580 1.580
1.270
The spin system has a fi proton at 1.270 and a y proton at 1.340. These peaks are close enough that in the HOHAHA spectrum, they appear as a single broad peak and ’ This number is corrected for the two @protons lost due to the Cys to Ala mutations.
COMPUTER-ASSISTED
SEQUENTIAL TABLE
PROTEIN
ASSIGNMENT
1
NAB Output” Label
N
C-a
C-B
L29 N44 A21 A51 N43 T54 co5 s41 E07 LO6 Y 1O/C38 D50 F04 G56 N24 A48 A58 K26 K15 c55 T32 118 A30 G51 G57 G28 G28 R17/A16 K41 R20 R42 v34 M52 E49 G36 Q31 DO3 119 Tll A25 R39 Y21 F33 Y35 F22 K46 F45 Y23
6.669 6.690 6.145 6.875 7.151 7.371 7.384 7.405 7.501 7.522 7.666 7.735 7.742 7.749 7.736 7.791 7.852 7.880 7.914 8.017 8.017 8.024 8.024 8.107 8.107 8.114 8.1 14 8.121 8.25 1 8.286 8.299 8.334 8.368 8.409 8.568 8.568 8.616 8.630 8.864 8.877 9.015 9.029 9.208 9.345 9.799 9.840 10.012 10.439
4.664 4.815 4.223 1.745 5.008 3.948 4.306 4.278 4.512 4.402 4.857 4.223 4.499 3.741 4.581 3.122 3.948 4.003 4.333 4.072 5.118 4.113 5.008 3.741 3.893 3.507 3.824 4.223 4.361 4.581 3.590 3.852 3.893 3.728 3.177 4.746 4.210 4.182 4.485 3.169 3.852 5.559 4.174 4.870 5.270 4.361 5.049 4.072
1.580 2.392 1.105 0.947 3.293 3.892 1.525 3.960 2.199 1.759
’ Chemical-shift
C-B
2.708
3.135 2.736 3.740
2.688 2.013 2.825 0.982 1.236 1.814 1.938 3.947 1.807 1.222 3.892 3.733 3.493 1.518 1.559 0.686 1.876 1.532 1.752 1.518 1.518 1.862 2.701 3.960 1.491 2.179 2.406 2.894
1.099
1.773 1.910 4.222 2.701
2.529
2.784 2.763 3.293
reference: Y23 e proton = 6.330 ppm.
3.382 2.330
473
474
EADS AND
KUNTZ
were picked as such. The apparent chemical shift of this conglomerate peak is 1.29 1. The actual COSY peak, located at 4.669, 1.270, is outside the window of 0.0 15. When assigning this system by hand, it became obvious that the 1.291 HOHAHA peak contained two resonances because of the location of cu-/?and p-p cross peaks at 1.270 and y-6 cross peaks at 1.340. Example 2: CO5 Automatic: 7.384 Manual: 7.385
4.306 4.306
1.525 2.763 2.763
The two B protons are nearly degenerate at near 2.763 ppm. The COSY peak at 4.320, 1.52 1 is an (r-p cross peak for K15. It was assigned to the wrong spin system because it overlaps with another peak, K4 1 LY+, causing its position to be improperly determined by our peak identification method. Example 3: E07 Automatic: 7.501 Manual: 7.495
4.512 4.504
2.199 2.199
2.096
There is no detectable cross peak near 2.096,4.5 12 in the COSY spectrum, but in the manual assignment, it was deduced that this was a @proton because both it and the 2.199 resonance show cross preaks with the y protons, and because it was known already that the spin system arose from a glutamate residue. This is an example where information and logic are being used in the manual analysis that are not available to the program. Such an approach is common in later stages of the manual assignment process, but would not occur on a first pass through an uncharacterized sample. Example 4: D50 Automatic: 7.735 Manual: 7.736
4.223 4.215
2.688 2.688
2.839
The HOHAHA peak at 2.839 was not picked because it overlaps strongly with two other peaks. During the manual analysis, the unusual peak shape was noted and COSY peaks were used to verify the identity. Example 5: G36 Automatic: 8.568 Manual: 8.568
3.177 4.222
1.518 4.222 3.169
In the fingerprint region, only the 3.177 (Yproton of this glycine appears, and this weakly. This spin system also shares its N proton chemical shift with Q31, which introduces extra peaks in the HOHAHA ladder, including the peak at 1.5 18. There is a COSY peak at 3.185, 1.52 1 belonging to the side chain of another residue. When evaluated manually, it was known that the system was a Gly, the peak at 4.222 was observed in the HOHAHA, and the peak at 1.5 18 was not even considered. Example 6: Y 35 Automatic: 9.345 Manual: 9.346
4.870 4.855
2.477
2.598
COMPUTER-ASSISTED
SEQUENTIAL
PROTEIN ASSlGNMENT
475
The cu-0 cross peaks are both strong, but the (Yshift from the fingerprint is outside the window from them. A window of 0.020 does include both peaks. In summary, differences in the automatic and manual results occur for the following reasons: (1) Neardegeneracy of two cross peaks results in picking them together as a single peak, or in errors in their coordinate determinations; (2) cross peaks from other spin systems give false-positive results; (3) peak picking gives chemical shifts outside the window size; and (4) additional knowledge about the identity of a spin system, or about other peaks in a spin system beyond the C-p protons, allows a spectroscopist to make decisions the program is not designed to make. In addition, there is an occasional occurrence of fingerprint peak being absent from the COSY but present in the HOHAHA. The user is free to add these to the output of NAB and make any other changes desired. Implementation of a peak locating program that can deal with overlapping peaks and nonideal lineshapes could help alleviate many of the discrepancies observed for these data. Included in the NAB program is an option to apply a second, less reliable criterion for identifying ,L?protons. When this option is invoked, for those spins systems for which the above analysis has identified a single C-p proton, the program will search for peaks in the HOHAHA ladder which are joined to the 0 proton by a p--/3 cross peak in the COSY spectrum. The obvious difficulty with this is that it may also locate 8-7 cross peaks. When applied to the test data set, this option suggeststhree candidates that turn out to be /3 protons according to a manual analysis, but most often it finds y protons. FINDING
CONNECTIONS
BETWEEN SPIN SYSTEMS
The algorithm for finding NOESY cross peaks between spin systems is called NOESER (NOE search). This algorithm is based on a data structure called a spin system connectivity matrix (SSCM). Any N, C-a, or C-P proton in any spin system may be specified by two indices: a spin system index and a second index which encodes the atoms. For the atom indices we use 1 for N, 2 for C-a, and 3 and 4 for C-p protons. Thus a cross peak between any two protons requires four indices for specification. The SSCM is a four-dimensional matrix whose elements contain the index of the NOESY peak joining the pair of protons specified by the matrix indices, or a zero if there is no peak joining them. The matrix is set up by searching the NOESY spectrum for peaks joining every pair of protons in the spin system list generated by NAB or otherwise supplied by the user. Given the SSCM, searches for any patterns of connectivity among and within the spin systems may be carried out. For example, characteristic secondary structure patterns as described by Englander and Wand (4) may be evaluated. In our current implementation of the SA logic, cross peaks between the N, C-a, and C-8 protons and the N protons of other spin systems are detected in the matrix and written to an output file as potential pairs of sequential residues. A sample entry from this rather long output is given in Table 2. In the section “F22 next,” there are seven residues whose N proton resonance shows a cross peak with the N, C-(-Y,or C-p protons of F22. In the case of G36 and Q3 1, which share the same
476
EADS AND TABLE NOESER Output
KUNTZ 2
Showing Cross-Peak Intensities Joining Spin Systems
F22 previous HN chemical shift = 9.799 Residue
N/intensity
C-u/intensity
C-P/intensity
C-P/intensity
N44
6.690 594.0
4.815 0.0
2.392 0.0
2.708 0.0
A51
6.875 0.0
I.745 0.0
0.947 1815.0
T32
8.017 0.0
5.118 6438.0
3.947 0.0
A30
8.024 0.0
5.008 0.0
1.222 5180.0
G36
8.568 21170.0
3.177 0.0
1.518 0.0
Q31
8.568 21170.0
4.746 0.0
1.518 0.0
Y21
9.029 0.0
5.559 74880.0
2.406 1986.0
2.529 7035.0
F33
9.208 1697.0
4.774 0.0
2.894 0.0
3.039 0.0
4.222 0.0
F22 next Shifts:
N 9.779
C-Cu 5.270
C-P 2.784
Residue
HN shift
Intensity
Intensity
Intensity
N44 s47 G36 Q31 F33 F45 Y23
6.690 7.405 8.568 8.568 9.208 10.012 10.439
594.0 3976.0 21170.0 21170.0 1697.0 0.0 0.0
0.0 31670.0 0.0 0.0 0.0 2228.0 46720.0
0.0 0.0 0.0 0.0 0.0 0.0 36900.0
N proton chemical shift, it is not possible to determine which, if either, of these makes a valid connection to the N proton of F22, so both of them must be considered. Cross peaks are also detected with other residues and are due to coincidental occurrence of peaks from other spin systems at that position, or from long range coupling. Y23, which is the correct next residue, is also included in the list. Most residues show a comparable list of potential next and previous residues. The length of the list is sensitive to the window parameter w, because for larger window sizes there is more likely to be chemical-shift degeneracy within that window.
COMPUTER-ASSISTED
SEQUENTIAL
PROTEIN ASSIGNMENT
477
Given the many possibilities for the next spin system in the sequence, the program requires an additional rule for determining the most likely next spin system from the list. We find that the following rule works well: the residue that makes the greatest number of connections gets the highest priority. In the event of a tie, the residue with the largest sum of cross-peak intensities gets highest priority. In the event of a second tie, then all peaks which tie are considered highest priority. In the F22 example, Y23 is correctly deemed the most likely next residue. This list of high-priority potential pairs is written to a separate file. Included in this file is a link number, which gives a count of the number of NOESY peaks joining each pair of spin systems. The link number provides a useful filter for the operation of the next program on this file. SORTING SPIN SYSTEMS BASED ON PAIR-WISE CONNECTIVITY
The high-priority file generated by the NOESER algorithm provides a list of pairs of spin systems that have a high probability of being sequential. Our next task is to use this information to arrange the list of spin systems into the order corresponding to the primary sequence. If the pair-wise connectivity list were correct, complete, and unambiguous, then this task would present little problem. However, the pair-wise connectivity list generated from NMR data may contain errors and conflicting hnkages due to long range coupling and m isassignment of NOESY peaks. Also, one expects incomplete information from the NMR data because prolines give no signal in the fingerprint region, and because other peaks may be m issing from this region for other reasons such as loss of C-cuproton resonances due to water suppression. These problems can lead to the formation of loops and tangled pathways through the sequence because of branching. Our solution to this problem is the “KILTER” algorithm, shown in Fig. 2. Because of the directional nature of the pair-wise connectivity input, one can detect if a spin system is the first in the sequence or follows a break in the connectivity data by determ ining if any other spin systems precede it. All spin systems which have nothing preceding them are placed on a stack. For each element on the stack, a list of sequential spin systems is created based on the pair-wise connectivity data until the end of the list is reached (no next element listed). A difficulty arises when there is more than one possible next spin system. The algorithm handles this by marking all possible next spin systems as branch elements, adding them to the current list, and terminating the current list. These branch elements are pushed on the stack as starting points for constructing new lists. The algorithm also deals with other special cases as follows. Each time a list-initializing element is taken from the stack, previously constructed lists are checked to see if they begin with the same element. If such a list is found, then a new list is not constructed with that element, since it would duplicate the existing list. Each time a new element is added to a growing list, all existing elements of that same list are checked for a match with that new element. If the list already contains that element, the list is terminated and marked to indicate it contains a loop. Finally, after the stack is emptied, every element in the list of spin systems is checked to be sure it is an element of some sequential list. When a spin system is found that is not a member of a sequential list, it is pushed on the stack and marked as being part of a loop structure.
478
EADS AND
find all beginning
elements
KUNTZ
and push them on the stack
pop the next element from the stack and start a new list (unless this element already begins an existing list)
get the next element(s) from the pair-wise connectivity
list
if there is a single next element, add it to the current list if there is more than one next element push each branch element on the stack add the branch elements
to the current list and mark them as such
end the current list if there are no next elements,
end the current list
until the current list is ended until the stack is empty FIG. 2. KILTER
algorithm.
These rules guarantee that even with the most pathological input, the number of lists of sequential spin systems can never exceed the number of spin systems in the input. Also, the length of the lists is limited by the same constraint. In this way the problems with looping and ambiguous pathways are avoided. The output consists of lists of spin systems bordered by break points and branch points. The results of applying the KILTER algorithm to the output of NOESER using the test data are presented in Table 3. A minimum link number of 2 was imposed for this run. VALUE
TO THE
NMR
SPECTROSCOPIST
To illustrate the behavior of the programs described here and their value to the NMR spectroscopist, consider the KILTER output in Table 3. Of the 46 spin systems identified by their fingerprint peaks, 22 of these (48%) were placed correctly in lists of length 3 or longer. Four more were placed correctly in lists of length 2 (total correct: 57%), although 8 were placed incorrectly in lists of length 2. Our experience with this and with fabricated data sets is that incorrect linkages occur almost exclusively in lists of length 2, so these are best disregarded. Following execution of the KILTER algorithm, the automatic assistance to the assignment process provided by these programs is completed. It currently remains a task for the spectroscopist to evaluate spin system types and establish correspondence of the lists of spin systems generated by KILTER with the primary sequence. By manually identifying some of the spin system types in these lists, it will usually be possible to establish correspondence with the primary sequence and thereby assign all the spin systems in the lists.
COMPUTER-ASSISTED
SEQUENTIAL TABLE
PROTEIN
479
ASSIGNMENT
3
KILTER
Output=
F45
K46
T32
Length
8 4 4 3 3 2 2 2* 2* 2* 2* 1
1
1
K41 Y21 T32 A25 E49 G28 co5 DO3 c55 R17/A16 118 R3Y TIl 119 G56 F04 R20 G57 K15 A58 Y lo/C38 E07 T54 A30 Q31 G36
” Symbol definitions:
R42 F22 F33 K26 D50 L29 LO6 G56 M52 L29 Y35
N43 Y23 v34 A27 A51
N44 N24 Y35 <
K26 c55
<
F04
G56
<
F04
G56
< < < <
F04 G36 c55 c55
G56 Q31 T32 T32
118 118
s47
A48
-
N44
A30 A30
-: loop detected; <: the following are branch elements; *: list contains error.
When used in conjunction with knowledge of the spin system types, the list of potential next and previous spin systems, as generated by the NOESER program (Table 2) is also of great help during subsequent stages of the assignment process. For example, once a spin system at the end of a list is assigned, it is known from the primary sequence which type of spin system must follow. Inspection of output such as that found in Table 2 will reveal in some cases that only one member of the list of potential next residues is consistent with this constraint, allowing another residue to be assigned. The list of potential next residues may be otherwise reduced by removing spin systems which have already been assigned. This also applies to working backward through the primary sequence; the lists of potential previous residues may be reduced using the same logic. A spectroscopist will not often need to consult the original spectra during these operations, because all relevant connections are contained in this output and organized in a useful way. We have applied this approach to our test data set in order to assign as many spin systems as possible. This required determining spin system types by a manual analysis
480
EADS AND
KUNTZ
of the spectra. Several amino acid types show unique scalar-coupling patterns (I). With our data, those spin systems with scalar-coupling patterns matching the pattern expected for a particular amino acid were assigned to that amino acid type. Other amino acids fall into groups whose members cannot be distinguished based on scalar coupling. When such coupling patterns were detected in our data, all the amino acid types consistent with that coupling pattern were listed as possible assignments for that system. For several of the spin systems, it was impossible to uniquely associate the observed coupling pattern with any of those expected for the amino acids. This occurred most often with the larger spin systems, because of peak overlap in crowded regions of the spectra, or because of low signal intensity. In these cases, all amino acid types having coupling patterns which include the unambiguous parts of the observed coupling pattern as a subset were listed as possible amino acid types for that spin system. Alanine residues were distinguished from other residues showing only one C-p proton (which can occur for spin systems having degenerate C-p proton chemical shifts) based on the upfield location of the alanine methyl resonance. During this work, cross peaks between N and C-a protons in the HOHAHA spectrum which were not recognized or not present in the COSY fingerprint region were identified. These spin systems were traced manually and added to the output of the NAB program. The two pairs of peaks which are nearly degenerate in the fingerprint region were analyzed carefully so as to separate the chemical shifts belonging to each one. The amino acid residues whose spin systems were not identified include all prolines as well as Rl, G37, and R53. Also, chemical-shift values for the N and C-cu protons, which are taken from the COSY fingerprint file by the NAB algorithm, were replaced by the coordinates from the HOHAHA file so that all the coordinates arise from the same spectrum. Execution of NOESER and KILTER on this input placed 2 1 of the 5 1 spin systems, 4 l%, into sequential lists of length 3 or longer. (The discrepancy between this result and the 48% figure presented earlier for the input file generated automatically is due to the presence of more spin systems in the edited input file.) Correspondence with the primary sequence for these lists was easily established given information on the spin system types. Assignment of additional spin systems was achieved by consulting the lists of potential next and previous spin systems. By removing spin systems from these lists which had already been assigned, and by using restrictions on the types of spin systems which can follow assigned spin systems, it was possible to unambiguously assign an additional 23 residues. The 44 residues finally assigned represent 86% of the observed fingerprint peaks. Seven residues with fingerprint peaks remained unassigned. Residues D3 and F4 were found to be adjacent but their sequential order was ambiguous. A search for connectivity to the aromatic ring of F4 would resolve this ambiguity. Residues E7, G12, C14, C38, and C55 have no cross peaks in the NOESY peak file with their adjacent residues. These peaks are missing for the usual reasons of peak overlap, errors in chemical-shift determination, and limited signal intensity. An important practical consideration for the operation of these programs is the accuracy and precision of peak coordinate determination. There are several places in the analysis where this could have an effect. In the NAB algorithm, HOHAHA peaks
COMPUTER-ASSISTED
SEQUENTIAL
PROTEIN
ASSIGNMENT
481
must be classified into ladders of peaks sharing the same w2 coordinate. If there is a great deal of scatter in the determination of chemical shifts, then peaks which belong in the HOHAHA ladder will be missed. This problem may be overcome by choosing a larger window size, but this solution will exacerbate a second problem, that of peaks from nearby ladders being assigned to the wrong ladder. Another operation in which inaccurate or imprecise peak coordinate identification is problematic is searching spectra for peaks at predefined positions. This happens both in the NAB algorithm, where a-0 cross peaks are identified from the COSY, and in the NOESER algorithm, where NOE peaks are searched for among all pairs of protons. In these cases, using a large window size to compensate for inaccurate or imprecise peak picking will have the serious consequences of introducing erroneous /3 proton assignments or erroneous connectivities among spin systems. Our method of calculating the center of gravity of a peak may be evaluated by comparing the calculated coordinates of corresponding peaks from different spectra. In general, we find a standard deviation in the w2 coordinates of 0.005 ppm, and in the w, coordinates of 0.006 ppm. The latter dimension has a lower digital resolution. These numbers, and the number of peaks that may be identified as being in common between two spectra, depend to some extent on the window size chosen for the comparison, and on the crowdedness of the spectra. However, two standard deviations, or a window size of 0.01 ppm, was found to be a good choice. As larger proteins are studied and spectra become more crowded, it will obviously be advantageous to improve the accuracy and precision of coordinate determination, allowing use of a smaller window size so that spurious cross-peak assignments are minimized. Given the required spectra, the process of creating the peak files required approximately 4 to 5 days of effort in our hands. Execution of the NAB, NOESER, and KILTER programs took less than 15 min on a VAX 1 l/750 computer. Manually tracing the spin systems beyond the /3 protons took approximately 5 to 6 days. Consulting the list of potential next and previous spin systems generated by NOESER to extend the automatic assignments required about a half day. Thus, the total time required to assign 86% of the observable residues in our test system was approximately 2 weeks. Similar work carried out manually might require months of effort. the bulk of which involving the tedious task of working through the NOESY connectivities. In summary, the execution of these programs serves to convert several functionally disorganized lists of chemical-shift coordinates into an organized list of potential connectivities between spin systems. The ranking scheme incorporated into the NOESER algorithm allows the KILTER algorithm to create several long lists of spin systems which are very likely to be sequential within the primary sequence. Using the list of potential connectivities and knowledge of the spin system types, these lists can be extended and new lists created with little effort. The overall result is the unambiguous assignment of a large fraction of the observed spin systems. The time scale of the entire process is significantly shorter than the time scale for achieving the same result manually. Thus, these algorithms should be of real value to the practicing NMR spectroscopist. The programs are written in Fortran 77 and run under UNIX: they are available upon request.
482
EADS AND
KUNTZ
ACKNOWLEDGMENTS The authors thank Dr. M. Hurle for providing the mutant bovine pancreatic trypsin inhibitor. This work was supported in part by Research Resources Grant RR0 1695 and by NIH Grant GM3 1497. C. D. Eads was supported by NIH Postdoctoral Training Grant GM 10654. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. II. 12.
K. W~THRICH, “NMR of Proteins and Nucleic Acids,” Wiley, New York, 1986. M. BILLETER, W. BRAUN, AND K. WOTHRICH, J. Mol. Biol. 155,321 (1982). K. W~THRICH, G. WIDER, G. WAGNER, AND W. BRAUN, J. Mol. Biol. 155,311 (1982). S. W. ENGLANDER AND A. J. WAND, Biochemistry 26,5953 (1987). G. M. CRIPPEN, “Distance Geometry and Conformational Calculations,” Research Studies Press, Chichester, 198 1. M. BILLETER, V. J. BASJS, AND I. D. KUNTZ, J. Mugn. Reson. 76,400 (1988). V. J. BASUS, M. BILLETER, R. A. LOVE, R. M. STROUD, AND I. D. KUNTZ, Biochemistry 27,2763 (1988). P. A. KOSEN, J. FINER-MOORE, M. P. MCCARTHY, AND V. J. BASUS, Biochemistry 27,2775 (1988). G. WAGNER, W. BRAUN, T. F. HAVEL, T. SCHAUMANN, N. Go, AND K. W~THRICH, J. Mol. Biol. 196,611(1987). A. BAX AND G. DROBINY, J. Mugn. Reson. 61,306 (1985). A. BAXANDD. G. DAVIS, J. Magn. Reson. 65,355 (1985). D. G. DAVISANDA. BAX, J. Am. Chem. Sot. 107,2820(1985).