J. Mol. Biol. (1991) 219, 727-732
A Simple Method to Generate Alternate Alignments of Protein Sequences
Mansoor A. S. Saqi and Michael J. E. Sternberg Biomolecular
Model&q Laboratory Cancer Research Fund PO Box 123, 44 Lincoln’s Inn Fields London WClA 3PX, U.K. Imperial
1990; accepted 11 March
A major problem in sequence alignments based on the standard dynamic programming method is that the optimal path does not necessarily yield the best equivalencing of residues assessedby structural or functional criteria. An algorithm is presented that finds suboptimal alignments of protein sequences by a simple modification to the standard dynamic programming method. The standard pairwise weight matrix elements are modified in order to penalize, but not eliminate, the equivalencing of residues obtained from previous alignments. The algorithm thereby yields a limited set of alternate alignments that can differ considerably from the optimal. The approach is benchmarked on the alignments of immunoglobulin domains. Without a prior knowledge of the optimal choice of gap penalty, one of the suboptimal alignments is shown to be more accurate than the optimal. Keywords: sequence alignment; dynamic programming; suboptimal paths: protein structure prediction; homology modelling
Vingron & Argos, 1990). Thus, the correct alignment may only be found at some distance away from optimality. There are two different approaches to compute suboptimal alignments: the algorithm of Kruskal & Sankoff (1983) and the approach of Waterman (1983), which has been explored by Vingron & Argos (1990). Kruskal & Sankoff (1983), and subsequently Waterman & Eggert (1987), modify the Smith & Waterman (1981) algorithm that finds local similarities between two sequences by suppressing minor perturbations associated with the best alignment as they often reveal nothing new. All cells visited along an alignment path are marked and the similarity scores of these cells are then set to zero. However, this approach would prevent any equivalenced residues in the optimal alignment being produced again in a suboptimal alignment. Furthermore, the user is required manually to generate the complete suboptimal alignments from the fragments identified by the algorithm. A rigorous algorithm to compute complete suboptimal alignments has been given by Waterman (1983). However, for the comparison of two sequencesof moderate length there generally exist a large number of alignments close to the optimal. Also, the enumeration of all suboptimal paths is quite complex, especially if more flexible gap penal-
1. Introduction The rapid increase in the number of protein sequences has highlighted the need to explore relationships between sequencesand thereby identify those residues which play an important structural and/or functional role. A sequence alignment of two or more proteins (Barton & Sternberg, 19876; Needleman & Wunsch, 1970; Sellers, 1974; Vingron & Argos, 1989) can lead to the identification of conserved residues that are essential for function (Argos & Leberman, 1985; Zvelebil et aZ., 1987). Furthermore, modelling studies can be initiated if there is an experimental three-dimensional structure for one of the homologous sequences(Bates et al., 1989; Blundell et al., 1987; Sutcliffe et al., 1987a,b).
The standard sequence alignment methods carry out global (Needleman & Wunsch, 1970) or local (Smith & Waterman, 1981) comparisons using dynamic programming methods, either by maximizing a similarity or minimizing a distance (Sellers, 1974). They result in an optimal score for the sequence comparison and a corresponding optimal alignment. The best alignment is the one that correctly equivalences similar structural regions in the two sequences. However, this may not be the one that produces the best score in an automatic sequence comparison (Barton 6 Sternberg, 1987a; 727 0022-2836/91/12072746
M. A. S. Saqi and M. J. E. Sternberg
ties are to be used (Gotoh, 1982; Altschull & Erickson, 1986). Vingron & Argos (1990) discuss the importance of suboptimal alignments as a way of determining reliable regions in protein sequence alignments. They calculate the maximal score of an alignment path going through a point in the alignment but do not evaluate the alignment themselves. This work shows that the suboptimal alignments include many minor perturbations about the optimal alignment (see Fig. 2 of Vingron & Argos, 1990), many of which are trivial as they do not lead to major differences in the equivalencing of structural fails to generate a regions. Thus, their approach limited set of suboptimal alignments that can realistically be used in subsequent analysis or modelbuilding. The problem is that often the optimal alignment only contains some sections that are correctly aligned, but suboptimals must be considered to obtain the other regions of correct alignment. We propose here a simple modification to the traditional dynamic programming algorithm that incorporates features of the methods described above to obtain suboptimal alignments. The aims of our method are: (1) to generate automatically the complete suboptimal sequence alignment; (2) to obtain a limited set of alternate alignments that can be subsequently used in other algorithms; (3) to generate only suboptimal alignments that include non-trivial differences from the optimal alignment. The algorithm was benchmarked on two alignments of pairs of immunoglobulin sequences: (1) light-chain variable region (FABVL) against heavy-chain variable region (FABVH); (2) lightchain constant region (FABCL) against (FABVH). These immunoglobulin domains were from the Fab fragment of human myeloma IgG New, whose crystal structure has been solved (Poljak et al., 1973). The structural alignment was taken from Cohen et al. (1981) and for each pair of proteins a set of equivalent residues were selected from central portions of regions of homologous b-strands in the X-ray structure. The accuracy of the alignments was calculated from the number of amino acid pairs within the defined zones that were aligned as in the structural alignment (see Barton & Sternberg, 1987a).
2. Algorithm The computer program developed implemented a variant of the standard widely used Sellers (1974) algorithm (see also Kruskal & Sankoff, 1983). The algorithm finds a minimum distance D(a, b) of an alignment of two sequences a = a,. .ai . . . a,,, and b = b, . bj b,. A matrix of amino acid pair scores must be chosen, where the (i,j) element of the matrix gives the substitution score for ai with bj. As the distance is minimized a scaled version (MDMLSO) of the standard MDM,,, matrix (Dayhoff, 1978) of amino acid substitution scores was used: MDM&,(i,j)
The MDM;,, matrix was used to generate a weight matrix W, where W(i,j) is the MDM;,, value for a, versus bj. Using W and a suitable gap penalty function, the distance matrix D is generated, where D(i,j) is the minimum distance of the alignment of a, . .ai with b, . . bj keeping pointers at each sitca (i, j) to enable a path through the matrix to be computed, corresponding to the alignment of u and b with minimum length. The gap penatty fun&ion used here is of the form u’ = v + ku. where opening a. gap costs v and each null in the gap costs %c. In our modification to TieId suboptimal alignments, after an alignment, 1s obtained. t,hosr cells (i. j) t~hat correspond to a match or substitut,ion along the alignment path are marked. The distance matrix is then recalculated with t)hr weights. W(i.Q). associated with marked cells increased by a small amount’ A. The weight matrix is increased b> A as the algorithm minimizes the distance. (In implementations that maximize t,he similarity, the weight matrix would be reduced by A.) This procedure is repeated iterat,ively. The value of A depends on the scoring scheme and should br a fraction of a typical weight matrix entry. Tn our scheme, each weight matrix element, W(ij) is irk the range 0.1 to 2.4, given the scaling of the Dayhoff’ (1978) matrix, and A was set, to lO(!,& of t,htA most favourable score of 0.1. i.e. A = 001.
3. FABVL/FABVH and FABCL/FABVH Alignments For FABVIJFABVH, t,he best result obtained from the init,ial scan of (P, U) space was 33 out of a maximum of 11 (X04;, correctly aligned) for (v, u) = (1. f ). However, one does not know the best choice of gap penalty parameters and, following the work of Barton & Sternberg (1987a), t,he effect of choice of gap penalty parameters on the accurac) of the alignment was investigated. A value of (v, U) = (0. 3) was taken as yielding results u)f typical, but not, the highest, accuracy wit.h 26 residues (63%) correctly aligned. This allows us to investigate whether t,he best result could br improved upon by examining a set of suboptimals. The alignment of FABCL with FABVH is more difficult and illustrates the sensitivity of alignment algorithms to gap penalt,y parameters. The highest score obtained from the scan of (27, 71) space for (v. u) = (1. I ) was 16 out, of a possible 38 correct matches (42(j/). For ((3, TV)= (0. 3) there arr 11 correct matches (29 OJ~). Suboptimals are found using t,htt scheme described above. The results are displayed in Table 1. Often alignments corresponding to runs with thtx same score are identical but this is not< necessarily the case and must be determined by inspection. In both alignments and for bot,h gap penalt,J choices, bett#er alignments are found among the wit)h suboptimals. For FAKVL/FABVH (71, 7~) = (0. 3), Table I shows an increase from 63% to 76%) correctly aligned. .For the alignment of FABCL with FABVH ((71. U) = (I. 1)) the best score
Table Accuracy Alignment FARVL/FAlWH b, u) = (1, 1) FARVL/FABVH b> u) = (0, 3) FARCL/FARVH (v, fh) = (1, 1) FAHCL/FABVH (fJ> u) = (0,3)
of optimal and suboptimal alignments Accuracy optimal
Accuracy of best suboptimal
Run giving best suboptimal
The number of correct matches (residue pairs within the defined zones that are aligned as in the structural alignment) for the optimal run and the best suboptimal run and the run at which the best suboptimal occurs are shown. The parameter A = 001.
of 16 (42%) is marginally improved upon in searching for suboptimals. This score can also be found among the suboptimals for the second choice of gap penalties where the score increases from 26% to 42 y. (Fig. 1). As the best choice of gap penalty parameters is generally not known a priori, this example demonstrates the importance of examining the suboptimals. An idea of the stability of an alignment is given by the number of iterations through which the alignment persists. Vingron & Argos (1990) have shown that these stable regions tend to be the most accurately aligned. Figure 2 shows the residue equivalences in some of the alignments for (v, U) = (0,3). The optimal and
near suboptimal correctly align the F and parts of the E and G strands but fail to align strands A, B, C and D. The alignment with the highest score (number 33) matches correctly strands A, B and C but fails with the other strands. The D strands are particularly difficult to align correctly. 4. Effect of A The only adjustable parameter in this method is A, which was chosen after trials as O-01. The smaller the absolute value of A, the less change there is to the distance D matrix in each iteration. Thus, a search with a small absolute value for A can find minor differences from a previous iteration but is
3 ;: r E
ii k z 5 0’ E z 0 -~.-.-.-.-.....1.......,.....................~.....
Figure 1. Histogram of the number of correct matches for the optimal and subsequent runs for FABCL/FABVH A = @Ol.
(v, U) = (0, 3) and
M. A. S. Saqi and M. J. E. Sternberg
1 (11 correct matches)
33 (16 correct matches)
GGG G 'FFFFE
Figure 2. Alignments of’ FABCL with FABVH. The optimal alignmenti (run 1) and run XS, the Ijest suboptimal are given, The B-strands are labelled A to G. The gap penalty parameters are (a. U) = (0, S) and A = 001. The top saquencr is FABCL. the bottom FABVH.
(v. u) = (0, 3). the optimal alignment was 320,6 accurate and the most, accurate alignment (59f)/o) occurred in the fifth suboptimal path (data not, shown). The sets of ail alignments within a specified distance E of the optimal was also enutnerated using the method of Waterman (1983) for FABCI,/
also more likely to find the same alignment. Table 2 shows the effect of changing A. As both of the above studies used the immunoglobulin fold, the choice of A was also explored for the alignment of sperm whale myoglobin with leghaemoglobin based on the structural alignment given by Bashford et al. (1987). With A = 001 and
Table 2 Accuracy
Alignment FABVL/FAHVH (A = 0.005) FABVL/FAII\‘H (A = 002) FAN’L/FAt3VH (A = 0.005) FAM:L/FABVH (A = 0.02)
for different values of and FABCLlFABVN
best, suboptimal 3t/41
‘“1; 4 I
The number of correct matches (residue pairs within the defined zones that are aligned as in the structural alignment) for the best and second best suboptimal runs and the run value at which they occur (run 1 is the optimal) are shown. The gap penalty parameters are (v, U) = (0,3). For FABVL/ FAUVH with A = 0005. the initial score of 26 persists over 7 iterations instead of just 4 as is the case for A = 001. With A = @02, the coarser search results in a maximum of 30 being found instead of 31 (A = 0.01). For FAUCLIFAUVH, the lower absolute value of A = OGO5 results in the initial score of 11 persisting over 6 iterations as opposed to 3 for A = @Ol. The best score of 16, which was obtained on the 33rd run with A = @Ol, is now reached on run 57. For A = OG?. the top score is reached sooner. a.s expected, on run 21. The choice of A affects the hehaviour of our algorithm. If the value of A is vrr~ large, the algorithm behaves like that described hy Kruskal & Sankoff (1983). With A practically zero. many local perturbations are picked up and this resembles the algorithm described hg Waterman (1983). The exact choice of the value of A between these extremes will he governed by the particular application of the algorithm balancing the sensitivity of the search for alternate alignments against the number of suhoptimals generated.
FABVH. For (v, u) = (0,4) there are 16 optimal (E = 0) alignments and these alignments differ trivially. With E = 61 there are 1620 alignments. A gap penalty of (0, 3), used in this study gives 16,384 optimal alignments. Clearly, the set of all paths for a given (w, u) will almost certainly include the correct alignment if E is large enough, but it is not clear using this procedure how to filter out from the vast number of trivial minor perturbations a useful set of alignments for modelling. The Kruskal & Sankoff (1983) algorithm was also used for the alignment of FABCL with FABVH. While the Kruskal & Sankoff (1983) method and related algorithms (Smith & Waterman, 1981; Waterman & Eggart, 1987) are useful for locating separate non-intersecting regions of local similarity, they do not explore the set of suboptimals which may only differ in part. When applied to FABCL/ FABVH, the first (top) alignment correctly equivalenced the B and parts of the C, E and F strands but subsequent regions of next best local similarity ‘failed to improve on this.
The advantage of this approach to generate suboptimals is that it can be included easily in standard programs and any choice of flexible gap penalty can be employed. There are a large number of applications of the dynamic programming method in sequence comparisons and structural alignments of proteins and nucleic acids. These applications include multiple alignment algorithms (Barton & Sternberg, 19876; Vingron & Argos, 1989), motif or pattern searches (Barton & Sternberg, 1990; Bowie et al., 1990; Gribskov et al., 1988) and structural comparisons (Barton & Sternberg, 1988; Taylor & Orengo, 1989; Sali &, Blundell, 1990). If an actual model-building study were undertaken, more accurate methods of aligning two sequences, given the crystal structure of one of them, should be used. These more accurate methods employ different weighting schemes for insertion in secondary structures compared to loops (Barton & Sternberg, 1987a; Lesk et al., 1986) and/or use different pairwise entries in the weight matrix if buried non-polar residues are equivalenced (Henneke. 1989). The next step in the use of this approach to generate a limited set of alignments is to develop filters using other sets of criteria. Methods are currently being developed to distinguish between correctly and incorrectly folded proteins (Novotny et al., 1984, 1988; Baumann et aE., 1989; Gregoret & Cohen, 1996; Sippl, 1990; Hendlich et al., 1990). The modelling would then start with several suboptimal alignments followed by the examination and filtering of a series of three-dimensional models. Thus, the automatic generation of suboptimals is central to the development of fully automatic methods of model-building by homology.
References Altschul, S. F. & Erickson, B. W. (1986). Bull. Math. Biol. 48, 603-616. Argos, P. 6 Leberman, R. (1985). Eur. J. B&hem. 152, 651-656. Barton, G. J. & Sternberg, M. J. E. (1987a). Protein Eng. 1, 89-94. Barton, G. J. & Sternberg, M. .J. E. (1987b). J. Mol. Biol. 198, 327-337. Barton, G. J. & Sternberg, M. J. E. (1988). J. Mol. Graph. 6, 190-196. Barton, G. J. & Sternberg, M. .J. E. (1990). .J. Mol. Bi01. 212, 389-402. Bashford, D., Chothia, C. & Lesk, .4. M. (1987). J. Mol. Biol. 1%. 199-216. Bates, P. A., McGregor, M. J., Islam, S. ii., Sattentau. Q. ,J. & Sternberg, M. J. E. (1989). Protein Eng. 3. 13-21. Baumann, G., Frommel, C. & Sander, (‘. (1989). Protein Eng. 2, 329-334. Blundell, T. L., Sibanda, B. I,.. Sternberg, M. J. E. & Thornton, J. M. (1987). Natuw (LIondon), 326. 3477352. Bowie, J. U., Clarke, N. D., Pabo, C. 0. 8: Sauer, R. T. (1990). Proteins, 7, 257-264. Cohen,F. E., Kovotny, J., Sternberg, M. J. E.? Campbell, D. G. & Williams, A. F. (1981). Biochem. J. 195. 31-40. Dayhoff. M. 0. (1978).Editor of Atlas of Protein Sequence & Structure, vol. 5, supplement 3, National Biomedical ResearchFoundation, WashingDon, DC. Gotoh, 0. (1982). J. Mol. Biol. 162, 705-708. Gregoret, L. M. & Cohen,F. E. (1990).J. Mol. Biol. 211, 959-974. Gribskov, M.. Homyak, M., Edenfield. .I. C Eisenberg. D. (1988). (TABZOS, 4, 61-66. Hendlich, M.. Lackner, I’., Weitckus. S., Floeckner, H.. Froschauer, R.. Gottsbacher, K.. Casari, G. & Sippl. M. J. (1990). J. Mol. Biol. 216. 1677180. Henneke. C. M. (1989). CABZOS, 5. 141-150. Kruskal, ,J. B. & Sankoff, D. (1983). In Time Warps, String Practice
of Sequence Comparison J. B.. eds), pp. 265-310.
(Sankoff, D. & Addison-Wesley.
Kruskal. London. Lesk, A. M.. Levitt. M. & Chothia. (‘. (1986). Protein Eng. 1, 77778. Needleman, S. B. & Wunsch, C. D. (1970). .I. Mol. Biol. 48, 443-453. Novotny. ,J.. Bruccoleri, R. E. & Karplus. M. (1984). J. Mol. Biol. 177. 787-818. Novotny. ,J.. Rashin, A. A. & Bruccoleri. R,. E. (1988). Proteins. 4. 19-30. Poljak. R. J.. Amzel, L. M., Avey. H. l’., (:hen, R. L., Phizackerley, R. P. & Saul. F. (1973). Proc. Nat. Acad. Sci., F.S.A. 70, 330553310. Sali. A. &. Blundell. T. L. (1990). J. Nol. Biol. 212. 403-428. Sellers, 1’. H. (1974). &&n. ,I. Appl. Math. 26, 787-793. Sippl, M. (1990). J. Mol. Biol. 213. 859-883. Smith, T. F. & Waterman. M. S. (1981). .J. Mol. Biol. 147. 195197. Sutcliffe, M. J., Haneef, I.. Carney. 1). & Blundell. T. I,. (1987a). Protein Eng. 1, 377-384. Sutcliffe, M. J.. Hayes, F. 8r Blundcll. T. L. (19870). Protein Eng. 1. 385-392. Taylor, W. R. 8r Orengo. C. A. (1989). .J. Mol. Biol. 208. l-22.
M. A. S. Saqi and M. J. E. Sternberg
732 Vingron, M. & Vingron, M. & Waterman, M. 3123-3124. Waterman, M. 723-728.
Argos, P. (1989). C’ABZOS. 5, 115-121. Argos, P. (1990). Protein Eng. 3, 565-569. S. (1983). Proc. Nat. Acud. Sci., U.S.A. 80. S. & Eggert,
Zvelebil, M. J. .J. M., Barton, G. .J., Taylor. W. ft. & Sternberg, M. ,J. E. (1987). J. Mol. Biol. 195. 957-961.
M. (1987). J. Mol. Biol. 197.
Edited by A. R. Fersht