[29] Hierarchical method to align large numbers of biological sequences

[29] Hierarchical method to align large numbers of biological sequences

456 ALIGNING PROTEIN AND NUCLEIC ACID I291 SEQUENCES Conclusion Here, several protein sequences were compared simultaneously with a multiple s...

1MB Sizes 0 Downloads 48 Views

456

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

I291

SEQUENCES

Conclusion Here, several protein sequences were compared simultaneously with a multiple sequence comparison method to show conserved regions of one sequence. However, the method is applicable for comparing DNA and RNA sequences, too. Pairwise comparisons are superimposed; thus, enormous computer power is not required. This approach is very fruitful because a multiple sequence comparison gives more data than individual pairwise analyses. As a test, sequences for liquefying and saccharifying cY-amylases were shown to share some common regions with other amylolytic enzymes, but each shared different conserved regions with others. Several secondary structural features can be predicted, and their use to compare protein sequences can point to important features possibly invisible by sequence comparisons. However, some caution is necessary in the use of secondary structure predictions, because predictive methods do not always give correct results. In connection with sequence analysis, structure predictions can be valuable in searching for functional sites, e.g., for protein engineering studies. The relatedness of two hydropathy prediction scales was analyzed with the method. Such applications extended to the analysis of different predictive methods may have numerous applications in the study of the relationships between different structural features. The method can even be used to analyze three-dimensional structures, either refined or modeled. Acknowledgments Antti Euranto and Petri Luostarinen are thanked for implementation This work was supported by a grant from Neste Oy Foundation.

[291 Hierarchical

Method Biological By

WILLIAM

of the algorithm.

to Align Large Numbers Sequences R.

of

TAYLOR

Introduction The rapidly increasing determinations of biological sequences have increased the corresponding need for a fast and effective multiple sequence alignment computer program for their analysis. The contents of this volume indicate that this need has not been neglected by those who write METHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

1291

HIERARCHICAL

ALIGNMENT

SCHEME

457

programs. The number of sequence alignment programs that have appeared in the last few years, in particular those directed toward aligning more than two sequences, seem to have kept pace (at least in variety) with the growing sequence data banks. In this chapter I describe the computer program that resulted in response to my own need to align more than two protein sequences. However, with the exception of similar contemporary developments, I have made no attempt to review the rapidly changing multiple sequence alignment field. Background Several years ago, if the need arose to align more than two sequences I used the ALIGN program’ to align pairs of sequences and then combined these pair alignments together in a text editor to produce a multiple alignment. If the sequences were not very similar (e.g., <50% identity) then some care was necessary to choose the best alignments. For this I followed this simple heuristic: take the best alignment, say A with B (written AB), as a core and then add on the next best alignment that contains either A or B, say, BC, giving the multiple alignment ABC. This can be repeated using the edge sequences in the alignment until all the sequences are aligned (see Fig. 1). However, as editing several large sequences together in this way can prove to be rather tedious I wrote a simple program, based on the results of the ALIGN program, to do this. Strangely, repeating what I had been doing in the text editor with a computer program, although conceptually straightforward, proved to be surprisingly difficult to code. Chained

Pair-wise Alignment

RecursiveAlignment Linking The program that linked together alignments2 was written in FORTRAN, and the routine that performed the merging required more than 100 lines of code. (The source of the complexity lay in placing insertions in regions that already contain inserts.) A more aesthetically acceptable solution to this problem was found when the program was recoded in the I B. C. Orcutt, M. 0. Dayhoff, D. A. George, and W. C. Barker, “User’s Guide for the Alignment Score Program of the Protein Identification Resource (PIR),” PIR Report ALI-1284, National Biomedical Research Foundation, Washington, D.C., 1984. * W. R. Taylor, CABIOS 3,s 1 (1987).

458

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

I291

SEQUENCES

a ISRBT

ISCHT

ISRBT ISCHT

4.40

ISLAT

4.29

ISBYT

2.80

ISBSTF

2.28

3

ISLAT

ISBYT

ISBSTF

4.29

2.80

2.28

4.13 2

2.76

2.40

2.70

2.41

2.41

2.00

2.00

b ISBYT

- Triasephosphate

isomerase

(EC 5.3.1.1)

- Baker's

yeast

2.78 ISLAT ISRBT ISCHT

4.29

isomerase

(EC 5.3.1.1)

- Coelacanth

isomerase

(EC

5.3.1.1)

- Rabbit

- Triosephosphate

isomerase

(EC

5.3.1.1)

- Chicken

- Triasephosphate stearothermophilus

isomerase

(EC

5.3.1.1)

- Bacillus

4.40 2.40

ISBSTF

- Triosephosphate - Triosephosphate

FIG. 1. (a) Relatedness matrix obtained from the alignment of five sequences of triosephosphate isomerase.. The best alignment (ISRBT with ISCHT) is indicated by a thick box. From this pair, other overlapping sequences are found (indicated by finer boxes), producing the ordered list of sequences in (b).

computer language C. This treated the combined alignments as a linked tree of pointers which was then recursively processed by one simple routine. The tree has its root at the beginning of the first sequence (A) in the alignment. This residue was considered to be linked to the second residue in the same sequence and any residue against which it had been aligned. The aligned residue in, say, sequence B in the alignment AB is linked only to itself in the following alignment in the chain of pair alignments, say, in sequence B in the alignment BC. This second occurrence of the residue in B is, like the initial residue, linked to the following residue (in B) and any residue with which it aligns (in C). However, the residue in B is not necessarily the first in the sequence so it must also have a link to its preceding residue. These links define a tree spanning the whole alignment that has at most three branches from any node (see Fig. 2). The branches are referred to as LEFT and RIGHT (along a sequence) and DOWN when jumping from one pair alignment to the next. The order or processing the branches is LEFT, DOWN, then RIGHT, and the same procedure processes each branch. The residues are entered in the final alignment only after each branch has hit a dead end since at this stage the total number of insertions required between the last and current positions is known.

r291

HIERARCHICAL

ALIGNMENT

SCHEME

459

a Sequence

1 2 3 4

Sequence

Sequence sequence

= = = =

ALGORITHM ALIGNING ELEMENTARY ALIGNMENTS

b

AL*GORITEIM ::::: ALIGN*ING

:::

I I I I I

\ \ \ \

ALIGN**'ING : : : : : : : EL*+EMENTARY

:

:

I

:

:

I

I

I

I I I I

EL**EMENTARY : : : : : : ALIGNMENTS

:

:

:

I

I

I

I

:

FIG. 2. (a) Four sequences to be aligned by the algorithm described in the text. (b) The three best alignments resulting from the pairwise alignment of the sequences in (a) are linked together. These links, including those to the right and left along the first sequence in each pair, form a tree shown in (c). (c) The tree defined in (b) is processed by the following recursive routine:

enter (aln, col, row, pos) co1 = bump-check (row, col); if (clear(LEFT)) enter(aln, co1 - 1, row, pos - 1); if (down) co1 = entetjaln + 1, col, row + 1, down); if (clear(RIGHT)) enterjam, co1 + 1, row, pos + 1); alignment[row][col] = residue[a]; return col; The routine (shown only in skeleton C code) calls itself with three different sets of parameters depending on the direction of branching. The parameters are as follows: aln, a pointer to an ordered list of pairwise alignments (AD, BC, CD); col, the column position in the output array (alignment); row, the row position in the same array; and pos, the position in the alignment pointed to by aln. The variable down is a pointer from one pair alignment to another and has the value of the new position in the following alignment. The routine clear simply checks end conditions when moving right and left along a sequence, while bump-check keeps check on the maximum current column position and increments this if there is danger of overwriting existing entries in the alignment array. Because of this possible change, the value of co1 must be returned. The growing multiple alignment is filled up only on the return path from a series of calls since only at this point is the number of required columns known. (d) The resulting alignment.

460

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

f291

c root

\ A

-

L

-

*

-

I

I

A

L

I

I

I

E

L . L

* : I

A

G

I I

-

0

---------

R

-

I

I

I

G

N

*

I

I

I

: G

E : N

: N

l

1

M-E-N : : M E

-

I

T

-

H

-

f.q

I

I

N

G

I

I

I

T : T

A : S

R - Y

I

d AL*GO**RITHM ALIGN***ING EL**EMENTARY ALIGNMENTS FIG.

2c and d. See legend on p. 459.

Comparisonto True Multiple Alignment Method The problem of multiple sequence alignment had meanwhile been approached in a more rigorous way through the generalization of the basic Needleman and Wunsch algorithm3 to higher dimensions, each corresponding to a sequence (e.g., Refs. 4, 5, and this volume). These “true” multiple alignment programs can produce a mathematically defined “best” alignment since they simultaneously consider all the sequences, whereas the concatenation of pairwise alignments considers only two at a time. Unfortunately, the “true” multiple alignment programs require considerable computer resources (both time and memory) to solve problems of a useful size. When aligning protein sequences, this expenditure is not necessarily justified since there is no clear definition of how two protein sequences should, in principle, be related. This theoretical weakness arises because the amino acids in protein sequences realize their full functional 3 S. B. Needleman and C. D. Wunsch, J. Mol. Biol. 48,444 (1970). 4 M. Murata, J. S. Richardson, and I. L. Sussman, Pm. Nat/ Acad. Sci. U.S.A. 82, 3073 (1985). 5 M. S. Johnson and R. F. Doolittle, J. Mol. Evol. 23,267 (1986).

1291

HIERARCHICAL

ALIGNMENT

SCHEME

461

significance only in the three-dimensional structure of the protein. Since this structure is generally unknown, the constraints on the mutability of each residue cannot be fully known. Consequently, residues in different sequences cannot be compared without an inherent degree of uncertainty. The work of Dayhoff and Barker6 provides a guide to the likelihood that two residues in different sequences are equivalent; however, compared to this rough measure, the subtleties in preference discriminated by the Needleman- Wunsch algorithm3 are very fine. In practical terms, therefore, it is not justified to spend a large amount of (computer) time seeking a mathematically best solution to the multiple sequence alignment problem while the underlying model of amino acid relatedness remains ill defined. As an example, I repeated the alignment of three small copper-binding proteins that had been aligned by the method of Murata et al4 The alignment produced by linking pairwise alignments* contained all the essential features of the solution arrived at by the more rigorous method and arrived at effectively the same solution 30 times faster (on a slower computer). Incorporating

Cluster

Algorithm

The method of Taylor* linked together pairwise alignments to produce a multiple alignment by selecting the best aligned pair of sequences from a family and adding further alignments onto this core, in order of their relatedness, until all the sequences had been incorporated. For a single homogeneously related family of sequences this method is sufficient. If the family contains two distinct subfamilies, however, then the alignments of these subfamilies will be linked through a poorly aligned pair. A simple change was made to the earlier algorithm to avoid this situation. Beginning again with the best aligned pair of sequences within the group (i.e., those with the highest score), further alignments are considered in order of decreasing score. If the alignments overlap (e.g., AZ? and BC form ABC, as above) then the chain of linked alignments is extended.* If, however, the alignment cannot be added to the end, then, rather than be rejected, it begins a new chain (unless either of the pair of sequences is already buried in a chain). Subsequent alignments are compared to all free ends for potential sites of addition, and where an alignment can join two ends, the chains are fused into one (i.e., the alignment HZ joins chains ZJK and EGFH forming EGFHZJK) Run to its conclusion (when the alignments of all pairs of sequences have been considered) this algorithm will 6 M. 0. Dayhoff and W. C. Barker, “Atlas of Protein Sequence and Structure,” National Biomedical Research Foundation, Washington, DC., 1978.

Suppl. 3.

462

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

1291

generally produce an ordered list of sequences very similar to that produced by the previous algorithm. The new method, however, has the advantage that a cutoff can be introduced for the score value, below which alignments will not be allowed to join the chains of sequences. This implies that the subfamilies need not necessarily be linked and that isolated sequences may remain excluded (see Fig. 3). An interesting variation of the above algorithm is to allow a chain to be linked to itself. That is, given the alignment ABC, if the next best alignment is AC then the list ABC can be considered to be a circle, closed to further additions. This condition results in smaller groups, but these are

c

3. The basic algorithm illustrated for a two-pass run (two cycles) on 11 sequences. The sequences are designated A to K, and the matrix (a) shows the relatedness of each pair of sequences. The relatedness (or score) for a pair of sequences is indicated graphically, with solid being strongly related through shades of gray to blank representing weak or no similarity. The scores for each pair are then sorted (b), and those above a given cutoff (i.e., black or gray) are used to define clusters of sequences (c). Two different clusters are shown. In both, the initial pair of sequences in the cluster is boxed, with the others having been added to this pair. In the first example three groups were initially defined but the lower two fused into one. The second example shows the effect of the circle parameter (see text). The dashed lines indicate groups that are closed to addition. This configuration is followed through a second pass of the algorithm in which the alignment of each group behaves as a single (consensus) sequence (see text for details). This second pass quickly unifies the sequences in one group. FIG.

WI

HIERARCHICAL

ALIGNMENT

SCHEME

then more likely to be closely related to each other than to nonmembers the group. Consensus Sequence

463 of

Definition

In a parallel development, I had been using consensus patterns (templates) to align remotely related sequences.’ These patterns were matched to a new sequence using either functional classes of amino acids or a score based on the frequency of amino acid occurrence at a given position in a sequence alignment. Similar ideas were developed by Grioskov et al.* and Barton and Stemberg using the Needleman - Wunsch algorithm3 (see elsewhere in this volume). These methods defined a consensus profile from a family of proteins against which a new sequence was matched using the frequency of amino acid occurrence as a score. Each sequence alignment defined by the above cluster algorithm can be reduced to a consensus sequence. This approach has the attractive feature that by reducing each subalignment to a single consensus sequence, the cluster algorithm can be reapplied to both consensus and single sequences to produce further groupings until nothing more will align. The score between two positions in a pair of alignments was defined as follows:

where NA is the number of sequences in alignment A, A& the number of sequences in alignment B, and MXy the measure of relatedness between residues X and Y (Ai and Bj for positions i and j in alignments A and B, respectively). The overall score (A’) for the alignment of the two alignments is given by the sum of scores (s) on the best path through the score matrix (see Fig. 4) and normalized by both the number of sequences in each alignment and the length of the shorter alignment, i.e.,

where LA and LB are the lengths of alignments (or sequences) A and B, respectively.

’ W. R. Taylor, J. Mol. Biol. 188,233 (1986). 8 M. Gribskov, A. D. McLachlan, and D. Eisenberg, Proc. Nat/. Acad. Sci. U.S.A. 84,4355 (1987). g J. G. Barton and M. J. E. Stemberg. J. Mol. Biol. 198, 327 (1987).

464

ALIGNING

PROTEIN

AND NUCLEIC

ACID SEQUENCES

r291

AADDEFGH +--D

G

-q

H

-J

Id

WINDOW

II

4

FIG. 4. The effect of the window parameter is illustrated for the alignment of two short sequences (AADDEFGH with ADCDEGH). Using a simple scoring scheme (identity scores 2 and a gap costs l), the resulting values accumulated in the cells of the score matrix are shown with some trace-back routes (arrows). The window parameter with a value of 4 instructs the program to ignore the shaded cells.

Implementation

Needleman- WunschAlgorithm Recoded The algorithm described above has been implemented as a computer program called MULTAL. lo As before, all alignments, whether between previous alignments, single sequences, or both, are determined by a standard, Needleman and Wunsch-like algorithm.3 In the simpler method of Taylor2 the routines from the program ALIGN’ were used. In this later version these were completely rewritten to improve performance, resulting in roughly an order of magnitude increase in the speed of the basic algorithm. A window was also defined about the diagonal of the score matrix to ignore unreasonably large insertions (see Fig. 4). This is a standard technique, variations of which are described in Kruskal and Sankoff.” The window is centered on the diagonal if the sequences are of equal length; lo W. R. Taylor, .I. Mol. Evol. 28, 161 (1988). ‘I J. B. Kruskal and D. Sankoff, “Times Warps, String Edits and Macromolecules” D. Sankoff and J. B. Kruskal, eds.) Chap. 10. Addison-Wesley, Reading, Massachusetts, 1983.

WI

HIERARCHICAL

ALIGNMENT

SCHEME

465

otherwise, it is offset in proportion to the difference in their lengths, and if this exceeds the window size, the comparison is not calculated. MULTAL was written in C, and internal sequence storage was controlled using the dynamic memory allocation features of this language. Thus, there is effectively no limit on the number of sequences that can be considered, providing total storage requirements remain within the virtual memory limit of the system. The current version of the program interacts directly with the University of Wisconsin (UWGCG) sequence analysis package.

Alignment Parameters Sequence alignment is sensitive to the choice of gap penalty and the form of the relatedness matrix, and it is often desirable to vary these in response to the similarity of the sequences considered. As the effective overall similarity of the sequences (and clusters) can vary through the cycles of the above algorithm, the program allows both the gap penalty and the relatedness matrix to be altered between cycles. Variation of the relatedness matrix is restricted to an initial choice of two matrices, say, the identity matrix (ID) and the Dayhoff and Barker matrix6 of amino acid relatedness (MD). A parameter (matwt) was defined that instructs the program to apply a weight to each matrix when calculating the distance between two amino acids. For example, if matwt = 2 then the score is calculated as 0.8ID + 0.2MD. (The ID matrix scores 10 for a match so as to be numerically more equivalent to MD.) Typically, a run might begin using full weight on the first matrix (matwt = 0) when aligning closely related sequences and, on subsequent cycles, gradually shift emphasis onto the second matrix by increasing matwt. In addition, it is common practice to allow a constant to be added to the matrix. In the implementation described here a constant can be added to each matrix at the start of a run but not varied between cycles (6 was added to MD in the results discussed below). To allow complete freedom in the variation of these parameters, the program reads a complete set afresh between cycles. A typical parameter specification file is shown in Table I.

InteractiveAlignment Because automatic alignment programs are not guaranteed to produce a result that conforms with a user’s expectations, the program can also be run interactively, with the ability to examine the intermediate results and vary the parameters accordingly if these are not satisfactory. As alignments

466

ALIGNING

PROTEIN

AND NUCLEIC

1291

ACID SEQUENCES

TABLE II PARAMETER SPECIFKATION FILE” Parameter matwt

gapen

span

window

cutoff

Sequences plus families

0 0 0 0 0 0 0 0 0 0 0 0 0 0

20 20 25 25 30 35 40 45 50 50 50 50 50 50

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192

5 5 10 15 20 25 30 35 40 45 50 50 50 50

990 950 900 850 800 750 700 650 600 550 500 500 500 500

5727 5653 5370 5152 4840 4354 4105 3796 3599 3245 3032 2613 2366 2319 2318

Families 0 52 222 317 438 545 606 648 679 729 758 750 724 714 714

0 The condensation of the NBRF data bank into families is tabulated for progressive cycles of the program MULTAL. The gap penalty (gapen) was slowly increased over the cycles, and there was a progressive shift from the identity (ID) matrix toward an average of this with the Dayhoff and Barker6 relatedness matrix (MD) controlled by the matwt parameter. The parameters window, span, and cutoff control the degree of clustering, with the latter setting a limit to the score between sequences considered for alignment. See Fig. 7 for plot.

can be evaluated only after they have been calculated, the program retains the state of the alignments in previous cycles to be returned to if desired. This interactive facility was encoded by calling the alignment program recursively from itself, rather than successively in a loop. This trivial change results in the automatic storage of the previous states of the program, which can then be returned to by exiting the current state. Strategies

for Large Numbers

of Sequences

A fast multiple alignment program was developed to enable interactive use as described above, which is especially useful when aligning remotely related sequences. However, a fast program also allows greater numbers of sequences to be considered if one is still prepared to wait as long as before.

r291

HIERARCHICAL

ALIGNMENT

467

SCHEME

Local Clustering It is often found that the sequences to be aligned are encountered already ordered (perhaps taxonomically classified) with the more similar sequences adjacent. In this situation it is unnecessary to compare all pairs of sequences in the first cycle. The program, therefore, allows only pairs to be compared that lie within a preset separation of each other, referred to below as the span (Fig. 5). It might be expected that this could allow the alignment of closely related but remotely placed sequences to be missed. However, if this occurs in the early cycles, on subsequent cycles the span can be increased and, combined with the contraction caused by the clustering of sequences, the parameters can be chosen so that in the final cycles any isolated sequence is compared to all others, either as an individual or as a component in a consensus. The window size can also be altered between cycles of the program, as can the cutoff level on the score. Together these features allow similar sequences (also of similar length) to be initially aligned into subfamilies, then on subsequent cycles the similarity cutoff can be lowered (to allow the

A

K FIG. 5. Using the matrix described in Fig. 3, the effect of the span parameter is illustrated (for the value of span = 3).

468

ALIGNING

PROTEIN

AND

NUCLEIC

ACID

SEQUENCES

I291

subfamilies to align) while simultaneously widening the window. Thus, although there is now more to calculate with each comparison, there are fewer comparisons to perform. Using all these features together allows large numbers of sequences to be initially scanned and partially condensed into closely related families. For subsequent cycles a parameter profile can be chosen to fit the size of the group of sequences and their rate of condensation. Rough Presorting of Sequence Order If the sequences are encountered in a random order, then the above strategy will still produce a complete alignment, only the rate of condensation will be much slower. A general approach to this problem is to use a rough but fast measure of sequence relatedness for the first pass over the sequences. To compare each sequence with every other would typically involve many millions of comparisons, so any form of exact sequence alignment is impractical. However, it is possible to precalculate the location of all sequence fragments of a fixed length for a large number of sequences. A rough measure of sequence relatedness can thus be relatively quickly calculated as the number of common peptides. The resulting list of scores for all pairwise comparisons between sequences could be used directly by the program MULTAL, however, this would have the undesirable effect of forcing an alignment order based on a rough measure of relatedness. Instead, the scores were sorted and the cluster algorithm applied without a cutoff to provide a complete ordered list of sequences. A separate program called SEQSORT implements this method using the tripeptide lookup table provided with the PIR database. A peptide length of 4 was used (i.e., 16,000 peptides from AAAA to YYYY), which greatly reduces the number of matches (relative to a length of 3) and thus saves time when incrementing the count of hits common to each pair of sequences (which is the rate-limiting step). Clustering

Tests

Cytochrome c Superfamily Alignment The effect of variation in the degree and rate of sequence clustering was investigated using the cytochrome c family. This extensive family (> 100 sequences) consists of several subfamilies, the interrelationships of which have been analyzed by Schwartz and DayhofP (see also Dayhoff and

I2 R. M. Schwartz and M. 0. DayhofF, Science 199,395 (1978).

1291

HIERARCHICAL

ALIGNMENT

SCHEME

469

Barker6 and Fig. 6a). The program MULTAL was applied to this family using different alignment strategies and the results compared. Two different strategies were tried. The first allowed no families to align that scored less than 500 using the ID matrix (for a pair of sequences this is -50% identity). Using the same cutoff, the second approach gradually shifts weight from the ID matrix to the MD matrix as described above. As this allows higher overall scores to be obtained, the gap penalty (gapen) was gradually increased throughout this run in compensation. The approach using the “softer” scoring scheme created the largest families. These agreed well with the Schwartz and Dayhoffi2 classification by defining a large group of sequences consisting of the eukaryotic mitochondrial and photosynthetic bacterial sequences (80 sequences including cytochromes c, c-550, and most of the c2 sequences). The other sequences fall into two main groups; the c6 sequences from cyanobacteria and chloroplasts (13 sequences) and the aerobic bacterial sequences (c-551 and some c2, 8 sequences). The remainder includes a small family of two c-555 sequences, remotely related to a C~ sequence. These minor families were eventually joined into one large family (see Fig. 6b). Using the exact matching scheme (ID matrix), the families were more fragmented (e.g., the photosynthetic bacterial and eukaryotic mitochondrial sequences remained distinct). Although the major subfamilies were clearly defined, 14 sequences remained unclassified (Fig. 6c) compared to 4 in the previous run. Both the above runs began with the order of sequences as found in the PIR database, which is already almost optimally ordered. As a more exacting test, the alignment program was provided with the sequences in the order determined by the program SEQSORT and rerun using the “softer” scoring scheme described above. The order imposed by SEQSORT reflected all the major subfamilies, but these were arranged in different positions relative to the order in the PIR database. The results of the alignment starting from this order again identified the major subfamilies and, surprisingly, arrived at this grouping by a route that is largely isostructural with the alignment trees described above (Fig. 6d). Some differences were found, however, the most extreme example being the early alignment of CCMST and CCRF2V, proteins which are separated by 55 sequences in the PIR order. This large displacement was a result of the two sequences being adjacent in the SEQSORT-derived order. Despite this misplacement the resulting alignment was correct. NBRF Data Bank Alignment As a test of speed and capacity, the program was applied to the current NBRF (PIR) protein sequence database (Release 16.0). To avoid polypro-

3

CCSf CCYB Ccl” ccnl

CCffCl CCES ccl!, CCL.P cccr. CCHP CCDSK CCSl CCBISC CCZP CCSL ccsc ecus CCRPSN CCilECC WCS CCRZ CCZ" WCS CCTO CCED ccsx CCLI CCRl CEKD ccxs ccw ccnc CCFS CCP.z

El-

-

-

-

-

1

- - -

-

c

- -

CC?.” CCWP cc!!0 CCYP CCRS cc!& CCXGC CC”STCCPY ccc!4 ccx ceos ccsu CCPN CCSf CCLS CCfG cesw cccl CCDF CC3

d

-

-

L!E

I

__.__. CCPSIA CCm.6 CCPF‘ CCAL‘O CC3F6 CCPR6 CCIM

CCEI CCEG -

FIG. 6. See legend on p. 472

CCDC

CCFA I- CCSP CCGK CCEl

472

ALIGNING

PROTEIN

AND

NUCLEIC

ACID SEQUENCES

1291

teins and large repetitive sequences, a maximum sequence length of 500 residues was set. Similarly, to avoid short peptides, a minimum length of 50 residues was set, leaving 5727 sequences. It might be noted at this point that because the program controls its own storage requirements, no changes were required to read in this volume of data. The first two passes over the data base proceeded cautiously, aligning only closely adjacent sequences within 90% identity and of lengths within 5 residues. In subsequent cycles, the score cutoff was linearly reduced to 500 while the window size was similarly increased until a maximum insertion size of 50 residues was allowed. The range of adjacent sequences compared (span) was doubled in each cycle until this exceeded the size of the data bank. In the final cycles, the span was larger than the size of the condensed data bank; thus, all sequences had been compared to each other, either as individuals or as members of a family. The progressive condensation of the data bank is plotted in Fig. 7 and tabulated, with details of the parameter for each cycle, in Table I. In the final cycles the condensation of the data base had converged, with little reduction being made in the number of distinct entries (single plus consensus sequences). The data base condensed to 2318 distinct entries, of which 714 contained more than one sequence. This degree of compression is similar to a less complete analysis on an older version of the same data bank.‘O I3 W. C. Barker, L. T. Hunt, B. C. Orcutt, D. G. George, L. S. Yeh, H. R. Chen, M. C. Blomquist, G. C. Johnson, E. I. Se&cl-Ross, M. K. Hong and R. S. Ledley, “Protein Identification Resource,” Version 4.3. National Biomedical Research Foundation, Washington, D.C., 1984. FIG. 6. Progressive clustering of the cytochrome c superfamily by the program MULTAL using different parameters. The ill sequences are taken from the PIR data bank (Version 4.3)i2 in the order in which they occur there and are designated by a code followed by the protein name and their source. (a) For comparison the phylogenctic tree derived by Schwartz and Dayhoff” is shown in a simplified form (adpated from Dayhoff and Barker6). Those sequences known in 1978 are shown in bold-face type. In the following trees (b-d) each level of the tree represents the state of clustering after a cycle of the program. For example, in the run described in (b), CCHO, CCHP, CCRB, CCDG, CCKGG, and CCMST were aligned in the first cycle of the program. The order of the sequences within such groups (and hence their position in the alignment) is determined by the algorithm described in the text. For clarity, however, the order as found in the PIR data bank is retained. To facilitate comparison with (a), elements that correspond to that tree arc emphasized. (b) Clusters defined using the parameters specified in the MD matrix (using the “softer” scoring scheme). Careful examination shows that the trees are remarkably similar. (c) Clusters defined using the parameters specified in the ID matrix (scored by amino acid identity). The tree is largely isostructural with (b) but less complete. (d) As (b) but using the sequence order generated by the program SEQSORT (see text). The tree is largely isostructural with (b) barring some local&d rearrangement. One unexpectedly long connection is made between CCMST and CCRF2V, for clarity, it is drawn behind the list of sequence codes.

r291

HIERARCHICAL

ALIGNMENT

473

SCHEME

5727

CYCLE

NO

7. Plot of the condensation of the NBRF data bank. The 5727 starting sequences are condensed into consensus sequences (families), the number of which rises to 7 14, while the total distinct entries (single plus consensus sequences) falls to 23 18, leaving only 1604 unattached sequences (see Table I for full details). The dashed line plots the course of a similar runr” using an older (smaller) version of the data bank but on a doubled scale. FIG.

The above run took 10 days (234 CPU-hr) on a SUN-4 computer and performed roughly 13 million alignments. Such a large run gives a good estimate of the average speed of the basic pairwise alignment algorithm on typical protein sequences (between 50 and 500 residues) and gives a value of 7f set to align a pair of sequences (or consensus sequences). As this step is rate determining for large runs, this value can be used to calculate the total time for any run once the number of pairwise alignments has been estimated. Summary The method presented here is intended as a compromise between finding a good overall alignment and the time taken to do so. Many multiple alignment algorithms spend an excessively large amount of effort trying to find the best global alignment. This time is often ill spent because the results of the standard dynamic programming alignment algorithm are dominated by the choice of gap penalty and the form of the score matrix, both of which have a poor theoretical foundation. Nonetheless, it is impor-

474

ALIGNING

PROTEINANDNUCLEICACIDSEQUENCES

1301

tant that savings in time do not compromise the quality of the alignment. By using the consensus sequence approach, this danger is largely avoided as the conserved features of the sequences are quickly identified and preserved through further cycles. In the alignment of existing alignments, which is one of the more novel aspects of the method, each alignment was treated as an averaged consensus sequence with gaps making no contribution. This gives rise to the advantageous property that gaps will have a greater propensity to be inserted where there are already gaps and is equivalent to a local change in the gap penalty. This type of behavior represents a transition away from the homogeneous scoring schemes used in aligning two sequences toward a scoring scheme that depends on position in the sequence. The alignment of consensus sequences thus forms a bridge between simple pair alignment and the alignment of discrete patterns in which sequence features and allowed gap locations are exaggerated. To complete this transition the program described above has been integrated into the earlier pattern matching (template) program.’ Such templates can reliably locate sequence similarities that are too weak or scattered to be found by the more standard alignment methodsi and should therefore produce a further condensation of the sequence data bank. Only by continually extending our knowledge of the relationships between sequences to increasingly distant similarities can we hope to avoid being overwhelmed by the increasing amount of data. I4 L. H. Pearl and W. R. Taylor, Nature (London) 329,35 1 (1987).

[30] Significance BYJOHN

of Protein

F. COLLINS

Sequence Similarities

and ANDREW

F. W. COULSON

Introduction Similarities between pairs of protein sequences are expressed as alignments of subsequences such as: ** * * * ** * * * * * ** * NYTGLRKQMAVKKYLN S I LNGK2 -NILEDPVPVKRHSDAVFlD 137 -NIVEE LR RRHADGSFSDEMNYV L DS LATRDFINWLLQTKMETHODS

IN ENZYMOLOGY,

VOL.

183

Copyright 0 1990 by Academic’l’tzss, Inc. All right.3 of reproduction in any form reserved.

42 175